11 minute read

My last post received quite some responses, but before actually settling on the curricula of the upcoming ML course, I’ve decided to “eat my own lunch” and use ML to analyze what the data is telling me.

More specifically I’m interested in the classes of ML enthusiasts that exist in Cluj-Napoca, and try to identify what those classes represent. I’d like to understand who my audience is, what they know, and what they’d like to learn next.

I’d also like to get an insight into why are they interested in joining this course.

The basic strategy that I’m going to use, is:

  • data cleaning
  • feature engineering
  • feature analysis
  • clustering
  • cluster analysis

In the end I’ll end up with the insight I’m seeking, so let’s proceed!

Load the data

Code
%matplotlib inline
import sklearn 
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

I have a WorkAt field that I’m going to drop it as it contains sensitive information.

Code
df = pd.read_csv("Cluj_ML_Enthusiasts.csv", sep='\t', na_values=['-']).drop(columns=['Email address', 'Name'])
df.drop(columns=['WorkAt']).head()
Timestamp Knowledge Effort Degree IsResearcher IsML IsDeveloper IsTeaching IsStudent
0 29/10/2018 16:05:09 15 16 4 True True True False False
1 29/10/2018 16:11:16 7 32 3 False False True False False
2 29/10/2018 17:18:45 6 16 3 False False True False False
3 29/10/2018 17:21:18 6 4 3 False False True False False
4 29/10/2018 17:25:51 10 4 3 False True True False False

The Knowledge field contains values that represent, in ascending order, the self-assessment of ML knowledge. The actual number-to-question relation is found bellow:

1 - I heard about ML
2 - I understand what ML is and should be doing
3 - I've used some packaged libraries / API's that used ML in the background 
4 - I regularly read news articles or blogs about ML and wish to transition into ML
5 - I've started online ML courses but gave up before finishing
6 - I've finished 1-2 courses about ML (at university or online)
7 - I've followed a few tutorials and executed their code ony my machine / forked the GitHub project
8 - I've registered on Kaggle and ran some kernels other people wrote
9 - I've wrote some ML hobby projects / played with some Kaggle datasets myself
10- I've been recently employed on an ML position
11- I've finished a PoC ML code for my employer but the project is young
12- I know how to debug my ML model, I understand what it does and what are it's shortcomings
13- I know the academic name, SoTA, previous work for the class of problems that I work on.
14- I've competed and finished in top 10% in at least one Kaggle competition / I have active ML code in production
15- I read ML academic papers and am trying to reproduce their results / Pursuing or have a PhD in AI
16- I'm able to reproduce ML academic papers and can evaluate the correctness of their claims
17- When I read a paper I can easily see where this fits in the current SoTA, and can anticipate some future directions of research
18- I write ML papers with code released online that gets published in recognized venues / I work at DeepMind, FAIR or Microsoft Research

The Degree field is a 4 value list for the following enumeration:

1 - Student
2 - BsC
3 - MsC
4 - PhD

The other Is.. fields are just expansions of the information contained in Knowledge, Degree and WorkAt columns taken as a whole which I’ve filled them myself.

Clean the data

Making the WorkAt column a categorical column. This both hides the sensitive information while at the same time, converts everything into numbers. ML models love numbers!

Code
df.WorkAt = df.WorkAt.astype('category').cat.as_ordered().cat.codes + 1

Making a numerical copy of the dataframe and anonymising the data

Code
df_num = df.copy()
df_num.WorkAt = df.WorkAt.astype('category').cat.as_ordered().cat.codes + 1
df_num.tail()
Timestamp Knowledge Effort Degree IsResearcher IsML IsDeveloper IsTeaching IsStudent WorkAt
67 01/11/2018 16:26:09 12 4 3 True True False True False 21
68 02/11/2018 09:20:57 1 16 2 False False True False False 22
69 02/11/2018 22:25:43 2 16 1 True False False True False 20
70 02/11/2018 22:29:13 2 16 3 False False True False False 13
71 06/11/2018 22:14:40 5 16 1 False False False False True 21

Converting the Timestamp column to datatime. This needs some attention and this case a specific format parameter was required for pandas. It usually infers the date format correctly but in this instance some of the November entries were detected as being from February.

Code
df_num.Timestamp = pd.to_datetime(df_num.Timestamp, dayfirst=True, format="%d/%m/%Y %H:%M:%S")

Some feature engineering

Adding an HoursElapsed column, infered from the Timestamp column. The HoursElapsed represent a synthetic feature that counts the number of hours passed between me publishing the news and the actual registration. It’s usefull to know how fast a certain user reacted when seeing the event.

Code
import numpy as np
hours_elapsed = df_num.Timestamp.astype(np.int64) // ((10 ** 9) * 3600) # hours
hours_elapsed -= min(hours_elapsed)
df_num['HoursElapsed'] = hours_elapsed
df_num.head()
Timestamp Knowledge Effort Degree IsResearcher IsML IsDeveloper IsTeaching IsStudent WorkAt HoursElapsed
0 2018-10-29 16:05:09 15 16 4 True True True False False 6 0
1 2018-10-29 16:11:16 7 32 3 False False True False False 6 0
2 2018-10-29 17:18:45 6 16 3 False False True False False 10 1
3 2018-10-29 17:21:18 6 4 3 False False True False False 17 1
4 2018-10-29 17:25:51 10 4 3 False True True False False 18 1

The HoursElapsed per se, are just a quick proxy for measuring how enthusiastic is a certain user about the idea of an ML course. Arguably, it’s also a proxy of how much time he/she spends on Facebook or Twitter, but I’ll assume the former.

I’ll do a simple transformation on the HourElapsed by which most recent ones will have a high number and the more distant ones will fade to 0 enthusiasm.

Modeling the enthusiasm of people by assuming that the velocity of their response is relative to a positive Gaussian kernel, that defines the Enthusiasm

Code
from math import pi
plt.rcParams['figure.figsize'] = [10, 10]

def gaussian(x, std): 
    return np.exp(-0.5*((x/std)**2)) / (std * np.sqrt(2*pi))


from matplotlib import pyplot as plt
plt.scatter(df_num.HoursElapsed, gaussian(df_num.HoursElapsed, 40))

png

We are going to add the Enthusiasm values to our data, bellow.

Code
kernel = gaussian(df_num.HoursElapsed, 40)
kernel = (kernel - kernel.mean()) / kernel.std()
kernel

df_num['Enthusiasm'] = kernel
df_num.tail()
Timestamp Knowledge Effort Degree IsResearcher IsML IsDeveloper IsTeaching IsStudent WorkAt HoursElapsed Enthusiasm
67 2018-11-01 16:26:09 12 4 3 True True False True False 21 72 -1.789470
68 2018-11-02 09:20:57 1 16 2 False False True False False 22 89 -2.122546
69 2018-11-02 22:25:43 2 16 1 True False False True False 20 102 -2.255501
70 2018-11-02 22:29:13 2 16 3 False False True False False 13 102 -2.255501
71 2018-11-06 22:14:40 5 16 1 False False False False True 21 198 -2.368870

Now that we’ve added the Enthusiasm column, we can remove the HoursElapsed and Timestamp columns since they are correlated with the Enthusiasm

Code
df_num = df_num.drop(columns=['Timestamp', 'HoursElapsed'])

Feature analisys

Do a feature analysis and remove highly correlated features

Code
def plot_correlation(df_num):
    plt.rcParams['figure.figsize'] = [140, 105]
    corr = df_num.corr().abs()
    plt.matshow(df_num.corr())
    plt.xticks(ticks=np.arange(len(df_num.columns.values)), labels=df_num.columns)
    plt.yticks(ticks=np.arange(len(df_num.columns.values)), labels=df_num.columns)
    plt.show()
    return corr

corr = plot_correlation(df_num)

png

Code
# https://stackoverflow.com/questions/17778394/list-highest-correlation-pairs-from-a-large-correlation-matrix-in-pandas
def get_redundant_pairs(df):
    '''Get diagonal and lower triangular pairs of correlation matrix'''
    pairs_to_drop = set()
    cols = df.columns
    for i in range(0, df.shape[1]):
        for j in range(0, i+1):
            pairs_to_drop.add((cols[i], cols[j]))
    return pairs_to_drop

def get_top_abs_correlations(df, n=5):
    au_corr = df.corr().abs().unstack()
    labels_to_drop = get_redundant_pairs(df)
    au_corr = au_corr.drop(labels=labels_to_drop).sort_values(ascending=False)
    return au_corr[0:n]

get_top_abs_correlations(df_num, n=20)
IsResearcher  IsTeaching      0.624758
Knowledge     IsML            0.585793
IsResearcher  IsML            0.535484
IsDeveloper   IsStudent       0.452602
IsTeaching    Enthusiasm      0.422794
IsDeveloper   IsTeaching      0.408248
Knowledge     IsResearcher    0.402911
Degree        IsStudent       0.384491
IsML          IsTeaching      0.369175
WorkAt        Enthusiasm      0.363629
IsResearcher  IsDeveloper     0.324617
Degree        IsTeaching      0.315283
IsDeveloper   WorkAt          0.293584
Degree        IsResearcher    0.282929
IsResearcher  IsStudent       0.266398
Effort        IsML            0.254194
Knowledge     Degree          0.253271
Degree        IsML            0.239952
IsTeaching    IsStudent       0.234521
IsResearcher  Enthusiasm      0.219462
dtype: float64

We can see that some of the pairs are highly correlated with the other. From the mostly correlated pairs we’re going to drop one of them, as it contains redundant information.

Decided on dropping the following columns:

IsML, Degree, IsResearcher

In addition, WorkAt brings no evident value, drop it.

Code
df_num = df_num.drop(columns=['IsML', 'Degree', 'IsResearcher', 'WorkAt'])
df_num.head()
Knowledge Effort IsDeveloper IsTeaching IsStudent Enthusiasm
0 15 16 True False False 0.558944
1 7 32 True False False 0.558944
2 6 16 True False False 0.558029
3 6 4 True False False 0.558029
4 10 4 True False False 0.558029

Data plotting

We will attempt to plot the data that we have on a 2D plot. In order to do this, we will reduce the dimensionality of it using PCA. This is useful because we want to get a rough idea of how many clusters do we have.

Code
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import AffinityPropagation, AgglomerativeClustering, KMeans
from sklearn.decomposition import PCA


pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('dim_reducer', PCA(n_components=2))
])


reduced = pipe.fit_transform(df_num.values.astype(np.float32))

plt.rcParams['figure.figsize'] = [10, 10]
plt.scatter(reduced[:, 0], reduced[:, 1])

for i, (x, y) in enumerate(reduced[:, [0, 1]]):
    plt.text(x, y, "Name " + str(df.index[i]))

png

If you squint, it seems we have roughly 4 generic clusters (with a couple of outliers). We will use KMeans clustering with this value, but only after we normalize all the features. Let’s see what we get!

Code
df_ = df_num.copy()
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('cluster', KMeans(n_clusters=4))
])

df_['Cluster'] = pipe.fit_predict(df_num)
df_.sort_values('Cluster')
Knowledge Effort IsDeveloper IsTeaching IsStudent Enthusiasm Cluster
0 15 16 True False False 0.558944 0
36 12 16 True False False 0.558944 0
32 1 16 True False False -2.122546 0
37 7 32 True False False 0.558944 0
38 3 16 True False False 0.558029 0
27 6 16 True False False 0.246598 0
25 7 16 True False False 0.333842 0
42 3 16 True False False 0.558029 0
63 9 16 True False False 0.246598 0
45 7 16 True False True 0.555286 0
46 6 16 True False False 0.555286 0
47 6 16 True False False 0.555286 0
34 4 16 True False False -2.255501 0
50 9 16 True False False 0.544341 0
49 3 16 True False False 0.550721 0
6 4 16 True False False 0.558029 0
1 7 32 True False False 0.558944 0
2 6 16 True False False 0.558029 0
61 10 16 True False False 0.333842 0
14 6 16 True False False 0.544341 0
68 1 16 True False False -2.122546 0
70 2 16 True False False -2.255501 0
10 6 16 True False False 0.555286 0
11 4 16 True False False 0.555286 0
13 6 16 True False False 0.550721 0
9 6 16 True False True 0.555286 0
64 3 4 True False False -0.681784 1
53 1 4 True False True 0.536159 1
41 4 4 True False False 0.558029 1
62 7 4 True False False 0.277015 1
... ... ... ... ... ... ... ...
24 1 4 True False False 0.360152 1
23 4 4 True False False 0.514453 1
22 5 4 True False True 0.514453 1
21 10 4 True False True 0.514453 1
3 6 4 True False False 0.558029 1
4 10 4 True False False 0.558029 1
17 2 4 True False True 0.536159 1
16 5 4 True False False 0.536159 1
15 2 4 True False False 0.544341 1
5 7 4 True False False 0.558029 1
55 1 4 False True False 0.526190 2
31 15 4 False True False -1.789470 2
67 12 4 False True False -1.789470 2
33 2 16 False True False -2.255501 2
19 15 4 False True False 0.526190 2
65 5 16 True True False -1.231252 2
69 2 16 False True False -2.255501 2
29 3 16 True True False -1.231252 2
35 6 16 False False True -2.368870 3
54 1 16 False False True 0.526190 3
48 1 16 False False True 0.550721 3
44 14 4 False False True 0.558029 3
43 5 4 False False True 0.558029 3
20 1 16 False False True 0.526190 3
18 1 16 False False True 0.526190 3
12 2 16 False False True 0.550721 3
8 2 4 False False True 0.558029 3
7 6 4 False False True 0.558029 3
56 1 16 False False True 0.526190 3
71 5 16 False False True -2.368870 3

72 rows × 7 columns

From the initial looks of it, the clusters have the following characteristics:

  • 0 - High commitment Developers (36,1%)
  • 1 - Quick win Developers (36,1%)
  • 2 - Academics / Researchers (11,1%)
  • 3 - Unemployed students (16,7%)

AffinityPropagation is a clustering algo that doesn’t require a number of clusters as its inputs. I’m going to use it for estimating the number of clusters, because maybe I’m missing some clusters.

Code
nr_clusters = np.unique(AffinityPropagation().fit_predict(df_num))
nr_clusters
array([0, 1, 2, 3, 4])

It seems that there’s roughly the same amount of clusters that it finds. After some inspection, it seems that the model decided to split the ‘Accademics’ into high and low commitment ones.

  • 1 - Quick win Developers ( 36,1% )
    • < 4h work / week
    • employed
  • 2 - Unemployed students ( 16,7% )

  • 3 - High commitment Developers ( 36,1% )
    • 16+ hours / week
    • employed
  • 4 - Academics ( 11,1% )
    • high commitment (> 16h/week) ( 5,6% )
    • low commitment (< 4h/week) ( 5,6% )

So there you go! I won’t draw any conclusions to the above findings, I’ll leave the interpretation of the result up to you, but people seem generally interested in ML ;)

Comments