28 minute read

image.png

On the 26th of May, we had the European parliamentary elections. In Romania, the results and progress of the vote were published online in real time on the official electoral site.

As far as I know it’s the first time we had such data exposed to the public, and with such granularity.

Since my daily work involves working closely with data, I couldn’t miss the opportunity to get my hands on that dataset. Previously I’ve shown how I got the data.

In this second post I’ll try to do some quick analysis of it.

Loading the data

Just to remind ourselves what we’re working with.

This is the dataset:

Code
import pandas as pd
import numpy as np
df = pd.read_csv("_data/final.csv")
df.head()
liste_permanente lista_suplimentare total urna_mobila county_code county_name id_county id_locality id_precinct id_uat ... Femei 96 Femei 97 Femei 98 Femei 99 Femei 100 Femei 101 Femei 102 Femei 103 Femei 104 Femei 109
0 696 63 759 0 VS VASLUI 39 9015 16128 2936 ... 0 0 0 0 0 0 0 0 0 0
1 140 10 150 0 VS VASLUI 39 9015 16187 2936 ... 0 0 0 0 0 0 0 0 0 0
2 501 25 526 0 VS VASLUI 39 9006 16086 2933 ... 0 0 0 0 0 0 0 0 0 0
3 571 41 612 0 VS VASLUI 39 9006 16087 2933 ... 0 0 0 0 0 0 0 0 0 0
4 680 55 736 1 VS VASLUI 39 9006 16088 2933 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 241 columns

It has 19k rows (places where voting was held).

df.shape
(19171, 242)

Some descriptive statistics about the dataset.

Code
df.describe(include="all")
liste_permanente lista_suplimentare total urna_mobila county_code county_name id_county id_locality id_precinct id_uat ... Femei 96 Femei 97 Femei 98 Femei 99 Femei 100 Femei 101 Femei 102 Femei 103 Femei 104 Femei 109
count 19171.000000 19171.000000 19171.000000 19171.000000 19171 19171 19171.000000 19171.000000 19171.000000 19171.000000 ... 19171.000000 19171.000000 19171.000000 19171.000000 19171.000000 19171.000000 19171.000000 19171.000000 19171.000000 19171.000000
unique NaN NaN NaN NaN 43 43 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
top NaN NaN NaN NaN B MUNICIPIUL BUCUREŞTI NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
freq NaN NaN NaN NaN 1269 1269 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
mean 402.484847 81.316468 486.681863 2.880549 NaN NaN 22.577852 5338.597517 9590.956132 1674.926817 ... 0.008137 0.004016 0.001982 0.001565 0.000782 0.000156 0.000365 0.000052 0.000104 0.000052
std 234.090739 162.995607 279.595538 12.756191 NaN NaN 13.044442 3033.239896 5536.418093 992.618029 ... 0.093818 0.064069 0.044478 0.039528 0.027962 0.012509 0.019106 0.007222 0.010214 0.007222
min 0.000000 0.000000 4.000000 0.000000 NaN NaN 1.000000 1.000000 1.000000 1.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 204.000000 27.000000 260.000000 0.000000 NaN NaN 11.000000 2790.500000 4798.500000 807.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
50% 401.000000 44.000000 470.000000 0.000000 NaN NaN 23.000000 5355.000000 9591.000000 1681.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
75% 582.000000 76.000000 672.000000 0.000000 NaN NaN 34.000000 8051.000000 14383.500000 2554.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
max 1263.000000 2577.000000 2577.000000 314.000000 NaN NaN 43.000000 10419.000000 19198.000000 3283.000000 ... 2.000000 2.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000

11 rows × 242 columns

And all the available columns we have.

Code
list(df.columns)
[
 'liste_permanente', 'lista_suplimentare', 'total', 'urna_mobila', 'county_code', 'county_name', 'id_county', 'id_locality', 'id_precinct', 'id_uat', 'latitude', 'locality_name', 'longitude', 'medium', 'precinct_name', 'precinct_nr', 'presence', 'siruta', 'uat_code', 'uat_name', 'men_18_24', 'men_25_34', 'men_35_44', 'men_45_64', 'men_65+', 'women_18_24', 'women_25_34', 'women_35_44', 'women_45_64', 'women_65+', 'Cod birou electoral', 'Localitate_x', 'Secție', 'Tip', 'Total alegatori', 'Total lista permanenta', 'Total urna mobila', 'Total prezenti', 'Prezenti lista permanenta', 'Prezenti urna mobila', 'Prezenti lista suplimentara', 'Total voturi', 'Voturi nefolosite', 'Voturi valabile', 'Voturi anulate', 'Contestatii', 'Starea sigiliilor', 'PSD', 'USR-PLUS', 'PRO Romania', 'UDMR', 'PNL', 'ALDE', 'PRODEMO', 'PMP', 'Partidul Socialist Roman', 'Partidul Social Democrat Independent', 'Partidul Romania Unita', 'Uniunea Nationala Pentur Progresul Romaniei', 'Blocul Unitatii Nationale', 'Gregoriana-Carmen Tudoran', 'George-Nicaolae Simion', 'Peter Costea', 'Siruta', 'Votanti lista', 'Barbati 18', 'Barbati 19',... 'Barbati 100', 'Barbati 101', 'Barbati 102', 'Barbati 103', 'Barbati 105', 'Barbati 107', 'Barbati 111', 'Femei 18', 'Femei 19', 'Femei 20', 'Femei 21', 'Femei 22',... 'Femei 99', 'Femei 100', 'Femei 101', 'Femei 102', 'Femei 103', 'Femei 104', 'Femei 109'# Score validations

The first thing I’d like to see is the percentage of votes that each party has received.

Code
def score(party):
    return df[party].sum() / df['total'].sum()

parties = ["PSD", "USR-PLUS", "PRO Romania", "UDMR", "PNL", "ALDE", "PMP",]
sorted([(party, np.round(score(party)*100, 2)) for party in parties], reverse=True, key=lambda x: x[1])
[('PNL', 26.25),
 ('PSD', 21.87),
 ('USR-PLUS', 21.74),
 ('PRO Romania', 6.26),
 ('PMP', 5.6),
 ('UDMR', 5.11),
 ('ALDE', 4.0)]

Officially (according to here) we have the following results.

  • PNL 27,0%
  • PSD 22,51%
  • ALIANȚA 2020 USR-PLUS 22,36%
  • PRO ROMÂNIA 6,55%
  • PMP 5,66%
  • UDMR 5,44%
  • ALDE 4,0%
  • Others 6,1%

This is a bit at odds with the results that I’ve got from my computations. I’d like to see exactly by how much..

Investigating some discrepancies in the numbers

I’m going to compute the difference between my numbers and the official numbers in two ways:

  • absolute improvement, meaning the amount of percentage points that each party has changed with
  • relative score increase (i.e. how much did the absolute improvement above meant for the scores that each party got)
Code
my_results = {party: np.round(score(party)*100, 2) for party in parties}
official_results = {
    "PNL": 27.0,
    "PSD": 22.51,
    "USR-PLUS": 22.36,
    "PRO Romania": 6.55,
    "PMP": 5.66,
    "UDMR": 5.44,
    "ALDE": 4.0,
    "Others": 6.1,
}

sorted([(
    party, 
    f"+{np.round((official_results[party] - my_results[party]) / my_results[party] * 100 , 2)}% relative score increase",
    f"+{np.round((official_results[party] - my_results[party]), 2)}% absolute improvement"
) for party in parties], key=lambda x: x[1], reverse=True)
[('UDMR', '+6.46% relative score increase', '+0.33% absolute improvement'),
 ('PRO Romania',
  '+4.63% relative score increase',
  '+0.29% absolute improvement'),
 ('PSD', '+2.93% relative score increase', '+0.64% absolute improvement'),
 ('PNL', '+2.86% relative score increase', '+0.75% absolute improvement'),
 ('USR-PLUS', '+2.85% relative score increase', '+0.62% absolute improvement'),
 ('PMP', '+1.07% relative score increase', '+0.06% absolute improvement'),
 ('ALDE', '+0.0% relative score increase', '+0.0% absolute improvement')]
score_changes = pd.DataFrame.from_records([{"party":party, 
    "relative score improvement": np.round((official_results[party] - my_results[party]) / my_results[party] * 100 , 2),
    "absolute improvement": np.round((official_results[party] - my_results[party]), 2)
 } for party in parties])

score_changes = score_changes[["party", "absolute improvement", "relative score improvement"]]
score_changes.sort_values(by="relative score improvement", ascending=False)
party absolute improvement relative score improvement
3 UDMR 0.33 6.46
2 PRO Romania 0.29 4.63
0 PSD 0.64 2.93
4 PNL 0.75 2.86
1 USR-PLUS 0.62 2.85
6 PMP 0.06 1.07
5 ALDE 0.00 0.00

So it seems that in the end, the change between the official results and the one I’ve calculated for (example) UDMR meant a boost of +6.4% of their score ( from 5.11 to 5.44 ). In absolute terms they’ve only gained 0.33%, but that 0.33% increase meant a 6.4% boost when you consider that they had 5.11% to begin with.

From the data, it seems that virtually all the parties benefited from some form of increase in final scores. I’m unsure to what this is due. The data already counts in a separate row the invalidated votes and I suspect that for each party only the valid votes are counted. So in essence, invalidated vote shouldn’t be the cause of these changes. We’ve also counted the data form the foreign offices.

Column correlations

In the following sections we will analyze the dependence of various columns to others in order and discuss what the results mean (by using correlation scores).

What is a correlation

A correlation is a statistical measure that shows how likely are to variables to move in sync. This is usually due to a common underlying factor or in other cases because one of the variables influences the other (causation).

Bear in mind that causation is not the same as correlation! Sometimes, if two variables are correlated it might be because they have a cause and effect relation, but it’s not a given that if we observe correlation we have causation. For example, read this seminal paper that shows some great fallacies we might end up with if we consider that correlation means causation.

So please keep an eye on this statement while reading the sections bellow.

Types of correlations

Two of the most common type of correlation measures are the Pearson correlation coefficient and Spearman’s rank correlation.

Both of them show how likely two variables are correlated, but using a slightly different approach and thus, have a different interpretation of their results:

Pearson correlation

..evaluates the linear relationship between two continuous variables. A relationship is linear when a change in one variable is associated with a proportional change in the other variable.

For example, you might use a Pearson correlation to evaluate whether increases in temperature at your production facility are associated with decreasing thickness of your chocolate coating.

Spearman rank-order correlation

..evaluates the monotonic relationship between two continuous or ordinal variables. In a monotonic relationship, the variables tend to change together, but not necessarily at a constant rate. The Spearman correlation coefficient is based on the ranked values for each variable rather than the raw data.

Spearman correlation is often used to evaluate relationships involving ordinal variables. For example, you might use a Spearman correlation to evaluate whether the order in which employees complete a test exercise is related to the number of months they have been employed.

Pearson measures (for two data series F1 and F2) how well a change of size a in the series F1 leads to a change of size C * a (proportionally large) in F2 when C is a constant

Spearman measures how common is for two variables to move in the same direction (not by how much)

Generic correlations between columns

As a first example, we will take all the columns of our dataset and print the pairs with the highest correlation between them. Note that we’ve excluded from the results the (A, A) pairs that will have a score of 1 (naturally).

Code
_corr = df[[column for column in df.columns if "Barbati" not in column and "Femei" not in column and "id_" not in column]].corr()
_correlations = _corr.fillna(0).values.flatten()

_pairs = [(_idx // _corr.shape[0], _idx % _corr.shape[0], _correlations[_idx]) for _idx in np.where(_correlations > 0.5)[0]]
_pairs = {(i, j, _corr_val) if i < j else (j, i, _corr_val) for i, j, _corr_val in _pairs if i != j}
_pairs = {(_corr_val, _corr.columns[i], _corr.columns[j]) for i, j, _corr_val in _pairs}
sorted(_pairs, reverse=True)
    [
     (0.9999988405995148, 'Total alegatori', 'Total lista permanenta'),
     (0.9998662230466443, 'Total alegatori', 'Votanti lista'),
     (0.9998609492610934, 'Total lista permanenta', 'Votanti lista'),
     (0.9986963671087307, 'Total prezenti', 'Voturi valabile'),
     (0.998433713220943, 'liste_permanente', 'Prezenti lista permanenta'),
     (0.9981585374630918, 'total', 'Voturi valabile'),
     (0.9978683615024313, 'total', 'Total prezenti'),
     (0.9933939092376602, 'lista_suplimentare', 'Prezenti lista suplimentara'),
     (0.9847303875910568, 'siruta', 'Cod birou electoral'),
     (0.9363222144727064, 'Total voturi', 'Voturi nefolosite'),
     (0.9340859461869343, 'men_25_34', 'women_25_34'),
     (0.9207424840375259, 'women_35_44', 'Voturi valabile'),
     (0.9199551991340589, 'total', 'women_35_44'),
     (0.9189659926817213, 'women_35_44', 'Total prezenti'),
     (0.9163024438767253, 'men_45_64', 'women_45_64'),
     (0.9111562909051516, 'men_35_44', 'women_35_44'),
     (0.9107249415201683, 'liste_permanente', 'Votanti lista'),
     (0.9106335321830499, 'liste_permanente', 'Total alegatori'),
     (0.9106224412691745, 'liste_permanente', 'Total lista permanenta'),
     (0.908997006641406, 'Prezenti lista permanenta', 'Votanti lista'),
     (0.9089103759743806, 'Total alegatori', 'Prezenti lista permanenta'),
     (0.9088995061839914, 'Total lista permanenta', 'Prezenti lista permanenta'),
     (0.9074301748600953, 'Total urna mobila', 'Prezenti urna mobila'),
     (0.9039615687968309, 'total', 'men_45_64'),
     (0.9037799402851376, 'total', 'women_45_64'),
     (0.9030713989974236, 'men_65+', 'women_65+'),
     (0.9022989994403603, 'men_45_64', 'Total prezenti'),
     (0.8998027165804341, 'women_45_64', 'Total prezenti'),
     (0.8991597124396662, 'women_45_64', 'Voturi valabile'),
     (0.8985049758239619, 'men_45_64', 'Voturi valabile'),
     (0.8967270820273472, 'men_18_24', 'women_18_24'),
     (0.8908618299795728, 'men_35_44', 'Voturi valabile'),
     (0.8888218041604524, 'women_25_34', 'USR-PLUS'),
     (0.8887240874383147, 'men_35_44', 'Total prezenti'),
     ...]

So you can see some obvious correlations at the top, like, ('Total alegatori', 'Total lista permanenta'). This means that the total number of votes is strongly correlated with the number of people allowed to vote, and this makes sense:

  • whenever you have a higher number of people on the voting lists, you will end up with a higher number (in absolute terms) of people who actually voted (spearman). Even more than this, the increase in total voters is proportional with the number of eligible votes (pearson).

The above correlation coefficients were computed using the pearson method, which is the pandas default setting.

Further down the lists you start to see some interesting results:

(0.8888218041604524, 'women_25_34', 'USR-PLUS')
..
(0.7261606942057695, 'men_45_64', 'PNL')

These indicate that when we have an increase in each category, we have a proportional increase if the other category. In other words, women_25_34 high attendance resulted in proportionally higher votes for USR-PLUS.

One conclusion you could draw from the above is that, by a large margin (0.88 correlation), women_25_34 voted for USR-PLUS. To a somewhat leaser extent, but by the same reasoning, men_45_64 voted for PNL.

Correlation matrix

Since a correlation compares two variables at a time, we can compute the scores for all the possible pairs. We end up with a symmetric matrix whose diagonal values are all 1 (the correlation of the pair (column X, column X) is always 1).

For example, this is the correlation matrix that we get as a result of comparing all the demographic buckets to each other.

Code
df.iloc[:,range(20, 30)].corr().fillna(0)
men_18_24 men_25_34 men_35_44 men_45_64 men_65+ women_18_24 women_25_34 women_35_44 women_45_64 women_65+
men_18_24 1.000000 0.701936 0.616640 0.561348 0.338181 0.896727 0.687798 0.608060 0.532537 0.352328
men_25_34 0.701936 1.000000 0.872921 0.633216 0.222204 0.643599 0.934086 0.761192 0.571337 0.247303
men_35_44 0.616640 0.872921 1.000000 0.761757 0.362060 0.547091 0.874717 0.911156 0.704520 0.370324
men_45_64 0.561348 0.633216 0.761757 1.000000 0.625348 0.491986 0.658756 0.795008 0.916302 0.571230
men_65+ 0.338181 0.222204 0.362060 0.625348 1.000000 0.313547 0.286544 0.478729 0.670303 0.903071
women_18_24 0.896727 0.643599 0.547091 0.491986 0.313547 1.000000 0.666430 0.572266 0.501070 0.343935
women_25_34 0.687798 0.934086 0.874717 0.658756 0.286544 0.666430 1.000000 0.845163 0.658797 0.320458
women_35_44 0.608060 0.761192 0.911156 0.795008 0.478729 0.572266 0.845163 1.000000 0.813859 0.499351
women_45_64 0.532537 0.571337 0.704520 0.916302 0.670303 0.501070 0.658797 0.813859 1.000000 0.646758
women_65+ 0.352328 0.247303 0.370324 0.571230 0.903071 0.343935 0.320458 0.499351 0.646758 1.000000

What each party is correlated with

Now that we know what a correlation is and how to obtain it, we will compute the full (pearson in this case) correlation matrix for all columns, and then get the top 10 most correlated columns for each party. The top most correlated column will always be itself (the diagonal value) so we will continue from 1.

Code
_corr = df.corr().fillna(0)
_corr['PSD'].sort_values(ascending=False)[1:10]
PSD
===========================
men_65+                      0.617477
women_65+                    0.544377
Prezenti lista permanenta    0.523147
liste_permanente             0.517126
men_45_64                    0.502112
Votanti lista                0.485241
Total alegatori              0.485193
Total lista permanenta       0.485173
Voturi anulate               0.474736

For PSD it seems that the top two are men_65+ and women_65+.

Code
_corr['PNL'].sort_values(ascending=False)[1:10]
PNL
===========================
men_45_64          0.726161
Total prezenti     0.722173
Voturi valabile    0.719738
total              0.718390
men_35_44          0.674777
women_35_44        0.649057
Total voturi       0.641204
women_45_64        0.640049
women_25_34        0.606771
Code
_corr['Voturi anulate'].sort_values(ascending=False)[1:10]
Voturi anulate
===========================
Votanti lista                0.475865
Total alegatori              0.475692
Total lista permanenta       0.475680
PSD                          0.474736
men_45_64                    0.459099
Prezenti lista permanenta    0.422869
liste_permanente             0.421464
men_65+                      0.394126
total                        0.387026

Of course there is a certain threshold bellow which, the numbers might mean noise but this number is dataset related. In our case, after working a bit with this dataset, I’d say that 0.5 is a good one for pearson and 0.6 for spearman. Anything bellow we ignore.

We will list bellow the top 3 correlated columns, with each individual column using both methods. For some columns it’s not worth showing the results because they are meaningless (id_* columns, totals columns and the like) so we will skip these.

Code
na_columns = df.columns[df.isna().sum() > 0]
non_na_columns = df.columns.difference(na_columns)
non_na_columns

def correlations(corr_matrix, column, irrelevant_score_threshold=0.4):
    """
    Returns the top 3 most correlated columns with the given one, in descending order.
    For columns that have a correlation lower than 0.4 we ignore the results and consider the 
    coefficient score irrelevant.
    """
    _corr = corr_matrix[column][corr_matrix[column] > irrelevant_score_threshold].sort_values(ascending=False)[1:4]
    return [(name, round(score, 2)) for name, score in _corr.items() if "id_" not in name]

def is_column_ignored(column):
    """"We exclude some columns because the output is too verbose"""
    return \
        "id_" in column or \
        "total" in column.lower() or \
        "Barbati" in column or \
        "Femei" in column or \
        column.lower().startswith("list") or \
        column == "siruta"
_corr = df[non_na_columns]._get_numeric_data().corr(method="pearson")
for column in _corr.columns:
    if is_column_ignored(column): continue
    correlated_var = correlations(_corr, column, irrelevant_score_threshold=0.5)
    if correlated_var:
        print(f"{column} -> {correlated_var}")
Pearson correlation pairs
=========================
George-Nicaolae Simion -> [('USR-PLUS', 0.7), ('men_35_44', 0.68), ('Voturi valabile', 0.68)]
Gregoriana-Carmen Tudoran -> [('USR-PLUS', 0.58), ('Voturi valabile', 0.56), ('total', 0.56)]
PMP -> [('Voturi valabile', 0.67), ('total', 0.67), ('Total prezenti', 0.67)]
PNL -> [('men_45_64', 0.73), ('Total prezenti', 0.72), ('Voturi valabile', 0.72)]
PRO Romania -> [('women_45_64', 0.63), ('liste_permanente', 0.61), ('Prezenti lista permanenta', 0.6)]
PSD -> [('men_65+', 0.62), ('women_65+', 0.54), ('Prezenti lista permanenta', 0.52)]
Partidul Romania Unita -> [('Voturi valabile', 0.57), ('Total prezenti', 0.57), ('total', 0.57)]
Prezenti lista permanenta -> [('liste_permanente', 1.0), ('Votanti lista', 0.91), ('Total alegatori', 0.91)]
Prezenti lista suplimentara -> [('lista_suplimentare', 0.99), ('men_25_34', 0.74), ('men_35_44', 0.71)]
Prezenti urna mobila -> [('Total urna mobila', 0.91)]
USR-PLUS -> [('women_25_34', 0.89), ('men_25_34', 0.85), ('Voturi valabile', 0.83)]
Votanti lista -> [('Total alegatori', 1.0), ('Total lista permanenta', 1.0), ('liste_permanente', 0.91)]
Voturi nefolosite -> [('Total voturi', 0.94), ('men_35_44', 0.62), ('Total prezenti', 0.59)]
Voturi valabile -> [('Total prezenti', 1.0), ('total', 1.0), ('women_35_44', 0.92)]
men_18_24 -> [('women_18_24', 0.9), ('Barbati 21', 0.87), ('Barbati 22', 0.86)]
men_25_34 -> [('women_25_34', 0.93), ('Barbati 30', 0.91), ('Barbati 31', 0.91)]
men_35_44 -> [('women_35_44', 0.91), ('Voturi valabile', 0.89), ('Total prezenti', 0.89)]
men_45_64 -> [('women_45_64', 0.92), ('total', 0.9), ('Total prezenti', 0.9)]
men_65+ -> [('women_65+', 0.9), ('Prezenti lista permanenta', 0.75), ('liste_permanente', 0.75)]
women_18_24 -> [('Femei 21', 0.9), ('men_18_24', 0.9), ('Femei 20', 0.89)]
women_25_34 -> [('men_25_34', 0.93), ('Femei 29', 0.89), ('Femei 31', 0.89)]
women_35_44 -> [('Voturi valabile', 0.92), ('total', 0.92), ('Total prezenti', 0.92)]
women_45_64 -> [('men_45_64', 0.92), ('total', 0.9), ('Total prezenti', 0.9)]
women_65+ -> [('men_65+', 0.9), ('Femei 69', 0.76), ('Femei 70', 0.75)]

We can see that George-Nicolae Simion, Gregoriana-Carmen Tudoran are correlated with USR-PLUS. My view is that these three parties catered to the same kind of demographics. So whenever we had an increase in these kind of people, all three rose with the same velocity.

USR-PLUS specifically though, is mostly correlated (by far) with women_25_34 and men_25_34. Even more, being also highly correlated with Voturi valabile means that whenever there were more eligible votes its score rose. Since the increase in votes leads to better scores for USR+ and increase votes means larger and larger cities, we can deduce that USR-PLUS was especially well represented as cities got larger, and the population density increased.

Other interesting results worth highlighting:

  • PSD -> [('men_65+', 0.62), ('women_65+', 0.54)]
  • PRO Romania -> [('women_45_64', 0.63)]
  • PNL -> [('men_45_64', 0.73)
  • UDMR isn’t present in the above because it seems none of the columns have a high correlation with it.

Interesting how, on the 45-65 age range, women choose to vote PRO Romania whereas men, PNL.

Code
_corr = df[non_na_columns]._get_numeric_data().corr(method="spearman")
for column in _corr.columns:
    if is_column_ignored(column): continue
    correlated_var = correlations(_corr, column, irrelevant_score_threshold=0.6)
    if correlated_var:
        print(f"{column} -> {correlated_var}")
Spearman correlation pairs
=========================
ALDE -> [('Voturi valabile', 0.65), ('total', 0.65), ('Total prezenti', 0.65)]
George-Nicaolae Simion -> [('USR-PLUS', 0.8), ('Gregoriana-Carmen Tudoran', 0.76), ('PMP', 0.75)]
Gregoriana-Carmen Tudoran -> [('USR-PLUS', 0.78), ('George-Nicaolae Simion', 0.76), ('PMP', 0.74)]
PMP -> [('USR-PLUS', 0.83), ('Voturi valabile', 0.79), ('total', 0.79)]
PNL -> [('men_45_64', 0.74), ('total', 0.74), ('Total prezenti', 0.74)]
PRO Romania -> [('women_45_64', 0.76), ('PMP', 0.75), ('USR-PLUS', 0.75)]
PRODEMO -> [('PRO Romania', 0.6)]
PSD -> [('men_65+', 0.68), ('women_65+', 0.64), ('Prezenti lista permanenta', 0.63)]
Partidul Romania Unita -> [('USR-PLUS', 0.69), ('PMP', 0.66), ('George-Nicaolae Simion', 0.65)]
Peter Costea -> [('USR-PLUS', 0.68), ('women_25_34', 0.63), ('Voturi valabile', 0.63)]
Prezenti lista permanenta -> [('liste_permanente', 1.0), ('Votanti lista', 0.92), ('Total alegatori', 0.92)]
Prezenti lista suplimentara -> [('lista_suplimentare', 0.98), ('men_25_34', 0.63), ('Voturi valabile', 0.62)]
Prezenti urna mobila -> [('Total urna mobila', 0.86)]
USR-PLUS -> [('Voturi valabile', 0.86), ('total', 0.85), ('Total prezenti', 0.85)]
Uniunea Nationala Pentur Progresul Romaniei -> [('USR-PLUS', 0.64), ('PMP', 0.62), ('Voturi valabile', 0.6)]
Votanti lista -> [('Total alegatori', 1.0), ('Total lista permanenta', 1.0), ('liste_permanente', 0.92)]
Voturi nefolosite -> [('Total voturi', 0.93), ('Votanti lista', 0.82), ('Total lista permanenta', 0.82)]
Voturi valabile -> [('total', 1.0), ('Total prezenti', 1.0), ('women_35_44', 0.95)]
men_18_24 -> [('women_18_24', 0.87), ('total', 0.85), ('Total prezenti', 0.85)]
men_25_34 -> [('women_25_34', 0.94), ('total', 0.92), ('Voturi valabile', 0.92)]
men_35_44 -> [('women_35_44', 0.95), ('total', 0.95), ('Voturi valabile', 0.95)]
men_45_64 -> [('women_45_64', 0.94), ('total', 0.93), ('Total prezenti', 0.93)]
men_65+ -> [('women_65+', 0.92), ('Prezenti lista permanenta', 0.79), ('liste_permanente', 0.79)]
women_18_24 -> [('men_18_24', 0.87), ('men_25_34', 0.86), ('women_25_34', 0.86)]
women_25_34 -> [('men_25_34', 0.94), ('total', 0.93), ('Voturi valabile', 0.93)]
women_35_44 -> [('men_35_44', 0.95), ('total', 0.95), ('Voturi valabile', 0.95)]
women_45_64 -> [('Voturi valabile', 0.95), ('total', 0.95), ('Total prezenti', 0.95)]
women_65+ -> [('men_65+', 0.92), ('Prezenti lista permanenta', 0.79), ('liste_permanente', 0.79)]

In the spearman analysis, we can see there is an even stronger effect of the same-people-overlap of the following: [George-Nicolae Simion, Gregoriana-Carmen Tudoran, PRO Romania, Partidul Romania Unita, Peter Costea, Uniunea Nationala Pentur Progresul Romaniei] linked to USR-PLUS. So whenever USR-PLUS scores increased, these candidate’ scores increased as well. We can assume that these being all niche candidates they were mostly voted by progressives seeking out alternatives to the older, more established parties.

It’s interesting to observe PRODEMO -> [('PRO Romania', 0.6)] so an increase in PRO Romania means an increase in PRODEMO. This might be due to a confusion of naming, both starting with PRO, and due to the higher scores obtained by PRO Romania and given the fact that it’s being led by a former prime-minister, I’d say PRODEMO were the ones that gained more votes from this confusion (and not the other way around).

Again, interesting how, on the 45-65 age range, women choose to vote PRO Romania whereas men, PNL.

PNL -> [('men_45_64', 0.74)]
PRO Romania -> [('women_45_64', 0.76)]

Party correlation with each demographic category

Using all the columns is somewhat distracting because some columns (like the totals are obscuring other possible patterns). In this section we’re mainly interested to see which demographic category (men and women of various ages) are correlated with each party. We will only do computations on the main parties that got scores above > 4%.

What we do is compute the correlation between each party and all the granular demographic data that we have, and then show the top 10 correlated columns for each party.

Code
only_granular_demographic_columns = [column for column in df.columns if "Barbati" in column or "Femei" in column]
_granular_demographics = df[parties+only_granular_demographic_columns].corr(method='pearson').fillna(0)
from IPython.display import display

def correlations(corr_matrix, party):
    _corr = corr_matrix[party].sort_values(ascending=False)[1:10]
    return _corr

for party in parties:
    display(f"{party}", correlations(_granular_demographics, party))
PSD
===========================
Barbati 69    0.464600
Femei 69      0.452073
Barbati 70    0.449218
Barbati 68    0.439194
Barbati 71    0.437439
Barbati 66    0.429882
Femei 70      0.424870
Barbati 67    0.422452
Barbati 72    0.422372

USR-PLUS
===========================
Femei 31      0.790589
Barbati 34    0.789017
Femei 29      0.787429
Femei 32      0.787392
Femei 33      0.787345
Femei 30      0.786208
Femei 34      0.784999
Barbati 33    0.784203
Barbati 32    0.769324

PRO Romania
===========================
Femei 51      0.575745
Femei 50      0.519486
Barbati 51    0.496390
Femei 49      0.488390
Femei 63      0.483246
Femei 61      0.481602
Femei 62      0.481364
Femei 60      0.473459
Femei 59      0.469298

UDMR
===========================
Femei 75      0.184590
Femei 77      0.183656
Femei 74      0.172165
Femei 76      0.162428
Barbati 75    0.150304
Barbati 77    0.146041
Barbati 74    0.144867
Barbati 76    0.125835
Femei 67      0.120261

PNL
===========================
Barbati 44    0.606405
Barbati 46    0.599908
Barbati 45    0.598910
Barbati 43    0.597983
Barbati 48    0.595641
Barbati 47    0.594866
Barbati 49    0.591697
Barbati 42    0.589587
Barbati 50    0.588989

ALDE
===========================
Femei 51      0.337525
Barbati 51    0.333236
Femei 50      0.312419
Femei 49      0.293341
Barbati 50    0.284151
Femei 48      0.282056
Femei 44      0.281466
Femei 63      0.280512
Femei 64      0.280196

PMP
===========================
USR-PLUS    0.593511
Femei 51    0.566518
Femei 50    0.538230
Femei 34    0.537767
Femei 39    0.536655
Femei 38    0.536526
Femei 35    0.534226
Femei 36    0.532233
Femei 42    0.532082

So by age, the 10 most correlated groups for each party indicate the main support demographics (obvious to read and understand the results).

Joint correlation graphs

We can read the above results but is rather hard to make an idea of the whole picture at once. In this section we will put all the numbers above on a single graph (one for males and one for females) s follows:

  • each age will represent a bar
  • it will be a stacked percentage bar plot, so
    • the bar will be made out of multiple segments, each representing one party
    • the sum of the dimensions of the segments will be equals for all ages and so, the segment heights will actually be percentages of 100% (total voted for each age)
    • only the significant result will be included (correlation > 0.4)

For easier interpretation we will add some dominant colors for each party.

Code
party_colors = {
    "PNL": "xkcd:yellow", 
    "PSD": "r", 
    "USR-PLUS": "xkcd:sky blue", 
    "PMP": "xkcd:yellowgreen",
    "UDMR": "g",
    "ALDE": "xkcd:darkblue",
    "PRO Romania": "xkcd:orange"
}
_males_columns = [column for column in only_granular_demographic_columns if "Barbati" in column]
_female_columns = [column for column in only_granular_demographic_columns if "Femei" in column]

def drop_constant_column(dataframe):
    """
    Drops constant value columns of pandas dataframe.
    """
    return dataframe.loc[:, (dataframe != dataframe.iloc[0]).any()]

def male_female_series(corr_matrix, party, correlation_threshold=0.4):
    _party_demographics = correlations(corr_matrix, party)[only_granular_demographic_columns]
    _party_males = _party_demographics[_males_columns].copy()
    _party_females = _party_demographics[_female_columns].copy()
    _party_males.loc[_party_males < correlation_threshold] = 0
    _party_females.loc[_party_females < correlation_threshold] = 0
    return _party_males, _party_females
    
def process_gender(_males, column_prefix="Barbati "):
    _males = pd.DataFrame({party: _series for party, _series in zip(parties, _males)}).T
    _males = drop_constant_column(_males)
    _males.columns = [int(column.replace(column_prefix, "")) for column in _males.columns]
    return _males

def get_gender_correlations(_granular_demographics, correlation_threshold=0.4):
    _male_females = [male_female_series(_granular_demographics, party, correlation_threshold=correlation_threshold) for party in parties]
    males = list(zip(*_male_females))[0]
    females = list(zip(*_male_females))[1]

    males = process_gender(males, column_prefix="Barbati ")
    females = process_gender(females, column_prefix="Femei ")
    
    return males, females
from cycler import cycler
import matplotlib.pyplot as plt

def stacked_percentage_bar_plot(_df, title): 
    _df = (_df / _df.sum()).T
    bar_l = range(_df.shape[0])

    cm = plt.get_cmap('nipy_spectral')

    f, ax = plt.subplots(1, figsize=(15,15))
    ax.set_prop_cycle(cycler('color', [party_colors[party] for party in parties]))

    bottom = np.zeros_like(bar_l).astype('float')
    for i, column in enumerate(_df.columns):
        ax.bar(bar_l, _df[column], bottom = bottom, label=column)
        bottom += _df[column].values

    ax.set_xticks(bar_l)
    ax.set_xticklabels(_df.T.columns, rotation=90, size='x-small')
    ax.legend(ncol=2, fontsize='x-small')
    f.subplots_adjust(right=0.75, bottom=0.4)
    ax.set_title(title)
    f.show()

    
def plot_graphs(df, method='pearson', correlation_threshold=0.4):
    _correlation_matrix = df.corr(method=method).fillna(0)
    males, females = get_gender_correlations(_correlation_matrix, correlation_threshold=correlation_threshold)
    stacked_percentage_bar_plot(males, "Males")
    stacked_percentage_bar_plot(females, "Females")
plot_graphs(df[parties+only_granular_demographic_columns], method="pearson", correlation_threshold=0.4)

So here we have the results:

png

png

So, my interpretation:

  • PNL had a surprising grip on the 18 year olds
  • USR-PLUS took most of the votes for ages 19-20 in both men and women
  • until the age of 36, the votes were (mostly) split between USR-PLUS and PNL with USR-PLUS starting strong and gradually decreasing.
  • PRO Romania was strongly voted by women aged between 37 and 68 (30 year span), and men aged 57-67 (only 10 year span)
  • PSD was mainly voted by the elderly (60+) in both men and women, but women between 60 and 64 more likely to have voted for PRO Romania instead.
  • PMP had a steady fan-base of people between 25-64 in both women and men (but as percentages show this is makes for aprox. 5% of the electorate).

The above were computed on the perason correlation scores. We can do the same with the spearman method.

Code
plot_graphs(df[parties+only_granular_demographic_columns], method="spearman", correlation_threshold=0.5)

png

png

These graphs show ALDE for the first time, and it seems to mostly be liked by women.

In both men and women there’s something happening at the age of 51, where all of a sudden we see PSD. After this, there is a gap of around 11 years in both men and women until we see again a correlation with PSD.

It’s possible that this is the exact age of people born during the natality boom Romania had in the 2 years that followed the Decree 770. The first year this decree started to have an effect on, was 1967 and the peak lasted to around 2 years. The people born between 1967 and 1970 should have between 50 and 52 years now, and should be almost twice as many as the people aged 54-56.

Depending on when in 1967 each person was born, on 26 may one may have been either 51 or 52 [1967 + (51, 52) = (2018, 2019)]

Drawing

Correlation with bucketed demographics

Since we also have age buckets already computed (men_25_34, etc..) we will also print the correlations of parties relative to these only. The interpretation is easy to make.

Code
from IPython.display import display

def correlations(corr_matrix, party):
    _corr = corr_matrix.T[_bucketed_demographics_columns].T[party].sort_values(ascending=False)[1:]
    return _corr[_corr > 0.5]


_bucketed_demographics_columns = [column for column in df.columns if "men_" in column or "women_" in column]
_bucketed_demographics = df[parties+_bucketed_demographics_columns].corr().fillna(0)
for party in parties:
    display(f"{party}", correlations(_bucketed_demographics, party))
    print("=====================")
PSD
===========================
women_65+    0.544377
men_45_64    0.502112

USR-PLUS
===========================
men_25_34      0.850179
men_35_44      0.826530
women_35_44    0.819073
men_18_24      0.683041
women_18_24    0.672514
women_45_64    0.647503
men_45_64      0.595842

PRO Romania
===========================
men_45_64      0.561500
women_35_44    0.522041

UDMR
===========================
Series([])

PNL
===========================
men_35_44      0.674777
women_35_44    0.649057
women_45_64    0.640049
women_25_34    0.606771
men_25_34      0.606319
men_18_24      0.537984

ALDE
===========================
Series([])

PMP
===========================
women_35_44    0.632382
men_45_64      0.620332
men_35_44      0.586705
women_25_34    0.573251
men_25_34      0.529779

Inter-party correlation

We’ve explored in some detail the effect of age and gender on the vote results associated with each candidate.

In this section I’m going to try to plot the correlation that each party had, relative to the other parties. By doing this I’m trying to by how much, each party attracts the same kind of votes. So by inference, by how much each two parties share the same message / carters to the same group of interests.

In order to do this I’ll use a dendogram plot, where at each step, the two most similar elements are linked together.

Code
import scipy
from scipy.cluster import hierarchy as hc
from matplotlib import pyplot as plt

def plot_correlation(_df, method='pearson'):
    # compute the correlation matrix
    corr = _df.corr(method=method)
    corr.fillna(value=0, inplace=True)
    
    # compress into a condensed format (only the top triangle of a symetric matrix)
    corr_condensed = hc.distance.squareform(1-corr, checks=False)
    
    # compute the linakages (what is closer to what) 
    z = hc.linkage(corr_condensed, method='average')
    fig = plt.figure(figsize=(10,10))
    plt.title(f"{method.title()} correlation dendogram")
    dendrogram = hc.dendrogram(z, labels=corr.columns, orientation='left', leaf_font_size=16)
    plt.show()
    
plot_correlation(df[parties], method='pearson')

png

Code
_party_correlation = df[parties].corr().fillna(0)

def correlations(corr_matrix, party):
    _corr = corr_matrix[party].sort_values(ascending=False)[1:]
    return _corr
#     return _corr[_corr > 0.5]

for party in parties:
    display(f"{party}", correlations(_party_correlation, party))
    print("=====================")
PSD
===========================
PRO Romania    0.382307
ALDE           0.261416
PMP            0.218082
PNL            0.206330
USR-PLUS       0.079658
UDMR          -0.223707

USR-PLUS
===========================
PMP            0.593511
PNL            0.501996
PRO Romania    0.418803
ALDE           0.231465
PSD            0.079658
UDMR          -0.042722

PRO Romania
===========================
PMP         0.462144
USR-PLUS    0.418803
PSD         0.382307
PNL         0.267807
ALDE        0.255237
UDMR       -0.148242

UDMR
===========================
USR-PLUS      -0.042722
ALDE          -0.113350
PMP           -0.130799
PNL           -0.139204
PRO Romania   -0.148242
PSD           -0.223707

PNL
===========================
USR-PLUS       0.501996
PMP            0.447272
PRO Romania    0.267807
PSD            0.206330
ALDE           0.168811
UDMR          -0.139204

ALDE
===========================
PSD            0.261416
PRO Romania    0.255237
PMP            0.237606
USR-PLUS       0.231465
PNL            0.168811
UDMR          -0.113350

PMP
===========================
USR-PLUS       0.593511
PRO Romania    0.462144
PNL            0.447272
ALDE           0.237606
PSD            0.218082
UDMR          -0.130799
  • Close to 0 scores mean that the two parties have orthogonal behavior (when one’s score is increasing it doesn’t influence the score of the other - this means that they address wildly different people categories and interests).
  • Minus scores mean that if the score of one party is increasing, the score of the other is decreasing. The only such case is UDMR, that is inversely correlated with all the other parties. This means that whenever UDMR has a high score, the scores of the others go in the opposite way, and whenever the others start to gain traction, UDMR is loosing scores.

Urban vs Rural attendance

The data that we have also contains the medium information, telling us if votes were cast in rural areas versus urban ones. Since we have this, I’d like to see the how men and women behaved in each setting.

In order to do this we first have to do some data manipulation, adding the gender column, and separating the data for each one.

Code
non_demographic_columns = set(df.columns) - \
    {column for column in df.columns if column.startswith("men_")} - \
    {column for column in df.columns if column.startswith("women_")} - \
    set(_males_columns) - \
    set(_female_columns)
_male_columns = df[_males_columns + ["precinct_nr"]]
_male_columns.columns = [int(column.replace("Barbati ", "")) for column in _males_columns] + ["precinct_nr"]
_male_columns["gender"] = "males"

_fem_columns = df[_female_columns + ["precinct_nr"]]
_fem_columns.columns = [int(column.replace("Femei ", "")) for column in _female_columns] + ["precinct_nr"]
_fem_columns["gender"] = "females"


df_by_gender = pd.concat((
    pd.merge(df[non_demographic_columns].copy(), _male_columns, on="precinct_nr"),
    pd.merge(df[non_demographic_columns].copy(), _fem_columns, on="precinct_nr")
))
age_columns = list(range(18, 90))

_df = df_by_gender[["gender", "medium"] + age_columns].fillna(0)

urban_males_votes = np.sum(_df[(_df.medium == "U") & (_df.gender == "males")][age_columns].values)
urban_females_votes = np.sum(_df[(_df.medium == "U") & (_df.gender == "females")][age_columns].values)

rural_males_votes = np.sum(_df[(_df.medium == "R") & (_df.gender == "males")][age_columns].values)
rural_females_votes = np.sum(_df[(_df.medium == "R") & (_df.gender == "females")][age_columns].values)

# scale the numbers according to their total magnitude. Urban dwellers are more numerous
_pivot = _df.pivot_table(index=["medium", "gender"], aggfunc=lambda x: x.sum())

_pivot.loc[("U", "males"), age_columns] /= urban_males_votes
_pivot.loc[("U", "females"), age_columns] /= urban_females_votes
_pivot.loc[("R", "males"), age_columns] /= rural_males_votes
_pivot.loc[("R", "females"), age_columns] /= rural_females_votes
_pivot.loc[[("U", "males"), ("U", "females")], :].T.plot(figsize=(10,10), title="In cities, men attendence is higher until the age of 47, after which women attendence is higher.")

And here are the results. You can read their interpretation from the titles.

png

_pivot.loc[[("R", "males"), ("R", "females")], :].T.plot(figsize=(10,10), title="In rural areas, males surpass females in attendence, until the age of 58, after which the trend reverses.")

png

Other interesting findings

Columns highly correlated with “Urna mobila” (mobile voting booth)

Code
df[[column for column in df.columns if "Femei " not in column and "Barbati " not in column]].corrwith(df.urna_mobila).sort_values(ascending=False)[1:20]
PSD                          0.233645
women_65+                    0.197278
men_65+                      0.190935
Voturi anulate               0.148569
men_45_64                    0.124989
Total prezenti               0.115288
total                        0.112613
Voturi valabile              0.111136
presence                     0.106723
men_18_24                    0.098921
UDMR                         0.098836
Prezenti lista permanenta    0.093936
women_45_64                  0.087840
liste_permanente             0.082491
Total alegatori              0.073915
Total lista permanenta       0.073891
Votanti lista                0.073797
men_35_44                    0.073775
men_25_34                    0.070998

Columns highly correlated with “Liste suplimentare” (mobile voting booth)

Code
df[[column for column in df.columns if "Femei " not in column and "Barbati " not in column]].corrwith(df.lista_suplimentare).sort_values(ascending=False)[1:20]
Prezenti lista suplimentara    0.993394
men_25_34                      0.739754
men_35_44                      0.712754
women_25_34                    0.647773
George-Nicaolae Simion         0.630228
USR-PLUS                       0.628024
Total voturi                   0.580983
Voturi valabile                0.555929
Total prezenti                 0.551418
total                          0.540754
women_35_44                    0.522387
PNL                            0.509740
Voturi nefolosite              0.498012
men_18_24                      0.492730
women_18_24                    0.443678
men_45_64                      0.442544
PMP                            0.431459
Partidul Romania Unita         0.346605
Peter Costea                   0.329325

Columns highly correlated with “Voturi anulate” (mobile voting booth)

Code
df[[column for column in df.columns if "Femei " not in column and "Barbati " not in column]].corrwith(df["Voturi anulate"]).sort_values(ascending=False)[1:20]
Votanti lista                0.475865
Total alegatori              0.475692
Total lista permanenta       0.475680
PSD                          0.474736
men_45_64                    0.459099
Prezenti lista permanenta    0.422869
liste_permanente             0.421464
men_65+                      0.394126
total                        0.387026
Total prezenti               0.386483
Total voturi                 0.380412
women_45_64                  0.375835
Voturi valabile              0.356550
PNL                          0.342141
women_65+                    0.323280
Partidul Socialist Roman     0.320872
Voturi nefolosite            0.313705
women_35_44                  0.312228
men_18_24                    0.281524

Comments