23 minute read

Getting data of an election

image.png

On the 26th of May, we had the european parliamentary elections. In Romania, the results and progress of the vote were published online in real time on the official electoral site.

As far as I know it’s the first time we had such data exposed to the public, and with such granularity.

Since my daily work involves working closely with data, I couldn’t miss the oportunity to get my hands on that dataset. Of course, the site is not documented and there aren’t any publicly available API’s to begin with. So I spent some hours debugging the underlying stack to see how I can query it and compile it in a usable format.

As far as I see it’s built with React, using some NoSQL as the backend. I’m betting on NoSQL because, while doing the ETL I’ve found some schema inconsistencies that shouldn’t normally happen if the data sat on top of a SQL DB.

Understanding the API

Maybe there’s better way to do this, but what I did was start the developer console of the browser, refresh the election page and look for a request that seemed to contain the data that I was looking for. Using this approach I’ve found the following endopoints that I could query.

An example of how you interogate the BEC site, for the presence data. You need to know the county (in this case AR, code for ARAD).

Code
!curl 'https://prezenta.bec.ro/europarlamentare26052019/data/pv/json//pv_AR.json' -H 'accept: */*' -H 'referer: https://prezenta.bec.ro/europarlamentare26052019/romania-pv-part' -H 'authority: prezenta.bec.ro' --compressed -o "_data/AR.json"
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 73871    0 73871    0     0   319k      0 --:--:-- --:--:-- --:--:--  319k

There is also an endpoint publishing the presence count (the number of people that voted so far ). Again, we also need to query this for each county.

Code
!curl 'https://prezenta.bec.ro/europarlamentare26052019/data/presence/json/presence_AR_now.json' -H 'accept: */*' -H 'referer: https://prezenta.bec.ro/europarlamentare26052019/romania-precincts' -H 'authority: prezenta.bec.ro' --compressed -o "_data/AR-presence.json"
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 60786    0 60786    0     0   690k      0 --:--:-- --:--:-- --:--:--  690k

There is also another csv that we can use, which contains the presence data in a single big file.

Code
!curl 'https://prezenta.bec.ro/europarlamentare26052019/data/presence/csv/presence_now.csv' -H 'Referer: https://prezenta.bec.ro/europarlamentare26052019/abroad-pv-part' --compressed -o "_data/all_presence.csv"

Fetching the data

Getting all the data ouf of the site. Each county has a dedicated page which contains information about its stats. By looking over the source of the site we can compile a list of all counties that we need to inspect. It’s instersting that the S1..S6 (Bucharest’s sectors) were modeled as counties.

Code
counties = ["AR", "AB", "AR", "AG", "BC", "BH", "BN", "BT", "BV", "BR", "BZ", "CS", "CL", "CJ", "CT", "CV", "DB", "DJ", "GL", "GR", "GJ", "HR", "HD", "IL", "IS", "IF", "MM", "MH", "MS", "NT", "OT", "PH", "SM", "SJ", "SB", "SV", "TR", "TM", "TL", "VS", "VL", "VN", "B", "SR", "S1", "S2", "S3", "S4", "S5", "S6"]
len(counties)

The vote information is stored on the data/pv/json/ route, specific for each county. In order not to make multiple queries while testing, we first cache all the results localy and we can refer to them later on.

Above we’ve deduced the counties we can have, but I’ve found that there are slight asymetries for certain cases (mostly regarding the expat data and the way Bucharest is represented).

It’s because of this that we need to handle the counties list in a case-by-case fashion.

Code
import json
from tqdm import tqdm_notebook as tqdm
counties = ["AR", "AB", "AR", "AG", "BC", "BH", "BN", "BT", "BV", "BR", "BZ", "CS", "CL", "CJ", "CT", "CV", "DB", "DJ", "GL", "GR", "GJ", "HR", "HD", "IL", "IS", "IF", "MM", "MH", "MS", "NT", "OT", "PH", "SM", "SJ", "SB", "SV", "TR", "TM", "TL", "VS", "VL", "VN", "S1", "S2", "S3", "S4", "S5", "S6"]
for county in tqdm(counties):
    !curl 'https://prezenta.bec.ro/europarlamentare26052019/data/pv/json//pv_{county}.json' -H 'accept-encoding: gzip, deflate, br'  -H 'accept: */*' -H 'referer: https://prezenta.bec.ro/europarlamentare26052019/romania-pv-part' -H 'authority: prezenta.bec.ro' --compressed -o "_data/{county}.json"

The presence data is (as above) stored in a different (data/presence/json/) route specific to each county. Again, we fetch everything an cache localy.

Code
counties = ["AR", "AB", "AR", "AG", "BC", "BH", "BN", "BT", "BV", "BR", "BZ", "CS", "CL", "CJ", "CT", "CV", "DB", "DJ", "GL", "GR", "GJ", "HR", "HD", "IL", "IS", "IF", "MM", "MH", "MS", "NT", "OT", "PH", "SM", "SJ", "SB", "SV", "TR", "TM", "TL", "VS", "VL", "VN", "B", "SR"]
for county in tqdm(counties[-8:-6]):
    !curl 'https://prezenta.bec.ro/europarlamentare26052019/data/presence/json/presence_{county}_now.json' -H 'accept: */*' -H 'referer: https://prezenta.bec.ro/europarlamentare26052019/romania-precincts' -H 'authority: prezenta.bec.ro' --compressed -o "_data/{county}-presence.json"

Let’s also get the all in one data about the presence.

Code
!curl 'https://prezenta.bec.ro/europarlamentare26052019/data/presence/csv/presence_now.csv' -H 'Referer: https://prezenta.bec.ro/europarlamentare26052019/abroad-pv-part' --compressed -o "_data/all_presence.csv"
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 2618k    0 2618k    0     0  5729k      0 --:--:-- --:--:-- --:--:-- 5729k

Compiling the data

Loading a presence file

When reading the presence file, there’s some manipulation that we need to do because the original returned json contains lots of information that seemed either useless or redundant (info we already had in other places), or information that I didn’t know how to interpret.

There was also the age_ranges field which was contained actually a list of values, that I needed to exapend into individual columns, by using a transform function.

Code
def process_row(row):
    return tuple(row.age_ranges.values())

def load_presence(presence_file_name):
    _json = read_json_file(presence_file_name)
    _df = pd.DataFrame.from_records(_json['precinct'])
    _df[["men_18_24", "men_25_34", "men_35_44", "men_45_64", "men_65+", "women_18_24", "women_25_34", "women_35_44", "women_45_64", "women_65+"]] = _df.apply(process_row, axis=1, result_type='expand')
    _df.drop(columns=['age_ranges'], inplace=True)
    _df.columns = [
        'liste_permanente', 
        'lista_suplimentare', 
        'total', 
        'urna_mobila', 
        'county_code',
        'county_name',
        'id_county',
        'id_locality',
        'id_precinct',
        'id_uat',
        'initial_count',
        'latitude',
        'locality_name',
        'longitude',
        'medium',
        'precinct_name',
        'precinct_nr',
        'presence',
        'siruta',
        'uat_code',
        'uat_name',
        'men_18_24',
        'men_25_34',
        'men_35_44',
        'men_45_64',
        'men_65+',
        'women_18_24',
        'women_25_34',
        'women_35_44',
        'women_45_64',
        'women_65+',
    ]
    return _df

tulcea = load_presence("_data/TL-presence.json")
tulcea.head()
liste_permanente lista_suplimentare total urna_mobila county_code county_name id_county id_locality id_precinct id_uat ... men_18_24 men_25_34 men_35_44 men_45_64 men_65+ women_18_24 women_25_34 women_35_44 women_45_64 women_65+
0 409 31 440 0 TL TULCEA 38 8884 15848 2882 ... 13 23 36 84 29 17 22 55 110 51
1 471 73 544 0 TL TULCEA 38 8884 15849 2882 ... 11 28 55 90 71 10 40 62 92 85
2 14 20 34 0 TL TULCEA 38 8909 15943 2892 ... 1 1 5 10 4 0 2 4 5 2
3 134 35 169 0 TL TULCEA 38 8914 15948 2894 ... 5 6 17 37 21 4 9 7 37 26
4 638 70 708 0 TL TULCEA 38 8912 15946 2894 ... 19 42 46 185 78 17 42 53 142 84

5 rows × 31 columns

Getting all the individual result files for all precint_nr

The voting results (what voted for what, in which place) are stored in the data/pv route. The problem is that if we only use the county file, downloaded in step 3, there isn’t any keys that will link the results to the presence rows. To be more exact, the results are listed per voting facilty, whereas the presence is aggregated at a locality level. This means that we can’t really jon the two.

Fortuantely, I’ve found that if you know the id of a specific voting facilty, you could ask for the resuls of that specific facility through a version of the original results API.

So the final strategy that worked was something along the following lines:

  • For all countyes, we will load the presence file associated to it
  • inspect all the precint_nr that is contains
  • individually query the api for the results of that precint_nr.
Code
counties = ["AR", "AB", "AR", "AG", "BC", "BH", "BN", "BT", "BV", "BR", "BZ", "CS", "CL", "CJ", "CT", "CV", "DB", "DJ", "GL", "GR", "GJ", "HR", "HD", "IL", "IS", "IF", "MM", "MH", "MS", "NT", "OT", "PH", "SM", "SJ", "SB", "SV", "TR", "TM", "TL", "VS", "VL", "VN", "SR"]
for county in tqdm(counties):
    df_county = load_presence(f"_data/{county}-presence.json")
    for precinct_nr in tqdm(df_county['precinct_nr'].values, leave=False):
        file_name = f"_data/{county}_results_{precinct_nr}.json"
        if not os.path.exists(file_name): 
            !curl 'https://prezenta.bec.ro/europarlamentare26052019/data/pv/csv/pv_{county}_{precinct_nr}_EUP_PART.csv' -H 'accept-encoding: gzip, deflate, br' -H 'accept-language: en-GB,en-US;q=0.9,en;q=0.8' -H 'user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36' -H 'accept: */*' -H 'referer: https://prezenta.bec.ro/europarlamentare26052019/romania-pv-part' -H 'authority: prezenta.bec.ro' -H 'cookie: _ga=GA1.2.772980748.1558943895; _gid=GA1.2.1466959792.1561374632' --compressed --silent -o "_data/{county}_results_{precinct_nr}.json"
        else:
            with open(file_name) as f:
                file_contents = f.read()
            if "Cod birou electoral" not in file_contents:
                print(f"File: {file_name} has bad content {file_contents[:50]}. Will retry")
                os.remove(file_name)
                !curl 'https://prezenta.bec.ro/europarlamentare26052019/data/pv/csv/pv_{county}_{precinct_nr}_EUP_PART.csv' -H 'accept-encoding: gzip, deflate, br' -H 'accept-language: en-GB,en-US;q=0.9,en;q=0.8' -H 'user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36' -H 'accept: */*' -H 'referer: https://prezenta.bec.ro/europarlamentare26052019/romania-pv-part' -H 'authority: prezenta.bec.ro' -H 'cookie: _ga=GA1.2.772980748.1558943895; _gid=GA1.2.1466959792.1561374632' --compressed --silent -o "_data/{county}_results_{precinct_nr}.json"

Bucharest is a special case. It’s treated as a county but the results are stored by sectors so we need to do things a bit different.

Code
county = "B"
df_county = load_presence(f"_data/{county}-presence.json")
for id_sector in tqdm(df_county.id_locality.unique()):
    sector = f"S{int(id_sector) - 9910 + 1}"
    print(f"Processing: {sector}")
    county = sector
    for precinct_nr in tqdm(df_county[df_county.id_locality == id_sector]['precinct_nr'].values, leave=False):
        file_name = f"_data/{county}_results_{precinct_nr}.json"
        if not os.path.exists(file_name): 
            !curl 'https://prezenta.bec.ro/europarlamentare26052019/data/pv/csv/pv_{county}_{precinct_nr}_EUP_PART.csv' -H 'accept-encoding: gzip, deflate, br' -H 'accept-language: en-GB,en-US;q=0.9,en;q=0.8' -H 'user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36' -H 'accept: */*' -H 'referer: https://prezenta.bec.ro/europarlamentare26052019/romania-pv-part' -H 'authority: prezenta.bec.ro' -H 'cookie: _ga=GA1.2.772980748.1558943895; _gid=GA1.2.1466959792.1561374632' --compressed --silent -o "_data/{county}_results_{precinct_nr}.json"
        else:
            with open(file_name) as f:
                file_contents = f.read()
            if "Cod birou electoral" not in file_contents:
                print(f"File: {file_name} has bad content {file_contents[:50]}. Will retry")
                os.remove(file_name)
                !curl 'https://prezenta.bec.ro/europarlamentare26052019/data/pv/csv/pv_{county}_{precinct_nr}_EUP_PART.csv' -H 'accept-encoding: gzip, deflate, br' -H 'accept-language: en-GB,en-US;q=0.9,en;q=0.8' -H 'user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36' -H 'accept: */*' -H 'referer: https://prezenta.bec.ro/europarlamentare26052019/romania-pv-part' -H 'authority: prezenta.bec.ro' -H 'cookie: _ga=GA1.2.772980748.1558943895; _gid=GA1.2.1466959792.1561374632' --compressed --silent -o "_data/{county}_results_{precinct_nr}.json"

In SR we have data about the foreign offices.

Code
load_presence("_data/SR-presence.json").head().T
0 1 2 3 4
liste_permanente 0 0 0 0 0
lista_suplimentare 18 57 865 79 1330
total 18 57 865 79 1330
urna_mobila 0 0 0 0 0
county_code SR SR SR SR SR
county_name STRAINATATE STRAINATATE STRAINATATE STRAINATATE STRAINATATE
id_county 43 43 43 43 43
id_locality 10244 10178 10334 10206 9996
id_precinct 18619 18627 19096 18716 18723
id_uat 3230 3218 3245 3219 3186
initial_count 0 0 0 0 0
latitude None None None None None
locality_name KABUL BUENOS AIRES RENNES TBILISI STUTTGART
longitude None None None None None
medium U U U U U
precinct_name Eticheta Credentiale/Tableta - 1 Eticheta Credentiale/Tableta - 10 Eticheta Credentiale/Tableta - 100 Eticheta Credentiale/Tableta - 101 Eticheta Credentiale/Tableta - 102
precinct_nr 1 10 100 101 102
presence 0 0 0 0 0
siruta None None None None None
uat_code AF AR FR GE DE
uat_name AFGANISTAN ARGENTINA FRANȚA GEORGIA GERMANIA
men_18_24 0 1 50 1 71
men_25_34 4 3 155 9 258
men_35_44 11 4 172 18 270
men_45_64 3 7 141 14 249
men_65+ 0 1 8 1 4
women_18_24 0 0 37 1 50
women_25_34 0 10 119 13 182
women_35_44 0 13 93 11 128
women_45_64 0 11 76 9 112
women_65+ 0 7 14 2 6

Reading all the presence data into a single DataFrame

We now have all the presence data cached, and we’ll read it into a single dataframe.

Code
counties = ["AR", "AB", "AR", "AG", "BC", "BH", "BN", "BT", "BV", "BR", "BZ", "CS", "CL", "CJ", "CT", "CV", "DB", "DJ", "GL", "GR", "GJ", "HR", "HD", "IL", "IS", "IF", "MM", "MH", "MS", "NT", "OT", "PH", "SM", "SJ", "SB", "SV", "TR", "TM", "TL", "VS", "VL", "VN", "SR", "B"]
 
df_precints = pd.concat((load_presence(f) for f in tqdm(glob("_data/*-presence.json"))), ignore_index=True)
df_precints.shape
(19171, 31)
Code
df_precints.head().T
0 1 2 3 4
liste_permanente 696 140 501 571 680
lista_suplimentare 63 10 25 41 55
total 759 150 526 612 736
urna_mobila 0 0 0 0 1
county_code VS VS VS VS VS
county_name VASLUI VASLUI VASLUI VASLUI VASLUI
id_county 39 39 39 39 39
id_locality 9015 9015 9006 9006 9006
id_precinct 16128 16187 16086 16087 16088
id_uat 2936 2936 2933 2933 2933
initial_count 1470 1840 1354 1375 1570
latitude 46.6389853639550 46.6421141774663 46.2240238056566 46.2278431009305 46.2278431009305
locality_name VASLUI VASLUI BÂRLAD BÂRLAD BÂRLAD
longitude 27.7326775437114 27.7289502189002 27.6775710052581 27.6686353095150 27.6686353095150
medium U U U U U
precinct_name CASA DE CULTURĂ A SINDICATELOR ”CONSTANTIN TĂN... ȘCOALA GIMNAZIALĂ ”CONSTANTIN PARFENE” ŞCOALA GIMNAZIALĂ ”VICTOR IOAN POPA” CASA DE CULTURĂ A SINDICATELOR ”GEORGE TUTOVEANU” CASA DE CULTURĂ A SINDICATELOR ”GEORGE TUTOVEANU”
precinct_nr 1 10 100 101 102
presence 51.6327 8.1522 38.8479 44.5091 46.879
siruta 161954 161954 161801 161801 161801
uat_code None None None None None
uat_name MUNICIPIUL VASLUI MUNICIPIUL VASLUI MUNICIPIUL BÂRLAD MUNICIPIUL BÂRLAD MUNICIPIUL BÂRLAD
men_18_24 13 4 18 15 11
men_25_34 31 5 25 29 40
men_35_44 66 17 60 52 60
men_45_64 110 28 104 113 158
men_65+ 115 20 43 72 81
women_18_24 22 6 17 22 14
women_25_34 34 8 33 32 46
women_35_44 78 15 64 55 52
women_45_64 171 28 117 127 178
women_65+ 119 19 45 95 96

The all_presence.csv file contains information about age groups, more granular than the bucketed info found in the county files. We will merge it with the current dataframe.

Code
_all_df = pd.read_csv("_data/all_presence.csv")
_all_df.head().T
0 1 2 3 4
Judet AB AB AB AB AB
UAT MUNICIPIUL ALBA IULIA MUNICIPIUL ALBA IULIA MUNICIPIUL SEBEŞ MUNICIPIUL SEBEŞ MUNICIPIUL SEBEŞ
Localitate ALBA IULIA ALBA IULIA SEBEŞ SEBEŞ SEBEŞ
Siruta 1026 1026 1883 1883 1883
Nr sectie de votare 1 10 100 101 102
Nume sectie de votare CENTRUL DE ZI PENTRU PERSOANE VÂRSTNICE COLEGIUL NAŢIONAL „HOREA CLOŞCA ŞI CRIŞAN” ŞCOALA GIMNAZIALĂ NR. 2 SEBEŞ COLEGIUL NAŢIONAL ”LUCIAN BLAGA” SEBEŞ COLEGIUL NAŢIONAL ”LUCIAN BLAGA” SEBEŞ
Mediu U U U U U
Votanti lista 1612 1443 1415 1303 1362
LP 901 648 769 697 765
LS 45 143 66 73 27
UM 0 0 14 42 0
LT 946 791 849 812 792
Barbati 18-24 26 38 34 24 31
Barbati 25-34 58 68 56 56 35
Barbati 35-44 88 69 84 70 82
Barbati 45-64 165 128 157 136 136
Barbati 65+ 102 71 88 78 58
Femei 18-24 32 40 30 30 35
Femei 25-34 72 56 61 61 48
Femei 35-44 107 79 76 92 122
Femei 45-64 178 161 163 167 160
Femei 65+ 118 81 100 98 85
Barbati 18 7 9 5 4 3
Barbati 19 4 4 4 6 5
Barbati 20 5 7 6 2 9
Barbati 21 1 2 2 2 5
Barbati 22 2 7 7 1 7
Barbati 23 5 4 4 4 0
Barbati 24 2 5 6 5 2
Barbati 25 5 9 2 2 1
... ... ... ... ... ...
Femei 91 0 0 0 0 0
Femei 92 1 0 0 0 0
Femei 93 0 0 0 0 0
Femei 94 0 0 0 0 0
Femei 95 0 0 0 0 0
Femei 96 0 0 0 0 0
Femei 97 0 0 0 0 0
Femei 98 0 0 0 0 0
Femei 99 0 0 0 0 0
Femei 100 0 0 0 0 0
Femei 101 0 0 0 0 0
Femei 102 0 0 0 0 0
Femei 103 0 0 0 0 0
Femei 104 0 0 0 0 0
Femei 105 0 0 0 0 0
Femei 106 0 0 0 0 0
Femei 107 0 0 0 0 0
Femei 108 0 0 0 0 0
Femei 109 0 0 0 0 0
Femei 110 0 0 0 0 0
Femei 111 0 0 0 0 0
Femei 112 0 0 0 0 0
Femei 113 0 0 0 0 0
Femei 114 0 0 0 0 0
Femei 115 0 0 0 0 0
Femei 116 0 0 0 0 0
Femei 117 0 0 0 0 0
Femei 118 0 0 0 0 0
Femei 119 0 0 0 0 0
Femei 120 0 0 0 0 0

228 rows × 5 columns

It’s interesting that this file contains presence data on a per year-of-birth grouping (which is more granular than the 10 years buckets we had prior).

Reading all the results data into a single dataframe.

The individual results files we’ve got from two steps above, we will load them into a single big pandas DataFrame

Code
df_results = pd.concat((pd.read_csv(f) for f in tqdm(glob("_data/*_results_*"))), ignore_index=True)
df_results.shape
(19171, 36)
df_results.head().T
0 1 2 3 4
Cod birou electoral 22 5 35 35 6
Județ HUNEDOARA BIHOR SUCEAVA SUCEAVA BISTRIŢA-NĂSĂUD
Uat ORAŞ CĂLAN CEICA VICOVU DE JOS MUNICIPIUL SUCEAVA BUDEŞTI
Localitate CĂLAN BUCIUM VICOVU DE JOS SUCEAVA BUDEŞTI-FÂNAŢE
Secție ŞCOALA GIMNAZIALĂ ȘCOALA BUCIUM SCOALA CU CLASELE I-VIII IOAN VICOVEANU GRĂDINIŢA CU PROGRAM NORMAL NR.7 ŞCOALA PRIMARĂ BUDEŞTI-FÎNAŢE
Nr 190 320 532 61 97
Tip Europarlamentare Europarlamentare Europarlamentare Europarlamentare Europarlamentare
a 1471 172 1344 1393 256
a1 1471 172 1344 1393 256
a2 0 0 0 0 0
b 534 116 520 625 162
b1 505 88 479 560 141
b2 0 0 0 0 0
b3 29 28 41 65 21
c 1600 188 1500 1500 300
d 1066 72 980 875 138
e 514 113 504 605 153
f 20 3 16 20 9
h 0 0 NU ESTE CAZUL NU ESTE CAZUL 0
i 0 0 FOARTE BUNA FOARTE BUNA 0
g1 111 46 174 128 18
g2 86 14 51 126 29
g3 29 5 34 43 7
g4 15 0 0 2 1
g5 176 28 162 153 62
g6 18 4 19 27 10
g7 2 3 2 6 1
g8 25 5 29 55 11
g9 6 0 2 3 3
g10 3 0 3 1 2
g11 3 0 3 2 2
g12 1 0 2 3 2
g13 2 0 0 1 0
g14 18 0 4 21 3
g15 9 2 9 12 1
g16 10 6 10 22 1

Joining the results with the presence data

Some code cleanup are neede. In order to join the two dataframes we need to make slight conversions to make all the keys from both side match.

Code
df_results.loc[df_results.Județ == "STRĂINĂTATE", "Județ"] = "STRAINATATE"
df_results.loc[df_results.Uat == "OMAN", "Uat"] = "SULTANATUL OMAN"
df_results.loc[df_results.Județ == "SECTOR 1", "Județ"] = "MUNICIPIUL BUCUREŞTI"
df_results.loc[df_results.Județ == "SECTOR 2", "Județ"] = "MUNICIPIUL BUCUREŞTI"
df_results.loc[df_results.Județ == "SECTOR 3", "Județ"] = "MUNICIPIUL BUCUREŞTI"
df_results.loc[df_results.Județ == "SECTOR 4", "Județ"] = "MUNICIPIUL BUCUREŞTI"
df_results.loc[df_results.Județ == "SECTOR 5", "Județ"] = "MUNICIPIUL BUCUREŞTI"
df_results.loc[df_results.Județ == "SECTOR 6", "Județ"] = "MUNICIPIUL BUCUREŞTI"

Now, if we merge the two we will get a single big dataframe with the same number of rows but double the columns.

Code
df_precint_with_results = pd.merge(left=df_precints, right=df_results, left_on=["county_name", "uat_name", "precinct_nr"], right_on=["Județ", "Uat", "Nr"])
df_precint_with_results.shape
(19171, 67)

Let’s print one example of how one entry this looks like in practice.

Code
dict(df_precint_with_results.iloc[0])
{'liste_permanente': 696,
 'lista_suplimentare': 63,
 'total': 759,
 'urna_mobila': 0,
 'county_code': 'VS',
 'county_name': 'VASLUI',
 'id_county': '39',
 'id_locality': '9015',
 'id_precinct': '16128',
 'id_uat': '2936',
 'initial_count': 1470,
 'latitude': '46.6389853639550',
 'locality_name': 'VASLUI',
 'longitude': '27.7326775437114',
 'medium': 'U',
 'precinct_name': 'CASA DE CULTURĂ A SINDICATELOR ”CONSTANTIN TĂNASE”',
 'precinct_nr': 1,
 'presence': 51.6327,
 'siruta': '161954',
 'uat_code': None,
 'uat_name': 'MUNICIPIUL VASLUI',
 'men_18_24': 13,
 'men_25_34': 31,
 'men_35_44': 66,
 'men_45_64': 110,
 'men_65+': 115,
 'women_18_24': 22,
 'women_25_34': 34,
 'women_35_44': 78,
 'women_45_64': 171,
 'women_65+': 119,
 'Cod birou electoral': 39,
 'Județ': 'VASLUI',
 'Uat': 'MUNICIPIUL VASLUI',
 'Localitate': 'VASLUI',
 'Secție': 'CASA DE CULTURĂ A SINDICATELOR ”CONSTANTIN TĂNASE”',
 'Nr': 1,
 'Tip': 'Europarlamentare',
 'a': 1470,
 'a1': 1470,
 'a2': 0,
 'b': 759,
 'b1': 695,
 'b2': 0,
 'b3': 64,
 'c': 1500,
 'd': 741,
 'e': 741,
 'f': 18,
 'h': 0,
 'i': 0,
 'g1': 185,
 'g2': 232,
 'g3': 51,
 'g4': 0,
 'g5': 118,
 'g6': 37,
 'g7': 2,
 'g8': 68,
 'g9': 0,
 'g10': 4,
 'g11': 5,
 'g12': 3,
 'g13': 2,
 'g14': 18,
 'g15': 9,
 'g16': 7}

We will also join the data with the all_presence.csv file.

Code
df_full = pd.merge(left=df_precint_with_results, right=_all_df, left_on=["county_code", "uat_name", "precinct_nr"], right_on=["Judet", "UAT", "Nr sectie de votare"])
df_full.shape
(19171, 295)

Applying the legend

Some of the columns in the above dataframe are not quite obvious (g1, .., g16, etc..). These are party names that I was only able to find in a legend in the dropdown of a button in the UI of the site. I’ve copied it here, along with explanations of some fields that I’ve been able to figure out by looking over the PDF’s of official scanned documents.

We also need to convert these column names into more meaningfull lables.

Code
df_full.rename(columns={
    "g1": "PSD",
    "g2": "USR-PLUS",
    "g3": "PRO Romania",
    "g4": "UDMR",
    "g5": "PNL",
    "g6": "ALDE",
    "g7": "PRODEMO",
    "g8": "PMP",
    "g9": "Partidul Socialist Roman",
    "g10": "Partidul Social Democrat Independent",
    "g11": "Partidul Romania Unita",
    "g12": "Uniunea Nationala Pentur Progresul Romaniei",
    "g13": "Blocul Unitatii Nationale",
    "g14": "Gregoriana-Carmen Tudoran",
    "g15": "George-Nicaolae Simion",
    "g16": "Peter Costea",
    "a": "Total alegatori",
    "a1": "Total lista permanenta",
    "a2": "Total urna mobila",
    "b": "Total prezenti",
    "b1": "Prezenti lista permanenta",
    "b2": "Prezenti urna mobila",
    "b3": "Prezenti lista suplimentara",
    "c": "Total voturi",
    "d": "Voturi nefolosite",
    "e": "Voturi valabile",
    "f": "Voturi anulate",
    "h": "Contestatii",
    "i": "Starea sigiliilor"
}, inplace=True)

Ok, let’s check for the amount of missing data

Code
na_series = df_full.isna().sum()
na_series[na_series != 0]
latitude               479
longitude              479
siruta                 441
uat_code             18730
Contestatii             19
Starea sigiliilor        8
Siruta                 441

Removing duplicate columns

Because we’ve basically merged two types of presence datasets (the per-county one and the all_presence.csv one) we ended up with some duplicate columns in the joined dataframe. We also have as duplicates the join on columns, and columns that contained the same type of information.

We want to eliminate those. We will find the duplicated columns by:

  • using the pandas.duplicated method (used on the transposed matix - duplicated only works on rows)
  • looking at the correlation matrix of the resulting columns and get the pairs of columns that have the highes correlation.
Code
duplicate_columns = df_full.columns[df_full.T.duplicated(keep=False)]
duplicate_columns
Index(['liste_permanente', 'lista_suplimentare', 'total', 'urna_mobila',
       'county_code', 'county_name', 'initial_count', 'locality_name',
       'medium', 'precinct_name', 'precinct_nr', 'uat_name', 'men_18_24',
       'men_25_34', 'men_35_44', 'men_45_64', 'men_65+', 'women_18_24',
       'women_25_34', 'women_35_44', 'women_45_64', 'women_65+', 'Județ',
       'Uat', 'Nr', 'Judet', 'UAT', 'Localitate_y', 'Nr sectie de votare',
       'Nume sectie de votare', 'Mediu', 'Votanti lista', 'LP', 'LS', 'UM',
       'LT', 'Barbati 18-24', 'Barbati 25-34', 'Barbati 35-44',
       'Barbati 45-64', 'Barbati 65+', 'Femei 18-24', 'Femei 25-34',
       'Femei 35-44', 'Femei 45-64', 'Femei 65+', 'Barbati 104', 'Barbati 106',
       'Barbati 108', 'Barbati 109', 'Barbati 110', 'Barbati 112',
       'Barbati 113', 'Barbati 114', 'Barbati 115', 'Barbati 116',
       'Barbati 117', 'Barbati 118', 'Barbati 119', 'Barbati 120', 'Femei 105',
       'Femei 106', 'Femei 107', 'Femei 108', 'Femei 110', 'Femei 111',
       'Femei 112', 'Femei 113', 'Femei 114', 'Femei 115', 'Femei 116',
       'Femei 117', 'Femei 118', 'Femei 119', 'Femei 120'],
      dtype='object')

With these, we will compare each with each and see what searies are equals. This will results in a long list of pairs of columns that are duplicates of one another.

Code
_pairs = set()
for i, _a in enumerate(duplicate_columns):
    for _b in duplicate_columns[i+1:]:
        if (df_full[_a] == df_full[_b]).all():
            _pairs.add(tuple(sorted([_a, _b])))
_pairs
{('Barbati 104', 'Barbati 106'),
 ('Barbati 104', 'Barbati 108'),
 ('Barbati 104', 'Barbati 109'),
 ...
 ('Barbati 120', 'Femei 120'),
 ('Barbati 18-24', 'men_18_24'),
 ('Barbati 25-34', 'men_25_34'),
 ('Barbati 35-44', 'men_35_44'),
 ('Barbati 45-64', 'men_45_64'),
 ('Barbati 65+', 'men_65+'),
 ('Femei 105', 'Femei 106'),
 ('Femei 105', 'Femei 107'),
 ...
 ('Femei 119', 'Femei 120'),
 ('Femei 18-24', 'women_18_24'),
 ('Femei 25-34', 'women_25_34'),
 ('Femei 35-44', 'women_35_44'),
 ('Femei 45-64', 'women_45_64'),
 ('Femei 65+', 'women_65+'),
 ('Judet', 'county_code'),
 ('Județ', 'county_name'),
 ('LP', 'liste_permanente'),
 ('LS', 'lista_suplimentare'),
 ('LT', 'total'),
 ('Localitate_y', 'locality_name'),
 ('Mediu', 'medium'),
 ('Nr', 'Nr sectie de votare'),
 ('Nr', 'precinct_nr'),
 ('Nr sectie de votare', 'precinct_nr'),
 ('Nume sectie de votare', 'precinct_name'),
 ('UAT', 'Uat'),
 ('UAT', 'uat_name'),
 ('UM', 'urna_mobila'),
 ('Uat', 'uat_name'),
 ('Votanti lista', 'initial_count')}

There’s only one more step that we need to do: find the groups of columns that have the same information. There are cases where the columns are not only duplicated but triplicated, which results in (A == B), (B == C), (C == A) pairs in the analisys above.

This is the perfect job of the disjoint-set datastructure.

Code
_groups = DisjointSets()
for _a, _b in _pairs:
    _groups.union(_a, _b)
_groups.sets()
[['Barbati 109',
  'Femei 118',
  'Femei 110',
  'Femei 114',
  'Barbati 104',
  'Femei 115',
  'Barbati 116',
  'Barbati 117',
  'Barbati 114',
  'Femei 107',
  'Femei 119',
  'Femei 105',
  'Barbati 115',
  'Femei 108',
  'Barbati 118',
  'Barbati 108',
  'Barbati 119',
  'Femei 120',
  'Femei 116',
  'Barbati 112',
  'Femei 113',
  'Barbati 113',
  'Barbati 120',
  'Femei 117',
  'Barbati 106',
  'Barbati 110',
  'Femei 106',
  'Femei 112',
  'Femei 111'],
 ['LP', 'liste_permanente'],
 ['Femei 35-44', 'women_35_44'],
 ['LT', 'total'],
 ['Nr sectie de votare', 'precinct_nr', 'Nr'],
 ['UM', 'urna_mobila'],
 ['Mediu', 'medium'],
 ['Barbati 65+', 'men_65+'],
 ['Barbati 35-44', 'men_35_44'],
 ['Femei 18-24', 'women_18_24'],
 ['Votanti lista', 'initial_count'],
 ['Femei 25-34', 'women_25_34'],
 ['Barbati 25-34', 'men_25_34'],
 ['UAT', 'Uat', 'uat_name'],
 ['Barbati 18-24', 'men_18_24'],
 ['Barbati 45-64', 'men_45_64'],
 ['Localitate_y', 'locality_name'],
 ['Femei 45-64', 'women_45_64'],
 ['Nume sectie de votare', 'precinct_name'],
 ['Judet', 'county_code'],
 ['Femei 65+', 'women_65+'],
 ['Județ', 'county_name'],
 ['LS', 'lista_suplimentare']]

From the list above we know we choose to drop the following columns:

Code
drop_columns = ['Barbati 104', 'Barbati 106',
       'Barbati 108', 'Barbati 109', 'Barbati 110', 'Barbati 112',
       'Barbati 113', 'Barbati 114', 'Barbati 115', 'Barbati 116',
       'Barbati 117', 'Barbati 118', 'Barbati 119', 'Barbati 120', 'Femei 105',
       'Femei 106', 'Femei 107', 'Femei 108', 'Femei 110', 'Femei 111',
       'Femei 112', 'Femei 113', 'Femei 114', 'Femei 115', 'Femei 116',
       'Femei 117', 'Femei 118', 'Femei 119', 'Femei 120', 'LP', 'Femei 35-44', 'LT', 'Nr sectie de votare', 'Nr', 'UM', 
       'Mediu', 'Barbati 65+', 'Barbati 35-44','Femei 18-24',  'initial_count', 'Femei 25-34', 'Barbati 25-34', 
       'UAT', 'Uat', 'Barbati 18-24', 'Barbati 45-64', 'Localitate_y', 'Femei 45-64', 'Femei 45-64', 'Nume sectie de votare',
       'Judet', 'Femei 65+', 'Județ', 'LS', 
]
df_final = df_full.drop(columns=drop_columns)
df_final.columns
Index(['liste_permanente', 'lista_suplimentare', 'total', 'urna_mobila',
       'county_code', 'county_name', 'id_county', 'id_locality', 'id_precinct',
       'id_uat',
       ...
       'Femei 96', 'Femei 97', 'Femei 98', 'Femei 99', 'Femei 100',
       'Femei 101', 'Femei 102', 'Femei 103', 'Femei 104', 'Femei 109'],
      dtype='object', length=242)

And we end up with..

Code
df_final.head().T
0 1 2 3 4
liste_permanente 696 140 501 571 680
lista_suplimentare 63 10 25 41 55
total 759 150 526 612 736
urna_mobila 0 0 0 0 1
county_code VS VS VS VS VS
county_name VASLUI VASLUI VASLUI VASLUI VASLUI
id_county 39 39 39 39 39
id_locality 9015 9015 9006 9006 9006
id_precinct 16128 16187 16086 16087 16088
id_uat 2936 2936 2933 2933 2933
latitude 46.6389853639550 46.6421141774663 46.2240238056566 46.2278431009305 46.2278431009305
locality_name VASLUI VASLUI BÂRLAD BÂRLAD BÂRLAD
longitude 27.7326775437114 27.7289502189002 27.6775710052581 27.6686353095150 27.6686353095150
medium U U U U U
precinct_name CASA DE CULTURĂ A SINDICATELOR ”CONSTANTIN TĂN... ȘCOALA GIMNAZIALĂ ”CONSTANTIN PARFENE” ŞCOALA GIMNAZIALĂ ”VICTOR IOAN POPA” CASA DE CULTURĂ A SINDICATELOR ”GEORGE TUTOVEANU” CASA DE CULTURĂ A SINDICATELOR ”GEORGE TUTOVEANU”
precinct_nr 1 10 100 101 102
presence 51.6327 8.1522 38.8479 44.5091 46.879
siruta 161954 161954 161801 161801 161801
uat_code None None None None None
uat_name MUNICIPIUL VASLUI MUNICIPIUL VASLUI MUNICIPIUL BÂRLAD MUNICIPIUL BÂRLAD MUNICIPIUL BÂRLAD
men_18_24 13 4 18 15 11
men_25_34 31 5 25 29 40
men_35_44 66 17 60 52 60
men_45_64 110 28 104 113 158
men_65+ 115 20 43 72 81
women_18_24 22 6 17 22 14
women_25_34 34 8 33 32 46
women_35_44 78 15 64 55 52
women_45_64 171 28 117 127 178
women_65+ 119 19 45 95 96
... ... ... ... ... ...
Femei 76 5 2 0 7 2
Femei 77 4 0 2 0 0
Femei 78 7 0 2 2 3
Femei 79 4 1 1 2 3
Femei 80 5 0 3 1 1
Femei 81 4 0 0 3 0
Femei 82 2 0 0 4 3
Femei 83 1 0 1 2 1
Femei 84 0 1 0 3 1
Femei 85 1 0 0 1 0
Femei 86 2 1 0 1 0
Femei 87 1 1 0 0 2
Femei 88 2 1 0 0 0
Femei 89 0 0 0 0 1
Femei 90 0 0 0 0 0
Femei 91 0 0 0 0 0
Femei 92 0 0 0 0 0
Femei 93 0 0 0 0 0
Femei 94 0 0 0 0 0
Femei 95 0 0 0 0 0
Femei 96 0 0 0 0 0
Femei 97 0 0 0 0 0
Femei 98 0 0 0 0 0
Femei 99 0 0 0 0 0
Femei 100 0 0 0 0 0
Femei 101 0 0 0 0 0
Femei 102 0 0 0 0 0
Femei 103 0 0 0 0 0
Femei 104 0 0 0 0 0
Femei 109 0 0 0 0 0

242 rows × 5 columns

Save the data to csv format

We’re almost done. We only need to save the dataset on disk and start using it (to be continued, in a future post)!

df_final.to_csv("_data/final.csv")

Comments