Statistical overview of the data sourcing and mapping


In [9]:
import sys
import os

sys.path.append(os.getcwd())

Collect mapping data


In [3]:
from common.bootstrap import collect_sources
from pandas import DataFrame

mappings = DataFrame()

for source in collect_sources():
    mappings = mappings.append(source.fields_mapping, ignore_index=True)

In [11]:
mappings.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 939 entries, 0 to 938
Data columns (total 10 columns):
decimal          1 non-null object
description      11 non-null object
mapping          1 non-null object
maps_to          707 non-null object
missingValue     1 non-null object
name             939 non-null object
pipeline_id      939 non-null object
translates_to    832 non-null object
translation      98 non-null object
type             36 non-null object
dtypes: object(10)
memory usage: 73.4+ KB

How many times has each field been used?


In [5]:
fiscal_fields = mappings.groupby('maps_to').size().sort_values(ascending=False)
fiscal_fields = fiscal_fields.reindex(fiscal_fields.index.rename('Fiscal field'))
DataFrame(fiscal_fields, columns=['Count'])


Out[5]:
Count
Fiscal field
beneficiary_name 94
project_name 89
total_amount 69
total_amount_eligible 68
starting_date 45
project_description 44
eu_cofinancing_amount 40
completion_date 36
approval_date 32
beneficiary_country 19
beneficiary_postal_code 18
project_id 17
beneficiary_city 16
beneficiary_nuts_region 14
theme_code 14
fund_acronym 12
theme_name 12
member_state_amount 10
cci_program_code 9
beneficiary_address 8
cci_program_title 7
priority_label 7
final_payment_date 7
priority_number 6
beneficiary_nuts_code 4
total_amount_applied 3
start_date 2
beneficiary_url 2
beneficiary_country_code 2
third_party_amount 1

In [7]:
from bokeh.charts import Bar, output_notebook, show
from bokeh.sampledata.autompg import autompg as df
output_notebook()

p = Bar(fiscal_fields, legend=None, plot_width=900)
show(p)


Loading BokehJS ...

What fields did not find a mapping?

First: how many fields are not mapped to a fiscal field?


In [32]:
empty_mappings = mappings[mappings['maps_to'].isnull()]
print('There are {} fields out of {} that have not been mapped...'.format(len(empty_mappings), len(mappings)))
print('In other words, that\'s {0:.0f}%.'.format(len(empty_mappings) / len(mappings) * 100))


There are 232 fields out of 939 that have not been mapped...
In other words, that's 25%.

How many of these empty mappings do not have a translation?


In [37]:
nb_without_translation = len(empty_mappings[empty_mappings['translates_to'].isnull()])
print('There are {} fields that have not been mapped that are missing a translation'.format(nb_without_translation))


There are 64 fields that have not been mapped that are missing a translation

After removing those without a translation, the set of fields without a mapping is...


In [39]:
set(empty_mappings['translates_to'][empty_mappings['translates_to'].notnull()].values)


Out[39]:
{'Actually paid',
 'Approved Amount in Euro',
 'Artists',
 'Beneficiary',
 'CCD',
 'Call number',
 'Certified funds (EU sources)',
 'Certified funds (public funds total)',
 'Claim status',
 'Contract date',
 'Conveyor item number',
 'County',
 'Date of last update of the list of operations',
 'Date of last updated',
 'Date of latest update',
 'Duration (Months)',
 'EU cofinancing amount eligible',
 'Email Address',
 'FIELD (e.g. Competitiveness of enterprises and research excellence, Development of Regions, etc.)',
 'FV number',
 'Finance by Beneficiary',
 'Financing',
 'Focus Area of Intervention',
 'Fund (ERDF, ESF...)',
 'Fund (either ERDF or ESF)',
 'G = commited, A = paid out',
 'General allocation of funds (e.g. Call of Tender)',
 'Government body (authority)',
 'ID (not change)',
 'Implementation Procedures',
 'Instrument',
 'Intervention field',
 'Intervention field and number (includes category number and description of field)',
 'Investment Priority',
 'Is the project being implemented on an area of more than one administrative unit?',
 'Line Ministry',
 'Multimedia',
 'Municipality',
 'Name of Contract',
 'Name of category of intervention for the operation',
 'Name of the Project in English',
 'Number and title of the program',
 'Operating Type',
 'Operational Objective',
 'Paid-financing',
 'Person responsible for project (lastname)',
 'President of the company',
 'Priority Axis',
 'Project Status',
 'Project Website',
 'Project funds disbursed',
 'Project name (short)',
 'Project number',
 'Project will be implemented',
 'Projects Website',
 'Ranking in the framework of the project (filter)',
 'Region',
 'Region of the applicant - name',
 'SOURCE',
 'Status',
 'Status - G committed - A paid',
 'Status G = committed A = paid',
 'Status of Project - G confirmed A paid',
 'Status of project',
 'Total value',
 'Unions co-financing rate in percent per priority',
 'Value of the project application',
 'Year Expenditure Approval / Year Final Payment (two dates)',
 'Year allocation',
 'administrative district (county)',
 'amount of EU funding',
 'applicant code',
 'beneficiary county',
 'bus number',
 'category of Interventionskategorie',
 'co-beneficiaries',
 'contact person',
 'contracts number',
 'cooperation authority',
 'county',
 'date last updated',
 'directive code',
 'duration',
 'economic activity',
 'email of person responsible for project',
 'field',
 'first name of person responsible',
 'form of financing (numerical)',
 'funding agency',
 'house number',
 'legal form of applicant',
 'legal form of group',
 'list of project_name',
 'location / county',
 'name of cofinancing programme',
 'names of the executor',
 'number of seats (execution)',
 'number support',
 'percent rate that EU will cover',
 'percentage of EU cofinancing',
 'person responsible email',
 'person responsible for project',
 'programme',
 'programme / county',
 'project id',
 'project status',
 'project theme / subtheme',
 'responsible authority',
 'some number defining the project either 1 or 2',
 'source',
 'sources',
 'specific county number for county within Nordrhein-Westfalen',
 'status',
 'supported action',
 'telephone number',
 'telephone number of person responsible',
 'territory type',
 'the amount of EU funds disbursed',
 'thematic objective',
 'total public funds',
 'unclear',
 'unclear some number',
 'village applicants - code',
 'village applicants - name'}

How complete are the datasets?

Also what we might wanna know is what are the dataset that do do have the minimum requirements. So here goes...


In [59]:
minimum_fields = [
    'beneficary_name', 
    'project_name', 
    'total_amount', 
    'fund_acronym', 
    'beneficiary_nuts_region', 
    'program_name'
]

In [66]:
pipeline_ids = mappings['pipeline_id'].unique()
print('There are currently {} datasets'.format(len(pipeline_ids))


There are currently 97 datasets

In [77]:
minimum_requirements = []
minimum_requirements_counter = {}

for pipeline_id in pipeline_ids:
    pipeline = mappings[mappings['pipeline_id'] == pipeline_id]
    dataset = {'pipeline_id': pipeline_id}
    counter = 0
    for field in minimum_fields:
        has_field = field in pipeline['maps_to'].values
        dataset.update({field: has_field})
        counter = counter + 1 if has_field else counter
    minimum_requirements.append(dataset)
    minimum_requirements_counter.update({pipeline_id: counter})
    
minimum_requirements = DataFrame(minimum_requirements)
minimum_requirements_counter


Out[77]:
{'AT.austria/AT11.burgenland': 2,
 'AT.austria/AT12.niederoesterreich': 2,
 'AT.austria/AT21.kaernten': 2,
 'AT.austria/AT22.steiermark': 1,
 'AT.austria/AT31.oberoesterreich': 2,
 'AT.austria/AT32.salzburg': 1,
 'AT.austria/AT33.tirol': 2,
 'AT.austria/AT34.vorarlberg': 1,
 'AT.austria/national': 0,
 'BE.belgium/BE1.brussels': 2,
 'BE.belgium/BE2.vlaams.gewest/ERDF 2007-2013': 2,
 'BE.belgium/BE2.vlaams.gewest/ERDF 2014-2020': 2,
 'BE.belgium/BE3.region.wallonne/all funds 2014-2020': 3,
 'BG.bulgaria': 0,
 'CY.cyprus/all funds 2007-2013': 2,
 'CZ.czech-republic/2007-2013': 2,
 'CZ.czech-republic/2014-2020': 1,
 'DE.germany/DE1.baden-wuerttemberg/ERDF 2007-2013': 2,
 'DE.germany/DE1.baden-wuerttemberg/ERDF 2014-2020': 1,
 'DE.germany/DE1.baden-wuerttemberg/ESF 2007-2013': 2,
 'DE.germany/DE2.bayern/ERDF 2007-2013': 1,
 'DE.germany/DE2.bayern/ESF 2007-2013': 2,
 'DE.germany/DE3.berlin/ERDF 2007-2013': 2,
 'DE.germany/DE3.berlin/ERDF 2014-2020': 1,
 'DE.germany/DE3.berlin/ESF 2007-2013': 2,
 'DE.germany/DE4.brandenburg/ESF 2007-2013': 2,
 'DE.germany/DE4.brandenburg/ESF 2014-2020': 1,
 'DE.germany/DE5.bremen/ERDF 2007-2013': 1,
 'DE.germany/DE5.bremen/ERDF 2014-2020': 1,
 'DE.germany/DE5.bremen/ESF 2007-2013': 2,
 'DE.germany/DE6.hamburg/ERDF 2007-2013': 2,
 'DE.germany/DE6.hamburg/ESF 2007-2013': 2,
 'DE.germany/DE6.hamburg/ESF 2014-2020': 2,
 'DE.germany/DE7.hessen/ERDF 2007-2013': 2,
 'DE.germany/DE7.hessen/ESF 2007-2013': 2,
 'DE.germany/DE7.hessen/ESF 2014-2020': 1,
 'DE.germany/DE8.mecklenburg-vorpommern/ERDF 2007-2013': 2,
 'DE.germany/DE8.mecklenburg-vorpommern/ERDF 2014-2020': 1,
 'DE.germany/DE8.mecklenburg-vorpommern/ESF 2007-2013': 2,
 'DE.germany/DE8.mecklenburg-vorpommern/ESF 2014-2020': 1,
 'DE.germany/DE9.niedersachsen/ERDF 2007-2013': 2,
 'DE.germany/DE9.niedersachsen/ERDF 2014-2020': 2,
 'DE.germany/DE9.niedersachsen/ESF 2007-2013': 2,
 'DE.germany/DE9.niedersachsen/ESF 2014-2020': 2,
 'DE.germany/DEA.nordrhein-westfalen/ERDF 2007-2013': 2,
 'DE.germany/DEA.nordrhein-westfalen/ERDF 2014-2020': 1,
 'DE.germany/DEA.nordrhein-westfalen/ESF 2007-2013': 1,
 'DE.germany/DEA.nordrhein-westfalen/ESF 2014-2020': 1,
 'DE.germany/DEB.rheinland-pfalz/ERDF 2007-2013': 2,
 'DE.germany/DEB.rheinland-pfalz/ERDF 2014-2020': 2,
 'DE.germany/DEB.rheinland-pfalz/ESF 2007-2013': 2,
 'DE.germany/DEB.rheinland-pfalz/ESF 2014-2020': 1,
 'DE.germany/DEC.saarland/ERDF 2007-2013': 2,
 'DE.germany/DEC.saarland/ESF 2007-2013': 0,
 'DE.germany/DEC.saarland/ESF 2014-2020': 2,
 'DE.germany/DED.sachsen/ERDF 2007-2013': 2,
 'DE.germany/DED.sachsen/ERDF 2014-2020': 1,
 'DE.germany/DED.sachsen/ESF 2007-2013': 2,
 'DE.germany/DED.sachsen/ESF 2014-2020': 1,
 'DE.germany/DEE.sachsen-anhalt/ERDF 2007-2013': 1,
 'DE.germany/DEE.sachsen-anhalt/ESF 2007-2013': 1,
 'DE.germany/DEF.schleswig-holstein/ERDF 2007-2013': 2,
 'DE.germany/DEF.schleswig-holstein/ERDF 2014-2020': 1,
 'DE.germany/DEF.schleswig-holstein/ESF 2007-2013': 2,
 'DE.germany/DEF.schleswig-holstein/ESF 2014-2020': 1,
 'DE.germany/DEG.thueringen/ERDF 2007-2013': 2,
 'DE.germany/DEG.thueringen/ERDF 2014-2020': 1,
 'DE.germany/DEG.thueringen/ESF 2007-2013': 2,
 'DE.germany/DEG.thueringen/ESF 2014-2020': 1,
 'DK.denmark/2007-2013': 4,
 'DK.denmark/2014-2020': 4,
 'EE.estonia': 0,
 'EL.greece/2007-2013': 2,
 'EL.greece/2014-2020': 0,
 'ES.spain/ERDF 2007-2013': 2,
 'ES.spain/ESF 2007-2013': 3,
 'FI.finland/ERDF 2007-2013': 3,
 'FI.finland/ERDF 2014-2020': 3,
 'FR.france/2007-2013': 4,
 'FR.france/2014-2020': 2,
 'HR.croatia': 0,
 'IE.ireland/North_Western_Ireland': 2,
 'IE.ireland/Southern_Eastern_Ireland': 1,
 'LT.lithuania/2007-2013': 2,
 'LT.lithuania/2014-2020': 2,
 'LU.luxemburg': 2,
 'LV.latvia/2007-2013': 2,
 'LV.latvia/2014-2020': 2,
 'MT.malta': 3,
 'NL.the-netherlands': 2,
 'PL.poland': 2,
 'PT.portugal/all_funds_2007-2013': 2,
 'PT.portugal/all_funds_2014-2020': 2,
 'SE.sweden/all_funds_2007-2013': 3,
 'SE.sweden/all_funds_2014-2020': 3,
 'SI.slovenia/2007-2013': 3,
 'SI.slovenia/2014-2020': 2}

Now how complete is each dataset? In other words, how many fields do each dataset have?


In [52]:
datasets = mappings.groupby('pipeline_id').size()

In [57]:
datasets.sort_values(ascending=False)


Out[57]:
pipeline_id
DE.germany/DEB.rheinland-pfalz/ESF 2007-2013            36
CZ.czech-republic/2007-2013                             28
SE.sweden/all_funds_2007-2013                           22
BE.belgium/BE2.vlaams.gewest/ERDF 2007-2013             21
AT.austria/national                                     21
DK.denmark/2014-2020                                    19
SI.slovenia/2014-2020                                   18
SE.sweden/all_funds_2014-2020                           18
SI.slovenia/2007-2013                                   17
NL.the-netherlands                                      17
DE.germany/DED.sachsen/ERDF 2014-2020                   17
MT.malta                                                16
DK.denmark/2007-2013                                    15
LU.luxemburg                                            14
FI.finland/ERDF 2007-2013                               14
BE.belgium/BE3.region.wallonne/all funds 2014-2020      14
EL.greece/2014-2020                                     14
DE.germany/DED.sachsen/ESF 2014-2020                    13
PT.portugal/all_funds_2007-2013                         13
DE.germany/DEB.rheinland-pfalz/ERDF 2014-2020           13
FI.finland/ERDF 2014-2020                               13
DE.germany/DE9.niedersachsen/ESF 2014-2020              12
DE.germany/DE9.niedersachsen/ERDF 2014-2020             12
EL.greece/2007-2013                                     12
DE.germany/DE9.niedersachsen/ESF 2007-2013              12
FR.france/2014-2020                                     12
DE.germany/DEA.nordrhein-westfalen/ESF 2014-2020        11
HR.croatia                                              11
DE.germany/DE8.mecklenburg-vorpommern/ERDF 2014-2020    11
IE.ireland/Southern_Eastern_Ireland                     11
                                                        ..
DE.germany/DEF.schleswig-holstein/ESF 2007-2013          6
DE.germany/DE9.niedersachsen/ERDF 2007-2013              6
DE.germany/DEA.nordrhein-westfalen/ERDF 2007-2013        6
DE.germany/DE3.berlin/ERDF 2007-2013                     6
DE.germany/DEF.schleswig-holstein/ERDF 2007-2013         6
DE.germany/DE3.berlin/ESF 2007-2013                      6
DE.germany/DE4.brandenburg/ESF 2007-2013                 5
DE.germany/DEB.rheinland-pfalz/ERDF 2007-2013            5
DE.germany/DEE.sachsen-anhalt/ESF 2007-2013              5
DE.germany/DE5.bremen/ERDF 2007-2013                     5
DE.germany/DEE.sachsen-anhalt/ERDF 2007-2013             5
DE.germany/DE6.hamburg/ERDF 2007-2013                    5
DE.germany/DED.sachsen/ESF 2007-2013                     5
DE.germany/DE8.mecklenburg-vorpommern/ERDF 2007-2013     5
DE.germany/DEC.saarland/ESF 2007-2013                    5
CY.cyprus/all funds 2007-2013                            5
DE.germany/DEC.saarland/ERDF 2007-2013                   5
DE.germany/DEA.nordrhein-westfalen/ESF 2007-2013         5
DE.germany/DE7.hessen/ERDF 2007-2013                     5
DE.germany/DE1.baden-wuerttemberg/ESF 2007-2013          5
DE.germany/DE2.bayern/ERDF 2007-2013                     5
DE.germany/DE1.baden-wuerttemberg/ERDF 2007-2013         5
AT.austria/AT33.tirol                                    5
AT.austria/AT34.vorarlberg                               5
DE.germany/DEG.thueringen/ERDF 2007-2013                 5
AT.austria/AT32.salzburg                                 4
AT.austria/AT31.oberoesterreich                          4
AT.austria/AT22.steiermark                               4
AT.austria/AT21.kaernten                                 4
AT.austria/AT12.niederoesterreich                        4
dtype: int64

In [ ]: