Statistical overview of the data sourcing and mapping



In [9]:

    
import sys
import os

sys.path.append(os.getcwd())

Collect mapping data



In [3]:

    
from common.bootstrap import collect_sources
from pandas import DataFrame

mappings = DataFrame()

for source in collect_sources():
    mappings = mappings.append(source.fields_mapping, ignore_index=True)



In [11]:

    
mappings.info()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 939 entries, 0 to 938
Data columns (total 10 columns):
decimal          1 non-null object
description      11 non-null object
mapping          1 non-null object
maps_to          707 non-null object
missingValue     1 non-null object
name             939 non-null object
pipeline_id      939 non-null object
translates_to    832 non-null object
translation      98 non-null object
type             36 non-null object
dtypes: object(10)
memory usage: 73.4+ KB

How many times has each field been used?



In [5]:

    
fiscal_fields = mappings.groupby('maps_to').size().sort_values(ascending=False)
fiscal_fields = fiscal_fields.reindex(fiscal_fields.index.rename('Fiscal field'))
DataFrame(fiscal_fields, columns=['Count'])









    Out[5]:






  
    
      
      Count
    
    
      Fiscal field
      
    
  
  
    
      beneficiary_name
      94
    
    
      project_name
      89
    
    
      total_amount
      69
    
    
      total_amount_eligible
      68
    
    
      starting_date
      45
    
    
      project_description
      44
    
    
      eu_cofinancing_amount
      40
    
    
      completion_date
      36
    
    
      approval_date
      32
    
    
      beneficiary_country
      19
    
    
      beneficiary_postal_code
      18
    
    
      project_id
      17
    
    
      beneficiary_city
      16
    
    
      beneficiary_nuts_region
      14
    
    
      theme_code
      14
    
    
      fund_acronym
      12
    
    
      theme_name
      12
    
    
      member_state_amount
      10
    
    
      cci_program_code
      9
    
    
      beneficiary_address
      8
    
    
      cci_program_title
      7
    
    
      priority_label
      7
    
    
      final_payment_date
      7
    
    
      priority_number
      6
    
    
      beneficiary_nuts_code
      4
    
    
      total_amount_applied
      3
    
    
      start_date
      2
    
    
      beneficiary_url
      2
    
    
      beneficiary_country_code
      2
    
    
      third_party_amount
      1



In [7]:

    
from bokeh.charts import Bar, output_notebook, show
from bokeh.sampledata.autompg import autompg as df
output_notebook()

p = Bar(fiscal_fields, legend=None, plot_width=900)
show(p)









    





    
        
        Loading BokehJS ...

What fields did not find a mapping?

First: how many fields are not mapped to a fiscal field?



In [32]:

    
empty_mappings = mappings[mappings['maps_to'].isnull()]
print('There are {} fields out of {} that have not been mapped...'.format(len(empty_mappings), len(mappings)))
print('In other words, that\'s {0:.0f}%.'.format(len(empty_mappings) / len(mappings) * 100))









    



There are 232 fields out of 939 that have not been mapped...
In other words, that's 25%.

How many of these empty mappings do not have a translation?



In [37]:

    
nb_without_translation = len(empty_mappings[empty_mappings['translates_to'].isnull()])
print('There are {} fields that have not been mapped that are missing a translation'.format(nb_without_translation))









    



There are 64 fields that have not been mapped that are missing a translation

After removing those without a translation, the set of fields without a mapping is...



In [39]:

    
set(empty_mappings['translates_to'][empty_mappings['translates_to'].notnull()].values)









    Out[39]:





{'Actually paid',
 'Approved Amount in Euro',
 'Artists',
 'Beneficiary',
 'CCD',
 'Call number',
 'Certified funds (EU sources)',
 'Certified funds (public funds total)',
 'Claim status',
 'Contract date',
 'Conveyor item number',
 'County',
 'Date of last update of the list of operations',
 'Date of last updated',
 'Date of latest update',
 'Duration (Months)',
 'EU cofinancing amount eligible',
 'Email Address',
 'FIELD (e.g. Competitiveness of enterprises and research excellence, Development of Regions, etc.)',
 'FV number',
 'Finance by Beneficiary',
 'Financing',
 'Focus Area of Intervention',
 'Fund (ERDF, ESF...)',
 'Fund (either ERDF or ESF)',
 'G = commited, A = paid out',
 'General allocation of funds (e.g. Call of Tender)',
 'Government body (authority)',
 'ID (not change)',
 'Implementation Procedures',
 'Instrument',
 'Intervention field',
 'Intervention field and number (includes category number and description of field)',
 'Investment Priority',
 'Is the project being implemented on an area of more than one administrative unit?',
 'Line Ministry',
 'Multimedia',
 'Municipality',
 'Name of Contract',
 'Name of category of intervention for the operation',
 'Name of the Project in English',
 'Number and title of the program',
 'Operating Type',
 'Operational Objective',
 'Paid-financing',
 'Person responsible for project (lastname)',
 'President of the company',
 'Priority Axis',
 'Project Status',
 'Project Website',
 'Project funds disbursed',
 'Project name (short)',
 'Project number',
 'Project will be implemented',
 'Projects Website',
 'Ranking in the framework of the project (filter)',
 'Region',
 'Region of the applicant - name',
 'SOURCE',
 'Status',
 'Status - G committed - A paid',
 'Status G = committed A = paid',
 'Status of Project - G confirmed A paid',
 'Status of project',
 'Total value',
 'Unions co-financing rate in percent per priority',
 'Value of the project application',
 'Year Expenditure Approval / Year Final Payment (two dates)',
 'Year allocation',
 'administrative district (county)',
 'amount of EU funding',
 'applicant code',
 'beneficiary county',
 'bus number',
 'category of Interventionskategorie',
 'co-beneficiaries',
 'contact person',
 'contracts number',
 'cooperation authority',
 'county',
 'date last updated',
 'directive code',
 'duration',
 'economic activity',
 'email of person responsible for project',
 'field',
 'first name of person responsible',
 'form of financing (numerical)',
 'funding agency',
 'house number',
 'legal form of applicant',
 'legal form of group',
 'list of project_name',
 'location / county',
 'name of cofinancing programme',
 'names of the executor',
 'number of seats (execution)',
 'number support',
 'percent rate that EU will cover',
 'percentage of EU cofinancing',
 'person responsible email',
 'person responsible for project',
 'programme',
 'programme / county',
 'project id',
 'project status',
 'project theme / subtheme',
 'responsible authority',
 'some number defining the project either 1 or 2',
 'source',
 'sources',
 'specific county number for county within Nordrhein-Westfalen',
 'status',
 'supported action',
 'telephone number',
 'telephone number of person responsible',
 'territory type',
 'the amount of EU funds disbursed',
 'thematic objective',
 'total public funds',
 'unclear',
 'unclear some number',
 'village applicants - code',
 'village applicants - name'}

How complete are the datasets?

Also what we might wanna know is what are the dataset that do do have the minimum requirements. So here goes...



In [59]:

    
minimum_fields = [
    'beneficary_name', 
    'project_name', 
    'total_amount', 
    'fund_acronym', 
    'beneficiary_nuts_region', 
    'program_name'
]



In [66]:

    
pipeline_ids = mappings['pipeline_id'].unique()
print('There are currently {} datasets'.format(len(pipeline_ids))









    



There are currently 97 datasets



In [77]:

    
minimum_requirements = []
minimum_requirements_counter = {}

for pipeline_id in pipeline_ids:
    pipeline = mappings[mappings['pipeline_id'] == pipeline_id]
    dataset = {'pipeline_id': pipeline_id}
    counter = 0
    for field in minimum_fields:
        has_field = field in pipeline['maps_to'].values
        dataset.update({field: has_field})
        counter = counter + 1 if has_field else counter
    minimum_requirements.append(dataset)
    minimum_requirements_counter.update({pipeline_id: counter})
    
minimum_requirements = DataFrame(minimum_requirements)
minimum_requirements_counter









    Out[77]:





{'AT.austria/AT11.burgenland': 2,
 'AT.austria/AT12.niederoesterreich': 2,
 'AT.austria/AT21.kaernten': 2,
 'AT.austria/AT22.steiermark': 1,
 'AT.austria/AT31.oberoesterreich': 2,
 'AT.austria/AT32.salzburg': 1,
 'AT.austria/AT33.tirol': 2,
 'AT.austria/AT34.vorarlberg': 1,
 'AT.austria/national': 0,
 'BE.belgium/BE1.brussels': 2,
 'BE.belgium/BE2.vlaams.gewest/ERDF 2007-2013': 2,
 'BE.belgium/BE2.vlaams.gewest/ERDF 2014-2020': 2,
 'BE.belgium/BE3.region.wallonne/all funds 2014-2020': 3,
 'BG.bulgaria': 0,
 'CY.cyprus/all funds 2007-2013': 2,
 'CZ.czech-republic/2007-2013': 2,
 'CZ.czech-republic/2014-2020': 1,
 'DE.germany/DE1.baden-wuerttemberg/ERDF 2007-2013': 2,
 'DE.germany/DE1.baden-wuerttemberg/ERDF 2014-2020': 1,
 'DE.germany/DE1.baden-wuerttemberg/ESF 2007-2013': 2,
 'DE.germany/DE2.bayern/ERDF 2007-2013': 1,
 'DE.germany/DE2.bayern/ESF 2007-2013': 2,
 'DE.germany/DE3.berlin/ERDF 2007-2013': 2,
 'DE.germany/DE3.berlin/ERDF 2014-2020': 1,
 'DE.germany/DE3.berlin/ESF 2007-2013': 2,
 'DE.germany/DE4.brandenburg/ESF 2007-2013': 2,
 'DE.germany/DE4.brandenburg/ESF 2014-2020': 1,
 'DE.germany/DE5.bremen/ERDF 2007-2013': 1,
 'DE.germany/DE5.bremen/ERDF 2014-2020': 1,
 'DE.germany/DE5.bremen/ESF 2007-2013': 2,
 'DE.germany/DE6.hamburg/ERDF 2007-2013': 2,
 'DE.germany/DE6.hamburg/ESF 2007-2013': 2,
 'DE.germany/DE6.hamburg/ESF 2014-2020': 2,
 'DE.germany/DE7.hessen/ERDF 2007-2013': 2,
 'DE.germany/DE7.hessen/ESF 2007-2013': 2,
 'DE.germany/DE7.hessen/ESF 2014-2020': 1,
 'DE.germany/DE8.mecklenburg-vorpommern/ERDF 2007-2013': 2,
 'DE.germany/DE8.mecklenburg-vorpommern/ERDF 2014-2020': 1,
 'DE.germany/DE8.mecklenburg-vorpommern/ESF 2007-2013': 2,
 'DE.germany/DE8.mecklenburg-vorpommern/ESF 2014-2020': 1,
 'DE.germany/DE9.niedersachsen/ERDF 2007-2013': 2,
 'DE.germany/DE9.niedersachsen/ERDF 2014-2020': 2,
 'DE.germany/DE9.niedersachsen/ESF 2007-2013': 2,
 'DE.germany/DE9.niedersachsen/ESF 2014-2020': 2,
 'DE.germany/DEA.nordrhein-westfalen/ERDF 2007-2013': 2,
 'DE.germany/DEA.nordrhein-westfalen/ERDF 2014-2020': 1,
 'DE.germany/DEA.nordrhein-westfalen/ESF 2007-2013': 1,
 'DE.germany/DEA.nordrhein-westfalen/ESF 2014-2020': 1,
 'DE.germany/DEB.rheinland-pfalz/ERDF 2007-2013': 2,
 'DE.germany/DEB.rheinland-pfalz/ERDF 2014-2020': 2,
 'DE.germany/DEB.rheinland-pfalz/ESF 2007-2013': 2,
 'DE.germany/DEB.rheinland-pfalz/ESF 2014-2020': 1,
 'DE.germany/DEC.saarland/ERDF 2007-2013': 2,
 'DE.germany/DEC.saarland/ESF 2007-2013': 0,
 'DE.germany/DEC.saarland/ESF 2014-2020': 2,
 'DE.germany/DED.sachsen/ERDF 2007-2013': 2,
 'DE.germany/DED.sachsen/ERDF 2014-2020': 1,
 'DE.germany/DED.sachsen/ESF 2007-2013': 2,
 'DE.germany/DED.sachsen/ESF 2014-2020': 1,
 'DE.germany/DEE.sachsen-anhalt/ERDF 2007-2013': 1,
 'DE.germany/DEE.sachsen-anhalt/ESF 2007-2013': 1,
 'DE.germany/DEF.schleswig-holstein/ERDF 2007-2013': 2,
 'DE.germany/DEF.schleswig-holstein/ERDF 2014-2020': 1,
 'DE.germany/DEF.schleswig-holstein/ESF 2007-2013': 2,
 'DE.germany/DEF.schleswig-holstein/ESF 2014-2020': 1,
 'DE.germany/DEG.thueringen/ERDF 2007-2013': 2,
 'DE.germany/DEG.thueringen/ERDF 2014-2020': 1,
 'DE.germany/DEG.thueringen/ESF 2007-2013': 2,
 'DE.germany/DEG.thueringen/ESF 2014-2020': 1,
 'DK.denmark/2007-2013': 4,
 'DK.denmark/2014-2020': 4,
 'EE.estonia': 0,
 'EL.greece/2007-2013': 2,
 'EL.greece/2014-2020': 0,
 'ES.spain/ERDF 2007-2013': 2,
 'ES.spain/ESF 2007-2013': 3,
 'FI.finland/ERDF 2007-2013': 3,
 'FI.finland/ERDF 2014-2020': 3,
 'FR.france/2007-2013': 4,
 'FR.france/2014-2020': 2,
 'HR.croatia': 0,
 'IE.ireland/North_Western_Ireland': 2,
 'IE.ireland/Southern_Eastern_Ireland': 1,
 'LT.lithuania/2007-2013': 2,
 'LT.lithuania/2014-2020': 2,
 'LU.luxemburg': 2,
 'LV.latvia/2007-2013': 2,
 'LV.latvia/2014-2020': 2,
 'MT.malta': 3,
 'NL.the-netherlands': 2,
 'PL.poland': 2,
 'PT.portugal/all_funds_2007-2013': 2,
 'PT.portugal/all_funds_2014-2020': 2,
 'SE.sweden/all_funds_2007-2013': 3,
 'SE.sweden/all_funds_2014-2020': 3,
 'SI.slovenia/2007-2013': 3,
 'SI.slovenia/2014-2020': 2}

Now how complete is each dataset? In other words, how many fields do each dataset have?



In [52]:

    
datasets = mappings.groupby('pipeline_id').size()



In [57]:

    
datasets.sort_values(ascending=False)









    Out[57]:





pipeline_id
DE.germany/DEB.rheinland-pfalz/ESF 2007-2013            36
CZ.czech-republic/2007-2013                             28
SE.sweden/all_funds_2007-2013                           22
BE.belgium/BE2.vlaams.gewest/ERDF 2007-2013             21
AT.austria/national                                     21
DK.denmark/2014-2020                                    19
SI.slovenia/2014-2020                                   18
SE.sweden/all_funds_2014-2020                           18
SI.slovenia/2007-2013                                   17
NL.the-netherlands                                      17
DE.germany/DED.sachsen/ERDF 2014-2020                   17
MT.malta                                                16
DK.denmark/2007-2013                                    15
LU.luxemburg                                            14
FI.finland/ERDF 2007-2013                               14
BE.belgium/BE3.region.wallonne/all funds 2014-2020      14
EL.greece/2014-2020                                     14
DE.germany/DED.sachsen/ESF 2014-2020                    13
PT.portugal/all_funds_2007-2013                         13
DE.germany/DEB.rheinland-pfalz/ERDF 2014-2020           13
FI.finland/ERDF 2014-2020                               13
DE.germany/DE9.niedersachsen/ESF 2014-2020              12
DE.germany/DE9.niedersachsen/ERDF 2014-2020             12
EL.greece/2007-2013                                     12
DE.germany/DE9.niedersachsen/ESF 2007-2013              12
FR.france/2014-2020                                     12
DE.germany/DEA.nordrhein-westfalen/ESF 2014-2020        11
HR.croatia                                              11
DE.germany/DE8.mecklenburg-vorpommern/ERDF 2014-2020    11
IE.ireland/Southern_Eastern_Ireland                     11
                                                        ..
DE.germany/DEF.schleswig-holstein/ESF 2007-2013          6
DE.germany/DE9.niedersachsen/ERDF 2007-2013              6
DE.germany/DEA.nordrhein-westfalen/ERDF 2007-2013        6
DE.germany/DE3.berlin/ERDF 2007-2013                     6
DE.germany/DEF.schleswig-holstein/ERDF 2007-2013         6
DE.germany/DE3.berlin/ESF 2007-2013                      6
DE.germany/DE4.brandenburg/ESF 2007-2013                 5
DE.germany/DEB.rheinland-pfalz/ERDF 2007-2013            5
DE.germany/DEE.sachsen-anhalt/ESF 2007-2013              5
DE.germany/DE5.bremen/ERDF 2007-2013                     5
DE.germany/DEE.sachsen-anhalt/ERDF 2007-2013             5
DE.germany/DE6.hamburg/ERDF 2007-2013                    5
DE.germany/DED.sachsen/ESF 2007-2013                     5
DE.germany/DE8.mecklenburg-vorpommern/ERDF 2007-2013     5
DE.germany/DEC.saarland/ESF 2007-2013                    5
CY.cyprus/all funds 2007-2013                            5
DE.germany/DEC.saarland/ERDF 2007-2013                   5
DE.germany/DEA.nordrhein-westfalen/ESF 2007-2013         5
DE.germany/DE7.hessen/ERDF 2007-2013                     5
DE.germany/DE1.baden-wuerttemberg/ESF 2007-2013          5
DE.germany/DE2.bayern/ERDF 2007-2013                     5
DE.germany/DE1.baden-wuerttemberg/ERDF 2007-2013         5
AT.austria/AT33.tirol                                    5
AT.austria/AT34.vorarlberg                               5
DE.germany/DEG.thueringen/ERDF 2007-2013                 5
AT.austria/AT32.salzburg                                 4
AT.austria/AT31.oberoesterreich                          4
AT.austria/AT22.steiermark                               4
AT.austria/AT21.kaernten                                 4
AT.austria/AT12.niederoesterreich                        4
dtype: int64



In [ ]:

	Count
Fiscal field
beneficiary_name	94
project_name	89
total_amount	69
total_amount_eligible	68
starting_date	45
project_description	44
eu_cofinancing_amount	40
completion_date	36
approval_date	32
beneficiary_country	19
beneficiary_postal_code	18
project_id	17
beneficiary_city	16
beneficiary_nuts_region	14
theme_code	14
fund_acronym	12
theme_name	12
member_state_amount	10
cci_program_code	9
beneficiary_address	8
cci_program_title	7
priority_label	7
final_payment_date	7
priority_number	6
beneficiary_nuts_code	4
total_amount_applied	3
start_date	2
beneficiary_url	2
beneficiary_country_code	2
third_party_amount	1