Translate dataset

The main language of the project is English: works well mixed in programming languages like Python and provides a low barrier for non-Brazilian contributors. Today, the dataset we make available by default for them is a set of XMLs from The Chamber of Deputies, in Portuguese. We need attribute names and categorical values to be translated to English.

This file is intended to serve as a base for the script to translate current and future datasets in the same format.



In [1]:

    
import pandas as pd

data = pd.read_csv('../data/2016-08-08-AnoAtual.csv')
data.shape









    Out[1]:





(185459, 29)



In [2]:

    
data.head()









    Out[2]:






  
    
      
      ideDocumento
      txNomeParlamentar
      ideCadastro
      nuCarteiraParlamentar
      nuLegislatura
      sgUF
      sgPartido
      codLegislatura
      numSubCota
      txtDescricao
      ...
      vlrLiquido
      numMes
      numAno
      numParcela
      txtPassageiro
      txtTrecho
      numLote
      numRessarcimento
      vlrRestituicao
      nuDeputadoId
    
  
  
    
      0
      5928744.0
      ABEL MESQUITA JR.
      178957.0
      1.0
      2015
      RR
      DEM
      55.0
      1
      MANUTENÇÃO DE ESCRITÓRIO DE APOIO À ATIVIDADE ...
      ...
      31.37
      1
      2016
      0
      NaN
      NaN
      1268870
      5369.0
      NaN
      3074
    
    
      1
      5970849.0
      ABEL MESQUITA JR.
      178957.0
      1.0
      2015
      RR
      DEM
      55.0
      1
      MANUTENÇÃO DE ESCRITÓRIO DE APOIO À ATIVIDADE ...
      ...
      37.24
      3
      2016
      0
      NaN
      NaN
      1282185
      5417.0
      NaN
      3074
    
    
      2
      6024670.0
      ABEL MESQUITA JR.
      178957.0
      1.0
      2015
      RR
      DEM
      55.0
      1
      MANUTENÇÃO DE ESCRITÓRIO DE APOIO À ATIVIDADE ...
      ...
      103.73
      4
      2016
      0
      NaN
      NaN
      1299456
      5492.0
      NaN
      3074
    
    
      3
      6024668.0
      ABEL MESQUITA JR.
      178957.0
      1.0
      2015
      RR
      DEM
      55.0
      1
      MANUTENÇÃO DE ESCRITÓRIO DE APOIO À ATIVIDADE ...
      ...
      43.11
      5
      2016
      0
      NaN
      NaN
      1299457
      5492.0
      NaN
      3074
    
    
      4
      5928732.0
      ABEL MESQUITA JR.
      178957.0
      1.0
      2015
      RR
      DEM
      55.0
      1
      MANUTENÇÃO DE ESCRITÓRIO DE APOIO À ATIVIDADE ...
      ...
      33.30
      2
      2016
      0
      NaN
      NaN
      1268878
      5369.0
      NaN
      3074
    
  

5 rows × 29 columns



In [3]:

    
data.iloc[0]









    Out[3]:





ideDocumento                                                       5.92874e+06
txNomeParlamentar                                            ABEL MESQUITA JR.
ideCadastro                                                             178957
nuCarteiraParlamentar                                                        1
nuLegislatura                                                             2015
sgUF                                                                        RR
sgPartido                                                                  DEM
codLegislatura                                                              55
numSubCota                                                                   1
txtDescricao                 MANUTENÇÃO DE ESCRITÓRIO DE APOIO À ATIVIDADE ...
numEspecificacaoSubCota                                                      0
txtDescricaoEspecificacao                                                  NaN
txtFornecedor                          COMPANHIA DE AGUAS E ESGOTOS DE RORAIMA
txtCNPJCPF                                                         5.93947e+12
txtNumero                                                               146439
indTipoDocumento                                                             0
datEmissao                                                 2016-02-15T00:00:00
vlrDocumento                                                             37.37
vlrGlosa                                                                     6
vlrLiquido                                                               31.37
numMes                                                                       1
numAno                                                                    2016
numParcela                                                                   0
txtPassageiro                                                              NaN
txtTrecho                                                                  NaN
numLote                                                                1268870
numRessarcimento                                                          5369
vlrRestituicao                                                             NaN
nuDeputadoId                                                              3074
Name: 0, dtype: object

New names are based on the "Nome do Dado" column of the table available at data/2016-08-08-datasets-format.html, not "Elemento de Dado", their current names.



In [4]:

    
data.rename(columns={
        'ideDocumento': 'document_id',
        'txNomeParlamentar': 'congressperson_name',
        'ideCadastro': 'congressperson_id',
        'nuCarteiraParlamentar': 'congressperson_document',
        'nuLegislatura': 'term',
        'sgUF': 'state',
        'sgPartido': 'party',
        'codLegislatura': 'term_id',
        'numSubCota': 'subquota_number',
        'txtDescricao': 'subquota_description',
        'numEspecificacaoSubCota': 'subquota_group_id',
        'txtDescricaoEspecificacao': 'subquota_group_description',
        'txtFornecedor': 'supplier',
        'txtCNPJCPF': 'cnpj_cpf',
        'txtNumero': 'document_number',
        'indTipoDocumento': 'document_type',
        'datEmissao': 'issue_date',
        'vlrDocumento': 'document_value',
        'vlrGlosa': 'remark_value',
        'vlrLiquido': 'net_value',
        'numMes': 'month',
        'numAno': 'year',
        'numParcela': 'installment',
        'txtPassageiro': 'passenger',
        'txtTrecho': 'leg_of_the_trip',
        'numLote': 'batch_number',
        'numRessarcimento': 'reimbursement_number',
        'vlrRestituicao': 'reimbursement_value',
        'nuDeputadoId': 'applicant_id',
    }, inplace=True)



In [5]:

    
data['subquota_description'] = data['subquota_description'].astype('category')
data['subquota_description'].cat.categories









    Out[5]:





Index(['ASSINATURA DE PUBLICAÇÕES', 'COMBUSTÍVEIS E LUBRIFICANTES.',
       'CONSULTORIAS, PESQUISAS E TRABALHOS TÉCNICOS.',
       'DIVULGAÇÃO DA ATIVIDADE PARLAMENTAR.', 'Emissão Bilhete Aéreo',
       'FORNECIMENTO DE ALIMENTAÇÃO DO PARLAMENTAR',
       'HOSPEDAGEM ,EXCETO DO PARLAMENTAR NO DISTRITO FEDERAL.',
       'LOCAÇÃO OU FRETAMENTO DE AERONAVES',
       'LOCAÇÃO OU FRETAMENTO DE EMBARCAÇÕES',
       'LOCAÇÃO OU FRETAMENTO DE VEÍCULOS AUTOMOTORES',
       'MANUTENÇÃO DE ESCRITÓRIO DE APOIO À ATIVIDADE PARLAMENTAR',
       'PARTICIPAÇÃO EM CURSO, PALESTRA OU EVENTO SIMILAR', 'PASSAGENS AÉREAS',
       'PASSAGENS TERRESTRES, MARÍTIMAS OU FLUVIAIS',
       'SERVIÇO DE SEGURANÇA PRESTADO POR EMPRESA ESPECIALIZADA.',
       'SERVIÇO DE TÁXI, PEDÁGIO E ESTACIONAMENTO', 'SERVIÇOS POSTAIS',
       'TELEFONIA'],
      dtype='object')

When localizing categorical values, I prefer a direct translation over adaptation as much as possible. Not sure what values each attribute will contain, so I give the power of the interpretation to the people analyzing it in the future.



In [6]:

    
data['subquota_description'].cat.rename_categories([
        'Publication subscriptions',
        'Fuels and lubricants',
        'Consultancy, research and technical work',
        'Publicity of parliamentary activity',
        'Flight ticket issue',
        'Congressperson meal',
        'Lodging, except for congressperson from Distrito Federal',
        'Aircraft renting or charter of aircraft',
        'Watercraft renting or charter',
        'Automotive vehicle renting or charter',
        'Maintenance of office supporting parliamentary activity',
        'Participation in course, talk or similar event',
        'Flight tickets',
        'Terrestrial, maritime and fluvial tickets',
        'Security service provided by specialized company',
        'Taxi, toll and parking',
        'Postal services',
        'Telecommunication',
    ], inplace=True)



In [7]:

    
data.head()









    Out[7]:






  
    
      
      document_id
      congressperson_name
      congressperson_id
      congressperson_document
      term
      state
      party
      term_id
      subquota_number
      subquota_description
      ...
      net_value
      month
      year
      installment
      passenger
      leg_of_the_trip
      batch_number
      reimbursement_number
      reimbursement_value
      applicant_id
    
  
  
    
      0
      5928744.0
      ABEL MESQUITA JR.
      178957.0
      1.0
      2015
      RR
      DEM
      55.0
      1
      Maintenance of office supporting parliamentary...
      ...
      31.37
      1
      2016
      0
      NaN
      NaN
      1268870
      5369.0
      NaN
      3074
    
    
      1
      5970849.0
      ABEL MESQUITA JR.
      178957.0
      1.0
      2015
      RR
      DEM
      55.0
      1
      Maintenance of office supporting parliamentary...
      ...
      37.24
      3
      2016
      0
      NaN
      NaN
      1282185
      5417.0
      NaN
      3074
    
    
      2
      6024670.0
      ABEL MESQUITA JR.
      178957.0
      1.0
      2015
      RR
      DEM
      55.0
      1
      Maintenance of office supporting parliamentary...
      ...
      103.73
      4
      2016
      0
      NaN
      NaN
      1299456
      5492.0
      NaN
      3074
    
    
      3
      6024668.0
      ABEL MESQUITA JR.
      178957.0
      1.0
      2015
      RR
      DEM
      55.0
      1
      Maintenance of office supporting parliamentary...
      ...
      43.11
      5
      2016
      0
      NaN
      NaN
      1299457
      5492.0
      NaN
      3074
    
    
      4
      5928732.0
      ABEL MESQUITA JR.
      178957.0
      1.0
      2015
      RR
      DEM
      55.0
      1
      Maintenance of office supporting parliamentary...
      ...
      33.30
      2
      2016
      0
      NaN
      NaN
      1268878
      5369.0
      NaN
      3074
    
  

5 rows × 29 columns



In [8]:

    
data.iloc[0]









    Out[8]:





document_id                                                         5.92874e+06
congressperson_name                                           ABEL MESQUITA JR.
congressperson_id                                                        178957
congressperson_document                                                       1
term                                                                       2015
state                                                                        RR
party                                                                       DEM
term_id                                                                      55
subquota_number                                                               1
subquota_description          Maintenance of office supporting parliamentary...
subquota_group_id                                                             0
subquota_group_description                                                  NaN
supplier                                COMPANHIA DE AGUAS E ESGOTOS DE RORAIMA
cnpj_cpf                                                            5.93947e+12
document_number                                                          146439
document_type                                                                 0
issue_date                                                  2016-02-15T00:00:00
document_value                                                            37.37
remark_value                                                                  6
net_value                                                                 31.37
month                                                                         1
year                                                                       2016
installment                                                                   0
passenger                                                                   NaN
leg_of_the_trip                                                             NaN
batch_number                                                            1268870
reimbursement_number                                                       5369
reimbursement_value                                                         NaN
applicant_id                                                               3074
Name: 0, dtype: object



In [ ]:

	ideDocumento	txNomeParlamentar	ideCadastro	nuCarteiraParlamentar	nuLegislatura	sgUF	sgPartido	codLegislatura	numSubCota	txtDescricao	...	vlrLiquido	numMes	numAno	txtPassageiro	txtTrecho	numLote	numRessarcimento	vlrRestituicao	nuDeputadoId
0	5928744.0	ABEL MESQUITA JR.	178957.0	1.0	2015	RR	DEM	55.0	1	MANUTENÇÃO DE ESCRITÓRIO DE APOIO À ATIVIDADE ...	...	31.37	1	2016	NaN	NaN	1268870	5369.0	NaN	3074
1	5970849.0	ABEL MESQUITA JR.	178957.0	1.0	2015	RR	DEM	55.0	1	MANUTENÇÃO DE ESCRITÓRIO DE APOIO À ATIVIDADE ...	...	37.24	3	2016	NaN	NaN	1282185	5417.0	NaN	3074
2	6024670.0	ABEL MESQUITA JR.	178957.0	1.0	2015	RR	DEM	55.0	1	MANUTENÇÃO DE ESCRITÓRIO DE APOIO À ATIVIDADE ...	...	103.73	4	2016	NaN	NaN	1299456	5492.0	NaN	3074
3	6024668.0	ABEL MESQUITA JR.	178957.0	1.0	2015	RR	DEM	55.0	1	MANUTENÇÃO DE ESCRITÓRIO DE APOIO À ATIVIDADE ...	...	43.11	5	2016	NaN	NaN	1299457	5492.0	NaN	3074
4	5928732.0	ABEL MESQUITA JR.	178957.0	1.0	2015	RR	DEM	55.0	1	MANUTENÇÃO DE ESCRITÓRIO DE APOIO À ATIVIDADE ...	...	33.30	2	2016	NaN	NaN	1268878	5369.0	NaN	3074