In [1]:

    
import pandas as pd

Ha nincs meg minden bővítőcsomag, akkor feltesszük ezeket console-ból vagy ! a cella elejére.



In [2]:

    
!pip install xlrd









    



Requirement already satisfied: xlrd in c:\programdata\anaconda3\lib\site-packages (1.2.0)

Pandasban az táblázatatok DataFrame-ek.



In [3]:

    
df=pd.read_excel('set-date-deschise-vtp2015.xlsx')

Vízszintes sorok nevei



In [4]:

    
df.index









    Out[4]:





RangeIndex(start=0, stop=880, step=1)

Oszlopok nevei



In [5]:

    
df.columns









    Out[5]:





Index(['gen', 'varsta', 'Destinatie_tara1', 'Cetatenie', 'mediu_prov',
       'judet_domiciliu', 'Nivel educatie', 'Relatia cu recrutorul',
       'Forma de exploatare'],
      dtype='object')

Első néhány sor



In [6]:

    
df.head(2)









    Out[6]:







  
    
      
      gen
      varsta
      Destinatie_tara1
      Cetatenie
      mediu_prov
      judet_domiciliu
      Nivel educatie
      Relatia cu recrutorul
      Forma de exploatare
    
  
  
    
      0
      feminin
      25
      Spania
      Romana
      rural
      Galati
      studii gimnaziale
      partener(a)/sot(ie)
      exploatare sexuala
    
    
      1
      feminin
      38
      Spania
      Romana
      rural
      Vrancea
      NaN
      cunostinta/prieten(a)
      exploatare sexuala

Utolsó néhány sor



In [7]:

    
df.tail(2)









    Out[7]:







  
    
      
      gen
      varsta
      Destinatie_tara1
      Cetatenie
      mediu_prov
      judet_domiciliu
      Nivel educatie
      Relatia cu recrutorul
      Forma de exploatare
    
  
  
    
      878
      feminin
      39
      Romania
      Straina
      urban
      NaN
      studii liceale
      cunostinta/prieten(a)
      exploatare sexuala
    
    
      879
      feminin
      24
      Romania
      Straina
      rural
      NaN
      studii gimnaziale
      proxenet
      exploatare sexuala

Szűrés több oszlop szerint



In [8]:

    
az_en_oszlopom=['varsta','Cetatenie','gen']
df2=df[az_en_oszlopom].head(2)
df2









    Out[8]:







  
    
      
      varsta
      Cetatenie
      gen
    
  
  
    
      0
      25
      Romana
      feminin
    
    
      1
      38
      Romana
      feminin

Szűrés egyetloen oszlopra lehet Series vagy DataFrame.



In [9]:

    
print(type(df[['varsta']]))
print(type(df['varsta']))









    



<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>

Sorok szerinti szűrés



In [10]:

    
az_en_indexem=[2,4,67]
df3=df.loc[az_en_indexem]
df3









    Out[10]:







  
    
      
      gen
      varsta
      Destinatie_tara1
      Cetatenie
      mediu_prov
      judet_domiciliu
      Nivel educatie
      Relatia cu recrutorul
      Forma de exploatare
    
  
  
    
      2
      feminin
      22
      Danemarca
      Romana
      urban
      Constanta
      studii gimnaziale
      NaN
      exploatare sexuala
    
    
      4
      masculin
      25
      Portugalia
      Romana
      urban
      Ilfov
      NaN
      cunostinta/prieten(a)
      exploatare prin munca
    
    
      67
      feminin
      27
      Italia
      Romana
      urban
      Dolj
      studii liceale
      rude
      obligarea la cersetorie

Pnadas dokumentációban van sok más típusú szűrés (iloc, xslice) és még sok más.

Új oszlop hozzáadása.



In [11]:

    
df['uj']=42

Új oszlop egyenlő méretű listát kell tartalmazzon a DataFrame sorai számával.



In [12]:

    
df['uj2']=range(880)
df['uj2']=range(len(df))
#df['uj2']=range(870) #nem mukodik

Exportálás pl. Excelbe



In [13]:

    
df[['gen','varsta','Destinatie_tara1']].to_excel('enyim.xlsx')

Egyediségvizsgálat



In [14]:

    
len(df)









    Out[14]:





880



In [15]:

    
len(df['Destinatie_tara1'].unique())









    Out[15]:





27

Számolás NaN elemek nélkül.



In [16]:

    
df.count()









    Out[16]:





gen                      880
varsta                   880
Destinatie_tara1         879
Cetatenie                880
mediu_prov               865
judet_domiciliu          874
Nivel educatie           821
Relatia cu recrutorul    849
Forma de exploatare      880
uj                       880
uj2                      880
dtype: int64

Egyedi számolás NaN elemek nélkül.



In [17]:

    
df.nunique()









    Out[17]:





gen                        2
varsta                    59
Destinatie_tara1          26
Cetatenie                  2
mediu_prov                 2
judet_domiciliu           42
Nivel educatie             7
Relatia cu recrutorul      6
Forma de exploatare        6
uj                         1
uj2                      880
dtype: int64

Csoportosítani oszlopok szerint



In [18]:

    
df.groupby(['judet_domiciliu','Destinatie_tara1']).mean() #tobb oszlop is lehet a szuro









    Out[18]:







  
    
      
      
      varsta
      uj
      uj2
    
    
      judet_domiciliu
      Destinatie_tara1
      
      
      
    
  
  
    
      Alba
      Austria
      19.500000
      42.0
      698.500000
    
    
      Franta
      26.000000
      42.0
      705.000000
    
    
      Italia
      20.000000
      42.0
      685.000000
    
    
      Romania
      15.500000
      42.0
      699.000000
    
    
      Arad
      Germania
      18.000000
      42.0
      656.000000
    
    
      ...
      ...
      ...
      ...
      ...
    
    
      Vrancea
      Grecia
      19.500000
      42.0
      532.000000
    
    
      Italia
      24.000000
      42.0
      521.000000
    
    
      Olanda
      18.000000
      42.0
      779.000000
    
    
      Romania
      18.818182
      42.0
      529.181818
    
    
      Spania
      38.000000
      42.0
      1.000000
    
  

217 rows × 3 columns



In [19]:

    
df4=df.groupby(['Destinatie_tara1']).mean()



In [20]:

    
df4=df4[['varsta']]



In [21]:

    
az_en_indexem=list(df4.index)



In [22]:

    
az_en_indexem[2]









    Out[22]:





'Austria'



In [23]:

    
az_en_indexem[2:]









    Out[23]:





['Austria',
 'Bahamas',
 'Belgia',
 'Cehia',
 'Cipru',
 'Danemarca',
 'Elvetia',
 'Finlanda',
 'Franta',
 'Germania',
 'Grecia',
 'Irlanda',
 'Italia',
 'Libia',
 'Norvegia',
 'Olanda',
 'Portugalia',
 'Qatar',
 'Romania',
 'Spania',
 'Suedia',
 'Turcia',
 'UK',
 'Ungaria']

Listák vágása (slicing) / szeletelése



In [24]:

    
az_en_indexem[:2]









    Out[24]:





[18, 999]



In [25]:

    
az_en_indexem[2:7]









    Out[25]:





['Austria', 'Bahamas', 'Belgia', 'Cehia', 'Cipru']

"Hány elemenként"



In [26]:

    
az_en_indexem[::3]









    Out[26]:





[18,
 'Bahamas',
 'Cipru',
 'Finlanda',
 'Grecia',
 'Libia',
 'Portugalia',
 'Spania',
 'UK']



In [27]:

    
az_en_indexem[::-1]









    Out[27]:





['Ungaria',
 'UK',
 'Turcia',
 'Suedia',
 'Spania',
 'Romania',
 'Qatar',
 'Portugalia',
 'Olanda',
 'Norvegia',
 'Libia',
 'Italia',
 'Irlanda',
 'Grecia',
 'Germania',
 'Franta',
 'Finlanda',
 'Elvetia',
 'Danemarca',
 'Cipru',
 'Cehia',
 'Belgia',
 'Bahamas',
 'Austria',
 999,
 18]

Pythonic list composition

for és if parancsok



In [28]:

    
for elem in az_en_indexem[:4]:
    print(elem)









    



18
999
Austria
Bahamas



In [29]:

    
for i,elem in enumerate(az_en_indexem[:4]):
    print(i,elem)









    



0 18
1 999
2 Austria
3 Bahamas



In [30]:

    
for i in range(5):
    print(i,az_en_indexem[i])









    



0 18
1 999
2 Austria
3 Bahamas
4 Belgia

range(honnan,hova,mekkora lepes=1)



In [31]:

    
for i in range(3,8):
    print(i,az_en_indexem[i])









    



3 Bahamas
4 Belgia
5 Cehia
6 Cipru
7 Danemarca

Elegánsabb szűrés



In [32]:

    
for elem in az_en_indexem:
    if ( (elem==18) or (elem==999) ):
        print(elem)



In [33]:

    
tiltolista=[18,999]
for elem in az_en_indexem:
    if elem in tiltolista:
        print(elem)



In [34]:

    
for elem in az_en_indexem:
    if ( (elem!=18) and (elem!=999) ):
        print(elem)









    



Austria
Bahamas
Belgia
Cehia
Cipru
Danemarca
Elvetia
Finlanda
Franta
Germania
Grecia
Irlanda
Italia
Libia
Norvegia
Olanda
Portugalia
Qatar
Romania
Spania
Suedia
Turcia
UK
Ungaria



In [35]:

    
for elem in az_en_indexem:
    if elem not in tiltolista:
        print(elem)









    



Austria
Bahamas
Belgia
Cehia
Cipru
Danemarca
Elvetia
Finlanda
Franta
Germania
Grecia
Irlanda
Italia
Libia
Norvegia
Olanda
Portugalia
Qatar
Romania
Spania
Suedia
Turcia
UK
Ungaria



In [36]:

    
uj_index=[]
for elem in az_en_indexem:
    if elem not in [18,999]:
        uj_index.append(elem)



In [37]:

    
uj_index









    Out[37]:





['Austria',
 'Bahamas',
 'Belgia',
 'Cehia',
 'Cipru',
 'Danemarca',
 'Elvetia',
 'Finlanda',
 'Franta',
 'Germania',
 'Grecia',
 'Irlanda',
 'Italia',
 'Libia',
 'Norvegia',
 'Olanda',
 'Portugalia',
 'Qatar',
 'Romania',
 'Spania',
 'Suedia',
 'Turcia',
 'UK',
 'Ungaria']



In [38]:

    
[i for i in range(5)]









    Out[38]:





[0, 1, 2, 3, 4]



In [39]:

    
uj_index2=[elem for elem in az_en_indexem if elem not in tiltolista]



In [40]:

    
uj_index2









    Out[40]:





['Austria',
 'Bahamas',
 'Belgia',
 'Cehia',
 'Cipru',
 'Danemarca',
 'Elvetia',
 'Finlanda',
 'Franta',
 'Germania',
 'Grecia',
 'Irlanda',
 'Italia',
 'Libia',
 'Norvegia',
 'Olanda',
 'Portugalia',
 'Qatar',
 'Romania',
 'Spania',
 'Suedia',
 'Turcia',
 'UK',
 'Ungaria']

Uj index szerinti szűrés



In [41]:

    
df4.loc[uj_index2].head(4)









    Out[41]:







  
    
      
      varsta
    
    
      Destinatie_tara1
      
    
  
  
    
      Austria
      19.565217
    
    
      Bahamas
      37.000000
    
    
      Belgia
      30.153846
    
    
      Cehia
      31.000000



In [42]:

    
df5=df4.loc[uj_index2]



In [43]:

    
df5=df5.reset_index()



In [44]:

    
df5[['Destinatie_tara1']].replace('Belgia','Belgium').replace('Cehia','Cseh')









    Out[44]:







  
    
      
      Destinatie_tara1
    
  
  
    
      0
      Austria
    
    
      1
      Bahamas
    
    
      2
      Belgium
    
    
      3
      Cseh
    
    
      4
      Cipru
    
    
      5
      Danemarca
    
    
      6
      Elvetia
    
    
      7
      Finlanda
    
    
      8
      Franta
    
    
      9
      Germania
    
    
      10
      Grecia
    
    
      11
      Irlanda
    
    
      12
      Italia
    
    
      13
      Libia
    
    
      14
      Norvegia
    
    
      15
      Olanda
    
    
      16
      Portugalia
    
    
      17
      Qatar
    
    
      18
      Romania
    
    
      19
      Spania
    
    
      20
      Suedia
    
    
      21
      Turcia
    
    
      22
      UK
    
    
      23
      Ungaria

Python dictionary, azaz szotar: {key1:value2, key2: value2, ...}



In [45]:

    
nev_cserelo={'Belgia':'Belgium','Cehia':'Cseh'}
df5[['Destinatie_tara2']]=df5[['Destinatie_tara1']].replace(nev_cserelo)



In [46]:

    
df5.set_index('Destinatie_tara2')[['varsta']].to_excel('df5.xlsx')



In [47]:

    
df5=df5.set_index('Destinatie_tara2')[['varsta']]
df5.to_excel('df5.xlsx')

Geometria hozzatarsitas. Ezeket shapefile-típusú fájlokban találjuk. Ezeke ArcGIS. A shapefile fájlok nagyok, Csak az ArcGIS nyithatóak - ezért van egy nyílt standard, ez a geojson. (De a geojson elég régi és van egy újabb stnadardja a topojson).

Konvertálni egyszerű közöttük:

Szükségunk van országok szintű geojson-ra. https://github.com/johan/world.geo.json/blob/master/countries.geo.json

Románia megyék: https://github.com/deldersveld/topojson/blob/master/countries/romania/romania-counties.json

Európa megyeszintű felosztás: https://data.europa.eu/euodp/en/data/dataset/HZKBS2y8ycdZijX0PMHPA

JSON file beolvasása



In [48]:

    
pd.read_json('countries.geo.json').head()









    Out[48]:







  
    
      
      type
      features
    
  
  
    
      0
      FeatureCollection
      {'type': 'Feature', 'id': 'AFG', 'properties':...
    
    
      1
      FeatureCollection
      {'type': 'Feature', 'id': 'AGO', 'properties':...
    
    
      2
      FeatureCollection
      {'type': 'Feature', 'id': 'ALB', 'properties':...
    
    
      3
      FeatureCollection
      {'type': 'Feature', 'id': 'ARE', 'properties':...
    
    
      4
      FeatureCollection
      {'type': 'Feature', 'id': 'ARG', 'properties':...



In [49]:

    
import json



In [50]:

    
file=open('countries.geo.json','r').read()
countries=json.loads(file)



In [51]:

    
countries=countries['features']



In [52]:

    
countries[0]









    Out[52]:





{'type': 'Feature',
 'id': 'AFG',
 'properties': {'name': 'Afghanistan'},
 'geometry': {'type': 'Polygon',
  'coordinates': [[[61.210817, 35.650072],
    [62.230651, 35.270664],
    [62.984662, 35.404041],
    [63.193538, 35.857166],
    [63.982896, 36.007957],
    [64.546479, 36.312073],
    [64.746105, 37.111818],
    [65.588948, 37.305217],
    [65.745631, 37.661164],
    [66.217385, 37.39379],
    [66.518607, 37.362784],
    [67.075782, 37.356144],
    [67.83, 37.144994],
    [68.135562, 37.023115],
    [68.859446, 37.344336],
    [69.196273, 37.151144],
    [69.518785, 37.608997],
    [70.116578, 37.588223],
    [70.270574, 37.735165],
    [70.376304, 38.138396],
    [70.806821, 38.486282],
    [71.348131, 38.258905],
    [71.239404, 37.953265],
    [71.541918, 37.905774],
    [71.448693, 37.065645],
    [71.844638, 36.738171],
    [72.193041, 36.948288],
    [72.63689, 37.047558],
    [73.260056, 37.495257],
    [73.948696, 37.421566],
    [74.980002, 37.41999],
    [75.158028, 37.133031],
    [74.575893, 37.020841],
    [74.067552, 36.836176],
    [72.920025, 36.720007],
    [71.846292, 36.509942],
    [71.262348, 36.074388],
    [71.498768, 35.650563],
    [71.613076, 35.153203],
    [71.115019, 34.733126],
    [71.156773, 34.348911],
    [70.881803, 33.988856],
    [69.930543, 34.02012],
    [70.323594, 33.358533],
    [69.687147, 33.105499],
    [69.262522, 32.501944],
    [69.317764, 31.901412],
    [68.926677, 31.620189],
    [68.556932, 31.71331],
    [67.792689, 31.58293],
    [67.683394, 31.303154],
    [66.938891, 31.304911],
    [66.381458, 30.738899],
    [66.346473, 29.887943],
    [65.046862, 29.472181],
    [64.350419, 29.560031],
    [64.148002, 29.340819],
    [63.550261, 29.468331],
    [62.549857, 29.318572],
    [60.874248, 29.829239],
    [61.781222, 30.73585],
    [61.699314, 31.379506],
    [60.941945, 31.548075],
    [60.863655, 32.18292],
    [60.536078, 32.981269],
    [60.9637, 33.528832],
    [60.52843, 33.676446],
    [60.803193, 34.404102],
    [61.210817, 35.650072]]]}}



In [54]:

    
orszagok_angol=[country['properties']['name'] for country in countries]



In [55]:

    
orszagok_roman=list(df5.index)

Levenshtein függvény definiálása kólön python .py fileban.



In [56]:

    
import lev

Összes távolság



In [57]:

    
orszag_matrix={}
for orszag1 in orszagok_roman:
    if orszag1 not in orszag_matrix:orszag_matrix[orszag1]={}
    for orszag2 in orszagok_angol:
        orszag_matrix[orszag1][orszag2]=lev.levenshteinDistance(orszag1,orszag2)
        #print(orszag1,orszag2)

Minimum távolságok



In [58]:

    
orszag_matrix_jo={}
for orszag1 in orszag_matrix:
    orszag_matrix_jo[orszag1]=min(orszag_matrix[orszag1],key=orszag_matrix[orszag1].get)

Hibák javítása manuálisan



In [59]:

    
orszag_matrix_jo['Cseh']='Czech Republic'
orszag_matrix_jo['Elvetia']='Switzerland'
orszag_matrix_jo['Grecia']='Greece'
orszag_matrix_jo['Norvegia']='Norway'
orszag_matrix_jo['Olanda']='Netherlands'
orszag_matrix_jo['Suedia']='Sweden'
orszag_matrix_jo['Turcia']='Turkey'
orszag_matrix_jo['UK']='United Kingdom'
orszag_matrix_jo['Ungaria']='Hungary'
orszag_matrix_jo['Bahamas']='The Bahamas'
orszag_matrix_jo









    Out[59]:





{'Austria': 'Austria',
 'Bahamas': 'The Bahamas',
 'Belgium': 'Belgium',
 'Cseh': 'Czech Republic',
 'Cipru': 'Cyprus',
 'Danemarca': 'Denmark',
 'Elvetia': 'Switzerland',
 'Finlanda': 'Finland',
 'Franta': 'France',
 'Germania': 'Germany',
 'Grecia': 'Greece',
 'Irlanda': 'Ireland',
 'Italia': 'Italy',
 'Libia': 'Libya',
 'Norvegia': 'Norway',
 'Olanda': 'Netherlands',
 'Portugalia': 'Portugal',
 'Qatar': 'Qatar',
 'Romania': 'Romania',
 'Spania': 'Spain',
 'Suedia': 'Sweden',
 'Turcia': 'Turkey',
 'UK': 'United Kingdom',
 'Ungaria': 'Hungary'}

Életkorok kinyerése a df5 dataframeből



In [60]:

    
eletkorok=df5.to_dict()['varsta']



In [61]:

    
eletkorok









    Out[61]:





{'Austria': 19.565217391304348,
 'Bahamas': 37.0,
 'Belgium': 30.153846153846153,
 'Cseh': 31.0,
 'Cipru': 37.666666666666664,
 'Danemarca': 42.490196078431374,
 'Elvetia': 21.0,
 'Finlanda': 24.0,
 'Franta': 24.48148148148148,
 'Germania': 25.672727272727272,
 'Grecia': 26.0,
 'Irlanda': 24.9375,
 'Italia': 25.52252252252252,
 'Libia': 48.666666666666664,
 'Norvegia': 27.666666666666668,
 'Olanda': 22.166666666666668,
 'Portugalia': 33.88235294117647,
 'Qatar': 35.0,
 'Romania': 18.172680412371133,
 'Spania': 32.205128205128204,
 'Suedia': 22.2,
 'Turcia': 34.333333333333336,
 'UK': 24.666666666666668,
 'Ungaria': 29.5}

Országok listáját átalakítjuk szótárrá



In [62]:

    
country_dict={country['properties']['name']:country for country in countries}

Ezután kicseréljük a szótár kulcsait az angol országneveről a román országnevekre.



In [63]:

    
country_dict_ro={orszag1:country_dict[orszag_matrix_jo[orszag1]] for orszag1 in orszag_matrix_jo}

Behozzuk a román országneveket és élatkorokat az országok szótárába



In [64]:

    
for country in country_dict_ro:
    country_dict_ro[country]['properties']['name_ro']=country
    country_dict_ro[country]['properties']['varsta']=eletkorok[country]

Visszalakítjuk lista formátumba



In [65]:

    
country_dict_ro_values=list(country_dict_ro.values())
country_dict_ro_updated={'features':country_dict_ro_values}
country_dict_ro_updated['type']='FeatureCollection'

Exportáljuk mint új geojson



In [66]:

    
export_file=open('countries_ro.geo.json','w')
export_file.write(json.dumps(country_dict_ro_updated))









    Out[66]:





27460

	gen	varsta	Destinatie_tara1	Cetatenie	mediu_prov	judet_domiciliu	Nivel educatie	Relatia cu recrutorul	Forma de exploatare
0	feminin	25	Spania	Romana	rural	Galati	studii gimnaziale	partener(a)/sot(ie)	exploatare sexuala
1	feminin	38	Spania	Romana	rural	Vrancea	NaN	cunostinta/prieten(a)	exploatare sexuala

	gen	varsta	Destinatie_tara1	Cetatenie	mediu_prov	judet_domiciliu	Nivel educatie	Relatia cu recrutorul	Forma de exploatare
878	feminin	39	Romania	Straina	urban	NaN	studii liceale	cunostinta/prieten(a)	exploatare sexuala
879	feminin	24	Romania	Straina	rural	NaN	studii gimnaziale	proxenet	exploatare sexuala

	gen	varsta	Destinatie_tara1	Cetatenie	mediu_prov	judet_domiciliu	Nivel educatie	Relatia cu recrutorul	Forma de exploatare
2	feminin	22	Danemarca	Romana	urban	Constanta	studii gimnaziale	NaN	exploatare sexuala
4	masculin	25	Portugalia	Romana	urban	Ilfov	NaN	cunostinta/prieten(a)	exploatare prin munca
67	feminin	27	Italia	Romana	urban	Dolj	studii liceale	rude	obligarea la cersetorie

		varsta	uj	uj2
judet_domiciliu	Destinatie_tara1
Alba	Austria	19.500000	42.0	698.500000
	Franta	26.000000	42.0	705.000000
	Italia	20.000000	42.0	685.000000
	Romania	15.500000	42.0	699.000000
Arad	Germania	18.000000	42.0	656.000000
...	...	...	...	...
Vrancea	Grecia	19.500000	42.0	532.000000
	Italia	24.000000	42.0	521.000000
	Olanda	18.000000	42.0	779.000000
	Romania	18.818182	42.0	529.181818
	Spania	38.000000	42.0	1.000000

	varsta
Destinatie_tara1
Austria	19.565217
Bahamas	37.000000
Belgia	30.153846
Cehia	31.000000

	Destinatie_tara1
0	Austria
1	Bahamas
2	Belgium
3	Cseh
4	Cipru
5	Danemarca
6	Elvetia
7	Finlanda
8	Franta
9	Germania
10	Grecia
11	Irlanda
12	Italia
13	Libia
14	Norvegia
15	Olanda
16	Portugalia
17	Qatar
18	Romania
19	Spania
20	Suedia
21	Turcia
22	UK
23	Ungaria

	type	features
0	FeatureCollection	{'type': 'Feature', 'id': 'AFG', 'properties':...
1	FeatureCollection	{'type': 'Feature', 'id': 'AGO', 'properties':...
2	FeatureCollection	{'type': 'Feature', 'id': 'ALB', 'properties':...
3	FeatureCollection	{'type': 'Feature', 'id': 'ARE', 'properties':...
4	FeatureCollection	{'type': 'Feature', 'id': 'ARG', 'properties':...