In [1]:
import pandas as pd

JSON file beolvasás


In [3]:
pd.read_json('data.json')


Out[3]:
id birth name studwhere livesin other1 marriedto workwhere workwhat studwhat schoolyear studyear marriedsince other2 school other4 other5 other6 other3
0 0 keresztur Boholt Lore Orbán Balázs Gimnázium, Székelykeresztúr Szèkelykeresztur, Harghita, Romania NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 1 csikszereda Nelut Kémenes NaN Csikszentdomokos Venczel József Iskolaközpont - Csíkszereda NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 2 udvarhely Kovács Iulia None Angyalföld, Budapest, Hungary NaN Péterfy Eniko Drogmentes Világért Alapítvány - Magyarország helyettes ügyvezető NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 3 barot Monica-Ana Szep NaN Budapest, Hungary NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 4 szentgyorgy Matei Szabolcs NaN NaN NaN László Tünde NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
49215 49215 keresztur Ágota) Andrea NaN Szèkelykeresztur, Harghita, Romania NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
49216 49216 udvarhely Tubák Galaczi NaN Székelyudvarhely NaN NaN Agro Demcomplex NaN NaN 92 NaN NaN NaN NaN NaN NaN NaN NaN
49217 49217 toplita Coco Laci NaN Toplita NaN NaN konyuipar Mechanikai Muszeresz NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
49218 49218 kovaszna Szots Kovacs Universitatea BABEŞ - BOLYAI Sfântu-Gheorghe, Covasna, Romania NaN NaN sc sergiana prodimpex srl lucrator comercial Stiinte Economice si Gestiunea Afacerilor NaN 12 NaN NaN NaN NaN NaN NaN NaN
49219 49219 csikszereda (Szandi) Sándor EMTE NaN NaN NaN Benedek Elek Tanítóképző NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

49220 rows × 19 columns

Excel file beolvasás: sorok kihagyhatók a file tetejéről, munkalap neve választható.


In [4]:
df=pd.read_excel('2.17deaths causes.xls',sheet_name='2.17',skiprows=5)

numpy egy matematikai bővítőcsomag


In [5]:
import numpy as np

A nan értékek numpy-ban vannak definiálva.


In [6]:
df=df.set_index('Unnamed: 0').dropna(how='any').replace('-',np.nan)

In [7]:
df2=pd.read_excel('2.17deaths causes.xls',sheet_name='2.17',skiprows=4)

ffill azt jelenti forward fill, és a nan-okat kitölti a balra vagy fölötte álló értékkel. Az axis=0 a sorokat jelenti, az axis=1 az oszlopokat.


In [8]:
df2.loc[[0]].ffill(axis=1)


Out[8]:
Unnamed: 0 Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8 Unnamed: 9 Unnamed: 10 persoane / persons Unnamed: 12 Unnamed: 13
0 NaN 2011 2011 2012 2012 2013 2013 20141) 20141) 2015 2015 20162) 20162) 20162)

Sorok/oszlopok törlése.


In [ ]:
df=df.drop('Unnamed: 13',axis=1)

In [11]:
df.columns


Out[11]:
Index([         2011,  'Unnamed: 2',          2012,  'Unnamed: 4',
                2013,  'Unnamed: 6',      '20141)',  'Unnamed: 8',
                2015, 'Unnamed: 10',      '20162)', 'Unnamed: 12'],
      dtype='object')

In [12]:
[year for year in range(2011,2017)]


Out[12]:
[2011, 2012, 2013, 2014, 2015, 2016]

In [13]:
df.columns=[year for year in range(2011,2017) for k in range(2)]

Nested pythonic lista - két felsorolás egymás után


In [14]:
[str(year)+'-'+str(k) for year in range(2011,2017) for k in range(2)]


Out[14]:
['2011-0',
 '2011-1',
 '2012-0',
 '2012-1',
 '2013-0',
 '2013-1',
 '2014-0',
 '2014-1',
 '2015-0',
 '2015-1',
 '2016-0',
 '2016-1']

In [15]:
nemek=['Masculin','Feminin']
[str(year)+'-'+nem for year in range(2011,2017) for nem in nemek]


Out[15]:
['2011-Masculin',
 '2011-Feminin',
 '2012-Masculin',
 '2012-Feminin',
 '2013-Masculin',
 '2013-Feminin',
 '2014-Masculin',
 '2014-Feminin',
 '2015-Masculin',
 '2015-Feminin',
 '2016-Masculin',
 '2016-Feminin']

In [16]:
df.columns=[str(year)+'-'+nem for year in range(2011,2017) for nem in nemek]
df


Out[16]:
2011-Masculin 2011-Feminin 2012-Masculin 2012-Feminin 2013-Masculin 2013-Feminin 2014-Masculin 2014-Feminin 2015-Masculin 2015-Feminin 2016-Masculin 2016-Feminin
Unnamed: 0
Total 132180 119259 132408.0 121308 129209.0 118266 133197.0 121768 135886.0 125408 134063.0 122413
Boli infecţioase şi parazitare 1575 658 1626.0 770 1548.0 808 1676.0 995 1745.0 992 1911.0 1215
Tuberculoză 1052 231 986.0 263 900.0 233 876.0 249 838.0 220 785.0 184
Tumori 28718 19638 29143.0 19875 29351.0 20313 30123.0 20473 30316.0 20999 30670.0 21084
metabolism 1074 1283 1120.0 1311 1142.0 1208 1126.0 1282 1257.0 1411 1350.0 1551
Diabet zaharat 1024 1199 1051.0 1243 1069.0 1146 1043.0 1200 1182.0 1328 1239.0 1441
Tulburări mentale şi de comportament 315 112 259.0 88 265.0 79 275.0 91 239.0 82 275.0 93
Boli ale sistemului nervos 1384 1394 1445.0 1733 1393.0 1729 1532.0 1922 1640.0 2100 1692.0 2223
Boli ale aparatului circulator 70845 80693 71025.0 82188 68486.0 78808 68947.0 80190 71237.0 82673 69048.0 79401
Boala ischemică a inimii 25796 25069 25534.0 25851 24786.0 24828 25189.0 25505 26066.0 26643 24773.0 24903
Boli cerebro-vasculare 21278 26295 21103.0 26249 20503.0 25603 19876.0 24825 19763.0 24542 19059.0 23789
Boli ale aparatului respirator 7923 4537 8194.0 5046 7855.0 4829 8337.0 5075 9146.0 5856 8906.0 5622
Boli ale aparatului digestiv 8684 5815 8702.0 5759 8407.0 5631 8694.0 5861 8663.0 5718 8774.0 5724
Boli ale aparatului genito-urinar 1437 1219 1625.0 1332 1643.0 1510 1701.0 1571 1823.0 1633 1877.0 1745
Sarcină, naştere şi lăuzie - 50 NaN 23 NaN 27 NaN 24 NaN 27 NaN 17
situează în perioada perinatală 380 277 354.0 269 372.0 291 404.0 238 317.0 230 286.0 209
anomalii cromozomiale 329 239 289.0 245 277.0 223 240.0 199 233.0 192 216.0 180
consecinţe ale cauzelor externe 8156 2378 8207.0 2344 7738.0 2249 7746.0 2197 7527.0 2235 7446.0 2287
Alte cauze 1360 966 419.0 325 732.0 561 2396.0 1650 1743.0 1260 1612.0 1062

In [17]:
evek=[str(year) for year in range(2011,2017) for nem in nemek]
nemlista=[nem for year in range(2011,2017) for nem in nemek]

In [18]:
df=df.T

Új oszlopok a dimenzióknak.


In [21]:
df['Ev']=evek
df['Nem']=nemlista

In [22]:
df.head(6)


Out[22]:
Unnamed: 0 Total Boli infecţioase şi parazitare Tuberculoză Tumori metabolism Diabet zaharat Tulburări mentale şi de comportament Boli ale sistemului nervos Boli ale aparatului circulator Boala ischemică a inimii ... Boli ale aparatului respirator Boli ale aparatului digestiv Boli ale aparatului genito-urinar Sarcină, naştere şi lăuzie situează în perioada perinatală anomalii cromozomiale consecinţe ale cauzelor externe Alte cauze Ev Nem
2011-Masculin 132180 1575 1052 28718 1074 1024 315 1384 70845 25796 ... 7923 8684 1437 - 380 329 8156 1360 2011 Masculin
2011-Feminin 119259 658 231 19638 1283 1199 112 1394 80693 25069 ... 4537 5815 1219 50 277 239 2378 966 2011 Feminin
2012-Masculin 132408 1626 986 29143 1120 1051 259 1445 71025 25534 ... 8194 8702 1625 NaN 354 289 8207 419 2012 Masculin
2012-Feminin 121308 770 263 19875 1311 1243 88 1733 82188 25851 ... 5046 5759 1332 23 269 245 2344 325 2012 Feminin
2013-Masculin 129209 1548 900 29351 1142 1069 265 1393 68486 24786 ... 7855 8407 1643 NaN 372 277 7738 732 2013 Masculin
2013-Feminin 118266 808 233 20313 1208 1146 79 1729 78808 24828 ... 4829 5631 1510 27 291 223 2249 561 2013 Feminin

6 rows × 21 columns


In [23]:
df.set_index(['Ev','Nem'])


Out[23]:
Unnamed: 0 Total Boli infecţioase şi parazitare Tuberculoză Tumori metabolism Diabet zaharat Tulburări mentale şi de comportament Boli ale sistemului nervos Boli ale aparatului circulator Boala ischemică a inimii Boli cerebro-vasculare Boli ale aparatului respirator Boli ale aparatului digestiv Boli ale aparatului genito-urinar Sarcină, naştere şi lăuzie situează în perioada perinatală anomalii cromozomiale consecinţe ale cauzelor externe Alte cauze
Ev Nem
2011 Masculin 132180 1575 1052 28718 1074 1024 315 1384 70845 25796 21278 7923 8684 1437 - 380 329 8156 1360
Feminin 119259 658 231 19638 1283 1199 112 1394 80693 25069 26295 4537 5815 1219 50 277 239 2378 966
2012 Masculin 132408 1626 986 29143 1120 1051 259 1445 71025 25534 21103 8194 8702 1625 NaN 354 289 8207 419
Feminin 121308 770 263 19875 1311 1243 88 1733 82188 25851 26249 5046 5759 1332 23 269 245 2344 325
2013 Masculin 129209 1548 900 29351 1142 1069 265 1393 68486 24786 20503 7855 8407 1643 NaN 372 277 7738 732
Feminin 118266 808 233 20313 1208 1146 79 1729 78808 24828 25603 4829 5631 1510 27 291 223 2249 561
2014 Masculin 133197 1676 876 30123 1126 1043 275 1532 68947 25189 19876 8337 8694 1701 NaN 404 240 7746 2396
Feminin 121768 995 249 20473 1282 1200 91 1922 80190 25505 24825 5075 5861 1571 24 238 199 2197 1650
2015 Masculin 135886 1745 838 30316 1257 1182 239 1640 71237 26066 19763 9146 8663 1823 NaN 317 233 7527 1743
Feminin 125408 992 220 20999 1411 1328 82 2100 82673 26643 24542 5856 5718 1633 27 230 192 2235 1260
2016 Masculin 134063 1911 785 30670 1350 1239 275 1692 69048 24773 19059 8906 8774 1877 NaN 286 216 7446 1612
Feminin 122413 1215 184 21084 1551 1441 93 2223 79401 24903 23789 5622 5724 1745 17 209 180 2287 1062

unstack paranccsal egy MultiIndex (azaz többszintes index) pivot-álható.


In [25]:
df.set_index(['Ev','Nem'])[['Total']].unstack()


Out[25]:
Unnamed: 0 Total
Nem Feminin Masculin
Ev
2011 119259 132180
2012 121308 132408
2013 118266 129209
2014 121768 133197
2015 125408 135886
2016 122413 134063

Hiányzó értékek (nan-ok) helyettesítése.


In [26]:
pd.DataFrame([0,3,4,5,'gfgf',np.nan]).replace(np.nan,'Mas')


Out[26]:
0
0 0
1 3
2 4
3 5
4 gfgf
5 Mas

In [27]:
pd.DataFrame([0,3,4,5,'gfgf',np.nan]).fillna('Mas')


Out[27]:
0
0 0
1 3
2 4
3 5
4 gfgf
5 Mas

join - több DataFrame összefűzése. Az index ugyanaz kell legyen. Az oszlopok nevei különbözőek. Az index neve nem számít.


In [28]:
df1=pd.read_excel('pensiunea comfort 1.xlsx',sheet_name='Sheet1')
df2=pd.read_excel('pensiunea comfort 1.xlsx',sheet_name='Sheet2')
df3=pd.read_excel('pensiunea comfort 1.xlsx',sheet_name='Sheet3')

In [29]:
df1=df1.dropna(how='all',axis=0).dropna(how='all',axis=1).set_index(2019)
df2=df2.dropna(how='all',axis=0).dropna(how='all',axis=1).set_index(2019)
df3=df3.dropna(how='all',axis=0).dropna(how='all',axis=1).set_index('2019/ NR. DE NOPTI')

In [30]:
df1.join(df2).join(df3)


Out[30]:
CARD NUMERAR VOUCHER AGENTIA DE TURISM TOTAL TELEFON BOOKING TRAVELMINIT LA LIBER EUROPA TRADE ... 3 NOPTI 4 NOPTI 5 NOPTI 6 NOPTI 7 NOPTI 8 NOPTI 9 NOPTI 10 NOPTI 11 NOPTI 12 NOPTI
2019
APRILIE 2040.0 3230.0 6000.0 1370.0 12640.0 5.0 7.0 2.0 0.0 0.0 ... 3.0 3.0 5.0 2.0 0.0 0.0 0.0 0.0 0.0 0.0
MAI 7200.0 4640.0 15350.0 4500.0 31690.0 24.0 22.0 9.0 1.0 1.0 ... 11.0 8.0 3.0 3.0 0.0 0.0 1.0 0.0 0.0 0.0
IUNIE 17940.0 10110.0 14000.0 5030.0 47080.0 29.0 25.0 5.0 3.0 0.0 ... 14.0 6.0 11.0 7.0 2.0 3.0 1.0 0.0 1.0 0.0
IULIE 18080.0 26890.0 21650.0 9700.0 76320.0 45.0 14.0 13.0 2.0 0.0 ... 10.0 1.0 11.0 12.0 11.0 2.0 2.0 5.0 0.0 1.0
AUGUST 12128.0 12672.0 24450.0 11455.0 60705.0 52.0 8.0 5.0 7.0 0.0 ... 14.0 10.0 11.0 14.0 5.0 3.0 0.0 1.0 0.0 0.0
SEPTEMBRIE 13880.0 13790.0 14200.0 10250.0 52120.0 34.0 27.0 7.0 3.0 0.0 ... 14.0 6.0 8.0 6.0 4.0 2.0 1.0 1.0 1.0 1.0

6 rows × 29 columns


In [ ]: