Introduction to data cleaning with Jupyter and pandas

Meli Lewis, data engineer at Simple

melidata.com



In [55]:

    
import numpy as np
import pandas as pd

from IPython.display import display, Image, YouTubeVideo

%matplotlib inline

The case for open source data tools

Reproducibility and Transparency
Cost -- compare capabilities between software you already use and open source here
Allows a diversity of platforms on a given team

The case for notebooks

They're amenable to sketching, and they're amenable to reproducibility.
You can retrace your own steps and also make a narrative for someone else to follow.
Built-in documentation streamlines your workflow; magic methods anticipate it.

Jupyter Notebook: some cool tips for beginners

2. built-in documentation

shift-tab brings up brief function documentation



In [2]:

    
?pd.read_csv

3. Markdown cells, which can display other markup languages

Markdown and HTML

</span>

LaTeX

$\bar{x} = \frac{\sum_{i=1}^{n}w_i\cdot x_i}{\sum_{i=1}^{n}w_i}$

Syntax highlighting for other languages

x <- c(0:10, 50)
xm <- mean(x)
c(xm, mean(x, trim = 0.10))

The final cell type is "Raw NBConvert"

# [Markdown](https://daringfireball.net/projects/markdown/) and HTML

# LaTeX $\bar{x} = \frac{\sum_{i=1}^{n}w_i\cdot x_i}{\sum_{i=1}^{n}w_i}$ # Syntax highlighting for other languages ```R x <- c(0:10, 50) xm <- mean(x) c(xm, mean(x, trim = 0.10)) ```

4. magic methods



In [3]:

    
%quickref



In [4]:

    
%%python2

print "hello world"









    



hello world



In [5]:

    
%history









    



import numpy as np
import pandas as pd

from IPython.display import display, Image, YouTubeVideo

%matplotlib inline
?pd.read_csv
%quickref
%%python2

print "hello world"
%history



In [6]:

    
# list the available line magics
%lsmagic









    Out[6]:





Available line magics:
%alias  %alias_magic  %autocall  %automagic  %autosave  %bookmark  %cat  %cd  %clear  %colors  %config  %connect_info  %cp  %debug  %dhist  %dirs  %doctest_mode  %ed  %edit  %env  %gui  %hist  %history  %install_default_config  %install_ext  %install_profiles  %killbgscripts  %ldir  %less  %lf  %lk  %ll  %load  %load_ext  %loadpy  %logoff  %logon  %logstart  %logstate  %logstop  %ls  %lsmagic  %lx  %macro  %magic  %man  %matplotlib  %mkdir  %more  %mv  %notebook  %page  %pastebin  %pdb  %pdef  %pdoc  %pfile  %pinfo  %pinfo2  %popd  %pprint  %precision  %profile  %prun  %psearch  %psource  %pushd  %pwd  %pycat  %pylab  %qtconsole  %quickref  %recall  %rehashx  %reload_ext  %rep  %rerun  %reset  %reset_selective  %rm  %rmdir  %run  %save  %sc  %set_env  %store  %sx  %system  %tb  %time  %timeit  %unalias  %unload_ext  %who  %who_ls  %whos  %xdel  %xmode

Available cell magics:
%%!  %%HTML  %%SVG  %%bash  %%capture  %%debug  %%file  %%html  %%javascript  %%latex  %%perl  %%prun  %%pypy  %%python  %%python2  %%python3  %%ruby  %%script  %%sh  %%svg  %%sx  %%system  %%time  %%timeit  %%writefile

Automagic is ON, % prefix IS NOT needed for line magics.

5. multimedia



In [7]:

    
YouTubeVideo("L4Hbv4ugUWk")









    Out[7]:

7. Shell commands!



In [8]:

    
!ls









    



IRS_data          j                 pycaribbean.ipynb tourism_data

8. Keyboard shortcuts! Your implementation may vary.

Use what's here or roll your own.

Introduction to exploratory data analysis with Pandas

First: United States income

Source: IRS.gov

Input



In [9]:

    
pd.read_csv?



In [10]:

    
# read in a CSV
# specify that zipcode should be treated as a string rather than an int!
AGI = pd.read_csv('IRS_data/12zpallagi.csv',dtype={'zipcode': str})



In [11]:

    
AGI.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 166904 entries, 0 to 166903
Data columns (total 77 columns):
STATEFIPS    166904 non-null int64
STATE        166904 non-null object
zipcode      166904 non-null object
AGI_STUB     166904 non-null int64
N1           166904 non-null float64
MARS1        166904 non-null float64
MARS2        166904 non-null float64
MARS4        166904 non-null float64
PREP         166904 non-null float64
N2           166904 non-null float64
NUMDEP       166904 non-null float64
A00100       166904 non-null float64
N00200       166904 non-null float64
A00200       166904 non-null float64
N00300       166904 non-null float64
A00300       166904 non-null float64
N00600       166904 non-null float64
A00600       166904 non-null float64
N00650       166904 non-null float64
A00650       166904 non-null float64
N00900       166904 non-null float64
A00900       166904 non-null float64
SCHF         166904 non-null float64
N01000       166904 non-null float64
A01000       166904 non-null float64
N01400       166904 non-null float64
A01400       166904 non-null float64
N01700       166904 non-null float64
A01700       166904 non-null float64
N02300       166904 non-null float64
A02300       166904 non-null float64
N02500       166904 non-null float64
A02500       166904 non-null float64
N03300       166904 non-null float64
A03300       166904 non-null float64
N00101       166904 non-null float64
A00101       166904 non-null float64
N04470       166904 non-null float64
A04470       166904 non-null float64
N18425       166904 non-null float64
A18425       166904 non-null float64
N18450       166904 non-null float64
A18450       166904 non-null float64
N18500       166904 non-null float64
A18500       166904 non-null float64
N18300       166904 non-null float64
A18300       166904 non-null float64
N19300       166904 non-null float64
A19300       166904 non-null float64
N19700       166904 non-null float64
A19700       166904 non-null float64
N04800       166904 non-null float64
A04800       166904 non-null float64
N07100       166904 non-null float64
A07100       166904 non-null float64
N07220       166904 non-null float64
A07220       166904 non-null float64
N07180       166904 non-null float64
A07180       166904 non-null float64
N07260       166904 non-null float64
A07260       166904 non-null float64
N59660       166904 non-null float64
A59660       166904 non-null float64
N59720       166904 non-null float64
A59720       166904 non-null float64
N11070       166904 non-null float64
A11070       166904 non-null float64
N09600       166904 non-null float64
A09600       166904 non-null float64
N06500       166904 non-null float64
A06500       166904 non-null float64
N10300       166904 non-null float64
A10300       166904 non-null float64
N11901       166904 non-null float64
A11901       166904 non-null float64
N11902       166904 non-null float64
A11902       166904 non-null float64
dtypes: float64(73), int64(2), object(2)
memory usage: 99.3+ MB

Transformation



In [12]:

    
# you can select columns by label or position!
AGI_column_subset = AGI[['STATE','AGI_STUB','zipcode','N1','A00100']]



In [13]:

    
# get information about type for a given field, and how many values you can expect for each
AGI_column_subset.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 166904 entries, 0 to 166903
Data columns (total 5 columns):
STATE       166904 non-null object
AGI_STUB    166904 non-null int64
zipcode     166904 non-null object
N1          166904 non-null float64
A00100      166904 non-null float64
dtypes: float64(2), int64(1), object(2)
memory usage: 7.6+ MB



In [14]:

    
AGI_column_subset.describe()









    Out[14]:






  
    
      
      AGI_STUB
      N1
      A00100
    
  
  
    
      count
      166904.000000
      166904.000000
      1.669040e+05
    
    
      mean
      3.499940
      1703.184106
      1.093321e+05
    
    
      std
      1.707905
      36669.348742
      2.346716e+06
    
    
      min
      1.000000
      0.000000
      0.000000e+00
    
    
      25%
      2.000000
      60.000000
      3.869000e+03
    
    
      50%
      3.000000
      220.000000
      1.412900e+04
    
    
      75%
      5.000000
      910.000000
      5.508425e+04
    
    
      max
      6.000000
      6404780.000000
      4.601720e+08



In [15]:

    
# note this is inclusive!
AGI_column_subset.ix[6:11]



In [16]:

    
AGI_column_subset = AGI_column_subset.rename(columns={'N1':'population','A00100':'amount'})
# AGI_column_subset.rename(columns={'N1':'population','A00100':'amount'},inplace=True)



In [17]:

    
# easy way to look at a subset of data without having to specify rows
AGI_column_subset.head()



In [18]:

    
# group by zipcode and sum other values, resetting index
AGI_grouped = AGI_column_subset.groupby('zipcode').sum().reset_index()



In [19]:

    
AGI_grouped.head()

00000?



In [20]:

    
AGI_grouped['population'].mean()









    Out[20]:





10254.987012987012



In [21]:

    
#this can also be done using the na_values parameter upon being read in
# null_zips = (AGI_grouped['zipcode'] == '00000')
null_zips = AGI_grouped['zipcode'].isin(['00000', '99999'])
AGI_grouped.loc[null_zips, 'zipcode'] = np.nan



In [22]:

    
AGI_grouped.head()



In [23]:

    
AGI_notnull = AGI_grouped.dropna()



In [24]:

    
AGI_notnull['population'].mean()









    Out[24]:





5024.9671693484379



In [25]:

    
AGI_grouped.dropna(inplace=True)



In [26]:

    
# make a new column with the real amount, not in thousands
AGI_grouped['actual_amount'] = AGI_grouped['amount'] * 1000

Keep in mind you have options, and use magic methods to test implementation inline!



In [27]:

    
%timeit applied = AGI_grouped['amount'].apply(lambda x: x * 1000)









    



10 loops, best of 3: 15 ms per loop



In [28]:

    
#being vectorized operations, this is happening at the C level and thereby much faster
%timeit vectorized = AGI_grouped['amount'] * 1000









    



10000 loops, best of 3: 66.9 µs per loop



In [29]:

    
AGI_grouped.tail()









    Out[29]:






  
    
      
      zipcode
      AGI_STUB
      population
      amount
      actual_amount
    
  
  
    
      27714
      99801
      21
      12370
      784594
      784594000
    
    
      27715
      99824
      21
      1130
      82023
      82023000
    
    
      27716
      99827
      21
      1360
      62933
      62933000
    
    
      27717
      99835
      21
      4650
      272187
      272187000
    
    
      27718
      99901
      21
      6330
      389809
      389809000



In [30]:

    
# make a mean, using standard math operations!
AGI_grouped['weighted_mean_AGI'] = AGI_grouped['actual_amount']/AGI_grouped['population']



In [31]:

    
#use anonymous functions to change every value in a column!
#because this is an apply, much slower
AGI_grouped['weighted_mean_AGI']= AGI_grouped['weighted_mean_AGI'].apply(lambda x: round(x, 0))



In [32]:

    
AGI_grouped.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 27718 entries, 1 to 27718
Data columns (total 6 columns):
zipcode              27718 non-null object
AGI_STUB             27718 non-null int64
population           27718 non-null float64
amount               27718 non-null float64
actual_amount        27718 non-null float64
weighted_mean_AGI    27718 non-null float64
dtypes: float64(4), int64(1), object(1)
memory usage: 1.5+ MB



In [33]:

    
# drop columns you won't need
AGI_grouped.drop(['AGI_STUB','amount','actual_amount'],axis=1,inplace=True)



In [34]:

    
AGI_grouped.head()









    Out[34]:






  
    
      
      zipcode
      population
      weighted_mean_AGI
    
  
  
    
      1
      01001
      8780
      52246
    
    
      2
      01002
      9460
      77468
    
    
      3
      01005
      2230
      55042
    
    
      4
      01007
      7320
      63920
    
    
      5
      01008
      640
      60136

Merging! Better than in traffic!

"Group by" knows aggregating strings is nonsensical, and so drops those.

But let's add state information to the mix again!



In [35]:

    
# also look into pandas.Series.unique
AGI_subset_geo = AGI[['zipcode','STATE']].drop_duplicates()



In [36]:

    
AGI_subset_geo









    Out[36]:






  
    
      
      zipcode
      STATE
    
  
  
    
      0
      00000
      AL
    
    
      6
      35004
      AL
    
    
      12
      35005
      AL
    
    
      18
      35006
      AL
    
    
      24
      35007
      AL
    
    
      30
      35010
      AL
    
    
      36
      35014
      AL
    
    
      42
      35016
      AL
    
    
      48
      35019
      AL
    
    
      54
      35020
      AL
    
    
      60
      35022
      AL
    
    
      66
      35023
      AL
    
    
      72
      35031
      AL
    
    
      78
      35033
      AL
    
    
      84
      35034
      AL
    
    
      90
      35035
      AL
    
    
      96
      35040
      AL
    
    
      102
      35042
      AL
    
    
      108
      35043
      AL
    
    
      114
      35044
      AL
    
    
      120
      35045
      AL
    
    
      126
      35046
      AL
    
    
      132
      35049
      AL
    
    
      138
      35051
      AL
    
    
      144
      35053
      AL
    
    
      150
      35054
      AL
    
    
      156
      35055
      AL
    
    
      162
      35057
      AL
    
    
      168
      35058
      AL
    
    
      174
      35061
      AL
    
    
      ...
      ...
      ...
    
    
      166724
      82922
      WY
    
    
      166730
      82923
      WY
    
    
      166736
      82925
      WY
    
    
      166742
      82930
      WY
    
    
      166748
      82932
      WY
    
    
      166754
      82933
      WY
    
    
      166760
      82935
      WY
    
    
      166766
      82937
      WY
    
    
      166772
      82941
      WY
    
    
      166778
      83001
      WY
    
    
      166784
      83011
      WY
    
    
      166790
      83012
      WY
    
    
      166796
      83013
      WY
    
    
      166802
      83014
      WY
    
    
      166808
      83101
      WY
    
    
      166814
      83110
      WY
    
    
      166820
      83111
      WY
    
    
      166826
      83112
      WY
    
    
      166832
      83113
      WY
    
    
      166838
      83114
      WY
    
    
      166844
      83115
      WY
    
    
      166850
      83118
      WY
    
    
      166856
      83120
      WY
    
    
      166862
      83122
      WY
    
    
      166868
      83123
      WY
    
    
      166874
      83126
      WY
    
    
      166880
      83127
      WY
    
    
      166886
      83128
      WY
    
    
      166892
      83414
      WY
    
    
      166898
      99999
      WY
    
  

27820 rows × 2 columns



In [37]:

    
#merge rather than join if you want to use a common column other than the index
AGI_final = pd.merge(AGI_grouped, AGI_subset_geo, how='left', on='zipcode')



In [38]:

    
AGI_final.tail()









    Out[38]:






  
    
      
      zipcode
      population
      weighted_mean_AGI
      STATE
    
  
  
    
      27713
      99801
      12370
      63427
      AK
    
    
      27714
      99824
      1130
      72587
      AK
    
    
      27715
      99827
      1360
      46274
      AK
    
    
      27716
      99835
      4650
      58535
      AK
    
    
      27717
      99901
      6330
      61581
      AK



In [39]:

    
# this gives you the greatest weighted_mean_AGI first
AGI_final.sort_values(by='weighted_mean_AGI',ascending=False).head()









    Out[39]:






  
    
      
      zipcode
      population
      weighted_mean_AGI
      STATE
    
  
  
    
      9031
      33109
      250
      2694776
      FL
    
    
      26104
      94027
      3220
      1464534
      CA
    
    
      4926
      19035
      2040
      1052019
      PA
    
    
      2194
      10005
      5580
      983554
      NY
    
    
      9168
      33480
      5580
      966673
      FL



In [40]:

    
# chain methods!
AGI_final.groupby('STATE').mean().sort_values(by='weighted_mean_AGI',ascending=False)









    Out[40]:






  
    
      
      population
      weighted_mean_AGI
    
    
      STATE
      
      
    
  
  
    
      DC
      14558.181818
      103480.590909
    
    
      CT
      6529.118774
      96062.421456
    
    
      NJ
      7737.192661
      89242.225688
    
    
      MA
      6602.619543
      87019.825364
    
    
      CA
      10900.944032
      75537.775455
    
    
      MD
      6939.034653
      73676.836634
    
    
      NV
      9425.968992
      71209.976744
    
    
      NY
      5914.349515
      68860.152104
    
    
      ND
      1174.930070
      68097.318182
    
    
      FL
      9633.849509
      65399.196292
    
    
      RI
      7275.217391
      65351.202899
    
    
      NH
      2872.173913
      64188.408696
    
    
      WY
      2525.794393
      64128.813084
    
    
      DE
      7718.000000
      63196.363636
    
    
      CO
      5891.367089
      62926.582278
    
    
      WA
      6312.862903
      61806.048387
    
    
      TX
      6823.705302
      60807.772503
    
    
      IL
      4855.851671
      60416.809291
    
    
      VA
      4682.682620
      59560.884131
    
    
      AK
      5259.245283
      58695.433962
    
    
      MN
      3252.553729
      58641.551201
    
    
      UT
      6037.540984
      57716.355191
    
    
      AZ
      8895.644599
      56730.749129
    
    
      NE
      1756.219008
      56224.725207
    
    
      PA
      4363.623719
      55447.786969
    
    
      HI
      10498.965517
      55156.913793
    
    
      IA
      1689.345455
      55083.106667
    
    
      SD
      1376.089965
      53786.121107
    
    
      KS
      2144.527363
      53320.759536
    
    
      WI
      3825.617978
      52697.198034
    
    
      OK
      2836.599265
      52071.439338
    
    
      LA
      4272.217295
      51511.485588
    
    
      OR
      4855.170455
      51207.985795
    
    
      OH
      5419.438315
      50741.277834
    
    
      MI
      5076.033708
      50450.595506
    
    
      VT
      1288.813559
      50049.114407
    
    
      IN
      4405.780089
      49091.734027
    
    
      MT
      1979.911111
      49056.662222
    
    
      GA
      6297.087087
      48479.487988
    
    
      NC
      5710.524862
      48392.198895
    
    
      ID
      3087.428571
      47607.995238
    
    
      ME
      1649.646739
      47555.201087
    
    
      TN
      4769.761905
      46735.782313
    
    
      SC
      5373.512064
      45564.546917
    
    
      MO
      2998.018018
      45385.798423
    
    
      AL
      3456.539130
      45051.260870
    
    
      NM
      3826.333333
      43959.495238
    
    
      WV
      1422.684825
      43259.171206
    
    
      KY
      2771.065449
      42960.374429
    
    
      AR
      2376.510204
      42197.508163
    
    
      MS
      3254.347826
      40630.513587



In [41]:

    
# not sure if this is PEP8-compliant
top_5_states = AGI_final.groupby('STATE').mean().sort_values(
    by='weighted_mean_AGI',ascending=False).reset_index().head()



In [42]:

    
top_5_states.plot(kind='bar',x='STATE')









    Out[42]:





<matplotlib.axes._subplots.AxesSubplot at 0x10984d5c0>

pandas and world tourism data



In [43]:

    
# `cat` is an easy way to examine file contents in place

# !cat tourism_data/581a4d76-9f6d-4786-b22c-73d59627d1e2_v2.csv



In [44]:

    
# read in a CSV
df = pd.read_csv('tourism_data/581a4d76-9f6d-4786-b22c-73d59627d1e2_v2.csv',skiprows=4)
# df = pd.read_csv('tourism_data/581a4d76-9f6d-4786-b22c-73d59627d1e2_v2.csv')



In [45]:

    
# get information about type for a given field, and how many values you can expect for each
df.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 248 entries, 0 to 247
Data columns (total 61 columns):
Country Name      248 non-null object
Country Code      248 non-null object
Indicator Name    248 non-null object
Indicator Code    248 non-null object
1960              0 non-null float64
1961              0 non-null float64
1962              0 non-null float64
1963              0 non-null float64
1964              0 non-null float64
1965              0 non-null float64
1966              0 non-null float64
1967              0 non-null float64
1968              0 non-null float64
1969              0 non-null float64
1970              0 non-null float64
1971              0 non-null float64
1972              0 non-null float64
1973              0 non-null float64
1974              0 non-null float64
1975              0 non-null float64
1976              0 non-null float64
1977              0 non-null float64
1978              0 non-null float64
1979              0 non-null float64
1980              0 non-null float64
1981              0 non-null float64
1982              0 non-null float64
1983              0 non-null float64
1984              0 non-null float64
1985              0 non-null float64
1986              0 non-null float64
1987              0 non-null float64
1988              0 non-null float64
1989              0 non-null float64
1990              0 non-null float64
1991              0 non-null float64
1992              0 non-null float64
1993              0 non-null float64
1994              0 non-null float64
1995              204 non-null float64
1996              206 non-null float64
1997              207 non-null float64
1998              210 non-null float64
1999              210 non-null float64
2000              215 non-null float64
2001              216 non-null float64
2002              215 non-null float64
2003              220 non-null float64
2004              222 non-null float64
2005              225 non-null float64
2006              227 non-null float64
2007              230 non-null float64
2008              230 non-null float64
2009              228 non-null float64
2010              229 non-null float64
2011              225 non-null float64
2012              217 non-null float64
2013              205 non-null float64
2014              0 non-null float64
2015              0 non-null float64
Unnamed: 60       0 non-null float64
dtypes: float64(57), object(4)
memory usage: 120.1+ KB



In [46]:

    
df_subset = df.dropna(axis=1,how='all')



In [47]:

    
df_subset.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 248 entries, 0 to 247
Data columns (total 23 columns):
Country Name      248 non-null object
Country Code      248 non-null object
Indicator Name    248 non-null object
Indicator Code    248 non-null object
1995              204 non-null float64
1996              206 non-null float64
1997              207 non-null float64
1998              210 non-null float64
1999              210 non-null float64
2000              215 non-null float64
2001              216 non-null float64
2002              215 non-null float64
2003              220 non-null float64
2004              222 non-null float64
2005              225 non-null float64
2006              227 non-null float64
2007              230 non-null float64
2008              230 non-null float64
2009              228 non-null float64
2010              229 non-null float64
2011              225 non-null float64
2012              217 non-null float64
2013              205 non-null float64
dtypes: float64(19), object(4)
memory usage: 46.5+ KB



In [48]:

    
# df_subset.drop(['Indicator Name','Indicator Code'], axis=1, inplace=True)
df_subset = df_subset.drop(['Indicator Name','Indicator Code'], axis=1)



In [49]:

    
df_subset.head()









    Out[49]:






  
    
      
      Country Name
      Country Code
      1995
      1996
      1997
      1998
      1999
      2000
      2001
      2002
      ...
      2004
      2005
      2006
      2007
      2008
      2009
      2010
      2011
      2012
      2013
    
  
  
    
      0
      Aruba
      ABW
      619000
      641000
      650000
      647000
      683000
      721000
      691000
      643000
      ...
      728000
      733000
      694000
      772000
      827000
      813000
      824000
      869000
      904000
      979000
    
    
      1
      Andorra
      AND
      NaN
      NaN
      NaN
      NaN
      2347000
      2949000
      3516000
      3387000
      ...
      2791000
      2418000
      2227000
      2189000
      2059000
      1830000
      1808000
      2242000
      2238000
      2335000
    
    
      2
      Afghanistan
      AFG
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      3
      Angola
      AGO
      9000
      21000
      45000
      52000
      45000
      51000
      67000
      91000
      ...
      194000
      210000
      121000
      195000
      294000
      366000
      425000
      481000
      528000
      650000
    
    
      4
      Albania
      ALB
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      1062000
      1247000
      1711000
      2191000
      2469000
      3156000
      2857000
    
  

5 rows × 21 columns



In [50]:

    
df_melted = pd.melt(df_subset, id_vars=['Country Name', 'Country Code'])



In [51]:

    
df_melted









    Out[51]:






  
    
      
      Country Name
      Country Code
      variable
      value
    
  
  
    
      0
      Aruba
      ABW
      1995
      6.190000e+05
    
    
      1
      Andorra
      AND
      1995
      NaN
    
    
      2
      Afghanistan
      AFG
      1995
      NaN
    
    
      3
      Angola
      AGO
      1995
      9.000000e+03
    
    
      4
      Albania
      ALB
      1995
      NaN
    
    
      5
      Arab World
      ARB
      1995
      2.566348e+07
    
    
      6
      United Arab Emirates
      ARE
      1995
      2.315000e+06
    
    
      7
      Argentina
      ARG
      1995
      2.289000e+06
    
    
      8
      Armenia
      ARM
      1995
      1.200000e+04
    
    
      9
      American Samoa
      ASM
      1995
      3.400000e+04
    
    
      10
      Antigua and Barbuda
      ATG
      1995
      2.200000e+05
    
    
      11
      Australia
      AUS
      1995
      3.726000e+06
    
    
      12
      Austria
      AUT
      1995
      1.717300e+07
    
    
      13
      Azerbaijan
      AZE
      1995
      NaN
    
    
      14
      Burundi
      BDI
      1995
      3.400000e+04
    
    
      15
      Belgium
      BEL
      1995
      5.560000e+06
    
    
      16
      Benin
      BEN
      1995
      1.380000e+05
    
    
      17
      Burkina Faso
      BFA
      1995
      1.240000e+05
    
    
      18
      Bangladesh
      BGD
      1995
      1.560000e+05
    
    
      19
      Bulgaria
      BGR
      1995
      3.466000e+06
    
    
      20
      Bahrain
      BHR
      1995
      2.311000e+06
    
    
      21
      Bahamas, The
      BHS
      1995
      1.598000e+06
    
    
      22
      Bosnia and Herzegovina
      BIH
      1995
      NaN
    
    
      23
      Belarus
      BLR
      1995
      1.610000e+05
    
    
      24
      Belize
      BLZ
      1995
      1.310000e+05
    
    
      25
      Bermuda
      BMU
      1995
      3.870000e+05
    
    
      26
      Bolivia
      BOL
      1995
      2.840000e+05
    
    
      27
      Brazil
      BRA
      1995
      1.991000e+06
    
    
      28
      Barbados
      BRB
      1995
      4.420000e+05
    
    
      29
      Brunei Darussalam
      BRN
      1995
      NaN
    
    
      ...
      ...
      ...
      ...
      ...
    
    
      4682
      Togo
      TGO
      2013
      3.270000e+05
    
    
      4683
      Thailand
      THA
      2013
      2.654700e+07
    
    
      4684
      Tajikistan
      TJK
      2013
      2.080000e+05
    
    
      4685
      Turkmenistan
      TKM
      2013
      NaN
    
    
      4686
      Timor-Leste
      TLS
      2013
      NaN
    
    
      4687
      Tonga
      TON
      2013
      4.500000e+04
    
    
      4688
      Trinidad and Tobago
      TTO
      2013
      4.340000e+05
    
    
      4689
      Tunisia
      TUN
      2013
      6.269000e+06
    
    
      4690
      Turkey
      TUR
      2013
      3.779500e+07
    
    
      4691
      Tuvalu
      TUV
      2013
      NaN
    
    
      4692
      Tanzania
      TZA
      2013
      1.063000e+06
    
    
      4693
      Uganda
      UGA
      2013
      1.206000e+06
    
    
      4694
      Ukraine
      UKR
      2013
      2.467100e+07
    
    
      4695
      Upper middle income
      UMC
      2013
      2.648748e+08
    
    
      4696
      Uruguay
      URY
      2013
      2.683000e+06
    
    
      4697
      United States
      USA
      2013
      6.976800e+07
    
    
      4698
      Uzbekistan
      UZB
      2013
      1.969000e+06
    
    
      4699
      St. Vincent and the Grenadines
      VCT
      2013
      7.200000e+04
    
    
      4700
      Venezuela, RB
      VEN
      2013
      9.860000e+05
    
    
      4701
      Virgin Islands (U.S.)
      VIR
      2013
      5.700000e+05
    
    
      4702
      Vietnam
      VNM
      2013
      7.572000e+06
    
    
      4703
      Vanuatu
      VUT
      2013
      1.100000e+05
    
    
      4704
      West Bank and Gaza
      PSE
      2013
      5.450000e+05
    
    
      4705
      World
      WLD
      2013
      1.123200e+09
    
    
      4706
      Samoa
      WSM
      2013
      1.160000e+05
    
    
      4707
      Yemen, Rep.
      YEM
      2013
      9.900000e+05
    
    
      4708
      South Africa
      ZAF
      2013
      9.537000e+06
    
    
      4709
      Congo, Dem. Rep.
      COD
      2013
      1.910000e+05
    
    
      4710
      Zambia
      ZMB
      2013
      9.150000e+05
    
    
      4711
      Zimbabwe
      ZWE
      2013
      1.833000e+06
    
  

4712 rows × 4 columns



In [52]:

    
df_melted.rename(columns={'variable':'Year','value':'Tourists'},inplace=True)



In [53]:

    
df_melted[df_melted['Country Code']== 'DOM']









    Out[53]:






  
    
      
      Country Name
      Country Code
      Year
      Tourists
    
  
  
    
      56
      Dominican Republic
      DOM
      1995
      1776000
    
    
      304
      Dominican Republic
      DOM
      1996
      1926000
    
    
      552
      Dominican Republic
      DOM
      1997
      2211000
    
    
      800
      Dominican Republic
      DOM
      1998
      2309000
    
    
      1048
      Dominican Republic
      DOM
      1999
      2649000
    
    
      1296
      Dominican Republic
      DOM
      2000
      2978000
    
    
      1544
      Dominican Republic
      DOM
      2001
      2882000
    
    
      1792
      Dominican Republic
      DOM
      2002
      2811000
    
    
      2040
      Dominican Republic
      DOM
      2003
      3282000
    
    
      2288
      Dominican Republic
      DOM
      2004
      3450000
    
    
      2536
      Dominican Republic
      DOM
      2005
      3691000
    
    
      2784
      Dominican Republic
      DOM
      2006
      3965000
    
    
      3032
      Dominican Republic
      DOM
      2007
      3980000
    
    
      3280
      Dominican Republic
      DOM
      2008
      3980000
    
    
      3528
      Dominican Republic
      DOM
      2009
      3992000
    
    
      3776
      Dominican Republic
      DOM
      2010
      4125000
    
    
      4024
      Dominican Republic
      DOM
      2011
      4306000
    
    
      4272
      Dominican Republic
      DOM
      2012
      4563000
    
    
      4520
      Dominican Republic
      DOM
      2013
      4690000



In [54]:

    
df_melted[df_melted['Country Code']== 'DOM'].plot(x='Year')









    Out[54]:





<matplotlib.axes._subplots.AxesSubplot at 0x109b3efd0>

	AGI_STUB	N1	A00100
count	166904.000000	166904.000000	1.669040e+05
mean	3.499940	1703.184106	1.093321e+05
std	1.707905	36669.348742	2.346716e+06
min	1.000000	0.000000	0.000000e+00
25%	2.000000	60.000000	3.869000e+03
50%	3.000000	220.000000	1.412900e+04
75%	5.000000	910.000000	5.508425e+04
max	6.000000	6404780.000000	4.601720e+08

	STATE	AGI_STUB	zipcode	N1	A00100
6	AL	1	35004	1600	20639
7	AL	2	35004	1310	48501
8	AL	3	35004	900	55790
9	AL	4	35004	590	50978
10	AL	5	35004	480	59932
11	AL	6	35004	50	21723

	STATE	AGI_STUB	population	amount
0	AL	1	889920	11517112
1	AL	2	491150	17617800
2	AL	3	254280	15644666
3	AL	4	160160	13885434
4	AL	5	183320	24641055

	zipcode	AGI_STUB	population	amount
0	00000	1071	142098490	9123982917
1	01001	21	8780	458716
2	01002	21	9460	732849
3	01005	21	2230	122744
4	01007	21	7320	467891

	zipcode	AGI_STUB	population	amount	actual_amount
27714	99801	21	12370	784594	784594000
27715	99824	21	1130	82023	82023000
27716	99827	21	1360	62933	62933000
27717	99835	21	4650	272187	272187000
27718	99901	21	6330	389809	389809000

	zipcode	population	weighted_mean_AGI
1	01001	8780	52246
2	01002	9460	77468
3	01005	2230	55042
4	01007	7320	63920
5	01008	640	60136

	zipcode	STATE
0	00000	AL
6	35004	AL
12	35005	AL
18	35006	AL
24	35007	AL
30	35010	AL
36	35014	AL
42	35016	AL
48	35019	AL
54	35020	AL
60	35022	AL
66	35023	AL
72	35031	AL
78	35033	AL
84	35034	AL
90	35035	AL
96	35040	AL
102	35042	AL
108	35043	AL
114	35044	AL
120	35045	AL
126	35046	AL
132	35049	AL
138	35051	AL
144	35053	AL
150	35054	AL
156	35055	AL
162	35057	AL
168	35058	AL
174	35061	AL
...	...	...
166724	82922	WY
166730	82923	WY
166736	82925	WY
166742	82930	WY
166748	82932	WY
166754	82933	WY
166760	82935	WY
166766	82937	WY
166772	82941	WY
166778	83001	WY
166784	83011	WY
166790	83012	WY
166796	83013	WY
166802	83014	WY
166808	83101	WY
166814	83110	WY
166820	83111	WY
166826	83112	WY
166832	83113	WY
166838	83114	WY
166844	83115	WY
166850	83118	WY
166856	83120	WY
166862	83122	WY
166868	83123	WY
166874	83126	WY
166880	83127	WY
166886	83128	WY
166892	83414	WY
166898	99999	WY

	zipcode	population	weighted_mean_AGI	STATE
27713	99801	12370	63427	AK
27714	99824	1130	72587	AK
27715	99827	1360	46274	AK
27716	99835	4650	58535	AK
27717	99901	6330	61581	AK

	zipcode	population	weighted_mean_AGI	STATE
9031	33109	250	2694776	FL
26104	94027	3220	1464534	CA
4926	19035	2040	1052019	PA
2194	10005	5580	983554	NY
9168	33480	5580	966673	FL

	population	weighted_mean_AGI
STATE
DC	14558.181818	103480.590909
CT	6529.118774	96062.421456
NJ	7737.192661	89242.225688
MA	6602.619543	87019.825364
CA	10900.944032	75537.775455
MD	6939.034653	73676.836634
NV	9425.968992	71209.976744
NY	5914.349515	68860.152104
ND	1174.930070	68097.318182
FL	9633.849509	65399.196292
RI	7275.217391	65351.202899
NH	2872.173913	64188.408696
WY	2525.794393	64128.813084
DE	7718.000000	63196.363636
CO	5891.367089	62926.582278
WA	6312.862903	61806.048387
TX	6823.705302	60807.772503
IL	4855.851671	60416.809291
VA	4682.682620	59560.884131
AK	5259.245283	58695.433962
MN	3252.553729	58641.551201
UT	6037.540984	57716.355191
AZ	8895.644599	56730.749129
NE	1756.219008	56224.725207
PA	4363.623719	55447.786969
HI	10498.965517	55156.913793
IA	1689.345455	55083.106667
SD	1376.089965	53786.121107
KS	2144.527363	53320.759536
WI	3825.617978	52697.198034
OK	2836.599265	52071.439338
LA	4272.217295	51511.485588
OR	4855.170455	51207.985795
OH	5419.438315	50741.277834
MI	5076.033708	50450.595506
VT	1288.813559	50049.114407
IN	4405.780089	49091.734027
MT	1979.911111	49056.662222
GA	6297.087087	48479.487988
NC	5710.524862	48392.198895
ID	3087.428571	47607.995238
ME	1649.646739	47555.201087
TN	4769.761905	46735.782313
SC	5373.512064	45564.546917
MO	2998.018018	45385.798423
AL	3456.539130	45051.260870
NM	3826.333333	43959.495238
WV	1422.684825	43259.171206
KY	2771.065449	42960.374429
AR	2376.510204	42197.508163
MS	3254.347826	40630.513587

	Country Name	Country Code	1995	1996	1997	1998	1999	2000	2001	2002	...	2004	2005	2006	2007	2008	2009	2010	2011	2012	2013
0	Aruba	ABW	619000	641000	650000	647000	683000	721000	691000	643000	...	728000	733000	694000	772000	827000	813000	824000	869000	904000	979000
1	Andorra	AND	NaN	NaN	NaN	NaN	2347000	2949000	3516000	3387000	...	2791000	2418000	2227000	2189000	2059000	1830000	1808000	2242000	2238000	2335000
2	Afghanistan	AFG	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	Angola	AGO	9000	21000	45000	52000	45000	51000	67000	91000	...	194000	210000	121000	195000	294000	366000	425000	481000	528000	650000
4	Albania	ALB	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	1062000	1247000	1711000	2191000	2469000	3156000	2857000

	Country Name	Country Code	variable	value
0	Aruba	ABW	1995	6.190000e+05
1	Andorra	AND	1995	NaN
2	Afghanistan	AFG	1995	NaN
3	Angola	AGO	1995	9.000000e+03
4	Albania	ALB	1995	NaN
5	Arab World	ARB	1995	2.566348e+07
6	United Arab Emirates	ARE	1995	2.315000e+06
7	Argentina	ARG	1995	2.289000e+06
8	Armenia	ARM	1995	1.200000e+04
9	American Samoa	ASM	1995	3.400000e+04
10	Antigua and Barbuda	ATG	1995	2.200000e+05
11	Australia	AUS	1995	3.726000e+06
12	Austria	AUT	1995	1.717300e+07
13	Azerbaijan	AZE	1995	NaN
14	Burundi	BDI	1995	3.400000e+04
15	Belgium	BEL	1995	5.560000e+06
16	Benin	BEN	1995	1.380000e+05
17	Burkina Faso	BFA	1995	1.240000e+05
18	Bangladesh	BGD	1995	1.560000e+05
19	Bulgaria	BGR	1995	3.466000e+06
20	Bahrain	BHR	1995	2.311000e+06
21	Bahamas, The	BHS	1995	1.598000e+06
22	Bosnia and Herzegovina	BIH	1995	NaN
23	Belarus	BLR	1995	1.610000e+05
24	Belize	BLZ	1995	1.310000e+05
25	Bermuda	BMU	1995	3.870000e+05
26	Bolivia	BOL	1995	2.840000e+05
27	Brazil	BRA	1995	1.991000e+06
28	Barbados	BRB	1995	4.420000e+05
29	Brunei Darussalam	BRN	1995	NaN
...	...	...	...	...
4682	Togo	TGO	2013	3.270000e+05
4683	Thailand	THA	2013	2.654700e+07
4684	Tajikistan	TJK	2013	2.080000e+05
4685	Turkmenistan	TKM	2013	NaN
4686	Timor-Leste	TLS	2013	NaN
4687	Tonga	TON	2013	4.500000e+04
4688	Trinidad and Tobago	TTO	2013	4.340000e+05
4689	Tunisia	TUN	2013	6.269000e+06
4690	Turkey	TUR	2013	3.779500e+07
4691	Tuvalu	TUV	2013	NaN
4692	Tanzania	TZA	2013	1.063000e+06
4693	Uganda	UGA	2013	1.206000e+06
4694	Ukraine	UKR	2013	2.467100e+07
4695	Upper middle income	UMC	2013	2.648748e+08
4696	Uruguay	URY	2013	2.683000e+06
4697	United States	USA	2013	6.976800e+07
4698	Uzbekistan	UZB	2013	1.969000e+06
4699	St. Vincent and the Grenadines	VCT	2013	7.200000e+04
4700	Venezuela, RB	VEN	2013	9.860000e+05
4701	Virgin Islands (U.S.)	VIR	2013	5.700000e+05
4702	Vietnam	VNM	2013	7.572000e+06
4703	Vanuatu	VUT	2013	1.100000e+05
4704	West Bank and Gaza	PSE	2013	5.450000e+05
4705	World	WLD	2013	1.123200e+09
4706	Samoa	WSM	2013	1.160000e+05
4707	Yemen, Rep.	YEM	2013	9.900000e+05
4708	South Africa	ZAF	2013	9.537000e+06
4709	Congo, Dem. Rep.	COD	2013	1.910000e+05
4710	Zambia	ZMB	2013	9.150000e+05
4711	Zimbabwe	ZWE	2013	1.833000e+06

	Country Name	Country Code	Year	Tourists
56	Dominican Republic	DOM	1995	1776000
304	Dominican Republic	DOM	1996	1926000
552	Dominican Republic	DOM	1997	2211000
800	Dominican Republic	DOM	1998	2309000
1048	Dominican Republic	DOM	1999	2649000
1296	Dominican Republic	DOM	2000	2978000
1544	Dominican Republic	DOM	2001	2882000
1792	Dominican Republic	DOM	2002	2811000
2040	Dominican Republic	DOM	2003	3282000
2288	Dominican Republic	DOM	2004	3450000
2536	Dominican Republic	DOM	2005	3691000
2784	Dominican Republic	DOM	2006	3965000
3032	Dominican Republic	DOM	2007	3980000
3280	Dominican Republic	DOM	2008	3980000
3528	Dominican Republic	DOM	2009	3992000
3776	Dominican Republic	DOM	2010	4125000
4024	Dominican Republic	DOM	2011	4306000
4272	Dominican Republic	DOM	2012	4563000
4520	Dominican Republic	DOM	2013	4690000