Python Workshop in Text analysis: Introduction

Javier Garcia-Bernardo, Jan 21st-22nd (The University of Amsterdam)



In [162]:

    
%matplotlib inline
from IPython.display import Image

0. Structure

About Python
Data types, structures and code
Working with text
Writing and reading files
Error debuging
Summary

1. About python

Python: Easy programming language, great for text analysis = English. (R = Dutch)
iPython notebook: where you write Python = Word
Python packages: extensions = Tabs in word



In [229]:

    
#This is a comment
print("Hello World")









    



Hello World



In [231]:

    
## HOW TO IMPORT PACKAGES
## Create a dataframe (spreadsheet) called spreadsheet

#Way 1 (recommended)
import pandas
spreadsheet = pandas.DataFrame() 

import pandas as pd
spreadsheet = pd.DataFrame()

#Way 2
from pandas import DataFrame
spreadsheet = DataFrame()

from pandas import *
spreadsheet = DataFrame()



In [233]:

    
import pandas as pd



In [234]:

    
df = pd.DataFrame()



In [235]:

    
df?

To install new packages you can use pip. Click in the "Anaconda Prompt" (under the start menu) and write:

pip install selenium
pip install dateutil
pip install pycountry
pip install labMTsimple
pip install lda
wbdata

2. Data types, structures and code

Variables

Data types: Numbers, strings and others
Data structures:
- Lists, tables... (full of data types)

Code

Instructions to modify variables
Can be organized in functions

2.AB Variables: Data Types and Structures



In [217]:

    
## DATA TYPES
this_is_variable1 = 3.5
this_is_variable3 = "I'm a string"
this_is_variable2 = False

## DATA STRUCTURES (list)
this_is_variable4 = [3.5,"I'm another string",4]
this_is_variable5 = [this_is_variable1,this_is_variable2,this_is_variable3,this_is_variable4]



In [218]:

    
print(this_is_variable5)









    



[3.5, False, "I'm a string", [3.5, "I'm another string", 4]]

2.A Data Types

number
- int: -2, 0, 1
- float: 3.5, 4.23
string: "I'm a string"
boolean: False/True



In [167]:

    
print(type(3))
print(type(3.5))
print(type("I'm a string"))
print(type(False))









    



<class 'int'>
<class 'float'>
<class 'str'>
<class 'bool'>
<class 'NoneType'>

2.A.Strings: Beware of the encoding

The computer uses 0s and 1s to encode strings

We used to use ASCII encoding, that reads blocks of 5 numbers (0/1). 2^7 = 128. Enough for lower and upper case letters and some puntuation, but not for weird symbols (e.g. é,ó,í). It's the default of python2 (bad for text analysis).

Nowadays we use UTF-8 encoding, that can handle all symbols in any language. It's the default of python3.

But some programs use UTF-16, ASCII or ISO-5589-1, which makes everything crappy. If sometime you're reading a file and the content are weird symbols this is likely the problem. Look for an "encoding" option in the code.



In [238]:

    
##THE COMPUTER READS IT LINE BY LINE, UP_DOWN


## OPERATIONS ON DATA TYPES
print(3*5.0)
print(3 == 5)
b = 3
print(b == 3)
b = 5
print(b == 5)

## CONVERT BETWEEN TYPES
print(b)
print(type(b))
b = float(b)
print(b)
print(type(b))









    



15.0
False
True
True
5
<class 'int'>
5.0
<class 'float'>



In [239]:

    
3 = 4









    



  File "<ipython-input-239-16286b76249b>", line 1
    3 = 4
         ^
SyntaxError: can't assign to literal



In [240]:

    
this_is_a_variable6 = not_defined_variable









    



---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-240-8717ac856325> in <module>()
----> 1 this_is_a_variable6 = not_defined_variable

NameError: name 'not_defined_variable' is not defined

2.B Data Structures

list = notebook
tuple = book
dictionary = index in a book
numpy array = fast list for math
pandas dataframe = spreedsheet

They have methods = ways to edit the data structure. For example add, delete, find, sort... (= functions in excel)

2.B.Lists

Combines a series of variables in order
Fast to add and delete variables, slow to find variables (needs to go over all the elements)



In [241]:

    
[1,2,3]









    Out[241]:





[1, 2, 3]



In [169]:

    
## A list
print([0]*10)









    



[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]



In [170]:

    
## A list
this_is_a_list = [1,3,2,"b"]
print("Original: ", this_is_a_list)









    



Original:  [1, 3, 2, 'b']



In [248]:

    
## Add elements
this_is_a_list.append("Hoi")
print("Added Hoi: ", this_is_a_list)









    



Added Hoi:  [1, 3, 2, 'b', 'Hoi']



In [242]:

    
## Get element
print("Fourth element: ", this_is_a_list[3])









    



Fourth element:  b



In [243]:

    
[0,1,2,3,4]









    Out[243]:





[0, 1, 2, 3, 4]



In [173]:

    
## Get slices
print("Second to end element: ", this_is_a_list[1:])









    



Second to end element:  [3, 2, 'b', 'Hoi']



In [249]:

    
print(this_is_a_list)
print(this_is_a_list[1:3])
print(this_is_a_list[:-1])









    



[1, 3, 2, 'b', 'Hoi']
[3, 2]
[1, 3, 2, 'b']



In [174]:

    
## Remove 4th element
this_is_a_list.pop(3)
print("Removed fourth element: ", this_is_a_list)









    



Removed fourth element:  [1, 3, 2, 'Hoi']



In [175]:

    
#Search
"Hoi" in this_is_a_list









    Out[175]:





True



In [176]:

    
## Sort
this_is_a_list = [1, 3, 2]
this_is_a_list.sort()
print("Sorted: ", this_is_a_list)









    



Sorted:  [1, 2, 3]

ipython help



In [177]:

    
this_is_a_list?



In [250]:

    
this_is_a_list.pop?

2.B.Tuples

Combines a series of variables in order
Inmutable (fixed size)
Faster than lists
I never use them



In [179]:

    
this_is_a_tuple = (1,3,2,"b")
print(this_is_a_tuple)
this_is_a_list = list(this_is_a_tuple)
print(this_is_a_list)









    



(1, 3, 2, 'b')
[1, 3, 2, 'b']



In [251]:

    
print(list(range(3,16)))
print(list(range(3,16,3)))









    



[3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
[3, 6, 9, 12, 15]

2.B.Sets

One element of each kind
Allows for intersecting
Example: Which words are different between two sets



In [181]:

    
Image(filename='./images/setOp.png')









    Out[181]:



In [182]:

    
this_is_a_set1 = set([1,1,1,2,3])
this_is_a_set2 = set({1,2,4})
print(this_is_a_set1)
print(this_is_a_set2)









    



{1, 2, 3}
{1, 2, 4}



In [183]:

    
## Union
print(this_is_a_set1 | this_is_a_set2)









    



{1, 2, 3, 4}



In [184]:

    
## Intersection
print(this_is_a_set1 & this_is_a_set2)









    



{1, 2}



In [185]:

    
## Diference set_1 - set2
print(this_is_a_set1 - this_is_a_set2)

{3}

2.B.Dictionary

Like in a index, finds a page in the book very very fast.
Slow to create it



In [252]:

    
#Dictionary
this_is_a_dict = {"Javier": 63434234234, "Friend1": 4234423243, "Friend2": 4234423243}
print(this_is_a_dict)









    



{'Friend1': 4234423243, 'Friend2': 4234423243, 'Javier': 63434234234}



In [188]:

    
this_is_a_dict["Friend2"]









    Out[188]:





424233345



In [189]:

    
#Creating dictionary using two lists
list_names = ["Javier", "Friend1", "Friend2"]
list_numbers = [63434234234,4234423243,424233345]

#Put both together
this_is_a_dict = dict(zip(list_names,list_numbers))
print(this_is_a_dict)









    



{'Friend1': 4234423243, 'Friend2': 424233345, 'Javier': 63434234234}



In [190]:

    
print(zip(list_names,list_numbers))









    



<zip object at 0x7fa6fbc3fc08>



In [191]:

    
print(list(zip(list_names,list_numbers)))









    



[('Javier', 63434234234), ('Friend1', 4234423243), ('Friend2', 424233345)]



In [ ]:



In [255]:

    
import time
numElements = 50000
this_is_a_list = list(range(numElements))
print(this_is_a_list[:10])

this_is_a_dict = dict(zip(range(numElements),[0]*numElements))
print(([0]*numElements)[:10])
      
start = time.time()
for i in range(numElements): this_is_a_dict.get(i)
print(time.time() - start)

start = time.time()
for i in range(numElements): this_is_a_list.index(i)
print(time.time() - start)









    



[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
0.018622636795043945
20.1564302444458

2.B.Numpy array

Part of the numpy package
Extremely fast (and easy) to do math



In [257]:

    
import numpy as np
import scipy.stats

numElements = 1000
this_is_an_list = list(range(numElements))
this_is_an_array = np.array(range(numElements))
print(this_is_an_array[:10])

## Mean
print(np.mean(this_is_an_array))
print(np.std(this_is_an_array))
print(np.median(this_is_an_array))
print(scipy.stats.mode(this_is_an_array))
print(scipy.stats.skew(this_is_an_array))









    



[0 1 2 3 4 5 6 7 8 9]
499.5
288.674990257
499.5
ModeResult(mode=array([0]), count=array([1]))
0.0

2.B.Pandas dataframe

Part of the pandas package
Read/write csv files
We will use it to keep compatibility with excel
First rule of the "good people club": Save raw spreadsheets as .csv data

What's a CSV = comma separated values file

newspaper,number_something ABC,1 ABC,3 ABC,2 ElPais,10 ElPais, 15

Can be separated by other things and it's still called a csv (tsv something with tabs).



In [258]:

    
import pandas as pd
## Read excel
excelFrame = pd.read_excel("./data/nl_data.xlsx",sheetname = 0, header = 0,skiprows=4)#
## Read csv
csvFrame = pd.read_csv("./data/nl_data.csv",sep=",",index_col=None,skiprows=4,na_values=["-999"])

## Print first 5 rows
csvFrame.head()









    Out[258]:






  
    
      
      Country Name
      Country Code
      Indicator Name
      Indicator Code
      1960
      1961
      1962
      1963
      1964
      1965
      ...
      2007
      2008
      2009
      2010
      2011
      2012
      2013
      2014
      2015
      Unnamed: 60
    
  
  
    
      0
      Netherlands
      NLD
      Agricultural machinery, tractors
      AG.AGR.TRAC.NO
      NaN
      62000.000000
      70000.000000
      78000.000000
      86000.000000
      94000.000000
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      1
      Netherlands
      NLD
      Fertilizer consumption (% of fertilizer produc...
      AG.CON.FERT.PT.ZS
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      19.293867
      17.687940
      16.708960
      18.603141
      15.604367
      19.844205
      13.081956
      NaN
      NaN
      NaN
    
    
      2
      Netherlands
      NLD
      Fertilizer consumption (kilograms per hectare ...
      AG.CON.FERT.ZS
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      302.139083
      267.708607
      238.171044
      293.325836
      246.811133
      345.989120
      231.127696
      NaN
      NaN
      NaN
    
    
      3
      Netherlands
      NLD
      Agricultural land (sq. km)
      AG.LND.AGRI.K2
      NaN
      23140.000000
      23030.000000
      22890.000000
      22680.000000
      22550.000000
      ...
      19144.000000
      19293.000000
      19174.000000
      18723.000000
      18584.000000
      18417.000000
      18476.000000
      NaN
      NaN
      NaN
    
    
      4
      Netherlands
      NLD
      Agricultural land (% of land area)
      AG.LND.AGRI.ZS
      NaN
      68.542654
      68.216825
      67.802133
      67.180095
      66.795024
      ...
      56.706161
      57.147512
      56.845538
      55.508449
      55.112693
      54.617438
      54.873775
      NaN
      NaN
      NaN
    
  

5 rows × 61 columns



In [195]:

    
## Describe
print(csvFrame.describe())









    



               1960          1961          1962          1963          1964  \
count  1.580000e+02  1.780000e+02  1.910000e+02  1.950000e+02  1.950000e+02   
mean   9.644333e+09  8.671782e+09  8.733247e+09  8.952507e+09  9.810818e+09   
std    3.216011e+10  3.059052e+10  3.169151e+10  3.259826e+10  3.540193e+10   
min   -7.734878e+08 -7.255122e+08 -5.424170e+08 -8.969276e+08 -2.042674e+09   
25%    7.648415e+00  4.676500e+00  7.729379e+00  6.793045e+00  7.720433e+00   
50%    7.421578e+01  6.246398e+01  6.642467e+01  5.977151e+01  5.681000e+01   
75%    9.840219e+07  8.098655e+05  9.581490e+05  7.080155e+05  9.143374e+04   
max    1.632038e+11  1.628373e+11  1.753224e+11  1.807355e+11  1.966831e+11   

               1965          1966          1967          1968          1969  \
count  2.070000e+02  2.000000e+02  2.060000e+02  2.020000e+02  2.210000e+02   
mean   1.041116e+10  1.119155e+10  1.151534e+10  1.265630e+10  1.404333e+10   
std    3.753462e+10  3.927049e+10  4.082562e+10  4.396270e+10  4.543976e+10   
min   -8.078442e+08 -1.486532e+09 -1.482232e+09 -1.766168e+09 -1.666834e+09   
25%    7.728036e+00  5.973497e+00  6.816228e+00  8.626535e+00  9.986721e+00   
50%    6.042472e+01  6.085083e+01  6.100371e+01  6.179447e+01  7.353951e+01   
75%    9.546090e+05  1.131976e+06  9.601292e+05  1.149812e+07  8.094900e+08   
max    2.166166e+11  2.223483e+11  2.342650e+11  2.490582e+11  2.651387e+11   

          ...               2007          2008          2009          2010  \
count     ...       7.540000e+02  7.500000e+02  7.310000e+02  7.730000e+02   
mean      ...       6.199006e+10  6.281328e+10  5.955810e+10  5.837514e+10   
std       ...       1.766378e+11  1.759854e+11  1.632613e+11  1.645582e+11   
min       ...      -1.410985e+11 -9.209300e+10 -2.831900e+10 -5.156719e+10   
25%       ...       6.000000e+00  5.925000e+00  6.040510e+00  6.000000e+00   
50%       ...       5.825905e+01  5.891979e+01  5.970000e+01  5.890000e+01   
75%       ...       1.029768e+05  6.349576e+04  4.908945e+05  6.221985e+04   
max       ...       1.545905e+12  9.784070e+11  8.847164e+11  8.945230e+11   

               2011          2012          2013          2014         2015  \
count  7.220000e+02  7.450000e+02  6.370000e+02  4.590000e+02    50.000000   
mean   6.596796e+10  6.249261e+10  7.204636e+10  9.519222e+10   246.625697   
std    1.791602e+11  1.725278e+11  1.877609e+11  2.134125e+11  1229.121808   
min   -2.269100e+10 -2.145100e+10 -1.696441e+10 -1.716528e+10     0.000000   
25%    6.611412e+00  6.000000e+00  6.800000e+00  6.214552e+00     3.425000   
50%    6.350000e+01  6.240000e+01  6.819122e+01  6.590000e+01     8.500000   
75%    1.063354e+06  2.025720e+05  8.998325e+06  1.099990e+10    97.650000   
max    1.026275e+12  9.775700e+11  1.021033e+12  9.889670e+11  8700.000000   

       Unnamed: 60  
count            0  
mean           NaN  
std            NaN  
min            NaN  
25%            NaN  
50%            NaN  
75%            NaN  
max            NaN  

[8 rows x 57 columns]



In [196]:

    
## Calculate mean
print(csvFrame.mean(axis=0)) #By columns
print(csvFrame.mean(axis=1)) #By rows









    



1960           9.644333e+09
1961           8.671782e+09
1962           8.733247e+09
1963           8.952507e+09
1964           9.810818e+09
1965           1.041116e+10
1966           1.119155e+10
1967           1.151534e+10
1968           1.265630e+10
1969           1.404333e+10
1970           2.034449e+10
1971           1.773767e+10
1972           1.867170e+10
1973           2.105561e+10
1974           2.242130e+10
1975           2.187833e+10
1976           2.354577e+10
1977           2.457515e+10
1978           2.672740e+10
1979           2.830681e+10
1980           2.687033e+10
1981           2.653267e+10
1982           2.613718e+10
1983           2.515379e+10
1984           2.776158e+10
1985           2.571123e+10
1986           3.144424e+10
1987           3.016809e+10
1988           3.061985e+10
1989           3.199168e+10
1990           3.187026e+10
1991           3.438141e+10
1992           3.514316e+10
1993           3.607781e+10
1994           3.854710e+10
1995           3.829632e+10
1996           3.987001e+10
1997           4.083188e+10
1998           4.259209e+10
1999           4.338528e+10
2000           4.226315e+10
2001           4.542415e+10
2002           4.505648e+10
2003           4.772440e+10
2004           4.981190e+10
2005           5.110916e+10
2006           5.738841e+10
2007           6.199006e+10
2008           6.281328e+10
2009           5.955810e+10
2010           5.837514e+10
2011           6.596796e+10
2012           6.249261e+10
2013           7.204636e+10
2014           9.519222e+10
2015           2.466257e+02
Unnamed: 60             NaN
dtype: float64
0       1.508709e+05
1       1.977243e+01
2       3.200545e+02
3       2.036600e+04
4       6.033293e+01
5       8.940453e+05
6       6.151689e-02
7       2.648631e+01
8       2.629354e+05
9       1.045027e+00
10      5.846905e+01
11      3.585870e+03
12      1.069640e+01
13      1.055161e+01
14      7.780000e+02
15      3.375407e+04
16      1.766827e+03
17      1.535190e+06
18      8.051811e+01
19      8.773226e+01
20      9.020547e+01
21      4.152889e+04
22      6.339111e+03
23      3.139955e+01
24      6.614991e+01
25      2.765265e+11
26      5.431615e+11
27      3.978890e+00
28      4.069067e+11
29      1.362548e+11
            ...     
1315    1.214338e+01
1316    1.389317e+01
1317    1.974906e+00
1318    5.828606e+01
1319    2.876095e+00
1320    2.316059e+00
1321    1.847732e+11
1322    7.993843e+01
1323    7.524936e+00
1324    1.343265e+00
1325    1.675852e+00
1326    1.208438e+00
1327    1.529114e+00
1328    5.167672e-01
1329    1.950753e+00
1330    1.253663e+01
1331    1.795774e+11
1332             NaN
1333    1.703854e+02
1334    5.630338e+01
1335    9.700789e+10
1336    4.106099e+10
1337    2.441667e+01
1338    2.826600e+01
1339    1.345571e+01
1340    2.000000e+00
1341             NaN
1342             NaN
1343    1.103590e+00
1344             NaN
dtype: float64



In [197]:

    
## Keep a subset
print(csvFrame["Indicator Code"])









    



0          AG.AGR.TRAC.NO
1       AG.CON.FERT.PT.ZS
2          AG.CON.FERT.ZS
3          AG.LND.AGRI.K2
4          AG.LND.AGRI.ZS
5          AG.LND.ARBL.HA
6       AG.LND.ARBL.HA.PC
7          AG.LND.ARBL.ZS
8          AG.LND.CREL.HA
9          AG.LND.CROP.ZS
10         AG.LND.EL5M.ZS
11         AG.LND.FRST.K2
12         AG.LND.FRST.ZS
13      AG.LND.IRIG.AG.ZS
14         AG.LND.PRCP.MM
15         AG.LND.TOTL.K2
16         AG.LND.TRAC.ZS
17         AG.PRD.CREL.MT
18         AG.PRD.CROP.XD
19         AG.PRD.FOOD.XD
20         AG.PRD.LVSK.XD
21         AG.SRF.TOTL.K2
22         AG.YLD.CREL.KG
23      BG.GSR.NFSV.GD.ZS
24         BM.GSR.CMCP.ZS
25         BM.GSR.FCTY.CD
26         BM.GSR.GNFS.CD
27         BM.GSR.INSF.ZS
28         BM.GSR.MRCH.CD
29         BM.GSR.NFSV.CD
              ...        
1315    TX.VAL.FUEL.ZS.UN
1316    TX.VAL.ICTG.ZS.UN
1317    TX.VAL.INSF.ZS.WT
1318    TX.VAL.MANF.ZS.UN
1319    TX.VAL.MMTL.ZS.UN
1320    TX.VAL.MRCH.AL.ZS
1321    TX.VAL.MRCH.CD.WT
1322    TX.VAL.MRCH.HI.ZS
1323    TX.VAL.MRCH.OR.ZS
1324    TX.VAL.MRCH.R1.ZS
1325    TX.VAL.MRCH.R2.ZS
1326    TX.VAL.MRCH.R3.ZS
1327    TX.VAL.MRCH.R4.ZS
1328    TX.VAL.MRCH.R5.ZS
1329    TX.VAL.MRCH.R6.ZS
1330    TX.VAL.MRCH.RS.ZS
1331    TX.VAL.MRCH.WL.CD
1332    TX.VAL.MRCH.WR.ZS
1333    TX.VAL.MRCH.XD.WD
1334    TX.VAL.OTHR.ZS.WT
1335    TX.VAL.SERV.CD.WT
1336       TX.VAL.TECH.CD
1337    TX.VAL.TECH.MF.ZS
1338    TX.VAL.TRAN.ZS.WT
1339    TX.VAL.TRVL.ZS.WT
1340          VC.BTL.DETH
1341       VC.IDP.TOTL.HE
1342       VC.IDP.TOTL.LE
1343       VC.IHR.PSRC.P5
1344       VC.PKP.TOTL.UN
Name: Indicator Code, dtype: object



In [198]:

    
print(csvFrame["Indicator Code"]=="AG.LND.AGRI.K2")









    



0       False
1       False
2       False
3        True
4       False
5       False
6       False
7       False
8       False
9       False
10      False
11      False
12      False
13      False
14      False
15      False
16      False
17      False
18      False
19      False
20      False
21      False
22      False
23      False
24      False
25      False
26      False
27      False
28      False
29      False
        ...  
1315    False
1316    False
1317    False
1318    False
1319    False
1320    False
1321    False
1322    False
1323    False
1324    False
1325    False
1326    False
1327    False
1328    False
1329    False
1330    False
1331    False
1332    False
1333    False
1334    False
1335    False
1336    False
1337    False
1338    False
1339    False
1340    False
1341    False
1342    False
1343    False
1344    False
Name: Indicator Code, dtype: bool



In [199]:

    
print(csvFrame.loc[3,:])









    



Country Name                     Netherlands
Country Code                             NLD
Indicator Name    Agricultural land (sq. km)
Indicator Code                AG.LND.AGRI.K2
1960                                     NaN
1961                                   23140
1962                                   23030
1963                                   22890
1964                                   22680
1965                                   22550
1966                                   22450
1967                                   22390
1968                                   22270
1969                                   22100
1970                                   21930
1971                                   21280
1972                                   21140
1973                                   21010
1974                                   20940
1975                                   20820
1976                                   20730
1977                                   20600
1978                                   20460
1979                                   20340
1980                                   20200
1981                                   20110
1982                                   20050
1983                                   20090
1984                                   20160
1985                                   20190
                             ...            
1987                                   20140
1988                                   20120
1989                                   20040
1990                                   20060
1991                                   19910
1992                                   19860
1993                                   19880
1994                                   19710
1995                                   19640
1996                                   19810
1997                                   19660
1998                                   19730
1999                                   19670
2000                                   19560
2001                                   19310
2002                                   19490
2003                                   19230
2004                                   19494
2005                                   19377
2006                                   19196
2007                                   19144
2008                                   19293
2009                                   19174
2010                                   18723
2011                                   18584
2012                                   18417
2013                                   18476
2014                                     NaN
2015                                     NaN
Unnamed: 60                              NaN
Name: 3, dtype: object



In [200]:

    
csvFrame.loc[csvFrame["Indicator Code"]=="AG.LND.AGRI.K2",["Indicator Code","1970","1971"]]









    Out[200]:






  
    
      
      Indicator Code
      1970
      1971
    
  
  
    
      3
      AG.LND.AGRI.K2
      21930
      21280



In [201]:

    
### More advanced stuff
import pylab as plt
## Plot
print(csvFrame.columns)
columns_to_keep = ['1961', '1962', '1963', '1964', '1965', '1966', '1967', '1968', '1969',
       '1970', '1971', '1972', '1973', '1974', '1975', '1976', '1977', '1978',
       '1979', '1980', '1981', '1982', '1983', '1984', '1985', '1986', '1987',
       '1988', '1989', '1990', '1991', '1992', '1993', '1994', '1995', '1996',
       '1997', '1998', '1999', '2000', '2001', '2002', '2003', '2004', '2005',
       '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014',
       '2015']
print(csvFrame.loc[3,"Indicator Name"])
print(csvFrame.loc[5,"Indicator Name"])

years = []
for year in columns_to_keep:
    years.append(int(year))
print(years)

plt.plot(years,csvFrame.loc[3,columns_to_keep]*100,color="blue",label="Agricultural land")
plt.plot(years,csvFrame.loc[5,columns_to_keep],color="red",label="Arable land")
plt.xlabel("Year")
plt.ylabel("Land (hectares)")
plt.legend()
plt.show()









    



Index(['Country Name', 'Country Code', 'Indicator Name', 'Indicator Code',
       '1960', '1961', '1962', '1963', '1964', '1965', '1966', '1967', '1968',
       '1969', '1970', '1971', '1972', '1973', '1974', '1975', '1976', '1977',
       '1978', '1979', '1980', '1981', '1982', '1983', '1984', '1985', '1986',
       '1987', '1988', '1989', '1990', '1991', '1992', '1993', '1994', '1995',
       '1996', '1997', '1998', '1999', '2000', '2001', '2002', '2003', '2004',
       '2005', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013',
       '2014', '2015', 'Unnamed: 60'],
      dtype='object')
Agricultural land (sq. km)
Arable land (hectares)
[1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015]



In [202]:

    
csvFrame2 = csvFrame.loc[:,["Indicator Name"]+columns_to_keep[-22:-2]]
csvFrame_tranposed = csvFrame2.set_index("Indicator Name").transpose()
csvFrame_tranposed = csvFrame_tranposed.dropna(axis=1)
csvFrame_tranposed

csvFrame_transposedShort = csvFrame_tranposed[["Agricultural land (sq. km)","Arable land (hectares)","Cereal production (metric tons)"]]
csvFrame_transposedShort.columns = ["agric_land","arable_land","cereal_prod"]



In [203]:

    
import statsmodels.formula.api as sm

result = sm.ols(formula="cereal_prod ~ agric_land * arable_land", data=csvFrame_transposedShort).fit()
result.summary()









    Out[203]:





OLS Regression Results

  Dep. Variable:        cereal_prod      R-squared:             0.454


  Model:                    OLS          Adj. R-squared:        0.352


  Method:              Least Squares     F-statistic:           4.439


  Date:              Tue, 19 Jan 2016    Prob (F-statistic):   0.0188 


  Time:                  16:35:25        Log-Likelihood:      -263.77


  No. Observations:           20         AIC:                   535.5


  Df Residuals:               16         BIC:                   539.5


  Df Model:                    3                                     


  Covariance Type:       nonrobust                                   




                            coef      std err       t       P>|t|  [95.0% Conf. Int.] 


  Intercept                7.142e+07   4.32e+07      1.651   0.118  -2.03e+07  1.63e+08


  agric_land              -3634.2121   2219.807     -1.637   0.121  -8339.993  1071.569


  arable_land               -68.3397     43.271     -1.579   0.134   -160.070    23.391


  agric_land:arable_land      0.0036      0.002      1.604   0.128     -0.001     0.008




  Omnibus:         1.428    Durbin-Watson:         1.911


  Prob(Omnibus):   0.490    Jarque-Bera (JB):      0.940


  Skew:            0.174    Prob(JB):              0.625


  Kurtosis:        1.996    Cond. No.           2.53e+13



In [204]:

    
result = sm.ols(formula="cereal_prod ~ agric_land + arable_land", data=csvFrame_transposedShort).fit()
result.summary()









    Out[204]:





OLS Regression Results

  Dep. Variable:        cereal_prod      R-squared:             0.367


  Model:                    OLS          Adj. R-squared:        0.292


  Method:              Least Squares     F-statistic:           4.918


  Date:              Tue, 19 Jan 2016    Prob (F-statistic):   0.0206 


  Time:                  16:35:25        Log-Likelihood:      -265.26


  No. Observations:           20         AIC:                   536.5


  Df Residuals:               17         BIC:                   539.5


  Df Model:                    2                                     


  Covariance Type:       nonrobust                                   




                 coef      std err       t       P>|t|  [95.0% Conf. Int.] 


  Intercept      2.14e+06   2.06e+06      1.041   0.312   -2.2e+06  6.48e+06


  agric_land     -77.2548     92.732     -0.833   0.416   -272.902   118.392


  arable_land      1.0483      0.471      2.223   0.040      0.053     2.043




  Omnibus:         2.038    Durbin-Watson:         1.529


  Prob(Omnibus):   0.361    Jarque-Bera (JB):      1.048


  Skew:            0.079    Prob(JB):              0.592


  Kurtosis:        1.890    Cond. No.           5.98e+07



In [259]:

    
result = sm.ols(formula="cereal_prod ~ arable_land", data=csvFrame_transposedShort).fit()
result.summary()









    Out[259]:





OLS Regression Results

  Dep. Variable:        cereal_prod      R-squared:             0.341


  Model:                    OLS          Adj. R-squared:        0.304


  Method:              Least Squares     F-statistic:           9.299


  Date:              Wed, 20 Jan 2016    Prob (F-statistic):   0.00690


  Time:                  14:02:07        Log-Likelihood:      -265.66


  No. Observations:           20         AIC:                   535.3


  Df Residuals:               18         BIC:                   537.3


  Df Model:                    1                                     


  Covariance Type:       nonrobust                                   




                 coef      std err       t       P>|t|  [95.0% Conf. Int.] 


  Intercept     4.608e+05      4e+05      1.152   0.265   -3.8e+05   1.3e+06


  arable_land      1.2414      0.407      3.049   0.007      0.386     2.097




  Omnibus:         2.053    Durbin-Watson:         1.456


  Prob(Omnibus):   0.358    Jarque-Bera (JB):      1.038


  Skew:           -0.011    Prob(JB):              0.595


  Kurtosis:        1.884    Cond. No.           1.17e+07

Group and plot



In [205]:

    
data = pd.DataFrame(columns=["Year", "Newspaper","Number_something"])
data = data.append({"Newspaper":"ABC","Number_something": 1},ignore_index=True)
data = data.append({"Newspaper":"ABC","Number_something": 2},ignore_index=True)
data = data.append({"Newspaper":"ABC","Number_something": 1},ignore_index=True)
data = data.append({"Newspaper":"ElPais","Number_something": 10},ignore_index=True)
data = data.append({"Newspaper":"ElPais","Number_something": 25},ignore_index=True)
data = data.append({"Newspaper":"ElPais","Number_something": 25},ignore_index=True)
print(data)


meanData = data.groupby("Newspaper").mean().reset_index()
meanData.plot(x = "Newspaper", y = "Number_something", kind="bar",edgecolor="none",color=(70/255,140/255,210/255),legend=False)
plt.show()









    



  Year Newspaper  Number_something
0  NaN       ABC                 1
1  NaN       ABC                 2
2  NaN       ABC                 1
3  NaN    ElPais                10
4  NaN    ElPais                25
5  NaN    ElPais                25

Look for info

http://stackoverflow.com/questions/12096252/use-a-list-of-values-to-select-rows-from-a-pandas-dataframe http://stanford.edu/~mwaskom/software/seaborn/examples/hexbin_marginals.html

2.C Code: Operations, functions, control flow and loops

We have the data in data structures, composed of several data types.
We need code to edit everything

2.C.Operations

Change a data type or structure



In [206]:

    
## OPERATIONS ON DATA TYPES
print(3*5.0)
print(3 == 5)
b = 3
print(b == 3)
b = 5
print(b == 5)

## CONVERT BETWEEN TYPES
print(type(b))
b = float(b)
print(type(b))









    



15.0
False
True
True
<class 'int'>
<class 'float'>

2.C.Functions

A fragment of code that takes some standard input to give some standard output.
Example: The mean function. Gets a list of numbers as input, gives the mean as output. Gives an error if you try to calculate the mean of some strings.
We have already seen many functions. Add, mean...



In [207]:

    
##INDENTATION!!


## Our own functions
def mean(listOfNumbers):
    return np.sum(listOfNumbers)/len(listOfNumbers)



In [208]:

    
aList = [2,3,4]
print(mean(aList))

3.0

2.C.Scope: Global vs local variables

Variables inside functions are only seen by that function
Variables outside functions are seen and can be modified by all functions (dangerous)



In [209]:

    
s = "I hate spam." 

## What's s?
def f(): 
    s = "Me too."
    return s

f()
print(s)









    



I hate spam.



In [210]:

    
s = "I hate spam." 

## What's s?
def f(): 
    s = "Me too."
    return s

s = f()
print(s)









    



Me too.

2.C.Control flow = if-else statements

Controls the flow
If something, do something. Else do another thing.



In [262]:

    
#Count words
aDict = dict({"Bob":5, "Pep":3})

name = input("Enter Bob or Pep: ")

if name == "Bob":
    aDict["Bob"] = aDict["Bob"] + 1  
elif name == "Pep":
    aDict["Pep"] = aDict["Pep"] + 1 
else:
    print("Wrong name")
    
print(aDict)









    



Enter Bob or Pep: Bob
{'Bob': 6, 'Pep': 3}

2.C.Loops

Iterate over something
for loop



In [276]:

    
import numpy as np
list_numbers = [1,9,121,2335432432432423434877733543544533.]

print(np.sqrt(list_numbers[0]))
print(np.sqrt(list_numbers[1]))
print(np.sqrt(list_numbers[2]))
print(np.sqrt(list_numbers[3]))
#...









    



1.0
3.0
11.0
4.83263120094e+16



In [279]:

    
for index in [0,1,2,3]:
    print(index)



In [280]:

    
for index in [0,1,2,3]:
    print(np.sqrt(list_numbers[index]))









    



1.0
3.0
11.0
4.83263120094e+16



In [283]:

    
for element in list_numbers:
    print(np.sqrt(element))









    



1.0
3.0
11.0
4.83263120094e+16

3.A.Working with strings

Delete punctuation
Convert the string to a list of words
Remove stop words
Join a list to a string



In [284]:

    
#Slice just like lists
this_is_a_string = "Hello my name is"
print(this_is_a_string[:10])









    



Hello my n



In [286]:

    
print("-"*10)









    



----------



In [285]:

    
#Upper and lower case
print("Hello All".lower())
print("Hello All".upper())









    



hello all
HELLO ALL



In [287]:

    
#Strip end spaces or return characters
this_is_a_string = "Hello my name is\n" #tab = "\t
print("-"*10)
print(this_is_a_string)
print("-"*10)
print(this_is_a_string.strip())
print("-"*10)









    



----------
Hello my name is

----------
Hello my name is
----------



In [290]:

    
#Formatting (\t = tab, \n = return)
print("{0}\t{1}-----{2}\n".format("Hello","my name","is"))









    



Hello	my name-----is



In [369]:

    
#Delete punctuation
def remove_punctuation(string_to_remove):
    import string
    transtable = {ord(c): None for c in string.punctuation}
    return string_to_remove.translate(transtable).lower()

initial_string = "Hello. I'm having breakfast with my brothers. A nice one"
print(initial_string)
new_string = remove_punctuation(initial_string)
print(new_string)









    



Hello. I'm having breakfast with my brothers. A nice one
hello im having breakfast with my brothers a nice one



In [375]:

    
#Remove endings
def stem_string(string_to_stem,language="english"):
    from nltk.stem.snowball import SnowballStemmer
    stemmer = SnowballStemmer(language)
    return " ".join([stemmer.stem(word) for word in string_to_stem.split(" ")])

new_string = stem_string(new_string)
print(new_string)









    



amig



In [371]:

    
#Splitting. Covnert them to something that we can iterate form
splitted_text = new_string.split(" ")
print(splitted_text)









    



['hello', 'im', 'have', 'breakfast', 'with', 'my', 'brother', 'a', 'nice', 'one']



In [372]:

    
#Join them
joined_text = " ".join(splitted_text)
print(joined_text)









    



hello im have breakfast with my brother a nice one



In [ ]:

    
#Download package
import nltk
from nltk.corpus import stopwords
nltk.download()



In [373]:

    
cached_stop = stopwords.words("english")
print(cached_stop)

def remove_stop_words_not_obscure(text):
    #split
    text = text.split()
    #remove stop words
    new_text = []
    for word in text:
        if word in cached_stop: pass
        else: new_text.append(word)
    #join together
    text = ' '.join(new_text)
    return text

def remove_stop_words(text):
    return ' '.join([word for word in text.split() if word not in cached_stop])
print()
print(joined_text)
print(remove_stop_words(new_string))









    



['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now']

hello im have breakfast with my brother a nice one
hello im breakfast brother nice one

3.B.An example



In [300]:

    
def updateDictionary(aDict,name):
    if aDict.get(name):
        aDict[name] = aDict[name] + 1
    else:
        aDict[name] = 1 
    
    return aDict

text = "That is what happens when you flee your homeland. You don’t know that you are going to become part of a flood of refugees. I later would learn that I was one of the nearly 130,000 people who fled Saigon that day and one of the estimated two million “boat people” who fled Vietnam by boat and other means over the next two decades. But I didn’t set out to come to America; I left my house when my parents said I should."
print(text)
print()
text_no_punc = remove_punctuation(text)
print(text_no_punc)
print()
text_no_stop = remove_stop_words(text_no_punc)
print(text_no_stop)
print()

aDict = dict()
list_text_no_stop = text_no_stop.split()
print(list_text_no_stop)
print()
for word in list_text_no_stop:
    aDict = updateDictionary(aDict,word)
print(aDict)
print()









    



That is what happens when you flee your homeland. You don’t know that you are going to become part of a flood of refugees. I later would learn that I was one of the nearly 130,000 people who fled Saigon that day and one of the estimated two million “boat people” who fled Vietnam by boat and other means over the next two decades. But I didn’t set out to come to America; I left my house when my parents said I should.

that is what happens when you flee your homeland you don’t know that you are going to become part of a flood of refugees i later would learn that i was one of the nearly 130000 people who fled saigon that day and one of the estimated two million “boat people” who fled vietnam by boat and other means over the next two decades but i didn’t set out to come to america i left my house when my parents said i should

happens flee homeland don’t know going become part flood refugees later would learn one nearly 130000 people fled saigon day one estimated two million “boat people” fled vietnam boat means next two decades didn’t set come america left house parents said

['happens', 'flee', 'homeland', 'don’t', 'know', 'going', 'become', 'part', 'flood', 'refugees', 'later', 'would', 'learn', 'one', 'nearly', '130000', 'people', 'fled', 'saigon', 'day', 'one', 'estimated', 'two', 'million', '“boat', 'people”', 'fled', 'vietnam', 'boat', 'means', 'next', 'two', 'decades', 'didn’t', 'set', 'come', 'america', 'left', 'house', 'parents', 'said']

{'day': 1, 'america': 1, 'fled': 2, 'boat': 1, 'said': 1, 'one': 2, '“boat': 1, 'come': 1, 'means': 1, 'million': 1, 'going': 1, 'would': 1, 'flood': 1, 'refugees': 1, 'decades': 1, 'two': 2, '130000': 1, 'left': 1, 'vietnam': 1, 'learn': 1, 'estimated': 1, 'house': 1, 'become': 1, 'happens': 1, 'homeland': 1, 'parents': 1, 'saigon': 1, 'flee': 1, 'later': 1, 'people”': 1, 'didn’t': 1, 'people': 1, 'nearly': 1, 'don’t': 1, 'know': 1, 'set': 1, 'part': 1, 'next': 1}



In [301]:

    
from collections import Counter
print(Counter(list_text_no_stop))









    



Counter({'fled': 2, 'one': 2, 'two': 2, 'day': 1, 'america': 1, 'boat': 1, 'said': 1, '“boat': 1, 'come': 1, 'means': 1, 'million': 1, 'going': 1, 'would': 1, 'flood': 1, 'refugees': 1, 'decades': 1, '130000': 1, 'left': 1, 'vietnam': 1, 'learn': 1, 'estimated': 1, 'house': 1, 'become': 1, 'happens': 1, 'homeland': 1, 'parents': 1, 'saigon': 1, 'flee': 1, 'later': 1, 'people”': 1, 'didn’t': 1, 'people': 1, 'nearly': 1, 'don’t': 1, 'know': 1, 'set': 1, 'part': 1, 'next': 1})

3.C.Working with dates

Dates are nasty. What date is this? 05/06/2015

Luckily we have Python

from dateutils.parser import parse http://dateutil.readthedocs.org/en/latest/parser.html#dateutil.parser.parse

dayfirst – Whether to interpret the first value in an ambiguous 3-integer date (e.g. 01/05/09) as the day (True) or month (False). If yearfirst is set to True, this distinguishes between YDM and YMD. If set to None, this value is retrieved from the current parserinfo object (which itself defaults to False).
yearfirst – Whether to interpret the first value in an ambiguous 3-integer date (e.g. 01/05/09) as the year. If True, the first number is taken to be the year, otherwise the last number is taken to be the year. If this is set to None, the value is retrieved from the current parserinfo object (which itself defaults to False).
fuzzy – Whether to allow fuzzy parsing, allowing for string like “Today is January 1, 2047 at 8:21:00AM”.



In [307]:

    
from dateutil.parser import parse
print(parse("05-06-2015",dayfirst=True).date())
print(parse("05-06-2015",dayfirst=False).date())
print(parse("05/06-2015").date())
print(parse("Today is January 1, 2047 at 8:21:00AM",fuzzy=True).date())

4.Writing and reading files + examples

""" with open(filename,how_to_open) as f: code goes here """ f: variable name, whatever you want how_to_open: "w+": Write "r+": Read "a+": Append (The + indicates python to create the file if the file doesn't exist)

4.A.Writing files



In [309]:

    
with open("./data/file_to_write.csv","w+") as f:
    f.write("I'm line number {0}".format(0))
    f.write("I'm line number {0}".format(1))
    f.write("I'm line number {0}".format(2))
    f.write("I'm line number {0}".format(3))
    f.write("I'm line number {0}".format(4))



In [310]:

    
with open("./data/file_to_write.csv","w+") as f:
    f.write("I'm line number {0}\n".format(0))
    f.write("I'm line number {0}\n".format(1))
    f.write("I'm line number {0}\n".format(2))
    f.write("I'm line number {0}\n".format(3))
    f.write("I'm line number {0}\n".format(4))



In [312]:

    
list(range(10))









    Out[312]:





[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]



In [313]:

    
#Beware the enter
with open("./data/file_to_write.csv","w+") as f:
    for i in range(10):
        f.write("I'm line number {0}\n".format(i))

4.B.Reading files



In [317]:

    
with open("./data/file_to_write.csv","r+") as f:
    splitted_by_line_1 = f.readlines()
        
with open("./data/file_to_write.csv","r+") as f:
    all_together = f.read()
splitted_by_line_2 = all_together.split("\n")

splitted_by_line_3 = []
with open("./data/file_to_write.csv","r+") as f:
    for line in f:
        splitted_by_line_3.append(line)



In [318]:

    
print(splitted_by_line_1)
print(splitted_by_line_2)
print(splitted_by_line_3)









    



["I'm line number 0\n", "I'm line number 1\n", "I'm line number 2\n", "I'm line number 3\n", "I'm line number 4\n", "I'm line number 5\n", "I'm line number 6\n", "I'm line number 7\n", "I'm line number 8\n", "I'm line number 9\n"]
["I'm line number 0", "I'm line number 1", "I'm line number 2", "I'm line number 3", "I'm line number 4", "I'm line number 5", "I'm line number 6", "I'm line number 7", "I'm line number 8", "I'm line number 9", '']
["I'm line number 0\n", "I'm line number 1\n", "I'm line number 2\n", "I'm line number 3\n", "I'm line number 4\n", "I'm line number 5\n", "I'm line number 6\n", "I'm line number 7\n", "I'm line number 8\n", "I'm line number 9\n"]
["I'm line number 0", "I'm line number 1", "I'm line number 2", "I'm line number 3", "I'm line number 4", "I'm line number 5", "I'm line number 6", "I'm line number 7", "I'm line number 8", "I'm line number 9"]



In [ ]:

    
#The strip removes the return and all that
splitted_by_line_3 = []
with open("./data/file_to_write.csv","r+") as f:
    for line in f:
        splitted_by_line_3.append(line.strip())
        
print(splitted_by_line_3)

4.C.Try - except

Exception handling. Reading a file and not closing it is (very) bad for the system.

Handled by the "with open() as f"



In [ ]:

    
try:
    f = open("./data/file_to_write.csv","r+")
    for line in f:
        splitted_by_line_3.append(line.strip())
    print(splitted_by_line_3)
    f.close()
except:
    f.close()

5.Error debugging



In [319]:

    
Image("http://i.imgur.com/WRuJV6r.png")









    Out[319]:

Errors

IndexError: List is too short
NameError: Misspeling, the variable/funcion/module is not defined
SintaxError: You're missing parenthesis, colons...
FileNotFoundError/IOError: The file doesn't exist
KeyError: In a dictionary, the key doesn't exist
IndentationError: You have a mixture of tabs and spaces
TypeError: The data structure doesn't allow for that operation, a variable is None instead of having a value

IndexError: List is too short



In [320]:

    
this_is_a_list = [1,2,3,4,5]
len_list = len(this_is_a_list)
print(len_list)
this_is_a_list[len_list]









    



5






    



---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-320-1f99355ae5f1> in <module>()
      2 len_list = len(this_is_a_list)
      3 print(len_list)
----> 4 this_is_a_list[len_list]

IndexError: list index out of range



In [321]:

    
this_is_a_list = [1,2,3,4,5]
for element in this_is_a_list:
    this_is_a_list.pop(-1)    
    print(element)

NameError: Misspeling, the variable/funcion/module is not defined



In [323]:

    
this_is_a_list = [1,2,3,4,5]
for element in this_is_a_list:
    sum_all = sum_all + element









    



---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-323-73d4f0ef9860> in <module>()
      1 this_is_a_list = [1,2,3,4,5]
      2 for element in this_is_a_list:
----> 3     sum_all = sum_all + element

NameError: name 'sum_all' is not defined

SintaxError: You're missing parenthesis, colons...



In [324]:

    
def function()
    return 0









    



  File "<ipython-input-324-b90b96076647>", line 1
    def function()
                  ^
SyntaxError: invalid syntax



In [326]:

    
3 = 5









    



  File "<ipython-input-326-dc9bf34ad6e8>", line 1
    3 = 5
         ^
SyntaxError: can't assign to literal



In [327]:

    
3 == "3"









    Out[327]:





False



In [328]:

    
3 == int("3")









    Out[328]:





True



In [330]:

    
"A" == "a"









    Out[330]:





False

IOError: The file doesn't exist



In [331]:

    
open("non_existing_file","r")









    



---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-331-29e273de833d> in <module>()
----> 1 open("non_existing_file","r")

FileNotFoundError: [Errno 2] No such file or directory: 'non_existing_file'

KeyError: In a dictionary, the key doesn't exist



In [332]:

    
d = dict({"You": 0, "Her": 1})
d["Him"]









    



---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-332-1e1ab586bf06> in <module>()
      1 d = dict({"You": 0, "Her": 1})
----> 2 d["Him"]

KeyError: 'Him'

IndentationError: You have a mixture of tabs and spaces

ipython notebooks handle this

TypeError: The data structure doesn't allow for that operation, a variable is None instead of having a value



In [333]:

    
this_is_a_list = [0,1,2,3,4]
this_is_a_list + 8









    



---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-333-c0a4b0d1fa13> in <module>()
      1 this_is_a_list = [0,1,2,3,4]
----> 2 this_is_a_list + 8

TypeError: can only concatenate list (not "int") to list



In [334]:

    
this_is_a_list = [0,1,2,3,4]
this_is_a_list + [8]









    Out[334]:





[0, 1, 2, 3, 4, 8]

AttributeError: The data structure doesn't have the method (e.g. calling mean() in a list)



In [335]:

    
this_is_a_list = [0,1,2,3,4]
this_is_a_list.add(8)









    



---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-335-8ad22b5651d0> in <module>()
      1 this_is_a_list = [0,1,2,3,4]
----> 2 this_is_a_list.add(8)

AttributeError: 'list' object has no attribute 'add'

In-place algorithms



In [336]:

    
this_is_a_list = [4,3,2,1,0]
this_is_a_list = sorted(this_is_a_list)
print(this_is_a_list)









    



[0, 1, 2, 3, 4]



In [337]:

    
this_is_a_list = [4,3,2,1,0]
this_is_a_list = this_is_a_list.sort()
print(this_is_a_list)









    



None

The answer is 42



In [338]:

    
this_is_a_list = [4,3,2,1,0]
this_is_a_list.sort() #IN-PLACE SORTING!!
print(this_is_a_list)









    



[0, 1, 2, 3, 4]

6.Summary

We have

Python
External packages
- numpy and scipy: math
- pandas: spreadsheet
- matplotlib (pylab): plot
- statsmodels: regression

Python and packages have

Data structures: list, numpy arrays, pandas dataframes

That are composed of

Other data structures
Data types: int, floats, strings, dates

We manipulate the data structures with code

Operations
Functions (from python/packages)
If-else statements
Loops

	Country Name	Country Code	Indicator Name	Indicator Code	1960	1961	1962	1963	1964	1965	...	2007	2008	2009	2010	2011	2012	2013	2014	2015	Unnamed: 60
0	Netherlands	NLD	Agricultural machinery, tractors	AG.AGR.TRAC.NO	NaN	62000.000000	70000.000000	78000.000000	86000.000000	94000.000000	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	Netherlands	NLD	Fertilizer consumption (% of fertilizer produc...	AG.CON.FERT.PT.ZS	NaN	NaN	NaN	NaN	NaN	NaN	...	19.293867	17.687940	16.708960	18.603141	15.604367	19.844205	13.081956	NaN	NaN	NaN
2	Netherlands	NLD	Fertilizer consumption (kilograms per hectare ...	AG.CON.FERT.ZS	NaN	NaN	NaN	NaN	NaN	NaN	...	302.139083	267.708607	238.171044	293.325836	246.811133	345.989120	231.127696	NaN	NaN	NaN
3	Netherlands	NLD	Agricultural land (sq. km)	AG.LND.AGRI.K2	NaN	23140.000000	23030.000000	22890.000000	22680.000000	22550.000000	...	19144.000000	19293.000000	19174.000000	18723.000000	18584.000000	18417.000000	18476.000000	NaN	NaN	NaN
4	Netherlands	NLD	Agricultural land (% of land area)	AG.LND.AGRI.ZS	NaN	68.542654	68.216825	67.802133	67.180095	66.795024	...	56.706161	57.147512	56.845538	55.508449	55.112693	54.617438	54.873775	NaN	NaN	NaN

Dep. Variable:	cereal_prod	R-squared:	0.454
Model:	OLS	Adj. R-squared:	0.352
Method:	Least Squares	F-statistic:	4.439
Date:	Tue, 19 Jan 2016	Prob (F-statistic):	0.0188
Time:	16:35:25	Log-Likelihood:	-263.77
No. Observations:	20	AIC:	535.5
Df Residuals:	16	BIC:	539.5
Df Model:	3
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
Intercept	7.142e+07	4.32e+07	1.651	0.118	-2.03e+07 1.63e+08
agric_land	-3634.2121	2219.807	-1.637	0.121	-8339.993 1071.569
arable_land	-68.3397	43.271	-1.579	0.134	-160.070 23.391
agric_land:arable_land	0.0036	0.002	1.604	0.128	-0.001 0.008

Omnibus:	1.428	Durbin-Watson:	1.911
Prob(Omnibus):	0.490	Jarque-Bera (JB):	0.940
Skew:	0.174	Prob(JB):	0.625
Kurtosis:	1.996	Cond. No.	2.53e+13

Omnibus:	2.038	Durbin-Watson:	1.529
Prob(Omnibus):	0.361	Jarque-Bera (JB):	1.048
Skew:	0.079	Prob(JB):	0.592
Kurtosis:	1.890	Cond. No.	5.98e+07

Omnibus:	2.053	Durbin-Watson:	1.456
Prob(Omnibus):	0.358	Jarque-Bera (JB):	1.038
Skew:	-0.011	Prob(JB):	0.595
Kurtosis:	1.884	Cond. No.	1.17e+07