Python Workshop in Text analysis: Introduction

Javier Garcia-Bernardo, Jan 21st-22nd (The University of Amsterdam)


In [162]:
%matplotlib inline
from IPython.display import Image

0. Structure

  1. About Python
  2. Data types, structures and code
  3. Working with text
  4. Writing and reading files
  5. Error debuging
  6. Summary

1. About python

  • Python: Easy programming language, great for text analysis = English. (R = Dutch)
  • iPython notebook: where you write Python = Word
  • Python packages: extensions = Tabs in word

In [229]:
#This is a comment
print("Hello World")


Hello World

In [231]:
## HOW TO IMPORT PACKAGES
## Create a dataframe (spreadsheet) called spreadsheet

#Way 1 (recommended)
import pandas
spreadsheet = pandas.DataFrame() 

import pandas as pd
spreadsheet = pd.DataFrame()

#Way 2
from pandas import DataFrame
spreadsheet = DataFrame()

from pandas import *
spreadsheet = DataFrame()

In [233]:
import pandas as pd

In [234]:
df = pd.DataFrame()

In [235]:
df?

To install new packages you can use pip. Click in the "Anaconda Prompt" (under the start menu) and write:

  • pip install selenium
  • pip install dateutil
  • pip install pycountry
  • pip install labMTsimple
  • pip install lda
  • wbdata

2. Data types, structures and code

Variables

  • Data types: Numbers, strings and others
  • Data structures:
    • Lists, tables... (full of data types)

Code

  • Instructions to modify variables
  • Can be organized in functions

2.AB Variables: Data Types and Structures


In [217]:
## DATA TYPES
this_is_variable1 = 3.5
this_is_variable3 = "I'm a string"
this_is_variable2 = False

## DATA STRUCTURES (list)
this_is_variable4 = [3.5,"I'm another string",4]
this_is_variable5 = [this_is_variable1,this_is_variable2,this_is_variable3,this_is_variable4]

In [218]:
print(this_is_variable5)


[3.5, False, "I'm a string", [3.5, "I'm another string", 4]]

2.A Data Types

  • number
    • int: -2, 0, 1
    • float: 3.5, 4.23
  • string: "I'm a string"
  • boolean: False/True

In [167]:
print(type(3))
print(type(3.5))
print(type("I'm a string"))
print(type(False))


<class 'int'>
<class 'float'>
<class 'str'>
<class 'bool'>
<class 'NoneType'>

2.A.Strings: Beware of the encoding

The computer uses 0s and 1s to encode strings

We used to use ASCII encoding, that reads blocks of 5 numbers (0/1). 2^7 = 128. Enough for lower and upper case letters and some puntuation, but not for weird symbols (e.g. é,ó,í). It's the default of python2 (bad for text analysis).

Nowadays we use UTF-8 encoding, that can handle all symbols in any language. It's the default of python3.

But some programs use UTF-16, ASCII or ISO-5589-1, which makes everything crappy. If sometime you're reading a file and the content are weird symbols this is likely the problem. Look for an "encoding" option in the code.


In [238]:
##THE COMPUTER READS IT LINE BY LINE, UP_DOWN


## OPERATIONS ON DATA TYPES
print(3*5.0)
print(3 == 5)
b = 3
print(b == 3)
b = 5
print(b == 5)

## CONVERT BETWEEN TYPES
print(b)
print(type(b))
b = float(b)
print(b)
print(type(b))


15.0
False
True
True
5
<class 'int'>
5.0
<class 'float'>

In [239]:
3 = 4


  File "<ipython-input-239-16286b76249b>", line 1
    3 = 4
         ^
SyntaxError: can't assign to literal

In [240]:
this_is_a_variable6 = not_defined_variable


---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-240-8717ac856325> in <module>()
----> 1 this_is_a_variable6 = not_defined_variable

NameError: name 'not_defined_variable' is not defined

2.B Data Structures

  • list = notebook
  • tuple = book
  • dictionary = index in a book
  • numpy array = fast list for math
  • pandas dataframe = spreedsheet

They have methods = ways to edit the data structure. For example add, delete, find, sort... (= functions in excel)

2.B.Lists

  • Combines a series of variables in order
  • Fast to add and delete variables, slow to find variables (needs to go over all the elements)

In [241]:
[1,2,3]


Out[241]:
[1, 2, 3]

In [169]:
## A list
print([0]*10)


[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

In [170]:
## A list
this_is_a_list = [1,3,2,"b"]
print("Original: ", this_is_a_list)


Original:  [1, 3, 2, 'b']

In [248]:
## Add elements
this_is_a_list.append("Hoi")
print("Added Hoi: ", this_is_a_list)


Added Hoi:  [1, 3, 2, 'b', 'Hoi']

In [242]:
## Get element
print("Fourth element: ", this_is_a_list[3])


Fourth element:  b

In [243]:
[0,1,2,3,4]


Out[243]:
[0, 1, 2, 3, 4]

In [173]:
## Get slices
print("Second to end element: ", this_is_a_list[1:])


Second to end element:  [3, 2, 'b', 'Hoi']

In [249]:
print(this_is_a_list)
print(this_is_a_list[1:3])
print(this_is_a_list[:-1])


[1, 3, 2, 'b', 'Hoi']
[3, 2]
[1, 3, 2, 'b']

In [174]:
## Remove 4th element
this_is_a_list.pop(3)
print("Removed fourth element: ", this_is_a_list)


Removed fourth element:  [1, 3, 2, 'Hoi']

In [175]:
#Search
"Hoi" in this_is_a_list


Out[175]:
True

In [176]:
## Sort
this_is_a_list = [1, 3, 2]
this_is_a_list.sort()
print("Sorted: ", this_is_a_list)


Sorted:  [1, 2, 3]

ipython help


In [177]:
this_is_a_list?

In [250]:
this_is_a_list.pop?

2.B.Tuples

  • Combines a series of variables in order
  • Inmutable (fixed size)
  • Faster than lists
  • I never use them

In [179]:
this_is_a_tuple = (1,3,2,"b")
print(this_is_a_tuple)
this_is_a_list = list(this_is_a_tuple)
print(this_is_a_list)


(1, 3, 2, 'b')
[1, 3, 2, 'b']

In [251]:
print(list(range(3,16)))
print(list(range(3,16,3)))


[3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
[3, 6, 9, 12, 15]

2.B.Sets

  • One element of each kind
  • Allows for intersecting
  • Example: Which words are different between two sets

In [181]:
Image(filename='./images/setOp.png')


Out[181]:

In [182]:
this_is_a_set1 = set([1,1,1,2,3])
this_is_a_set2 = set({1,2,4})
print(this_is_a_set1)
print(this_is_a_set2)


{1, 2, 3}
{1, 2, 4}

In [183]:
## Union
print(this_is_a_set1 | this_is_a_set2)


{1, 2, 3, 4}

In [184]:
## Intersection
print(this_is_a_set1 & this_is_a_set2)


{1, 2}

In [185]:
## Diference set_1 - set2
print(this_is_a_set1 - this_is_a_set2)


{3}

2.B.Dictionary

  • Like in a index, finds a page in the book very very fast.
  • Slow to create it

In [252]:
#Dictionary
this_is_a_dict = {"Javier": 63434234234, "Friend1": 4234423243, "Friend2": 4234423243}
print(this_is_a_dict)


{'Friend1': 4234423243, 'Friend2': 4234423243, 'Javier': 63434234234}

In [188]:
this_is_a_dict["Friend2"]


Out[188]:
424233345

In [189]:
#Creating dictionary using two lists
list_names = ["Javier", "Friend1", "Friend2"]
list_numbers = [63434234234,4234423243,424233345]

#Put both together
this_is_a_dict = dict(zip(list_names,list_numbers))
print(this_is_a_dict)


{'Friend1': 4234423243, 'Friend2': 424233345, 'Javier': 63434234234}

In [190]:
print(zip(list_names,list_numbers))


<zip object at 0x7fa6fbc3fc08>

In [191]:
print(list(zip(list_names,list_numbers)))


[('Javier', 63434234234), ('Friend1', 4234423243), ('Friend2', 424233345)]

In [ ]:


In [255]:
import time
numElements = 50000
this_is_a_list = list(range(numElements))
print(this_is_a_list[:10])

this_is_a_dict = dict(zip(range(numElements),[0]*numElements))
print(([0]*numElements)[:10])
      
start = time.time()
for i in range(numElements): this_is_a_dict.get(i)
print(time.time() - start)

start = time.time()
for i in range(numElements): this_is_a_list.index(i)
print(time.time() - start)


[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
0.018622636795043945
20.1564302444458

2.B.Numpy array

  • Part of the numpy package
  • Extremely fast (and easy) to do math

In [257]:
import numpy as np
import scipy.stats

numElements = 1000
this_is_an_list = list(range(numElements))
this_is_an_array = np.array(range(numElements))
print(this_is_an_array[:10])

## Mean
print(np.mean(this_is_an_array))
print(np.std(this_is_an_array))
print(np.median(this_is_an_array))
print(scipy.stats.mode(this_is_an_array))
print(scipy.stats.skew(this_is_an_array))


[0 1 2 3 4 5 6 7 8 9]
499.5
288.674990257
499.5
ModeResult(mode=array([0]), count=array([1]))
0.0

2.B.Pandas dataframe

  • Part of the pandas package
  • Read/write csv files
  • We will use it to keep compatibility with excel
  • First rule of the "good people club": Save raw spreadsheets as .csv data

What's a CSV = comma separated values file

newspaper,number_something ABC,1 ABC,3 ABC,2 ElPais,10 ElPais, 15

Can be separated by other things and it's still called a csv (tsv something with tabs).


In [258]:
import pandas as pd
## Read excel
excelFrame = pd.read_excel("./data/nl_data.xlsx",sheetname = 0, header = 0,skiprows=4)#
## Read csv
csvFrame = pd.read_csv("./data/nl_data.csv",sep=",",index_col=None,skiprows=4,na_values=["-999"])

## Print first 5 rows
csvFrame.head()


Out[258]:
Country Name Country Code Indicator Name Indicator Code 1960 1961 1962 1963 1964 1965 ... 2007 2008 2009 2010 2011 2012 2013 2014 2015 Unnamed: 60
0 Netherlands NLD Agricultural machinery, tractors AG.AGR.TRAC.NO NaN 62000.000000 70000.000000 78000.000000 86000.000000 94000.000000 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 Netherlands NLD Fertilizer consumption (% of fertilizer produc... AG.CON.FERT.PT.ZS NaN NaN NaN NaN NaN NaN ... 19.293867 17.687940 16.708960 18.603141 15.604367 19.844205 13.081956 NaN NaN NaN
2 Netherlands NLD Fertilizer consumption (kilograms per hectare ... AG.CON.FERT.ZS NaN NaN NaN NaN NaN NaN ... 302.139083 267.708607 238.171044 293.325836 246.811133 345.989120 231.127696 NaN NaN NaN
3 Netherlands NLD Agricultural land (sq. km) AG.LND.AGRI.K2 NaN 23140.000000 23030.000000 22890.000000 22680.000000 22550.000000 ... 19144.000000 19293.000000 19174.000000 18723.000000 18584.000000 18417.000000 18476.000000 NaN NaN NaN
4 Netherlands NLD Agricultural land (% of land area) AG.LND.AGRI.ZS NaN 68.542654 68.216825 67.802133 67.180095 66.795024 ... 56.706161 57.147512 56.845538 55.508449 55.112693 54.617438 54.873775 NaN NaN NaN

5 rows × 61 columns


In [195]:
## Describe
print(csvFrame.describe())


               1960          1961          1962          1963          1964  \
count  1.580000e+02  1.780000e+02  1.910000e+02  1.950000e+02  1.950000e+02   
mean   9.644333e+09  8.671782e+09  8.733247e+09  8.952507e+09  9.810818e+09   
std    3.216011e+10  3.059052e+10  3.169151e+10  3.259826e+10  3.540193e+10   
min   -7.734878e+08 -7.255122e+08 -5.424170e+08 -8.969276e+08 -2.042674e+09   
25%    7.648415e+00  4.676500e+00  7.729379e+00  6.793045e+00  7.720433e+00   
50%    7.421578e+01  6.246398e+01  6.642467e+01  5.977151e+01  5.681000e+01   
75%    9.840219e+07  8.098655e+05  9.581490e+05  7.080155e+05  9.143374e+04   
max    1.632038e+11  1.628373e+11  1.753224e+11  1.807355e+11  1.966831e+11   

               1965          1966          1967          1968          1969  \
count  2.070000e+02  2.000000e+02  2.060000e+02  2.020000e+02  2.210000e+02   
mean   1.041116e+10  1.119155e+10  1.151534e+10  1.265630e+10  1.404333e+10   
std    3.753462e+10  3.927049e+10  4.082562e+10  4.396270e+10  4.543976e+10   
min   -8.078442e+08 -1.486532e+09 -1.482232e+09 -1.766168e+09 -1.666834e+09   
25%    7.728036e+00  5.973497e+00  6.816228e+00  8.626535e+00  9.986721e+00   
50%    6.042472e+01  6.085083e+01  6.100371e+01  6.179447e+01  7.353951e+01   
75%    9.546090e+05  1.131976e+06  9.601292e+05  1.149812e+07  8.094900e+08   
max    2.166166e+11  2.223483e+11  2.342650e+11  2.490582e+11  2.651387e+11   

          ...               2007          2008          2009          2010  \
count     ...       7.540000e+02  7.500000e+02  7.310000e+02  7.730000e+02   
mean      ...       6.199006e+10  6.281328e+10  5.955810e+10  5.837514e+10   
std       ...       1.766378e+11  1.759854e+11  1.632613e+11  1.645582e+11   
min       ...      -1.410985e+11 -9.209300e+10 -2.831900e+10 -5.156719e+10   
25%       ...       6.000000e+00  5.925000e+00  6.040510e+00  6.000000e+00   
50%       ...       5.825905e+01  5.891979e+01  5.970000e+01  5.890000e+01   
75%       ...       1.029768e+05  6.349576e+04  4.908945e+05  6.221985e+04   
max       ...       1.545905e+12  9.784070e+11  8.847164e+11  8.945230e+11   

               2011          2012          2013          2014         2015  \
count  7.220000e+02  7.450000e+02  6.370000e+02  4.590000e+02    50.000000   
mean   6.596796e+10  6.249261e+10  7.204636e+10  9.519222e+10   246.625697   
std    1.791602e+11  1.725278e+11  1.877609e+11  2.134125e+11  1229.121808   
min   -2.269100e+10 -2.145100e+10 -1.696441e+10 -1.716528e+10     0.000000   
25%    6.611412e+00  6.000000e+00  6.800000e+00  6.214552e+00     3.425000   
50%    6.350000e+01  6.240000e+01  6.819122e+01  6.590000e+01     8.500000   
75%    1.063354e+06  2.025720e+05  8.998325e+06  1.099990e+10    97.650000   
max    1.026275e+12  9.775700e+11  1.021033e+12  9.889670e+11  8700.000000   

       Unnamed: 60  
count            0  
mean           NaN  
std            NaN  
min            NaN  
25%            NaN  
50%            NaN  
75%            NaN  
max            NaN  

[8 rows x 57 columns]

In [196]:
## Calculate mean
print(csvFrame.mean(axis=0)) #By columns
print(csvFrame.mean(axis=1)) #By rows


1960           9.644333e+09
1961           8.671782e+09
1962           8.733247e+09
1963           8.952507e+09
1964           9.810818e+09
1965           1.041116e+10
1966           1.119155e+10
1967           1.151534e+10
1968           1.265630e+10
1969           1.404333e+10
1970           2.034449e+10
1971           1.773767e+10
1972           1.867170e+10
1973           2.105561e+10
1974           2.242130e+10
1975           2.187833e+10
1976           2.354577e+10
1977           2.457515e+10
1978           2.672740e+10
1979           2.830681e+10
1980           2.687033e+10
1981           2.653267e+10
1982           2.613718e+10
1983           2.515379e+10
1984           2.776158e+10
1985           2.571123e+10
1986           3.144424e+10
1987           3.016809e+10
1988           3.061985e+10
1989           3.199168e+10
1990           3.187026e+10
1991           3.438141e+10
1992           3.514316e+10
1993           3.607781e+10
1994           3.854710e+10
1995           3.829632e+10
1996           3.987001e+10
1997           4.083188e+10
1998           4.259209e+10
1999           4.338528e+10
2000           4.226315e+10
2001           4.542415e+10
2002           4.505648e+10
2003           4.772440e+10
2004           4.981190e+10
2005           5.110916e+10
2006           5.738841e+10
2007           6.199006e+10
2008           6.281328e+10
2009           5.955810e+10
2010           5.837514e+10
2011           6.596796e+10
2012           6.249261e+10
2013           7.204636e+10
2014           9.519222e+10
2015           2.466257e+02
Unnamed: 60             NaN
dtype: float64
0       1.508709e+05
1       1.977243e+01
2       3.200545e+02
3       2.036600e+04
4       6.033293e+01
5       8.940453e+05
6       6.151689e-02
7       2.648631e+01
8       2.629354e+05
9       1.045027e+00
10      5.846905e+01
11      3.585870e+03
12      1.069640e+01
13      1.055161e+01
14      7.780000e+02
15      3.375407e+04
16      1.766827e+03
17      1.535190e+06
18      8.051811e+01
19      8.773226e+01
20      9.020547e+01
21      4.152889e+04
22      6.339111e+03
23      3.139955e+01
24      6.614991e+01
25      2.765265e+11
26      5.431615e+11
27      3.978890e+00
28      4.069067e+11
29      1.362548e+11
            ...     
1315    1.214338e+01
1316    1.389317e+01
1317    1.974906e+00
1318    5.828606e+01
1319    2.876095e+00
1320    2.316059e+00
1321    1.847732e+11
1322    7.993843e+01
1323    7.524936e+00
1324    1.343265e+00
1325    1.675852e+00
1326    1.208438e+00
1327    1.529114e+00
1328    5.167672e-01
1329    1.950753e+00
1330    1.253663e+01
1331    1.795774e+11
1332             NaN
1333    1.703854e+02
1334    5.630338e+01
1335    9.700789e+10
1336    4.106099e+10
1337    2.441667e+01
1338    2.826600e+01
1339    1.345571e+01
1340    2.000000e+00
1341             NaN
1342             NaN
1343    1.103590e+00
1344             NaN
dtype: float64

In [197]:
## Keep a subset
print(csvFrame["Indicator Code"])


0          AG.AGR.TRAC.NO
1       AG.CON.FERT.PT.ZS
2          AG.CON.FERT.ZS
3          AG.LND.AGRI.K2
4          AG.LND.AGRI.ZS
5          AG.LND.ARBL.HA
6       AG.LND.ARBL.HA.PC
7          AG.LND.ARBL.ZS
8          AG.LND.CREL.HA
9          AG.LND.CROP.ZS
10         AG.LND.EL5M.ZS
11         AG.LND.FRST.K2
12         AG.LND.FRST.ZS
13      AG.LND.IRIG.AG.ZS
14         AG.LND.PRCP.MM
15         AG.LND.TOTL.K2
16         AG.LND.TRAC.ZS
17         AG.PRD.CREL.MT
18         AG.PRD.CROP.XD
19         AG.PRD.FOOD.XD
20         AG.PRD.LVSK.XD
21         AG.SRF.TOTL.K2
22         AG.YLD.CREL.KG
23      BG.GSR.NFSV.GD.ZS
24         BM.GSR.CMCP.ZS
25         BM.GSR.FCTY.CD
26         BM.GSR.GNFS.CD
27         BM.GSR.INSF.ZS
28         BM.GSR.MRCH.CD
29         BM.GSR.NFSV.CD
              ...        
1315    TX.VAL.FUEL.ZS.UN
1316    TX.VAL.ICTG.ZS.UN
1317    TX.VAL.INSF.ZS.WT
1318    TX.VAL.MANF.ZS.UN
1319    TX.VAL.MMTL.ZS.UN
1320    TX.VAL.MRCH.AL.ZS
1321    TX.VAL.MRCH.CD.WT
1322    TX.VAL.MRCH.HI.ZS
1323    TX.VAL.MRCH.OR.ZS
1324    TX.VAL.MRCH.R1.ZS
1325    TX.VAL.MRCH.R2.ZS
1326    TX.VAL.MRCH.R3.ZS
1327    TX.VAL.MRCH.R4.ZS
1328    TX.VAL.MRCH.R5.ZS
1329    TX.VAL.MRCH.R6.ZS
1330    TX.VAL.MRCH.RS.ZS
1331    TX.VAL.MRCH.WL.CD
1332    TX.VAL.MRCH.WR.ZS
1333    TX.VAL.MRCH.XD.WD
1334    TX.VAL.OTHR.ZS.WT
1335    TX.VAL.SERV.CD.WT
1336       TX.VAL.TECH.CD
1337    TX.VAL.TECH.MF.ZS
1338    TX.VAL.TRAN.ZS.WT
1339    TX.VAL.TRVL.ZS.WT
1340          VC.BTL.DETH
1341       VC.IDP.TOTL.HE
1342       VC.IDP.TOTL.LE
1343       VC.IHR.PSRC.P5
1344       VC.PKP.TOTL.UN
Name: Indicator Code, dtype: object

In [198]:
print(csvFrame["Indicator Code"]=="AG.LND.AGRI.K2")


0       False
1       False
2       False
3        True
4       False
5       False
6       False
7       False
8       False
9       False
10      False
11      False
12      False
13      False
14      False
15      False
16      False
17      False
18      False
19      False
20      False
21      False
22      False
23      False
24      False
25      False
26      False
27      False
28      False
29      False
        ...  
1315    False
1316    False
1317    False
1318    False
1319    False
1320    False
1321    False
1322    False
1323    False
1324    False
1325    False
1326    False
1327    False
1328    False
1329    False
1330    False
1331    False
1332    False
1333    False
1334    False
1335    False
1336    False
1337    False
1338    False
1339    False
1340    False
1341    False
1342    False
1343    False
1344    False
Name: Indicator Code, dtype: bool

In [199]:
print(csvFrame.loc[3,:])


Country Name                     Netherlands
Country Code                             NLD
Indicator Name    Agricultural land (sq. km)
Indicator Code                AG.LND.AGRI.K2
1960                                     NaN
1961                                   23140
1962                                   23030
1963                                   22890
1964                                   22680
1965                                   22550
1966                                   22450
1967                                   22390
1968                                   22270
1969                                   22100
1970                                   21930
1971                                   21280
1972                                   21140
1973                                   21010
1974                                   20940
1975                                   20820
1976                                   20730
1977                                   20600
1978                                   20460
1979                                   20340
1980                                   20200
1981                                   20110
1982                                   20050
1983                                   20090
1984                                   20160
1985                                   20190
                             ...            
1987                                   20140
1988                                   20120
1989                                   20040
1990                                   20060
1991                                   19910
1992                                   19860
1993                                   19880
1994                                   19710
1995                                   19640
1996                                   19810
1997                                   19660
1998                                   19730
1999                                   19670
2000                                   19560
2001                                   19310
2002                                   19490
2003                                   19230
2004                                   19494
2005                                   19377
2006                                   19196
2007                                   19144
2008                                   19293
2009                                   19174
2010                                   18723
2011                                   18584
2012                                   18417
2013                                   18476
2014                                     NaN
2015                                     NaN
Unnamed: 60                              NaN
Name: 3, dtype: object

In [200]:
csvFrame.loc[csvFrame["Indicator Code"]=="AG.LND.AGRI.K2",["Indicator Code","1970","1971"]]


Out[200]:
Indicator Code 1970 1971
3 AG.LND.AGRI.K2 21930 21280

In [201]:
### More advanced stuff
import pylab as plt
## Plot
print(csvFrame.columns)
columns_to_keep = ['1961', '1962', '1963', '1964', '1965', '1966', '1967', '1968', '1969',
       '1970', '1971', '1972', '1973', '1974', '1975', '1976', '1977', '1978',
       '1979', '1980', '1981', '1982', '1983', '1984', '1985', '1986', '1987',
       '1988', '1989', '1990', '1991', '1992', '1993', '1994', '1995', '1996',
       '1997', '1998', '1999', '2000', '2001', '2002', '2003', '2004', '2005',
       '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014',
       '2015']
print(csvFrame.loc[3,"Indicator Name"])
print(csvFrame.loc[5,"Indicator Name"])

years = []
for year in columns_to_keep:
    years.append(int(year))
print(years)

plt.plot(years,csvFrame.loc[3,columns_to_keep]*100,color="blue",label="Agricultural land")
plt.plot(years,csvFrame.loc[5,columns_to_keep],color="red",label="Arable land")
plt.xlabel("Year")
plt.ylabel("Land (hectares)")
plt.legend()
plt.show()


Index(['Country Name', 'Country Code', 'Indicator Name', 'Indicator Code',
       '1960', '1961', '1962', '1963', '1964', '1965', '1966', '1967', '1968',
       '1969', '1970', '1971', '1972', '1973', '1974', '1975', '1976', '1977',
       '1978', '1979', '1980', '1981', '1982', '1983', '1984', '1985', '1986',
       '1987', '1988', '1989', '1990', '1991', '1992', '1993', '1994', '1995',
       '1996', '1997', '1998', '1999', '2000', '2001', '2002', '2003', '2004',
       '2005', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013',
       '2014', '2015', 'Unnamed: 60'],
      dtype='object')
Agricultural land (sq. km)
Arable land (hectares)
[1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015]

In [202]:
csvFrame2 = csvFrame.loc[:,["Indicator Name"]+columns_to_keep[-22:-2]]
csvFrame_tranposed = csvFrame2.set_index("Indicator Name").transpose()
csvFrame_tranposed = csvFrame_tranposed.dropna(axis=1)
csvFrame_tranposed

csvFrame_transposedShort = csvFrame_tranposed[["Agricultural land (sq. km)","Arable land (hectares)","Cereal production (metric tons)"]]
csvFrame_transposedShort.columns = ["agric_land","arable_land","cereal_prod"]

In [203]:
import statsmodels.formula.api as sm

result = sm.ols(formula="cereal_prod ~ agric_land * arable_land", data=csvFrame_transposedShort).fit()
result.summary()


Out[203]:
OLS Regression Results
Dep. Variable: cereal_prod R-squared: 0.454
Model: OLS Adj. R-squared: 0.352
Method: Least Squares F-statistic: 4.439
Date: Tue, 19 Jan 2016 Prob (F-statistic): 0.0188
Time: 16:35:25 Log-Likelihood: -263.77
No. Observations: 20 AIC: 535.5
Df Residuals: 16 BIC: 539.5
Df Model: 3
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
Intercept 7.142e+07 4.32e+07 1.651 0.118 -2.03e+07 1.63e+08
agric_land -3634.2121 2219.807 -1.637 0.121 -8339.993 1071.569
arable_land -68.3397 43.271 -1.579 0.134 -160.070 23.391
agric_land:arable_land 0.0036 0.002 1.604 0.128 -0.001 0.008
Omnibus: 1.428 Durbin-Watson: 1.911
Prob(Omnibus): 0.490 Jarque-Bera (JB): 0.940
Skew: 0.174 Prob(JB): 0.625
Kurtosis: 1.996 Cond. No. 2.53e+13

In [204]:
result = sm.ols(formula="cereal_prod ~ agric_land + arable_land", data=csvFrame_transposedShort).fit()
result.summary()


Out[204]:
OLS Regression Results
Dep. Variable: cereal_prod R-squared: 0.367
Model: OLS Adj. R-squared: 0.292
Method: Least Squares F-statistic: 4.918
Date: Tue, 19 Jan 2016 Prob (F-statistic): 0.0206
Time: 16:35:25 Log-Likelihood: -265.26
No. Observations: 20 AIC: 536.5
Df Residuals: 17 BIC: 539.5
Df Model: 2
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
Intercept 2.14e+06 2.06e+06 1.041 0.312 -2.2e+06 6.48e+06
agric_land -77.2548 92.732 -0.833 0.416 -272.902 118.392
arable_land 1.0483 0.471 2.223 0.040 0.053 2.043
Omnibus: 2.038 Durbin-Watson: 1.529
Prob(Omnibus): 0.361 Jarque-Bera (JB): 1.048
Skew: 0.079 Prob(JB): 0.592
Kurtosis: 1.890 Cond. No. 5.98e+07

In [259]:
result = sm.ols(formula="cereal_prod ~ arable_land", data=csvFrame_transposedShort).fit()
result.summary()


Out[259]:
OLS Regression Results
Dep. Variable: cereal_prod R-squared: 0.341
Model: OLS Adj. R-squared: 0.304
Method: Least Squares F-statistic: 9.299
Date: Wed, 20 Jan 2016 Prob (F-statistic): 0.00690
Time: 14:02:07 Log-Likelihood: -265.66
No. Observations: 20 AIC: 535.3
Df Residuals: 18 BIC: 537.3
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
Intercept 4.608e+05 4e+05 1.152 0.265 -3.8e+05 1.3e+06
arable_land 1.2414 0.407 3.049 0.007 0.386 2.097
Omnibus: 2.053 Durbin-Watson: 1.456
Prob(Omnibus): 0.358 Jarque-Bera (JB): 1.038
Skew: -0.011 Prob(JB): 0.595
Kurtosis: 1.884 Cond. No. 1.17e+07

Group and plot


In [205]:
data = pd.DataFrame(columns=["Year", "Newspaper","Number_something"])
data = data.append({"Newspaper":"ABC","Number_something": 1},ignore_index=True)
data = data.append({"Newspaper":"ABC","Number_something": 2},ignore_index=True)
data = data.append({"Newspaper":"ABC","Number_something": 1},ignore_index=True)
data = data.append({"Newspaper":"ElPais","Number_something": 10},ignore_index=True)
data = data.append({"Newspaper":"ElPais","Number_something": 25},ignore_index=True)
data = data.append({"Newspaper":"ElPais","Number_something": 25},ignore_index=True)
print(data)


meanData = data.groupby("Newspaper").mean().reset_index()
meanData.plot(x = "Newspaper", y = "Number_something", kind="bar",edgecolor="none",color=(70/255,140/255,210/255),legend=False)
plt.show()


  Year Newspaper  Number_something
0  NaN       ABC                 1
1  NaN       ABC                 2
2  NaN       ABC                 1
3  NaN    ElPais                10
4  NaN    ElPais                25
5  NaN    ElPais                25

2.C Code: Operations, functions, control flow and loops

  • We have the data in data structures, composed of several data types.
  • We need code to edit everything

2.C.Operations

  • Change a data type or structure

In [206]:
## OPERATIONS ON DATA TYPES
print(3*5.0)
print(3 == 5)
b = 3
print(b == 3)
b = 5
print(b == 5)

## CONVERT BETWEEN TYPES
print(type(b))
b = float(b)
print(type(b))


15.0
False
True
True
<class 'int'>
<class 'float'>

2.C.Functions

  • A fragment of code that takes some standard input to give some standard output.
  • Example: The mean function. Gets a list of numbers as input, gives the mean as output. Gives an error if you try to calculate the mean of some strings.
  • We have already seen many functions. Add, mean...

In [207]:
##INDENTATION!!


## Our own functions
def mean(listOfNumbers):
    return np.sum(listOfNumbers)/len(listOfNumbers)

In [208]:
aList = [2,3,4]
print(mean(aList))


3.0

2.C.Scope: Global vs local variables

  • Variables inside functions are only seen by that function
  • Variables outside functions are seen and can be modified by all functions (dangerous)

In [209]:
s = "I hate spam." 

## What's s?
def f(): 
    s = "Me too."
    return s

f()
print(s)


I hate spam.

In [210]:
s = "I hate spam." 

## What's s?
def f(): 
    s = "Me too."
    return s

s = f()
print(s)


Me too.

2.C.Control flow = if-else statements

  • Controls the flow
  • If something, do something. Else do another thing.

In [262]:
#Count words
aDict = dict({"Bob":5, "Pep":3})

name = input("Enter Bob or Pep: ")

if name == "Bob":
    aDict["Bob"] = aDict["Bob"] + 1  
elif name == "Pep":
    aDict["Pep"] = aDict["Pep"] + 1 
else:
    print("Wrong name")
    
print(aDict)


Enter Bob or Pep: Bob
{'Bob': 6, 'Pep': 3}

2.C.Loops

  • Iterate over something
  • for loop

In [276]:
import numpy as np
list_numbers = [1,9,121,2335432432432423434877733543544533.]

print(np.sqrt(list_numbers[0]))
print(np.sqrt(list_numbers[1]))
print(np.sqrt(list_numbers[2]))
print(np.sqrt(list_numbers[3]))
#...


1.0
3.0
11.0
4.83263120094e+16

In [279]:
for index in [0,1,2,3]:
    print(index)


0
1
2
3

In [280]:
for index in [0,1,2,3]:
    print(np.sqrt(list_numbers[index]))


1.0
3.0
11.0
4.83263120094e+16

In [283]:
for element in list_numbers:
    print(np.sqrt(element))


1.0
3.0
11.0
4.83263120094e+16

3.A.Working with strings

  • Delete punctuation
  • Convert the string to a list of words
  • Remove stop words
  • Join a list to a string

In [284]:
#Slice just like lists
this_is_a_string = "Hello my name is"
print(this_is_a_string[:10])


Hello my n

In [286]:
print("-"*10)


----------

In [285]:
#Upper and lower case
print("Hello All".lower())
print("Hello All".upper())


hello all
HELLO ALL

In [287]:
#Strip end spaces or return characters
this_is_a_string = "Hello my name is\n" #tab = "\t
print("-"*10)
print(this_is_a_string)
print("-"*10)
print(this_is_a_string.strip())
print("-"*10)


----------
Hello my name is

----------
Hello my name is
----------

In [290]:
#Formatting (\t = tab, \n = return)
print("{0}\t{1}-----{2}\n".format("Hello","my name","is"))


Hello	my name-----is


In [369]:
#Delete punctuation
def remove_punctuation(string_to_remove):
    import string
    transtable = {ord(c): None for c in string.punctuation}
    return string_to_remove.translate(transtable).lower()

initial_string = "Hello. I'm having breakfast with my brothers. A nice one"
print(initial_string)
new_string = remove_punctuation(initial_string)
print(new_string)


Hello. I'm having breakfast with my brothers. A nice one
hello im having breakfast with my brothers a nice one

In [375]:
#Remove endings
def stem_string(string_to_stem,language="english"):
    from nltk.stem.snowball import SnowballStemmer
    stemmer = SnowballStemmer(language)
    return " ".join([stemmer.stem(word) for word in string_to_stem.split(" ")])

new_string = stem_string(new_string)
print(new_string)


amig

In [371]:
#Splitting. Covnert them to something that we can iterate form
splitted_text = new_string.split(" ")
print(splitted_text)


['hello', 'im', 'have', 'breakfast', 'with', 'my', 'brother', 'a', 'nice', 'one']

In [372]:
#Join them
joined_text = " ".join(splitted_text)
print(joined_text)


hello im have breakfast with my brother a nice one

In [ ]:
#Download package
import nltk
from nltk.corpus import stopwords
nltk.download()

In [373]:
cached_stop = stopwords.words("english")
print(cached_stop)

def remove_stop_words_not_obscure(text):
    #split
    text = text.split()
    #remove stop words
    new_text = []
    for word in text:
        if word in cached_stop: pass
        else: new_text.append(word)
    #join together
    text = ' '.join(new_text)
    return text

def remove_stop_words(text):
    return ' '.join([word for word in text.split() if word not in cached_stop])
print()
print(joined_text)
print(remove_stop_words(new_string))


['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now']

hello im have breakfast with my brother a nice one
hello im breakfast brother nice one

3.B.An example


In [300]:
def updateDictionary(aDict,name):
    if aDict.get(name):
        aDict[name] = aDict[name] + 1
    else:
        aDict[name] = 1 
    
    return aDict

text = "That is what happens when you flee your homeland. You don’t know that you are going to become part of a flood of refugees. I later would learn that I was one of the nearly 130,000 people who fled Saigon that day and one of the estimated two million “boat people” who fled Vietnam by boat and other means over the next two decades. But I didn’t set out to come to America; I left my house when my parents said I should."
print(text)
print()
text_no_punc = remove_punctuation(text)
print(text_no_punc)
print()
text_no_stop = remove_stop_words(text_no_punc)
print(text_no_stop)
print()

aDict = dict()
list_text_no_stop = text_no_stop.split()
print(list_text_no_stop)
print()
for word in list_text_no_stop:
    aDict = updateDictionary(aDict,word)
print(aDict)
print()


That is what happens when you flee your homeland. You don’t know that you are going to become part of a flood of refugees. I later would learn that I was one of the nearly 130,000 people who fled Saigon that day and one of the estimated two million “boat people” who fled Vietnam by boat and other means over the next two decades. But I didn’t set out to come to America; I left my house when my parents said I should.

that is what happens when you flee your homeland you don’t know that you are going to become part of a flood of refugees i later would learn that i was one of the nearly 130000 people who fled saigon that day and one of the estimated two million “boat people” who fled vietnam by boat and other means over the next two decades but i didn’t set out to come to america i left my house when my parents said i should

happens flee homeland don’t know going become part flood refugees later would learn one nearly 130000 people fled saigon day one estimated two million “boat people” fled vietnam boat means next two decades didn’t set come america left house parents said

['happens', 'flee', 'homeland', 'don’t', 'know', 'going', 'become', 'part', 'flood', 'refugees', 'later', 'would', 'learn', 'one', 'nearly', '130000', 'people', 'fled', 'saigon', 'day', 'one', 'estimated', 'two', 'million', '“boat', 'people”', 'fled', 'vietnam', 'boat', 'means', 'next', 'two', 'decades', 'didn’t', 'set', 'come', 'america', 'left', 'house', 'parents', 'said']

{'day': 1, 'america': 1, 'fled': 2, 'boat': 1, 'said': 1, 'one': 2, '“boat': 1, 'come': 1, 'means': 1, 'million': 1, 'going': 1, 'would': 1, 'flood': 1, 'refugees': 1, 'decades': 1, 'two': 2, '130000': 1, 'left': 1, 'vietnam': 1, 'learn': 1, 'estimated': 1, 'house': 1, 'become': 1, 'happens': 1, 'homeland': 1, 'parents': 1, 'saigon': 1, 'flee': 1, 'later': 1, 'people”': 1, 'didn’t': 1, 'people': 1, 'nearly': 1, 'don’t': 1, 'know': 1, 'set': 1, 'part': 1, 'next': 1}


In [301]:
from collections import Counter
print(Counter(list_text_no_stop))


Counter({'fled': 2, 'one': 2, 'two': 2, 'day': 1, 'america': 1, 'boat': 1, 'said': 1, '“boat': 1, 'come': 1, 'means': 1, 'million': 1, 'going': 1, 'would': 1, 'flood': 1, 'refugees': 1, 'decades': 1, '130000': 1, 'left': 1, 'vietnam': 1, 'learn': 1, 'estimated': 1, 'house': 1, 'become': 1, 'happens': 1, 'homeland': 1, 'parents': 1, 'saigon': 1, 'flee': 1, 'later': 1, 'people”': 1, 'didn’t': 1, 'people': 1, 'nearly': 1, 'don’t': 1, 'know': 1, 'set': 1, 'part': 1, 'next': 1})

3.C.Working with dates

Dates are nasty. What date is this? 05/06/2015

Luckily we have Python

from dateutils.parser import parse http://dateutil.readthedocs.org/en/latest/parser.html#dateutil.parser.parse

  • dayfirst – Whether to interpret the first value in an ambiguous 3-integer date (e.g. 01/05/09) as the day (True) or month (False). If yearfirst is set to True, this distinguishes between YDM and YMD. If set to None, this value is retrieved from the current parserinfo object (which itself defaults to False).

  • yearfirst – Whether to interpret the first value in an ambiguous 3-integer date (e.g. 01/05/09) as the year. If True, the first number is taken to be the year, otherwise the last number is taken to be the year. If this is set to None, the value is retrieved from the current parserinfo object (which itself defaults to False).

  • fuzzy – Whether to allow fuzzy parsing, allowing for string like “Today is January 1, 2047 at 8:21:00AM”.


In [307]:
from dateutil.parser import parse
print(parse("05-06-2015",dayfirst=True).date())
print(parse("05-06-2015",dayfirst=False).date())
print(parse("05/06-2015").date())
print(parse("Today is January 1, 2047 at 8:21:00AM",fuzzy=True).date())


2015-06-05
2015-05-06
2015-05-06
2047-01-01

4.Writing and reading files + examples

""" with open(filename,how_to_open) as f: code goes here """ f: variable name, whatever you want how_to_open: "w+": Write "r+": Read "a+": Append (The + indicates python to create the file if the file doesn't exist)

4.A.Writing files


In [309]:
with open("./data/file_to_write.csv","w+") as f:
    f.write("I'm line number {0}".format(0))
    f.write("I'm line number {0}".format(1))
    f.write("I'm line number {0}".format(2))
    f.write("I'm line number {0}".format(3))
    f.write("I'm line number {0}".format(4))

In [310]:
with open("./data/file_to_write.csv","w+") as f:
    f.write("I'm line number {0}\n".format(0))
    f.write("I'm line number {0}\n".format(1))
    f.write("I'm line number {0}\n".format(2))
    f.write("I'm line number {0}\n".format(3))
    f.write("I'm line number {0}\n".format(4))

In [312]:
list(range(10))


Out[312]:
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [313]:
#Beware the enter
with open("./data/file_to_write.csv","w+") as f:
    for i in range(10):
        f.write("I'm line number {0}\n".format(i))

4.B.Reading files


In [317]:
with open("./data/file_to_write.csv","r+") as f:
    splitted_by_line_1 = f.readlines()
        
with open("./data/file_to_write.csv","r+") as f:
    all_together = f.read()
splitted_by_line_2 = all_together.split("\n")

splitted_by_line_3 = []
with open("./data/file_to_write.csv","r+") as f:
    for line in f:
        splitted_by_line_3.append(line)

In [318]:
print(splitted_by_line_1)
print(splitted_by_line_2)
print(splitted_by_line_3)


["I'm line number 0\n", "I'm line number 1\n", "I'm line number 2\n", "I'm line number 3\n", "I'm line number 4\n", "I'm line number 5\n", "I'm line number 6\n", "I'm line number 7\n", "I'm line number 8\n", "I'm line number 9\n"]
["I'm line number 0", "I'm line number 1", "I'm line number 2", "I'm line number 3", "I'm line number 4", "I'm line number 5", "I'm line number 6", "I'm line number 7", "I'm line number 8", "I'm line number 9", '']
["I'm line number 0\n", "I'm line number 1\n", "I'm line number 2\n", "I'm line number 3\n", "I'm line number 4\n", "I'm line number 5\n", "I'm line number 6\n", "I'm line number 7\n", "I'm line number 8\n", "I'm line number 9\n"]
["I'm line number 0", "I'm line number 1", "I'm line number 2", "I'm line number 3", "I'm line number 4", "I'm line number 5", "I'm line number 6", "I'm line number 7", "I'm line number 8", "I'm line number 9"]

In [ ]:
#The strip removes the return and all that
splitted_by_line_3 = []
with open("./data/file_to_write.csv","r+") as f:
    for line in f:
        splitted_by_line_3.append(line.strip())
        
print(splitted_by_line_3)

4.C.Try - except

Exception handling. Reading a file and not closing it is (very) bad for the system.

Handled by the "with open() as f"


In [ ]:
try:
    f = open("./data/file_to_write.csv","r+")
    for line in f:
        splitted_by_line_3.append(line.strip())
    print(splitted_by_line_3)
    f.close()
except:
    f.close()

5.Error debugging


In [319]:
Image("http://i.imgur.com/WRuJV6r.png")


Out[319]:

Errors

  • IndexError: List is too short
  • NameError: Misspeling, the variable/funcion/module is not defined
  • SintaxError: You're missing parenthesis, colons...
  • FileNotFoundError/IOError: The file doesn't exist
  • KeyError: In a dictionary, the key doesn't exist
  • IndentationError: You have a mixture of tabs and spaces
  • TypeError: The data structure doesn't allow for that operation, a variable is None instead of having a value

IndexError: List is too short


In [320]:
this_is_a_list = [1,2,3,4,5]
len_list = len(this_is_a_list)
print(len_list)
this_is_a_list[len_list]


5
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-320-1f99355ae5f1> in <module>()
      2 len_list = len(this_is_a_list)
      3 print(len_list)
----> 4 this_is_a_list[len_list]

IndexError: list index out of range

In [321]:
this_is_a_list = [1,2,3,4,5]
for element in this_is_a_list:
    this_is_a_list.pop(-1)    
    print(element)


1
2
3

NameError: Misspeling, the variable/funcion/module is not defined


In [323]:
this_is_a_list = [1,2,3,4,5]
for element in this_is_a_list:
    sum_all = sum_all + element


---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-323-73d4f0ef9860> in <module>()
      1 this_is_a_list = [1,2,3,4,5]
      2 for element in this_is_a_list:
----> 3     sum_all = sum_all + element

NameError: name 'sum_all' is not defined

SintaxError: You're missing parenthesis, colons...


In [324]:
def function()
    return 0


  File "<ipython-input-324-b90b96076647>", line 1
    def function()
                  ^
SyntaxError: invalid syntax

In [326]:
3 = 5


  File "<ipython-input-326-dc9bf34ad6e8>", line 1
    3 = 5
         ^
SyntaxError: can't assign to literal

In [327]:
3 == "3"


Out[327]:
False

In [328]:
3 == int("3")


Out[328]:
True

In [330]:
"A" == "a"


Out[330]:
False

IOError: The file doesn't exist


In [331]:
open("non_existing_file","r")


---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-331-29e273de833d> in <module>()
----> 1 open("non_existing_file","r")

FileNotFoundError: [Errno 2] No such file or directory: 'non_existing_file'

KeyError: In a dictionary, the key doesn't exist


In [332]:
d = dict({"You": 0, "Her": 1})
d["Him"]


---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-332-1e1ab586bf06> in <module>()
      1 d = dict({"You": 0, "Her": 1})
----> 2 d["Him"]

KeyError: 'Him'

IndentationError: You have a mixture of tabs and spaces

ipython notebooks handle this

TypeError: The data structure doesn't allow for that operation, a variable is None instead of having a value


In [333]:
this_is_a_list = [0,1,2,3,4]
this_is_a_list + 8


---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-333-c0a4b0d1fa13> in <module>()
      1 this_is_a_list = [0,1,2,3,4]
----> 2 this_is_a_list + 8

TypeError: can only concatenate list (not "int") to list

In [334]:
this_is_a_list = [0,1,2,3,4]
this_is_a_list + [8]


Out[334]:
[0, 1, 2, 3, 4, 8]

AttributeError: The data structure doesn't have the method (e.g. calling mean() in a list)


In [335]:
this_is_a_list = [0,1,2,3,4]
this_is_a_list.add(8)


---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-335-8ad22b5651d0> in <module>()
      1 this_is_a_list = [0,1,2,3,4]
----> 2 this_is_a_list.add(8)

AttributeError: 'list' object has no attribute 'add'

In-place algorithms


In [336]:
this_is_a_list = [4,3,2,1,0]
this_is_a_list = sorted(this_is_a_list)
print(this_is_a_list)


[0, 1, 2, 3, 4]

In [337]:
this_is_a_list = [4,3,2,1,0]
this_is_a_list = this_is_a_list.sort()
print(this_is_a_list)


None

The answer is 42


In [338]:
this_is_a_list = [4,3,2,1,0]
this_is_a_list.sort() #IN-PLACE SORTING!!
print(this_is_a_list)


[0, 1, 2, 3, 4]

6.Summary

We have

  • Python
  • External packages
    • numpy and scipy: math
    • pandas: spreadsheet
    • matplotlib (pylab): plot
    • statsmodels: regression

Python and packages have

  • Data structures: list, numpy arrays, pandas dataframes

That are composed of

  • Other data structures
  • Data types: int, floats, strings, dates

We manipulate the data structures with code

  • Operations
  • Functions (from python/packages)
  • If-else statements
  • Loops