Working with data 2017. Class 1

Contact

Javier Garcia-Bernardo garcia@uva.nl

0. Structure

  1. About Python
  2. Data types, structures and code
  3. Read csv files to dataframes
  4. Basic operations with dataframes
  5. My first plots
  6. Debugging python
  7. Summary

In [12]:
##Some code to run at the beginning of the file, to be able to show images in the notebook
##Don't worry about this cell

#Print the plots in this screen
%matplotlib inline 

#Be able to plot images saved in the hard drive
from IPython.display import Image 

#Make the notebook wider
from IPython.core.display import display, HTML 
display(HTML("<style>.container { width:90% !important; }</style>"))



3. Reading files

CSV = comma separated values file

  • The problem with csvs is that names inside your fields can have commas. A solution is to use quotes in all the fields. If you do this the computer undestands that commas inside quotes do not separate fields. Another solution is separate fields with tabs (\t)

TSV = tab separated values file

  • However most people (including me) call them csv.
  • An example of a csv:

person year score 1 2000 8 2 2000 1 3 2000 3 1 2010 7 2 1010 3

We use pandas to read them and save then in a data structure call dataframe

  • We are going to use the files in the data directory. You can go to the dashboard of jupyter notebook and check that the files are there.

In [13]:
#First import required library
import pandas as pd

In [4]:
## Read excel
excelFrame = pd.read_excel("data/class1_test_excel.xlsx",sheetname = 0)
#Print the first 5 lines
print(excelFrame.head(5))


   person  year  treatment  score
0       1  2000          1      4
1       2  2000          1      3
2       3  2000          2      6
3       4  2000          2      4
4       1  2005          1      8

In [7]:
#However jupyter notebooks show you what is inside your last command, so you can skip the print, and it looks nicer
excelFrame.head(5)


Out[7]:
person year treatment score
0 1 2000 1 4
1 2 2000 1 3
2 3 2000 2 6
3 4 2000 2 4
4 1 2005 1 8

In [8]:
## Read stata
stataFrame = pd.read_stata("data/class1_test_stata.dta")
stataFrame.head(5)


Out[8]:
index person year treatment score
0 0 1 2000 1 4
1 1 2 2000 1 3
2 2 3 2000 2 6
3 3 4 2000 2 4
4 4 1 2005 1 8

In [10]:
## Read csv
csvFrame = pd.read_csv("data/class1_test_csv.csv",
                       sep="\t",skiprows=4,na_values=["-9"])
csvFrame.head()


Out[10]:
person year treatment score
0 1 2000 1 4.0
1 2 2000 1 3.0
2 3 2000 2 6.0
3 4 2000 2 4.0
4 1 2005 1 8.0

1 "a" 2000 2 "b" 3000


In [ ]:

We'll focus on CSVs, because they are universal, you can read them with any text editor, and you can export your data as csvs from any program

  • From stata to csv: outsheet id gender race read write science using outfile.csv , comma
  • From excel to csv: Save as -> csv (or text to use tabs)

3.1 pd.read_csv

  • Pandas function to read csv files.
  • A function is a piece of code that takes as input some standard input and returns some standard output.
  • In this case, it takes as input a file_name and return a DataFrame

Other Examples

  • sorted() is also a function, that takes a list as input and returns you the sorted list
  • sum() is also a function, that takes a list as input and returns the sum of its elements
  • .pop() is also a function, that takes as input a list and an index to delete, and return you the element in that index
  • np.mean() is another

Argument of a function

  • what it is inside the parenthesis are the arguments, they tell the function how to work

Arguments of pd.read_csv()

  • path (required, first argument, no need to write path=): This is required, what is the name of the file. If inside a folder you need to write the name of the folder too. For instance if the file "example.csv" is inside the folder "data", you need to write data/example.csv
  • sep (default ","): "\t" for tab, "," for comma, ";" for semicolon, etc
  • header (default 0): 0 if the first line has column names. None if the first line has already data.
  • skiprows (default 0): number of lines to skip
  • skipfooter (default 0): number of lines to skip at the end
  • usecols (default None): what columns do you want to read? The default is all, but you can say [0,3,4] or ["column_x","column_y"]
  • na_values (default None): what other values should be considered missing (e.g. ["n.a.","-9","-999"])
  • thousands (default None): what is the thousands separator, usually there is None
  • decimal (default "."): Americans use "."; Europeans use ","; in science we use ".".
  • encoding (default "UTF-8"): `"UTF-8" (great), "UTF-16", "ISO-8859-1" (W Europe), "SHIFT-JIS" (Japan), "ASCII" (US files)

In [11]:
#To find more about the function use 
pd.read_csv?

In [12]:
#To find a lot more about the function use
pd.read_csv??

4. Basic functions on dataframes

We will be using a very small dataset (data/class1_test_csv.csv)

  • Uses TABS (\t) as separator: sep="\t"
  • Does not have an index_col: index_col=None
  • Has 4 rows at the start with comments: skiprows=4
  • Uses "-9" as missing value: na_values=["-9"]
  • The rest are the default options, so we don't need to write them

In [14]:
#First we reed our data
import pandas as pd
df = pd.read_csv("data/class1_test_csv.csv",sep="\t",skiprows=4,na_values=["-9"])

In [15]:
#And print it
df


Out[15]:
person year treatment score
0 1 2000 1 4.0
1 2 2000 1 3.0
2 3 2000 2 6.0
3 4 2000 2 4.0
4 1 2005 1 8.0
5 2 2005 1 7.0
6 3 2005 2 5.0
7 4 2005 2 5.0
8 1 2010 1 9.0
9 2 2010 1 7.0
10 3 2010 2 6.0
11 3 2010 2 NaN

4.1 Descriptives


In [16]:
#Describe the data
#use df.describe? (outside a comment) to get help

df.describe()


/opt/anaconda/anaconda3/lib/python3.5/site-packages/numpy/lib/function_base.py:3834: RuntimeWarning: Invalid value encountered in percentile
  RuntimeWarning)
Out[16]:
person year treatment score
count 12.000000 12.000000 12.000000 11.000000
mean 2.416667 2005.000000 1.500000 5.818182
std 1.083625 4.264014 0.522233 1.834022
min 1.000000 2000.000000 1.000000 3.000000
25% 1.750000 2000.000000 1.000000 NaN
50% 2.500000 2005.000000 1.500000 NaN
75% 3.000000 2010.000000 2.000000 NaN
max 4.000000 2010.000000 2.000000 9.000000

You can calculate the mean with df.mean() (or the median, std, etc)


In [19]:
## Calculate mean by columns
## axis is a very common argument. The computer by default gets the mean by column 
#df.mean() === df.mean(axis=0)
df.mean(axis=0)


Out[19]:
person          2.416667
year         2005.000000
treatment       1.500000
score           5.818182
dtype: float64

In [20]:
df.mean(axis=1) #By rows


Out[20]:
0     501.500000
1     501.500000
2     502.750000
3     502.500000
4     503.750000
5     503.750000
6     503.750000
7     504.000000
8     505.250000
9     505.000000
10    505.250000
11    671.666667
dtype: float64

4.2 Keeping columns


In [18]:
df.head()


Out[18]:
person year treatment score
0 1 2000 1 4.0
1 2 2000 1 3.0
2 3 2000 2 6.0
3 4 2000 2 4.0
4 1 2005 1 8.0

In [19]:
## Keep ONE column
df["treatment"]


Out[19]:
0     1
1     1
2     2
3     2
4     1
5     1
6     2
7     2
8     1
9     1
10    2
11    2
Name: treatment, dtype: int64

In [22]:
## Keep SEVERAL column
df[["year","treatment"]]


Out[22]:
year treatment
0 2000 1
1 2000 1
2 2000 2
3 2000 2
4 2005 1
5 2005 1
6 2005 2
7 2005 2
8 2010 1
9 2010 1
10 2010 2
11 2010 2

4.2 Keeping rows (slicing like list). df.iloc[slice] (not too useful)


In [21]:
df.iloc[:5]


Out[21]:
person year treatment score
0 1 2000 1 4.0
1 2 2000 1 3.0
2 3 2000 2 6.0
3 4 2000 2 4.0
4 1 2005 1 8.0

4.3 Keeping rows (filtering like np array). df[filter] (very useful)


In [23]:
df


Out[23]:
person year treatment score
0 1 2000 1 4.0
1 2 2000 1 3.0
2 3 2000 2 6.0
3 4 2000 2 4.0
4 1 2005 1 8.0
5 2 2005 1 7.0
6 3 2005 2 5.0
7 4 2005 2 5.0
8 1 2010 1 9.0
9 2 2010 1 7.0
10 3 2010 2 6.0
11 3 2010 2 NaN

In [49]:
df["year"].isin([2000,2010])


Out[49]:
0      True
1      True
2      True
3      True
4     False
5     False
6     False
7     False
8      True
9      True
10     True
Name: year, dtype: bool

In [53]:
cond = df["year"].isin( [2000,2010] )
cond


Out[53]:
0      True
1      True
2      True
3      True
4     False
5     False
6     False
7     False
8      True
9      True
10     True
Name: year, dtype: bool

In [54]:
df.loc[cond]


Out[54]:
person year treatment score score_sq happiness events
0 1 2000 1 4.0 16.0 1 1
1 2 2000 1 3.0 9.0 2 2
2 3 2000 2 6.0 36.0 3 3
3 4 2000 2 4.0 16.0 4 4
8 1 2010 1 9.0 81.0 9 9
9 2 2010 1 7.0 49.0 10 10
10 3 2010 2 6.0 36.0 11 11

In [59]:
df.loc[df["year"] == 2000]


Out[59]:
person year treatment score score_sq happiness events
0 1 2000 1 4.0 16.0 1 1
1 2 2000 1 3.0 9.0 2 2
2 3 2000 2 6.0 36.0 3 3
3 4 2000 2 4.0 16.0 4 4

In [ ]:
()
mean 
list
sorted
.isin


[]
df.loc[]
df["Year"]
["A","b","c"]

In [26]:
df_2000 = df[df["year"] == 2000]


Out[26]:
person year treatment score
0 1 2000 1 4.0
1 2 2000 1 3.0
2 3 2000 2 6.0
3 4 2000 2 4.0

In [23]:
#For example we want to keep the rows with the year 2000
#We can create the condition
condition = df["year"] == 2000
print(condition)


0      True
1      True
2      True
3      True
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
Name: year, dtype: bool

In [68]:
df.loc[df["year"] == 2000 ,  ["score","events"]  ]


Out[68]:
score events
0 4.0 1
1 3.0 2
2 6.0 3
3 4.0 4

In [26]:
#And then filter. In a numpy array you could do np_array[condition]. Here you do df[condition]
condition = df["year"] == 2000
df[condition] # df[df["year"]==2000]


Out[26]:
person year treatment score
0 1 2000 1 4.0
1 2 2000 1 3.0
2 3 2000 2 6.0
3 4 2000 2 4.0

In [30]:
df[df["year"].isin([2000,2010])]


Out[30]:
person year treatment score
0 1 2000 1 4.0
1 2 2000 1 3.0
2 3 2000 2 6.0
3 4 2000 2 4.0
8 1 2010 1 9.0
9 2 2010 1 7.0
10 3 2010 2 6.0
11 3 2010 2 NaN

In [27]:
#If they meet more than one condition
condition = df["year"].isin([2000,2010])
df[condition]


Out[27]:
person year treatment score
0 1 2000 1 4.0
1 2 2000 1 3.0
2 3 2000 2 6.0
3 4 2000 2 4.0
8 1 2010 1 9.0
9 2 2010 1 7.0
10 3 2010 2 6.0
11 3 2010 2 NaN

In [30]:
df_treat_and_year = df[["treatment","year"]]

In [34]:
df_treat_and_year.head()


Out[34]:
treatment year
0 1 2000
1 1 2000
2 2 2000
3 2 2000
4 1 2005

Keeping rows and columns (very useful) df.loc[condition,[columns]]


In [31]:
x = [1,2,3]

In [ ]:
df.loc[df["year"]==2000, ["year","treatment"]]

In [35]:
#Keeping the columns year and treatment for the year 2000
condition = df["year"] == 2000
df.loc[condition,["year","treatment"]]


Out[35]:
year treatment
0 2000 1
1 2000 1
2 2000 2
3 2000 2

In [37]:
#df[["year","treatment"]]
df.loc[:,["year","treatment"]].head()


Out[37]:
year treatment
0 2000 1
1 2000 1
2 2000 2
3 2000 2
4 2005 1

In [69]:
df


Out[69]:
person year treatment score score_sq happiness events
0 1 2000 1 4.0 16.0 1 1
1 2 2000 1 3.0 9.0 2 2
2 3 2000 2 6.0 36.0 3 3
3 4 2000 2 4.0 16.0 4 4
4 1 2005 1 8.0 64.0 5 5
5 2 2005 1 7.0 49.0 6 6
6 3 2005 2 5.0 25.0 7 7
7 4 2005 2 5.0 25.0 8 8
8 1 2010 1 9.0 81.0 9 9
9 2 2010 1 7.0 49.0 10 10
10 3 2010 2 6.0 36.0 11 11

In [71]:
df["test"] = df["score"]/2
df


/opt/anaconda/anaconda3/lib/python3.5/site-packages/ipykernel/__main__.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':
Out[71]:
person year treatment score score_sq happiness events test
0 1 2000 1 4.0 16.0 1 1 2.0
1 2 2000 1 3.0 9.0 2 2 1.5
2 3 2000 2 6.0 36.0 3 3 3.0
3 4 2000 2 4.0 16.0 4 4 2.0
4 1 2005 1 8.0 64.0 5 5 4.0
5 2 2005 1 7.0 49.0 6 6 3.5
6 3 2005 2 5.0 25.0 7 7 2.5
7 4 2005 2 5.0 25.0 8 8 2.5
8 1 2010 1 9.0 81.0 9 9 4.5
9 2 2010 1 7.0 49.0 10 10 3.5
10 3 2010 2 6.0 36.0 11 11 3.0

In [37]:
df["score_sq"] = df["score"]**2
df


Out[37]:
person year treatment score score_sq
0 1 2000 1 4.0 16.0
1 2 2000 1 3.0 9.0
2 3 2000 2 6.0 36.0
3 4 2000 2 4.0 16.0
4 1 2005 1 8.0 64.0
5 2 2005 1 7.0 49.0
6 3 2005 2 5.0 25.0
7 4 2005 2 5.0 25.0
8 1 2010 1 9.0 81.0
9 2 2010 1 7.0 49.0
10 3 2010 2 6.0 36.0
11 3 2010 2 NaN NaN

Creating new variables


In [38]:
df["happiness"] = [1,2,3,4,5,6,7,8,9,10,11,12]
df["events"] = [1,2,3,4,5,6,7,8,9,10,11,12]

In [40]:
df


Out[40]:
person year treatment score happiness events
0 1 2000 1 4.0 1 1
1 2 2000 1 3.0 2 2
2 3 2000 2 6.0 3 3
3 4 2000 2 4.0 4 4
4 1 2005 1 8.0 5 5
5 2 2005 1 7.0 6 6
6 3 2005 2 5.0 7 7
7 4 2005 2 5.0 8 8
8 1 2010 1 9.0 9 9
9 2 2010 1 7.0 10 10
10 3 2010 2 6.0 11 11
11 3 2010 2 NaN 12 12

In [41]:
#create new rows
df.loc[12] = [2,2017,2,9.,10,23]

Sorting the dataframe


In [43]:
df.sort_values(by=["treatment","score"],ascending=[True,False])


Out[43]:
person year treatment score happiness events
8 1.0 2010.0 1.0 9.0 9.0 9.0
4 1.0 2005.0 1.0 8.0 5.0 5.0
5 2.0 2005.0 1.0 7.0 6.0 6.0
9 2.0 2010.0 1.0 7.0 10.0 10.0
0 1.0 2000.0 1.0 4.0 1.0 1.0
1 2.0 2000.0 1.0 3.0 2.0 2.0
12 2.0 2017.0 2.0 9.0 10.0 23.0
2 3.0 2000.0 2.0 6.0 3.0 3.0
10 3.0 2010.0 2.0 6.0 11.0 11.0
6 3.0 2005.0 2.0 5.0 7.0 7.0
7 4.0 2005.0 2.0 5.0 8.0 8.0
3 4.0 2000.0 2.0 4.0 4.0 4.0
11 3.0 2010.0 2.0 NaN 12.0 12.0

Dropping rows with missing values


In [46]:
df_no_nan = df.dropna(subset=["score"])

Checking and modifying the name of the columns


In [47]:
df.columns


Out[47]:
Index(['person', 'year', 'treatment', 'score', 'happiness', 'events'], dtype='object')

In [48]:
df.columns = ["ID","year","treatment","score","happiness","events"]
df.head()


Out[48]:
ID year treatment score happiness events
0 1.0 2000.0 1.0 4.0 1.0 1.0
1 2.0 2000.0 1.0 3.0 2.0 2.0
2 3.0 2000.0 2.0 6.0 3.0 3.0
3 4.0 2000.0 2.0 4.0 4.0 4.0
4 1.0 2005.0 1.0 8.0 5.0 5.0

5. My first three plots (scatter, line, boxplot, histogram)

Finally!

  • We will use the libraries matplotlib and seaborn.
  • seaborn makes beautiful plots. matplotlib is easier (and it's the base of seaborn).

In [49]:
#this tells the computer to plot everything here
%matplotlib inline 

#importing this library makes the default colors be beautiful
import seaborn as sns 

#this import matplotlib
import pylab as plt

5.1 Basic commands


In [ ]:
#create a figure with a size (measured in inches!)
plt.figure(figsize=(4,3)) 

#add a title to the figure
plt.title("Title")

#add a label in the x and y axis
plt.xlabel("X axis label")
plt.ylabel("Y axis label",fontsize=14) #we can add the font size to all the functions where we pass text

#add a legend
plt.legend()

#use log scale in the x and y axis
plt.xscale("log")
plt.yscale("log")

#trim the x axis between 1 and 100 (to make it look like you want, it depends on your specific values)
plt.xlim((1,100))

#add minor ticks (vertical/horizontal lines) with tranparency 50%
plt.grid(which='minor',alpha=0.5)

#take out the grid
plt.grid(False)

#save the figure (I CAN'T STRESS ENOUGH: SAVE AS PDF FOR ANY PAPER YOU WRITE!)
plt.savefig("plots/name_of_figure.pdf") 

#show the figure (not required in jupyter notebooks but still good to write it)
plt.show()

5.2 Scatter plot: plt.scatter(arguments)

Used to plot two quantitative variables against each other. We can add one extra quant. variable if we use bubble size and one qualitative if we use bubble color.

Important arguments:

  • x: x values (an array)
  • y: y values (an array)
  • c (optional, default = "blue"): color (can be an array, or a string such as "blue")
  • s (optional, default = 20): size (can be an array, or a number)
  • alpha (optional, default = 1): transparency
  • edgecolor (optional, default = "black"): "none" for none
  • cmap: which colormap to use: http://matplotlib.org/examples/color/colormaps_reference.html
  • label: label of the plot for the legend

In [50]:
plt.scatter?

In [51]:
df.head()


Out[51]:
ID year treatment score happiness events
0 1.0 2000.0 1.0 4.0 1.0 1.0
1 2.0 2000.0 1.0 3.0 2.0 2.0
2 3.0 2000.0 2.0 6.0 3.0 3.0
3 4.0 2000.0 2.0 4.0 4.0 4.0
4 1.0 2005.0 1.0 8.0 5.0 5.0

In [39]:
df = df.dropna()

In [40]:
x = df["score"] #x values
y = df["happiness"] #y values
c = df["treatment"] #color
s = df["events"] #size

#to convert the pandas column into a np.array you need to write "values". 
#It's not needed to plot but it is to print it nicely
print(x.values) 
print(y.values)
print(c.values) 
print(s.values) 

#Create a figure
plt.figure(figsize=(6,4)) 

#Make the scatter plot, using treatment as color, 80 as size of the marker, 
#no edgecolor and a Red-Yellow-Blue colormap
plt.scatter(x,y,c=c,s=s*20,edgecolor="none",cmap="RdYlBu")
plt.xlabel("Score",fontsize=12)
plt.ylabel("Happiness",fontsize=12)
plt.show()


[ 4.  3.  6.  4.  8.  7.  5.  5.  9.  7.  6.]
[ 1  2  3  4  5  6  7  8  9 10 11]
[1 1 2 2 1 1 2 2 1 1 2]
[ 1  2  3  4  5  6  7  8  9 10 11]

In [42]:
plt.scatter(df["score"],df["happiness"])


Out[42]:
<matplotlib.collections.PathCollection at 0x7ff72aab4128>

5.3 Line plot

Used to plot two quantitative variables against each other. We can add one qualitative if we use several lines. Importantly, the x variable must be ordered.

Important arguments:

  • x: x values (an array)
  • y: y values (an array)
  • color (optional, default = "blue"): color (a string, hex string, or rgb tuple e.g. (1,0,0) = red)
  • marker (optional, default = "none"): which marker symbol to use
  • ms (optional, default = 20): marker size (can be an array, or a number)
  • markeredgecolor (optional, default = same than color): color of the edge of the marker
  • alpha (optional, default = 1): transparency
  • label (optional, default =""): label of the plot for the legend

In [ ]:
plt.plot?

In [58]:
#Data of person 1,2,3 and 4
df_1 = df.loc[df["ID"]==1,["year","score"]]
df_2 = df.loc[df["ID"]==2,["year","score"]]
df_3 = df.loc[df["ID"]==3,["year","score"]]
df_4 = df.loc[df["ID"]==4,["year","score"]]

#let's use the default matplotlib colros (instead of the seaborn colors)
sns.reset_orig() #sns.set() to bring back the seaborn colors

#create plot
plt.figure(figsize=(6,4)) 


#plot the score for all years for the people
plt.plot(df_1["year"],df_1["score"],marker="o",color="#0D4F8B",linewidth=2,label="Treatment 1")
plt.plot(df_2["year"],df_2["score"],marker="o",color="#0D4F8B",linewidth=2,label="") #no legend for this guy
plt.plot(df_3["year"],df_3["score"],marker="o",color="#e60000",linewidth=2,label="Treatment 2")
plt.plot(df_4["year"],df_4["score"],marker="o",color="#e60000",linewidth=2,label="") #no legend for this guy

#Make a legend, the default is the right top corner, but that can be changed with the "loc" argument
plt.legend(loc=0) 

#Labels
plt.xlabel("Year",fontsize=14)
plt.ylabel("Score",fontsize=14)

#Add some more space so the markers are not cut
plt.xlim(1999.8,2010.2)
plt.ylim(2.8,9.2)


plt.show()


5.4 Box plot (using seaborn)

Used to plot the ranges of one quantitative variable for different categories. We can add another qualitative variable splitting into colors (hue).

Important arguments:

  • x: x values (an array)
  • y: y values (an array)
  • hue (optional): if we want to divide the data using another column in our dataframe
  • data: name of the dataframe
  • orient (optional): "v" | "h" for vertical or horizontal
  • width (optional): width of the boxes
  • palette (optional): name of the colormap: http://matplotlib.org/examples/color/colormaps_reference.html

In [ ]:
sns.boxplot?

In [5]:
import pylab as plt

In [11]:
sns.set(font_scale=1.2) #20% larger fonts

#Create figure
plt.figure(figsize=(6,4))

#Make box plot
sns.boxplot(x="treatment", y="score",hue='year', data=df,palette="Blues")

#Take out the vertical grid
sns.despine(trim=True)
plt.savefig("plots/d.pdf")
plt.show()


But is this the most appropriate plot?

  • What would we want to to visualize?
  • What can be more effective for it?
  • Which variables are continuous (and thus can be connected by a line?)

5.3 Line plot (using seaborn. called point plot)

A point plot represents an estimate of central tendency for a numeric variable by the position of scatter plot points and provides some indication of the uncertainty around that estimate using error bars.

Important arguments:

  • x: x values (an array)
  • y: y values (an array)
  • hue (optional): if we want to divide the data using another column in our dataframe
  • data: name of the dataframe
  • errwidth (default None): width of the error bars
  • palette (optional): name of the colormap: http://matplotlib.org/examples/color/colormaps_reference.html

In [ ]:
sns.pointplot?

In [60]:
sns.set(font_scale=1.) #20% larger fonts

#Create figure
plt.figure(figsize=(6,4))

#Make box plot
sns.pointplot(x="year", y="score", hue="treatment",data=df)

#Take out the vertical grid
sns.despine(trim=True)

plt.show()


Thursday we'll learn more about data visualization and other types of plots

  • Bar plot
  • Histogram/violin plot
  • Slope plot and parallel coordinates

6 Error debugging


In [61]:
Image("http://i.imgur.com/WRuJV6r.png")


Out[61]:

Errors

  • IndexError: List is too short
  • NameError: Misspeling, the variable/funcion/module is not defined
  • SintaxError: You're missing parenthesis, colons...
  • FileNotFoundError/IOError: The file doesn't exist
  • KeyError: In a dictionary, the key doesn't exist
  • IndentationError: You have a mixture of tabs and spaces
  • TypeError: The data structure doesn't allow for that operation, a variable is None instead of having a value

IndexError: List is too short


In [62]:
#we have a list
this_is_a_list = [1,2,3,4,5]

#this is the length
len_list = len(this_is_a_list)
print(len_list)

#we try to get the element, it doesn't exit (index 4 = fifth element)
this_is_a_list[len_list]


5
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-62-9bc349a85fd7> in <module>()
      7 
      8 #we try to get the element, it doesn't exit (index 4 = fifth element)
----> 9 this_is_a_list[len_list]

IndexError: list index out of range

NameError: Misspeling, the variable/funcion/module is not defined


In [63]:
this_is_a_list = [1,2,3,4,5]
#we try to sum the fourth element to a variable
sum_all = sum_all + this_is_a_list[3]


---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-63-6509ed142267> in <module>()
      1 this_is_a_list = [1,2,3,4,5]
      2 #we try to sum the fourth element to a variable
----> 3 sum_all = sum_all + this_is_a_list[3]

NameError: name 'sum_all' is not defined

SintaxError: You're missing parenthesis, colons...


In [64]:
#missing parenthesis
sum([1,2,3]


  File "<ipython-input-64-afa9be58c273>", line 2
    sum([1,2,3]
               ^
SyntaxError: unexpected EOF while parsing

In [65]:
#you cannot tell that 3 is 5, the computer is smarter than that
3 = 5


  File "<ipython-input-65-24b5f3b93612>", line 2
    3 = 5
         ^
SyntaxError: can't assign to literal

In [66]:
#Careful with this one
"A" == "a"


Out[66]:
False

IOError: The file doesn't exist


In [67]:
open("non_existing_file","r")


---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-67-29e273de833d> in <module>()
----> 1 open("non_existing_file","r")

FileNotFoundError: [Errno 2] No such file or directory: 'non_existing_file'

KeyError: In a dictionary, the key doesn't exist


In [ ]:
#The mistake from earlier
d = dict({"Him": 0, "Her": 1})
d["You"]

IndentationError: You have a mixture of tabs and spaces

ipython notebooks handle this

TypeError: The data structure doesn't allow for that operation, a variable is None instead of having a value


In [68]:
this_is_a_list = [0,1,2,3,4]
this_is_a_list + 8


---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-68-c0a4b0d1fa13> in <module>()
      1 this_is_a_list = [0,1,2,3,4]
----> 2 this_is_a_list + 8

TypeError: can only concatenate list (not "int") to list

In [ ]:
this_is_a_list = [0,1,2,3,4]
this_is_a_list + [8]

AttributeError: The data structure doesn't have the method (e.g. calling mean() in a list)


In [69]:
this_is_a_list = [0,1,2,3,4]
this_is_a_list.add(8)


---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-69-8ad22b5651d0> in <module>()
      1 this_is_a_list = [0,1,2,3,4]
----> 2 this_is_a_list.add(8)

AttributeError: 'list' object has no attribute 'add'

In-place algorithms

  • Some functions work "in place", I'll not use them in the class to make it a bit simpler, but you may encounter.
  • A function works in place if it modifies the data structure directly

In [ ]:
#THIS DOES NOT WORK IN PLACE
this_is_a_list = [4,3,2,1,0]
print(this_is_a_list)
print(sorted(this_is_a_list))
print(this_is_a_list)

In [ ]:
#THIS WORKS IN PLACE
this_is_a_list = [4,3,2,1,0]
print(this_is_a_list)
print(this_is_a_list.sort())
print(this_is_a_list)

In [ ]:
#So usually you would do
this_is_a_list = [4,3,2,1,0]
sorted_list = sorted(this_is_a_list)
print(this_is_a_list)
print(sorted_list)

In [ ]:
#But it you do that with a function that works in place you may not get what you expect
this_is_a_list = [4,3,2,1,0]
sorted_list = this_is_a_list.sort()
print(this_is_a_list)
print(sorted_list)

In [ ]:
#Some functions that you will use and work in place: APPEND to list
this_is_a_list = [4,3,2,1,0]
this_is_a_list.append(3) #add a 3 to the end
print(this_is_a_list)

In [ ]:
#Some functions that you will use and work in place: POP to list
this_is_a_list = [4,3,2,1,0]
this_is_a_list.pop(-1) #remove last element
print(this_is_a_list)

7. Summary

We have

  • Python
  • External packages
    • numpy and scipy: math
    • pandas: spreadsheet
    • matplotlib (pylab): plot
    • statsmodels: regression (next time)

Python and packages have

  • Data structures: list, numpy arrays, pandas dataframes

That are composed of

  • Other data structures
  • Data types: int, floats, strings, dates

We manipulate the data structures with code

  • Operations
  • Functions (from python/packages)
  • If-else statements (next time)
  • Loops (next time)

Plan for next class

  • Python: Custom functions, If statement, for loops
  • Data (pandas): Merge files, group by attribute, tidy data
  • Data visualization