Working with data 2017. Class 1

Contact

Javier Garcia-Bernardo garcia@uva.nl

0. Structure

About Python
Data types, structures and code
Read csv files to dataframes
Basic operations with dataframes
My first plots
Debugging python
Summary



In [12]:

    
##Some code to run at the beginning of the file, to be able to show images in the notebook
##Don't worry about this cell

#Print the plots in this screen
%matplotlib inline 

#Be able to plot images saved in the hard drive
from IPython.display import Image 

#Make the notebook wider
from IPython.core.display import display, HTML 
display(HTML("<style>.container { width:90% !important; }</style>"))

3. Reading files

CSV = comma separated values file

The problem with csvs is that names inside your fields can have commas. A solution is to use quotes in all the fields. If you do this the computer undestands that commas inside quotes do not separate fields. Another solution is separate fields with tabs (\t)

TSV = tab separated values file

However most people (including me) call them csv.
An example of a csv:

person year score 1 2000 8 2 2000 1 3 2000 3 1 2010 7 2 1010 3

We use pandas to read them and save then in a data structure call dataframe

We are going to use the files in the data directory. You can go to the dashboard of jupyter notebook and check that the files are there.



In [13]:

    
#First import required library
import pandas as pd



In [4]:

    
## Read excel
excelFrame = pd.read_excel("data/class1_test_excel.xlsx",sheetname = 0)
#Print the first 5 lines
print(excelFrame.head(5))









    



   person  year  treatment  score
0       1  2000          1      4
1       2  2000          1      3
2       3  2000          2      6
3       4  2000          2      4
4       1  2005          1      8



In [7]:

    
#However jupyter notebooks show you what is inside your last command, so you can skip the print, and it looks nicer
excelFrame.head(5)



In [8]:

    
## Read stata
stataFrame = pd.read_stata("data/class1_test_stata.dta")
stataFrame.head(5)



In [10]:

    
## Read csv
csvFrame = pd.read_csv("data/class1_test_csv.csv",
                       sep="\t",skiprows=4,na_values=["-9"])
csvFrame.head()

1 "a" 2000 2 "b" 3000



In [ ]:

We'll focus on CSVs, because they are universal, you can read them with any text editor, and you can export your data as csvs from any program

From stata to csv: outsheet id gender race read write science using outfile.csv , comma
From excel to csv: Save as -> csv (or text to use tabs)

3.1 pd.read_csv

Pandas function to read csv files.
A function is a piece of code that takes as input some standard input and returns some standard output.
In this case, it takes as input a file_name and return a DataFrame

Other Examples

sorted() is also a function, that takes a list as input and returns you the sorted list
sum() is also a function, that takes a list as input and returns the sum of its elements
.pop() is also a function, that takes as input a list and an index to delete, and return you the element in that index
np.mean() is another

Argument of a function

what it is inside the parenthesis are the arguments, they tell the function how to work

Arguments of pd.read_csv()

path (required, first argument, no need to write path=): This is required, what is the name of the file. If inside a folder you need to write the name of the folder too. For instance if the file "example.csv" is inside the folder "data", you need to write data/example.csv
sep (default ","): "\t" for tab, "," for comma, ";" for semicolon, etc
header (default 0): 0 if the first line has column names. None if the first line has already data.
skiprows (default 0): number of lines to skip
skipfooter (default 0): number of lines to skip at the end
usecols (default None): what columns do you want to read? The default is all, but you can say [0,3,4] or ["column_x","column_y"]
na_values (default None): what other values should be considered missing (e.g. ["n.a.","-9","-999"])
thousands (default None): what is the thousands separator, usually there is None
decimal (default "."): Americans use "."; Europeans use ","; in science we use ".".
encoding (default "UTF-8"): `"UTF-8" (great), "UTF-16", "ISO-8859-1" (W Europe), "SHIFT-JIS" (Japan), "ASCII" (US files)



In [11]:

    
#To find more about the function use 
pd.read_csv?



In [12]:

    
#To find a lot more about the function use
pd.read_csv??

4. Basic functions on dataframes

We will be using a very small dataset (data/class1_test_csv.csv)

Uses TABS (\t) as separator: sep="\t"
Does not have an index_col: index_col=None
Has 4 rows at the start with comments: skiprows=4
Uses "-9" as missing value: na_values=["-9"]
The rest are the default options, so we don't need to write them



In [14]:

    
#First we reed our data
import pandas as pd
df = pd.read_csv("data/class1_test_csv.csv",sep="\t",skiprows=4,na_values=["-9"])



In [15]:

    
#And print it
df

4.1 Descriptives



In [16]:

    
#Describe the data
#use df.describe? (outside a comment) to get help

df.describe()









    



/opt/anaconda/anaconda3/lib/python3.5/site-packages/numpy/lib/function_base.py:3834: RuntimeWarning: Invalid value encountered in percentile
  RuntimeWarning)






    Out[16]:






  
    
      
      person
      year
      treatment
      score
    
  
  
    
      count
      12.000000
      12.000000
      12.000000
      11.000000
    
    
      mean
      2.416667
      2005.000000
      1.500000
      5.818182
    
    
      std
      1.083625
      4.264014
      0.522233
      1.834022
    
    
      min
      1.000000
      2000.000000
      1.000000
      3.000000
    
    
      25%
      1.750000
      2000.000000
      1.000000
      NaN
    
    
      50%
      2.500000
      2005.000000
      1.500000
      NaN
    
    
      75%
      3.000000
      2010.000000
      2.000000
      NaN
    
    
      max
      4.000000
      2010.000000
      2.000000
      9.000000

You can calculate the mean with df.mean() (or the median, std, etc)



In [19]:

    
## Calculate mean by columns
## axis is a very common argument. The computer by default gets the mean by column 
#df.mean() === df.mean(axis=0)
df.mean(axis=0)









    Out[19]:





person          2.416667
year         2005.000000
treatment       1.500000
score           5.818182
dtype: float64



In [20]:

    
df.mean(axis=1) #By rows









    Out[20]:





0     501.500000
1     501.500000
2     502.750000
3     502.500000
4     503.750000
5     503.750000
6     503.750000
7     504.000000
8     505.250000
9     505.000000
10    505.250000
11    671.666667
dtype: float64

4.2 Keeping columns



In [18]:

    
df.head()



In [19]:

    
## Keep ONE column
df["treatment"]









    Out[19]:





0     1
1     1
2     2
3     2
4     1
5     1
6     2
7     2
8     1
9     1
10    2
11    2
Name: treatment, dtype: int64



In [22]:

    
## Keep SEVERAL column
df[["year","treatment"]]

4.2 Keeping rows (slicing like list). df.iloc[slice] (not too useful)



In [21]:

    
df.iloc[:5]

4.3 Keeping rows (filtering like np array). df[filter] (very useful)



In [23]:

    
df



In [49]:

    
df["year"].isin([2000,2010])









    Out[49]:





0      True
1      True
2      True
3      True
4     False
5     False
6     False
7     False
8      True
9      True
10     True
Name: year, dtype: bool



In [53]:

    
cond = df["year"].isin( [2000,2010] )
cond









    Out[53]:





0      True
1      True
2      True
3      True
4     False
5     False
6     False
7     False
8      True
9      True
10     True
Name: year, dtype: bool



In [54]:

    
df.loc[cond]



In [59]:

    
df.loc[df["year"] == 2000]



In [ ]:

    
()
mean 
list
sorted
.isin


[]
df.loc[]
df["Year"]
["A","b","c"]



In [26]:

    
df_2000 = df[df["year"] == 2000]



In [23]:

    
#For example we want to keep the rows with the year 2000
#We can create the condition
condition = df["year"] == 2000
print(condition)









    



0      True
1      True
2      True
3      True
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
Name: year, dtype: bool



In [68]:

    
df.loc[df["year"] == 2000 ,  ["score","events"]  ]



In [26]:

    
#And then filter. In a numpy array you could do np_array[condition]. Here you do df[condition]
condition = df["year"] == 2000
df[condition] # df[df["year"]==2000]



In [30]:

    
df[df["year"].isin([2000,2010])]



In [27]:

    
#If they meet more than one condition
condition = df["year"].isin([2000,2010])
df[condition]



In [30]:

    
df_treat_and_year = df[["treatment","year"]]



In [34]:

    
df_treat_and_year.head()

Keeping rows and columns (very useful) df.loc[condition,[columns]]



In [31]:

    
x = [1,2,3]



In [ ]:

    
df.loc[df["year"]==2000, ["year","treatment"]]



In [35]:

    
#Keeping the columns year and treatment for the year 2000
condition = df["year"] == 2000
df.loc[condition,["year","treatment"]]



In [37]:

    
#df[["year","treatment"]]
df.loc[:,["year","treatment"]].head()



In [69]:

    
df



In [71]:

    
df["test"] = df["score"]/2
df









    



/opt/anaconda/anaconda3/lib/python3.5/site-packages/ipykernel/__main__.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':






    Out[71]:






  
    
      
      person
      year
      treatment
      score
      score_sq
      happiness
      events
      test
    
  
  
    
      0
      1
      2000
      1
      4.0
      16.0
      1
      1
      2.0
    
    
      1
      2
      2000
      1
      3.0
      9.0
      2
      2
      1.5
    
    
      2
      3
      2000
      2
      6.0
      36.0
      3
      3
      3.0
    
    
      3
      4
      2000
      2
      4.0
      16.0
      4
      4
      2.0
    
    
      4
      1
      2005
      1
      8.0
      64.0
      5
      5
      4.0
    
    
      5
      2
      2005
      1
      7.0
      49.0
      6
      6
      3.5
    
    
      6
      3
      2005
      2
      5.0
      25.0
      7
      7
      2.5
    
    
      7
      4
      2005
      2
      5.0
      25.0
      8
      8
      2.5
    
    
      8
      1
      2010
      1
      9.0
      81.0
      9
      9
      4.5
    
    
      9
      2
      2010
      1
      7.0
      49.0
      10
      10
      3.5
    
    
      10
      3
      2010
      2
      6.0
      36.0
      11
      11
      3.0



In [37]:

    
df["score_sq"] = df["score"]**2
df

Creating new variables



In [38]:

    
df["happiness"] = [1,2,3,4,5,6,7,8,9,10,11,12]
df["events"] = [1,2,3,4,5,6,7,8,9,10,11,12]



In [40]:

    
df



In [41]:

    
#create new rows
df.loc[12] = [2,2017,2,9.,10,23]

Sorting the dataframe



In [43]:

    
df.sort_values(by=["treatment","score"],ascending=[True,False])

Dropping rows with missing values



In [46]:

    
df_no_nan = df.dropna(subset=["score"])

Checking and modifying the name of the columns



In [47]:

    
df.columns









    Out[47]:





Index(['person', 'year', 'treatment', 'score', 'happiness', 'events'], dtype='object')



In [48]:

    
df.columns = ["ID","year","treatment","score","happiness","events"]
df.head()

5. My first three plots (scatter, line, boxplot, histogram)

Finally!

We will use the libraries matplotlib and seaborn.
seaborn makes beautiful plots. matplotlib is easier (and it's the base of seaborn).



In [49]:

    
#this tells the computer to plot everything here
%matplotlib inline 

#importing this library makes the default colors be beautiful
import seaborn as sns 

#this import matplotlib
import pylab as plt

5.1 Basic commands



In [ ]:

    
#create a figure with a size (measured in inches!)
plt.figure(figsize=(4,3)) 

#add a title to the figure
plt.title("Title")

#add a label in the x and y axis
plt.xlabel("X axis label")
plt.ylabel("Y axis label",fontsize=14) #we can add the font size to all the functions where we pass text

#add a legend
plt.legend()

#use log scale in the x and y axis
plt.xscale("log")
plt.yscale("log")

#trim the x axis between 1 and 100 (to make it look like you want, it depends on your specific values)
plt.xlim((1,100))

#add minor ticks (vertical/horizontal lines) with tranparency 50%
plt.grid(which='minor',alpha=0.5)

#take out the grid
plt.grid(False)

#save the figure (I CAN'T STRESS ENOUGH: SAVE AS PDF FOR ANY PAPER YOU WRITE!)
plt.savefig("plots/name_of_figure.pdf") 

#show the figure (not required in jupyter notebooks but still good to write it)
plt.show()

5.2 Scatter plot: plt.scatter(arguments)

Used to plot two quantitative variables against each other. We can add one extra quant. variable if we use bubble size and one qualitative if we use bubble color.

Important arguments:

x: x values (an array)
y: y values (an array)
c (optional, default = "blue"): color (can be an array, or a string such as "blue")
s (optional, default = 20): size (can be an array, or a number)
alpha (optional, default = 1): transparency
edgecolor (optional, default = "black"): "none" for none
cmap: which colormap to use: http://matplotlib.org/examples/color/colormaps_reference.html
label: label of the plot for the legend



In [50]:

    
plt.scatter?



In [51]:

    
df.head()



In [39]:

    
df = df.dropna()



In [40]:

    
x = df["score"] #x values
y = df["happiness"] #y values
c = df["treatment"] #color
s = df["events"] #size

#to convert the pandas column into a np.array you need to write "values". 
#It's not needed to plot but it is to print it nicely
print(x.values) 
print(y.values)
print(c.values) 
print(s.values) 

#Create a figure
plt.figure(figsize=(6,4)) 

#Make the scatter plot, using treatment as color, 80 as size of the marker, 
#no edgecolor and a Red-Yellow-Blue colormap
plt.scatter(x,y,c=c,s=s*20,edgecolor="none",cmap="RdYlBu")
plt.xlabel("Score",fontsize=12)
plt.ylabel("Happiness",fontsize=12)
plt.show()









    



[ 4.  3.  6.  4.  8.  7.  5.  5.  9.  7.  6.]
[ 1  2  3  4  5  6  7  8  9 10 11]
[1 1 2 2 1 1 2 2 1 1 2]
[ 1  2  3  4  5  6  7  8  9 10 11]



In [42]:

    
plt.scatter(df["score"],df["happiness"])









    Out[42]:





<matplotlib.collections.PathCollection at 0x7ff72aab4128>

5.3 Line plot

Used to plot two quantitative variables against each other. We can add one qualitative if we use several lines. Importantly, the x variable must be ordered.

Important arguments:

x: x values (an array)
y: y values (an array)
color (optional, default = "blue"): color (a string, hex string, or rgb tuple e.g. (1,0,0) = red)
marker (optional, default = "none"): which marker symbol to use
ms (optional, default = 20): marker size (can be an array, or a number)
markeredgecolor (optional, default = same than color): color of the edge of the marker
alpha (optional, default = 1): transparency
label (optional, default =""): label of the plot for the legend



In [ ]:

    
plt.plot?



In [58]:

    
#Data of person 1,2,3 and 4
df_1 = df.loc[df["ID"]==1,["year","score"]]
df_2 = df.loc[df["ID"]==2,["year","score"]]
df_3 = df.loc[df["ID"]==3,["year","score"]]
df_4 = df.loc[df["ID"]==4,["year","score"]]

#let's use the default matplotlib colros (instead of the seaborn colors)
sns.reset_orig() #sns.set() to bring back the seaborn colors

#create plot
plt.figure(figsize=(6,4)) 


#plot the score for all years for the people
plt.plot(df_1["year"],df_1["score"],marker="o",color="#0D4F8B",linewidth=2,label="Treatment 1")
plt.plot(df_2["year"],df_2["score"],marker="o",color="#0D4F8B",linewidth=2,label="") #no legend for this guy
plt.plot(df_3["year"],df_3["score"],marker="o",color="#e60000",linewidth=2,label="Treatment 2")
plt.plot(df_4["year"],df_4["score"],marker="o",color="#e60000",linewidth=2,label="") #no legend for this guy

#Make a legend, the default is the right top corner, but that can be changed with the "loc" argument
plt.legend(loc=0) 

#Labels
plt.xlabel("Year",fontsize=14)
plt.ylabel("Score",fontsize=14)

#Add some more space so the markers are not cut
plt.xlim(1999.8,2010.2)
plt.ylim(2.8,9.2)


plt.show()

5.4 Box plot (using seaborn)

Used to plot the ranges of one quantitative variable for different categories. We can add another qualitative variable splitting into colors (hue).

Important arguments:

x: x values (an array)
y: y values (an array)
hue (optional): if we want to divide the data using another column in our dataframe
data: name of the dataframe
orient (optional): "v" | "h" for vertical or horizontal
width (optional): width of the boxes
palette (optional): name of the colormap: http://matplotlib.org/examples/color/colormaps_reference.html



In [ ]:

    
sns.boxplot?



In [5]:

    
import pylab as plt



In [11]:

    
sns.set(font_scale=1.2) #20% larger fonts

#Create figure
plt.figure(figsize=(6,4))

#Make box plot
sns.boxplot(x="treatment", y="score",hue='year', data=df,palette="Blues")

#Take out the vertical grid
sns.despine(trim=True)
plt.savefig("plots/d.pdf")
plt.show()

But is this the most appropriate plot?

What would we want to to visualize?
What can be more effective for it?
Which variables are continuous (and thus can be connected by a line?)

5.3 Line plot (using seaborn. called point plot)

A point plot represents an estimate of central tendency for a numeric variable by the position of scatter plot points and provides some indication of the uncertainty around that estimate using error bars.

Important arguments:

x: x values (an array)
y: y values (an array)
hue (optional): if we want to divide the data using another column in our dataframe
data: name of the dataframe
errwidth (default None): width of the error bars
palette (optional): name of the colormap: http://matplotlib.org/examples/color/colormaps_reference.html



In [ ]:

    
sns.pointplot?



In [60]:

    
sns.set(font_scale=1.) #20% larger fonts

#Create figure
plt.figure(figsize=(6,4))

#Make box plot
sns.pointplot(x="year", y="score", hue="treatment",data=df)

#Take out the vertical grid
sns.despine(trim=True)

plt.show()

Thursday we'll learn more about data visualization and other types of plots

Bar plot
Histogram/violin plot
Slope plot and parallel coordinates

6 Error debugging



In [61]:

    
Image("http://i.imgur.com/WRuJV6r.png")









    Out[61]:

Errors

IndexError: List is too short
NameError: Misspeling, the variable/funcion/module is not defined
SintaxError: You're missing parenthesis, colons...
FileNotFoundError/IOError: The file doesn't exist
KeyError: In a dictionary, the key doesn't exist
IndentationError: You have a mixture of tabs and spaces
TypeError: The data structure doesn't allow for that operation, a variable is None instead of having a value

IndexError: List is too short



In [62]:

    
#we have a list
this_is_a_list = [1,2,3,4,5]

#this is the length
len_list = len(this_is_a_list)
print(len_list)

#we try to get the element, it doesn't exit (index 4 = fifth element)
this_is_a_list[len_list]









    



5






    



---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-62-9bc349a85fd7> in <module>()
      7 
      8 #we try to get the element, it doesn't exit (index 4 = fifth element)
----> 9 this_is_a_list[len_list]

IndexError: list index out of range

NameError: Misspeling, the variable/funcion/module is not defined



In [63]:

    
this_is_a_list = [1,2,3,4,5]
#we try to sum the fourth element to a variable
sum_all = sum_all + this_is_a_list[3]









    



---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-63-6509ed142267> in <module>()
      1 this_is_a_list = [1,2,3,4,5]
      2 #we try to sum the fourth element to a variable
----> 3 sum_all = sum_all + this_is_a_list[3]

NameError: name 'sum_all' is not defined

SintaxError: You're missing parenthesis, colons...



In [64]:

    
#missing parenthesis
sum([1,2,3]









    



  File "<ipython-input-64-afa9be58c273>", line 2
    sum([1,2,3]
               ^
SyntaxError: unexpected EOF while parsing



In [65]:

    
#you cannot tell that 3 is 5, the computer is smarter than that
3 = 5









    



  File "<ipython-input-65-24b5f3b93612>", line 2
    3 = 5
         ^
SyntaxError: can't assign to literal



In [66]:

    
#Careful with this one
"A" == "a"









    Out[66]:





False

IOError: The file doesn't exist



In [67]:

    
open("non_existing_file","r")









    



---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-67-29e273de833d> in <module>()
----> 1 open("non_existing_file","r")

FileNotFoundError: [Errno 2] No such file or directory: 'non_existing_file'

KeyError: In a dictionary, the key doesn't exist



In [ ]:

    
#The mistake from earlier
d = dict({"Him": 0, "Her": 1})
d["You"]

IndentationError: You have a mixture of tabs and spaces

ipython notebooks handle this

TypeError: The data structure doesn't allow for that operation, a variable is None instead of having a value



In [68]:

    
this_is_a_list = [0,1,2,3,4]
this_is_a_list + 8









    



---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-68-c0a4b0d1fa13> in <module>()
      1 this_is_a_list = [0,1,2,3,4]
----> 2 this_is_a_list + 8

TypeError: can only concatenate list (not "int") to list



In [ ]:

    
this_is_a_list = [0,1,2,3,4]
this_is_a_list + [8]

AttributeError: The data structure doesn't have the method (e.g. calling mean() in a list)



In [69]:

    
this_is_a_list = [0,1,2,3,4]
this_is_a_list.add(8)









    



---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-69-8ad22b5651d0> in <module>()
      1 this_is_a_list = [0,1,2,3,4]
----> 2 this_is_a_list.add(8)

AttributeError: 'list' object has no attribute 'add'

In-place algorithms

Some functions work "in place", I'll not use them in the class to make it a bit simpler, but you may encounter.
A function works in place if it modifies the data structure directly



In [ ]:

    
#THIS DOES NOT WORK IN PLACE
this_is_a_list = [4,3,2,1,0]
print(this_is_a_list)
print(sorted(this_is_a_list))
print(this_is_a_list)



In [ ]:

    
#THIS WORKS IN PLACE
this_is_a_list = [4,3,2,1,0]
print(this_is_a_list)
print(this_is_a_list.sort())
print(this_is_a_list)



In [ ]:

    
#So usually you would do
this_is_a_list = [4,3,2,1,0]
sorted_list = sorted(this_is_a_list)
print(this_is_a_list)
print(sorted_list)



In [ ]:

    
#But it you do that with a function that works in place you may not get what you expect
this_is_a_list = [4,3,2,1,0]
sorted_list = this_is_a_list.sort()
print(this_is_a_list)
print(sorted_list)



In [ ]:

    
#Some functions that you will use and work in place: APPEND to list
this_is_a_list = [4,3,2,1,0]
this_is_a_list.append(3) #add a 3 to the end
print(this_is_a_list)



In [ ]:

    
#Some functions that you will use and work in place: POP to list
this_is_a_list = [4,3,2,1,0]
this_is_a_list.pop(-1) #remove last element
print(this_is_a_list)

7. Summary

We have

Python
External packages
- numpy and scipy: math
- pandas: spreadsheet
- matplotlib (pylab): plot
- statsmodels: regression (next time)

Python and packages have

Data structures: list, numpy arrays, pandas dataframes

That are composed of

Other data structures
Data types: int, floats, strings, dates

We manipulate the data structures with code

Operations
Functions (from python/packages)
If-else statements (next time)
Loops (next time)

Plan for next class

Python: Custom functions, If statement, for loops
Data (pandas): Merge files, group by attribute, tidy data
Data visualization

	person	year	treatment	score
0	1	2000	1	4.0
1	2	2000	1	3.0
2	3	2000	2	6.0
3	4	2000	2	4.0
4	1	2005	1	8.0
5	2	2005	1	7.0
6	3	2005	2	5.0
7	4	2005	2	5.0
8	1	2010	1	9.0
9	2	2010	1	7.0
10	3	2010	2	6.0
11	3	2010	2	NaN

	person	year	treatment	score
count	12.000000	12.000000	12.000000	11.000000
mean	2.416667	2005.000000	1.500000	5.818182
std	1.083625	4.264014	0.522233	1.834022
min	1.000000	2000.000000	1.000000	3.000000
25%	1.750000	2000.000000	1.000000	NaN
50%	2.500000	2005.000000	1.500000	NaN
75%	3.000000	2010.000000	2.000000	NaN
max	4.000000	2010.000000	2.000000	9.000000

	person	year	treatment	score	score_sq	happiness	events
0	1	2000	1	4.0	16.0	1	1
1	2	2000	1	3.0	9.0	2	2
2	3	2000	2	6.0	36.0	3	3
3	4	2000	2	4.0	16.0	4	4
8	1	2010	1	9.0	81.0	9	9
9	2	2010	1	7.0	49.0	10	10
10	3	2010	2	6.0	36.0	11	11

	person	year	treatment	score	happiness	events
8	1.0	2010.0	1.0	9.0	9.0	9.0
4	1.0	2005.0	1.0	8.0	5.0	5.0
5	2.0	2005.0	1.0	7.0	6.0	6.0
9	2.0	2010.0	1.0	7.0	10.0	10.0
0	1.0	2000.0	1.0	4.0	1.0	1.0
1	2.0	2000.0	1.0	3.0	2.0	2.0
12	2.0	2017.0	2.0	9.0	10.0	23.0
2	3.0	2000.0	2.0	6.0	3.0	3.0
10	3.0	2010.0	2.0	6.0	11.0	11.0
6	3.0	2005.0	2.0	5.0	7.0	7.0
7	4.0	2005.0	2.0	5.0	8.0	8.0
3	4.0	2000.0	2.0	4.0	4.0	4.0
11	3.0	2010.0	2.0	NaN	12.0	12.0