In [ ]:

    
import numpy as np

Pandas

Sometimes you want a spreadsheet.

Starting point



In [ ]:

    
some_dict = {"x":{"a":1,"b":2,"c":3},
             "y":{"a":4,"b":5,"c":6}}

Write a piece of code that prints the number 5, taken from some_dict.



In [ ]:

    
# Answer
some_dict["y"]["b"]

A pandas dataframe is a dictionary of dictionaries on steroids. (Note, its said "pan-dis" not "pandas" like the cute bears).

We import it in the cell below. It's almost always imported this way (import pandas as pd) just like numpy is almost always imported import numpy as np.



In [ ]:

    
import pandas as pd



In [ ]:

    
# Turn the dictionary into a data frame.
# Note that jupyter renders the data frame in pretty form.
df = pd.DataFrame(some_dict)
df

Getting values

A pandas dataframe organizes data with named columns and named rows



In [ ]:

    
# column names
print(df.columns)

# row names
print(df.index)

Predict

What will this cell print out when run?



In [ ]:

    
print(df.loc["a","y"])

Summarize

How does loc work?

loc accesses data by row name and column name: loc[row_name,column_name]

Predict

What will this cell print out when run?



In [ ]:

    
print(df.iloc[0,1])

Summarize

How does iloc work?

iloc access data by row number and column number: loc[row_number,column_number]

Modify

Change the cell below so it prints out all of column "y" (hint: you can slice)



In [ ]:

    
print(df.loc["a","y"])

# Answer
print(df.loc[:,"y"])

Predict what this will return



In [ ]:

    
df.loc[["a","b"],:]

Modify

Change this cell so it prints out column "x" and rows "b" and "c".



In [ ]:

    
df.loc[["a","b"],:]

#Answer
df.loc[["b","c"],"x"]

Summarize

How does loc slicing work for a dataframe?

loc slicing works using specified lists of named rows and columns (or ":" for all rows/columns).

Modify

Change this cell so it prints out only the first column.



In [ ]:

    
df.iloc[:,:]

#Answer
df.iloc[:,0]

Summarize

How does iloc slicing work for a dataframe?

iloc slicing works like list or numpy array slicing: 0, -1, : etc.

IMPORTANT CONCEPT

`loc`:

loc['x','y'] will always refer to the same data. This is true even if you delete or reorder rows and columns.
loc['x','y'] will not always refer to the same place in the data frame.

`iloc`:

iloc[i,j] will always refer to the same place in the data frame. iloc[0,0] is the top-left, iloc[n_row-1,n_col-1] is the bottom-right.
iloc[i,j] will not always refer to the same data. If you delete or reorder rows and columns, different data could be in each cell.

Confusing thing that happens with `loc` and `iloc`

Data frames often have numbers as labels for each row. This means loc takes a number for the row. But a loc row number is different from an iloc row number. I'll demonstrate:



In [ ]:

    
another_dict = {"col1":[1,2,3],"col2":[3,4,5]}
another_df = pd.DataFrame(another_dict)
another_df

Notice that the row names are 0, 1, and 2. (We did not specify row names in our input dictionary, so pandas gave these row names 0-2).

We can access rows using loc with these integer row labels:



In [ ]:

    
print(another_df.loc[1,"col1"])

Now, let's whack out the middle row.



In [ ]:

    
another_df = another_df.loc[[0,2],:]
another_df

You might expect you can now access second row with loc[1,"col1"], but you can't. The cell below will fail:



In [ ]:

    
another_df.loc[1,"col1"]

REMEMBER: loc[x,y] will always point to the same data. This means another_df.loc[1,"col"] can't point to anything because we deleted the data. The row labeled 1 is gone. The other two rows are still there:



In [ ]:

    
print(another_df.loc[0,"col1"])
print(another_df.loc[2,"col1"])

iloc, though, points to the location in the dataframe. This has now changed:



In [ ]:

    
print(another_df.iloc[0,0])
print(another_df.iloc[1,0])

The following line will now fail because there is no more 3rd row in the data frame:



In [ ]:

    
print(another_df.iloc[2,0])

After deleting the second row (labeld 1):

The first row (labeled 0) is accessed by loc[0,:] or iloc[0,:].
The second row (labeled 2) is accessed by loc[2,:] or iloc[1,:].

Confused yet? KEEP IN MIND:loc refers to data and iloc refers to location in the data frame.

Final note on accessing data: there are lots of ways (too many ways?) to access data in pandas DataFrames

The following calls all access the same values.

I'm putting these here in case you run across them in the wild, but I strongly recommend using loc and iloc exclusively, both for your own sanity and for readability...



In [ ]:

    
df = pd.DataFrame({"x":{"a":1,"b":2,"c":3},
                   "y":{"a":4,"b":5,"c":6}})

print(df)
print("---")
print(df.loc["a","y"])
print(df.iloc[0,1])
print(df["y"]["a"])
print(df["y"][0])
print(df.y[0])

Setting values

Predict

What does the data frame look like after this code is run?



In [ ]:

    
df = pd.DataFrame({"x":{"a":1,"b":2,"c":3},
                   "y":{"a":4,"b":5,"c":6}})
df.iloc[0,0] = 22
df.loc["a","y"] = 14
df

Summarize:

How can you set values in a data frame?

You set values using = and a specified location in the dataframe using loc or iloc.

Fancy setting

You can do all kinds of interesting setting moves:



In [ ]:

    
df = pd.DataFrame({"x":{"a":1,"b":2,"c":3},
                   "y":{"a":4,"b":5,"c":6}})

# Setting multiple locations to a single value
df.loc[("a","b"),("x","y")] = 5
df



In [ ]:

    
# Setting a square of locations (rows a,b, columns x,y) to 1 using 
# a 2x2 array
df.loc[("a","b"),("x","y")] = np.ones((2,2),dtype=np.int)
df

Creating and saving dataframes

Summarize:

What does the following do?



In [ ]:

    
# You can write to a csv file
df.to_csv("a-file.csv")

# You can read the csv back in
new_df = pd.read_csv("a-file.csv",index_col=0)
new_df

The program wrote a csv file called a-file.csv and then read it back in as a new data frame new_df.

Predict:

What does the data frame look like that comes out? (This is a very common way to generate a data frame.)



In [ ]:

    
names = {"harry":[],"jane":[],"sally":[]}
for i in range(5):
    names["harry"].append(i)
    names["jane"].append(i*5)
    names["sally"].append(-i)
    
df_names = pd.DataFrame(names)
df_names

Summary

You can get a data frame by:

reading in a spreadsheet by pd.read_csv (or pd.read_excel or many other options)
constructing one by pd.DataFrame(some_dictionary)

You can save out a data frame by:

df.to_csv (or df.to_excel or many other options)

Some Really Useful Stuff Presented In Non-Inductive Fashion

Sorting by a column



In [ ]:

    
df = pd.DataFrame({"x":{"a":1,"b":2,"c":3},
                   "y":{"a":4,"b":5,"c":6}})
sorted_df = df.sort_values("y",ascending=False)
sorted_df

Accessing elements using True/False arrays (masks)

(Note this only works for loc, not iloc)



In [ ]:

    
mask = np.array([True,False,True],dtype=np.bool)
df.loc[mask,"x"]

This lets you do some powerful stuff. Below, I am going to set all values in this data frame that are less than 4 to 0:



In [ ]:

    
# Copy the df_names data frame we made above... 
new_df_names = df_names.copy()
new_df_names



In [ ]:

    
mask = new_df_names < 4
new_df_names[mask] = 0
new_df_names

Final aside: you can do this sort of mask slicing on numpy arrays too:



In [ ]:

    
mask = np.arange(10) > 6
x = np.arange(10)
x[mask] = 42
x

Summary

You get access to data frames by import pandas as pd

You can create a data frame from a dictionary by

df = pd.DataFrame({"col1":[v1,v2,...],"col2":[vi,vj,...],...})

You can load a dataframe from a spreadsheet by:
```
df = pd.read_csv(csv_file)
```
You can access and set values in data frames using:
- loc[row_name,col_name] (refers to a piece of data)
- iloc[row_number,col_number] (refers to a location in the data frame)
- Both loc and iloc allow slicing.
You can do a lot with data frames (sorting, masking, etc.)

Implement

Download the following spreadsheet: https://github.com/harmsm/pythonic-science/raw/master/labs/02_regression/data/time_course_4.csv
Read the file as a pandas data frame
Find the mean of the obs column (hint, use np.mean)
Sort from decreasing to increasing obs
Find the standard deviation of the 5 lowest obs values (hint, us np.std)



In [ ]:

    
df = pd.read_csv("time_course_4.csv",index_col=0)
mean_value = np.mean(df.loc[:,"obs"])
print(mean_value)

# sorted!
sorted_df = df.sort_values("obs")

lowest_five = sorted_df.iloc[:5,0]
print(np.std(lowest_five))



In [ ]:

Pandas

Starting point

Getting values

A pandas dataframe organizes data with named columns and named rows

Predict

Summarize

Predict

Summarize

Modify

Predict what this will return

Modify

Summarize

Modify

Summarize

IMPORTANT CONCEPT

loc:

iloc:

Confusing thing that happens with loc and iloc

Final note on accessing data: there are lots of ways (too many ways?) to access data in pandas DataFrames

Setting values

Predict

Summarize:

Fancy setting

Creating and saving dataframes

Summarize:

Predict:

Summary

Some Really Useful Stuff Presented In Non-Inductive Fashion

Sorting by a column

Accessing elements using True/False arrays (masks)

Summary

Implement

`loc`:

`iloc`:

Confusing thing that happens with `loc` and `iloc`