In [ ]:
import numpy as np
In [ ]:
some_dict = {"x":{"a":1,"b":2,"c":3},
"y":{"a":4,"b":5,"c":6}}
Write a piece of code that prints the number 5, taken from some_dict.
In [ ]:
# Answer
some_dict["y"]["b"]
A pandas dataframe is a dictionary of dictionaries on steroids. (Note, its said "pan-dis" not "pandas" like the cute bears).
We import it in the cell below. It's almost always imported this way (import pandas as pd) just like numpy is almost always imported import numpy as np.
In [ ]:
import pandas as pd
In [ ]:
# Turn the dictionary into a data frame.
# Note that jupyter renders the data frame in pretty form.
df = pd.DataFrame(some_dict)
df
In [ ]:
# column names
print(df.columns)
# row names
print(df.index)
In [ ]:
print(df.loc["a","y"])
loc accesses data by row name and column name: loc[row_name,column_name]
In [ ]:
print(df.iloc[0,1])
iloc access data by row number and column number: loc[row_number,column_number]
In [ ]:
print(df.loc["a","y"])
# Answer
print(df.loc[:,"y"])
In [ ]:
df.loc[["a","b"],:]
In [ ]:
df.loc[["a","b"],:]
#Answer
df.loc[["b","c"],"x"]
loc slicing works using specified lists of named rows and columns (or ":" for all rows/columns).
In [ ]:
df.iloc[:,:]
#Answer
df.iloc[:,0]
iloc slicing works like list or numpy array slicing: 0, -1, : etc.
loc:loc['x','y'] will always refer to the same data. This
is true even if you delete or reorder rows and columns. loc['x','y'] will not always refer to the same place
in the data frame. iloc:iloc[i,j] will always refer to the same place in the data
frame. iloc[0,0] is the top-left, iloc[n_row-1,n_col-1] is
the bottom-right.iloc[i,j] will not always refer to the same data. If you
delete or reorder rows and columns, different data could be
in each cell.
In [ ]:
another_dict = {"col1":[1,2,3],"col2":[3,4,5]}
another_df = pd.DataFrame(another_dict)
another_df
Notice that the row names are 0, 1, and 2. (We did not specify row names in our input dictionary, so pandas gave these row names 0-2).
We can access rows using loc with these integer row labels:
In [ ]:
print(another_df.loc[1,"col1"])
Now, let's whack out the middle row.
In [ ]:
another_df = another_df.loc[[0,2],:]
another_df
You might expect you can now access second row with loc[1,"col1"], but you can't. The cell below will fail:
In [ ]:
another_df.loc[1,"col1"]
REMEMBER: loc[x,y] will always point to the same data. This means another_df.loc[1,"col"] can't point to anything because we deleted the data. The row labeled 1 is gone. The other two rows are still there:
In [ ]:
print(another_df.loc[0,"col1"])
print(another_df.loc[2,"col1"])
iloc, though, points to the location in the dataframe. This has now changed:
In [ ]:
print(another_df.iloc[0,0])
print(another_df.iloc[1,0])
The following line will now fail because there is no more 3rd row in the data frame:
In [ ]:
print(another_df.iloc[2,0])
After deleting the second row (labeld 1):
0) is accessed by loc[0,:] or iloc[0,:].2) is accessed by loc[2,:] or iloc[1,:]. Confused yet? KEEP IN MIND:loc refers to data and iloc refers to location in the data frame.
The following calls all access the same values.
I'm putting these here in case you run across them in the wild, but I strongly recommend using loc and iloc exclusively, both for your own sanity and for readability...
In [ ]:
df = pd.DataFrame({"x":{"a":1,"b":2,"c":3},
"y":{"a":4,"b":5,"c":6}})
print(df)
print("---")
print(df.loc["a","y"])
print(df.iloc[0,1])
print(df["y"]["a"])
print(df["y"][0])
print(df.y[0])
In [ ]:
df = pd.DataFrame({"x":{"a":1,"b":2,"c":3},
"y":{"a":4,"b":5,"c":6}})
df.iloc[0,0] = 22
df.loc["a","y"] = 14
df
You set values using = and a specified location in the dataframe using loc or iloc.
In [ ]:
df = pd.DataFrame({"x":{"a":1,"b":2,"c":3},
"y":{"a":4,"b":5,"c":6}})
# Setting multiple locations to a single value
df.loc[("a","b"),("x","y")] = 5
df
In [ ]:
# Setting a square of locations (rows a,b, columns x,y) to 1 using
# a 2x2 array
df.loc[("a","b"),("x","y")] = np.ones((2,2),dtype=np.int)
df
In [ ]:
# You can write to a csv file
df.to_csv("a-file.csv")
# You can read the csv back in
new_df = pd.read_csv("a-file.csv",index_col=0)
new_df
The program wrote a csv file called a-file.csv and then read it back in as a new data frame new_df.
In [ ]:
names = {"harry":[],"jane":[],"sally":[]}
for i in range(5):
names["harry"].append(i)
names["jane"].append(i*5)
names["sally"].append(-i)
df_names = pd.DataFrame(names)
df_names
In [ ]:
df = pd.DataFrame({"x":{"a":1,"b":2,"c":3},
"y":{"a":4,"b":5,"c":6}})
sorted_df = df.sort_values("y",ascending=False)
sorted_df
In [ ]:
mask = np.array([True,False,True],dtype=np.bool)
df.loc[mask,"x"]
This lets you do some powerful stuff. Below, I am going to set all values in this data frame that are less than 4 to 0:
In [ ]:
# Copy the df_names data frame we made above...
new_df_names = df_names.copy()
new_df_names
In [ ]:
mask = new_df_names < 4
new_df_names[mask] = 0
new_df_names
Final aside: you can do this sort of mask slicing on numpy arrays too:
In [ ]:
mask = np.arange(10) > 6
x = np.arange(10)
x[mask] = 42
x
You get access to data frames by import pandas as pd
You can create a data frame from a dictionary by
df = pd.DataFrame({"col1":[v1,v2,...],"col2":[vi,vj,...],...})
You can load a dataframe from a spreadsheet by:
df = pd.read_csv(csv_file)
You can access and set values in data frames using:
loc[row_name,col_name] (refers to a piece of data)iloc[row_number,col_number] (refers to a location in the
data frame)loc and iloc allow slicing. You can do a lot with data frames (sorting, masking, etc.)
obs column (hint, use np.mean)obsobs values (hint, us np.std)
In [ ]:
df = pd.read_csv("time_course_4.csv",index_col=0)
mean_value = np.mean(df.loc[:,"obs"])
print(mean_value)
# sorted!
sorted_df = df.sort_values("obs")
lowest_five = sorted_df.iloc[:5,0]
print(np.std(lowest_five))
In [ ]: