In [ ]:
import numpy as np
In [ ]:
some_dict = {"x":{"a":1,"b":2,"c":3},
"y":{"a":4,"b":5,"c":6}}
Write a piece of code that prints the number 5
, taken from some_dict
.
In [ ]:
# Answer
some_dict["y"]["b"]
A pandas
dataframe is a dictionary of dictionaries on steroids. (Note, its said "pan-dis" not "pandas" like the cute bears).
We import it in the cell below. It's almost always imported this way (import pandas as pd
) just like numpy
is almost always imported import numpy as np
.
In [ ]:
import pandas as pd
In [ ]:
# Turn the dictionary into a data frame.
# Note that jupyter renders the data frame in pretty form.
df = pd.DataFrame(some_dict)
df
In [ ]:
# column names
print(df.columns)
# row names
print(df.index)
In [ ]:
print(df.loc["a","y"])
In [ ]:
print(df.iloc[0,1])
In [ ]:
print(df.loc["a","y"])
In [ ]:
df.loc[["a","b"],:]
In [ ]:
df.loc[["a","b"],:]
loc
slicing works using specified lists of named rows and columns (or ":" for all rows/columns).
In [ ]:
df.iloc[:,:]
loc
:loc['x','y']
will always refer to the same data. This
is true even if you delete or reorder rows and columns. loc['x','y']
will not always refer to the same place
in the data frame. iloc
:iloc[i,j]
will always refer to the same place in the data
frame. iloc[0,0]
is the top-left, iloc[n_row-1,n_col-1]
is
the bottom-right.iloc[i,j]
will not always refer to the same data. If you
delete or reorder rows and columns, different data could be
in each cell.
In [ ]:
another_dict = {"col1":[1,2,3],"col2":[3,4,5]}
another_df = pd.DataFrame(another_dict)
another_df
Notice that the row names are 0
, 1
, and 2
. (We did not specify row names in our input dictionary, so pandas
gave these row names 0-2
).
We can access rows using loc
with these integer row labels:
In [ ]:
print(another_df.loc[1,"col1"])
Now, let's whack out the middle row.
In [ ]:
another_df = another_df.loc[[0,2],:]
another_df
You might expect you can now access second row with loc[1,"col1"]
, but you can't. The cell below will fail:
In [ ]:
another_df.loc[1,"col1"]
REMEMBER: loc[x,y]
will always point to the same data. This means another_df.loc[1,"col"]
can't point to anything because we deleted the data. The row labeled 1
is gone. The other two rows are still there:
In [ ]:
print(another_df.loc[0,"col1"])
print(another_df.loc[2,"col1"])
iloc
, though, points to the location in the dataframe. This has now changed:
In [ ]:
print(another_df.iloc[0,0])
print(another_df.iloc[1,0])
The following line will now fail because there is no more 3rd row in the data frame:
In [ ]:
print(another_df.iloc[2,0])
After deleting the second row (labeld 1
):
0
) is accessed by loc[0,:]
or iloc[0,:]
.2
) is accessed by loc[2,:]
or iloc[1,:]
. Confused yet? KEEP IN MIND:loc
refers to data and iloc
refers to location in the data frame.
The following calls all access the same values.
I'm putting these here in case you run across them in the wild, but I strongly recommend using loc
and iloc
exclusively, both for your own sanity and for readability...
In [ ]:
df = pd.DataFrame({"x":{"a":1,"b":2,"c":3},
"y":{"a":4,"b":5,"c":6}})
print(df)
print("---")
print(df.loc["a","y"])
print(df.iloc[0,1])
print(df["y"]["a"])
print(df["y"][0])
print(df.y[0])
In [ ]:
df = pd.DataFrame({"x":{"a":1,"b":2,"c":3},
"y":{"a":4,"b":5,"c":6}})
df.iloc[0,0] = 22
df.loc["a","y"] = 14
df
In [ ]:
df = pd.DataFrame({"x":{"a":1,"b":2,"c":3},
"y":{"a":4,"b":5,"c":6}})
# Setting multiple locations to a single value
df.loc[("a","b"),("x","y")] = 5
df
In [ ]:
# Setting a square of locations (rows a,b, columns x,y) to 1 using
# a 2x2 array
df.loc[("a","b"),("x","y")] = np.ones((2,2),dtype=np.int)
df
In [ ]:
# You can write to a csv file
df.to_csv("a-file.csv")
# You can read the csv back in
new_df = pd.read_csv("a-file.csv",index_col=0)
new_df
In [ ]:
names = {"harry":[],"jane":[],"sally":[]}
for i in range(5):
names["harry"].append(i)
names["jane"].append(i*5)
names["sally"].append(-i)
df_names = pd.DataFrame(names)
df_names
In [ ]:
df = pd.DataFrame({"x":{"a":1,"b":2,"c":3},
"y":{"a":4,"b":5,"c":6}})
sorted_df = df.sort_values("y",ascending=False)
sorted_df
In [ ]:
mask = np.array([True,False,True],dtype=np.bool)
df.loc[mask,"x"]
This lets you do some powerful stuff. Below, I am going to set all values in this data frame that are less than 4 to 0:
In [ ]:
# Copy the df_names data frame we made above...
new_df_names = df_names.copy()
new_df_names
In [ ]:
mask = new_df_names < 4
new_df_names[mask] = 0
new_df_names
Final aside: you can do this sort of mask slicing on numpy arrays too:
In [ ]:
mask = np.arange(10) > 6
x = np.arange(10)
x[mask] = 42
x
You get access to data frames by import pandas as pd
You can create a data frame from a dictionary by
df = pd.DataFrame({"col1":[v1,v2,...],"col2":[vi,vj,...],...})
You can load a dataframe from a spreadsheet by:
df = pd.read_csv(csv_file)
You can access and set values in data frames using:
loc[row_name,col_name]
(refers to a piece of data)iloc[row_number,col_number
] (refers to a location in the
data frame)loc
and iloc
allow slicing. You can do a lot with data frames (sorting, masking, etc.)
obs
column (hint, use np.mean
)obs
obs
values (hint, us np.std
)
In [ ]: