In [605]:
import pandas as pd
import numpy as np
In [606]:
df = pd.read_csv('./data.csv')
df
Out[606]:
The head
method of our data frame returns the top 5 items. This can take a parameter for the number of items you want returned.
In [608]:
df.head(3)
Out[608]:
The tail
method does the opposite and returns the last 5 items. It can also take in a parameter for the number of items you want returned.
In [609]:
df.tail(2)
Out[609]:
The copy
method returns a new data frame as a copy. This is useful to have a copy of the original data frame around in case you need to revert back to it.
In [610]:
df_original = df.copy()
df_original
Out[610]:
Often you'll be reading in data from the web or from a local file. The pd.read_*
methods can read in all sorts of data. The two most popular ones you'll use are pd.read_csv
and pd.read_table
.
To look at any documentation on a method, type a ?
at the end of it - pd.read_table?
, or put the cursor on the method you want to see the documentation for and hit Shift+Tab
.
Other read functions are:
read_clipboard
- Pandas will read in whatever is in the clipboard. Useful for when copying tables from a web page.read_JSON
- Read JSON strings, files, or web calls. Very useful for API calls to get data.read_HTML
- Read in HTML. Mainly used for web scraping. Note: The beautifulsoup package is required for this as it uses its APIs.read_excel
- Similar to reading an CSV, this reads an Excel spreadsheet and can specify what sheet to read in.Many more ways to read in data can be found at the API reference page.
In [ ]:
df.to_
Also, there are methods to write to any of these types of files. They can be accessed on the DataFrame
or Series
object itself.
In [611]:
print(df.values)
type(df.values)
Out[611]:
Even the values in a single row are of type ndarray
.
In [612]:
print(df.values[0])
type(df.values[0])
Out[612]:
In [613]:
df.head()
Out[613]:
First, let's handle the -999
with the replace
method.
In [614]:
df = df.replace(-999, 9)
df
Out[614]:
Now, let's find if there are any missing values in our data frame.
In [615]:
df.isnull().values.any()
Out[615]:
There is missing data - denoted by NaN
.
Now we are selecting what rows have any values that are considered null by pandas - NaN
and None
are both selected. The axis=1
parameter tells the any
method to look at rows instead of columns.
In [617]:
df[df.isnull().any(axis='columns')]
Out[617]:
We can do a boolean selection within our data frame to access our data. For example, if I want to find, within my data frame, all the rows in which the "Item Sold" row is less than five:
In [618]:
df["Items Sold"] < 5
Out[618]:
In [619]:
df[df["Items Sold"] < 5]
Out[619]:
There are two main ways to deal with missing data:
Dropping is ok if there are only a very small percentage of items that to be dropped compared to the entire data set, but if that's not the case then filling in the values would be the best option.
You can drop all missing data with the df.dropna
function.
In [620]:
df.dropna()
Out[620]:
Filling in values for missing data can be tricky, but there are a few things you can do. You can pick the mean (average), median (middle), or most used values. Depending on the data, having a good foundation in the business knowledge really helps in this case.
Here I'm using the mean to fill in my missing data. Note the inplace=True
here instead of setting the data frame equal to the results like in the previous cells.
In [621]:
df["Items Sold"].fillna(round(df["Items Sold"].mean(), 0), inplace=True)
df["Order Date"].fillna("July 25, 2017", inplace=True)
df
Out[621]:
In [622]:
df.duplicated()
Out[622]:
We can use that as a boolean mask to filter items from our data frame.
In [623]:
df[df.duplicated()]
Out[623]:
To show all items, including the first instance of the duplicate, put the keep
parameter to False
.
In [624]:
df[df.duplicated(keep=False)]
Out[624]:
We can then use the drop_duplicates
function to drop all duplicates from our data frame.
In [625]:
df = df.drop_duplicates()
df
Out[625]:
In [626]:
df.dtypes
Out[626]:
We can convert types of columns pretty easy. For dates, pandas
offers us a very convenient to_datetime
method.
In [627]:
df["Order Date"] = pd.to_datetime(df["Order Date"])
df
Out[627]:
For other types, we can call the astype
method and specify the type we want to convert to.
In [628]:
df["Items Sold"] = df["Items Sold"].astype(int)
df
Out[628]:
We can call the info
method on the data frame to get an idea on the data types.
In [629]:
df.info()
The groupby
method groups by a column or a list of columns. Doing a groupby
on its own will result in returning a DataFrameGroupBy
object. In order to show the data frame an aggregation on the grouping must be made.
In [632]:
df.groupby("Sales Person").sum()
Out[632]:
We can call sort_values
after grouping to sort by a specific column. To sort by descending set the ascending
to False
.
In [633]:
df.groupby("Sales Person").sum().sort_values("Item Price", ascending=False)
Out[633]:
Or you can group by multiple columns since pandas
has the ability to use multiple indexes.
In [634]:
df_grouped = df.groupby(["Sales Person", "Order Date"]).sum()
print(df_grouped.index)
df_grouped
Out[634]:
If you have more than one source of the same type of data, the concat
operation will append the "new" data to the "old" data into a single data frame.
In [635]:
df2 = pd.DataFrame({"Order Date": ["December 22, 2017"],
"Sales Person": ["Mary"],
"Items Sold": 12,
"Item Price": 9.99})
df2
Out[635]:
The concat
method takes in a list of data frames to concatenate together.
In [636]:
df = pd.concat([df, df2])
df
Out[636]:
Now the second data frame has been added on to the original. However, this keeps the index of the second data frame when it adds to the first. We can fix this with the reset_index
method.
In [637]:
df = df.reset_index(drop=True)
df
Out[637]:
In [638]:
df.head()
Out[638]:
In [639]:
df['Total Price'] = df['Item Price'] * df['Items Sold']
df
Out[639]:
From statistics, dummy variables are variables that represent categorical variables as numbers.
With no parameters, it produces dummy variables on all objects with a type of object
and cateogry
.
In [640]:
pd.get_dummies(df)
Out[640]:
However, you can specify the columns that it uses.
In [641]:
pd.get_dummies(df, columns=['Sales Person'])
Out[641]:
In [642]:
df.describe()
Out[642]:
You can also pin point the descriptive statistics on just the columns you want.
In [643]:
df[['Item Price']].describe()
Out[643]:
pandas
has access to methods to give a specific statistic instead of a table of a select few. A full list of these methods are on pandas
' documentation.
In [644]:
df['Item Price'].mean()
Out[644]:
In [645]:
df['Item Price'].median()
Out[645]:
In [646]:
df['Item Price'].std()
Out[646]:
Another useful statistical method available on the data frame is the correlation of each column's data to other columns. You can get this with the corr
method.
In [647]:
df.corr()
Out[647]:
Import matplotlib
into our notebook. matplotlib
is the most popular and battle tested visualization package in Python. Its API is very heavily influenced from MATLAB.
The %matplotlib inline
tells Jupyter to inline all plots. This prevents any plots from showing up in a separate popup window and so we don't have to always run plt.show()
.
Seaborn is a package that helps make matplotlib
graphs look nicer and includes statistical graphs. It is typically imported as import seaborn as sns
and sns.set
may be called to use it on plots from pandas
data frames.
In [648]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
%matplotlib inline
matplotlib
is built right into pandas
, so if we want to plot all of the numerical values in our data frame, just call the plot
method.
In [649]:
df.plot()
Out[649]:
Scatter plots are just as easy. Just need to specify the x
and y
variables.
In [650]:
df.plot.scatter(x='Item Price', y='Items Sold')
Out[650]:
Bar charts are simple as well in the same way.
In [651]:
df.plot.bar(x='Sales Person', y='Items Sold')
Out[651]:
Box plots are very informative. They show a box that has a line which indicates the median. The top box is the 25th percentile, and the bottom box is the 75th percentile of the data. The top line is the max value, while the bottom line is the minimum value. If there are any outliers that are way outside of this data it is marked by a circle.
In [652]:
df.plot.box()
Out[652]:
There's also a Kernel Density plot. Can also use the df.plot.kde()
method to get the same plot.
In [653]:
df.plot.density()
Out[653]:
We can also plot on a single column, by selecting it by name from the data frame.
In [654]:
df["Item Price"].plot.box()
Out[654]:
Grouped data frames can also be plotted.
In [655]:
df.groupby("Sales Person").sum().sort_values("Item Price", ascending=False).plot(kind='bar')
Out[655]:
In [656]:
plot = df.plot(title='Sales Plot')
plot.set_xlabel('Count')
plot.set_ylabel('Price')
Out[656]:
We can change colors of our plots.
In [657]:
df.plot.scatter(x='Item Price', y='Items Sold', c='red')
Out[657]: