"I don't know, what I don't know"
We want to first visually explore the data to see if we can confirm some of our initial hypotheses as well as make new hypothesis about the problem we are trying to solve.
For this we will start by loading the data and understanding the data structure of the dataframe we have.
In [ ]:
# Import the library we need, which is Pandas
import pandas as pd
You will find the variable df
used quite often to store a dataframe
In [ ]:
# Read the csv file of Monthwise Quantity and Price csv file we have.
df = pd.read_csv('MonthWiseMarketArrivals_clean.csv')
In [ ]:
df.shape
In [ ]:
df.head()
So we have ten columns in our dataset. Let us understand what each one is.
Three are about the location of the Wholesale Market where Onion where sold.
Three are related to the
Four are about quantity and price in these wholesale market.
We would expect the following the columns to be of the following type
Let us see what pandas dataframe is reading these columns as.
In [ ]:
# Get the typeof each column
df.dtypes
So we are getting the quantitive columns are correctly being shown as integers and the categorical columns are showing as objects(strings) which is fine.
However, the date
columns is being read as an object and not a Time-Interval. Let us at least fix the date
column and make it into a datetime object
In [ ]:
# Changing the date column to a Time Interval columnn
df.date = pd.DatetimeIndex(df.date)
In [ ]:
# Now checking for type of each column
df.dtypes
In [ ]:
# Let us see the dataframe again now
df.head()
In [ ]:
df2010 = df[df.year == 2010]
In [ ]:
df2010.head()
In [ ]:
# We can also subset on multiple criterias
df2010Bang = df[(df.year == 2010) & (df.city == 'BANGALORE')]
In [ ]:
df2010Bang.head()
In [ ]:
# Group by using city
df2010City = df2010.groupby(['city']).sum()
In [ ]:
df2010City.head()
In [ ]:
# If we only want to apply the sum function on quantity, then we specify the quantity column
df2010City = df2010.groupby(['city']).quantity.sum()
In [ ]:
# Let us see this dataframe
df2010City.head()
In [ ]:
# To create a dataframe again, it is best to specify index as false
df2010City = df2010.groupby(['city'], as_index=False).quantity.sum()
In [ ]:
df2010City.head()
In [ ]:
# Sort the Dataframe by Quantity to see which one is on top
df2010City = df2010City.sort_values(by = "quantity", ascending = False)
df2010City.head()
In [ ]:
# Load the visualisation libraries - Matplotlib
import matplotlib.pyplot as plt
In [ ]:
# Let us see the output plots in the notebook itself
%matplotlib inline
In [ ]:
# Set some parameters to get good visuals - style to ggplot and size to 15,10
plt.style.use('ggplot')
plt.rcParams['figure.figsize'] = (15, 10)
In [ ]:
# Plot the Data
df2010City.plot(kind ="bar", x = 'city', y = 'quantity')
In [ ]:
In [ ]:
In [ ]:
In [ ]:
df.head()
In [ ]:
dfBang = df[df.city == 'BANGALORE']
In [ ]:
dfBang.head()
In [ ]:
dfBang.describe()
In [ ]:
# Reduce the precision of numbers - so that it is easy to read
pd.set_option('precision', 0)
In [ ]:
dfBang.describe()
In [ ]:
dfBang.head()
In [ ]:
dfBang.index
In [ ]:
# Set the index as date
dfBang = dfBang.sort_values(by = "date")
dfBang.head()
In [ ]:
# Set the Index for the Dataframe
dfBang.index = pd.PeriodIndex(dfBang.date, freq='M')
In [ ]:
dfBang.head()
In [ ]:
dfBang.priceMod.plot()
In [ ]:
dfBang.plot(kind = "line", y = ['priceMin', 'priceMod', 'priceMax'])
To calculate the range of change, we will create a new price difference variable - which is the difference between the priceMin and priceMax
In [ ]:
dfBang['priceDiff'] = dfBang['priceMax'] - dfBang['priceMin']
In [ ]:
dfBang.head()
In [ ]:
dfBang.plot(kind = 'line', y = 'priceDiff')
In [ ]:
# Create new variable for Integer Month
dfBang['monthVal'] = pd.DatetimeIndex(dfBang['date']).month
In [ ]:
dfBangPivot = pd.pivot_table(dfBang, values = "priceDiff", columns = "year", index = "monthVal")
In [ ]:
dfBangPivot.head()
In [ ]:
dfBangPivot.plot()
In [ ]:
dfBangPivot.plot(subplots = True, figsize=(15, 15), layout=(3, 5), sharey=True)
In [ ]:
In [ ]: