4. Explore the Data

"I don't know, what I don't know"

We want to first visually explore the data to see if we can confirm some of our initial hypotheses as well as make new hypothesis about the problem we are trying to solve.

For this we will start by loading the data and understanding the data structure of the dataframe we have.

Lets read the data


In [ ]:
# Import the library we need, which is Pandas
import pandas as pd

You will find the variable df used quite often to store a dataframe


In [ ]:
# Read the csv file of Monthwise Quantity and Price csv file we have.
df = pd.read_csv('MonthWiseMarketArrivals_clean.csv')

Understand Data Structure and Types


In [ ]:
df.shape

In [ ]:
df.head()

Data Structure

So we have ten columns in our dataset. Let us understand what each one is.

Three are about the location of the Wholesale Market where Onion where sold.

  • state: This is the 2/3 letter abbreviation for the state in India (PB is Punjab and so on)
  • city: This is the city in India (ABOHAR, BANGALORE and so on)
  • market: This is a string with the combination of the state and city

Three are related to the

  • month: Month in January, February and so on.
  • year: Year in YYYY representastion
  • date: The combination of the two above.

Four are about quantity and price in these wholesale market.

  • quantity: The quanity of Onion arriving in the market in that month in quintals (100 kg)
  • priceMin: The minimum price in the month in Rs./quintal
  • priceMax: The maximum price in the month in Rs./quintal
  • priceMod: The modal price in the month in Rs./quintal

We would expect the following the columns to be of the following type

  • CATEGORICAL: state, city, market
  • TIME INTERVAL: month, year, date
  • QUANTITATIVE: quantity, priceMin, priceMax, priceModal

Let us see what pandas dataframe is reading these columns as.


In [ ]:
# Get the typeof each column
df.dtypes

So we are getting the quantitive columns are correctly being shown as integers and the categorical columns are showing as objects(strings) which is fine. However, the date columns is being read as an object and not a Time-Interval. Let us at least fix the date column and make it into a datetime object


In [ ]:
# Changing the date column to a Time Interval columnn
df.date = pd.DatetimeIndex(df.date)

In [ ]:
# Now checking for type of each column
df.dtypes

In [ ]:
# Let us see the dataframe again now
df.head()

Question 1 - How big is the Bangalore onion market compared to other cities in India?

Let us try to do this examination for one of the year only. So we want to reduce our dataframe for only where the year = 2010. This process is called subset.

PRINCIPLE: Subset a Dataframe

How do you subset a dataframe on a given criteria

newDataframe = df[ <subset condition> ]


In [ ]:
df2010 = df[df.year == 2010]

In [ ]:
df2010.head()

In [ ]:
# We can also subset on multiple criterias
df2010Bang = df[(df.year == 2010) & (df.city == 'BANGALORE')]

In [ ]:
df2010Bang.head()

Principle: Split Apply Combine

How do we get the sum of quantity for each city.

We need to SPLIT the data by each city, APPLY the sum to the quantity row and then COMBINE the data again

In pandas, we use the groupby function to do this.


In [ ]:
# Group by using city
df2010City = df2010.groupby(['city']).sum()

In [ ]:
df2010City.head()

In [ ]:
# If we only want to apply the sum function on quantity, then we specify the quantity column
df2010City = df2010.groupby(['city']).quantity.sum()

In [ ]:
# Let us see this dataframe
df2010City.head()

In [ ]:
# To create a dataframe again, it is best to specify index as false
df2010City = df2010.groupby(['city'], as_index=False).quantity.sum()

In [ ]:
df2010City.head()

In [ ]:
# Sort the Dataframe by Quantity to see which one is on top
df2010City = df2010City.sort_values(by = "quantity", ascending = False)
df2010City.head()

PRINCIPLE: Visual Exploration

Lets load the libraries required for plotting in python


In [ ]:
# Load the visualisation libraries - Matplotlib
import matplotlib.pyplot as plt

In [ ]:
# Let us see the output plots in the notebook itself
%matplotlib inline

In [ ]:
# Set some parameters to get good visuals - style to ggplot and size to 15,10
plt.style.use('ggplot')
plt.rcParams['figure.figsize'] = (15, 10)

In [ ]:
# Plot the Data
df2010City.plot(kind ="bar", x = 'city', y = 'quantity')

Exercise: Find the State with Highest Quantity Sales in 2015?


In [ ]:


In [ ]:


In [ ]:

Question 2 - Have the price variation in Onion prices in Bangalore really gone up over the years?


In [ ]:
df.head()

In [ ]:
dfBang = df[df.city == 'BANGALORE']

In [ ]:
dfBang.head()

In [ ]:
dfBang.describe()

In [ ]:
# Reduce the precision of numbers - so that it is easy to read
pd.set_option('precision', 0)

In [ ]:
dfBang.describe()

PRINCIPLE: Setting Index


In [ ]:
dfBang.head()

In [ ]:
dfBang.index

In [ ]:
# Set the index as date
dfBang = dfBang.sort_values(by = "date")
dfBang.head()

In [ ]:
# Set the Index for the Dataframe
dfBang.index = pd.PeriodIndex(dfBang.date, freq='M')

In [ ]:
dfBang.head()

In [ ]:
dfBang.priceMod.plot()

In [ ]:
dfBang.plot(kind = "line", y = ['priceMin', 'priceMod', 'priceMax'])

To calculate the range of change, we will create a new price difference variable - which is the difference between the priceMin and priceMax


In [ ]:
dfBang['priceDiff'] = dfBang['priceMax'] - dfBang['priceMin']

In [ ]:
dfBang.head()

In [ ]:
dfBang.plot(kind = 'line', y = 'priceDiff')

PRINCIPLE: Pivot Table

Pivot table is a way to summarize data frame data into index (rows), columns and value


In [ ]:
# Create new variable for Integer Month
dfBang['monthVal'] = pd.DatetimeIndex(dfBang['date']).month

In [ ]:
dfBangPivot = pd.pivot_table(dfBang, values = "priceDiff", columns = "year", index = "monthVal")

In [ ]:
dfBangPivot.head()

In [ ]:
dfBangPivot.plot()

In [ ]:
dfBangPivot.plot(subplots = True, figsize=(15, 15), layout=(3, 5), sharey=True)

Exercise: Find the price variation for LASALGAON city?


In [ ]:


In [ ]: