When we think about summarizing data, what are the metrics that we look at?
In this notebook, we will look at the car dataset
To read how the data was acquired, please read this repo to get more information
In [2]:
#Import the required libraries
import numpy as np
import pandas as pd
from datetime import datetime as dt
from scipy import stats
In [3]:
cars = pd.read_csv("cars_v1.csv", encoding = "ISO-8859-1")
In [4]:
cars.head()
Out[4]:
Exercise
In [5]:
#Display the first 10 records
cars.head(10)
Out[5]:
In [6]:
#Display the last 5 records
cars.tail()
Out[6]:
In [7]:
#Find the number of rows and columns in the dataset
cars.shape
Out[7]:
In [8]:
#What are the column names in the dataset?
cars.columns
Out[8]:
In [9]:
#What are the types of those columns ?
cars.dtypes
Out[9]:
In [ ]:
cars.head()
In [11]:
#How to check if there are null values in any of the columns?
#Hint: use the isnull() function (how about using sum or values/any with it?)
cars.isnull().sum()
Out[11]:
In [20]:
Out[20]:
How to handle missing values?
In [15]:
#fillna function
In [17]:
#Find mean of price
cars.Price.mean()
Out[17]:
In [18]:
#Find mean of Mileage
cars.Mileage.mean()
Out[18]:
Let's do something fancier. Let's find mean mileage of every make.
Hint: need to use groupby
In [23]:
#cars.groupby('Make') : Finish the code
cars.groupby('Make').Mileage.mean().reset_index()
Out[23]:
In [46]:
Out[46]:
If count is odd, the median is the value at (n+1)/2,
else it is the average of n/2 and (n+1)/2
Find median of mileage
In [24]:
cars.Mileage.median()
Out[24]:
Find the mode of Type of cars
In [55]:
#Let's first find count of each of the car Types
#Hint: use value_counts
In [26]:
cars.Type.value_counts()
Out[26]:
In [56]:
#Mode of cars
In [30]:
cars.Type
Out[30]:
In [28]:
cars.Type.mode()
Out[28]:
In [29]:
cars.head()
Out[29]:
Find variance of mileage
In [31]:
cars.Mileage.var()
Out[31]:
Find standard deviation of mileage
In [32]:
cars.Mileage.std()
Out[32]:
In [33]:
cars.describe()
Out[33]:
covariance as a measure of the (average) co-variation between two variables, say x and y. Covariance describes both how far the variables are spread out, and the nature of their relationship, Covariance is a measure of how much two variables change together. Compare this to Variance, which is just the range over which one measure (or variable) varies.
In [34]:
pd.unique(cars.GearType)
Out[34]:
In [36]:
cars_Automatic = cars[cars.GearType==' Automatic'].copy().reset_index()
In [37]:
cars_Manual = cars[cars.GearType==' Manual'].copy().reset_index()
In [38]:
cars_Automatic.head()
Out[38]:
In [85]:
cars_Manual.head()
Out[85]:
In [86]:
cars_Manual.shape
Out[86]:
In [87]:
cars_Automatic.shape
Out[87]:
The number of observations have to be same. For the current exercise, let's take the first 300 observations in both the datasets
In [39]:
cars_Automatic = cars_Automatic.ix[:299,:]
cars_Manual = cars_Manual.ix[:299,:]
In [92]:
cars_Automatic.shape
Out[92]:
In [93]:
cars_Manual.shape
Out[93]:
In [40]:
cars_manual_automatic = pd.DataFrame([cars_Automatic.Mileage, cars_Manual.Mileage])
In [97]:
cars_manual_automatic
Out[97]:
In [41]:
cars_manual_automatic = cars_manual_automatic.T
In [101]:
cars_manual_automatic.head()
Out[101]:
In [42]:
cars_manual_automatic.columns = ['Mileage_Automatic', 'Mileage_Manual']
In [104]:
cars_manual_automatic.head()
Out[104]:
In [44]:
#Co-variance matrix between the mileages of automatic and manual:
cars_manual_automatic.cov()
Out[44]:
In [ ]:
In [106]:
#### Find the correlation between the mileages of automatic and manual in the above dataset
In [45]:
cars_manual_automatic.corr()
Out[45]:
In [46]:
cars_manual_automatic.corrwith?
In [ ]: