Basic Metrics

When we think about summarizing data, what are the metrics that we look at?

In this notebook, we will look at the car dataset

To read how the data was acquired, please read this repo to get more information


In [2]:
#Import the required libraries
import numpy as np
import pandas as pd
from datetime import datetime as dt
from scipy import stats

Read the dataset


In [3]:
cars = pd.read_csv("cars_v1.csv", encoding = "ISO-8859-1")

Warm up


In [4]:
cars.head()


Out[4]:
Make Model Price Type ABS BootSpace GearType AirBag Engine FuelCapacity Mileage
0 Ashok Leyland Stile Ashok Leyland Stile LE 8-STR (Diesel) 750 MPV No 500.0 Manual No 1461.0 50.0 20.7
1 Ashok Leyland Stile Ashok Leyland Stile LS 8-STR (Diesel) 800 MPV No 500.0 Manual No 1461.0 50.0 20.7
2 Ashok Leyland Stile Ashok Leyland Stile LX 8-STR (Diesel) 830 MPV No 500.0 Manual No 1461.0 50.0 20.7
3 Ashok Leyland Stile Ashok Leyland Stile LS 7-STR (Diesel) 850 MPV No 500.0 Manual No 1461.0 50.0 20.7
4 Ashok Leyland Stile Ashok Leyland Stile LS 7-STR Alloy (Diesel) 880 MPV No 500.0 Manual No 1461.0 50.0 20.7

Exercise


In [5]:
#Display the first 10 records
cars.head(10)


Out[5]:
Make Model Price Type ABS BootSpace GearType AirBag Engine FuelCapacity Mileage
0 Ashok Leyland Stile Ashok Leyland Stile LE 8-STR (Diesel) 750 MPV No 500.0 Manual No 1461.0 50.0 20.7
1 Ashok Leyland Stile Ashok Leyland Stile LS 8-STR (Diesel) 800 MPV No 500.0 Manual No 1461.0 50.0 20.7
2 Ashok Leyland Stile Ashok Leyland Stile LX 8-STR (Diesel) 830 MPV No 500.0 Manual No 1461.0 50.0 20.7
3 Ashok Leyland Stile Ashok Leyland Stile LS 7-STR (Diesel) 850 MPV No 500.0 Manual No 1461.0 50.0 20.7
4 Ashok Leyland Stile Ashok Leyland Stile LS 7-STR Alloy (Diesel) 880 MPV No 500.0 Manual No 1461.0 50.0 20.7
5 Ashok Leyland Stile Ashok Leyland Stile LX 7-STR (Diesel) 900 MPV No 500.0 Manual No 1461.0 50.0 20.7
6 Ashok Leyland Stile Ashok Leyland Stile LX 7-STR Alloy (Diesel) 930 MPV No 500.0 Manual No 1461.0 50.0 20.7
7 Aston Martin Rapide Aston Martin Rapide LUXE (Petrol) 35000 Sedan Yes 300.0 Automatic Yes 5935.0 90.5 7.0
8 Aston Martin Rapide S Aston Martin Rapide S (Petrol) 44000 Sedan Yes NaN Automatic Yes 5935.0 90.0 11.9
9 Aston Martin V12 Vantage Aston Martin V12 Vantage Coupe (Petrol) 35000 Coupe Yes 300.0 Automatic Yes 5935.0 80.0 9.0

In [6]:
#Display the last 5 records
cars.tail()


Out[6]:
Make Model Price Type ABS BootSpace GearType AirBag Engine FuelCapacity Mileage
828 Volvo V40 Volvo V40 D3 R-Design (Diesel) 2770 Hatchback Yes 324.0 Automatic Yes 1984.0 62.0 27.8
829 Volvo V40 Cross Country Volvo V40 D3 Cross Country (Diesel) 3180 Hatchback Yes 335.0 Automatic Yes 1984.0 60.0 16.8
830 Volvo V40 Cross Country Volvo V40 T4 Cross Country (Petrol) 3240 Hatchback Yes 324.0 Automatic Yes 1596.0 50.0 16.0
831 Volvo XC90 Volvo XC 90 D5 Momentum(Diesel) 6490 SUV Yes NaN Automatic Yes 1969.0 68.0 11.1
832 Volvo XC90 Volvo XC 90 D5 Inscription(Diesel) 7790 SUV Yes NaN Automatic Yes 1969.0 68.0 11.1

In [7]:
#Find the number of rows and columns in the dataset
cars.shape


Out[7]:
(833, 11)

In [8]:
#What are the column names in the dataset?
cars.columns


Out[8]:
Index(['Make', 'Model', 'Price', 'Type', 'ABS', 'BootSpace', 'GearType',
       'AirBag', 'Engine', 'FuelCapacity', 'Mileage'],
      dtype='object')

In [9]:
#What are the types of those columns ? 
cars.dtypes


Out[9]:
Make             object
Model            object
Price             int64
Type             object
ABS              object
BootSpace       float64
GearType         object
AirBag           object
Engine          float64
FuelCapacity    float64
Mileage         float64
dtype: object

In [ ]:
cars.head()

In [11]:
#How to check if there are null values in any of the columns?

#Hint: use the isnull() function  (how about using sum or values/any with it?)
cars.isnull().sum()


Out[11]:
Make              0
Model             0
Price             0
Type              0
ABS              16
BootSpace       179
GearType         16
AirBag           21
Engine            7
FuelCapacity      0
Mileage         171
dtype: int64

In [20]:



Out[20]:
Make              0
Model             0
Price             0
Type              0
ABS              16
BootSpace       179
GearType         16
AirBag           21
Engine            7
FuelCapacity      0
Mileage         171
dtype: int64

How to handle missing values?


In [15]:
#fillna function

Mean, Median, Variance, Standard Deviation

Mean

arithmetic average of a range of values or quantities, computed by dividing the total of all values by the number of values.


In [17]:
#Find mean of price
cars.Price.mean()


Out[17]:
3159.4957983193276

In [18]:
#Find mean of Mileage
cars.Mileage.mean()


Out[18]:
17.480407854984893

Let's do something fancier. Let's find mean mileage of every make.

Hint: need to use groupby


In [23]:
#cars.groupby('Make') : Finish the code
cars.groupby('Make').Mileage.mean().reset_index()


Out[23]:
Make Mileage
0 Ashok Leyland Stile 20.700000
1 Aston Martin Rapide 7.000000
2 Aston Martin Rapide S 11.900000
3 Aston Martin V12 Vantage 9.000000
4 Aston Martin V8 Vantage 5.000000
5 Aston Martin Vanquish 8.000000
6 Audi A3 Cabriolet 19.186667
7 Audi A4 14.804000
8 Audi A6 15.260000
9 Audi A7 14.400000
10 Audi A8 L 12.575000
11 Audi Q3 16.260000
12 Audi Q5 13.063333
13 Audi Q7 12.466667
14 Audi R8 9.515000
15 Audi RS5 9.000000
16 Audi RS6 10.410000
17 Audi RS7 13.300000
18 Audi S6 10.300000
19 Audi TT 8.300000
20 BMW 1 Series 20.580000
21 BMW 3 Series 17.857500
22 BMW 3 Series GT 19.590000
23 BMW 5 Series 13.982500
24 BMW 5 Series GT NaN
25 BMW 6 Series 17.540000
26 BMW 7 Series 11.962500
27 BMW M3 10.750000
28 BMW M5 13.250000
29 BMW M6 Gran Coupe 10.100000
... ... ...
173 Tata Movus NaN
174 Tata Nano 24.083333
175 Tata Safari DICOR 13.930000
176 Tata Safari Storme 13.875000
177 Tata Sumo Gold 14.612500
178 Tata Sumo Grande 13.550000
179 Tata Venture 15.420000
180 Tata Winger NaN
181 Tata Xenon 12.745000
182 Tata Zest 20.600000
183 Toyota Corolla Altis 2014 21.322500
184 Toyota Etios 20.185000
185 Toyota Etios Cross 13.300000
186 Toyota Etios Liva NaN
187 Toyota Etios Xclusive 16.780000
188 Toyota Fortuner 12.550000
189 Volkswagen Cross Polo 19.285000
190 Volkswagen New JETTA NaN
191 Volkswagen New Phaeton 12.800000
192 Volkswagen Passat 17.286667
193 Volkswagen Polo 18.305000
194 Volkswagen Polo GT 18.450000
195 Volkswagen Taigun NaN
196 Volkswagen Up 21.000000
197 Volkswagen Vento 18.957000
198 Volvo S60 19.133333
199 Volvo S80 16.660000
200 Volvo V40 27.800000
201 Volvo V40 Cross Country 16.400000
202 Volvo XC90 11.100000

203 rows × 2 columns

Exercise

How about finding the average mileage for every Type-GearType combination?


In [46]:



Out[46]:
Type GearType Mileage
0 Convertible Automatic 12.147143
1 Coupe Automatic 9.746071
2 Hatchback Automatic 19.446111
3 Hatchback Manual 20.804878
4 Hatchback No 21.748000
5 MPV Automatic 20.434286
6 MPV Manual 18.825862
7 MUV Automatic 12.466667
8 MUV Manual 15.634091
9 MUV No 13.912000
10 SUV Automatic 14.603176
11 SUV Manual 15.772308
12 SUV No 11.973333
13 Sedan Automatic 16.670000
14 Sedan Manual 20.143626
15 Sedan No 19.500000

Median

Denotes value or quantity lying at the midpoint of a frequency distribution of observed values or quantities, such that there is an equal probability of falling above or below it. Simply put, it is the middle value in the list of numbers.

If count is odd, the median is the value at (n+1)/2,

else it is the average of n/2 and (n+1)/2

Find median of mileage


In [24]:
cars.Mileage.median()


Out[24]:
17.985

Mode

It is the number which appears most often in a set of numbers.

Find the mode of Type of cars


In [55]:
#Let's first find count of each of the car Types
#Hint: use value_counts

In [26]:
cars.Type.value_counts()


Out[26]:
Sedan          294
Hatchback      222
SUV            186
MPV             47
MUV             40
Coupe           33
Convertible     11
Name: Type, dtype: int64

In [56]:
#Mode of cars

In [30]:
cars.Type


Out[30]:
0              MPV
1              MPV
2              MPV
3              MPV
4              MPV
5              MPV
6              MPV
7            Sedan
8            Sedan
9            Coupe
10           Coupe
11           Coupe
12           Coupe
13           Sedan
14           Sedan
15           Sedan
16           Sedan
17           Sedan
18           Sedan
19     Convertible
20           Sedan
21           Sedan
22           Sedan
23           Sedan
24           Sedan
25           Sedan
26           Sedan
27           Sedan
28           Sedan
29           Sedan
          ...     
803      Hatchback
804      Hatchback
805      Hatchback
806      Hatchback
807      Hatchback
808      Hatchback
809            SUV
810      Hatchback
811          Sedan
812          Sedan
813          Sedan
814          Sedan
815          Sedan
816          Sedan
817          Sedan
818          Sedan
819          Sedan
820          Sedan
821          Sedan
822          Sedan
823          Sedan
824          Sedan
825          Sedan
826          Sedan
827      Hatchback
828      Hatchback
829      Hatchback
830      Hatchback
831            SUV
832            SUV
Name: Type, dtype: object

In [28]:
cars.Type.mode()


Out[28]:
0    Sedan
dtype: object

In [29]:
cars.head()


Out[29]:
Make Model Price Type ABS BootSpace GearType AirBag Engine FuelCapacity Mileage
0 Ashok Leyland Stile Ashok Leyland Stile LE 8-STR (Diesel) 750 MPV No 500.0 Manual No 1461.0 50.0 20.7
1 Ashok Leyland Stile Ashok Leyland Stile LS 8-STR (Diesel) 800 MPV No 500.0 Manual No 1461.0 50.0 20.7
2 Ashok Leyland Stile Ashok Leyland Stile LX 8-STR (Diesel) 830 MPV No 500.0 Manual No 1461.0 50.0 20.7
3 Ashok Leyland Stile Ashok Leyland Stile LS 7-STR (Diesel) 850 MPV No 500.0 Manual No 1461.0 50.0 20.7
4 Ashok Leyland Stile Ashok Leyland Stile LS 7-STR Alloy (Diesel) 880 MPV No 500.0 Manual No 1461.0 50.0 20.7

Variance

Once two statistician of height 4 feet and 5 feet have to cross a river of AVERAGE depth 3 feet. Meanwhile, a third person comes and said, "what are you waiting for? You can easily cross the river"

It's the average distance of the data values from the mean

Find variance of mileage


In [31]:
cars.Mileage.var()


Out[31]:
21.018811179847432

Standard Deviation

It is the square root of variance. This will have the same units as the data and mean.

Find standard deviation of mileage


In [32]:
cars.Mileage.std()


Out[32]:
4.5846277035161132

Using Pandas built-in function


In [33]:
cars.describe()


/Users/amitkaps/miniconda3/lib/python3.5/site-packages/numpy/lib/function_base.py:3834: RuntimeWarning: Invalid value encountered in percentile
  RuntimeWarning)
Out[33]:
Price BootSpace Engine FuelCapacity Mileage
count 833.000000 654.000000 826.000000 833.000000 662.000000
mean 3159.495798 409.536697 1900.533898 51.341693 17.480408
std 6272.519112 175.351525 1065.149758 17.643132 4.584628
min 199.000000 80.000000 624.000000 0.000000 5.000000
25% 619.000000 NaN NaN 42.000000 NaN
50% 880.000000 NaN NaN 50.000000 NaN
75% 2329.000000 NaN NaN 64.000000 NaN
max 50000.000000 1000.000000 6592.000000 120.000000 30.000000

Co-variance

covariance as a measure of the (average) co-variation between two variables, say x and y. Covariance describes both how far the variables are spread out, and the nature of their relationship, Covariance is a measure of how much two variables change together. Compare this to Variance, which is just the range over which one measure (or variable) varies.





Co-variance of mileage of Automatic and Manual Gear Type


In [34]:
pd.unique(cars.GearType)


Out[34]:
array([' Manual', ' Automatic', nan, ' No'], dtype=object)

In [36]:
cars_Automatic = cars[cars.GearType==' Automatic'].copy().reset_index()

In [37]:
cars_Manual = cars[cars.GearType==' Manual'].copy().reset_index()

In [38]:
cars_Automatic.head()


Out[38]:
index Make Model Price Type ABS BootSpace GearType AirBag Engine FuelCapacity Mileage
0 7 Aston Martin Rapide Aston Martin Rapide LUXE (Petrol) 35000 Sedan Yes 300.0 Automatic Yes 5935.0 90.5 7.0
1 8 Aston Martin Rapide S Aston Martin Rapide S (Petrol) 44000 Sedan Yes NaN Automatic Yes 5935.0 90.0 11.9
2 9 Aston Martin V12 Vantage Aston Martin V12 Vantage Coupe (Petrol) 35000 Coupe Yes 300.0 Automatic Yes 5935.0 80.0 9.0
3 10 Aston Martin V8 Vantage Aston Martin V8 Vantage Coupe (Petrol) 13500 Coupe Yes 300.0 Automatic Yes 4735.0 80.0 5.0
4 11 Aston Martin V8 Vantage Aston Martin V8 Vantage S Coupe (Petrol) 25500 Coupe Yes 300.0 Automatic Yes 4735.0 80.0 5.0

In [85]:
cars_Manual.head()


Out[85]:
index Make Model Price Type ABS BootSpace GearType AirBag Engine FuelCapacity Mileage
0 0 Ashok Leyland Stile Ashok Leyland Stile LE 8-STR (Diesel) 750 MPV No 500.0 Manual No 1461.0 50.0 20.7
1 1 Ashok Leyland Stile Ashok Leyland Stile LS 8-STR (Diesel) 800 MPV No 500.0 Manual No 1461.0 50.0 20.7
2 2 Ashok Leyland Stile Ashok Leyland Stile LX 8-STR (Diesel) 830 MPV No 500.0 Manual No 1461.0 50.0 20.7
3 3 Ashok Leyland Stile Ashok Leyland Stile LS 7-STR (Diesel) 850 MPV No 500.0 Manual No 1461.0 50.0 20.7
4 4 Ashok Leyland Stile Ashok Leyland Stile LS 7-STR Alloy (Diesel) 880 MPV No 500.0 Manual No 1461.0 50.0 20.7

In [86]:
cars_Manual.shape


Out[86]:
(421, 12)

In [87]:
cars_Automatic.shape


Out[87]:
(372, 12)

The number of observations have to be same. For the current exercise, let's take the first 300 observations in both the datasets


In [39]:
cars_Automatic = cars_Automatic.ix[:299,:]
cars_Manual = cars_Manual.ix[:299,:]

In [92]:
cars_Automatic.shape


Out[92]:
(300, 12)

In [93]:
cars_Manual.shape


Out[93]:
(300, 12)

In [40]:
cars_manual_automatic = pd.DataFrame([cars_Automatic.Mileage, cars_Manual.Mileage])

In [97]:
cars_manual_automatic


Out[97]:
0 1 2 3 4 5 6 7 8 9 ... 290 291 292 293 294 295 296 297 298 299
Mileage 7.0 11.9 9.0 5.0 5.0 5.0 8.0 16.60 17.0 16.55 ... 13.1 13.10 17.2 13.7 17.20 14.80 NaN NaN NaN 19.30
Mileage 20.7 20.7 20.7 20.7 20.7 20.7 20.7 17.32 13.7 13.70 ... 20.5 20.45 20.5 19.0 19.01 19.01 19.01 13.05 19.87 19.87

2 rows × 300 columns


In [41]:
cars_manual_automatic = cars_manual_automatic.T

In [101]:
cars_manual_automatic.head()


Out[101]:
Mileage Mileage
0 7.0 20.7
1 11.9 20.7
2 9.0 20.7
3 5.0 20.7
4 5.0 20.7

In [42]:
cars_manual_automatic.columns = ['Mileage_Automatic', 'Mileage_Manual']

In [104]:
cars_manual_automatic.head()


Out[104]:
Mileage_Automatic Mileage_Manual
0 7.0 20.7
1 11.9 20.7
2 9.0 20.7
3 5.0 20.7
4 5.0 20.7

In [44]:
#Co-variance matrix between the mileages of automatic and manual:
cars_manual_automatic.cov()


Out[44]:
Mileage_Automatic Mileage_Manual
Mileage_Automatic 22.375515 0.446292
Mileage_Manual 0.446292 12.776373

In [ ]:

Correlation

Extent to which two or more variables fluctuate together. A positive correlation indicates the extent to which those variables increase or decrease in parallel; a negative correlation indicates the extent to which one variable increases as the other decreases.





In [106]:
#### Find the correlation between the mileages of automatic and manual in the above dataset

In [45]:
cars_manual_automatic.corr()


Out[45]:
Mileage_Automatic Mileage_Manual
Mileage_Automatic 1.000000 0.026011
Mileage_Manual 0.026011 1.000000

In [46]:
cars_manual_automatic.corrwith?

Correlation != Causation

correlation between two variables does not necessarily imply that one causes the other.


In [ ]: