Imports needed for the script. Uses Python 2.7.13, numpy 1.11.3, pandas 0.19.2, scipy 0.18.1, matplotlib 2.0.0, statsmodels 0.6.1.
In [19]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
import statsmodels.sandbox.stats.multicomp as mc
Set up a color palette reputed to be color blind friendly. From: https://gist.github.com/thriveth/8560036.
In [20]:
colors = ('#377eb8', '#ff7f00', '#4daf4a', '#f781bf', '#a65628', '#984ea3', '#999999', '#e41a1c') # , '#dede00')
Read in the data. I am using a modified version of the UCI Auto MPG data set. Missing horsepower values have been looked up and a column to mark diesel vehicles added.
In [21]:
df = pd.read_csv('uci_auto_mpg_mod.csv')
print df.head()
Lets check the effect of the number of cylinders a vehicle has vs. the fuel economy (mpg) by using a box plot.
In [22]:
mpg3cyl = df['mpg'][df['cylinders'] == 3]
mpg4cyl = df['mpg'][df['cylinders'] == 4]
mpg5cyl = df['mpg'][df['cylinders'] == 5]
mpg6cyl = df['mpg'][df['cylinders'] == 6]
mpg8cyl = df['mpg'][df['cylinders'] == 8]
cyl = [mpg3cyl, mpg4cyl, mpg5cyl, mpg6cyl, mpg8cyl]
bplot = plt.boxplot(cyl,
notch=False, # notch shape
vert=True, # vertical box alignment
patch_artist=True, # fill with color
labels=['3 Cyl', '4 Cyl', '5 Cyl', '6 Cyl', '8 Cyl']) # will be used to label x-ticks
for patch, color in zip(bplot['boxes'], colors):
patch.set_facecolor(color)
patch._alpha = 0.5
for line in bplot['medians']:
line._color = 'k'
plt.xlabel("Number of Engine Cylinders")
plt.ylabel("MPG")
plt.title("Engine Cylinders vs. MPG")
plt.show()
What is with 3 cylinder vehicles, why do they break the pattern of decreasing fuel economy with more cylinders?
In [23]:
df3cyl = df.loc[df['cylinders'] == 3]
print ' 3 cylinder cars'
print df3cyl['car_name']
3 cylinder cars are all Mazdas with rotary engines. These actually have 3 chambers per rotor and 2 rotors per engine - so 3 x 2 = 6. As plotted by the box plot they do indeed appear to share fuel economy characteristics with 6 cylinder engines. Possibly in a machine learning model it might be best to replace these by 6 cylinders, especially since there are so few of them.
Lets try a plot to check vehicle weight vs fuel economy (mpg) and add a regression line to the plot.
In [24]:
# fit the line
z = np.polyfit(df['weight'], df['mpg'], 1)
y_poly = [z[0] * x + z[1] for x in range(1500, 5600, 100)]
x_poly = [x for x in range(1500, 5600, 100)]
# plot
plt.plot(df['weight'], df['mpg'], ".", color=colors[0])
plt.plot(x_poly, y_poly, "-", color=colors[1])
plt.xlim(1500, 5500)
plt.ylim(0, 50)
plt.xlabel("Vehicle Weight")
plt.ylabel("MPG")
plt.title("Weight vs. MPG")
plt.show()
plt.close()
It appears that the relationship is not really linear. So try a second order polynomial fit.
In [25]:
# second order fit
z = np.polyfit(df['weight'], df['mpg'], 2)
y_poly2 = [z[0] * x ** 2 + z[1] * x + z[2] for x in range(1500, 5600, 100)]
# plot it
plt.plot(df['weight'], df['mpg'], ".", color=colors[0])
plt.plot(x_poly, y_poly2, "-", color=colors[1])
plt.xlim(1500, 5500)
plt.ylim(0, 50)
plt.xlabel("Vehicle Weight")
plt.ylabel("MPG")
plt.title("Weight vs. MPG")
plt.show()
plt.close()
The second order polynomial does appear to fit better than the first order (straight line). Hence, a machine learning model should either try to capture the higher order effect with squared terms (and possibly interactions) or use a machine learning method that copes well with non-linearity such as Random Forests or other similar tree-based algorithms.
Here is a more complex scatter plot showing the results for weight vs. mpg broken out by the number of cylinders in the engine.
In [26]:
markers = ('s', 'D')
cyls = pd.unique(df.cylinders.values)
cyls = np.sort(cyls)
for i, cyl in enumerate(cyls):
if cyl ==3 or cyl == 5:
s = 16
m = markers[1]
else:
s = 4
m = markers[0]
plt.scatter(df['weight'].loc[df['cylinders'] == cyl],
df['mpg'].loc[df['cylinders'] == cyl],
s=s, c=colors[i-1], marker=m, label='%d Cylinders' % cyl)
plt.plot(x_poly, y_poly, "-", color=colors[5])
plt.plot(x_poly, y_poly2, "-", color=colors[6])
plt.xlim(1500, 5500)
plt.ylim(0, 50)
plt.xlabel("Vehicle Weight")
plt.ylabel("MPG")
plt.title("Weight vs. MPG")
plt.legend()
plt.show()
plt.close()
We can observe that more cylinders generally lead to increasing weight and lower mpg. Multi-collinearity could be a problem in this data set, here the relationship between weight and cylinders.
A look at displacement vs. horsepower, another source of possible multi-collinearity in the data since there is a generally linear relationship.
In [27]:
z = np.polyfit(df['displacement'], df['horsepower'], 1)
y_poly = [z[0] * x + z[1] for x in range(0, 600, 50)]
x_poly = [x for x in range(0, 600, 50)]
plt.plot(df['displacement'], df['horsepower'], '.', color=colors[0])
plt.plot(x_poly, y_poly, '-', color=colors[1])
plt.xlim(0, 500)
plt.ylim(0, 300)
plt.xlabel("Displacement")
plt.ylabel("Horsepower")
plt.title("Displacement vs. Horsepower")
plt.show()
plt.close()
Mean mpg by the region of origin of a vehicle. The origin code is translated to a more human readable format. Origin is either related to fuel economy or lighter vehicles come from Japan and Europe than from the US in this time period (1970's to 1980's).
In [28]:
df['origin_name'] = df['origin'] - 1
df['origin_name'] = df['origin_name'].astype('category')
df['origin_name'] = pd.Series(pd.Categorical.from_codes(df['origin_name'], categories=['US', 'EUR', 'JPN']))
mean_mpg = df.groupby('origin_name', as_index=False)['mpg'].mean()
print 'Mean value of mpg, by vehicle origin:'
print mean_mpg.to_string(index=False)
Box plot for fuel economy by vehicle origin.
In [29]:
mpg_usa = df['mpg'][df['origin_name'] == 'US']
mpg_eur = df['mpg'][df['origin_name'] == 'EUR']
mpg_jpn = df['mpg'][df['origin_name'] == 'JPN']
origin_mpg = (mpg_usa, mpg_eur, mpg_jpn)
bplot = plt.boxplot(origin_mpg,
notch=False, # notch shape
vert=True, # vertical box alignment
patch_artist=True, # fill with color
labels=['USA', 'EUR', 'JPN']) # will be used to label x-ticks
for patch, color in zip(bplot['boxes'], colors):
patch.set_facecolor(color)
patch._alpha = 0.5
for line in bplot['medians']:
line._color = 'k'
plt.xlabel("Vehicle Origin")
plt.ylabel("MPG")
plt.title("Vehicle Origin vs. MPG")
plt.show()
plt.close()
A scatter plot for mpg vs. weight with 2nd order regression fit curves by vehicle origin. Shows that vehicle origin may have some effect on fuel economy even when vehicle weight is considered. It is a complex relationship since lighter Japanese cars have higher fuel economy and the heavier amongst the European cars have higher fuel economy than their equivalent American cars.
In [30]:
df_usa = df.loc[df['origin_name'] == 'US']
z = np.polyfit(df_usa['weight'], df_usa['mpg'], 2)
x_poly_usa = [x for x in range(np.min(df_usa['weight']), np.max(df_usa['weight'] + 100), 100)]
y_poly_usa = [z[0] * x ** 2 + z[1] * x + z[2] for x in range(np.min(df_usa['weight']),
np.max(df_usa['weight'] + 100), 100)]
df_eur = df.loc[df['origin_name'] == 'EUR']
z = np.polyfit(df_eur['weight'], df_eur['mpg'], 2)
x_poly_eur = [x for x in range(np.min(df_eur['weight']), np.max(df_eur['weight'] + 100), 100)]
y_poly_eur = [z[0] * x ** 2 + z[1] * x + z[2] for x in range(np.min(df_eur['weight']),
np.max(df_eur['weight'] + 100), 100)]
df_jpn = df.loc[df['origin_name'] == 'JPN']
z = np.polyfit(df_jpn['weight'], df_jpn['mpg'], 2)
x_poly_jpn = [x for x in range(np.min(df_jpn['weight']), np.max(df_jpn['weight'] + 100), 100)]
y_poly_jpn = [z[0] * x ** 2 + z[1] * x + z[2] for x in range(np.min(df_jpn['weight']),
np.max(df_jpn['weight'] + 100), 100)]
plt.plot(df['weight'], df['mpg'], ".", color=colors[0], alpha=0.33)
plt.plot(x_poly_usa, y_poly_usa, "-", color=colors[1])
plt.plot(x_poly_eur, y_poly_eur, "-", color=colors[2])
plt.plot(x_poly_jpn, y_poly_jpn, "-", color=colors[4])
plt.xlim(1500, 5500)
plt.ylim(0, 50)
plt.xlabel("Vehicle Weight")
plt.ylabel("MPG")
plt.title("Weight vs. MPG, Fitted by Vehicle Origin")
plt.show()
plt.close()
In the plot above the orange line shows weight vs. mpg for cars from the US. The regression fit curve covers the weight range for US cars. The green line is similar for European cars and the maroon line for Japanese cars with the curves covering the weight range for their respective origin codes.
Can we verify statistically what the box plot for mpg by cylinders apears to show - that cylinders are significant as a predictor of fuel economy? I will work with only the 4, 6 and 8 cylinder cars here since 3 and 5 cylinder cars are rare.
Set up data frame for 4, 6 and 8 cylinder vehicles.Leave out diesel vehicles since they are likely to be outliers and there are not many of them. Thus, this checks the effect of cylinders for gasoline engine cars.
In [31]:
df468 = df.loc[df.cylinders != 3]
df468 = df468.loc[df468.cylinders != 5]
df468 = df468.loc[df.diesel != 1]
Check for normal distributions of the data - the test fails - note the p-values are far below the normal level of significance of 0.05. All three cylinder types fail the normality test.
In [32]:
s, p = stats.normaltest(df468['mpg'].loc[df468['cylinders'] == 4])
print 'Normality test for mpg with 4 cylinders, p-value: %.4f' % p
s, p = stats.normaltest(df468['mpg'].loc[df468['cylinders'] == 6])
print 'Normality test for mpg with 6 cylinders, p-value: %.4f' % p
s, p = stats.normaltest(df468['mpg'].loc[df468['cylinders'] == 8])
print 'Normality test for mpg with 8 cylinders, p-value: %.4f' % p
Histograms show skewed distributions.
In [33]:
plt.hist(df468['mpg'].loc[df468['cylinders'] == 4], bins=32)
plt.show()
plt.close()
plt.hist(df468['mpg'].loc[df468['cylinders'] == 6], bins=32)
plt.show()
plt.close()
plt.hist(df468['mpg'].loc[df468['cylinders'] == 8], bins=32)
plt.show()
plt.close()
Even with non-normal data Levene's test can be used to check for equal variances. This test shows low p-values for the 4 cylinder vehicles against 6 and 8 cylinder vehicles, so we cannot assume equal variances going forward.
In [34]:
W, p = stats.levene(df468['mpg'].loc[df468['cylinders'] == 4], df468['mpg'].loc[df468['cylinders'] == 6])
print 'Levene test for equal variances, 4 vs 6 cylinder vehicles, p-value: %.4f' % p
W, p = stats.levene(df468['mpg'].loc[df468['cylinders'] == 4], df468['mpg'].loc[df468['cylinders'] == 8])
print 'Levene test for equal variances, 4 vs 8 cylinder vehicles, p-value: %.4f' % p
W, p = stats.levene(df468['mpg'].loc[df468['cylinders'] == 6], df468['mpg'].loc[df468['cylinders'] == 8])
print 'Levene test for equal variances, 6 vs 8 cylinder vehicles, p-value: %.4f' % p
The Kruskal ANOVA test can be used when conditions of normality and equal variances are not satisfied. The p-value is near zero, so cylinders seem have an effect on mpg, though we may need to consider multi-collinearity with other variables such as weight and displacement.
In [35]:
grps = pd.unique(df468.cylinders.values)
d_data = {grp: df468['mpg'][df468.cylinders == grp] for grp in grps}
kruskal = stats.kruskal(d_data[4], d_data[6], d_data[8])
print 'kruskal'
print kruskal
Mood's Median Test, another way of checking that the number of cylinders may be predictive of mpg. Or put another way, that automobiles with differing numbers of cylinders have significantly different mpg. This is a comparison for 4, 6 and 8 cylinder cars.
In [36]:
stat, p, med, tbl = stats.median_test(d_data[4], d_data[6], d_data[8])
print "Mood's median p-value: %.4f" % p
print "Mood's median value: %.4f (median mpg)" % med
We can use the stats models multiple tests function to correct the p-values for the three test pairs so that we can see which pairs are significant. The adjustment for multiple tests gives a better result than simple pairwise comparison since it adjusts for the loss of power in doing multiple tests. Here the p-values are all still zero so each combination of engine cylinders has a significantly different median mpg value.
In [37]:
stat, p46, med, tbl = stats.median_test(d_data[4], d_data[6])
stat, p48, med, tbl = stats.median_test(d_data[4], d_data[8])
stat, p68, med, tbl = stats.median_test(d_data[6], d_data[8])
rej, pvc, aS, aB = mc.multipletests((p46, p48, p68))
print pvc
In conclusion for this partial EDA on the UCI Auto MPG Data Set, a machine learning model may need to deal with several issues. These are the non-linearities in some of the data and potential issues of multi-collinearity.
Citation: Auto MPG Data Set (1993). UCI Machine Learning Repository. Retrieved from: