How to follow along:
git clone https://github.com/dataweekends/pyladies_intro_to_data_science
cd pyladies_intro_to_data_science
ipython notebook
We start by importing the necessary libraries:
In [ ]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
Load the csv file into memory using Pandas
In [ ]:
df = pd.read_csv('iris-2-classes.csv')
What's the content of df
?
In [ ]:
df.iloc[[0,1,98,99]]
Describe each attribute (is it discrete? is it continuous? is it a number? is it text?)
In [ ]:
df.info()
Quick stats on the features
In [ ]:
df.describe()
ah, yes... the type of Iris flower!
In [ ]:
df['iris_type'].value_counts()
Check if any values are missing
In [ ]:
df.info()
In [ ]:
df[df['iris_type']=='virginica']['sepal_length_cm'].plot(kind='hist', bins = 10, range = (4,7),
alpha = 0.3, color = 'b')
df[df['iris_type']=='versicolor']['sepal_length_cm'].plot(kind='hist', bins = 10, range = (4,7),
alpha = 0.3, color = 'g')
plt.title('Distribution of Sepal Length', size = '20')
plt.xlabel('Sepal Length (cm)', size = '20')
plt.ylabel('Number of flowers', size = '20')
plt.legend(['Virginica', 'Versicolor'])
plt.show()
In [ ]:
plt.scatter(df[df['iris_type']== 'virginica']['petal_length_cm'].values,
df[df['iris_type']== 'virginica']['sepal_length_cm'].values, label = 'Virginica', c = 'b', s = 40)
plt.scatter(df[df['iris_type']== 'versicolor']['petal_length_cm'].values,
df[df['iris_type']== 'versicolor']['sepal_length_cm'].values, label = 'Versicolor', c = 'r', marker='s',s = 40)
plt.legend(['virginica', 'versicolor'], loc = 2)
plt.title('Iris Flowers', size = '20')
plt.xlabel('Petal Length (cm)', size = '20')
plt.ylabel('Sepal Length (cm)', size = '20')
plt.show()
Ok, so, the flowers seem to have different characteristics
Let's build a simple model to test that
Define a new target column called target
like this:
iris_type = 'virginica'
===> target = 1
target = 0
In [ ]:
df['target'] = df['iris_type'].map({'virginica': 1, 'versicolor': 0})
print df[['iris_type', 'target']].head(2)
print
print df[['iris_type', 'target']].tail(2)
Define simplest model as benchmark
The simplest model is a model that predicts 0 for everybody, i.e. all versicolor.
How good is it?
In [ ]:
df['target'].value_counts()
If I predict every flower is Versicolor, I'm correct 50% of the time
We need to do better than that
Define features (X) and target (y) variables
In [ ]:
X = df[['sepal_length_cm', 'sepal_width_cm',
'petal_length_cm', 'petal_width_cm']]
y = df['target']
Initialize a decision Decision Tree model
In [ ]:
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(random_state=0)
model
Split the features and the target into a Train and a Test subsets.
Ratio should be 70/30
In [ ]:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size = 0.3, random_state=0)
Train the model
In [ ]:
model.fit(X_train, y_train)
Calculate the model score
In [ ]:
my_score = model.score(X_test, y_test)
print "Classification Score: %0.2f" % my_score
Print the confusion matrix
In [ ]:
from sklearn.metrics import confusion_matrix
y_pred = model.predict(X_test)
print "\n=======confusion matrix=========="
print confusion_matrix(y_test, y_pred)