See also: Scikit-Learn Quick Start
See also: Scikit-Learn Tutorials
See also: Scikit-Learn User Guide
See also: Scikit-Learn API Reference
See also: Scikit-Learn Support Vector Machines
See also: Scikit-Learn Nearest Neighbors
In [2]:
# 1. Import sklearn, import pandas as pd, and pd.read_csv the CFPB CSV file into dataframe 'df'.
import sklearn
import pandas as pd
df = pd.read_csv('data/cfpb_complaints_with_fictitious_data.csv')
In [3]:
# 2. Filter your df down to 'Product', 'Consumer Claim', 'Amount Received' using [[]] notation. Which is our target?
df = df[['Product', 'Consumer Claim', 'Amount Received']]
df.head(5) # Our target is "Product"
Out[3]:
In [6]:
# 3. From sklearn.cross_validation import train_test_split. Make a train/test split 80/20 (we won't use it though).
from sklearn.cross_validation import train_test_split
train, test = train_test_split(df, test_size=.2)
print(len(train))
print(len(test))
In [7]:
# 4. Assign df[['Consumer Claim', 'Amount Received']] to 'X'
X = df[['Consumer Claim', 'Amount Received']]
In [8]:
# 5. Convert to raw values df['Product'].values and assign to 'y'
y = df['Product']
In [9]:
# 6. From sklearn.preprocessing import StandardScaler. From sklearn.pipeline import Pipeline.
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
In [11]:
# 7. From sklearn.neighbors import KNeighborsClassifier. Make a scalar/knn pipeline.
from sklearn.neighbors import KNeighborsClassifier
pipe = Pipeline([('scaler', StandardScaler()),
('classifier', KNeighborsClassifier())])
In [12]:
# 8. Fit your pipeline with your X and y.
pipe.fit(X, y)
Out[12]:
In [13]:
# 9. Use your newly fitted pipeline to predict classifications for [[100, 80], [5000, 4000], [350, 900]] .
pipe.predict([[100, 80], [5000, 4000], [350, 900]])
Out[13]:
In [16]:
# 10. From sklearn.cross_validation import cross_val_score. Run cross val score on your pipeline.
from sklearn.cross_validation import cross_val_score
scores = cross_val_score(pipe, X, y)
scores
Out[16]:
In [15]:
# 11. Get the mean of cross validation scores from your pipeline,
score_mean = scores.mean()
score_mean
Out[15]:
In [ ]:
# 12. Now repeat with Support Vector Machine Classifier (sklearn.svm.SVC) pipeline. Which yields better results?
from sklearn.svm import SVC
pipe = Pipeline([('scaler', StandardScaler()),
('classifier', SVC())])
pipe.fit(X, y)
print(pipe.predict([[100, 80], [5000, 4000], [350, 900]]))
scores = cross_val_score(pipe, X, y)
print(scores)
print(scores.mean())