See also: Scikit-Learn Quick Start
See also: Scikit-Learn Tutorials
See also: Scikit-Learn User Guide
See also: Scikit-Learn API Reference
See also: Scikit-Learn Support Vector Machines
See also: Scikit-Learn Nearest Neighbors
In [2]:
    
# 1. Import sklearn, import pandas as pd, and pd.read_csv the CFPB CSV file into dataframe 'df'.
import sklearn
import pandas as pd
df = pd.read_csv('data/cfpb_complaints_with_fictitious_data.csv')
    
In [3]:
    
# 2. Filter your df down to 'Product', 'Consumer Claim', 'Amount Received' using [[]] notation. Which is our target?
df = df[['Product', 'Consumer Claim', 'Amount Received']]
df.head(5) # Our target is "Product"
    
    Out[3]:
In [6]:
    
# 3. From sklearn.cross_validation import train_test_split. Make a train/test split 80/20 (we won't use it though).
from sklearn.cross_validation import train_test_split
train, test = train_test_split(df, test_size=.2)
print(len(train))
print(len(test))
    
    
In [7]:
    
# 4. Assign df[['Consumer Claim', 'Amount Received']] to 'X'
X = df[['Consumer Claim', 'Amount Received']]
    
In [8]:
    
# 5. Convert to raw values df['Product'].values and assign to 'y'
y = df['Product']
    
In [9]:
    
# 6. From sklearn.preprocessing import StandardScaler. From sklearn.pipeline import Pipeline.
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
    
In [11]:
    
# 7. From sklearn.neighbors import KNeighborsClassifier. Make a scalar/knn pipeline.
from sklearn.neighbors import KNeighborsClassifier
pipe = Pipeline([('scaler', StandardScaler()),
                 ('classifier', KNeighborsClassifier())])
    
In [12]:
    
# 8. Fit your pipeline with your X and y.
pipe.fit(X, y)
    
    Out[12]:
In [13]:
    
# 9. Use your newly fitted pipeline to predict classifications for [[100, 80], [5000, 4000], [350, 900]] .
pipe.predict([[100, 80], [5000, 4000], [350, 900]])
    
    Out[13]:
In [16]:
    
# 10. From sklearn.cross_validation import cross_val_score. Run cross val score on your pipeline.
from sklearn.cross_validation import cross_val_score
scores = cross_val_score(pipe, X, y)
scores
    
    Out[16]:
In [15]:
    
# 11. Get the mean of cross validation scores from your pipeline,
score_mean = scores.mean()
score_mean
    
    Out[15]:
In [ ]:
    
# 12. Now repeat with Support Vector Machine Classifier (sklearn.svm.SVC) pipeline. Which yields better results?
from sklearn.svm import SVC
pipe = Pipeline([('scaler', StandardScaler()),
                 ('classifier', SVC())])
pipe.fit(X, y)
print(pipe.predict([[100, 80], [5000, 4000], [350, 900]]))
scores = cross_val_score(pipe, X, y)
print(scores)
print(scores.mean())