See also: Scikit-Learn Quick Start
See also: Scikit-Learn Tutorials
See also: Scikit-Learn User Guide
See also: Scikit-Learn API Reference
See also: Scikit-Learn Support Vector Machines
See also: Scikit-Learn Nearest Neighbors
In [ ]:
# 1. Import sklearn, import pandas as pd, and pd.read_csv the CFPB CSV file into dataframe 'df'.
In [ ]:
# 2. Filter your df down to 'Product', 'Consumer Claim', 'Amount Received' using [[]] notation. Which is our target?
In [ ]:
# 3. From sklearn.cross_validation import train_test_split. Make a train/test split 80/20 (we won't use it though).
In [ ]:
# 4. Assign df[['Consumer Claim', 'Amount Received']] to 'X'
In [ ]:
# 5. Convert to raw values df['Product'].values and assign to 'y'
In [ ]:
# 6. From sklearn.preprocessing import StandardScaler. From sklearn.pipeline import Pipeline.
In [ ]:
# 7. From sklearn.neighbors import KNeighborsClassifier. Make a scalar/knn pipeline.
In [ ]:
# 8. Fit your pipeline with your X and y.
In [ ]:
# 9. Use your newly fitted pipeline to predict classifications for [[100, 80], [5000, 4000], [350, 900]] .
In [ ]:
# 10. From sklearn.cross_validation import cross_val_score. Run cross val score on your pipeline.
In [ ]:
# 11. Get the mean of cross validation scores from your pipeline,
In [ ]:
# 12. Now repeat with Support Vector Machine Classifier (sklearn.svm.SVC) pipeline. Which yields better results?