DAT210x - Programming with Python for DS

Module6- Lab6



In [ ]:

    
import pandas as pd
import time

How to Get The Dataset

Grab the DLA HAR dataset from:

http://groupware.les.inf.puc-rio.br/har
http://groupware.les.inf.puc-rio.br/static/har/dataset-har-PUC-Rio-ugulino.zip
A cached copy of the dataset is included in the course repository.

After extracting it out, load up the dataset into dataframe named X and do your regular dataframe examination:



In [ ]:

    
# .. your code here ..

Encode the gender column such that: 0 is male, and 1 as female:



In [ ]:

    
# .. your code here ..

Clean up any columns with commas in them so that they're properly represented as decimals:



In [ ]:

    
# .. your code here ..

Let's take a peek at your data types:



In [ ]:

    
X.dtypes

Convert any column that needs to be converted into numeric use errors='raise'. This will alert you if something ends up being problematic.



In [ ]:

    
# .. your code here ..

If you find any problematic records, drop them before calling the to_numeric methods above.

Okay, now encode your y value as a Pandas dummies version of your dataset's class column:



In [ ]:

    
# .. your code here ..

In fact, get rid of the user and class columns:



In [ ]:

    
# .. your code here ..

Let's take a look at your handy-work:



In [ ]:

    
X.describe()

You can also easily display which rows have nans in them, if any:



In [ ]:

    
X[pd.isnull(X).any(axis=1)]

Create an RForest classifier named model and set n_estimators=30, the max_depth to 10, oob_score=True, and random_state=0:



In [ ]:

    
# .. your code here ..

Split your data into test / train sets. Your test size can be 30%, with random_state 7. Use variable names: X_train, X_test, y_train, and y_test:



In [ ]:

    
# .. your code here ..

Now the Fun Stuff



In [ ]:

    
print("Fitting...")
s = time.time()

# TODO: train your model on your training set

# .. your code here ..

print("Fitting completed in: ", time.time() - s)

Display the OOB Score of your data:



In [ ]:

    
score = model.oob_score_
print("OOB Score: ", round(score*100, 3))



In [ ]:

    
print("Scoring...")
s = time.time()

# TODO: score your model on your test set

# .. your code here ..

print("Score: ", round(score*100, 3))
print("Scoring completed in: ", time.time() - s)

At this point, go ahead and answer the lab questions, then return here to experiment more --

Try playing around with the gender column. For example, encode gender Male:1, and Female:0. Also try encoding it as a Pandas dummies variable and seeing what changes that has. You can also try dropping gender entirely from the dataframe. How does that change the score of the model? This will be a key insight on how your feature encoding alters your overall scoring, and why it's important to choose good ones.



In [ ]:

    
# .. your code changes above ..

After that, try messing with y. Right now its encoded with dummies, but try other encoding methods to what effects they have.



In [ ]: