by Alejandro Correa Bahnsen & Iván Torroledo and Jesus Solano
version 1.5, February 2019
This notebook is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Special thanks goes to Kevin Markham
In [2]:
import pandas as pd
url = 'https://raw.githubusercontent.com/albahnsen/PracticalMachineLearningClass/master/datasets/titanic.csv.zip'
titanic = pd.read_csv(url, index_col=0)
titanic.head()
Out[2]:
In [2]:
# check for missing values
titanic.isnull().sum()
Out[2]:
One possible strategy is to drop missing values:
In [3]:
# drop rows with any missing values
titanic.dropna().shape
Out[3]:
In [4]:
# drop rows where Age is missing
titanic[titanic.Age.notnull()].shape
Out[4]:
Sometimes a better strategy is to impute missing values:
In [5]:
# mean Age
titanic.Age.mean()
Out[5]:
In [6]:
# median Age
titanic.Age.median()
Out[6]:
In [7]:
titanic.loc[titanic.Age.isnull()]
Out[7]:
In [8]:
# fill missing values for Age with the median age
titanic.Age.fillna(titanic.Age.median(), inplace=True)
Another strategy would be to build a KNN model just to impute missing values. How would we do that?
If values are missing from a categorical feature, we could treat the missing values as another category. Why might that make sense?
How do we choose between all of these strategies?
In [10]:
titanic.head(10)
Out[10]:
In [11]:
# encode Sex_Female feature
titanic['Sex_Female'] = titanic.Sex.map({'male':0, 'female':1})
In [12]:
# create a DataFrame of dummy variables for Embarked
embarked_dummies = pd.get_dummies(titanic.Embarked, prefix='Embarked')
embarked_dummies.drop(embarked_dummies.columns[0], axis=1, inplace=True)
# concatenate the original DataFrame and the dummy DataFrame
titanic = pd.concat([titanic, embarked_dummies], axis=1)
In [13]:
titanic.head(1)
Out[13]:
In [14]:
# define X and y
feature_cols = ['Pclass', 'Parch', 'Age', 'Sex_Female', 'Embarked_Q', 'Embarked_S']
X = titanic[feature_cols]
y = titanic.Survived
# train/test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
# train a logistic regression model
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(C=1e9)
logreg.fit(X_train, y_train)
# make predictions for testing set
y_pred_class = logreg.predict(X_test)
# calculate testing accuracy
from sklearn import metrics
print(metrics.accuracy_score(y_test, y_pred_class))
This data set includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family (pp. 500-525). Each species is identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended. This latter class was combined with the poisonous one. The Guide clearly states that there is no simple rule for determining the edibility of a mushroom; no rule like leaflets three, let it be for Poisonous Oak and Ivy.
Mushroom records drawn from The Audubon Society Field Guide to North American Mushrooms (1981). G. H. Lincoff (Pres.), New York: Alfred A. Knopf
In [87]:
url = 'https://raw.githubusercontent.com/albahnsen/PracticalMachineLearningClass/master/datasets/agaricus-lepiota.zip'
data = pd.read_csv(url, index_col=None)
data = data.drop(['capcolor', 'stalkcolorabovering', 'odor', 'gillsize', 'sporeprintcolor', 'stalksurfaceabovering',
'ringtype', 'stalkroot', 'bruises'], axis=1)
data = data.sample(frac=1, random_state=42)
data.head()
Out[87]:
In [88]:
data.columns
Out[88]:
Attribute Information: (classes: edible=e, poisonous=p)
1. cap-shape: bell=b,conical=c,convex=x,flat=f,
knobbed=k,sunken=s
2. cap-surface: fibrous=f,grooves=g,scaly=y,smooth=s
3. cap-color: brown=n,buff=b,cinnamon=c,gray=g,green=r,
pink=p,purple=u,red=e,white=w,yellow=y
4. bruises?: bruises=t,no=f
5. odor: almond=a,anise=l,creosote=c,fishy=y,foul=f,
musty=m,none=n,pungent=p,spicy=s
6. gill-attachment: attached=a,descending=d,free=f,notched=n
7. gill-spacing: close=c,crowded=w,distant=d
8. gill-size: broad=b,narrow=n
9. gill-color: black=k,brown=n,buff=b,chocolate=h,gray=g,
green=r,orange=o,pink=p,purple=u,red=e,
white=w,yellow=y
10. stalk-shape: enlarging=e,tapering=t
11. stalk-root: bulbous=b,club=c,cup=u,equal=e,
rhizomorphs=z,rooted=r,missing=?
12. stalk-surface-above-ring: fibrous=f,scaly=y,silky=k,smooth=s
13. stalk-surface-below-ring: fibrous=f,scaly=y,silky=k,smooth=s
14. stalk-color-above-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o,
pink=p,red=e,white=w,yellow=y
15. stalk-color-below-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o,
pink=p,red=e,white=w,yellow=y
16. veil-type: partial=p,universal=u
17. veil-color: brown=n,orange=o,white=w,yellow=y
18. ring-number: none=n,one=o,two=t
19. ring-type: cobwebby=c,evanescent=e,flaring=f,large=l,
none=n,pendant=p,sheathing=s,zone=z
20. spore-print-color: black=k,brown=n,buff=b,chocolate=h,green=r,
orange=o,purple=u,white=w,yellow=y
21. population: abundant=a,clustered=c,numerous=n,
scattered=s,several=v,solitary=y
22. habitat: grasses=g,leaves=l,meadows=m,paths=p,
urban=u,waste=w,woods=d
In [89]:
y = (data['class'] == 'p') * 1.0
In [90]:
y.mean(), y.shape
Out[90]:
In [91]:
X = data.drop(['class'], axis=1)
In [92]:
X = pd.get_dummies(X)
X.head()
Out[92]:
In [93]:
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
In [94]:
pd.Series(cross_val_score(RandomForestClassifier(n_estimators=10), X, y, cv=10, scoring='accuracy')).describe()
Out[94]:
In [107]:
from sklearn.decomposition import PCA
In [108]:
X_ = PCA(n_components=8).fit_transform(X)
In [110]:
pd.Series(cross_val_score(RandomForestClassifier(n_estimators=10), X_, y, cv=10, scoring='accuracy')).describe()
Out[110]:
PCA can only be estimated if num_columns < num_observations
In [97]:
!pip install category_encoders
In [98]:
import category_encoders as ce
In [126]:
X_ = ce.BinaryEncoder().fit_transform(data.drop(['class'], axis=1))
In [127]:
X_.head()
Out[127]:
In [128]:
pd.Series(cross_val_score(RandomForestClassifier(n_estimators=10), X_, y, cv=10, scoring='accuracy')).describe()
Out[128]:
Feature Hashing for Large Scale Multitask Learning
https://alex.smola.org/papers/2009/Weinbergeretal09.pdf
Empirical evidence suggests that hashing is an effective strategy for dimensionality reduction and practical nonparametric estimation. In this paper we provide exponential tail bounds for feature hashing and show that the interaction between random subspaces is negligible with high probability. We demonstrate the feasibility of this approach with experimental results for a new use case — multitask learning with hundreds of thousands of tasks.
In [111]:
X_ = ce.HashingEncoder(n_components=8).fit_transform(data.drop(['class'], axis=1))
In [112]:
X_.head()
Out[112]:
In [105]:
pd.Series(cross_val_score(RandomForestClassifier(n_estimators=10), X_, y, cv=10, scoring='accuracy')).describe()
Out[105]:
Helmert coding compares each level of a categorical variable to the mean of the subsequent levels. Hence, the first contrast compares the mean of the dependent variable for level 1 of race with the mean of all of the subsequent levels of race (levels 2, 3, and 4), the second contrast compares the mean of the dependent variable for level 2 of race with the mean of all of the subsequent levels of race (levels 3 and 4), and the third contrast compares the mean of the dependent variable for level 3 of race with the mean of all of the subsequent levels of race (level 4).
Gregory Carey (2003). Coding Categorical Variables
In [113]:
X_ = ce.HelmertEncoder().fit_transform(data.drop(['class'], axis=1))
In [114]:
X_.head()
Out[114]:
In [115]:
pd.Series(cross_val_score(RandomForestClassifier(n_estimators=10), X_, y, cv=10, scoring='accuracy')).describe()
Out[115]:
In [122]:
X_ = ce.OrdinalEncoder().fit_transform(data.drop(['class'], axis=1))
In [123]:
X_.head()
Out[123]:
In [124]:
pd.Series(cross_val_score(RandomForestClassifier(n_estimators=10), X_, y, cv=10, scoring='accuracy')).describe()
Out[124]:
Polynomial contrast coding for the encoding of categorical features
Orthogonal polynomial coding is a form of trend analysis in that it is looking for the linear, quadratic and cubic trends in the categorical variable. This type of coding system should be used only with an ordinal variable in which the levels are equally spaced. Examples of such a variable might be income or education.
In [119]:
X_ = ce.PolynomialEncoder().fit_transform(data.drop(['class'], axis=1))
In [120]:
X_.head()
Out[120]:
In [121]:
pd.Series(cross_val_score(RandomForestClassifier(n_estimators=10), X_, y, cv=10, scoring='accuracy')).describe()
Out[121]:
In [ ]: