In [28]:
%matplotlib inline
import diogenes
import numpy as np
wine_data = diogenes.read.open_csv_url('http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv',
delimiter=';')
We will then separate labels from features using :func:diogenes.utils.remove_cols
.
In [29]:
labels = wine_data['quality']
M = diogenes.utils.remove_cols(wine_data, 'quality')
Finally, we alter labels to make this into a binary classification problem. (At this point, all Diogenes features are available for binary classification, but other kinds of ML have more limited support).
In [30]:
labels = labels < np.average(labels)
We can look at our summary statistics with :func:diogenes.display.display.describe_cols
. Like most functions in Diogenes, describe_cols
produces a Numpy structured array.
In [31]:
summary_stats = diogenes.display.describe_cols(M)
print summary_stats.dtype
In [32]:
print summary_stats
It's a bit confusing to figure out which numbers go to which statistics using default structured array printing, so we provide :func:diogenes.display.display.pprint_sa
to make it more readable when we print small structured arrays.
In [33]:
diogenes.display.pprint_sa(summary_stats)
Similarly, we have a number of tools that visualize data. They all return figures, in case the user wants to save them or plot them later.
In [34]:
figure = diogenes.display.plot_correlation_matrix(M)
figure = diogenes.display.plot_correlation_scatter_plot(M)
There are also a number of tools for exploring the distribution of data in a single column (ie a 1-dimensional Numpy array)
In [35]:
chlorides = M['chlorides']
figure = diogenes.display.plot_box_plot(chlorides)
figure = diogenes.display.plot_kernel_density(chlorides)
figure = diogenes.display.plot_simple_histogram(chlorides)
In [36]:
diogenes.display.pprint_sa(diogenes.display.crosstab(np.round(chlorides, 1), labels))
First, we will arrange and execute a quick grid_search experiment with :class:diogenes.grid_search.experiment.Experiment
. This will run Random Forest on our data with a number of different hyper-parameters and a number of different train/test splits. See documentation for grid_search for more detail.
In [37]:
from sklearn.ensemble import RandomForestClassifier
clfs = [{'clf': RandomForestClassifier, 'n_estimators': [10,50],
'max_features': ['sqrt','log2'], 'random_state': [0]}]
exp = diogenes.grid_search.experiment.Experiment(M, labels, clfs=clfs)
_ = exp.run()
Now, we will extract a single run, which gives us a single fitted classifier and a single set of test data.
In [38]:
run = exp.trials[0].runs[0][0]
fitted_classifier = run.clf
# Sadly, SKLearn doesn't like structured arrays, so we have to convert to the other kind of array
M_test = diogenes.utils.cast_np_sa_to_nd(M[run.test_indices])
labels_test = labels[run.test_indices]
scores = fitted_classifier.predict_proba(M_test)[:,1]
We can use our fitted classifier and test data to make an ROC curve or a precision-recall curve showing us how well the classifier performs.
In [39]:
roc_fig = diogenes.display.plot_roc(labels_test, scores)
prec_recall_fig = diogenes.display.plot_prec_recall(labels_test, scores)
For classifiers that offer feature importances, we provide a convenience method to get the top n
features.
In [40]:
top_features = diogenes.display.get_top_features(fitted_classifier, M=M)
For random forest classifiers, we also provide a function to examine consecutive occurence of features in decision trees. see :func:diogenes.display.display.feature_pairs_in_rf
for more detail.
In [41]:
results = diogenes.display.feature_pairs_in_rf(fitted_classifier, n=3)
Finally, diogenes.display provides a simple way to make PDF reports using :class:diogenes.display.display.Report
.
diogenes.display.display.Report.add_heading
diogenes.display.display.Report.add_text
diogenes.display.display.Report.add_table
diogenes.display.display.Report.add_fig
diogenes.display.display.Report.to_pdf
In [42]:
report = diogenes.display.Report(report_path='display_sample_report.pdf')
report.add_heading('My Great Report About RF', level=1)
report.add_text('I did an experiment with the wine data set '
'(http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv)')
report.add_heading('Top Features', level=2)
report.add_table(top_features)
report.add_heading('ROC Plot', level=2)
report.add_fig(roc_fig)
full_report_path = report.to_pdf(verbose=False)
Here's the result:
In [43]:
from IPython.display import HTML
HTML('<iframe src=display_sample_report.pdf width=700 height=350></iframe>')
Out[43]:
In [ ]: