Title: Make Simulated Data For Classification
Slug: make_simulated_data_for_classification
Summary: Make a simulated dataset for classification using scikit-learn.
Date: 2017-01-16 12:00
Category: Machine Learning
Tags: Basics
Authors: Chris Albon

Preliminaries


In [1]:
from sklearn.datasets import make_classification
import pandas as pd

Create Simulated Data


In [2]:
# Create a simulated feature matrix and output vector with 100 samples,
features, output = make_classification(n_samples = 100,
                                       # ten features
                                       n_features = 10,
                                       # five features that actually predict the output's classes
                                       n_informative = 5,
                                       # five features that are random and unrelated to the output's classes
                                       n_redundant = 5,
                                       # three output classes
                                       n_classes = 3,
                                       # with 20% of observations in the first class, 30% in the second class,
                                       # and 50% in the third class. ('None' makes balanced classes)
                                       weights = [.2, .3, .8])

View Data


In [3]:
# View the first five observations and their 10 features
pd.DataFrame(features).head()


Out[3]:
0 1 2 3 4 5 6 7 8 9
0 -1.338796 2.218025 3.333541 2.586772 -2.050240 -5.289060 4.364050 3.010074 3.073564 0.827317
1 1.535519 1.964163 -0.053789 0.610150 -4.256450 -6.044707 7.617702 4.654903 0.632368 3.234648
2 0.249576 -4.051890 -4.578764 -1.629710 2.188123 1.488968 -1.977744 -2.888737 -4.957220 3.599833
3 3.778789 -4.797895 -1.187821 0.724315 1.083952 0.165924 -0.352818 0.615942 -4.392519 1.683278
4 0.856266 0.568888 -0.520666 -1.970701 0.597743 2.224923 0.065515 0.250906 -1.512495 -0.859869

In [4]:
# View the first five observation's classes
pd.DataFrame(output).head()


Out[4]:
0
0 2
1 2
2 1
3 2
4 2