Title: Make Simulated Data For Classification
Slug: make_simulated_data_for_classification
Summary: Make a simulated dataset for classification using scikit-learn.
Date: 2017-01-16 12:00
Category: Machine Learning
Tags: Basics
Authors: Chris Albon

Preliminaries



In [1]:

    
from sklearn.datasets import make_classification
import pandas as pd

Create Simulated Data



In [2]:

    
# Create a simulated feature matrix and output vector with 100 samples,
features, output = make_classification(n_samples = 100,
                                       # ten features
                                       n_features = 10,
                                       # five features that actually predict the output's classes
                                       n_informative = 5,
                                       # five features that are random and unrelated to the output's classes
                                       n_redundant = 5,
                                       # three output classes
                                       n_classes = 3,
                                       # with 20% of observations in the first class, 30% in the second class,
                                       # and 50% in the third class. ('None' makes balanced classes)
                                       weights = [.2, .3, .8])

View Data



In [3]:

    
# View the first five observations and their 10 features
pd.DataFrame(features).head()



In [4]:

    
# View the first five observation's classes
pd.DataFrame(output).head()

	0	1	2	3	4	5	6	7	8	9
0	-1.338796	2.218025	3.333541	2.586772	-2.050240	-5.289060	4.364050	3.010074	3.073564	0.827317
1	1.535519	1.964163	-0.053789	0.610150	-4.256450	-6.044707	7.617702	4.654903	0.632368	3.234648
2	0.249576	-4.051890	-4.578764	-1.629710	2.188123	1.488968	-1.977744	-2.888737	-4.957220	3.599833
3	3.778789	-4.797895	-1.187821	0.724315	1.083952	0.165924	-0.352818	0.615942	-4.392519	1.683278
4	0.856266	0.568888	-0.520666	-1.970701	0.597743	2.224923	0.065515	0.250906	-1.512495	-0.859869