Title: Make Simulated Data For Classification
Slug: make_simulated_data_for_classification
Summary: Make a simulated dataset for classification using scikit-learn.
Date: 2017-01-16 12:00
Category: Machine Learning
Tags: Basics
Authors: Chris Albon
In [1]:
from sklearn.datasets import make_classification
import pandas as pd
In [2]:
# Create a simulated feature matrix and output vector with 100 samples,
features, output = make_classification(n_samples = 100,
# ten features
n_features = 10,
# five features that actually predict the output's classes
n_informative = 5,
# five features that are random and unrelated to the output's classes
n_redundant = 5,
# three output classes
n_classes = 3,
# with 20% of observations in the first class, 30% in the second class,
# and 50% in the third class. ('None' makes balanced classes)
weights = [.2, .3, .8])
In [3]:
# View the first five observations and their 10 features
pd.DataFrame(features).head()
Out[3]:
In [4]:
# View the first five observation's classes
pd.DataFrame(output).head()
Out[4]: