Data Loading

Get some data to play with


In [ ]:
from sklearn.datasets import load_digits
import numpy as np
digits = load_digits()
digits.keys()

In [ ]:
digits.data.shape

In [ ]:
digits.data.shape

In [ ]:
digits.target.shape

In [ ]:
digits.target

In [ ]:
np.bincount(digits.target)

In [ ]:
import matplotlib.pyplot as plt
%matplotlib inline
# %matplotlib notebook <- interactive interface

plt.matshow(digits.data[0].reshape(8, 8), cmap=plt.cm.Greys)

In [ ]:
digits.target[0]

In [ ]:
fig, axes = plt.subplots(4, 4)
for x, y, ax in zip(digits.data, digits.target, axes.ravel()):
    ax.set_title(y)
    ax.imshow(x.reshape(8, 8), cmap="gray_r")
    ax.set_xticks(())
    ax.set_yticks(())
plt.tight_layout()

Data is always a numpy array (or sparse matrix) of shape (n_samples, n_features)

Split the data to get going


In [ ]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(digits.data,
                                                    digits.target, test_size=0.25, random_state=1)

In [ ]:
digits.data.shape

In [ ]:
X_train.shape

In [ ]:
X_test.shape

Exercises

Load the iris dataset from the sklearn.datasets module using the load_iris function. The function returns a dictionary-like object that has the same attributes as digits.

What is the number of classes, features and data points in this dataset? Use a scatterplot to visualize the dataset.

You can look at DESCR attribute to learn more about the dataset.

Usually data doesn't come in that nice a format. You can find the csv file that contains the iris dataset at the following path:

import sklearn.datasets
import os
iris_path = os.path.join(sklearn.datasets.__path__[0], 'data', 'iris.csv')

Try loading the data from there using pandas pd.read_csv method.


In [ ]:
# %load solutions/load_iris.py