Get some data to play with
In [ ]:
from sklearn.datasets import load_digits
import numpy as np
digits = load_digits()
digits.keys()
In [ ]:
digits.data.shape
In [ ]:
digits.data.shape
In [ ]:
digits.target.shape
In [ ]:
digits.target
In [ ]:
np.bincount(digits.target)
In [ ]:
import matplotlib.pyplot as plt
%matplotlib inline
# %matplotlib notebook <- interactive interface
plt.matshow(digits.data[0].reshape(8, 8), cmap=plt.cm.Greys)
In [ ]:
digits.target[0]
In [ ]:
fig, axes = plt.subplots(4, 4)
for x, y, ax in zip(digits.data, digits.target, axes.ravel()):
ax.set_title(y)
ax.imshow(x.reshape(8, 8), cmap="gray_r")
ax.set_xticks(())
ax.set_yticks(())
plt.tight_layout()
Data is always a numpy array (or sparse matrix) of shape (n_samples, n_features)
Split the data to get going
In [ ]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(digits.data,
digits.target, test_size=0.25, random_state=1)
In [ ]:
digits.data.shape
In [ ]:
X_train.shape
In [ ]:
X_test.shape
Load the iris dataset from the sklearn.datasets
module using the load_iris
function.
The function returns a dictionary-like object that has the same attributes as digits
.
What is the number of classes, features and data points in this dataset? Use a scatterplot to visualize the dataset.
You can look at DESCR
attribute to learn more about the dataset.
Usually data doesn't come in that nice a format. You can find the csv file that contains the iris dataset at the following path:
import sklearn.datasets
import os
iris_path = os.path.join(sklearn.datasets.__path__[0], 'data', 'iris.csv')
Try loading the data from there using pandas pd.read_csv
method.
In [ ]:
# %load solutions/load_iris.py