Here we are using the California Housing dataset to learn more about Machine Learning.
In [4]:
import pandas as pd
housing = pd.read_csv('housing.csv')
housing.head()
Out[4]:
In [5]:
housing.info()
In [8]:
housing.describe()
Out[8]:
In the meanwhile we are trying to have more information about pandas
. In the following sections we are using the value_counts
method to have more information about each feature values. This method specify number of different values for given feature.
In [6]:
housing['total_rooms'].value_counts()
Out[6]:
In [7]:
housing['ocean_proximity'].value_counts()
Out[7]:
See the difference between loc
and iloc
methods in a simple pandas DataFrame.
In [26]:
pd.DataFrame([{'a': 1, 'b': '1'}, {'a': 2, 'b': 1}, {'a': 3, 'b': 1}]).iloc[1]
Out[26]:
In [21]:
pd.DataFrame([{'a': 1, 'b': '1'}, {'a': 2, 'b': 1}, {'a': 3, 'b': 1}]).loc[1]
Out[21]:
In [23]:
pd.DataFrame([{'a': 1, 'b': '1'}, {'a': 2, 'b': 1}, {'a': 3, 'b': 1}]).loc[1, ['b']]
Out[23]:
In [27]:
pd.DataFrame([{'a': 1, 'b': '1'}, {'a': 2, 'b': 1}, {'a': 3, 'b': 1}]).loc[[True, True, False]]
Out[27]:
Here we want to see the apply function of pandas
for an specific feature.
In [35]:
pd.DataFrame([{'a': 1, 'b': '1'}, {'a': 2, 'b': 1}, {'a': 3, 'b': 1}])['a'].apply(lambda a: a > 10)
Out[35]:
The following function helps to split the given dataset into test and train sets.
In [32]:
from zlib import crc32
import numpy as np
def test_set_check(identifier, test_ratio):
return crc32(np.int64(identifier)) & 0xffffffff < test_ratio * 2**32
def split_train_test_by_id(data, test_ratio, id_column):
ids = data[id_column]
in_test_set = ids.apply(lambda _id: test_set_check(_id, test_ratio))
return data.loc[~in_test_set], data.loc[in_test_set]
In [36]:
housing_with_id = housing.reset_index() # adds an "index" column
train_set, test_set = split_train_test_by_id(housing_with_id, 0.2, 'index')
In [39]:
housing = train_set.copy()
housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.1)
Out[39]:
In [41]:
import matplotlib.pyplot as plt
housing.plot(kind='scatter', x='longitude', y='latitude',
alpha=0.4, s=housing['population']/100, label='population',
c='median_house_value', cmap=plt.get_cmap('jet'), colorbar=True,
)
Out[41]: