Here we are using the California Housing dataset to learn more about Machine Learning.


In [4]:
import pandas as pd

housing = pd.read_csv('housing.csv')
housing.head()


Out[4]:
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity
0 -122.23 37.88 41.0 880.0 129.0 322.0 126.0 8.3252 452600.0 NEAR BAY
1 -122.22 37.86 21.0 7099.0 1106.0 2401.0 1138.0 8.3014 358500.0 NEAR BAY
2 -122.24 37.85 52.0 1467.0 190.0 496.0 177.0 7.2574 352100.0 NEAR BAY
3 -122.25 37.85 52.0 1274.0 235.0 558.0 219.0 5.6431 341300.0 NEAR BAY
4 -122.25 37.85 52.0 1627.0 280.0 565.0 259.0 3.8462 342200.0 NEAR BAY

In [5]:
housing.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
longitude             20640 non-null float64
latitude              20640 non-null float64
housing_median_age    20640 non-null float64
total_rooms           20640 non-null float64
total_bedrooms        20433 non-null float64
population            20640 non-null float64
households            20640 non-null float64
median_income         20640 non-null float64
median_house_value    20640 non-null float64
ocean_proximity       20640 non-null object
dtypes: float64(9), object(1)
memory usage: 1.6+ MB

In [8]:
housing.describe()


Out[8]:
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value
count 20640.000000 20640.000000 20640.000000 20640.000000 20433.000000 20640.000000 20640.000000 20640.000000 20640.000000
mean -119.569704 35.631861 28.639486 2635.763081 537.870553 1425.476744 499.539680 3.870671 206855.816909
std 2.003532 2.135952 12.585558 2181.615252 421.385070 1132.462122 382.329753 1.899822 115395.615874
min -124.350000 32.540000 1.000000 2.000000 1.000000 3.000000 1.000000 0.499900 14999.000000
25% -121.800000 33.930000 18.000000 1447.750000 296.000000 787.000000 280.000000 2.563400 119600.000000
50% -118.490000 34.260000 29.000000 2127.000000 435.000000 1166.000000 409.000000 3.534800 179700.000000
75% -118.010000 37.710000 37.000000 3148.000000 647.000000 1725.000000 605.000000 4.743250 264725.000000
max -114.310000 41.950000 52.000000 39320.000000 6445.000000 35682.000000 6082.000000 15.000100 500001.000000

In the meanwhile we are trying to have more information about pandas. In the following sections we are using the value_counts method to have more information about each feature values. This method specify number of different values for given feature.


In [6]:
housing['total_rooms'].value_counts()


Out[6]:
1527.0    18
1613.0    17
1582.0    17
2127.0    16
1703.0    15
          ..
7784.0     1
7916.0     1
6859.0     1
6846.0     1
5639.0     1
Name: total_rooms, Length: 5926, dtype: int64

In [7]:
housing['ocean_proximity'].value_counts()


Out[7]:
<1H OCEAN     9136
INLAND        6551
NEAR OCEAN    2658
NEAR BAY      2290
ISLAND           5
Name: ocean_proximity, dtype: int64

See the difference between loc and iloc methods in a simple pandas DataFrame.


In [26]:
pd.DataFrame([{'a': 1, 'b': '1'}, {'a': 2, 'b': 1}, {'a': 3, 'b': 1}]).iloc[1]


Out[26]:
a    2
b    1
Name: 1, dtype: object

In [21]:
pd.DataFrame([{'a': 1, 'b': '1'}, {'a': 2, 'b': 1}, {'a': 3, 'b': 1}]).loc[1]


Out[21]:
a    2
b    1
Name: 1, dtype: object

In [23]:
pd.DataFrame([{'a': 1, 'b': '1'}, {'a': 2, 'b': 1}, {'a': 3, 'b': 1}]).loc[1, ['b']]


Out[23]:
b    1
Name: 1, dtype: object

In [27]:
pd.DataFrame([{'a': 1, 'b': '1'}, {'a': 2, 'b': 1}, {'a': 3, 'b': 1}]).loc[[True, True, False]]


Out[27]:
a b
0 1 1
1 2 1

Here we want to see the apply function of pandas for an specific feature.


In [35]:
pd.DataFrame([{'a': 1, 'b': '1'}, {'a': 2, 'b': 1}, {'a': 3, 'b': 1}])['a'].apply(lambda a: a > 10)


Out[35]:
0    False
1    False
2    False
Name: a, dtype: bool

The following function helps to split the given dataset into test and train sets.


In [32]:
from zlib import crc32
import numpy as np

def test_set_check(identifier, test_ratio):
    return crc32(np.int64(identifier)) & 0xffffffff < test_ratio * 2**32

def split_train_test_by_id(data, test_ratio, id_column):
    ids = data[id_column]
    in_test_set = ids.apply(lambda _id: test_set_check(_id, test_ratio))
    return data.loc[~in_test_set], data.loc[in_test_set]

In [36]:
housing_with_id = housing.reset_index() # adds an "index" column
train_set, test_set = split_train_test_by_id(housing_with_id, 0.2, 'index')

In [39]:
housing = train_set.copy()
housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.1)


Out[39]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f699b340390>

In [41]:
import matplotlib.pyplot as plt

housing.plot(kind='scatter', x='longitude', y='latitude',
             alpha=0.4, s=housing['population']/100, label='population',
             c='median_house_value', cmap=plt.get_cmap('jet'), colorbar=True,
            )


Out[41]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f695a1ad090>