Here we are using the California Housing dataset to learn more about Machine Learning.



In [4]:

    
import pandas as pd

housing = pd.read_csv('housing.csv')
housing.head()









    Out[4]:







  
    
      
      longitude
      latitude
      housing_median_age
      total_rooms
      total_bedrooms
      population
      households
      median_income
      median_house_value
      ocean_proximity
    
  
  
    
      0
      -122.23
      37.88
      41.0
      880.0
      129.0
      322.0
      126.0
      8.3252
      452600.0
      NEAR BAY
    
    
      1
      -122.22
      37.86
      21.0
      7099.0
      1106.0
      2401.0
      1138.0
      8.3014
      358500.0
      NEAR BAY
    
    
      2
      -122.24
      37.85
      52.0
      1467.0
      190.0
      496.0
      177.0
      7.2574
      352100.0
      NEAR BAY
    
    
      3
      -122.25
      37.85
      52.0
      1274.0
      235.0
      558.0
      219.0
      5.6431
      341300.0
      NEAR BAY
    
    
      4
      -122.25
      37.85
      52.0
      1627.0
      280.0
      565.0
      259.0
      3.8462
      342200.0
      NEAR BAY



In [5]:

    
housing.info()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
longitude             20640 non-null float64
latitude              20640 non-null float64
housing_median_age    20640 non-null float64
total_rooms           20640 non-null float64
total_bedrooms        20433 non-null float64
population            20640 non-null float64
households            20640 non-null float64
median_income         20640 non-null float64
median_house_value    20640 non-null float64
ocean_proximity       20640 non-null object
dtypes: float64(9), object(1)
memory usage: 1.6+ MB



In [8]:

    
housing.describe()









    Out[8]:







  
    
      
      longitude
      latitude
      housing_median_age
      total_rooms
      total_bedrooms
      population
      households
      median_income
      median_house_value
    
  
  
    
      count
      20640.000000
      20640.000000
      20640.000000
      20640.000000
      20433.000000
      20640.000000
      20640.000000
      20640.000000
      20640.000000
    
    
      mean
      -119.569704
      35.631861
      28.639486
      2635.763081
      537.870553
      1425.476744
      499.539680
      3.870671
      206855.816909
    
    
      std
      2.003532
      2.135952
      12.585558
      2181.615252
      421.385070
      1132.462122
      382.329753
      1.899822
      115395.615874
    
    
      min
      -124.350000
      32.540000
      1.000000
      2.000000
      1.000000
      3.000000
      1.000000
      0.499900
      14999.000000
    
    
      25%
      -121.800000
      33.930000
      18.000000
      1447.750000
      296.000000
      787.000000
      280.000000
      2.563400
      119600.000000
    
    
      50%
      -118.490000
      34.260000
      29.000000
      2127.000000
      435.000000
      1166.000000
      409.000000
      3.534800
      179700.000000
    
    
      75%
      -118.010000
      37.710000
      37.000000
      3148.000000
      647.000000
      1725.000000
      605.000000
      4.743250
      264725.000000
    
    
      max
      -114.310000
      41.950000
      52.000000
      39320.000000
      6445.000000
      35682.000000
      6082.000000
      15.000100
      500001.000000

In the meanwhile we are trying to have more information about pandas. In the following sections we are using the value_counts method to have more information about each feature values. This method specify number of different values for given feature.



In [6]:

    
housing['total_rooms'].value_counts()









    Out[6]:





1527.0    18
1613.0    17
1582.0    17
2127.0    16
1703.0    15
          ..
7784.0     1
7916.0     1
6859.0     1
6846.0     1
5639.0     1
Name: total_rooms, Length: 5926, dtype: int64



In [7]:

    
housing['ocean_proximity'].value_counts()









    Out[7]:





<1H OCEAN     9136
INLAND        6551
NEAR OCEAN    2658
NEAR BAY      2290
ISLAND           5
Name: ocean_proximity, dtype: int64

See the difference between loc and iloc methods in a simple pandas DataFrame.



In [26]:

    
pd.DataFrame([{'a': 1, 'b': '1'}, {'a': 2, 'b': 1}, {'a': 3, 'b': 1}]).iloc[1]









    Out[26]:





a    2
b    1
Name: 1, dtype: object



In [21]:

    
pd.DataFrame([{'a': 1, 'b': '1'}, {'a': 2, 'b': 1}, {'a': 3, 'b': 1}]).loc[1]









    Out[21]:





a    2
b    1
Name: 1, dtype: object



In [23]:

    
pd.DataFrame([{'a': 1, 'b': '1'}, {'a': 2, 'b': 1}, {'a': 3, 'b': 1}]).loc[1, ['b']]









    Out[23]:





b    1
Name: 1, dtype: object



In [27]:

    
pd.DataFrame([{'a': 1, 'b': '1'}, {'a': 2, 'b': 1}, {'a': 3, 'b': 1}]).loc[[True, True, False]]

Here we want to see the apply function of pandas for an specific feature.



In [35]:

    
pd.DataFrame([{'a': 1, 'b': '1'}, {'a': 2, 'b': 1}, {'a': 3, 'b': 1}])['a'].apply(lambda a: a > 10)









    Out[35]:





0    False
1    False
2    False
Name: a, dtype: bool

The following function helps to split the given dataset into test and train sets.



In [32]:

    
from zlib import crc32
import numpy as np

def test_set_check(identifier, test_ratio):
    return crc32(np.int64(identifier)) & 0xffffffff < test_ratio * 2**32

def split_train_test_by_id(data, test_ratio, id_column):
    ids = data[id_column]
    in_test_set = ids.apply(lambda _id: test_set_check(_id, test_ratio))
    return data.loc[~in_test_set], data.loc[in_test_set]



In [36]:

    
housing_with_id = housing.reset_index() # adds an "index" column
train_set, test_set = split_train_test_by_id(housing_with_id, 0.2, 'index')



In [39]:

    
housing = train_set.copy()
housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.1)









    Out[39]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f699b340390>



In [41]:

    
import matplotlib.pyplot as plt

housing.plot(kind='scatter', x='longitude', y='latitude',
             alpha=0.4, s=housing['population']/100, label='population',
             c='median_house_value', cmap=plt.get_cmap('jet'), colorbar=True,
            )









    Out[41]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f695a1ad090>

	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value	ocean_proximity
0	-122.23	37.88	41.0	880.0	129.0	322.0	126.0	8.3252	452600.0	NEAR BAY
1	-122.22	37.86	21.0	7099.0	1106.0	2401.0	1138.0	8.3014	358500.0	NEAR BAY
2	-122.24	37.85	52.0	1467.0	190.0	496.0	177.0	7.2574	352100.0	NEAR BAY
3	-122.25	37.85	52.0	1274.0	235.0	558.0	219.0	5.6431	341300.0	NEAR BAY
4	-122.25	37.85	52.0	1627.0	280.0	565.0	259.0	3.8462	342200.0	NEAR BAY

	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value
count	20640.000000	20640.000000	20640.000000	20640.000000	20433.000000	20640.000000	20640.000000	20640.000000	20640.000000
mean	-119.569704	35.631861	28.639486	2635.763081	537.870553	1425.476744	499.539680	3.870671	206855.816909
std	2.003532	2.135952	12.585558	2181.615252	421.385070	1132.462122	382.329753	1.899822	115395.615874
min	-124.350000	32.540000	1.000000	2.000000	1.000000	3.000000	1.000000	0.499900	14999.000000
25%	-121.800000	33.930000	18.000000	1447.750000	296.000000	787.000000	280.000000	2.563400	119600.000000
50%	-118.490000	34.260000	29.000000	2127.000000	435.000000	1166.000000	409.000000	3.534800	179700.000000
75%	-118.010000	37.710000	37.000000	3148.000000	647.000000	1725.000000	605.000000	4.743250	264725.000000
max	-114.310000	41.950000	52.000000	39320.000000	6445.000000	35682.000000	6082.000000	15.000100	500001.000000