Ch02 End-to-end ML project

If you are a data scientist and get a new ML project, what will you do?

look at the big picture
get the data
discover and visualize the data to gain insights
prepare the data for ML algorithm
select a model and train
fine-tune the model
present your solution
launch, monitor, and maintain the system

Working with Real data

Experimenting with the real-world data is the best way to learn ML. There are some places we can get data:

Popular open data repo.
- UC Iriving ML repo.
- Kaggle
- Amazon's AWS datasets

Meta portas (they usually list the data repo. that are available)
- http://dataportals.org/
- https://opendatamonitor.eu/ mainly about the data in Eurapa
- https://www.quandl.com/ Financial data
Other pages listing many popular open data repo.
- Wikipedia's list of ML datasets
- Quora questions
- Datasets subreddit

Appendix-B Machine learning project checklist

Frame the problem and look at the big picture

There are a list of issues:

What is the objective
How can performance be measured (such as Root mean square error and Mean absolute error)

...

Get the data

Note: We had better automate data as much as possible so that we can always get the fresh data.

list the data we need and how much we need
find and document where we can get the data
....
Lastly but not least, sample a test set. Put it aside, and never look at it (no data snooping)

Explore the data

Note: get insights from a field expert for these steps.

create a copy of data that we explore
use jupyter notebook too keep record
study each attribute and ists characteristics

...

Prepare the data

Notes:

work on copies of the data (keep the original dataset intact)
write functions for all data transformations you apply, for 5 reasons:
- we can easily prepare the data the next time we get a fresh dataset
- we can apply these transormations in future projects
- to clean and prepare the test set
- to clean and preprare new data instances once our solution is live
- to make it easy to treat your preparation choices as hyperparameters

数据清理
feature 选择。丢掉那些无用或者无意义的attribute
feature engineering：
- discretsize continous features
- decompose features (e.g., categorical, data/time)
- add promising transormations of features (e.g. log(x))
- aggregate features into promising new features
feature scaling: standardize or normalize features

Short-list promising models (快速列出有潜力的models)

快速、粗糙得训练models
比较这些models的performance
- 对于每个模型，使用 N-fold cross-validation并计算performance meansure的平均/标准差。
分析最有意义的算法参数
分析模型错误的类型
快速查看下feature selection和feature engineering

6..快速进行几次前面五个步骤的iterations

## Fine-tune the system

Present your solution

launch

Workign with real data (California housing princes)

look at the big picture

We are asked to build a model of housing prices in California using California census data



In [ ]:



In [ ]:

    
In the first example of chapter 02,we use this [dataset](./datasets/housing/housing.tgz) (Note that it's a compressed file with extension of tgz)

Cerntaily, we can download the database from the github by hand, and then decompress this file into csv file, but it's preferable to create a small function to do that. It is useful in particular if data changes regularly, as it can allow us to write a small sciprt that can run whenever we need it. We can even set up a scheduled job to do that automaticaly at regular intervals.


I have written the python fetch sciprt in [fetch_data_github.py](./ch02/fetch_data_github.py)

In my note book, I hold my dataset just in the current directory.



In [4]:

    
!pwd









    



/home/ywfang/FANG/git/readingnotes/machine-learning/handson_scikitlearn_tf_2017



In [3]:

    
!ls -l









    



total 248
drwxr-xr-x  4 ywfang  staff    128 Jun  5 18:25 ch01
-rw-r--r--  1 ywfang  staff  34893 Jun  5 18:30 ch01-notebook.ipynb
drwxr-xr-x  4 ywfang  staff    128 Jun  1 16:17 ch02
-rw-r--r--  1 ywfang  staff  86787 Jun  5 18:33 ch02-notebook.ipynb
drwxr-xr-x  5 ywfang  staff    160 May 31 00:20 datasets

Take a quick look at data structure



In [5]:

    
import pandas as pd
import os

HOUSING_PATH = os.path.join("datasets", "housing")
print(HOUSING_PATH)

def load_housing_data(housing_path=HOUSING_PATH):
    csv_path = os.path.join(housing_path, "housing.csv")
    return pd.read_csv(csv_path)









    



datasets/housing



In [6]:

    
housing = load_housing_data()
print("10 attributes: ", housing.columns)
housing.head(20)









    



10 attributes:  Index(['longitude', 'latitude', 'housing_median_age', 'total_rooms',
       'total_bedrooms', 'population', 'households', 'median_income',
       'median_house_value', 'ocean_proximity'],
      dtype='object')






    Out[6]:







  
    
      
      longitude
      latitude
      housing_median_age
      total_rooms
      total_bedrooms
      population
      households
      median_income
      median_house_value
      ocean_proximity
    
  
  
    
      0
      -122.23
      37.88
      41.0
      880.0
      129.0
      322.0
      126.0
      8.3252
      452600.0
      NEAR BAY
    
    
      1
      -122.22
      37.86
      21.0
      7099.0
      1106.0
      2401.0
      1138.0
      8.3014
      358500.0
      NEAR BAY
    
    
      2
      -122.24
      37.85
      52.0
      1467.0
      190.0
      496.0
      177.0
      7.2574
      352100.0
      NEAR BAY
    
    
      3
      -122.25
      37.85
      52.0
      1274.0
      235.0
      558.0
      219.0
      5.6431
      341300.0
      NEAR BAY
    
    
      4
      -122.25
      37.85
      52.0
      1627.0
      280.0
      565.0
      259.0
      3.8462
      342200.0
      NEAR BAY
    
    
      5
      -122.25
      37.85
      52.0
      919.0
      213.0
      413.0
      193.0
      4.0368
      269700.0
      NEAR BAY
    
    
      6
      -122.25
      37.84
      52.0
      2535.0
      489.0
      1094.0
      514.0
      3.6591
      299200.0
      NEAR BAY
    
    
      7
      -122.25
      37.84
      52.0
      3104.0
      687.0
      1157.0
      647.0
      3.1200
      241400.0
      NEAR BAY
    
    
      8
      -122.26
      37.84
      42.0
      2555.0
      665.0
      1206.0
      595.0
      2.0804
      226700.0
      NEAR BAY
    
    
      9
      -122.25
      37.84
      52.0
      3549.0
      707.0
      1551.0
      714.0
      3.6912
      261100.0
      NEAR BAY
    
    
      10
      -122.26
      37.85
      52.0
      2202.0
      434.0
      910.0
      402.0
      3.2031
      281500.0
      NEAR BAY
    
    
      11
      -122.26
      37.85
      52.0
      3503.0
      752.0
      1504.0
      734.0
      3.2705
      241800.0
      NEAR BAY
    
    
      12
      -122.26
      37.85
      52.0
      2491.0
      474.0
      1098.0
      468.0
      3.0750
      213500.0
      NEAR BAY
    
    
      13
      -122.26
      37.84
      52.0
      696.0
      191.0
      345.0
      174.0
      2.6736
      191300.0
      NEAR BAY
    
    
      14
      -122.26
      37.85
      52.0
      2643.0
      626.0
      1212.0
      620.0
      1.9167
      159200.0
      NEAR BAY
    
    
      15
      -122.26
      37.85
      50.0
      1120.0
      283.0
      697.0
      264.0
      2.1250
      140000.0
      NEAR BAY
    
    
      16
      -122.27
      37.85
      52.0
      1966.0
      347.0
      793.0
      331.0
      2.7750
      152500.0
      NEAR BAY
    
    
      17
      -122.27
      37.85
      52.0
      1228.0
      293.0
      648.0
      303.0
      2.1202
      155500.0
      NEAR BAY
    
    
      18
      -122.26
      37.84
      50.0
      2239.0
      455.0
      990.0
      419.0
      1.9911
      158700.0
      NEAR BAY
    
    
      19
      -122.27
      37.84
      52.0
      1503.0
      298.0
      690.0
      275.0
      2.6033
      162900.0
      NEAR BAY

info() function is useful to get a quick description of the data



In [19]:

    
housing.info()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
longitude             20640 non-null float64
latitude              20640 non-null float64
housing_median_age    20640 non-null float64
total_rooms           20640 non-null float64
total_bedrooms        20433 non-null float64
population            20640 non-null float64
households            20640 non-null float64
median_income         20640 non-null float64
median_house_value    20640 non-null float64
ocean_proximity       20640 non-null object
dtypes: float64(9), object(1)
memory usage: 1.6+ MB



In [22]:

    
ocean_proximity_series = housing['ocean_proximity']
print(type(ocean_proximity_series))









    



<class 'pandas.core.series.Series'>



In [23]:

    
ocean_proximity_series.value_counts()









    Out[23]:





<1H OCEAN     9136
INLAND        6551
NEAR OCEAN    2658
NEAR BAY      2290
ISLAND           5
Name: ocean_proximity, dtype: int64



In [7]:

    
housing.describe()   # shows a summary of the **numberical** attributes









    Out[7]:







  
    
      
      longitude
      latitude
      housing_median_age
      total_rooms
      total_bedrooms
      population
      households
      median_income
      median_house_value
    
  
  
    
      count
      20640.000000
      20640.000000
      20640.000000
      20640.000000
      20433.000000
      20640.000000
      20640.000000
      20640.000000
      20640.000000
    
    
      mean
      -119.569704
      35.631861
      28.639486
      2635.763081
      537.870553
      1425.476744
      499.539680
      3.870671
      206855.816909
    
    
      std
      2.003532
      2.135952
      12.585558
      2181.615252
      421.385070
      1132.462122
      382.329753
      1.899822
      115395.615874
    
    
      min
      -124.350000
      32.540000
      1.000000
      2.000000
      1.000000
      3.000000
      1.000000
      0.499900
      14999.000000
    
    
      25%
      -121.800000
      33.930000
      18.000000
      1447.750000
      296.000000
      787.000000
      280.000000
      2.563400
      119600.000000
    
    
      50%
      -118.490000
      34.260000
      29.000000
      2127.000000
      435.000000
      1166.000000
      409.000000
      3.534800
      179700.000000
    
    
      75%
      -118.010000
      37.710000
      37.000000
      3148.000000
      647.000000
      1725.000000
      605.000000
      4.743250
      264725.000000
    
    
      max
      -114.310000
      41.950000
      52.000000
      39320.000000
      6445.000000
      35682.000000
      6082.000000
      15.000100
      500001.000000



In [10]:

    
%matplotlib inline

import matplotlib.pyplot as plt
housing.hist(bins=50, figsize=(20, 15))
# see
#https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.hist.html
plt.show()

	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value	ocean_proximity
0	-122.23	37.88	41.0	880.0	129.0	322.0	126.0	8.3252	452600.0	NEAR BAY
1	-122.22	37.86	21.0	7099.0	1106.0	2401.0	1138.0	8.3014	358500.0	NEAR BAY
2	-122.24	37.85	52.0	1467.0	190.0	496.0	177.0	7.2574	352100.0	NEAR BAY
3	-122.25	37.85	52.0	1274.0	235.0	558.0	219.0	5.6431	341300.0	NEAR BAY
4	-122.25	37.85	52.0	1627.0	280.0	565.0	259.0	3.8462	342200.0	NEAR BAY
5	-122.25	37.85	52.0	919.0	213.0	413.0	193.0	4.0368	269700.0	NEAR BAY
6	-122.25	37.84	52.0	2535.0	489.0	1094.0	514.0	3.6591	299200.0	NEAR BAY
7	-122.25	37.84	52.0	3104.0	687.0	1157.0	647.0	3.1200	241400.0	NEAR BAY
8	-122.26	37.84	42.0	2555.0	665.0	1206.0	595.0	2.0804	226700.0	NEAR BAY
9	-122.25	37.84	52.0	3549.0	707.0	1551.0	714.0	3.6912	261100.0	NEAR BAY
10	-122.26	37.85	52.0	2202.0	434.0	910.0	402.0	3.2031	281500.0	NEAR BAY
11	-122.26	37.85	52.0	3503.0	752.0	1504.0	734.0	3.2705	241800.0	NEAR BAY
12	-122.26	37.85	52.0	2491.0	474.0	1098.0	468.0	3.0750	213500.0	NEAR BAY
13	-122.26	37.84	52.0	696.0	191.0	345.0	174.0	2.6736	191300.0	NEAR BAY
14	-122.26	37.85	52.0	2643.0	626.0	1212.0	620.0	1.9167	159200.0	NEAR BAY
15	-122.26	37.85	50.0	1120.0	283.0	697.0	264.0	2.1250	140000.0	NEAR BAY
16	-122.27	37.85	52.0	1966.0	347.0	793.0	331.0	2.7750	152500.0	NEAR BAY
17	-122.27	37.85	52.0	1228.0	293.0	648.0	303.0	2.1202	155500.0	NEAR BAY
18	-122.26	37.84	50.0	2239.0	455.0	990.0	419.0	1.9911	158700.0	NEAR BAY
19	-122.27	37.84	52.0	1503.0	298.0	690.0	275.0	2.6033	162900.0	NEAR BAY

	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value
count	20640.000000	20640.000000	20640.000000	20640.000000	20433.000000	20640.000000	20640.000000	20640.000000	20640.000000
mean	-119.569704	35.631861	28.639486	2635.763081	537.870553	1425.476744	499.539680	3.870671	206855.816909
std	2.003532	2.135952	12.585558	2181.615252	421.385070	1132.462122	382.329753	1.899822	115395.615874
min	-124.350000	32.540000	1.000000	2.000000	1.000000	3.000000	1.000000	0.499900	14999.000000
25%	-121.800000	33.930000	18.000000	1447.750000	296.000000	787.000000	280.000000	2.563400	119600.000000
50%	-118.490000	34.260000	29.000000	2127.000000	435.000000	1166.000000	409.000000	3.534800	179700.000000
75%	-118.010000	37.710000	37.000000	3148.000000	647.000000	1725.000000	605.000000	4.743250	264725.000000
max	-114.310000	41.950000	52.000000	39320.000000	6445.000000	35682.000000	6082.000000	15.000100	500001.000000