Ch02 End-to-end ML project

If you are a data scientist and get a new ML project, what will you do?

  • look at the big picture

  • get the data

  • discover and visualize the data to gain insights

  • prepare the data for ML algorithm

  • select a model and train

  • fine-tune the model

  • present your solution

  • launch, monitor, and maintain the system

Working with Real data

Experimenting with the real-world data is the best way to learn ML. There are some places we can get data:

  • Popular open data repo.

    • UC Iriving ML repo.

    • Kaggle

    • Amazon's AWS datasets

Appendix-B Machine learning project checklist

Frame the problem and look at the big picture

There are a list of issues:

  1. What is the objective

  2. How can performance be measured (such as Root mean square error and Mean absolute error)

...

Get the data

Note: We had better automate data as much as possible so that we can always get the fresh data.

  1. list the data we need and how much we need

  2. find and document where we can get the data

  3. ....

  4. Lastly but not least, sample a test set. Put it aside, and never look at it (no data snooping)

Explore the data

Note: get insights from a field expert for these steps.

  1. create a copy of data that we explore

  2. use jupyter notebook too keep record

  3. study each attribute and ists characteristics

...

Prepare the data

Notes:

  • work on copies of the data (keep the original dataset intact)

  • write functions for all data transformations you apply, for 5 reasons:

    • we can easily prepare the data the next time we get a fresh dataset

    • we can apply these transormations in future projects

    • to clean and prepare the test set

    • to clean and preprare new data instances once our solution is live

    • to make it easy to treat your preparation choices as hyperparameters

  1. 数据清理

  2. feature 选择。丢掉那些无用或者无意义的attribute

  3. feature engineering:

    • discretsize continous features

    • decompose features (e.g., categorical, data/time)

    • add promising transormations of features (e.g. log(x))

    • aggregate features into promising new features

  4. feature scaling: standardize or normalize features

Short-list promising models (快速列出有潜力的models)

  1. 快速、粗糙得训练models

  2. 比较这些models的performance

    • 对于每个模型,使用 N-fold cross-validation并计算performance meansure的平均/标准差。
  3. 分析最有意义的算法参数

  4. 分析模型错误的类型

  5. 快速查看下feature selection和feature engineering

6..快速进行几次前面五个步骤的iterations

  1. ## Fine-tune the system

Present your solution

launch

Workign with real data (California housing princes)

look at the big picture

We are asked to build a model of housing prices in California using California census data


In [ ]:


In [ ]:
In the first example of chapter 02,we use this [dataset](./datasets/housing/housing.tgz) (Note that it's a compressed file with extension of tgz)

Cerntaily, we can download the database from the github by hand, and then decompress this file into csv file, but it's preferable to create a small function to do that. It is useful in particular if data changes regularly, as it can allow us to write a small sciprt that can run whenever we need it. We can even set up a scheduled job to do that automaticaly at regular intervals.


I have written the python fetch sciprt in [fetch_data_github.py](./ch02/fetch_data_github.py)

In my note book, I hold my dataset just in the current directory.


In [4]:
!pwd


/home/ywfang/FANG/git/readingnotes/machine-learning/handson_scikitlearn_tf_2017

In [3]:
!ls -l


total 248
drwxr-xr-x  4 ywfang  staff    128 Jun  5 18:25 ch01
-rw-r--r--  1 ywfang  staff  34893 Jun  5 18:30 ch01-notebook.ipynb
drwxr-xr-x  4 ywfang  staff    128 Jun  1 16:17 ch02
-rw-r--r--  1 ywfang  staff  86787 Jun  5 18:33 ch02-notebook.ipynb
drwxr-xr-x  5 ywfang  staff    160 May 31 00:20 datasets

Take a quick look at data structure


In [5]:
import pandas as pd
import os

HOUSING_PATH = os.path.join("datasets", "housing")
print(HOUSING_PATH)

def load_housing_data(housing_path=HOUSING_PATH):
    csv_path = os.path.join(housing_path, "housing.csv")
    return pd.read_csv(csv_path)


datasets/housing

In [6]:
housing = load_housing_data()
print("10 attributes: ", housing.columns)
housing.head(20)


10 attributes:  Index(['longitude', 'latitude', 'housing_median_age', 'total_rooms',
       'total_bedrooms', 'population', 'households', 'median_income',
       'median_house_value', 'ocean_proximity'],
      dtype='object')
Out[6]:
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity
0 -122.23 37.88 41.0 880.0 129.0 322.0 126.0 8.3252 452600.0 NEAR BAY
1 -122.22 37.86 21.0 7099.0 1106.0 2401.0 1138.0 8.3014 358500.0 NEAR BAY
2 -122.24 37.85 52.0 1467.0 190.0 496.0 177.0 7.2574 352100.0 NEAR BAY
3 -122.25 37.85 52.0 1274.0 235.0 558.0 219.0 5.6431 341300.0 NEAR BAY
4 -122.25 37.85 52.0 1627.0 280.0 565.0 259.0 3.8462 342200.0 NEAR BAY
5 -122.25 37.85 52.0 919.0 213.0 413.0 193.0 4.0368 269700.0 NEAR BAY
6 -122.25 37.84 52.0 2535.0 489.0 1094.0 514.0 3.6591 299200.0 NEAR BAY
7 -122.25 37.84 52.0 3104.0 687.0 1157.0 647.0 3.1200 241400.0 NEAR BAY
8 -122.26 37.84 42.0 2555.0 665.0 1206.0 595.0 2.0804 226700.0 NEAR BAY
9 -122.25 37.84 52.0 3549.0 707.0 1551.0 714.0 3.6912 261100.0 NEAR BAY
10 -122.26 37.85 52.0 2202.0 434.0 910.0 402.0 3.2031 281500.0 NEAR BAY
11 -122.26 37.85 52.0 3503.0 752.0 1504.0 734.0 3.2705 241800.0 NEAR BAY
12 -122.26 37.85 52.0 2491.0 474.0 1098.0 468.0 3.0750 213500.0 NEAR BAY
13 -122.26 37.84 52.0 696.0 191.0 345.0 174.0 2.6736 191300.0 NEAR BAY
14 -122.26 37.85 52.0 2643.0 626.0 1212.0 620.0 1.9167 159200.0 NEAR BAY
15 -122.26 37.85 50.0 1120.0 283.0 697.0 264.0 2.1250 140000.0 NEAR BAY
16 -122.27 37.85 52.0 1966.0 347.0 793.0 331.0 2.7750 152500.0 NEAR BAY
17 -122.27 37.85 52.0 1228.0 293.0 648.0 303.0 2.1202 155500.0 NEAR BAY
18 -122.26 37.84 50.0 2239.0 455.0 990.0 419.0 1.9911 158700.0 NEAR BAY
19 -122.27 37.84 52.0 1503.0 298.0 690.0 275.0 2.6033 162900.0 NEAR BAY

info() function is useful to get a quick description of the data


In [19]:
housing.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
longitude             20640 non-null float64
latitude              20640 non-null float64
housing_median_age    20640 non-null float64
total_rooms           20640 non-null float64
total_bedrooms        20433 non-null float64
population            20640 non-null float64
households            20640 non-null float64
median_income         20640 non-null float64
median_house_value    20640 non-null float64
ocean_proximity       20640 non-null object
dtypes: float64(9), object(1)
memory usage: 1.6+ MB

In [22]:
ocean_proximity_series = housing['ocean_proximity']
print(type(ocean_proximity_series))


<class 'pandas.core.series.Series'>

In [23]:
ocean_proximity_series.value_counts()


Out[23]:
<1H OCEAN     9136
INLAND        6551
NEAR OCEAN    2658
NEAR BAY      2290
ISLAND           5
Name: ocean_proximity, dtype: int64

In [7]:
housing.describe()   # shows a summary of the **numberical** attributes


Out[7]:
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value
count 20640.000000 20640.000000 20640.000000 20640.000000 20433.000000 20640.000000 20640.000000 20640.000000 20640.000000
mean -119.569704 35.631861 28.639486 2635.763081 537.870553 1425.476744 499.539680 3.870671 206855.816909
std 2.003532 2.135952 12.585558 2181.615252 421.385070 1132.462122 382.329753 1.899822 115395.615874
min -124.350000 32.540000 1.000000 2.000000 1.000000 3.000000 1.000000 0.499900 14999.000000
25% -121.800000 33.930000 18.000000 1447.750000 296.000000 787.000000 280.000000 2.563400 119600.000000
50% -118.490000 34.260000 29.000000 2127.000000 435.000000 1166.000000 409.000000 3.534800 179700.000000
75% -118.010000 37.710000 37.000000 3148.000000 647.000000 1725.000000 605.000000 4.743250 264725.000000
max -114.310000 41.950000 52.000000 39320.000000 6445.000000 35682.000000 6082.000000 15.000100 500001.000000

In [10]:
%matplotlib inline

import matplotlib.pyplot as plt
housing.hist(bins=50, figsize=(20, 15))
# see
#https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.hist.html
plt.show()


Create a test set