If you are a data scientist and get a new ML project, what will you do?
look at the big picture
get the data
discover and visualize the data to gain insights
prepare the data for ML algorithm
select a model and train
fine-tune the model
present your solution
launch, monitor, and maintain the system
Experimenting with the real-world data is the best way to learn ML. There are some places we can get data:
Popular open data repo.
UC Iriving ML repo.
Kaggle
Amazon's AWS datasets
Meta portas (they usually list the data repo. that are available)
https://opendatamonitor.eu/ mainly about the data in Eurapa
https://www.quandl.com/ Financial data
Other pages listing many popular open data repo.
Wikipedia's list of ML datasets
Quora questions
Datasets subreddit
There are a list of issues:
What is the objective
How can performance be measured (such as Root mean square error and Mean absolute error)
list the data we need and how much we need
find and document where we can get the data
....
Lastly but not least, sample a test set. Put it aside, and never look at it (no data snooping)
create a copy of data that we explore
use jupyter notebook too keep record
study each attribute and ists characteristics
work on copies of the data (keep the original dataset intact)
write functions for all data transformations you apply, for 5 reasons:
we can easily prepare the data the next time we get a fresh dataset
we can apply these transormations in future projects
to clean and prepare the test set
to clean and preprare new data instances once our solution is live
to make it easy to treat your preparation choices as hyperparameters
数据清理
feature 选择。丢掉那些无用或者无意义的attribute
feature engineering:
discretsize continous features
decompose features (e.g., categorical, data/time)
add promising transormations of features (e.g. log(x))
aggregate features into promising new features
feature scaling: standardize or normalize features
快速、粗糙得训练models
比较这些models的performance
分析最有意义的算法参数
分析模型错误的类型
快速查看下feature selection和feature engineering
6..快速进行几次前面五个步骤的iterations
## Fine-tune the system
In [ ]:
In [ ]:
In the first example of chapter 02,we use this [dataset](./datasets/housing/housing.tgz) (Note that it's a compressed file with extension of tgz)
Cerntaily, we can download the database from the github by hand, and then decompress this file into csv file, but it's preferable to create a small function to do that. It is useful in particular if data changes regularly, as it can allow us to write a small sciprt that can run whenever we need it. We can even set up a scheduled job to do that automaticaly at regular intervals.
I have written the python fetch sciprt in [fetch_data_github.py](./ch02/fetch_data_github.py)
In my note book, I hold my dataset just in the current directory.
In [4]:
!pwd
In [3]:
!ls -l
In [5]:
import pandas as pd
import os
HOUSING_PATH = os.path.join("datasets", "housing")
print(HOUSING_PATH)
def load_housing_data(housing_path=HOUSING_PATH):
csv_path = os.path.join(housing_path, "housing.csv")
return pd.read_csv(csv_path)
In [6]:
housing = load_housing_data()
print("10 attributes: ", housing.columns)
housing.head(20)
Out[6]:
info() function is useful to get a quick description of the data
In [19]:
housing.info()
In [22]:
ocean_proximity_series = housing['ocean_proximity']
print(type(ocean_proximity_series))
In [23]:
ocean_proximity_series.value_counts()
Out[23]:
In [7]:
housing.describe() # shows a summary of the **numberical** attributes
Out[7]:
In [10]:
%matplotlib inline
import matplotlib.pyplot as plt
housing.hist(bins=50, figsize=(20, 15))
# see
#https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.hist.html
plt.show()