This lab is an introduction to linear regression using Python and Scikit-Learn. This lab serves as a foundation for more complex algorithms and machine learning models that you will encounter in the course. We will train a linear regression model to predict housing price.
Each learning objective will correspond to a #TODO in the student lab notebook -- try to complete that notebook first before reviewing this solution notebook.
In [ ]:
!sudo chown -R jupyter:jupyter /home/jupyter/training-data-analyst
In [2]:
import os
import pandas as pd
import numpy as np
# delete me from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import seaborn as sns # Seaborn is a Python data visualization library based on matplotlib.
%matplotlib inline
Here, we create a directory called usahousing. This directory will hold the dataset that we copy from Google Cloud Storage.
In [3]:
if not os.path.isdir("../data/explore"):
os.makedirs("../data/explore")
Next, we copy the Usahousing dataset from Google Cloud Storage.
In [4]:
!gsutil cp gs://cloud-training-demos/feat_eng/housing/housing_pre-proc.csv ../data/explore
Then we use the "ls" command to list files in the directory. This ensures that the dataset was copied.
In [5]:
!ls -l ../data/explore
Next, we read the dataset into a Pandas dataframe.
In [6]:
# TODO 1
df_USAhousing = pd.read_csv('../data/explore/housing_pre-proc.csv')
In [7]:
# Show the first five row.
df_USAhousing.head()
Out[7]:
Let's check for any null values.
In [8]:
df_USAhousing.isnull().sum()
Out[8]:
In [9]:
df_stats = df_USAhousing.describe()
df_stats = df_stats.transpose()
df_stats
Out[9]:
In [10]:
df_USAhousing.info()
Let's take a peek at the first and last five rows of the data for all columns.
In [11]:
print ("Rows : " ,df_USAhousing.shape[0])
print ("Columns : " ,df_USAhousing.shape[1])
print ("\nFeatures : \n" ,df_USAhousing.columns.tolist())
print ("\nMissing values : ", df_USAhousing.isnull().sum().values.sum())
print ("\nUnique values : \n",df_USAhousing
.nunique())
In [12]:
sns.heatmap(df_USAhousing.corr())
Out[12]:
Create a distplot showing "median_house_value".
In [13]:
# TODO 2a
sns.distplot(df_USAhousing['median_house_value'])
Out[13]:
In [14]:
sns.set_style('whitegrid')
df_USAhousing['median_house_value'].hist(bins=30)
plt.xlabel('median_house_value')
Out[14]:
In [15]:
x = df_USAhousing['median_income']
y = df_USAhousing['median_house_value']
plt.scatter(x, y)
plt.show()
Create a jointplot showing "median_income" versus "median_house_value".
In [16]:
# TODO 2b
sns.jointplot(x='median_income',y='median_house_value',data=df_USAhousing)
Out[16]:
In [17]:
sns.countplot(x = 'ocean_proximity', data=df_USAhousing)
Out[17]:
In [18]:
# takes numeric only?
#plt.figure(figsize=(20,20))
g = sns.FacetGrid(df_USAhousing, col="ocean_proximity")
g.map(plt.hist, "households");
In [60]:
# takes numeric only?
#plt.figure(figsize=(20,20))
g = sns.FacetGrid(df_USAhousing, col="ocean_proximity")
g.map(plt.hist, "median_income");
You can see below that this is the state of California!
In [20]:
x = df_USAhousing['latitude']
y = df_USAhousing['longitude']
plt.scatter(x, y)
plt.show()
In this notebook, we will explore data corresponding to taxi rides in New York City to build a Machine Learning model in support of a fare-estimation tool. The idea is to suggest a likely fare to taxi riders so that they are not surprised, and so that they can protest if the charge is much higher than expected.
First, restart the Kernel. Now, let's start with the Python imports that we need.
In [1]:
from google.cloud import bigquery
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
The dataset that we will use is a BigQuery public dataset. Click on the link, and look at the column names. Switch to the Details tab to verify that the number of records is one billion, and then switch to the Preview tab to look at a few rows.
Let's write a SQL query to pick up interesting fields from the dataset. It's a good idea to get the timestamp in a predictable format.
In [2]:
%%bigquery
SELECT
FORMAT_TIMESTAMP(
"%Y-%m-%d %H:%M:%S %Z", pickup_datetime) AS pickup_datetime,
pickup_longitude, pickup_latitude, dropoff_longitude,
dropoff_latitude, passenger_count, trip_distance, tolls_amount,
fare_amount, total_amount
# TODO 3
FROM
`nyc-tlc.yellow.trips`
LIMIT 10
Out[2]:
Let's increase the number of records so that we can do some neat graphs. There is no guarantee about the order in which records are returned, and so no guarantee about which records get returned if we simply increase the LIMIT. To properly sample the dataset, let's use the HASH of the pickup time and return 1 in 100,000 records -- because there are 1 billion records in the data, we should get back approximately 10,000 records if we do this.
We will also store the BigQuery result in a Pandas dataframe named "trips"
In [3]:
%%bigquery trips
SELECT
FORMAT_TIMESTAMP(
"%Y-%m-%d %H:%M:%S %Z", pickup_datetime) AS pickup_datetime,
pickup_longitude, pickup_latitude,
dropoff_longitude, dropoff_latitude,
passenger_count,
trip_distance,
tolls_amount,
fare_amount,
total_amount
FROM
`nyc-tlc.yellow.trips`
WHERE
ABS(MOD(FARM_FINGERPRINT(CAST(pickup_datetime AS STRING)), 100000)) = 1
Out[3]:
In [4]:
print(len(trips))
In [5]:
# We can slice Pandas dataframes as if they were arrays
trips[:10]
Out[5]:
Let's explore this dataset and clean it up as necessary. We'll use the Python Seaborn package to visualize graphs and Pandas to do the slicing and filtering.
In [6]:
# TODO 4
ax = sns.regplot(
x="trip_distance", y="fare_amount",
fit_reg=False, ci=None, truncate=True, data=trips)
ax.figure.set_size_inches(10, 8)
Hmm ... do you see something wrong with the data that needs addressing?
It appears that we have a lot of invalid data that is being coded as zero distance and some fare amounts that are definitely illegitimate. Let's remove them from our analysis. We can do this by modifying the BigQuery query to keep only trips longer than zero miles and fare amounts that are at least the minimum cab fare ($2.50).
Note the extra WHERE clauses.
In [7]:
%%bigquery trips
SELECT
FORMAT_TIMESTAMP(
"%Y-%m-%d %H:%M:%S %Z", pickup_datetime) AS pickup_datetime,
pickup_longitude, pickup_latitude,
dropoff_longitude, dropoff_latitude,
passenger_count,
trip_distance,
tolls_amount,
fare_amount,
total_amount
FROM
`nyc-tlc.yellow.trips`
WHERE
ABS(MOD(FARM_FINGERPRINT(CAST(pickup_datetime AS STRING)), 100000)) = 1
# TODO 4a
AND trip_distance > 0
AND fare_amount >= 2.5
Out[7]:
In [8]:
print(len(trips))
In [9]:
ax = sns.regplot(
x="trip_distance", y="fare_amount",
fit_reg=False, ci=None, truncate=True, data=trips)
ax.figure.set_size_inches(10, 8)
What's up with the streaks around 45 dollars and 50 dollars? Those are fixed-amount rides from JFK and La Guardia airports into anywhere in Manhattan, i.e. to be expected. Let's list the data to make sure the values look reasonable.
Let's also examine whether the toll amount is captured in the total amount.
In [10]:
tollrides = trips[trips["tolls_amount"] > 0]
tollrides[tollrides["pickup_datetime"] == "2012-02-27 09:19:10 UTC"]
Out[10]:
In [11]:
notollrides = trips[trips["tolls_amount"] == 0]
notollrides[notollrides["pickup_datetime"] == "2012-02-27 09:19:10 UTC"]
Out[11]:
Looking at a few samples above, it should be clear that the total amount reflects fare amount, toll and tip somewhat arbitrarily -- this is because when customers pay cash, the tip is not known. So, we'll use the sum of fare_amount + tolls_amount as what needs to be predicted. Tips are discretionary and do not have to be included in our fare estimation tool.
Let's also look at the distribution of values within the columns.
In [12]:
trips.describe()
Out[12]:
Copyright 2020 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.