Data Exploration

This lab is optional. It demonstrates advanced Pandas usage and in-depth data analysis.


Learning objectives:

  1. Learn useful patterns for exploring data before modeling
  2. Gain an understanding of the dataset and identify any data issues.

The goal of this notebook is to explore our base tables before we began feature engineering and modeling. We will explore the price history of stock in the S&P 500.

  • Price history : Price history of stocks
  • S&P 500 : A list of all companies and symbols for companies in the S&P 500

For our analysis, let's limit price history since 2000. In general, the further back historical data is used the lower it's predictive power can be.


In [ ]:
PROJECT = 'your-gcp-project'  # Change to your project.

In [ ]:
import numpy as np
import pandas as pd
import seaborn as sns

from matplotlib import pyplot as plt
from google.cloud import bigquery
from IPython.core.magic import register_cell_magic
from IPython import get_ipython

bq = bigquery.Client(project=PROJECT)

In [ ]:
# Allow you to easily have Python variables in SQL query.
@register_cell_magic('with_globals')
def with_globals(line, cell):
    contents = cell.format(**globals())
    if 'print' in line:
        print(contents)
    get_ipython().run_cell(contents)

Let's look at the tables and columns we have for analysis. Please query the INFORMATION_SCHEMA.

Learning objective 1.


In [ ]:
%%with_globals
%%bigquery --project {PROJECT}
SELECT table_name, column_name, data_type
FROM `asl-ml-immersion.stock_src.INFORMATION_SCHEMA.COLUMNS`
ORDER BY table_name, ordinal_position

Price History

TODO: Visualize stock symbols from the dataset.


In [ ]:
%%with_globals
%%bigquery --project {PROJECT}
SELECT *
FROM `asl-ml-immersion.stock_src.price_history`
LIMIT 10

In [ ]:
def query_stock(symbol):
    return bq.query('''
    # TODO: query a specific stock
    '''.format(symbol)).to_dataframe()

In [ ]:
df_stock = query_stock('GOOG')

In [ ]:
df_stock.Date = pd.to_datetime(df_stock.Date)
ax = df_stock.plot(x='Date', y='Close', title='price')

# Add smoothed plot.
df_stock['Close_smoothed'] = df_stock.Close.rolling(100, center=True).mean()
df_stock.plot(x='Date', y='Close_smoothed', ax=ax);

TODO 2: Compare individual stocks to the S&P 500.


In [ ]:
SP500_SYMBOL = gspc
df_sp = query_stock(SP500_SYMBOL)

# TODO: visualize S&P 500 price

Let's see how the price of stocks change over time on a yearly basis. Using the LAG function we can compute the change in stock price year-over-year.

Let's compute average close difference for each year. This line could, of course, be done in Pandas. Often times it's useful to use some combination of BigQuery and Pandas for exploration analysis. In general, it's most effective to let BigQuery do the heavy-duty processing and then use Pandas for smaller data and visualization.

Learning objective 1, 2


In [ ]:
%%with_globals
%%bigquery --project {PROJECT} df
WITH 
with_year AS
    (
    SELECT symbol, 
    EXTRACT(YEAR FROM date) AS year,
    close
    FROM `asl-ml-immersion.stock_src.price_history`
    WHERE symbol in (SELECT symbol FROM `asl-ml-immersion.stock_src.snp500`)
    ),
year_aggregated AS
    (
    SELECT year, symbol, AVG(close) as avg_close
    FROM with_year
    WHERE year >= 2000
    GROUP BY year, symbol
    )
SELECT year, symbol, avg_close as close,
(LAG( 
    --# TODO: compute a year lag on avg_close
)) 
    AS next_yr_close
FROM year_aggregated
ORDER BY symbol, year

Compute the year-over-year percentage increase.


In [ ]:
df.dropna(inplace=True)
df['percent_increase'] = (df.next_yr_close - df.close) / df.close

Let's visualize some yearly stock


In [ ]:
def get_random_stocks(n=5):
    random_stocks = df.symbol.sample(n=n, random_state=3)
    rand = df.merge(random_stocks)
    return rand[['year', 'symbol', 'percent_increase']]


rand = get_random_stocks()
for symbol, _df in rand.groupby('symbol'):
    plt.figure()
    sns.barplot(x='year', y="percent_increase", data=_df)
    plt.title(symbol)

There have been some major fluctations in individual stocks. For example, there were major drops during the early 2000's for tech companies.


In [ ]:
df.sort_values('percent_increase').head()

In [ ]:
stock_symbol = 'YHOO'

In [ ]:
%%with_globals
%%bigquery --project {PROJECT} df
SELECT date, close
FROM `asl-ml-immersion.stock_src.price_history`
WHERE symbol='{stock_symbol}'
ORDER BY date

In [ ]:
ax = df.plot(x='date', y='close')

Stock splits can also impact our data - causing a stock price to rapidly drop. In practice, we would need to clean all of our stock data to account for this. This would be a major effort! Fortunately, in the case of IBM, for example, all stock splits occurred before the year 2000.

Learning objective 2

TODO: Query the IBM stock history and to visualize how the stock splits affect our data. A stock split occurs when there is a sudden drop in price.


In [ ]:
stock_symbol = 'IBM'

In [ ]:
%%with_globals
%%bigquery --project {PROJECT} df
SELECT date, close
FROM `asl-ml-immersion.stock_src.price_history`
WHERE symbol='{stock_symbol}'
ORDER BY date

In [ ]:
# TODO: can you visualize when the major stock splits occured?

S&P companies list


In [ ]:
%%with_globals
%%bigquery --project {PROJECT} df
SELECT *
FROM `asl-ml-immersion.stock_src.snp500`

In [ ]:
df.industry.value_counts().plot(kind='barh');

We can join the price histories table with the S&P 500 table to compare industries:

Learning objective 1,2


In [ ]:
%%with_globals
%%bigquery --project {PROJECT} df
WITH sp_prices AS
    (
    SELECT a.*, b.industry
    FROM `asl-ml-immersion.stock_src.price_history` a 
        JOIN `asl-ml-immersion.stock_src.snp500` b
    USING (symbol)
    WHERE date >= "2000-01-01"
    )
SELECT Date, industry, AVG(close) as close
FROM sp_prices
GROUP BY Date, industry
ORDER BY industry, Date

In [ ]:
df.head()

Using pandas we can "unstack" our table so that each industry has it's own column. This will be useful for plotting.


In [ ]:
# Pandas `unstack` to make each industry a column. Useful for plotting.
df_ind = df.set_index(['industry', 'Date']).unstack(0).dropna()
df_ind.columns = [c[1] for c in df_ind.columns]
df_ind.head()

In [ ]:
ax = df_ind.plot(figsize=(16, 8))
# Move legend down.
ax.legend(loc='upper center', bbox_to_anchor=(0.5, -0.05), shadow=True, ncol=2)

Let's scale each industry using min/max scaling. This will put all of the stocks on the same scale. Currently it can be hard to see the changes in stocks over time across industries.

Learning objective 1


In [ ]:
def min_max_scale(df):
    return (df - df.min()) / df.max()
scaled = min_max_scale(df_ind)
ax = scaled.plot(figsize=(16, 8))
ax.legend(loc='upper center', bbox_to_anchor=(0.5, -0.05), shadow=True, ncol=2);

We can also create a smoothed version of the plot above using a rolling mean. This is a useful transformation to make when visualizing time-series data.


In [ ]:
SMOOTHING_WINDOW = 30  # Days.
rolling = scaled.copy()
for col in scaled.columns:
    rolling[col] = scaled[col].rolling(SMOOTHING_WINDOW).mean()
ax = rolling.plot(figsize=(16, 8))
ax.legend(loc='upper center', bbox_to_anchor=(0.5, -0.05), shadow=True, ncol=2);

Information technology had a large crash during the early 2000s and again in 2008/2009; along with all other stocks. After 2008, some industries were a bit slower to recover than other industries.

BONUS: In the next lab, we will want to predict the price of the stock in the future. What are some features that we can use to predict future price? Try visualizing some of these features.

Copyright 2019 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License


In [ ]: