Learning Objectives
BigQuery is Google's fully managed, NoOps, low cost analytics database. With BigQuery you can query terabytes and terabytes of data without having any infrastructure to manage or needing a database administrator. BigQuery uses SQL and can take advantage of the pay-as-you-go model. BigQuery allows you to focus on analyzing data to find meaningful insights.
We have a publicly available ecommerce dataset that has millions of Google Analytics records for the Google Merchandise Store loaded into a table in BigQuery. In this lab, you use a copy of that dataset. Sample scenarios are provided, from which you look at the data and ways to remove duplicate information. The lab then steps you through further analysis the data.
BigQuery can be accessed by its own browser-based interface, Google Data Studio, and many third party tools. In this lab you will use the BigQuery directly in notebook cells using the iPython magic command %%bigquery
.
The steps you will follow in the lab are analogous to what you would do to prepare data for use in advanced ML operations. You will follow the notebook to experiment with the BigQuery queries provided to analyze the data.
In [15]:
import os
import pandas as pd
PROJECT = "<YOUR PROJECT>" #TODO Replace with your project id
os.environ["PROJECT"] = PROJECT
pd.options.display.max_columns = 50
Scenario: You were provided with Google Analytics logs for an eCommerce website in a BigQuery dataset. The data analyst team created a new BigQuery table of all the raw eCommerce visitor session data. This data tracks user interactions, location, device types, time on page, and details of any transaction. Your ultimate plan is to use this data in an ML capacity to create a model that delivers highly accurate predictions of user behavior to support tailored marketing campaigns.
First, a few notes on BigQuery within a python notebook context. Any cell that starts with %%bigquery
(the BigQuery Magic) will be interpreted as a SQL query that is executed on BigQuery, and the result is printed to our notebook.
BigQuery supports two flavors of SQL syntax: legacy SQL and standard SQL. The preferred is standard SQL because it complies with the official SQL:2011 standard. To instruct BigQuery to interpret our syntax as such we start the query with #standardSQL
.
Our first query is accessing the BigQuery Information Schema which stores all object-related metadata. In this case we want to see metadata details for the "all_sessions_raw" table.
Tip: To run the current cell you can click the cell and hit shift enter
TODO 2
In [17]:
%%bigquery --project $PROJECT
#standardsql
SELECT *
EXCEPT
(table_catalog, table_schema, is_generated, generation_expression, is_stored,
is_updatable, is_hidden, is_system_defined, is_partitioning_column, clustering_ordinal_position)
FROM `data-to-insights.ecommerce.INFORMATION_SCHEMA.COLUMNS`
WHERE table_name="all_sessions_raw"
Out[17]:
Next examine how many rows are in the table.
TODO 1
In [18]:
%%bigquery --project $PROJECT
#standardSQL
SELECT count(*)
FROM `data-to-insights.ecommerce.all_sessions_raw`
Out[18]:
Now take a quick at few rows of data in the table.
In [19]:
%%bigquery --project $PROJECT
#standardSQL
SELECT *
FROM `data-to-insights.ecommerce.all_sessions_raw`
LIMIT 7
Out[19]:
Seeing a sample amount of data may give you greater intuition for what is included in the dataset. But since the table is quite large, a preview is not likely to render meaningful results. As you scan and scroll through the sample rows you see there is no singular field that uniquely identifies a row, so you need advanced logic to identify duplicate rows.
The query below uses the SQL GROUP BY function on every field and counts (COUNT) where there are rows that have the same values across every field.
If every field is unique, the COUNT will return 1 as there are no other groupings of rows with the exact same value for all fields. If there is a row with the same values for all fields, they will be grouped together and the COUNT will be greater than 1. The last part of the query is an aggregation filter using HAVING to only show the results that have a COUNT of duplicates greater than 1. Run the following query to find duplicate records across all columns.
TODO 3
In [20]:
%%bigquery --project $PROJECT
#standardSQL
SELECT count(*) AS num_duplicate_rows,
*
FROM `data-to-insights.ecommerce.all_sessions_raw`
GROUP BY fullvisitorid,
channelgrouping,
time,
country,
city,
totaltransactionrevenue,
transactions,
timeonsite,
pageviews,
sessionqualitydim,
date,
visitid,
type,
productrefundamount,
productquantity,
productprice,
productrevenue,
productsku,
v2productname,
v2productcategory,
productvariant,
currencycode,
itemquantity,
itemrevenue,
transactionrevenue,
transactionid,
pagetitle,
searchkeyword,
pagepathlevel1,
ecommerceaction_type,
ecommerceaction_step,
ecommerceaction_option
HAVING num_duplicate_rows > 1;
Out[20]:
As you can see there are quite a few "duplicate" records (615) when analyzed with these parameters.
In your own datasets, even if you have a unique key, it is still beneficial to confirm the uniqueness of the rows with COUNT, GROUP BY, and HAVING before you begin your analysis.
In this section you use a deduplicated table called all_sessions.
Scenario: Your data analyst team has provided you with a relevant query, and your schema experts have identified the key fields that must be unique for each record per your schema.
Run the query to confirm that no duplicates exist, this time against the "all_sessions" table:
In [21]:
%%bigquery --project $PROJECT
#standardSQL
SELECT fullvisitorid, # the unique visitor ID
visitid, # a visitor can have multiple visits
date, # session date stored as string YYYYMMDD
time, # time of the individual site hit (can be 0 or more)
v2productname, # not unique since a product can have variants like Color
productsku, # unique for each product
type, # visit and/or event trigger
ecommerceaction_type, # maps to ‘add to cart', ‘completed checkout'
ecommerceaction_step,
ecommerceaction_option,
transactionrevenue, # revenue of the order
transactionid, # unique identifier for revenue bearing transaction
count(*) AS row_count
FROM `data-to-insights.ecommerce.all_sessions`
GROUP BY 1,
2,
3,
4,
5,
6,
7,
8,
9,
10,
11,
12
HAVING row_count > 1 # find duplicates
Out[21]:
The query returns zero records indicating no duplicates exist.
In this section, you query for insights on the ecommerce dataset.
A good first path of analysis is to find the total unique visitors The query below determines the total views by counting product_views and the number of unique visitors by counting fullVisitorID.
In [22]:
%%bigquery --project $PROJECT
#standardSQL
SELECT count(*) AS product_views,
count(DISTINCT fullvisitorid) AS unique_visitors
FROM `data-to-insights.ecommerce.all_sessions`;
Out[22]:
The next query shows total unique visitors(fullVisitorID) by the referring site (channelGrouping):
In [23]:
%%bigquery --project $PROJECT
#standardSQL
SELECT count(DISTINCT fullvisitorid) AS unique_visitors,
channelgrouping
FROM `data-to-insights.ecommerce.all_sessions`
GROUP BY 2
ORDER BY 2 DESC;
Out[23]:
To find deeper insights in the data, the next query lists the five products with the most views (product_views) from unique visitors. The query counts number of times a product (v2ProductName) was viewed (product_views), puts the list in descending order, and lists the top 5 entries:
In [24]:
%%bigquery --project $PROJECT
#standardSQL
SELECT count(*) AS product_views,
( v2productname ) AS ProductName
FROM `data-to-insights.ecommerce.all_sessions`
WHERE type = 'PAGE'
GROUP BY v2productname
ORDER BY product_views DESC
LIMIT 5;
Out[24]:
Now expand your previous query to include the total number of distinct products ordered and the total number of total units ordered (productQuantity):
In [25]:
%%bigquery --project $PROJECT
#standardSQL
SELECT count(*) AS product_views,
count(productquantity) AS orders,
sum(productquantity) AS quantity_product_ordered,
v2productname
FROM `data-to-insights.ecommerce.all_sessions`
WHERE type = 'PAGE'
GROUP BY v2productname
ORDER BY product_views DESC
LIMIT 5;
Out[25]:
Lastly, expand the query to include the average amount of product per order (total number of units ordered/total number of orders, or SUM(productQuantity)/COUNT(productQuantity)
).
In [26]:
%%bigquery --project $PROJECT
#standardSQL
SELECT count(*) AS product_views,
count(productquantity) AS orders,
sum(productquantity) AS quantity_product_ordered,
sum(productquantity) / Count(productquantity) AS avg_per_order,
v2productname AS productName
FROM `data-to-insights.ecommerce.all_sessions`
WHERE type = 'PAGE'
GROUP BY v2productname
ORDER BY product_views DESC
LIMIT 5;
Out[26]:
You can see that among these top 5 products by product views that the 22 oz YouTube Bottle Infuser had the highest avg_per_order with 9.38 units per order.
You have completed this lab exercise. In this situation the "all_sessions" was provided to you with the deduplicated records. In the course of your own future analysis you may have to create this on your own using BigQuery and the create table DATASET.TABLE2 as select * from DATASET.TABLE1
syntax.
Copyright 2019 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.