Analyze a large dataset with Google BigQuery

Learning Objectives

  1. Access an ecommerce dataset
  2. Look at the dataset metadata
  3. Remove duplicate entries
  4. Write and execute queries

Introduction

BigQuery is Google's fully managed, NoOps, low cost analytics database. With BigQuery you can query terabytes and terabytes of data without having any infrastructure to manage or needing a database administrator. BigQuery uses SQL and can take advantage of the pay-as-you-go model. BigQuery allows you to focus on analyzing data to find meaningful insights.

We have a publicly available ecommerce dataset that has millions of Google Analytics records for the Google Merchandise Store loaded into a table in BigQuery. In this lab, you use a copy of that dataset. Sample scenarios are provided, from which you look at the data and ways to remove duplicate information. The lab then steps you through further analysis the data.

BigQuery can be accessed by its own browser-based interface, Google Data Studio, and many third party tools. In this lab you will use the BigQuery directly in notebook cells using the iPython magic command %%bigquery.

The steps you will follow in the lab are analogous to what you would do to prepare data for use in advanced ML operations. You will follow the notebook to experiment with the BigQuery queries provided to analyze the data.

Set up the notebook environment

VERY IMPORTANT: In the cell below you must replace the text <YOUR PROJECT> with you GCP project id.


In [15]:
import os

import pandas as pd

PROJECT = "<YOUR PROJECT>" #TODO Replace with your project id

os.environ["PROJECT"] = PROJECT

pd.options.display.max_columns = 50

Explore eCommerce data and identify duplicate records

Scenario: You were provided with Google Analytics logs for an eCommerce website in a BigQuery dataset. The data analyst team created a new BigQuery table of all the raw eCommerce visitor session data. This data tracks user interactions, location, device types, time on page, and details of any transaction. Your ultimate plan is to use this data in an ML capacity to create a model that delivers highly accurate predictions of user behavior to support tailored marketing campaigns.

First, a few notes on BigQuery within a python notebook context. Any cell that starts with %%bigquery (the BigQuery Magic) will be interpreted as a SQL query that is executed on BigQuery, and the result is printed to our notebook.

BigQuery supports two flavors of SQL syntax: legacy SQL and standard SQL. The preferred is standard SQL because it complies with the official SQL:2011 standard. To instruct BigQuery to interpret our syntax as such we start the query with #standardSQL.

Our first query is accessing the BigQuery Information Schema which stores all object-related metadata. In this case we want to see metadata details for the "all_sessions_raw" table.

Tip: To run the current cell you can click the cell and hit shift enter

TODO 2


In [17]:
%%bigquery --project $PROJECT
#standardsql
SELECT * 
EXCEPT 
      (table_catalog, table_schema, is_generated, generation_expression, is_stored, 
      is_updatable, is_hidden, is_system_defined, is_partitioning_column, clustering_ordinal_position)
FROM  `data-to-insights.ecommerce.INFORMATION_SCHEMA.COLUMNS`
WHERE table_name="all_sessions_raw"


Out[17]:
table_name column_name ordinal_position is_nullable data_type
0 all_sessions_raw fullVisitorId 1 YES STRING
1 all_sessions_raw channelGrouping 2 YES STRING
2 all_sessions_raw time 3 YES INT64
3 all_sessions_raw country 4 YES STRING
4 all_sessions_raw city 5 YES STRING
5 all_sessions_raw totalTransactionRevenue 6 YES INT64
6 all_sessions_raw transactions 7 YES INT64
7 all_sessions_raw timeOnSite 8 YES INT64
8 all_sessions_raw pageviews 9 YES INT64
9 all_sessions_raw sessionQualityDim 10 YES INT64
10 all_sessions_raw date 11 YES STRING
11 all_sessions_raw visitId 12 YES INT64
12 all_sessions_raw type 13 YES STRING
13 all_sessions_raw productRefundAmount 14 YES INT64
14 all_sessions_raw productQuantity 15 YES INT64
15 all_sessions_raw productPrice 16 YES INT64
16 all_sessions_raw productRevenue 17 YES INT64
17 all_sessions_raw productSKU 18 YES STRING
18 all_sessions_raw v2ProductName 19 YES STRING
19 all_sessions_raw v2ProductCategory 20 YES STRING
20 all_sessions_raw productVariant 21 YES STRING
21 all_sessions_raw currencyCode 22 YES STRING
22 all_sessions_raw itemQuantity 23 YES INT64
23 all_sessions_raw itemRevenue 24 YES INT64
24 all_sessions_raw transactionRevenue 25 YES INT64
25 all_sessions_raw transactionId 26 YES STRING
26 all_sessions_raw pageTitle 27 YES STRING
27 all_sessions_raw searchKeyword 28 YES STRING
28 all_sessions_raw pagePathLevel1 29 YES STRING
29 all_sessions_raw eCommerceAction_type 30 YES STRING
30 all_sessions_raw eCommerceAction_step 31 YES INT64
31 all_sessions_raw eCommerceAction_option 32 YES STRING

Next examine how many rows are in the table.

TODO 1


In [18]:
%%bigquery --project $PROJECT 
#standardSQL
SELECT count(*)
FROM `data-to-insights.ecommerce.all_sessions_raw`


Out[18]:
f0_
0 21552195

Now take a quick at few rows of data in the table.


In [19]:
%%bigquery --project $PROJECT 
#standardSQL
SELECT *
FROM `data-to-insights.ecommerce.all_sessions_raw`
LIMIT 7


Out[19]:
fullVisitorId channelGrouping time country city totalTransactionRevenue transactions timeOnSite pageviews sessionQualityDim date visitId type productRefundAmount productQuantity productPrice productRevenue productSKU v2ProductName v2ProductCategory productVariant currencyCode itemQuantity itemRevenue transactionRevenue transactionId pageTitle searchKeyword pagePathLevel1 eCommerceAction_type eCommerceAction_step eCommerceAction_option
0 140505387240227138 Direct 0 Ghana not available in demo dataset None None None 1 None 20161007 1475872666 PAGE None None 9990000 None GGOEGDHG014499 Google Infuser-Top Water Bottle Home/Drinkware/ (not set) USD None None None None Drinkware | Google Merchandise Store None /store.html 0 1 None
1 2161300302561027053 Organic Search 0 India not available in demo dataset None None None 1 None 20170521 1495355218 PAGE None None 99990000 None GGOEGBRJ037299 Google Alpine Style Backpack Home/Bags/ (not set) USD None None None None Bags | Google Merchandise Store None /google+redesign/ 0 1 None
2 5809923035957342173 Direct 0 United States not available in demo dataset None None None 1 None 20161008 1475972048 PAGE None None 99990000 None GGOEGBRA037499 Waterproof Backpack Home/Bags/ (not set) USD None None None None Bags None /google+redesign/ 0 1 None
3 8178337623496064877 Referral 0 United States Mountain View None None None 1 None 20160909 1473454898 PAGE None None 10990000 None GGOEGCLB020832 Softsided Travel Pouch Set Home/Bags/ (not set) USD None None None None Bags None /google+redesign/ 0 1 None
4 9730509990517739822 Direct 0 Canada Toronto None None None 1 None 20160928 1475105477 PAGE None None 3500000 None GGOEGBJR018199 Reusable Shopping Bag Home/Bags/ (not set) USD None None None None Bags None /google+redesign/ 0 1 None
5 9977935485234401557 Referral 0 United Kingdom London None None None 1 None 20160811 1470905998 PAGE None None 4990000 None GGOEGOAA017199 Rubber Grip Ballpoint Pen 4 Pack Home/Office/ (not set) USD None None None None Office None /google+redesign/ 0 1 None
6 0064667731979082203 Organic Search 0 United States New York None None None 1 None 20161012 1476318035 PAGE None None 17990000 None GGOEGOAB016099 Leather and Metal Ballpoint Pen Home/Office/ (not set) USD None None None None Office | Google Merchandise Store None /google+redesign/ 0 1 None

Identify duplicate rows

Seeing a sample amount of data may give you greater intuition for what is included in the dataset. But since the table is quite large, a preview is not likely to render meaningful results. As you scan and scroll through the sample rows you see there is no singular field that uniquely identifies a row, so you need advanced logic to identify duplicate rows.

The query below uses the SQL GROUP BY function on every field and counts (COUNT) where there are rows that have the same values across every field.

If every field is unique, the COUNT will return 1 as there are no other groupings of rows with the exact same value for all fields. If there is a row with the same values for all fields, they will be grouped together and the COUNT will be greater than 1. The last part of the query is an aggregation filter using HAVING to only show the results that have a COUNT of duplicates greater than 1. Run the following query to find duplicate records across all columns.

TODO 3


In [20]:
%%bigquery --project $PROJECT 
#standardSQL
SELECT count(*) AS num_duplicate_rows, 
       * 
FROM   `data-to-insights.ecommerce.all_sessions_raw` 
GROUP BY fullvisitorid, 
          channelgrouping, 
          time, 
          country, 
          city, 
          totaltransactionrevenue, 
          transactions, 
          timeonsite, 
          pageviews, 
          sessionqualitydim, 
          date, 
          visitid, 
          type, 
          productrefundamount, 
          productquantity, 
          productprice, 
          productrevenue, 
          productsku, 
          v2productname, 
          v2productcategory, 
          productvariant, 
          currencycode, 
          itemquantity, 
          itemrevenue, 
          transactionrevenue, 
          transactionid, 
          pagetitle, 
          searchkeyword, 
          pagepathlevel1, 
          ecommerceaction_type, 
          ecommerceaction_step, 
          ecommerceaction_option 
HAVING num_duplicate_rows > 1;


Out[20]:
num_duplicate_rows fullVisitorId channelGrouping time country city totalTransactionRevenue transactions timeOnSite pageviews sessionQualityDim date visitId type productRefundAmount productQuantity productPrice productRevenue productSKU v2ProductName v2ProductCategory productVariant currencyCode itemQuantity itemRevenue transactionRevenue transactionId pageTitle searchKeyword pagePathLevel1 eCommerceAction_type eCommerceAction_step eCommerceAction_option
0 2 4890832126611809362 Organic Search 0 Sri Lanka not available in demo dataset NaN NaN NaN 2 NaN 20170610 1497087233 PAGE None NaN 20990000 NaN GGOEGAAX0356 YouTube Men's Vintage Tank Home/Shop by Brand/YouTube/ (not set) USD None None NaN None YouTube | Shop by Brand | Google Merchandise S... None /google+redesign/ 0 1 None
1 2 477997596462135678 Direct 653506 United States San Jose 246000000.0 1.0 931.0 36 NaN 20170118 1484796890 PAGE None NaN 119000000 NaN GGOENEBQ078999 Nest® Cam Outdoor Security Camera - USA Nest-USA Single Option Only USD None None NaN ORD201701182177 Checkout Confirmation None /ordercompleted.html 6 1 None
2 2 4890832126611809362 Organic Search 0 Sri Lanka not available in demo dataset NaN NaN NaN 2 NaN 20170610 1497087233 PAGE None NaN 24990000 NaN GGOEYHPA003610 YouTube Wool Heather Cap Heather/Black Home/Shop by Brand/YouTube/ (not set) USD None None NaN None YouTube | Shop by Brand | Google Merchandise S... None /google+redesign/ 0 1 None
3 4 3211801082901013750 Referral 417832 United States New York 409000000.0 1.0 418.0 23 NaN 20170601 1496375475 PAGE None 1.0 100000000 102250000.0 GGOEGGCX056199 Gift Card- $100.00 Gift Cards Single Option Only USD None None 409000000.0 ORD201706013038 Checkout Confirmation None /ordercompleted.html 6 1 None
4 2 8306703617933176158 Organic Search 559533 Canada not available in demo dataset 151000000.0 1.0 567.0 18 NaN 20170511 1494505602 PAGE None NaN 149000000 NaN GGOENEBB081499 Nest® Cam Indoor Security Camera - CA Nest-Canada Single Option Only USD None None NaN ORD201705112202 Checkout Confirmation None /ordercompleted.html 6 1 None
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
610 2 6706553303219862080 Organic Search 78319 Australia Melbourne NaN NaN 88.0 8 NaN 20170301 1488431352 PAGE None NaN 4990000 NaN GGOEYDHJ056099 22 oz YouTube Bottle Infuser Home/Shop by Brand/YouTube/ (not set) None None None NaN None YouTube | Shop by Brand | Google Merchandise S... None /google+redesign/ 2 1 None
611 4 9758235511148216157 Direct 682296 United States Sunnyvale 134000000.0 2.0 705.0 22 NaN 20170615 1497575168 PAGE None NaN 25000000 NaN GGOEGGCX056299 Gift Card - $25.00 Gift Cards Single Option Only USD None None NaN None Payment Method None /payment.html 5 2 Payment
612 3 1915538933685278364 Referral 487234 United States San Jose NaN NaN 638.0 22 NaN 20161221 1482356339 PAGE None 1.0 100000000 NaN GGOEGGCX056199 Gift Card- $100.00 Gift Cards Single Option Only USD None None NaN None Checkout Your Information None /yourinfo.html 5 1 Billing and Shipping
613 2 8368489856222393707 Direct 331079 United States Mountain View NaN NaN 423.0 13 NaN 20170216 1487286356 EVENT None NaN 16990000 NaN GGOEGAAX0104 Google Men's 100% Cotton Short Sleeve Hero Tee... Home/Apparel/Men's/ (not set) None None None NaN None Men's Apparel | Google Merchandise Store None /google+redesign/ 1 1 None
614 2 917551604501805376 Direct 86855 Russia not available in demo dataset NaN NaN 145.0 53 NaN 20170428 1493445465 EVENT None NaN 24990000 NaN GGOEAHPA004110 Android Wool Heather Cap Heather/Black Home/Shop by Brand/Android/ (not set) None None None NaN None Android | Shop by Brand | Google Merchandise S... None /store.html 1 1 None

615 rows × 33 columns

As you can see there are quite a few "duplicate" records (615) when analyzed with these parameters.

In your own datasets, even if you have a unique key, it is still beneficial to confirm the uniqueness of the rows with COUNT, GROUP BY, and HAVING before you begin your analysis.

Analyze the new all_sessions table

In this section you use a deduplicated table called all_sessions.

Scenario: Your data analyst team has provided you with a relevant query, and your schema experts have identified the key fields that must be unique for each record per your schema.

Run the query to confirm that no duplicates exist, this time against the "all_sessions" table:


In [21]:
%%bigquery --project $PROJECT
#standardSQL
SELECT fullvisitorid, # the unique visitor ID 
       visitid, # a visitor can have multiple visits 
       date, # session date stored as string YYYYMMDD 
       time, # time of the individual site hit  (can be 0 or more) 
       v2productname, # not unique since a product can have variants like Color 
       productsku, # unique for each product 
       type, # visit and/or event trigger 
       ecommerceaction_type, # maps to ‘add to cart', ‘completed checkout' 
       ecommerceaction_step, 
       ecommerceaction_option, 
       transactionrevenue, # revenue of the order 
       transactionid, # unique identifier for revenue bearing transaction 
       count(*) AS row_count 
FROM   `data-to-insights.ecommerce.all_sessions` 
GROUP  BY 1, 
          2, 
          3, 
          4, 
          5, 
          6, 
          7, 
          8, 
          9, 
          10, 
          11, 
          12 
HAVING row_count > 1 # find duplicates


Out[21]:
fullvisitorid visitid date time v2productname productsku type ecommerceaction_type ecommerceaction_step ecommerceaction_option transactionrevenue transactionid row_count

The query returns zero records indicating no duplicates exist.

Write basic SQL against the eCommerce data (TODO 4)

In this section, you query for insights on the ecommerce dataset.

A good first path of analysis is to find the total unique visitors The query below determines the total views by counting product_views and the number of unique visitors by counting fullVisitorID.


In [22]:
%%bigquery --project $PROJECT
#standardSQL
SELECT count(*)                      AS product_views, 
       count(DISTINCT fullvisitorid) AS unique_visitors 
FROM   `data-to-insights.ecommerce.all_sessions`;


Out[22]:
product_views unique_visitors
0 21493109 389934

The next query shows total unique visitors(fullVisitorID) by the referring site (channelGrouping):


In [23]:
%%bigquery --project $PROJECT
#standardSQL
SELECT count(DISTINCT fullvisitorid) AS unique_visitors, 
       channelgrouping 
FROM   `data-to-insights.ecommerce.all_sessions` 
GROUP  BY 2 
ORDER  BY 2 DESC;


Out[23]:
unique_visitors channelgrouping
0 38101 Social
1 57308 Referral
2 11865 Paid Search
3 211993 Organic Search
4 3067 Display
5 75688 Direct
6 5966 Affiliates
7 62 (Other)

To find deeper insights in the data, the next query lists the five products with the most views (product_views) from unique visitors. The query counts number of times a product (v2ProductName) was viewed (product_views), puts the list in descending order, and lists the top 5 entries:


In [24]:
%%bigquery --project $PROJECT
#standardSQL
SELECT count(*)          AS product_views, 
       ( v2productname ) AS ProductName 
FROM   `data-to-insights.ecommerce.all_sessions` 
WHERE  type = 'PAGE' 
GROUP  BY v2productname 
ORDER  BY product_views DESC 
LIMIT  5;


Out[24]:
product_views ProductName
0 316482 Google Men's 100% Cotton Short Sleeve Hero Tee...
1 221558 22 oz YouTube Bottle Infuser
2 210700 YouTube Men's Short Sleeve Hero Tee Black
3 202205 Google Men's 100% Cotton Short Sleeve Hero Tee...
4 200789 YouTube Custom Decals

Now expand your previous query to include the total number of distinct products ordered and the total number of total units ordered (productQuantity):


In [25]:
%%bigquery --project $PROJECT
#standardSQL
SELECT count(*)               AS product_views, 
       count(productquantity) AS orders, 
       sum(productquantity)   AS quantity_product_ordered, 
       v2productname 
FROM   `data-to-insights.ecommerce.all_sessions` 
WHERE  type = 'PAGE' 
GROUP  BY v2productname 
ORDER  BY product_views DESC 
LIMIT  5;


Out[25]:
product_views orders quantity_product_ordered v2productname
0 316482 3158 6352 Google Men's 100% Cotton Short Sleeve Hero Tee...
1 221558 508 4769 22 oz YouTube Bottle Infuser
2 210700 949 1114 YouTube Men's Short Sleeve Hero Tee Black
3 202205 2713 8072 Google Men's 100% Cotton Short Sleeve Hero Tee...
4 200789 1703 11336 YouTube Custom Decals

Lastly, expand the query to include the average amount of product per order (total number of units ordered/total number of orders, or SUM(productQuantity)/COUNT(productQuantity)).


In [26]:
%%bigquery --project $PROJECT
#standardSQL
SELECT count(*)                                      AS product_views, 
       count(productquantity)                        AS orders, 
       sum(productquantity)                          AS quantity_product_ordered, 
       sum(productquantity) / Count(productquantity) AS avg_per_order, 
       v2productname                                 AS productName 
FROM   `data-to-insights.ecommerce.all_sessions` 
WHERE  type = 'PAGE' 
GROUP  BY v2productname 
ORDER  BY product_views DESC 
LIMIT  5;


Out[26]:
product_views orders quantity_product_ordered avg_per_order productName
0 316482 3158 6352 2.011400 Google Men's 100% Cotton Short Sleeve Hero Tee...
1 221558 508 4769 9.387795 22 oz YouTube Bottle Infuser
2 210700 949 1114 1.173867 YouTube Men's Short Sleeve Hero Tee Black
3 202205 2713 8072 2.975304 Google Men's 100% Cotton Short Sleeve Hero Tee...
4 200789 1703 11336 6.656489 YouTube Custom Decals

You can see that among these top 5 products by product views that the 22 oz YouTube Bottle Infuser had the highest avg_per_order with 9.38 units per order.

You have completed this lab exercise. In this situation the "all_sessions" was provided to you with the deduplicated records. In the course of your own future analysis you may have to create this on your own using BigQuery and the create table DATASET.TABLE2 as select * from DATASET.TABLE1 syntax.

Copyright 2019 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.