Analyze a large dataset with Google BigQuery

Learning Objectives

  1. Access an ecommerce dataset
  2. Look at the dataset metadata
  3. Remove duplicate entries
  4. Write and execute queries

Introduction

BigQuery is Google's fully managed, NoOps, low cost analytics database. With BigQuery you can query terabytes and terabytes of data without having any infrastructure to manage or needing a database administrator. BigQuery uses SQL and can take advantage of the pay-as-you-go model. BigQuery allows you to focus on analyzing data to find meaningful insights.

We have a publicly available ecommerce dataset that has millions of Google Analytics records for the Google Merchandise Store loaded into a table in BigQuery. In this lab, you use a copy of that dataset. Sample scenarios are provided, from which you look at the data and ways to remove duplicate information. The lab then steps you through further analysis the data.

BigQuery can be accessed by its own browser-based interface, Google Data Studio, and many third party tools. In this lab you will use the BigQuery directly in notebook cells using the iPython magic command %%bigquery.

The steps you will follow in the lab are analogous to what you would do to prepare data for use in advanced ML operations. You will follow the notebook to experiment with the BigQuery queries provided to analyze the data.

Set up the notebook environment

VERY IMPORTANT: In the cell below you must replace the text <YOUR PROJECT> with you GCP project id.


In [ ]:
import os

import pandas as pd

PROJECT = "<YOUR PROJECT>" #TODO Replace with your project id

os.environ["PROJECT"] = PROJECT

pd.options.display.max_columns = 50

Explore eCommerce data and identify duplicate records

Scenario: You were provided with Google Analytics logs for an eCommerce website in a BigQuery dataset. The data analyst team created a new BigQuery table of all the raw eCommerce visitor session data. This data tracks user interactions, location, device types, time on page, and details of any transaction. Your ultimate plan is to use this data in an ML capacity to create a model that delivers highly accurate predictions of user behavior to support tailored marketing campaigns.

First, a few notes on BigQuery within a python notebook context. Any cell that starts with %%bigquery (the BigQuery Magic) will be interpreted as a SQL query that is executed on BigQuery, and the result is printed to our notebook.

BigQuery supports two flavors of SQL syntax: legacy SQL and standard SQL. The preferred is standard SQL because it complies with the official SQL:2011 standard. To instruct BigQuery to interpret our syntax as such we start the query with #standardSQL.

Our first query is accessing the BigQuery Information Schema which stores all object-related metadata. In this case we want to see metadata details for the "all_sessions_raw" table.

Tip: To run the current cell you can click the cell and hit shift enter


In [ ]:
%%bigquery --project $PROJECT
#standardsql
SELECT * 
EXCEPT 
      (table_catalog, table_schema, is_generated, generation_expression, is_stored, 
      is_updatable, is_hidden, is_system_defined, is_partitioning_column, clustering_ordinal_position)
FROM  `data-to-insights.ecommerce.INFORMATION_SCHEMA.COLUMNS`
WHERE table_name="all_sessions_raw"

Next examine how many rows are in the table.


In [ ]:
%%bigquery --project $PROJECT 
#standardSQL
SELECT count(*)
FROM `data-to-insights.ecommerce.all_sessions_raw`

Now take a quick at few rows of data in the table.


In [ ]:
%%bigquery --project $PROJECT 
#standardSQL
SELECT *
FROM `data-to-insights.ecommerce.all_sessions_raw`
LIMIT 7

Identify duplicate rows

Seeing a sample amount of data may give you greater intuition for what is included in the dataset. But since the table is quite large, a preview is not likely to render meaningful results. As you scan and scroll through the sample rows you see there is no singular field that uniquely identifies a row, so you need advanced logic to identify duplicate rows.

The query below uses the SQL GROUP BY function on every field and counts (COUNT) where there are rows that have the same values across every field.

If every field is unique, the COUNT will return 1 as there are no other groupings of rows with the exact same value for all fields. If there is a row with the same values for all fields, they will be grouped together and the COUNT will be greater than 1. The last part of the query is an aggregation filter using HAVING to only show the results that have a COUNT of duplicates greater than 1. Run the following query to find duplicate records across all columns.


In [ ]:
%%bigquery --project $PROJECT 
#standardSQL
SELECT count(*) AS num_duplicate_rows, 
       * 
FROM   `data-to-insights.ecommerce.all_sessions_raw` 
GROUP BY fullvisitorid, 
          channelgrouping, 
          time, 
          country, 
          city, 
          totaltransactionrevenue, 
          transactions, 
          timeonsite, 
          pageviews, 
          sessionqualitydim, 
          date, 
          visitid, 
          type, 
          productrefundamount, 
          productquantity, 
          productprice, 
          productrevenue, 
          productsku, 
          v2productname, 
          v2productcategory, 
          productvariant, 
          currencycode, 
          itemquantity, 
          itemrevenue, 
          transactionrevenue, 
          transactionid, 
          pagetitle, 
          searchkeyword, 
          pagepathlevel1, 
          ecommerceaction_type, 
          ecommerceaction_step, 
          ecommerceaction_option 
HAVING num_duplicate_rows > 1;

As you can see there are quite a few "duplicate" records (615) when analyzed with these parameters.

In your own datasets, even if you have a unique key, it is still beneficial to confirm the uniqueness of the rows with COUNT, GROUP BY, and HAVING before you begin your analysis.

Analyze the new all_sessions table

In this section you use a deduplicated table called all_sessions.

Scenario: Your data analyst team has provided you with a relevant query, and your schema experts have identified the key fields that must be unique for each record per your schema.

Run the query to confirm that no duplicates exist, this time against the "all_sessions" table:


In [ ]:
%%bigquery --project $PROJECT
#standardSQL
SELECT fullvisitorid, # the unique visitor ID 
       visitid, # a visitor can have multiple visits 
       date, # session date stored as string YYYYMMDD 
       time, # time of the individual site hit  (can be 0 or more) 
       v2productname, # not unique since a product can have variants like Color 
       productsku, # unique for each product 
       type, # visit and/or event trigger 
       ecommerceaction_type, # maps to ‘add to cart', ‘completed checkout' 
       ecommerceaction_step, 
       ecommerceaction_option, 
       transactionrevenue, # revenue of the order 
       transactionid, # unique identifier for revenue bearing transaction 
       count(*) AS row_count 
FROM   `data-to-insights.ecommerce.all_sessions` 
GROUP  BY 1, 
          2, 
          3, 
          4, 
          5, 
          6, 
          7, 
          8, 
          9, 
          10, 
          11, 
          12 
HAVING row_count > 1 # find duplicates

The query returns zero records indicating no duplicates exist.

Write basic SQL against the eCommerce data

In this section, you query for insights on the ecommerce dataset.

A good first path of analysis is to find the total unique visitors The query below determines the total views by counting product_views and the number of unique visitors by counting fullVisitorID.


In [ ]:
%%bigquery --project $PROJECT
#standardSQL
SELECT count(*)                      AS product_views, 
       count(DISTINCT fullvisitorid) AS unique_visitors 
FROM   `data-to-insights.ecommerce.all_sessions`;

The next query shows total unique visitors(fullVisitorID) by the referring site (channelGrouping):


In [ ]:
%%bigquery --project $PROJECT
#standardSQL
SELECT count(DISTINCT fullvisitorid) AS unique_visitors, 
       channelgrouping 
FROM   `data-to-insights.ecommerce.all_sessions` 
GROUP  BY 2 
ORDER  BY 2 DESC;

To find deeper insights in the data, the next query lists the five products with the most views (product_views) from unique visitors. The query counts number of times a product (v2ProductName) was viewed (product_views), puts the list in descending order, and lists the top 5 entries:


In [ ]:
%%bigquery --project $PROJECT
#standardSQL
SELECT count(*)          AS product_views, 
       ( v2productname ) AS ProductName 
FROM   `data-to-insights.ecommerce.all_sessions` 
WHERE  type = 'PAGE' 
GROUP  BY v2productname 
ORDER  BY product_views DESC 
LIMIT  5;

Now expand your previous query to include the total number of distinct products ordered and the total number of total units ordered (productQuantity):


In [ ]:
%%bigquery --project $PROJECT
#standardSQL
SELECT count(*)               AS product_views, 
       count(productquantity) AS orders, 
       sum(productquantity)   AS quantity_product_ordered, 
       v2productname 
FROM   `data-to-insights.ecommerce.all_sessions` 
WHERE  type = 'PAGE' 
GROUP  BY v2productname 
ORDER  BY product_views DESC 
LIMIT  5;

Lastly, expand the query to include the average amount of product per order (total number of units ordered/total number of orders, or SUM(productQuantity)/COUNT(productQuantity)).


In [ ]:
%%bigquery --project $PROJECT
#standardSQL
SELECT count(*)                                      AS product_views, 
       count(productquantity)                        AS orders, 
       sum(productquantity)                          AS quantity_product_ordered, 
       sum(productquantity) / Count(productquantity) AS avg_per_order, 
       v2productname                                 AS productName 
FROM   `data-to-insights.ecommerce.all_sessions` 
WHERE  type = 'PAGE' 
GROUP  BY v2productname 
ORDER  BY product_views DESC 
LIMIT  5;

You can see that among these top 5 products by product views that the 22 oz YouTube Bottle Infuser had the highest avg_per_order with 9.38 units per order.

You have completed this lab exercise. In this situation the "all_sessions" was provided to you with the deduplicated records. In the course of your own future analysis you may have to create this on your own using BigQuery and the create table DATASET.TABLE2 as select * from DATASET.TABLE1 syntax.

Copyright 2019 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.