Conversion Analysis with Google Analytics Data

This sample notebook demonstrates working with Google Analytics page views and session data exported to Google BigQuery.

Google Analytics offers BigQuery export as part of its premium offering. If you're a premium user, you have the ability to export any of your analytics views to a BigQuery dataset that you own. If you're not, you can use the Analytics API to retrieve and import the data used to generate the default Analytics dashboards.

The sample data used in this notebook shares the same schema as the Google Analytics BigQuery export, but it is from a sample, pubicly available account. It is also small in size. This notebook demonstrates one possible custom analytics scenario, and is not based upon actual data.

Related Links:


In [1]:
import google.datalab.bigquery as bq

Understanding the Hits Data

It's helpful to inspect the schema and a sample of the data we're working with.


In [2]:
%%bq tables describe -n "google.com:analytics-bigquery.LondonCycleHelmet.ga_sessions_20130910"


Out[2]:

The Google Analytics dataset has a large schema. It should be interesting to inspect some of the data in important columns.


In [3]:
%%bq query -n sessions 
SELECT fullVisitorId, visitId, hit.hitNumber as hitNumber, hit.page.pagePath as path
FROM `google.com:analytics-bigquery.LondonCycleHelmet.ga_sessions_20130910`
  CROSS JOIN UNNEST(hits) as hit
ORDER BY visitStartTime, hitNumber

In [4]:
%bq execute --query sessions


Out[4]:
fullVisitorIdvisitIdhitNumberpath
287971356260898352513788031731/
287971356260898352513788031732/vests/
287971356260898352513788031733/vests/orange.html
287971356260898352513788031734/vests/orange.html
287971356260898352513788031735/
287971356260898352513788031736/vests/
287971356260898352513788031737/vests/yellow.html
287971356260898352513788031738/vests/yellow.html
287971356260898352513788032501/helmets/
695573261279394985013788033561/helmets/heavy.html
695573261279394985013788033562/helmets/heavy.html
695573261279394985013788033563/helmets/heavy.html
695573261279394985013788033564/login.html
695573261279394985013788033565/login.html
695573261279394985013788033566/basket.html
695573261279394985013788033567/shipping.html
695573261279394985013788033568/billing.html
695573261279394985013788033569/confirm.html
6955732612793949850137880335610/confirm.html
6955732612793949850137880335611/confirm.html
167714015729620549813788033861/vests/orange.html
167714015729620549813788033862/vests/orange.html
167714015729620549813788033863/vests/orange.html
167714015729620549813788033864/basket.html
167714015729620549813788033865/login.html

(rows: 362, time: 0.3s, cached, job: job_CL1g5-WnW4U_LS2on5lJG24Pckg)

The data is organized as a set of visits (or sessions), with each visit containing a set of hits (or page views), in succession. Each hit has a URL path associated with it. Here is another query that shows paths and the number of hits across sessions.


In [5]:
%%bq query -n hits
SELECT hit.page.pagePath as path, COUNT(visitId) as hitCount
FROM `google.com:analytics-bigquery.LondonCycleHelmet.ga_sessions_20130910`
  CROSS JOIN UNNEST(hits) as hit
GROUP BY path
ORDER BY hitCount DESC

In [6]:
%%bq execute -q hits


Out[6]:
pathhitCount
/confirm.html51
/45
/login.html43
/helmets/foldable.html38
/vests/yellow.html24
/vests/23
/helmets/23
/basket.html21
/helmets/heavy.html20
/vests/orange.html20
/shipping.html20
/helmets/light.html16
/billing.html16
/vests/yellow.html?uym_source=google2

(rows: 14, time: 0.3s, cached, job: job_nUF0qmIuv0zFu1ltQndoqLDX3RY)

Producing Conversion Data

For the purposes of this sample, the question to be answered is "Which path leads to higher conversion ratio: users landing on the landing page (path = '/') or users landing on a product page (eg. '/vests/yellow.html')?" "Conversion" is defined as the user loading the '/confirm/' page within a single session.


In [7]:
%%bq query -n conversions
WITH
  AnnotatedVisits AS (
    SELECT
      visitId,
      hit.page.pagePath AS path,
      hit.hitNumber AS hitNumber,
      '/confirm.html' IN (SELECT page.pagePath FROM UNNEST(hits)) AS transacted
    FROM `google.com:analytics-bigquery.LondonCycleHelmet.ga_sessions_20130910`
      CROSS JOIN UNNEST(hits) AS hit
    ORDER BY visitStartTime, hitNumber)
SELECT
  IF (path = '/', 'home', 'product') AS start,
  IF (transacted, 'completed', 'abandoned') AS outcome,
  COUNT(*) AS count
FROM AnnotatedVisits
WHERE hitNumber = 1
GROUP BY start, outcome
ORDER BY outcome, start

In [8]:
%%bq execute -q conversions


Out[8]:
startoutcomecount
homeabandoned18
productabandoned29
homecompleted5
productcompleted11

(rows: 4, time: 0.2s, cached, job: job_6Ur3LlQplBQqC-m5EusnlgE_1d8)

Visualizing the Conversion Path

The matrix tells us a bit about completed visits vs. abandoned visits, depending on the starting point. However, this is more easily seen in a sankey diagram, which is provided by the Google Charting API.


In [9]:
%%chart sankey --data conversions
{
  "sankey": {
    "node": {
      "colors": [ "black", "red", "black", "green" ]
    }
  }
}


Out[9]:

Insights

The results and the diagram tell us that the product pages lead to a slightly better conversion.

Going Beyond the Sample

This sample demonstrated how you can apply the power of BigQuery to your Google Analytics, and more generally to any analytics telemetry data. Essentially, you can build custom metrics and visualizations, and track the behavior and engagement of your audience more effectively by joining analytics data with your own data.