Calculating Claim Statistics for a Subset of Patents using the Google Patents Public Datasets on BigQuery.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

 http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Overview

Imagine you're analyzing a subset of patents and want to do some text analysis of the first independent claim. To do this, you'd need to be able to join your list of patent publication numbers with a dataset containing the patent text. Additionally, you'd need a method to extract the first claim from the rest of the claims. This notebook is a demonstration of one method to perform this analysis using python, BigQuery, and Google's new public dataset on patents.

What follows is a walkthrough of how to extract the first claim from a set of US Patent publications and run a few summary statistics against the first claim. The set of patents we analyze in this notebook is a random sample of 20,000 patents filed between 1995 and present in the G06F Cooperative Patent Classification (CPC) code, however this analysis could be repeated for any arbitrary list of patents of interest.

The steps we'll use to get there are:

Create a new BigQuery "dataset" where we can store our results. For anyone new to bigquery, check out the quickstart guide.
Upload a list of patent publications we wish to analyze.
Join this list with Google's public patent data.
Extract the first claim using regular expressions and BigQuery User Defined Functions.
Calculate summary statistics about the first claims and plot results over time.

Pre-requisites

A google cloud account with access to BigQuery. If you don't have an account yet, this page will guide you through the process to get started with an account and access the free tier of services.
Basic knowledge of how to run a query using the BigQuery web UI. If you're new to this, check out this page for a quick intro.
Basic knowledge of pandas dataframes and python programming.
A few python packages (pandas, seaborn, matplotlib and google-api-python-client.



In [0]:

    
from __future__ import print_function
from google.cloud import bigquery
import pandas as pd
import matplotlib.pylab as plt
from oauth2client.client import GoogleCredentials
from googleapiclient import discovery
import uuid
import time
# Variables to be used throughout the notebook, replace with your project
# and desired dataset name.
PROJECT_ID = 'my_project_name'
DEST_DATASET = 'claims_analysis'

# Create a python client we can use for executing table creation queries
client = bigquery.Client(project=PROJECT_ID)
# Create an HTTP client for additional functionality.
credentials = GoogleCredentials.get_application_default()
http_client = discovery.build('bigquery', 'v2', credentials=credentials)



In [0]:

    
%matplotlib inline

1. Create a new BigQuery Dataset

During this analysis, we're going to be leveraging BigQuery to do most of the computation and we'll store our results in tables in BigQuery. To organize this, we'll need a new dataset to work with. We'll use the API to create a new dataset. For more on working with the BigQuery python client library, see this page.



In [0]:

    
dataset_ref = client.dataset(DEST_DATASET)
dataset = bigquery.Dataset(dataset_ref)
client.create_dataset(dataset)
print('Dataset {} created.'.format(dataset.dataset_id))

2. Upload a list of Publication Numbers for analysis

In this hypothetical example, we are planning to inspect the first claims from a set of patents. To do this, we need to get our list of publication numbers into BigQuery. Since this is just an example, we're going to use a random sample of 20k G06F patents we have saved to a CSV file and added to this repo, but this could be any set of patents you're interested in. We'll leverage pandas "to_gbq" method to simplify this. (Note, you could also do this from the web UI or API. )

The steps we need to complete are:

Load a csv containing publication numbers.
Store these records in a new table in our BigQuery dataset.



In [0]:

    
# Load a list of patents from disk
df = pd.read_csv('./data/20k_G_and_H_publication_numbers.csv')
df.head(5)









    Out[0]:







  
    
      
      publication_number
    
  
  
    
      0
      US-2004088502-A1
    
    
      1
      US-2008085107-A1
    
    
      2
      US-2002021306-A1
    
    
      3
      US-9104683-B2
    
    
      4
      US-2002097416-A1



In [0]:

    
# Upload these to our dataset on BigQuery.
pubs_table = 'claim_text_publications'
full_table_path = '{}.{}'.format(DEST_DATASET, pubs_table)
df.to_gbq(destination_table=full_table_path,
          project_id=PROJECT_ID,
          if_exists='replace')









    





Streaming Insert is 50.0% Complete
Streaming Insert is 100.0% Complete

3. Join our uploaded list of publications with the patents-public-data on BigQuery to get the text of the claims, priority year, and classification code at the 4 digit level.

In the next section, we'll begin using the public patent data which is available on Big Query. Among numerous other fields, this table includes the text of the patent claims, the CPC code, and priority date of the patent claims. Because BigQuery doesn't have a "create table as" syntax, we've created a function to handle constuction of a call to the BigQuery python API. This is simply a helper function, which takes a query and produces a table. This can also easily by done using the web UI.

The steps to be completed are:

Write a query which produces a 4 column table: publication number, priority year, CPC code (split at 4 digits), and the full text of the claims.
Run this query, and write the results to a table in our project/dataset.



In [0]:

    
def create_table_as(query_string, dest_table, project_id=PROJECT_ID, dest_dataset=DEST_DATASET,
                    http_client=http_client, overwrite=True, use_legacy_sql=False):
  """Simulates a 'create table as' statement in BigQuery.

  Args:
    query_string: A string query that produces rows of data.
    dest_table: string table name to use for storing results.
    project_id: string project id to use for running query and storing results.
    dest_dataset: String name of dataset to use for storing results.
    http_client: An http client for use in inserting the BigQuery Job.
    overwrite: Should new data be appended to existing data or overwritten?
    use_legacy_sql: Defaults to standard_sql, but option to use legacy.

  Raises:
    Exception: If the BigQuery job finshes with an error, a general Exception is raised
    with the error message from BigQuery included.
  """
  write_disposition = 'WRITE_TRUNCATE'
  if not overwrite:
    write_disposition = 'WRITE_APPEND'
  config = {
      'kind': 'bigquery#job',
      'projectId': str(uuid.uuid4()),
      'configuration': {
          'query': {
              'query': query_string,
              'destinationTable': {
                  'projectId': project_id,
                  'datasetId': dest_dataset,
                  'tableId': dest_table
              },
              'createDisposition': 'CREATE_IF_NEEDED',
              'writeDisposition': write_disposition,
              'useLegacySql': use_legacy_sql
          }
      }
  }
  done = False
  request = http_client.jobs().insert(
      projectId=project_id, body=config).execute()
  job_id = request['jobReference']['jobId']
  while not done:
    jobs = http_client.jobs().list(projectId=project_id).execute()
    matches = [j for j in jobs['jobs'] if j['jobReference']['jobId'] == job_id]
    if matches[0]['state'] == 'DONE':
      if matches[0].get('errorResult'):
        raise Exception(
            'Create table failed: {}'.format(matches[0]['errorResult']))
      done = True
    print('Job still running...')
    time.sleep(5)

  print('Created table {}.'.format(dest_table))



In [0]:

    
query = """
#standardSQL

WITH P AS (
  SELECT 
  DISTINCT publication_number, 
  substr(cpc.code, 1,4) cpc4,
  floor(priority_date / 10000) priority_yr
  FROM `patents-public-data.patents.publications`,
  unnest(cpc) as cpc
  WHERE substr(cpc.code, 1,4) = 'G06F'
  AND floor(priority_date / 10000) >= 1995
  AND country_code = 'US'
)

SELECT 
P.publication_number,
P.priority_yr,
P.cpc4,
claims.text
FROM `patents-public-data.patents.publications` as pubs,
UNNEST(claims_localized) as claims
JOIN P 
  ON P.publication_number = pubs.publication_number
JOIN `{}.{}.{}` my_pubs
  ON pubs.publication_number = my_pubs.publication_number
WHERE claims.language = 'en'
""".format(PROJECT_ID, DEST_DATASET, pubs_table)
create_table_as(query, '20k_G06F_pubs_after_1994')









    



Job still running...
Job still running...
Job still running...
Job still running...
Job still running...
Created table 20k_G06F_pubs_after_1994.

Lets inspect the contents of this new table we've created

We've now written these results to a table, lets take a look at the output. First, we'll pull down a random sample of 500 rows using a BigQuery trick for selecting a random sample (see below). We'll get the data with a handy pandas function, "read_gbq". Once we get the data in memory, let's look at the distribution of priority years and inspect a few example claims.

To get a random sample of roughly n_rows, we can simply include a statement in our where clause that looks like:
```
 where rand() < n_rows / total_rows_in_table
```
Since we have about 20k rows and we want a sample of about 500, this becomes:
```
where rand() < 500 / 20000
```



In [0]:

    
query = """
SELECT *
FROM `{}.{}.20k_G06F_pubs_after_1994`
WHERE RAND() < 500/20000 
""".format(PROJECT_ID, DEST_DATASET)
df = pd.read_gbq(query, project_id=PROJECT_ID, dialect='standard')



In [0]:

    
# Quick visualization of the first 5 rows
df.head()









    Out[0]:







  
    
      
      publication_number
      priority_yr
      cpc4
      text
    
  
  
    
      0
      US-5867718-A
      1995.0
      G06F
      We claim: \n       \n       1. A method of pow...
    
    
      1
      US-6006313-A
      1995.0
      G06F
      What is claimed is: \n       \n       1. An el...
    
    
      2
      US-5774588-A
      1995.0
      G06F
      What is claimed is: \n       \n       1. A met...
    
    
      3
      US-5678063-A
      1995.0
      G06F
      What is claimed is: \n       \n       1. An in...
    
    
      4
      US-5758179-A
      1995.0
      G06F
      I claim: \n       \n       1. A match bus oper...



In [0]:

    
# Do we have a mix of data from 1995-present?
df.priority_yr.value_counts().sort_index()









    Out[0]:





1995.0    19
1996.0    23
1997.0    24
1998.0    26
1999.0    17
2000.0    43
2001.0    50
2002.0    62
2003.0    73
2012.0    73
2013.0    45
2014.0    21
2015.0     6
Name: priority_yr, dtype: int64



In [0]:

    
# What do the claims look like?
for claim in df.sample(2).text:
  print('-----' * 5, '\n\nEXAMPLE CLAIM: \n\n', claim[:3000])









    



------------------------- 

EXAMPLE CLAIM: 

 What is claimed is:  
       
         1 . A method for implementing a distributed file system on a storage area network, comprising: 
 accepting a request at a client to access a file;    performing a lookup in a local cache to determine what physical disk blocks are associated with the request; and    if the lookup cannot be satisfied in the local cache, 
 forwarding the request from the client to a server,  
 receiving a block map for the file from the server, wherein the block map includes location information specifying physical blocks corresponding to the file, and  
 using information from the block map to access the file on the storage device directly without going through the server.  
   
     
         
         2 . The method of  claim 1 , wherein the block map includes only a portion of the location information for the file.  
     
         
         3 . The method of  claim 2 , wherein if the request to access the file is a read request, the method further comprises prefetching additional portions of the block map as they are needed.  
     
         
         4 . The method of  claim 1 , wherein if the request to access the file is a write request, the method further comprises: 
 causing the server to invalidate outstanding block maps for the file on other clients; and    granting an exclusive block map to the client that initiated the write request.    
     
         
         5 . The method of  claim 4 , further comprising invalidating the exclusive block map prior to accepting a read request for the file.  
     
         
         6 . The method of  claim 1 , wherein an underlying file system includes a UNIX file system.  
     
         
         7 . The method of  claim 6 , wherein the underlying file system does not reorder blocks so that the block map remains correct after the block map has been received from the server.  
     
         
         8 . The method of  claim 1 , wherein if a direct path between the client and the storage device does not exist, the method further comprises sending all requests associated with the storage device from the client through the server to the storage device via an independent port.  
     
         
         9 . A computer-readable storage medium storing instructions that when executed by a computer cause the computer to perform a method for implementing a distributed file system on a storage area network, the method comprising: 
 accepting a request at a client to access a file;    performing a lookup in a local cache to determine what physical disk blocks are associated with the request; and    if the lookup cannot be satisfied in the local cache, 
 forwarding the request from the client to a server,  
 receiving a block map for the file from the server, wherein the block map includes location information specifying physical blocks corresponding to the file, and  
 using information from the block map to access the file on the storage device directly without going through the serv
------------------------- 

EXAMPLE CLAIM: 

 What is claimed is: 
       
       1. A method comprising:
 locating a machine at a fixed location; 
 connecting a scannable code to the machine, the code indicating the fixed location and an application program having a user interface that is displayable on the machine; 
 scanning the code using a portable user device to generate first and second identifiers, the first and second identifiers identifying the fixed location and the application program, respectively; 
 transmitting the first and second identifiers from the portable user device to one or more remotely-located servers; 
 identifying, using the one or more remotely-located servers, the machine and the application program in response to the receipt of the first and second identifiers by the one or more remotely-located servers; 
 transmitting user interface instructions from the one or more remotely-located servers to the machine, the user interface instructions being associated with the display of the user interface on the machine and based on at least the application program and the fixed location of the machine; and 
 displaying the user interface on the machine in accordance with the user interface instructions; 
 wherein the one or more remotely-located servers generate, process, or select the user interface instructions based on at least the fixed location of the machine and cause the user interface to be displayed on the machine in accordance with the user interface instructions; and 
 wherein, in accordance with the user interface instructions, the user interface is displayed on the machine in accordance with the user interface instructions based on at least the fixed location and displays on the machine at least one of the following:
 a map that indicates the fixed location of the machine; 
 information regarding how to get from the fixed location of the machine to another location; and 
 information regarding the availability of venues and services from, or in the vicinity of, the fixed location of the machine. 
 
 
     
         
       2. The method of  claim 1 , further comprising:
 transmitting a third identifier from the portable user device to the one or more remotely-located servers, the third identifier identifying a user profile, the user profile being associated with the portable user device; 
 wherein the user interface instructions are based on at least the application program and the user profile so that the display of the user interface on the machine is customized. 
 
     
         
       3. The method of  claim 1 , wherein the portable user device is selected from the group consisting of a scanner, a reader, and a smartphone; and
 wherein the machine is selected from the group consisting of a kiosk, a vending machine, and a digital sign. 
 
     
         
       4. A method comprising:
 receiving, using at least one computer system, a first identifier associated with a fixed location of a machine, and a second identifier associated with an application pro

Extract the First Claims Using A UDF in Big Query

Since we have the full claim above, we need a way to parse out just the first claim. BigQuery has User Defined Functions (UDF's) which let the user pass columns through some arbitrary function. In our case, we need some logic to parse the claims and split after the first claim has ended. In most cases we can simply split on the text "1." but there are also cases where the text is not so cleanly formatted. The UDF below shows and example of using javascript regular expressions to parse the first claim.

The basic logic is to work down the following options, returning the first positive match:

Try to split on '2.' or '2 .'.
Find the first mention of 'claim 1' and split there.
Handle canceled claims by finding the word canceled and determining which claim was the first non-canceled claim.
Look for a sentence beginning with "The" - split and return the first half.
Finally, just return the first 2000 characters.



In [0]:

    
# The JS script to run against the text of our claims.
js = r"""

// Regex to find a period followed by any number of spaces and '2.' or '2 .'
var pattern = new RegExp(/[.][\\s]+[2][\\s]*[.]/, 'g');
if (pattern.test(text)) {
  return text.split(pattern)[0];
}

// If none of the above worked, try to find a reference to claim 1.
if (text.indexOf('claim 1 ') > -1) {
  return text.split('claim 1 ')[0];
}

// Look for claim cancelations and return first non-canceled claim
text = text.replace(/canceled/i, 'canceled')
text = text.replace(/cancelled/i, 'canceled')
if (text.indexOf('(canceled)') > -1) {
	text = text.split('(canceled)')
	canceled = text[0];
	not_canceled = text[1];
	// Determine first non-cancelled claim
	if (canceled.indexOf('-') > -1 ) {
		next_claim = parseInt(canceled.split('-')[1]) + 1
	}
	else {
		next_claim = 2;
	}
	// Split on next_claim + 1
	return not_canceled.split(next_claim + 1)[0].trim();
}


// If none of the above worked, try to find a sentence starting with 'The'
// This should only happen after claim 1 has been defined.
if (text.indexOf(' The ') > -1) {
  return text.split(' The ')[0];
}

// if none of the above worked return the first 2000 characters. 
return text.slice(0,2000);
"""



In [0]:

    
# Insert the JS code above into this string, and pass it to Big Query
query = r'''
#standardSQL
CREATE TEMPORARY FUNCTION get_first_claim(text STRING)
RETURNS STRING
LANGUAGE js AS """
{}
"""; 

 
SELECT 
pubs.*, 
get_first_claim(text) first_claim
FROM `{}.{}.20k_G06F_pubs_after_1994` pubs
'''.format(js, PROJECT_ID, DEST_DATASET)
CLAIM_TEXT_TABLE = '20k_G06F_pubs_after_1994_split_first_claim'
create_table_as(query, CLAIM_TEXT_TABLE)









    



Job still running...
Job still running...
Job still running...
Job still running...
Created table 20k_G06F_pubs_after_1994_split_first_claim.



In [0]:

    
query = """
SELECT * FROM `{}.{}.{}` WHERE RAND() < 500/20000 
""".format(PROJECT_ID, DEST_DATASET, CLAIM_TEXT_TABLE)
df = pd.read_gbq(query, project_id=PROJECT_ID, dialect='standard')

Let's See how that UDF Performed

One way to test the UDF would be to run a query in bigquery against an example we can test. For example:

SELECT get_first_claim("1. test claim 1 2. test claim 2");

We can also run this claim against real data and pull down a sample, which is shown below. We've now created a new table which has both the full claims text as well as our extracted first claim. Lets compare a few to see how well it worked.



In [0]:

    
for idx, row in df.sample(2).iterrows():
  print('------' * 10, '\nORIGINAL CLAIM TEXT:\n', '------' * 10)
  print(row.text[:2000], '\n')
  print('------' * 10, '\nRESULT AFTER PARSING WITH UDF:\n', '------' * 10)
  print(row.first_claim, '...\n\n')









    



------------------------------------------------------------ 
ORIGINAL CLAIM TEXT:
 ------------------------------------------------------------
1 . A visual display system for a vehicle, comprising: 
 at least one sensor operable to capture images representing scenery outside the vehicle;    a protective housing enclosing the at least one sensor, wherein the protective housing further comprises a transparent aperture through which the at least one sensor captures images;    a cleaning mechanism operable to remove obstructions from the transparent aperture; and    an operator display through which images representing scenery outside the vehicle are displayed.    
   
       
       2 . The visual display system of  claim 1 , wherein the at least one sensor further comprises a camera.  
   
       
       3 . The visual display system of  claim 1 , wherein the transparent aperture further comprises a rotating transparent conical surface, wherein an outer surface of the transparent conical surface is wiped within the protective housing by the cleaning mechanism to remove obstructions from the transparent aperture.  
   
       
       4 . The visual display system of  claim 3 , wherein the cleaning mechanism is located to not obstruct the at least one sensor&#39;s field of view.  
   
       
       5 . The visual display system of  claim 2 , wherein the camera further comprises an infrared camera or low light camera.  
   
       
       6 . The visual display system of  claim 3 , wherein the cleaning mechanism further comprises a mechanical brush that removes obstructions from an outer surface of the transparent aperture.  
   
       
       7 . The visual display system of  claim 3 , wherein the cleaning mechanism further comprises a cleaning fluid applicator that applies cleaning solution to the transparent aperture.  
   
       
       8 . A visual display system for a vehicle, comprising: 
 at least one sensor operable to capture images representing scenery outside the vehicle;    a protective housing enclosing the at least one sensor, wherein the protective housing further comprises a transparent aper 

------------------------------------------------------------ 
RESULT AFTER PARSING WITH UDF:
 ------------------------------------------------------------
1 . A visual display system for a vehicle, comprising: 
 at least one sensor operable to capture images representing scenery outside the vehicle;    a protective housing enclosing the at least one sensor, wherein the protective housing further comprises a transparent aperture through which the at least one sensor captures images;    a cleaning mechanism operable to remove obstructions from the transparent aperture; and    an operator display through which images representing scenery outside the vehicle are displayed ...


------------------------------------------------------------ 
ORIGINAL CLAIM TEXT:
 ------------------------------------------------------------
What is claimed is: 
       
       1. An apparatus for user authentication based on proximity to a close network of a user, the apparatus comprising:
 a memory; 
 a processor; and 
 computer-executable instructions stored in the memory, executable by the processor, and configured to cause the processor to:
 receive from a user, a request to execute a user action associated with an application, wherein execution of the user action requires validation of one or more authentication credentials; 
 collect a set of data comprising information related to a physical location of the user; 
 determine a close network score associated with the user, comprising:
 selecting at least one member of a close network of the user; 
 determining a proximity threshold associated with the at least one member and the user; 
 determining whether the at least one member is within the proximity threshold, comprising:
 determining a location of the apparatus of the user by a GPS application running on the apparatus of the user and a location of an apparatus associated with the at least one member by a GPS application running on the apparatus of the at least one member; 
 calculating a distance between the determined location of the user apparatus and the at least one member apparatus; 
 comparing the calculated distance to the determined proximity threshold; and 
 
 based at least in part on the determination of whether the at least one member is within the proximity threshold, assigning a close network score; 
 
 determine a level of authentication associated with the determined close network score; 
 determine which one or more authentication types from a plurality of authentication types are associated with the level of authentication associated with the close network score; 
 request one or more authentication credentials corresponding to the determined one or more authentication types; 
 receive one or more authentication credentials from the user; 
 validate the one or more authentica 

------------------------------------------------------------ 
RESULT AFTER PARSING WITH UDF:
 ------------------------------------------------------------
What is claimed is: 
       
       1. An apparatus for user authentication based on proximity to a close network of a user, the apparatus comprising:
 a memory; 
 a processor; and 
 computer-executable instructions stored in the memory, executable by the processor, and configured to cause the processor to:
 receive from a user, a request to execute a user action associated with an application, wherein execution of the user action requires validation of one or more authentication credentials; 
 collect a set of data comprising information related to a physical location of the user; 
 determine a close network score associated with the user, comprising:
 selecting at least one member of a close network of the user; 
 determining a proximity threshold associated with the at least one member and the user; 
 determining whether the at least one member is within the proximity threshold, comprising:
 determining a location of the apparatus of the user by a GPS application running on the apparatus of the user and a location of an apparatus associated with the at least one member by a GPS application running on the apparatus of the at least one member; 
 calculating a distance between the determined location of the user apparatus and the at least one member apparatus; 
 comparing the calculated distance to the determined proximity threshold; and 
 
 based at least in part on the determination of whether the at least one member is within the proximity threshold, assigning a close network score; 
 
 determine a level of authentication associated with the determined close network score; 
 determine which one or more authentication types from a plurality of authentication types are associated with the level of authentication associated with the close network score; 
 request one or more authentication credentials corresponding to the determined one or more authentication types; 
 receive one or more authentication credentials from the user; 
 validate the one or more authentication credentials, thereby resulting in a successful validation of the one or more authentication credentials; and 
 in response to the successful validation of the one or more authentication credentials, execute the user action ...

Lastly, lets use BigQuery to calculate some statistics about the first claim.

From the start, the goal here was to compute some summary statistics about claims for a specific set of patents. Namely, we're interested to see if the number of elements, character counts, or word counts have varied over time.

Average Word Count by Year



In [0]:

    
# Average Word Count by Year:
query = r"""
#standardSQL
with words as (
  SELECT 
  publication_number,
  priority_yr,
  SPLIT(REGEXP_REPLACE(first_claim, r'\s{2,}', ''), ' ') word
  FROM `%s.%s.%s`
)

SELECT 
  priority_yr, 
  avg(num_words) avg_word_cnt
FROM (
  SELECT
  publication_number,
  priority_yr,
  count(*) as num_words
  from words, unnest(word)
  group by 1,2
)
GROUP BY 1
ORDER BY 1
""" %(PROJECT_ID, DEST_DATASET, CLAIM_TEXT_TABLE)
df = pd.read_gbq(query, project_id=PROJECT_ID, dialect='standard')



In [0]:

    
plt.figure(figsize=(14, 6))
plt.plot(df.priority_yr, df.avg_word_cnt)
plt.xlabel('Priority Year', fontsize=14)
plt.xlim(1995, 2016)
plt.ylabel('Average Word Count')
plt.title('Average Word Count for the First Independent Claim of\nG06F Patents,'
          '1995 to present', fontsize=16)
plt.show()

Average Number of Elements by Year



In [0]:

    
query = """
#standardSQL
with elements as (
  SELECT 
  publication_number,
  priority_yr,
  SPLIT(first_claim, ';') element
  FROM `%s.%s.%s`
)

SELECT 
  priority_yr, 
  avg(num_elements) avg_element_cnt
FROM (
  SELECT
  publication_number,
  priority_yr,
  count(*) as num_elements
  from elements, unnest(element)
  group by 1,2
)
GROUP BY 1
ORDER BY 1
""" %(PROJECT_ID, DEST_DATASET, CLAIM_TEXT_TABLE)
df = pd.read_gbq(query, project_id=PROJECT_ID, dialect='standard')



In [0]:

    
plt.figure(figsize=(14, 6))
plt.plot(df.priority_yr, df.avg_element_cnt, color='g')
plt.xlabel('Priority Year', fontsize=14)
plt.xlim(1995, 2016)
plt.ylabel('Average Element Count')
plt.title('Average Element Count for the First Independent Claim of\n'
          'G06F Patents, 1995 to present', fontsize=16)
plt.show()



In [0]:

	publication_number
0	US-2004088502-A1
1	US-2008085107-A1
2	US-2002021306-A1
3	US-9104683-B2
4	US-2002097416-A1

	publication_number	priority_yr	cpc4	text
0	US-5867718-A	1995.0	G06F	We claim: \n \n 1. A method of pow...
1	US-6006313-A	1995.0	G06F	What is claimed is: \n \n 1. An el...
2	US-5774588-A	1995.0	G06F	What is claimed is: \n \n 1. A met...
3	US-5678063-A	1995.0	G06F	What is claimed is: \n \n 1. An in...
4	US-5758179-A	1995.0	G06F	I claim: \n \n 1. A match bus oper...