This notebook walks through the process of creating a patent landscape as described in the paper Automated Patent Landscaping (Abood, Feltenberger 2016). The basic outline is:
If you haven't already, please make sure you've setup an environment following the instructions in the README so you have all the necessary dependencies.
Copyright 2017 Google Inc.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
https://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
You'll need a Google Cloud project with BigQuery enabled (it's enabled by default) for this notebook and associated code to work. Put your project ID below. Go to cloud.google.com to create one if you don't already have an account. You can create the Cloud account for free and won't be auto-billed. Then copy your Project ID and paste it below into bq_project.
In [1]:
# bq_project = 'patent-landscape-165715'
bq_project = ''
In [2]:
import tensorflow as tf
import pandas as pd
import os
seed_name = 'video_codec'
seed_file = 'seeds/video_codec.seed.csv'
patent_dataset = 'patents-public-data:patents.publications_latest'
num_anti_seed_patents = 15000
if bq_project == '':
raise Exception('You must enter a bq_project above for this code to run.')
We provide a pre-trained word2vec word embedding model that we trained on 5.9 million patent abstracts. See also word2vec.py in this repo if you'd like to train your own (though gensim is likely an easier path). The code below will download the model from Google Cloud Storage (GCS) and store it on the local filesystem, or simply load it from local disk if it's already present.
In [3]:
from word2vec import W2VModelDownload
model_name = '5.9m'
model_download = W2VModelDownload(bq_project)
model_download.download_w2v_model('patent_landscapes', model_name)
print('Done downloading model {}!'.format(model_name))
This loads Word2Vec embeddings from a model trained on 5.9 million patent abstracts. Just as a demonstration, this also finds the k most similar words to a given word ranked by closeness in the embedding space. Finally, we use tSNE to visualize the word closeness in 2-dimensional space.
Note that the actual model files are fairly large (e.g., the 5.9m dataset is 760mb per checkpoint), so they are not stored in the Github repository. They're stored in the patent_landscapes Google Cloud Storage bucket under the models/ folder. If you'd like to use them, download the models folder and put it into the root repository folder (e.g., if you checked out this repository into the patent-models folder, then the 5.9m model should be in patent-models/models/5.9m, for example.
In [4]:
from word2vec import Word2Vec
word2vec5_9m = Word2Vec('5.9m')
w2v_runtime = word2vec5_9m.restore_runtime()
In [14]:
w2v_runtime.find_similar('codec', 10)
Out[14]:
In [6]:
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
w2v_runtime.visualize_embeddings(500)
In [7]:
import expansion
expander = expansion.PatentLandscapeExpander(
seed_file,
seed_name,
bq_project=bq_project,
patent_dataset=patent_dataset,
num_antiseed=num_anti_seed_patents)
This does the actual expansion and displays the head of the final training data dataframe.
In [8]:
training_data_full_df, seed_patents_df, l1_patents_df, l2_patents_df, anti_seed_patents = \
expander.load_from_disk_or_do_expansion()
training_df = training_data_full_df[
['publication_number', 'title_text', 'abstract_text', 'claims_text', 'description_text', 'ExpansionLevel', 'refs', 'cpcs']]
training_data_full_df.head()
Out[8]:
In [9]:
print('Seed/Positive examples:')
print(training_df[training_df.ExpansionLevel == 'Seed'].count())
print('\n\nAnti-Seed/Negative examples:')
print(training_df[training_df.ExpansionLevel == 'AntiSeed'].count())
In [10]:
import train_data
import tokenizer
# TODO: persist this tokenization data too
td = train_data.LandscapeTrainingDataUtil(training_df, w2v_runtime)
td.prepare_training_data(
training_df.ExpansionLevel,
training_df.abstract_text,
training_df.refs,
training_df.cpcs,
0.8,
50000,
500)
Out[10]:
In [11]:
pos_idx = -1
neg_idx = -1
for idx in range(0, len(td.prepped_labels)):
if td.prepped_labels[idx] == 0 and pos_idx < 0:
pos_idx = idx
if td.prepped_labels[idx] == 1 and neg_idx < 0:
neg_idx = idx
if pos_idx > -1 and neg_idx > -1:
break
print('Showing positive example (instance #{}).'.format(pos_idx))
td.show_instance_details(pos_idx)
print('\n------------------------------------\n')
print('Showing negative example (instance #{}).'.format(neg_idx))
td.show_instance_details(neg_idx)
The following cells specify hyperparameters, the neural network architecture (using Keras) and actually trains and tests the model.
The model is generally composed of:
In [12]:
batch_size = 64
dropout_pct = 0.4
num_epochs = 4
# Convolution (not currently used)
kernel_size = 5
filters = 64
pool_size = 4
# LSTM
lstm_size = 64
In [13]:
import model
import importlib
importlib.reload(model)
model = model.LandscapeModel(td, 'data', seed_name)
model.wire_model_functional(lstm_size, dropout_pct, sequence_len=td.sequence_len)
In [15]:
model.train_or_load_model(batch_size, num_epochs)
In [16]:
score, acc, p, r, f1 = model.evaluate_model(batch_size)
NOTE: the below inference code assumes you've already trained a model in the currently active kernel such that the following variables are already set:
model is the Keras-trained DNNl1_patents_df is the dataframe returned by the PatentLandscapeExpanderFuture iterations will save model checkpoints and allow saving off models and then loading them later just for inference.
In [17]:
subset_l1_pub_nums, l1_texts, padded_abstract_embeddings, refs_one_hot, cpc_one_hot = \
expander.sample_for_inference(td, 0.2)
We use a very rough heuristic to determine what the "ground truth" is for our L1 expansion -- whether something contains both the string "hair" and "dry"; if it does, we consider it a positive example; if not, we consider it a negative example. We can explore our positives, false positives, negatives, and false negatives in the confusion matrix and Pandas DataFrame below.
In [18]:
predictions = model.batch_predict(padded_abstract_embeddings, refs_one_hot, cpc_one_hot)
In [20]:
l1_texts['score'] = pd.DataFrame(predictions, columns=['score'])['score']
l1_texts['label'] = 'Seed'
l1_texts[['publication_number', 'score', 'label', 'abstract_text', 'refs', 'cpcs']]
l1_texts['has_video'] = l1_texts.abstract_text.str.lower().str.contains('video')
l1_texts['has_codec'] = l1_texts.abstract_text.str.lower().str.contains('codec')
l1_texts.loc[(l1_texts.has_video == False) & (l1_texts.has_codec == False), 'label'] = 'AntiSeed'
# Shows the CPC codes from an example item from our inference dataframe.
l1_texts['cpcs'].iloc[0]
Out[20]:
In [21]:
classify_report, confusion_matrix = model.reports(l1_texts)
print(classify_report)
print('Confusion matrix:\n{}'.format(confusion_matrix))
model.show_confusion_matrix(confusion_matrix)
In [22]:
predicted_seed_class = l1_texts.score <= 0.5
predicted_nonseed_class = l1_texts.score > 0.5
l1_texts[
(predicted_seed_class) & (l1_texts.has_video == True) & (l1_texts.has_codec == True)
].abstract_text.iloc[1]
Out[22]:
If you want to experiment with ad-hoc classification of data, edit the fields below for abstract text, references, and CPC codes to see how it influences the model's output. Somewhat counter-intuitively, a score closer to 0 means more likely to be an instance of the seed class (e.g., more likely to be about 'hair dryers'); conversely, the closer to 1, the more likely the model thinks the instance is part of the anti seed class.
In [24]:
# A conditioner infuser cartridge for use with a dryer attachment having an attachment end for engagement with a hair dryer barrel, an opposite air outlet end and a perforated portion between the ends having at least one air intake, the cartridge configured for engagement near the attachment end and including a conditioner element constructed and arranged for retaining a supply of vaporizable conditioner and a support frame receiving the conditioner element and securing same in the attachment.
# An electric lamp bulb or electric resistance coil is provided to radiate heat to one or more adjacent surfaces which have an active oxidation catalyst such that fumes and odors in the confines of a room that are drawn over the catalytic surface will be converted to less objectionable products. A catalytic device with an incandescent light bulb type of heating element is of particular advantage in that it can readily be screwed into a lamp base or mounted in other forms of current supplying receptacles and, in addition to a light source, will provide a heat emitting surface for heating the catalyst surface and inducing natural air convection current flow past the catalytic surface. Also, a fume control device which utilizes a resistance heating coil can readily provide both radiant heat and convection heat so that there will be the dual function of fume oxidation from air flow past a heated catalyst surface and radiant heat into a room area. Various types of catalyst coatings and/or catalytic wrappings may be used on the refractory surfaces which will be radiantly heated by the manually mountable-demountable form of bulb or resistance heating element
#A45D20/10,A45D20/08,A45D20/10,A45D20/08,A45D20/10,A45D20/08,A45D20/10,A45D20/08,A45D20/10,A45D20/08
text = 'Embodiments of a method and system for motion compensation in decoding video data are described herein. In various embodiments, a high-compression-ratio codec (such as H.264) is part of the encoding scheme for the video data. Embodiments pre-process control maps that were generated from encoded video data, and generating intermediate control maps comprising information regarding decoding the video data. The control maps indicate which one of multiple prediction operations is to be used in performing motion compensation on particular units of data in a frame. In an embodiment, motion compensation is performed on a frame basis such that each of the multiple prediction operations is performed on an entire frame at one time. In other embodiments, processing of different frames is interleaved. Embodiments increase the efficiency of the motion compensation such as to allow decoding of high-compression-ratio encoded video data on personal computers or comparable equipment without special, additional decoding hardware.'
refs = ''
cpcs = ''
model.predict(td, text, refs, cpcs)
Out[24]: