Welcome! This Jupyter Notebook will go through the process of reading patent data, analyzing the word usage in the patent abstracts, and identifying words which are used more frequently by city or state. Hopefully, we can learn about regional differences in patent filing.
This analysis is possible because the USPTO makes bulk patent data available online through a collaboration with Reed Tech (http://patents.reedtech.com/).
Questions, comments, suggestions, and corrections can be directed to mgebhard@gmail.com.
If you want to work through the code and follow along on your own computer, you'll need the python packages: lxml, pandas, sklearn, re, and numpy.
In [1]:
import numpy as np
import pandas as pd
import lxml.etree as ET
import re
from sklearn.feature_extraction.text import CountVectorizer
In [2]:
files = ['ipg160119.xml', 'ipg150120.xml', 'ipg160816.xml', 'ipg160809.xml', 'ipg160802.xml',
'ipg160719.xml', 'ipg160726.xml']
We want to extract location data and the abstract text from each patent in these files. Unfortunately, the files are not properly constructed .XML files but instead contain a concatenation of individual .XML files for each patent, which makes reading them more difficult. We have to use a bit of code to read these files and extract the data to a pandas dataframe.
In [3]:
class patent(object):
def __init__(self, location, abstract):
self.location = location
self.abstract = abstract
def getLocation(self):
return self.location
def getAbstract(self):
return self.abstract
def filesToDataFrame(files):
patents = []
for path in files:
with open(path, 'rb') as f:
lines = f.readlines()
patent_entries_start = []
for line in range(len(lines)):
if lines[line][0:5] == '<?xml':
patent_entries_start.append(line)
for i in range(len(patent_entries_start)-1):
start = patent_entries_start[i]
stop = patent_entries_start[i+1]
patent_entry = lines[start:stop]
location, abstract = process_entry(patent_entry)
patents.append(patent(location, abstract))
patent_df = pd.DataFrame.from_records([(x.getLocation()[0],
x.getLocation()[1],
x.getLocation()[2],
x.getAbstract()) for x in patents],
columns=['city', 'state', 'country', 'abstract'])
patent_df.dropna(inplace=True)
return patent_df
def process_entry(entry):
with open('temp.xml', 'w') as xml_entry:
for line in entry:
xml_entry.write(line)
context = ET.iterparse('temp.xml', events=('start', 'end'))
city = None
state = None
country = None
paragraph = None
for event, elem in iter(context):
if event == 'start' and elem.tag == 'inventor':
for child in elem:
for grandchild in child:
for info in grandchild:
if info.tag == 'city':
city = info.text
if info.tag == 'state':
state = info.text
if info.tag == 'country':
country = info.text
if event == 'start' and elem.tag == 'abstract':
paragraph = elem.findtext('p')
return (city, state, country), paragraph
patent_df = filesToDataFrame(files)
Let's take a look at the dataframe we've constructed.
In [4]:
patent_df.info()
patent_df.sample(10)
Out[4]:
We now want to write a function that can take in our dataframe (or a piece of it) and keep track of word usage in the abstracts that appear in the dataframe. To do this, we take the following steps:
In [5]:
def getOverallWordFrequency(patent_df, vocab_size):
combinedWords = ''
for entry in patent_df['abstract']:
combinedWords += entry + ' '
text = processText(combinedWords)
cv = CountVectorizer(min_df=0,
decode_error='ignore',
stop_words='english',
max_features=vocab_size)
counts = cv.fit_transform([text]).toarray().ravel()
words = np.array(cv.get_feature_names())
counts = counts / float(counts.max())
return zip(words, counts)
def processText(text):
text = text.lower()
text = re.sub(r'\W', ' ', text)
text = text.split()
return ' '.join(text)
In [6]:
def getHighestScoredWordInRegion(regionalUsage, overallUsage):
highestScore = 0
wordWithHighestScore = None
for wordAndFreq in regionalUsage:
overallFreq = getOverallFreq(wordAndFreq[0], overallUsage)
score = getScore(wordAndFreq[1], overallFreq)
if score > highestScore:
wordWithHighestScore = wordAndFreq[0]
highestScore = score
return wordWithHighestScore
def getOverallFreq(word, overallUsage):
for wordAndFreq in overallUsage:
if word == wordAndFreq[0]:
return wordAndFreq[1]
return 1
Above, we used the getScore function to assign a score to a word based on its frequency in the region compared to its frequency in the overall data. A simple scoring function is just the ratio of the occurrance in the region to the occurrance in the overall data. Other scoring functions will, of course, identify different patterns in word usage.
In [7]:
def getScore(singleFreq, overallFreq):
return (singleFreq / overallFreq)
We can put this all together into a final function:
In [8]:
def bestWordInRegion(region, city=False, state=False):
overallUsage = getOverallWordFrequency(patent_df, 100000)
output = []
for location in region:
if city:
regionalUsage = getOverallWordFrequency(patent_df[patent_df['city'] == location], 500)
output.append((location,
len(patent_df[patent_df['city'] == location].index),
getHighestScoredWordInRegion(regionalUsage, overallUsage)))
if state:
regionalUsage = getOverallWordFrequency(patent_df[patent_df['state'] == location], 200)
output.append((location,
len(patent_df[patent_df['state'] == location].index),
getHighestScoredWordInRegion(regionalUsage, overallUsage)))
output_df = pd.DataFrame.from_records(output, columns=['location', 'patents analyzed', 'frequent word'])
print output_df
return
In [9]:
region_states = ['AL', 'AR', 'CA', 'CO', 'CT', 'DE', 'FL', 'GA', 'HI', 'ID', 'IL', 'IN', 'IA', 'KS',
'KY', 'LA', 'ME', 'MD', 'MI', 'MN', 'MS', 'MO', 'MT', 'NE', 'NV', 'NH', 'NJ', 'NM', 'NY',
'NC', 'ND', 'OH', 'OK', 'OR', 'PA', 'RI', 'SC', 'SD', 'TN', 'TX', 'UT', 'VT', 'VA', 'WA',
'WV', 'WI', 'WY']
bestWordInRegion(region_states, state=True)
Additionally, we can look at patents by city. Granted, this may not be as accurate as analyzing by state for a few reasons. (1) There are likely fewer patents from a city than from a state. (2) The location is based on only one inventor when there are often multiple inventors listed per patent. (3) The location is based on the inventor's residence address and not necessarily the city where the work was done.
But analyzing by city can still be interesting. Let's look at some cities in the Bay Area.
In [10]:
region_cities = ['San Jose', 'Santa Clara', 'Sunnyvale', 'Cupertino', 'Mountain View', 'Palo Alto',
'Los Altos', 'Menlo Park', 'Redwood City', 'San Mateo', 'South San Francisco',
'San Francisco', 'Berkeley', 'Oakland']
bestWordInRegion(region_cities, city=True)
What did I learn technically?
Non-standard .XML files can be difficult to work with, but lxml.etree can still be useful in extracting the desired data.
CountVectorizer from sklearn.feature_extraction.text is incredibly helpful for analyzing large amounts of text.
What did I learn from the patents?
Certain state/word pairings make intuitive sense and confirm that our word search algorithm is probably working correctly: Texas/wellbore, Iowa/ethanologen, Michigan/telematics.
Some words that show up seem random but convey information. Missouri's 20453014 is a soybean variant.
Add more patent data to improve results.
Adjust the vocab_size and scoring function to find more representative words.