Analyzing Patents With USPTO Open Data

Welcome! This Jupyter Notebook will go through the process of reading patent data, analyzing the word usage in the patent abstracts, and identifying words which are used more frequently by city or state. Hopefully, we can learn about regional differences in patent filing.

This analysis is possible because the USPTO makes bulk patent data available online through a collaboration with Reed Tech (http://patents.reedtech.com/).

Questions, comments, suggestions, and corrections can be directed to mgebhard@gmail.com.

If you want to work through the code and follow along on your own computer, you'll need the python packages: lxml, pandas, sklearn, re, and numpy.


In [1]:
import numpy as np
import pandas as pd
import lxml.etree as ET
import re
from sklearn.feature_extraction.text import CountVectorizer

Download and Preprocess the Data

  1. Download .ZIP files from http://patents.reedtech.com/pgrbft.php. Because of the different formatting, we'll stick to files that fall into 2001-Present. Each file contains all of the patents issued by the USPTO for one week. These files are each around 100 MB zipped and unzip to 300-700 MB so we can't work with too many at once. Extract them to the working directory.
  2. The more patent data we can analyze at once, the more our word frequency output will be due to true regional differences and not just low sample sizes. Let's make a list of file paths from some recent releases so we can process multiple files.

In [2]:
files = ['ipg160119.xml', 'ipg150120.xml', 'ipg160816.xml', 'ipg160809.xml', 'ipg160802.xml',
         'ipg160719.xml', 'ipg160726.xml']

We want to extract location data and the abstract text from each patent in these files. Unfortunately, the files are not properly constructed .XML files but instead contain a concatenation of individual .XML files for each patent, which makes reading them more difficult. We have to use a bit of code to read these files and extract the data to a pandas dataframe.


In [3]:
class patent(object):
    def __init__(self, location, abstract):
        self.location = location
        self.abstract = abstract
    def getLocation(self):
        return self.location
    def getAbstract(self):
        return self.abstract
    
def filesToDataFrame(files):
    patents = []
    for path in files:
        with open(path, 'rb') as f:
            lines = f.readlines()
        patent_entries_start = []
        for line in range(len(lines)):
            if lines[line][0:5] == '<?xml':
                patent_entries_start.append(line)
        for i in range(len(patent_entries_start)-1):
            start = patent_entries_start[i]
            stop = patent_entries_start[i+1]
            patent_entry = lines[start:stop]
            location, abstract = process_entry(patent_entry)
            patents.append(patent(location, abstract))
    patent_df = pd.DataFrame.from_records([(x.getLocation()[0], 
                                            x.getLocation()[1], 
                                            x.getLocation()[2], 
                                            x.getAbstract()) for x in patents], 
                                              columns=['city', 'state', 'country', 'abstract'])
    patent_df.dropna(inplace=True)
    return patent_df

def process_entry(entry):
    with open('temp.xml', 'w') as xml_entry:
        for line in entry:
            xml_entry.write(line)
    context = ET.iterparse('temp.xml', events=('start', 'end'))
    city = None
    state = None
    country = None
    paragraph = None
    for event, elem in iter(context):
        if event == 'start' and elem.tag == 'inventor':
            for child in elem:
                for grandchild in child:
                    for info in grandchild:
                        if info.tag == 'city':
                            city = info.text
                        if info.tag == 'state':
                            state = info.text
                        if info.tag == 'country':
                            country = info.text
        if event == 'start' and elem.tag == 'abstract':
            paragraph = elem.findtext('p')
    return (city, state, country), paragraph

patent_df = filesToDataFrame(files)

Let's take a look at the dataframe we've constructed.


In [4]:
patent_df.info()
patent_df.sample(10)


<class 'pandas.core.frame.DataFrame'>
Int64Index: 17333 entries, 252 to 38750
Data columns (total 4 columns):
city        17333 non-null object
state       17333 non-null object
country     17333 non-null object
abstract    17333 non-null object
dtypes: object(4)
memory usage: 677.1+ KB
Out[4]:
city state country abstract
13769 Hillsborough CA US The present document relates to methods and sy...
8898 Kirkland WA US Supplemental computing devices that provide pr...
34189 Morganville NJ US A mobile device operates in a standard mode of...
36431 Phelan CA US The ceiling fan with air ionizing fan blades i...
33275 Yorktown Heights NY US An approach to forming a semiconductor structu...
31888 Gilbert AZ US A process produces a time release pesticide gr...
34008 Lawrenceville GA US A system for maintaining an address book, wher...
24503 Berlin CT US A rifle configured for firing a 7.62×39 mm rou...
32057 Staatsburg NY US Execution of instructions in a transactional e...
14336 San Diego CA US A system for producing an exclusionary buffer ...

Find Word Frequencies in a String of Text

We now want to write a function that can take in our dataframe (or a piece of it) and keep track of word usage in the abstracts that appear in the dataframe. To do this, we take the following steps:

  1. Combine all of the abstracts into one long string.
  2. Process the string to remove uppercase letters and non-alphanumeric characters.
  3. Use CountVectorizer from sklearn.feature_extraction.text to convert the text to a matrix of token counts. This is especially useful because it removes "stop words" which are short, common words that don't have meaning for our purpose. We can also limit the vocab size using max_features. In other words, we can choose to only look at, for example, the top 200 most frequent words. Documentation here: http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

In [5]:
def getOverallWordFrequency(patent_df, vocab_size):
    combinedWords = ''
    for entry in patent_df['abstract']:
        combinedWords += entry + ' '
    text = processText(combinedWords)
    cv = CountVectorizer(min_df=0, 
                         decode_error='ignore', 
                         stop_words='english', 
                         max_features=vocab_size)
    counts = cv.fit_transform([text]).toarray().ravel()
    words = np.array(cv.get_feature_names())
    counts = counts / float(counts.max())
    return zip(words, counts)

def processText(text):
    text = text.lower()
    text = re.sub(r'\W', ' ', text)
    text = text.split()
    return ' '.join(text)

Compare Regional Word Frequencies to Overall Word Frequencies

Now that we can generate a list of words and their occurrances in the patent abstracts, we need to be able to compare a list created from a region with the list created from the overall data.


In [6]:
def getHighestScoredWordInRegion(regionalUsage, overallUsage):
    highestScore = 0
    wordWithHighestScore = None
    for wordAndFreq in regionalUsage:
        overallFreq = getOverallFreq(wordAndFreq[0], overallUsage)
        score = getScore(wordAndFreq[1], overallFreq)
        if score > highestScore:
            wordWithHighestScore = wordAndFreq[0]
            highestScore = score
    return wordWithHighestScore

def getOverallFreq(word, overallUsage):
    for wordAndFreq in overallUsage:
        if word == wordAndFreq[0]:
            return wordAndFreq[1]
    return 1

Above, we used the getScore function to assign a score to a word based on its frequency in the region compared to its frequency in the overall data. A simple scoring function is just the ratio of the occurrance in the region to the occurrance in the overall data. Other scoring functions will, of course, identify different patterns in word usage.


In [7]:
def getScore(singleFreq, overallFreq):
    return (singleFreq / overallFreq)

We can put this all together into a final function:


In [8]:
def bestWordInRegion(region, city=False, state=False):
    overallUsage = getOverallWordFrequency(patent_df, 100000)
    output = []
    for location in region:
        if city:
            regionalUsage = getOverallWordFrequency(patent_df[patent_df['city'] == location], 500)
            output.append((location, 
                           len(patent_df[patent_df['city'] == location].index), 
                           getHighestScoredWordInRegion(regionalUsage, overallUsage)))
        if state:
            regionalUsage = getOverallWordFrequency(patent_df[patent_df['state'] == location], 200)
            output.append((location, 
                           len(patent_df[patent_df['state'] == location].index), 
                           getHighestScoredWordInRegion(regionalUsage, overallUsage))) 
    output_df = pd.DataFrame.from_records(output, columns=['location', 'patents analyzed', 'frequent word'])
    print output_df
    return

Look at Word Usage by State and City

The function bestWordInRegion takes in a list of city or state regions and tells us the word for each region that stand out. Let's look at states. Alaska is left off since it has so few patents.


In [9]:
region_states = ['AL', 'AR', 'CA', 'CO', 'CT', 'DE', 'FL', 'GA', 'HI', 'ID', 'IL', 'IN', 'IA', 'KS', 
                 'KY', 'LA', 'ME', 'MD', 'MI', 'MN', 'MS', 'MO', 'MT', 'NE', 'NV', 'NH', 'NJ', 'NM', 'NY',
                 'NC', 'ND', 'OH', 'OK', 'OR', 'PA', 'RI', 'SC', 'SD', 'TN', 'TX', 'UT', 'VT', 'VA', 'WA',
                 'WV', 'WI', 'WY']
bestWordInRegion(region_states, state=True)


   location  patents analyzed  frequent word
0        AL                74            ejb
1        AR                29          cl172
2        CA              4969          clock
3        CO               347     television
4        CT               264       buttress
5        DE                37            tfe
6        FL               525        antenna
7        GA               291         carton
8        HI                12       forearms
9        ID                92   microfeature
10       IL               579      hydraulic
11       IN               229        humeral
12       IA               104    ethanologen
13       KS                94  concatenation
14       KY                99          rolls
15       LA                54   hydrocyclone
16       ME                34       hookworm
17       MD               249     mesothelin
18       MI               678     telematics
19       MN               511         pacing
20       MS                16         asador
21       MO               144       20453014
22       MT                22    antisolvent
23       NE                50           muc4
24       NV                86         patron
25       NH               111       monocore
26       NJ               558     subscriber
27       NM                35           lccd
28       NY              1037            fin
29       NC               453        droplet
30       ND                 7        loaders
31       OH               381           tire
32       OK                64      acidizing
33       OR               383        feather
34       PA               446       fixation
35       RI                49      calibrant
36       SC               118            phr
37       SD                16       barreled
38       TN               138     rotisserie
39       TX              1211       wellbore
40       UT               173            pcd
41       VT                47            rdf
42       VA               249        webbing
43       WA               812       aircraft
44       WV                12       succinic
45       WI               246           bean
46       WY                 7       scatting

Additionally, we can look at patents by city. Granted, this may not be as accurate as analyzing by state for a few reasons. (1) There are likely fewer patents from a city than from a state. (2) The location is based on only one inventor when there are often multiple inventors listed per patent. (3) The location is based on the inventor's residence address and not necessarily the city where the work was done.

But analyzing by city can still be interesting. Let's look at some cities in the Bay Area.


In [10]:
region_cities = ['San Jose', 'Santa Clara', 'Sunnyvale', 'Cupertino', 'Mountain View', 'Palo Alto',
                 'Los Altos', 'Menlo Park', 'Redwood City', 'San Mateo', 'South San Francisco',
                 'San Francisco', 'Berkeley', 'Oakland']
bestWordInRegion(region_cities, city=True)


               location  patents analyzed     frequent word
0              San Jose               507               dbi
1           Santa Clara               138  chemonucleolysis
2             Sunnyvale               242             chord
3             Cupertino               162             asisp
4         Mountain View               187      whitelisting
5             Palo Alto               163          antifuse
6             Los Altos                74    disambiguation
7            Menlo Park                58        guidepiece
8          Redwood City                49          airspace
9             San Mateo                47           bureaus
10  South San Francisco                 7               c10
11        San Francisco               325            mentor
12             Berkeley                34      choreography
13              Oakland                42             actor

Conclusions

What did I learn technically?

  • Non-standard .XML files can be difficult to work with, but lxml.etree can still be useful in extracting the desired data.

  • CountVectorizer from sklearn.feature_extraction.text is incredibly helpful for analyzing large amounts of text.

What did I learn from the patents?

  • Certain state/word pairings make intuitive sense and confirm that our word search algorithm is probably working correctly: Texas/wellbore, Iowa/ethanologen, Michigan/telematics.

  • Some words that show up seem random but convey information. Missouri's 20453014 is a soybean variant.

Future

  • Add more patent data to improve results.

  • Adjust the vocab_size and scoring function to find more representative words.