In [7]:
#Modules
import pandas as pd # data frames
import sqlite3 as sql #sqlite for reading the database
import numpy as np # mathematics
#import pandasql as pdsql # library to interface pandas and sqlite3 - doesn't work though...
#Globals
FILENAME = '/run/media/potterzot/My Passport/potterzot/data/fcc/nn_comments.db'
The data is stored in an sqlite3 database that we will use as the base to analyze. Initially I tried this using HDF5, but my laptop couldn't handle the memory requirements of so much text, and chunking with pandas wasn't very effective (or perhaps I should say, I wasn't very effective with it).
In [10]:
# Initialize the connection
db = sql.connect(FILENAME)
cursor = db.cursor()
# How many comments were there?
cursor.execute("SELECT COUNT(*) FROM comments")
cursor.fetchone()
Out[10]:
So we've got almost 450,000 comments. I read somewhere that nearly 150,000 were the same though, so let's see how many unique comments we have.
In [13]:
cursor.execute("SELECT COUNT(*) FROM (SELECT DISTINCT text FROM comments)")
cursor.fetchone()
Out[13]:
Hmm, well that's not in accordance with what the FCC said. We should look at that closer. And let's also get the important data moved into a dataframe if possible without breaking the bank
In [23]:
# See what fields are available and save them as a data frame
cursor.execute("PRAGMA table_info(comments)")
metadata = pd.DataFrame(cursor.fetchall(), columns=('id', 'field', 'type', '0', 'None', '02'))
metadata
Out[23]:
Okay, so let's create one data frame that is the geographical and other important data, but not the text itself, since that is the major memory consumer.
In [26]:
# Create the query
q = "SELECT id, applicant, author, city, stateCD AS state, zip, pages, score, dateRcpt AS received FROM comments"
cursor.execute(q)
nontextdata = pd.DataFrame(cursor.fetchall(), columns = ('id', 'applicant', 'author', 'city', 'state', 'zip', 'pages', 'score', 'received'))
In [33]:
# See the number of unique names, cities, states, zips
for v in ['applicant', 'state', 'city', 'zip']:
print(v+": "+str(len(pd.Series(nontextdata[v]).unique())))
28,873 cities and 60 states. A quick look show's we've got a few extra 'states' thrown in the comments.
In [37]:
nontextdata['city'].unique()
Out[37]:
In [ ]: