In this notebook, we describe how to retrieve our browsing history from Safari (Part 1) and produce some simple plots (Part 2) in order to see what websites we visit most often. (Note: this was run on Mac OS X 10.10.2).

Part 1: Retrieving our Safari Browsing History

To access our browsing history, we go to ~/Library/Safari and look for the database History.db. We make a copy of it in a folder in our workspace, e.g. to ~/Workspace/web_browsing/hs.db.



In [1]:

    
%%bash
cp ~/Library/Safari/History.db ~/Workspace/web_browsing/hs.db

Now let us fire up sqlite3 and see what tables are inside the database.



In [2]:

    
%%script sqlite3 hs.db
.tables









    



history_items       history_tombstones  history_visits      metadata

Among the four (4) tables above, we'll be using 'history_items' and 'history_visits'. Let's take a look at their schemata.



In [3]:

    
%%script sqlite3 hs.db
.schema history_items









    



CREATE TABLE history_items (id INTEGER PRIMARY KEY AUTOINCREMENT,url TEXT NOT NULL UNIQUE,domain_expansion TEXT NULL,visit_count INTEGER NOT NULL,daily_visit_counts BLOB NOT NULL,weekly_visit_counts BLOB NULL,autocomplete_triggers BLOB NULL,should_recompute_derived_visit_counts INTEGER NOT NULL);
CREATE INDEX history_items__domain_expansion ON history_items (domain_expansion);



In [4]:

    
%%script sqlite3 hs.db
.schema history_visits









    



CREATE TABLE history_visits (id INTEGER PRIMARY KEY AUTOINCREMENT,history_item INTEGER NOT NULL REFERENCES history_items(id) ON DELETE CASCADE,visit_time REAL NOT NULL,title TEXT NULL,load_successful BOOLEAN NOT NULL DEFAULT 1,http_non_get BOOLEAN NOT NULL DEFAULT 0,synthesized BOOLEAN NOT NULL DEFAULT 0,redirect_source INTEGER NULL UNIQUE REFERENCES history_visits(id) ON DELETE CASCADE,redirect_destination INTEGER NULL UNIQUE REFERENCES history_visits(id) ON DELETE CASCADE,origin INTEGER NOT NULL DEFAULT 0,generation INTEGER NOT NULL DEFAULT 0);
CREATE INDEX history_visits__last_visit ON history_visits (history_item, visit_time DESC, synthesized ASC);
CREATE INDEX history_visits__origin ON history_visits (origin, generation);

The first thing to note is that each history visit table is associated with one history item, as we can tell from 'history_item' column in 'history_visits' table.

history_item INTEGER NOT NULL REFERENCES history_items(id) ON DELETE CASCADE

Moreover, as we can tell from the columns of the two tables, 'history_visits' contains information about individual visits to websites, like the time of the visit and the title of the website, while history_items contains general and aggregate information about the websites we visit, such as their url and number of visits.

Let's take a look now at the addresses that we visit most often (and despair later)...



In [5]:

    
%%script sqlite3 hs.db
SELECT id, url, visit_count  FROM history_items ORDER BY visit_count DESC LIMIT 5;









    



491|https://twitter.com/|4048
9252|https://www.facebook.com/|2380
17706|http://international.nytimes.com/|1225
18|http://www.theverge.com/|1211
23141|https://mail.google.com/mail/u/0/#inbox|1134

...As well as our most recent visits.



In [6]:

    
%%script sqlite3 hs.db
SELECT V.title, I.url, V.visit_time
FROM history_items as I,
(SELECT history_item, title, visit_time FROM history_visits ORDER BY visit_time DESC LIMIT 5) as V
WHERE I.id = V.history_item;









    



Sanja Scepanovic and 2 others - Messages|https://www.facebook.com/messages/conversation-1680195825528863|462218817.510182
Facebook|https://www.facebook.com/login.php?login_attempt=1&lwv=111|462218817.412889
Espoo Cine 2015|http://boxoffice.espoocine.fi/ewt/?lang=fin&tctpage=cart|462218724.79521
Espoo Cine 2015: bikes vs cars|http://www.espoocine.fi/2015/fi/ohjelmisto/elokuvat/bikes-vs-cars|462218707.2001
Espoo Cine 2015: bikes vs cars|http://www.espoocine.fi/2015/en/programme/movies/bikes-vs-cars|462218641.31604

Note that, from what I can tell at least, time is expressed as the number of seconds since the beginning of 2001-01-01.

Finally, let's dump to text files ('visits.csv') only the information that interests us: for now, that's only the web address visited (url) and the associated timestamp.



In [7]:

    
%%bash --out visits
sqlite3 hs.db \
"SELECT I.url, V.visit_time \
FROM history_items as I, \
(SELECT history_item, title, visit_time FROM history_visits) as V \
WHERE I.id = V.history_item;"



In [8]:

    
f = open('visits.txt', 'w')
f.write(visits)
f.close()

That was it, our browsing history is in file 'visits.txt'.



In [9]:

    
%%bash
head -5 visits.txt
echo "... ... ..."
tail -5 visits.txt









    



http://www.apple.com/iphone/|431973567.3
http://live.arstechnica.com/apples-september-9-2014-event/|431975282.8
http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=0CDAQFjAA&url=http%3A%2F%2Fwww.stevens.edu%2F&ei=NofoU56VCMeP0AW6rYCIBg&usg=AFQjCNFN7sE7yPBglg-N1stqD034Ifa_pQ&bvm=bv.72676100,d.d2k|429440704.7
http://www.theverge.com/2014/9/9/6122731/iphone-6-apple-4-7-inch-display-announced|431982335.9
http://www.apple.com/live/2014-sept-event/|431919219.233333
... ... ...
http://www.espoocine.fi/2015/en/programme/movies/bikes-vs-cars|462218641.31604
http://www.espoocine.fi/2015/fi/ohjelmisto/elokuvat/bikes-vs-cars|462218707.2001
http://boxoffice.espoocine.fi/ewt/?lang=fin&tctpage=cart|462218724.79521
https://www.facebook.com/login.php?login_attempt=1&lwv=111|462218817.412889
https://www.facebook.com/messages/conversation-1680195825528863|462218817.510182

Part 2: Plots



In [10]:

    
%matplotlib inline



In [11]:

    
import sys
import matplotlib.pyplot as plt
import time
import datetime

Let us parse the text file we created and extract the timestamps of our visits.



In [12]:

    
f = open('visits.txt')
tt = [] # stores timestamps of our visits
for line in f:
    try:
        tokens = line.rstrip().split('|')
        time_sec = float(tokens[-1])
        tt.append(time_sec)
    except:
        sys.stderr.write(line)
        sys.stderr.write('\n')
f.close()
# sort the timestamps
tt.sort()

# turn into standard POSIX by adding
# the first second of 2001-1-1 UTC
zero_sec = time.mktime(datetime.datetime(2001, 1, 1).timetuple())
for i in range(len(tt)): tt[i] += zero_sec

Let us plot the daily number of pages we've visited for that period of time.



In [17]:

    
bin_size = 24 * 3600 # have one bin per 24 hours
num_of_bins = (tt[-1] - tt[0]) / bin_size
# figure out intervals to accomodate number of labels on x-axis
num_of_xlabels = 10
label_bin = (tt[-1] - tt[0]) / (num_of_xlabels - 1)
locs = [tt[0] + label_bin * (i) for i in range(num_of_xlabels)]
labels = map(lambda x: str(datetime.datetime.fromtimestamp(x)).split()[0], locs)
# plot the histogram
plt.figure(figsize=(15, 5))
plt.hist(tt, num_of_bins, range = (tt[0], tt[-1]))
plt.xticks(locs, labels)
plt.ylabel("Daily Visits")









    Out[17]:





<matplotlib.text.Text at 0x108166b50>

What are websites we've visited most often? We've already had an idea about this, but now let us do more elaborate processing.



In [14]:

    
import re
import numpy as np
website_pattern = re.compile('http(s)?://(\w+\.)?(\w+?)\.\w+?/')

verbose = False

f = open('visits.txt')
website_counts = {} # stores counts for each website
addresses = {} # stores the different addresses (base-urls) for each website
for line in f:
    try:
        # keep only the url, ignore the timestamp
        tokens = line.rstrip().split('|')
        m = website_pattern.search("".join(tokens[:-1]))
        try:
            # parse the url to extract the website name
            # for example, if the url is http://www.facebook.com/ab238ub
            # the website name is 'facebook'
            website = m.group(3)
            # update count
            website_counts[website] = website_counts.get(website, 0) + 1
            # keep track of the different base urls that match this website
            # for example, if the url is http://www.facebook.com/ab238ub
            # the base url is http://www.facebook.com/
            if website not in addresses:
                addresses[website] = {}
            addresses[website][m.group(0)] = addresses[website].get(m.group(0), 0) + 1
        except Exception, e:
            if verbose:
                sys.stderr.write(str(e))
                sys.stderr.write('\n')
    except:
        sys.stderr.write(line)
        sys.stderr.write('\n')
f.close()

Plot the number of visits to the websites we visit most frequently



In [15]:

    
k = 5 # The number of top websites to present

# Keep the top-k websites
k = min(k, len(website_counts))
top_websites = sorted(website_counts.items(),
                key = lambda x: x[1], reverse = True)[:k]
# The number of visits to websites that are not among the top-k
other_visits = sum(x[1] for x in top_websites[k:])

# Let's make a histogram for the top-k websites
plt.figure(figsize = (10, 5)) # create new figure
bar_width = 0.5
xpos = np.arange(len(top_websites)) + bar_width # bar positions on the x-axis
plt.bar(xpos, [w[1] for w in top_websites], bar_width) # places the bars
plt.xticks(xpos+bar_width/2, [w[0] for w in top_websites]) # ticks on x-axis









    Out[15]:





([<matplotlib.axis.XTick at 0x1077d3c90>,
  <matplotlib.axis.XTick at 0x1077aff10>,
  <matplotlib.axis.XTick at 0x10811ee50>,
  <matplotlib.axis.XTick at 0x10812d6d0>,
  <matplotlib.axis.XTick at 0x10812de10>],
 <a list of 5 Text xticklabel objects>)



In [16]:

    
## For each of the top-k websites, show the most popular
## addresses (base urls)

m = 3       # Show at most m addresses...
pct = 0.90  # ... or stop at 90% of visits

# For each of the top-k websites...
for website_name, website_visits in top_websites:

    # ... store the most frequently visited base urls ...
    website_addresses = sorted(addresses[website_name].items(),
                               key = lambda x: x[1], reverse = True)
    top_addresses = []; visits = 0.
    for address, address_count in website_addresses:
        visits += address_count
        top_addresses.append((address, address_count))
        if len(top_addresses) > m or visits / website_visits >= pct:
            break
        
    # ... and the number of visits to addresses that are not among the top
    other_num = website_visits - visits

    # make a pie-chart for this website
    # if it is associated with many addresses
    if len(top_addresses) >= m: # TODO change this condition if you want
        plt.figure(figsize = (7,7))
        labels = [x[0] for x in top_addresses] + ["other"]
        sizes = [x[1] for x in top_addresses] + [other_num] 
        # arbitrary choice of colors -- I like blue
        n = float(len(sizes))
        colors = [( p / (n + 1.), 0., (1. - p / (n + 1.)))\
                                  for p in xrange(len(sizes))]
        colors[-1] = 'grey'
        plt.pie(sizes, labels=labels, colors=colors,
                labeldistance = 1.1, autopct='%1.1f%%', startangle = 90)

Resources

For more information on the tools we used in this session, check the following websites.

Tutorial on the sqlite3 command line tool
Python's documentation for the sqlite3 module
Running scripts in iPython
Markdown syntax



In [ ]: