Introduction

This notebook includes a pattern about my reading habits. Pocket is a handy tool to keep track of all the nice articles you might miss because of an "important" meeting.

I have used data provided by nice API provided by Pocket. I wanted to play more with the data at hand but it is better to start with simple things. In future, I might go to discover some more insights. Lets start with the basics for now.


In [4]:
import json
import glob
import pandas as pd
import datetime
import requests
import matplotlib.pyplot as plt
import numpy as np
from wordcloud import WordCloud
from urllib.parse import urlparse

First, we need to get data from Pocket through its API.

For that, we first need to create the consumer key. Access token can then be generated using the API or fxneumann's OneClickPocket.


In [5]:
consumer_key = "" # Consumer key required for Pocket API
access_token = "" # Access Token required for Pocket API
time_added_limit = 1483228800 # Since when do you want to fetch the data, default 1 Jan 2017

In [6]:
if consumer_key == "" or access_token == "":
    raise ValueError("Please generate Consumer Key and Access Token.")


---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-6-a63675820804> in <module>()
      1 if consumer_key == "" or access_token == "":
----> 2     raise ValueError("Please generate Consumer Key and Access Token.")

ValueError: Please generate Consumer Key and Access Token.

It is a good practice to limit the number of API requests/data you are fetching. So, specifying a reasonable time limit is appreciated. Now, time to fetch the numbers.


In [ ]:
data_points = {
  "item_id" : "int64",
  "resolved_title" : "object",
  "resolved_url" : "object",
  "time_added" : "int64",
  "time_read": "int64",
  "excerpt" : "object",
  "word_count" : "int32",
  "is_article": "int32",
  "status" : "int32"
}

rawReq = {
  "consumer_key" : consumer_key,
  "access_token" : access_token,
  "sort" : "newest",
  "state" : "all",
  "detailType" : "complete"
}

url = "https://getpocket.com/v3/get"
headers = {"Content-Type": "application/json"}
counter = 0
items_per_request = 500
df = pd.DataFrame([])

while True:
    req = rawReq.copy()
    req.update({"offset": counter*items_per_request, "count": items_per_request})
    r = requests.post(url, data=json.dumps(req), headers=headers)
    rjson = r.json()
    all_items_found = False  
    if (len(rjson['list']) == 0):
        break
  
    for item in rjson["list"].values():
        if (int(item["time_added"]) < time_added_limit):
            all_items_found = True
            
        try:
            data_arr = pd.DataFrame([[item[x] for x in data_points.keys()]])
        except KeyError:
            continue #Ignoring ill-formed data
        df = df.append(data_arr)
        
    if all_items_found:
        break
    counter += 1
  
df.columns = data_points.keys()
for col in df:
    df[col] = df[col].astype(data_points.get(col))

First, I would like to see what websites do I visit most when reading the saved arcticles.


In [ ]:
number_of_top_websites = 10

# Different subdomains are considered as different entities for the sake of simplicity
group_by_domain = df["resolved_url"].apply(lambda x:urlparse(x).netloc)
top_domains = group_by_domain.value_counts().head(number_of_top_websites)

_, ax = plt.subplots(figsize=(12, 10))
y_pos = np.arange(10, 0, -1)
ax.set_yticks(y_pos)
ax.set_yticklabels(top_domains.index.values)
ax.barh(y_pos, top_domains, color="green")
ax.set_title("Top Websites for reading")
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)

plt.show()

No surprises there.

Next, I want to know at what time of week I am most active saving/reading the articles to Pocket. Lets build a cool heatmap to find that out.


In [ ]:
def get_day_and_hour(ts):
    dt = datetime.datetime.fromtimestamp(ts)
    return (dt.date().weekday(), dt.hour)

def get_weekday_hour_heatmap_data(ts_series):
    dfa = ts_series.apply(lambda x : pd.Series(get_day_and_hour(x), index=["a", "b"]))  
    dfg = dfa.groupby(["a", "b"]).size().reset_index(name='count')
    days_hmp = np.zeros((7, 24))
    for i in range(7):
        for j in range(24):
            if len(dfg[(dfg["a"] == i) & (dfg["b"] == j)]) > 0:
                days_hmp[i, j] = dfg[(dfg["a"] == i) & (dfg["b"] == j)]["count"].values[0]
    
    return days_hmp

def plot_weekday_hour_heatmap(days_hmp, ax, vmin, vmax, title):
    ax.matshow(days_hmp, cmap='summer', vmin=vmin, vmax=vmax)

    ax.set_xticks(np.arange(0, 24, 1))
    ax.set_yticks(np.arange(0, 7, 1))

    ax.set_yticklabels(["M", "T", "W", "T", "F", "S", "S"])
    ax.set_xticklabels(np.arange(0, 24, 1))

    ax.set_xticks(np.arange(-.5, 23, 1), minor=True)
    ax.set_yticks(np.arange(-.5, 6, 1), minor=True)
    ax.set_title(title)
    ax.grid(which="minor", linestyle="-", color='black', linewidth=1)

   
fig, [ax1, ax2] = plt.subplots(nrows=2, figsize=(14, 10))
plot_weekday_hour_heatmap(get_weekday_hour_heatmap_data(df["time_added"]), ax1, 10, 200, "Add")
plot_weekday_hour_heatmap(get_weekday_hour_heatmap_data(df[df["status"]!=0]["time_read"]), ax2, 0, 150, "Read")
plt.tight_layout()
plt.show()

No surprises again!! I usually save articles when travelling to the office during the weekdays. As for reading the saved articles, there is no general pattern as reading is spread across the week.

Lets go for the number of words that I might have read in the given time period.


In [ ]:
df[df["status"] != 0]["word_count"].sum()

What about the number of articles (excluding the videos and other stuff).


In [ ]:
len(df[(df["status"] != 0) & (df["is_article"] == 1)])

Oh!! Lets end this thing with a nice word cloud. I should get a printed T-Shirt with this cool word cloud.


In [ ]:
read_wordcloud = WordCloud(
    max_font_size=50, \
    background_color='white', 
    width=800, \
    height=400).generate(df[df["status"] != 0]["excerpt"].sum())

_, ax = plt.subplots(figsize=(12, 10))
ax.imshow(read_wordcloud, interpolation="bilinear")
plt.axis("off")
plt.tight_layout()
plt.show()