Homework 5


In [ ]:
import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display, Latex, Markdown

!pip install -U okpy
from client.api.notebook import Notebook
ok = Notebook('hw5.ok')

This assignment

In this assignment we will study a fundamental problem in crawling the web: deciding when to re-index changing pages.

Search engines like those deployed at Google and Microsoft periodically visit, or crawl, web pages to check their current status and record any changes to the page. Some pages, such as news websites, are updated frequently while other pages, like this one, are never updated.

Why not crawl the entire web all the time?

The web is big and even giant search engines don't have the resources to re-read the entire web continuously. Fortunately, different sites change at different rates, so an intuitive strategy would be to visit sites more if they are changed more frequently. But this requires us to know, or at least estimate, how often each web site is changed. This is the main question you'll answer in this assignment.

Along the way you will:

  • do some EDA to understand the site dataset.
  • build several models for the site changes.
  • use simulation to get a concrete understanding of your models.
  • compute maximum-likelihood estimates of model parameters.
  • use simulation and visualization to understand the accuracy of your estimates of site change frequencies.

Examining the data

Our dataset consists of observations for about 1000 pages over a 30-day period. Every hour, each page was downloaded, and it was determined whether the page had been changed since the previous hour. For the first hour, we can't tell whether it has changed, so there are 719 such "checks" for each page.

Important fact about the data: Not every check succeeded. Sometimes the page failed to load, or it was otherwise impossible to tell whether it had changed since the previous hour. These hours are omitted from the dataset. A field in the dataset indicates the number of successful checks for each page.

The dataset is in JSON format in the file crawl.json. For each page, we have: the URL, called url; the number of successful visits to the page, number of checks; and the checks when a change was detected, called positive checks. Examine the crawl.json file. You might find it convenient to load it into python using json.load. The next cell two cells are provided for that.


In [ ]:
# Use this cell to examine the dataset, if you like.
!head -n 25 crawl.json

In [ ]:
# This cell loads the JSON data into Python.
import json
with open("crawl.json", "r") as f:
    crawl_json = json.load(f)

Question 1

Fill in the function json_description to determine:

  1. the number of records, and
  2. the set of possible top level keys (field names) in each record.

In [ ]:
def json_description(crawl_json_records):
    """Produces information about a JSON object containing crawl data.
    
    Args:
      crawl_json_records (list): A list of JSON objects such as the
                                 crawl_json variable above.
    
    Returns:
      2-tuple: An (int, set) pair.  The integer is the number of records
               in the given list.  The set is a set (constructed with
               Python's set() function) of strings.  It contains all the
               field top names that appear at the top level of any record
               in crawl_json_records.
    """
    n = len(crawl_json_records)
    keys = set(k for f in crawl_json for k in f.keys())
    return (n, keys)

In [ ]:
_ = ok.grade('q01')
_ = ok.backup()

In [ ]:
# Display results
n, keys = json_description(crawl_json)
print("Number of records:", n)
print("Keys in the records:", keys)

Question 2

What is the granularity of the dataset as represented in crawl.json? Write a one to two sentence description in the cell below:


In [ ]:
# Use this cell for your explorations.
q2_answer = r"""

**SOLUTION:** Each top-level JSON object represents all the crawls for one site.

"""

display(Markdown(q2_answer))

Question 3

It will be more convenient to work with the data in Pandas DataFrame with a rectangular format. Fill in the function make_crawls_dataframe in the cell below. Then run the cell below that to create the table crawls.


In [ ]:
def make_crawls_dataframe(crawl_json):
    """Creates a Pandas DataFrame from the given list of JSON records.
    
    The DataFrame corresponds to the following relation:

        crawls(primary key (url, hour), updated)

    `updated` is a boolean value indicating whether the check for that
    hour found a change.

    The result is sorted by URL in ascending order and **further**
    sorted by hour in ascending order among the rows for each URL.
    
    Args:
      crawl_json_records (list): A list of JSON objects such as the
                                 crawl_json variable above.
    
    Returns:
      DataFrame: A table whose schema (and sort order) is described
                 above.
    """
    pages = pd.DataFrame.from_records(crawl_json)
    pages = pages[['url', 'number of checks']]
    
    positive_crawls = [{'url': page['url'], 'hour': hour-1, 'updated': True} 
                   for page in crawl_json 
                   for hour in page['positive checks']]

    negative_crawls = [{'url': page['url'], 'hour': hour-1, 'updated': False} 
                   for page in crawl_json 
                   for hour in (set(range(1, page['number of checks']+1)) - 
                               set(page['positive checks']))]

    crawls = pd.DataFrame.from_records(positive_crawls + negative_crawls)
    crawls.set_index(['url', 'hour'], inplace=True)
    crawls.sort_index(inplace=True)
    return crawls

In [ ]:
# Run this cell before you continue.
crawls = make_crawls_dataframe(crawl_json)

In [ ]:
_ = ok.grade('q03')
_ = ok.backup()

Question 4

There are other reasonable ways to represent the data in a relational database (or in DataFrames). One alternate schema uses 2 relations to represent the data. The first relation is:

crawls_that_found_changes(primary key (url, hour))

The primary key for the second relation is url. Define a schema for that 2nd relation. Your schema should ensure that the two resulting tables could be used to answer any question that could be answered using crawls.


In [ ]:
# Use this cell for your explorations.
q4_answer = r"""

**SOLUTION:** `pages(primary key (url), num_crawls)`

"""

display(Markdown(q4_answer))

Question 5

In the following we will construct visualizations to understand several key quantities of interest. The following cell constructs some key summary statistics that we will be using in the next few questions.


In [ ]:
crawl_stats = (
    crawls['updated']
        .groupby(crawls.index.get_level_values('url'))
        .agg({
            'number of crawls': 'count', 
            'proportion of updates': 'mean', 
            'number of updates': 'sum'
        })
)

Part 1:

What was the distribution of the number of crawls for each page? Did most get crawled all 719 times? (For this and the other parts of this question, create a visualization to answer the question.)


In [ ]:
crawl_stats['number of crawls'].hist(bins=20)
plt.xlabel("Number of Crawls")
plt.ylabel("Number of Pages")

# Leave this for grading purposes
q5_p1_plot = plt.gcf()

Part 2

What was the distribution of the number of positive checks for each page?


In [ ]:
crawl_stats['number of updates'].hist(bins=20)
plt.xlabel("Number of Updates")
plt.ylabel("Number of Pages")

# Leave this for grading purposes
q5_p2_plot = plt.gcf()

Part 3

What is the relationship between the number of crawls for each page and the number of positive checks? Construct a scatter plot relating these two quantities.


In [ ]:
crawl_stats.plot.scatter('number of crawls', 'number of updates')

# Leave this for grading purposes
q5_p3_plot = plt.gcf()

Part 4

In 2 or 3 sentences, describe what you discovered in your initial explorations.


In [ ]:
q5_p4_answer = r"""

**SOLUTION:** Most sites were crawled every time, but a substantial number weren't.  
Most sites had only a few positive checks, but some sites were changed very frequently.  
It appears that the number of positive checks increases with the number of crawls.  
Of course, the number of positive checks is always less than the number of crawls!

"""

display(Markdown(q5_p4_answer))

Making a timeline from one site

It will be useful to be able to look at timelines of positive checks or changes for sites. The function display_points, defined below, will help.


In [ ]:
def display_points(points, xlim, title):
    """Displays a timeline with points on it.
    
    Args:
      points (ndarray): A list of floats in the range [xlim[0], xlim[1]],
                        each one a point to display in the timeline.
      xlim (list-like): A list/tuple/array with 2 elements giving the
                        start and end of the timeline.
      title (str): The title to display on the plot.
      
    Example:
        >>> # plot the points at [1,3,30,50]
        >>> display_points([1,4,30,50], [0, 75], "Example")
    """
    fig, ax = plt.subplots(1)
    fig.set_size_inches(8, 1)
    ax.scatter(points, np.repeat(0, len(points)), alpha=0.5)
    ax.axhline(0, color="grey", zorder=-1, lw=.5)
    ax.yaxis.set_visible(False)
    ax.xaxis.set_visible(True)
    ax.set_xlim(xlim)
    ax.set_title("{} ({:d} total points)".format(title, len(points)))
    plt.show()

Example Usage


In [ ]:
display_points([1,4,30,50], [0, 75], "Example")

Question 6

We want to understand the behavior of page changes in order to determine how often a page should be visited. To do this, we need more than summary statistics of the 1000 sites. We also need to examine the patterns of positive checks for sites. Let's examine when the positive checks occurred for a handful of sites. The display_points function is helpful here.

We selected a variety of sites - some changed frequently, others rarely, and some had only a few successful crawls. Their URLs are available in pages_to_display. Use display_points to examine each page. You should notice that the visualization is not very informative in the last 3 cases.


In [ ]:
# This cell identifies a few categories of pages and 
# associates a label in the 'Desc' column of crawl_stats

crawl_stats['Desc'] = "" # default blank description

# Normal pages have a moderate number of updates and 
# were successfully crawled most times.
crawl_stats.loc[
        (crawl_stats['proportion of updates'] > 0.1)
        & (crawl_stats['proportion of updates'] < 0.9)
        & (crawl_stats['number of crawls'] >= 700), 
    'Desc'] = 'Normal'

crawl_stats.loc[
        (crawl_stats['proportion of updates'] < .1)
        & (crawl_stats['number of crawls'] >= 700), 
    'Desc'] = 'Rare Update'

crawl_stats.loc[
        (crawl_stats['proportion of updates'] > .9)
        & (crawl_stats['number of crawls'] >= 700), 
    'Desc'] = 'Frequent Update'

crawl_stats.loc[
        crawl_stats['number of crawls'] < 50, 
    'Desc'] = 'Few Crawls'


# Build a dataframe with a few examples from each type of webpage
num_of_each = 3
pages_to_display = pd.concat([
    crawl_stats[crawl_stats['Desc'] == "Normal"].head(num_of_each),
    crawl_stats[crawl_stats['Desc'] == 'Rare Update'].head(num_of_each),
    crawl_stats[crawl_stats['Desc'] == 'Frequent Update'].head(num_of_each),
    crawl_stats[crawl_stats['Desc'] == 'Few Crawls'].head(num_of_each),
])

In [ ]:
# Construct the timeline point diagrams for each example
for (url, desc) in pages_to_display['Desc'].iteritems():
    crawls_for_site = crawls.loc[url]
    display_points(
        crawls_for_site[crawls_for_site['updated']].index,
        [0, len(crawls_for_site)], desc)

Examining the Times of Positive Checks

Consider what you learned about the positive checks in the previous question. Do these occur at regular intervals? Or do they appear to occur more randomly in time? Write a 2 to 3 sentence explanation of what you see:


In [ ]:
q6_answer = r"""

**SOLUTION:** They do seem to occur randomly in time.  However, it's hard to tell even with the
visualization - our brains are often not very good at distinguishing random noise
from patterns!

"""

display(Markdown(q6_answer))

Understanding this distribution of positive checks will help us determine the distribution of changes. However, determining this distribution is difficult to do if we do not crawl the page sufficiently often. (We will soon see that it is also hard to do if the site changed too many times.) For these reasons, we will focus on "normal" pages, i.e., pages that were successfully crawled at least 700 times and that had between 50 and 200 updates.

Do the positive checks occur uniformly at random? Let's compare the observed distribution of the positive checks to the uniform distribution for a few of our pages.


Question 7

To compare the data distribution to a hypothetical probability distribution, we can compare the histogram of the data to the probability density function. However, a more effective comparison is to compare the quantiles of the observed distribution to those of the probability distribution.

The $q^{th}$ quantile of the data, for $0 < q < 1$ is that point $x_q$ such that at least $q$ of the observations are at or below $x_q$ and at least $1-q$ of the observations are at or above $x_q$. For example, the median is the 0.5 quantile and the lower and upper quartiles are the 0.25 and 0.75 quantiles.

The $q^{th}$ quantile of a continuous probability distribution, for $0 < q < 1$ is that point $x_q$ such that $P(X \leq x_q) = q$ and $P(X \geq x_q) = 1-q$. The quantiles of a uniform distribution are easy to compute. For example, for a Uniform$(0, N)$ distribution, we have $x_q = qN$.

If the data come from the same probability distribution, then their quantiles should roughly match. If we make a scatter plot of pairs of the empirical and theoretical quantiles, like these 3 points:

(lower quartile of the data, lower quartile of the pdf),
(median of data, median of the pdf),
(upper quartile of the data, upper quartile of the pdf)

...the points should fall roughly on a line.

Such a plot is called a quantile-quantile or Q-Q plot. For the websites with Desc = 'Normal' websites in pages_to_display, make a Q-Q plot for that site's updates, as follows:

  • Let $N$ be the number of crawls for this page and $n$ be the number of positive checks.
  • Order the positive check values for a site from smallest to largest. We use these as the quantiles. That is, the $k^{th}$ smallest observation is the $k/(n+1)$-quantile.
  • Compute the $n$ corresponding quantiles for the uniform distribution on $(0, N)$. These are simply $$k \times \frac{N}{n+1},$$ for $k = 1, \ldots, n$. (Why?)
  • Plot the $n$ pairs of quantiles in a scatter plot.

Part 1


In [ ]:
def compute_q_q_pairs(url):
    """Computes lists of uniform-distribution-quantiles and data-quantiles for a page.
    
    Args:
      url (str): The URL of a page.
    
    Returns:
      2-tuple: A pair (AKA a 2-tuple).  Both components are lists or arrays of
               length n, where n is the number of positive checks for the page.
               The first component contains quantiles of a uniform distribution,
               and the second contains quantiles of the distribution of positive
               check times for the page.  The kth element of a component is the
               (k+1)/(n+1)-th quantile of its distribution or dataset.
               
               The first component of the pair contains quantiles of the uniform
               distribution on [0, N], where N is the number of checks performed
               for the page.
    """
    updates = crawls.loc[url]['updated']
    obs_q = np.array(updates[updates].index)
    N = len(updates)
    n = len(obs_q)
    pdf_q = np.linspace(N/(n+1), N, n)
    return (pdf_q, obs_q)

In [ ]:
_ = ok.grade('q07')
_ = ok.backup()

The following code constructs the final Q-Q plot for the "Normal" sites defined above


In [ ]:
for url in pages_to_display[pages_to_display['Desc'] == "Normal"].index.values:
    print(url)
    (pdf_q, obs_q) = compute_q_q_pairs(url)
    sns.regplot(pdf_q, np.array(obs_q))
    plt.xlabel("True uniform quantile")
    plt.ylabel("Actual quantile")
    plt.show()

Part 2

Describe your findings.


In [ ]:
# SOLUTION
q7_p2_answer = r"""

**SOLUTION:** The page we examined on this site appear to have roughly uniform updates.
There are deviations from a straight line, but it's hard to tell whether could have
happened by chance.

"""

display(Markdown(q7_p2_answer))

Question 8

Even if the updates were distributed uniformly, the Q-Q plot will not appear as exactly a straight line. To get a sense of what a uniform-quantile plot might look like if the data were truly distributed according to the uniform distribution, simulate $n$ observations from the uniform distribution on the interval (0, 719) and make a uniform-quantile plot for these simulated data. $n$ is defined in the next cell; write your code in the cell below that.


In [ ]:
url = 'http://a.hatena.ne.jp/Syako/simple'
n = np.count_nonzero(crawls.loc[url]['updated'])

In [ ]:
# SOLUTION
obs_q = np.round(np.random.uniform(0, 719, n))
obs_q.sort()

N = 719
pdf_q = np.arange(N/(n+1), N, N/(n+1))

sns.regplot(pdf_q, np.array(obs_q))
plt.xlabel("True uniform quantile")
plt.ylabel("Actual quantile")

# Leave this for grading purposes
q8_plot = plt.gcf()

How do the empirical quantile plots from the previous quesion compare to your simulated quantile plot? Optionally, we suggest looking at a few other sites and a few other simulated sets of data to see how well they match.


In [ ]:
q8_answer = r"""

**SOLUTION:** Our data drawn from a uniform distribution produced a Q-Q plot
that's a *little* more "line-like" than the Q-Q plot for the actual positive
checks for the first page in the previous question.  The difference was not so
big; we would need to run a more formal test to see if the positive checks for
the first page were really abnormal.  The other pages were even more
uniform than our simulated data.

"""

display(Markdown(q8_answer))

Estimating the Update Rate: A Simple Approach

How would you estimate the change rate for a page?

For example, imagine that in 720 hourly visits to a page, we observed a change in the page on 36 visits. One estimate for the rate of changes is:

$$\frac {36\text{ changes}}{720\text{ hours}} = \frac{1}{20} \text{ changes per hour}$$

There is a small problem with our estimate. What if a page changes twice in an hour? We would see only one positive check. We do not observe the true number of changes that happened to a page in 30 days, but the number of hours that had a change. Think about how this could affect our interpretation of how often websites change.

To help answer this question, we use a probability model for the change behavior of a site and examine the impact of the incomplete observations on our simple estimate of a site's rate of change.

A Model for Page Updates

What model should we use?

In our earlier Q-Q plot analysis we found that the updates appear to occur uniformly at random.

However, the number of positive checks (and therefore the number of page changes) is not the same from one page to the next. That is, both the number of changes and the hours of the changes appear to be random.

When events can happen independently at any point in a span of time, and they're equally likely to happen at any of the times, a good model for the number of events that happen is the Poisson distribution.

The Poisson distribution has a simple probability mass function,

$$P(k) = \frac{\lambda^k}{k!} e^{-\lambda},$$

for $k = 0, 1, 2, \ldots$. The parameter $\lambda$ is called the rate, and this is a Poisson$(\lambda)$ distribution.

For example, if $\lambda$ is the hourly rate of changes to a page, then the chance of $k$ changes in one hour is:

$$P(k) = \frac{\lambda^k}{k!} e^{-\lambda}.$$

If we count the number of changes over $N$ hours, then the number of changes in this time period has a Poisson$(N \lambda)$ distribution. That is,

$$P(k \texttt{ updates in }N \texttt{ hours}) = \frac{(N\lambda)^k}{k!} e^{-(N\lambda)}.$$

Question 9

Suppose that we observe $n$ changes for a site that was visited for $N$ hours. If the number of changes follows the Poisson distribution, show that $\frac{n}{N}$ is the MLE for $\lambda$. In fancy notation, we could call this $\lambda^{\text{MLE}}$. (Note that we are currently ignoring the problem of not being able to detect multiple changes in an hour.)

Note: It's okay to write your answer in plain text. If you know $\LaTeX$ or would like to learn it, now is a good time to try it out. (If you double-click on this cell or some of the cells above, you can see some $\LaTeX$ examples.)


In [ ]:
q9_answer = r"""

The probability of the data given $\lambda$ is:

$$L(\lambda) = e^{-(\lambda N)} \frac{(\lambda N)^{n}}{n!}$$

...or in log form:

$$\log L(\lambda) = -(\lambda N) + n \log(\lambda) + n \log(N) - \log(n!)$$

Setting the derivative to 0, we find:

$$0 = \left[\frac{d}{d \lambda} \log L \right](\lambda^{\text{MLE}}) = 
-N + \frac{n}{\lambda^{\text{MLE}}}$$

(Notice that the terms in $\log L$ that didn't involve $\lambda$, like $\log(n!)$,
conveniently disappeared when we took the derivative of the log-likelihood.)

Therefore,

$$\lambda^{\text{MLE}} = \frac{n}{N}$$
"""

display(Markdown(q9_answer))

Question 10

  1. Add a 'simple mle' column to the crawl_stats table containing the 'simple mle' estimator we derived earlier for each website.

  2. Then, make a plot that displays the distribution of these MLEs for the sites with at least 700 crawls.


In [ ]:
crawl_stats['simple mle'] = crawl_stats['proportion of updates']

crawl_stats[crawl_stats['number of crawls'] >= 700]['simple mle'].hist(bins=20)
plt.title("The distribution of simplified MLEs for the change rate for each site");

q10_plot = plt.gcf()

In [ ]:
_ = ok.grade('q10')
_ = ok.backup()

The Impact of Hourly Observations

The histogram that you made for previous problem has a small mode at 1. Why is this? It is not possible for our estimate of the rate of changes to be greater than once an hour because we only observe the page once an hour. So if the rate of changes is large, we are likely to underestimate it. Let's try to assess the impact of this.

We will carry out a Monte Carlo (repeated, randomized simulation) study of the process that generates our observations. To perform such a study, we need a model for how the checks came out. For that, we need a model for the page changes themselves.

We previously said that the number of changes follows a Poisson distribution with a certain rate for each page. We also saw that the positive checks are distributed roughly uniformly. It seems reasonable to assume that the changes themselves are also distributed uniformly.

When the number of events follows the Poisson distribution and the location of the events (along the time interval) follows the uniform distribution, then we have what is called a Poisson process. This is a random object that's a set of numbers rather than just one number.

To perform our Monte Carlo study, for various values of $\lambda$ we will generate the simulated changes according to the Poisson process distribution and then reduce the changes to the hour of the change. More specifically, we can simulate the data as follows:

  • Generate $M$, the number of updates, by drawing from the Poisson(720$\lambda$) distribution.
  • Place each of the $M$ updates uniformly at random on the interval (0, 719).
  • "Snap" each update to the next hour. For example, a 3:15 update time becomes 4:00, which is the time when we would have observed it. (What do you do when more than one update occurs within an hour?)

Question 11

Following the above description of the Poisson process, complete the function sample_poisson_process.

Hint: The function np.random.poisson will be useful.


In [ ]:
def sample_poisson_process(rate, length):
    """Generates n points from Poisson(rate*length) and locates them
    uniformly at random on [0, length].
    
    Args:
      rate (float): The average number of points per unit length.
      length (float): The length of the line segment.
    
    Returns:
      ndarray: An array of points scattered randomly on [0, length].
               The number of points has a Poisson(rate*length)
               distribution.
    """
    total = np.random.poisson(rate*length)
    def random_point():
        return length*np.random.random()
    return np.array(sorted([random_point() for _ in range(total)]))

In [ ]:
_ = ok.grade('q11')
_ = ok.backup()

The snap_times function in the next cell will help you simulate the hourly observations from the crawler.


In [ ]:
def snap_times(update_times, window_length, process_length):
    """Given a list of change times, produces a list of the windows that detected a
    change (that is, where at least one change happened inside the window).
    
    This has the effect of 'snapping' each change to the next hour (or whatever the
    window_length is).  For periods where more than one change happened, the output
    will still only list the period once.
    
    In other words, it produces a list of the positive checks given a list of true
    change times.
    
    Args:
      update_times (ndarray): A list of times when changes happened for a page.
                              All times are between 0 and process_length.
      window_length (float): The width of each window.  (For example, if time
                             is being measured in hours, and an observation
                             happens once per hour, then this should be 1.)
      process_length (float): The last time time any change could have happened.
    
    Returns:
      ndarray: A list of numbers, each one the right endpoint of a window where at
               least one change happened."""
    window_ends = np.arange(0, process_length, window_length) + window_length
    num_windows = len(window_ends)
    event_windows = np.floor(np.array(update_times) / window_length).astype(int)
    events_by_window = np.bincount(event_windows, minlength=num_windows)
    event_happened = events_by_window > 0
    return window_ends[event_happened]

Question 12

Use the functions sample_poisson_process and snap_times to examine what happens when we visit hourly. Look at examples where

  • the length of time is 24 hours,
  • the rate is 1/8, 1/4, 1/2, 1, and 2 changes per hour, and
  • the window_length is 1.

For each value of rate, simulate one set of change on the interval (0 hours, 24 hours), and plot the resulting points on a timeline using display_points. Then snap these data to the next hour with snap_times and plot the resulting positive check times.

Then, compare the actual change times to the censored times for the various rates. What happens as the rate changes? Do you think hourly visits to the site are a problem if the rate is 1/8? How about 2?


In [ ]:
for rate in [1/8, 1/4, 1/2, 1, 2]:
    draws = sample_poisson_process(rate, 24)
    windows = snap_times(draws, 1, 24)
    display_points(draws, (0, 24), "Actual update times (rate={:f})".format(rate))
    display_points(windows, (0, 24), "Crawler checks that found updates")
q12_answer = r"""

**SOLUTION:** It seems problematic for rates 1/2 or higher, but possibly okay if the rate is 1/8 or 1/4.

"""

display(Markdown(q12_answer))

Simulation study of the rate estimate

Now let us extend our Monte Carlo study to compute the MLE from the observed positive checks. Since we know the true change rate in the simulations, we can compare the MLE to the true change rate and see how accurate it is as an estimator. (We couldn't have done that with our real data, since we don't know the true change rates.)


Question 13

Suppose a website is visited every hour for 30 days, so it has 719 visits. For a particular rate $\lambda$,

  • Generate the observed positive checks with our earlier functions (sample_poisson_process and snap_times).
  • Calculate the MLE from these simulated data.
  • Repeat 1000 times.

This average estimate will change as we increase the true rate of changes, so vary $\lambda$ between 0 and 4 and make a plot showing the estimate as a function of $\lambda$. For comparison, show the true change rates in your plot as well.


In [ ]:
def simulate_censored_rate_estimate(true_rate, num_visits):
    """Simulates updates to a page and visits by a web crawler, and
    returns the proportion of visits in which an update was observed.
    
    Args:
      true_rate (float): The average number of updates per unit length
                         of time.  (The units are irrelevant for the
                         purposes of this function, but you can imagine
                         that they are hours.)
      num_visits (float): The number of visits made to the site.  One
                          visit is made per unit length of time, so
                          this is also equal to the duration of time
                          simulated.
    
    Returns:
      float: the MLE for true_rate lambda, based on the number of 
      observed positive checks.
    """
    draws = sample_poisson_process(true_rate, num_visits)
    windows = snap_times(draws, 1, num_visits)
    n = len(windows)
    return n / num_visits

The following helper function can be used to simulate many rate estimates using the function you just completed.


In [ ]:
def simulate_many_rate_estimates(true_rate, num_visits):
    return np.mean([simulate_censored_rate_estimate(true_rate, num_visits) 
                    for _ in range(1000)])

The following code will simulate rate estimates for various values of $\lambda$ (this cell may take up to 1 minute to run):

Finally, the following code will plot the estimated values for $\lambda$ against their true values. You will need this plot to answer the next question.


In [ ]:
_ = ok.grade('q13')
_ = ok.backup()

Question 14

Part 1

Explain why the estimated rates seem to level off at 1.


In [ ]:
q14_p1_answer = r"""

**SOLUTION:**

The proportion of hours in which an update was observed can never be greater 
than 1 because we can't observe updates more than once an hour.  
The estimate for $\lambda$ is closest to the true value for small $\lambda$. 
For $\lambda$ of 0.25, the estimate is off by less than 5%. 


"""

display(Markdown(q14_p1_answer))

Part 2

How far off is the estimate from the truth for $\lambda$ less than 0.25?


In [ ]:
q14_p2_answer = r"""

**SOLUTION:** It's not so far off: at most ~20%.

"""

display(Markdown(q14_p2_answer))

Estimating the update rate: An improved approach

Our simulation study has shown us that our MLE estimate for the rate is biased. It systematically underestimates the quantity we're trying to estimate, the rate. This bias is small for small $\lambda$, but can we eliminate it?

We can recast the problem slightly by reconsidering our observations. Although the possible times of a page changes may be continuous, we only observe the data hourly and we only observe whether there was at least one update in that hour. Does this remind you of a distribution that you have worked with?

What if we consider the data for a page as 719 0-1 values, where 0 indicates no change and 1 indicates at least 1 change in the hour? Do you see why these 0-1 values are equivalent to our representation of the data as positive changes? For example, if N is 10 and we observe changes at hours 3, 7, and 9 then this information is equivalent to the sequence 0,0,1,0,0,0,1,0,1,0.

The distribution of the 0-1 values is Bernoulli(p). Under our assumption that the changes come from a Poisson process, their values are independent.

This distribution incorporates the censoring of the data. How can we estimate the rate? We can use the Poisson distribution to compute the chance of a 0 or 1, which depends on the rate. Then we use the Bernoulli distribution to find the MLE of $\lambda$.


Question 15

What is the chance that a Poisson($\lambda$) random variable is equal to 0? What is the chance that it's greater than or equal to 1?


In [ ]:
# SOLUTION
q15_answer = r"""

**SOLUTION:** $e^{-\lambda}$ and $1 - e^{-\lambda}$, respectively.

"""

display(Markdown(q15_answer))

To check your answers, fill in the numerical values of each probability, for $\lambda=0.5$


In [ ]:
# The probability that a Poisson(0.5) random variable is equal to 0
Prob_pois_half_equals_0 = ...
# The probability that a Poisson(0.5) random variable is greater than or equal to 1
Prob_pois_half_gte_1 = ...

In [ ]:
_ = ok.grade('q15')
_ = ok.backup()

Question 16

With this modified model, the MLE of $\lambda$, given $n$ updates in $N$ visits for a site, is:

$$\lambda^* = \log \left(\frac{n}{N - n} + 1 \right)$$

Show that that's true.

Hint: If you find an expression like $\frac{e^{-\lambda}}{1 - e^{-\lambda}}$, it often helps to convert such an expression to $\frac{1}{e^{\lambda} - 1}$ by multiplying its numerator and denominator by $e^{\lambda}$.


In [ ]:
q16_answer = r"""

$$L(\lambda) = (e^{-\lambda})^{N - n} (1 - e^{-\lambda})^{n}$$

$$\log L(\lambda) = (N - n) (-\lambda) + n \log(1 - e^{-\lambda})$$

$$0 = \left[\frac{d}{d \lambda} \log L \right](\lambda^*) = -(N - n) + n \frac{e^{-\lambda^*}}{1 - e^{-\lambda^*}}$$

$$e^{\lambda^*} = \frac{n}{N - n} + 1$$

$$\lambda^* = \log \left(\frac{n}{N - n} + 1 \right)$$

"""

display(Markdown(q16_answer))

What happens when we observe a change every hour? Then, our MLE $\lambda^*$ takes the value $\infty$.

Why is that true? Intuitively, if there is a change on each visit, we can always increase the likelihood of our data (which is the likelihood that every hour saw at least one change) by increasing $\lambda$.

An MLE of $\infty$ is clearly problematic. We can use the idea of Laplace smoothing to modify the estimator a bit. We'll use the following "modified MLE" estimator, which adds to every site a single imagined visit where "half" saw a change and the other "half" didn't:

$$\lambda^+ = \log \left(\frac{n + .5}{N - n + .5} + 1 \right)$$

You can imagine that this estimate incorporates a prior conservative intuition that a site might not change all the time even if it happened to change every hour in the particular 30-day period we observed.


Question 17

  1. Add a 'modified mle' column to the crawl_stats table containing the 'modified mle' estimator $\lambda^+$ for each website.

  2. Then, make a histogram that displays the distribution of the modified MLEs for the sites with at least 700 crawls.

Hint: You may find your work in question 10 helpful.


In [ ]:
crawl_stats["modified mle"] = np.log(
    (crawl_stats['number of updates'] + .5)
    / (crawl_stats['number of crawls'] - crawl_stats['number of updates'] + .5)
    + 1)
crawl_stats["modified mle"].hist(bins=20);

In [ ]:
_ = ok.grade('q17')
_ = ok.backup()

Notice that this distribution is quite different than the earlier one we found in question 10. Now there is not a pile up of estimates at 1.

How accurate are our estimates?

We don't know the true update rates of the pages, so we can't know exactly how accurate our estimates are. But this is often the case in data science. Let's try to figure out a reasonable way to estimate the accuracy of our estimates.

The strategy is similar to the bootstrap. In pseudocode:

for each page $i$ :

... Assume the true change rate $r_i$ equals our estimate $\lambda^+_i$.

... for _ in range(num_simulations):

... Simulate changes and positive checks.

... Compute a new maximum likelihood estimate of $r_i$ based on the simulated data.

... Plot these estimates in a histogram or summarize them in some other way.

This is called a parametric bootstrap method, because instead of resampling directly from our data, we use our data to estimate its distribution's parameters and then sample from that distribution.


Question 18

Complete the definitions of simulate_change_rate_estimate and modified_mle in the cell below.


In [ ]:
def simulate_change_rate_estimate(change_rate, num_observations, estimator):
    """Simulates hourly change observations for a website and produces an
    estimate of the change rate.
    
    Args:
      change_rate (float): The hourly change rate of the website to be used in
                           the simulation.
      num_observations (int): The number of observations (equivalently, the
                              number of hours) to simulate.
      estimator (func): A function to apply to the simulated observations to
                        produce an estimate of the change rate.  Has the same
                        signature as the function modified_mle below.
    """
    draws = sample_poisson_process(change_rate, num_observations)
    windows = snap_times(draws, 1, num_observations)
    return estimator(windows, num_observations)

def modified_mle(positive_checks, num_observations):
    """Produces the modified MLE for a dataset.
    
    Args:
      positive_checks (ndarray): A list or array of the hours when a
                                 change was observed.
      num_observations (int): The number of hours in which we could
                              have observed a change.
    
    Returns:
      (float): The modified MLE.
    """
    num_updates = len(positive_checks)
    return np.log((num_updates + .5) / (num_observations - num_updates + .5) + 1)

In [ ]:
_ = ok.grade('q18')
_ = ok.backup()

Question 19

Complete the definition of plot_bootstrap_rate_estimates in the cell below. Read the docstring carefully so you know what to do. Then run the provided cell below that to plot the bootstrap rate estimate distributions for a few pages.


In [ ]:
def plot_bootstrap_rate_estimates(page_url, num_simulations):
    """Simulates positive check observations for a website many times.  For
    each simulation, the page's change rate is estimated based on the simulated
    positive checks.  The estimation method is the modified_mle function
    implemented above.  Then this function produces a histogram that shows
    the distribution of these estimated change rates.
    
    When conducting each simulation, the change rate is taken to be the
    estimated update rate for the page.  *That* estimate is the modified
    MLE computed from the actual positive check observations for that page.
    
    The modified MLE computed from the actual observations for the page
    is also displayed as a vertical red line at an appropriate horizontal
    position on the histogram.
    
    Here is a diagram that might be helpful:
    
                                             (This happens 1 time)
                                                     |
                                                     v
    Positive check observations for this page ---modified_mle---> Estimate of change rate for this page
    
                                         (once per simulation)
    Estimate of change rate for this page ---simulation---> Simulated positive checks
    
                              (once per simulation)
    Simulated positive checks ---modified_mle---> Bootstrap estimate of change rate for this page
    
    1000 bootstrap estimates of change rate for this page ---> Histogram
                                                               ^
                                                               |
    Estimate of change rate for this page --------------------/
                                              (displayed somewhere on the histogram)
    
    Args:
      page_url (str): The URL of the page to simulate.
      num_simulations (int): The number of simulations to run.
    
    Returns:
      None: This function doesn't return anything; it just makes a histogram
      appear as described above.
    """
    update_rate_estimate = crawl_stats.loc[page_url]["modified mle"]
    num_observations = crawl_stats.loc[page_url]['number of crawls']
    bootstrap_estimates = np.array([
        simulate_change_rate_estimate(update_rate_estimate, num_observations, modified_mle) 
        for _ in range(num_simulations)])
    plt.hist(bootstrap_estimates, bins=20)
    plt.axvline(crawl_stats.loc[page_url]["modified mle"], color='red')
    plt.show()

In [ ]:
# Run this cell to make plots for several pages using your
# function.  Since no automatic tests are provided for this
# question, we suggest examining the plots to make sure they
# make sense to you.
many_updates_url = crawl_stats[np.logical_and(
    crawl_stats['number of crawls'] >= 700,
    np.logical_and(
        .2 <= crawl_stats['proportion of updates'],
        crawl_stats['proportion of updates'] >= .8))].index[0]
plot_bootstrap_rate_estimates(many_updates_url, 1000)

few_updates_url = crawl_stats[np.logical_and(
    crawl_stats['number of crawls'] >= 700,
    np.logical_and(
        .05 <= crawl_stats['proportion of updates'],
        crawl_stats['proportion of updates'] >= .15))].index[0]
plot_bootstrap_rate_estimates(few_updates_url, 1000)

Looking at the error distribution is good, but it's also useful to characterize error with a single number. (This facilitates comparisons among different estimators, for example.)

A common way to summarize error in estimation is the root mean squared error, or RMSE. The RMSE is the square root of the MSE, or mean square error. We calculate the MSE as follows:

  • the average (across many simulations, where each simulation results in 1 estimate) of
  • the squared difference between the estimate and its actual value.

Working with the RMSE, rather than the MSE, is useful when we want to understand the magnitude of error relative to the estimates.

We don't know the true change rate for each website, so we can't compute the true MSE. However, if we find the MSE using the data making up the histogram you produced in the previous question, using the original estimate (the red line) as the actual change rate, that is a bootstrap estimate of the RMSE.

Question 20


Complete the definition of estimate_rmse in the cell below. Then, use it to estimate the RMSE for each page's modified MLE. Add these RMSEs as a column named 'rmse' in crawl_stats.

Note: There are 1000 pages, so this will take some time. We recommend using only 100 simulations per page, which should be sufficient for reasonable estimates. Try your code on a single page before running it on every page.


In [ ]:
def estimate_rmse(page_url, num_simulations):
    """Simulates update observations for a website many times.  For each
    simulation, the page's change rate is estimated based on the simulated
    observations.  (The estimation method is the modified MLE.)  Then this
    function produces an estimate of the RMSE of estimates of the change
    rate for this page.
    
    When conducting each simulation, the change rate is taken to be the
    estimated change rate for the page.  *That* estimate is the modified
    MLE computed from the actual observations for the page.
    
    We compute the modified MLE for each set of simulated observations.  That
    constitutes num_simulations estimates of the change rate.
    
    Then we compute the RMSE of those estimates.  The "true" change rate in
    that calculation is taken to be the modified MLE computed from the
    actual observations for the page.
    
    Args:
      page_url (str): The URL of the page to simulate.
      num_simulations (int): The number of simulations to run.
    
    Returns:
      float: The estimated RMSE of the modified MLE for the given page,
             based on num_simulations simulations.
    """
    change_rate_estimate = crawl_stats.loc[page_url]["modified mle"]
    num_observations = crawl_stats.loc[page_url]['number of crawls']
    bootstrap_estimates = np.array(
        [simulate_change_rate_estimate(change_rate_estimate, num_observations, modified_mle) 
         for _ in range(num_simulations)])
    return np.mean((bootstrap_estimates - change_rate_estimate)**2)**0.5

print(estimate_rmse(many_updates_url, 1000))
print(estimate_rmse(few_updates_url, 1000))

In [ ]:
crawl_stats['rmse'] = [estimate_rmse(i, 100) for i in crawl_stats.index]

In [ ]:
_ = ok.grade('q20')
_ = ok.backup()

Question 21

Create a visualization to display the RMSEs you computed. Then, create another visualization to see the relationship (across the 1000 pages) between RMSE and the modified MLE.


In [ ]:
crawl_stats["rmse"].hist(bins=20);
crawl_stats[crawl_stats['number of crawls'] == 719].plot.scatter("modified mle", "rmse");

Is the model reasonable?

All the foregoing work has assumed a particular probability model for website updates: the number of changes follows a Poisson($\lambda$) distribution and the locations of the changes, given the number of changes, follows a uniform distribution. Like most models, this is certainly imperfect. However, it still may produce reasonable estimates. Let's check some of the conditions of the model to see if they are reasonable.

Embedded in our model is the assumption that the rate of changes for a page is constant over time. If we find a strong pattern in the rates of updates, that is evidence against this assumption. For example, you might guess that changes for certain pages might happen more often at certain times of the day, or on certain days of the week.

Before you group by day of week, here are some difficulties presented by the dataset:

  1. The day of the week is not given.
  2. The hour of the day is not given.
  3. We don't know when the checks started.
  4. When a site had a failed check, the check was just omitted from the dataset, so we don't know when the failed checks happened, only their total count for each page.
  5. Not all sites are in the same or nearby time zones.

Here is what we do know:

  1. Checks for all pages started at the same time, at midnight PST (US Pacific Coast time).
  2. Checks were attempted every hour for 30 days, and then they stopped.
  3. We know each page's URL. A URL may tell us something about its geography and therefore its rough time zone. For example, the top-level domains .edu, .com,.gov, and .net include only US sites.

Question 22

Propose a plan to answer one of these questions:

  1. How much did the rate of changes vary by hour of the day?
  2. How much did the rate of changes vary by day of the week?

Or, you may choose your own pattern to test.

Your plan doesn't need to produce results that hold for all pages, only a reasonably-large subset. Be sure to make clear exactly what question your analysis would answer.

Note: You don't need to actually implement your plan!


In [ ]:
q22_answer = r"""

TODO: Write a solution.

"""

display(Markdown(q22_answer))

Submitting your assignment

Congratulations, you're done with this homework!

Run the next cell to run all the tests at once.


In [ ]:
_ = ok.grade_all()

Finally, run the next cell to submit the assignment to OkPy so that the staff will know to grade it. You can submit as many times as you want, and you can choose which submission you want us to grade by going to https://okpy.org/cal/data100/sp17/. After you've done that, make sure you've pushed your changes to Github as well!


In [ ]:
_ = ok.submit()