Summary

The purpose of this notebook is to demonstrate clearly the problem with offset, to try and resolve it.

By default, querying the public API of data.sfgov.org will only return the first 1,000 rows of data. That's fine and is good practice, but the problem is that it should be easy to paginate, and not get duplicate data.

According to the API documentation, it's important to use $order when using $offset so that the data is returned in the same order.

What's bizarre is that it seems that any attempt to order by a specific row and then return the next offset doesn't necessarily give you the next batch.

The API documentation (https://dev.socrata.com/docs/paging.html) cautioned that order is needed for paging:

Heads Up! The order of the results of a query are not implicitly ordered, so if you're paging, make sure you provide an order clause or at a minimum $order=:id. That will guarantee that the order of your results will be stable as you page through the dataset.

The problem though is that the results do not appear to be stable.

Steps to Repro

As shown below, the steps to reproduce the issue are:

  • run the query using order and offset=0
  • check the min and max of the ordered column
  • change the offset and run the query again
  • check the min and max of the ordered column

Expected/Wanted:

  • The min and max values of the first query should not overlap with the subsequent offset

Got:

  • There appears to be quite a bit of overlap!

Update:

I was in touch with someone at Socrates on the IRC channel, and here's what came up:

https://data.sfgov.org/resource/wbb6-uh78.json?$order=close_dttm%20DESC&$offset=0&$limit=3000

That’d get all 3000

If you want to get them in pages, change $limit and $offset

...and the problem is still there... The offset still only retrieves 3,000 within the same close_dttm date range 5/27 through 7/10.

Update Number 2:

I'm an idiot. The reason it is acting like this is because offset should be at 3,000, it is NOT the page number!


Original Journal


In [30]:
from __future__ import division, print_function # always future proof for python3
import pandas as pd

In [36]:
# here is what is meant to be the last 1,000 rows ordered by the close date time
df = pd.read_json('https://data.sfgov.org/resource/wbb6-uh78.json?$order=close_dttm%20DESC&$offset=0&$limit=1000')

In [37]:
df.close_dttm.min()


Out[37]:
u'2016-06-24T16:28:44.000'

In [38]:
df.close_dttm.max()


Out[38]:
u'2016-07-10T22:08:15.000'

With the above dates, my next step was to do an offset and try again, and what I expected was that the next 1,000 rows of data would be from the dates that are all greater than the max or less than the min above.


In [39]:
# and here is the next 1,000 dates
df = pd.read_json('https://data.sfgov.org/resource/wbb6-uh78.json?$order=close_dttm%20DESC&$offset=1&$limit=1000')

In [40]:
df.close_dttm.min()


Out[40]:
u'2016-06-24T16:05:07.000'

In [41]:
df.close_dttm.max()


Out[41]:
u'2016-07-10T19:52:41.000'

As you can see above, the results are slightly different than the first, but it's still within the same date range. This shows I'm doing something really wrong.

The Solution

After some back and forth with @chrismetcalf on the IRC channel for Socrates API help, it was revealed that my error was really really basic. I was treating offset as a page number, when it (strangely enough) is just an offset. So if I want the next 1,000 items, I need the offset to be at 1,000. And then page 2 will be the offset of 2,000.

It was so simple!

OK, back to work, but first here is the solution correctly implemented:

(notice how the df2 min and max all took place before the minimum of the first df)


In [46]:
df = pd.read_json('https://data.sfgov.org/resource/wbb6-uh78.json?$order=close_dttm%20DESC&$offset=0&$limit=1000')

In [47]:
print(df.close_dttm.min())
print(df.close_dttm.max())


2016-06-24T16:28:44.000
2016-07-10T22:08:15.000

In [48]:
df2 = pd.read_json('https://data.sfgov.org/resource/wbb6-uh78.json?$order=close_dttm%20DESC&$offset=1000&$limit=1000')

In [49]:
print(df2.close_dttm.min())
print(df2.close_dttm.max())


2016-06-10T11:37:39.000
2016-06-24T16:05:07.000

In [ ]: