The purpose of this notebook is to demonstrate clearly the problem with offset, to try and resolve it.
By default, querying the public API of data.sfgov.org will only return the first 1,000 rows of data. That's fine and is good practice, but the problem is that it should be easy to paginate, and not get duplicate data.
According to the API documentation, it's important to use $order
when using $offset
so that the data is returned in the same order.
What's bizarre is that it seems that any attempt to order by a specific row and then return the next offset doesn't necessarily give you the next batch.
The API documentation (https://dev.socrata.com/docs/paging.html) cautioned that order
is needed for paging:
Heads Up! The order of the results of a query are not implicitly ordered, so if you're paging, make sure you provide an
order
clause or at a minimum $order=:id. That will guarantee that the order of your results will be stable as you page through the dataset.
The problem though is that the results do not appear to be stable.
As shown below, the steps to reproduce the issue are:
order
and offset=0
I was in touch with someone at Socrates on the IRC channel, and here's what came up:
https://data.sfgov.org/resource/wbb6-uh78.json?$order=close_dttm%20DESC&$offset=0&$limit=3000
That’d get all 3000
If you want to get them in pages, change $limit and $offset
...and the problem is still there... The offset still only retrieves 3,000 within the same close_dttm date range 5/27 through 7/10.
I'm an idiot. The reason it is acting like this is because offset should be at 3,000, it is NOT the page number!
Original Journal
In [30]:
from __future__ import division, print_function # always future proof for python3
import pandas as pd
In [36]:
# here is what is meant to be the last 1,000 rows ordered by the close date time
df = pd.read_json('https://data.sfgov.org/resource/wbb6-uh78.json?$order=close_dttm%20DESC&$offset=0&$limit=1000')
In [37]:
df.close_dttm.min()
Out[37]:
In [38]:
df.close_dttm.max()
Out[38]:
With the above dates, my next step was to do an offset and try again, and what I expected was that the next 1,000 rows of data would be from the dates that are all greater than the max or less than the min above.
In [39]:
# and here is the next 1,000 dates
df = pd.read_json('https://data.sfgov.org/resource/wbb6-uh78.json?$order=close_dttm%20DESC&$offset=1&$limit=1000')
In [40]:
df.close_dttm.min()
Out[40]:
In [41]:
df.close_dttm.max()
Out[41]:
As you can see above, the results are slightly different than the first, but it's still within the same date range. This shows I'm doing something really wrong.
After some back and forth with @chrismetcalf on the IRC channel for Socrates API help, it was revealed that my error was really really basic. I was treating offset as a page number, when it (strangely enough) is just an offset. So if I want the next 1,000 items, I need the offset to be at 1,000. And then page 2 will be the offset of 2,000.
It was so simple!
OK, back to work, but first here is the solution correctly implemented:
(notice how the df2 min and max all took place before the minimum of the first df)
In [46]:
df = pd.read_json('https://data.sfgov.org/resource/wbb6-uh78.json?$order=close_dttm%20DESC&$offset=0&$limit=1000')
In [47]:
print(df.close_dttm.min())
print(df.close_dttm.max())
In [48]:
df2 = pd.read_json('https://data.sfgov.org/resource/wbb6-uh78.json?$order=close_dttm%20DESC&$offset=1000&$limit=1000')
In [49]:
print(df2.close_dttm.min())
print(df2.close_dttm.max())
In [ ]: