Implementing the method described in “Joseph, K., Tan, C. H., & Carley, K. M. (2012). Beyond “Local”, “Categories” and “Friends”: Clustering Foursquare Users with Latent “Topics.” UbiComp'12 (pp. 919–926). doi:10.1145/2370216.2370422”


In [1]:
from pymongo import MongoClient
from collections import defaultdict
import scipy.io as sio
import scipy.sparse as sp
import gensim
import pandas as pd
import sys
sys.path.append('..')
import persistent as p
from scipy.spatial import ConvexHull
import shapely.geometry as geom
import folium

Pull the venues out of the DB


In [2]:
cl = MongoClient()
db = cl.combined
fullcity, city = 'San Francisco', 'sanfrancisco'
scaler = p.load_var('../sandbox/paper-models/{}.scaler'.format(city))
venue_infos = {}
for venue in db.venues.find({ "bboxCity": fullcity}):
    venue_infos[venue['_id']] = (venue['coordinates'], venue['name'], None if len(venue['categories']) == 0 else venue['categories'][0])


/home/chercheur/geraud/venvs/chr351/lib/python3.5/site-packages/sklearn/utils/fixes.py:55: DeprecationWarning: inspect.getargspec() is deprecated, use inspect.signature() instead
  if 'order' in inspect.getargspec(np.copy)[0]:

Read checkins to start building the venues×users matrix


In [3]:
user_visits = defaultdict(lambda : defaultdict(int))
venue_visitors = defaultdict(lambda : defaultdict(int))

for vid in venue_infos:
    for checkin in  db.checkins.find({"venueId": vid}):
        uid = checkin['foursquareUserId']
        user_visits[uid][vid] += 1
        venue_visitors[vid][uid] += 1

user_visits = {us: {vn: c for vn, c in usval.items()} for us, usval in user_visits.items()}
venue_visitors = {us: {vn: c for vn, c in usval.items()} for us, usval in venue_visitors.items()}

Then apply the pruning described in the paper

In addition, we remove those users with less than 5 unique venue check ins and those venues with less than 10 check ins, repeating the pruning iteratively until all such venues and users are removed.


In [4]:
still_work_todo = True
print(len(user_visits), len(venue_visitors))
while still_work_todo:
    users_to_remove = set()
    new_users = {}
    for u, vst in user_visits.items():
        rvst = {vid: num for vid, num in vst.items() if vid in venue_visitors}
        num_visit = len(rvst.values())
        if num_visit < 5:
            users_to_remove.add(u)
        else:
            new_users[u] = dict(rvst)
    #print(len(users_to_remove), len(new_users))
    venues_to_remove = set()
    new_venues = {}
    for u, vst in venue_visitors.items():
        rvst = {user: num for user, num in vst.items() if user in new_users}
        num_visit = sum(rvst.values())
        if num_visit < 10:
            venues_to_remove.add(u)
        else:
            new_venues[u] = dict(rvst)
    #print(len(venues_to_remove), len(new_venues))
    user_visits = new_users
    venue_visitors = new_venues
    still_work_todo = len(users_to_remove) > 0 or len(venues_to_remove) > 0
print('At the end of the pruning, we are left with {} unique users who have made {} checkins in {} unique venues'.format(len(user_visits), sum(map(len, user_visits.values())), len(venue_visitors)))


20443 22376
At the end of the pruning, we are left with 8989 unique users who have made 214583 checkins in 6198 unique venues

Build the corpus using gensim. There, each user is a “document” represented by a bag of words, where “words” are venues.

I also remove a few venues appearing to often (in more than 8% of the total number of users, like railway station) but maybe that's no so smart… To quote the paper:

In the text modeling domain, “stop words” are often removed, words such as “a” and “the” which are highly frequent. It might behoove a model of places to do the same. However, while it might make sense to remove uninteresting places such as airports and bus stations, it is unclear if popular places representative of interests, like stadiums, should really be removed. While we considered this avenue, we did not obtain rigorous findings in this direction.


In [5]:
sorted_venues = {v: i for i, v in enumerate(sorted(venue_visitors))}
sorted_users = sorted(user_visits) 
write_doc = lambda vst: [v for v, c in vst.items() for _ in range(c)]

texts = [write_doc(user_visits[uid]) for uid in sorted_users]

dictionary = gensim.corpora.Dictionary(texts)
print(dictionary)
dictionary.filter_extremes(no_below=0, no_above=.08, keep_n=None)
print(dictionary)
corpus = [dictionary.doc2bow(text) for text in texts]
gensim.corpora.MmCorpus.serialize('{}_corpus.mm'.format(city), corpus)
dictionary.save('{}_venues_dict.dict'.format(city))


Dictionary(6198 unique tokens: ['4a898355f964a520350820e3', '4ac6b79af964a520fab520e3', '4b76efbff964a520406c2ee3', '40b52f80f964a52061001fe3', '4a6265bcf964a520f2c31fe3']...)
Dictionary(6191 unique tokens: ['4a898355f964a520350820e3', '4ac6b79af964a520fab520e3', '49f226d4f964a520026a1fe3', '4b76efbff964a520406c2ee3', '40b52f80f964a52061001fe3']...)

They don't mention anything about tf-idf so let's not do it

On the other hand, there is some indication for the number of topics

In the case studies below, we set the number of hidden topics to be twenty. We complete sensitivity tests, as suggested in [3], and find that our model is most effective and most interpretable when we use twenty clusters


In [6]:
%time model = gensim.models.ldamulticore.LdaMulticore(corpus, id2word=dictionary, workers=10, num_topics=20)


CPU times: user 5.92 s, sys: 324 ms, total: 6.25 s
Wall time: 6.04 s

Display the top venues in some topic, keeping in mind there's not topic ordering

Unlike LSA, there is no natural ordering between the topics in LDA. The returned subset of all topics is therefore arbitrary and may change between two LDA training runs.


In [7]:
venues_per_topic = 10
num_topics = 5
index_name = [['Topic {}'.format(i+1) for i in range(num_topics) for _ in range(venues_per_topic) ],
          [str(_) for i in range(num_topics) for _ in range(venues_per_topic) ]]

res =[]

for topic in range(num_topics):
    for vidx, weight in model.get_topic_terms(topic, venues_per_topic):
        vid = dictionary.id2token[vidx]
        name = venue_infos[vid][1]
        link = 'https://foursquare.com/v/'+vid
        cat = venue_infos[vid][2]['name']
        res.append([name, cat, weight, link])

pd.DataFrame(res, index=index_name, columns=['Venue', 'Category', 'Weight', 'URL'])


Out[7]:
Venue Category Weight URL
Topic 1 0 Montgomery St. BART Station Metro Station 0.014999 https://foursquare.com/v/455f77abf964a520903d1fe3
1 Fitness SF Castro Gym 0.007627 https://foursquare.com/v/464f17daf964a520c5461fe3
2 Blue Bottle Coffee Coffee Shop 0.006722 https://foursquare.com/v/49d68c61f964a520e65c1fe3
3 Salesforce Office 0.006557 https://foursquare.com/v/4abbbc4ff964a5209f8420e3
4 Blue Bottle Coffee Coffee Shop 0.004590 https://foursquare.com/v/49ca8f4df964a520b9581fe3
5 The Lusty Lady Strip Club 0.004576 https://foursquare.com/v/4456128ff964a520ce321fe3
6 Tapjoy Inc. Tech Startup 0.003762 https://foursquare.com/v/4d35dc3f6c7c721e33e3ce56
7 Powell St. BART Station Metro Station 0.003676 https://foursquare.com/v/455f7871f964a520913d1fe3
8 Twitter HQ Office 0.003603 https://foursquare.com/v/4ee0ecde29c2c6e332924109
9 WeWork SOMA Coworking Space 0.003601 https://foursquare.com/v/4e7b95099a52e6aecea6b40b
Topic 2 0 Westfield San Francisco Centre Mall 0.005822 https://foursquare.com/v/452b81ddf964a520393b1fe3
1 Sumazi.com World HQ Tech Startup 0.005731 https://foursquare.com/v/5130e938e4b0a24f7c6509c2
2 Vara Apartments Residential Building (Apartment / Condo) 0.005404 https://foursquare.com/v/51ec36f0498e2020c2740167
3 The Amazing Sumazi.com HQ Tech Startup 0.005277 https://foursquare.com/v/4da4b6dab521224b708938ee
4 Century San Francisco Centre 9 & XD Multiplex 0.004943 https://foursquare.com/v/454cf014f964a520c53c1fe3
5 The Corner Studio Gym / Fitness Center 0.004633 https://foursquare.com/v/4c1bcdf3b306c92849d162b7
6 Pier 39 Pier 0.004469 https://foursquare.com/v/409d7480f964a520f2f21ee3
7 24 Hour Fitness Gym / Fitness Center 0.003737 https://foursquare.com/v/4a53a9e0f964a52096b21fe3
8 AMC Van Ness 14 Multiplex 0.003704 https://foursquare.com/v/4390a026f964a5204d2b1fe3
9 24 Hour Fitness Gym / Fitness Center 0.003499 https://foursquare.com/v/4be612cecf200f479a31143c
Topic 3 0 24 Hour Fitness Gym / Fitness Center 0.008069 https://foursquare.com/v/4acfadeef964a52049d520e3
1 Rogue Ales Public House Bar 0.006819 https://foursquare.com/v/43c453c6f964a520582d1fe3
2 Westfield San Francisco Centre Mall 0.006427 https://foursquare.com/v/452b81ddf964a520393b1fe3
3 24 Hour Fitness Gym / Fitness Center 0.005983 https://foursquare.com/v/4a53a9e0f964a52096b21fe3
4 Butter Bar 0.005948 https://foursquare.com/v/410c3280f964a520ba0b1fe3
5 Montgomery St. BART Station Metro Station 0.004792 https://foursquare.com/v/455f77abf964a520903d1fe3
6 Twitter HQ Office 0.004600 https://foursquare.com/v/4ee0ecde29c2c6e332924109
7 Hult International Business School University 0.004533 https://foursquare.com/v/4e1b9be71850caeb9c9aef56
8 AMC Van Ness 14 Multiplex 0.004467 https://foursquare.com/v/4390a026f964a5204d2b1fe3
9 U.S. Environmental Protection Agency (EPA) Government Building 0.004187 https://foursquare.com/v/4af9f19ff964a520751522e3
Topic 4 0 Whole Foods Market Grocery Store 0.018729 https://foursquare.com/v/46002d20f964a52093441fe3
1 San Francisco-Oakland Bay Bridge Bridge 0.011451 https://foursquare.com/v/4a71e4cff964a520ccd91fe3
2 Louise M. Davies Symphony Hall Concert Hall 0.004991 https://foursquare.com/v/4aa48566f964a520024720e3
3 Moscone West Convention Center 0.004626 https://foursquare.com/v/43c52dc7f964a520672d1fe3
4 Deddy's home Home (private) 0.004328 https://foursquare.com/v/4e50a3328877402b06d5b89a
5 24 Hour Fitness Gym / Fitness Center 0.004282 https://foursquare.com/v/4a53a9e0f964a52096b21fe3
6 Adobe Office 0.003775 https://foursquare.com/v/49b9cab2f964a52052531fe3
7 CBS Interactive Office 0.003599 https://foursquare.com/v/453faf95f964a520383c1fe3
8 21st Amendment Brewery & Restaurant Brewery 0.003588 https://foursquare.com/v/42af6f80f964a5205e251fe3
9 Midnight Sun Gay Bar 0.003570 https://foursquare.com/v/42911d00f964a5200f241fe3
Topic 5 0 Fitness SF Castro Gym 0.034896 https://foursquare.com/v/464f17daf964a520c5461fe3
1 FITNESS SF SoMa Gym / Fitness Center 0.008577 https://foursquare.com/v/4a4dce3ef964a52010ae1fe3
2 Sundance Kabuki Cinemas Movie Theater 0.008337 https://foursquare.com/v/444df333f964a5208d321fe3
3 Whole Foods Market Grocery Store 0.006552 https://foursquare.com/v/46002d20f964a52093441fe3
4 Candlestick Park Football Stadium 0.006540 https://foursquare.com/v/430e5b80f964a52044271fe3
5 Louise M. Davies Symphony Hall Concert Hall 0.005727 https://foursquare.com/v/4aa48566f964a520024720e3
6 City of San Francisco City 0.004761 https://foursquare.com/v/4c82f252d92ea09323185072
7 Lookout Gay Bar 0.004323 https://foursquare.com/v/4735a4c2f964a520344c1fe3
8 Cabin Bar 0.003919 https://foursquare.com/v/51dee1f0498e40e6123add13
9 Cathay Pacific Office 0.003681 https://foursquare.com/v/4e80a0df61af5299ac9bbed1

How much venues are needed in each topic to reach 15% of the probability mass?


In [8]:
weights = np.array([[_[1] for _ in model.get_topic_terms(i, 500000)] for i in range(20)])

In [9]:
top_venues_per_topic = (weights.cumsum(-1)>.15).argmax(1)
top_venues_per_topic


Out[9]:
array([54, 59, 47, 42, 34, 57, 53, 36, 33, 49, 60, 50, 54, 49, 54, 60, 45, 46, 25, 56])

In [14]:
mf = folium.Map(location=[37.76,-122.47])
feats=[]
for topic, num_venues in enumerate(top_venues_per_topic):
    pts=[]
    for vidx, _ in model.get_topic_terms(topic, num_venues):
        vid = dictionary.id2token[vidx]
        pts.append(venue_infos[vid][0])
    pts = np.array(pts)

    spts = scaler.transform(pts)

    hull = pts[ConvexHull(spts).vertices, :]

    geojson_geo = geom.mapping(geom.Polygon(hull))
    feats.append({ "type": "Feature", "geometry": geojson_geo, "properties": {"fill": "#BB900B"}})

_=folium.GeoJson({"type": "FeatureCollection", "features": feats},
                 style_function=lambda x: {
                     'opacity': 0.2,
                     'fillColor': x['properties']['fill'],
                 }).add_to(mf)

Although their method is nice, the main problem is that disregarding spatial info makes it very difficult to compare with us.

Here I tried to plot the convex hull of the top venues of each topic and it basically cover the whole city so…

They justify it as follow:

While the dataset gives a diverse set of information, we describe each user simply by the places they go and how often they go there, thus choosing to ignore geospatial and social information which exists in the data. In addition, we ignore information on the category of different places, as explained in later section.

In addition, by not specifying any presumed factors to be responsible for similar check in locations between users, we avoid restrictions of the types of groups our model might find. For example, explicitly using geo-spatial features may restrict our ability to understand groups of users with similar interests which are spread throughout a city, such as the tourists described above.

Another limitation they give which we partially lift is temporal aspect:

One clear limitation of our model - by ignoring temporal information in the data, we assume that groupings of users (and thus the factors affecting their check in behaviors) are heavily static, which is likely not the case. Topic models which consider temporal information, such as periodicity [25], may be able to garner interesting clusters over time.


In [15]:
mf


Out[15]:

In case the live javascript map doesn't show up, below is a static screenshot of it…


In [18]:
from IPython.display import Image

In [19]:
Image('sf_ref44_static.png')


Out[19]: