notebook.community

Edit and run

Implementing the method described in “Joseph, K., Tan, C. H., & Carley, K. M. (2012). Beyond “Local”, “Categories” and “Friends”: Clustering Foursquare Users with Latent “Topics.” UbiComp'12 (pp. 919–926). doi:10.1145/2370216.2370422”



In [1]:

    
from pymongo import MongoClient
from collections import defaultdict
import scipy.io as sio
import scipy.sparse as sp
import gensim
import pandas as pd
import sys
sys.path.append('..')
import persistent as p
from scipy.spatial import ConvexHull
import shapely.geometry as geom
import folium

Pull the venues out of the DB



In [2]:

    
cl = MongoClient()
db = cl.combined
fullcity, city = 'San Francisco', 'sanfrancisco'
scaler = p.load_var('../sandbox/paper-models/{}.scaler'.format(city))
venue_infos = {}
for venue in db.venues.find({ "bboxCity": fullcity}):
    venue_infos[venue['_id']] = (venue['coordinates'], venue['name'], None if len(venue['categories']) == 0 else venue['categories'][0])









    



/home/chercheur/geraud/venvs/chr351/lib/python3.5/site-packages/sklearn/utils/fixes.py:55: DeprecationWarning: inspect.getargspec() is deprecated, use inspect.signature() instead
  if 'order' in inspect.getargspec(np.copy)[0]:

Read checkins to start building the venues×users matrix



In [3]:

    
user_visits = defaultdict(lambda : defaultdict(int))
venue_visitors = defaultdict(lambda : defaultdict(int))

for vid in venue_infos:
    for checkin in  db.checkins.find({"venueId": vid}):
        uid = checkin['foursquareUserId']
        user_visits[uid][vid] += 1
        venue_visitors[vid][uid] += 1

user_visits = {us: {vn: c for vn, c in usval.items()} for us, usval in user_visits.items()}
venue_visitors = {us: {vn: c for vn, c in usval.items()} for us, usval in venue_visitors.items()}

Then apply the pruning described in the paper

In addition, we remove those users with less than 5 unique venue check ins and those venues with less than 10 check ins, repeating the pruning iteratively until all such venues and users are removed.



In [4]:

    
still_work_todo = True
print(len(user_visits), len(venue_visitors))
while still_work_todo:
    users_to_remove = set()
    new_users = {}
    for u, vst in user_visits.items():
        rvst = {vid: num for vid, num in vst.items() if vid in venue_visitors}
        num_visit = len(rvst.values())
        if num_visit < 5:
            users_to_remove.add(u)
        else:
            new_users[u] = dict(rvst)
    #print(len(users_to_remove), len(new_users))
    venues_to_remove = set()
    new_venues = {}
    for u, vst in venue_visitors.items():
        rvst = {user: num for user, num in vst.items() if user in new_users}
        num_visit = sum(rvst.values())
        if num_visit < 10:
            venues_to_remove.add(u)
        else:
            new_venues[u] = dict(rvst)
    #print(len(venues_to_remove), len(new_venues))
    user_visits = new_users
    venue_visitors = new_venues
    still_work_todo = len(users_to_remove) > 0 or len(venues_to_remove) > 0
print('At the end of the pruning, we are left with {} unique users who have made {} checkins in {} unique venues'.format(len(user_visits), sum(map(len, user_visits.values())), len(venue_visitors)))









    



20443 22376
At the end of the pruning, we are left with 8989 unique users who have made 214583 checkins in 6198 unique venues

Build the corpus using gensim. There, each user is a “document” represented by a bag of words, where “words” are venues.

I also remove a few venues appearing to often (in more than 8% of the total number of users, like railway station) but maybe that's no so smart… To quote the paper:

In the text modeling domain, “stop words” are often removed, words such as “a” and “the” which are highly frequent. It might behoove a model of places to do the same. However, while it might make sense to remove uninteresting places such as airports and bus stations, it is unclear if popular places representative of interests, like stadiums, should really be removed. While we considered this avenue, we did not obtain rigorous findings in this direction.



In [5]:

    
sorted_venues = {v: i for i, v in enumerate(sorted(venue_visitors))}
sorted_users = sorted(user_visits) 
write_doc = lambda vst: [v for v, c in vst.items() for _ in range(c)]

texts = [write_doc(user_visits[uid]) for uid in sorted_users]

dictionary = gensim.corpora.Dictionary(texts)
print(dictionary)
dictionary.filter_extremes(no_below=0, no_above=.08, keep_n=None)
print(dictionary)
corpus = [dictionary.doc2bow(text) for text in texts]
gensim.corpora.MmCorpus.serialize('{}_corpus.mm'.format(city), corpus)
dictionary.save('{}_venues_dict.dict'.format(city))









    



Dictionary(6198 unique tokens: ['4a898355f964a520350820e3', '4ac6b79af964a520fab520e3', '4b76efbff964a520406c2ee3', '40b52f80f964a52061001fe3', '4a6265bcf964a520f2c31fe3']...)
Dictionary(6191 unique tokens: ['4a898355f964a520350820e3', '4ac6b79af964a520fab520e3', '49f226d4f964a520026a1fe3', '4b76efbff964a520406c2ee3', '40b52f80f964a52061001fe3']...)

They don't mention anything about tf-idf so let's not do it

On the other hand, there is some indication for the number of topics

In the case studies below, we set the number of hidden topics to be twenty. We complete sensitivity tests, as suggested in [3], and find that our model is most effective and most interpretable when we use twenty clusters



In [6]:

    
%time model = gensim.models.ldamulticore.LdaMulticore(corpus, id2word=dictionary, workers=10, num_topics=20)









    



CPU times: user 5.92 s, sys: 324 ms, total: 6.25 s
Wall time: 6.04 s

Display the top venues in some topic, keeping in mind there's not topic ordering

Unlike LSA, there is no natural ordering between the topics in LDA. The returned subset of all topics is therefore arbitrary and may change between two LDA training runs.



In [7]:

    
venues_per_topic = 10
num_topics = 5
index_name = [['Topic {}'.format(i+1) for i in range(num_topics) for _ in range(venues_per_topic) ],
          [str(_) for i in range(num_topics) for _ in range(venues_per_topic) ]]

res =[]

for topic in range(num_topics):
    for vidx, weight in model.get_topic_terms(topic, venues_per_topic):
        vid = dictionary.id2token[vidx]
        name = venue_infos[vid][1]
        link = 'https://foursquare.com/v/'+vid
        cat = venue_infos[vid][2]['name']
        res.append([name, cat, weight, link])

pd.DataFrame(res, index=index_name, columns=['Venue', 'Category', 'Weight', 'URL'])









    Out[7]:






  
    
      
      
      Venue
      Category
      Weight
      URL
    
  
  
    
      Topic 1
      0
      Montgomery St. BART Station
      Metro Station
      0.014999
      https://foursquare.com/v/455f77abf964a520903d1fe3
    
    
      1
      Fitness SF Castro
      Gym
      0.007627
      https://foursquare.com/v/464f17daf964a520c5461fe3
    
    
      2
      Blue Bottle Coffee
      Coffee Shop
      0.006722
      https://foursquare.com/v/49d68c61f964a520e65c1fe3
    
    
      3
      Salesforce
      Office
      0.006557
      https://foursquare.com/v/4abbbc4ff964a5209f8420e3
    
    
      4
      Blue Bottle Coffee
      Coffee Shop
      0.004590
      https://foursquare.com/v/49ca8f4df964a520b9581fe3
    
    
      5
      The Lusty Lady
      Strip Club
      0.004576
      https://foursquare.com/v/4456128ff964a520ce321fe3
    
    
      6
      Tapjoy Inc.
      Tech Startup
      0.003762
      https://foursquare.com/v/4d35dc3f6c7c721e33e3ce56
    
    
      7
      Powell St. BART Station
      Metro Station
      0.003676
      https://foursquare.com/v/455f7871f964a520913d1fe3
    
    
      8
      Twitter HQ
      Office
      0.003603
      https://foursquare.com/v/4ee0ecde29c2c6e332924109
    
    
      9
      WeWork SOMA
      Coworking Space
      0.003601
      https://foursquare.com/v/4e7b95099a52e6aecea6b40b
    
    
      Topic 2
      0
      Westfield San Francisco Centre
      Mall
      0.005822
      https://foursquare.com/v/452b81ddf964a520393b1fe3
    
    
      1
      Sumazi.com World HQ
      Tech Startup
      0.005731
      https://foursquare.com/v/5130e938e4b0a24f7c6509c2
    
    
      2
      Vara Apartments
      Residential Building (Apartment / Condo)
      0.005404
      https://foursquare.com/v/51ec36f0498e2020c2740167
    
    
      3
      The Amazing Sumazi.com HQ
      Tech Startup
      0.005277
      https://foursquare.com/v/4da4b6dab521224b708938ee
    
    
      4
      Century San Francisco Centre 9 & XD
      Multiplex
      0.004943
      https://foursquare.com/v/454cf014f964a520c53c1fe3
    
    
      5
      The Corner Studio
      Gym / Fitness Center
      0.004633
      https://foursquare.com/v/4c1bcdf3b306c92849d162b7
    
    
      6
      Pier 39
      Pier
      0.004469
      https://foursquare.com/v/409d7480f964a520f2f21ee3
    
    
      7
      24 Hour Fitness
      Gym / Fitness Center
      0.003737
      https://foursquare.com/v/4a53a9e0f964a52096b21fe3
    
    
      8
      AMC Van Ness 14
      Multiplex
      0.003704
      https://foursquare.com/v/4390a026f964a5204d2b1fe3
    
    
      9
      24 Hour Fitness
      Gym / Fitness Center
      0.003499
      https://foursquare.com/v/4be612cecf200f479a31143c
    
    
      Topic 3
      0
      24 Hour Fitness
      Gym / Fitness Center
      0.008069
      https://foursquare.com/v/4acfadeef964a52049d520e3
    
    
      1
      Rogue Ales Public House
      Bar
      0.006819
      https://foursquare.com/v/43c453c6f964a520582d1fe3
    
    
      2
      Westfield San Francisco Centre
      Mall
      0.006427
      https://foursquare.com/v/452b81ddf964a520393b1fe3
    
    
      3
      24 Hour Fitness
      Gym / Fitness Center
      0.005983
      https://foursquare.com/v/4a53a9e0f964a52096b21fe3
    
    
      4
      Butter
      Bar
      0.005948
      https://foursquare.com/v/410c3280f964a520ba0b1fe3
    
    
      5
      Montgomery St. BART Station
      Metro Station
      0.004792
      https://foursquare.com/v/455f77abf964a520903d1fe3
    
    
      6
      Twitter HQ
      Office
      0.004600
      https://foursquare.com/v/4ee0ecde29c2c6e332924109
    
    
      7
      Hult International Business School
      University
      0.004533
      https://foursquare.com/v/4e1b9be71850caeb9c9aef56
    
    
      8
      AMC Van Ness 14
      Multiplex
      0.004467
      https://foursquare.com/v/4390a026f964a5204d2b1fe3
    
    
      9
      U.S. Environmental Protection Agency (EPA)
      Government Building
      0.004187
      https://foursquare.com/v/4af9f19ff964a520751522e3
    
    
      Topic 4
      0
      Whole Foods Market
      Grocery Store
      0.018729
      https://foursquare.com/v/46002d20f964a52093441fe3
    
    
      1
      San Francisco-Oakland Bay Bridge
      Bridge
      0.011451
      https://foursquare.com/v/4a71e4cff964a520ccd91fe3
    
    
      2
      Louise M. Davies Symphony Hall
      Concert Hall
      0.004991
      https://foursquare.com/v/4aa48566f964a520024720e3
    
    
      3
      Moscone West
      Convention Center
      0.004626
      https://foursquare.com/v/43c52dc7f964a520672d1fe3
    
    
      4
      Deddy's home
      Home (private)
      0.004328
      https://foursquare.com/v/4e50a3328877402b06d5b89a
    
    
      5
      24 Hour Fitness
      Gym / Fitness Center
      0.004282
      https://foursquare.com/v/4a53a9e0f964a52096b21fe3
    
    
      6
      Adobe
      Office
      0.003775
      https://foursquare.com/v/49b9cab2f964a52052531fe3
    
    
      7
      CBS Interactive
      Office
      0.003599
      https://foursquare.com/v/453faf95f964a520383c1fe3
    
    
      8
      21st Amendment Brewery & Restaurant
      Brewery
      0.003588
      https://foursquare.com/v/42af6f80f964a5205e251fe3
    
    
      9
      Midnight Sun
      Gay Bar
      0.003570
      https://foursquare.com/v/42911d00f964a5200f241fe3
    
    
      Topic 5
      0
      Fitness SF Castro
      Gym
      0.034896
      https://foursquare.com/v/464f17daf964a520c5461fe3
    
    
      1
      FITNESS SF SoMa
      Gym / Fitness Center
      0.008577
      https://foursquare.com/v/4a4dce3ef964a52010ae1fe3
    
    
      2
      Sundance Kabuki Cinemas
      Movie Theater
      0.008337
      https://foursquare.com/v/444df333f964a5208d321fe3
    
    
      3
      Whole Foods Market
      Grocery Store
      0.006552
      https://foursquare.com/v/46002d20f964a52093441fe3
    
    
      4
      Candlestick Park
      Football Stadium
      0.006540
      https://foursquare.com/v/430e5b80f964a52044271fe3
    
    
      5
      Louise M. Davies Symphony Hall
      Concert Hall
      0.005727
      https://foursquare.com/v/4aa48566f964a520024720e3
    
    
      6
      City of San Francisco
      City
      0.004761
      https://foursquare.com/v/4c82f252d92ea09323185072
    
    
      7
      Lookout
      Gay Bar
      0.004323
      https://foursquare.com/v/4735a4c2f964a520344c1fe3
    
    
      8
      Cabin
      Bar
      0.003919
      https://foursquare.com/v/51dee1f0498e40e6123add13
    
    
      9
      Cathay Pacific
      Office
      0.003681
      https://foursquare.com/v/4e80a0df61af5299ac9bbed1

How much venues are needed in each topic to reach 15% of the probability mass?



In [8]:

    
weights = np.array([[_[1] for _ in model.get_topic_terms(i, 500000)] for i in range(20)])



In [9]:

    
top_venues_per_topic = (weights.cumsum(-1)>.15).argmax(1)
top_venues_per_topic









    Out[9]:





array([54, 59, 47, 42, 34, 57, 53, 36, 33, 49, 60, 50, 54, 49, 54, 60, 45, 46, 25, 56])



In [14]:

    
mf = folium.Map(location=[37.76,-122.47])
feats=[]
for topic, num_venues in enumerate(top_venues_per_topic):
    pts=[]
    for vidx, _ in model.get_topic_terms(topic, num_venues):
        vid = dictionary.id2token[vidx]
        pts.append(venue_infos[vid][0])
    pts = np.array(pts)

    spts = scaler.transform(pts)

    hull = pts[ConvexHull(spts).vertices, :]

    geojson_geo = geom.mapping(geom.Polygon(hull))
    feats.append({ "type": "Feature", "geometry": geojson_geo, "properties": {"fill": "#BB900B"}})

_=folium.GeoJson({"type": "FeatureCollection", "features": feats},
                 style_function=lambda x: {
                     'opacity': 0.2,
                     'fillColor': x['properties']['fill'],
                 }).add_to(mf)

Although their method is nice, the main problem is that disregarding spatial info makes it very difficult to compare with us.

Here I tried to plot the convex hull of the top venues of each topic and it basically cover the whole city so…

They justify it as follow:

While the dataset gives a diverse set of information, we describe each user simply by the places they go and how often they go there, thus choosing to ignore geospatial and social information which exists in the data. In addition, we ignore information on the category of different places, as explained in later section.

In addition, by not specifying any presumed factors to be responsible for similar check in locations between users, we avoid restrictions of the types of groups our model might find. For example, explicitly using geo-spatial features may restrict our ability to understand groups of users with similar interests which are spread throughout a city, such as the tourists described above.

Another limitation they give which we partially lift is temporal aspect:

One clear limitation of our model - by ignoring temporal information in the data, we assume that groupings of users (and thus the factors affecting their check in behaviors) are heavily static, which is likely not the case. Topic models which consider temporal information, such as periodicity [25], may be able to garner interesting clusters over time.



In [15]:

    
mf









    Out[15]:

In case the live javascript map doesn't show up, below is a static screenshot of it…



In [18]:

    
from IPython.display import Image



In [19]:

    
Image('sf_ref44_static.png')









    Out[19]:

		Venue	Category	Weight	URL
Topic 1	0	Montgomery St. BART Station	Metro Station	0.014999	https://foursquare.com/v/455f77abf964a520903d1fe3
	1	Fitness SF Castro	Gym	0.007627	https://foursquare.com/v/464f17daf964a520c5461fe3
	2	Blue Bottle Coffee	Coffee Shop	0.006722	https://foursquare.com/v/49d68c61f964a520e65c1fe3
	3	Salesforce	Office	0.006557	https://foursquare.com/v/4abbbc4ff964a5209f8420e3
	4	Blue Bottle Coffee	Coffee Shop	0.004590	https://foursquare.com/v/49ca8f4df964a520b9581fe3
	5	The Lusty Lady	Strip Club	0.004576	https://foursquare.com/v/4456128ff964a520ce321fe3
	6	Tapjoy Inc.	Tech Startup	0.003762	https://foursquare.com/v/4d35dc3f6c7c721e33e3ce56
	7	Powell St. BART Station	Metro Station	0.003676	https://foursquare.com/v/455f7871f964a520913d1fe3
	8	Twitter HQ	Office	0.003603	https://foursquare.com/v/4ee0ecde29c2c6e332924109
	9	WeWork SOMA	Coworking Space	0.003601	https://foursquare.com/v/4e7b95099a52e6aecea6b40b
Topic 2	0	Westfield San Francisco Centre	Mall	0.005822	https://foursquare.com/v/452b81ddf964a520393b1fe3
	1	Sumazi.com World HQ	Tech Startup	0.005731	https://foursquare.com/v/5130e938e4b0a24f7c6509c2
	2	Vara Apartments	Residential Building (Apartment / Condo)	0.005404	https://foursquare.com/v/51ec36f0498e2020c2740167
	3	The Amazing Sumazi.com HQ	Tech Startup	0.005277	https://foursquare.com/v/4da4b6dab521224b708938ee
	4	Century San Francisco Centre 9 & XD	Multiplex	0.004943	https://foursquare.com/v/454cf014f964a520c53c1fe3
	5	The Corner Studio	Gym / Fitness Center	0.004633	https://foursquare.com/v/4c1bcdf3b306c92849d162b7
	6	Pier 39	Pier	0.004469	https://foursquare.com/v/409d7480f964a520f2f21ee3
	7	24 Hour Fitness	Gym / Fitness Center	0.003737	https://foursquare.com/v/4a53a9e0f964a52096b21fe3
	8	AMC Van Ness 14	Multiplex	0.003704	https://foursquare.com/v/4390a026f964a5204d2b1fe3
	9	24 Hour Fitness	Gym / Fitness Center	0.003499	https://foursquare.com/v/4be612cecf200f479a31143c
Topic 3	0	24 Hour Fitness	Gym / Fitness Center	0.008069	https://foursquare.com/v/4acfadeef964a52049d520e3
	1	Rogue Ales Public House	Bar	0.006819	https://foursquare.com/v/43c453c6f964a520582d1fe3
	2	Westfield San Francisco Centre	Mall	0.006427	https://foursquare.com/v/452b81ddf964a520393b1fe3
	3	24 Hour Fitness	Gym / Fitness Center	0.005983	https://foursquare.com/v/4a53a9e0f964a52096b21fe3
	4	Butter	Bar	0.005948	https://foursquare.com/v/410c3280f964a520ba0b1fe3
	5	Montgomery St. BART Station	Metro Station	0.004792	https://foursquare.com/v/455f77abf964a520903d1fe3
	6	Twitter HQ	Office	0.004600	https://foursquare.com/v/4ee0ecde29c2c6e332924109
	7	Hult International Business School	University	0.004533	https://foursquare.com/v/4e1b9be71850caeb9c9aef56
	8	AMC Van Ness 14	Multiplex	0.004467	https://foursquare.com/v/4390a026f964a5204d2b1fe3
	9	U.S. Environmental Protection Agency (EPA)	Government Building	0.004187	https://foursquare.com/v/4af9f19ff964a520751522e3
Topic 4	0	Whole Foods Market	Grocery Store	0.018729	https://foursquare.com/v/46002d20f964a52093441fe3
	1	San Francisco-Oakland Bay Bridge	Bridge	0.011451	https://foursquare.com/v/4a71e4cff964a520ccd91fe3
	2	Louise M. Davies Symphony Hall	Concert Hall	0.004991	https://foursquare.com/v/4aa48566f964a520024720e3
	3	Moscone West	Convention Center	0.004626	https://foursquare.com/v/43c52dc7f964a520672d1fe3
	4	Deddy's home	Home (private)	0.004328	https://foursquare.com/v/4e50a3328877402b06d5b89a
	5	24 Hour Fitness	Gym / Fitness Center	0.004282	https://foursquare.com/v/4a53a9e0f964a52096b21fe3
	6	Adobe	Office	0.003775	https://foursquare.com/v/49b9cab2f964a52052531fe3
	7	CBS Interactive	Office	0.003599	https://foursquare.com/v/453faf95f964a520383c1fe3
	8	21st Amendment Brewery & Restaurant	Brewery	0.003588	https://foursquare.com/v/42af6f80f964a5205e251fe3
	9	Midnight Sun	Gay Bar	0.003570	https://foursquare.com/v/42911d00f964a5200f241fe3
Topic 5	0	Fitness SF Castro	Gym	0.034896	https://foursquare.com/v/464f17daf964a520c5461fe3
	1	FITNESS SF SoMa	Gym / Fitness Center	0.008577	https://foursquare.com/v/4a4dce3ef964a52010ae1fe3
	2	Sundance Kabuki Cinemas	Movie Theater	0.008337	https://foursquare.com/v/444df333f964a5208d321fe3
	3	Whole Foods Market	Grocery Store	0.006552	https://foursquare.com/v/46002d20f964a52093441fe3
	4	Candlestick Park	Football Stadium	0.006540	https://foursquare.com/v/430e5b80f964a52044271fe3
	5	Louise M. Davies Symphony Hall	Concert Hall	0.005727	https://foursquare.com/v/4aa48566f964a520024720e3
	6	City of San Francisco	City	0.004761	https://foursquare.com/v/4c82f252d92ea09323185072
	7	Lookout	Gay Bar	0.004323	https://foursquare.com/v/4735a4c2f964a520344c1fe3
	8	Cabin	Bar	0.003919	https://foursquare.com/v/51dee1f0498e40e6123add13
	9	Cathay Pacific	Office	0.003681	https://foursquare.com/v/4e80a0df61af5299ac9bbed1