GeosPy: Geolocation Inference Made Easy

"Semi-supervised learning is a class of supervised learning tasks and techniques that also make use of unlabeled data for training - typically a small amount of labeled data with a large amount of unlabeled data. Semi-supervised learning falls between unsupervised learning (without any labeled training data) and supervised learning (with completely labeled training data). Many machine-learning researchers have found that unlabeled data, when used in conjunction with a small amount of labeled data, can produce considerable improvement in learning accuracy."

Wikipedia on Semi-supervised learning

Semi-supervised Learning Example

In this example, we'll use the Backstrom model again.

We'll access the Backstrom model directly to view the default parameters as described in Find Me If You Can: Improving Geographical Prediction with Social and Spatial Proximity


In [1]:
from GeosPy.models import Backstrom
backstrom = Backstrom()
params = (backstrom.A, backstrom.B, backstrom.C)
print(params)


(0.0019000000320374966, 0.19599999487400055, -0.014999999664723873)

The parameters above were computed empirically, all datasets will have slightly different topologies and therefore different optimum params. Consider the following dataset of user locations and friendships:


In [2]:
user_locations = {'Tyler': (42.4, -71.1), 'Lougee': (39.0, -105.0), 'Nate': (35.5, -98.0), 'Tim': (37.0, -120.0),
                  'Ryan': (35.4, -80.1), 'Conor': (47.5, -120.5), 'Sam': (44.0, -71.5)}

friendships = {'Tyler': ['Sam', 'Ryan'], 'Sam': ['Ryan', 'Tyler'], 'Conor': ['Tim', 'Lougee'], 
               'Lougee' :['Conor', 'Nate'], 'Nate': ['Lougee', 'Ryan'], 'Ryan': ['Tyler', 'Sam', 'Nate'], 
               'Tim': 'Conor'}

Here's a plot of the user locations to help visualize the dataset. The example attempts to cluster users by location.


In [3]:
import plotly
plotly.offline.init_notebook_mode()
import pandas as pd

df = pd.DataFrame.from_dict(user_locations)
df.head()

scl = [ [0,"rgb(5, 10, 172)"],[0.35,"rgb(40, 60, 190)"],[0.5,"rgb(70, 100, 245)"],[0.6,"rgb(90, 120, 245)"],
       [0.7,"rgb(106, 137, 247)"],[1,"rgb(220, 220, 220)"] ]

data = [ dict(
        type = 'scattergeo',
        locationmode = 'USA-states',
        lat = list(df.iloc[0]),
        lon = list(df.iloc[1]),
        text = list(df.columns.values),
        mode = 'markers',
        marker = dict( 
            size = 8, 
            opacity = 0.8,
            reversescale = True,
            autocolorscale = False,
            symbol = 'square',
            line = dict(
                width=1,
                color='rgba(102, 102, 102)'
            ),
#             colorscale = scl,
            cmin = 10
        ))]

layout = dict(
        title = 'Location of Users',
        colorbar = True,   
        geo = dict(
            scope='usa',
            projection=dict( type='albers usa' ),
            showland = True,
            landcolor = "rgb(250, 250, 250)",
            subunitcolor = "rgb(217, 217, 217)",
            countrycolor = "rgb(217, 217, 217)",
            countrywidth = 0.5,
            subunitwidth = 0.5        
        ),
    )

fig = dict( data=data, layout=layout )
plotly.offline.iplot(fig, validate=False)



In [4]:
backstrom_trained = Backstrom().train(user_locations, friendships)
new_params = (backstrom_trained.A, backstrom_trained.B, backstrom_trained.C)
print([abs(param - new_param) for param, new_param in zip(params, new_params)])


[0.020721061620861292, 0.33885714411735535, 0.30336077604442835]

Thus we can see our training in action altering the internal parameters used to compute the likelihood of friendship given distance.

When using the GeosPy package, we don't want to access Backstrom directly and can rather leverage GeosPy's wrapper functions.


In [5]:
from GeosPy import Geos
backstrom_geos = Geos('backstrom')
backstrom_geos_trained = backstrom_geos.train(user_locations, friendships)

In [6]:
# refine the same user_locations as above
user_locations = {'Tyler': (42.4, -71.1), 'Lougee': (39.0, -105.0), 'Nate': (35.5, -98.0), 
                                'Tim': (37.0, -120.0), 'Ryan': (35.4, -80.1), 'Conor': (47.5, -120.5), 
                                'Sam': (44.0, -71.5)}
# grab the known users
known_user_names = list(user_locations.keys())
# add a user of unknown location
user_locations['OffTheGrid'] = None
OffTheGrid_friends = {'OffTheGrid': known_user_names}

print(backstrom_geos_trained.locate(user_locations, OffTheGrid_friends))


{'Nate': (35.5, -98.0), 'OffTheGrid': (42.4, -71.1), 'Sam': (44.0, -71.5), 'Lougee': (39.0, -105.0), 'Tyler': (42.4, -71.1), 'Conor': (47.5, -120.5), 'Tim': (37.0, -120.0), 'Ryan': (35.4, -80.1)}

In [ ]: