Easy Automated Geocoding of Text with CLIFF-up

By Andy Halterman. November 22, 2014

MIT's CLIFF is a piece of software for extracting geolocation data from text, bundled into a server that can be accessed via API calls. I've bundled CLIFF into a Vagrant virtual machine for people (like me) who aren't thrilled about learning how to set up Tomcat servers and get Java configurations right. See my CLIFF-up repo on Github for the code to get CLIFF running easily inside a Vagrant virtual machine. Follow the instructions there to get CLIFF-up and then come back here for a walkthough of how to use CLIFF once it's running.

CLIFF is built on a number of free and open source projects, including Berico Technologies' CLAVIN geoparsing software, Stanford's CoreNLP natural language software, and Geonames.org's free gazetteer of place names and coordinates.

You only need two modules to work with CLIFF in this example, including the excellent requests module.


In [1]:
import requests
import json

Give it a sentence you're interested in geolocating. (From the New York Times):


In [2]:
sentence = "In Sweden, the episode brought back memories of another incident in 1981, when Sweden discovered that a Soviet submarine had run aground off Swedish shores at Karlskrona in the south of the country."

print sentence


In Sweden, the episode brought back memories of another incident in 1981, when Sweden discovered that a Soviet submarine had run aground off Swedish shores at Karlskrona in the south of the country.

CLIFF's API expects data passed to it in the form of an HTTP GET request. The requests module makes it very easy to do this. We construct a payload, comprised of the variable that CLIFF expects ("p") and the sentence we want parsed. We can then send it as a GET request to the port where CLIFF is listening, which is 8999 if you're using the Vagrantfile I have on my Github to get it running.


In [5]:
payload = {"q":sentence}
located = requests.get("http://localhost:8999/CLIFF-2.1.1/parse/text", params=payload)

We convert the resulting data into a dictionary, and then we can start to look at its components.


In [7]:
t = located.json()
t


Out[7]:
{u'milliseconds': 6764,
 u'results': {u'organizations': [],
  u'people': [],
  u'places': {u'focus': {u'cities': [{u'countryCode': u'SE',
      u'countryGeoNameId': u'2661886',
      u'featureClass': u'P',
      u'featureCode': u'PPLA',
      u'id': 2701713,
      u'lat': 56.16156,
      u'lon': 15.58661,
      u'name': u'Karlskrona',
      u'population': 32309,
      u'score': 1,
      u'stateCode': u'02',
      u'stateGeoNameId': u'2721357'}],
    u'countries': [{u'countryCode': u'SE',
      u'countryGeoNameId': u'2661886',
      u'featureClass': u'A',
      u'featureCode': u'PCLI',
      u'id': 2661886,
      u'lat': 62.0,
      u'lon': 15.0,
      u'name': u'Kingdom of Sweden',
      u'population': 9555893,
      u'score': 3,
      u'stateCode': u'00',
      u'stateGeoNameId': u''}],
    u'states': [{u'countryCode': u'SE',
      u'countryGeoNameId': u'2661886',
      u'featureClass': u'A',
      u'featureCode': u'ADM1',
      u'id': 2721357,
      u'lat': 56.33333,
      u'lon': 15.33333,
      u'name': u'Blekinge',
      u'population': 152315,
      u'score': 1,
      u'stateCode': u'02',
      u'stateGeoNameId': u'2721357'}]},
   u'mentions': [{u'confidence': 1.0,
     u'countryCode': u'SE',
     u'countryGeoNameId': u'2661886',
     u'featureClass': u'A',
     u'featureCode': u'PCLI',
     u'id': 2661886,
     u'lat': 62.0,
     u'lon': 15.0,
     u'name': u'Kingdom of Sweden',
     u'population': 9555893,
     u'source': {u'charIndex': 3, u'string': u'Sweden'},
     u'stateCode': u'00',
     u'stateGeoNameId': u''},
    {u'confidence': 1.0,
     u'countryCode': u'SE',
     u'countryGeoNameId': u'2661886',
     u'featureClass': u'A',
     u'featureCode': u'PCLI',
     u'id': 2661886,
     u'lat': 62.0,
     u'lon': 15.0,
     u'name': u'Kingdom of Sweden',
     u'population': 9555893,
     u'source': {u'charIndex': 79, u'string': u'Sweden'},
     u'stateCode': u'00',
     u'stateGeoNameId': u''},
    {u'confidence': 1.0,
     u'countryCode': u'SE',
     u'countryGeoNameId': u'2661886',
     u'featureClass': u'P',
     u'featureCode': u'PPLA',
     u'id': 2701713,
     u'lat': 56.16156,
     u'lon': 15.58661,
     u'name': u'Karlskrona',
     u'population': 32309,
     u'source': {u'charIndex': 159, u'string': u'Karlskrona'},
     u'stateCode': u'02',
     u'stateGeoNameId': u'2721357'}]}},
 u'status': u'ok',
 u'version': u'2.1.1'}

You'll see that it returns some data about the query (time elapsed, status, version), a list of organizations mentioned, a list of people, and the places in the story. One of CLIFF's big selling points is that it distinguishes between "focus" places–the location the text is really about, and "mention" places that appear peripherally in the text. You'll also notice that it's fast: 52 milliseconds, which is the return on the long Lucene index building process when you did vagrant up.

We can cut the results down to just the "focus" places, which is presumably what we're interested in.


In [8]:
t['results']['places']['focus']


Out[8]:
{u'cities': [{u'countryCode': u'SE',
   u'countryGeoNameId': u'2661886',
   u'featureClass': u'P',
   u'featureCode': u'PPLA',
   u'id': 2701713,
   u'lat': 56.16156,
   u'lon': 15.58661,
   u'name': u'Karlskrona',
   u'population': 32309,
   u'score': 1,
   u'stateCode': u'02',
   u'stateGeoNameId': u'2721357'}],
 u'countries': [{u'countryCode': u'SE',
   u'countryGeoNameId': u'2661886',
   u'featureClass': u'A',
   u'featureCode': u'PCLI',
   u'id': 2661886,
   u'lat': 62.0,
   u'lon': 15.0,
   u'name': u'Kingdom of Sweden',
   u'population': 9555893,
   u'score': 3,
   u'stateCode': u'00',
   u'stateGeoNameId': u''}],
 u'states': [{u'countryCode': u'SE',
   u'countryGeoNameId': u'2661886',
   u'featureClass': u'A',
   u'featureCode': u'ADM1',
   u'id': 2721357,
   u'lat': 56.33333,
   u'lon': 15.33333,
   u'name': u'Blekinge',
   u'population': 152315,
   u'score': 1,
   u'stateCode': u'02',
   u'stateGeoNameId': u'2721357'}]}

Or we can pare it down even further and just look at the city names:


In [9]:
for i in t['results']['places']['focus']['cities']:
    print i['name']


Karlskrona

Multi-sentence example: Syria

The point of automated geocoding is obviously to do it at scale, perhaps as part of an event data project. Indeed, the impetus behind getting CLIFF to work was to use it to geocode the event data that the Open Event Data Alliance is producing, specifically the completely open-source Phoenix data set.

Let's take a look at an example a little closer to what we'll be doing with event data, where we'd like to use it to extract the places where events are happening. Our event extraction software, PETRARCH handles the event extraction, but we will rely on a separate program to figure out the places associated with events in each sentence.

Here, I'm giving it a list of sentences from a recent Reuters story about Syria. Normally, I would split the paragraph into sentences automatically using CoreNLP's sentence splitter function, but I've done that step by hand here to keep this example light weight.


In [10]:
paragraph = ["The United States continued its assault on Islamic State militants this week, conducting 14 airstrikes in recent days in Syria and Iraq, U.S. Central Command said, three of them near the predominantly Kurdish border town of Kobani.", "Turkish President Tayyip Erdogan has criticized the U.S.-led coalition's focus on Kobani, which has been besieged by Islamic State for more than a month, and warned its attention needed to be turned to other parts of the conflict.", "The Syrian civil war has killed close to 200,000 people and forced more than 3 million refugees to flee the country, according to the United Nations.", "At least 11 children were killed in Damascus when mortars fell on a school in an eastern district of the Syrian capital, the Britain-based Syrian Observatory for Human Rights, which monitors the war, said on Wednesday.","The school was in a rebel-held part of Qaboun, a district in the east of the city which is contested between government and rebel forces, the monitoring group said.","The death toll was expected to rise because a number of those wounded were in critical condition, it said.","Fighters linked to al Qaeda also took ground from moderate Syrian rebels last week in the northern province of Idlib, expanding their control.","A member of the Syrian rebel forces based in southeastern Turkey said on Wednesday the Nusra Front had made further gains in recent days."]

for i in paragraph:
    print i
    print "\n"


The United States continued its assault on Islamic State militants this week, conducting 14 airstrikes in recent days in Syria and Iraq, U.S. Central Command said, three of them near the predominantly Kurdish border town of Kobani.


Turkish President Tayyip Erdogan has criticized the U.S.-led coalition's focus on Kobani, which has been besieged by Islamic State for more than a month, and warned its attention needed to be turned to other parts of the conflict.


The Syrian civil war has killed close to 200,000 people and forced more than 3 million refugees to flee the country, according to the United Nations.


At least 11 children were killed in Damascus when mortars fell on a school in an eastern district of the Syrian capital, the Britain-based Syrian Observatory for Human Rights, which monitors the war, said on Wednesday.


The school was in a rebel-held part of Qaboun, a district in the east of the city which is contested between government and rebel forces, the monitoring group said.


The death toll was expected to rise because a number of those wounded were in critical condition, it said.


Fighters linked to al Qaeda also took ground from moderate Syrian rebels last week in the northern province of Idlib, expanding their control.


A member of the Syrian rebel forces based in southeastern Turkey said on Wednesday the Nusra Front had made further gains in recent days.


I've defined a little function here to construct our request and get the resulting data.


In [16]:
def geolocate(text):
    payload = {"q":text}
    located = requests.get("http://localhost:8999/CLIFF-2.1.1/parse/text", params=payload)
    return located.json()

In [17]:
for i, sentence in enumerate(paragraph):
    p = geolocate(sentence)
    num = i
    if 'cities' in p['results']['places']['focus']:
        for i in p['results']['places']['focus']['cities']:
            place = i['name']
            lat = i['lat']
            lon = i['lon']
            print "Cities in sentence " + str(num) + ": " + place + " (" + str(lat) + ", " + str(lon) + ")"


Cities in sentence 0: ‘Ayn al ‘Arab (36.89095, 38.35347)
Cities in sentence 3: Damascus (33.5102, 36.29128)
Cities in sentence 4: Qābūn (33.54309, 36.33604)
Cities in sentence 6: Idlib (35.93062, 36.63393)

Although the first one doesn't look right (the first sentence [#0] is about Kobani), it turns out that ‘Ayn al ‘Arab is actually the same place. (See it on a map).

As the makers of CLIFF point out, assessing automated document geolocation is very difficult. Over the next few week, we at the Open Event Data Alliance will start evaluating CLIFF for our needs and see if we can move it into production as our geolocating service.