Simple Tools from Extracting Quantities from Strings

Suppose we have a report and we want to find the sentences that are talking about numerical things....

Originally inspired by When you get data in sentences: how to use a spreadsheet to extract numbers from phrases, Paul Bradshaw, Online Journalism blog, from which some of the example sentences (sic!) are taken.

Distribution: https://twitter.com/paulbradshaw/status/1158752556958519297

Potentially Useful Python Packages

  • quantulum: extract quantities from natural language text;
  • ctparse: extract time / date related quantities from natural language text;
  • r1chardj0n3s/parse: easy scrape / regex extraction from semi-structred text using format() like patterns; example use here;
  • dateparser [docs]: "easily parse localized dates in almost any string formats commonly found on web pages" (includes foreign language detection);
  • invoice2data:

Example Sentences

Make a start on some sample test sentences...


In [152]:
sentences = [
    '4 years and 6 months’ imprisonment with a licence extension of 2 years and 6 months',
    'No quantities here',
    'I measured it as 2 meters and 30 centimeters.',
    "four years and six months' imprisonment with a licence extension of 2 years and 6 months",
    'it cost £250... bargain...',
    'it weighs four hundred kilograms.',
    'It weighs 400kg.',
    'three million, two hundred & forty, you say?',
    'it weighs four hundred and twenty kilograms.'
    
]

quantulum3

quantulum3 is a Python package "for information extraction of quantities from unstructured text".


In [153]:
#!pip3 install quantulum3
from quantulum3 import parser

In [154]:
for sent in sentences:
    print(sent)
    p = parser.parse(sent)
    if p:
        print('\tSpoken:',parser.inline_parse_and_expand(sent))
        print('\tNumeric elements:')
        for q in p:
            display(q)
            print('\t\t{} :: {}'.format(q.surface, q))
    print('\n---------\n')


4 years and 6 months’ imprisonment with a licence extension of 2 years and 6 months
	Spoken: four years and six months’ imprisonment with a licence extension of two years and six months
	Numeric elements:
Quantity(4, "Unit(name="year", entity=Entity("time"), uri=Year)")
		4 years :: four years
Quantity(6, "Unit(name="month", entity=Entity("time"), uri=Month)")
		6 months :: six months
Quantity(2, "Unit(name="year", entity=Entity("time"), uri=Year)")
		2 years :: two years
Quantity(6, "Unit(name="month", entity=Entity("time"), uri=Month)")
		6 months :: six months

---------

No quantities here

---------

I measured it as 2 meters and 30 centimeters.
	Spoken: I measured it as two metres and thirty centimetres.
	Numeric elements:
Quantity(2, "Unit(name="metre", entity=Entity("length"), uri=Metre)")
		2 meters :: two metres
Quantity(30, "Unit(name="centimetre", entity=Entity("length"), uri=Centimetre)")
		30 centimeters :: thirty centimetres

---------

four years and six months' imprisonment with a licence extension of 2 years and 6 months
	Spoken: four years and six months imprisonment with a licence extension of two years and six months
	Numeric elements:
Quantity(4, "Unit(name="year", entity=Entity("time"), uri=Year)")
		four years :: four years
Quantity(6, "Unit(name="month", entity=Entity("time"), uri=Month)")
		six months' :: six months
Quantity(2, "Unit(name="year", entity=Entity("time"), uri=Year)")
		2 years :: two years
Quantity(6, "Unit(name="month", entity=Entity("time"), uri=Month)")
		6 months :: six months

---------

it cost £250... bargain...
	Spoken: it cost two hundred and fifty pounds sterling, zero pence... bargain...
	Numeric elements:
Quantity(250, "Unit(name="pound sterling", entity=Entity("currency"), uri=Pound_sterling)")
		£250 :: two hundred and fifty pounds sterling, zero pence

---------

it weighs four hundred kilograms.
	Spoken: it weighs four hundred kilograms.
	Numeric elements:
Quantity(400, "Unit(name="kilogram", entity=Entity("mass"), uri=Kilogram)")
		four hundred kilograms :: four hundred kilograms

---------

It weighs 400kg.
	Spoken: It weighs four hundred kilograms.
	Numeric elements:
Quantity(400, "Unit(name="kilogram", entity=Entity("mass"), uri=Kilogram)")
		400kg :: four hundred kilograms

---------

three million, two hundred & forty, you say?
	Spoken: three million, two hundred & forty, you say?
	Numeric elements:
Quantity(3e+06, "Unit(name="dimensionless", entity=Entity("dimensionless"), uri=Dimensionless_quantity)")
		three million :: three million
Quantity(200, "Unit(name="dimensionless", entity=Entity("dimensionless"), uri=Dimensionless_quantity)")
		two hundred :: two hundred
Quantity(40, "Unit(name="dimensionless", entity=Entity("dimensionless"), uri=Dimensionless_quantity)")
		forty :: forty

---------

it weighs four hundred and twenty kilograms.
	Spoken: it weighs four hundred and twenty kilograms.
	Numeric elements:
Quantity(420, "Unit(name="kilogram", entity=Entity("mass"), uri=Kilogram)")
		four hundred and twenty kilograms :: four hundred and twenty kilograms

---------

Finding quantity statements in large texts

If we have a large block of text, we might want to quickly skim it for quantity containing sentences, we can do something like the following...


In [155]:
import spacy
nlp = spacy.load('en_core_web_lg', disable = ['ner'])

In [171]:
text = '''
Once upon a time, there was a thing. The thing weighed forty kilogrammes and cost £250. 
It was blue. It took forty five minutes to get it home. 
What a day that was. I didn't get back until 2.15pm. Then I had some cake for tea.
'''

In [172]:
doc = nlp(text)
for sent in doc.sents:
    print(sent)


Once upon a time, there was a thing.
The thing weighed forty kilogrammes and cost £250. 

It was blue.
It took forty five minutes to get it home. 

What a day that was.
I didn't get back until 2.15pm.
Then I had some cake for tea.


In [173]:
for sent in doc.sents:
    sent = sent.text
    p = parser.parse(sent)
    if p:
        print('\tSpoken:',parser.inline_parse_and_expand(sent))
        print('\tNumeric elements:')
        for q in p:
            display(q)
            print('\t\t{} :: {}'.format(q.surface, q))
    print('\n---------\n')


	Spoken: 
Once upon one instance, there was a thing.
	Numeric elements:
Quantity(1, "Unit(name="count", entity=Entity("dimensionless"), uri=Count_data)")
		a time :: one instance

---------

	Spoken: The thing weighed forty kilograms and cost two hundred and fifty pounds sterling, zero pence. 

	Numeric elements:
Quantity(40, "Unit(name="kilogram", entity=Entity("mass"), uri=Kilogram)")
		forty kilogrammes :: forty kilograms
Quantity(250, "Unit(name="pound sterling", entity=Entity("currency"), uri=Pound_sterling)")
		£250 :: two hundred and fifty pounds sterling, zero pence

---------


---------

	Spoken: It took forty-five minutes to get it home. 

	Numeric elements:
Quantity(45, "Unit(name="minute of arc", entity=Entity("angle"), uri=Minute_and_second_of_arc)")
		forty five minutes :: forty-five minutes

---------

	Spoken: What one day that was.
	Numeric elements:
Quantity(1, "Unit(name="day", entity=Entity("time"), uri=Day)")
		a day :: one day

---------

	Spoken: I didn't get back until two point one five picometres.
	Numeric elements:
Quantity(2.15, "Unit(name="picometre", entity=Entity("length"), uri=Picometre)")
		2.15pm :: two point one five picometres

---------


---------

Annotating a dataset

Can we extract numbers from sentences in a CSV file? Yes we can...


In [1]:
url = 'https://raw.githubusercontent.com/BBC-Data-Unit/unduly-lenient-sentences/master/ULS%20for%20Sankey.csv'

In [2]:
import pandas as pd

df = pd.read_csv(url)
df.head()


Out[2]:
Year Offence category REFINED Original sentence (refined) Crown Court Outcome of Decision Revised? People Top 7
0 2015 Drug offence 3 years imprisonment Bristol Not referred No 1 Y
1 2015 Death or serious injury - unlawful driving 6 years imprisonment - Disqualified driving - ... Portsmouth Not referred No 1 Y
2 2015 Sexual offence 9 months imprisonment suspended for 2 years Nottingham Out of time No 1 Y
3 2015 Theft offence 4 years and 10 months imprisonment - consecuti... St Albans Not referred No 1 Y
4 2015 Theft offence unknown unknown Not in scheme No 1 Y

In [178]:
#get a row
df.iloc[1]


Out[178]:
Year                                                                        2015
Offence category REFINED              Death or serious injury - unlawful driving
Original sentence (refined)    6 years imprisonment - Disqualified driving - ...
Crown Court                                                           Portsmouth
Outcome of Decision                                                 Not referred
Revised?                                                                      No
People                                                                         1
Top 7                                                                          Y
Name: 1, dtype: object

In [179]:
#and a, erm. sentence...
df.iloc[1]['Original sentence (refined)']


Out[179]:
'6 years imprisonment - Disqualified driving - 8 years'

In [180]:
parser.parse(df.iloc[1]['Original sentence (refined)'])


Out[180]:
[Quantity(6, "Unit(name="year", entity=Entity("time"), uri=Year)"),
 Quantity(8, "Unit(name="year", entity=Entity("time"), uri=Year)")]

In [206]:
def amountify(txt):
    #txt may be some flavout of nan...
    #handle scruffily for now...
    try:
        if txt:
            p = parser.parse(txt)
            x=[]
            for q in p:
                x.append( '{} {}'.format(q.value, q.unit.name))
            return '::'.join(x)
        return ''
    except:
        return

In [207]:
df['amounts'] = df['Original sentence (refined)'].apply(amountify)

In [208]:
df.head()


Out[208]:
Year Offence category REFINED Original sentence (refined) Crown Court Outcome of Decision Revised? People Top 7 amounts
0 2015 Drug offence 3 years imprisonment Bristol Not referred No 1 Y 3.0 year
1 2015 Death or serious injury - unlawful driving 6 years imprisonment - Disqualified driving - ... Portsmouth Not referred No 1 Y 6.0 year::8.0 year
2 2015 Sexual offence 9 months imprisonment suspended for 2 years Nottingham Out of time No 1 Y 9.0 month::2.0 year
3 2015 Theft offence 4 years and 10 months imprisonment - consecuti... St Albans Not referred No 1 Y 4.0 year::10.0 month
4 2015 Theft offence unknown unknown Not in scheme No 1 Y

We could then do something to split multiple amounts into multiple rows or columns...

Parsing Semi-Structured Sentences

The sentencing sentences look to have a reasonable degree of structure to them (or at least, there are some commenalities in the way some of them are structured).

We can exploit this structure by writing some more specific pattern matches to pull out even more information.


In [6]:
df['Original sentence (refined)'][:20].apply(print);


3 years imprisonment
6 years imprisonment - Disqualified driving - 8 years
9 months imprisonment suspended for 2 years
4 years and 10 months imprisonment - consecutive to any other periods of imprisonment
unknown
unknown
3 year community sentence attend sex offenders group and pay surcharge of £60 within 2 months
£850 Fine
12-months disqualification
Community Sentence / SOPO for 5 years/ pay a surcharge of £60 within 3 months
Bound over in the sum of £100.00 for 12 months
18 months imprisonment suspended for 2 years
9 years imprisonment
13 months imprisonment
14 years and 6 months imprisonment
3 years and 9 months imprisonment
Life imprisonment with a minimum of 25 years
6 years and 3 months imprisonment
4 years imprisonment
12 months imprisonment - confiscation under POCA 2002

It makes sense to try to build a default hierarchy that extracts from more specific to less specific structures...

For example:

  • 9 months imprisonment suspended for 2 years is more specific than 9 months imprisonment