You are currently looking at version 1.0 of this notebook. To download notebooks and datafiles, as well as get help on Jupyter notebooks in the Coursera platform, visit the Jupyter Notebook FAQ course resource.


Working With Text


In [1]:
text1 = "Ethics are built right into the ideals and objectives of the United Nations "

len(text1) # The length of text1


Out[1]:
76

In [2]:
text2 = text1.split(' ') # Return a list of the words in text2, separating by ' '.

len(text2)


Out[2]:
14

In [3]:
text2


Out[3]:
['Ethics',
 'are',
 'built',
 'right',
 'into',
 'the',
 'ideals',
 'and',
 'objectives',
 'of',
 'the',
 'United',
 'Nations',
 '']


List comprehension allows us to find specific words:


In [4]:
[w for w in text2 if len(w) > 3] # Words that are greater than 3 letters long in text2


Out[4]:
['Ethics',
 'built',
 'right',
 'into',
 'ideals',
 'objectives',
 'United',
 'Nations']

In [5]:
[w for w in text2 if w.istitle()] # Capitalized words in text2


Out[5]:
['Ethics', 'United', 'Nations']

In [6]:
[w for w in text2 if w.endswith('s')] # Words in text2 that end in 's'


Out[6]:
['Ethics', 'ideals', 'objectives', 'Nations']


We can find unique words using set().


In [7]:
text3 = 'To be or not to be'
text4 = text3.split(' ')

len(text4)


Out[7]:
6

In [8]:
len(set(text4))


Out[8]:
5

In [9]:
set(text4)


Out[9]:
{'To', 'be', 'not', 'or', 'to'}

In [10]:
len(set([w.lower() for w in text4])) # .lower converts the string to lowercase.


Out[10]:
4

In [11]:
set([w.lower() for w in text4])


Out[11]:
{'be', 'not', 'or', 'to'}

Processing free-text


In [12]:
text5 = '"Ethics are built right into the ideals and objectives of the United Nations" \
#UNSG @ NY Society for Ethical Culture bit.ly/2guVelr'
text6 = text5.split(' ')

text6


Out[12]:
['"Ethics',
 'are',
 'built',
 'right',
 'into',
 'the',
 'ideals',
 'and',
 'objectives',
 'of',
 'the',
 'United',
 'Nations"',
 '#UNSG',
 '@',
 'NY',
 'Society',
 'for',
 'Ethical',
 'Culture',
 'bit.ly/2guVelr']


Finding hastags:


In [13]:
[w for w in text6 if w.startswith('#')]


Out[13]:
['#UNSG']


Finding callouts:


In [14]:
[w for w in text6 if w.startswith('@')]


Out[14]:
['@']

In [15]:
text7 = '@UN @UN_Women "Ethics are built right into the ideals and objectives of the United Nations" \
#UNSG @ NY Society for Ethical Culture bit.ly/2guVelr'
text8 = text7.split(' ')


We can use regular expressions to help us with more complex parsing.

For example '@[A-Za-z0-9_]+' will return all words that:

  • start with '@' and are followed by at least one:
  • capital letter ('A-Z')
  • lowercase letter ('a-z')
  • number ('0-9')
  • or underscore ('_')

In [16]:
import re # import re - a module that provides support for regular expressions

[w for w in text8 if re.search('@[A-Za-z0-9_]+', w)]


Out[16]:
['@UN', '@UN_Women']