Text Processing Primer



In [1]:

    
text1 = "Julia is a high-level high-performance dynamic programming language for numerical computing and Julia is used ..."

len(text1)









    Out[1]:





113



In [2]:

    
text2 = text1.split(' ')
text2









    Out[2]:





['Julia',
 'is',
 'a',
 'high-level',
 'high-performance',
 'dynamic',
 'programming',
 'language',
 'for',
 'numerical',
 'computing',
 'and',
 'Julia',
 'is',
 'used',
 '...']



In [3]:

    
len(text2)









    Out[3]:





16



In [4]:

    
[w for w in text2 if len(w) > 3] # Words that are greater than 3 letters long in text2









    Out[4]:





['Julia',
 'high-level',
 'high-performance',
 'dynamic',
 'programming',
 'language',
 'numerical',
 'computing',
 'Julia',
 'used']



In [5]:

    
[w for w in text2 if w.istitle()] # Capitalized words in text2









    Out[5]:





['Julia', 'Julia']



In [6]:

    
[w for w in text2 if w.endswith('l')] # Words in text2 that end in 'l'









    Out[6]:





['high-level', 'numerical']

We can find unique words using set().



In [7]:

    
len(set(text2))









    Out[7]:





14



In [8]:

    
set(text2)









    Out[8]:





{'...',
 'Julia',
 'a',
 'and',
 'computing',
 'dynamic',
 'for',
 'high-level',
 'high-performance',
 'is',
 'language',
 'numerical',
 'programming',
 'used'}



In [9]:

    
set([w.lower() for w in text2])









    Out[9]:





{'...',
 'a',
 'and',
 'computing',
 'dynamic',
 'for',
 'high-level',
 'high-performance',
 'is',
 'julia',
 'language',
 'numerical',
 'programming',
 'used'}

Processing free-text



In [10]:

    
text3 = 'Demystifying Dynamic Programming @freecamp @bostongroup @ NY #algorithms'
text4 = text3.split(' ')

text4









    Out[10]:





['Demystifying',
 'Dynamic',
 'Programming',
 '@freecamp',
 '@bostongroup',
 '@',
 'NY',
 '#algorithms']

Finding hastags:



In [11]:

    
[w for w in text4 if w.startswith('#')]









    Out[11]:





['#algorithms']

Finding callouts:



In [12]:

    
[w for w in text4 if w.startswith('@')]









    Out[12]:





['@freecamp', '@bostongroup', '@']

Problem

Finds also single '@'.

Solution

Regular expressions for more complex parsing.

For example '@[A-Za-z0-9_]+' will return all words that:

start with '@' and are followed by at least one:
capital letter ('A-Z')
lowercase letter ('a-z')
number ('0-9')
or underscore ('_')



In [14]:

    
import re # for regular expressions

[w for w in text4 if re.search('@[A-Za-z0-9_]+', w)]









    Out[14]:





['@freecamp', '@bostongroup']

Text Processing Primer

Processing free-text

Links