This notebook was put together by [Jake Vanderplas](http://www.vanderplas.com) for UW's [Astro 599](http://www.astro.washington.edu/users/vanderplas/Astr599/) course. Source and license info is on [GitHub](https://github.com/jakevdp/2013_fall_ASTR599/).

Advanced String Manipulation

& File I/O

One of the areas where Python has a distinct (and huge) advantage over lower-level languages like C is in its string manipulation. Operations that are downright painful in other languages can be accomplished very straightforwardly in Python.

The `string` module

We can get a preview of what's available by examining the built-in string module



In [1]:

    
import string
dir(string)









    Out[1]:





['Formatter',
 'Template',
 '_TemplateMetaclass',
 '__builtins__',
 '__doc__',
 '__file__',
 '__name__',
 '__package__',
 '_float',
 '_idmap',
 '_idmapL',
 '_int',
 '_long',
 '_multimap',
 '_re',
 'ascii_letters',
 'ascii_lowercase',
 'ascii_uppercase',
 'atof',
 'atof_error',
 'atoi',
 'atoi_error',
 'atol',
 'atol_error',
 'capitalize',
 'capwords',
 'center',
 'count',
 'digits',
 'expandtabs',
 'find',
 'hexdigits',
 'index',
 'index_error',
 'join',
 'joinfields',
 'letters',
 'ljust',
 'lower',
 'lowercase',
 'lstrip',
 'maketrans',
 'octdigits',
 'printable',
 'punctuation',
 'replace',
 'rfind',
 'rindex',
 'rjust',
 'rsplit',
 'rstrip',
 'split',
 'splitfields',
 'strip',
 'swapcase',
 'translate',
 'upper',
 'uppercase',
 'whitespace',
 'zfill']

Modifying Case:

`lower()`, `upper()`, `title()`, `capitalize()`, `swapcase()`



In [2]:

    
s = "HeLLo tHEre MY FriEND"



In [3]:

    
s.upper()









    Out[3]:





'HELLO THERE MY FRIEND'



In [4]:

    
s.lower()









    Out[4]:





'hello there my friend'



In [5]:

    
s.title()









    Out[5]:





'Hello There My Friend'



In [6]:

    
s.capitalize()









    Out[6]:





'Hello there my friend'



In [7]:

    
s.swapcase()









    Out[7]:





'hEllO TheRE my fRIend'

Splitting, Cleaning, and Joining

`split()`, `strip()`, `join()`, `replace()`



In [8]:

    
s.split()









    Out[8]:





['HeLLo', 'tHEre', 'MY', 'FriEND']



In [9]:

    
L = s.capitalize().split()
print L









    



['Hello', 'there', 'my', 'friend']



In [10]:

    
s = '_'.join(L)
print s









    



Hello_there_my_friend



In [11]:

    
s.split('_')









    Out[11]:





['Hello', 'there', 'my', 'friend']



In [12]:

    
''.join(s.split('_'))









    Out[12]:





'Hellotheremyfriend'



In [13]:

    
s = "    Too many spaces!    "
s.strip()









    Out[13]:





'Too many spaces!'



In [14]:

    
s = "*~*~*~*Super!!**~*~**~*~**~"
s.strip('*~')









    Out[14]:





'Super!!'



In [15]:

    
s.rstrip('*~')









    Out[15]:





'*~*~*~*Super!!'



In [16]:

    
s.lstrip('*~')









    Out[16]:





'Super!!**~*~**~*~**~'



In [17]:

    
s.replace('*', '')









    Out[17]:





'~~~Super!!~~~~~'



In [18]:

    
s.replace('*', '').replace('~', '')









    Out[18]:





'Super!!'

Finding substrings

`find()`, `startswith()`, `endswith()`



In [19]:

    
s = "The quick brown fox jumped"
s.find("fox")









    Out[19]:





16



In [20]:

    
s[16:]









    Out[20]:





'fox jumped'



In [21]:

    
s.find('booyah')









    Out[21]:





-1



In [22]:

    
s.startswith('The')









    Out[22]:





True



In [23]:

    
s.endswith('jumped')









    Out[23]:





True



In [24]:

    
s.endswith('fox')









    Out[24]:





False

Checking a string's contents

`isdigit()`, `isalpha()`, `islower()`, `isupper()`, `isspace()`, etc.



In [25]:

    
'1234'.isdigit()









    Out[25]:





True



In [26]:

    
'123.45'.isdigit()









    Out[26]:





False



In [27]:

    
'ABC'.isalpha()









    Out[27]:





True



In [28]:

    
'ABC123'.isalpha()









    Out[28]:





False



In [29]:

    
"ABC123".isalnum()









    Out[29]:





True



In [30]:

    
'ABC easy as 123'.isalnum()









    Out[30]:





False



In [31]:

    
'hello'.islower()









    Out[31]:





True



In [32]:

    
'HELLO'.isupper()









    Out[32]:





True



In [33]:

    
'Hello'.istitle()









    Out[33]:





True



In [34]:

    
'   '.isspace()









    Out[34]:





True

String Formatting

The old way

The old-style string formatting operations will look familiar to those who have used C. Essentially, any % in the string indicates a replacement.

Basic interface is

"%(format)" % value



In [35]:

    
from math import pi
"my favorite integer is %d, but my favorite float is %f." % (42, pi)









    Out[35]:





'my favorite integer is 42, but my favorite float is 3.141593.'



In [36]:

    
"in exponential notation it's %e" % pi









    Out[36]:





"in exponential notation it's 3.141593e+00"



In [37]:

    
"to choose smartly if exponential is needed: %g" % pi









    Out[37]:





'to choose smartly if exponential is needed: 3.14159'



In [38]:

    
"or with a bigger number: %g" % 123456787654321.0









    Out[38]:





'or with a bigger number: 1.23457e+14'



In [39]:

    
"rounded to three decimal places it's %.3f" % pi









    Out[39]:





"rounded to three decimal places it's 3.142"



In [40]:

    
"an integer padded with spaces: %10d" % 42









    Out[40]:





'an integer padded with spaces:         42'



In [41]:

    
"an integer padded on the right: %-10d" % 42









    Out[41]:





'an integer padded on the right: 42        '



In [42]:

    
"an integer padded with zeros: %010d" % 42









    Out[42]:





'an integer padded with zeros: 0000000042'



In [43]:

    
"we can also name our arguments: %(value)d" % dict(value=3)









    Out[43]:





'we can also name our arguments: 3'



In [44]:

    
"Escape the percent sign with an extra symbol: the %d%%" % 99









    Out[44]:





'Escape the percent sign with an extra symbol: the 99%'

Read more about formats in the Python docs

Formatting: the new way

New-style string formatting uses curly braces {} to contain the formats, which can be referenced by argument number and name:

"{0} {name}".format(first, name=second)"



In [45]:

    
"{}{}".format("ABC", 123)









    Out[45]:





'ABC123'



In [46]:

    
"{0}{1}".format("ABC", 123)









    Out[46]:





'ABC123'



In [47]:

    
"{0}{0}".format("ABC", 123)









    Out[47]:





'ABCABC'



In [48]:

    
"{1}{0}".format("ABC", 123)









    Out[48]:





'123ABC'

Formatting comes after the :



In [49]:

    
("%.2f" % 3.14159) ==  "{:.2f}".format(3.14159)









    Out[49]:





True



In [50]:

    
"{0:d} is an integer; {1:.3f} is a float".format(42, pi)









    Out[50]:





'42 is an integer; 3.142 is a float'



In [51]:

    
"{the_answer:010d} is an integer; {pi:.5g} is a float".format(the_answer=42,
                                                              pi=pi)









    Out[51]:





'0000000042 is an integer; 3.1416 is a float'



In [52]:

    
'{desire} to {place}'.format(desire='Fly me',
                             place='The Moon')









    Out[52]:





'Fly me to The Moon'



In [53]:

    
# using a pre-defined dictionary
f = {"desire": "Won't you take me",
     "place": "funky town?"}

'{desire} to {place}'.format(**f)









    Out[53]:





"Won't you take me to funky town?"



In [54]:

    
# format also supports binary numbers
"int: {0:d};  hex: {0:x};  oct: {0:o};  bin: {0:b}".format(42)









    Out[54]:





'int: 42;  hex: 2a;  oct: 52;  bin: 101010'

File I/O

Let's create a file for us to read:



In [55]:

    
%%file inout.dat
Here is a nice file
with a couple lines of text
it is a haiku









    



Overwriting inout.dat



In [56]:

    
f = open('inout.dat')
print f.read()
f.close()









    



Here is a nice file
with a couple lines of text
it is a haiku



In [57]:

    
f = open('inout.dat')
print f.readlines()
f.close()









    



['Here is a nice file\n', 'with a couple lines of text\n', 'it is a haiku']



In [58]:

    
for line in open('inout.dat'):
    print line.split()









    



['Here', 'is', 'a', 'nice', 'file']
['with', 'a', 'couple', 'lines', 'of', 'text']
['it', 'is', 'a', 'haiku']



In [59]:

    
# write() is the opposite of read()
contents = open('inout.dat').read()
out = open('my_output.dat', 'w')
out.write(contents.replace(' ', '_'))
out.close()



In [60]:

    
!cat my_output.dat









    



Here_is_a_nice_file
with_a_couple_lines_of_text
it_is_a_haiku



In [61]:

    
# writelines() is the opposite of readlines()
lines = open('inout.dat').readlines()
out = open('my_output.dat', 'w')
out.writelines(lines)
out.close()



In [62]:

    
!cat my_output.dat









    



Here is a nice file
with a couple lines of text
it is a haiku

Breakout: clearing up some output

Here is some code that creates a comma-delimited file of numbers with random precision, leading spaces, and formatting:



In [63]:

    
# Don't modify this: it simply writes the example file
f = open('messy_data.dat', 'w')
import random
for i in range(100):
    for j in range(5):
        f.write(' ' * random.randint(0, 6))
        f.write('%0*.*g' % (random.randint(8, 12),
                            random.randint(5, 10),
                            100 * random.random()))
        if j != 4:
            f.write(',')
    f.write('\n')
f.close()



In [64]:

    
# Look at the first four lines of the file:
!head -4 messy_data.dat









    



 00000095.945, 0000096.1158,    014.15002,  0050.46316, 000014.6082
 0000070.778,00073.821, 57.85960388,      0008.85737,     00000092.04
      077.012237,0038.6466,     34.87242,      0000003.3876,     25.07738969
   00068.3471,    00009.9584,      020.02878, 65.9716241,   00063.43892

Your task: Write a program that reads in the contents of "messy_data.dat" and extracts the numbers from each line, using the string manipulations we used above (remember that float() will convert a suitable string to a floating-point number).

Next write out a new file named "clean_data.dat". The new file should contain the same data as the old file, but with uniform formatting and aligned columns.



In [65]:

    
# your solution here

The numpy solution

What you did above with text wrangling, numpy can do much more easily:



In [66]:

    
import numpy as np
data = np.loadtxt("messy_data.dat", delimiter=',')
np.savetxt("clean_data.dat", data,
           delimiter=',', fmt="%8.4f")



In [67]:

    
!head -5 clean_data.dat









    



 95.9450, 96.1158, 14.1500, 50.4632, 14.6082
 70.7780, 73.8210, 57.8596,  8.8574, 92.0400
 77.0122, 38.6466, 34.8724,  3.3876, 25.0774
 68.3471,  9.9584, 20.0288, 65.9716, 63.4389
 87.9833,  7.8228, 60.3212, 82.9680, 22.4530

Still, text manipulation is a very good skill to have under your belt!