by Megat Harun Al Rashid bin Megat Ahmad
last updated: April 14, 2016
Python has many built-in classes that support the processing of text or commonly known as strings. Strings can be declared by quoting the text (with single ('
) or double ("
) quotes).
In [1]:
'Hello World, I am an evangelical Python enthusiast'
Out[1]:
Just like numerical value, strings can be assigned to a variable which can be used later.
In [2]:
str1 = 'Hello World'
str2 = 'I am an evangelical Python enthusiast'
In [3]:
str1
Out[3]:
In [4]:
str2
Out[4]:
Strings can be stiched together using the '+'
operator...
In [5]:
sente = str1 + ", " + str2
sente
Out[5]:
...as well as mutiply with the '*'
operator:
In [6]:
5*'hello ,'
Out[6]:
The quoting of strings can also be done with triple single ($'''$) and triple double ($"""$) quotes.
In [7]:
str3 = 'I am ' + '''not ''' + "the " + """we of anyone"""
In [8]:
str3
Out[8]:
This allows the usage of both the single and double quotes as part of the strings.
In [9]:
'Katheline Kelly: "That is amazing, you can spell "fox"! Can you spell "dog"?"'
Out[9]:
In [10]:
'''Joe Fox: "...but I am in the middle of a project \
that needs "tweaking""'''
Out[10]:
You will notice there is an escape character '\'
between the words "project"
and "that"
. This allows the strings (or any Python statement) to break a line i.e. when one Python statement is long it can be continued to the next line by using escape '\'
to improve readability.
The length of strings or the number of characters in the strings can be obtained using the **len()** function.
In [11]:
len(str3)
Out[11]:
Python strings classes support what is called the sequence type method i.e. strings in Python behave like a list. Therefore strings can be sliced, extracted and reassigned.
In [12]:
str0 = 'There is a wisdom of the head, and \
a wisdom of the heart'
In [13]:
str0
Out[13]:
Characters in strings can be extracted by specifying the location or range of the strings. The location of the character can start from both left and right of the strings. From the left side, the character position starts from '0'
and above whereas from right side, the character position starts from '-1'
and below. Both can be used in slicing the strings. If one of the positions are not specified during slicing, then every character will be included in the direction of the position. The end position number must be one step higher than the desired end position of the strings. This is helpful control flow operation as first position starts with the number '0'
.
In [14]:
str0[0]
Out[14]:
In [15]:
str0[-1]
Out[15]:
In [16]:
str0[11:17]
Out[16]:
In [17]:
str0[-45:-39]
Out[17]:
In [18]:
str0[11:-39]
Out[18]:
In [19]:
str0[-9:]
Out[19]:
In [20]:
str0[:-27]
Out[20]:
In [21]:
str0[:]
Out[21]:
Below is a table that shows the position of characters in the strings 'Python vs. Perl'
from left to right and from right to left:
P | y | t | h | o | n | v | s | . | P | e | r | l | |||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Position starting from left | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 |
Position starting from right | -15 | -14 | -13 | -12 | -11 | -10 | -9 | -8 | -7 | -6 | -5 | -4 | -3 | -2 | -1 |
In [22]:
str_table = 'Python vs. Perl'
str_table[-4:-1] + str_table[-7] + str_table[4:6]
Out[22]:
Python has the **print** function that can print formatted output.
In [23]:
print "Hello Everyone"
In [24]:
print "Hello ","Everyone"
In [25]:
x = 45.6
y = 98.1
print x + y
Formatted output can be printed using format specifiers and escape characters.
In [26]:
x = 45.678
y = 98.142
print '%.2f added to %.3f gives \n%.1f' % (x,y,x+y)
The sequence and numbers of the format specifiers for the strings must match the number of arguments in the parenthesis afterwards. Arguments in parenthesis can be numbers, strings and operational expression. In the above example, the ' %.2f ' term means that it will be replaced by the variable $x$ but with only two decimal places displayed instead of the original three as specified by the ' .2 ' term. The letter ' f ' indicates that only variable with floating point value can be accepted. The ' \n ' is an escape character and in this case it means that printing will be done in new line afterwards.
Some of the formatted strings that can be used:
Formatted Strings | Type |
---|---|
%c | char |
%d | decimal |
%s | string |
%f | floating |
Some of the escape characters that can be used:
Escape Character | Function |
---|---|
\t | tab |
\n | new line |
|'\' | |
\s | space |
\v | vertical tab |
In [27]:
str_list = 'Friya (78), Darylene (84), Femi (91), Aiko (100) \
and Holda (80)'
n1 = str_list[0:5]; n2 = str_list[12:20]; n3 = str_list[27:31]
n4 = str_list[-25:-21]; n5 = str_list[-10:-5]
# Checking whether extract/slice is correct
print '%s\t%s\t%s\t%s\t%s\n' % (n1,n2,n3,n4,n5)
m1 = str_list[7:9]; m2 = str_list[22:24]; m3 = str_list[33:35]
m4 = str_list[-19:-16]; m5 = str_list[-3:-1]
print '%s\t%s\t%s\t%s\t%s\n' % (m1,m2,m3,m4,m5)
'''Printing the results
in two columns'''
print 'Names\t\tMarks'
print 21*'-'
print '%s\t\t%s' % (n1,m1)
print '%s\t%s' % (n2,m2)
print '%s\t\t%s' % (n3,m3)
print '%s\t\t%s' % (n4,m4)
print '%s\t\t%s' % (n5,m5)
print '\nThe average mark is %d' % ((int(m1)+int(m2)+int(m3)\
+int(m4)+int(m5))/5)
In this exercise, the '#' is used before writing a one line comments in the program. This line will not be executed. Multiple lines comments can be inserted by quoting the comments in triple single ('''
) or double ("""
) quotes. Comments assist program readability.
Apart from direct writing of numerical digits and strings (by quoting) and assigning them to variables, Python provides input capability via key-in and also from reading file. The output can also be saved into a file. In general, the input and output functions available are:
1.1. Key-in input from keyboard
input, raw_input
1.2. Reading from file
2.1. Print on the screen
2.2. Write to file
In [28]:
# input function: key-in integer value
int_x = input('Key-in any integer (and press Enter): ')
print 'The input value is %d' % int_x
print '8 x %d = %d' % (int_x,8*int_x)
In [29]:
# input function: key-in a string
str_x = input('Key-in any word (and press Enter): ')
print 'The input word is "%s"' % str_x
The **input()** function accepts any numerical digits and quoted strings as input whereas the **raw_input()** function converts all inputs into strings (without the needs to quote the strings).
In [30]:
int_x = raw_input("Key-in any integer (and press Enter): ")
print "The input value is %s" % int_x
str_x = raw_input("Key-in any word (and press Enter): ")
print "The input word is %s" % str_x
Here the value of $int$_$x$ variable is actually a strings and cannot be operated mathematically. Therefore, it needs to be converted to a number.
In [31]:
int_x = int(raw_input("Key-in any integer (and press Enter): "))
print "The input value when multiply with %d is %d" % (4,int_x*4)
fx = float(raw_input("\nKey-in any floating number (and press Enter): "))
print "The floating value when divided \nwith %d is %.2f" % (8,fx/8)
Reading from and writing to a file can be done by using the **open()** function.
In [32]:
# Opening a file
file_read = open("Tutorial2/les miserables.txt")
# file_read = open("Tutorial2/les miserables.txt","r")
file_read.close()
Here the **open()** function is followed by the quoted name of the file in parenthesis. When opening a file, by default it is for reading i.e. it is in the 'r'
mode which is usually omitted. The file needs to be closed after reading it. In the example below, the $les miserables.txt$ file is opened for reading (with reading mode specified) and the content of the file is printed. If the file is not in the same directory of the working notebook, then the directory path of the file needs to be specified.
In [33]:
file_read = open("Tutorial2/les miserables.txt","r")
print 'Name of the file: %s\n' % file_read.name
# Read and print the whole file as strings
texts = file_read.read()
print texts
# Closing the file
file_read.close()
In the following example, the content of the file is read as strings and assigned to a variable, so that excerpts from the file can be extracted.
In [34]:
file_read = open("Tutorial2/les miserables.txt","r")
texts = file_read.read()
'''printing the number of characters
(including empty spaces) in the file'''
print len(texts)
# Extracting certain portion of the file
excerpt = texts[290:419]
print excerpt
file_read.close()
Writing this extracted excerpt into another file can be done by opening the another file in writing mode 'w'
.
In [35]:
file_write = open("Tutorial2/contents.txt","w")
file_write.write("Important points from Victor Hugo's Les Miserables:\n\n")
file_write.write(excerpt)
file_write.close()
In [36]:
# Reading back the written file
file_new = open("Tutorial2/contents.txt","r")
print file_new.read()
file_new.close()
Some important opening file modes:
Mode | Function |
---|---|
r | Read only, pointer at beginning of the file, default mode |
w | Write only |
r+ | Read and write, pointer at beginning of the file |
a | Append, pointer at the end of the file |
a+ | Append and read, pointer at the end of the file |
The characters those affected by society environments in Victor Hugo's Les Miserables:
1) The degradation of man by poverty.
- Jean Valjean
2) The ruin of woman by starvation.
- Fantine
3) The dwarfing of childhood by physical and spiritual night.
- Cosette
In [37]:
file_read = open("Tutorial2/les miserables.txt","r")
texts = file_read.read()
# Extracting certain portion of the file
excerpt = texts[290:419]
# print excerpt # if you want to check the excerpt first
file_read.close()
excerpt1 = excerpt[0:33]
excerpt2 = excerpt[35:66]
excerpt3 = excerpt[68:130]
excerpt3_1 = excerpt3[0:29]
excerpt3_2 = excerpt3[30:]
file_write = open("Tutorial2/Answer_to_2_1.txt","w")
file_write.write("The characters those affected by society \
environments\nin Victor Hugo's Les Miserables:\n")
file_write.write('\n1.\t'+excerpt1)
file_write.write('\n\t-\tJean Valjean\n')
file_write.write('\n2.\t'+excerpt2)
file_write.write('\n\t-\tFantine\n')
file_write.write('\n3.\t'+excerpt3_1+' '+excerpt3_2)
file_write.write('\n\t-\tCosette')
file_write.close()
# Checking back by reading the written file
file_new = open("Tutorial2/Answer_to_2_1.txt","r")
print file_new.read()
file_new.close()
Python has the regular expression libraries, **re** and **regex** that can perform sophisticated text processing. Regular expression is a very large topic and will not be discussed here.
Further information on Python **regex** library can be found in https://pypi.python.org/pypi/regex whereas the Python native **re** library can be found in https://docs.python.org/2/library/re.html
Instead, users can focus on using the **split()** and **replace()** functions. These functions can assist the users a lot when doing text processing. They, however, are not really part of Python regular expression libraries.
In [38]:
line = 'Oz: "Greatness?"; Glenda: "No, better than that, goodness"'
In [39]:
line_list = line.split(";")
line_list
Out[39]:
In the above instance, the content of $line$ is splitted into separated elements according to the position of ";"
and assigned to $line\_list$ variable. The variable $line\_list$ now contains two elements (separated by comma). The variable $line\_list$ is now a list (We will further explore list in the The Sequence topic).
In [40]:
line_list[0]
Out[40]:
In [41]:
line_list[1]
Out[41]:
In [42]:
line_list = line.split(":")
line_list
Out[42]:
In [43]:
line_list[2]
Out[43]:
If **split()** function is used without any arguments than the strings elements in $line$ will be splitted according to empty spaces.
In [44]:
line_list = line.split()
print line_list
Multiple splitting can be performed by replacing certain markers into single type specific marker and then executing the **split()** function. The replacement process can be carried out using the **replace()** function.
In [45]:
line_list = line.replace(":","")\
.replace(";","").replace(",","").\
replace('"',"").replace("?","").split()
print line_list
The **replace($x$,$y$)** function receives two arguments: $x$ and $y$ i.e. strings object $y$ replaces strings object $x$.
In [46]:
no_ser="1\t2\t3\t4\n5\t6\t7\t8\n"
print no_ser
no_list = no_ser.replace("\t"," ").replace("\n"," ").split()
print no_list