This notebook uses code snippets and explanations from this course.
The first thing you learned was printing a simple sentence: "Hello, world!" This sentence, as any text, was stored by Python as a string. Since many disciplines within the Humanities and Social Sciences work with texts, quite naturally we will focus a lot on manipulating texts in this course. Therefore, strings will be an important data type for us. This Notebook is devoted to this object type.
If you have questions about this chapter, please refer to the forum on Canvas.
In [ ]:
# Here are some strings:
string_1 = "Hello, world!"
string_2 = 'I ❤️ cheese' # If you are using Python 2, your computer will not like this.
string_3 = '1,2,3,4,5,6,7,8,9'
There is no difference in declaring a string with single or double quotes. However, if your string contains a quote symbol it can lead to errors if you try to enclose it with the same quotes.
In [ ]:
# Run this cell to see the error generated by the following line.
restaurant = 'Wendy's'
In the example above the error indicates that there is something wrong with the letter s. This is because the single quote closes the string we started, and anything after that is unexpected. To solve this we can enclose the string in double quotes, as follows:
In [ ]:
restaurant = "Wendy's"
# Similarly, we can enclose a string containing double quotes with single quotes:
quotes = 'Using "double" quotes enclosed by a single quote.'
We can also use the escape character "\" in front of the quote, which will tell Python not to treat this specific quote as the end of the string.
In [ ]:
restaurant = 'Wendy\'s'
print(restaurant)
restaurant = "Wendy\"s"
print(restaurant)
Strings in Python can also span across multiple lines, which can be useful for when you have a very long string, or when you want to format the output of the string in a certain way. This can be achieved in two ways:
We will first demonstrate how this would work when you use one double or single quote.
In [ ]:
# This example also works with single-quotes.
long_string = "A very long string\n\
can be split into multiple\n\
sentences by appending a newline symbol\n\
to the end of the line."
print(long_string)
The \n
or newline symbol indicates that we want to start the rest of the text on a new line in the string, the following \ indicates that we want the string to continue on the next line of the code. This difference can be quite hard to understand, but best illustrated with an example where we do not include the \n
symbol.
In [ ]:
long_string = "A very long string \
can be split into multiple \
sentences by appending a backslash \
to the end of the line."
print(long_string)
As you can see, Python now interprets this example as a single line of text. If we use the recommended way in Python to write multiline strings, with triple double or single quotes, you will see that the \n
or newline symbol is automatically included.
In [ ]:
long_string = """A very long string
can also be split into multiple
sentences by enclosing the string
with three double or single quotes."""
print(long_string)
print()
another_long_string = '''A very long string
can also be split into multiple
sentences by enclosing the string
with three double or single quotes.'''
print(another_long_string)
What will happen if you remove the backslash characters in the example? Try it out in the cell below.
In [ ]:
long_string = "A very long string\
can be split into multiple\
sentences by appending a backslash\
to the end of the line."
print(long_string)
As we have seen above, it is possible to make strings that span multiple lines. Here are two ways to do so:
In [ ]:
multiline_text_1 = """This is a multiline text, so it is enclosed by triple quotes.
Pretty cool stuff!
I always wanted to type more than one line, so today is my lucky day!"""
multiline_text_2 = "This is a multiline text, so it is enclosed by triple quotes.\nPretty cool stuff!\nI always wanted to type more than one line, so today is my lucky day!"
print(multiline_text_1)
print() # this just prints an empty line
print(multiline_text_2)
Internally, these strings are equally represented. We can check that with the double equals sign, which checks if two objects are the same:
In [ ]:
print(multiline_text_1 == multiline_text_2)
So from this we can conclude that multiline_text_1
has the same hidden characters (in this case \n
, which stands for 'new line') as multiline_text_2
. You can show that this is indeed true by using the built-in repr()
function (which gives you the Python-internal representation of an object).
In [ ]:
# Show the internal representation of multiline_text_1.
print(repr(multiline_text_1))
print(repr(multiline_text_2))
Another hidden character that is often used is \t
, which represents tabs:
In [ ]:
colors = "yellow\tgreen\tblue\tred"
print(colors)
print(repr(colors))
Strings are simply sequences of characters. Each character in a string therefore has a position, which can be referred to by the index number of the position. The index numbers start at 0 and then increase to the length of the string. You can also start counting backwards using negative indices. The following table shows all characters of the sentence "Sandwiches are yummy" in the first row. The second row and the third row show respectively the positive and negative indices for each character:
Characters | S | a | n | d | w | i | c | h | e | s | a | r | e | y | u | m | m | y | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Positive index | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 |
Negative index | -20 | -19 | -18 | -17 | -16 | -15 | -14 | -13 | -12 | -11 | -10 | -9 | -8 | -7 | -6 | -5 | -4 | -3 | -2 | -1 |
You can access the characters of a string as follows:
In [ ]:
my_string = "Sandwiches are yummy"
print(my_string[1])
print(my_string[-1])
Length: Python has a built-in function called len()
that lets you compute the length of a sequence. It works like this:
In [ ]:
number_of_characters = len(my_string)
print(number_of_characters) # Note that spaces count as characters too!
In [ ]:
my_string = "Sandwiches are yummy"
print(my_string[1:4])
This is called string slicing. So how does this notation work?
my_string[i] # Get the character at index i.
my_string[start:end] # Get the substring starting at 'start' and ending *before* 'end'.
my_string[start:end:stepsize] # Get all characters starting from 'start', ending before 'end',
# with a specific step size.
You can also leave parts out:
my_string[:i] # Get the substring starting at index 0 and ending just before i.
my_string[i:] # Get the substring starting at i and running all the way to the end.
my_string[::i] # Get a string going from start to end with step size i.
You can also have negative step size. my_string[::-1]
is the idiomatic way to reverse a string.
Do you know what the following statements will print?
In [ ]:
print(my_string[1:4])
In [ ]:
print(my_string[1:4:1])
In [ ]:
print(my_string[11:14])
In [ ]:
print(my_string[15:])
In [ ]:
print(my_string[:9])
In [ ]:
print('cow'[::2])
In [ ]:
print('cow'[::-2])
In [ ]:
# This is fine, because we are creating a new string
fruit = 'guanabana'
island = fruit[:5]
print(island, 'island')
In [ ]:
# This works because we are creating a new string and overwriting our old one
fruit = fruit[5:] + 'na'
print(fruit)
In [ ]:
# This does not work because now we are trying to change an existing string
fruit[4:5] = 'an'
print(fruit)
In [ ]:
# If we want to do this then we need to do:
fruit = fruit[:4] + 'an'
print(fruit)
The reasons for why strings are immutable are beyond the scope of this notebook. Just remember that if you want to modify a string, you need to overwrite the entire string, and you cannot modify parts of it by using individual indices.
In Python it is possible to use comparison operators (as used in conditional statements) on strings. These operators are: ==, !=, <, <=, >, and >=
String comparison is always case-sensitive. Some of the comparison operations (greater/smaller than) are useful for putting words in lexicographical order. This is similar to the alphabetical order you would use with a dictionary, except that all the uppercase letters come before all the lowercase letters (so first A, B, C, etc. and then a, b, c, etc.)
In [ ]:
print('a' == 'a')
print('a' != 'b')
print('a' == 'A') # string comparison is case-sensitive
print('a' < 'b') # alphabetical order
print('A' < 'a') # uppercase comes before lowercase
print('B' < 'a') # uppercase comes before lowercase
print()
print('orange' == 'Orange')
print('orange' > 'Orange')
print('orange' < 'Orange')
print('orange' > 'banana')
print('Orange' > 'banana')
Another way of comparing strings is to check whether a string is part of another string, which can be done using the in
operator. It returns True
if the string contains the relevant substring, and False
if it doesn't. These two values (True
and False
) are called boolean values, or booleans for short. We'll talk about them in more detail later. Here are some examples to try:
In [ ]:
"fun" in "function"
In [ ]:
"I" in "Team"
In [ ]:
"App" in "apple" # Capitals are not the same as lowercase characters!
In [ ]:
print("Hello", "World")
print("Hello " + "World")
Even though they may look similar, there are two different things happening here. Simply said: the plus in the expression is doing concatenation, but the comma is not doing concatenation.
The 'print()' function, which we have seen many times now, will print as strings everything in a comma-separated sequence of expressions to your screen, and it will separate the results with single blanks by default. Note that you can mix types: anything that is not already a string is automatically converted to its string representation.
In [ ]:
number = 5
print("I have", number, "apples")
String concatenation, on the other hand, happens when we merge two strings into a single object using the + operator. No single blanks are inserted, and you cannot concatenate mix types. So, if you want to merge a string and an integer, you will need to convert the integer to a string.
In [ ]:
number = 5
print("I have " + str(number) + " apples")
Optionally, we can assign the concatenated string to a variable:
In [ ]:
my_string = "I have " + str(number) + " apples"
print(my_string)
In addition to using + to concatenate strings, we can also use the multiplication sign * in combination with an integer for repeating strings (note that we again need to add a blank after 'apples' if we want it to be inserted):
In [ ]:
my_string = "apples " * 5
print(my_string)
The difference between "," and "+" when printing and concatenating strings can be confusing at first. Have a look at these examples to get a better sense of their differences.
In [ ]:
print("Hello", "World")
In [ ]:
print("Hello" + "World")
In [ ]:
print("Hello " + "World")
In [ ]:
print(5, "eggs")
In [ ]:
print(str(5), "eggs")
In [ ]:
print(5 + " eggs")
In [ ]:
print(str(5) + " eggs")
In [ ]:
text = "Hello" + "World"
print(text)
print(type(text))
In [ ]:
text = "Hello", "World"
print(text)
print(type(text))
We can imagine that string concatenation can get rather confusing and unreadable if we have more variables. Consider the following example:
In [ ]:
name = "Chantal"
age = 27
country = "The Netherlands"
introduction = "Hello. My name is " + name + ". I'm " + str(age) + " years old and I'm from " + country + "."
print(introduction)
Luckily, there is a way to make the code a lot more easy to understand and nicely formatted. In Python, you can use a string formatting mechanism called Literal String Interpolation. Strings that are formatted using this mechanism are called f-strings, after the leading character used to denote such strings, and standing for "formatted strings". It works as follows:
In [ ]:
name="Chantal"
age=27
country="The Netherlands"
introduction = f"Hello. My name is {name}. I'm {age} years old and I'm from {country}."
introduction
We can even do cool stuff like this with f-strings:
In [ ]:
text = f"Next year, I'm turning {age+1} years old."
print(text)
Other formatting methods that you may come across include %-formatting and str.format(), but we recommend that you use f-strings because they are the most intuitive.
In [ ]:
string_1 = 'Hello, world!'
print(string_1) # The original string.
print(string_1.lower()) # Lowercased.
print(string_1.upper()) # Uppercased.
So how do you find out what kind of methods an object has? There are two options:
dir()
function, which returns a list of method names (as well as attributes of the object). If you want to know what a specific method does, use the help()
function.Run the code below to see what the output of dir()
looks like.
The method names that start and end with double underscores ('dunder methods') are Python-internal. They are what makes general methods like len()
work (len()
internally calls the string.__len__()
function), and cause Python to know what to do when you, for example, use a for-loop with a string.
The other method names indicate common and useful methods.
In [ ]:
# Run this cell to see all methods for strings
dir(str)
If you'd like to know what one of these methods does, you can just use help()
(or look it up online):
In [ ]:
help(str.upper)
It's important to note that string methods only return the result. They do not change the string itself.
In [ ]:
x = 'test' # Defining x.
y = x.upper() # Using x.upper(), assigning the result to variable y.
print(y) # Print y.
print(x) # Print x. It is unchanged.
Below we illustrate some of the string methods. Try to understand what is happening. Use the help()
function to find more information about each of these methods.
In [ ]:
# Find out more about each of the methods used below by changing the name of the method
help(str.strip)
In [ ]:
s = ' Humpty Dumpty sat on the wall '
print(s)
s = s.strip()
print(s)
print(s.upper())
print(s.lower())
print(s.count("u"))
print(s.count("U"))
print(s.find('sat'))
print(s.find('t', 12))
print(s.find('q', 12))
print(s.replace('sat on', 'fell off'))
words = s.split() # This returns a list, which we will talk about later.
for word in words: # But you can iterate over each word in this manner
print(word.capitalize())
print('-'.join(words))
In [ ]:
print("A message").
print("A message')
print('A message"')
In [ ]:
my_string = "Sandwiches are yummy"
# your code here
Can you print the following? Try using both positive and negative indices.
In [ ]:
# your code here
In [ ]:
# your code here
Can you print 'banana' in reverse ('ananab')?
In [ ]:
# your code here
Can you exchange the first and last characters in my_string
('aananb')? Create a new variable new_string
to store your result.
In [ ]:
my_string = "banana"
new_string = # your code here
In [ ]:
name = "Bruce Banner"
alterego = "The Hulk"
colour = "Green"
country = "USA"
print("His name is" + name + "and his alter ego is" + alterego +
", a big" + colour + "superhero from the" + country + ".")
How would you print the same sentence using ","?
In [ ]:
name = "Bruce Banner"
alterego = "The Hulk"
colour = "Green"
country = "USA"
print("His name is" + name + "and his alter ego is" + alterego +
", a big" + colour + "superhero from the" + country + ".")
Can you rewrite the code below using an f-string?
In [ ]:
name = "Bruce Banner"
alterego = "The Hulk"
colour = "green"
country = "the USA"
birth_year = 1969
current_year = 2017
print("His name is " + name + " and his alter ego is " + alterego +
", a big " + colour + " superhero from " + country + ". He was born in " + str(birth_year) +
", so he must be " + str(current_year - birth_year - 1) + " or " + str(current_year - birth_year) +
" years old now.")
In [ ]:
my_string = "banana"
# your code here
Remove all spaces in the sentence using a string method.
In [ ]:
my_string = "Humpty Dumpty sat on the wall"
# your code here
What do the methods lstrip()
and rstrip()
do? Try them out below.
In [ ]:
# find out what lstrip() and rstrip() do
What do the methods startswith()
and endswith()
do? Try them out below.
In [ ]:
# find out what startswith() and endswith() do