Strings

A sting is a sequence of characters. A string's characters can be treated similarly to a list. Each character has an index starting at 0 and can be accessed just like a list

my_str = "M y   S t r i n g"
#index    0 1 2 3 4 5 6 7 8

print my_str[0]

In [ ]:
chinese_zodiac = "Rat Ox Tiger Rabbit Dragon Snake Horse Goat Monkey Rooster Dog Pig"
print(chinese_zodiac[0])
print(chinese_zodiac[1])

TRY IT

Create a string with the 5 elements of the Chinese zodiac (wood fire earth metal water) and store it in a variable called elements. Get the 4th letter of elements.


In [ ]:

You can get the length (number of characters) of a string by using the len operator

len(my_string)

In [ ]:
print(len(chinese_zodiac))

Be careful with length, it is the number of characters, not the last index.

The last index is len(string) - 1


In [ ]:
zlen = len(chinese_zodiac)
# WRONG
print(chinese_zodiac[zlen])
# RIGHT
print(chinese_zodiac[zlen - 1])

But actually you can use negative indexing to get the last character in a string. -1 is the last character, -2 is the second to last and so on.

Wondering why negative indexing starts with -1 and not 0? It's because -0 and 0 are the same thing, so you would just get the first character.


In [ ]:
print(chinese_zodiac[-1])

TRY IT

Create a string with your name and store it in a variable called name. Print the last character of your name using both indexing methods (positive and negative).


In [ ]:

String Slices

You can take more than a single character; you can take a whole slice. To take a slice, give the first index and then the last index + 1. (The first index is inclusive, second index is exclusive)

string[0:5]

In [ ]:
second_animal = chinese_zodiac[4:6]
print(second_animal)

You can omit the first index and it will start at the beginning, you can omit the last index and it will go to the end.


In [ ]:
first_six = chinese_zodiac[:32]
print(first_six)

last_six = chinese_zodiac[33:]
print(last_six)

TRY IT

What happens when you omit both indices? Try it on the chinese_zodiac.


In [ ]:

Immutable strings

Strings are immutable. You cannot change their contents; you must make a new string.


In [ ]:
# This will fail
chinese_zodiac[0] = 'A'

In [ ]:
# This is better
print(chinese_zodiac)
chinese_zodiac_minus_r = chinese_zodiac[1:]
print(chinese_zodiac_minus_r)

ate_the_rat = "C" + chinese_zodiac_minus_r
print(ate_the_rat)

Strings and in

The in operator checks if a substring is in a string. It returns a boolean.

'substring' in 'string'

Hint case matters


In [ ]:
print('Cat in zodiac:') 
print('Cat' in chinese_zodiac)

print('Dragon in zodiac:')
print('Dragon' in chinese_zodiac)

Looping through strings

You can loop through each character in a string using a for loop (or a while loop, but that is more straightforward)

for character in string:
    print character

In [ ]:
for character in 'Rat':
    print(character)

In [ ]:
# Print only vowels
for letter in 'ABCDEFGHIJKLMNOPQRSTUVWXYZ':
    if letter in 'AEIOUY':
        print(letter)

TRY IT

Loop through the chinese zodiac characters, printing only letters that are in you name


In [ ]:

String Comparison

You can compare strings using the ==, >, >=, <, and <= operators.

Numbers come first, then capital letters and then lowercase letters. They are actually sorted based on their ascii value http://ascii.cl/


In [ ]:
print('A' > 'a')
print('A' < 'a')
print('A' > 'B')
print('0' > 'A')

In [ ]:
# Case matters in equality
print(('cat' == 'Cat'))
print(('cat' == 'cat'))

String Methods

There are several built in methods you can use on your strings. To find them all use dir(str).

'my string'.method_name(params)

In [ ]:
dir(chinese_zodiac)

lower, upper, title, capitalize, and swapcase all change the case of the string

HINT: use one of these methods to transform input to functions so that you don't have to worry about what case the user's input is in


In [ ]:
print(chinese_zodiac.lower())
print(chinese_zodiac.upper())

The split method slices the string into a list
It takes one parameter, the character(s) to split on

my_str.split(slice_string)

And the join method merges a list into a string It operates on the 'glue' string, and the list is the parameter

glue_string.join(my_list)

In [ ]:
cz_list = chinese_zodiac.split(' ')
print(cz_list)

print(', '.join(cz_list))

The find method finds the index of a substring (and -1 if it doesn't exist) (Why -1 and not 0?)

count counts the occurrence of a substring

startswith and endswith checks if a string starts or ends with a given substring and returns a boolean


In [ ]:
print(chinese_zodiac.find('Snake'))

print(chinese_zodiac.count('at'))

print(chinese_zodiac.startswith('Ra'))

You can chain some string methods if the method also returns a string


In [ ]:
print(''.join(cz_list).lower().startswith('ra'))

TRY IT

Change the case of your elements string to upper case and then split the result on spaces (' ')

Challenge do this by chaining string methods


In [ ]:

Parsing Strings

You can use string methods and indexing to get exactly the substring you want.


In [ ]:
# Lets find the zodiac animals between Ox and Monkey
ox_idx = chinese_zodiac.find('Ox')
monkey_idx = chinese_zodiac.find('Monkey')

print(chinese_zodiac[ox_idx:monkey_idx])

In [ ]:
# Wait, I wanted to exclude Ox
ox_end = ox_idx + len('Ox ')
print(chinese_zodiac[ox_end:monkey_idx])

TRY IT

Use string parsing strategies to extract the host name (in this case gmail) from the email address. (And no, just counting, doesn't count)


In [ ]:
email = 'my.name@gmail.com'

Formatting

String concatenation gets old really fast, and casting numbers and booleans as strings does too. Luckily, there is a better option.

String formatting allows you to include variables directly in you string.

'string {var} '.format(var1)

Each parameter passed to format has an index and can be accessed in the string using {idx}. They don't have to be in order and can be repeated


In [ ]:
print('The {0} has {1} toes per limb and thus is considered {2}'.format('ox', 4, 'yin'))
print('The {0} has {1} toes per limb and thus is considered {2}'.format('tiger', 5, 'yang'))

# I learned something when creating this notebook

You don't have to put the elements in the correct order


In [ ]:
print('The {1} has {0} toes per limb and thus is considered {2}'.format('ox', 4, 'yin'))
print('The {2} has {2} toes per limb and thus is considered {2}'.format('tiger', 5, 'yang'))

You can also use variable names. In the parameters use a dictionary (you'll learn about these soon) or key=value syntax.


In [ ]:
"The {animal}'s attribute is {attribute}".format(animal='snake', attribute='flexibility')

You can even format the variables in various ways. Reference the docs for everything, there is just too much you can do and I only have so much time to show you.

https://docs.python.org/3.4/library/string.html


In [ ]:
print('{:<30}'.format('left aligned'))
print('{:>30}'.format('right aligned'))
print('{:0.2f}; {:0.7f}'.format(3.14, -3.14))

TRY IT

Use string formatting to make a sentence "[Your name] finds string formatting [difficulty level]"


In [ ]:

PROJECT: DNA EXTRAVAGANZA

You are going to create a program that does some very simple bioinformatics functions on a DNA input.

Background

A little bit of molecular biology. Codons are non-overlapping triplets of nucleotides.

ATG CCC CTG GTA ... - this corresponds to four codons; spaces added for emphasis

The start codon is 'ATG'

Stop codons can be 'TGA' , 'TAA', or 'TAG', but they must be 'in frame' with the start codon. The first stop codon usually determines the end of the gene. In other words:

'ATGCCTGA...' - here TGA is not a stop codon, because the T is part of CCT
'ATGCCTTGA...' - here TGA is a stop codon because it is in frame (i.e. a multiple of 3 nucleic acids from ATG)

The gene is start codon to stop codon, inclusive Example:

dna - GGCATGAAAGTCAGGGCAGAGCCATCTATTTGAGCTTAC
gene - ATGAAAGTCAGGGCAGAGCCATCTATTTGA

Instructions

  1. Write a function called numCodons that takes a dna string and returns to you how many codons are in it (a codon is a group of 3 DNA bases). Examples: AAACCC -> 2 GT -> 0
  2. Write a function called startCodonIndex which finds the index of the first start codon 'ATG' and returns -1 if none are found.
  3. Write a function called stopCodonIndex which finds the index of the first stop codon 'TAA' or 'TAG' or 'TGA' in frame with the start codon (found from startCodonIndex) and returns -1 if none are found.
  4. Write a function called codingDNA which returns the substring of the DNA from the beginning of the start codon to the end of the stop codon (please for the love of all things, use the functions you already wrote to calculate start and stop)
  5. Write a function called transcription that takes the DNA and translates it to RNA. Each letter should be translated using these mappings (A->U), (T->A), (C->G), (G->C).
  6. Write a function called DNAExtravaganza that calls your functions and prints out (using string formatting)

    DNA: [DNA]

    CODONS: [Number of codons]

    START: [start index]

    STOP: [stop index]

    CODING DNA: [coding DNA string]

    TRANSCRIBED RNA: [transcribed DNA]

You can use these as test DNA string:

   dna='GGCATGAAAGTCAGGGCAGAGCCATCTATTGCTTACATTTGCTTCTGACACAACTGTGTTCACTAGCAACCTCAAACAGACACCATGGTGCACCTGACTCCTGAGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGCAGGTTGGTATCAAGGTTACAAGACAGGTTTAAGGAGACCAATAGAAACTGGGCATGTGGAGACAGAGAAGACTCTTGGGTTTCTGATAGGCACTGACTCTCTCTGCCTATTGGTCTATTTTCCCACCCTTAGGCTGCTGGTGGTCTACCCTTGGACCCAGAGGTTCTTTGAGTCCTTTGGGGATCTGTCCACTCCTGATGCTGTTATGGGCAACCCTAAGGTGAAGGCTCATGGCAAGAAAGTGCTCGGTGCCTTTAGTGATGGCCTGGCTCACCTGGACAACCTCAAGGGCACCTTTGCCACACTGAGTGAGCTGCACTGTGACAAGCTGCACGTGGATCCTGAGAACTTCAGGGTGAGTCTATGGGACCCTTGATGTTTTCTTTCCCCTTCTTTTCTATGGTTAAGTTCATGTCATAGGAAGGGGAGAAGTAACAGGGTACAGTTTAGAATGGGAAACAGACGAATGATT'

    dna = 'GGGATGTTTGGGCCCTACGGGCCCTGATCGGCT'

In [ ]: