Importing modules and libraries

Like other laguages, Python has the ability to import external modules (or libraries) into the current program. These modules may be part of the standard library that is automatically included with the Python installation, they may be extra libraries which you install separately or they may be other Python programs you have written yourself. Whatever the source of the module, they are imported into a program via an import command.

For example, if we wish to access the mathematical constants pi and e we can use the import keyword to get the module named math and access its contents with the dot notation:


In [ ]:
import math
print(math.pi, math.e)

Also we can use the as keyword to give the module a different name in our code, which can be useful for brevity and avoiding name conflicts:


In [ ]:
import math as m
print(m.pi, m.e)

Alternatively we can import the separate components using the from … import keyword combination:


In [ ]:
from math import pi, e
print(pi, e)

We can import multiple components from a single module, either on one line like as seen above or on separate lines:


In [ ]:
from math import pi
from math import e

Listing module contents

Using the method dir() and passing the module name:


In [ ]:
import math
dir(math)

or directly using an instance, like with this String:


In [ ]:
dir("mystring")

or using the object type


In [ ]:
dir(str)

Getting help from the official Python documentation

The most useful information is online on https://www.python.org/ website and should be used as a reference guide.

Python file library

os.path — Common pathname manipulations

  • exists(path) : returns whether path exists
  • isfile(path) : returns whether path is a “regular” file (as opposed to a directory)
  • isdir(path) : returns whether path is a directory
  • islink(path) : returns whether path is a symbolic link
  • join(*paths) : joins the paths together into one long path
  • dirname(path) : returns directory containing the path
  • basename(path) : returns the path minus the dirname(path) in front
  • split(path) : returns (dirname(path), basename(path))

os — Miscellaneous operating system interfaces

  • chdir(path) : change the current working directory to be path
  • getcwd() : return the current working directory
  • listdir(path) : returns a list of files/directories in the directory path
  • mkdir(path) : create the directory path
  • rmdir(path) : remove the directory path
  • remove(path) : remove the file path
  • rename(src, dst) : move the file/directory from src to dst

Building the path to your file from a list of directory and filename makes your script able to run on any platforms.


In [ ]:
import os.path
os.path.join("data", "mydata.txt")
# data/mydata.txt - Unix
# data\mydata.txt - Windows

Check if a file exists before opening it:


In [ ]:
import os.path
data_file = os.path.join("data", "mydata.txt")
if os.path.exists(data_file):
    print("file", data_file, "exists")
    with open(data_file) as f:
        print(f.read())
else:
    print("file", data_file, "not found!")

Exercise 3.1

Write a script that reads a tab delimited file which has 4 columns: gene, chromosome, start and end coordinates. Check if the file exists, then compute the length of each gene and store its name and corresponding length into a dictionary. Write the results into a new tab separated file. You can find a data file in data/genes.txt directory of the course materials.

Using the csv module

The so-called CSV (Comma Separated Values) format is the most common import and export format for spreadsheets and databases. The csv module implements methods to read and write tabular data in CSV format.

The csv module’s reader() and writer() methods read and write CSV files. You can also read and write data into dictionary form using the DictReader() and DictWriter() methods.

For more information about this built-in Python library about CSV File Reading and Writing documentation.

Let's now read our data/mydata.txt space separated file using the csv module.


In [ ]:
import csv
with open("data/mydata.txt") as f:
    reader = csv.reader(f, delimiter = " ") # default delimiter is ","
    for row in reader:
        print(row)

Change the csv.reader() by the csv.DictReader() and it builds up a dictionary automatically based on the column headers.


In [ ]:
with open("data/mydata.txt") as f:
    reader = csv.DictReader(f, delimiter = " ")
    for row in reader:
        print(row)

In [ ]:
# Write a tab delimited file using the csv module
import csv

mydata = [
    ['1', 'Human', '1.076'], 
    ['2', 'Mouse', '1.202'], 
    ['3', 'Frog', '2.2362'], 
    ['4', 'Fly', '0.9853']
]

with open("data.txt", "w") as f:
    writer = csv.writer(f, delimiter='\t' )
    writer.writerow( [ "Index", "Organism", "Score" ] ) # write header
    for record in mydata:
        writer.writerow( record )

# Open the output file and print out its content
with open("data.txt") as f:
    print(f.read())

In [ ]:
# Write a delimited file using the csv module from a list of dictionaries 
import csv

mydata = [
    {'Index': '1', 'Score': '1.076', 'Organism': 'Human'}, 
    {'Index': '2', 'Score': '1.202', 'Organism': 'Mouse'}, 
    {'Index': '3', 'Score': '2.2362', 'Organism': 'Frog'}, 
    {'Index': '4', 'Score': '0.9853', 'Organism': 'Fly'}
]

with open("dict_data.txt", "w") as f:
    writer = csv.DictWriter(f, mydata[0].keys(), delimiter='\t')
    writer.writeheader() # write header

    for record in mydata:
        writer.writerow( record )

# Open the output file and print out its content
with open("dict_data.txt") as f:
    print(f.read())

Exercise 3.2

Now change the script you wrote for Exercise 3.1 to make use of the csv module.

Create your own module

So far we have been writing Python code in files as executable scripts without knowing that they are also modules from which we are able to call the different functions defined in them.

A module is a file containing Python definitions and statements. The file name is the module name with the suffix .py appended. Create a file called my_first_module.py in the current directory with the following contents:


In [ ]:
def say_hello(user):
    print('hello', user, '!')

Now enter the Python interpreter from the directory you've created my_first_module.py file and import the say_hello function from this module with the following command:

python3
Python 3.5.2 (default, Jun 30 2016, 18:10:25) 
[GCC 4.2.1 Compatible Apple LLVM 7.0.2 (clang-700.1.81)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from my_first_module import say_hello
>>> say_hello('Anne')
hello Anne !
>>>

There is one module already stored in the course directory called my_first_module.py, if you wish to import it into this notebook, below is what you need to do. If you wish to edit this file and change the code or add another function, you will have to restart the notebook to have these changes taken into account using the restart the kernel button in the menu bar.


In [ ]:
from my_first_module import say_hello
say_hello('Anne')

A module can contain executable statements as well as function definitions. These statements are intended to initialize the module. They are executed only the first time the module name is encountered in an import statement. They are also run if the file is executed as a script.

Do comment out these executable statements if you do not wish to have them executed when importing your module.

For more information about modules, https://docs.python.org/3/tutorial/modules.html.

Exercise 3.3

Calculate the GC content of a DNA sequence

Write a function that calculates the GC content of a DNA sequence.

Extract the list of all overlapping sub-sequences

Write a function that extracts a list of overlapping sub-sequences for a given window size from a given sequence. Do not forget to test it on a given DNA sequence.

Exercise 3.4

Calculate GC content along the DNA sequence

Combine the two methods written above to calculate the GC content of each overlapping sliding window along a DNA sequence from start to end.

Import the two methods you wrote above at exercise 3.3, to solve this exercise.

The new function should take two arguments, the DNA sequence and the size of the sliding window, and re-use the previous methods written to calculate the GC content of a DNA sequence and to extract the list of all overlapping sub-sequences. It returns a list of GC% along the DNA sequence.

Next session

Go to our next notebook: python_functions_and_modules_4