Discovering the python environment

The partical lab is based on the following tools :

  • python : a general purpose language
    • modern language / very clean / not limited
    • gentle learning curve - easy for beginners
  • python standard library
    • HUGE
  • ipython an interactive layer over the python language
    • many added convenient utilities added
    • check the %magic commands
  • numpy a library making numeric vectors first class python objects
  • scipy a library holding most standards mathematical and statistical methods
    • other library available.
  • matplotlib a library for scientific plots
  • ipython notebook aka jupyter, a nice GUI of this stuff, also includes R and other environments
  • there are also libraries that we will not use, but can be very usefull for data-scientists:
    • pandas is a very comprehensie data-analysis environment for python - similar to R
    • SymPy is a symbolic environment similar to Mapple
    • scikit-learn a machine-earning envirnment for python

getting documentation

This is a very important part - you'll find yourself spending more time reading doc than writting code !

Do not hesitate to go through the different help systems - available from here, ( look at the Help menu of this page, you will recognize the list )

ipython notebook environment

you've already seen it - this is a notebook

main features

Within a notebook, you can freely mix

  • text, in markdown syntax, where you can have
    • headers,
    • list and sub-lists,
    • typypographic enhancements
    • URL, http://www.python.org, pictures from internet
  • example of code :

    sp = numpy.fft.rfft(fid)
  • equation, using the $\LaTeX$ syntax : $$ \ell_p(\mathbf{x}) = \left ( \sum_{i=O}^N {(x_i)^p} \right )^{\frac{1}{p}} $$ but also in line : $ \ell_p(\mathbf{x}) = \left ( \sum_{i=O}^N {(x_i)^p} \right )^{\frac{1}{p}}$

  • programs in python, but also R, julia, etc... (more about this later on)
    • running either locally, or on a remote ipython server
  • results from the programs (texts, graphics, etc..)
  • even interactive environment
  • etc...

    please double-click on this cell to see the internal magic !

    ### convenient user interface

  • all language doc available from this page
  • comprehensive IDE on the top of the python language: try
    • object? for help
    • object?? for code
    • obje + tab-key for code completion
    • object. + tab-key for attribute list
    • function( + shift-tab-key ) for interface description
  • convenient interactive commands
    • _ for last results
    • %history
    • %timeit
    • %debug
    • many other %magic commands
  • basic Unix shell
    • ls, cat, pwd - values are returned as python variables !
    • %cd is slightly special
    • ! any command

In [1]:
pwd


Out[1]:
u'/Users/mad/Documents/ mad/ en cours/python/MemoBio2015'

In [2]:
ls


FTICR-Files/        LICENSE             Pract_CS.ipynb      README.md           embryos.tif
FTICR_1.ipynb       OMP_example.ipynb   Presentation.ipynb  clown.jpg

python langage

main features

Python is script langage, meant to tie things together. Over time, many possibilities have been implemented, the ones which we are going to use is the scientific stack which allows to program very rapidly, at a very high level, efficient computational tasks.

One confusion to be cleared at the very beginning: There is basically 2 flavors of the Python language:

  • python 2.7 - the one we are going to use
  • python 3.x (currently 3.5) with more features, but not fully adopted yet

The difference are minutes, and anything which works in 2.7 will work in 3.x as long as you check the following differences:

  • print syntax print(something) (both 2.7 and 3.5) rather than print something (2.7 only)
  • integer division (optionnal in 3.5 mandatory in 2.7)

This repository is meant to run under python 2.7 Most of the features should also work under 3.x but you might need some tuning.


In [3]:
# this simple line allows the code to be version independent (kind of)
from __future__ import division, print_function

several native types are available


In [4]:
a = 1 # integer
b = 3.14 # floats
c = 1.1 + 2j  # complex
# but also unlimited precision integers :
l = 123456789012345678901234567890L
print ("l^2 = ", l*l)


l^2 =  15241578753238836750495351562536198787501905199875019052100

In [5]:
d = "Mary had a little cat "   #strings - strings are immutable, d[3] = "g"  will fail
dd = 'George had one too '   # ' and " are just the same
ddd = """ triple quotes indicate multi line string
very convenient for large texts
where you can easily use " and '
"""

m = None   # some prefined constants
n = True
o = False

e = (1.1, a, (b,c), d, a)      # tuples
f = [1.1, a, (b,c), d, a]    # lists  - lists and tuples ar ordered, tuples are immutable
empty = []  # initialize an empty list
empty2 = ()  # even empty tuple

# index in tuples and lists start at 0.
print ("e[1] ",e[1])

# there are tools for creating lists and string
line = "*"*30   # this is 30 "*" in a row
line_extended = "#" + line + "#"    # is '#******************************#'
r = range(10)   # this is [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
r.append("end")  # now r is [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 'end']
# and MANY other tools

g = {"key1": 1.0,  "key2":  "horsemen",  1:"keys can be anything"}     # dictionnaries
empty3 = {}     # initialize an empty directory
empty3["here"] = "there"   # set values in dictionnaries

print  ('g["key2"]: ', g["key2"])
print ('g.keys(): ', g.keys(), 'g.values(): ', g.values())
h = set((1.1, a, (b,c), d, a))   # set do not have duplicated values, so there is only one a here
# dictionnaries and set are unordered

# MANY MANY other stuff (see standard library for types and associated functions )


e[1]  1
g["key2"]:  horsemen
g.keys():  ['key2', 'key1', 1] g.values():  ['horsemen', 1.0, 'keys can be anything']

control structures

blanks are meaningful, and indicate the execussion blocks ( no need for { } or begin-end }


In [6]:
if (1==2):
    do(this)

for i in e:
    print(i)

s = 0
for j in range(10):
    s = s+j**2

while abs(c)<100:
    print(c)
    c = c**2


1.1
1
(3.14, (1.1+2j))
Mary had a little cat 
1
(1.1+2j)
(-2.79+4.4j)
(-11.5759-24.552j)

In [7]:
# range
m2 = range(10)      # 10 values from 0 to 9
m3 = range(2,15,3)  # 2 to 14, by steps of 3
print('m2 :', m2)
print('m3 :', m3)


m2 : [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
m3 : [2, 5, 8, 11, 14]

In [8]:
# indexing
print(m2[3:])  # m2 from 3 to the end
print(m2[:5])  # m2 from beginning to 4
print(m2[::2])  # m2 by step of 2
print(m2[:7:2])  # m2 from beginning to 6 by step of 2
print(m2[:-3])   # m2 with all but last 3
print(m2[::-1])  # m2 reversed


[3, 4, 5, 6, 7, 8, 9]
[0, 1, 2, 3, 4]
[0, 2, 4, 6, 8]
[0, 2, 4, 6]
[0, 1, 2, 3, 4, 5, 6]
[9, 8, 7, 6, 5, 4, 3, 2, 1, 0]

functions


In [9]:
def func1(arg1):
    "example of function"
    do(arg1)
    return value

# arguments may have default value, in which case, they are optional (but come last in arg list)
def func2(arg1, arg2="default", arg3=3, arg4=None):
    "example of default arguments in function definition"
    if arg4 is None:  # prefered to == None
        # NEVER EVER use a mutable ([] for instance) as defaut var
        arg4 = []
    return (arg1, arg2, arg3, arg4)

print ( func2(5) )


(5, 'default', 3, [])

In [10]:
func2(6)


Out[10]:
(6, 'default', 3, [])

Unlike some script languages, values are typed, however, the variable can hold sequentially different types, and function can adapt anytype as long as it is syntaxly correct


In [11]:
def combine(x,y):
    " combines two vars, using + and *"
    return x + 2*y

In [12]:
# this works
print ( combine(1, 2))
print ( combine("a", "b"))
print ( combine(d, dd))
# this doesn't
print ( combine("a", 2))


5
abb
Mary had a little cat George had one too George had one too 
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-12-94a45fea46c0> in <module>()
      4 print ( combine(d, dd))
      5 # this doesn't
----> 6 print ( combine("a", 2))

<ipython-input-11-b9452edd5bd7> in combine(x, y)
      1 def combine(x,y):
      2     " combines two vars, using + and *"
----> 3     return x + 2*y

TypeError: cannot concatenate 'str' and 'int' objects

this is called "duck typing":

  • if it quacks like a duck, consider it's a duck

python has Classes and methods for Object Oriented Programming


In [13]:
class MyClass(object):  # here we inherit from basic object - may inherit from any other class
    "a minumum class"
    def __init__(self, arg):
        "this is the 'creator' of the object"
        self.arg1 = arg  # here you create an object attribute
        self.arg2 = "initial"  # here another
    def method1(self):
        "here we define a method for this object"
        if self.arg1:
            v = self.arg2.upper()
        else:
            v = self.arg2.lower()
        return v
# then we can create
ob1 = MyClass(True)
ob1.arg2 = "ExAmPlE"
print ( ob1.method1() )
ob1.arg1 = False
print ( ob1.method1() )


EXAMPLE
example

libraries

Standard python has a complete library of packages, which cover about everything you want to do with a computer : (regular expression, socket, web sites, interface with OS, cryptographic, threads, multiprocessing, etc...)

to load and use a library into a program, simply do one of these:

import library
#then use
library.tool()

import library as lib  # just an alias
#then use
lib.tool()

from  library import tool
#then use
tool()

You should definitely check the documentation

and MUCH more

python is a real full-fledge language, created to be simple yet not limited. You should go thru the on-line tutorial for getting ideas about the possibilities of the language.

This is in contrast with most scripting languages that are usually limited, and/or started as a quick hack, and contains some initial defects which are hard to get rid of.

It is also in contrast with specific languages (R, Matlab, PHP) which are optimized for a given task, but have hard time doing something else (try doing big stat in PHP, or a web site in Matlab!)

Shell mode

Jupyter can be used as a browser over a file system

we can also use iPython as a simple shell, 'Unix like' (ls, pwd, cat, etc...)


In [14]:
ls 'FTICR/Files/bruker ubiquitin file/ESI_pos_Ubiquitin_000006.d/'


ls: FTICR/Files/bruker ubiquitin file/ESI_pos_Ubiquitin_000006.d/: No such file or directory

In [15]:
cat 'FTICR/Files/bruker ubiquitin file/readme.txt'


cat: FTICR/Files/bruker ubiquitin file/readme.txt: No such file or directory

and ! can be used to call more specific Unix commands


In [16]:
!find . -name '*.method'


./FTICR-Files/ESI_pos_Ubiquitin_000006.d/ESI_pos_150_3000.m/apexAcquisition.method

numpy

numpy is the library that create a numerical multidimensional array type, which allows to efficiently do numerical computations.

All the elements in the numpy array have the same type (int, float, complex, etc...), and the computation are performed by call in optimized machine-level code.


In [17]:
import numpy    # this is how you load an external library
import numpy as np   # this is the standard way of loading numpy

x = np.linspace(0,5,1000)   # create a series of 1000 points ranging from 0.0 to 5.0
y = 1 + np.sin(x)**2   # do some arithmetic with x
print('y_100: ',y[100])   # then elements appear like simple lists


y_100:  1.23027013899

In [18]:
# multidimentional
mat = np.array([[0,1,2],[3,4,5],[6,7,8]])
print(mat)
print(mat[1,2])
print(mat[1,:])
print(mat[:,2])
print(mat.T)


[[0 1 2]
 [3 4 5]
 [6 7 8]]
5
[3 4 5]
[2 5 8]
[[0 3 6]
 [1 4 7]
 [2 5 8]]

In [19]:
# creators
print(np.zeros(10))
print(np.zeros(5, dtype=complex))
print(np.ones(10))
print(np.arange(10))   # note the int
print(np.arange(10.0)) # note the float


[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
[ 0.+0.j  0.+0.j  0.+0.j  0.+0.j  0.+0.j]
[ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.]
[0 1 2 3 4 5 6 7 8 9]
[ 0.  1.  2.  3.  4.  5.  6.  7.  8.  9.]

In [20]:
print("initialize to 0 or 1")
print(np.zeros( (2,3) ) )  # note the tuple
print(np.ones( (3,2) ) )
print("a diagonal matrix")
print(np.eye(5))
print( np.eye(5).shape )
print ("a random array")
print(np.random.randn( 5,3 ) )    # note the 2 arguments
print ("you have more than 2 dimension")
print(np.random.randn( 4,3,2 ) )  #


initialize to 0 or 1
[[ 0.  0.  0.]
 [ 0.  0.  0.]]
[[ 1.  1.]
 [ 1.  1.]
 [ 1.  1.]]
a diagonal matrix
[[ 1.  0.  0.  0.  0.]
 [ 0.  1.  0.  0.  0.]
 [ 0.  0.  1.  0.  0.]
 [ 0.  0.  0.  1.  0.]
 [ 0.  0.  0.  0.  1.]]
(5, 5)
a random array
[[-0.10394185  0.62937842  0.60273962]
 [ 2.91122215 -0.40419939  2.26364137]
 [-1.00206994 -1.91517684 -0.26611194]
 [-1.77874634  0.6531998   1.01299359]
 [-1.48771406  0.36463991 -0.93075   ]]
you have more than 2 dimension
[[[ 0.46371463 -2.47986633]
  [ 3.2300485   1.83383053]
  [-2.40167934  1.3126498 ]]

 [[-0.38123022  0.42393782]
  [ 0.6102568   1.02480006]
  [-0.22208881 -1.07729831]]

 [[-1.69359621  2.19085341]
  [-1.07862808 -0.65571953]
  [ 0.38365083  1.39372566]]

 [[ 1.36922795  0.36695679]
  [-1.22182109  0.54491912]
  [-0.70386927 -0.92887869]]]

In [21]:
A = np.eye(5)
B = np.random.randn(5,5)
print("you can do arithmetics with array")
D = A -2*B    # arithmetic
x = np.arange(5.0)
y = x*x      # this is a element-wise mult
print (np.dot(y,y))  # this is the scalar product
print (np.dot(D,y))  # this is the matrix product


you can do arithmetics with array
354.0
[ 30.16290767   5.93351722  63.71724411  -1.73457151  76.8875269 ]

In [22]:
A


Out[22]:
array([[ 1.,  0.,  0.,  0.,  0.],
       [ 0.,  1.,  0.,  0.,  0.],
       [ 0.,  0.,  1.,  0.,  0.],
       [ 0.,  0.,  0.,  1.,  0.],
       [ 0.,  0.,  0.,  0.,  1.]])

speed

The point is that these computations are very fast, as they are not performed by the python interpreter, but rather by a call to an optimized library, written in C or FORTRAN.


In [23]:
x = np.linspace(0,10,100000)  # un 1E5 points
%timeit y = np.sin(2*x + 3*x**3)


100 loops, best of 3: 4.61 ms per loop

Difference with Matlab

This is very close to Matlab approach. However there are some differences ( simplified here ).

  • we're in python, so an $n$ long array starts at 0 and finishes at $n-1$
  • all indexing techniques presented above for lists work as well
  • all operations are element wise, so * in numpy is equivalent to .* in matlab
  • if you need a matrix product, use numpy.dot()
  • the transpose of the matrix A is A.T

Additionally, memory management is somewhat better than MatLab


In [24]:
D = A - 2*B   # this creates a new matrix in memory
A -=  2*B    # this does not
print(A)


[[ 2.07846945  0.93857031  0.39444619 -0.68274077  2.11195122]
 [ 3.0745776   2.4800333  -0.35641843 -1.73957114  1.28345612]
 [-3.48531224 -1.99364509  5.80436863  0.72341778  2.24891591]
 [-0.58052093  1.94567111  3.29439194 -3.94639945  1.16623654]
 [-2.59917792 -2.66250906  0.71678823  1.56256371  3.9137381 ]]

scipy

this library contains many mathetical tools

  • optimizers
  • special functions
  • statistical tools
  • Fourier transform
  • linear algebra; eigenvalues
  • integrals, ODE
  • sparse arrays
  • ...

check the doc !

matplotlib

This is the plotting library - very similar to matlab one

There is two ways of using it

  • basic - using simple routines, matlab-like, and called pylab

    import matplotlib.pylab as plt
    
  • advanced - using the OOP approach We'll stick to simple


In [25]:
import matplotlib.pylab as plt   # traditionnal import
# this magic command embed graphics into page
%matplotlib inline

x = np.linspace(0, 4*np.pi, 100)

plt.plot(x, np.sin(x))
plt.plot(x, np.cos(x))  # 'r' means red


Out[25]:
[<matplotlib.lines.Line2D at 0x106439dd0>]

check

  • semilogx() semilogy() loglog() for log-plots
  • scatter() stem() bar() for different formats
  • contour() contourp() for 2D and
  • imshow() for images
  • ...

Documentation is a bit complex and confuse !

Usefull references:

some more advanced examples :

Showing the generic kaiser() function which approximate many different windows, used for apodisation in Fourier spectroscopy


In [26]:
plt.figure(figsize=(8,6))    # forces size  (x,y)
for beta in range(11):
    plt.plot(np.kaiser(100, beta), label=r"$\beta=%.1f$"%beta)
    # create a label, using LaTeX syntax and % operator for string formating
plt.legend(loc=0)    # show the legend, loc=0 means "optimal" zone


Out[26]:
<matplotlib.legend.Legend at 0x1065b9350>

using the scatter function, to code 4 values : x, y, size, color


In [27]:
N = 50
x = np.random.rand(N)   # generates random values
y = np.random.rand(N)
colors = np.random.rand(N)
area = np.pi * (15 * np.random.rand(N))**2 # 0 to 15 point radiuses

plt.scatter(x, y, s=area, c=colors, alpha=0.5)   # alpha is transparency


Out[27]:
<matplotlib.collections.PathCollection at 0x106c22550>

Contour plots are possible also, as well as multi-images


In [28]:
x = np.linspace(-5, 5, 100)
y = np.linspace(-5, 5, 100)
xx, yy = np.meshgrid(x, y)
z = np.sin(xx**2 + yy**2) / (xx**2 + yy**2)
plt.figure(figsize=(10,4))    # 
plt.subplot(121)      # 1 line 2 columns
h = plt.contourf(x,y,z)    # 'filled' contours
plt.subplot(122)
h = plt.contour(x,y,z)     # empty ones


as a conclusion

This whole series : python ipython jupyter nupy scipy matplotlib realizes a very nice environment for scientists. It free, fast, quite complete, and very efficient.

There is in ipython a magic command that imports everything into the current space :

%pylab inline

We are not going to use it, as it is a quick hack, good for tiny projects, and considered harmfull by many -and we are here in a school!

examples

There is a large number of possibilities and tricks to play with this environment.


In [29]:
print("compute a difference")
x = numpy.linspace(0,10,1000)
y = numpy.sin(x)
yp = 1000*(y[1:] - y[:-1])/10
plt.plot(x, y, label='y' )
plt.plot(x[1:], yp, label="yp")
plt.legend()


compute a difference
Out[29]:
<matplotlib.legend.Legend at 0x10752ac90>

In [30]:
print("accessing pictures")
s = plt.imread("clown.jpg")
print(s.shape)
plt.imshow(s)


accessing pictures
(200, 320, 3)
Out[30]:
<matplotlib.image.AxesImage at 0x10769b2d0>

In [31]:
plt.plot(s[70,:,1])


Out[31]:
[<matplotlib.lines.Line2D at 0x10779f850>]

In [32]:
print("compute histogram")
sg = s.sum(axis=2)/3.0
print (sg.shape)
h = plt.hist(sg.ravel(), bins=255)


compute histogram
(200, 320)

In [33]:
print("accessing pictures")
s = plt.imread("embryos.tif")
print(s.shape)
plt.imshow(s)


accessing pictures
(1200, 1600, 3)
Out[33]:
<matplotlib.image.AxesImage at 0x10809c750>

In [34]:
for i in range(3):
    plt.figure()
    plt.imshow(s[:,:,i], cmap='gray')



In [35]:
c1 = 1.0*s[:,:,0]
h = plt.hist(c1.ravel(), bins=255)



In [36]:
plt.plot(h[0])


Out[36]:
[<matplotlib.lines.Line2D at 0x11365f990>]

In [37]:
print("thresholding")
mask = numpy.where(c1<162,1,0)
plt.imshow(mask, cmap="gray_r")


thresholding
Out[37]:
<matplotlib.image.AxesImage at 0x1084a3790>

In [38]:
cleaned = c1*mask
plt.imshow(cleaned, cmap="gray")


Out[38]:
<matplotlib.image.AxesImage at 0x1087c8610>

In [39]:
import matplotlib.cbook as cbook
lena = plt.imread(cbook.get_sample_data("lena.png"))
plt.imshow(lena)
print("Lena tells you good bye!")


Lena tells you good bye!

A word on open science

science is about sharing

This is obvious, your research is useless to the community unless ist is accessible to others.

That is why we write publications, and present in conferences

programming is science

A program is a way of presenting your ideas

Whether it is

  • an algorithm
  • an analysis
  • a modeling
  • ...

It actually present what you did (or want to do)

program as a publication

So a program is as valuable as the text of the publication. It expresses science.

For this is reason it should be

  • commented
  • tested
  • reproducible
  • accessible

tools

There are tools and methods to help managing programs

this school

One of the purpose of this school was to create some awareness within the scientists that

  • there are new and important approaches to data processing / analysis
  • there is life beyond Excel
  • programming is an important task

Software Carpentry

is an organization dedicated to teaching computing skills to scientists, with support from the Alfred P. Sloan Foundation and the Mozilla Foundation.

Activities:

  • short intensive workshops (“boot camps”)
  • online courses

http://software-carpentry.org/


In [ ]: