Course Introduction

What is Python?

If you are not a programmer, Python is a language designed for you. Here is what Guido van Rossum, the creator of Python, wrote in his funding submission to DARPA, titled Computer Programming for Everybody back in 1999.

"...while many people nowadays use a computer, few of them are computer programmers. Non-programmers aren't really "empowered" in how they can use their computer: they are confined to using applications in ways that "programmers" have determined for them. One doesn't need to be a visionary to see the limitations here."

So Python is conceived to be plain and simple, yet allow you to do everything a complex language would. Many languages tried to open for a wider audience, but none of them had this goal as the primary founding principle. The result is that today Python has the most vibrant developer community of all languages, and that fact translates into good libraries and expertise.

Just like a human language, a computer language is a contaimnent of a certain culture, together with a set of values. A popular source describing Python is the Zen of Python, a collection of 20 aphorisms similar in style to a taoist book, most of whom are adressed towards programers but some are easy to understand by anyone:

Beautiful is better than ugly
Explicit is better than implicit
Simple is better than complex
Complex is better than complicated
Readability counts
etc ...

The course I am trying to hold is fitting well to the Python creed. It is adressed to anyone, it is open source, it is short and simple, and yet it introduces the participants to subjects that would otherwise need years to master. I can probably hold this course in ten languages, but only in Python I can cover all the subjects in three days.

A short history

The first programming languages had a concise syntax and were able to control the computer at a very low level. The best surviving dino from that era is probably C, a language that allows you to directly control the memory registry that the processor is using. In time as software accumulated and the IT infrastructure evolved languages became specialized, whith languages like C++ and Java controlling the commercial sector while niche languages such as Matlab, R and others were being employed in engineering, statistics respectively. Meanwhile, languages became more distant to the hardware, becoming interpreted by a an engine that lies between the operating system and the program itself. Languages that are ready to run the moment writing stops are commonly called scripting languages. Another important changed occured in the manner data structures are managed, some languages alowing the type of an object to be verified and changed during runtime. These are called dynamic languages.

Rise of the scripting languages. It is difficult to explain why scripting languages became so ubiquitous. It has to do with the lack of time and with safety. In C it is relatively easy to make mistakes that can damage a hardware, it takes more time to do something if you are not skilled and you write ugly code if you are not talented. The general purpose languages that stood out were Perl and Python. However, Perl came first and had an intently obfuscated syntax this is because it was primarily adressed to programmers, who love syntax puzzles. Many sequencing libraries were developed into or integrated Perl. R also developed as an opensource alternative to SAS, a language used in statistics. When Python started Perl dominated text processing, R dominated open source statistics computing and Javascript dominated the web. But Python was adressed to everyone, and in a decade became the number one scripting language in terms of popularity.

Future I am not trying to speculate, the point of Python was made, today older languages are trying to copy Python features, while new languages develop that are grown from Python concepts and will ultimately compete for broad audiences. I myself programmed in 30 different languages or more and I kept my sanity so there is no reason for alarm. Python is probably the best introduction to programming at this moment in time, which is why most universities and schools are including it as the main tool in their curricula.

As a new "pythonista", who will be your friends?

As a data scientist working in biology, or as a biologist working with data science, calling Python at a party brings with it a few friends, which I recommend you get to know in time, or maybe some of them are already familiar? The mistake of the beginer is to think he knows programming after a three day course. The following list should not discourage you, in three days you will already be able to do what programmers spend most of their time with, and maybe free some of them for more important work.

  • R. It is hard to extract your data if you avoid R. Many Python fans complain of R's byzantine syntax, or quote what R experts say, that R is a plaftorm with a language instead of the opposite. This is all true of course, but keep in mind that R just like Python has the real programming done in C, while most of the best libraries are also compiled from C. Much of the bioinformatics toolset today is found in R, and Python is a perfect glue for R calls.

  • C. No matter how good computers become, core stuff will always need to run fast and not hog on resources. C is the mother of most software including the interpterers for most programming languages. The best Python libraries for scientific computing for example are mere calls to compiled C code. C is not as hard as it used to be. Notably Cython is a way to call C and C++ to and from Python. Swig is another language agnosting platform with makes it easy to design bindings for C code.

  • Perl. Perl is viewed as a dying language since the community migrated to Python but there is a lot of bioinformatics code writen in Perl. Some still think that Perl text processing is the best, and Perl is slightly more geekish and more natively integrated with the Unix toolset. It fits you better than Python if you hate the "mainstream".

  • HPC. High performance computing is all the rage today and Python is struggling to keep up with younger languages that were designed with cloud computing, server farms and super clusters in mind. Have a go with some of the new kids on the block, such as GO, Scala, Julia, etc. However Python implements the latest methodologies for distributed computing, such as MapReduce or deep learning, and has decent multiprocessing libraries.

  • Java, C++, C#. Together these languages contain a lot of bioinformatics software. Most NGS sequencing programs and most data science programs are written in Java. Through Jython one can integrate Java classes with Python modules. Python claims it too but Java is probably the only platforms that can trully claim to come with "baterries included", since most of Python batteries are in C. C++ is originally an object oriented extension of C and Python can treat it with the same tools and C# is what Microsoft did when it felt that too many programers are abandoning Visual C++ for Java. The Microsoft clone of Python is called IronPython, in case you wonder.

  • Matlab. Most of the scientific computing software in Python is an open source version of Matlab libraries. Yes, Matlab is that good, if you can afford it. Matlab is still setting the trend in the field of numerical computation. For the poor, Octave is a free and open source alternative. SageMath is another popular collection of opensource libraries that includes most of Python scientific computing libraries.

  • Javascript. While Python dominates the multipurpose scripting languages, one scripting languages dominates the browsers and slowly makes its way into server-side scripting too. Javascript much like R is viewed with disgust by many programmers due to its semantic imperfections but a recent statistics claims that 70% of the world code is running on Javascript interpreter. I wrote this entire course in the browser using IPython without ever using a Python editor, and the only community that is as vibrant as Python's is probably the HTML5/.js community, that embraces the latest open web standards. Some claim operating systems are superfluous, and ChromeOS, running on Javascript is certainly there to prove it. In terms of data visualization and interactivity, .js does not have equal. It is a bit too fictional to imagine that .js will improve to defeat all languages, more likely it is that assemblers will make the entire question of languages irrelevant and people will use what comes handier and translate the code as it suits them.

  • Linux. It is probably more suitable that I write open source and open hardware, but unfortunately Linux is the only OS that qualifies, although both Apple and Microsoft have been "converted" and release more and more core OS functionality as open source. Linux runs the most popular mobile OS (Android) and is native to the most popular PC gaming and home entertainment platform (Steam + SteamOS). Free and open source software such as Python, R, C, Java and Perl feels best with Linux. The same can be said about open hardware. The revolution Python created by lowering the bar in programming is currently undergoing in hardware. From RapsberyPi minicomputers to Arduino microcontrollers, architecture is opening up to the common people, and Python/Linux is their main development environment. It will not be long until researchers will sit at three day courses on how to assemble sequencers, PCR machines and mass-spec chambers from multipurpose parts.

  • Google, Wikipedia, Biostar, IRC channels, mailinglists and Stackoverflow. Today you will not be productive if you read the documentation, read the whole book or attend all the courses. I am not trying to make you lazy, it is important to learn. But the best learning is doing and the online communities are more than helpfull. You should not give up because you cannot do something with Python. Learn to ask!

Questions:

  • Why "Python"?
  • What are dynamic languages? What are static languages?
  • What are interpreted languages? What are native languages, OSes and programs?
  • What does "compiling" a language do? What about an assembler?
  • How does a program run inside a computer?

How to make Python work for this course

Let us now descend, rather abruptly, to more practical matters. I suggest that you are using a Scientific Python distribution. It makes it easy, especially if you are not on Linux. I feel it unfair to recommend one above another. Each distribution has its own way to install/update a package. Some packages may lack though in which case they have to be manually installed.

At the current course we will try to stick to the Anaconda distribution's specific way of managing packages. If one package is missing that is not from the core set of packages it is up to you if you have time to install it. For me on Linux, all packages were installed with simple commands. For the purpose of the course I am using a core set of the most common libraries. But for presentation purposes I am also using libraries that may be hard to install on certain situations. Unfortunately installing software is not always easy, and each language has multiple distributions, each having a whole philosophy about library management. Wellcome to the Hell of programming!

Anaconda installation and management

Please install the Anaconda distribution for Python 3.5, available here:

https://www.continuum.io/downloads

Anaconda installs a package manager called conda. Use it to create a microenvironment running Python 3. 'py35' is our invented name for this microenvironment, 'anaconda' is a way of telling conda that we want all the standard packages available in the distribution to be available for our environment:

conda create -n py35 python=3.5 anaconda
[source] activate py35 #only activate on Win
source deactivate

What if we want only a selection of packages to be made available? Here is an example.

conda list
conda search biopython
conda create --name py35 biopython scipy

What if we want to install a new package inside a microenvironment?

conda install --name py35 beautiful-soup

Packages that are not part of standard Anaconda can be installed with pip. Find more here: http://conda.pydata.org/docs/using/pkgs.html

  • Native package management: pip, virtualenv, setuptools

Python approach to package management is very ... modular. One normally starts with installing Python by downloading it from a main location. The exact way Python and other packages can be installed is dependent on your OS. Once Python is there, we have a few crossplatform methods for installing packages, these are the more important options:

pip is a package manager that will download and install packages. The basic command is:

pip install SomePackage

virtualenv will allow multiple python versions to coexist on the same computer without conflicts, so that we don't start crying in case we need libraries that require different python versions. This may seem a little scary to a beginer, but all programming languages have this problem. Similarly old programs will not work on new operating systems, thus one can install virtualization programs line VirtualBox, that allow you to run for example ancient version of Windows, Linux and Mac OS all from the same OS. I am only mention this library, but hopefully it will not be needed during the course. setuptools are usefull if we want to deploy a python program on another computer. That too is perhaps not important for the course.

  • Installing from source. Since only a few Python packages are in native Python code, this also entails having compilers for C/C++ in many cases.

Download the source archive of the package and decompress it. Open a terminal, navigate inside the source directory of the package (all this is done differently on different platforms) and type:

python setup.py build - not really required, this will not deploy the compiled files into their specific locations on your OS filesystem.

python setup.py install

That is it, you will see meaningless text running on the screen, it will end eventually either with an OKAY or with some errors. There is more to it, but not for this course.

  • Version 2 or 3?

Python users are currently oscillating between version 2 and version 3 and have been for many years, this is because it is difficult to update all libraries to new language specifications. I am a relaxed person and also too busy to look for ways around, and in consequence I am still using version 2.7 and never needed to switch. Most of the code here will probably work if you have v3. But we will try to be nice to the great minds behing Python, who decided that the whole community should slowly migrate to Python 3, and we will try to use version 3.

In particular, instructors introducing Python to new programmers may want to consider teaching Python 3 first and then introducing the differences in Python 2 afterwards (if necessary), since Python 3 eliminates many quirks that can unnecessarily trip up beginning programmers trying to learn Python 2.

Questions:

  • What is a Python package?
  • What is a Python distribution?
  • Why and when do we need micro-environments?

Jupyter

http://jupyter.org/

Reproducibility is a very important these days, and a verbose output of your work helps others who want to repeat your study and even yourself who after a few months or even years would like to repeat your work. It is the future of research and learning. There are several competing platforms that try to offer similar things. For R lovers there is also http://shiny.rstudio.com/.

Jupyter is what you see here, and much more.

  • Jupyter notebooks will run code in 40 different programming languages including Python and R
  • Will give you sophisticated consoles for many of these languages. The Python specific Jupyter kernel is called IPython, and is probably the best console for Python.
  • Can be used for teaching and support purposes, massive scale data science as well as small individual note taking.
  • Can be used for interactive web presentations.

On Windows, you can find a launcher for IPython Notebook under Anaconda in the Start menu. Alternatively you can open a command prompt, navigate to your folder and use the command bellow.

On Linux or OS X, you can start IPython Notebook from the command line. First open a terminal window, use 'cd' to navigate to the directory where you want to store your Python files and notebook document files. Then run this command:

jupyter notebook

You will see something like:

[NotebookApp] Serving notebooks from current/directory/pathway

Make sure the course material is in that directory.

Here is how to specify a different home directory:

jupyter notebook --notebook-dir=/path/to/course/dir

Keep in mind that on Windows the paths are specified in a different manner.

Questions:

  • Why the name 'Jupyter'?
  • Why do we need reproducible science? (joking)

In [1]:
print "Hello World!"


Hello World!

Python console

The console is a command line interface directly to the Python interpreter.

https://docs.python.org/2/tutorial/interpreter.html

To open a console you have to open a terminal inside your operating system and type 'python'. The console is useful for interogating the Python interpreter.

Python editors

Editors for Python range from any simple raw text editor to the most complex IDE (integrated development environment).

In the first cathegory I reccommend Notepad and Notepad++ for Windows, Emacs for MacOS and Linux, and nano, vim, geany for Linux.

Among IDEs, Spyder is a simpler editor with an interface similar to Matlab and native integration of the IPython interpreter, and we will use that for the purpose of this class. A much more complex favorite of mine is PyCharm from JetBrains.

Task:

Create a 'src' folder inside your working directory. Use a raw test editor to make a hello world program inside and run it on the command line. Now open the same file inside Spyder and run it inside the interpreter embedded into Spyder. Magic!


In [ ]: