Just like a human language, a computer language is a contaimnent of a certain culture, together with a set of values. A popular source describing Python is the Zen of Python, a collection of 20 aphorisms similar in style to a taoist book, most of whom are adressed towards programers but some are easy to understand by anyone:


In [1]:
import this


The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!

If you are a seasoned programmer, Python also matters because it is common. Tiobe inxed of language popularity constantly ranks it number one after the infrastructure languages. It is popular with both the data science industry and academia, which translates into good support.

If you are not a professional programmer, Python is a language designed for you. Here is what Guido van Rossum, the creator of Python, wrote in his funding submission to DARPA, titled Computer Programming for Everybody back in 1999.

"...while many people nowadays use a computer, few of them are computer programmers. Non-programmers aren't really "empowered" in how they can use their computer: they are confined to using applications in ways that "programmers" have determined for them. One doesn't need to be a visionary to see the limitations here."

So Python is conceived to be plain and simple, yet allow you to do everything a complex language would. Many languages tried to open for a wider audience, but none of them had this goal as the primary founding principle. The result is that today Python has the most vibrant developer community of all languages, and that fact translates into good libraries and expertise.

The course I am trying to hold is fitting well to the Python creed. It is adressed to anyone, it is open source, it is short and simple, and yet it introduces the participants to subjects that would otherwise need years to master. I can probably hold this course in ten languages, but only in Python I can cover all the subjects in three days.

How do computers work

Discuss:

  • Computer architecture
  • Grid computing
  • Cloud computing
  • Parralel computing: multiprocessing, multithreading
  • Speed considerations
  • Price considerations

Python: Stats, strengths and weaknesses

Stats:

  • Tiobe index
  • It is gaining ever more foothold in data science Examples: PySpark more popular than the native Scala implementation, all deep learning libs are python based, etc.
  • Many apostles got bored and moved into newer languages or the web.
  • Job market: language popularity means plenty of jobs but small career edge.

Strenghts:

  • Easy
  • Has a Zen poem
  • Everyone uses it
  • You can do anything with Python
  • ..?

Weakness:

  • Hard to maintain large source code, compared with Cpp/Java. Although it brought us Git!
  • Memory hog
  • Weakness with parallel libraries (less relevant since the advent of distributed computing)
  • ..?

In [2]:
i = 1
import sys
sys.getsizeof(i)
# in bytes


Out[2]:
28

Python: Past, present and future

A short history

The first programming languages had a concise syntax and were able to control the computer at a very low level. The best surviving dino from that era is probably C, a language that allows you to directly control the memory registry that the processor is using. In time as software accumulated and the IT infrastructure evolved languages became specialized, whith languages like C++ and Java controlling the core infrastructure sector while niche languages such as Matlab, R and others were being employed in engineering, statistics respectively. Meanwhile, languages became more distant to the hardware, becoming interpreted by a an engine that lies between the operating system and the program itself. Languages that are ready to run the moment writing stops are commonly called scripting languages. Another important changed occured in the manner data structures are managed, some languages alowing the type of an object to be verified and changed during runtime. These are called dynamic languages.

Rise of the scripting languages

It is difficult to explain why scripting languages became so ubiquitous. It has to do with the lack of time and with safety. In C it is relatively easy to make mistakes that can damage a hardware, it takes more time to do something if you are not skilled and you write ugly code if you are not talented. The general purpose languages that stood out were Perl and Python. However, Perl came first and had an intently obfuscated syntax - this is because it was primarily adressed to UNIX system programmers, who love syntax puzzles and text obfuscation. Many sequencing libraries were developed into or integrated Perl, notable being BioPerl. R also developed as an opensource alternative to SAS, a language used in statistics. When Python started Perl dominated text processing, Matlab dominated engineering, R dominated open source statistics computing and Javascript dominated the web. But Python was adressed to everyone, and in a decade became the number one scripting language in terms of popularity.

Future

I am not trying to speculate, the point of Python was made, today older languages are revamping with Python features, while newer languages develop that are grown from Python concepts and will ultimately compete for broad audiences. I myself programmed in 30 different languages or more and I kept my sanity so there is no reason for alarm. Python is probably the best introduction to programming at this moment in time, which is why most universities and schools are including it as the main tool in their curricula.

Outside Python

As a data scientist working in biology, or as a biologist working with data science, calling Python at a party brings with it a few friends, which I recommend you get to know in time, or maybe some of them are already familiar? The mistake of the beginer is to think he knows programming after a three day course. The following list should not discourage you, in three days you will already be able to do what programmers spend most of their time with, and maybe free some of them for more important work or put them out of work entirely.

  • R. It is hard to extract your data if you avoid R. Many Python fans complain of R's byzantine syntax, or quote what R experts say, that R is a plaftorm with a language instead of the opposite. This is all true of course, but keep in mind that R just like Python has the real programming done in C, while most of the best libraries are also compiled from C. Much of the bioinformatics toolset today is found in R, and Python is a perfect glue for R calls.

  • C. No matter how good computers become, core stuff will always need to run fast and not hog on resources. C is the mother of most software including the interpreters for most scriåting languages. The best Python libraries for scientific computing for example are mere calls to compiled C code. C is not as hard as it used to be. Notably Cython is a way to call C and C++ to and from Python. Swig is another language agnosting platform that makes it easy to design bindings for C code.

  • Perl. Perl is viewed as a dying language since the community migrated to Python but there is a lot of bioinformatics code writen in Perl. Some still think that Perl text processing is the best, and Perl is slightly more geekish and more natively integrated with the Unix toolset. It fits you better than Python if you hate the "mainstream".

  • Clouds and the newer langs. High performance ultra scallable computing is all the rage today and Python is struggling to keep up with younger languages that were designed with cloud computing, server farms and super clusters in mind. Have a go with some of the new kids on the block, such as GO, Scala, Julia, Closure, etc.

  • Java, C++, C#. Together these languages contain a lot of bioinformatics software. Most NGS sequencing programs and most data science programs are written in Java. Through Jython one can integrate Java classes with Python modules. Python claims it too but Java is probably the only platforms that can trully claim to come with "baterries included", since most of Python batteries are in C. C++ is originally an object oriented extension of C and Python can treat it with the same tools and C# is what Microsoft did when it felt that too many programers are abandoning Visual C++ for Java. The Microsoft clone of Python is called IronPython, in case you wonder.

  • Matlab. Most of the scientific computing software in Python is an open source version of Matlab libraries. Yes, Matlab is that good, if you can afford it. Matlab is still setting the trend in the field of numerical computation. For the poor, Octave is a free and open source alternative. SageMath is another popular collection of opensource libraries that includes most of Python scientific computing libraries.

  • Javascript. While Python dominates the multipurpose scripting languages, one scripting languages dominates the browsers and slowly makes its way into server-side scripting too. Javascript much like R is viewed with disgust by many programmers due to its semantic imperfections but a recent statistics claims that 70% of the world code is running on Javascript interpreter. I wrote this entire course in the browser using Jupyter without ever using a Python editor, and the only community that is as vibrant as Python's is probably the HTML5/.js community, that embraces the latest open web standards. Some claim operating systems are superfluous, and ChromeOS, running on Javascript is certainly there to prove it. In terms of data visualization and interactivity, .js does not have equal. It is a bit too fictional to imagine that .js will improve to defeat all languages, more likely it is that assemblers will make the entire question of languages irrelevant and people will use what comes handier and translate the code as it suits them.

  • Linux. It is probably more suitable that I write open source and open hardware, but unfortunately Linux is the only OS that qualifies, although both Apple and Microsoft have been "converted" and release more and more core OS functionality as open source. Linux runs the most popular mobile OS (Android) and is native to the most popular PC gaming and home entertainment platform (Steam + SteamOS). Free and open source software such as Python, R, C, Java and Perl feels best with Linux. The same can be said about open hardware. The revolution Python created by lowering the bar in programming is currently undergoing in hardware. From RapsberyPi minicomputers to Arduino microcontrollers, architecture is opening up to the common people, and Python/Linux is their main development environment. It will not be long until researchers will sit at three day courses on how to assemble sequencers, PCR machines and mass-spec chambers from multipurpose parts.

  • Google, Wikipedia, Biostar, IRC channels, mailinglists and Stackoverflow. Today you will not be productive if you read the documentation, read the whole book or attend all the courses. I am not trying to make you lazy, it is important to learn. But the best learning is doing and the online communities are more than helpfull. You should not give up because you cannot do something with Python. Learn to ask!

Class discussion topics:

  • What is a computer?
  • What components are expensive/speedy today?
  • Where are the "weak links" in the way a computer runs? What about a grid, or a cloud?
  • Why "Python"?
  • What are dynamic languages? What are static languages?
  • What are interpreted languages? What are native languages, OSes and programs?
  • What does "compiling" a language do? What about an assembler?
  • How does a program run inside a computer?

In [1]:
i = 1
i


Out[1]:
1

In [2]:
i = 'abc'
i


Out[2]:
'abc'

MAKE PYTHON WORK AGAIN!

Once you use a programing language past the beginning steps you will sooner or later need libraries that require different language versions, even worse different other libraries that lie on different laguage versions. This is solved by using distributions and virtualization. This may seem a little scary to a beginer, but all programming languages have this problem. Similarly, old programs will not work on new operating systems, thus one can install virtualization programs like VirtualBox, that allow you to run for example ancient version of Windows, Linux and Mac OS all from the same OS. Unfortunately installing software is not always easy, and each language has multiple distributions, each having a whole philosophy about library management. Wellcome to the Hell of programming!

How to install Python libs:

  • distros
  • pip and virtualenv
  • distutils, setuptools, wheels and eggs
  • conda
  • installing from source

distros

Scientific Python friendly distributions.

It makes it easy, especially if you are not on Linux. I feel it unfair to recommend one above another. Each distribution has its own way to install/update a package. Some packages may lack though in which case they have to be manually installed.

pip, virtualenv

Python approach to package management is very ... modular. One normally starts with installing Python by downloading it from a main location. The exact way Python and other packages can be installed is dependent on your OS. Once Python is there, we have a few crossplatform methods for installing packages, these are the more important options:

pip is a package manager that will download and install packages. The basic command is:

pip install SomePackage

virtualenv will allow multiple python versions to coexist on the same computer without conflicts, so that we don't start crying in case we need libraries that require different python versions. This may seem a little scary to a beginer, but all programming languages have this problem. Similarly, old programs will not work on new operating systems, thus one can install virtualization programs like VirtualBox, that allow you to run for example ancient version of Windows, Linux and Mac OS all from the same OS. I am only mention this library, but hopefully it will not be needed during the course.

The repository site is called pypi.

distutils, setuptools, wheels and eggs

This is usefull if we want to deploy (distribute) a python program on another computer. Here is the official link for distributing packages.

conda

At the current course we will try to stick to the Anaconda distribution's specific way of managing packages. If one package is missing that is not from the core set of packages it is up to you if you have time to install it. For me on Linux, most packages were installed with simple commands. For the purpose of the course I am using a core set of the most common libraries. But for presentation purposes I am also using libraries that may be hard to install on certain situations.

Anaconda installation and management

Please install the Anaconda distribution for Python 3.5, available here:

https://www.continuum.io/downloads

Anaconda installs a package manager called conda. Use it to create a microenvironment running Python 3. 'py35' is our invented name for this microenvironment, 'anaconda' is a way of telling conda that we want all the standard packages available in the distribution to be available for our environment:

conda create -n pycourse python=3.6 anaconda
[source] activate pycourse # use only activate on Win
source deactivate

What if we want only a selection of packages to be made available? Here is an example.

conda list
conda search biopython
conda create --name pycourse biopython scipy

What if we want to install a new package inside a microenvironment?

conda install --name pycourse beautiful-soup

Packages that are not part of standard Anaconda can be installed with pip. Find more here: http://conda.pydata.org/docs/using/pkgs.html

Installing from source

Since only a few Python packages are in native Python code, this also entails having compilers for C/C++ in many cases.

Download the source archive of the package and decompress it. Open a terminal, navigate inside the source directory of the package (all this is done differently on different platforms) and type:

python setup.py build - not really required, this will not deploy the compiled files into their specific locations on your OS filesystem.

python setup.py install

That is it, you will see meaningless text running on the screen, it will end eventually either with an OKAY or with some errors. There is more to it, but not for this course.

Version 2 or 3?

Python users are currently oscillating between version 2 and version 3 and have been for many years, this is because it is difficult to update all libraries to new language specifications. The administrators of Python, decided that the whole community should slowly migrate to Python 3, and we will try to use version 3.

In particular, instructors introducing Python to new programmers may want to consider teaching Python 3 first and then introducing the differences in Python 2 afterwards (if necessary), since Python 3 eliminates many quirks that can unnecessarily trip up beginning programmers trying to learn Python 2.

Questions:

  • What is a Python package?
  • What is a Python distribution?
  • Why and when do we need micro-environments?

Task:

  • Install a conda environment containing biopython and scipy and make the install more automated by using a requirements.txt file.
  • Verify your enviroment by deleting the old test environment and reinstalling it using the requirements.txt file.
  • Install a similar environment with pip/virtualenv and try to switch between the conda and pip environments.

Python console

The console is a command line interface directly to the Python interpreter.

https://docs.python.org/2/tutorial/interpreter.html

To open a console you have to open a terminal inside your operating system and type 'python'. The console is useful for interogating the Python interpreter.


In [ ]:
!python


Python 3.6.0 |Continuum Analytics, Inc.| (default, Dec 23 2016, 12:22:00) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> 

In [ ]:
import this