Quick and clean: Python for biological data processing

Day 1, 8:00-16:00, Data analysis

The first day is also the most essential day in terms of Python programming. It is targeted to custom scientific computing and data mining, so you learn how to adapt a method, or port one from a different language, or glue a remote call to it, also how to find information or mine it using web services. If you are not versed with programming you also get an idea of how to achieve everything with a language.

  • Visualization:
    • Standard plots with matplotlib and seaborn: line, scatter, chart
    • Web publishing with plotly and bokeh: heatmap example
    • Network display with networkx
    • GUI programming with wxpython
    • Web interfaces
  • Scientific computing:
    • Numpy: advanced array operations
    • Scipy introduction: from linear algebra to image analisys
    • Simpy: symbolic math
    • Networks with networkx: centrality computation
    • Fitting a curve, cleaning a signal.
  • Optimization
    • Least squares
    • Gradient descent
    • Constraint optimization
    • Global optimization

Day 2, 8:00-16:00, Biological data science

This day has a smaller focus on actual programing and a more practical focus on how to perform machine learning, statistical learning and patern recognition. This day builds on the "science stack" libraries and makes heavy use of scikit-learn and other more exotic libraries.

  • Intro to data science
    • Data science vs biology.
    • How to extract information from data.
    • Dataset, model, prediction.
    • Supervised vs Unsupervised ML
    • Linearity, Nonlinearity
    • Data distributions in biology.
  • Clustering:
    • Similarity scores in various biological data
    • Useful methods: Kmeans, hierarchical, spectral, DBSCAN
    • Graph community detection: modularity, random walking, infomap (todo)
  • Regression:
    • models: OLS, Lasso, Ridge, Bayesian, etc
    • Model validation, Hyperparametrization.
  • Classification:
    • Logistic regression
    • SVD, Kernel methods
    • PLS, OPLS, OPLS-DA
    • Decision trees and random forests
  • Data integration:
    • Integrative clustering
    • Covariance optimization
    • Bayesian learning
  • Deep learning
    • Essentials, Tensor math
    • One example: Transcriptional regulation
    • Keras, Tensorflow and PyTorch

Day 3, 8:00-16:00, Biological data engineering

We covered the more advanced aspects of Python, and explored some of the libraries that make data science work. This day is dedicated to engineering the computing infrastructure and Python's role in it. What is the state of the computing infrastructure today, how to use Python to organize your work with reproducibility in mind, how to run it on clouds and GPU machines?

  • Gentle introduction
    • PC, Server, Grid, Cloud, IoT
    • Desktop OS and Python.
    • Using Docker.
    • Using Jupyter and Python inside Docker.
  • DevOps
    • Reproducible research.
    • Continuous integration.
    • How to source your code.
    • Using source editors. What matters?
    • Distributed version control using git.
    • Development vs production.
    • When do we need containers? Using Docker.
    • Speed: Profiling, IPython, JIT
    • Robustness: unit testing
    • Documentation: pydoc and Sphinx
  • Workflow management
    • Snakemake tutorial
    • Other examples: nextflow, luigi.
    • Make your own workflow in Python.
  • Distributed computing
    • What is the cloud?
    • Hadoop and Spark
    • Spinning instances, the need for containers
    • Serverless.
  • Spark
    • MapReduce, Hadoop and Spark.
    • Data lake: Hive, Pigs, YARN.
    • Setup a Spark cluster in AWS.
    • ML with PySpark

Task day, 8:00-16:00, 'Omics

We setup the problems and describe the tasks, and give you some helper code to start with, and you will work on your picked task in class. I will tend to guide rather than tell. You are engouraged to bring your own task, but if you want the quality of advice to be high it would be good to send me a description or contact me in advance! We will also have an individual follow up at some later date.

  • Sequencing - NGS pipelining:
    • Open a cloud instance and install required programs
    • Setup the pipeline
    • Read mapping and IGV inspection
    • Normalizing counts and differential expression
    • Galaxy integration
  • Sequencing - Biopython
    • Make a toy sequencing library in standard Python for processing DNA, RNA and protein data.
    • Implement the DNA, RNA and proteins as Python classes
    • Make methods for transcription, translation, regulation.
    • Compute several sequence similarity scores, such as hamming distance and mutual information.
    • Add BioPython methods and prefix them with bp
    • Describe your module in a tutorial like fashion
  • Gene Expression
    • Download a GEO dataset and prepare it
    • Cluster the genes based on their expression
    • Compute a co-expression network
    • Compute differential gene expression for a set of samples.
    • Compute functional enrichment of the main clusters.
  • Transcriptomics
    • Extract the promoter regions using biopython
    • Investigate de-novo motifs on clusters of genes using meme or steme
    • Use TransFac database to search for motif occurences on selected genes
    • Save the found motifs and related data
    • Test which of the motif occurences on your selected genes is significant
  • Proteomics
    • Compute a protein similarity graph, cluster enrichment study
    • Perform structural alignment and plots with PyMol
  • Metabolomics
    • Metabolic pathway assembly, enrichment and display
    • Flux balance analysis
  • Dynamic modeling
    • Load a curated SBML model
    • Plot the model
    • Solve the model
    • Peak identification
    • Pathway studies

In [ ]: