Quick and clean: Python for biological data processing

Day 1, 8:00-16:00, Data analysis

The first day is also the most essential day in terms of Python programming. It is targeted to custom scientific computing and data mining, so you learn how to adapt a method, or port one from a different language, or glue a remote call to it, also how to find information or mine it using web services. If you are not versed with programming you also get an idea of how to achieve everything with a language.

Information

What is Python?

Jupyter

The Standard Python tutorial
- Basics: Math, Variables, Functions, Control flow, Modules
- Data representation: String, Tuple, List, Set, Dictionary, Objects and Classes
- Standard library modules: script arguments, file operations, timing, processes, forks, multiprocessing

Python and the data
- Text and binary: streaming, serialization, regular expression
- The Web: XML parsing, html scraping, web frameworks, API calls
- Data Storage: SQLite, SQL querrying, Chunking and HDF5, pytables
- Python and other languages: C, R, Julia

Visualization:
- Standard plots with matplotlib and seaborn: line, scatter, chart
- Web publishing with plotly and bokeh: heatmap example
- Network display with networkx
- GUI programming with wxpython
- Web interfaces

Scientific computing:
- Numpy: advanced array operations
- Scipy introduction: from linear algebra to image analisys
- Simpy: symbolic math
- Networks with networkx: centrality computation
- Fitting a curve, cleaning a signal.

Statistics:
- Pandas: Python's most important dataframing library
- Tidiness and method chaining
- Using scipy.stats for more advanced statistics
- Statistical enrichment analysis
- Statsmodels
- Differential Expression

Optimization
- Least squares
- Gradient descent
- Constraint optimization
- Global optimization

Dynamical systems (todo)
- Time series
- Fixed points and bifurcations

Day 2, 8:00-16:00, Biological data science

This day has a smaller focus on actual programing and a more practical focus on how to perform machine learning, statistical learning and patern recognition. This day builds on the "science stack" libraries and makes heavy use of scikit-learn and other more exotic libraries.

Intro to data science
- Data science vs biology.
- How to extract information from data.
- Dataset, model, prediction.
- Supervised vs Unsupervised ML
- Linearity, Nonlinearity
- Data distributions in biology.

Dimension Reduction:
- Feature selection
- Feature extraction
  - PCA
  - ICA
  - FA
- Application: tSNE

Clustering:
- Similarity scores in various biological data
- Useful methods: Kmeans, hierarchical, spectral, DBSCAN
- Graph community detection: modularity, random walking, infomap (todo)

Regression:
- models: OLS, Lasso, Ridge, Bayesian, etc
- Model validation, Hyperparametrization.

Classification:
- Logistic regression
- SVD, Kernel methods
- PLS, OPLS, OPLS-DA
- Decision trees and random forests

Data integration:
- Integrative clustering
- Covariance optimization
- Bayesian learning

Deep learning
- Essentials, Tensor math
- One example: Transcriptional regulation
- Keras, Tensorflow and PyTorch

Statistical learning(todo)
- Bayesian statistics
- PyMC3 basics

Day 3, 8:00-16:00, Biological data engineering

We covered the more advanced aspects of Python, and explored some of the libraries that make data science work. This day is dedicated to engineering the computing infrastructure and Python's role in it. What is the state of the computing infrastructure today, how to use Python to organize your work with reproducibility in mind, how to run it on clouds and GPU machines?

Gentle introduction
- PC, Server, Grid, Cloud, IoT
- Desktop OS and Python.
- Using Docker.
- Using Jupyter and Python inside Docker.

DevOps
- Reproducible research.
- Continuous integration.
- How to source your code.
- Using source editors. What matters?
- Distributed version control using git.
- Development vs production.
- When do we need containers? Using Docker.
- Speed: Profiling, IPython, JIT
- Robustness: unit testing
- Documentation: pydoc and Sphinx

Workflow management
- Snakemake tutorial
- Other examples: nextflow, luigi.
- Make your own workflow in Python.

Grid computing
- What is HPC?
- Singularity.

Distributed computing
- What is the cloud?
- Hadoop and Spark
- Spinning instances, the need for containers
- Serverless.

AWS cloud
- Loading data into S3 buckets
  - via Console, CLI, Boto3
- Setting up an EC2 reserved instance
  - via Console, CLI, Boto3
- Spin up containers via Docker Machine
- Instance types
- ECS clusters and Docker Cloud

Spark
- MapReduce, Hadoop and Spark.
- Data lake: Hive, Pigs, YARN.
- Setup a Spark cluster in AWS.
- ML with PySpark

GPU acceleration on the cloud
- GPU computing, FPGA, ASICS
- Statistical learning example: PyMC3
- Deep Learning example: Keras, Tensorflow

Task day, 8:00-16:00, 'Omics

We setup the problems and describe the tasks, and give you some helper code to start with, and you will work on your picked task in class. I will tend to guide rather than tell. You are engouraged to bring your own task, but if you want the quality of advice to be high it would be good to send me a description or contact me in advance! We will also have an individual follow up at some later date.

Sequencing - NGS pipelining:
- Open a cloud instance and install required programs
- Setup the pipeline
- Read mapping and IGV inspection
- Normalizing counts and differential expression
- Galaxy integration

Sequencing - Biopython
- Make a toy sequencing library in standard Python for processing DNA, RNA and protein data.
- Implement the DNA, RNA and proteins as Python classes
- Make methods for transcription, translation, regulation.
- Compute several sequence similarity scores, such as hamming distance and mutual information.
- Add BioPython methods and prefix them with bp
- Describe your module in a tutorial like fashion

Gene Expression
- Download a GEO dataset and prepare it
- Cluster the genes based on their expression
- Compute a co-expression network
- Compute differential gene expression for a set of samples.
- Compute functional enrichment of the main clusters.

Transcriptomics
- Extract the promoter regions using biopython
- Investigate de-novo motifs on clusters of genes using meme or steme
- Use TransFac database to search for motif occurences on selected genes
- Save the found motifs and related data
- Test which of the motif occurences on your selected genes is significant

Proteomics
- Compute a protein similarity graph, cluster enrichment study
- Perform structural alignment and plots with PyMol

Metabolomics
- Metabolic pathway assembly, enrichment and display
- Flux balance analysis

Dynamic modeling
- Load a curated SBML model
- Plot the model
- Solve the model
- Peak identification
- Pathway studies

Population genetics and philogeny - Removed/Not included.
- Run a small scale coalescent simulation
- Compute a philogeny tree and display it



In [ ]: