1. Introduction

The unification of numerical and linear algebra packages under the umbrella of numpy/scipy about a decade ago spawned a slew of scientific packages for Python. Ever since, the Python scientific ecosystem has been evolving rapidly and things often break backwards compatibility. We will use Matplotlib 2.0, Scikit-learn 0.18, and QuTiP 4.1 in this tutorial. Import everything except Seaborn.


In [ ]:
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import sklearn
import qutip
from skimage import io
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from qutip import destroy, basis, steadystate, expect, mcsolve, mesolve, \
    thermal_dm, plot_fock_distribution, matrix_histogram, hinton, tensor
%matplotlib inline
print("Matplotlib:", matplotlib.__version__,
      "\nScikit-learn:", sklearn.__version__,
      "\nQuTiP:", qutip.__version__)

Notice the line that starts with %. This is a 'magic command' specific to Jupyter. It ensures that images will be plotted inline, instead of popping up in a window. You can look at all magic commands by entering %quickref. Some are useful, although most of them are not. The magic commands are not part of Python, so calling them in a script will throw an error. Keep this in mind when you copy code from a notebook.

2. Plotting

Making nice plots is an involved and painful task. A few paradigms emerged over the decades, each with its own advantages and disadvantages. We restict our attention to plotting functions and statistical data. R as a language for statistical processing is popular in no small part due to its plotting module, ggplot2, which implements a school of plotting called grammar of graphics. This is a declarative approach, subsequent statements added to the plot define the final looks. Python's main plotting module, a behemoth called Matplotlib, is object-oriented: the figure is made up of a hierarchy of classes, and the properties of these will decide the look of the final image. Matplotlib is complex and you will often get the impression that it was designed to make plotting simple things difficult, and its default settings do not produce the most beautiful plots on the planet. The module Seaborn wraps around Matplotlib, and for a restricted set of plot types, it makes your task very easy. It also changes the horrific defaults of Matplotlib, so importing will change the look of all of your plots, whether they were done with Matplotlib calls or with Seaborn. This is why we did not import it in the first cell of this notebook. Seaborn also makes plotting statistical data easy and it interoperates marvellously with the statistical package Pandas. Python, of course, also has a declarative plotting module; try Altair, and see this write-up why it is interesting in the first place. Here we will discuss only the object-oriented approach.

2.1 Matplotlib

2.1.1 The basics

This is the bare minimum you need to plot a function:


In [ ]:
x = np.linspace(0, 5, 10)
plt.plot(x, x**2);

We imported the module matplotlib.plot as plt, and we call a function of it called plot to plot the square function. You always plot discrete points: x is a numpy array containing ten points as a linear approximation between zero and five. On closer inspection, the curve is not smooth: this is because ten points are not enough for the illusion of smoothness. Let us add some more points, labels for the axes, and a title for the figure:


In [ ]:
x = np.linspace(0, 5, 100)
y = x**2
plt.plot(x, y)
plt.xlabel('x')
plt.ylabel('y')
plt.title('The most exciting function ever, full stop.');

The order in which you add the decorations to your figure does not matter. The figure is not actually created until you execute the cell. Actually, the execution of the cell just triggers the call of the function plt.show(), which instructs Matplotlib to draw the figure and display it. In a Python script, you would always call plt.show() manually. Let us plot the cube function too, and call plt.show() manually:


In [ ]:
x = np.linspace(0, 5, 100)
y1 = x**2
y2 = x**3
plt.plot(x, y1)
plt.plot(x, y2)
plt.xlabel('x')
plt.ylabel('y')
plt.show()

Notice the difference with this case:


In [ ]:
plt.plot(x, y1)
plt.xlabel('x')
plt.ylabel('y')
plt.show()
plt.xlabel('x')
plt.ylabel('y')
plt.plot(x, y2)
plt.show()

The plt.show() resets all settings, so for the second figure, you must set the axes again.

Instead of showing the plot on the screen, you can write them to a file, which will also trigger Matplotlib to draw the figure. If you export it to PDF, it will be as scale-invariant as it can possibly be, so you can readily insert them in your LaTeX manuscripts.


In [ ]:
plt.plot(x, y1)
plt.xlabel('x')
plt.ylabel('y')
plt.savefig("whatever.pdf")
plt.close()

2.1.2 Object-oriented paradigm

The stuff that you see displayed is composed of a hierarchical structure of components. On the top level, it is an instance of the Figure class. This is what plt.plot() creates for you, with all the other underlying structures within; this function is for your convenience to avoid dealing with classes if you want a simple plot. The structures in the hierarchy include the area where you draw, which is technically called the Axes class. You may have more than one Axes if you have subplots or embedded plots. Axes than have x and y axes, which in turn have a scale, ticks, labels, and so on. If you have a single Axes class instantiated, like in the examples below, you can access and change most parts of the hierarchy like you did above with the x and y labels and the figure title. If you want to do anything non-trivial, you have to compose the figure and its components yourself. The examples in this section are mainly from this tutorial. Let us instantiate an object of the figure class, get an area to draw on, and plot the same thing:


In [ ]:
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(x, y)
ax.set_xlabel('x')
ax.set_ylabel('y')
plt.show()

Armed with this knowledge, we can do inserts:


In [ ]:
fig = plt.figure()
axes1 = fig.add_axes([0.1, 0.1, 0.8, 0.8]) # main axes
axes2 = fig.add_axes([0.2, 0.5, 0.4, 0.3]) # insert axes

# main figure
axes1.plot(x, y1, 'r')
axes1.set_xlabel('x')
axes1.set_ylabel('y')
axes1.set_title('Square function in red')

# insert
axes2.plot(x, y2, 'b')
axes2.set_xlabel('x')
axes2.set_ylabel('y')
axes2.set_title('Cube function in blue')
plt.show()

You can also do aribtrary grids of subplots. The function plt.subplots conveniently creates you the figure object and returns it to you along with the axes:


In [ ]:
fig, axes = plt.subplots(ncols=2)
y = [y1, y2]
labels = ["Square function", "Cube function"]
colors = ['r', 'b']
for i, ax in enumerate(axes):
    ax.plot(x, y[i], colors[i])
    ax.set_xlabel('x')
    ax.set_ylabel('y')
    ax.set_title(labels[i])  
fig.tight_layout()

Matplotlib handles LaTeX reasonably well, just put things between $ signs. For instance, we can a fancy legend:


In [ ]:
fig, ax = plt.subplots()
ax.plot(x, y1, label=r"$y = x^2$")
ax.plot(x, y2, label=r"$y = x^3$")
ax.legend(loc=2) # upper left corner
ax.set_xlabel(r'$x$')
ax.set_ylabel(r'$y$')
plt.show()

You need the leading r in the strings to avoid some nastiness with backslashes.

The rest is all about exploring the parameter space. Here we manually create a grid (this is necessary if we mix 2D, 3D or polar coordinates), and plot a bunch of things that Matplotlib can do. For more examples, refer to the gallery.


In [ ]:
# Some new data will be necessary
n = np.random.randn(100000)
t = np.linspace(0, 2 * np.pi, 100)
X, Y = np.meshgrid(t, t)
Z = (2.7 - 2 * np.cos(Y) * np.cos(X) - 0.7 * np.cos(np.pi - 2*Y)).T

# The actual plot
fig = plt.figure(figsize=(12, 6))
axes = [[],[]]
axes[0].append(fig.add_subplot(2, 4, 1))
axes[0][0].scatter(x, x + 0.25*np.random.randn(len(x)))
axes[0][0].set_title("Scatter")
axes[0].append(fig.add_subplot(2, 4, 2))
axes[0][1].step(x, y1, lw=2)
axes[0][1].set_title("Step")
axes[0].append(fig.add_subplot(2, 4, 3))
axes[0][2].bar(x, y1, align="center", width=0.5, alpha=0.5)
axes[0][2].set_title("Bar")
axes[0].append(fig.add_subplot(2, 4, 4))
axes[0][3].fill_between(x, y1, y2, color="green", alpha=0.5);
axes[0][3].set_title("Fill between");
axes[1].append(fig.add_subplot(2, 4, 5))
axes[1][0].hist(n, bins=100)
axes[1][0].set_title("Histogram")
axes[1][0].set_xlim((min(n), max(n)))
axes[1].append(fig.add_subplot(2, 4, 6))
p = axes[1][1].pcolor(X/(2*np.pi), Y/(2*np.pi), Z, cmap=matplotlib.cm.RdBu, vmin=abs(Z).min(), vmax=abs(Z).max())
axes[1][1].set_title("Color map")
fig.colorbar(p, ax=axes[1][1])
axes[1].append(fig.add_subplot(2, 4, 7, projection='3d'))
axes[1][2].plot_surface(X, Y, Z, rstride=1, cstride=1, cmap=matplotlib.cm.coolwarm, linewidth=0, antialiased=False)
axes[1][2].set_title("Surface plot")
axes[1].append(fig.add_subplot(2, 4, 8, polar=True))
axes[1][3].plot(t, t, color='blue', lw=3);
axes[1][3].set_title("Polar coordinates")
fig.tight_layout()
plt.show()

Exercise 1. Create a three by three grid. Put Lluis Torner in the center. Surround him with aesthetically pleasing functions in the remaining subplots. Hint: io.imread("http://www.icfo.eu/images/static/director_LluisTorner.jpg") will get you a photo of him, and the method imshow will plot it.

2.2 Seaborn and Pandas

Seaborn is primarily meant for statistical plotting, but it also improves the defaults of all Matplotlib figures.

2.2.1 Side effect of importing it.

Witness the magic of it:


In [ ]:
plt.plot(x, x**2)
plt.show()
import seaborn as sns
plt.plot(x, x**2)
plt.show()

Yes, importing a package should not have severe side effects. On the other hand, this is Python, not Haskell, so let us rejoice at this sheer opportunism.

2.2.2 Add Pandas

Pandas turns Python into a competitor to R. It allows you to do a wide-scale of statistical operations, but even more importantly, it makes low-level data processing chores easy. Here we load the standard Iris dataset via Scikit-learn and convert it to a Pandas dataframe, which is the key data structure of the package.


In [ ]:
iris = load_iris()
iris = pd.DataFrame(data=np.c_[iris['data'], iris['target']],
                     columns= iris['feature_names'] + ['target'])
iris.head()

You can access individual columns by indexing with the name of the column:


In [ ]:
iris["sepal length (cm)"].head()

We will use seaborn for some basic visualization


In [ ]:
sns.jointplot(x="sepal length (cm)", y="sepal width (cm)", data=iris, size=5);

Let us define an array with all the names of the features and plot their correlations.


In [ ]:
features = iris.columns.values[:-1]
sns.pairplot(iris, vars=features, hue="target", size=3);

Exercise 2. Plot the histogram of all four features. First, instantiate a Matplotlib figure in a one by four grid, and then pass the matching axes to Seaborn's distplot function that draws the histograms. A figsize=(14, 4) is a recommended parameter to plt.subplots, otherwise the figure will be too squished. Use zip to iterate over the axes and the features simultaneously.


In [ ]:

3. Modelling

Python is a comfortable environment for prototyping ideas, and since it also acts as a glue language, you will probably have most of the scaffolding in Python for your actual calculations too. Many libraries that are primarily meant for prototyping are actually efficient enough to cover decent real-world use cases, and your prototype can be your final model. This is especially the case if your numpy/scipy packages were compiled with an efficient BLAS/LAPACK implementation, which is exactly what Anaconda gives you with MKL built in. For machine learning prototypes, Scikit-learn is great: it covers most of the important algorithms of the non-deep learning type. For quantum physics, QuTiP is highly recommended: it is still actively developed, its efficiency is improving day by day, and it covers a wide range of problems, including functions related to quantum info and open quantum systems.

3.1 Machine learning prototypes

This section is based on the examples in the book Python Machine Learning.

At the end of the day, most machine learning algorithms will output a function that we can use for any data instance. To characterize the learning algorithm, we can plot the decision boundaries that this function produces. The following helper function plots this decision function along the first two dimensions in the data set:


In [ ]:
def plot_decision_regions(X, y, classifier, test_idx=None, resolution=0.02):

    # setup marker generator and color map
    markers = ('s', 'x', 'o', '^', 'v')
    colors = ('red', 'blue', 'lightgreen', 'gray', 'cyan')
    cmap = matplotlib.colors.ListedColormap(colors[:len(np.unique(y))])

    # plot the decision surface
    x1_min, x1_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    x2_min, x2_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx1, xx2 = np.meshgrid(np.arange(x1_min, x1_max, resolution),
                           np.arange(x2_min, x2_max, resolution))
    Z = classifier.predict(np.array([xx1.ravel(), xx2.ravel()]).T)
    Z = Z.reshape(xx1.shape)
    plt.contourf(xx1, xx2, Z, alpha=0.4, cmap=cmap)
    plt.xlim(xx1.min(), xx1.max())
    plt.ylim(xx2.min(), xx2.max())

    for idx, cl in enumerate(np.unique(y)):
        plt.scatter(x=X[y == cl, 0], y=X[y == cl, 1], alpha=0.8, 
                    c=cmap(idx), marker=markers[idx], label=cl)

    # highlight test samples
    if test_idx:
        # plot all samples
        X_test, y_test = X[test_idx, :], y[test_idx]

        plt.scatter(X_test[:, 0], X_test[:, 1], c='', alpha=1.0,
                    linewidths=1, marker='o', s=55, label='test set')

We reload the Iris data set:


In [ ]:
iris = load_iris()
X = iris.data[:, [2, 3]]
y = iris.target
print('Class labels:', np.unique(y))

To avoid overfitting, we split the data set in a training and validation part. This is a static random split, not something you would use in 10x random cross-validation.


In [ ]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

We standardize the distribution of the features of the data set. Some kind of normalization or standardization is usually a good idea. Certain learning models work well with data vectors of norm 1, for instance. Here we choose standardization because the physical size parameters of the iris species actually follows a normal distribution.


In [ ]:
sc = StandardScaler()
sc.fit(X_train)
X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)
X_combined_std = np.vstack((X_train_std, X_test_std))
y_combined = np.hstack((y_train, y_test))

The dumbest model we can possibly train is a neural network of a single neuron, trained by stochastic gradient descent. Even this simple model misses only four instances:


In [ ]:
lr = LogisticRegression(C=1000.0, random_state=0)
lr.fit(X_train_std, y_train)
plot_decision_regions(X_combined_std, y_combined,
                      classifier=lr, test_idx=range(105, 150))
plt.xlabel('petal length [standardized]')
plt.ylabel('petal width [standardized]')
plt.legend(loc='upper left')
plt.tight_layout()
plt.show()

The decision boundary is linear.

Notice the C parameter in the instantiation of the logistic regression class. This is not a parameter, but a hyperparameter: the training algorithm (the same stochastic gradient descent as in the perceptron) does not optimize over it. It is our task to find a good value for it. The objective function we are optimizing for logistic regression is

$$J(\mathbf{w}) = \sum_{i=1}^N\left[-y_i \log(\phi(z_i))-(1-y_i)\log(1-\phi(z_i))\right] + \frac{1}{2C}||w||^2$$

,

where $z = w^\top x$. By increasing the regularization term $\frac{1}{2C}$, we get a sparser model. In our case, it means that fewer features will be factored in. We can plot this as follows:


In [ ]:
weights, params = [], []
for c in np.arange(-5, 5):
    lr = LogisticRegression(C=10.0**c, random_state=0)
    lr.fit(X_train_std, y_train)
    weights.append(lr.coef_[1])
    params.append(10.0**c)
weights = np.array(weights)
plt.plot(params, weights[:, 0], label='petal length')
plt.plot(params, weights[:, 1], linestyle='--', label='petal width')
plt.ylabel('weight coefficient')
plt.xlabel('C')
plt.legend(loc='upper left')
plt.xscale('log')
plt.show()

This regularization term makes machine learning very different from statistics, at least as far as structural risk minimization goes. In general, sparser model will have better generalization properties, that is, they are less prone to overfitting. Since there is no explicit way to optimize over the hyperparameter, you typically do something like grid search.

Exercise 3. Study the decision boundary for the above ten choices of $C$.


In [ ]:

3.2 Quantum simulations

We take an example from the QuTiP documentation (Section 3.6.5). It is a system that reaches a steady state is a harmonic oscillator coupled to a thermal environment. The initial state is the $|10\rangle$ number state, and it is weakly coupled to a thermal environment characterized by an average particle expectation value of $\langle n\rangle = 2$.


In [ ]:
N = 20
a = destroy(N)
H = a.dag() * a
psi0 = basis(N, 10) # initial state
kappa = 0.1 # coupling to oscillator

Next we define the collapse operators:


In [ ]:
n_th_a = 2 # temperature with average of 2 excitations
rate = kappa * (1 + n_th_a)
c_op_list = [np.sqrt(rate) * a] # decay operators
rate = kappa * n_th_a
c_op_list.append(np.sqrt(rate) * a.dag()) # excitation operators

We calculate the steady state and the particle number in the steady state:


In [ ]:
final_state = steadystate(H, c_op_list)
fexpt = expect(a.dag() * a, final_state)

We calculate the time evolution over a hundred points with two methods: by the Monte Carlo method and by solving the master equation:


In [ ]:
tlist = np.linspace(0, 50, 100)
# monte-carlo
mcdata = mcsolve(H, psi0, tlist, c_op_list, [a.dag() * a], ntraj=100)
# master eq.
medata = mesolve(H, psi0, tlist, c_op_list, [a.dag() * a])

In [ ]:
plt.plot(tlist, mcdata.expect[0], tlist, medata.expect[0], lw=2)
plt.axhline(y=fexpt, color='r', lw=1.5)
plt.ylim([0, 10])
plt.xlabel('Time', fontsize=14)
plt.ylabel('Number of excitations', fontsize=14)
plt.legend(('Monte Carlo', 'Master Equation', 'Steady State'))
plt.title('Decay of Fock state $\left|10\\rangle\\right.$' + ' in a thermal environment with $\langle n\\rangle=2$')
plt.show()

Exercise 4. Improve the Monte Carlo simulation to approximate the master equation closer. Typing mcsolve? will give you a detailed help on the parametrization of the solver.


In [ ]:

QuTiP has built-in functions to work with thermal states. Let us consider a state that is on average occupied by two photons:


In [ ]:
rho_thermal = thermal_dm(N, 2)

In [ ]:
fig, axes = plt.subplots(1, 3, figsize=(12,3))
axes[0].matshow(rho_thermal.data.toarray().real)
axes[0].set_title("Matrix plot")
axes[1].bar(np.arange(0, N)-.5, rho_thermal.diag())
axes[1].set_xlim([-.5, N])
axes[1].set_title("Diagonal")
plot_fock_distribution(rho_thermal, fig=fig, ax=axes[2])
axes[2].set_title("Fock number distribution")
plt.show()

Exercise 5. Create and study the maximally mixed state of dimension $N$. Here are three possible ways to do it:

  1. The Cheap Way: import the function maximally_mixed_dm from QuTiP.
  2. The Way of the Boson: use the previous function thermal_dm and increase the average particle number until you converge to the maximally mixed state.
  3. The Church of Nonlocality Way: trace out half of a maximally entangled state. For qubits, you could create a Bell pair with (tensor(basis(2, 0), basis(2, 0)) + tensor(basis(2, 1), basis(2, 1))).unit() and then trace out either party. If you generalize this to $N$ dimensions, you get the solution. Bonus points are awarded for simulating the Unruh effect.

In [ ]:

4. Writing good code

The ideal scientific code is written in good style, it is correct, and it is fast. Correctness is up to you to verify. Here we will mainly focus on style, with a few words on efficiency. Unfortunately, Jupyter does not give you tools to cultivate good habits in writing Python. For that, you have to go to Spyder (or to another advanced editor).

First and foremost, you should follow the PEP8 recommendations. They are, of course, not easy to keep in mind. Open Spyder and go to Tools->Preferences. Then, under Editor, choose the tab Code Introspection/Analysis. Tick the Real-time code style analysis box. By enabling this, you will see yellow warnings every time you violate PEP8 in a line, and mousing over will tell you what the problem is.

The next stage is linting. Linting is a form of static code analysis that is able to point out potential problems and errors, as well as giving you further style advice. It static in the sense that it does not run your code to find errors. In Spyder, press F8 to run the analysis. It will give you a numerical score of how good your code is, ranging from $-\infty$ to 10. The problems are sorted in categories. You should pay special attention to errors and warnings.

Exercise 6. Pick a homework solution and start fixing it up. Make five-six improvements. Run Pylint before and after.

If your code is pretty and you are sure that it is correct, then and only then you can start improving its run time. As Donald Knuth said, "premature optimization is the root of all evil", so do not waste any effort on speeding up your code before you know that the results are what to be expected (and your code is pretty).

Python has a reputation for being slow, but it is actually not too bad, and the problem can often be circumvented. As a matter of fact, it has a built-in routine for timing your operations:


In [ ]:
a = [i for i in range(1000)]
b = np.array(a)
%timeit sum(a)
%timeit sum(b)
%timeit np.sum(a)
%timeit np.sum(b)

When you try to improve the speed of your code, you must ensure two things:

  1. Your code remains correct. Write some tests that verify the results. The more tests you have the better. For non-trivial projects, please make use of Python's excellent unit testing framework. Unit tests are your best friends, yet scientist seldom ever use them. In industry, you cannot possibly work on anything without providing unit tests.

  2. You focus on the time-critical part. There is no point in improving the speed of a chunk of code that is executed once, taking up a second, as opposed to improving a function that gets called a hundred thousand times, taking a second each time. Finding the time critical parts is called profiling). Note that profiling is dynamic, that is, you must execute your code.

Spyder has a built-in profiler. To run it on the current file, press F10. Make sure it is a file that executes fast. Then it gives you a hierarchical call structure telling you what was called and how often, and how long it took.

Exercise 7. Copy this to an empty file, profile it, and improve its run time.


In [ ]:
import numpy as np


def dummy_test_for_slow_function():
    """Super-stupid test for the function"""
    np.testing.assert_almost_equal(slow_function(), 157687.67990470183)


def slow_function():
    """This function could see a fair bit of improvement"""
    total = 0
    for i in range(1000):
        a = np.array([np.sqrt(j) for j in range(1000)])
        b = a/(i+1)
        total += sum(b)
    return total

# We run the test
dummy_test_for_slow_function()

5. Open science

Open science is an umbrella term to label a set of initiatives that are meant to liberate science from the shackles of academic publishers and software licences, to provide access to research results to the very taxpayers who paid for it, to encourage networking and to do credible research with contemporary tools. Open access publishing, academic social networks, open source code repositories, and open access data all belong here, but here we limit ourselves to code and data.

Things you want to achieve by making your code and data available:

  1. Reproduceability of results

    • At all. Most numerical results are barely reproduceable, but usually they are not. Not even to the original authors.

    • With ease. If you have to struggle to compile something for a week just to start a calculation, that is not particularly attractive.

  2. Discoverability.

  3. More citations, which is the ultimate goal of being a scientist today. Points 1) and 2) should yield this.

Stages:

  1. ALWAYS choose a license. Without a license, nobody will know what they can legally do or expect from your code. Technically speaking, without a license, code dumped online is not open source. When you create a new repository on GitHub, you are automatically offered a handful licenses. Unfortunately, this is only optional, but please choose one for every repo you create.

    • When in doubt: GNU Public License Version 3 (GPLv3). This is a 'contagious' license: if somebody builds on your code, it must also be released and it must be released under GPL. This is a bad choice of license If you want your code to be used by commercial entities. In this case, a BSD or Apache-style license works better. On a related note, text is also always released under some form of a license. For instance, when you upload a manuscript to arXiv, you are forced to use a license. Open access publishers typically choose a Creative Commons license, which is as author-friendly as it gets. For texts that you publish online that are not manuscripts, you should also consider a Creative Commons license. Data is also typically released under a Creative Commons license.
  2. Provide decent documentation. Raw code without any code comments is not especially helpful. Not for others, and not for your future self.

    • If it is a one-off piece of code that does some calculation, spend an hour writing it up in a notebook, if you developed the code in a notebook-friendly language. If the helper functions are too long, you can collect those in a separate file that you import in the notebook. If a notebook is too much, write a gist that is decently commented. If your code is complex, includes many files, it is an entire module or package, or generally it is not notebook-friendly, then create a repository and add a readme file to tell what it does and how. Use Markdown format for the readme to have an easy way of formatting code within the document. The text parts of notebooks also use Markdown.

    • Write documentation for functions that are user-facing. Often a one-liner is enough, but you can make the extra effort of writing up the documentation for all arguments. In Julia, defining the types of the arguments and providing the documentation go hand-in-hand, establishing a solid best practice. In Python, you can embed tests in the function docstrings, which is insanely handy.

    • If the thing that you developed is a library, package, or module, your documentation would probably longer. Awesome tools like Sphinx for Python code and Doxygen for C++ help you write beautiful documentation that yields a PDF manual or HTML pages. There are $N+1$ places to host the result, e.g. ReadTheDocs, which also integrates neatly with GitHub: every time you release a new version, ReadTheDocs will automatically update the documentation to the latest variant. GitHub itself offers a way to host documentation via GitHub pages: create a branch in your repository called gh-pages, and the HTML files you put there will appear at https://yourusername.github.io/reponame.

  3. Choose a place of sharing. You want to ensure that your work is visible.

    • In the case of code, the go-to place is GitHub, so this is pretty straightforward. You can't get more visible than that, plus GitHub integrates with everything else under the sun, including things we already mentioned (e.g. ReadTheDocs) and things we did not (like continuous integration, project management, code quality, cloud deployment, and a hundred others; check GitHub Integrations). There is one distinct disadvantage that we should keep in mind, though: GitHub is a private entity. They are out to make money, and not to do charity to academics. They are based in the US, so they have no oversight on how much metadata they can harvest about you and how they use it. Furthermore, they can go bankrupt or change their business model, which might render your repos inaccessible. However, they cannot 'steal' your code, since it is released under a license of your choice.

    • In the case of data, the options are more scattered. My recommendation goes for Zenodo. It is funded by CERN and OpenAIRE, and it is hosted in CERN itself on the same infrastructure where the 25PBytes/year of data goes from LHC. The longevity of your data there is ensured, and your metadata stays in Europe.

  • Get a DOI for code and data. Zenodo provides a DOI for everything you upload there. It also integrates with GitHub, so getting a DOI for code hosted on GitHub is $\epsilon$ effort. If your code is a library, package, or module, perhaps you want to write up a paper on it and get a DOI for that. In my (PW) experience, it is not worth it. Your publication outlets are limited. In physics, you can go for the Journal of Computational Physics or Computer Physics Communications. Both have short review cycles and decent Impact Factors, but they are owned by the Beast (Elsevier). In machine learning, you have two open access outlets that have pretty good indicators, plus they are free to publish in. If you can make it in the Open Source Software track of the Journal of Machine Learning Research, that means a lot. The other option is the Journal of Statistical Software, where it took two years to review a manuscript, and in the year that passed since the acceptance decision, the page-proofs still did not arrive. For mathematical stuff, there is the ACM Transactions on Mathematical Software, which takes about a year to review a paper, and it is by the ACM, which is an organization that is even more retarded than the APS. An additional problem with all of these options is that you cannot really modify the library or the documentation you provided in the paper, whereas your library will probably evolve. So you might as well save yourself the trouble, get a DOI from Zenodo, and let the documentation and the links update automatically with each release on GitHub.

Exercise 8. For the next paper you write that includes even a single line of code for numerical or symbolic calculations, make a computational appendix on GitHub as a gist, as a notebook, or as a repository. You might as well include a link here by sending a pull request: since the incoming link count will increase to your code, it will rank higher in search engines, and your work will be a notch easier to discover.