Installation

Installation instructions are hosted by the Jupyter project here. While this official documentation should be considered as the ultimate reference for installing the software, I will briefly describe how to install the Jupyter Notebook using the Anaconda Python distribution.

Anaconda

Anaconda by Continuum Analytics is a free Python distribution that contains more than 300 of the best Python packages for data analysis and scientific research. Armed with conda, the cross-platform and Python-agnostic binary package manager shipped with Anaconda, the Anaconda distribution makes it easy and quick to install Python packages without worrying about third party library requirements, compiling, or version incompatibilities.

Our own research computing support team at Boston University recommends and describes using Anaconda when using Python on the GEO/SCC cluster.

Download

Download the Anaconda distribution from their website:

https://www.continuum.io/downloads

Select the download for your operating system (Windows, Mac, or Linux) and your computer's architecture (32-bit and 64-bit are available, most you most likely want 64-bit). I recommend using the Python 3 download as it is the present and future version of the language unless you're tied to Python 2 because of some third party library incompatibility with Python 3 (e.g., QGIS). See this page for more information on the history of the versions and the difference between them.

Users on Windows must use the graphical installer while users on Macs have the optional of a graphical or terminal installation. Linux users must use the terminal based installer.

Installation instructions from Continuum Analytics are provided on their download site and also here: http://docs.continuum.io/anaconda/install

Instructions for the terminal based installers are also included below:

Terminal installation

Users must run Anaconda using bash even if they're not using bash as their shell. Navigate to the download and execute the installer as follows:

cd Downloads/
bash Anaconda3-2.3.0-MacOSX-x86_64.sh

or

cd Downloads/
bash Anaconda3-2.3.0-Linux-x86_64.sh

Note that the version numbers in the download filename will change. You should substitute the version number of the installer you downloaded as necessary.

Read and agree to the license terms. Unless you have good reason to do otherwise, it is perfectly okay to use the default installation options.

If you did not allow the installer to preprend the Anaconda installation to your PATH by editing your .bashrc, manually append it to your PATH:

export PATH=/home/ceholden/anaconda3/bin:$PATH

Test

To test the conda installation, please try to update conda:

  1. Windows
    • Open a command prompt (Run -> cmd)
    • Run conda update conda
  2. Mac OSX
    • Open the Terminal)
    • Run conda update conda
  3. Linux
    • Open a terminal. You should know what to do.
    • Run conda update conda

Jupyter

Once Anaconda has been installed and is configured, you can install the Jupyter notebook simply as follows:

conda install jupyter

With Jupyter installed, you can run a notebook session as follows:

jupyter notebook

Read the text that the notebook program prints to the console. The Jupyter program is a web browser based application, so what you're seeing is some information about the web server that Jupyter has launched. You will most likely be assigned port 8888, but it might be different. Your log should look something like this:

[I 10:48:03.225 NotebookApp] Serving notebooks from local directory: /home/ceholden/Downloads
[I 10:48:03.225 NotebookApp] 0 active kernels 
[I 10:48:03.225 NotebookApp] The IPython Notebook is running at: http://localhost:8889/

The URL listed (e.g., http://localhost:8889/) is the URL of the page you want to navigate to using your web browser.

You may wish to change your directory in the terminal before launching jupyter notebook because your current direectory controls what notebooks can be opened. You can access any notebook files below the current directory when you launched the notebook, but you cannot access any notebooks above it.

Additional Topic: Jupyter notebooks on the GEO/SCC cluster

Running jupyter notebook from your local machine is fairly trivial. It is also trivial to run the notebook from the GEO/SCC cluster, but there are a few security considerations that make it somewhat harder to run the notebook responsibly from the cluster.

Security

The jupyter notebook works just like a web server in that anyone with the IP and port number can access it through the internet. What makes it different from normal web servers is that the jupyter notebook has tremendous power and does not limit code execution. A malicious user could, for instance, delete your entire project folder if they had access to your notebook session. Thus, it is important to secure and protect your notebook session using encryption.

I will not try to include all information relevant to securing the notebook in this tutorial because it is likely to rapidly become out of date and, thus, insecure. Instead, I will link to the official security documentation and be happy to help anyone follow along:

For more information on how to generate a password for your notebooks and an SSL certificate to encrypt your communication, please follow the guide linked below:

http://jupyter-notebook.readthedocs.org/en/latest/public_server.html

Port forwarding

Once you follow the security steps listed above, it is pretty easy to connect to a head node on the cluster running the jupyter notebook session by forwarding the port of the notebook server from the remote host to your local host using SSH. For instance,

ssh -L 8888:localhost:8888 -N ceholden@geo.bu.edu

Users familiar with accessing the GEO/SCC head nodes via Remote Desktop VNC sessions will be used to this procedure. For those unfamiliar, the -L option to the ssh command is "forwarding" port 8888 on the geo.bu.edu server to my local machine. When I access localhost:8888 in my web browser, the ssh session is forwarding my request to my local machine for port 8888 to the geo.bu.edu server which is hosting my jupyter notebook session.

Head node versus compute node

As all users of the GEO/SCC cluster should know, the head nodes are only to be used for lightweight tasks. Intensive computing and data visualization using the jupyter notebook is not one of these tasks. Instead of runnign the notebook server on the head node, one should use the qsub or qsh system to run the notebook server on a compute node.

Consider the following scenario:

ceholden@geo: > qsh -V -l h_rt=24:00:00 -N jupyter_nb
...
Your job 9992339 ("jupyter_nb") has been submitted
waiting for interactive job to be scheduled ....
...
ceholden@scc-gb08: > jupyter notebook
[I 11:35:11.149 NotebookApp] Serving notebooks from local directory: /usr3/graduate/ceholden
[I 11:35:11.150 NotebookApp] 0 active kernels 
[I 11:35:11.150 NotebookApp] The IPython Notebook is running at: https://[all ip addresses on your system]:8888/

I cannot directly access the compute node I've been assigned, scc-gb08, from my local machine. I have to go through a machine that can access it -- the "head nodes" scc1.bu.edu, scc2.bu.edu, geo.bu.edu, or scc4.bu.edu.

The first step is to ssh into the compute node from the head node to confirm the RSA signature of the compute node (e.g., confirm that the machine is who it says it is):

The authenticity of host 'scc-* (IP ADDRESS)' can't be established.
RSA key fingerprint is `RSA KEY`.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'scc-*' (RSA) to the list of known hosts.

This only needs to be done once per compute node. When we connect to the compute node in the next step we will node have a chance to manually accept the RSA key so it is important to do this first.

Next, tunnel through the head node (geo.bu.edu in this example) and into the compute node (scc-gb08.bu.edu in this example) using two SSH commands:

ssh -L 8890:localhost:8890 ceholden@geo.bu.edu ssh -L 8890:localhost:8888 -N ceholden@scc-gb08

The first SSH command is almost identical to the forwarding we would perform if we only needed to access the head node. However, instead of not doing anything by passing the -N option, we will instead do another ssh command into the compute node.

We are forwarding port 8888 from the compute node to port 8890 on the head node. From there we are forwarding port 8890 from the head node to our local machine. This is commonly referred to as "multiple hops" or "multi-hop". Since port 8888 on the compute node is now port 8890 on our local machine, we can simply access localhost:8890 on our local web browser.

See the following StackOverflow question and answers for more information: