The Scientific Python Ecosystem

Python

Python is a interpreted, high-level programming language that is meant to be easily understandable and usable for a multitude of purposes. It is composed of libraries that contain useful tools for you to do quick and efficient data analysis and visualization. These libraries are like Lego blocks - you can pick and choose which ones you want to build your end product. The Scientific Python Ecosystem is composed of key libraries (i.e. NumPy, SciPy, Pandas, Matplotlib) that serve as a basis for most other libraries (i.e. MetPy). In this notebook, we'll briefly touch on several of these foundational libraries of the SciPy Ecosystem.

Getting Python: Anaconda

Anaconda provides distributions of Python and the main third-party packages either as a full distribution or as a lighter-weight verison, "Miniconda". We recommend using Anaconda to build and maintain your Python stack, as it provides command line tools to download and update Python libraries. You can check it out at https://www.anaconda.com/distribution/.

Jupyter

The Jupyter library provides "literate programming" interfaces for Python and other programming languages. This file is displayed using the Jupyter library, either within Jupyter Notebook or Lab. It incorporates code, prose, and other text (equations, HTML) to make a seamless document for your analysis or presentation by working in small blocks. This also allows for quick prototyping and debugging of code as you write!

The Building Blocks of the SciPy World

While Python is the basis for everything, this figure demonstrates how packages build on top of each other (causing dependencies). Additionally, packages are constantly under development, so this structure does have some transient nature to it, as the SciPy world continue to expand (see Dask as a recent addition to this framework).

Data Analysis/Computation Libraries

NumPy

NumPy is the primary numerical computation library in Python. It works with N-dimensional arrays and matrices and performs basic computations on them.


In [ ]:
import numpy as np
x = np.arange(1,11)
y = np.arange(100,110)
mean_x_y = np.mean([x,y])
print(mean_x_y)

Pandas

Pandas is an excellent library for handling tabular data and quickly performing data analysis on it. It can handle many textfile types.


In [ ]:
import pandas as pd
df = pd.read_csv('../Pandas/Jan17_CO_ASOS.txt', sep='\t')
df.head()

xarray

xarray is a Python library meant to handle N-dimensional arrays with metadata (think netCDF files). With the Dask library, it can work with Big Data efficiently in a Python framework.


In [ ]:
import xarray as xr
ds = xr.open_dataset('../../data/NARR_19930313_0000.nc')
ds

Dask

Dask is a parallel-computing library in Python. You can use it on your laptop, cloud environment, or on a high-performance computer (NCAR's Cheyenne for example). It allows for lazy evaluations so that computations only occur after you've chained all of your operations together. Additionally, it has a built-in scheduler to scale with your computational demand to optimize your parellel resources.

SciPy

The SciPy library has a lot of advanced mathematical functions that are not contained in Numpy, including Fast Fourier Transforms, interpolation methods, and linear algebra operations.

Scikit-learn

Scikit-learn is the primary machine learning library for Python. It can do simple things like regressions and classifications, or more advanced techniques like random forests. It can perform some neural network operations, but for big data implementations, check out the keras library.

Scikit-image

An image processing library built on NumPy

Visualization Libraries

Matplotlib

Matplotlib is one of the core visualization libraries in Python and produces publication-quality figures without much configuration.


In [ ]:
import matplotlib.pyplot as plt
plt.plot(x,y)
plt.title('Demo of Matplotlib')
plt.show()

CartoPy

CartoPy is the primary geographical mapping and visualization library in Python, as support for Basemap has been discontinued. It can handle various projections and transformation to/from projections to map data accurately for your problem.

Atmospheric Science Libraries

MetPy

MetPy is developed at Unidata with support from the user community as a replacement for GEMPAK. Its primary functions are to read in data, perform meteorological calculations on it, and visualize it in useful way for education and research.

Pint

Pint is a unit-handling library, which MetPy relies upon for its calculations. Pint allow units to be attached to NumPy arrays, which allows for unit-aware calculations and easy conversions to reduce unit-based errors.

netcdf4-python

This is another Unidata package that serves as an interface from Python to the netCDF-C library. As a result, netCDF files can easily be read and written in Python.

Siphon

The Siphon library, developed at Unidata, is a remote access library, built for accessing data on THREDDS servers, but also has hooks into the Wyoming, IGRA, and Iowa State upper air databases, the National Buoy Data Center, and the NHC and SPC storm reports as well.

For more information on the SciPy Ecosystem, check out these links: https://www.scipy.org/about.html and https://scipy-lectures.org/intro/intro.html