Introduction to 'Introduction to Python for Data Science'

Introduction to Python for Data Science is a series of lessons designed to equip you with a basic understanding of the parts of the Python programming language and the associated software libraries that are commonly used by data scientists.

What is Python?

Python, like the other languages R and MATLAB, is currently very popular for data science work, but unlike those languages it is also a powerful general purpose programming language so is well suited to very different activities such as creating dynamic websites and bioinformatics. It is named after Monty Python, a surreal comedy group (see this clip for a classic Monty Python sketch).

Here are some reasons why Python is extensively used by data scientists:

  • Readable and concise: some simple Python programs can be understood by non-programmers
  • Accessible: Python is often taught as a first programming language yet it is also...
  • Powerful: has various features that allow for the development of large programs
  • Well established: created in 1991; its associated data science packages are also fairly mature
  • Consistent and well-designed language: some would argue that this is not the case for R
  • Free and open-source (unlike MATLAB)
  • Many software packages available, including libraries for handling/plotting numerical data and for statistics/machine learning, plus many general purpose and specialist libraries (83269 freely available via the Python Package Index (PyPI) as of 2016-06-27)
  • Can provide accessible interfaces to very fast, efficient software written in compiled languages (like C, C++ and Fortran) (useful for data science)

Learning outcomes

TODO: INTRO SENTENCE

  • 00 - Introduction
    1. Jupyter and the IPython kernel for exploring code interactively within a Notebook
  • 01 - First steps
    1. Arithmetic, expressions and assignment
    2. Numpy for loading and manipulating n-dimensional arrays of data
    3. Missing data
    4. Plotting

TODO: FINISH THE ABOVE USING PREVIOUS SYLLABUS BELOW

  1. A glorified calculator
    • Arithmetic
    • Built-in types
    • Precision (true and visible, floating point error, rounding)
    • Expressions and assignment statements
    • Jupyter notebook structure, state and control
    • Converting between temperature units
  2. Basic of automation: collections and control
    • lists e.g. sequence of values from a sensor
      • Attributes and key operations
      • Plotting lists
      • Iterating over a list
        • Quantifying and visualising performance
    • dictionaries
      • Attributes and key operations
    • Nested structures (inc to/from JSON for interesting example)
    • Other data structures: sets, tuples (brief)
    • More control
      • conditionals
      • loops (inc. ranges, skipping steps and leaving early)
    • Comprehensions
  3. Parcelling up reusable chunks of code as functions
    • Example and motivation
    • Useful terminology (call, return, argument, parameters, scope)
    • Positional, named and default arguments
    • Mapping over collections
    • Introducing core library functions inc. how to find signatures and docstrings
    • Documenting
    • Testing: assert, py.test, hypothesis
    • Higher-order functions
  4. Parcelling up state information and associated operations as Objects
    • Example and motivation; note that only going to be covered briefly here
    • Class vs object distinction; method vs function distinction
    • Instantiation, retrieving information, setting information
    • References vs copies - source of confusion
    • How to get info on classes and class methods (signatures, docstrings)
  5. Contorting strings of characters
    • Common string methods
    • Basic file IO
    • Basic web API access
  6. Fast and concise numerical computation: arrays (plus plotting)
    • What are the characteristics of the datasets encountered by data scientists?
    • Quantify performance differences between lists and ndarrays
    • More general introduction to ndarrays
      • attributes
      • similarity to data structures in other languages
      • methods
      • instantiating (ones, zeros, empty, arange, linspace)
      • vector functions inc math ops
      • indexing (simple vs fancy), slicing
      • reshaping, stacking, splitting
      • Limited support for heterogeneously-typed rows and labelled data
    • Vectorized code: a different way of solving problems
    • Views vs copies (source of confusion; updating in-place)
    • Missing data
    • File IO
    • Plots
      • line
      • aesthetics
      • scatter
      • histogram
  7. Carrying contextual information around with your arrays: Series and Dataframes
    • Example and motivation
    • Introduction to Dataframes and Series (row and column indexes)
      • Attributes
      • similarity to data structures in other languages
      • reindexing
    • Views vs copies (source of confusion; updating in-place)
    • Groupings
    • Time-series
    • Basic interaction with databases
  8. Talking to other languages
    • rpy2
    • matlab
  9. Running Python elsewhere
    • Jupyter on Iceberg
    • conda
    • Python scripts
    • Setting up and using Python on
      • on your own machine
      • on Iceberg
    • Finding and installing packages
  10. Additional exercises

INTEGRATE THROUGHOUT

  • Comments on things that work differently in Python 2
  • Introduce core libraries
    • os.path
    • datetime
    • json
    • sys for command-line arguments
    • Optional: itertools, functools

TODO: EXPLAIN THAT TOP DOWN, NOT BOTTOM UP, TO ENSURE GET TO DOING SOMETHING INTERESTING/USEFUL QUICKLY; MEANS THAT MAY NOT ALWAYS UNDERSTAND EVERY PRESENTED LINE OF CODE AT FIRST

Course format and support

Each of the lessons in this course (including this introduction) is what is called a Jupyter Notebook. These note contain explanatory text but also can provide an environment for interactively viewing, editing and running code, enabling you to learn by doing/experimenting.

You are invited to work through these lessons independently and at your own pace.

If you are at a Code Cafe event (an informal workshop hosted on University of Sheffield premises) then instructors will be on hand to help you. Also, we have an interactive discussion notebook (https://v.etherpad.org/p/code_cafe) where you can ask questions and make comments. TODO: CREATE NEW ETHERPAD

If you are working remotely by yourself then ... TODO: FEEDBACK MECHANISM FOR DISTANCE LEARNERS (SMC, ETHERPAD, MAILING LIST, GOOGLE GROUP, GH ISSUES?). EXPLAIN IN SIMPLE TERMS WHAT PROVIDED.

The first step is to ensure that you are viewing these Notebooks in a way where you can interact with them (as opposed to viewing a non-editable static snapshot of a Notebook on a site such as github or nbviewer). You have several options for viewing/editing/running Notebooks:

  1. View/edit/run Notebooks using the Sage Math Cloud (SMC) service. If you have been instructed to use SMC then all you need to get started are a SMC account, copies of the course materials in your account and a web browser for accessing SMC. A more detailed guide to getting started with SMC will be provided separately. TODO: ENSURE THIS IS AVAILABLE. MENTION FEATURES/PAYMENT
  2. View/edit/run Notebooks entirely on your own machine. You may wish to start by trying Jupyter using Sage Math Cloud as then you do not need spend time installing and configuring software on your own machine. TODO: PROVIDE GUIDE?
  3. TODO: MENTION ICEBERG OR MANAGED DESKTOP HERE OR LATER ON?

Please open this Notebook usign SMC or Jupyter running on your own machine (if you have not done so already) before continuing.

TODO: CLEAR ENOUGH?

Jupyter basics

What is a Notebook?

Each Jupyter Notebook is a document comprised of a sequence of cells. A cell can contain formatted text (as this one does) or some lines of runnable code (like the cell below). Code cells can generate output, which here is the single value produced by the last line of code but could be a table or a graph. Try it: Click on the following cell (or use the cursor keys) to highlight it then press Shift-Enter to run it:


In [ ]:
pi = 3.141593
radius = 0.5
area = pi * radius * radius
area

Ignore the details of how this Python code produced a result for now; this is simply a demonstration of how Jupyter Notebooks work.

You can create cells and run code cells in any order you like and the values you create in one cell will be available when you next run a cell, allowing you to interactively explore code/data over time. Run the following cell to see how we still have access to the value associated with area:


In [ ]:
area * 2

You can think of a Notebook as being a little like an Excel spreadsheet containing just a single column, the key differences being that

  • You have to explicitly re-run cells after changing them;
  • The 'formulae' (code) in Notebook cells is never hidden, making them much easier to read.

Editing cells

To edit code cells, click inside them or press enter when they are surrounded by a blue border. Try this: Edit the code cell above, replace 2 with another number then re-run the cell.

You can edit text cells in the same way. Text formatting effects are achieved by writing [Markdown[(https://daringfireball.net/projects/markdown/) in these cells rather than just plain text. Try it: double-click within this cell to see how Markdown was used to denote headings, bold text, links...

  • and
  • bulletted
  • lists.

Run the cell to render it as attractive HTML.

See the 'Edit', 'Insert' and 'Cell' menus directly above the Notebook for further ways of manipulating cells. Also, note that there are keyboard shortcuts for most Notebook actions (see the 'Help' menu).

Jupyter vs Python vs IPython

You may be wondering what is happening behind the scenes and how Jupyter relates to Python. In brief:

  • Juypter is server software that allows anyone with a web browser to connect and view/edit/run Notebooks. The server software could be running on the same machine as the web browser and could possibly be running on a different machine such as Sage Math Cloud (as might be the case for you) or a supercomputer. Therefore no Python software needs to be installed on the same machine as the web browser if connecting to a remote Jupyter server.
  • Jupyter does not execute the contents of code cells itself; it sends them to some software (called a kernel) that understands a specific programming language. Here we are using the IPython kernel, which allows chunks of Python code to be run interactively over time.

Knowing more about Jupyter and IPython is not important at this stage; however, it it useful to have a little understanding of what these pieces of software are and how they fit together.

A brief note on restarting the kernel associated with a particular Notebook: if you want to forget all the values you have created in memory since first running a code cell in a Notebook then click 'Kernel' -> 'Restart'. This only erases the kernel's working memory; it does not change the code/text in the Notebook's cells.

Notebook setup cells

Most lessons contain a 'setup' code cell. You need to run this before running any other code cells and you should not edit it nor do you need to focus on trying to understand its contents. Please run the following setup cell.


In [ ]:
____ = 0
from numpy.testing import assert_equal
from codecs import decode

Course exercises

There are exercises throughout the lessons to help confirm your understanding of different concepts. These are not externally assessed and are entirely for your own benefit / learning. The exercises typically require you to write/alter some code so that a statement is true.

For example, replace '____' in the following cell with a number so that the mathematical expressions before and after the comma (within the brackets) are exactly equal.


In [ ]:
assert_equal(____, 6 + 18)

Next, edit that cell and change the number you entered to a different number then re-run the cell so see what happens when the expressions either side of the comma are not equal.

Certain exercises come with hints to help you. However, these have been encoded using a simple cypher called rot13, so you have to explicitly decode them if you think you need them.

For example, say that your hint is the following ROT13-encoded string of characters:

'Gel zhygvcylvat ol guerr orsber lbh qb nalguvat ryfr'

You can decode this hint in a code cell like this:


In [ ]:
decode('Gel zhygvcylvat ol guerr orsber lbh qb nalguvat ryfr', 'rot13')