Introduction to Python and Natural Language Technologies

Lecture 01, Introduction to Python

September 6, 2017

About this part of the course

Goal

  • upper intermediate level Python
  • will cover some advanced concepts
  • focus on string manipulation

Prerequisites

  • intermediate level in at least one object oriented programming language
  • must know: class, instance, method, operator overloading, basic IO handling
  • good to know: static method, property, mutability, garbage collection

Course material

Official Github repository

  • will push the slideshow notebooks right before the lecture, so you can follow on your own notebook

Homework

  • one homework for this part
  • released on Week 4
  • deadline by the end of Week 7

Jupyter

  • Jupyter - formally known as IPython Notebook is a web application that allows you to create and share documents with live code, equations, visualizations etc.
  • Jupyter notebooks are JSON files with the extension .ipynb
  • can be converted to HTML, PDF, LateX etc.
  • can render images, tables, graphs, LateX equations

  • content is organized into cells

Cell types

  1. code cell: Python/R/Lua/etc. code
  2. raw cell: raw text
  3. markdown cell: formatted text using Markdown

Code cell


In [1]:
print("Hello world")


Hello world

The last command's output is displayed


In [2]:
2 + 3
3 + 4


Out[2]:
7

This can be a tuple of multiple values


In [3]:
2 + 3, 3 + 4, "hello " + "world"


Out[3]:
(5, 7, 'hello world')

Markdown cell

This is in bold

This is in italics

This is
a table

and is a pretty LateX equation:

$$ \mathbf{E}\cdot\mathrm{d}\mathbf{S} = \frac{1}{\varepsilon_0} \iiint_\Omega \rho \,\mathrm{d}V $$

Using Jupyter

Command mode and edit mode

Jupyter has two modes: command mode and edit mode

  1. Command mode: perform non-edit operations on selected cells (can select more than one cell)
    • selected cells are marked blue
  2. Edit mode: edit a single cell
    • the cell being edited is marked green

Switching between modes

  1. Esc: Edit mode -> Command mode
  2. Enter or double click: Command mode -> Edit mode

Running cells

  1. Ctrl + Enter: run cell
  2. Shift + Enter: run cell and select next cell
  3. Alt + Enter: run cell and insert new cell below

Cell magic

Special commands can modify a single cell's behavior, for example


In [4]:
%%time

for x in range(100000):
    pass


CPU times: user 10 ms, sys: 0 ns, total: 10 ms
Wall time: 7.65 ms

In [5]:
%%timeit

x = 2


16.4 ns ± 1.43 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

In [6]:
%%writefile hello.py

print("Hello world from BME")


Overwriting hello.py

For a complete list of magic commands:


In [7]:
%lsmagic


Out[7]:
Available line magics:
%alias  %alias_magic  %autocall  %automagic  %autosave  %bookmark  %cat  %cd  %clear  %colors  %config  %connect_info  %cp  %debug  %dhist  %dirs  %doctest_mode  %ed  %edit  %env  %gui  %hist  %history  %killbgscripts  %ldir  %less  %lf  %lk  %ll  %load  %load_ext  %loadpy  %logoff  %logon  %logstart  %logstate  %logstop  %ls  %lsmagic  %lx  %macro  %magic  %man  %matplotlib  %mkdir  %more  %mv  %notebook  %page  %pastebin  %pdb  %pdef  %pdoc  %pfile  %pinfo  %pinfo2  %popd  %pprint  %precision  %profile  %prun  %psearch  %psource  %pushd  %pwd  %pycat  %pylab  %qtconsole  %quickref  %recall  %rehashx  %reload_ext  %rep  %rerun  %reset  %reset_selective  %rm  %rmdir  %run  %save  %sc  %set_env  %store  %sx  %system  %tb  %time  %timeit  %unalias  %unload_ext  %who  %who_ls  %whos  %xdel  %xmode

Available cell magics:
%%!  %%HTML  %%SVG  %%bash  %%capture  %%debug  %%file  %%html  %%javascript  %%js  %%latex  %%markdown  %%perl  %%prun  %%pypy  %%python  %%python2  %%python3  %%ruby  %%script  %%sh  %%svg  %%sx  %%system  %%time  %%timeit  %%writefile

Automagic is ON, % prefix IS NOT needed for line magics.

Course material - Jupyter slides

Jupyter notebooks can be converted to slides and rendered with Reveal.js just like this course material.

This slideshow is a single Jupyter notebook which means:

  • you can view it as a notebook on Github
  • you can run and modify it on your own computer
  • you can render it using Reveal.js
jupyter-nbconvert --to slides 01_Python_introduction.ipynb --reveal-prefix=reveal.js --post serve

More on Jupyter slides:

10 min video on Jupyter slides

  • cells may be skipped during presentations
    • some extra material is skipped, they will not be covered in the exam
  • all notebooks should run without errors using Kernel -> Restart & Run All
    • code samples that would raise an exception are commented
  • this live presentation uses the RISE jupyter extension

Under the hood

  • each notebook is run by its own Kernel (Python interpreter)
    • the kernel can interrupted or restarted through the Kernel menu
    • always run Kernel -> Restart & Run All before submitting homework to make sure that your notebook behaves as expected
  • all cells share a single namespace
  • cells can be run in arbitrary order, execution count is helpful

In [8]:
print("this is run first")


this is run first

In [9]:
print("this is run afterwords. Note the execution count on the left.")


this is run afterwords. Note the execution count on the left.

The input and output of code cells can be accessed

Previous output:


In [10]:
42


Out[10]:
42

In [11]:
_


Out[11]:
42

Next-previous output:


In [12]:
"first"


Out[12]:
'first'

In [13]:
"second"


Out[13]:
'second'

In [14]:
__


Out[14]:
'first'

In [15]:
__


Out[15]:
'second'

Next-next previous output:


In [16]:
___


Out[16]:
'second'

N-th output can also be accessed as a variable _output_count. This is only defined if the N-th cell had an output.

Here is a way to list all defined outputs (you will understand the code in 3 week):


In [17]:
list(filter(lambda x: x.startswith('_') and x[1:].isdigit(), globals()))


Out[17]:
['_2', '_3', '_7', '_10', '_11', '_12', '_13', '_14', '_15', '_16']

Inputs can be accessed similarly

Previous input:


In [18]:
_i


Out[18]:
"list(filter(lambda x: x.startswith('_') and x[1:].isdigit(), globals()))"

N-th input:


In [19]:
_i2


Out[19]:
'2 + 3\n3 + 4'

The Python programming language

History of Python

  • Python started as a hobby project of Dutch programmer, Guido van Rossum in 1989.
  • Python 1.0 in 1994
  • Python 2.0 in 2000
    • cycle-detecting garbage collector
    • Unicode support
  • Python 3.0 in 2008
    • backward incompatible
  • Python2 End-of-Life (EOL) date was postponed from 2015 to 2020

Benevolent Dictator for Life

Guido van Rossum at OSCON 2006. by Doc Searls licensed under CC BY 2.0

Python community and development

  • Python Software Foundation nonprofit organization based in Delaware, US
  • managed through PEPs (Python Enhancement Proposal)
  • strong community inclusion
  • large standard library
  • very large third-party module repository called PyPI (Python Package Index)
  • pip installer

In [20]:
import antigravity

Python neologisms

  • the Python community has a number of made-up expressions
  • Pythonic: following Python's conventions, Python-like
  • Pythonist or Pythonista: good Python programmer

General properties of Python

Whitespaces

  • whitespace indentation instead of curly braces
  • no semicolons

In [21]:
n = 12
if n % 2 == 0:
    print("n is even")
else:
    print("n is odd")


n is even

Dynamic typing

  • type checking is performed at run-time as opposed to compile-time (C++)

In [22]:
n = 2
print(type(n))

n = 2.1
print(type(n))

n = "foo"
print(type(n))


<class 'int'>
<class 'float'>
<class 'str'>

Assignment

assignment differs from other imperative languages:

  • in C++ i = 2 translates to typed variable named i receives a copy of numeric value 2
  • in Python i = 2 translates to name i receives a reference to object of numeric type of value 2

the built-in function id returns the object's id


In [23]:
i = 2
print(id(i))

i = 3
print(id(i))

i = "foo"
print(id(i))

s = i
print(id(s) == id(i))

s += "bar"
print(id(s) == id(i))


140529538636192
140529538636224
140529472500880
True
False

Simple statements

if, elif, else


In [24]:
#n = int(input())
n = 12

if n < 0:
    print("N is negative")
elif n > 0:
    print("N is positive")
else:
    print("N is neither positive nor negative")


N is positive

Conditional expressions

  • one-line if statements
  • the order of operands is different from C's ?: operator, the C version of abs would look like this
int x = -2;
int abs_x = x ? x>=0 : -x;
  • should only be used for very short statements

<expr1> if <condition> else <expr2>


In [25]:
n = -2
abs_n = n if n >= 0 else -n
abs_n


Out[25]:
2

Lists

  • lists are the most frequently used built-in containers
  • basic operations: indexing, length, append, extend
  • lists will be covered in detail next week

In [26]:
l = []  # empty list
l.append(2)
l.append(2)
l.append("foo")

len(l), l


Out[26]:
(3, [2, 2, 'foo'])

In [27]:
l[1] = "bar"
l.extend([-1, True])
len(l), l


Out[27]:
(5, [2, 'bar', 'foo', -1, True])

for, range

Iterating a list


In [28]:
for e in ["foo", "bar"]:
    print(e)


foo
bar

Iterating over a range of integers

The same in C++:

for (int i=0; i<5; i++)
    cout << i << endl;

By default range starts from 0.


In [29]:
for i in range(5):
    print(i)


0
1
2
3
4

specifying the start of the range:


In [30]:
for i in range(2, 5):
    print(i)


2
3
4

specifying the step. Note that in this case we need to specify all three positional arguments.


In [31]:
for i in range(0, 10, 2):
    print(i)


0
2
4
6
8

while


In [32]:
i = 0
while i < 5:
    print(i)
    i += 1


0
1
2
3
4

There is no do...while loop in Python.

break and continue

  • break: allows early exit from a loop
  • continue: allows early jump to next iteration

In [33]:
for i in range(10):
    if i % 2 == 0:
        continue
    print(i)


1
3
5
7
9

In [34]:
for i in range(10):
    if i > 4:
        break
    print(i)


0
1
2
3
4

Functions

Defining functions

Functions can be defined using the def keyword:


In [35]:
def foo():
    print("this is a function")
     
foo()


this is a function

Function arguments

  1. positional
  2. named or keyword arguments

keyword arguments must follow positional arguments


In [36]:
def foo(arg1, arg2, arg3):
    print("arg1 ", arg1)
    print("arg2 ", arg2)
    print("arg3 ", arg3)
    
foo(1, 2, 3)


arg1  1
arg2  2
arg3  3

In [37]:
foo(1, arg3=2, arg2=29)


arg1  1
arg2  29
arg3  2

Default arguments

  • arguments can have default values
  • default arguments must follow non-default arguments

In [38]:
def foo(arg1, arg2, arg3=3):
    print("arg1 ", arg1)
    print("arg2 ", arg2)
    print("arg3 ", arg3)
foo(1, 2)


arg1  1
arg2  2
arg3  3

Default arguments need not be specified when calling the function


In [39]:
foo(1, 2)


arg1  1
arg2  2
arg3  3

In [40]:
foo(arg1=1, arg3=33, arg2=222)


arg1  1
arg2  222
arg3  33

If more than one value has default arguments, either can be skipped:


In [41]:
def foo(arg1, arg2=2, arg3=3):
    print("arg1 ", arg1)
    print("arg2 ", arg2)
    print("arg3 ", arg3)
    
foo(11, arg3=33)


arg1  11
arg2  2
arg3  33

This mechanism allows having a very large number of arguments. Many libraries have functions with dozens of arguments.

The popular data analysis library pandas has functions with dozens of arguments, for example:

pandas.read_csv(filepath_or_buffer, sep=', ', delimiter=None, header='infer', names=None, index_col=None, usecols=None, squeeze=False, prefix=None, mangle_dupe_cols=True, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skipinitialspace=False, skiprows=None, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, skip_blank_lines=True, parse_dates=False, infer_datetime_format=False, keep_date_col=False, date_parser=None, dayfirst=False, iterator=False, chunksize=None, compression='infer', thousands=None, decimal=b'.', lineterminator=None, quotechar='"', quoting=0, escapechar=None, comment=None, encoding=None, dialect=None, tupleize_cols=False, error_bad_lines=True, warn_bad_lines=True, skipfooter=0, skip_footer=0, doublequote=True, delim_whitespace=False, as_recarray=False, compact_ints=False, use_unsigned=False, low_memory=True, buffer_lines=None, memory_map=False, float_precision=None)

The return statement

  • functions may return more than one value
    • a tuple of the values is returned
  • without an explicit return statement None is returned
  • an empty return statement returns None

In [42]:
def foo(n):
    if n < 0:
        return "negative"
    if 0 <= n < 10:
        return "positive", n
    return

print(foo(-2))
print(foo(3))
print(foo(12))


negative
('positive', 3)
None

Zen of Python


In [43]:
import this


The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!