Introduction to Python and Natural Language Technologies

Lecture 01, Introduction to Python

September 6, 2017

About this part of the course

Goal

upper intermediate level Python
will cover some advanced concepts
focus on string manipulation

Prerequisites

intermediate level in at least one object oriented programming language
must know: class, instance, method, operator overloading, basic IO handling
good to know: static method, property, mutability, garbage collection

Course material

Official Github repository

will push the slideshow notebooks right before the lecture, so you can follow on your own notebook

Homework

one homework for this part
released on Week 4
deadline by the end of Week 7

Jupyter

Jupyter - formally known as IPython Notebook is a web application that allows you to create and share documents with live code, equations, visualizations etc.
Jupyter notebooks are JSON files with the extension .ipynb
can be converted to HTML, PDF, LateX etc.
can render images, tables, graphs, LateX equations
content is organized into cells

Cell types

code cell: Python/R/Lua/etc. code
raw cell: raw text
markdown cell: formatted text using Markdown

Code cell



In [1]:

    
print("Hello world")









    



Hello world

The last command's output is displayed



In [2]:

    
2 + 3
3 + 4









    Out[2]:





7

This can be a tuple of multiple values



In [3]:

    
2 + 3, 3 + 4, "hello " + "world"









    Out[3]:





(5, 7, 'hello world')

Markdown cell

This is in bold

This is in italics

This	is
a	table

and is a pretty LateX equation:

$$ \mathbf{E}\cdot\mathrm{d}\mathbf{S} = \frac{1}{\varepsilon_0} \iiint_\Omega \rho \,\mathrm{d}V $$

Using Jupyter

Command mode and edit mode

Jupyter has two modes: command mode and edit mode

Command mode: perform non-edit operations on selected cells (can select more than one cell)
- selected cells are marked blue
Edit mode: edit a single cell
- the cell being edited is marked green

Switching between modes

Esc: Edit mode -> Command mode
Enter or double click: Command mode -> Edit mode

Running cells

Ctrl + Enter: run cell
Shift + Enter: run cell and select next cell
Alt + Enter: run cell and insert new cell below

Cell magic

Special commands can modify a single cell's behavior, for example



In [4]:

    
%%time

for x in range(100000):
    pass









    



CPU times: user 10 ms, sys: 0 ns, total: 10 ms
Wall time: 7.65 ms



In [5]:

    
%%timeit

x = 2









    



16.4 ns ± 1.43 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)



In [6]:

    
%%writefile hello.py

print("Hello world from BME")









    



Overwriting hello.py

For a complete list of magic commands:



In [7]:

    
%lsmagic









    Out[7]:





Available line magics:
%alias  %alias_magic  %autocall  %automagic  %autosave  %bookmark  %cat  %cd  %clear  %colors  %config  %connect_info  %cp  %debug  %dhist  %dirs  %doctest_mode  %ed  %edit  %env  %gui  %hist  %history  %killbgscripts  %ldir  %less  %lf  %lk  %ll  %load  %load_ext  %loadpy  %logoff  %logon  %logstart  %logstate  %logstop  %ls  %lsmagic  %lx  %macro  %magic  %man  %matplotlib  %mkdir  %more  %mv  %notebook  %page  %pastebin  %pdb  %pdef  %pdoc  %pfile  %pinfo  %pinfo2  %popd  %pprint  %precision  %profile  %prun  %psearch  %psource  %pushd  %pwd  %pycat  %pylab  %qtconsole  %quickref  %recall  %rehashx  %reload_ext  %rep  %rerun  %reset  %reset_selective  %rm  %rmdir  %run  %save  %sc  %set_env  %store  %sx  %system  %tb  %time  %timeit  %unalias  %unload_ext  %who  %who_ls  %whos  %xdel  %xmode

Available cell magics:
%%!  %%HTML  %%SVG  %%bash  %%capture  %%debug  %%file  %%html  %%javascript  %%js  %%latex  %%markdown  %%perl  %%prun  %%pypy  %%python  %%python2  %%python3  %%ruby  %%script  %%sh  %%svg  %%sx  %%system  %%time  %%timeit  %%writefile

Automagic is ON, % prefix IS NOT needed for line magics.

Course material - Jupyter slides

Jupyter notebooks can be converted to slides and rendered with Reveal.js just like this course material.

This slideshow is a single Jupyter notebook which means:

you can view it as a notebook on Github
you can run and modify it on your own computer
you can render it using Reveal.js

jupyter-nbconvert --to slides 01_Python_introduction.ipynb --reveal-prefix=reveal.js --post serve

Under the hood

each notebook is run by its own Kernel (Python interpreter)
- the kernel can interrupted or restarted through the Kernel menu
- always run Kernel -> Restart & Run All before submitting homework to make sure that your notebook behaves as expected
all cells share a single namespace
cells can be run in arbitrary order, execution count is helpful



In [8]:

    
print("this is run first")









    



this is run first



In [9]:

    
print("this is run afterwords. Note the execution count on the left.")









    



this is run afterwords. Note the execution count on the left.

The input and output of code cells can be accessed

Previous output:



In [10]:

    
42









    Out[10]:





42



In [11]:

    
_









    Out[11]:





42

Next-previous output:



In [12]:

    
"first"









    Out[12]:





'first'



In [13]:

    
"second"









    Out[13]:





'second'



In [14]:

    
__









    Out[14]:





'first'



In [15]:

    
__









    Out[15]:





'second'

Next-next previous output:



In [16]:

    
___









    Out[16]:





'second'

N-th output can also be accessed as a variable _output_count. This is only defined if the N-th cell had an output.

Here is a way to list all defined outputs (you will understand the code in 3 week):



In [17]:

    
list(filter(lambda x: x.startswith('_') and x[1:].isdigit(), globals()))









    Out[17]:





['_2', '_3', '_7', '_10', '_11', '_12', '_13', '_14', '_15', '_16']

Inputs can be accessed similarly

Previous input:



In [18]:

    
_i









    Out[18]:





"list(filter(lambda x: x.startswith('_') and x[1:].isdigit(), globals()))"

N-th input:



In [19]:

    
_i2









    Out[19]:





'2 + 3\n3 + 4'

The Python programming language

History of Python

Python started as a hobby project of Dutch programmer, Guido van Rossum in 1989.
Python 1.0 in 1994
Python 2.0 in 2000
- cycle-detecting garbage collector
- Unicode support
Python 3.0 in 2008
- backward incompatible
Python2 End-of-Life (EOL) date was postponed from 2015 to 2020

Benevolent Dictator for Life

Guido van Rossum at OSCON 2006. by Doc Searls licensed under CC BY 2.0

Python community and development

Python Software Foundation nonprofit organization based in Delaware, US
managed through PEPs (Python Enhancement Proposal)
strong community inclusion
large standard library
very large third-party module repository called PyPI (Python Package Index)
pip installer



In [20]:

    
import antigravity

Python neologisms

the Python community has a number of made-up expressions
Pythonic: following Python's conventions, Python-like
Pythonist or Pythonista: good Python programmer

General properties of Python

Whitespaces

whitespace indentation instead of curly braces
no semicolons



In [21]:

    
n = 12
if n % 2 == 0:
    print("n is even")
else:
    print("n is odd")









    



n is even

Dynamic typing

type checking is performed at run-time as opposed to compile-time (C++)



In [22]:

    
n = 2
print(type(n))

n = 2.1
print(type(n))

n = "foo"
print(type(n))









    



<class 'int'>
<class 'float'>
<class 'str'>

Assignment

assignment differs from other imperative languages:

in C++ i = 2 translates to typed variable named i receives a copy of numeric value 2
in Python i = 2 translates to name i receives a reference to object of numeric type of value 2

the built-in function id returns the object's id



In [23]:

    
i = 2
print(id(i))

i = 3
print(id(i))

i = "foo"
print(id(i))

s = i
print(id(s) == id(i))

s += "bar"
print(id(s) == id(i))









    



140529538636192
140529538636224
140529472500880
True
False

Simple statements

if, elif, else



In [24]:

    
#n = int(input())
n = 12

if n < 0:
    print("N is negative")
elif n > 0:
    print("N is positive")
else:
    print("N is neither positive nor negative")









    



N is positive

Conditional expressions

one-line if statements
the order of operands is different from C's ?: operator, the C version of abs would look like this

int x = -2;
int abs_x = x ? x>=0 : -x;

should only be used for very short statements

<expr1> if <condition> else <expr2>



In [25]:

    
n = -2
abs_n = n if n >= 0 else -n
abs_n









    Out[25]:





2

Lists

lists are the most frequently used built-in containers
basic operations: indexing, length, append, extend
lists will be covered in detail next week



In [26]:

    
l = []  # empty list
l.append(2)
l.append(2)
l.append("foo")

len(l), l









    Out[26]:





(3, [2, 2, 'foo'])



In [27]:

    
l[1] = "bar"
l.extend([-1, True])
len(l), l









    Out[27]:





(5, [2, 'bar', 'foo', -1, True])

for, range

Iterating a list



In [28]:

    
for e in ["foo", "bar"]:
    print(e)









    



foo
bar

Iterating over a range of integers

The same in C++:

for (int i=0; i<5; i++)
    cout << i << endl;

By default range starts from 0.



In [29]:

    
for i in range(5):
    print(i)

specifying the start of the range:



In [30]:

    
for i in range(2, 5):
    print(i)

specifying the step. Note that in this case we need to specify all three positional arguments.



In [31]:

    
for i in range(0, 10, 2):
    print(i)

while



In [32]:

    
i = 0
while i < 5:
    print(i)
    i += 1

There is no do...while loop in Python.

break and continue

break: allows early exit from a loop
continue: allows early jump to next iteration



In [33]:

    
for i in range(10):
    if i % 2 == 0:
        continue
    print(i)



In [34]:

    
for i in range(10):
    if i > 4:
        break
    print(i)

Functions

Defining functions

Functions can be defined using the def keyword:



In [35]:

    
def foo():
    print("this is a function")
     
foo()









    



this is a function

Function arguments

positional
named or keyword arguments

keyword arguments must follow positional arguments



In [36]:

    
def foo(arg1, arg2, arg3):
    print("arg1 ", arg1)
    print("arg2 ", arg2)
    print("arg3 ", arg3)
    
foo(1, 2, 3)









    



arg1  1
arg2  2
arg3  3



In [37]:

    
foo(1, arg3=2, arg2=29)









    



arg1  1
arg2  29
arg3  2

Default arguments

arguments can have default values
default arguments must follow non-default arguments



In [38]:

    
def foo(arg1, arg2, arg3=3):
    print("arg1 ", arg1)
    print("arg2 ", arg2)
    print("arg3 ", arg3)
foo(1, 2)









    



arg1  1
arg2  2
arg3  3

Default arguments need not be specified when calling the function



In [39]:

    
foo(1, 2)









    



arg1  1
arg2  2
arg3  3



In [40]:

    
foo(arg1=1, arg3=33, arg2=222)









    



arg1  1
arg2  222
arg3  33

If more than one value has default arguments, either can be skipped:



In [41]:

    
def foo(arg1, arg2=2, arg3=3):
    print("arg1 ", arg1)
    print("arg2 ", arg2)
    print("arg3 ", arg3)
    
foo(11, arg3=33)









    



arg1  11
arg2  2
arg3  33

This mechanism allows having a very large number of arguments. Many libraries have functions with dozens of arguments.

The popular data analysis library pandas has functions with dozens of arguments, for example:

pandas.read_csv(filepath_or_buffer, sep=', ', delimiter=None, header='infer', names=None, index_col=None, usecols=None, squeeze=False, prefix=None, mangle_dupe_cols=True, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skipinitialspace=False, skiprows=None, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, skip_blank_lines=True, parse_dates=False, infer_datetime_format=False, keep_date_col=False, date_parser=None, dayfirst=False, iterator=False, chunksize=None, compression='infer', thousands=None, decimal=b'.', lineterminator=None, quotechar='"', quoting=0, escapechar=None, comment=None, encoding=None, dialect=None, tupleize_cols=False, error_bad_lines=True, warn_bad_lines=True, skipfooter=0, skip_footer=0, doublequote=True, delim_whitespace=False, as_recarray=False, compact_ints=False, use_unsigned=False, low_memory=True, buffer_lines=None, memory_map=False, float_precision=None)

The return statement

functions may return more than one value
- a tuple of the values is returned
without an explicit return statement None is returned
an empty return statement returns None



In [42]:

    
def foo(n):
    if n < 0:
        return "negative"
    if 0 <= n < 10:
        return "positive", n
    return

print(foo(-2))
print(foo(3))
print(foo(12))









    



negative
('positive', 3)
None

Zen of Python



In [43]:

    
import this









    



The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!