Introduction to Python and Natural Language Technologies

Type system and built-in types

Lecture 02

13 September 2017

PEP8, the Python style guide

  • widely accepted style guide for Python
  • PEP8 by Guido himself, 2001

Specifies:

  • indentation
  • line length
  • module imports
  • class names, function names etc.

We shall use PEP8 throughout this course.

Type system

Dynamic type system

  • no need to declare variables
  • the = operator binds a reference to any arbitrary object

In [1]:
i = 2
type(i), id(i)


Out[1]:
(int, 139967066462624)

In [2]:
i = "foo"
type(i), id(i)


Out[2]:
(str, 139967000318952)

Strongly typed

  • most implicit conversions are disallowed.
  • conversions between numeric types are OK:

In [3]:
i = 2
f = 1.2
s = i + f
print(type(i), type(f))
print(type(i + f))
print(s)


<class 'int'> <class 'float'>
<class 'float'>
3.2
  • conversions between numeric types and string are not allowed

In [4]:
# print("I am " + 20 + " years old")

we must explicitely cast it:


In [5]:
print("I am " + str(20) + " years old")


I am 20 years old

Note that many other languages like Javascript allow implicit casting:


In [6]:
%%javascript

element.text("I am " + 20 + " years old")



In [7]:
%%javascript

element.text("1" + 1)



In [8]:
%%javascript

element.text(1 + "1")



In [9]:
%%perl

print "1" + 1


2

Built-in types and operators

boolean operators

  • three boolean operators: and, or and not

In [10]:
x = 2

x < 2 or x >= 2


Out[10]:
True

In [11]:
x > 0 and x < 10


Out[11]:
True

In [12]:
not x < 0


Out[12]:
True

boolean type

  • two boolean values: True and False (must be capitalized)

In [13]:
x = True
type(x)


Out[13]:
bool

In [14]:
True and False


Out[14]:
False

In [15]:
True or False


Out[15]:
True

In [16]:
not True


Out[16]:
False

Numeric types

  • three numeric types: int, float and complex
  • an object's type is derived from its initial value

In [17]:
i = 2
f = 1.2
c = 1+2j

type(i), type(f), type(c)


Out[17]:
(int, float, complex)
  • implicit conversion between numeric types is supported in arithmetic operations
  • the resulting type is the one with less data loss

In [18]:
c2 = i + c
print(c2, type(c2))


(3+2j) <class 'complex'>

Precision and range

  • integers have unlimited precision
  • different from Python2, where integers are usually implemented using C's long type

In [19]:
%%python2

import sys
import math

max_i = sys.maxint
print(max_i, math.log(max_i, 2))


(9223372036854775807, 63.0)
  • in Python2 numbers are automatically converted to Python's long type when they exceed maxint

In [20]:
%%python2
import sys

max_i = sys.maxint
print(type(max_i), type(max_i+1))


(<type 'int'>, <type 'long'>)

Python3


In [21]:
type(2**63 + 1)


Out[21]:
int

float

  • floats are usually implemented using C's double.
  • complex numbers use two floats for their real and imaginary parts
  • check sys.float_info for more information

In [22]:
import sys
sys.float_info


Out[22]:
sys.float_info(max=1.7976931348623157e+308, max_exp=1024, max_10_exp=308, min=2.2250738585072014e-308, min_exp=-1021, min_10_exp=-307, dig=15, mant_dig=53, epsilon=2.220446049250313e-16, radix=2, rounds=1)

In [23]:
sys.int_info


Out[23]:
sys.int_info(bits_per_digit=30, sizeof_digit=4)

Arithmetic operators

  • addition, subtraction and product work the same as in C/C++

In [24]:
i = 2
f = 4.2
c = 4.1-3j

s1 = i + f
s2 = f - c
s3 = i * c
print(s1, type(s1))
print(s2, type(s2))
print(s3, type(s3))


6.2 <class 'float'>
(0.10000000000000053+3j) <class 'complex'>
(8.2-6j) <class 'complex'>

quotient operator

Python 2 vs. 3 difference

  • in Python3 operator/ computes the float quotient even if the operands are both integers

In [25]:
-3 / 2


Out[25]:
-1.5
  • in Python2 the quotient is integer if both operands are integers
  • the result is the floor of the quotient

In [26]:
%%python2

q = -3 / 2
print(q, type(q))


(-2, <type 'int'>)
  • this is not the case if one or both of the operands are float

In [27]:
%%python2

q = -3.0 / 2
print(q, type(q))


(-1.5, <type 'float'>)

Explicit floor quotient operator


In [28]:
-3.0 // 2


Out[28]:
-2.0

Comparison operators


In [29]:
x = 23
x < 24


Out[29]:
True

In [30]:
x >= 22


Out[30]:
True

operators can be chained


In [31]:
23 < x < 100


Out[31]:
False

In [32]:
23 <= x < 100


Out[32]:
True

Other operators for numeric types

remainder


In [33]:
5 % 3


Out[33]:
2

power


In [34]:
2 ** 3


Out[34]:
8

In [35]:
2 ** 0.5


Out[35]:
1.4142135623730951

absolute value


In [36]:
abs(-2 - 1j)


Out[36]:
2.23606797749979

round


In [37]:
round(2.3456), round(2.3456, 2)


Out[37]:
(2, 2.35)

Explicit conversions between numeric types


In [38]:
float(2)


Out[38]:
2.0

In [39]:
int(2.5)


Out[39]:
2

math and cmath

  • additional operations
  • cmath is for complex numbers

In [40]:
import math

math.log(16), math.log(16, 2), math.exp(2), \
math.exp(math.log(10))


Out[40]:
(2.772588722239781, 4.0, 7.38905609893065, 10.000000000000002)

Mutable vs. immutable types

  • instances of mutable types can be modified in place
  • immutable objects have the same value during their lifetime
  • are numeric types mutable or immutable?

In [41]:
x = 2
old_id = id(x)
x += 1
print(id(x) == old_id)


False

How about booleans?


In [42]:
x = True
y = False
print(id(x) == id(y))
x = False
print(id(x) == id(y))


False
True

There is only one True and one False object.

How about lists?


In [43]:
l1 = [0, 1]
old_id = id(l1)
l1.append(2)
old_id == id(l1)


Out[43]:
True

Lists are mutable.

Sequence types

  • all sequences support the following basic operations
operation behaviour
x in s True if an item of s is equal to x, else False
x not in s False if an item of s is equal to x, else True
s + t the concatenation of s and t
s * n or n * s equivalent to adding s to itself n times
s[i] ith item of s, origin 0
s[i:j] slice of s from i to j
s[i:j:k] slice of s from i to j with step k
len(s) length of s
min(s) smallest item of s
max(s) largest item of s
s.index(x[, i[, j]]) index of the first occurrence of x in s (at or after index i and before index j)
s.count(x) total number of occurrences of x in s

Table source

list

list

  • mutable sequence type

In [44]:
l = [1, 2, 2, 3]
l


Out[44]:
[1, 2, 2, 3]

In [45]:
l[1]


Out[45]:
2

In [46]:
# l[4]  # raises IndexError

In [47]:
l[-1]


Out[47]:
3

Advanced indexing, ranges


In [48]:
l = []
for i in range(20):
    l.append(2*i + 1)
l[:10]


Out[48]:
[1, 3, 5, 7, 9, 11, 13, 15, 17, 19]

In [49]:
l[-4:]


Out[49]:
[33, 35, 37, 39]

In [50]:
i = 2
l[i:i+3]


Out[50]:
[5, 7, 9]

In [51]:
l[2:10:3]


Out[51]:
[5, 11, 17]

Q. What is the output?


In [52]:
l[::-1]


Out[52]:
[39, 37, 35, 33, 31, 29, 27, 25, 23, 21, 19, 17, 15, 13, 11, 9, 7, 5, 3, 1]

Lists are mutable, elements can be added or changed:


In [53]:
l = []
l.append(1)
l.append(2)
l.append(2)
l


Out[53]:
[1, 2, 2]

In [54]:
l[1] = 12

In [55]:
l.extend([3, 4, 5])
len(l), l


Out[55]:
(6, [1, 12, 2, 3, 4, 5])

elements don't need to be of the same type


In [56]:
l = [1, -1, "foo", 2, "bar"]
l


Out[56]:
[1, -1, 'foo', 2, 'bar']

lists can be traversed with for loops


In [57]:
for element in l:
    print(element)


1
-1
foo
2
bar

enumerate

if we need the indices too, the built-in enumerate function iterates over index-element pairs


In [58]:
for i, element in enumerate(l):
    print(i, element)


0 1
1 -1
2 foo
3 2
4 bar

List sorting

  • lists can be sorted using the built-in sorted function

In [59]:
l = [3, -1, 2, 11]

for e in sorted(l):
    print(e)


-1
2
3
11

the sorting key can be specified using the key argument


In [60]:
shopping_list = [
    ["apple", 5],
    ["pear", 2],
    ["milk", 1],
    ["bread", 3],
]

for product in sorted(shopping_list, key=lambda x: -x[1]):
    print(product)


['apple', 5]
['bread', 3]
['pear', 2]
['milk', 1]

tuple

tuple

  • tuple is an immutable sequence
  • it may be empty

In [61]:
t = ()
print(type(t), len(t))

t = ([1, 2, 3], "foo")
type(t), len(t)


<class 'tuple'> 0
Out[61]:
(tuple, 2)

In [62]:
t


Out[62]:
([1, 2, 3], 'foo')

tuples can be indexed the same way as lists


In [63]:
t[1], t[-1]


Out[63]:
('foo', 'foo')

tuples contain immutable references, however, the objects may be mutable


In [64]:
t = ([1, 2, 3], "foo")
# t[0]= "bar"  # this raises a TypeError

In [65]:
for e in t:
    print(id(e))
    
print("\nChanging an element of t[0]\n")
t[0][1] = 11

for e in t:
    print(id(e))

print("\n", t)


139966880293064
139967000318952

Changing an element of t[0]

139966880293064
139967000318952

 ([1, 11, 3], 'foo')

dictionary

dictionary

  • basic and only built-in map type
  • maps keys to values

In [66]:
d = {}  # empty dictionary  same as d = dict()
d["apple"] = 12
d["plum"] = 2
d


Out[66]:
{'apple': 12, 'plum': 2}

equivalent to


In [67]:
d = {"apple": 12, "plum": 2}
d


Out[67]:
{'apple': 12, 'plum': 2}

removing keys


In [68]:
del d["apple"]
d


Out[68]:
{'plum': 2}

iterating dictionaries

  • keys and values can be iterated separately or together
  • iterating keys

In [69]:
d = {"apple": 12, "plum": 2}
for key in d.keys():
    print(key, d[key])


apple 12
plum 2
  • iterating values

In [70]:
for value in d.values():
    print(value)


12
2
  • iterating both

In [71]:
for key, value in d.items():
    print(key, value)


apple 12
plum 2

Under the hood

  • uses hash table (same as C++'s std::unordered_map)
    • constraints on key values: they must be hashable i.e. they cannot be or contain mutable objects
    • keys can be mixed type

In [72]:
d = {}
d[1] = "a"  # numeric types are immutable
d[3+2j] = "b"
d["c"] = 1.0
d


Out[72]:
{1: 'a', (3+2j): 'b', 'c': 1.0}
  • tuples are immutable too

In [73]:
d[("apple", 1)] = -2
d


Out[73]:
{1: 'a', (3+2j): 'b', 'c': 1.0, ('apple', 1): -2}
  • however lists are not

In [74]:
# d[["apple", 1]] = 12  # raises TypeError

Q. Can these be dictionary keys?


In [75]:
key1 = (2, (3, 4))
key2 = (2, [], (3, 4))

# d = {}
# d[key1] = 1
# d[key2] = 2

set

set

  • collection of unique, hashable elements
  • implements basic set operations (intersection, union, difference)

In [76]:
s = set()
s.add(2)
s.add(3)
s.add(2)
s


Out[76]:
{2, 3}

In [77]:
s = {2, 3, 2}
type(s), s


Out[77]:
(set, {2, 3})

deleting elements


In [78]:
s.add(2)
s.remove(2)
# s.remove(2)  # raises KeyError, since we already removed this element
s.discard(2)  # removes if present, does not raise exception

frozenset

  • immutable counterpart of set

In [79]:
fs = frozenset([1, 2])
# fs.add(2)  # raises AttributeError

In [80]:
fs = frozenset([1, 2])
s = {1, 2}

d = dict()
d[fs] = 1
# d[s] = 2  # raises TypeError
d


Out[80]:
{frozenset({1, 2}): 1}

set operations

  • implemented as
    1. methods
    2. overloaded operators

In [81]:
s1 = {1, 2, 3, 4, 5}
s2 = {2, 5, 6, 7}

s1 & s2  # s1.intersection(s2) or s2.intersection(s1)


Out[81]:
{2, 5}

In [82]:
s1 | s2  # s1.union(s2) OR s2.union(s1)


Out[82]:
{1, 2, 3, 4, 5, 6, 7}

In [83]:
s1 - s2, s2 - s1  # s1.difference(s2), s2.difference(s1)


Out[83]:
({1, 3, 4}, {6, 7})
  • these operations return new sets

In [84]:
s3 = s1 & s2
type(s3), id(s3) == id(s1), id(s3) == id(s2)


Out[84]:
(set, False, False)

issubset and issuperset


In [85]:
s1 < s2, s1.issubset(s2)


Out[85]:
(False, False)

In [86]:
{1, 2} < s1


Out[86]:
True

In [87]:
s1.issuperset({1, 6})


Out[87]:
False

useful set properties

  • creating a set is a convenient way of getting the unique elements of a sequence

In [88]:
l = [1, 2, 3, -1, 1, 2, 1, 0]
uniq = set(l)
uniq


Out[88]:
{-1, 0, 1, 2, 3}
  • sets and dictionaries provide O(1) lookup
  • in contrast lists provide O(n) lookup

In [143]:
import random

n = 10000

l = list(range(n))
random.shuffle(l)
s = set(l)
len(s), len(l)


Out[143]:
(10000, 10000)

In [150]:
{2, -1, 12, "aa", "bb"}


Out[150]:
{2, 12, 'bb', 'aa', -1}

In [152]:
{"aa", "cc", 2, "bb"}


Out[152]:
{'aa', 2, 'cc', 'bb'}

In [90]:
%%timeit

2 in l


144 µs ± 4.23 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [91]:
%%timeit

2 in s


50.8 ns ± 6.08 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

Functions with variable number of arguments

  • recall that functions may take positional and keyword arguments

In [92]:
def foo(arg1, arg2, arg3):
    print(arg1, arg2, arg3)
    
foo(1, 2, 3)
foo(1, arg3=2, arg2=3)


1 2 3
1 3 2
  • arguments may have default values starting from the rightmost argument

In [93]:
def foo(arg1, arg2, arg3=12):
    print(arg1, arg2, arg3)
    
foo(-1, -4)


-1 -4 12

args and kwargs

  • both positional and keyword arguments can be captured in arbitrary numbers using the * and ** operators
  • positional arguments are captured in a tuple

In [94]:
def arbitrary_positional_f(*args):
    print(type(args))
    for arg in args:
        print(arg)
        
arbitrary_positional_f(1, 2, -1)
# arbitrary_positional_f(1, 2, arg=-1)  # raises TypeError


<class 'tuple'>
1
2
-1
  • keyword arguments are captured in a dictionary

In [95]:
def arbitrary_keyword_f(**kwargs):
    print(type(kwargs))
    for argname, value in kwargs.items():
        print(argname, value)
        
arbitrary_keyword_f(arg1=1, arg2=12)
# arbitrary_keyword_f(12, arg=12)  # TypeError


<class 'dict'>
arg1 1
arg2 12
  • we usually capture both

In [96]:
def arbitrary_arg_f(*args, **kwargs):
    if args:
        print("Positional arguments")
        for arg in args:
            print(arg)
    else:
        print("No positional arguments")
    if kwargs:
        print("Keyword arguments")
        for argname, value in kwargs.items():
            print(argname, value)
    else:
        print("No keyword arguments")
        
arbitrary_arg_f()
arbitrary_arg_f(12, -2, param1="foo")


No positional arguments
No keyword arguments
Positional arguments
12
-2
Keyword arguments
param1 foo

Mutable default arguments

  • be careful with mutable default arguments

In [97]:
def insert_value(value, l=[]):
    l.append(value)
    print(l)
    
l1 = []
insert_value(12, l1)
l2 = []
insert_value(14, l2)


[12]
[14]

In [98]:
insert_value(-1)


[-1]

In [99]:
insert_value(-3)


[-1, -3]
  • best not to use mutable defaults

One solution is to create a new list inside a function:


In [100]:
def insert_value(value, l=None):
    if l is None:
        l = []
    l.append(value)
    return l

l = insert_value(2)
l


Out[100]:
[2]

In [101]:
insert_value(12)


Out[101]:
[12]

Lambda expressions

Lambda expressions

  • unnamed functions
  • may take parameters
  • can access local scope

In [102]:
l = [-1, 0, -10, 2, 3]

Let's sort this list by absolute value. The built-in sorted's default behavior is not enough right now:


In [103]:
for e in sorted(l):
    print(e)


-10
-1
0
2
3

but we can specify the key to use


In [104]:
for e in sorted(l, key=lambda x : abs(x)):
    print(e)


0
-1
2
3
-10

we can use any callable as the key


In [105]:
for e in sorted(l, key=abs):
    print(e)


0
-1
2
3
-10

Strings

Strings

  • strings are immutable sequences of Unicode code points
  • strings can be constructed in various ways

In [106]:
single = 'abc'
double = "abc"
single == double


Out[106]:
True
  • immutability makes it impossible to change a string (unlike C-style string or C++'s std::string)

In [107]:
s = "abc"

# s[1] = "c"  # TypeError
  • all string operations create new objects

In [108]:
print(id(s))
s += "def"
id(s)


139967073665128
Out[108]:
139966880316808
  • strings support most operations available for lists
    • advanced indexing

In [109]:
s = "abcdefghijkl"
s[::2]


Out[109]:
'acegik'

Character encodings - Unicode

  • Unicode provides a mapping from letters to code points or numbers
character Unicode code point
a U+0061
ő U+0151
ش U+0634
گ U+06AF
¿ U+00BF
ư U+01B0
Ң U+04A2
U+26F5
  • actual text needs to be stored as a byte array/sequence (byte strings)
  • character encoding: code point - byte array correspondence
  • encoding: Unicode code point $\rightarrow$ byte sequence
  • decoding: byte sequence $\rightarrow$ Unicode code point
  • most popular encoding: UTF-8
character Unicode code point UTF-8 byte sequence
a U+0061 61
ő U+0151 C5 91
ش U+0634 D8 B4
گ U+06AF DA AF
¿ U+00BF C2 BF
ư U+01B0 C6 B0
Ң U+04A2 D2 A2
U+26F5 E2 9B B5

Python3 automatically encodes Unicode strings when:

  • writing to file
  • printing
  • any kind of operation that requires byte string conversion

In [110]:
s = "ábc"
print(type(s))

with open("file.txt", "w") as f:
    f.write(s)
    f.write("\n")


<class 'str'>

and automatically decodes byte sequnces when reading from file:


In [111]:
with open("file.txt") as f:
    text = f.read().strip()
    
print(text)
type(text)


ábc
Out[111]:
str

Python 2 does not do this automatically


In [112]:
%%python2

with open("file.txt") as f:
    text = f.read().strip()
    
print(text)
print(text[0], len(text), type(text))


ábc
('\xc3', 4, <type 'str'>)
  • Python 2's str type is a byte string, different from Python 3's str which is a Unicode string
  • Python 2 has a separate unicode type for Unicode strings
  • we need to manually decode the byte string

Manual string decoding in Python 2


In [113]:
%%python2

with open("file.txt") as f:
    text = f.read().decode('utf8').strip()
    
print(type(text), len(text))
print(text[0].encode('utf8'))


(<type 'unicode'>, 3)
á
  • we also need to manually encode the text

In [114]:
%%python2

with open("file.txt") as f:
    text = f.read().decode('utf8').strip()
    
print(text[0].encode('utf8'))
# print(text)  # raise UnicodeEncodeError - can you interpret the error message?
print(text.encode('utf8'))


á
ábc

bytes in Python 3

  • Python 3 has a different type called bytes for byte strings
  • one way to get a byte string is to encode a Unicode string

In [115]:
unicode_string = "ábc"
utf8_string = unicode_string.encode("utf8")
latin2_string = unicode_string.encode("latin2")

type(unicode_string), type(utf8_string), type(latin2_string)


Out[115]:
(str, bytes, bytes)

In [116]:
len(unicode_string), len(utf8_string), len(latin2_string)


Out[116]:
(3, 4, 3)

String operations

  • most sequence operations are supported
  • large variety of basic string manipulation: lower, upper, title

In [117]:
"abC".upper(), "ABC".lower(), "abc".title()


Out[117]:
('ABC', 'abc', 'Abc')

In [118]:
s = "\tabc  \n"
print("<START>" + s + "<STOP>")


<START>	abc  
<STOP>

In [119]:
s.strip()


Out[119]:
'abc'

In [120]:
s.rstrip()


Out[120]:
'\tabc'

In [121]:
s.lstrip()


Out[121]:
'abc  \n'

In [122]:
"abca".strip("a")


Out[122]:
'bc'

since each function returns a new string, they can be chained after another


In [123]:
" abcd abc".strip().rstrip("c").lstrip("ab")


Out[123]:
'cd ab'

Binary predicates (i.e. yes-no questions)


In [124]:
"abc".startswith("ab"), "abc".endswith("cd")


Out[124]:
(True, False)

In [125]:
"abc".istitle(), "Abc".istitle()


Out[125]:
(False, True)

In [126]:
"  \t\n".isspace()


Out[126]:
True

In [127]:
"989".isdigit()


Out[127]:
True

split and join


In [128]:
s = "the quick brown fox jumps over the lazy dog"
words = s.split()
words


Out[128]:
['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']

In [129]:
s = "R.E.M."
s.split(".")


Out[129]:
['R', 'E', 'M', '']

In [130]:
"-".join(words)


Out[130]:
'the-quick-brown-fox-jumps-over-the-lazy-dog'

use explicit token separators


In [131]:
" <W> ".join(words)


Out[131]:
'the <W> quick <W> brown <W> fox <W> jumps <W> over <W> the <W> lazy <W> dog'

Q. Compute the word frequencies in a Wikipedia article.


In [132]:
import urllib.request
wp_url = "https://en.wikipedia.org/wiki/Budapest"
text = urllib.request.urlopen(wp_url).read()
text = text.decode('utf8')

In [133]:
words = text.split()
len(words), len(set(words))


Out[133]:
(45827, 16575)

In [134]:
word_freq = {}
for word in words:
    if word not in word_freq:
        word_freq[word] = 1
    else:
        word_freq[word] += 1

In [ ]:


In [135]:
for word, freq in sorted(word_freq.items(), key=lambda x: -x[1])[:20]:
    print(word, freq)


<a 1356
the 1209
of 1063
and 708
in 661
<span 594
<div 459
<li 421
class="external 336
text" 326
rel="nofollow" 323
class="reference"><a 314
<li><a 313
</tr> 288
</div> 287
– 262
<td 245
class="Z3988"><span 243
is 240
class="citation 240

various other ways for counting word frequencies here

String formatting

Python features several string formatting options.

str.format

  • non-str objects are automatically cast to str
    • under the hood: the object's __format__ method is called if it exists, otherwise its __str__ is called

In [2]:
name = "John"
age = 25

print("My name is {0} and I'm {1} years old. I turned {1} last December".format(name, age))
print("My name is {} and I'm {} years old.".format(name, age))
# print("My name is {} and I'm {} years old. I turned {} last December".format(name, age))  # raises IndexError
print("My name is {name} and I'm {age} years old. I turned {age} last December".format(
    name=name, age=age))


My name is John and I'm 25 years old. I turned 25 last December
My name is John and I'm 25 years old.
My name is John and I'm 25 years old. I turned 25 last December

% operator

  • note that the arguments need to be parenthesized (make it a tuple)

In [3]:
print("My name is %s and I'm %d years old" % (name, age))


My name is John and I'm 25 years old

string interpolation

  • f-strings were added in Python 3.6 in PEP498

In [7]:
import sys

age2 = 12

if sys.version_info >= (3, 6):
    print(f"My name is {name} and I'm {age} years old {age2}")


My name is John and I'm 25 years old 12

In [ ]: