In [ ]:
# Software Engineering for Data Scientists

## *Python Basics*
## DATA 515 A

Today's Objectives

0. Cloning LectureNotes

1. Opening & Navigating the Jupyter Notebook

2. Data type basics

3. Loading data with pandas

4. Cleaning and Manipulating data with pandas

5. Visualizing data with pandas & matplotlib

0. Cloning Lecture Notes

The course materials are maintained on github. The next lecture will discuss github in detail. Today, you'll get minimal instructions to get access to today's lecture materials.

  1. Open a terminal session
  2. Type 'git clone https://github.com/UWSEDS/LectureNotes.git'
  3. Wait until the download is complete
  4. cd LectureNotes
  5. cd 02_Procedural_Python

1. Opening and Navigating the IPython Notebook

We will start today with the interactive environment that we will be using often through the course: the Jupyter Notebook.

We will walk through the following steps together:

  1. Download miniconda (be sure to get Version 3.6) and install it on your system (hopefully you have done this before coming to class)

  2. Use the conda command-line tool to update your package listing and install the IPython notebook:

    Update conda's listing of packages for your system:

    $ conda update conda

    Install IPython notebook and all its requirements

    $ conda install jupyter notebook
  3. Navigate to the directory containing the course material. For example:

    $ cd LectureNotes/02_Procedural_Python

    You should see a number of files in the directory, including these:

    $ ls
  4. Type jupyter notebook in the terminal to start the notebook

    $ jupyter notebook

    If everything has worked correctly, it should automatically launch your default browser

  5. Click on Lecture-Python-And-Data.ipynb to open the notebook containing the content for this lecture.

With that, you're set up to use the Jupyter notebook!

2. Data Types Basics

2.1 Data type theory

  • Components with the same capabilities are of the same type.
    • For example, the numbers 2 and 200 are both integers.
  • A type is defined recursively. Some examples.
    • A list is a collection of objects that can be indexed by position.
    • A list of integers contains an integer at each position.
  • A type has a set of supported operations. For example:
    • Integers can be added
    • Strings can be concatented
    • A table can find the name of its columns
      • What type is returned from the operation?
  • In python, members (components and operations) are indicated by a '.'
    • If a is a list, the a.append(1) adds 1 to the list.

2.2 Primitive types

The primitive types are integers, floats, strings, booleans.

2.2.1 Integers


In [1]:
# Integer arithematic
1 + 1


Out[1]:
2

In [2]:
# Integer division version floating point division
print (6 // 4, 6/ 4)


1 1.5

2.2.2 Floats


In [3]:
# Have the full set of "calculator functions" but need the numpy package
import numpy as np
print (6.0 * 3, np.sin(2*np.pi))


18.0 -2.4492935982947064e-16

In [4]:
# Floats can have a null value called nan, not a number
a = np.nan
3*a


Out[4]:
nan

2.2.3 Strings


In [5]:
# Can concatenate, substring, find, count, ...

In [6]:
a = "The lazy"
b = "brown fox"
print ("Concatenation: ", a + b)
print ("First three letters: " + a[0:3])
print ("Index of 'z': " + str(a.find('z')))


Concatenation:  The lazybrown fox
First three letters: The
Index of 'z': 6

2.3 Tuples

A tuple is an ordered sequence of objects. Tuples cannot be changed; they are immuteable.


In [7]:
a_tuple = (1, 'ab', (1,2))
a_tuple


Out[7]:
(1, 'ab', (1, 2))

In [8]:
a_tuple[2]


Out[8]:
(1, 2)

2.4 Lists

A list is an ordered sequence of objects that can be changed.


In [9]:
a_list = [1, 'a', [1,2]]

In [10]:
a_list[0]


Out[10]:
1

In [11]:
a_list.append(2)
a_list


Out[11]:
[1, 'a', [1, 2], 2]

In [12]:
a_list


Out[12]:
[1, 'a', [1, 2], 2]

In [13]:
dir(a_list)


Out[13]:
['__add__',
 '__class__',
 '__contains__',
 '__delattr__',
 '__delitem__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__iadd__',
 '__imul__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__mul__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__reversed__',
 '__rmul__',
 '__setattr__',
 '__setitem__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 'append',
 'clear',
 'copy',
 'count',
 'extend',
 'index',
 'insert',
 'pop',
 'remove',
 'reverse',
 'sort']

In [14]:
help (a_list)


Help on list object:

class list(object)
 |  list() -> new empty list
 |  list(iterable) -> new list initialized from iterable's items
 |  
 |  Methods defined here:
 |  
 |  __add__(self, value, /)
 |      Return self+value.
 |  
 |  __contains__(self, key, /)
 |      Return key in self.
 |  
 |  __delitem__(self, key, /)
 |      Delete self[key].
 |  
 |  __eq__(self, value, /)
 |      Return self==value.
 |  
 |  __ge__(self, value, /)
 |      Return self>=value.
 |  
 |  __getattribute__(self, name, /)
 |      Return getattr(self, name).
 |  
 |  __getitem__(...)
 |      x.__getitem__(y) <==> x[y]
 |  
 |  __gt__(self, value, /)
 |      Return self>value.
 |  
 |  __iadd__(self, value, /)
 |      Implement self+=value.
 |  
 |  __imul__(self, value, /)
 |      Implement self*=value.
 |  
 |  __init__(self, /, *args, **kwargs)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  __iter__(self, /)
 |      Implement iter(self).
 |  
 |  __le__(self, value, /)
 |      Return self<=value.
 |  
 |  __len__(self, /)
 |      Return len(self).
 |  
 |  __lt__(self, value, /)
 |      Return self<value.
 |  
 |  __mul__(self, value, /)
 |      Return self*value.n
 |  
 |  __ne__(self, value, /)
 |      Return self!=value.
 |  
 |  __new__(*args, **kwargs) from builtins.type
 |      Create and return a new object.  See help(type) for accurate signature.
 |  
 |  __repr__(self, /)
 |      Return repr(self).
 |  
 |  __reversed__(...)
 |      L.__reversed__() -- return a reverse iterator over the list
 |  
 |  __rmul__(self, value, /)
 |      Return self*value.
 |  
 |  __setitem__(self, key, value, /)
 |      Set self[key] to value.
 |  
 |  __sizeof__(...)
 |      L.__sizeof__() -- size of L in memory, in bytes
 |  
 |  append(...)
 |      L.append(object) -> None -- append object to end
 |  
 |  clear(...)
 |      L.clear() -> None -- remove all items from L
 |  
 |  copy(...)
 |      L.copy() -> list -- a shallow copy of L
 |  
 |  count(...)
 |      L.count(value) -> integer -- return number of occurrences of value
 |  
 |  extend(...)
 |      L.extend(iterable) -> None -- extend list by appending elements from the iterable
 |  
 |  index(...)
 |      L.index(value, [start, [stop]]) -> integer -- return first index of value.
 |      Raises ValueError if the value is not present.
 |  
 |  insert(...)
 |      L.insert(index, object) -- insert object before index
 |  
 |  pop(...)
 |      L.pop([index]) -> item -- remove and return item at index (default last).
 |      Raises IndexError if list is empty or index is out of range.
 |  
 |  remove(...)
 |      L.remove(value) -> None -- remove first occurrence of value.
 |      Raises ValueError if the value is not present.
 |  
 |  reverse(...)
 |      L.reverse() -- reverse *IN PLACE*
 |  
 |  sort(...)
 |      L.sort(key=None, reverse=False) -> None -- stable sort *IN PLACE*
 |  
 |  ----------------------------------------------------------------------
 |  Data and other attributes defined here:
 |  
 |  __hash__ = None


In [15]:
a_list.count(1)


Out[15]:
1

2.5 Dictionaries

A dictionary is a kind of associates a key with a value. A value can be any object, even another dictionary.


In [16]:
dessert_dict = {}  # Empty dictionary
dessert_dict['Dave'] = "Cake"
dessert_dict["Joe"] = ["Cake", "Pie"]
print (dessert_dict)


{'Dave': 'Cake', 'Joe': ['Cake', 'Pie']}

In [17]:
dessert_dict["Dave"]


Out[17]:
'Cake'

In [18]:
# This produces an error
dessert_dict["Bernease"] = {}
dessert_dict


Out[18]:
{'Bernease': {}, 'Dave': 'Cake', 'Joe': ['Cake', 'Pie']}

In [19]:
dessert_dict["Bernease"] = {"Favorite": ["sorbet", "cobbler"], "Dislike": "Brownies"}

2.7 A Shakespearean Detour: "What's in a Name?"

Deep vs. Shallow Copies

A deep copy can be manipulated separately. A shallow copy is a pointer to the same data as the original.


In [20]:
# A first name shell game
first_int = 1
second_int = first_int
second_int += 1
second_int


Out[20]:
2

In [21]:
# What is first_int?
first_int


Out[21]:
1

In [22]:
# A second name shell game
a_list = ['a', 'aa', 'aaa']
b_list = a_list
b_list.append('bb')
b_list


Out[22]:
['a', 'aa', 'aaa', 'bb']

In [23]:
# What is a_list?
a_list


Out[23]:
['a', 'aa', 'aaa', 'bb']

In [24]:
# Create a deep copy
import copy
# A second name shell game
a_list = ['a', 'aa', 'aaa']
b_list = copy.deepcopy(a_list)
b_list.append('bb')
print("b_list = %s" % str(b_list))
print("a_list = %s" % str(a_list))


b_list = ['a', 'aa', 'aaa', 'bb']
a_list = ['a', 'aa', 'aaa']

Key insight: Deep vs. Shallow Copies

  • A deep copy can be manipulated separately from the original.
  • A shallow copy cannot.
  • Assigning a python immutable creates a deep copy. Non-immutables are shallow copies.

Name Resolution

The most common errors that you'll see in your python codes are:

  • NameError
  • AttributeError A common error when using the bash shell is command not found.

Name resolution: Associating a name with code or data.

Resolving a name in the bash shell is done by searching the directories in the PATH environment variable. The first executable with the name is run.


In [25]:
# Example 1 of name resolution in python
var = 10
def func(val):
    var = val + 1
    return val

In [26]:
# What is returned?
print("func(2) = %d" % func(2))
# What is var?
print("var = %d" % var)


func(2) = 2
var = 10

In [27]:
# Example 2 of name resolution in python
var = 10
def func(val):
    return val + var

In [28]:
# What is returned?
print("func(2) = %d" % func(2))
# What is var?
print("var = %d" % var)


func(2) = 12
var = 10

Insights on python name resolution

  • Names are assigned within a context.
  • Context changes with the function and module.
    • Assigning a name in a function creates a new name.
    • Referencing an unassigned name in function uses an existing name.

2.7 Object Essentials

Objects are a "packaging" of data and code. Almost all python entities are objects.


In [29]:
# A list and a dict are objects.
# dict has been implemented so that you see its values when you type
# the instance name.
# This is done with many python objects, like list.
a_dict = {'a': [1, 2], 'b': [3, 4, 5]}
a_dict


Out[29]:
{'a': [1, 2], 'b': [3, 4, 5]}

In [30]:
# You access the data and methods (codes) associated with an object by
# using the "." operator. These are referred to collectively
# as attributes. Methods are followed by parentheses;
# values (properties) are not.
a_dict.keys()


Out[30]:
dict_keys(['a', 'b'])

In [31]:
# You can discover the attributes of an object using "dir"
dir(a_dict)


Out[31]:
['__class__',
 '__contains__',
 '__delattr__',
 '__delitem__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setitem__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 'clear',
 'copy',
 'fromkeys',
 'get',
 'items',
 'keys',
 'pop',
 'popitem',
 'setdefault',
 'update',
 'values']

2.8 Summary


type description
primitive int, float, string, bool
tuple An immutable collection of ordered objects
list A mutable collection of ordered objects
dictionary A mutable collection of named objects
object A packaging of codes and data