Introduction to Python

1. Installing Python

2. The Language

Expressions
List, Tuple and Dictionary
Strings
Functions

3. Example: Word Frequency Analysis with Python

Reading text files
Geting and using python packages : wordcloud
Histograms
Exporting data as text files

1. Installing Python:

Easy way : with a Python distribution, anaconda

https://www.continuum.io/downloads
Hard way : compile it yourself from source. It is open-source after all.

[Not covered here; was the main way in early days, before 2011 or even 2014]

Three Python user interfaces

Python Shell python

    [yfeng1@waterfall ~]$ python
    Python 2.7.12 (default, Sep 29 2016, 13:30:34) 
    [GCC 6.2.1 20160916 (Red Hat 6.2.1-2)] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    >>>

Jupyter Notebook (in a browser, like this)
IDEs: PyCharm, Spyder, etc.

We use Jupyter Notebook here.

Jupyter Notebook is included in the Anaconda distribution.

2. Python the Language

2.1 Expressions

An expression looks like a math formula



In [1]:

    
2 + 3 # Press <Ctrl-Enter to evaluate a cell>









    Out[1]:





5



In [2]:

    
2 + int(3.5 * 4) * float("8")









    Out[2]:





114.0



In [3]:

    
9 // 2 # Press <Ctrl-Enter to evaluate>









    Out[3]:





4

To use the result of an expression in the future, we assign an expression to a variable.

Type of a variable in python is usually implied. (duck-typing -- read more on https://en.wikipedia.org/wiki/Duck_typing)



In [4]:

    
x = 2 + 3



In [5]:

    
x









    Out[5]:





5

The weirdest expression in Python:



In [6]:

    
print(x)

Q: What happens under the hood?

2.2 List, Tuple, Set and Dictionary

A list is a list of expressions.



In [7]:

    
MyListOfNumbers = [1,2,3,4,5,6,7]

A list has a length



In [8]:

    
len(MyListOfNumbers)









    Out[8]:





7

We can loop over items in a list.



In [9]:

    
for num in MyListOfNumbers:
    print(num, end=', ')









    



1, 2, 3, 4, 5, 6, 7,

A tuple is almost a list, defined with () instead of []. () can sometimes be omitted.



In [10]:

    
MyTupleOfNumbers = (1, 2, 3, 4, 5, 6)
MyTupleOfNumbers = 1, 2, 3, 4, 5, 6
for num in MyTupleOfNumbers:
    print(num, end=', ')









    



1, 2, 3, 4, 5, 6,

But Tuples have a twist.

Items in a tuple is immutable;
Items in a list can change

Let's try it out



In [11]:

    
MyListOfNumbers[4] = 99
print(MyListOfNumbers)









    



[1, 2, 3, 4, 99, 6, 7]



In [12]:

    
Tuple[4] = 99









    



---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-12-059965e7e9cf> in <module>()
----> 1 Tuple[4] = 99

NameError: name 'Tuple' is not defined

Oops.

Tuple object does not support item assignment.

Tuples are immutable.

Dictionary

A dicionary records a mapping from Keys to Values.

Mathematically a dictionary defines a function on a finite, discrete domain.



In [13]:

    
MyDictionary = {}

MyDictionary[9] = 81
MyDictionary[3] = 9



In [14]:

    
print(MyDictionary)









    



{9: 81, 3: 9}

We may write

MyDictionary : {9, 3} => R.

We can loop over items in a dictionary, as well



In [15]:

    
for k, v in MyDictionary.items():
    print('Key', k, ":", 'Value', v, end=' | ')









    



Key 9 : Value 81 | Key 3 : Value 9 |

2.? String

We have seen strings a few times.

String literals can be defined with quotation marks, single or double.



In [16]:

    
"the hacker within", 'the hacker within', r'the hacker within', u'the hacker within', b'the hacker within'









    Out[16]:





('the hacker within',
 'the hacker within',
 'the hacker within',
 'the hacker within',
 b'the hacker within')

Q: Mind the tuple

If we assign a string literal to a variable, we get a string variable



In [17]:

    
name = "the hacker within"

Python give us a lot of means to manipulate a string.



In [18]:

    
print(name.upper())
print(name.split())
print(name.upper().split())









    



THE HACKER WITHIN
['the', 'hacker', 'within']
['THE', 'HACKER', 'WITHIN']

We can look for substring from a string



In [19]:

    
name.find("hack")









    Out[19]:





4



In [20]:

    
name[name.find("hack"):]









    Out[20]:





'hacker within'

Formatting strings with the traditional printf formats



In [21]:

    
foo = "there are %03d numbers" % 3
print(foo)









    



there are 003 numbers

Conversion between bytes and strings

encode : from bytes to string
decode : from string to bytes

The conversion is called 'encoding'. The default encoding on Unix is UTF-8.

Q: What is the default encoding on Windows and OS X?



In [22]:

    
bname = name.encode()
print(bname)









    



b'the hacker within'



In [23]:

    
print(bname.decode())









    



the hacker within

Encodings are important if you work with text beyond English.

2.? Functions

A function is a more compact representation of mathematical functions.

(still remember dictionaries)



In [24]:

    
def square_num(num):
    
    return num*num



In [25]:

    
print(square_num(9))
print(square_num(3))

Compare this with our dictionary



In [26]:

    
print(MyDictionary[9])
print(MyDictionary[3])

The domain of a function is much bigger than a dictionary.

A diciontary only remembers what we told it;
a function reevalutes its body every time it is called.



In [27]:

    
print(square_num(10))
print(MyDictionary[10])









    



100






    



---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-27-dfeb607970e6> in <module>()
      1 print(square_num(10))
----> 2 print(MyDictionary[10])

KeyError: 10

Oops. We never told MyDictionary about 10.

3. A Word Count Example

In this section we will analyze some textual data with Python.

We first obtain the data, with a bash cell.



In [28]:

    
%%bash

curl -so titles.tsv https://raw.githubusercontent.com/thehackerwithin/berkeley/master/code_examples/spring17_survey/session_titles.tsv
head -5 titles.tsv









    



Navigating bash and UNIX
Machine learning with Neural Networks using Keras.io
Git and GitHub
Data Tidying in RR & Python
Ensemble (Machine) Learning with Super Learner and H2O in RR

Reading in a text file is very easy in Python.



In [29]:

    
text = open('titles.tsv').read()

Q : There is a subtle problem.

We usually use a different syntax for reading files.



In [30]:

    
with open('titles.tsv') as ff:
    text = ff.read()

Let's chop the text off into semantic elements.



In [31]:

    
words = text.split()
lines = text.split("\n")



In [32]:

    
print(words[::10]) # 1 word every 10









    



['Navigating', 'Keras.io', 'Ensemble', 'RRStudio', 'Python', 'with', 'Learning', 'To', 'For', 'Geospatial', 'Scraping', 'Advanced', 'and', 'GitHub', 'Computer', 'Editors', 'What', 'Teach', 'Cython', 'MocDown', 'Module', 'physics)', 'imagemagick', 'Have', 'Pi', 'Code', 'and']



In [33]:

    
print(lines[::10]) # 1 line every 10









    



['Navigating bash and UNIX', 'The Python Olympics ', 'Build Systems ', 'High Performance Python ', 'Advanced Git and GitHub ', 'Matplotlib and Seaborn ', 'Jekyll ', 'LaTeX ', 'Intro to Git Part II ']

Looks like we read in the file correctly.

Let's visualize this data.

We use some exteral help from a package, wordcloud.

So we will first install the package with pip, the Python Package Manager.



In [34]:

    
import pip
pip.main(['install', "wordcloud"])









    



Requirement already satisfied: wordcloud in /home/yfeng1/anaconda3/install/lib/python3.5/site-packages






    Out[34]:





0

Oops I have already installed wordcloud. You may see a different message.



In [35]:

    
from wordcloud import WordCloud

wordcloud = WordCloud(width=800, height=300, prefer_horizontal=1, stopwords=None).generate(text)
wordcloud.to_image()









    Out[35]:

The biggest keyword is Python. Let's get quantatitive:

Frequency statistics: How many times does each word occur in the file?

For each word, we need to remember a number (number of occurances)

Use dictionary.
We will examine all words in the file (splitted into words).

Use loop.



In [36]:

    
freq_dict = {}

for word in words:
    freq_dict[word] = freq_dict.get(word, 0) + 1
    
print(freq_dict)









    



{'Pages': 1, 'imagemagick': 1, 'Metaprogramming': 1, 'Do': 1, 'with': 6, 'the': 2, 'Open': 1, 'Hadoop': 1, 'Pi': 1, 'Part': 2, 'Standard': 1, 'Hierarchy': 1, 'C++': 1, 'UNIX': 1, 'Installation': 1, 'GPUs': 2, 'Seaborn': 1, 'Computing': 1, 'Learning': 2, 'Makefiles': 1, 'Engineering': 1, 'So': 1, 'bash': 1, 'RR': 4, 'Tableau': 1, 'Parallel': 1, 'Module': 1, 'NLTK': 1, 'Nuclear': 2, 'Performance': 1, 'Where': 1, 'Packaging': 1, 'Ensemble': 1, '(Machine)': 1, 'RRStudio': 1, 'Bash': 3, 'in': 3, 'The': 4, 'Hacking': 1, 'Teach': 2, 'Documenting': 1, 'Build': 1, 'Survey': 1, 'Git': 7, 'Jekyll': 2, '3': 1, 'Physical': 1, 'Computer': 1, 'for': 2, 'Distribution': 1, 'You': 3, 'Introductory': 1, 'Learn': 2, 'HPC': 1, 'RadWatch': 1, 'Can': 1, 'Threading': 1, 'Navigating': 1, 'Networks': 1, 'Object': 1, 'Pyne': 1, 'Text': 1, 'Kaggle': 1, 'Architectures': 1, 'Matplotlib': 2, 'Programming': 1, 'Wikipedia': 1, 'Webscraping': 1, 'H2O': 1, 'CRAM': 1, 'Cython': 3, 'Visualization': 1, 'Visualizing': 1, 'Parallelization': 2, 'Geospatial': 1, 'IPython': 2, 'Vectorization': 1, 'Super': 1, 'Code': 1, 'Spark': 1, 'Plotting': 2, 'Serpent': 1, 'Free-form': 1, 'To': 2, 'Advanced': 4, 'C/API': 1, 'MocDown': 2, 'Thanksgiving': 1, 'Computational': 1, 'What': 3, 'Microcontrollers': 1, 'GitHub': 3, 'Natural': 1, 'ORIGEN': 1, 'learning': 2, 'Tidying': 1, 'Pandas': 2, 'Intro': 2, 'Tools': 1, 'High': 1, 'Logging': 1, 'Install': 1, 'Github': 2, 'Data': 5, 'Emailing': 1, 'matplotlib': 1, 'Handling': 1, 'using': 1, 'PARCS': 1, '&': 4, 'to': 3, 'Conversion': 1, 'CUDA': 1, 'Competitions': 1, 'scikit-learn': 3, 'A': 1, 'Processing': 1, 'Editors': 1, 'Make': 1, 'Olympics': 2, 'Source': 1, 'Self': 1, 'Systems': 1, 'II': 2, 'Learner': 1, 'Neural': 1, 'physics)': 1, 'Numpy': 1, 'and': 28, 'D3.js': 1, 'hacking': 1, 'Shiny': 1, 'Software': 1, '(without': 1, 'Julia': 2, 'LaTeX': 3, 'Python': 13, 'Overview': 1, 'When': 1, 'Filesystem': 1, 'Language': 1, '3D': 1, 'Keras.io': 1, 'Raspberry': 1, 'Machine': 3, 'Scraping': 1, 'Testing': 3, 'Orientation': 1, 'Shell': 1, 'Have': 1, 'Want': 1, 'Timeseries': 1, 'For': 1}



In [37]:

    
print(freq_dict['Python'])
print(freq_dict['CUDA'])

Seems to be working. Let's make a function.



In [38]:

    
def freq(items):
    freq_dict = {}
    for word in items:
        freq_dict[word] = freq_dict.get(word, 0) + 1
    return freq_dict

The function freq is a mapping between a list and a dictionary,

where each key of the dictionary (output) is associated with the number of occurances of the key in the list (input).



In [39]:

    
freq_dict = freq(words)
freq_freq = freq(freq_dict.values())

Q : what is in freq_freq?



In [40]:

    
print(freq_freq)









    



{1: 111, 2: 22, 3: 11, 4: 4, 5: 1, 6: 1, 7: 1, 28: 1, 13: 1}

Q: Which is the most frequent word?

Answer



In [41]:

    
top_word = ""
top_word_freq = 0
for word, freq in freq_dict.items():
    if freq > top_word_freq:
        top_word = word
        top_word_freq = freq
print('word', top_word, 'freq', top_word_freq)









    



word and freq 28

Using the max function avoids writing an if



In [43]:

    
most  = (0, None)
for word, freq in freq_dict.items():
    most = max([most, (freq, word)])
    
print(most)









    



(28, 'and')

final challenge: the 1 liner.



In [44]:

    
next(reversed(sorted((freq, word) for word, freq in freq_dict.items())))









    Out[44]:





(28, 'and')

Exporting data

The world of Python has 4 corners.

We need to reach out to other applications.

Export the data from Python.



In [45]:

    
def save(filename, freq_dict):
    ff = open(filename, 'w')
    for word, freq in sorted(freq_dict.items()):
        ff.write("%s %s\n" % (word, freq))
    ff.close()



In [46]:

    
def save(filename, freq_dict):
    with open(filename, 'w') as ff:
        for word, freq in sorted(freq_dict.items()):
            ff.write("%s %s\n" % (word, freq))



In [47]:

    
save("freq_dict_thw.txt", freq_dict)



In [48]:

    
!cat freq_dict_thw.txt









    



& 4
(Machine) 1
(without 1
3 1
3D 1
A 1
Advanced 4
Architectures 1
Bash 3
Build 1
C++ 1
C/API 1
CRAM 1
CUDA 1
Can 1
Code 1
Competitions 1
Computational 1
Computer 1
Computing 1
Conversion 1
Cython 3
D3.js 1
Data 5
Distribution 1
Do 1
Documenting 1
Editors 1
Emailing 1
Engineering 1
Ensemble 1
Filesystem 1
For 1
Free-form 1
GPUs 2
Geospatial 1
Git 7
GitHub 3
Github 2
H2O 1
HPC 1
Hacking 1
Hadoop 1
Handling 1
Have 1
Hierarchy 1
High 1
II 2
IPython 2
Install 1
Installation 1
Intro 2
Introductory 1
Jekyll 2
Julia 2
Kaggle 1
Keras.io 1
LaTeX 3
Language 1
Learn 2
Learner 1
Learning 2
Logging 1
Machine 3
Make 1
Makefiles 1
Matplotlib 2
Metaprogramming 1
Microcontrollers 1
MocDown 2
Module 1
NLTK 1
Natural 1
Navigating 1
Networks 1
Neural 1
Nuclear 2
Numpy 1
ORIGEN 1
Object 1
Olympics 2
Open 1
Orientation 1
Overview 1
PARCS 1
Packaging 1
Pages 1
Pandas 2
Parallel 1
Parallelization 2
Part 2
Performance 1
Physical 1
Pi 1
Plotting 2
Processing 1
Programming 1
Pyne 1
Python 13
RR 4
RRStudio 1
RadWatch 1
Raspberry 1
Scraping 1
Seaborn 1
Self 1
Serpent 1
Shell 1
Shiny 1
So 1
Software 1
Source 1
Spark 1
Standard 1
Super 1
Survey 1
Systems 1
Tableau 1
Teach 2
Testing 3
Text 1
Thanksgiving 1
The 4
Threading 1
Tidying 1
Timeseries 1
To 2
Tools 1
UNIX 1
Vectorization 1
Visualization 1
Visualizing 1
Want 1
Webscraping 1
What 3
When 1
Where 1
Wikipedia 1
You 3
and 28
bash 1
for 2
hacking 1
imagemagick 1
in 3
learning 2
matplotlib 1
physics) 1
scikit-learn 3
the 2
to 3
using 1
with 6



In [49]:

    
save("freq_freq_thw.txt", freq_freq)



In [50]:

    
!cat freq_freq_thw.txt

Reading file in with Pandas



In [51]:

    
import pandas as pd
dataframe = pd.read_table("freq_freq_thw.txt", sep=' ', header=None, index_col=0)
dataframe



In [52]:

    
%matplotlib inline



In [53]:

    
dataframe.plot(kind='bar')









    Out[53]:





<matplotlib.axes._subplots.AxesSubplot at 0x7fddc9cfebe0>



In [57]:

    
import pandas as pd
dataframe = pd.read_table("freq_dict_thw.txt", sep=' ', header=None, index_col=0)



In [56]:

    
dataframe.plot(kind='bar')









    Out[56]:





<matplotlib.axes._subplots.AxesSubplot at 0x7fddc67522b0>

Well, a busy plot is a busy plot...



In [ ]: