Intro to Python!

Stuart Geiger and Yu Feng for The Hacker Within

1. Installing Python

2. The Language

Expressions
List, Tuple and Dictionary
Strings
Functions

3. Example: Word Frequency Analysis with Python

Reading text files
Geting and using python packages : wordcloud
Histograms
Exporting data as text files

1. Installing Python:

Easy way : with a Python distribution, anaconda: https://www.continuum.io/downloads
Hard way : install python and all dependencies yourself
Super hard way : compile everything from scratch

Three Python user interfaces

Python Shell `python`

```
    [yfeng1@waterfall ~]$ python
    Python 2.7.12 (default, Sep 29 2016, 13:30:34) 
    [GCC 6.2.1 20160916 (Red Hat 6.2.1-2)] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    >>> 
```

Jupyter Notebook (in a browser, like this)

IDEs: PyCharm, Spyder, etc.

We use Jupyter Notebook here.

Jupyter Notebook is included in the Anaconda distribution.

Expressions



In [1]:

    
2 + 3









    Out[1]:





5



In [2]:

    
2 / 3









    Out[2]:





0.6666666666666666



In [3]:

    
2 * 3









    Out[3]:





6



In [4]:

    
2 ** 3









    Out[4]:





8

Variables



In [5]:

    
num = 2 ** 3



In [6]:

    
print(num)



In [7]:

    
num









    Out[7]:





8



In [8]:

    
type(num)









    Out[8]:





int



In [9]:

    
name = "The Hacker Within"



In [10]:

    
type(name)









    Out[10]:





str



In [11]:

    
name + 8









    



---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-11-63a7ef3a1717> in <module>()
----> 1 name + 8

TypeError: Can't convert 'int' object to str implicitly



In [12]:

    
name + str(8)









    Out[12]:





'The Hacker Within8'

Lists



In [13]:

    
num_list = [0,1,2,3,4,5,6,7,8]



In [14]:

    
print(num_list)









    



[0, 1, 2, 3, 4, 5, 6, 7, 8]



In [15]:

    
type(num_list)









    Out[15]:





list



In [16]:

    
num_list[3]









    Out[16]:





3



In [17]:

    
num_list[3] = 10



In [18]:

    
print(num_list)









    



[0, 1, 2, 10, 4, 5, 6, 7, 8]

Appending new items to a list



In [19]:

    
num_list.append(3)



In [20]:

    
print(num_list)









    



[0, 1, 2, 10, 4, 5, 6, 7, 8, 3]

Loops and iteration



In [21]:

    
for num in num_list:
    print(num)



In [22]:

    
for num in num_list:
    print(num, num * num)



In [23]:

    
num_list.append("LOL")



In [24]:

    
print(num_list)









    



[0, 1, 2, 10, 4, 5, 6, 7, 8, 3, 'LOL']

If / else conditionals



In [25]:

    
for num in num_list:
    if type(num) is int or type(num) is float:
        print(num, num * num)
    else:
        print("ERROR!", num, "is not an int")









    



0 0
1 1
2 4
10 100
4 16
5 25
6 36
7 49
8 64
3 9
ERROR! LOL is not an int

Functions



In [26]:

    
def process_list(input_list):
    for num in input_list:
        if type(num) is int or type(num) is float:
            print(num, num * num)
        else:
            print("ERROR!", num, "is not an int")



In [27]:

    
process_list(num_list)









    



0 0
1 1
2 4
10 100
4 16
5 25
6 36
7 49
8 64
3 9
ERROR! LOL is not an int



In [28]:

    
process_list([1,3,4,14,1,9])

Dictionaries

Store a key : value relationship



In [29]:

    
yearly_value = {2001: 10, 2002: 14, 2003: 18, 2004: 20}
print(yearly_value)









    



{2001: 10, 2002: 14, 2003: 18, 2004: 20}



In [30]:

    
yearly_value = {}

yearly_value[2001] = 10
yearly_value[2002] = 14
yearly_value[2003] = 18
yearly_value[2004] = 20



In [31]:

    
print(yearly_value)









    



{2001: 10, 2002: 14, 2003: 18, 2004: 20}



In [32]:

    
yearly_value.pop(2001)









    Out[32]:





10



In [33]:

    
yearly_value









    Out[33]:





{2002: 14, 2003: 18, 2004: 20}



In [34]:

    
yearly_value[2001] = 10213

You can iterate through dictionaries too:



In [35]:

    
for key, value in yearly_value.items():
    print(key, value)



In [36]:

    
for key, value in yearly_value.items():
    print(key, value * 1.05)









    



2001 10723.65
2002 14.700000000000001
2003 18.900000000000002
2004 21.0

Strings

We have seen strings a few times.

String literals can be defined with single or double quotation marks. Triple quotes allow multi-line strings.



In [37]:

    
name = "the hacker within"



In [38]:

    
name_long = """
~*~*~*~*~*~*~*~*~*~*~
THE HACKER WITHIN
~*~*~*~*~*~*~*~*~*~*~
"""



In [39]:

    
print(name)









    



the hacker within



In [40]:

    
print(name_long)









    



~*~*~*~*~*~*~*~*~*~*~
THE HACKER WITHIN
~*~*~*~*~*~*~*~*~*~*~

Strings have many built in methods:



In [41]:

    
print(name.upper())
print(name.split())
print(name.upper().split())









    



THE HACKER WITHIN
['the', 'hacker', 'within']
['THE', 'HACKER', 'WITHIN']

Strings are also a kind of list, and substrings can be accessed with string[start,end]



In [42]:

    
print(name[4:10])
print(name[4:])
print(name[:4])









    



hacker
hacker within
the



In [43]:

    
count = 0
for character in name:
    print(count, character)
    count = count + 1



In [44]:

    
print(name.find('hacker'))
print(name[name.find('hacker'):])









    



4
hacker within

Functions



In [45]:

    
def square_num(num):
    return num * num



In [46]:

    
print(square_num(10))
print(square_num(9.1))
print(square_num(square_num(10)))









    



100
82.80999999999999
10000



In [47]:

    
def yearly_adjustment(yearly_dict, adjustment):
    for key, value in yearly_dict.items():
        print(key, value * adjustment)



In [48]:

    
yearly_adjustment(yearly_value, 1.05)









    



2001 10723.65
2002 14.700000000000001
2003 18.900000000000002
2004 21.0

We can expand on this a bit, adding some features:



In [49]:

    
def yearly_adjustment(yearly_dict, adjustment, print_values = False):
    adjusted_dict = {}
    for key, value in yearly_value.items():
        if print_values is True:
            print(key, value * adjustment)
        adjusted_dict[key] = value * adjustment
        
    return adjusted_dict



In [50]:

    
adjusted_yearly = yearly_adjustment(yearly_value, 1.05)



In [51]:

    
adjusted_yearly = yearly_adjustment(yearly_value, 1.05, print_values = True)









    



2001 10723.65
2002 14.700000000000001
2003 18.900000000000002
2004 21.0



In [52]:

    
adjusted_yearly









    Out[52]:





{2001: 10723.65,
 2002: 14.700000000000001,
 2003: 18.900000000000002,
 2004: 21.0}

Example: word counts

If you begin a line in a notebook cell with !, it will execute a bash command as if you typed it in the terminal. We will use this to download a list of previous THW topics with the curl program.



In [53]:

    
!curl -o thw.txt http://stuartgeiger.com/thw.txt

# and that's how it works, that's how you get to curl









    



  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1862  100  1862    0     0  10822      0 --:--:-- --:--:-- --:--:-- 11284



In [54]:

    
with open('thw.txt') as f:
    text = f.read()



In [55]:

    
text









    Out[55]:





'Navigating bash and UNIX\nMachine learning with Neural Networks using Keras.io\nGit and GitHub\nData Tidying in RR & Python\nEnsemble (Machine) Learning with Super Learner and H2O in RR \nRRStudio \nThanksgiving \nMachine learning with scikit-learn\nMatplotlib \nPhysical Computing \nThe Python Olympics \nParallelization in Python \nNatural Language Processing for Python with NLTK \nGit and Github \nGithub Pages and Jekyll \nMachine Learning for Kaggle Competitions with RR \nThe Bash Olympics \nWhat To Learn and Teach \nD3.js \nTableau \nBuild Systems \nCython \nPython For Plotting Timeseries & 3D Data \nmatplotlib \nHandling and Visualizing Geospatial Data \nPython Metaprogramming & Conversion to Python 3 \nJulia \nScraping Wikipedia Data \nPandas \nLaTeX \nHigh Performance Python \nscikit-learn\nscikit-learn\nAdvanced Python \nGPUs and Parallelization \nWebscraping \nFree-form hacking\nPandas \nSpark and Hadoop \nVisualization \nAdvanced Git and GitHub \nIntroductory Git and GitHub \nShiny \nMake \nC++ and Object Orientation \nMicrocontrollers \nJulia \nRR \nComputer Architectures \nTesting \nMatplotlib and Seaborn \nIPython \nAdvanced Git \nText Editors \nParallel Programming \nThe Shell and The Filesystem Hierarchy Standard \nWhat Do You Want To Learn and What Can You Teach \nNuclear Data and Advanced Cython \nORIGEN and Open Source\nCython and the Python C/API \nJekyll \nMocDown and Pyne Install \nMocDown and Python Threading \nNumpy Vectorization and Python Logging \nHPC Module Installation and Plotting Tools \nPARCS and RadWatch (without the physics) \nWhen and Where Survey\nSerpent and LaTeX \nCRAM and imagemagick \nComputational Nuclear Engineering Overview & Bash \nLaTeX \nSo You Have A Software\nPackaging and Distribution \nEmailing with Python \nRaspberry Pi Hacking \nTesting Part II \nTesting \nIPython \nMakefiles \nSelf Documenting Code \nIntro to Git Part II \nIntro to Git \nGPUs and CUDA \nBash  \n'



In [56]:

    
words = text.split()
lines = text.split("\n")



In [57]:

    
lines[0:5]









    Out[57]:





['Navigating bash and UNIX',
 'Machine learning with Neural Networks using Keras.io',
 'Git and GitHub',
 'Data Tidying in RR & Python',
 'Ensemble (Machine) Learning with Super Learner and H2O in RR ']

But there is an error! R always appears as "RR" -- so we will replace "RR" with "R"



In [58]:

    
text.replace("RR", "R")









    Out[58]:





'Navigating bash and UNIX\nMachine learning with Neural Networks using Keras.io\nGit and GitHub\nData Tidying in R & Python\nEnsemble (Machine) Learning with Super Learner and H2O in R \nRStudio \nThanksgiving \nMachine learning with scikit-learn\nMatplotlib \nPhysical Computing \nThe Python Olympics \nParallelization in Python \nNatural Language Processing for Python with NLTK \nGit and Github \nGithub Pages and Jekyll \nMachine Learning for Kaggle Competitions with R \nThe Bash Olympics \nWhat To Learn and Teach \nD3.js \nTableau \nBuild Systems \nCython \nPython For Plotting Timeseries & 3D Data \nmatplotlib \nHandling and Visualizing Geospatial Data \nPython Metaprogramming & Conversion to Python 3 \nJulia \nScraping Wikipedia Data \nPandas \nLaTeX \nHigh Performance Python \nscikit-learn\nscikit-learn\nAdvanced Python \nGPUs and Parallelization \nWebscraping \nFree-form hacking\nPandas \nSpark and Hadoop \nVisualization \nAdvanced Git and GitHub \nIntroductory Git and GitHub \nShiny \nMake \nC++ and Object Orientation \nMicrocontrollers \nJulia \nR \nComputer Architectures \nTesting \nMatplotlib and Seaborn \nIPython \nAdvanced Git \nText Editors \nParallel Programming \nThe Shell and The Filesystem Hierarchy Standard \nWhat Do You Want To Learn and What Can You Teach \nNuclear Data and Advanced Cython \nORIGEN and Open Source\nCython and the Python C/API \nJekyll \nMocDown and Pyne Install \nMocDown and Python Threading \nNumpy Vectorization and Python Logging \nHPC Module Installation and Plotting Tools \nPARCS and RadWatch (without the physics) \nWhen and Where Survey\nSerpent and LaTeX \nCRAM and imagemagick \nComputational Nuclear Engineering Overview & Bash \nLaTeX \nSo You Have A Software\nPackaging and Distribution \nEmailing with Python \nRaspberry Pi Hacking \nTesting Part II \nTesting \nIPython \nMakefiles \nSelf Documenting Code \nIntro to Git Part II \nIntro to Git \nGPUs and CUDA \nBash  \n'



In [59]:

    
text = text.replace("RR", "R")



In [60]:

    
words = text.split()
lines = text.split("\n")



In [61]:

    
lines[0:5]









    Out[61]:





['Navigating bash and UNIX',
 'Machine learning with Neural Networks using Keras.io',
 'Git and GitHub',
 'Data Tidying in R & Python',
 'Ensemble (Machine) Learning with Super Learner and H2O in R ']

Wordcloud library



In [62]:

    
!pip install wordcloud









    



Collecting wordcloud
Installing collected packages: wordcloud
Successfully installed wordcloud-1.2.1
You are using pip version 8.1.2, however version 9.0.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.



In [63]:

    
from wordcloud import WordCloud



In [64]:

    
wordcloud = WordCloud()
wordcloud.generate(text)
wordcloud.to_image()









    Out[64]:



In [65]:

    
wordcloud = WordCloud(width=800, height=300, prefer_horizontal=1, stopwords=None)
wordcloud.generate(text)
wordcloud.to_image()









    Out[65]:

Freqency counts



In [66]:

    
freq_dict = {}

for word in words:
    if word in freq_dict:
        freq_dict[word] = freq_dict[word] + 1
    else:
        freq_dict[word] = 1
    
print(freq_dict)









    



{'Cython': 3, 'Visualizing': 1, 'Navigating': 1, 'Bash': 3, 'Tableau': 1, 'RadWatch': 1, 'Neural': 1, 'Github': 2, 'C/API': 1, 'learning': 2, 'D3.js': 1, 'Editors': 1, 'Teach': 2, 'Competitions': 1, 'CUDA': 1, 'Language': 1, 'Webscraping': 1, 'When': 1, 'Orientation': 1, '3D': 1, 'Physical': 1, 'Vectorization': 1, 'Spark': 1, 'Machine': 3, 'Install': 1, 'Distribution': 1, 'UNIX': 1, 'Timeseries': 1, 'Geospatial': 1, 'So': 1, 'LaTeX': 3, 'Super': 1, 'Tidying': 1, 'A': 1, 'Overview': 1, 'R': 4, 'Documenting': 1, 'Computer': 1, 'Do': 1, 'Free-form': 1, 'Shiny': 1, 'Julia': 2, 'Hadoop': 1, 'NLTK': 1, 'Object': 1, 'C++': 1, 'matplotlib': 1, 'ORIGEN': 1, 'Emailing': 1, 'hacking': 1, 'Processing': 1, 'Want': 1, 'Handling': 1, 'Programming': 1, '3': 1, 'Data': 5, 'MocDown': 2, 'Natural': 1, 'GPUs': 2, 'Have': 1, 'Raspberry': 1, 'Learning': 2, 'Where': 1, 'Plotting': 2, '(without': 1, 'II': 2, 'Advanced': 4, 'The': 4, 'Build': 1, 'Tools': 1, 'Networks': 1, 'IPython': 2, 'Part': 2, 'in': 3, 'Metaprogramming': 1, 'Learner': 1, 'Code': 1, 'Filesystem': 1, 'Computational': 1, 'CRAM': 1, 'physics)': 1, 'and': 28, 'Threading': 1, 'Text': 1, 'Makefiles': 1, 'Pyne': 1, 'for': 2, 'Scraping': 1, 'Microcontrollers': 1, 'Engineering': 1, 'Hacking': 1, 'Make': 1, 'Standard': 1, 'For': 1, 'What': 3, 'Intro': 2, 'Self': 1, 'Pi': 1, 'Computing': 1, 'Git': 7, 'Shell': 1, 'Parallelization': 2, 'Architectures': 1, 'Performance': 1, '(Machine)': 1, 'imagemagick': 1, 'Numpy': 1, 'RStudio': 1, 'Olympics': 2, 'Kaggle': 1, 'scikit-learn': 3, 'Matplotlib': 2, 'Pandas': 2, 'Keras.io': 1, 'Python': 13, 'Jekyll': 2, 'Testing': 3, 'PARCS': 1, 'Hierarchy': 1, 'You': 3, 'Software': 1, 'with': 6, 'bash': 1, 'Learn': 2, 'High': 1, 'Survey': 1, 'Thanksgiving': 1, 'Ensemble': 1, '&': 4, 'Open': 1, 'GitHub': 3, 'Conversion': 1, 'Visualization': 1, 'Installation': 1, 'Source': 1, 'Serpent': 1, 'Can': 1, 'To': 2, 'Pages': 1, 'Wikipedia': 1, 'Introductory': 1, 'Parallel': 1, 'Systems': 1, 'Module': 1, 'Nuclear': 2, 'H2O': 1, 'Seaborn': 1, 'the': 2, 'to': 3, 'HPC': 1, 'using': 1, 'Packaging': 1, 'Logging': 1}

A better way to do this is:



In [67]:

    
freq_dict = {}

for word in words:
    freq_dict[word] = freq_dict.get(word, 0) + 1
    
print(freq_dict)









    



{'Cython': 3, 'Visualizing': 1, 'Navigating': 1, 'Bash': 3, 'Tableau': 1, 'RadWatch': 1, 'Neural': 1, 'Github': 2, 'C/API': 1, 'learning': 2, 'D3.js': 1, 'Editors': 1, 'Teach': 2, 'Competitions': 1, 'CUDA': 1, 'Language': 1, 'Webscraping': 1, 'When': 1, 'Orientation': 1, '3D': 1, 'Physical': 1, 'Vectorization': 1, 'Spark': 1, 'Machine': 3, 'Install': 1, 'Distribution': 1, 'UNIX': 1, 'Timeseries': 1, 'Geospatial': 1, 'So': 1, 'LaTeX': 3, 'Super': 1, 'Tidying': 1, 'A': 1, 'Overview': 1, 'R': 4, 'Documenting': 1, 'Computer': 1, 'Do': 1, 'Free-form': 1, 'Shiny': 1, 'Julia': 2, 'Hadoop': 1, 'NLTK': 1, 'Object': 1, 'C++': 1, 'matplotlib': 1, 'ORIGEN': 1, 'Emailing': 1, 'hacking': 1, 'Processing': 1, 'Want': 1, 'Handling': 1, 'Programming': 1, '3': 1, 'Data': 5, 'MocDown': 2, 'Natural': 1, 'GPUs': 2, 'Have': 1, 'Raspberry': 1, 'Learning': 2, 'Where': 1, 'Plotting': 2, '(without': 1, 'II': 2, 'Advanced': 4, 'The': 4, 'Build': 1, 'Tools': 1, 'Networks': 1, 'IPython': 2, 'Part': 2, 'in': 3, 'Metaprogramming': 1, 'Learner': 1, 'Code': 1, 'Filesystem': 1, 'Computational': 1, 'CRAM': 1, 'physics)': 1, 'and': 28, 'Threading': 1, 'Text': 1, 'Makefiles': 1, 'Pyne': 1, 'for': 2, 'Scraping': 1, 'Microcontrollers': 1, 'Engineering': 1, 'Hacking': 1, 'Make': 1, 'Standard': 1, 'For': 1, 'What': 3, 'Intro': 2, 'Self': 1, 'Pi': 1, 'Computing': 1, 'Git': 7, 'Shell': 1, 'Parallelization': 2, 'Architectures': 1, 'Performance': 1, '(Machine)': 1, 'imagemagick': 1, 'Numpy': 1, 'RStudio': 1, 'Olympics': 2, 'Kaggle': 1, 'scikit-learn': 3, 'Matplotlib': 2, 'Pandas': 2, 'Keras.io': 1, 'Python': 13, 'Jekyll': 2, 'Testing': 3, 'PARCS': 1, 'Hierarchy': 1, 'You': 3, 'Software': 1, 'with': 6, 'bash': 1, 'Learn': 2, 'High': 1, 'Survey': 1, 'Thanksgiving': 1, 'Ensemble': 1, '&': 4, 'Open': 1, 'GitHub': 3, 'Conversion': 1, 'Visualization': 1, 'Installation': 1, 'Source': 1, 'Serpent': 1, 'Can': 1, 'To': 2, 'Pages': 1, 'Wikipedia': 1, 'Introductory': 1, 'Parallel': 1, 'Systems': 1, 'Module': 1, 'Nuclear': 2, 'H2O': 1, 'Seaborn': 1, 'the': 2, 'to': 3, 'HPC': 1, 'using': 1, 'Packaging': 1, 'Logging': 1}

Outputting to files

Let's start from a loop that prints the values to the screen



In [68]:

    
for word, freq in sorted(freq_dict.items()):
    line = word + "\t" + str(freq)
    print(line)









    



&	4
(Machine)	1
(without	1
3	1
3D	1
A	1
Advanced	4
Architectures	1
Bash	3
Build	1
C++	1
C/API	1
CRAM	1
CUDA	1
Can	1
Code	1
Competitions	1
Computational	1
Computer	1
Computing	1
Conversion	1
Cython	3
D3.js	1
Data	5
Distribution	1
Do	1
Documenting	1
Editors	1
Emailing	1
Engineering	1
Ensemble	1
Filesystem	1
For	1
Free-form	1
GPUs	2
Geospatial	1
Git	7
GitHub	3
Github	2
H2O	1
HPC	1
Hacking	1
Hadoop	1
Handling	1
Have	1
Hierarchy	1
High	1
II	2
IPython	2
Install	1
Installation	1
Intro	2
Introductory	1
Jekyll	2
Julia	2
Kaggle	1
Keras.io	1
LaTeX	3
Language	1
Learn	2
Learner	1
Learning	2
Logging	1
Machine	3
Make	1
Makefiles	1
Matplotlib	2
Metaprogramming	1
Microcontrollers	1
MocDown	2
Module	1
NLTK	1
Natural	1
Navigating	1
Networks	1
Neural	1
Nuclear	2
Numpy	1
ORIGEN	1
Object	1
Olympics	2
Open	1
Orientation	1
Overview	1
PARCS	1
Packaging	1
Pages	1
Pandas	2
Parallel	1
Parallelization	2
Part	2
Performance	1
Physical	1
Pi	1
Plotting	2
Processing	1
Programming	1
Pyne	1
Python	13
R	4
RStudio	1
RadWatch	1
Raspberry	1
Scraping	1
Seaborn	1
Self	1
Serpent	1
Shell	1
Shiny	1
So	1
Software	1
Source	1
Spark	1
Standard	1
Super	1
Survey	1
Systems	1
Tableau	1
Teach	2
Testing	3
Text	1
Thanksgiving	1
The	4
Threading	1
Tidying	1
Timeseries	1
To	2
Tools	1
UNIX	1
Vectorization	1
Visualization	1
Visualizing	1
Want	1
Webscraping	1
What	3
When	1
Where	1
Wikipedia	1
You	3
and	28
bash	1
for	2
hacking	1
imagemagick	1
in	3
learning	2
matplotlib	1
physics)	1
scikit-learn	3
the	2
to	3
using	1
with	6

Then expand this to writing a file object:



In [69]:

    
with open("freq_dict_thw.csv", 'w') as f: 
    for word, freq in sorted(freq_dict.items()):
        line = word + ", " + str(freq) + "\n"
        f.write(line)



In [70]:

    
!head -10 freq_dict_thw.csv









    



&, 4
(Machine), 1
(without, 1
3, 1
3D, 1
A, 1
Advanced, 4
Architectures, 1
Bash, 3
Build, 1



In [ ]:

Intro to Python!

Contents

1. Installing Python

2. The Language

3. Example: Word Frequency Analysis with Python

1. Installing Python:

Three Python user interfaces

Python Shell python

Jupyter Notebook (in a browser, like this)

IDEs: PyCharm, Spyder, etc.

Expressions

Variables

Lists

Loops and iteration

If / else conditionals

Functions

Dictionaries

Strings

Functions

Example: word counts

Wordcloud library

Freqency counts

Outputting to files

Python Shell `python`