Intro to Python!

Stuart Geiger and Yu Feng for The Hacker Within

Contents

1. Installing Python

2. The Language

  • Expressions

  • List, Tuple and Dictionary

  • Strings

  • Functions

3. Example: Word Frequency Analysis with Python

  • Reading text files

  • Geting and using python packages : wordcloud

  • Histograms

  • Exporting data as text files

1. Installing Python:

  • Easy way : with a Python distribution, anaconda: https://www.continuum.io/downloads

  • Hard way : install python and all dependencies yourself

  • Super hard way : compile everything from scratch

Three Python user interfaces

Python Shell python

```
    [yfeng1@waterfall ~]$ python
    Python 2.7.12 (default, Sep 29 2016, 13:30:34) 
    [GCC 6.2.1 20160916 (Red Hat 6.2.1-2)] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    >>> 
```

Jupyter Notebook (in a browser, like this)

IDEs: PyCharm, Spyder, etc.

We use Jupyter Notebook here.

Jupyter Notebook is included in the Anaconda distribution.

Expressions


In [1]:
2 + 3


Out[1]:
5

In [2]:
2 / 3


Out[2]:
0.6666666666666666

In [3]:
2 * 3


Out[3]:
6

In [4]:
2 ** 3


Out[4]:
8

Variables


In [5]:
num = 2 ** 3

In [6]:
print(num)


8

In [7]:
num


Out[7]:
8

In [8]:
type(num)


Out[8]:
int

In [9]:
name = "The Hacker Within"

In [10]:
type(name)


Out[10]:
str

In [11]:
name + 8


---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-11-63a7ef3a1717> in <module>()
----> 1 name + 8

TypeError: Can't convert 'int' object to str implicitly

In [12]:
name + str(8)


Out[12]:
'The Hacker Within8'

Lists


In [13]:
num_list = [0,1,2,3,4,5,6,7,8]

In [14]:
print(num_list)


[0, 1, 2, 3, 4, 5, 6, 7, 8]

In [15]:
type(num_list)


Out[15]:
list

In [16]:
num_list[3]


Out[16]:
3

In [17]:
num_list[3] = 10

In [18]:
print(num_list)


[0, 1, 2, 10, 4, 5, 6, 7, 8]

Appending new items to a list


In [19]:
num_list.append(3)

In [20]:
print(num_list)


[0, 1, 2, 10, 4, 5, 6, 7, 8, 3]

Loops and iteration


In [21]:
for num in num_list:
    print(num)


0
1
2
10
4
5
6
7
8
3

In [22]:
for num in num_list:
    print(num, num * num)


0 0
1 1
2 4
10 100
4 16
5 25
6 36
7 49
8 64
3 9

In [23]:
num_list.append("LOL")

In [24]:
print(num_list)


[0, 1, 2, 10, 4, 5, 6, 7, 8, 3, 'LOL']

If / else conditionals


In [25]:
for num in num_list:
    if type(num) is int or type(num) is float:
        print(num, num * num)
    else:
        print("ERROR!", num, "is not an int")


0 0
1 1
2 4
10 100
4 16
5 25
6 36
7 49
8 64
3 9
ERROR! LOL is not an int

Functions


In [26]:
def process_list(input_list):
    for num in input_list:
        if type(num) is int or type(num) is float:
            print(num, num * num)
        else:
            print("ERROR!", num, "is not an int")

In [27]:
process_list(num_list)


0 0
1 1
2 4
10 100
4 16
5 25
6 36
7 49
8 64
3 9
ERROR! LOL is not an int

In [28]:
process_list([1,3,4,14,1,9])


1 1
3 9
4 16
14 196
1 1
9 81

Dictionaries

Store a key : value relationship


In [29]:
yearly_value = {2001: 10, 2002: 14, 2003: 18, 2004: 20}
print(yearly_value)


{2001: 10, 2002: 14, 2003: 18, 2004: 20}

In [30]:
yearly_value = {}

yearly_value[2001] = 10
yearly_value[2002] = 14
yearly_value[2003] = 18
yearly_value[2004] = 20

In [31]:
print(yearly_value)


{2001: 10, 2002: 14, 2003: 18, 2004: 20}

In [32]:
yearly_value.pop(2001)


Out[32]:
10

In [33]:
yearly_value


Out[33]:
{2002: 14, 2003: 18, 2004: 20}

In [34]:
yearly_value[2001] = 10213

You can iterate through dictionaries too:


In [35]:
for key, value in yearly_value.items():
    print(key, value)


2001 10213
2002 14
2003 18
2004 20

In [36]:
for key, value in yearly_value.items():
    print(key, value * 1.05)


2001 10723.65
2002 14.700000000000001
2003 18.900000000000002
2004 21.0

Strings

We have seen strings a few times.

String literals can be defined with single or double quotation marks. Triple quotes allow multi-line strings.


In [37]:
name = "the hacker within"

In [38]:
name_long = """
~*~*~*~*~*~*~*~*~*~*~
THE HACKER WITHIN
~*~*~*~*~*~*~*~*~*~*~
"""

In [39]:
print(name)


the hacker within

In [40]:
print(name_long)


~*~*~*~*~*~*~*~*~*~*~
THE HACKER WITHIN
~*~*~*~*~*~*~*~*~*~*~

Strings have many built in methods:


In [41]:
print(name.upper())
print(name.split())
print(name.upper().split())


THE HACKER WITHIN
['the', 'hacker', 'within']
['THE', 'HACKER', 'WITHIN']

Strings are also a kind of list, and substrings can be accessed with string[start,end]


In [42]:
print(name[4:10])
print(name[4:])
print(name[:4])


hacker
hacker within
the 

In [43]:
count = 0
for character in name:
    print(count, character)
    count = count + 1


0 t
1 h
2 e
3  
4 h
5 a
6 c
7 k
8 e
9 r
10  
11 w
12 i
13 t
14 h
15 i
16 n

In [44]:
print(name.find('hacker'))
print(name[name.find('hacker'):])


4
hacker within

Functions


In [45]:
def square_num(num):
    return num * num

In [46]:
print(square_num(10))
print(square_num(9.1))
print(square_num(square_num(10)))


100
82.80999999999999
10000

In [47]:
def yearly_adjustment(yearly_dict, adjustment):
    for key, value in yearly_dict.items():
        print(key, value * adjustment)

In [48]:
yearly_adjustment(yearly_value, 1.05)


2001 10723.65
2002 14.700000000000001
2003 18.900000000000002
2004 21.0

We can expand on this a bit, adding some features:


In [49]:
def yearly_adjustment(yearly_dict, adjustment, print_values = False):
    adjusted_dict = {}
    for key, value in yearly_value.items():
        if print_values is True:
            print(key, value * adjustment)
        adjusted_dict[key] = value * adjustment
        
    return adjusted_dict

In [50]:
adjusted_yearly = yearly_adjustment(yearly_value, 1.05)

In [51]:
adjusted_yearly = yearly_adjustment(yearly_value, 1.05, print_values = True)


2001 10723.65
2002 14.700000000000001
2003 18.900000000000002
2004 21.0

In [52]:
adjusted_yearly


Out[52]:
{2001: 10723.65,
 2002: 14.700000000000001,
 2003: 18.900000000000002,
 2004: 21.0}

Example: word counts

If you begin a line in a notebook cell with !, it will execute a bash command as if you typed it in the terminal. We will use this to download a list of previous THW topics with the curl program.


In [53]:
!curl -o thw.txt http://stuartgeiger.com/thw.txt

# and that's how it works, that's how you get to curl


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1862  100  1862    0     0  10822      0 --:--:-- --:--:-- --:--:-- 11284

In [54]:
with open('thw.txt') as f:
    text = f.read()

In [55]:
text


Out[55]:
'Navigating bash and UNIX\nMachine learning with Neural Networks using Keras.io\nGit and GitHub\nData Tidying in RR & Python\nEnsemble (Machine) Learning with Super Learner and H2O in RR \nRRStudio \nThanksgiving \nMachine learning with scikit-learn\nMatplotlib \nPhysical Computing \nThe Python Olympics \nParallelization in Python \nNatural Language Processing for Python with NLTK \nGit and Github \nGithub Pages and Jekyll \nMachine Learning for Kaggle Competitions with RR \nThe Bash Olympics \nWhat To Learn and Teach \nD3.js \nTableau \nBuild Systems \nCython \nPython For Plotting Timeseries & 3D Data \nmatplotlib \nHandling and Visualizing Geospatial Data \nPython Metaprogramming & Conversion to Python 3 \nJulia \nScraping Wikipedia Data \nPandas \nLaTeX \nHigh Performance Python \nscikit-learn\nscikit-learn\nAdvanced Python \nGPUs and Parallelization \nWebscraping \nFree-form hacking\nPandas \nSpark and Hadoop \nVisualization \nAdvanced Git and GitHub \nIntroductory Git and GitHub \nShiny \nMake \nC++ and Object Orientation \nMicrocontrollers \nJulia \nRR \nComputer Architectures \nTesting \nMatplotlib and Seaborn \nIPython \nAdvanced Git \nText Editors \nParallel Programming \nThe Shell and The Filesystem Hierarchy Standard \nWhat Do You Want To Learn and What Can You Teach \nNuclear Data and Advanced Cython \nORIGEN and Open Source\nCython and the Python C/API \nJekyll \nMocDown and Pyne Install \nMocDown and Python Threading \nNumpy Vectorization and Python Logging \nHPC Module Installation and Plotting Tools \nPARCS and RadWatch (without the physics) \nWhen and Where Survey\nSerpent and LaTeX \nCRAM and imagemagick \nComputational Nuclear Engineering Overview & Bash \nLaTeX \nSo You Have A Software\nPackaging and Distribution \nEmailing with Python \nRaspberry Pi Hacking \nTesting Part II \nTesting \nIPython \nMakefiles \nSelf Documenting Code \nIntro to Git Part II \nIntro to Git \nGPUs and CUDA \nBash  \n'

In [56]:
words = text.split()
lines = text.split("\n")

In [57]:
lines[0:5]


Out[57]:
['Navigating bash and UNIX',
 'Machine learning with Neural Networks using Keras.io',
 'Git and GitHub',
 'Data Tidying in RR & Python',
 'Ensemble (Machine) Learning with Super Learner and H2O in RR ']

But there is an error! R always appears as "RR" -- so we will replace "RR" with "R"


In [58]:
text.replace("RR", "R")


Out[58]:
'Navigating bash and UNIX\nMachine learning with Neural Networks using Keras.io\nGit and GitHub\nData Tidying in R & Python\nEnsemble (Machine) Learning with Super Learner and H2O in R \nRStudio \nThanksgiving \nMachine learning with scikit-learn\nMatplotlib \nPhysical Computing \nThe Python Olympics \nParallelization in Python \nNatural Language Processing for Python with NLTK \nGit and Github \nGithub Pages and Jekyll \nMachine Learning for Kaggle Competitions with R \nThe Bash Olympics \nWhat To Learn and Teach \nD3.js \nTableau \nBuild Systems \nCython \nPython For Plotting Timeseries & 3D Data \nmatplotlib \nHandling and Visualizing Geospatial Data \nPython Metaprogramming & Conversion to Python 3 \nJulia \nScraping Wikipedia Data \nPandas \nLaTeX \nHigh Performance Python \nscikit-learn\nscikit-learn\nAdvanced Python \nGPUs and Parallelization \nWebscraping \nFree-form hacking\nPandas \nSpark and Hadoop \nVisualization \nAdvanced Git and GitHub \nIntroductory Git and GitHub \nShiny \nMake \nC++ and Object Orientation \nMicrocontrollers \nJulia \nR \nComputer Architectures \nTesting \nMatplotlib and Seaborn \nIPython \nAdvanced Git \nText Editors \nParallel Programming \nThe Shell and The Filesystem Hierarchy Standard \nWhat Do You Want To Learn and What Can You Teach \nNuclear Data and Advanced Cython \nORIGEN and Open Source\nCython and the Python C/API \nJekyll \nMocDown and Pyne Install \nMocDown and Python Threading \nNumpy Vectorization and Python Logging \nHPC Module Installation and Plotting Tools \nPARCS and RadWatch (without the physics) \nWhen and Where Survey\nSerpent and LaTeX \nCRAM and imagemagick \nComputational Nuclear Engineering Overview & Bash \nLaTeX \nSo You Have A Software\nPackaging and Distribution \nEmailing with Python \nRaspberry Pi Hacking \nTesting Part II \nTesting \nIPython \nMakefiles \nSelf Documenting Code \nIntro to Git Part II \nIntro to Git \nGPUs and CUDA \nBash  \n'

In [59]:
text = text.replace("RR", "R")

In [60]:
words = text.split()
lines = text.split("\n")

In [61]:
lines[0:5]


Out[61]:
['Navigating bash and UNIX',
 'Machine learning with Neural Networks using Keras.io',
 'Git and GitHub',
 'Data Tidying in R & Python',
 'Ensemble (Machine) Learning with Super Learner and H2O in R ']

Wordcloud library


In [62]:
!pip install wordcloud


Collecting wordcloud
Installing collected packages: wordcloud
Successfully installed wordcloud-1.2.1
You are using pip version 8.1.2, however version 9.0.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.

In [63]:
from wordcloud import WordCloud

In [64]:
wordcloud = WordCloud()
wordcloud.generate(text)
wordcloud.to_image()


Out[64]:

In [65]:
wordcloud = WordCloud(width=800, height=300, prefer_horizontal=1, stopwords=None)
wordcloud.generate(text)
wordcloud.to_image()


Out[65]:

Freqency counts


In [66]:
freq_dict = {}

for word in words:
    if word in freq_dict:
        freq_dict[word] = freq_dict[word] + 1
    else:
        freq_dict[word] = 1
    
print(freq_dict)


{'Cython': 3, 'Visualizing': 1, 'Navigating': 1, 'Bash': 3, 'Tableau': 1, 'RadWatch': 1, 'Neural': 1, 'Github': 2, 'C/API': 1, 'learning': 2, 'D3.js': 1, 'Editors': 1, 'Teach': 2, 'Competitions': 1, 'CUDA': 1, 'Language': 1, 'Webscraping': 1, 'When': 1, 'Orientation': 1, '3D': 1, 'Physical': 1, 'Vectorization': 1, 'Spark': 1, 'Machine': 3, 'Install': 1, 'Distribution': 1, 'UNIX': 1, 'Timeseries': 1, 'Geospatial': 1, 'So': 1, 'LaTeX': 3, 'Super': 1, 'Tidying': 1, 'A': 1, 'Overview': 1, 'R': 4, 'Documenting': 1, 'Computer': 1, 'Do': 1, 'Free-form': 1, 'Shiny': 1, 'Julia': 2, 'Hadoop': 1, 'NLTK': 1, 'Object': 1, 'C++': 1, 'matplotlib': 1, 'ORIGEN': 1, 'Emailing': 1, 'hacking': 1, 'Processing': 1, 'Want': 1, 'Handling': 1, 'Programming': 1, '3': 1, 'Data': 5, 'MocDown': 2, 'Natural': 1, 'GPUs': 2, 'Have': 1, 'Raspberry': 1, 'Learning': 2, 'Where': 1, 'Plotting': 2, '(without': 1, 'II': 2, 'Advanced': 4, 'The': 4, 'Build': 1, 'Tools': 1, 'Networks': 1, 'IPython': 2, 'Part': 2, 'in': 3, 'Metaprogramming': 1, 'Learner': 1, 'Code': 1, 'Filesystem': 1, 'Computational': 1, 'CRAM': 1, 'physics)': 1, 'and': 28, 'Threading': 1, 'Text': 1, 'Makefiles': 1, 'Pyne': 1, 'for': 2, 'Scraping': 1, 'Microcontrollers': 1, 'Engineering': 1, 'Hacking': 1, 'Make': 1, 'Standard': 1, 'For': 1, 'What': 3, 'Intro': 2, 'Self': 1, 'Pi': 1, 'Computing': 1, 'Git': 7, 'Shell': 1, 'Parallelization': 2, 'Architectures': 1, 'Performance': 1, '(Machine)': 1, 'imagemagick': 1, 'Numpy': 1, 'RStudio': 1, 'Olympics': 2, 'Kaggle': 1, 'scikit-learn': 3, 'Matplotlib': 2, 'Pandas': 2, 'Keras.io': 1, 'Python': 13, 'Jekyll': 2, 'Testing': 3, 'PARCS': 1, 'Hierarchy': 1, 'You': 3, 'Software': 1, 'with': 6, 'bash': 1, 'Learn': 2, 'High': 1, 'Survey': 1, 'Thanksgiving': 1, 'Ensemble': 1, '&': 4, 'Open': 1, 'GitHub': 3, 'Conversion': 1, 'Visualization': 1, 'Installation': 1, 'Source': 1, 'Serpent': 1, 'Can': 1, 'To': 2, 'Pages': 1, 'Wikipedia': 1, 'Introductory': 1, 'Parallel': 1, 'Systems': 1, 'Module': 1, 'Nuclear': 2, 'H2O': 1, 'Seaborn': 1, 'the': 2, 'to': 3, 'HPC': 1, 'using': 1, 'Packaging': 1, 'Logging': 1}

A better way to do this is:


In [67]:
freq_dict = {}

for word in words:
    freq_dict[word] = freq_dict.get(word, 0) + 1
    
print(freq_dict)


{'Cython': 3, 'Visualizing': 1, 'Navigating': 1, 'Bash': 3, 'Tableau': 1, 'RadWatch': 1, 'Neural': 1, 'Github': 2, 'C/API': 1, 'learning': 2, 'D3.js': 1, 'Editors': 1, 'Teach': 2, 'Competitions': 1, 'CUDA': 1, 'Language': 1, 'Webscraping': 1, 'When': 1, 'Orientation': 1, '3D': 1, 'Physical': 1, 'Vectorization': 1, 'Spark': 1, 'Machine': 3, 'Install': 1, 'Distribution': 1, 'UNIX': 1, 'Timeseries': 1, 'Geospatial': 1, 'So': 1, 'LaTeX': 3, 'Super': 1, 'Tidying': 1, 'A': 1, 'Overview': 1, 'R': 4, 'Documenting': 1, 'Computer': 1, 'Do': 1, 'Free-form': 1, 'Shiny': 1, 'Julia': 2, 'Hadoop': 1, 'NLTK': 1, 'Object': 1, 'C++': 1, 'matplotlib': 1, 'ORIGEN': 1, 'Emailing': 1, 'hacking': 1, 'Processing': 1, 'Want': 1, 'Handling': 1, 'Programming': 1, '3': 1, 'Data': 5, 'MocDown': 2, 'Natural': 1, 'GPUs': 2, 'Have': 1, 'Raspberry': 1, 'Learning': 2, 'Where': 1, 'Plotting': 2, '(without': 1, 'II': 2, 'Advanced': 4, 'The': 4, 'Build': 1, 'Tools': 1, 'Networks': 1, 'IPython': 2, 'Part': 2, 'in': 3, 'Metaprogramming': 1, 'Learner': 1, 'Code': 1, 'Filesystem': 1, 'Computational': 1, 'CRAM': 1, 'physics)': 1, 'and': 28, 'Threading': 1, 'Text': 1, 'Makefiles': 1, 'Pyne': 1, 'for': 2, 'Scraping': 1, 'Microcontrollers': 1, 'Engineering': 1, 'Hacking': 1, 'Make': 1, 'Standard': 1, 'For': 1, 'What': 3, 'Intro': 2, 'Self': 1, 'Pi': 1, 'Computing': 1, 'Git': 7, 'Shell': 1, 'Parallelization': 2, 'Architectures': 1, 'Performance': 1, '(Machine)': 1, 'imagemagick': 1, 'Numpy': 1, 'RStudio': 1, 'Olympics': 2, 'Kaggle': 1, 'scikit-learn': 3, 'Matplotlib': 2, 'Pandas': 2, 'Keras.io': 1, 'Python': 13, 'Jekyll': 2, 'Testing': 3, 'PARCS': 1, 'Hierarchy': 1, 'You': 3, 'Software': 1, 'with': 6, 'bash': 1, 'Learn': 2, 'High': 1, 'Survey': 1, 'Thanksgiving': 1, 'Ensemble': 1, '&': 4, 'Open': 1, 'GitHub': 3, 'Conversion': 1, 'Visualization': 1, 'Installation': 1, 'Source': 1, 'Serpent': 1, 'Can': 1, 'To': 2, 'Pages': 1, 'Wikipedia': 1, 'Introductory': 1, 'Parallel': 1, 'Systems': 1, 'Module': 1, 'Nuclear': 2, 'H2O': 1, 'Seaborn': 1, 'the': 2, 'to': 3, 'HPC': 1, 'using': 1, 'Packaging': 1, 'Logging': 1}

Outputting to files

Let's start from a loop that prints the values to the screen


In [68]:
for word, freq in sorted(freq_dict.items()):
    line = word + "\t" + str(freq)
    print(line)


&	4
(Machine)	1
(without	1
3	1
3D	1
A	1
Advanced	4
Architectures	1
Bash	3
Build	1
C++	1
C/API	1
CRAM	1
CUDA	1
Can	1
Code	1
Competitions	1
Computational	1
Computer	1
Computing	1
Conversion	1
Cython	3
D3.js	1
Data	5
Distribution	1
Do	1
Documenting	1
Editors	1
Emailing	1
Engineering	1
Ensemble	1
Filesystem	1
For	1
Free-form	1
GPUs	2
Geospatial	1
Git	7
GitHub	3
Github	2
H2O	1
HPC	1
Hacking	1
Hadoop	1
Handling	1
Have	1
Hierarchy	1
High	1
II	2
IPython	2
Install	1
Installation	1
Intro	2
Introductory	1
Jekyll	2
Julia	2
Kaggle	1
Keras.io	1
LaTeX	3
Language	1
Learn	2
Learner	1
Learning	2
Logging	1
Machine	3
Make	1
Makefiles	1
Matplotlib	2
Metaprogramming	1
Microcontrollers	1
MocDown	2
Module	1
NLTK	1
Natural	1
Navigating	1
Networks	1
Neural	1
Nuclear	2
Numpy	1
ORIGEN	1
Object	1
Olympics	2
Open	1
Orientation	1
Overview	1
PARCS	1
Packaging	1
Pages	1
Pandas	2
Parallel	1
Parallelization	2
Part	2
Performance	1
Physical	1
Pi	1
Plotting	2
Processing	1
Programming	1
Pyne	1
Python	13
R	4
RStudio	1
RadWatch	1
Raspberry	1
Scraping	1
Seaborn	1
Self	1
Serpent	1
Shell	1
Shiny	1
So	1
Software	1
Source	1
Spark	1
Standard	1
Super	1
Survey	1
Systems	1
Tableau	1
Teach	2
Testing	3
Text	1
Thanksgiving	1
The	4
Threading	1
Tidying	1
Timeseries	1
To	2
Tools	1
UNIX	1
Vectorization	1
Visualization	1
Visualizing	1
Want	1
Webscraping	1
What	3
When	1
Where	1
Wikipedia	1
You	3
and	28
bash	1
for	2
hacking	1
imagemagick	1
in	3
learning	2
matplotlib	1
physics)	1
scikit-learn	3
the	2
to	3
using	1
with	6

Then expand this to writing a file object:


In [69]:
with open("freq_dict_thw.csv", 'w') as f: 
    for word, freq in sorted(freq_dict.items()):
        line = word + ", " + str(freq) + "\n"
        f.write(line)

In [70]:
!head -10 freq_dict_thw.csv


&, 4
(Machine), 1
(without, 1
3, 1
3D, 1
A, 1
Advanced, 4
Architectures, 1
Bash, 3
Build, 1

In [ ]: