Three Python user interfaces
Python Shell python
[yfeng1@waterfall ~]$ python
Python 2.7.12 (default, Sep 29 2016, 13:30:34)
[GCC 6.2.1 20160916 (Red Hat 6.2.1-2)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>>
Jupyter Notebook (in a browser, like this)
IDEs: PyCharm, Spyder, etc.
We use Jupyter Notebook here.
Jupyter Notebook is included in the Anaconda distribution.
In [1]:
2 + 3 # Press <Ctrl-Enter to evaluate a cell>
Out[1]:
In [2]:
2 + int(3.5 * 4) * float("8")
Out[2]:
In [3]:
9 // 2 # Press <Ctrl-Enter to evaluate>
Out[3]:
To use the result of an expression in the future, we assign an expression to a variable.
In [4]:
x = 2 + 3
In [5]:
x
Out[5]:
The weirdest expression in Python:
In [6]:
print(x)
Q: What happens under the hood?
A list is a list of expressions.
In [7]:
MyListOfNumbers = [1,2,3,4,5,6,7]
A list has a length
In [8]:
len(MyListOfNumbers)
Out[8]:
We can loop over items in a list.
In [9]:
for num in MyListOfNumbers:
print(num, end=', ')
A tuple is almost a list, defined with ()
instead of []
.
()
can sometimes be omitted.
In [10]:
MyTupleOfNumbers = (1, 2, 3, 4, 5, 6)
MyTupleOfNumbers = 1, 2, 3, 4, 5, 6
for num in MyTupleOfNumbers:
print(num, end=', ')
But Tuples have a twist.
Let's try it out
In [11]:
MyListOfNumbers[4] = 99
print(MyListOfNumbers)
In [12]:
Tuple[4] = 99
Oops.
Tuple object does not support item assignment.
Tuples are immutable.
In [13]:
MyDictionary = {}
MyDictionary[9] = 81
MyDictionary[3] = 9
In [14]:
print(MyDictionary)
We may write
MyDictionary : {9, 3} => R.
We can loop over items in a dictionary, as well
In [15]:
for k, v in MyDictionary.items():
print('Key', k, ":", 'Value', v, end=' | ')
We have seen strings a few times.
String literals can be defined with quotation marks, single or double.
In [16]:
"the hacker within", 'the hacker within', r'the hacker within', u'the hacker within', b'the hacker within'
Out[16]:
Q: Mind the tuple
If we assign a string literal to a variable, we get a string variable
In [17]:
name = "the hacker within"
Python give us a lot of means to manipulate a string.
In [18]:
print(name.upper())
print(name.split())
print(name.upper().split())
We can look for substring from a string
In [19]:
name.find("hack")
Out[19]:
In [20]:
name[name.find("hack"):]
Out[20]:
Formatting strings with the traditional printf
formats
In [21]:
foo = "there are %03d numbers" % 3
print(foo)
Conversion between bytes and strings
The conversion is called 'encoding'. The default encoding on Unix is UTF-8
.
Q: What is the default encoding on Windows and OS X?
In [22]:
bname = name.encode()
print(bname)
In [23]:
print(bname.decode())
Encodings are important if you work with text beyond English.
A function is a more compact representation of mathematical functions.
(still remember dictionaries)
In [24]:
def square_num(num):
return num*num
In [25]:
print(square_num(9))
print(square_num(3))
Compare this with our dictionary
In [26]:
print(MyDictionary[9])
print(MyDictionary[3])
The domain of a function is much bigger than a dictionary.
A diciontary only remembers what we told it;
a function reevalutes its body every time it is called
.
In [27]:
print(square_num(10))
print(MyDictionary[10])
Oops. We never told MyDictionary about 10.
In this section we will analyze some textual data with Python.
We first obtain the data, with a bash
cell.
In [28]:
%%bash
curl -so titles.tsv https://raw.githubusercontent.com/thehackerwithin/berkeley/master/code_examples/spring17_survey/session_titles.tsv
head -5 titles.tsv
Reading in a text file is very easy in Python.
In [29]:
text = open('titles.tsv').read()
Q : There is a subtle problem.
We usually use a different syntax for reading files.
In [30]:
with open('titles.tsv') as ff:
text = ff.read()
Let's chop the text off into semantic elements.
In [31]:
words = text.split()
lines = text.split("\n")
In [32]:
print(words[::10]) # 1 word every 10
In [33]:
print(lines[::10]) # 1 line every 10
Looks like we read in the file correctly.
Let's visualize this data.
We use some exteral help from a package, wordcloud
.
So we will first install the package with pip
, the Python Package Manager.
In [34]:
import pip
pip.main(['install', "wordcloud"])
Out[34]:
Oops I have already installed wordcloud. You may see a different message.
In [35]:
from wordcloud import WordCloud
wordcloud = WordCloud(width=800, height=300, prefer_horizontal=1, stopwords=None).generate(text)
wordcloud.to_image()
Out[35]:
The biggest keyword is Python. Let's get quantatitive:
For each word, we need to remember a number (number of occurances)
Use dictionary.
We will examine all words in the file (splitted into words).
Use loop.
In [36]:
freq_dict = {}
for word in words:
freq_dict[word] = freq_dict.get(word, 0) + 1
print(freq_dict)
In [37]:
print(freq_dict['Python'])
print(freq_dict['CUDA'])
Seems to be working. Let's make a function.
In [38]:
def freq(items):
freq_dict = {}
for word in items:
freq_dict[word] = freq_dict.get(word, 0) + 1
return freq_dict
The function freq
is a mapping between a list and a dictionary,
where each key of the dictionary (output) is associated with the number of occurances of the key in the list (input).
In [39]:
freq_dict = freq(words)
freq_freq = freq(freq_dict.values())
Q : what is in freq_freq?
In [40]:
print(freq_freq)
Q: Which is the most frequent word?
Answer
In [41]:
top_word = ""
top_word_freq = 0
for word, freq in freq_dict.items():
if freq > top_word_freq:
top_word = word
top_word_freq = freq
print('word', top_word, 'freq', top_word_freq)
Using the max
function avoids writing an if
In [43]:
most = (0, None)
for word, freq in freq_dict.items():
most = max([most, (freq, word)])
print(most)
final challenge: the 1 liner.
In [44]:
next(reversed(sorted((freq, word) for word, freq in freq_dict.items())))
Out[44]:
In [45]:
def save(filename, freq_dict):
ff = open(filename, 'w')
for word, freq in sorted(freq_dict.items()):
ff.write("%s %s\n" % (word, freq))
ff.close()
In [46]:
def save(filename, freq_dict):
with open(filename, 'w') as ff:
for word, freq in sorted(freq_dict.items()):
ff.write("%s %s\n" % (word, freq))
In [47]:
save("freq_dict_thw.txt", freq_dict)
In [48]:
!cat freq_dict_thw.txt
In [49]:
save("freq_freq_thw.txt", freq_freq)
In [50]:
!cat freq_freq_thw.txt
In [51]:
import pandas as pd
dataframe = pd.read_table("freq_freq_thw.txt", sep=' ', header=None, index_col=0)
dataframe
Out[51]:
In [52]:
%matplotlib inline
In [53]:
dataframe.plot(kind='bar')
Out[53]:
In [57]:
import pandas as pd
dataframe = pd.read_table("freq_dict_thw.txt", sep=' ', header=None, index_col=0)
In [56]:
dataframe.plot(kind='bar')
Out[56]:
Well, a busy plot is a busy plot...
In [ ]: