Hello, and welcome to the Jupyter Notebook for this lesson by Lee Ngo!
If you've gotten this far, that means you've accomplished the following:
If you've gotten this far, we're ready to go onto the next phase!
Seems like a lot, but we'll be able to get through most of this within an hour!
Let's get started with some light Python to get us warmed up. Great if you're already familiar with the language, but here are some warm-up commands to get started.
We're certainly not going to cover EVERYTHING one could learn in Python - that takes a lifetime. For our purposes, it helps to understand certain key concepts.
Let's start with data types.
There are five common data types in Python:
int
- integer valuefloat
- decimal valuebool
- True/Falsecomplex
- imaginaryNoneType
- null valueLet's try to identify some data types below. Predict the outputs of the following commands and run them.
In [4]:
type(454)
Out[4]:
In [5]:
type(2.1648)
Out[5]:
In [6]:
type(5 + 6 == 10) # You can put expressions in them as well!
Out[6]:
In [7]:
type(5 + 72j)
Out[7]:
In [8]:
type(None)
Out[8]:
Identifying data types will be helpful later on, as having conflicting types can lead to messy data science.
Data can also be stored, arranged, and organized in ways that lend itself to a lot of great analysis. Here are the types of data one might work with here.
Note: the term 'immutable' means that items within the object cannot be changed unless the entire object changes as well.
str
- string/varchar immutable value, defined with quotes = ‘abc’list
- collection of elements, defined with brackets = [‘a’, ‘b’]tuple
- immutable list, defined with parentheses = (‘a’, ‘b’)dict
- unordered key-value pairs, keys are unique and immutable, defined with braces = {‘a’:1, ‘b’:2}set
- unordered collection of unique elements, defined with braces = {‘a’, ‘b’}For this lesson, we'll mostly be working in strings, lists, and dictionaries. Let's play around with a basic one below.
Create a list below called house
and set the the value of the items to the following, in this order (include the quotes and separate the values with a comma):
In [9]:
house = ['Targaryen','Stark','Lannister','Tyrell','Tully','Aaryn','Martell','Baratheon','Greyjoy']
Let's do some super-basic data exploration. What happens if you type in house[5]
?
In [10]:
house[5]
Out[10]:
Yep, you get the sixth item in the list. A common standard in most coding languages, lists are automatically indexed upon creation, starting at 0. This will be helpful to know when you're trying to look for certain items in a list by order - they will be at the nth - 1
index.
We're going to be working a lot with functions and methods to do some cool things with our data.
First, what's the difference between the two?
A function is a programming construct that allows us to do a little bit more with the objects we created.
A method is a function specific to a class, and it is accessed using an instance/object within the class.
A class is a user-defined prototype for an objecct that defines a set of attributes and characteristics of any object within them, including variables and methods.
Don't worry if you're a little confused - we'll learn more through practice.
Another good way to remember: all methods are functions, but not all functions are methods.
Let's start with some pretty basic mathematical functions. What will the following functions return if you run them?
In [11]:
def words_of_stark():
return "Winter is coming!"
words_of_stark()
Out[11]:
In [12]:
def shipping(x,y):
return x + " is now romantically involved with " + y + ". Hope they're not related!"
shipping("Jon Snow","Danaerys Targaryen") # Sorry, spoiler alert.
Out[12]:
Well, that was fun, but I don't want to have to re-invent the wheel. Fortunately the Python community has developed a lot of rich libraries full of classes for us to work with so that we don't have to constant define them.
We access them by importing.
We're going to be working with two in particular for this lesson:
Pandas
- a data analysis library in Python, completely free to use. (pandas.pydata.org)Matplotlib
- a 2D p lotting library to create some decent visualizations on our data.We'll be importing both of them and using them throughout the lesson. Below is the code to import. Be sure to run it so that it applies to the subsequent code.
In [13]:
import pandas as pd # We use this shortened syntax to type less later on
import matplotlib.pyplot as plt # Specifically, we're using the PyPlot class and again, a shortened syntax
%matplotlib inline
# This handy command above allows us to see our visualizations instantly, if written correctly
Awesome! Now we're ready to start working with the dataset.
I originally found this while searching for fun Game of Thrones-based data, and I found one by Chris Albon, a major contributor to the data science community. I felt it was perfect teach some of the basics in Python for data science, especially the core concepts of how to think scientifically, even on make-believe fantasy data.
You can find the original dataset here: https://github.com/chrisalbon/war_of_the_five_kings_dataset
Out of respect to him, I've left it unchanged. You now have a copy of it as well. Use the code below to import it:
In [14]:
raw_dataframe = pd.read_csv("war_of_the_five_kings_dataset.csv")
We've now created an object called raw_dataframe
that contains all of the data from the csv file, converted into a Pandas-based dataframe. This will allow us to do a lot of great exploratory things.
Let's take a look in our data by using the .head()
method, which allows us to see the top few rows of data according to the number of rows we'd like to see. Run the code below.
In [15]:
raw_dataframe.head(3) # This should show 3 rows, starting at index 0.
Out[15]:
You now can catch a glimpse of what's in this data set! Wait a minute...
What happens to the data after column defender_1
? There's an ... ellipsis?
This dataset is actually very wide. Pandas is doing us a favor by ignoring some of those columns.
Let's make them visible by changing a small feature of our Pandas import:
In [16]:
pd.set_option('display.max_columns', None)
This little bit of code now makes it so that the display
class of our Pandas library has no limit to the number of columns it sends. Try running the .head()
method again to see if it worked.
In [17]:
raw_dataframe.head(3)
Out[17]:
Great! We can now see all of the columns of data! This little bit of code is handy for customizing your libraries in the future.
Let's do one more key bit of code, drawing back to our very first lesson:
In [18]:
raw_dataframe.info()
Here's another way for us to look at our data and see what types exit within them, indexed by key. So far we know that there are 38 data points overall, and some of those points are written as integers, others as float objects. These are all default data assignments by Pandas, but as we dive deeper, we'll care a little bit more about which is which.
We've imported our libraries and our dataset, and we have a decent idea as to what's in them. What's next?
The first rule about being a data scientist: it's not about the tools, it's about the questions to answer.
We are scientists first and foremost, thus we must begin all of our exercises with a question we hope the data will answer. In an era where there's oceans of data generated, we then need tools and people to use them properly to answer questions.
For now, we're dealing with a small dataset: a quantiative documentation of the results of the War of the Five Kings, including:
There's a lot more we could cover, but I'll try to focus on one particular King: Robb Stark.
Sure, things didn't exactly work out well for "The Young Wolf" (spoiler, although it's actually mentioned in this dataset), but his end overshadowed what was otherwise an impressive military campaign.
We can answer some pretty simple questions about his performance in this war, such as:
Let's start with just those 3.
First, we have to group the data in such as way so that we can analyze it.
Let's make a new dataframe called df
, cloned from the existing one with the .copy()
method.
In [19]:
df = raw_dataframe.copy()
In [20]:
robb_off = df[df['attacker_king'] == 'Robb Stark']
robb_def = df[df['defender_king'] == 'Robb Stark']
Whoa, that looks a little complex. Let's break it down. We've created two objects: robb_off
for whenever Robb Stark attacked (i.e. on the "offensive), and robb_def
for whenever Robb Stark defended.
We're requesting in Python to set these objects equal to the dataframe dictionary for whenever the key of attacker_king
and defender_king
is equivalent to the value of Robb Stark
.
Feel free to use .head()
on each object to see if it worked.
From here, it's as simple as counting the rows for each using the len()
function, which gives us the "length" of a dataset according to the number of indexed items (in this case - rows).
Here's the code, setting it equal to a new variable:
In [21]:
robb_total = len(robb_off) + len(robb_def)
robb_total
Out[21]:
In other words, Robb Stark was involved in nearly 2/3 of all the battles fought during the War of the Five Kings.
But how good of a war commander was he?
We can build upon the objects we've already built. We have the object for the number of battles Robb fought as an attacker, so let's create one involving him as an attacker AND a victor, still drawing from the original data source:
In [22]:
robb_off_win = robb_off[robb_off['attacker_outcome'] == 'win']
Using the same strategy as before, we're now looking into the sub-dataframe robb_off
for whenever the key attacker_outcome
has a value win
.
From there, it's a simple len()
method.
In [23]:
len(robb_off_win)
Out[23]:
Cool! Robb Stark won 8 of the battles he fought as an attacker. What about all the battles he won, including the ones as a defender?
We apply the same method, but remember - victories are according to the attacker's perspective. We need times when the attacker has lost to add to Robb's scoreboard.
In [24]:
robb_def_win = robb_def[robb_def['attacker_outcome'] == 'loss']
Adding these two variables together gets you the number of overall victories:
In [25]:
len(robb_off_win + robb_def_win)
Out[25]:
.... Wait, only 9? Out of the total number of battles Robb Stark fought, he was successful as a attacker but not great on the defensive. Overall, winning 9 out of 24 battles is really not that impressive.
Perhaps 'The Young Wolf' wasn't as impressive as we thought...
Try some other methods as well in Pandas:
.mean()
- gives you the average of some value (you have to designate the key-value in some cases).median()
- returns the median value of an object.min()
- gives you the lowest value in that array.fillna(0.0).astype(int)
- this is a way to get rid of all the float objects in your dataset. .describe()
- gives you an overview of the object's data, according to counts, unique values, and data typesNow that you have a light understanding of how data analysis is done, let's create some visualizations!
Relying a lot on Matplotlib here, data visualizations allow us to better communicate and understand the information we're able to create through our analyses.
Let's try to do a few based on the questions we've already resolved so far. Let's create some bar graphs.
First, let's create a new object robb_off_viz
that measures what's going on in our robb_off
object, using two more methods:
.groupby()
- calculating the unique values in a particular key.len()
- measuring them by their "length" or number of rows
In [26]:
robb_off_viz = robb_off.groupby('attacker_outcome').apply(len)
Now, we can create a simple bar graph with the code below and setting the y label with a few more methods.
In [27]:
robb_off_viz.plot(kind='bar').set_ylabel('# of Battles')
Out[27]:
Let's compare that with a plot for Robb Stark's defense. Remember, in this graph, Robb is the defender, so his "wins" are in the "loss" column below.
In [28]:
robb_def_viz = robb_def.groupby('attacker_outcome').apply(len)
robb_def_viz.plot(kind='bar').set_ylabel('# of Battles')
Out[28]:
We can interpret this data much easier now with these visuals in a couple of ways:
Cool. Though looking at just two bars is a little lame. Let's compare some more data!
The code below creates a new object called attacker_win
that groups and measures the victories according to the attacker.
Afterwards, the rest is a simple bar plot like before.
In [29]:
attacker_win = df[df['attacker_outcome'] == 'win'].groupby('attacker_1').apply(len)
attacker_win.plot(kind='bar').set_ylabel('# of Victories')
Out[29]:
We'll that's interesting. Turns out that the Greyjoys and the Lannisters are more effective on the attack than the Starks.
Let's move onto another popular form of visualization: scatterplots.
I was told by a good friend that scatterplots are the best way to visualize data.
Let's see if he's right. Let's create one based on the battles that took place in this war, beginning with a research question:
Let's see if we can do all of this in one scatterplot.
Since we now know that Robb won more often as an attacker than defender, let's see if army size is a factor.
First, let's create our first plot Let's plot the battles where Robb Stark is a attacking and make that the color red. See below:
In [30]:
x = robb_off['attacker_size']
y = robb_off['defender_size']
plt.scatter(x,y,color='red')
Out[30]:
Now let's do the same thing with the battles where Robb Stark is defending and make that the color blue.
In [31]:
x = robb_def['defender_size']
y = robb_def['attacker_size']
plt.scatter(x,y,color='blue')
Out[31]:
Hm, also interesting, but hard to see how it fits together. We have to run the code in a single Python block to make them all into one cool visual. I've seen added some code to make a "line of equality" to give us a sense of what the Starks were truly up against.
In [32]:
x = robb_off['attacker_size']
y = robb_off['defender_size']
plt.scatter(x,y,color='red')
x = robb_def['defender_size']
y = robb_def['attacker_size']
plt.scatter(x,y,color='blue')
plt.title('When Starks Attack')
plt.xlabel('Stark Army')
plt.ylabel('Defender Army')
plt.xlim(0,21000) # x parameters - aim for homogeneous scales
plt.ylim(0,21000) # y parameters
plt.plot([0,21000],[0,21000],color="k") # line of equality
plt.show()
Check it out! Especially with the "line of equality," we can make the following conclusions:
Can we make conclusions about army size AND victory? Not yet. We'll need to redo the data for that, but you should know everything you need to get started on that visualization.
One last little lesson on visualization is histograms, a popular way to show data if you have data with variables that occur frequently. Run the code below to see some of the potential histograms we can create.
In [33]:
df.hist()
Out[33]:
Hm, that's not all that meaningful to us as a group, but we can focus on one key, such as year
.
In [34]:
df.hist('year')
Out[34]:
A little more helpful! Here we can see that many of the battles (20) happened in the year 299, with 7 happening the year before and 11 in the year after. (We can use Python to count exactly what these numbers are.
Let's also check out how big the attacking armies were by exploring attacker_size
.
In [43]:
df.hist('attacker_size')
Out[43]:
With one outlier, it looks like most of the armies were less than 30,000 in troop size. Which army was the largest?
Even with such a small dataset, there are a number of questions you can answer as well as Python commands to explore.
I strongly recommend checking out this documentation here: http://pandas.pydata.org/pandas-docs/stable/
Keep testing this notebook with the commands and see what happens!
Well, kind of. There's a lot more to it than that. Depending on your learning style, there are a lot of directions I can point you in to help you build your skills. Feel free to reach out to me via GitHub or at my personal email: lee-dot-ngo-at-gmail-dot-com. (Please don't be a robot.)
Lee Ngo is a self-described 'Education Technology Community Architect,' and is perpetually passionate about inclusivity, engagement, and empathy in spaces of professional advancement. Lee serves as national data science evangelist for Metis. Previously, Lee served as an evangelist for Galvanize based in Seattle. Previously he worked for UP Global (now Techstars) and founded his own ed-tech company in Pittsburgh, PA. Lee believes in learning by doing, engaging and sharing, and he teaches code through a combination of visual communication, teamwork, and project-oriented learning.
You can email him at lee-dot-ngo-at-gmail-dot-com for any further questions.
Disclaimer: This lesson is entirely open-source, unaffilated with any other entities and intended for educational and entertain purposes. The data used remains unchanged from its initial source out of respect to its author and the inspired material. Please feel free to fork, clone, remake, sample, and enjoy as your please under the MIT License.
In [ ]: