Basic Python for Data Science: A Dataset of Ice and Fire

Hello, and welcome to the Jupyter Notebook for this lesson by Lee Ngo!

If you've gotten this far, that means you've accomplished the following:

If you've gotten this far, we're ready to go onto the next phase!

Objectives of this Lesson

  • Learn the basics of the Python language for data science
  • Import popular Python libraries and data files
  • Perform some Exploratory Data Analysis (EDA)
  • Complete some Data Visualization

Seems like a lot, but we'll be able to get through most of this within an hour!

Let's get started with some light Python to get us warmed up. Great if you're already familiar with the language, but here are some warm-up commands to get started.

Basics of the Python language

We're certainly not going to cover EVERYTHING one could learn in Python - that takes a lifetime. For our purposes, it helps to understand certain key concepts.

Let's start with data types.

Data Types in Python

There are five common data types in Python:

  • int - integer value
  • float - decimal value
  • bool - True/False
  • complex - imaginary
  • NoneType - null value

TIME TO CODE!

Let's try to identify some data types below. Predict the outputs of the following commands and run them.


In [4]:
type(454)


Out[4]:
int

In [5]:
type(2.1648)


Out[5]:
float

In [6]:
type(5 + 6 == 10) # You can put expressions in them as well!


Out[6]:
bool

In [7]:
type(5 + 72j)


Out[7]:
complex

In [8]:
type(None)


Out[8]:
NoneType

Identifying data types will be helpful later on, as having conflicting types can lead to messy data science.

'Arrays' in Python

Data can also be stored, arranged, and organized in ways that lend itself to a lot of great analysis. Here are the types of data one might work with here.

Note: the term 'immutable' means that items within the object cannot be changed unless the entire object changes as well.

  • str - string/varchar immutable value, defined with quotes = ‘abc’
  • list - collection of elements, defined with brackets = [‘a’, ‘b’]
  • tuple - immutable list, defined with parentheses = (‘a’, ‘b’)
  • dict - unordered key-value pairs, keys are unique and immutable, defined with braces = {‘a’:1, ‘b’:2}
  • set - unordered collection of unique elements, defined with braces = {‘a’, ‘b’}

For this lesson, we'll mostly be working in strings, lists, and dictionaries. Let's play around with a basic one below.

TIME TO CODE

Create a list below called house and set the the value of the items to the following, in this order (include the quotes and separate the values with a comma):

  • 'Targaryen'
  • 'Stark'
  • 'Lannister'
  • 'Tyrell'
  • 'Tully'
  • 'Arryn'
  • 'Martell'
  • 'Baratheon'
  • 'Greyjoy'

In [9]:
house = ['Targaryen','Stark','Lannister','Tyrell','Tully','Aaryn','Martell','Baratheon','Greyjoy']

Let's do some super-basic data exploration. What happens if you type in house[5]?


In [10]:
house[5]


Out[10]:
'Aaryn'

Yep, you get the sixth item in the list. A common standard in most coding languages, lists are automatically indexed upon creation, starting at 0. This will be helpful to know when you're trying to look for certain items in a list by order - they will be at the nth - 1 index.

Functions and Methods - Let's do more with these objects!

We're going to be working a lot with functions and methods to do some cool things with our data.

First, what's the difference between the two?

  • A function is a programming construct that allows us to do a little bit more with the objects we created.

  • A method is a function specific to a class, and it is accessed using an instance/object within the class.

  • A class is a user-defined prototype for an objecct that defines a set of attributes and characteristics of any object within them, including variables and methods.

Don't worry if you're a little confused - we'll learn more through practice.

Another good way to remember: all methods are functions, but not all functions are methods.

TIME TO CODE!

Let's start with some pretty basic mathematical functions. What will the following functions return if you run them?


In [11]:
def words_of_stark():
    return "Winter is coming!"

words_of_stark()


Out[11]:
'Winter is coming!'

In [12]:
def shipping(x,y):
    return x + " is now romantically involved with " + y + ". Hope they're not related!"

shipping("Jon Snow","Danaerys Targaryen") # Sorry, spoiler alert.


Out[12]:
"Jon Snow is now romantically involved with Danaerys Targaryen. Hope they're not related!"

Well, that was fun, but I don't want to have to re-invent the wheel. Fortunately the Python community has developed a lot of rich libraries full of classes for us to work with so that we don't have to constant define them.

We access them by importing.

Importing Libraries

We're going to be working with two in particular for this lesson:

  • Pandas - a data analysis library in Python, completely free to use. (pandas.pydata.org)
  • Matplotlib - a 2D p lotting library to create some decent visualizations on our data.

We'll be importing both of them and using them throughout the lesson. Below is the code to import. Be sure to run it so that it applies to the subsequent code.


In [13]:
import pandas as pd # We use this shortened syntax to type less later on
import matplotlib.pyplot as plt # Specifically, we're using the PyPlot class and again, a shortened syntax
%matplotlib inline 
# This handy command above allows us to see our visualizations instantly, if written correctly

Awesome! Now we're ready to start working with the dataset.

About This Dataset You're About to Import

I originally found this while searching for fun Game of Thrones-based data, and I found one by Chris Albon, a major contributor to the data science community. I felt it was perfect teach some of the basics in Python for data science, especially the core concepts of how to think scientifically, even on make-believe fantasy data.

You can find the original dataset here: https://github.com/chrisalbon/war_of_the_five_kings_dataset

Out of respect to him, I've left it unchanged. You now have a copy of it as well. Use the code below to import it:


In [14]:
raw_dataframe = pd.read_csv("war_of_the_five_kings_dataset.csv")

We've now created an object called raw_dataframe that contains all of the data from the csv file, converted into a Pandas-based dataframe. This will allow us to do a lot of great exploratory things.

Basic Exploratory Data Analysis

Let's take a look in our data by using the .head() method, which allows us to see the top few rows of data according to the number of rows we'd like to see. Run the code below.


In [15]:
raw_dataframe.head(3) # This should show 3 rows, starting at index 0.


Out[15]:
name year battle_number attacker_king defender_king attacker_1 attacker_2 attacker_3 attacker_4 defender_1 ... major_death major_capture attacker_size defender_size attacker_commander defender_commander summer location region note
0 Battle of the Golden Tooth 298 1 Joffrey/Tommen Baratheon Robb Stark Lannister NaN NaN NaN Tully ... 1.0 0.0 15000.0 4000.0 Jaime Lannister Clement Piper, Vance 1.0 Golden Tooth The Westerlands NaN
1 Battle at the Mummer's Ford 298 2 Joffrey/Tommen Baratheon Robb Stark Lannister NaN NaN NaN Baratheon ... 1.0 0.0 NaN 120.0 Gregor Clegane Beric Dondarrion 1.0 Mummer's Ford The Riverlands NaN
2 Battle of Riverrun 298 3 Joffrey/Tommen Baratheon Robb Stark Lannister NaN NaN NaN Tully ... 0.0 1.0 15000.0 10000.0 Jaime Lannister, Andros Brax Edmure Tully, Tytos Blackwood 1.0 Riverrun The Riverlands NaN

3 rows × 25 columns

You now can catch a glimpse of what's in this data set! Wait a minute... What happens to the data after column defender_1? There's an ... ellipsis? This dataset is actually very wide. Pandas is doing us a favor by ignoring some of those columns.

Let's make them visible by changing a small feature of our Pandas import:


In [16]:
pd.set_option('display.max_columns', None)

This little bit of code now makes it so that the display class of our Pandas library has no limit to the number of columns it sends. Try running the .head() method again to see if it worked.


In [17]:
raw_dataframe.head(3)


Out[17]:
name year battle_number attacker_king defender_king attacker_1 attacker_2 attacker_3 attacker_4 defender_1 defender_2 defender_3 defender_4 attacker_outcome battle_type major_death major_capture attacker_size defender_size attacker_commander defender_commander summer location region note
0 Battle of the Golden Tooth 298 1 Joffrey/Tommen Baratheon Robb Stark Lannister NaN NaN NaN Tully NaN NaN NaN win pitched battle 1.0 0.0 15000.0 4000.0 Jaime Lannister Clement Piper, Vance 1.0 Golden Tooth The Westerlands NaN
1 Battle at the Mummer's Ford 298 2 Joffrey/Tommen Baratheon Robb Stark Lannister NaN NaN NaN Baratheon NaN NaN NaN win ambush 1.0 0.0 NaN 120.0 Gregor Clegane Beric Dondarrion 1.0 Mummer's Ford The Riverlands NaN
2 Battle of Riverrun 298 3 Joffrey/Tommen Baratheon Robb Stark Lannister NaN NaN NaN Tully NaN NaN NaN win pitched battle 0.0 1.0 15000.0 10000.0 Jaime Lannister, Andros Brax Edmure Tully, Tytos Blackwood 1.0 Riverrun The Riverlands NaN

Great! We can now see all of the columns of data! This little bit of code is handy for customizing your libraries in the future.

Let's do one more key bit of code, drawing back to our very first lesson:


In [18]:
raw_dataframe.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38 entries, 0 to 37
Data columns (total 25 columns):
name                  38 non-null object
year                  38 non-null int64
battle_number         38 non-null int64
attacker_king         36 non-null object
defender_king         35 non-null object
attacker_1            38 non-null object
attacker_2            10 non-null object
attacker_3            3 non-null object
attacker_4            2 non-null object
defender_1            37 non-null object
defender_2            2 non-null object
defender_3            0 non-null float64
defender_4            0 non-null float64
attacker_outcome      37 non-null object
battle_type           37 non-null object
major_death           37 non-null float64
major_capture         37 non-null float64
attacker_size         24 non-null float64
defender_size         19 non-null float64
attacker_commander    37 non-null object
defender_commander    28 non-null object
summer                37 non-null float64
location              37 non-null object
region                38 non-null object
note                  5 non-null object
dtypes: float64(7), int64(2), object(16)
memory usage: 7.5+ KB

Here's another way for us to look at our data and see what types exit within them, indexed by key. So far we know that there are 38 data points overall, and some of those points are written as integers, others as float objects. These are all default data assignments by Pandas, but as we dive deeper, we'll care a little bit more about which is which.

SO, WHAT DO YOU WANT TO KNOW?

We've imported our libraries and our dataset, and we have a decent idea as to what's in them. What's next?

The first rule about being a data scientist: it's not about the tools, it's about the questions to answer.

We are scientists first and foremost, thus we must begin all of our exercises with a question we hope the data will answer. In an era where there's oceans of data generated, we then need tools and people to use them properly to answer questions.

For now, we're dealing with a small dataset: a quantiative documentation of the results of the War of the Five Kings, including:

  • Participating Houses
  • Participating "Kings" of Those Houses
  • Participants in each battle, including attacks and defenders
  • Army sizes
  • Outcome of the battle (based on the attacker's perspective)
  • Name of the battle

There's a lot more we could cover, but I'll try to focus on one particular King: Robb Stark.

Analyzing 'The King in the North'

Sure, things didn't exactly work out well for "The Young Wolf" (spoiler, although it's actually mentioned in this dataset), but his end overshadowed what was otherwise an impressive military campaign.

We can answer some pretty simple questions about his performance in this war, such as:

  • How many battles did Robb Stark fight in?
  • How many battles did Robb Stark win as an attacker?
  • How many battles did Robb Stark win overall?

Let's start with just those 3.

TIME TO CODE!

First, we have to group the data in such as way so that we can analyze it. Let's make a new dataframe called df, cloned from the existing one with the .copy() method.


In [19]:
df = raw_dataframe.copy()

How many battles did Robb Stark fight in?

A quick glance in the dataset shows that Robb Stark fought in battles as both an attacker and as a defender.

We'll need to create sub-dataframes that isolate those key_values:


In [20]:
robb_off = df[df['attacker_king'] == 'Robb Stark']
robb_def = df[df['defender_king'] == 'Robb Stark']

Whoa, that looks a little complex. Let's break it down. We've created two objects: robb_off for whenever Robb Stark attacked (i.e. on the "offensive), and robb_def for whenever Robb Stark defended.

We're requesting in Python to set these objects equal to the dataframe dictionary for whenever the key of attacker_king and defender_king is equivalent to the value of Robb Stark.

Feel free to use .head() on each object to see if it worked.

From here, it's as simple as counting the rows for each using the len() function, which gives us the "length" of a dataset according to the number of indexed items (in this case - rows).

Here's the code, setting it equal to a new variable:


In [21]:
robb_total = len(robb_off) + len(robb_def)
robb_total


Out[21]:
24

In other words, Robb Stark was involved in nearly 2/3 of all the battles fought during the War of the Five Kings.

But how good of a war commander was he?

How many battles did Robb Stark win as an attacker?

We can build upon the objects we've already built. We have the object for the number of battles Robb fought as an attacker, so let's create one involving him as an attacker AND a victor, still drawing from the original data source:


In [22]:
robb_off_win = robb_off[robb_off['attacker_outcome'] == 'win']

Using the same strategy as before, we're now looking into the sub-dataframe robb_off for whenever the key attacker_outcome has a value win.

From there, it's a simple len() method.


In [23]:
len(robb_off_win)


Out[23]:
8

Cool! Robb Stark won 8 of the battles he fought as an attacker. What about all the battles he won, including the ones as a defender?

We apply the same method, but remember - victories are according to the attacker's perspective. We need times when the attacker has lost to add to Robb's scoreboard.


In [24]:
robb_def_win = robb_def[robb_def['attacker_outcome'] == 'loss']

Adding these two variables together gets you the number of overall victories:


In [25]:
len(robb_off_win + robb_def_win)


Out[25]:
9

.... Wait, only 9? Out of the total number of battles Robb Stark fought, he was successful as a attacker but not great on the defensive. Overall, winning 9 out of 24 battles is really not that impressive.

Perhaps 'The Young Wolf' wasn't as impressive as we thought...

Try answering some more questions:

  • What was the average size of Robb Stark's armies against those defending against him?
  • How did the Lanninster/Baratheons fare in the War of the Five Kings?
  • Which king had the highest winning percentages? (Requires some light statistics...)
  • Who was the most effective commander (there are several to choose from)?

Try some other methods as well in Pandas:

  • .mean() - gives you the average of some value (you have to designate the key-value in some cases)
  • .median() - returns the median value of an object
  • .min() - gives you the lowest value in that array
  • .fillna(0.0).astype(int) - this is a way to get rid of all the float objects in your dataset.
  • .describe() - gives you an overview of the object's data, according to counts, unique values, and data types

Now that you have a light understanding of how data analysis is done, let's create some visualizations!

Creating Data Visualizations in Python

Relying a lot on Matplotlib here, data visualizations allow us to better communicate and understand the information we're able to create through our analyses.

Let's try to do a few based on the questions we've already resolved so far. Let's create some bar graphs.

First, let's create a new object robb_off_viz that measures what's going on in our robb_off object, using two more methods:

  • .groupby() - calculating the unique values in a particular key
  • .len() - measuring them by their "length" or number of rows

In [26]:
robb_off_viz = robb_off.groupby('attacker_outcome').apply(len)

Now, we can create a simple bar graph with the code below and setting the y label with a few more methods.


In [27]:
robb_off_viz.plot(kind='bar').set_ylabel('# of Battles')


Out[27]:
<matplotlib.text.Text at 0x9b06390>

Let's compare that with a plot for Robb Stark's defense. Remember, in this graph, Robb is the defender, so his "wins" are in the "loss" column below.


In [28]:
robb_def_viz = robb_def.groupby('attacker_outcome').apply(len)
robb_def_viz.plot(kind='bar').set_ylabel('# of Battles')


Out[28]:
<matplotlib.text.Text at 0x9d5b278>

We can interpret this data much easier now with these visuals in a couple of ways:

  • Attacking is a far more effective means to victory than defending
  • Robb Stark is about as good at attacking as he is terrible at defending

Cool. Though looking at just two bars is a little lame. Let's compare some more data!

The code below creates a new object called attacker_win that groups and measures the victories according to the attacker. Afterwards, the rest is a simple bar plot like before.


In [29]:
attacker_win = df[df['attacker_outcome'] == 'win'].groupby('attacker_1').apply(len)
attacker_win.plot(kind='bar').set_ylabel('# of Victories')


Out[29]:
<matplotlib.text.Text at 0x9efa400>

We'll that's interesting. Turns out that the Greyjoys and the Lannisters are more effective on the attack than the Starks.

Let's move onto another popular form of visualization: scatterplots.

Scatterploting in Python

I was told by a good friend that scatterplots are the best way to visualize data.

Let's see if he's right. Let's create one based on the battles that took place in this war, beginning with a research question:

  1. Were Robb Stark's armies generally larger than the defenders while attacking?
  2. Were Robb Stark's armies generally smaller than the attackers while defending?

Let's see if we can do all of this in one scatterplot.

Since we now know that Robb won more often as an attacker than defender, let's see if army size is a factor.

First, let's create our first plot Let's plot the battles where Robb Stark is a attacking and make that the color red. See below:


In [30]:
x = robb_off['attacker_size']
y = robb_off['defender_size']
plt.scatter(x,y,color='red')


Out[30]:
<matplotlib.collections.PathCollection at 0xa37a1d0>

Now let's do the same thing with the battles where Robb Stark is defending and make that the color blue.


In [31]:
x = robb_def['defender_size']
y = robb_def['attacker_size']
plt.scatter(x,y,color='blue')


Out[31]:
<matplotlib.collections.PathCollection at 0xaafb390>

Hm, also interesting, but hard to see how it fits together. We have to run the code in a single Python block to make them all into one cool visual. I've seen added some code to make a "line of equality" to give us a sense of what the Starks were truly up against.


In [32]:
x = robb_off['attacker_size']
y = robb_off['defender_size']
plt.scatter(x,y,color='red')

x = robb_def['defender_size']
y = robb_def['attacker_size']
plt.scatter(x,y,color='blue')

plt.title('When Starks Attack')
plt.xlabel('Stark Army')
plt.ylabel('Defender Army')
plt.xlim(0,21000) # x parameters - aim for homogeneous scales
plt.ylim(0,21000) # y parameters
plt.plot([0,21000],[0,21000],color="k") # line of equality

plt.show()


Check it out! Especially with the "line of equality," we can make the following conclusions:

  • The Starks typically faced opponents who had a larger army size than their own.

Can we make conclusions about army size AND victory? Not yet. We'll need to redo the data for that, but you should know everything you need to get started on that visualization.

Histograms in Python

One last little lesson on visualization is histograms, a popular way to show data if you have data with variables that occur frequently. Run the code below to see some of the potential histograms we can create.


In [33]:
df.hist()


Out[33]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x000000000AB15780>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000000000AFB1198>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000000000B02BC50>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x000000000B1233C8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000000000B223DD8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000000000B31C630>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x000000000B3B6208>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000000000AE33A58>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000000000AD6CC18>]], dtype=object)

Hm, that's not all that meaningful to us as a group, but we can focus on one key, such as year.


In [34]:
df.hist('year')


Out[34]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x000000000B9ACD68>]], dtype=object)

A little more helpful! Here we can see that many of the battles (20) happened in the year 299, with 7 happening the year before and 11 in the year after. (We can use Python to count exactly what these numbers are.

Let's also check out how big the attacking armies were by exploring attacker_size.


In [43]:
df.hist('attacker_size')


Out[43]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x000000000D7CF278>]], dtype=object)

With one outlier, it looks like most of the armies were less than 30,000 in troop size. Which army was the largest?

SANDBOX TIME!

Even with such a small dataset, there are a number of questions you can answer as well as Python commands to explore.

I strongly recommend checking out this documentation here: http://pandas.pydata.org/pandas-docs/stable/

Keep testing this notebook with the commands and see what happens!

YOU DID IT! YOU'RE A DATA SCIENTIST!

Well, kind of. There's a lot more to it than that. Depending on your learning style, there are a lot of directions I can point you in to help you build your skills. Feel free to reach out to me via GitHub or at my personal email: lee-dot-ngo-at-gmail-dot-com. (Please don't be a robot.)

About this Course's Author

Lee Ngo is a self-described 'Education Technology Community Architect,' and is perpetually passionate about inclusivity, engagement, and empathy in spaces of professional advancement. Lee serves as national data science evangelist for Metis. Previously, Lee served as an evangelist for Galvanize based in Seattle. Previously he worked for UP Global (now Techstars) and founded his own ed-tech company in Pittsburgh, PA. Lee believes in learning by doing, engaging and sharing, and he teaches code through a combination of visual communication, teamwork, and project-oriented learning.

You can email him at lee-dot-ngo-at-gmail-dot-com for any further questions.

Disclaimer: This lesson is entirely open-source, unaffilated with any other entities and intended for educational and entertain purposes. The data used remains unchanged from its initial source out of respect to its author and the inspired material. Please feel free to fork, clone, remake, sample, and enjoy as your please under the MIT License.


In [ ]: