Maybe you can quickly review these basics first.
Google's Python Class is also a nice resource.
Jupyter notebook (lab) (and IPython) and Pandas may be two most important libraries responsible for the Python's rise in data science. Jupyter lets you interactively explore datasets and code; Pandas lets you handle tabular datasets with superb speed and convenience. And they work so well together! In many cases, Jupyter and Pandas are all you need to load, clean, transform, visualize, and understand a dataset.
If you are not familiar with Pandas, you may want to follow their official tutorial called 10 Minutes to pandas now or in the near future.
In [1]:
import pandas as pd
You can check the version of the library. Because pandas is fast-evolving library, you want to make sure that you have the up-to-date version of the library.
In [2]:
pd.__version__
Out[2]:
You also need matplotlib, which is used by pandas to plot figures. The following is the most common convention to import matplotlib library.
In [3]:
import matplotlib.pyplot as plt
Let's check its version too.
In [4]:
import matplotlib
matplotlib.__version__
Out[4]:
Using pandas, you can read tabular data files in many formats and through many protocols. Pandas supports not only flat files such as .csv, but also various other formats including clipboard, Excel, JSON, HTML, Feather, Parquet, SQL, Google BigQuery, and so on. Moreover, you can pass a local file path or a URL. If it's on Amazon S3, just pass a url like s3://path/to/file.csv. If it's on a webpage, then just use https://some/url.csv.
Let's load a dataset about the location of pumps in the John Snow's map. You can download the file to your computer and try to load it using the local path too.
In [5]:
pump_df = pd.read_csv('https://raw.githubusercontent.com/yy/dviz-course/master/data/pumps.csv')
df stands for "Data Frame", which is a fundamental data object in Pandas. You can take a look at the dataset by looking at the first few lines.
In [6]:
pump_df.head()
Out[6]:
Q1: can you print only the first three lines? Refer: http://pandas.pydata.org/pandas-docs/stable/index.html
In [7]:
# TODO: write your code here
Out[7]:
You can also sample several rows randomly. If the data is sorted in some ways, sampling may give you a rather unbiased view of the dataset.
In [7]:
# Your code here
Out[7]:
You can also figure out the number of rows in the dataset by running
In [8]:
len(pump_df)
Out[8]:
Note that df.size does not give you the number of rows. It tells you the number of elements.
In [9]:
pump_df.size
Out[9]:
You can also look into the shape of the dataset as well as what are the columns in the dataset.
In [10]:
pump_df.shape # 13 rows and 2 columns
Out[10]:
In [11]:
pump_df.columns
Out[11]:
You can also check out basic descriptive statistics of the whole dataset by using describe() method.
In [12]:
pump_df.describe()
Out[12]:
You can slice the data like a list
In [13]:
pump_df[:2]
Out[13]:
In [14]:
pump_df[-2:]
Out[14]:
In [15]:
pump_df[1:5]
Out[15]:
or filter rows using some conditions.
In [16]:
pump_df[pump_df.X > 13]
Out[16]:
Now let's load another CSV file that documents the cholera deaths. The URL is https://raw.githubusercontent.com/yy/dviz-course/master/data/deaths.csv
Q2: load the death dataset and inspect it
death_df.
In [17]:
# TODO: write your code here. You probably want to create multiple cells.
Out[17]:
In [18]:
len(death_df)
Out[18]:
Let's visualize them! Pandas actually provides a nice visualization interface that uses matplotlib under the hood. You can do many basic plots without learning matplotlib. So let's try.
In [19]:
death_df.plot()
Out[19]:
Oh by the way, depending on your environment, you may not see any plot. If you don't see anything run the following command.
In [20]:
%matplotlib inline
The commands that start with % is called the magic commands, which are available in IPython and Jupyter. The purpose of this command is telling the IPython / Jupyter to show the plot right here instead of trying to use other external viewers.
Anyway, this doesn't seem like the plot we want. Instead of putting each row as a point in a 2D plane by using the X and Y as the coordinate, it just created a line chart. Let's fix it. Please take a look at the plot method documentation. How should we change the command? Which kind of plot do we want to draw?
Yes, we want to draw a scatter plot using x and y as the Cartesian coordinates.
In [21]:
death_df.plot(x='X', y='Y', kind='scatter', label='Deaths')
Out[21]:
I think I want to reduce the size of the dots and change the color to black. But it is difficult to find how to do that! It is sometimes quite annoying to figure out how to change how the visualization looks, especially when we use matplotlib. Unlike some other advanced tools, matplotlib does not provide a very coherent way to adjust your visualizations. That's one of the reasons why there are lots of visualization libraries that wrap matplotlib. Anyway, this is how you do it.
In [22]:
death_df.plot(x='X', y='Y', kind='scatter', label='Deaths', s=2, c='black')
Out[22]:
Can we visualize both deaths and pumps?
In [23]:
death_df.plot(x='X', y='Y', s=2, c='black', kind='scatter', label='Deaths')
pump_df.plot(x='X', y='Y', kind='scatter', c='red', s=8, label='Pumps')
Out[23]:
Why do we have two separate plots? The reason is that, by default, the plot method creates a new plot. In order to avoid it, we need to either create an Axes and tell plot to use that axes. What is an axes? See this illustration.
A figure can contain multiple axes (link):
and an axes can contain multiple plots (link).
Conveniently, when you call plot method, it creates an axes and returns it to you
In [24]:
ax = death_df.plot(x='X', y='Y', s=2, c='black', kind='scatter', label='Deaths')
In [25]:
ax
Out[25]:
Then you can pass this axes object to another plot to put both plots in the same axes.
In [29]:
ax = death_df.plot(x='X', y='Y', s=2, c='black', alpha=0.5, kind='scatter', label='Deaths')
pump_df.plot(x='X', y='Y', kind='scatter', c='red', s=8, label='Pumps', ax=ax)
Out[29]:
Probably the most explicit (and good) way to create a plot is by calling the subplots() method (see https://matplotlib.org/3.2.1/api/_as_gen/matplotlib.pyplot.subplots.html). In doing so, you directly obtain the figure object as well as the ax object. Then you can manipulate them directly. plt.plot() or df.plot() is a quick way to create plots, but if you want to produce a nice explanatory plots (which may involve multiple panels), use this method!
Now, can use this method to produce the same plot?
In [30]:
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
# your code here
Out[30]:
In [31]:
from scipy.spatial import Voronoi, voronoi_plot_2d
Take a look at the documentation of Voronoi and voronoi_plot_2d and
Q3: produce a Voronoi diagram that shows the deaths, pumps, and voronoi cells
In [32]:
# you'll need this
points = pump_df.values
points
Out[32]:
In [33]:
# TODO: your code here
In [39]:
import matplotlib.pyplot as plt
plt.plot([1,2,3], [4,2,3])
plt.savefig('foo.png')
Q4: Save your Voronoi diagram. Make sure that your plot contains the scatterplot of deaths & pumps as well as the Voronoi cells
In [40]:
# TODO: your code here
Ok, that was a brief introduction to pandas and some simple visualizations. Now let's talk about web a little bit.
Many browsers don't allow loading files locally due to security concerns. If you work with Javascript and datasets, this can cause some troubles. We can get around by simply running a local web server with Python (did you know that there is a simple HTTP server module in Python? 😎):
cd <FOLDER_LOCATION>. If you run your webserver here, then this becomes the root of the website. Type
python -m http.server.
If successful, you'll see
Serving HTTP on 0.0.0.0 port 8000 …
This means that now your computer is running a webserver and its IP address is 0.0.0.0 and the port is 8000. Now you can open a browser and type "0.0.0.0:8000" on the address bar to connect to this webserver. Equivalently, you can type "localhost:8000". After typing, click on the different links. You can also directly access one of these links by typing in localhost:8000/NAME_OF_YOUR_FILE.html in the address bar.
Webpages are written in a standard markup language called HTML (HyperText Markup Language). The basic syntax of HTML consists of elements enclosed within < and > symbols. Browsers such as Firefox and Chrome parse these tags and render the content of a webpage in the designated format.
Here is a list of important tags and their descriptions.
html - Surrounds the entire document.
head - Contains info about the document itself. E.g. the title, any external stylesheets or scripts, etc.
title - Assigns title to page. This title is used while bookmarking.
body - The main part of the document.
h1, h2, h3, ... - Headings (Smaller the number, larger the size).
p - Paragraph.
br - Line break.
em - emphasize text.
strong or b - Bold font.
a - Defines a hyperlink and allows you to link out to the other webpages.
img - Place an image.
ul, ol, li - Unordered lists with bullets, ordered lists with numbers and each item in list respectively.
table, th, td, tr - Make a table, specifying contents of each cell.
<!--> - Comments – will not be displayed.
span - This will not visibly change anything on the webpage. But it is important while referencing in CSS or JavaScript. It spans a section of text, say, within a paragraph.
div - This will not visibly change anything on the webpage. But it is important while referencing in CSS or JavaScript. It stands for division and allocates a section of a page.
While HTML directly deals with the content and structure, CSS (Cascading Style Sheets) is the primary language that is used for the look and formatting of a web document.
A CSS stylesheet consists of one or more selectors, properties and values. For example:
body {
background-color: white;
color: steelblue;
}
Selectors are the HTML elements to which the specific styles (combination of properties and values) will be applied. In the above example, all text within the body tags will be in steelblue.
There are three ways to include CSS code in HTML. This is called "referencing".
Embed CSS in HTML - You can place the CSS code within style tags inside the head tags. This way you can keep everything within a single HTML file but does make the code lengthy.
<head>
<style type="text/css"
.description {
font: 16px times-new-roman;
}
.viz {
font: 10px sans-serif;
}
</style>
Reference an external stylesheet from HTML - This is a much cleaner way but results in the creation of another file. To do this, you can copy the CSS code into a text file and save it as a .css file in the same folder as the HTML file. In the document head in the HTML code, you can then do the following:
<head>
<link rel="stylesheet" href="stylesheet.css">
</head>
Attach inline styles - You can also directly attach the styles in-line along with the main HTML code in the body. This makes it easy to customize specific elements but makes the code very messy, because the design and content get mixed up.
<p style="color: green; font-size:36px; font-weight:bold;">
Inline styles can be handy sometimes.
</p>
Q5: Create a simple HTML page that displays the Voronoi diagram that you saved. Feel free to add more plots, explanations, and any styles. Make sure to check you can run the Python webserver and open the HTML file that you created.
Btw, you can also export Jupyter notebook into various formats. Click File -> Export Notebook As and play with it.
In [ ]: