Why visualization is important?


There are many reasons to why data visualization is important, nevertheless some of the important ones are as follows:

  1. For better understanding of the data in question.

  2. For sharing the insights with others.

  3. For effectively communicating results to the non-technical masses as well.


Hence python offers a package called, "Matplotlib", a data visualization package.

let's see some help on matplotlib, but before we proceed with that, we have to first import the package.

Further there are dozens of sub-packages associated with Matplotlib, hence the ones used here are some of the common ones, these are:

  1. .pyplot()
  2. .scatter()

Importing convention:

import matplotlib.pyplot as plt

Note: plt is an alias and not be confused by it.


Exercise 1


RQ1: What is the characteristic about data visualization?

Ans: Visualization is a very powerful tool for exploring your data and reporting results.


RQ2: What is the conventional way of importing the pyplot sub-package from the matplotlib package?

Ans: import matplotlib.pyplot as plt


RQ3: You are creating a line plot using the following code:

        a = [1, 2, 3, 4]
        b = [3, 9, 2, 6]
        plt.plot(a, b)
        plt.show()

Which two options describe the result of your code?

Ans: a : Horizontal axis, b : Vertical axis.


RQ4: You are modifying the following code that calls the plot() function to create a line plot:

        a = [1, 2, 3, 4]
        b = [3, 9, 2, 6]
        plt.plot(a, b)
        plt.show()

What should you change in the code to create a scatter plot instead of a line plot?

Ans: Change plot() in plt.plot() to scatter().


Go to top: TOC

Lab:


Objective:

  • Experiment with matplotlib package.
  • Create both line plots and scatter plots.

Go to top: TOC

Line Plot 1


General Recepie:

    import matplotlib as plt

    plt.show(< variable@Horizontal_axis >, < variable@Vertical_axis >)

    plt.plot( x, y )

    plt.show()

Preface:

In the video, you already saw how much the world population has grown over the past years. Will it continue to do so?

The world bank has estimates of the world population for the years 1950 up to 2100.

  • the years are loaded in your workspace as a lit called year.
  • Corresponding populations as a list called pop.

Instructions:

  • print() the last item from both the year and the pop list to see what the predicted population for the year 2100 is.
  • Before you can start, you should import matplotlib.pyplot as plt.

    • pyplot is a sub-package of matplotlib, hence the dot.
  • Use plt.plot() to build a line plot. year should be mapped on the horizontal axis,

    • pop on the vertical axis. Don't forget to finish off with the show() function to actually display the plot.

Go to top: TOC


In [2]:
# Print the last item from year and pop
# print(year[-1])
# print(pop[-1])


# Import matplotlib.pyplot as plt
# import matplotlib.pyplot as plt

# Make a line plot: year on the x-axis, pop on the y-axis
# plt.plot( year, pop)
# plt.show()

Line Plot 2


Question: What is the first year in which there will be more than ten billion human beings on this planet?

Ans: By 2060, the world population will rise appx. to 10 billion.


Go to top: TOC

Line plot 3


Preface:

Now that you've built your first line plot, let's start working on the data that professor Hans Rosling used to build his beautiful bubble chart. It was collected in 2007. Two lists are available for you:

  • life_exp which contains the life expectancy for each country and

  • gdp_cap, which contains the GDP per capita, for each country expressed in US Dollar.

GDP stands for Gross Domestic Product. It basically represents the size of the economy of a country.Divide this by the population and you get the GDP per capita.


Instructions:

  • Print the last item from both the list gdp_cap, and the list life_exp; it is information about Zimbabwe.
  • Build a line chart, with gdp_cap on the x-axis, and life_exp on the y-axis.

    • Does it make sense to plot this data on a line plot?
  • Don't forget to finish off with a plt.show() command, to actually display the plot.

Go to top: TOC


In [3]:
# Print the last item of gdp_cap and life_exp
# print( gdp_cap[ -1 ] )
# print( life_exp[ -1 ])

# Make a line plot, gdp_cap on the x-axis, life_exp on the y-axis
# plt.plot( gdp_cap, life_exp )

# Display the plot
# plt.show()

Scatter Plot 1


It's fine to use such tools, but how do we know which one is best suited for which purpose? As a rule of thumb:

  • When we have a time scale along the horizontal axis.

    • We generally prefer a line plot.
  • When we're trying to assess if there's a correlation b/w two variables.

    • We go with scatter plot.

Importing convention:

    import matplotlib.pyplot as plt
    plt.scatter( x, y )
    plt.show()

Preface:

Let's continue with the gdp_cap versus life_exp plot, the GDP and life expectancy data for different countries in 2007. Maybe a scatter plot will be a better alternative?

Instructions:

  • Change the line plot that's coded in the script to a scatter plot.
  • A correlation will become clear when you display the GDP per capita on a logarithmic scale. Add the line plt.xscale('log').
  • Finish off your script with plt.show() to display the plot.

Go to top: TOC


In [4]:
# Change the line plot below to a scatter plot
#plt.scatter(gdp_cap, life_exp)

# Put the x-axis on a logarithmic scale
#plt.xscale('log')

# Show plot
#plt.show()

Scatter Plot 2


Preface:

In the previous exercise, you saw that that the higher GDP usually corresponds to a higher life expectancy. In other words, there is a positive correlation.

Do you think there's a relationship between population and life expectancy of a country?


Instructions:

  • Start from scratch: import matplotlib.pyplot as plt.
  • Build a scatter plot, where pop is mapped on the horizontal axis, and life_exp is mapped on the vertical axis.
  • Finish the script with plt.show() to actually display the plot. Do you see a correlation?

Go to top: TOC


In [ ]:
# Import package
# import matplotlib.pyplot as plt

# Build Scatter plot
# plt.scatter( pop, life_exp )

# Show plot
# plt.show()

"""Conclusion: Ther's no correlation b/w population
and Life Expectency! Which makes perfect sense."""

Histograms


In descriptive statistics, a histogram is a tool that precedes over to precursor methods such as dot plots on number line. It's a tool to visualize the frequency of a distribution.

In particular, Histograms are useful data visualziation tool for "Qunatitative data" in question.

Histogram concept:

  • Start off with number line, with data points superimposed respectively to there magnitude.
  • Next divide the line into equal chunks, called bins.
  • With each bin containing data points, we count the number of data points in each bin.
  • Finally, we draw a bar for each bin. The height of the bar corresponds to the number of data points that fall in this bin.

Go to top: TOC

Creating Histograms with Matplotlib


Importing convention:

import matplotlib.pyplot as plt

followed by calling the histogram func. using plt.

plt.hist(<list variable>, <no. of bins>)

Note: A good bit about the .hist() is, it automatically computes the boundries for all the bins. Also calculates how many values in each one.

Exercise 2


RQ1: What is a characteristic of a histogram?

Ans:


RQ2: You are working with a Python list with 10 different values. You divide the values into 5 equally-sized bins.

How wide will these bins be if the lowest value in your list is 0 and the highest is 20?

Ans: The range of the values is 20, if we divide into 5 bins, then each bin will have a width of 4.

For a visual que, see below cell.


In [5]:
x = [0, 0, 0, 0, 0, 0, 0, 0, 0, 20]

import matplotlib.pyplot as plt

plt.hist( x, 5 )

plt.show()


RQ3: You write the following code:


    import matplotlib.pyplot as plt
    x = [1, 3, 6, 3, 2, 7, 3, 9, 7, 5, 2, 4]
    plt.hist(x)
    plt.show()

You need to extend the plt.hist() command to specifically set the number of bins to 4. What should you do?

Ans: plt.hist(x, 4)


Go to top: TOC

Lab histograms


Objective:

  • Experiment with histograms.
  • Work with different bins.
  • Work with dfferent datasets.

Lab exercises:

  • Choose the right plot 1.
  • Chosse the right plot 2.

Go to top: TOC

Build a histogram 1.


Preface:

life_exp, the list containing data on the life expentancy for different countries in 2007(at data camp only!)

To see how life expectancy in different countries is distributed, let's create a histogram of life_exp.

Instructions:

  • Use plt.hist() to create a histogram of the values in life_exp.

    • Do not specify the number of bins; Python will set the number of bins to 10 by default for you.
  • Add plt.show() to actually display the histogram. Can you tell which bin contains the most observations?

# Create histogram of life_exp data
plt.hist(life_exp)

# Display histogram
plt.show()

Go to top: TOC

Build a histogram 2 : bins


  • By default python sets the number of bins to 10.

  • Number of bins is important,

    • Can zoom in / out of the data.

    • zooming in : shows much more detail, but not the bigger picture.

    • zooming out : shows the bigger picture.

  • To control the no. of bins to divide your data in,

      - by setting the `bins` argument.

Preface:

We'll be makin two plots here.

  • Use plt.clf() to clean up again to start fresh.

Instructions:

  • Build a histogram of life_exp, with 5 bins.

    • Can you tell which bins contains the most observations?
  • Build another histogram of life_exp, this time with 20 bins.

    • Is this better?

Go to top: TOC


In [1]:
# Build histogram with 5 bins
    # Ans: plt.hist(life_exp, bins = 5)
    # 4th and 5th bins.

# Show and clean up plot
# plt.show()
# plt.clf()

# Build histogram with 20 bins
# Ans: plt.hist( life_exp, bins = 20 )
    # Much better, 15th bin contains maximum value,
    # i.e. most people tend to live upto 71-73 years.

# Show and clean up again
    # plt.show()
    # plt.clf()

Build a histogram 3 : compare


Preface

In the video, you saw population pyramids for the present day and for the future. Because we were using a histogram, it was very easy to make a comparison.

Let's do similar comparison.

life_exp contains life expectancy data for different countries in 2007. You also have access to a second list now, life_exp1950, containing similar data for 1950. Can you make a histogram for both datasets?

You'll again be making two plots. The plt.show() and plt.clf() commands to render everything nicely are already included. Also matplotlib.pyplot is imported for you, as plt.


Instructions:

  • Build a histogram of life_exp with 15 bins.
  • Build a histogram of life_exp1950, also with 15 bins.

    • Is there a big difference with the histogram for the 2007 data?

Go to top: TOC


In [2]:
# Histogram of life_exp, 15 bins
    #Ans: plt.hist( life_exp, bins = 15)

# Show and clear plot
#plt.show()
#plt.clf()

# Histogram of life_exp1950, 15 bins
    #Ans: plt.hist( life_exp1950, bins = 15)

# Show and clear plot again
#plt.show()
#plt.clf()

"""
Conclusion: Neither one of these histogram is useful to 
better understand the life expectancy data.

Why? 
"""


Out[2]:
'\nConclusion: Neither one of these histogram is useful to \nbetter understand the life expectancy data.\n\nWhy? \n'

Choose the right plot 1


Scenario:

You're a professor teaching Data Science with Python, and you want to visually assess if the grades on your exam follow a normal distribution. Which plot do you use?

Answer: Since a histogram is a very good tool to visualize a frequency distribution of either one or multiple varibales, it's also a good tool to visualize if the distribution in question follows a normal(gaussian) distribution.


Choose the right plot 2

Scenario:

You're a professor in Data Analytics with Python, and you want to visually assess if longer answers on exam questions lead to higher grades. Which plot do you use?

Answer:

Since we are trying to find a visual relationship or correlation b/w two variables "longer-answer" and "higher-grades", in such case:

A scatter plot is a good visualizing tool to identify if the data points are "spread out" meaning no relationship or "linear grouping of data points" meaning there's some kind of relationship b/w the variables in question.


Go to top: TOC

Lecture: Customization


Data visualization is:

  • Science and Art.

    • To tell a story with data.
  • We have many options, i.e. can create different types of plots.

    • For each plot, there are infinite no. of customizations.

    • These may include, colors, shapes, lables, legend, axes etc.

  • Choice depends on:

    • Data.

    • Story you want to tell.

Exercise 3:


RQ1: You are customizing a plot by labelling its axes. You need to do this by using matplotlib.

Which code should you use?

Ans: xlabel("x-axis title") and ylabel("y-axis title").


RQ2: Which matplotlib function do you use to build a line plot where the area under the graph is colored?

Ans: fill_between()


RQ3: Typically, you place all customization commands between the plot() call and the show() call, as follows:

import matplotlib.pyplot as plt
x = [1, 2, 3]
y = [4, 5, 6]
plt.plot(x, y)
# customization here
plt.show()

What will happen if you place the customization code after the show() function instead?

import matplotlib.pyplot as plt
x = [1, 2, 3]
y = [4, 5, 6]
plt.plot(x, y)
plt.show()
#customization here

Ans: Let's check it out!


In [4]:
import matplotlib.pyplot as plt

x = [1, 2, 3]
y = [4, 5, 6]

plt.plot(x, y)

# customization here
plt.xlabel("var1")
plt.ylabel("var2")

plt.show()



In [5]:
"""It seems that customization should be done b/w 
plot() and show() function."""

import matplotlib.pyplot as plt

x = [1, 2, 3]
y = [4, 5, 6]

plt.plot(x, y)

# customization here
plt.show()

plt.xlabel("var1")
plt.ylabel("var2")


Out[5]:
<matplotlib.text.Text at 0x1a03ad43908>

Lab : Customization


Objective:

  • Customization of visual data.

Labels:

You're going to work on the scatter plot with world development data: GDP per capita on the x-axis (logarithmic scale), life expectancy on the y-axis. The code for this plot is available in the script.

As a first step, let's add axis labels and a title to the plot. You can do this with the xlabel(), ylabel() and title() functions, available in matplotlib.pyplot. This sub-package is already imported as plt.


Instructions:

  • The strings xlab and ylab are already set for you. Use these variables to set the label of the x- and y-axis.

  • The string title is also coded for you. Use it to add a title to the plot.

  • After these customizations, finish the script with plt.show() to actually display the plot.



In [ ]:
# Basic scatter plot, log scale
# plt.scatter(gdp_cap, life_exp)
# plt.xscale('log') 

# Strings
# xlab = 'GDP per Capita [in USD]'
# ylab = 'Life Expectancy [in years]'
# title = 'World Development in 2007'

# Add axis labels
# plt.xlabel(xlab)
# plt.ylabel(ylab)

# Add title
# plt.title(title)

# After customizing, display the plot
# plt.show()

Ticks:


Ticks are like custom markers! We can control the ticks by specifiying two arguments.

Syntax: plt.x_or_yticks( <Int-list>, <list of ticks("Strings")> )

E.g. plt.yticks([0,1,2], ["one","two","three"])

  • The ticks corresponds to number 0, 1, 2 will be replaced by "One", "Two", "Three" respectivley.

Preface:

Let's do a similar thing for the x-axis of your world development chart, with the xticks() function. The tick values 1000, 10000 and 100000 should be replaced by 1k, 10k and 100k. To this end, two lists have already been created for you: tick_val and tick_lab:

Instructions:

  • Use tick_val and tick_lab as inputs to the xticks() function to make the the plot more readable.
  • As usual, display the plot with plt.show() after you've added the customizations.

Go to top: TOC


In [ ]:
# Scatter plot
# plt.scatter(gdp_cap, life_exp)

# Previous customizations
# plt.xscale('log') 
# plt.xlabel('GDP per Capita [in USD]')
# plt.ylabel('Life Expectancy [in years]')
# plt.title('World Development in 2007')

# Definition of tick_val and tick_lab
# tick_val = [1000,10000,100000]
# tick_lab = ['1k','10k','100k']

# Adapt the ticks on the x-axis
# plt.xticks(tick_val, tick_lab)

# After customizing, display the plot
# plt.show()

Sizes:


Preface: Our scatter plot at present is just a cloud of blue dots, indistinguishable from each other. We can represent the size of the dots corresponds to the population.

  • pop is a list of population numbers for each country expressed in millions.

  • s is an argument scatter method as the argument for size.


Instructions:

  • Run the script to see how th plot changes.
  • Increase the size of bubbles to emphesize them more.

    • Import numpy package as np.

    • Use np.array() to create a numpy array from the list pop. Call this Numpy array np_pop.

    • Double the values in np_pop by assigning np_pop * 2 to np_pop again.

      • Because np_pop is a numpy array, each array element will be doubled.
  • Change the s argument inside plt.scatter() to be np_pop instead of pop.


Go to top


In [1]:
# Import numpy as np
# import numpy as np

# Store pop as a numpy array: np_pop
# np_pop = np.array(pop)

# Double np_pop
# np_pop *= 2

# Update: set s argument to np_pop
# plt.scatter(gdp_cap, life_exp, s = np_pop)

# Previous customizations
# plt.xscale('log') 
# plt.xlabel('GDP per Capita [in USD]')
# plt.ylabel('Life Expectancy [in years]')
# plt.title('World Development in 2007')
# plt.xticks([1000, 10000, 100000],['1k', '10k', '100k'])

# Display the plot
# plt.show()

Colors:


Preface:

Let's make our plot more colorful.

  • col is a list with a color for each corresponding country, depending on the continent the country is part of.

How did we make the list col?

  • We define a dictionary containing a list of continent as "keys" mapped to "colors" as "values".
    dict = {
        'Asia':'red',
        'Europe':'green',
        'Africa':'blue',
        'Americas':'yellow',
        'Oceania':'black'
    }

Instructions:

  • Add c = col to the arguments of the plt.scatter() function.

  • Change the opacity of the bubbles by setting the alpha argument to 0.8 inside plt.scatter(). Alpha can be set from zero to one, where zero totally transparant, and one is not transparant.


Go to top


In [2]:
"""
# Specify c and alpha inside plt.scatter()
plt.scatter(x = gdp_cap, y = life_exp, s = np.array(pop) * 2, c = col, alpha= 0.8)

# Previous customizations
plt.xscale('log') 
plt.xlabel('GDP per Capita [in USD]')
plt.ylabel('Life Expectancy [in years]')
plt.title('World Development in 2007')
plt.xticks([1000,10000,100000], ['1k','10k','100k'])

# Show the plot
plt.show()
"""


Out[2]:
"\n# Specify c and alpha inside plt.scatter()\nplt.scatter(x = gdp_cap, y = life_exp, s = np.array(pop) * 2, c = col, alpha= 0.8)\n\n# Previous customizations\nplt.xscale('log') \nplt.xlabel('GDP per Capita [in USD]')\nplt.ylabel('Life Expectancy [in years]')\nplt.title('World Development in 2007')\nplt.xticks([1000,10000,100000], ['1k','10k','100k'])\n\n# Show the plot\nplt.show()\n"

Additional Customizations:


Preface:

If you have another look at the script, under # Additional Customizations, you'll see that there are two plt.text() functions now. They add the words "India" and "China" in the plot.

Instructions:

  • Add plt.grid(True) after the plt.text() calls so that gridlines are drawn on the plot.

Go to top


In [ ]:
"""# Scatter plot
plt.scatter(x = gdp_cap, y = life_exp, s = np.array(pop) * 2, c = col, alpha = 0.8)

# Previous customizations
plt.xscale('log') 
plt.xlabel('GDP per Capita [in USD]')
plt.ylabel('Life Expectancy [in years]')
plt.title('World Development in 2007')
plt.xticks([1000,10000,100000], ['1k','10k','100k'])

# Additional customizations
plt.text(1550, 71, 'India')
plt.text(5700, 80, 'China')

# Add grid() call
plt.grid(True)

# Show the plot
plt.show()
"""

Interpretations:


Preface:

If you have a look at your colorful plot, it's clear that people live longer in countries with a higher GDP per capita.

No high income countries have really short life expectancy, and no low income countries have very long life expectancy.

Still, there is a huge difference in life expectancy between countries on the same income level.

Most people live in middle income countries where difference in lifespan is huge between countries; depending on how income is distributed and how it is used.

What can you say about the plot?