# Why visualization is important?

There are many reasons to why data visualization is important, nevertheless some of the important ones are as follows:

1. For better understanding of the data in question.

2. For sharing the insights with others.

3. For effectively communicating results to the non-technical masses as well.

Hence python offers a package called, "Matplotlib", a data visualization package.

let's see some help on matplotlib, but before we proceed with that, we have to first import the package.

Further there are dozens of sub-packages associated with Matplotlib, hence the ones used here are some of the common ones, these are:

1. `.pyplot()`
2. `.scatter()`

Importing convention:

`import matplotlib.pyplot as plt`

Note: plt is an alias and not be confused by it.

## Exercise 1

RQ1: What is the characteristic about data visualization?

Ans: Visualization is a very powerful tool for exploring your data and reporting results.

RQ2: What is the conventional way of importing the pyplot sub-package from the matplotlib package?

Ans: `import matplotlib.pyplot as plt`

RQ3: You are creating a line plot using the following code:

``````        a = [1, 2, 3, 4]
b = [3, 9, 2, 6]
plt.plot(a, b)
plt.show()``````

Which two options describe the result of your code?

Ans: `a` : Horizontal axis, `b` : Vertical axis.

RQ4: You are modifying the following code that calls the plot() function to create a line plot:

``````        a = [1, 2, 3, 4]
b = [3, 9, 2, 6]
plt.plot(a, b)
plt.show()``````

What should you change in the code to create a scatter plot instead of a line plot?

Ans: Change `plot()` in `plt.plot()` to `scatter()`.

Go to top: TOC

## Lab:

Objective:

• Experiment with matplotlib package.
• Create both line plots and scatter plots.

Go to top: TOC

### Line Plot 1

General Recepie:

``````    import matplotlib as plt

plt.show(< variable@Horizontal_axis >, < variable@Vertical_axis >)

plt.plot( x, y )

plt.show()``````

Preface:

In the video, you already saw how much the world population has grown over the past years. Will it continue to do so?

The world bank has estimates of the world population for the years 1950 up to 2100.

• the years are loaded in your workspace as a lit called `year`.
• Corresponding populations as a list called `pop`.

Instructions:

• `print()` the last item from both the year and the `pop` list to see what the predicted population for the year 2100 is.
• Before you can start, you should import `matplotlib.pyplot` as `plt`.

• `pyplot` is a sub-package of `matplotlib`, hence the dot.
• Use `plt.plot()` to build a line plot. year should be mapped on the horizontal axis,

• `pop` on the vertical axis. Don't forget to finish off with the `show()` function to actually display the plot.

Go to top: TOC

``````

In [2]:

# Print the last item from year and pop
# print(year[-1])
# print(pop[-1])

# Import matplotlib.pyplot as plt
# import matplotlib.pyplot as plt

# Make a line plot: year on the x-axis, pop on the y-axis
# plt.plot( year, pop)
# plt.show()

``````

### Line Plot 2

Question: What is the first year in which there will be more than ten billion human beings on this planet?

Ans: By 2060, the world population will rise appx. to 10 billion.

Go to top: TOC

### Line plot 3

Preface:

Now that you've built your first line plot, let's start working on the data that professor Hans Rosling used to build his beautiful bubble chart. It was collected in 2007. Two lists are available for you:

• `life_exp` which contains the life expectancy for each country and

• `gdp_cap`, which contains the GDP per capita, for each country expressed in US Dollar.

GDP stands for Gross Domestic Product. It basically represents the size of the economy of a country.Divide this by the population and you get the GDP per capita.

Instructions:

• Print the last item from both the list gdp_cap, and the list life_exp; it is information about Zimbabwe.
• Build a line chart, with gdp_cap on the x-axis, and life_exp on the y-axis.

• Does it make sense to plot this data on a line plot?
• Don't forget to finish off with a plt.show() command, to actually display the plot.

Go to top: TOC

``````

In [3]:

# Print the last item of gdp_cap and life_exp
# print( gdp_cap[ -1 ] )
# print( life_exp[ -1 ])

# Make a line plot, gdp_cap on the x-axis, life_exp on the y-axis
# plt.plot( gdp_cap, life_exp )

# Display the plot
# plt.show()

``````

### Scatter Plot 1

It's fine to use such tools, but how do we know which one is best suited for which purpose? As a rule of thumb:

• When we have a time scale along the horizontal axis.

• We generally prefer a line plot.
• When we're trying to assess if there's a correlation b/w two variables.

• We go with scatter plot.

Importing convention:

``````    import matplotlib.pyplot as plt
plt.scatter( x, y )
plt.show()``````

Preface:

Let's continue with the `gdp_cap` versus `life_exp` plot, the GDP and life expectancy data for different countries in 2007. Maybe a scatter plot will be a better alternative?

Instructions:

• Change the line plot that's coded in the script to a scatter plot.
• A correlation will become clear when you display the GDP per capita on a logarithmic scale. Add the line `plt.xscale('log')`.
• Finish off your script with `plt.show()` to display the plot.

Go to top: TOC

``````

In [4]:

# Change the line plot below to a scatter plot
#plt.scatter(gdp_cap, life_exp)

# Put the x-axis on a logarithmic scale
#plt.xscale('log')

# Show plot
#plt.show()

``````

### Scatter Plot 2

Preface:

In the previous exercise, you saw that that the higher GDP usually corresponds to a higher life expectancy. In other words, there is a positive correlation.

Do you think there's a relationship between population and life expectancy of a country?

Instructions:

• Start from scratch: import `matplotlib.pyplot` as `plt`.
• Build a scatter plot, where pop is mapped on the horizontal axis, and `life_exp` is mapped on the vertical axis.
• Finish the script with `plt.show()` to actually display the plot. Do you see a correlation?

Go to top: TOC

``````

In [ ]:

# Import package
# import matplotlib.pyplot as plt

# Build Scatter plot
# plt.scatter( pop, life_exp )

# Show plot
# plt.show()

"""Conclusion: Ther's no correlation b/w population
and Life Expectency! Which makes perfect sense."""

``````

### Histograms

In descriptive statistics, a histogram is a tool that precedes over to precursor methods such as dot plots on number line. It's a tool to visualize the frequency of a distribution.

In particular, Histograms are useful data visualziation tool for "Qunatitative data" in question.

Histogram concept:

• Start off with number line, with data points superimposed respectively to there magnitude.
• Next divide the line into equal chunks, called bins.
• With each bin containing data points, we count the number of data points in each bin.
• Finally, we draw a bar for each bin. The height of the bar corresponds to the number of data points that fall in this bin.

Go to top: TOC

### Creating Histograms with Matplotlib

Importing convention:

`import matplotlib.pyplot as plt`

followed by calling the histogram func. using `plt`.

`plt.hist(<list variable>, <no. of bins>)`

Note: A good bit about the `.hist()` is, it automatically computes the boundries for all the bins. Also calculates how many values in each one.

### Exercise 2

RQ1: What is a characteristic of a histogram?

Ans:

RQ2: You are working with a Python list with 10 different values. You divide the values into 5 equally-sized bins.

How wide will these bins be if the lowest value in your list is 0 and the highest is 20?

Ans: The range of the values is 20, if we divide into 5 bins, then each bin will have a width of 4.

For a visual que, see below cell.

``````

In [5]:

x = [0, 0, 0, 0, 0, 0, 0, 0, 0, 20]

import matplotlib.pyplot as plt

plt.hist( x, 5 )

plt.show()

``````
``````

``````

RQ3: You write the following code:

``````
import matplotlib.pyplot as plt
x = [1, 3, 6, 3, 2, 7, 3, 9, 7, 5, 2, 4]
plt.hist(x)
plt.show()``````

You need to extend the plt.hist() command to specifically set the number of bins to 4. What should you do?

Ans: `plt.hist(x, 4)`

Go to top: TOC

## Lab histograms

Objective:

• Experiment with histograms.
• Work with different bins.
• Work with dfferent datasets.

Lab exercises:

• Choose the right plot 1.
• Chosse the right plot 2.

Go to top: TOC

### Build a histogram 1.

Preface:

`life_exp`, the list containing data on the life expentancy for different countries in 2007(at data camp only!)

To see how life expectancy in different countries is distributed, let's create a histogram of `life_exp`.

Instructions:

• Use plt.hist() to create a histogram of the values in life_exp.

• Do not specify the number of bins; Python will set the number of bins to 10 by default for you.
• Add `plt.show()` to actually display the histogram. Can you tell which bin contains the most observations?

``````# Create histogram of life_exp data
plt.hist(life_exp)

# Display histogram
plt.show()``````

Go to top: TOC

### Build a histogram 2 : bins

• By default python sets the number of bins to 10.

• Number of bins is important,

• Can zoom in / out of the data.

• zooming in : shows much more detail, but not the bigger picture.

• zooming out : shows the bigger picture.

• To control the no. of bins to divide your data in,

``  - by setting the `bins` argument.``

Preface:

We'll be makin two plots here.

• Use `plt.clf()` to clean up again to start fresh.

Instructions:

• Build a histogram of `life_exp`, with `5` bins.

• Can you tell which bins contains the most observations?
• Build another histogram of `life_exp`, this time with `20` bins.

• Is this better?

Go to top: TOC

``````

In [1]:

# Build histogram with 5 bins
# Ans: plt.hist(life_exp, bins = 5)
# 4th and 5th bins.

# Show and clean up plot
# plt.show()
# plt.clf()

# Build histogram with 20 bins
# Ans: plt.hist( life_exp, bins = 20 )
# Much better, 15th bin contains maximum value,
# i.e. most people tend to live upto 71-73 years.

# Show and clean up again
# plt.show()
# plt.clf()

``````

### Build a histogram 3 : compare

Preface

In the video, you saw population pyramids for the present day and for the future. Because we were using a histogram, it was very easy to make a comparison.

Let's do similar comparison.

life_exp contains life expectancy data for different countries in 2007. You also have access to a second list now, life_exp1950, containing similar data for 1950. Can you make a histogram for both datasets?

You'll again be making two plots. The plt.show() and plt.clf() commands to render everything nicely are already included. Also matplotlib.pyplot is imported for you, as plt.

Instructions:

• Build a histogram of life_exp with 15 bins.
• Build a histogram of life_exp1950, also with 15 bins.

• Is there a big difference with the histogram for the 2007 data?

Go to top: TOC

``````

In [2]:

# Histogram of life_exp, 15 bins
#Ans: plt.hist( life_exp, bins = 15)

# Show and clear plot
#plt.show()
#plt.clf()

# Histogram of life_exp1950, 15 bins
#Ans: plt.hist( life_exp1950, bins = 15)

# Show and clear plot again
#plt.show()
#plt.clf()

"""
Conclusion: Neither one of these histogram is useful to
better understand the life expectancy data.

Why?
"""

``````
``````

Out[2]:

'\nConclusion: Neither one of these histogram is useful to \nbetter understand the life expectancy data.\n\nWhy? \n'

``````

### Choose the right plot 1

Scenario:

You're a professor teaching Data Science with Python, and you want to visually assess if the grades on your exam follow a normal distribution. Which plot do you use?

Answer: Since a histogram is a very good tool to visualize a frequency distribution of either one or multiple varibales, it's also a good tool to visualize if the distribution in question follows a normal(gaussian) distribution.

### Choose the right plot 2

Scenario:

You're a professor in Data Analytics with Python, and you want to visually assess if longer answers on exam questions lead to higher grades. Which plot do you use?

Since we are trying to find a visual relationship or correlation b/w two variables "longer-answer" and "higher-grades", in such case:

A scatter plot is a good visualizing tool to identify if the data points are "spread out" meaning no relationship or "linear grouping of data points" meaning there's some kind of relationship b/w the variables in question.

Go to top: TOC

## Lecture: Customization

Data visualization is:

• Science and Art.

• To tell a story with data.
• We have many options, i.e. can create different types of plots.

• For each plot, there are infinite no. of customizations.

• These may include, colors, shapes, lables, legend, axes etc.

• Choice depends on:

• Data.

• Story you want to tell.

### Exercise 3:

RQ1: You are customizing a plot by labelling its axes. You need to do this by using matplotlib.

Which code should you use?

Ans: `xlabel("x-axis title")` and `ylabel("y-axis title")`.

RQ2: Which matplotlib function do you use to build a line plot where the area under the graph is colored?

Ans: `fill_between()`

RQ3: Typically, you place all customization commands between the plot() call and the show() call, as follows:

``````import matplotlib.pyplot as plt
x = [1, 2, 3]
y = [4, 5, 6]
plt.plot(x, y)
# customization here
plt.show()``````

What will happen if you place the customization code after the show() function instead?

``````import matplotlib.pyplot as plt
x = [1, 2, 3]
y = [4, 5, 6]
plt.plot(x, y)
plt.show()
#customization here``````

Ans: Let's check it out!

``````

In [4]:

import matplotlib.pyplot as plt

x = [1, 2, 3]
y = [4, 5, 6]

plt.plot(x, y)

# customization here
plt.xlabel("var1")
plt.ylabel("var2")

plt.show()

``````
``````

``````
``````

In [5]:

"""It seems that customization should be done b/w
plot() and show() function."""

import matplotlib.pyplot as plt

x = [1, 2, 3]
y = [4, 5, 6]

plt.plot(x, y)

# customization here
plt.show()

plt.xlabel("var1")
plt.ylabel("var2")

``````
``````

Out[5]:

``````

## Lab : Customization

Objective:

• Customization of visual data.

### Labels:

You're going to work on the scatter plot with world development data: GDP per capita on the x-axis (logarithmic scale), life expectancy on the y-axis. The code for this plot is available in the script.

As a first step, let's add axis labels and a title to the plot. You can do this with the `xlabel()`, `ylabel()` and `title()` functions, available in `matplotlib.pyplot`. This sub-package is already imported as `plt`.

Instructions:

• The strings `xlab` and `ylab` are already set for you. Use these variables to set the label of the x- and y-axis.

• The string `title` is also coded for you. Use it to add a title to the plot.

• After these customizations, finish the script with `plt.show()` to actually display the plot.

``````

In [ ]:

# Basic scatter plot, log scale
# plt.scatter(gdp_cap, life_exp)
# plt.xscale('log')

# Strings
# xlab = 'GDP per Capita [in USD]'
# ylab = 'Life Expectancy [in years]'
# title = 'World Development in 2007'

# plt.xlabel(xlab)
# plt.ylabel(ylab)

# plt.title(title)

# After customizing, display the plot
# plt.show()

``````

### Ticks:

Ticks are like custom markers! We can control the ticks by specifiying two arguments.

Syntax: `plt.x_or_yticks( <Int-list>, <list of ticks("Strings")> )`

E.g. `plt.yticks([0,1,2], ["one","two","three"])`

• The ticks corresponds to number 0, 1, 2 will be replaced by "One", "Two", "Three" respectivley.

Preface:

Let's do a similar thing for the x-axis of your world development chart, with the `xticks()` function. The tick values `1000, 10000 and 100000` should be replaced by `1k, 10k and 100k`. To this end, two lists have already been created for you: `tick_val` and `tick_lab`:

Instructions:

• Use tick_val and tick_lab as inputs to the xticks() function to make the the plot more readable.
• As usual, display the plot with plt.show() after you've added the customizations.

Go to top: TOC

``````

In [ ]:

# Scatter plot
# plt.scatter(gdp_cap, life_exp)

# Previous customizations
# plt.xscale('log')
# plt.xlabel('GDP per Capita [in USD]')
# plt.ylabel('Life Expectancy [in years]')
# plt.title('World Development in 2007')

# Definition of tick_val and tick_lab
# tick_val = [1000,10000,100000]
# tick_lab = ['1k','10k','100k']

# Adapt the ticks on the x-axis
# plt.xticks(tick_val, tick_lab)

# After customizing, display the plot
# plt.show()

``````

### Sizes:

Preface: Our scatter plot at present is just a cloud of blue dots, indistinguishable from each other. We can represent the `size` of the dots corresponds to the `population`.

• `pop` is a list of population numbers for each country expressed in millions.

• `s` is an argument `scatter` method as the argument for size.

Instructions:

• Run the script to see how th plot changes.
• Increase the size of bubbles to emphesize them more.

• Import `numpy` package as `np`.

• Use `np.array()` to create a numpy array from the list `pop`. Call this Numpy array `np_pop`.

• Double the values in `np_pop` by assigning `np_pop * 2` to `np_pop` again.

• Because `np_pop` is a numpy array, each array element will be doubled.
• Change the `s` argument inside `plt.scatter()` to be `np_pop` instead of `pop`.

Go to top

``````

In [1]:

# Import numpy as np
# import numpy as np

# Store pop as a numpy array: np_pop
# np_pop = np.array(pop)

# Double np_pop
# np_pop *= 2

# Update: set s argument to np_pop
# plt.scatter(gdp_cap, life_exp, s = np_pop)

# Previous customizations
# plt.xscale('log')
# plt.xlabel('GDP per Capita [in USD]')
# plt.ylabel('Life Expectancy [in years]')
# plt.title('World Development in 2007')
# plt.xticks([1000, 10000, 100000],['1k', '10k', '100k'])

# Display the plot
# plt.show()

``````

### Colors:

Preface:

Let's make our plot more colorful.

• `col` is a list with a color for each corresponding country, depending on the continent the country is part of.

How did we make the list `col`?

• We define a `dictionary` containing a list of `continent` as "keys" mapped to "colors" as "values".
``````    dict = {
'Asia':'red',
'Europe':'green',
'Africa':'blue',
'Americas':'yellow',
'Oceania':'black'
}``````

Instructions:

• Add `c = col` to the arguments of the `plt.scatter()` function.

• Change the opacity of the bubbles by setting the `alpha` argument to `0.8` inside `plt.scatter()`. Alpha can be set from zero to one, where zero totally transparant, and one is not transparant.

Go to top

``````

In [2]:

"""
# Specify c and alpha inside plt.scatter()
plt.scatter(x = gdp_cap, y = life_exp, s = np.array(pop) * 2, c = col, alpha= 0.8)

# Previous customizations
plt.xscale('log')
plt.xlabel('GDP per Capita [in USD]')
plt.ylabel('Life Expectancy [in years]')
plt.title('World Development in 2007')
plt.xticks([1000,10000,100000], ['1k','10k','100k'])

# Show the plot
plt.show()
"""

``````
``````

Out[2]:

"\n# Specify c and alpha inside plt.scatter()\nplt.scatter(x = gdp_cap, y = life_exp, s = np.array(pop) * 2, c = col, alpha= 0.8)\n\n# Previous customizations\nplt.xscale('log') \nplt.xlabel('GDP per Capita [in USD]')\nplt.ylabel('Life Expectancy [in years]')\nplt.title('World Development in 2007')\nplt.xticks([1000,10000,100000], ['1k','10k','100k'])\n\n# Show the plot\nplt.show()\n"

``````

Preface:

If you have another look at the script, under # Additional Customizations, you'll see that there are two `plt.text()` functions now. They add the words "India" and "China" in the plot.

Instructions:

• Add `plt.grid(True)` after the `plt.text()` calls so that gridlines are drawn on the plot.

Go to top

``````

In [ ]:

"""# Scatter plot
plt.scatter(x = gdp_cap, y = life_exp, s = np.array(pop) * 2, c = col, alpha = 0.8)

# Previous customizations
plt.xscale('log')
plt.xlabel('GDP per Capita [in USD]')
plt.ylabel('Life Expectancy [in years]')
plt.title('World Development in 2007')
plt.xticks([1000,10000,100000], ['1k','10k','100k'])

plt.text(1550, 71, 'India')
plt.text(5700, 80, 'China')

plt.grid(True)

# Show the plot
plt.show()
"""

``````

### Interpretations:

Preface:

If you have a look at your colorful plot, it's clear that people live longer in countries with a higher GDP per capita.

No high income countries have really short life expectancy, and no low income countries have very long life expectancy.

Still, there is a huge difference in life expectancy between countries on the same income level.

Most people live in middle income countries where difference in lifespan is huge between countries; depending on how income is distributed and how it is used.

What can you say about the plot?