Python 3.6 Jupyter Notebook

Sources of data

Your completion of the notebook exercises will be graded based on your ability to do the following:

Apply: Are you able to execute code (using the supplied examples) that performs the required functionality on supplied or generated data sets?

Evaluate: Are you able to interpret the results and justify your interpretation based on the observed data?

Create: Are you able to produce notebooks that serve as computational records of a session and can be used to share your insights with others?

Notebook objectives

By the end of this notebook you will be expected to:

  • Use "trusted" and "untrusted" data sources to enrich your analysis; and
  • Understand the implications of the five Rs on data quality from external sources.

List of exercises

  • Exercise 1: Enriching analysis with data from "trusted" sources.
  • Exercise 2: Pros and cons of using data from "untrusted" sources.

Notebook introduction

Data collection is expensive and time consuming, as Arek Stopczynski alluded to in this module's video content. In some cases, you will be lucky enough to have existing datasets available to support your analysis. You may have datasets from previous analyses, access to data providers, or curated datasets from your organization. In many cases, however, you will not have access to the data that you require to support your analysis, and you will have to find alternate mechanisms. The data quality requirements will differ based on the problem you are trying to solve. Taking the hypothetical case of geocoding a location, which was introduced in Module 1, the accuracy of the geocoded location does not need to be exact when you are simply trying to plot the locations of students on a map. Geocoding a location for an automated vehicle to turn off the highway, on the other hand, has an entirely different accuracy requirement.

Note:

Those of you who work in large organizations may be privileged enough to have company data governance and data quality initiatives. These efforts and teams can often add significant value both in terms of supplying company-standard curated data, and making you aware of the internal policies that need to be adhered to.

As a data analyst or data scientist, it is important to be aware of the implications of your decisions. You need to choose the appropriate set of tools and methods to deal with sourcing and supplying data.

Technology has matured in recent years, and allowed access to a host of sources of data that can be used in your analyses. In many cases you can access free resources, or obtain (at a cost) data that has been curated, is at a lower latency, or comes with a service-level agreement. Some governments have even made datasets publicly available.

You have been introduced to OpenPDS, in the video content, where the focus shifts from supplying raw data -- where the provider needs to apply security principles before sharing datasets -- to supplying answers rather than data. OpenPDS allows users to collect, store, and control access to their data, while also allowing them to protect their privacy. In this way, users still have ownership of their data, as defined by the new deal on data.

This notebook demonstrates another example of how to source external data to enrich your analyses. The Python ecosystem contains a rich set of tools and libraries that can help you to exploit the available resources.

This course will not go into detail regarding the various options to source and interact with social data from sources such as Twitter, LinkedIn, Facebook, and Google Plus. However, you should be able to find libraries that will assist you in sourcing and manipulating these sources of data.

Twitter data is a good example because, depending on the options selected by the Twitter user, every tweet contains not just the message or content that most users are aware of. It also contains a view of the network of the person, home location, location from which the message was sent, and a number of other features that can be very useful when studying networks around a topic of interest. Professor Alex Pentland pointed out the difference in what you share with the world (how you want to be seen) compared to what you actually do and believe (what you commit to). Be sure to keep these concepts in mind when you start exploring the additional sources of data. Those who are interested in the topic can start to explore the options by visiting the Twitter library on PyPI.

Start with the five Rs introduced in Module 1, and consider the following questions:

  • How accurate does my dataset need to be?
  • How often should the dataset be updated?
  • What happens if the data provider is no longer available?
  • Do I need to adhere to any organizational standards to ensure consistent reporting or integration with other applications?
  • Are there any implications to getting the values wrong?

You may need to start with “untrusted” data sources as a means of validating that your analysis can be executed. Once this is done, you can replace the untrusted components with trusted and curated datasets, as your analysis matures.

Note:
It is strongly recommended that you save and checkpoint after applying significant changes or completing exercises. This allows you to return the notebook to a previous state should you wish to do so. On the Jupyter menu, select "File", then "Save and Checkpoint" from the dropdown menu that appears.

Load libraries and set options


In [ ]:
import pandas as pd
from pandas_datareader import data, wb
import numpy as np
import matplotlib.pylab as plt
import matplotlib
import folium
import geocoder
import wikipedia

#set plot options
%matplotlib inline
matplotlib.rcParams['figure.figsize'] = (10, 8)

1. Source additional data from public sources

This section will provide short examples to demonstrate the use of public data sources in your notebooks.

1.1 World Bank

This example demonstrates how to source data from an external source to enrich your existing analyses. You will need to combine the data sources and add additional features to the example of student locations plotted on the world map in Module 1's Notebook 3.

The specific indicator chosen has little relevance other than to demonstrate the process that you will typically follow in completing your projects. Population counts, from an untrusted source, will be added to your map, and you will use scaling factors combined with the number of students, and population size of the country to demonstrate adding external data with minimal effort.

This example makes use of the pandas-datareader module, which supports remote data access. This library has support for extracting data from various internet sources into a Pandas DataFrame. Currently, the supported sources are:

  • Google Finance
  • Enigma
  • Quandl
  • St.Louis FED (FRED)
  • Kenneth French’s data library
  • World Bank
  • OECD
  • Eurostat
  • Thrift Savings Plan
  • Nasdaq Trader symbol definitions.

This example focuses on enriching your student dataset from Module 1, using the World Bank's Development Indicators. In the following sections, you will use the data you saved in a previous exercise, add corresponding indicators for each country in the data, and find the mean location for all observed coordinates per country.

Prepare the student data

In the next code cell, you will load the data from disk, apply the groupby method to group the data by country and, for each group, find the total student count and the average of their GPS coordinates. The final dataset containing the country, student count, and averaged GPS coordinates is saved as a separate DataFrame variable.


In [ ]:
# Load the grouped_geocoded dataset from Module 1.
df1 = pd.read_csv('data/grouped_geocoded.csv',index_col=[0])

# Prepare the student location dataset for use in this example.
# We use the geometrical center by obtaining the mean location for all observed coordinates per country.
df2 = df1.groupby('country').agg({'student_count': [np.sum], 'lat': [np.mean], 
                                  'long': [np.mean]}).reset_index()
# Reset the index.
df3 = df2.reset_index(level=1, drop=True)

In [ ]:
# Review the data
df3.head()

The column label index has multiple levels. Although this is useful metadata, it would be better to drop multilevel labeling and, instead, rename the columns to capture this information.


In [ ]:
df3.columns = df3.columns.droplevel(1)
df3.rename(columns={'lat': "lat_mean", 
                    'long': "long_mean"}, inplace=True)
df3.head()

Get and prepare the external dataset from the World Bank

Remember you can use "wb.download?" (without the quotation marks) in a separate code cell to get help on the pandas-datareader method for remote data access of the World Bank Indicators. Refer to the pandas-datareader remote data access documentation for more detailed help.


In [ ]:
# After running this cell you can close the help by clicking on close (`X`) button in the upper right corner
wb.download?

In [ ]:
#  The selected indicator is the world population, "SP.POP.TOTL", for the years from 2008 to 2016 
wb_indicator = 'SP.POP.TOTL'
start_year = 2008
end_year = 2016

df4 = wb.download(indicator = wb_indicator,
                  country = ['all'],
                  start = start_year,
                  end = end_year)

In [ ]:
# Review the data
df4.head()

The data set contains entries for multiple years. The focus of this example is the entry corresponding to the latest year of data available for each country.


In [ ]:
df5 = df4.reset_index()
idx = df5.groupby(['country'])['year'].transform(max) == df5['year']

You can now extract only the values that correspond to the most recent year available for each country.


In [ ]:
# Create a new dataframe where entries corresponds to maximum year indexes in previous list.
df6 = df5.loc[idx,:]

# Review the data
df6.head()

Now merge your dataset with the World Bank data.


In [ ]:
# Combine the student and population datasets.
df7 = pd.merge(df3, df6, on='country', how='left')

# Rename the columns of our merged dataset and assign to a new variable.
df8 = df7.rename(index=str, columns={('SP.POP.TOTL'): "PopulationTotal_Latest_WB"})

# Drop NAN values.
df8 = df8[~df8.PopulationTotal_Latest_WB.isnull()]

# Reset index.
df8.reset_index(inplace = True)

In [ ]:
df8.head()

Let's plot the data.

Note: The visualization below does not have any meaning. The scaling factors selected are used to demonstrate the difference in population sizes, and number of students on this course, per country.


In [ ]:
# Plot the combined dataset

# Set map center and zoom level
mapc = [0, 30]
zoom = 2

# Create map object.
map_osm = folium.Map(location=mapc,
                   tiles='Stamen Toner',
                    zoom_start=zoom)

# Plot each of the locations that we geocoded.
for j in range(len(df8)):
    # Plot a blue circle marker for country population.
    folium.CircleMarker([df8.lat_mean[j], df8.long_mean[j]],
                    radius=df8.PopulationTotal_Latest_WB[j]/20000000,
                    popup='Population',
                    color='#3186cc',
                    fill_color='#3186cc',
                   ).add_to(map_osm)
    # Plot a red circle marker for students per country.
    folium.CircleMarker([df8.lat_mean[j], df8.long_mean[j]],
                    radius=df8.student_count[j]/50,
                    popup='Students',
                    color='red',
                    fill_color='red',
                   ).add_to(map_osm)
# Show the map.
map_osm


Exercise 1 Start.

Instructions

  1. Review the available indicators in the World Bank dataset, and select an indicator of your choice (other than the population indicator).
  2. Using a copy of the code (from above) in the cells below, replace the population indicator with your selected indicator. Instead of returning the most recent value for your selected indicator, compute the mean and standard deviation for the years from 2006 to 2016. You will need to use the Pandas groupby().agg() chained methods, together with the following functions from NumPy:
    • np.mean
    • np.std.

You can review the data preparation section for the student data above for an example.

Add comments (lines starting with a "#") giving a brief description of your view on the observed results. Make sure to include, in one or two sentences in each case, the following:

  1. A clear description why you selected the indicator.
  • What your expectation was before including the data.
  • What you think the results may indicate.

Important:

  • Only the external data needs to be prepared. You do not need to prepare the student dataset again. Just use the student data that you prepared above and join this to the new dataset you sourced.
  • Only plot the mean values for your selected indicator (not the standard deviation values).

In [ ]:
# Your solution here
# Note: Break your logic using separate cells to break code into units that can be executed 
# should you need to review individual steps.


Exercise 1 End.

Exercise complete:

This is a good time to "Save and Checkpoint".

1.2 Using Wikipedia as a data source

To demonstrate how quickly data can be sourced from public, "untrusted" data sources, you have been supplied with a number of sample scripts below. While these sources contain distinctly rich datasets, which you can acquire with minimal effort, they can be amended by anyone, and may not be 100% accurate. In some cases, you will have to manually transform the datasets, while in others, you might be able to use pre-built libraries.

Execute the code cells below before completing Exercise 2.


In [ ]:
# Display MIT page summary from Wikipedia 
print(wikipedia.summary("MIT"))

In [ ]:
# Display a single sentence summary.
wikipedia.summary("MIT", sentences=1)

In [ ]:
# Create variable page that contains the wikipedia information.
page = wikipedia.page("List of countries and dependencies by population")

# Display the page title.
page.title

In [ ]:
# Display the page URL. This can be utilised to create links back to descriptions.
page.url


Exercise 2 Start.

Instructions

After executing the cells for the Wikipedia example in Section 1.2, think about the potential implications of using this "public" and, in many cases, "untrusted" data source when doing analysis or creating data products.

Please compile and submit for evaluation a list of three pros and three cons.

Note: Your submission can be a simple markdown list using the syntax provided below.

Add your answer in this markdown cell. The contents of this cell should be replaced with your answer.

Submit as a list:

  • Pros:
    • Description 1
    • Description 2
    • Description 3
  • Cons:
    • Description 1
    • Description 2
    • Description 3


Exercise 2 End.

Exercise complete:

This is a good time to "Save and Checkpoint".

2. Submit your notebook

Please make sure that you:

  • Perform a final "Save and Checkpoint";
  • Download a copy of the notebook in ".ipynb" format to your local machine using "File", "Download as", and "IPython Notebook (.ipynb)"; and
  • Submit a copy of this file to the Online Campus.