Python 3.6 Jupyter Notebook

Data exploration: Test your intuition about BSSIDs

**This notebook contains advanced exercises that are only applicable to students who wish to deepen their understanding and qualify for bonus marks on this course.** You will be able to achieve 100% for this notebook by successfully completing exercises 1, and 2. An optional, additional exercise can be completed to qualify for bonus marks.

Your completion of the notebook exercises will be graded based on your ability to do the following:

Understand: Do your pseudo-code and comments show evidence that you recall and understand technical concepts?

Apply: Are you able to execute code (using the supplied examples) that performs the required functionality on supplied or generated data sets?

Evaluate: Are you able to interpret the results and justify your interpretation based on the observed data?

Notebook objectives

By the end of this notebook, you will be expected to:

  • Understand and use BSSIDs;
  • Review data using the "info()" and "describe()" functions; and
  • Interpret data using your own behaviors and patterns.

List of exercises

  • Exercise 1: Using Pandas' "describe"() and "info()" methods to review data.
  • Exercise 2: Access point identification.
  • Exercise 3 [Advanced]: Trending access point information.

Notebook introduction

In this notebook, you will examine the WiFi scan data set that you loaded in an exercise in Module 1's Notebook 2. You will also use the public data set from Dartmouth College (Student-Life) that was introduced in Module 1.

Before continuing with this notebook, it is important to understand the definition of a BSSID. A BSSID (Basic Service Set Identifier) is the media access control (MAC) address (or physical address) of a wireless access point (WAP). It is generated by combining the 24-bit organizationally unique identifier (OUI) (the manufacturer's identity), and the manufacturer's assigned 24-bit identifier for the radio chipset in the WAP. In short, every router has a unique address, which will be utilized in this notebook.

Typically, in any analysis, you will start with an idea that you need to validate. In the video content of this module, Arek Stopczynski suggests that you test ideas on yourself first, as this is the easiest way to validate your assumptions about the data generated. As a result, you will be able to quickly spot anomalies based on your understanding of your own behavior and patterns. Once you have a functional data set and hypothesis, you should also start to consider cases where the behaviors of others do not necessarily align to your own.

In many cases the data is reviewed manually. When performing an analysis, you need to validate all of your assumptions, and be able to logically describe what you want to do, before selecting a method of execution. In some cases, the functions you utilize may behave in unexpected ways. Therefore, you need to constantly perform checks to ensure that the output values are correct and as expected. Pandas is a widely used and popular library that is actively supported today by a community of dedicated developers and loyal users. However, this does not necessarily extend to other libraries you may come across.

Note:
It is strongly recommended that you save and checkpoint after applying significant changes or completing exercises. This allows you to return the notebook to a previous state should you wish to do so. On the Jupyter menu, select "File", then "Save and Checkpoint" from the dropdown menu that appears.

Load libraries


In [ ]:
from os import path
import pandas as pd
import matplotlib
import matplotlib.pylab as plt
%matplotlib inline
matplotlib.rcParams['figure.figsize'] = (10, 8)

1. Dataset exploration

Before you proceed with any analysis, preliminary data exploration is required. This helps in understanding what is contained in the data, and includes determining how many records are in the data, the type of variables included, and the coverage of each field (that is, how complete the records are). An initial exploration of the data set not only helps in familiarizing yourself with the data, it also helps in uncovering what general hypotheses the data is likely to support. Additionally, exploration of the data includes using graphical visualization to summarize the main characteristics of the data, and identifying anomalous observations and correlations among variables. You will explore the use of graphical visualization for preliminary analysis in a later module.

1.1 Load data

To start the process, load a single user's data. For this example, you will start with the first record: user00.


In [ ]:
# Load the data for a specific user.
dataset_path = '../data/dartmouth/wifi/'
user00 = pd.read_csv(path.join(dataset_path, 'wifi_u00.csv'))

In [ ]:
# Review the data.
user00.head(5)

1.2 Review data definitions

The table below provides some field definitions for the data set, which can aid you in better understanding the data.

Each row represents a WiFi access point seen by a user’s phone. There are four columns in the provided data set:

Column Description
time Timestamp of the observation (epochtime format).
BSSID Unique ID of WiFi access point (MAC address of the hardware).
freq The frequency on which the access point operates.
level The strength of the signal.

Note:

  • Epochtime format can be parsed with the Pandas to_datetime function, as demonstrated in Section 1 of Module 2's Notebook 2.

The first example will only look at the BSSID, while subsequent examples will also look at the timestamp. Students who have previously worked with BSSIDs will notice the lack of an SSID: the network name. This was removed by the Dartmouth researchers prior to the release of the data set due to institutional security concerns. While it could be argued that this is one of the most useful pieces of information, your analytical tasks in this course do not require this feature.

1.3 Check for missing values

You can use the Pandas "count()" method to provide a quick overview of the entries in each column that contain values (i.e., non-empty). These entries can then be compared with the total number of rows in the data set.


In [ ]:
print('Non-empty records in each column:{}\n'.format(user00.count()))

In [ ]:
print('Total number of rows:\n{}'.format(user00.shape[0]))

Since the columns all contain 446110 records, there are no missing values.


Exercise 1 Start.

Instructions

Apply the Pandas "info()" and "describe()" methods (introduced in Module 1's Notebook 2) to user00's DataFrame.


In [ ]:
# Your answer here. (Pandas info)

In [ ]:
# Your answer here. (Pandas describe)


Exercise 1 End.

Exercise complete:

This is a good time to "Save and Checkpoint".

1.4 Data validation

As you can see from the first few lines of the data, the epochtime format is not very useful when trying to review datetimes. In the cell below, you are going to use the Pandas "to_datetime()" function to convert the epochtime into a format humans can easily understand. Panda's default output from "to_datetime" is in units of milliseconds, which will need to be changed to seconds. To do this, set the optional argument "unit" to seconds to produce the desired output, that is, "unit='s'".


In [ ]:
# Review the contents of the "time" column.
user00.head(3)

In [ ]:
# Transform the "time" column into human-readable format.
user00.loc[:,'time'] = pd.to_datetime(user00.time, unit='s')

In [ ]:
# Review the data.
user00.head(3)

You can use the "print()" command to display the maximum and minimum times in the new data set. Notice that, by adding ".min()" or ".max()" after "readable_time", Python will apply the method to find these values, and print them in place of {}, which is used for string formatting. The ".format" function takes as many arguments as there are {} pairs – in this case, 2.


In [ ]:
# Manual review.
print('Existing times range between: {} and {}'.format(user00['time'].min(), user00['time'].max()))

In [ ]:
# Using Pandas "describe" method.
user00.time.describe()

Next, use the Pandas "value_counts()" method to find the counts of unique values for observed frequencies in the converted data set. The full syntax for the "value_counts" method is:

Series.value_counts(normalize=False, sort=True, ascending=False, bins=None, dropna=True)

where Series is a Pandas series object, that is, a single column from a DataFrame object.


In [ ]:
# Get more inline details on the function "value_counts". 
# Remember to close the dialogue box by clicking the close "x" button in the top right corner.
pd.value_counts?

In [ ]:
# Use the Pandas value_counts method to review observed frequencies.
freq_counts = pd.value_counts(user00['freq'])
freq_counts.head(10)

Review the other columns in the data set, and explore any of the other features, based on the information provided.


Exercise 2 Start.

Instructions

Assume, for the sake of this exercise, that user00 refers to a data set created based on your activities.

  1. Use the Pandas “value_counts” method (demonstrated in Section 1.4 with observed frequencies) to review the observations per BSSID ("user00['BSSID']"), and indicate which access point most likely corresponds to your home location.

  2. Provide a justification for your choice of access point.

  3. Briefly provide two instances where your justification in Question 2 would be invalid.

    Hint: Think about the locations where you spend most of your time, and what other kinds of behaviors you would expect in a large-scale experiment.


In [ ]:
# Your answer here.


Exercise 2 End.

Exercise complete:

This is a good time to "Save and Checkpoint".


Exercise 3 [Advanced] Start.
Note:
This activity is for advanced students only and extra credit will be allocated. Students will not be penalized for not completing this activity.

Instructions

  1. Using the "dt.dayofweek()" Pandas method, find the days of the week with most and least occurences of the access point you identified in Exercise 2.1 above. In your answer, provide both the days and corresponding number of occurrences of the access point on those days.

    Hint: You will need to use the "dt.dayofweek()" series method on a datetime Pandas series object, which has been filtered to only contain instances of the identified access point.

  2. Describe and explain the trend you observe regarding access point occurrences during the week, and whether or not it is similar to the behavior you would have expected.

    Hint: To view the trend, use the Pandas "plot(kind='bar')" on the series object containing the counts of access point occurrences during the week.


In [ ]:
# Your answer here.


Exercise 3 [Advanced] End.

Exercise complete:

This is a good time to "Save and Checkpoint".

2. Submit your notebook

Please make sure that you:

  • Perform a final "Save and Checkpoint";
  • Download a copy of the notebook in ".ipynb" format to your local machine using "File", "Download as", and "IPython Notebook (.ipynb)"; and
  • Submit a copy of this file to the Online Campus.