Python 3.6 Jupyter Notebook

Introduction to Bandicoot

Your completion of the notebook exercises will be graded based on your ability to do the following:

Understand: Do your pseudo-code and comments show evidence that you recall and understand technical concepts?

Apply: Are you able to execute code (using the supplied examples) that performs the required functionality on supplied or generated data sets?

Evaluate: Are you able to interpret the results and justify your interpretation based on the observed data?

Notebook objectives

By the end of this notebook, you will be expected to:

  • Understand the use of Bandicoot in automating the analysis of mobile phone data records; and
  • Understand data error handling.

List of exercises

  • Exercise 1: Calculating the number of call contacts.
  • Exercise 2: Determining average day and night weekly call activity rates.
  • Exercise 3: Interpreting gender assortativity values.
  • Exercise 4: Handling data errors.

Notebook introduction

This course started by introducing you to tools and techniques that can be applied in analyzing data. This notebook briefly revisits the “Friends and Family” data set (for context purposes), before demonstrating how to generate summary statistics manually, and through Bandicoot. Subsequent sections briefly demonstrate Bandicoot's visualization capabilities, how to use Bandicoot in combination with network and graph content (introduced in Module 4), error handling, and loading files from a directory.

Note:
It is strongly recommended that you save and checkpoint after applying significant changes or completing exercises. This allows you to return the notebook to a previous state should you wish to do so. On the Jupyter menu, select "File", then "Save and Checkpoint" from the dropdown menu that appears.

Load libraries


In [ ]:
import os
import pandas as pd
import bandicoot as bc
import numpy as np
import matplotlib

1. Data set review

Some of the relevant information pertaining to the “Friends and Family” data set is repeated here, as you will be focusing on the content of the data in this module's notebooks.

An experiment was designed in 2011 to study how people make decisions (with emphasis on the social aspects involved) and how people can be empowered to make better decisions using personal and social tools. The data set was collected by Nadav Aharony, Wei Pan, Cory Ip, Inas Khayal, and Alex Pentland. More details about this data set are available through the MIT Reality Commons resource.

The subjects are members of a young-family, residential, living community adjacent to a major research university in North America. All members of the community are couples, and at least one of the members is affiliated with the university. The community comprises over 400 residents, approximately half of whom have children. A pilot phase of 55 participants was launched in March 2010. In September 2010, phase two of the study included 130 participants – approximately 64 families. Participants were selected out of approximately 200 applicants in a way that would achieve a representative sample of the community and sub-communities (Aharony et al. 2011).

In this module, you will prepare and analyze the data in a social context, using tools and methods introduced in previous modules.

2. Calculating summary statistics

To better understand the process of creating features (referred to as behavioral indicators in Bandicoot), you will start by manually creating a feature. Creating features is a tedious process. Using libraries that are tailored for specific domains (such as Bandicoot) can significantly reduce the time required to generate features that you would use as input in machine-learning algorithms. It is important for you to both understand the process and ensure that the results produced by the libraries are as expected. In other words, you need to make sure that you use the correct function from the library to produce the expected results.

2.1 Average weekly call duration

In the first demonstration of automated analysis using Bandicoot, you will evaluate the average weekly call duration for a specific user, based on the interaction log file.

2.1.1 Data preparation

First, review the content of the text file containing the records using the bash (command line) command, "head". This function has been demonstrated in earlier notebooks, and it is extremely useful when you need to get a quick view of a data set without loading it into your analysis environment. Should the contents prove useful, you can load it as a DataFrame.


In [ ]:
# Retrieve the first three rows from the "clean_records" data set.
!head -n 3 ../data/bandicoot/clean_records/fa10-01-08.csv

The first three lines of the file are displayed. They contain a header row, as well as two data rows.

Next, load the data set using the Pandas "read_csv" function, and use the "datetime" field as the DataFrame index. This example only focuses on calls. You will create a new DataFrame containing only call data by filtering by the type of interaction.


In [ ]:
# Specify the user for review.
user_id = 'sp10-01-08'

# Load the data set and set the index.
interactions = pd.read_csv('../data/bandicoot/clean_records/' + user_id + '.csv')
interactions.set_index(pd.DatetimeIndex(interactions['datetime']), inplace=True)

# Extract the calls. 
calls = interactions[interactions.interaction == 'call'].copy()

# Display the head of the new calls dataframe.
calls.head(3)

The "correspondent_id" field contains the user ID for the other party involved in a specific call interaction. Each "correpondent_id" is encoded in one of two formats:

  1. A hexadecimal integer that indicates the corresponding party did not form part of the study.
  2. A non-hexadecimal (string) data type for a party within the study group.

The provided function below, "is_hex", checks if a string is hexadecimal or not.


In [ ]:
def is_hex(s):
    '''
    Check if a string is hexadecimal.
    '''
    try:
        int(s, 16)
        return True
    except ValueError:
        return False

Add a column that returns a Boolean indicating whether the value of "correspondent_id" is a hexadecimal or not, using the function defined above. This column can be used to filter interactions that only involve those users in the study population.


In [ ]:
calls['is_hex_correspondent_id'] = calls.correspondent_id.apply(lambda x: is_hex(x)).values

In [ ]:
calls.head()

2.1.2 Calculating the weekly average call duration

Performing the calculation is a two-step process:

  1. Attribute each call that has the value for the week the interaction occurred to the variable "week".

    Note: This is possible, in this case, because the data range is within a specific year. Otherwise, you would have attributed the call to both the year and the week the interaction occurred.

  2. Use the Pandas "pd.group_by()" method (demonstrated in Module 2) to bin the data on the basis of the week of interaction.

In [ ]:
# Add a field that contains the week number corresponding to a call record. 
calls['week'] = calls.index.week

# Get the mean and population(ddof=0) standard deviation of each grouping.
weekly_averages = calls.groupby('week')['call_duration'].agg([np.mean, lambda x: np.std(x, ddof=0)])

# Give the columns names that are intuitive.
weekly_averages.columns = ['mean_duration', 'std_duration']

# Review the data.
weekly_averages.head()

In [ ]:
# Retrieve the bins (weeks).
list(weekly_averages.index)

Now that you have the weekly averages and standard deviation of the call duration, you can compute the mean weekly call duration and the mean weekly call duration standard deviation.


In [ ]:
print ("The average weekly call duration for the user is {:.3f}, while the average weekly standard deviation is {:.3f}."
       .format(weekly_averages.mean_duration.mean(), weekly_averages.std_duration.mean()))

It is possible to use generic data analysis libraries (such as Pandas) that were introduced to you in earlier modules. However, in the next section of this notebook, you will return to a library briefly introduced to you in Module 2 of this course, namely Bandicoot.


In [ ]:
weekly_averages.describe()

2.2 Using Bandicoot

Bandicoot is an open-source Python toolbox used to analyze mobile phone metadata. You can perform actions – similar to your manual steps – with a single command, using this library.

The manual analysis of data sets can be an extremely tedious and resource-intensive process. Although it is outside of the scope of this course, it is important to start considering memory utilization, reproducibility of results, and the reuse of intermediate steps when working with large data sets. Toolboxes, such as Bandicoot, are optimized for improved performance, and specific to mobile phone metadata analysis, meaning that the functions available are specific to the type of data to be analyzed.

Please review the Bandicoot reference manual for details on functions, modules, and objects included in Bandicoot. Bandicoot has been preinstalled on your virtual analysis environment. Revisit the Bandicoot quick guide should you wish to set up this library in another environment.

In the following example, you will redo the analysis from the previous section, and work on additional examples of data manipulation using Bandicoot.

Load the data

This example starts with using the Bandicoot “import” function to load the input files. Note that the “import” function expects data in a specific format, and it provides additional information that allows you to better understand your data set.


In [ ]:
B = bc.read_csv(user_id, '../data/bandicoot/clean_records/', '../data/bandicoot/antennas.csv')

Note:

WARNING:root:100.00% of the records are missing a location.

This message indicates that our data set does not include any antenna IDs. This column was removed from the DataFrame in order to preserve user privacy. A research study on the privacy bounds of human mobility indicated that knowing four spatio-temporal points (approximate places and times of an individual) is enough to re-identify an individual in an anonymized data set in 95% of the cases.

2.2.1 Compute the weekly average call duration

In Bandicoot, you can achieve the same result demonstrated earlier with a single method call named "call_duration".


In [ ]:
# Calculate the call_duration summary statistics using Bandicoot.
bc.individual.call_duration(B)

You can see that the results (above) are in line with the manual calculation (which was rounded to five decimals) that you performed earlier. By default, Bandicoot computes indicators on a weekly basis, and returns the average (mean) over all of the weeks available, and the standard deviation (std), in a nested dictionary. You can read more about the creation of indicators in the Bandicoot documentation. The screenshot below demonstrates the format of the output produced.

To change the default behavior, and review the daily resolution, you can use “groupby”, in Bandicoot, as a method call argument. Other grouping parameters include "month", "year", and “None”. Now, change the grouping resolution to "day", in the following call, and display a summary of additional statistics by including a parameter for the summary argument.


In [ ]:
bc.individual.call_duration(B, groupby='day', interaction='call', summary='extended')

Note:

You will notice that you can switch between groupings by day, week, or month with ease. This is one of the advantages referred to earlier. In cases where you manually analyze the data, you would have had to manually create these features, or utilize much more resource-intensive parsing functions in order to achieve similar results. You can choose to include all options or change to a new grouping with minimal changes required from your side, and no additional functions needing to be created.


Exercise 1 Start.

Instructions

  1. Compute the average number of call contacts for data set, B, grouped by:
    1. Month; and
    2. Week.

Hint: You can review the help file for the "number_of_contacts" function to get started.


In [ ]:
# Your code here.


Exercise 1 End.

Exercise complete:

This is a good time to "Save and Checkpoint".

2.2.2 Splitting records

Regardless of the grouping time resolution, it is often useful to stratify the data between weekday and weekend, or day and night. Bandicoot allows you to achieve this with its Boolean split arguments, "split_week" and "split_day". You can read more about Bandicoot’s "number_of_interactions", and then execute the code below to view the data on the daily number of interactions stratified with "split_week".

Note:

This strategy is employed to generate features to be processed by machine learning algorithms, where the algorithms can identify behavior which is not visible at small scale. In 2015, a study, titled "Predicting Gender from Mobile Phone Metadata" (presented at the Netmob Conference, Cambridge), showed that the most predictive feature for men in a South Asian country is the "percent of calls initiated by the person during weekend nights", while the most predictive feature for men in the European Union is "the maximum text response delay during the week" (Jahani et al. 2015).


In [ ]:
# Use bandicoot to split the records by day.
bc.individual.number_of_interactions(B, groupby='day', split_day=True, interaction='call')

In [ ]:
# Plot the results. The mean is plotted as a barplot, with the std deviation as an error bar.
%matplotlib inline

interactions_split_by_day = bc.individual.number_of_interactions(B, groupby='day', split_day=True, interaction='call')

interactions_split = []
for period, values in interactions_split_by_day['allweek'].items():
    interactions_split.append([period, values['call']['mean'], values['call']['std']])

interactions_split = pd.DataFrame(interactions_split,columns=['period', 'mean','std'])    
interactions_split[['period', 'mean']].plot(kind='bar' , x='period', title='Daily vs nightly interactions',
                                            yerr=interactions_split['std'].values, )

The argument "split_day" is now demonstrated (below) to allow you to view all available strata.


In [ ]:
bc.individual.number_of_interactions(B, groupby='day', split_week=True, split_day=True, interaction='call')

Note:

The number of interactions is higher for “day” compared to “night”, as well as for “weekday” compared to “weekend”.

2.2.3 Other indicators

Machine learning algorithms use features for prediction and clustering tasks. Difficulty arises when manually generating these features. However, using custom libraries (such as Bandicoot) to generate them on your behalf can significantly speed up and standardize the process. In earlier modules, you performed manual checks on data quality. Experience will teach you that this step always takes longer than anticipated, and requires significant effort to determine the relevant questions, and then to execute them. Using a standardized library such as Bandicoot saves time in analyzing data sets, and spotting data quality issues, and makes the actions repeatable or comparable with other data sets or analyses.

Two additional features are demonstrated here. You can refer to the Bandicoot reference material for additional available features.

Active days (days with at least one interaction)


In [ ]:
# Active days.
bc.individual.active_days(B)

Note:

Remember that Bandicoot defaults to grouping by week, if the grouping is not explicitly specified.

Number of contacts

This number can be interesting, as some research suggests that it is predictable for humans, and that, in the long run, it is near constant for any individual. Review the following articles for additional information:


In [ ]:
# Number of contacts. 
bc.individual.number_of_contacts(B, split_week=True)

Note:

It appears as though there might be a difference between the number of people contacted by phone between the weekend and weekdays.

All available features

Bandicoot currently contains 1442 features. You can obtain a quick overview of the features for this data set using the Bandicoot "utils.all" function. The three categories of indicators are individual, spatial, and network-related features.


In [ ]:
bc.utils.all(B)

Note:

The “reporting” variables allow you to better understand the nature and origin of the data, as well as which computations have been performed (which version of the code, etc.).


Exercise 2 Start.

Instructions

  1. Using Bandicoot, find the user activity rate during the week and on weekends. Show your calculations and express your answer as a percentage using the print statement.

    Note: Five days constitute the maximum number of weekdays, and two days are the maximum possible number of weekend days.


In [ ]:
# Your code here.


Exercise 2 End.

Exercise complete:

This is a good time to "Save and Checkpoint".

3. Visualization with Bandicoot

Now that you have more background information on the toolbox and its capabilities, the visualization demonstrated in Module 2 will be repeated. As Yves-Alexandre de Montjoye mentioned in the video content, visualization is a powerful tool. This is not only with regard to communicating your final results, but also in terms of checking the validity of your data, in order to identify errors and outliers. Bandicoot is also a powerful tool when used in visually identifying useful patterns that are hidden by the aggregation processes applied to the raw data.

Note:
There is a problem with the current version of Jupyter notebooks which does not render HTML portions correctly and the frame below is not functional at this stage. It is included for students who utilize the code elsewhere and an static image of the output is included below.

In [ ]:
# Import the relevant libraries.
import os
from IPython.display import IFrame

# Set the path to store the visualization.
viz_path = os.path.dirname(os.path.realpath(__name__)) + '/viz'

# Create the visualization.
bc.visualization.export(B, viz_path)

# Display the visualization in a frame within this notebook.
IFrame("./viz/index.html", "100%", 700)

Image displaying sample output:

Note:

To serve the results in the notebook, "IFrame" is used. You can also serve the results as a web page using tools provided in Bandicoot. This function will not be demonstrated in this course, as the required ports on the AWS virtual analysis environment have not been opened.

You can review the Bandicoot quickstart guide for more details on the "bc.visualization.run(U)" command. You can use this function to serve the visualization as a web page if you choose to install bandicoot on infrastructure where you do have access to the default port (4242). (This port is not open on your AWS virtual analysis environment.)

4. Graphs and matrices

This section contains network indicators, a gender assortativity example, and a brief demonstration of how to use Bandicoot to generate input for visualizations, using NetworkX. At the start of the course, Professor Pentland described general patterns in behavior that are observed between individuals. Understanding an individual as a part of a network is an extremely useful way to evaluate how they resemble or do not resemble their friends, as well as the role they play in their network or community.

In the current “Friends and Family” data set, the majority of interactions take place outside of the population in the study. Therefore, performing the calculations on this data set does not make sense. This is because the data is not representative of the full network of contacts. In a commercial application, you would most likely encounter a similar situation as there are multiple carriers, each with only a portion of the total market share. The figures differ per country, but typically fall in the range of 10-30% market share for the main (dominant) carriers. You need to prepare a separate, trimmed data set to demonstrate this example.

A useful feature of Bandicoot is that it analyzes a user's ego network, or individual focus node quickly, if the input data is properly formatted. Start by loading the "ego" in question to a Bandicoot object. You need to set the network parameter to "True". Bandicoot will attempt to extract all "ego" interaction data, and do the network analysis for the data contained in the specified network folder.

4.1 Load the data


In [ ]:
# Specify the network folder containing all the data.
network_folder  = '../data/bandicoot/network_records/'

# Create Bandicoot object.
BN = bc.read_csv(user_id, network_folder, attributes_path='../data/bandicoot/attributes',network=True)

The Bandicoot "read_csv()" function loads the data, provides summary information, and removes the records that are not of interest in the analysis. Typically, performing the data cleansing steps is time-consuming, and prone to error or inconsistencies.

The graph data is stored as an adjacency matrix.

Note:

You will recall adjacency matrices (from Module 4) as a useful mechanism to represent finite graphs. Bandicoot stores graph information in an adjacency matrix, and said matrix indexes in a different object. Once the data has been loaded, you can start exploring the graph.

4.2 Network indicators


In [ ]:
# Index of the adjacency matrix - user_ids participating in the network.
node_labels = bc.network.matrix_index(BN)
node_labels

There are several types of adjacency matrices available in Bandicoot, including the following:

  • bc.network.matrix_directed_weighted(network_user)
  • bc.network.matrix_directed_unweighted(network_user)
  • bc.network.matrix_undirected_weighted(network_user)
  • bc.network.matrix_undirected_unweighted(network_user)

You can review the Bandicoot network documentation for additional information.


In [ ]:
# Directed unweighted matrix.
directed_unweighted = bc.network.matrix_directed_unweighted(BN)
directed_unweighted

In [ ]:
# Undirected weighted matrix.
undirected_weighted = bc.network.matrix_undirected_weighted(BN)
undirected_weighted

4.3 Gender assortativity

This indicator computes the assortativity of nominal attributes. More specifically, it measures the similarity of the current user to their correspondents for all Bandicoot indicators. For each one, it calculates the variance of the current user’s value with the values for all of their correspondents. This indicator measures the homophily of the current user with their correspondents, for each attribute. It returns a value between 0 (no assortativity) and 1 (all the contacts share the same value), which indicates the percentage of contacts sharing the same value.

Let's demonstrate this by reviewing the gender assortativity.


In [ ]:
bc.network.assortativity_attributes(BN)['gender']


Exercise 3 Start.

Instructions

In the previous example, you obtained a value of 0.714 or 71.4% for gender assortativity. Random behavior would typically deliver a value centered around 50%, if you have enough data points.

Question: Do you think the value of 71.4% is meaningful or relevant?

Your answer should consist of “Yes” or “No”, and a short description of what you think the value obtained means, in terms of the data set.

Your markdown answer here.


Exercise 3 End.

Exercise complete:

This is a good time to "Save and Checkpoint".

4.4 Ego network visualization

You can use the ego network adjacency matrices for further analyses in NetworkX.


In [ ]:
# Load the relevant libraries and set plotting options.
import networkx as nx
%matplotlib inline
matplotlib.rcParams['figure.figsize'] = (18,11)

Create directed unweighted and undirected weighted graphs to visualize network, in order to better understand the user behavior (as per the examples in Section 1.2 of Module 4’s Notebook 2).


In [ ]:
# Create the graph objects.
G_directed_unweighted = nx.DiGraph(nx.from_numpy_matrix(np.array(directed_unweighted)))
G_undirected_weighted = nx.from_numpy_matrix(np.array(undirected_weighted)) 
node_labels = dict(enumerate(node_labels))

4.4.1 Plot the directed unweighted graph

This can typically be utilized to better understand the flow or spread of information in a network.


In [ ]:
# Plot the graph.
layout = nx.spring_layout(G_directed_unweighted)
nx.draw_networkx(G_directed_unweighted, layout, node_color='blue', alpha=0.4, node_size=2000)
_ = nx.draw_networkx_labels(G_directed_unweighted, layout, node_labels)
_ = nx.draw_networkx_edges(G_directed_unweighted, layout,arrows=True)

4.4.2 Plot the undirected weighted graph

This can typically be utilized to better understand the importance of the various individuals and their interactions in the network.


In [ ]:
# Plot the graph.
layout = nx.spring_layout(G_directed_unweighted)
nx.draw_networkx(G_undirected_weighted, layout, node_color='blue', alpha=0.4, node_size=2000)
_ = nx.draw_networkx_labels(G_undirected_weighted, layout, node_labels)

Note:

Can you think of use cases for the various networks introduced in Module 4?

Feel free to discuss these with your fellow students on the forums.

5. Data error handling

This section demonstrates some of Bandicoot’s error handling and reporting strategies for some of the "faulty" users. Some circumstances may require working with CDR records (and collected mobile phone metadata) that have been corrupted. The reasons for this can be numerous, but typically include wrong formats, faulty files, empty periods of time, and missing users. Bandicoot will not attempt to correct errors, as this might lead to incorrect analyses. Correctness is key in data science, and Bandicoot will:

  1. Warn you when you attempt to import corrupted data;
  2. Remove faulty records; and
  3. Report on more than 30 variables (such as the number of contacts, types of records, records containing location), warning you of potential issues when exporting indicators.

5.1 Bandicoot CSV import

Importing CSV files with Bandicoot will produce warnings about:

  1. No files containing data being found in the specified path;
  2. The percentage of records missing location information;
  3. The number of antennas missing geotags (provided the antenna file has been loaded);
  4. The fraction of duplicated records; and
  5. The fraction of calls with an overlap bigger than 5 minutes.

In [ ]:
# Set the path and user for demonstration purposes.
antenna_file             = '../data/bandicoot/antennas.csv'
attributes_path          = '../data/bandicoot/attributes/'
records_with_errors_path = '../data/bandicoot/records/'
error_user_id            = 'fa10-01-04'

5.2 Error example


In [ ]:
errors = bc.read_csv(error_user_id, records_with_errors_path )

Review the errors below to quickly get a view of your data set. This example includes warnings that are in addition to the missing location and antenna warnings explained earlier. The new warnings include:

  1. Missing values of call duration;
  2. Duplicate records; and
  3. Overlapping records.

5.2.1 Rows with missing values

These rows are prudently excluded, and their details can be examined using “errors.ignored_records”.


In [ ]:
errors.ignored_records

5.2.2 Duplicated records

These records are retained by default, but you can change this behavior by adding the parameter “drop_duplicates=True” when loading files.

Warning:

Exercise caution when using this option. The maximum timestamp resolution is one minute, and some of the records that appear to be duplicates may in fact be distinct text messages, or even, although very unlikely, very short calls. As such, it is generally advised that you examine the records before removing them.


In [ ]:
errors = bc.read_csv(error_user_id, records_with_errors_path, drop_duplicates=True)


Exercise 4 Start.

Instructions

When working with data of any size or volume, data error handling can be a complex task.

  1. List three important topics to consider when performing data error handling.
  2. Provide a short description of your view of the topics to consider. Your answer should be one or two sentences in length, and can be based on an insight that you reached while completing the course material or from previous experience.

Your markdown answer here.


Exercise 4 End.

Exercise complete:

This is a good time to "Save and Checkpoint".

6. Loading the full data set

In this section, you will load the full “Friends and Family” reality commons data set, and compute all of the metrics (briefly introduced in Section 2.2.3) for all of the users. You need to specify a "flat" directory, containing files where each file corresponds to a single user, as input. It is crucial that the record-file naming convention is being observed (i.e., the names of the files are the user IDs), and that each user's data resides in a separate file.


In [ ]:
# View the files in the directory using the operating system list function.
!ls ../data/bandicoot/clean_records/

6.1 Load the files and create a metric

Review the Bandicoot "utils.all" page for more detail.


In [ ]:
# Load libraries and set path options.
import glob, os
records_path    = '../data/bandicoot/clean_records/'

# Create an empty list and then cycle through each of the available files in the directory to add features.
features = []
for f in glob.glob(records_path + '*.csv'):
    user_id = os.path.basename(f)[:-4]

    try:
        B = bc.read_csv(user_id, records_path, attributes_path=attributes_path, describe=False, warnings=False)
        metrics_dict = bc.utils.all(B, summary='extended', split_day=True, split_week=True)
    except Exception as e:
        metrics_dict = {'name': user_id, 'error': str(e)}

    features.append(metrics_dict)

6.2 Save the interactions in a file for future use

Note: The application of machine learning techniques, using a similar data set, will be explored in the next notebook.


In [ ]:
bc.io.to_csv(features, 'all_features.csv')

Before moving on, take a quick look at the results of the pipeline.

6.2.1 Review the data for the first user

Keep in mind that, in manual approaches, you would likely have to create each of these features by hand. The process entails thinking about features, and reviewing available literature to identify applicable features. These features are used in machine learning techniques (including feature selection) for various use cases.

Note:

The section below will display a large number of features for the first user. You do not need to review them in detail. Here, the intention is to emphasize the ease of creating features, and the advantages of computationally-optimized functions. These are extremely useful when scaling your analyses to large record sets (such as those typically found in the telecommunications industry).

6.2.2 Review the features list


In [ ]:
# Display the length or number of users with in the features list.
len(features)

In [ ]:
# Print the list of users' names contained in features list 
for u in features:
    print(u['name'])

In [ ]:
# Print the various groups of behavioral indicators (and attributes) that are available for each user.
# You will use the first user's data in the feature list for this.
[key for key,value in features[0].items()]

7. Submit your notebook

Please make sure that you:

  • Perform a final "Save and Checkpoint";
  • Download a copy of the notebook in ".ipynb" format to your local machine using "File", "Download as", and "IPython Notebook (.ipynb)"; and
  • Submit a copy of this file to the Online Campus.

8. References

Aharony, Nadav, Wei Pan, Cory Ip, Inas Khayal, Alex Pentland. 2011. “SocialfMRI: Investigating and shaping social mechanisms in the real world.” Pervasive and Mobile Computing 7:643-659.

Jahani, Eaman, Pal Roe Sundsoy, Johannes Bjelland, Asif Iqbal, Alex Pentland, and Yves-Alexandre de Montjoye. 2015. “Predicting Gender from Mobile Phone Metadata.” Paper presented at the Netmob Conference, Massachusetts Institute of Technology, Cambridge, April 8-10.

de Montjoye, Yves-Alexandre, Luc Rocher, and Alex 'Sandy' Pentland. (2016). bandicoot: a Python Toolbox for Mobile Phone Metadata. Journal of Machine Learning Research, 17:1-5. Accessed October 30, 2016.


In [ ]: