Understand: Do your pseudo-code and comments show evidence that you recall and understand technical concepts?
Apply: Are you able to execute code (using the supplied examples) that performs the required functionality on supplied or generated data sets?
By the end of this notebook, you will be expected to:
- Understand the importance, challenges, and approaches to personal data protection;
- Identify the dimensions across which big data differs from traditional data sets;
- Understand the concept of unicity;
- Use coarsening to anonymize personal information in data; and
- Understand the limitations of anonymization in the context of big data.
- Exercise 1: Calculate the unicity of a raw data set.
- Exercise 2: Calculate and interpret the unicity of a coarsened data set.
- Exercise 3: Identify limitations of data anonymization in the context of big data, and suggest alternative data-protection mechanisms.
In the video content, Cameron Kerry indicated that the law lags too far behind technology to answer many of the hard questions around data protection. He then went on to elaborate that, in many cases, the question becomes not just what you must do, but rather, what you should do in order to establish and maintain a trust relationship.
Sharing data (collected about individuals) between entities poses a risk to privacy and trust, and is regulated in most parts of the world. The European Union recently passed the General Data Protection Regulation (GDPR), which addresses the treatment of personal information, as well as the rights of the individuals whose information has been collected. Penalties are based on a tiered approach, and some infringements can result in fines of up to 4% of annual worldwide turnover, and €20 million. It is often the case that the information to be shared needs to be anonymous. In some cases, ensuring anonymity removes the data from the jurisdiction of certain laws. The application of the laws is a complex task that needs to be carefully implemented to ensure compliance. Refer to Stefan Nerinckx’s article on the new EU data protection regime for additional context.
Pseudonymization – the removal of direct identifiers – is the first step to anonymize data. This is achieved by removing direct identifiers such as names, surnames, social insurance numbers, and phone numbers; or by replacing them with random or hashed (and salted – see the NYC taxi cab example) values.
However, cases like William Weld's show that pseudonymization is not sufficient to prevent the reidentification of individuals in pseudonymized data sets. In 1990, the Massachusetts Group Insurance Commission (GIC) released hospital data to researchers for the purpose of improving healthcare and controlling costs. At the time GIC released the data, William Weld, then Governor of Massachusetts, assured the public that GIC had protected patient privacy by deleting identifiers (Sweeney 2002).
Note:
Latanya Sweeney was a graduate student at MIT at that stage. She bought the data, reidentified Governor Weld's medical records, and sent these to him (Sweeney 2002).
Sweeney (2002) later demonstrated that 87% of Americans can be uniquely identified by their zip code, gender, and birth date.
This value (i.e., the percentage of unique and, thus, identifiable members of the data set knowing a couple of quasi-identifiers) has been conceptualized as uniqueness.
While the numerous, available sources of data may reveal insights into human behavior, it is important to be sensitive to the legal and ethical considerations when dealing with them. These sources of data include census data, medical records, financial and transaction data, loyalty cards, mobility data, mobile phone data, browsing history and ratings, research-based or observational data, etc.
You can review the seven principles of privacy by design, for more information.
Uniqueness refers to the fraction of unique records in a particular data set (i.e., the number of individuals who are identifiable, given the fields).
The available fields in your data set can typically contain the following:
Identifiers: Attributes that can be used to explicitly identify individuals. These are typically removed from data sets prior to release.
Quasi-identifiers: A subset of attributes that can uniquely identify most individuals in the data set. They are not unique identifiers themselves, but are sufficiently well-correlated with an individual that they can be combined with other quasi-identifiers to create a unique identifier.
Anonymization has been chosen as a strategy to protect personal privacy. K-anonymity is the measure used for anonymization, and is defined below according to Sweeney (2002).
K-anonymity of a data set (given one or more fields) is the size of the smallest group in the dataset sharing the same value of the given field(s), or the number of persons having identical values of the fields, yielding them indistinguishable (Sweeney 2002).
For k-anonymity, the person anonymizing the data set needs to decide what the quasi-identifiers are, and what a potential attacker could extract from the provided data set.
Generalization and suppression are the core tools used to anonymize data, and make a data set k-anonymous (Samarati and Sweeney 1998). The privacy-securing methods employed in this paradigm are optimized for the high k-anonymity versus the precision of the data. One of the biggest problems experienced is that optimization is use-case specific, and, therefore, depends on the application. Typical methods include the following:
These heuristics typically come with trade-offs. Other techniques (such as noise addition and translation) exist, but provide similar results.
Technical examples of such methods are not of central importance in this course, therefore only the basic components will be repeated below to illustrate the fundamentals of the elements discussed above.
In [ ]:
import pandas
# Load the data set.
df = pandas.read_csv('privacy/belgium_100k.csv')
df = df.where((pandas.notnull(df)), None)
df['birthday'] = df['birthday'].astype('datetime64[ns]')
df.head()
In order to calculate uniqueness, as defined earlier, you need to define a function that accepts an input data set, and list of features, to be used to evaluate the data set. The output indicates the number or records in the data set that can be uniquely identified using the provided features.
In [ ]:
# Define function to evaluate uniqueness of the provided dataset.
def uniqueness(dataframe, pseudo):
groups = list(dataframe.groupby(pseudo).groups.values())
return sum(1. for g in groups if len(g) == 1) / len(dataframe)
In [ ]:
print((uniqueness(df, ['zip'])))
print((uniqueness(df, ['sex', 'birthday'])))
print((uniqueness(df, ['sex', 'birthday', 'zip'])))
The results indicate that about 20% of the individuals could potentially be identified using two features ("sex" and "birthday"), and 99% of the population could potentially be reidentified using three features ("sex", "birthday", and "zip").
As per the earlier definition of k-anonymity, you can calculate uniqueness as the smallest group of records that is returned, based on the grouping parameters provided. The code cell below defines a function that takes an input data set, and list of features, to group the recordset when performing the evaluation. The function provides the minimum count of records grouped by these features as the output.
In [ ]:
# Define function to evaluate k-anonymity of the provided data set.
def k_anonymity(dataframe, pseudo):
return dataframe.groupby(pseudo).count().min()[0]
In [ ]:
print((k_anonymity(df, ['sex', 'birthday', 'zip'])))
In this example, you will notice a value of one. This implies that the minimum number of individuals with a unique combination of the provided features is one, and that there is a significant risk of potential attackers being able to reidentify individuals in the data set.
Typically, the goal would be to not have any groups with a size less than k values, as defined by your organizational or industry standards. The typical target values that are observed range from six to eight.
Note:
You can experiment with different combinations of features, or repeat the test with single features, to review the impact on the produced result.
Example:
print(k_anonymity(df, ['sex']))
In [ ]:
# Use this code cell to review the k-anonymity function with different input parameters.
In this section, you will coarsen the data using a number of different techniques. It should be noted that the granularity or accuracy of the data set is lost in order to preserve the privacy of the records for individuals in the data set.
As mentioned before, the district is contained in the first two characters of the zip code. In order to retain the district when coarsening the data, a simple programmatic transformation (shown below) can be applied. After applying this transformation, you can choose to expose the "zip_district" to end users, instead of the more granular "zip".
In [ ]:
# Reduce the zip code to zip district.
df['zip_district'] = [z // 1000 for z in df['zip']]
df[['zip', 'zip_district']].head(3)
In [ ]:
# From birthday to birth year.
df['birth_year'] = df['birthday'].map(lambda d: d.year)
df[['birthday', 'birth_year']].head(3)
In [ ]:
# From birthday to birth decade.
df['birth_decade'] = df['birth_year'] // 10 * 10
df[['birthday', 'birth_year', 'birth_decade']].head()
This refers to the suppression of all of the groups that are smaller than the desired k. In many cases, you will reach a point where you will have to coarsen data to the point of destroying its utility. Removing records can be problematic because you may remove the records of interest to a particular question (such as 1% of data with a link to a particular feature).
In [ ]:
print((k_anonymity(df, ['sex', 'birth_year', 'zip_district'])))
grouped = df.groupby(['sex', 'birth_year', 'zip_district'])
df_filtered = grouped.filter(lambda x: len(x) > 5)
print(('Reducing size:', len(df), '> ', len(df_filtered)))
print(('K-anonymity after suppression:', k_anonymity(df_filtered, ['sex', 'birth_year', 'zip_district'])))
Big data sets typically differ from traditional data sets in terms of the following:
Many of the traditional measures used to define the uniqueness of individuals, and the strategies to preserve users' privacy, are no longer sufficient. Instead of uniqueness usually being used for fields consisting of single values, unicity has been proposed (de Montjoye et al. 2015; de Montjoye et al. 2013). Unicity can be used to measure the ease of reidentification of individuals in sets of metadata (such as a user's location over a period of time). Instead of assuming that an attacker knows all of the quasi-identifiers and none of the data, unicity assumes that any datapoint can either be known to the attacker or useful for research, and focuses on quantifying the amount of information that would be needed to uniquely reidentify people. In many cases, data is poorly anonymized. You also need to consider the richness of big data sources when evaluating articles, such as Natasha Singer’s article on identifying famous people.
Given one or more fields, the unicity of a dataset at p datapoints refers to:
The concept of unicity was originally developed in cryptography, and is based on information theory. Specifically, the unicity distance is a measure of the secrecy of a cryptographic system, and determines how effective it is against third parties gaining access to protected information.
An algorithm for computing unicity is shown below.
Note:
Unicity is well-suited for big data and its metadata, meaning that it is applicable to features containing numerous values (such as a trace, for example, a history of GPS coordinates).
An implementation of the unicity assessment algorithm is given below.
Note:
You do not need to understand the code in the cells below. It is provided as sample implementation for advanced students.
In [ ]:
# Function implementing the unicity assessment algorithm.
def draw_points(user, points):
'''IN: a Series; int'''
user.dropna(inplace=True)
indices = np.random.choice(len(user), points, replace=False)
return user[indices]
def is_unique(user_name, points):
'''IN: str, int'''
drawn_p = draw_points(samples[user_name], points)
for other_user in samples.loc[drawn_p.index].drop(user_name, axis=1).as_matrix().T:
if np.equal(drawn_p.values, other_user).all():
return False
return True
def compute_unicity(samples, points):
'''IN:int, int'''
unique_count = .0
users = samples.columns
for user_name in users:
if is_unique(user_name, points):
unique_count += 1
return unique_count / len(samples.columns)
def iterate_unicity(samples, points=4, iterations=10):
'''IN:int, int, int'''
unicities = []
for _ in tqdm(list(range(iterations))):
unicities.append(compute_unicity(samples, points))
return np.mean(unicities)
In [ ]:
# Load required libraries and methods.
import pandas as pd
import numpy as np
from scipy.stats import rv_discrete
from tqdm import tqdm
%pylab inline
In [ ]:
# Load samples of the data set.
samples = pd.read_csv('privacy/mobility_sample_1k.csv', index_col='datetime')
samples.index = samples.index.astype('datetime64[ns]')
samples.head(3)
In this example, you will use one datapoint and three iterations. The result will vary, based on the selected sample, but will indicate that about 35% of the individuals in the sample could potentially be identified using a single datapoint.
Note:
For a more robust result, a single estimation of unicity is performed using the “compute_unicity” function.
In [ ]:
## Compute unicity.
iterate_unicity(samples, 1, 3)
de Montjoye et al. (2015) have shown that, for 1.5 million people (over the course of a year), four visits to places (location and timestamps) are enough to uniquely identify 95% of the users. While for another 1 million people (over 3 months), unicity reached 90% at four points (shop and day) or even 94% at only three points (shop, day, and approximate price). Such an ease of identification means that if someone effectively anonymizes the big data of individuals, they will strip it of its utility.
Note: The “Friends and Family” data set individually transformed the location data of each user, which preserved their privacy well, yet rendered the data unusable for the purposes of this notebook. The “StudentLife” data set, on the other hand, left the GPS records intact, which enabled you to use this as input for Module 3’s notebook exercises. This introduces the risk of attacks through reidentifying individuals by reconciling the GPS records with location services such as Foursquare, Twitter, and Facebook.
In [ ]:
# Your code here.
Exercise complete:
This is a good time to "Save and Checkpoint".
Similarly, here you could try coarsening in order to anonymize the data. However, this approach has been shown to be insufficient in making a data set anonymous.
For more information on coarsening data, read about the implementation and interpretation of results of the "Unique in the Crowd: The privacy bounds of human mobility" study conducted by Yves-Alexandre de Montjoye et al. (2013).
Please review the paper and pay special attention to Figure 4, which demonstrates how the uniqueness of mobility traces (ε) depends on the spatial and temporal resolution of the data. The study found that traces are more unique when coarse on one dimension, and fine along another, than when they are medium-grained along both dimensions. (Unique implies being easier to attack, through the reidentification of individuals.)
The risk of reidentification decreases with the application of these basic techniques. However, this decrease is not fast enough. An alternate solution, for this specific use case, is to merge the antennas into (big) groups of 10, in an attempt to lower the unicity.
Note:
The two code cells below are used to prepare your data set, but do not produce any output. They will generate the input data set required for Exercise 2. The second code cell will also produce a warning, which you can safely ignore.
In [ ]:
# Load antenna data.
antennas = pd.read_csv("privacy/belgium_antennas.csv")
antennas.set_index('ins', inplace=True)
cluster_10 = pd.read_csv('privacy/clusters_10.csv')
#cluster_10['ins'] = map(int, cluster_10['ins'])
cluster_10['ins'] = list(map(int, cluster_10['ins']))
mapping = dict(cluster_10[['ins', 'cluster']].values)
In [ ]:
# Reduce the grain of the data set.
# Requires Numpy version 1.11.
samples_10 = samples.copy()
samples_10 = samples_10.applymap(lambda k: np.nan if np.isnan(k) else mapping[antennas.index[k]])
Calculate the unicity of the coarsened mobility data set (samples_10) with the same number of datapoints (four) and iterations (five) as in Exercise 1. You need to execute the same function, and replace the input data set, "samples", with the newly-created, "samples_10", data set.
- What is the difference between your answer, and the answer provided in the previous exercise, if any?
- How much does it improve anonymity (if at all)?
- Is the loss of spatial resolution worth the computational load and effort?
In [ ]:
# Your code here.
Your markdown answer here.
Exercise complete:
This is a good time to "Save and Checkpoint".
In the context of big data, the existing concepts and methods of privacy preservation are inadequate. Even the basic measure of how unique an individual is within the data set needs to be replaced. Perhaps, more importantly, the old measure of privacy (k-anonymity) is unattainable unless the majority of information has been removed from the data (consider the unusable location data of the “Friends and Family” data set). This leads to the conclusion that:
Anonymity is no longer a solution to the privacy problem, in the context of big data.
The answer lies in the paradigm of data handling, and can only be solved by changes in software architecture. Solutions that provide finely-grained access control and remote computation – like Open PDS by Yves-Alexandre de Montjoye or the Solid ("social linked data") by Sir Tim Berners-Lee (the inventor of the world wide web) – show the way by effectively changing the privacy problem into a security one. You can also review the Opal project – another initiative related to big data and privacy.
It has been shown that data anonymization is no longer a practical solution in the context of big data.
- In your own words, describe the typical problems experienced with the anonymization approach in the context of big data. Your description should be two or three sentences in length.
- What is the best alternative approach to ensure the privacy of sensitive data in the context of big data?
Your markdown answer here.
Exercise complete:
This is a good time to "Save and Checkpoint".
Arrington, Michael. 2006. “AOL Proudly Releases Massive Amounts of Private Data.” Accessed August 21, 2016. https://techcrunch.com/2006/08/06/aol-proudly-releases-massive-amounts-of-user-search-data/.
de Montjoye, Yves-Alexandre, César A. Hidalgo, Michel Verleysen, and Vincent D. Blondel. 2013. “Unique in the Crowd: The privacy bounds of human mobility.” Scientific Reports 3. doi:10.1038/srep01376.
de Montjoye Yves-Alexandre, Laura Radaelli, Vivek Kumar Singh, Alex “Sandy” Pentland. 2015. “Unique in the Shopping Mall: On the Re-identifiability of Credit Card Metadata.” Science 347:536- 539. doi:10.1126/science.1256297.
Golle, Philippe. 2006. “Revisiting the Uniqueness of Simple Demographics in the US Population.” Proceedings of the 5th ACM Workshop on Privacy in Electronic Society, Alexandria, Virginia, October 30.
Gymrek, Melissa, Amy L. McGuire, David Golan, Eran Halperin, and Yaniv Erlich. 2013. “Identifying Personal Genomes by Surname Inference.” Science 339: 321–24. doi:10.1126/science.1229566.
Narayanan, Arvind, and Vitaly Shmatikov. 2006. “How To Break Anonymity of the Netflix Prize Dataset.” arXiv [cs.CR]. arXiv.http://arxiv.org/abs/cs/0610105.
Samarati, Pierangela, and Latanya Sweeney. 1998. “Protecting Privacy When Disclosing Information: K-Anonymity and Its Enforcement through Generalization and Suppression.” Accessed October 14, 2016. .http://epic.org/privacy/reidentification/Samarati_Sweeney_paper.pdf.
Sweeney, Latanya. 2002. “K-Anonymity: A Model for Protecting Privacy.” International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 10:557–70.
In [ ]: