Python 3.6 Jupyter Notebook

Privacy by design: Big data and personal data protection

Your completion of the notebook exercises will be graded based on your ability to do the following:

Understand: Do your pseudo-code and comments show evidence that you recall and understand technical concepts?

Apply: Are you able to execute code (using the supplied examples) that performs the required functionality on supplied or generated data sets?

Notebook objectives:

By the end of this notebook, you will be expected to:

  • Understand the importance, challenges, and approaches to personal data protection;
  • Identify the dimensions across which big data differs from traditional data sets;
  • Understand the concept of unicity;
  • Use coarsening to anonymize personal information in data; and
  • Understand the limitations of anonymization in the context of big data.

List of exercises:

  • Exercise 1: Calculate the unicity of a raw data set.
  • Exercise 2: Calculate and interpret the unicity of a coarsened data set.
  • Exercise 3: Identify limitations of data anonymization in the context of big data, and suggest alternative data-protection mechanisms.

Notebook introduction

In the video content, Cameron Kerry indicated that the law lags too far behind technology to answer many of the hard questions around data protection. He then went on to elaborate that, in many cases, the question becomes not just what you must do, but rather, what you should do in order to establish and maintain a trust relationship.

Sharing data (collected about individuals) between entities poses a risk to privacy and trust, and is regulated in most parts of the world. The European Union recently passed the General Data Protection Regulation (GDPR), which addresses the treatment of personal information, as well as the rights of the individuals whose information has been collected. Penalties are based on a tiered approach, and some infringements can result in fines of up to 4% of annual worldwide turnover, and €20 million. It is often the case that the information to be shared needs to be anonymous. In some cases, ensuring anonymity removes the data from the jurisdiction of certain laws. The application of the laws is a complex task that needs to be carefully implemented to ensure compliance. Refer to Stefan Nerinckx’s article on the new EU data protection regime for additional context.

Pseudonymization – the removal of direct identifiers – is the first step to anonymize data. This is achieved by removing direct identifiers such as names, surnames, social insurance numbers, and phone numbers; or by replacing them with random or hashed (and salted – see the NYC taxi cab example) values.

However, cases like William Weld's show that pseudonymization is not sufficient to prevent the reidentification of individuals in pseudonymized data sets. In 1990, the Massachusetts Group Insurance Commission (GIC) released hospital data to researchers for the purpose of improving healthcare and controlling costs. At the time GIC released the data, William Weld, then Governor of Massachusetts, assured the public that GIC had protected patient privacy by deleting identifiers (Sweeney 2002).

Note:

Latanya Sweeney was a graduate student at MIT at that stage. She bought the data, reidentified Governor Weld's medical records, and sent these to him (Sweeney 2002).

Sweeney (2002) later demonstrated that 87% of Americans can be uniquely identified by their zip code, gender, and birth date.

This value (i.e., the percentage of unique and, thus, identifiable members of the data set knowing a couple of quasi-identifiers) has been conceptualized as uniqueness.

While the numerous, available sources of data may reveal insights into human behavior, it is important to be sensitive to the legal and ethical considerations when dealing with them. These sources of data include census data, medical records, financial and transaction data, loyalty cards, mobility data, mobile phone data, browsing history and ratings, research-based or observational data, etc.

You can review the seven principles of privacy by design, for more information.

Note:
It is strongly recommended that you save and checkpoint after applying significant changes or completing exercises. This allows you to return the notebook to a previous state should you wish to do so. On the Jupyter menu, select "File", then "Save and Checkpoint" from the dropdown menu that appears.

1. Uniqueness and k-anonymity

Uniqueness refers to the fraction of unique records in a particular data set (i.e., the number of individuals who are identifiable, given the fields).

The available fields in your data set can typically contain the following:

Identifiers: Attributes that can be used to explicitly identify individuals. These are typically removed from data sets prior to release.

Quasi-identifiers: A subset of attributes that can uniquely identify most individuals in the data set. They are not unique identifiers themselves, but are sufficiently well-correlated with an individual that they can be combined with other quasi-identifiers to create a unique identifier.

Anonymization has been chosen as a strategy to protect personal privacy. K-anonymity is the measure used for anonymization, and is defined below according to Sweeney (2002).

K-anonymity of a data set (given one or more fields) is the size of the smallest group in the dataset sharing the same value of the given field(s), or the number of persons having identical values of the fields, yielding them indistinguishable (Sweeney 2002).

For k-anonymity, the person anonymizing the data set needs to decide what the quasi-identifiers are, and what a potential attacker could extract from the provided data set.

Generalization and suppression are the core tools used to anonymize data, and make a data set k-anonymous (Samarati and Sweeney 1998). The privacy-securing methods employed in this paradigm are optimized for the high k-anonymity versus the precision of the data. One of the biggest problems experienced is that optimization is use-case specific, and, therefore, depends on the application. Typical methods include the following:

  • Generalization (or coarsening): Reducing the resolution of the data. For example, date of birth -> year of birth -> decade of birth.
  • Suppression: Removing rows (from groups in which k is lower than desired) from the data set.

These heuristics typically come with trade-offs. Other techniques (such as noise addition and translation) exist, but provide similar results.

Technical examples of such methods are not of central importance in this course, therefore only the basic components will be repeated below to illustrate the fundamentals of the elements discussed above.

1.1 Load data set

This example uses a synthetic data set created for 100,000 fictional people from Belgium. The zip codes are random numbers adhering to the same standards observed in Belgium, with the first two characters indicating the district.


In [ ]:
import pandas

# Load the data set.
df = pandas.read_csv('privacy/belgium_100k.csv')
df = df.where((pandas.notnull(df)), None)
df['birthday'] = df['birthday'].astype('datetime64[ns]')
df.head()

1.2 Calculate uniqueness

In order to calculate uniqueness, as defined earlier, you need to define a function that accepts an input data set, and list of features, to be used to evaluate the data set. The output indicates the number or records in the data set that can be uniquely identified using the provided features.


In [ ]:
# Define function to evaluate uniqueness of the provided dataset.
def uniqueness(dataframe, pseudo):
    groups = list(dataframe.groupby(pseudo).groups.values())
    return sum(1. for g in groups if len(g) == 1) / len(dataframe)

In [ ]:
print((uniqueness(df, ['zip'])))
print((uniqueness(df, ['sex', 'birthday'])))
print((uniqueness(df, ['sex', 'birthday', 'zip'])))

The results indicate that about 20% of the individuals could potentially be identified using two features ("sex" and "birthday"), and 99% of the population could potentially be reidentified using three features ("sex", "birthday", and "zip").

1.3 K-anonymity

As per the earlier definition of k-anonymity, you can calculate uniqueness as the smallest group of records that is returned, based on the grouping parameters provided. The code cell below defines a function that takes an input data set, and list of features, to group the recordset when performing the evaluation. The function provides the minimum count of records grouped by these features as the output.


In [ ]:
# Define function to evaluate k-anonymity of the provided data set.
def k_anonymity(dataframe, pseudo):
    return dataframe.groupby(pseudo).count().min()[0]

In [ ]:
print((k_anonymity(df, ['sex', 'birthday', 'zip'])))

In this example, you will notice a value of one. This implies that the minimum number of individuals with a unique combination of the provided features is one, and that there is a significant risk of potential attackers being able to reidentify individuals in the data set.

Typically, the goal would be to not have any groups with a size less than k values, as defined by your organizational or industry standards. The typical target values that are observed range from six to eight.

Note:

You can experiment with different combinations of features, or repeat the test with single features, to review the impact on the produced result.

Example: print(k_anonymity(df, ['sex']))


In [ ]:
# Use this code cell to review the k-anonymity function with different input parameters.

1.4 Coarsening of data

In this section, you will coarsen the data using a number of different techniques. It should be noted that the granularity or accuracy of the data set is lost in order to preserve the privacy of the records for individuals in the data set.

1.4.1 Remove the zip code

As mentioned before, the district is contained in the first two characters of the zip code. In order to retain the district when coarsening the data, a simple programmatic transformation (shown below) can be applied. After applying this transformation, you can choose to expose the "zip_district" to end users, instead of the more granular "zip".


In [ ]:
# Reduce the zip code to zip district.
df['zip_district'] = [z // 1000 for z in df['zip']]
df[['zip', 'zip_district']].head(3)

1.4.2 Coarsen the data from birthday to birth year

Similar to the previous exercise, you can expose the birth year, instead of the birthday, as demonstrated in the code cell below.


In [ ]:
# From birthday to birth year.
df['birth_year'] = df['birthday'].map(lambda d: d.year)
df[['birthday', 'birth_year']].head(3)

1.4.3 Coarsen the data from birthday to birth decade

You can reduce granularity to decade-level instead of year-level, as seen in Section 1.4.2, with the code demonstrated below.


In [ ]:
# From birthday to birth decade.
df['birth_decade'] = df['birth_year'] // 10 * 10
df[['birthday', 'birth_year', 'birth_decade']].head()

1.5 Suppression

This refers to the suppression of all of the groups that are smaller than the desired k. In many cases, you will reach a point where you will have to coarsen data to the point of destroying its utility. Removing records can be problematic because you may remove the records of interest to a particular question (such as 1% of data with a link to a particular feature).


In [ ]:
print((k_anonymity(df, ['sex', 'birth_year', 'zip_district'])))
grouped = df.groupby(['sex', 'birth_year', 'zip_district'])
df_filtered = grouped.filter(lambda x: len(x) > 5)
print(('Reducing size:', len(df), '> ', len(df_filtered)))
print(('K-anonymity after suppression:', k_anonymity(df_filtered, ['sex', 'birth_year', 'zip_district'])))

2. Privacy considerations for big data

Big data sets typically differ from traditional data sets in terms of the following:

  • Longitude: Data is typically collected for months, years, or even indefinitely. This is in contrast to snapshots or clearly-defined retention periods.
  • Resolution: Datapoints are collected with frequencies that are down to single seconds.
  • Features: Features have unprecedented width and detail for behavioral data, including location and mobility, purchases histories, and more.

Many of the traditional measures used to define the uniqueness of individuals, and the strategies to preserve users' privacy, are no longer sufficient. Instead of uniqueness usually being used for fields consisting of single values, unicity has been proposed (de Montjoye et al. 2015; de Montjoye et al. 2013). Unicity can be used to measure the ease of reidentification of individuals in sets of metadata (such as a user's location over a period of time). Instead of assuming that an attacker knows all of the quasi-identifiers and none of the data, unicity assumes that any datapoint can either be known to the attacker or useful for research, and focuses on quantifying the amount of information that would be needed to uniquely reidentify people. In many cases, data is poorly anonymized. You also need to consider the richness of big data sources when evaluating articles, such as Natasha Singer’s article on identifying famous people.

2.1 Unicity of a data set at p datapoints (given one or more fields)

Given one or more fields, the unicity of a dataset at p datapoints refers to:

  • The fraction of users who can be uniquely identified by p randomly-chosen points from that field; and
  • The approximate number of datapoints needed to reconcile two data sets.

The concept of unicity was originally developed in cryptography, and is based on information theory. Specifically, the unicity distance is a measure of the secrecy of a cryptographic system, and determines how effective it is against third parties gaining access to protected information.

An algorithm for computing unicity is shown below.

Note:

Unicity is well-suited for big data and its metadata, meaning that it is applicable to features containing numerous values (such as a trace, for example, a history of GPS coordinates).

An implementation of the unicity assessment algorithm is given below.

Note:

You do not need to understand the code in the cells below. It is provided as sample implementation for advanced students.


In [ ]:
# Function implementing the unicity assessment algorithm.
def draw_points(user, points):
    '''IN: a Series; int'''
    
    user.dropna(inplace=True)

    indices = np.random.choice(len(user), points, replace=False)
    return user[indices]        
    
def is_unique(user_name, points):
    '''IN: str, int'''
    drawn_p = draw_points(samples[user_name], points) 
    for other_user in samples.loc[drawn_p.index].drop(user_name, axis=1).as_matrix().T:
        if np.equal(drawn_p.values, other_user).all():

            return False
    return True

def compute_unicity(samples, points):
    '''IN:int, int'''
    unique_count = .0
    
    users = samples.columns
    for user_name in users:
        if is_unique(user_name, points): 
            unique_count += 1

    return unique_count / len(samples.columns)

def iterate_unicity(samples, points=4, iterations=10):
    '''IN:int, int, int'''    
    unicities = []
    for _ in tqdm(list(range(iterations))):
        unicities.append(compute_unicity(samples, points))
        
    return np.mean(unicities)

2.1.1 Example: Assessing the unicity of a data set

In this example, you will use a synthetic data set that simulates the mobility of 1,000 users. The data set contains mobile phone records based on hourly intervals.

Sampling


In [ ]:
# Load required libraries and methods.
import pandas as pd
import numpy as np
from scipy.stats import rv_discrete
from tqdm import tqdm

%pylab inline

In [ ]:
# Load samples of the data set.
samples = pd.read_csv('privacy/mobility_sample_1k.csv', index_col='datetime')
samples.index = samples.index.astype('datetime64[ns]')
samples.head(3)

Compute the unicity for a single data point (with 3 iterations)

In this example, you will use one datapoint and three iterations. The result will vary, based on the selected sample, but will indicate that about 35% of the individuals in the sample could potentially be identified using a single datapoint.

Note:

For a more robust result, a single estimation of unicity is performed using the “compute_unicity” function.


In [ ]:
## Compute unicity.
iterate_unicity(samples, 1, 3)

2.2 Unicity levels in big data data sets and their consequences

de Montjoye et al. (2015) have shown that, for 1.5 million people (over the course of a year), four visits to places (location and timestamps) are enough to uniquely identify 95% of the users. While for another 1 million people (over 3 months), unicity reached 90% at four points (shop and day) or even 94% at only three points (shop, day, and approximate price). Such an ease of identification means that if someone effectively anonymizes the big data of individuals, they will strip it of its utility.

Note: The “Friends and Family” data set individually transformed the location data of each user, which preserved their privacy well, yet rendered the data unusable for the purposes of this notebook. The “StudentLife” data set, on the other hand, left the GPS records intact, which enabled you to use this as input for Module 3’s notebook exercises. This introduces the risk of attacks through reidentifying individuals by reconciling the GPS records with location services such as Foursquare, Twitter, and Facebook.


Exercise 1 Start.

Instructions

Calculate the unicity at four datapoints. Iterate five times for additional accuracy. You can find the syntax in the example calculation above, and change the parameters as required.

Question: Is it big or small? What does this mean for anonymity?


In [ ]:
# Your code here.


Exercise 1 End.

Exercise complete:

This is a good time to "Save and Checkpoint".

2.2 Coarsening

Similarly, here you could try coarsening in order to anonymize the data. However, this approach has been shown to be insufficient in making a data set anonymous.

For more information on coarsening data, read about the implementation and interpretation of results of the "Unique in the Crowd: The privacy bounds of human mobility" study conducted by Yves-Alexandre de Montjoye et al. (2013).

Please review the paper and pay special attention to Figure 4, which demonstrates how the uniqueness of mobility traces (ε) depends on the spatial and temporal resolution of the data. The study found that traces are more unique when coarse on one dimension, and fine along another, than when they are medium-grained along both dimensions. (Unique implies being easier to attack, through the reidentification of individuals.)

The risk of reidentification decreases with the application of these basic techniques. However, this decrease is not fast enough. An alternate solution, for this specific use case, is to merge the antennas into (big) groups of 10, in an attempt to lower the unicity.

Note:

The two code cells below are used to prepare your data set, but do not produce any output. They will generate the input data set required for Exercise 2. The second code cell will also produce a warning, which you can safely ignore.


In [ ]:
# Load antenna data.
antennas = pd.read_csv("privacy/belgium_antennas.csv")
antennas.set_index('ins', inplace=True)

cluster_10 = pd.read_csv('privacy/clusters_10.csv')
#cluster_10['ins'] = map(int, cluster_10['ins'])
cluster_10['ins'] = list(map(int, cluster_10['ins']))
mapping = dict(cluster_10[['ins', 'cluster']].values)

In [ ]:
# Reduce the grain of the data set.
# Requires Numpy version 1.11.
samples_10 = samples.copy()
samples_10 = samples_10.applymap(lambda k: np.nan if np.isnan(k) else mapping[antennas.index[k]])


Exercise 2 Start.

Instructions

Calculate the unicity of the coarsened mobility data set (samples_10) with the same number of datapoints (four) and iterations (five) as in Exercise 1. You need to execute the same function, and replace the input data set, "samples", with the newly-created, "samples_10", data set.

  1. What is the difference between your answer, and the answer provided in the previous exercise, if any?
  2. How much does it improve anonymity (if at all)?
  3. Is the loss of spatial resolution worth the computational load and effort?

In [ ]:
# Your code here.

Your markdown answer here.


Exercise 2 End.

Exercise complete:

This is a good time to "Save and Checkpoint".

2.3 Big data privacy: Conclusion

In the context of big data, the existing concepts and methods of privacy preservation are inadequate. Even the basic measure of how unique an individual is within the data set needs to be replaced. Perhaps, more importantly, the old measure of privacy (k-anonymity) is unattainable unless the majority of information has been removed from the data (consider the unusable location data of the “Friends and Family” data set). This leads to the conclusion that:

Anonymity is no longer a solution to the privacy problem, in the context of big data.

The answer lies in the paradigm of data handling, and can only be solved by changes in software architecture. Solutions that provide finely-grained access control and remote computation – like Open PDS by Yves-Alexandre de Montjoye or the Solid ("social linked data") by Sir Tim Berners-Lee (the inventor of the world wide web) – show the way by effectively changing the privacy problem into a security one. You can also review the Opal project – another initiative related to big data and privacy.


Exercise 3 Start.

Instructions

It has been shown that data anonymization is no longer a practical solution in the context of big data.

  1. In your own words, describe the typical problems experienced with the anonymization approach in the context of big data. Your description should be two or three sentences in length.
  2. What is the best alternative approach to ensure the privacy of sensitive data in the context of big data?

Your markdown answer here.


Exercise 3 End.

Exercise complete:

This is a good time to "Save and Checkpoint".

3. Submit your notebook

Please make sure that you:

  • Perform a final "Save and Checkpoint";
  • Download a copy of the notebook in ".ipynb" format to your local machine using "File", "Download as", and "IPython Notebook (.ipynb)"; and
  • Submit a copy of this file to the Online Campus.

4. References

Arrington, Michael. 2006. “AOL Proudly Releases Massive Amounts of Private Data.” Accessed August 21, 2016. https://techcrunch.com/2006/08/06/aol-proudly-releases-massive-amounts-of-user-search-data/.

de Montjoye, Yves-Alexandre, César A. Hidalgo, Michel Verleysen, and Vincent D. Blondel. 2013. “Unique in the Crowd: The privacy bounds of human mobility.” Scientific Reports 3. doi:10.1038/srep01376.

de Montjoye Yves-Alexandre, Laura Radaelli, Vivek Kumar Singh, Alex “Sandy” Pentland. 2015. “Unique in the Shopping Mall: On the Re-identifiability of Credit Card Metadata.” Science 347:536- 539. doi:10.1126/science.1256297.

Golle, Philippe. 2006. “Revisiting the Uniqueness of Simple Demographics in the US Population.” Proceedings of the 5th ACM Workshop on Privacy in Electronic Society, Alexandria, Virginia, October 30.

Gymrek, Melissa, Amy L. McGuire, David Golan, Eran Halperin, and Yaniv Erlich. 2013. “Identifying Personal Genomes by Surname Inference.” Science 339: 321–24. doi:10.1126/science.1229566.

Narayanan, Arvind, and Vitaly Shmatikov. 2006. “How To Break Anonymity of the Netflix Prize Dataset.” arXiv [cs.CR]. arXiv.http://arxiv.org/abs/cs/0610105.

Samarati, Pierangela, and Latanya Sweeney. 1998. “Protecting Privacy When Disclosing Information: K-Anonymity and Its Enforcement through Generalization and Suppression.” Accessed October 14, 2016. .http://epic.org/privacy/reidentification/Samarati_Sweeney_paper.pdf.

Sweeney, Latanya. 2002. “K-Anonymity: A Model for Protecting Privacy.” International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 10:557–70.


In [ ]: