Tutorial: The Basic Tools of Private Deep Learning

Welcome to PySyft's introductory tutorial for privacy preserving, decentralized deep learning. This series of notebooks is a step-by-step guide for you to get to know the new tools and techniques required for doing deep learning on secret/private data/models without centralizing them under one authority.

Scope: Note that we'll not just be talking about how to decentralized / encrypt data, but we'll be addressing how PySyft can be used to help decentralize the entire ecosystem around data, even including the Databases where data is stored and queried, and the neural models which are used to extract information from data. As new extensions to PySyft are created, these notebooks will be extended with new tutorials to explain the new functionality.

Authors:

Outline:

  • Part 1: The Basic Tools of Private Deep Learning

Why Take This Tutorial?

1) A Competitive Career Advantage - For the past 20 years, the digital revolution has made data more and more accessible in ever larger quantities as analog processes have become digitized. However, with new regulation such as GDPR, enterprises are under pressure to have less freedom with how they use - and more importantly how they analyze - personal information. Bottom Line: Data Scientists aren't going to have access to as much data with "old school" tools, but by learning the tools of Private Deep Learning, YOU can be ahead of this curve and have a competitive advantage in your career.

2) Entrepreneurial Opportunities - There are a whole host of problems in society that Deep Learning can solve, but many of the most important haven't been explored because it would require access to incredibly sensitive information about people (consider using Deep Learning to help people with mental or relationship issues!). Thus, learning Private Deep Learning unlocks a whole host of new startup opportunities for you which were not previously available to others without these toolsets.

3) Social Good - Deep Learning can be used to solve a wide variety of problems in the real world, but Deep Learning on personal information is Deep Learning about people, for people. Learning how to do Deep Learning on data you don't own represents more than a career or entrepreneurial opportunity, it is the opportunity to help solve some of the most personal and important problems in people's lives - and to do it at scale.

How do I get extra credit?

... ok ... let's do this!

Part -1: Prerequisites

Part 0: Setup

To begin, you'll need to make sure you have the right things installed. To do so, head on over to PySyft's readme and follow the setup instructions. TLDR for most folks is.

  • Install Python 3.6 or higher
  • Install PyTorch 1.4
  • pip install syft[udacity]

If any part of this doesn't work for you - first check INSTALLATION.md for help, and then open a GitHub Issue or ping the #beginner channel in our Slack! slack.openmined.org


In [ ]:
# Run this cell to see if things work
import sys

import torch
from torch.nn import Parameter
import torch.nn as nn
import torch.nn.functional as F

import syft as sy
hook = sy.TorchHook(torch)

torch.tensor([1,2,3,4,5])

If this cell executed, then you're off to the races! Let's do this!

Part 1: The Basic Tools of Private, Decentralized Data Science

So - the first question you may be wondering is - How in the world do we train a model on data we don't have access to?

Well, the answer is surprisingly simple. If you're used to working in PyTorch, then you're used to working with torch.Tensor objects like these!


In [ ]:
x = torch.tensor([1,2,3,4,5])
y = x + x
print(y)

Obviously, using these super fancy (and powerful!) tensors is important, but also requires you to have the data on your local machine. This is where our journey begins.

Section 1.1 - Sending Tensors to Bob's Machine

Whereas normally we would perform data science / deep learning on the machine which holds the data, now we want to perform this kind of computation on some other machine. More specifically, we can no longer simply assume that the data is on our local machine.

Thus, instead of using Torch tensors, we're now going to work with pointers to tensors. Let me show you what I mean. First, let's create a "pretend" machine owned by a "pretend" person - we'll call him Bob.


In [ ]:
bob = sy.VirtualWorker(hook, id="bob")

Let's say Bob's machine is on another planet - perhaps on Mars! But, at the moment the machine is empty. Let's create some data so that we can send it to Bob and learn about pointers!


In [ ]:
x = torch.tensor([1,2,3,4,5])
y = torch.tensor([1,1,1,1,1])

And now - let's send our tensors to Bob!!


In [ ]:
x_ptr = x.send(bob)
y_ptr = y.send(bob)

In [ ]:
x_ptr

BOOM! Now Bob has two tensors! Don't believe me? Have a look for yourself!


In [ ]:
bob._objects

In [ ]:
z = x_ptr + x_ptr

In [ ]:
z

In [ ]:
bob._objects

Now notice something. When we called x.send(bob) it returned a new object that we called x_ptr. This is our first pointer to a tensor. Pointers to tensors do NOT actually hold data themselves. Instead, they simply contain metadata about a tensor (with data) stored on another machine. The purpose of these tensors is to give us an intuitive API to tell the other machine to compute functions using this tensor. Let's take a look at the metadata that pointers contain.


In [ ]:
x_ptr

Check out that metadata!

There are two main attributes specific to pointers:

  • x_ptr.location : bob, the location, a reference to the location that the pointer is pointing to
  • x_ptr.id_at_location : <random integer>, the id where the tensor is stored at location

They are printed in the format <id_at_location>@<location>

There are also other more generic attributes:

  • x_ptr.id : <random integer>, the id of our pointer tensor, it was allocated randomly
  • x_ptr.owner : "me", the worker which owns the pointer tensor, here it's the local worker, named "me"

In [ ]:
x_ptr.location

In [ ]:
bob

In [ ]:
bob == x_ptr.location

In [ ]:
x_ptr.id_at_location

In [ ]:
x_ptr.owner

You may wonder why the local worker which owns the pointer is also a VirtualWorker, although we didn't create it. Fun fact, just like we had a VirtualWorker object for Bob, we (by default) always have one for us as well. This worker is automatically created when we called hook = sy.TorchHook() and so you don't usually have to create it yourself.


In [ ]:
me = sy.local_worker
me

In [ ]:
me == x_ptr.owner

And finally, just like we can call .send() on a tensor, we can call .get() on a pointer to a tensor to get it back!!!


In [ ]:
x_ptr

In [ ]:
x_ptr.get()

In [ ]:
y_ptr

In [ ]:
y_ptr.get()

In [ ]:
z.get()

In [ ]:
bob._objects

And as you can see... Bob no longer has the tensors anymore!!! They've moved back to our machine!

Section 1.2 - Using Tensor Pointers

So, sending and receiving tensors from Bob is great, but this is hardly Deep Learning! We want to be able to perform tensor operations on remote tensors. Fortunately, tensor pointers make this quite easy! You can just use pointers like you would normal tensors!


In [ ]:
x = torch.tensor([1,2,3,4,5]).send(bob)
y = torch.tensor([1,1,1,1,1]).send(bob)

In [ ]:
z = x + y

In [ ]:
z

And voilà!

Behind the scenes, something very powerful happened. Instead of x and y computing an addition locally, a command was serialized and sent to Bob, who performed the computation, created a tensor z, and then returned the pointer to z back to us!

If we call .get() on the pointer, we will then receive the result back to our machine!


In [ ]:
z.get()

Torch Functions

This API has been extended to all of Torch's operations!!!


In [ ]:
x

In [ ]:
y

In [ ]:
z = torch.add(x,y)
z

In [ ]:
z.get()

Variables (including backpropagation!)


In [ ]:
x = torch.tensor([1,2,3,4,5.], requires_grad=True).send(bob)
y = torch.tensor([1,1,1,1,1.], requires_grad=True).send(bob)

In [ ]:
z = (x + y).sum()

In [ ]:
z.backward()

In [ ]:
x = x.get()

In [ ]:
x

In [ ]:
x.grad

So as you can see, the API is really quite flexible and capable of performing nearly any operation you would normally perform in Torch on remote data. This lays the groundwork for our more advanced privacy preserving protocols such as Federated Learning, Secure Multi-Party Computation, and Differential Privacy !


In [ ]:

Congratulations!!! - Time to Join the Community!

Congratulations on completing this notebook tutorial! If you enjoyed this and would like to join the movement toward privacy preserving, decentralized ownership of AI and the AI supply chain (data), you can do so in the following ways!

Star PySyft on GitHub

The easiest way to help our community is just by starring the GitHub repos! This helps raise awareness of the cool tools we're building.

Join our Slack!

The best way to keep up to date on the latest advancements is to join our community! You can do so by filling out the form at http://slack.openmined.org

Join a Code Project!

The best way to contribute to our community is to become a code contributor! At any time you can go to PySyft GitHub Issues page and filter for "Projects". This will show you all the top level Tickets giving an overview of what projects you can join! If you don't want to join a project, but you would like to do a bit of coding, you can also look for more "one off" mini-projects by searching for GitHub issues marked "good first issue".

If you don't have time to contribute to our codebase, but would still like to lend support, you can also become a Backer on our Open Collective. All donations go toward our web hosting and other community expenses such as hackathons and meetups!

OpenMined's Open Collective Page


In [ ]: