In [ ]:
epochs = 10
n_test_batches = 200
Data is the driver behind Machine Learning. Organizations who create and collect data are able to build and train their own machine learning models. This allows them to offer the use of such models as a service (MLaaS) to outside organizations. This is useful as other organizations who might not be able to create these models themselves but who still would like to use this model to make predictions on their own data.
However, a model hosted in the cloud still presents a privacy/IP issue. In order for external organizations to use it - they must either upload their input data (such as images to be classified) or download the model. Uploading input data can be problematic from a privacy perspective, but downloading the model might not be an option if the organization who created/owns the model is worried about losing their IP.
In this context, one potential solution is to encrypt both the model and the data in a way which allows one organization to use a model owned by another organization without either disclosing their IP to one another. Several encryption schemes exist that allow for computation over encrypted data, among which Secure Multi-Party Computation (SMPC), Homomorphic Encryption (FHE/SHE) and Functional Encryption (FE) are the most well known types. We will focus here on Secure Multi-Party Computation (introduced in detail here in tutorial 5) which consists of private additive sharing. It relies on crypto protocols such as SecureNN and SPDZ, the details of which are given in this excellent blog post.
These protocols achieve remarkable performances over encrypted data, and over the past few months we have been working to make these protocols easy to use. Specifically, we're building tools to allow you to use these protocols without having to re-implement the protocol yourself (or even necessarily know the cryptography behind how it works). Let's jump right in.
The exact setting in this tutorial is the following: consider that you are the server and you have some data. First, you define and train a model with this private training data. Then, you get in touch with a client who holds some of their own data who would like to access your model to make some predictions.
You encrypt your model (a neural network). The client encrypts their data. You both then use these two encrypted assets to use the model to classify the data. Finally, the result of the prediction is sent back to the client in an encrypted way so that the server (i.e. you) learns nothing about the client's data (you learn neither the inputs or the prediction).
Ideally we would additively share the client
's input between itself and the server
and vice versa for the model. For the sake of simplicity, the shares will be held by two other workers alice
and bob
. If you consider that alice is owned by the client and bob by the server, it's completely equivalent.
The computation is secure in the honest-but-curious adversary model which is standard in many MPC frameworks.
We have now everything we need, let's get started!
Author:
Let's get started!
In [ ]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets, transforms
We also need to execute commands specific to importing/starting PySyft. We create a few workers (named client
, bob
, and alice
). Lastly, we define the crypto_provider
who gives all the crypto primitives we may need (See our tutorial on SMPC for more details.
In [ ]:
import syft as sy
hook = sy.TorchHook(torch)
client = sy.VirtualWorker(hook, id="client")
bob = sy.VirtualWorker(hook, id="bob")
alice = sy.VirtualWorker(hook, id="alice")
crypto_provider = sy.VirtualWorker(hook, id="crypto_provider")
We define the setting of the learning task
In [ ]:
class Arguments():
def __init__(self):
self.batch_size = 64
self.test_batch_size = 50
self.epochs = epochs
self.lr = 0.001
self.log_interval = 100
args = Arguments()
In [ ]:
train_loader = torch.utils.data.DataLoader(
datasets.MNIST('../data', train=True, download=True,
transform=transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))
])),
batch_size=args.batch_size, shuffle=True)
Second, the client has some data and would like to have predictions on it using the server's model. This client encrypts its data by sharing it additively across two workers alice
and bob
.
SMPC uses crypto protocols which require to work on integers. We leverage here the PySyft tensor abstraction to convert PyTorch Float tensors into Fixed Precision Tensors using
.fix_precision()
. For example 0.123 with precision 2 does a rounding at the 2nd decimal digit so the number stored is the integer 12.
In [ ]:
test_loader = torch.utils.data.DataLoader(
datasets.MNIST('../data', train=False,
transform=transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))
])),
batch_size=args.test_batch_size, shuffle=True)
private_test_loader = []
for data, target in test_loader:
private_test_loader.append((
data.fix_precision().share(alice, bob, crypto_provider=crypto_provider),
target.fix_precision().share(alice, bob, crypto_provider=crypto_provider)
))
In [ ]:
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.fc1 = nn.Linear(784, 500)
self.fc2 = nn.Linear(500, 10)
def forward(self, x):
x = x.view(-1, 784)
x = self.fc1(x)
x = F.relu(x)
x = self.fc2(x)
return x
In [ ]:
def train(args, model, train_loader, optimizer, epoch):
model.train()
for batch_idx, (data, target) in enumerate(train_loader):
optimizer.zero_grad()
output = model(data)
output = F.log_softmax(output, dim=1)
loss = F.nll_loss(output, target)
loss.backward()
optimizer.step()
if batch_idx % args.log_interval == 0:
print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
epoch, batch_idx * args.batch_size, len(train_loader) * args.batch_size,
100. * batch_idx / len(train_loader), loss.item()))
In [ ]:
model = Net()
optimizer = torch.optim.Adam(model.parameters(), lr=args.lr)
for epoch in range(1, args.epochs + 1):
train(args, model, train_loader, optimizer, epoch)
In [ ]:
def test(args, model, test_loader):
model.eval()
test_loss = 0
correct = 0
with torch.no_grad():
for data, target in test_loader:
output = model(data)
output = F.log_softmax(output, dim=1)
test_loss += F.nll_loss(output, target, reduction='sum').item() # sum up batch loss
pred = output.argmax(1, keepdim=True) # get the index of the max log-probability
correct += pred.eq(target.view_as(pred)).sum().item()
test_loss /= len(test_loader.dataset)
print('\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
test_loss, correct, len(test_loader.dataset),
100. * correct / len(test_loader.dataset)))
In [ ]:
test(args, model, test_loader)
Our model is now trained and ready to be provided as a service!
Now, as the server, we send the model to the workers holding the data. Because the model is sensitive information (you've spent time optimizing it!), you don't want to disclose its weights so you secret share the model just like we did with the dataset earlier.
In [ ]:
model.fix_precision().share(alice, bob, crypto_provider=crypto_provider)
This test function performs the encrypted evaluation. The model weights, the data inputs, the prediction and the target used for scoring are encrypted!
However, the syntax is very similar to pure PyTorch testing of a model, isn't it nice?!
The only thing we decrypt from the server side is the final score at the end to verify predictions were on average good.
In [ ]:
def test(args, model, test_loader):
model.eval()
n_correct_priv = 0
n_total = 0
with torch.no_grad():
for data, target in test_loader[:n_test_batches]:
output = model(data)
pred = output.argmax(dim=1)
n_correct_priv += pred.eq(target.view_as(pred)).sum()
n_total += args.test_batch_size
# This 'test' function performs the encrypted evaluation. The model weights, the data inputs, the prediction and the target used for scoring are all encrypted!
# However as you can observe, the syntax is very similar to normal PyTorch testing! Nice!
# The only thing we decrypt from the server side is the final score at the end of our 200 items batches to verify predictions were on average good.
n_correct = n_correct_priv.copy().get().float_precision().long().item()
print('Test set: Accuracy: {}/{} ({:.0f}%)'.format(
n_correct, n_total,
100. * n_correct / n_total))
In [ ]:
test(args, model, private_test_loader)
Et voilà! Here you are, you have learned how to do end to end secure predictions: the weights of the server's model have not leaked to the client and the server has no information about the data input nor the classification output!
Regarding performance, classifying one image takes less than 0.1 second, approximately 33ms on my laptop (2,7 GHz Intel Core i7, 16GB RAM). However, this is using very fast communication (all the workers are on my local machine). Performance will vary depending on how fast different workers can talk to each other.
You have seen how easy it is to leverage PyTorch and PySyft to perform practical Secure Machine Learning and protect users data, without having to be a crypto expert!
More on this topic will come soon, including convolutional layers to properly benchmark PySyft performance with respect to other libraries, as well as private encrypted training of neural networks, which is needed when a organisation resorts to external sensitive data to train its own model. Stay tuned!
If you enjoyed this and would like to join the movement toward privacy preserving, decentralized ownership of AI and the AI supply chain (data), you can do so in the following ways!
The easiest way to help our community is just by starring the repositories! This helps raise awareness of the cool tools we're building.
We made really nice tutorials to get a better understanding of what Federated and Privacy-Preserving Learning should look like and how we are building the bricks for this to happen.
The best way to keep up to date on the latest advancements is to join our community!
The best way to contribute to our community is to become a code contributor! If you want to start "one off" mini-projects, you can go to PySyft GitHub Issues page and search for issues marked Good First Issue
.
If you don't have time to contribute to our codebase, but would still like to lend support, you can also become a Backer on our Open Collective. All donations go toward our web hosting and other community expenses such as hackathons and meetups!
In [ ]: