In [ ]:
epochs = 15
In this tutorial you are going to see how you can leverage on PySyft and PyTorch to train a 1-layer GRU model using Federated Learning.
The data used for this project was the SMS Spam Collection Data Set available on the UCI Machine Learning Repository. The dataset consists of c. 5500 SMS messages, of which around 13% are spam messages.
The objective here is to simulate two remote machines (that we will call Bob and Anne), where each machine have a similar number of labeled datapoints (SMS labeled as spam or not).
Author: André Macedo Farias. Github: @andrelmfarias | Twitter: @andrelmfarias
I also wrote a blogpost about this tutorial and PySyft, feel free to check it out: Private AI — Federated Learning with PySyft and PyTorch
In [ ]:
import numpy as np
from sklearn.metrics import roc_auc_score
import torch
from torch import nn, optim
from torch.utils.data import TensorDataset, DataLoader
import warnings
warnings.filterwarnings("ignore")
As we are most interested in the usage of PySyft and Federated Learning, I will skip the text-preprocessing part of the project. If you are interested in how I performed the preprocessing of the raw dataset you can take a look on the script preprocess.py.
Each data point of the inputs.npy
dataset correspond to an array of 30 tokens obtained form each message (padded at left or truncated at right)
The label.npy
dataset has the following unique values: 1
for spam
and 0
for non-spam
In [ ]:
inputs = np.load('./data/inputs.npy')
labels = np.load('./data/labels.npy')
In [ ]:
VOCAB_SIZE = int(inputs.max()) + 1
In [ ]:
# Training params
EPOCHS = epochs
CLIP = 5 # gradient clipping - to avoid gradient explosion (frequent in RNNs)
lr = 0.1
BATCH_SIZE = 32
# Model params
EMBEDDING_DIM = 50
HIDDEN_DIM = 10
DROPOUT = 0.2
In this part we are going to separate the dataset in training and test sets following the ratio 80/20. Each of these datasets will be split in two and will be sent to "Bob's" and "Anne's" machines in order to simulate remote and private data.
Please note that in a real case, such datasets will be already in the remote machines and the preprocessing will be performed before hand by their own devices.
In [ ]:
import syft as sy
In [ ]:
labels = torch.tensor(labels)
inputs = torch.tensor(inputs)
# splitting training and test data
pct_test = 0.2
train_labels = labels[:-int(len(labels)*pct_test)]
train_inputs = inputs[:-int(len(labels)*pct_test)]
test_labels = labels[-int(len(labels)*pct_test):]
test_inputs = inputs[-int(len(labels)*pct_test):]
In [ ]:
# Hook that extends the Pytorch library to enable all computations with pointers of tensors sent to other workers
hook = sy.TorchHook(torch)
# Creating 2 virtual workers
bob = sy.VirtualWorker(hook, id="bob")
anne = sy.VirtualWorker(hook, id="anne")
# threshold indexes for dataset split (one half for Bob, other half for Anne)
train_idx = int(len(train_labels)/2)
test_idx = int(len(test_labels)/2)
# Sending toy datasets to virtual workers
bob_train_dataset = sy.BaseDataset(train_inputs[:train_idx], train_labels[:train_idx]).send(bob)
anne_train_dataset = sy.BaseDataset(train_inputs[train_idx:], train_labels[train_idx:]).send(anne)
bob_test_dataset = sy.BaseDataset(test_inputs[:test_idx], test_labels[:test_idx]).send(bob)
anne_test_dataset = sy.BaseDataset(test_inputs[test_idx:], test_labels[test_idx:]).send(anne)
# Creating federated datasets, an extension of Pytorch TensorDataset class
federated_train_dataset = sy.FederatedDataset([bob_train_dataset, anne_train_dataset])
federated_test_dataset = sy.FederatedDataset([bob_test_dataset, anne_test_dataset])
# Creating federated dataloaders, an extension of Pytorch DataLoader class
federated_train_loader = sy.FederatedDataLoader(federated_train_dataset, shuffle=True, batch_size=BATCH_SIZE)
federated_test_loader = sy.FederatedDataLoader(federated_test_dataset, shuffle=False, batch_size=BATCH_SIZE)
For educational purposes, we built a handcrafted GRU with linear layers whose architecture and code you can check on handcrafted_GRU.py
As the focus of this notebook is the usage of Federated Learning with PySyft, we not show the construction of the model here.
In [ ]:
from handcrafted_GRU import GRU
In [ ]:
# Initiating the model
model = GRU(vocab_size=VOCAB_SIZE, hidden_dim=HIDDEN_DIM, embedding_dim=EMBEDDING_DIM, dropout=DROPOUT)
In [ ]:
# Defining loss and optimizer
criterion = nn.BCELoss()
optimizer = optim.SGD(model.parameters(), lr=lr)
For each epoch we are going to compute the training and validations losses, as well as the Area Under the ROC Curve score due to the fact that the target dataset is unbalaced (only 13% of labels are positive).
In [ ]:
for e in range(EPOCHS):
######### Training ##########
losses = []
# Batch loop
for inputs, labels in federated_train_loader:
# Location of current batch
worker = inputs.location
# Initialize hidden state and send it to worker
h = torch.Tensor(np.zeros((BATCH_SIZE, HIDDEN_DIM))).send(worker)
# Send model to current worker
model.send(worker)
# Setting accumulated gradients to zero before backward step
optimizer.zero_grad()
# Output from the model
output, _ = model(inputs, h)
# Calculate the loss and perform backprop
loss = criterion(output.squeeze(), labels.float())
loss.backward()
# Clipping the gradient to avoid explosion
nn.utils.clip_grad_norm_(model.parameters(), CLIP)
# Backpropagation step
optimizer.step()
# Get the model back to the local worker
model.get()
losses.append(loss.get())
######## Evaluation ##########
# Model in evaluation mode
model.eval()
with torch.no_grad():
test_preds = []
test_labels_list = []
eval_losses = []
for inputs, labels in federated_test_loader:
# get current location
worker = inputs.location
# Initialize hidden state and send it to worker
h = torch.Tensor(np.zeros((BATCH_SIZE, HIDDEN_DIM))).send(worker)
# Send model to worker
model.send(worker)
output, _ = model(inputs, h)
loss = criterion(output.squeeze(), labels.float())
eval_losses.append(loss.get())
preds = output.squeeze().get()
test_preds += list(preds.numpy())
test_labels_list += list(labels.get().numpy().astype(int))
# Get the model back to the local worker
model.get()
score = roc_auc_score(test_labels_list, test_preds)
print("Epoch {}/{}... \
AUC: {:.3%}... \
Training loss: {:.5f}... \
Validation loss: {:.5f}".format(e+1, EPOCHS, score, sum(losses)/len(losses), sum(eval_losses)/len(eval_losses)))
model.train()
Et voilà! You have just trained a model for a real world application (SMS spam classifier) using Federated Learning!
You can see that with the PySyft library and its PyTorch extension, you can perform operations with tensor pointers such as you can do with PyTorch API.
Thanks to this, you were able to train spam detector model without having any access to the remote and private data: for each batch you sent the model to the current remote worker and got it back to the local machine before sending it to the worker of the next batch.
You can also notice that this federated training did not harm the performance of the model as both losses reduced at each epoch as expected and the final AUC score on the test data was above 97.5%.
There is however one limitation of this method: by getting the model back we can still have access to some private information. Let's say Bob had only one SMS on his machine. When we get the model back, we can just check which embeddings of the model changed and we will know which were the tokens (words) of the SMS.
In order to address this issue, there are two solutions: Differential Privacy and Secured Multi-Party Computation (SMPC).
Differential Privacy would be used to make sure the model does not give access to some private information.
SMPC, which is one kind of Encrypted Computation, in return allows you to send the model privately so that the remote workers which have the data cannot see the weights you are using.
Congratulations on completing this notebook tutorial! If you enjoyed this and would like to join the movement toward privacy preserving, decentralized ownership of AI and the AI supply chain (data), you can do so in the following ways!
The easiest way to help our community is just by starring the repositories! This helps raise awareness of the cool tools we're building.
We made really nice tutorials to get a better understanding of what Federated and Privacy-Preserving Learning should look like and how we are building the bricks for this to happen.
The best way to keep up to date on the latest advancements is to join our community!
The best way to contribute to our community is to become a code contributor! If you want to start "one off" mini-projects, you can go to PySyft GitHub Issues page and search for issues marked Good First Issue
.
If you don't have time to contribute to our codebase, but would still like to lend support, you can also become a Backer on our Open Collective. All donations go toward our web hosting and other community expenses such as hackathons and meetups!