Federated Datasets

This example demonstrates how you can develop your custom datasets that work with federated data loaders.Unfortunately the regular torch datasets do not always work directly with federated data loaders. This example demonstrates how to develop a custom dataset that works with federated data loaders highlighting the differences. Further, you could also use Syft's Base dataset feature which simplifies creating federated datasets.To demonstrate these we will show you how to load the SVHN (Street View House Numbers) dataset and convert it to a federated dataset.

SHVN Dataset: Link

Authored By:

Hrishikesh Kamath - GitHub: @kamathhrishi


In [ ]:
from torch.utils.data import Dataset
import syft as sy  
import torch
import urllib
from pathlib import Path
import os

Define the workers you would want to distribute the data to.


In [ ]:
hook = sy.TorchHook(torch)  # <-- NEW: hook PyTorch ie add extra functionalities to support Federated Learning
bob = sy.VirtualWorker(hook, id="bob")  # <-- NEW: define remote worker bob
alice = sy.VirtualWorker(hook, id="alice")  # <-- NEW: and alice

The hyperparameters initialized while training the model. For , this tutorial we only need the batch size. In general practice creating a seperate class with hyperparameters or a dictionary is a good practice.


In [ ]:
class Arguments():
    def __init__(self):
        self.batch_size = 1
        self.test_batch_size = 1000
        self.seed = 1

In [ ]:
args = Arguments()

Load Data


In [ ]:
#create a function for checking if the dataset does indeed exist
def dataset_exists():
    return (os.path.isfile('./data/train_32x32.mat') and
            os.path.isfile('./data/test_32x32.mat'))

    
#If the dataset does not exist, then proceed to download the dataset anew
if not dataset_exists():
    Path('./data/').mkdir(parents=True, exist_ok=True)
    #If the dataset does not already exist, let's download the dataset directly from the URL where it is hosted
    print('Downloading the dataset with urllib2 to the data directory...')
    url1 = 'http://ufldl.stanford.edu/housenumbers/train_32x32.mat'
    urllib.request.urlretrieve(url1, './data/train_32x32.mat')
    url2 = 'http://ufldl.stanford.edu/housenumbers/test_32x32.mat'
    urllib.request.urlretrieve(url2, './data/test_32x32.mat')
    print("The dataset was successfully downloaded")
else:
    print("Not downloading the dataset because it was already downloaded")

The data required for this tutorial comes from from the following sources:-

The dataset is in MATLAB format

This section loads and pre-processes the SHVN dataset and does not have much to do with creating a federated dataset.You can skip the section.


In [ ]:
from scipy.io import loadmat
import matplotlib.pyplot as plt

def load_data(path):
    """ Helper function for loading a MAT-File"""
    data = loadmat(path)
    return data['X'], data['y']

In [ ]:
train_data , train_labels = load_data("data/train_32x32.mat")
test_data , test_labels = load_data("data/test_32x32.mat")

In [ ]:
print(train_data.shape)
print(test_data.shape)

Notice the above Numpy array dimensions are not in appropriate dimensions required for an image. For which we will transpose it to regular image dimensions.


In [ ]:
# Transpose the image arrays
X_train, y_train = train_data.transpose((3,0,1,2)), train_labels[:,0]
X_test, y_test = test_data.transpose((3,0,1,2)), test_labels[:,0]

Visualize Data


In [ ]:
fig=plt.figure(figsize=(8, 8))
columns = 4
rows = 5
for i in range(1,columns*rows+1):
    fig.add_subplot(rows, columns, i)
    plt.imshow(X_train[i])
plt.show()

Torch Dataset

SVHN dataset is in numpy array , the data could be in Python Array or any other datatype that could converted to torch tensors.


In [ ]:
class SVHNDataset(Dataset):

    def __init__(self,images,labels,transform=None):
        
        """Args:
             
             images (Numpy Array): Image Data
             labels (Numpy Array): Labels corresponding to each image
             transform (Optional): If any torch transform has to be performed on the dataset
             
        """
        
        "Attributes self.data and self.targets must be initialized."
        
        #<--Data must be initialized as self.data,self.train_data or self.test_data
        self.data=images
        #<--Targets must be initialized as self.targets,self.test_labels or self.train_labels
        self.targets=labels
        
        #<--The data and target must be converted to torch tensors before it is returned by __getitem__ method
        self.to_torchtensor()
        
        #<--If any transforms have to be performed on the dataset
        self.transform = transform
        
        
    def to_torchtensor(self):
        
        "Transform Numpy Arrays to Torch tensors."
        
        self.data=torch.from_numpy(self.data)
        self.labels=torch.from_numpy(self.targets)
    
        
    def __len__(self):
        
        """Required Method
            
           Returns:
        
                Length [int]: Length of Dataset/batches
        
        """
        
        return len(self.data)
    

    def __getitem__(self, idx):
        
        """Required Method
        
           The output of this method must be torch tensors since torch tensors are overloaded 
           with share() method which is used to share data to workers.
        
           Args:
                 
                 idx [integer]: The index of required batch/example
                 
           Returns:
                 
                 Data [Torch Tensor]:     The training examples
                 Target [ Torch Tensor]:  Corresponding labels of training examples 
        
        """
        
        sample=self.data[idx]
        target=self.targets[idx]
                
        if self.transform:
            sample = self.transform(sample)

        return sample,target

Call the federate method with the workers as arguments on the torch dataset instance and provide it as an argument to the federated data loader. This distributes the dataset to the required workers and returns their corresponding pointer tensors. The federated train loader can now be used to load the pointer tensors of corresponding examples and labels iteratively like regular torch data loader.


In [ ]:
federated_SVHN=SVHNDataset(X_train,y_train).federate((bob, alice))

In [ ]:
federated_train_loader = sy.FederatedDataLoader( # <-- this is now a FederatedDataLoader 
                         federated_SVHN,batch_size=args.batch_size)

Syft Base Datasets

Syft Base Dataset is a simplified dataset feature of Syft Library that allows you create datasets by simply providing training data and corresponding labels. This could also be utilized in federated data loaders. Ensure the inputs to BaseDataset are torch tensors.


In [ ]:
base=sy.BaseDataset(torch.from_numpy(X_train),torch.from_numpy(y_train))

In [ ]:
base_federated=base.federate((bob, alice))

In [ ]:
federated_train_loader = sy.FederatedDataLoader( # <-- this is now a FederatedDataLoader 
                         base_federated,batch_size=args.batch_size)

Well Done!

And voilà! We now know how to create a custom dataset that works with federated data loaders.

Shortcomings of this Example

Currently Federated datasets were developed to allow users to perform federated learning easily with federated data loaders. If you have any features that could improve your experience feel free to create an Github Issue.

Congratulations!!! - Time to Join the Community!

Congratulations on completing this notebook tutorial! If you enjoyed this and would like to join the movement toward privacy preserving, decentralized ownership of AI and the AI supply chain (data), you can do so in the following ways!

Star PySyft on GitHub

The easiest way to help our community is just by starring the repositories! This helps raise awareness of the cool tools we're building.

Pick our tutorials on GitHub!

We made really nice tutorials to get a better understanding of what Federated and Privacy-Preserving Learning should look like and how we are building the bricks for this to happen.

Join our Slack!

The best way to keep up to date on the latest advancements is to join our community!

Join a Code Project!

The best way to contribute to our community is to become a code contributor! If you want to start "one off" mini-projects, you can go to PySyft GitHub Issues page and search for issues marked Good First Issue.

If you don't have time to contribute to our codebase, but would still like to lend support, you can also become a Backer on our Open Collective. All donations go toward our web hosting and other community expenses such as hackathons and meetups!