In this tutorial you are going to see how you can run a linear regression model on data distributed in a pool of workers with encrypted computations leveraged by Secured Multi-Party Computation. For this demonstration we are going to use the classical Housing Prices dataset that is already available in the VirtualGrid set up by the Syft Sandbox.
The idea for the implementation of the Encrypted Linear Regression algorithm in PySyft is based on the section 2 of this paper written by Jonathan Bloom of the Broad Institute of MIT and Harvard.
Author: André Macedo Farias. Github: @andrelmfarias | Twitter: @andrelmfarias
First, let's import PySyft and PyTorch and set up the Syft sandbox, which will create all the objects and tools we will need to run our simulation (Virtual Workers, VirtualGrid with datasets, etc...)
In [ ]:
import warnings
warnings.filterwarnings("ignore")
In [ ]:
import torch
import syft as sy
sy.create_sandbox(globals(), verbose=False)
You can see that we have several workers already set up:
In [ ]:
workers
And each one has a chunk of the Housing Prices dataset:
In [ ]:
for worker in workers:
print(worker.search(["#housing", "#data"]))
Now we have our Syft environment set, let's load the data.
Please note that in order to avoid overflow with the SMPC computations performed by the linear model, and to maintain its stability, we need to scale the data in a such way that the magnitude of each coordinate average lies in the interval [0.1, 10].
Usually that can be done without revealing the data or the averages, you only need to have an idea of the order of magnitude. For example, if one of the coordinate is the surface of the house and it is represented in m², you should scale it by dividing by 100, as we know the surfaces of houses have an order of magnitude close to 100 in average.
After running the model and obtaining the main statistics, we can rescale them back if needed. The same can be done with predictions.
In this tutorial I will be loading the data and scale them following this idea:
In [ ]:
scale_data = torch.Tensor([10., 10., 10., 1., 1., 10., 100., 10., 10., 1000., 10., 1000., 10.])
scale_target = 100.0
housing_data = []
housing_targets = []
for worker in workers:
housing_data.append(sy.local_worker.request_search(["#housing", "#data"], location=worker)[0] / scale_data.send(worker))
housing_targets.append(sy.local_worker.request_search(["#housing", "#target"], location=worker)[0] / scale_target)
In order to run the linear regression, we will need two more workers, a crypto provider and a honest but curious worker. Both are necessary to assure the security of the SMPC computations when we run the model in a pool with more than 3 workers.
Note: the honest but curious worker is a legitimate participant in a communication protocol who will not deviate from the defined protocol but will attempt to learn all possible information from legitimately received messages.
In [ ]:
crypto_prov = sy.VirtualWorker(hook, id="crypto_prov")
hbc_worker = sy.VirtualWorker(hook, id="hbc_worker")
Now let's import the EncryptedLinearRegression from the linalg module of pysyft:
In [ ]:
from syft.frameworks.torch.linalg import EncryptedLinearRegression
Let's train the model!!
In [ ]:
crypto_lr = EncryptedLinearRegression(crypto_provider=crypto_prov, hbc_worker=hbc_worker)
crypto_lr.fit(housing_data, housing_targets)
We can display the results with the method .summarize()
In [ ]:
crypto_lr.summarize()
We can see that the EncryptedLinearRegression does not only give the coefficients and intercept values, but also their standard errors and the p-values!
Now, in order to show the effectiveness of the EncryptedLinearRegression, let's compare it with the Linear Regression from other known libraries.
First, let's send the data to the local worker and transform the torch.Tensors in numpy.arrays
In [ ]:
import numpy as np
data_tensors = [x.copy().get() for x in housing_data]
target_tensors = [y.copy().get() for y in housing_targets]
data_np = torch.cat(data_tensors, dim=0).numpy()
target_np = torch.cat(target_tensors, dim=0).numpy()
First let's compare the results with the sklearn's Linear Regression:
In [ ]:
from sklearn.linear_model import LinearRegression
lr = LinearRegression().fit(data_np, target_np.squeeze())
Display the results:
In [ ]:
print("=" * 25)
print("Sklearn Linear Regression")
print("=" * 25)
for i, coef in enumerate(lr.coef_, 1):
print(" coeff{:<3d}".format(i), "{:>14.4f}".format(coef))
print(" intercept:", "{:>12.4f}".format(lr.intercept_))
print("=" * 25)
You can notice that the are results are pretty much the same!! The are some small differences, but they are never higher than 0.2% of the value computed by the sklearn model!!
For an ecrypted model that can compute linear regression coefficients without ever revealing the data, this is a huge achievement!
We can do the same using the Linear Regression from Statsmodel API, which also gives us the standard errors and p-values of the coefficients. We can then compare it with the results given by the EncryptedLinearRegression
In [ ]:
import statsmodels.api as sm
mod = sm.OLS(target_np.squeeze(), sm.add_constant(data_np), hasconst=True)
res = mod.fit()
print(res.summary())
Once again, we can see that all results are pretty much the same!!
And voilà! We were able to train an OLS Regression model on distributed data and without ever seeing it. We were even able to compute standard errors and p-values for each coefficient.
Also, after comparing our results with results given by other known libraries, we were able to validate this approach.
Congratulations on completing this notebook tutorial! If you enjoyed this and would like to join the movement toward privacy preserving, decentralized ownership of AI and the AI supply chain (data), you can do so in the following ways!
The easiest way to help our community is just by starring the repositories! This helps raise awareness of the cool tools we're building.
We made really nice tutorials to get a better understanding of what Federated and Privacy-Preserving Learning should look like and how we are building the bricks for this to happen.
The best way to keep up to date on the latest advancements is to join our community!
The best way to contribute to our community is to become a code contributor! If you want to start "one off" mini-projects, you can go to PySyft GitHub Issues page and search for issues marked Good First Issue.
If you don't have time to contribute to our codebase, but would still like to lend support, you can also become a Backer on our Open Collective. All donations go toward our web hosting and other community expenses such as hackathons and meetups!