Federated learning - Train a reinforcement learning agent in a CartPole environment

This turorial demonstrates training of a reinforcement learning agent using federated learning in a CartPole environment. Before running this program you would need to install OpenAI gym.

To train our agent we would be using a policy which uses a simple neural network that maps the CartPole environment's state space to an action space. This policy is trained using federated learning with the help of the Pysyft library. The program simulates that the policy training happens on a remote machine (represented by the remote worker Bob).

References: 1. Pytorch Examples

Author: Amit Rastogi Github: @amit-rastogi Twitter: @amitrastogi

Import Dependencies



In [1]:

    
import torch
from torch import nn, optim
import torch.nn.functional as F
from torch.distributions import Categorical
import gym
import numpy as np
import syft as sy









    



WARNING:tf_encrypted:Falling back to insecure randomness since the required custom op could not be found for the installed version of TensorFlow (1.13.1). Fix this by compiling custom ops.

Create CartPole environment



In [2]:

    
env = gym.make('CartPole-v0')

Hook Torch and create a virtual remote worker



In [3]:

    
hook = sy.TorchHook(torch)
bob = sy.VirtualWorker(hook, id="bob")

Implement our neural network policy



In [4]:

    
class Policy(nn.Module):
    def __init__(self):
        super(Policy, self).__init__()
        self.input = nn.Linear(4, 4)
        self.output = nn.Linear(4, 2)

        self.episode_log_probs = []
        self.episode_raw_rewards = []

    def forward(self, x):
        x = self.input(x)
        x = F.relu(x)
        x = self.output(x)
        x = F.softmax(x, dim=1)
        return x



In [5]:

    
policy = Policy()
optimizer = optim.SGD(params=policy.parameters(), lr=0.03)
#discount rate to be used for action score calculation
discount_rate = 0.95



In [6]:

    
def select_action(state):
    state = torch.from_numpy(state).float().unsqueeze(0)
    #send the environment state to bob
    state = state.send(bob)
    probs = policy(state)
    #we need to get the estimated probabilities back to sample the action since Categorical does not yet
    #support remote tensor operations as of now
    probs = probs.get()
    m = Categorical(probs)
    action = m.sample()
    policy.episode_log_probs.append(m.log_prob(action))
    #get the state back as we would be sending the new state to bob
    state.get()
    return action.item()

def discount_and_normailze_rewards():
    discounted_rewards = []
    cumulative_rewards = 0
    
    for reward in policy.episode_raw_rewards[::-1]:
        cumulative_rewards = reward + discount_rate * cumulative_rewards
        discounted_rewards.insert(0, cumulative_rewards)
    
    discounted_rewards = torch.tensor(discounted_rewards)
    discounted_rewards = (discounted_rewards - discounted_rewards.mean())/discounted_rewards.std()
    
    return discounted_rewards

def update_policy():
    policy_loss = []
    discounted_rewards = discount_and_normailze_rewards()
    for log_prob, action_score in zip(policy.episode_log_probs, discounted_rewards):
        policy_loss.append(-log_prob * action_score)
    
    optimizer.zero_grad()
    policy_loss = torch.cat(policy_loss).sum()
    policy_loss.backward()
    optimizer.step()
    del policy.episode_log_probs[:]
    del policy.episode_raw_rewards[:]

Train our Policy



In [7]:

    
total_rewards = []
# send the policy to bob for training
policy.send(bob)
for episode in range(500):
    state = env.reset()
    episode_rewards = 0
    for step in range(1000):
        action = select_action(state)
        state, reward, done, _ = env.step(action)
        #env.render()  #uncomment to render the current environment
        policy.episode_raw_rewards.append(reward)
        episode_rewards += reward
        
        if done:
            break        
    #to keep track of rewards earned in each episode
    total_rewards.append(episode_rewards)
    update_policy()

#cleanup
policy.get()
bob.clear_objects()
print('Average reward: {:.2f}\tMax reward: {:.2f}'.format(np.mean(total_rewards), np.max(total_rewards)))









    



Average reward: 19.17	Max reward: 83.00

Well Done!

Our agent managed to keep the pole upright for a maximum of 83 consecutive steps using a very simple neural network policy trained using federated learning with Pysyft.

Limitations

In select_state method we have to get the estimated probabilities back to our local worker to sample the action since Categorical does not support remote tensor operations as of now.

Congratulations!!! - Time to Join the Community!

Congratulations on completing this notebook tutorial! If you enjoyed this and would like to join the movement toward privacy preserving, decentralized ownership of AI and the AI supply chain (data), you can do so in the following ways!

Star PySyft on GitHub

The easiest way to help our community is just by starring the repositories! This helps raise awareness of the cool tools we're building.

Star PySyft

Pick our tutorials on GitHub!

We made really nice tutorials to get a better understanding of what Federated and Privacy-Preserving Learning should look like and how we are building the bricks for this to happen.

Checkout the PySyft tutorials

Join our Slack!

The best way to keep up to date on the latest advancements is to join our community!

Join slack.openmined.org

Join a Code Project!

The best way to contribute to our community is to become a code contributor! If you want to start "one off" mini-projects, you can go to PySyft GitHub Issues page and search for issues marked Good First Issue.

Good First Issue Tickets

Donate

If you don't have time to contribute to our codebase, but would still like to lend support, you can also become a Backer on our Open Collective. All donations go toward our web hosting and other community expenses such as hackathons and meetups!

Donate through OpenMined's Open Collective Page