Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REQUESTING OVERVIEW OF DISTRIBUTED HANDYRL] #211

Open
adypd97 opened this issue Jul 28, 2021 · 4 comments
Open

[REQUESTING OVERVIEW OF DISTRIBUTED HANDYRL] #211

adypd97 opened this issue Jul 28, 2021 · 4 comments

Comments

@adypd97
Copy link

adypd97 commented Jul 28, 2021

Hello HandyRL Team!

First off, thanks for making such a useful repository for RL! I love it!

I am trying to understand how the distributed architecture of HandyRL works, but due to lack of documentation so far its been difficult to understand how it's implemented.

I'll give an example (following the Large Scale Training document in the repo):
I have 3 VMs running on GCP (1 as the server (the learner) and 2 other as workers). In the config.yaml file I entered the external IP (the document says its valid to enter the external IP too) of the learner in the worker args parameter for both workers (as per instructions in the document) and tried to run it. However, I don't see anything happen. In the following output the server appears to continue to sleep and does nothing.

OUTPUT:

xyz@vm1:~/HandyRL$ python3 main.py --train-server
{'env_args': {'env': 'HungryGeese'}, 'train_args': {'turn_based_training': False, 'observation': False, 'gamma': 0.8, 'forward_steps': 32, 'compress_steps': 4, 'entropy_regularization': 0.002, 'entropy_regularization_decay': 0.3, 'update_episodes': 500, 'batch_size': 400, 'minimum_episodes': 1000, 'maximum_episodes': 200000, 'epochs': -1, 'num_batchers': 7, 'eval_rate': 0.1, 'worker': {'num_parallel': 32}, 'lambda': 0.7, 'max_self_play_epoch': 1000, 'policy_target': 'TD', 'value_target': 'TD', 'eval': {'opponent': ['modelbase'], 'weights_path': 'None'}, 'seed': 0, 'restart_epoch': 0}, 'worker_args': {'server_address': '<EXTERNAL_IP_OF_SERVER_GOES_HERE_FOR_WORKERS>', 'num_parallel': 32}}
Loading environment football failed: No module named 'gfootball'
started batcher 0
started batcher 1
started batcher 2
started batcher 3
started batcher 4
started batcher 5
waiting training
started entry server 9999
started batcher 6
started worker server 9998
started server

I was hoping you could provide some guidance as to how I can proceed. In any case, a documentation or brief but complete background on the distributed architecture would also be appreciated to debug the problem on my own.

Thank you!

@ikki407
Copy link
Member

ikki407 commented Jul 29, 2021

Hi @adypd97, thank you for your interest in HandyRL!

First of all, after the training server launched, you need to run the workers in the VMs for worker: python main.py --worker (you should write the server address in the worker config (i.e. worker_args)) This command connects the workers to the server. After the server detects the worker connection, the learning process starts.

We illustrated the overview of the distributed architecture before in the Google Football Research competition. I hope this helps you.

Thanks

@adypd97
Copy link
Author

adypd97 commented Jul 29, 2021

Hi @ikki407!

Thanks for the link to the documentation! Very helpful!

To the main issue: Yes, I ran 2 worker VMs following the steps you mention (also, I entered the public IP of server VM (learner) for both workers in the worker_args parameter). Following that I got the OUTPUT mentioned in my initial comment. It seems like the learner is not able to detect the workers.

As further evidence for that I added a simple print statement to the following file ./handyrl/train.py in the following function (starting line 404):

    def run(self):
        print('waiting training')
        while not self.shutdown_flag:
            if len(self.episodes) < self.args['minimum_episodes']:
 >>>            print('here')
                time.sleep(1)
                continue
            if self.steps == 0:
                self.batcher.run()
                print('started training')
            model = self.train()
            self.report_update(model, self.steps)
        print('finished training') 

And in the output I get the following:
OUTPUT:

xyz@vm1:~/HandyRL$ python3 main.py --train-server
{'env_args': {'env': 'HungryGeese'}, 'train_args': {'turn_based_training': False, 'observation': False, 'gamma': 0.8, 'forward_steps': 32, 'compress_steps': 4, 'entropy_regularization': 0.002, 'entropy_regularization_decay': 0.3, 'update_episodes': 500, 'batch_size': 400, 'minimum_episodes': 1000, 'maximum_episodes': 200000, 'epochs': -1, 'num_batchers': 7, 'eval_rate': 0.1, 'worker': {'num_parallel': 32}, 'lambda': 0.7, 'max_self_play_epoch': 1000, 'policy_target': 'TD', 'value_target': 'TD', 'eval': {'opponent': ['modelbase'], 'weights_path': 'None'}, 'seed': 0, 'restart_epoch': 0}, 'worker_args': {'server_address': '<EXTERNAL_IP_OF_SERVER_GOES_HERE_FOR_WORKERS>', 'num_parallel': 32}}
Loading environment football failed: No module named 'gfootball'
started batcher 0
started batcher 1
started batcher 2
started batcher 3
started batcher 4
started batcher 5
waiting training
started entry server 9999
started batcher 6
started worker server 9998
started server
here
here
here...

I hope you find this helpful in assisting me. In any case thanks once again!

@ikki407
Copy link
Member

ikki407 commented Jul 29, 2021

From your outputs, it seems that the server is not connecting to the workers.

Next steps to debug...

  • Check if it runs correctly on your local machine (localhost)
  • Internal IP is available if your instances on same GCP network
  • Use small config to debug (batch size=1, minimum episodes=10, update episode=1, ...)
  • Check GCP network/firewall settings (ping command succeeds? TCP is allowed?)

@ikki407
Copy link
Member

ikki407 commented Jul 29, 2021

What the worker process/VM looks like? If the workers are still running without any errors, there maybe exist some problems I didn’t watch before.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants