train_baseline might not working in cleanup environment #144

joonleesky · 2019-04-02T01:36:33Z

All of the agent's rewards are saturated to 0 around ~390 episodes while training with the default configurations of cleanup environment in train_baseline.py
All of the agents are just getting 0 reward until the end of the training.

eugenevinitsky · 2019-04-02T04:20:39Z

Sorry for the difficulties you've been having and thank you for trying out the baselines!
There's three aspects to this. One, 390 episodes is not a lot if you look at the paper the training curves come from they take many, many more iterations: https://arxiv.org/abs/1810.08647
Two, some of the default parameters of Ray, such as the way the underlying value functions are implemented, may differ from the paper so simply taking the default hyperparameters from the paper may not be sufficient.
Three, in my understanding, most of the time the scores in Cleanup actually come out as zero and a score above zero is more unusual than it is likely. Is this third point correct @natashamjaques ?

joonleesky · 2019-04-02T08:59:10Z

Thank you so much for your kind reply ^_^ !! Still... I have some concerns
Even though, I have trained around 20,000,000 time steps (20000 episodes) all of my agent's rewards are just 0 from episode 390 without any change.

For the second part, I will look through some default parameters in ray and will notice you if i find some critical parameters that can affect the performance.

eugenevinitsky · 2019-04-02T16:41:25Z

One thing I would suggest trying is just disabling the hyperparameters in that file and using the default hyperparameters in Ray, with a possible sweep over the learning rate and the training batch size. To my mind those are usually the most critical hyperparameters.

eugenevinitsky · 2019-04-02T16:42:52Z

This isn't super unusual, without some initial luck in agents deleting enough of the initial waste cells, they never learn to get any apples. Increasing (making more negative) the value of the entropy coefficient, which should encourage exploration, may also help with the 0 score.

natashamjaques · 2019-04-03T04:20:00Z

Hey, just to chime in: actually, agents collectively scoring 0 reward in Cleanup is very typical. When I was using these environments in DeepMind, it was understood that 0 was the normal score for A3C agents. If you check out the Inequity Aversion paper https://arxiv.org/pdf/1803.08884.pdf, they report an average collective return of near 0 for 5 A3C agents in Cleanup.

I know that my paper reports a higher score; this was pretty atypical and actually just because I did a super extensive hyperparameter sweep for the baseline. But you can consider 0 the expected score.

Agents have difficulty solving this task both because of the partial observability (they can't see the apples appear when they clean the river), the delayed rewards (have to clean river, walk to apple patch, obtain apples, build that association), and because even if one agent learns to clean the river, the other agents will exploit it so much that they will consume all the apples before it can get to the apples to harvest anything resulting from its cleaning. So it will eventually un-learn the cleaning behavior.

Hope this helps!

joonleesky · 2019-04-03T09:06:53Z

Thanks for your comments! It helps a lot.
I was feeling just kinda weird that the causal influence paper's A3C baseline in cleanup was around 50 ~ 100 and the Inequity Aversion paper's baseline seems to be around 10~50 not 0 all the way.

I am a beginner in MARL and hoping to try some of my ideas to solve these social dilemmas. Just the word super extensive hyper parameter sweep seems to be very scary for me since I'm trying with my personal computer.

btw, I loved the way you encoded the intrinsic motivation and thanks for the reproduction of the environments in open-source.

eugenevinitsky · 2019-04-26T14:40:50Z

Hi, we've found a few bugs that may be contributing to your difficulties reproducing and will ping you here as soon as they're resolved; apologies! Additionally, you may want to try focusing on Harvest, I've found it to be less sensitive to hyperparams.

joonleesky · 2019-04-28T07:51:24Z

Actually, I had some fun times experimenting with Cleanup and Harvest environments.
With A3C algorithms, I was able to kinda successfully reproduce the results.
Below are the results of re-implementing paper, Inequity aversion improves cooperation in intertemporal social dilemmas.

CleanUp - Original Paper

CleanUp - My Experiment

However, few things that made me confusing was that the collective returns have soared too early than reported in the paper.
I thought timesteps is equal to the timesteps of the single agent. Is timesteps in this multi-agent settings equal to the timesteps of single-agent * number of agents or is this might be the result of few bugs?
Or maybe, am I good at hyper-parameter tuning? LOL

Thank you for your kind replies!!

eugenevinitsky · 2019-05-01T01:37:03Z

Hi @natashamjaques, I think you might be able to answer this question best?

eugenevinitsky · 2019-05-10T15:51:30Z

Hi @joonleesky, I'm pretty sure that timesteps is the total number of environment steps. It's perfectly possible that you've just found a better set of hyperparams. Would you mind posting what those hyperparams are so that I can:
(1) investigate the issue
(2) Put those hyperparams into the project?

eugenevinitsky · 2019-06-15T23:01:07Z

Hi @joonleesky, I'd still love to know what hyperparams you wound up using!

joonleesky · 2019-06-25T06:05:45Z

Hello @eugenevinitsky , Sorry for the late replies! I was in the middle of the semester.
I've tested the inequity aversion model again with 5 random seeds with same hyper-parameter, but the soaring performance was observable only 1 out of the 5.

Actually, I've had some fun experiments with adopting the reward schemes as https://gist.github.com/dfarhi/66ec9d760ae0c49a5c492c9fae93984a and some other methods.

Even though the performance was some what unstable, I will commit the training scripts with hyper-parameters, model scripts and the results in a quick time to help some people who needs so.

It was really fun for me to watch how training was going on as shown as below, thank you!

LUCKYGT · 2020-11-15T01:50:37Z

Hello @eugenevinitsky , Sorry for the late replies! I was in the middle of the semester.
I've tested the inequity aversion model again with 5 random seeds with same hyper-parameter, but the soaring performance was observable only 1 out of the 5.

Actually, I've had some fun experiments with adopting the reward schemes as https://gist.github.com/dfarhi/66ec9d760ae0c49a5c492c9fae93984a and some other methods.

Even though the performance was some what unstable, I will commit the training scripts with hyper-parameters, model scripts and the results in a quick time to help some people who needs so.

It was really fun for me to watch how training was going on as shown as below, thank you!

Hello @joonleesky , I met the same problem like you. But I didn't find your commitments. Could you give the link of your contributions? I would be very grateful.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

train_baseline might not working in cleanup environment #144

train_baseline might not working in cleanup environment #144

joonleesky commented Apr 2, 2019

eugenevinitsky commented Apr 2, 2019

joonleesky commented Apr 2, 2019

eugenevinitsky commented Apr 2, 2019

eugenevinitsky commented Apr 2, 2019

natashamjaques commented Apr 3, 2019

joonleesky commented Apr 3, 2019

eugenevinitsky commented Apr 26, 2019

joonleesky commented Apr 28, 2019 •

edited

Loading

eugenevinitsky commented May 1, 2019

eugenevinitsky commented May 10, 2019

eugenevinitsky commented Jun 15, 2019

joonleesky commented Jun 25, 2019

LUCKYGT commented Nov 15, 2020

train_baseline might not working in cleanup environment #144

train_baseline might not working in cleanup environment #144

Comments

joonleesky commented Apr 2, 2019

eugenevinitsky commented Apr 2, 2019

joonleesky commented Apr 2, 2019

eugenevinitsky commented Apr 2, 2019

eugenevinitsky commented Apr 2, 2019

natashamjaques commented Apr 3, 2019

joonleesky commented Apr 3, 2019

eugenevinitsky commented Apr 26, 2019

joonleesky commented Apr 28, 2019 • edited Loading

eugenevinitsky commented May 1, 2019

eugenevinitsky commented May 10, 2019

eugenevinitsky commented Jun 15, 2019

joonleesky commented Jun 25, 2019

LUCKYGT commented Nov 15, 2020

joonleesky commented Apr 28, 2019 •

edited

Loading