Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bellman Equation is not correct #11

Open
Oshimani opened this issue Jan 18, 2024 · 2 comments
Open

Bellman Equation is not correct #11

Oshimani opened this issue Jan 18, 2024 · 2 comments

Comments

@Oshimani
Copy link

Hey man,

I used your train function in my project because of its optimization. It runs the fit function in one batch an accelerates training quite a bit, thx for that.

Problem is that your Bellman equation is slightly wrong. The original Bellman equation states that the best policy is the one that leads to the next state that yields the highest possible return.

Check out this blog or their sources: Deeplizard

Basically what you need to do is instead of adding reward and next_qs[i] you want to add reward and max(next_qs)

next_states = np.array([memory[3] for memory in batch])
next_q_values = self.model.predict(next_states, verbose=0)
max_next_q_value = np.max(next_q_values, axis=1)

x = []
y = []

for index, (state, action, reward, next_state, done) in enumerate(batch):
    if done:
        new_q = reward
    else:
        new_q = float(reward + self.gamma * max_next_q_value)

    x.append(state)
    y.append(new_q)

self.model.fit(np.array(x), np.array(y), batch_size=len(
    x), use_multiprocessing=True, workers=16, verbose=0)

I copy and pasted this from my code, so the variable names are different, but I think you get the point.

https://github.com/nuno-faria/tetris-ai/blob/4d01877100870e2a6a1ef84dc955354e534589ae/dqn_agent.py#L132C64-L132C64

Again thanks for this cool optimization!
Keep up the good work.

@nuno-faria
Copy link
Owner

Thanks for the tip. I tested it but the agent was not able to achieve higher scores. I don't think that change is correct since each state will end up having the same new_q, which in this case is the new_q of the best play. Thus, the agent will not be able to really differentiate the different plays, since they will essentially be equal.

@Oshimani
Copy link
Author

I also did not achieve significantly different results, so the impact of the change seems to be minor in practice.
The value for new_q is different in the reward component.

The Bellman equation states that for any state-action pair at time t the expected return is the q-value of the action plus the discounted maximum expected return that can be achieved from any possible next state-action pair. That is why the max_next_q_value should be used.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants