Ok, since you took the time to respond, I just want to be constructive as well: ...

jachiam · on Nov 5, 2019

Hello! Spinning Up author here.

Very reasonable point that it is not clearly explained why you need to store logp_pi in the buffer. But the reason is that it would require additional code complexity to calculate it on the fly later. The likelihood ratio requires the denominator to be on the _old_ policy, so if you wanted to compute it on the fly, you would need to have a second policy in the computation graph to preserve the old policy while you change the current policy. You could not simply do a stop_gradient on the current policy and get the same results.

My personal feeling is that tutorial-style explanations like this don't fit nicely into code comment flow. As a result, most tutorial-style descriptions went into material on the Spinning Up website rather than into the code. It isn't 100% comprehensive, certainly, but RL has an enormous surface area (there are tons and tons of little details that teaching material could dive into) and I feel pretty good about what we were able to cover. :)

high_derivative · on Nov 5, 2019

Thank you for responding. Well, my point is that in particular the gradient on the likelihood ratio is what trips people up. They ask questions like 'why is this ratio not always 1' or similar. This is why I would say explaining what is going where here is critical, i.e. that we save the prior logp_pi (even though we could recompute it) to treat it as a constant value when computing the ratio/the gradient. That would be, from my perspective, the key pedagogical moment of a PPO tutorial. However his is purely subjective and I agree that one can feel differently about where to put explanations.