Ok, since you took the time to respond, I just want to be constructive as well:
So I don't have a big problem with some of the function definitions which can be compact, as the other comment points out.
The reason I don't like this code is that it does not comment anything on the critical bits. I don't necessarily care about whether you call the input to your matmul 'x' or 'tensor' or 'input' (although consistency is nice).
The thing that would stop be from absorbing and modifying this code is that it does not comment on all the bits that are non obvious to me if I haven't written a Transformer before. For example:
'Same as tf.matrix_band_part(tf.ones([nd, ns]), -1, ns-nd), but doesn't produce garbage on TPUs.' - I will have to ask the colleague what that means. Why not write out what the actual issue is instead of mysteriously hinting at some potential problem?
Code like this
"q, k, v = map(split_heads, tf.split(c, 3, axis=2))"
will require me re-reading the paper section, then printing out all the tensors to think about what tensor would have which shape at which point. Instead of writing relatively useless linecomments like '#Transformer', I would comment all non-trivial shape modifications with the current layout, and what we are trying to achieve while modifying the layout.
The other issue of my original comment was not specifically on that codebase, but I am sure you would admit that the baselines code was pretty much exactly what I was writing about re: ml scripts. That's not to denigrate its incredible usefulness to the community.
Since you mentioned spinning up, I thought I would add a few comments on that as well:
I think the spinning up code base is good at making the code compact, and terrible at making sense of data flow for beginners. There are a lot of line comments, but they do not actually explain what is conceptually going on but often just repeat short-hands.
Here, the function is returning pi, logp, and log_p_pi (and v). Do you know how incredibly confusing the distinction between these is for beginners? In particular, there is no explanation why logp_pi even needs to be stored in the buffer.
We could recompute it from the states and stop the gradient when computing the likelihood ratio. A sensible tutorial-level comment here may be something along the lines of computing the likelihood in the same forward pass as computing the action, so we can later use it to compute the likelihood ratio. We could also later re-compute this from the buffered states.
I will stop here but I hope my point comes across, whenever I read code from your repos, there are some good parts (conciseness, cute numerical tricks) but there is as general missing sense of thoughtfulness on what the code is really trying to convey to a reader. It shows in the comments and it shows in the code organisation.
As a final note, I have seen this in many organisations and I do not mean to call you out. There is just this quality degradation that inevitably happens when nobody is incentivised (read: promoted, rewarded) to think about these things for an organisation.
Managers at all levels typically don't because they don't get close enough to the subtle issues on a day to day level. If you are lucky, you get senior individual contributors who still look at code and and raise the bar for the entire org. My genuine recommendation to you is to look for that, because a manager won't do that, and more fresh grads can't do it.
Very reasonable point that it is not clearly explained why you need to store logp_pi in the buffer. But the reason is that it would require additional code complexity to calculate it on the fly later. The likelihood ratio requires the denominator to be on the _old_ policy, so if you wanted to compute it on the fly, you would need to have a second policy in the computation graph to preserve the old policy while you change the current policy. You could not simply do a stop_gradient on the current policy and get the same results.
My personal feeling is that tutorial-style explanations like this don't fit nicely into code comment flow. As a result, most tutorial-style descriptions went into material on the Spinning Up website rather than into the code. It isn't 100% comprehensive, certainly, but RL has an enormous surface area (there are tons and tons of little details that teaching material could dive into) and I feel pretty good about what we were able to cover. :)
Thank you for responding. Well, my point is that in particular the gradient on the likelihood ratio is what trips people up. They ask questions like 'why is this ratio not always 1' or similar. This is why I would say explaining what is going where here is critical, i.e. that we save the prior logp_pi (even though we could recompute it) to treat it as a constant value when computing the ratio/the gradient. That would be, from my perspective, the key pedagogical moment of a PPO tutorial. However his is purely subjective and I agree that one can feel differently about where to put explanations.
So I don't have a big problem with some of the function definitions which can be compact, as the other comment points out.
The reason I don't like this code is that it does not comment anything on the critical bits. I don't necessarily care about whether you call the input to your matmul 'x' or 'tensor' or 'input' (although consistency is nice).
The thing that would stop be from absorbing and modifying this code is that it does not comment on all the bits that are non obvious to me if I haven't written a Transformer before. For example:
'Same as tf.matrix_band_part(tf.ones([nd, ns]), -1, ns-nd), but doesn't produce garbage on TPUs.' - I will have to ask the colleague what that means. Why not write out what the actual issue is instead of mysteriously hinting at some potential problem?
Code like this "q, k, v = map(split_heads, tf.split(c, 3, axis=2))" will require me re-reading the paper section, then printing out all the tensors to think about what tensor would have which shape at which point. Instead of writing relatively useless linecomments like '#Transformer', I would comment all non-trivial shape modifications with the current layout, and what we are trying to achieve while modifying the layout.
The other issue of my original comment was not specifically on that codebase, but I am sure you would admit that the baselines code was pretty much exactly what I was writing about re: ml scripts. That's not to denigrate its incredible usefulness to the community.
Since you mentioned spinning up, I thought I would add a few comments on that as well:
I think the spinning up code base is good at making the code compact, and terrible at making sense of data flow for beginners. There are a lot of line comments, but they do not actually explain what is conceptually going on but often just repeat short-hands.
For example, look at the PPO implementation: https://github.com/openai/spinningup/blob/master/spinup/algo...
Here, the function is returning pi, logp, and log_p_pi (and v). Do you know how incredibly confusing the distinction between these is for beginners? In particular, there is no explanation why logp_pi even needs to be stored in the buffer.
We could recompute it from the states and stop the gradient when computing the likelihood ratio. A sensible tutorial-level comment here may be something along the lines of computing the likelihood in the same forward pass as computing the action, so we can later use it to compute the likelihood ratio. We could also later re-compute this from the buffered states.
I will stop here but I hope my point comes across, whenever I read code from your repos, there are some good parts (conciseness, cute numerical tricks) but there is as general missing sense of thoughtfulness on what the code is really trying to convey to a reader. It shows in the comments and it shows in the code organisation.
As a final note, I have seen this in many organisations and I do not mean to call you out. There is just this quality degradation that inevitably happens when nobody is incentivised (read: promoted, rewarded) to think about these things for an organisation.
Managers at all levels typically don't because they don't get close enough to the subtle issues on a day to day level. If you are lucky, you get senior individual contributors who still look at code and and raise the bar for the entire org. My genuine recommendation to you is to look for that, because a manager won't do that, and more fresh grads can't do it.