Conversation
| self.unroll_with_grad = unroll_with_grad | ||
| self.use_root_inputs_for_after_train_iter = use_root_inputs_for_after_train_iter | ||
| self.async_unroll = async_unroll | ||
| if not isinstance(self._unroll_length, ConstantScheduler): |
There was a problem hiding this comment.
ConstantScheduler --> should check against a base class, e.g. Scheduler?
There was a problem hiding this comment.
We need to check against ConstantScheduler here because a scalar input will be converted to one before this check due to the setter function on line 479.
Any non-constant scheduler should then raise an error if we're doing on-policy or async unroll.
QuantuMope
left a comment
There was a problem hiding this comment.
Hey Haichao, responded to your comment. Let me know if I misunderstood it.
| self.unroll_with_grad = unroll_with_grad | ||
| self.use_root_inputs_for_after_train_iter = use_root_inputs_for_after_train_iter | ||
| self.async_unroll = async_unroll | ||
| if not isinstance(self._unroll_length, ConstantScheduler): |
There was a problem hiding this comment.
We need to check against ConstantScheduler here because a scalar input will be converted to one before this check due to the setter function on line 479.
Any non-constant scheduler should then raise an error if we're doing on-policy or async unroll.
This PR allows for a scheduled unroll length if we are running synced off-policy RL training:
async_unroll=Falsewhole_replay_buffer_training=FalseIt also allows for a scheduled value of 0, which in turn skips unrolling to train from the replay buffer.
This allows us to "simulate" very diverse training strategies. E.g.,
Codex cleverly makes a minimal change with full backward compatibility by adding the following code