PPO Continuous Action

PPO code

Discrete Action
- https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/ppo.py
Continuous Action
- https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/ppo_continuous_action.py

Environment

Wrappers
- https://gymnasium.farama.org/api/wrappers/
HalfCheeta-v4
- https://gymnasium.farama.org/environments/mujoco/half_cheetah/

Implementation Details

https://docs.cleanrl.dev/rl-algorithms/ppo/

calculate approx_kl

http://joschu.net/blog/kl-approx.html

질문

1.

박우성

# 233~247
# bootstrap value if not done
with torch.no_grad():
    next_value = agent.get_value(next_obs).reshape(1, -1)
    advantages = torch.zeros_like(rewards).to(device)
    lastgaelam = 0
    for t in reversed(range(args.num_steps)):
        if t == args.num_steps - 1:
            nextnonterminal = 1.0 - next_done
            nextvalues = next_value
        else:
            nextnonterminal = 1.0 - dones[t + 1]
            nextvalues = values[t + 1]
        delta = rewards[t] + args.gamma * nextvalues * nextnonterminal - values[t]
        advantages[t] = lastgaelam = delta + args.gamma * args.gae_lambda * nextnonterminal * lastgaelam
    returns = advantages + values

해당 코드가 어떤 메커니즘으로 계산되는 것인지

GAE 계산이 어떻게 되어 returns으로 활용되고 나중에 b_returns 로 flatten되어

# 285~298
# Value loss
newvalue = newvalue.view(-1)
if args.clip_vloss:
    v_loss_unclipped = (newvalue - b_returns[mb_inds]) ** 2
    v_clipped = b_values[mb_inds] + torch.clamp(
        newvalue - b_values[mb_inds],
        -args.clip_coef,
        args.clip_coef,
    )
    v_loss_clipped = (v_clipped - b_returns[mb_inds]) ** 2
    v_loss_max = torch.max(v_loss_unclipped, v_loss_clipped)
    v_loss = 0.5 * v_loss_max.mean()
else:
    v_loss = 0.5 * ((newvalue - b_returns[mb_inds]) ** 2).mean()

위의 Value loss에서 어떻게 활용 되는 것인지.