- How do PG algorithms maximize the objective function?
- What's the main idea behind policy gradient algorithms?
- Why does the algorithm remain unbiased when introducing a baseline in REINFORCE?
- What broader class of algorithms does REINFORCE belong to?
- How does the critic in AC methods differ from a value function that is used as a baseline in REINFORCE?
- If you had to develop an algorithm for an agent that has to learn to move, would you prefer REINFORCE or AC?
- Could you use an n-step AC algorithm as a REINFORCE algorithm?





















































