Estimating Returns Refresher

Описание к видео Estimating Returns Refresher

This video continues discussing multi-agent problems, focusing on actor-critic methods, which lie in the middle of policy-based and value-based methods. Policy-based methods approximate a policy with an agent taking in observations and outputting actions, which can be either continuous or discrete. On the other hand, value-based methods approximate a value function estimating the expected return from any given state or state-action pair.

Actor-critic methods use both approximations to mitigate the trade-off between bias and variance. In machine learning, a biased estimator consistently over- or underestimates the target value, while variance measures how much the estimator's values fluctuate. In reinforcement learning, bias and variance affect the return estimation, which can be calculated using Monte Carlo returns or Temporal Difference (TD) returns.

The speaker explains the difference between Monte Carlo returns, which are unbiased but have high variance, and TD returns, which introduce bias to reduce variance and speed up training. TD returns use estimates from previous steps to calculate the return for the current step, hence the term "bootstrapping." The speaker also mentions Generalized Advantage Estimation (GAE) as a method used in actor-critic algorithms and recommends reading the relevant paper for more insights.

Комментарии

Информация по комментариям в разработке