An e-commerce company is designing a real-time bidding agent that must output a continuous bid price (0 - $5) for every advertising impression. The state representation contains hundreds of contextual features, and millions of interactions can be logged each day, so the team plans to store experience in a replay buffer and train off-policy. They also want an update rule whose gradient estimates have lower variance than pure Monte-Carlo policy-gradient methods. Which reinforcement-learning algorithm is the most appropriate starting point for these requirements?
Deep Deterministic Policy Gradient (DDPG) is a model-free, off-policy actor-critic method that was created specifically for high-dimensional continuous action spaces. The actor outputs a real-valued action, while the critic supplies a baseline that lowers the variance of the policy-gradient estimate; experience replay and target networks provide stable, sample-efficient learning.
Tabular Q-learning and SARSA(λ) both assume a small, discrete action set. Turning the bid into a usable discrete set would require coarse discretization, suffer from the curse of dimensionality, and for SARSA would still train on-policy, preventing the use of the replay buffer.
UCB1 solves a multi-armed bandit problem in which each pull produces an immediate reward and there is no state transition, so it cannot optimise long-term sequences of bids. Only DDPG satisfies the continuous-action, off-policy and low-variance requirements stated in the scenario.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is Deep Deterministic Policy Gradient (DDPG)?
Open an interactive chat with Bash
What is the importance of an off-policy method in reinforcement learning?