Reinforcement Learning Volley

Simulation Architecture

The game runs in a physics simulation at a fixed timestep. On each step, the server updates the world (players + ball), asks the AI for actions, and sends the updated state to your browser.

A layered reward shaping scheme provides spatial guidance, penalizes inefficient jumps, and enforces touch-limit discipline so gradients arrive long before a rally terminates.

Optimization Framework

Training uses Stable Baselines3 and Proximal Policy Optimization (PPO) with large vectorized rollouts. Each mini-era alternates optimization between the two players: one policy is kept frozen as a fixed opponent while the other performs thousands of PPO clipped-surrogate updates. Updates are constrained by a monitored KL divergence threshold to prevent overly large policy shifts; then the roles swap, which can be interpreted as a block coordinate descent style of self-play.

Stability comes from Generalized Advantage Estimation (GAE), rotating opponent checkpoints, and scheduled evaluation matches that catch regressions early. In practice, this behaves like a KL-regularized variant of fictitious play: the agent learns against a moving but controlled opponent distribution, resulting in more consistent rally building, timely net pressure, and dependable defensive resets.