Flappybird PPO
Published:
Flappybird PPO
Implementation of FlappyBird-PPO reinforcement learning algorithm
Technical Details
- Framework: PyTorch
- Environment: Flappybird
- Category: Other This directory contains an implementation of the Proximal Policy Optimization (PPO) algorithm applied to the Flappy Bird environment.
Overview
This project demonstrates how to train an agent to play Flappy Bird using the PPO algorithm, a state-of-the-art policy gradient method in reinforcement learning. The implementation leverages the flappy_bird_gymnasium
environment, which provides a Gym-compatible interface for the classic Flappy Bird game.
Environment
Flappy Bird is a side-scrolling game where the player controls a bird, attempting to fly between columns of green pipes without hitting them. The game mechanics are simple:
- The bird automatically moves forward
- The player can make the bird “flap” to move upward
- Gravity pulls the bird downward
- The goal is to navigate through as many pipes as possible
State Space: The observation space consists of game state information, including:
- Bird’s position and velocity
- Positions of the upcoming pipes
- Distances between the bird and pipe openings
Action Space: The action space is discrete with two possible actions:
- 0: Do nothing (let the bird fall)
- 1: Flap (make the bird move upward)
Implementation
The implementation uses a PPO agent with:
- Actor-Critic Architecture: Separate networks for policy (actor) and value function (critic)
- Clipped Surrogate Objective: Prevents excessive policy updates
- Entropy Bonus: Encourages exploration
- Generalized Advantage Estimation (GAE): For variance reduction in policy gradient estimation
Configuration
The implementation uses a Config
class with the following key parameters:
exp_name
: “PPO-Flappy” - Name of the experimentenv_id
: “FlappyBird-v0” - Environment IDepisodes
: 10000 - Number of training episodeslr
: 3e-4 - Learning rategamma
: 0.99 - Discount factorclip_value
: 0.2 - PPO clipping parameterPPO_EPOCHS
: 4 - Number of optimization epochs per batchENTROPY_COEFF
: 0.01 - Coefficient for entropy bonusmax_steps
: 512 - Maximum steps per episode
Training Process
The agent is trained through an iterative process:
- Interaction with Environment: The agent collects trajectories by playing the game
- Advantage Calculation: Compute advantages using Generalized Advantage Estimation
- Policy Update: Update policy and value function using the PPO objective
- Repeat: Continue training until the agent achieves satisfactory performance
Results
The agent successfully learns to play Flappy Bird, navigating through pipes with increasing proficiency as training progresses. A video of the trained agent’s performance is included (final_FlappyBird-v0.mp4
).
Source Code
📁 GitHub Repository: Flappybird PPO (Flappybird PPO)
View the complete implementation, training scripts, and documentation on GitHub.