Reinforcement Learning

Abstract:

On this page, we will cover reinforcement learning from A to Z. First, we explore the deep connection between classical mechanics and modern AI's reinforcement learning. It traces the evolution from Newton's laws, originally formulated to describe motion, to the Hamilton-Jacobi formalism, which optimizes paths in nature. This formalism was later extended by Richard Bellman, leading to the Hamilton-Jacobi-Bellman equation, a cornerstone of reinforcement learning. By discreti zing this equation and introducing probabilistic elements, AI systems learn optimal strategies. This forms the basis of some of the most advanced AI models like DeepSeek-R1, AlphaGo, and ChatGPT.

Next, we cover different methods of implementing the Bellman equation and the pros and cons of each of them: 1. Dynamic Programming, 2. Monte Carlo Methods, and 3. Temporal-Difference Learning. We also provide code implementations for each method through solving hands-on examples.


From Newton's Laws to Reinforcement Learning

The foundations of classical mechanics, originally formulated to describe physical motion, have surprisingly found their way into the optimization techniques that power modern AI systems. Newton's laws evolved through various mathematical formalisms, including the Hamiltonian and Hamilton-Jacobi frameworks, which describe how systems evolve along optimal paths. This same principle of optimization is central to reinforcement learning, the method behind AI systems like DeepMind's AlphaGo and the DeepSeek-R1 large language model.

In classical mechanics, the Lagrangian is defined as the difference between kinetic and potential energy:

\[ L = T - U \]

From the Lagrangian, the action potential is obtained by integrating over time:

\[ S(q, t) = \int dt \, L(q, \dot{q}) \]

Nature follows a principle of least action, meaning that physical systems evolve along the path that minimizes the accumulated cost. In reinforcement learning, this concept translates into minimizing the cost of a sequence of decisions.

The Hamilton-Jacobi equation, a reformulation of Newtonian mechanics, describes how the action evolves over time:

\[ \frac{\partial S}{\partial t} = L - \frac{\partial S}{\partial \vec{q}} \cdot \vec{\dot{q}} \]

By extending this framework to systems with external control inputs, Richard Bellman introduced the Hamilton-Jacobi-Bellman (HJB) equation, which finds applications in optimal control and reinforcement learning. When rewritten in terms of the value function, which represents the optimal cost-to-go, the equation takes the form:

\[ \frac{\partial V}{\partial t} = \min_{u}\left( L - \frac{\partial V}{\partial \vec{q}} \cdot \vec{\dot{q}} \right) \]

Discretizing time, using Taylor expansion, and defining the reward \(R\) as inverse of Lagrangian, transforms this equation into a recursive form, known as the Bellman equation, which is fundamental to reinforcement learning:

\[ V(s_t) = \max_{\pi} \mathbb{E} \Big[ R + \gamma V^\pi(s_{t+1}) \Big] \]

Here, \( \pi \) represents the agent's policy, which determines actions based on the current state, while \( \gamma \) is a discount factor that weighs future rewards. The agent interacts with the environment, choosing actions to maximize long-term rewards. By iteratively updating the value function using the Bellman equation, reinforcement learning algorithms, such as value iteration, enable AI systems to make optimal decisions in complex environments.

Let's now explicitly compute the expectation value in the Bellman equation

\[ V(s) = \max_a \sum_{s'} P(s' | s, a) (R(s', s, a) + \gamma V(s')) \]

where the term P represents the transition probability. It denotes the probability of transitioning to state s' given that the agent is in the state s and takes action a. This function models the dynamics of the environment, dictating how likely it is to move from one state to another under a given action.

Reinforcement Learning Mini-Course

This mini-course is given by Dr. Chris G. Willcocks at the Department of Computer Science, Durham University. Here is the link to the original location.

My approach to learning this mini-course would be as follows: First, I'd watch all the videos over one or two days without pausing—just listening carefully. I'd take a half-hour break between every two videos, aiming to absorb at least 20% of the content in this initial pass. Next, I'd rewatch each video daily, focusing on understanding every detail. I'd ask questions—whether from ChatGPT or anyone else who can help—and go through the provided code in the description of each video, ensuring I grasp its workings. My goal would be to master the entire material, including the code, within two weeks. Afterward, I'd fill the gaps by working on projects and gaining hands-on experience.

Tic-Tac-Toe using Dynamic Programming

Find Complete Code in Google Colab Using This Link

This Python implementation demonstrates how to use value iteration,

\[ V(s) = \max_a \sum_{s'} P(s' | s, a) (R(s', s, a) + \gamma V(s')) \]

a fundamental reinforcement learning algorithm, to train an agent to play Tic-Tac-Toe board game. The goal is to determine the optimal moves for the X player by learning state values using a Dynamic Programming (DP) approach.

In Tic-Tac-Toe, each action (placing X or O) results in exactly one new state. There is no randomness—if you place an X in an empty space, you know exactly what the board will look like next. That means the transition probability \(P(s' | s, a) = 1\). In a game like poker, betting (an action) may lead to various outcomes depending on the opponent's response, which means the transition is no longer deterministic.

Overview of the Code

The general idea is to write down all the possible board configurations (state). Give all of them an arbitrary initial value (0). Use the value iteration equation above to update the values of each configuration until the process converge. This would be the training part. To predict next best move, we use the trained values and given the current configuration, take the move that returns the highest value \(V\).

The implementation follows these key steps:

  1. Generate all possible Tic-Tac-Toe board states: The generate_all_states() function creates all valid board configurations while ensuring they follow the game's rules.
  2. Check board validity and winning conditions: Functions like is_valid() and check_winner() ensure that a board state is legal and detect winning positions.
  3. Compute rewards: The get_reward() function assigns rewards for terminal states: +1 for a win, -1 for a loss, and 0 for a draw.
  4. Generate next possible states: The get_next_states() function returns all valid board configurations resulting from the next move.
  5. Value Iteration Algorithm: The value_iteration() function is the implementaion of the equation above and iteratively updates the value function for all states, using Bellman's equation to approximate optimal state values.
  6. Predicting the best move: Once trained, the agent can use get_best_action() to determine the best move for X in a given board state.

Build Your Online Presence:

Post Any of These Extensions on Your GitHub

  • Extending the model to allow for O's optimal play as well.
  • Implementing policy iteration to explicitly model action choices.
  • Using Reinforcement Learning techniques like Q-learning for real-time learning without predefined states.
  • Redefine the agent to play other simple board games
  • Explain why this implementation does not work for Chess
  • Explore different gamma values and explain the effects of lower gamma values
  • Come up with the faster algorithms of the same or other methods

Frozen Lake Game using Monte Carlo

Find Complete Code in Google Colab Using This Link

This Python implementation demonstrates a Monte Carlo reinforcement learning (RL) approach for solving the Frozen Lake environment from OpenAI Gym.

Overview of the Code

The general idea is to train an agent to navigate the Frozen Lake grid world using Monte Carlo method of reinforcement learning. The agent learns an optimal policy through episodic updates, using an epsilon-greedy strategy to balance exploration and exploitation.

The implementation follows these key steps:

  • Loads the FrozenLake-v1 environment from OpenAI Gym.
  • Uses a Q-table to store state-action values.
  • Runs multiple episodes where the agent explores and learns from experience.
  • Applies Monte Carlo policy evaluation to update Q-values based on episode returns.
  • Implements an epsilon-greedy policy for action selection.
  • Visualizes agent's actions after the training.

Build Your Online Presence:

Post Any of These Extensions on Your GitHub

  • Implement other methods (Temporal Difference or Value Iteration) for the same environment.
  • Compare Monte Carlo and Temporal Difference (TD) methods.
  • Modify the environment to include larger grid sizes. Does the current learning approach scale well?
  • Experiment with different values of epsilon and also a decaying epsilon. Which strategy does lead to the best balance between exploration and exploitation?
  • Apply the implemented method to other environments of OpenAI Gym.
  • Replace the Q-table with a Deep Q-Network (DQN).
  • Modify the environment to is_slippery=True, making movement stochastic.
  • Train an agent in a small FrozenLake (4x4) and transfer knowledge to a larger lake (8x8).
  • Create a custom reward system. For example, falling into holes have negative score. How the final optimal policy change change?

More materials will be added in the future. Please stay tuned!


Leave a Comment

Comments

Are You a Physicist?


Join Our
FREE-or-Land-Job Data Science BootCamp