Lessons and Concepts from Reinforcement Learning
This is the place where I list down whatever I have learned or some concepts that I have found interesting while learning RL. Its documented date-wise so I can log out my journey of exploring the AI/ML space :)
Sources used for my learning:
- Build a Reasoning Model - Sebastian Raschka
- Hands-On Modern RL - Walking Labs
- State of Reinforcement Learning for LLM Reasoning
31/5/2026:
Why Reinforcement Learning?
Classical machine learning tactics calls for thousands of labelled data to be fed to a model to "learn". Humans do not learn to ride a bicycle by reading a book on how to do it, rather they use hit-and-trial by trying and falling and getting back up until they eventually find their balance to ride it properly.
Similarly, the next generation of machine learning models use a hit-and-trial method. We humans input the good traits (seek) and bad traits (avoid) and let the machine learn it and do the rest. As we have seen from the AlphaGo experiments, this tactic has managed to create a model that defeated the most professional Go player faster than any model has ever done. This hit-and-trial method is precisely what we call today as Reinforcement Learning.
For every good trait that the model has sought, it gets "rewarded". The most efficient way for it to learn is to not seek the closest reward, but to find a path that accumulates the most rewards, seeking for long-term establishment rather than short-term. This creates an efficient machine learning model.
Discount Function
To reduce the risk of uncertain rewards in the future, a discount function is applied to the cumulative return. Essentially, a fraction of 1 (eg: 0.97 or 0.1) is multiplied and increased by 1 exponentially to each step.
where is the cumulative return function, is the reward and is the discount function.
Due to exponential increase, the further the reward, the more heavily discounted the reward is. This makes the model think whether going further for the reward is worth it or are the efforts moot.
Policies and its Types
A function mapping a state to its action is called a Policy . It is essentially the brain of the model. The ideal outcome is the Optimal Policy . There are two types of policies:
- Deterministic Policies: Every state has a fixed set of actions. This policy is rigid and does not allow the model to explore.
- Stochastic Policies (Probabilistic): Every output is a probability of actions taken. This policy increases exploration within the model.
2/6/2026
Note: Action-Value function (Q-func) represents total future points expected from doing a particular action in a particular state.
Markov Decision Process and Bellman Equation
One way of finding the optimal policy is to follow the route by picking the action with the maximum reward. Each action will have a Q-func that denotes the amount of points rewarded for performing the action.
Markov's Decision Process states that the future depends only on the present, not the past. Just like how we humans take decisive actions, then learn later from the consequences, this algorithm decides which action to do then later corrects itself if it picked the wrong move.
We can then use the Bellman Equation where Current Q-value = immediate reward + next step's Q-value. This means that when the algorithm takes the next action and gets rewarded 2 points, it observes the succeeding action's reward (say 7 points) and immediately assumes that the action that the algorithm just took has to be approximately 2 + 7 = 9 points and corrects itself. As this goes on, the reward values of each step when unknown eventually converges to its true Q-func values.
In summary, the algorithm will pick the action with the maximum amount of rewards and the yielded optimal policy will look like:
This method is exploitative and rigid. It does not allow the model to explore various paths and scenarios and just sticks to a fixed system.
Probabilistic Approach for Optimal Policy Yield
Another approach is to not care about looking maximal rewards at each and every action, rather, it is to go through each and every action (like going through all of the pathways of a maze multiple times) and see which decisions are good and bad. Over time, the model learns and leans towards the good actions which gets rewarded, while the probability of performing bad actions gets drastically reduced. Again, to have it yield maximum results at the end (not for each step), we can use parameters to optimize maximum returns.
Don't stress out on the math symbols, the J() function essentially tells us how good the policy can get.
- is the parameters (eg: weight of a neural network)
- is the policy with the given parameters.
- is the expected return value on said policy.
The second equation is the optimized function which maximizes the total rewards.
Best of Both Worlds
Both approaches listed above has its unique pros and cons.
The first method ensures that you will always do the same action that would get you the maximum reward but is often rigid and cannot explore for better rewards.
The second method can explore into unknown depths and can scour for even more valuable rewards but can often fall into disarray when coming across bad traits due to its freedom.
So what can we do about it?