Solving an optimal growth model using Q-learning

Posted Jan 1, 2025 Updated Feb 25, 2025

By Jafet Baca-Obando 14 min read

We employ Q-learning to solve a deterministic optimal growth model. We then compare our results to the value function implied by the value function iteration approach.

Optimal growth model: The basics

Consider an economy populated by identical households. The social planner chooses capital and consumption streams $\{c_t,k_{t+1}\}_{t=0}^\infty$ to maximize lifetime utility:

\[V(k_0) = \sum_{t=0}^\infty u(c_t)\]

subject to the resource constraint $c_t + k_{t+1} = f(k_t) + (1-\delta)k_t$, given an initial value $k_0$. As usual, we assume that the functions $u$ and $f$ satisfy the Inada conditions $u’>0, u^”<0, f(0)=0, f’>0, f^”\leq 0$.

The only state variable in period $t$ is $k$. The controls are $c$ and $k’$. Given the resource constraint, we focus our attention on $k’$, so long as we make sure that $c$ is nonnegative.

For this problem, the Bellman equation takes the form:

\[\begin{equation} \label{bellman} V(k) = \max_{0\leq k' \leq f(k) + (1-\delta)k}\left\{u\left(f(k) + (1-\delta)k - k'\right) + \beta V(k')\right\} \end{equation}\]

In the language of Markov decision problems (MDP):

The action space $\mathcal{A}$ and state space $\mathcal{X}$ are identical and assumed to be a finite set of positive reals $\mathcal{A} = \mathcal{X} = \{k_1,k_2,\ldots,k_N\}$.
The feasible correspondence set is $\Gamma(k) = \left\{k \in X:0 \leq a \leq f(k) +(1-\delta)k\right\}\ \forall\ k\in \mathcal{X}$.
The set of feasible state action pairs is given by $G = \{(k,a) \in \mathcal{X}\times \mathcal{A}: 0 \leq a \leq f(k) +(1-\delta)k \}$.
The transition kernel can be defined as $P(k, a, k') = \mathbb{1}\{k'= a\}$.
The rewards function is implied by the instantaneous utility of consumption $r(k,a) = u(f(k) + (1-\delta)k - a)$.
The discount factor is $\beta \in (0,1)$.

This means that the Bellman operator:

\[\begin{aligned} (Tv)(k) &\equiv \max_{a\in \Gamma(k)}\left\{u\left(f(k) + (1-\delta)k - a\right) + \beta \sum_{k'}v(k')P(k,a,k') \right\} \\ &= \max_{0\leq k' \leq f(k) + (1-\delta)k}\left\{u\left(f(k) + (1-\delta)k - k'\right) + \beta v(k')\right\} \end{aligned}\]

has the nice property that $T^kv\rightarrow v^*$ as $k\rightarrow \infty$ for any $v \in \mathbb{R}^{\mathcal{X}}$. Moreover, $v^*$ is the unique solution of the Bellman equation \eqref{bellman}.

More formally, it turns out that $T$ is a contraction mapping of modulus $\beta$ on the closed set $\mathbb{R}^{\mathcal{X}}$, the space of functions mapping $\mathcal{X}$ into the real line, equipped with the metric
$$ \lVert v(k) - w(k) \rVert_\infty = \sup_{k\in X}\left\{v(k) - w(k)\right\} $$
In this case, it is possible to prove that $T$ has a unique fixed point $v^*$ on $\mathbb{R}^{\mathcal{X}}$, which is also globally stable. The fact that $v^*$ is the unique solution of the Bellman equation is not hard to check. Sargent & Stachurski (2024) discusses Markov decision problems thoroughly and presents detailed proofs of the aforementioned facts.

Our goal now is to introduce two computational techniques to approximate $v^*$. The first technique, value function iteration (VFI), is widely standard. The second technique, a form of reinforcement learning called Q-learning, is relatively less known.

Value function iteration

This algorithm is a ubiquitous dynamic programming technique. Here, we turn the Bellman equation \eqref{bellman} into an updating rule, as illustrated by the Bellman operator described above.

For the deterministic optimal growth model, the steps of the VFI algorithm are as follows:

Initialize $v_0$, an initial guess of the value function
Input a tolerance level $\tau$, a large error $\varepsilon$, and a maximum number of iterations $i_{max}$
Initialize first iteration $i \leftarrow 0$
While $i < i_{max}$ and $\varepsilon > \tau$:
- Update step $i$ estimate of the value function: $v_{i+1} \leftarrow Tv_i$
- Update error: $\varepsilon \leftarrow \lVert v_{i+1} - v_i \rVert_{\infty}$
- Move to next iteration: $i \leftarrow i + 1$
Retrieve the approximated value function $v^*$

The choice of tolerance means that we can get arbitrarily close to the value function $v^*$. This algorithm must come to an end. As we argued in the previous section, the Bellman operator $Tv$ is a contraction and converges to its unique fixed point $v^*$ up to some approximation error.

From now on, we adopt a Cobb-Douglas functional form for $f$ with output-capital elasticity $\alpha$. For the instantaneous utility function $u$, let us use a specification with constant relative risk aversion $\sigma \neq 1$. If $\sigma = 1$ then $u(c) = \ln c$.

\[f(k) = k^\alpha,\quad u(c) = \frac{c^{1-\sigma}-1}{1-\sigma}\]

The following Python code uses Numba to speed up the computation of $v^*$.

  
# --- Packages 
import numpy as np
import matplotlib.pyplot as plt

from numba import jit, float64, int64, njit, prange
from numba.experimental import jitclass

#------------------------------------------------------------
# 1. DYNAMIC PROGRAMMING APPROACH
#------------------------------------------------------------

# --- Define RBC class

rbc_data = [
    ('α', float64),          # Production parameter
    ('β', float64),          # Discount factor
    ('δ', float64),          # Depreciation rate
    ('σ', float64),          # Risk aversion parameter
    ('grid', float64[:]),    # Grid (array)
    ('kstar', float64),      # Steady-state capital
    ('kmin', float64),       # Minimum grid value
    ('kmax', float64)        # Maximum grid value
]

@jitclass(rbc_data)
class RBC:
    def __init__(self,
        α=0.33,
        β=0.95,
        δ=0.10,
        σ=2.00,
        grid_size=100):

    self.α, self.β, self.δ, self.σ = α, β, δ, σ

    # Set up grid
    self.kstar = (α / (1 / β - (1 - δ))) ** (1 / (1 - α))
    self.kmin = 0.25 * self.kstar
    self.kmax = 1.75 * self.kstar
    self.grid = np.linspace(self.kmin, self.kmax, grid_size)

    def f(self, k):
    "The production function"
    return k**self.α

    def u(self, c):
    "The utility function"
    if self.σ == 1:
        return np.log(c)
    else:
        return (c**(1 - self.σ) - 1) / (1 - self.σ)

#--- Value of consumption choice c given capital k

@jit
def state_action_value(i, j, v, rbc):
    '''
    Right hand side of the Bellman equation for given state and action indices
    * rbc is an instance of the RBC model
    * c is consumption
    * k is current capital
    * k_next is next period capital
    '''

    α, β, δ, σ, k_grid = rbc.α, rbc.β, rbc.δ, rbc.σ, rbc.grid
    k, k_next = k_grid[i], k_grid[j]
    c = rbc.f(k) + (1 - δ) * k - k_next
    value = -np.inf
    if c > 0:
        value = rbc.u(c) + β * v[j]
    return value

# --- Bellman operator

@jit(parallel=True)
def T(v, rbc):
    '''
    The Bellman operator.
     * rbc is an instance of RBCModel
     * v is an array representing a guess of the value function
     * v_new is the updated estimate of the value function
    '''

    α, β, δ, σ, k_grid = rbc.α, rbc.β, rbc.δ, rbc.σ, rbc.grid
    v_new = np.empty_like(v)
    for i in prange(len(k_grid)):
        x_tmp = np.array([state_action_value(i, j, v, rbc) for j in np.arange(len(k_grid))])
        v_new[i] = np.max(x_tmp)
    return v_new

#--- Value function iteration

def vfi(rbc, tol=1e-5, max_iter=1e5, print_step=25):
    '''
    Value function iteration
    '''
    v = np.zeros(len(rbc.grid))
    error = tol + 1
    iter = 1
    while (error > tol ) & (iter < max_iter):
        v_new = T(v, rbc)
        error = np.max(np.abs(v_new - v))
        if iter % print_step == 0:
            print(f'Error at iteration {iter} is {error}')
        v = v_new
        iter += 1
    if error < tol:
        print(f'Converged in {iter} iterations')
    else:
        print('Failed to converge')
    return v

The implied value function is:

As expected, the value function is strictly concave and strictly increasing in its argument $k$.

Q-learning

Q-learning is a reinforcement learning algorithm used to find the optimal action-selection policy for a given finite MDP and the associated optimal value. It is a type of temporal difference learning, meaning it can learn directly from the environment without needing an underlying model of the dynamics. Q-learning updates its estimates based on previously learned estimates, effectively combining Monte Carlo and dynamic programming techniques.

Given our MDP for the deterministic optimal growth model, let’s introduce the notion of Q-value. A Q-value $Q(k’,k)$ is the reward the social planner obtains by implementing action $k’$ and state $k$. Because of the law of motion for capital, today’s action $k’$ becomes tomorrow’s state. Nevertheless, Q-learning does not generally requires so. Formally speaking:

\[\begin{equation} \label{qvalue} Q(k,k') = u\left(f(k) + (1-\delta)k - k'\right) + \beta \sum_{k'}V(k')P(k,a,k') \end{equation}\]

Similarly to VFI, the goal of Q-learning is to compute the action-state value function $q^*(k,k')$, which comprises the Q-values of the optimal policy. By means of \eqref{bellman} and \eqref{qvalue}, it is clear that:

\[v^*(k) = \max_{0\leq k'\leq f(k) + (1-\delta)k}q^*(k,k')\]

indicating that by finding the optimal $q$-values we also learn about the value function.

It turns out that the action-state value function $q^*(k,k')$ is the unique solution of the Q-value Bellman equation:

\[Q(k,k') = u\left(f(k) + (1-\delta)k - k'\right) + \beta \sum_{k'}\max_{a'\in \Gamma(k)}Q(k',a')P(k,a,k')\]

A proof of this result is sketched by Sargent & Stachurski (2024).

A typical Q-learning algorithm consists of a series of episodes embedded with numerous learning steps. For the problem at hand, the algorithm is as follows:

Input a small exploration probability $\epsilon$, and the initial Q-table $q_0(k,k')$ for all $k, k' \in \mathcal{X}$
In the $i$-th episode:
- Choose a state $k$ randomly
- For each step of the episode:
  - Choose action $k'$ using an $\epsilon$-greedy algorithm
  - Take action $k'$, observe reward $u$
  - Compute temporal difference for the state-action pair $(k,k')$: $u + \beta \max_{a}q(k',a) - q_i(k,k')$
  - Update Q-table: $q_{i+1}(k,k') \leftarrow q_i(k,k') + \theta_i\left[u + \beta \max_{a}q(k',a) - q_i(k,k')\right]$
  - Update state: $k \leftarrow k'$
  - Update error until convergence or maximum number of steps: $\varepsilon \leftarrow \lVert q_{k+1} - q_k \rVert_{\infty}$
- Return Q-table
Move to $(i+1)$-th episode using Q-table of $i$-th episode
Iterate until convergence or until the maximum number of episodes is achieved.

Two points are of importance here:
First, using an $\epsilon$-greedy algorithm enables the planner to try all actions in all states repeatedly, and ultimately learn which action yields the highest lifetime, discounted reward. Thus, the social planner explores and learns about the environment thanks to the $\epsilon$-greedy strategy, also called behavior policy. Since it is computationally unfeasible to visit each state-action pair infinitely often, we instead use a large number of episodes and steps.
Second, we use a learning rate sequence $\{\theta_i\}_{i=1}^{i_{max}} = \{\frac{1}{i}\}_{i=1}^{i_{max}}$ within each episode. This ensures that the cumulative sum of the learning rates for each state-action pair $(k, k')$ diverges in the limit, while the sum of their squares remains finite.
As argued by Watkins & Dayan (1992), Even-Dar & Mansour (2003), and others, these two conditions guarantee that $\lim_{i\rightarrow \infty} q_i = q^*$, the unique solution to the Q-value Bellman equation.

Note that Q-learning uses a policy different from the target policy to generate behavior. Unlike the behavior policy, the target policy used to update the Q-table is purely greedy, i.e., selects the action $a$ that maximizes $q(k',a)$. For this reason Q-learning is said to be off-policy.

An excellent introduction to reinforcement learning techniques, including Q-learning, is presented in Sutton & Barto, (2020).

The Python code of Q-learning is shown below.

  
#------------------------------------------------------------
# 2. Q-LEARNING APPROACH
#------------------------------------------------------------

# --- Define QRBC class

qrbc_data = [
    ('α', float64),          # Production parameter
    ('β', float64),          # Discount factor
    ('δ', float64),          # Depreciation rate
    ('σ', float64),          # Risk aversion parameter
    ('grid_size', int64),    # Grid size (number of grid points)
    ('grid', float64[:]),    # Grid (array)
    ('kstar', float64),      # Steady-state capital
    ('kmin', float64),       # Minimum grid value
    ('kmax', float64),       # Maximum grid value
    ('ε', float64),          # Exploration rate
    ('δ_tol', float64),      # Tolerance for convergence
    ('N', int64),            # Number of episodes
    ('qtable', float64[:, :]),  # Q-table
]

@jitclass(qrbc_data)
class QRBC:
    def __init__(self,
                 α=0.33, 
                 β=0.95, 
                 δ=0.10, 
                 σ=2.0, 
                 grid_size=10, 
                 ε=0.1, 
                 δ_tol=1e-5, 
                 N=20000,
                 seed=None):
        
        self.α, self.β, self.δ, self.σ = α, β, δ, σ
        self.grid_size = grid_size
        self.ε = ε
        self.δ_tol = δ_tol
        self.N = N

        # Set up capital grid
        self.kstar = (α / (1 / β - (1 - δ))) ** (1 / (1 - α))
        self.grid = np.linspace(0.25 * self.kstar, 1.75 * self.kstar, grid_size)

        if seed is not None:
            np.random.seed(seed)

    def f(self, k):
        '''Define the production function'''
        return k**self.α

    def u(self, c):
        '''Define the utility function'''
        if self.σ == 1:
            return np.log(c)
        else:
            return (c**(1 - self.σ) - 1) / (1 - self.σ)

    def select_state(self):
        return np.random.randint(0, len(self.grid))

    def temp_diff(self, state, action, t):
        '''
        Compute the temporal difference for a given state-action pair
        '''
        state_next = np.argmin(np.abs(self.grid - self.grid[action]))
        c = max(self.f(self.grid[state]) + (1 - self.δ) * self.grid[state] - self.grid[action], 1e-8)
        TD = self.u(c) + self.β * np.max(self.qtable[state_next, :]) - self.qtable[state, action]
        return TD, state_next

    def run_one_step(self, max_times=15000):
        '''Run a single step in an episode'''
        s = self.select_state()
        for t in range(1, max_times + 1):
            # Choose action
            if np.random.rand() < self.ε:
                action = np.random.randint(0, self.grid_size)  # Explore
            else:
                action = np.argmax(self.qtable[s, :])  # Exploit

            TD, s_next = self.temp_diff(s, action, t)

            # Update Q-table with decaying learning rate
            qtable_new = self.qtable.copy()
            lr_t = 1 / t
            qtable_new[s, action] = self.qtable[s, action] + lr_t * TD

            # Check convergence
            error = np.abs(qtable_new - self.qtable).max()
            if error < self.δ_tol:
                break
            s, self.qtable = s_next, qtable_new

        return self.qtable

@jit
def run_episodes(qrbc):
    '''
    Run N episodes of Q-learning with qtable of last iteration each time
    '''
    N = qrbc.N
    for i in range(N):
        if i % (N / 10) == 0:
            print(f"Progress: EPOCHS = {i}")
        new_qtable = qrbc.run_one_step()

    return new_qtable

def vf(qtable):
    '''Compute the value function from the Q-table'''
    v = np.max(qtable, axis=1)
    return v

The plot below compares the performance of the VFI and Q-learning techniques. For the latter, we employ an $\epsilon$-greedy action selection with $\epsilon = 0.1$, while varying the number of episodes between 1000 and 100000.

As the number of episodes increases, the Q-learning algorithm’s results get closer to the VFI solution. With a sufficient number of episodes, the planner effectively learns the optimal Q-values through a balance of exploration and exploitation.

Log preferences

Now, suppose that the social planner derives logarithmic utility from consumption $u(c) = \ln(c)$. In this case, the coefficient of relative risk aversion is exactly one, i.e., $\sigma = 1$. Let us also assume that current capital fully depreciates every period, which means that $\delta = 1$. With these simplifications, an explicit solution for the value function exists and is given by:

\[\begin{equation}\label{log_vf} V(k) = \frac{1}{1-\beta}\left[\ln (1-\alpha \beta) + \frac{\alpha \beta}{1-\alpha \beta}\ln(\alpha \beta)\right] + \frac{\alpha}{1-\alpha \beta} \ln k \end{equation}\]

When the number of episodes in the Q-learning method is set to 100,000, both VFI and Q-learning yield value functions that are virtually indistinguishable from the true value function \eqref{log_vf}.

Acknowledgements

Thanks to Prof. Juan Rubio-Ramírez, Ph.D., for his dissertation workshop on reinforcement learning. This class sparked my interest in Q-learning in the first place.

Sargent & Stachurski’s blog entry on how to tackle the McCall model using Q-learning inspired a lot of the structure of the code above.
QuantEcon is truly a gem for the economics profession.

Computational macroeconomics

macroeconomics reinforcement learning

This post is licensed under CC BY 4.0 by the author.