【Stanford Reinforcement Learning】Notes

Lecture 1 (intro)

Element

environment
agent
action
observation and reward

history

History $h_t=(a_1,o_1,r_1,…,a_t,o_t,r_t)$

agent choose action based on history

state

state is information assumed to determine what happens next

function of history $s_t=f(h_t)$

The true state of the world used to determine how world generates next observation and reward are often hidden or unknown to agent

assume state used by the agent is sufficient statistic of history

In practice often assume most recent observation is sufficient statistic of history: $s_t=o_t$

state representation has big implications for:

computational complexity
data required
resulting performance

Markov

state $s_t$ is Markov if and only if: $$ p(s_{t+1}|s_{t},a_{t})=p(s_{t+1}|h_t,a_{t}) $$ future is independent of past given present

setting state as history always Markov: $s_t=h_t$

RL Algorithm component

often include one or more of

Model
representation of how the world changes in response to agent’s action
Policy
function mapping agent’s states to action
Value function
future rewards from being in a state and/or action when following a particular policy

Model

mathematical models of dynamics and reward, including:

Transition/dynamics model predicts next agent state $$ p(s_{t+1}=s^{’}|s_t=s,a_t=a) $$ Reward model predicts immediate reward $$ r(s_t=s,a_t=a)=E[r_t|s_t=s,a_t=a] $$

Policy

policy $\pi:S\to A$, mapping from states to actions, determines how the agent chooses actions

deterministic: $\pi(s)=a$
stochastic: $\pi(a|s)=Pr(a_t=a|s_t=s)$

Value

value function $V^\pi$ expected discounted sum of future rewards under a particular policy $\pi$ $$ V^\pi(s_t=s)=E_\pi[r_t+\gamma r_{t+1}+\gamma^2r_{t+2}+…|s_t=s] $$ discount factor $\gamma$ weighs immediate vs future rewards

can be used to quantify goodness/badness of states and actions and decide how to act by comparing policies

Types of RL agents

Model-based
- Explicit: Model
- May or may not have policy and/ or value function
Model-free
- Explicit: Value function and /or policy function
- No model

Lecture 2 (MDP)

Markov Process

$S$ is a (finite) set of states ($s\in S$)
$P$ is dynamics/transition model that specifies $p(s_{t+1}=s{’}|s_t=s)$
no reward, no actions
if finite number ($N$) of states, can express $P$ as a matrix $$ P=\begin{pmatrix} P(s_1|s_1) & P(s_2|s_1) & … &P(s_N|s_1)\ P(s_1|s_2) & P(s_2|s_2) & … &P(s_N|s_2)\ … & … & … & …\ P(s_1|s_N) & P(s_2|s_N) & … &P(s_N|s_N)\ \end{pmatrix} $$

A Markov chain is the discrete-time version of a Markov process.

Markov Reward Process(MRP)

$S$ is a (finite) set of states ($s\in S$)
$P$ is dynamics/transition model that specifies $P(s_{t+1}=s{’}|s_t=s)$
$R$ is a reward function $R(s_t=s)=E[r_t|s_t=s]$
Discount factor $\gamma\in [0,1]$
no actions
if finite number ($N$) of states, can express $R$ as a vector

horizon

number of time steps in each episode
can be infinite
otherwise called finite Markov reward process

Return $G_t$ (for a MRP)

discounted sum of rewards from time step $t$ to horizon $$ G_t=r_t+\gamma r_{t+1}+\gamma^2r_{t+2}+… $$

State Value Function, $V(s)$ (for a MRP)

expected return from starting in state $s$

$$ \begin{split} V(s) & = E[G_t|s_t=s] \ & = E[r_t+\gamma r_{t+1}+\gamma^2r_{t+2}+…|s_t=s] \end{split} $$

MRP value function satisfies $$ V(s)=R(s)+\gamma\sum_{s^{’}\in S}P(s^{’}|s)V(s^{’}) $$ where $R(s)$ is immediate reward and the later part is the discounted sum of future rewards
then: $$ V=R+\gamma PV $$ namely: $$ V=(I-\gamma P)^{-1}R $$

Markov Decision Process (MDP)

$S$ is a (finite) set of states ($s\in S$)
$A$ is a (finite) set of actions $a\in A$
$P$ is dynamics/transition model that specifies $P(s_{t+1}=s{’}|s_t=s，a_t=a)$
$R$ is a reward function $R(s_t=s,a_t=a)=E[r_t|s_t=s,a_t=a]$
Discount factor $\gamma\in [0,1]$

MDP is a tuple: $(S,A,P,R,\gamma)$

Policy

policy specifies what action to take in each state
can be deterministic or stochastic
for generality, consider as a conditional ditribution
given a state, specifies a distribution over actions
Policy: $\pi(a|s)=P(a_t=a|s_t=s)$

MDP+$\pi(a|s)$=Markov Reward Process

precisely, it is the MRP $(S,R^\pi,P^\pi,\gamma)$, where $$ R^\pi(s)=\sum_{a\in A}\pi(a|s)R(s,a) \newline P^\pi(s^{’}|s)=\sum_{a\in A}\pi(a|s)P(s^{’}|s,a) $$ implies we can use same techniques to evaluate the value of a policy for a MDP as we could to compute the value of a MRP, by defining a MRP with $R^{\pi}$ and $P^{\pi}$

control

compute the optimal policy $$ \pi^*(s)=arg,max_{\pi}V^{\pi}(s) $$ there exists an unique optimal value function

but the optimal policy is not unique

optimal policy for a MDP in an infinite horizon problem (agent a acts forever) is:

deterministic
stationary (does not depend on time step)
not necessarily unique

Policy Iteration

computes optimal value and policy

set $i= 0$
initialize $\pi_0(s)$ randomly for all states $s$
while $i==0$ or $||\pi_i-\pi_{i-1}||_1>0$ (L1-norm, measures if the policy changed for any state):
- $V^{\pi_i}\leftarrow$ MDP V function policy evaluation of $\pi_i$
- $\pi_{i+1}\leftarrow $ Policy improvement
- $i=i+1$

State-Action Value Q

State-action value of a policy $$ Q^{\pi}(s,a)=R(s,a)+\gamma\sum_{s^{’}\in S}P(s^{’}|s,a)V^{\pi}(s^{’}) $$ take action a, then follow the policy $\pi$ later on

policy improvement

compute state-action value of a policy $\pi_i$
- for $s$ in $S$ and $a$ in $A$: $$ Q^{\pi_i}(s,a)=R(s,a)+\gamma\sum_{s^{’}\in S}P(s^{’}|s,a)V^{\pi_i}(s^{’}) $$
conpute new policy $\pi_{i+1}$, for all $s\in S$ $$ \pi_{i+1}(s)=arg,max_a,Q^{\pi_i}(s,a),\forall s\in S $$

definition

$$ V^{\pi_1}\ge V^{\pi_2}:V^{\pi_1}(s)\ge V^{\pi_2}(s),\forall s\in S $$

Value Iteration

miantain optimal value of starting in a state $s$ if have a finite number of steps $k$ left in the episode

iterate to consider longer and longer episodes

set $k=1$
initialize $V_0(s)=0$ all states $s$
loop until finite horizon or convergence:
- for each state $s$ $$ V_{k+1}(s)=max_aR(s,a)+\gamma\sum_{s^{’}\in S}P^{\pi}(s^{’}|s)V^{\pi}(s^{’}) $$
- view as bellman backup on value function $$ V_{k+1}=BV_k \newline \pi_{k+1}(s)=arg,max_aR(s,a)+\gamma\sum_{s^{’}\in S}P^{\pi}(s^{’}|s)V^{\pi}(s^{’}) $$

Bellman

value function of a policy must satisfy the Bellman equation $$ V^{\pi}(s)=R^{\pi}(s)+\gamma\sum_{s^{’}\in S}P^{\pi}(s^{’}|s)V^{\pi}(s^{’}) $$ bellman backup operator

applied to a value function
returns a new value function
improves the value if possible

$$ BV(s)={max}aR(s,a)+\gamma\sum{s^{’}\in S}P^{\pi}(s^{’}|s)V^{\pi}(s^{’}) $$

$BV$ yields a value function over all states $s$

Policy iteration as Bellman Operations

Bellman backup operator $B^{\pi}$ for a particular policy is defined as $$ B^{\pi}V(s)=R^{\pi}(s)+\gamma\sum_{s^{’}\in S}P^{\pi}(s^{’}|s)V^{\pi}(s^{’}) $$ policy evaluation amounts to computing the fixed point of $B^{\pi}$

to do policy evaluation. repeatedly apply operator until $V$ stops chaning $$ V^{\pi}=B^{\pi}…B^{\pi}V $$

Lecture 3 (model-free evaluation)

Dynamic programming for policy evaluation

given dynamics model $p$ and reward model $r$, namely the model is known

initialize $V_0(s)=0$ all states $s$
for $k=1$ until convergence
- for all $s$ in $S$
  $V_{k}^{\pi}(s)=r(s,\pi(s))+\gamma\sum_{s^{’}\in S}P(s^{’}|s,\pi(s))V^{\pi}_{k-1}(s^{’})$

$V_{k}^{\pi}(s)$ is exact value of $k$-horizon value of state $s$ under policy $\pi$, and it’s an estimate of infinite horizon value of state $s$ under policy $\pi$ $$ V_{k}^{\pi}(s)=E_{\pi}[G_t|s_t=s]\approx E_{\pi}[r_t+\gamma V_{k-1})|s_t=s] $$

Monte Carlo policy evaluation

does not require MDP dynamics/rewards so are used when model is unkown $$ V^{\pi}(s)=E_{T\sim \pi}[G_t|s_t=s] $$

expectation of trajectories $T$ generated by following $\pi$
the idea is Value = mean return
does not assume state is Markov
can only be applied to episodic MDPs
- averaging over returns from a complete episode
- requires each episode to terminate
aim: estimate $V^\pi(s)$ given episodes generated under policy $\pi$
$s_1,a_1,r_1,s_2,a_2,r_2,…$ where the actions are sampled from $\pi$

steps:

initialize $N(s)=0,G(s)=0\forall s\in S$
loop
- sample episode $i=s_{i,1},a_{i,1},r_{i,1},s_{i,2},a_{i,2},r_{i,2},…,s_{i,T_i},a_{i,T_i},r_{i,T_i}$
- define $G_{i,t}=r_{i,t}+\gamma r_{i,t+1}+\gamma^2r_{i,t+2}+…\gamma^{T_i-1}r_{i,T_i}$ as return from time step $t$ onwards in $ith$ episode
- for each state $s$ visited in episode $i$
  - for first time $t$ that state $s$ is visited in episode $i$
    - increment counter of total first visits: $N(s)=N(s)+1$
    - increment total return $G(s)=G(s)+G_{i,t}$
    - update estimate $V^{\pi}(s)=G(s)/N(s)$

Every-Visit MC

instead of for first time $t$ that state $s$ is visited in episode $i$, update $N(s),G(s)$ and $V^{\pi}(s)$ for every time $t$ that state $s$ is visited in episode $i$

this estimator is biased but is consistent and often has better MSE

Incremental MC

sample episode $i=s_{i,1},a_{i,1},r_{i,1},s_{i,2},a_{i,2},r_{i,2},…,s_{i,T_i},a_{i,T_i},r_{i,T_i}$
define $G_{i,t}=r_{i,t}+\gamma r_{i,t+1}+\gamma^2r_{i,t+2}+…\gamma^{T_i-1}r_{i,T_i}$ as return from time step $t$ onwards in $ith$ episode
for state $s$ visited at time step $t$ in episode $i$
- increment counter of total visits: $N(s)=N(s)+1$
- update estimate $$ V^{\pi}(s)=V^{\pi}(s)\frac{N(s)-1}{N(s)}+\frac{G_{i,t}}{N(s)}=V^{\pi}(s)+\frac{1}{N(s)}(G_{i,t}-V^{\pi}(s)) $$

Running Mean

estimation is updated as $$ V^{\pi}(s)=V^{\pi}(s)+\alpha(G_{i,t}-V^{\pi}(s)) $$

$\alpha=\frac{1}{N(s)}$: identical to every visit MC
$\alpha\gt\frac{1}{N(s)}$: forget older data, helpful for non-stationary domains

Temporal Difference(TD)

combination of Monte Carlo & dynamic programming methods
Model-free
Bootstraps and samples
can be used in episodic or infinite-horizon non-episodic settings
immediately updates estimate of $V$ after each $(s,a,r,s^{’})$ tuple
aim: estimate $v^{\pi}(s)$ given episodes generated under policy $\pi$
have an estimate of $V^{\pi}$, use to estimate expected return $$ V^{\pi}(s)=V^{\pi}(s)+\alpha([r_t+\gamma V^{\pi}(s_{t+1})]-V^{\pi}(s)) $$

TD learning

Initialize $V^{\pi}(s)=0\forall s\in S$
loop
- sample tuple $(s_t,a_t,r_t,s_{t+1})$
- $V^{\pi}(s_t)=V^{\pi}(s_t)+\alpha([r_t+\gamma V^{\pi}(s_{t+1})]-V^{\pi}(s_t))$

Lecture 4 (model-free control)

on-policy learning
- direct experience
- learn to estimate and evaluate a policy from experience obtained from following that policy
off-policy learning
- learn to estimate and evaluate a policy using experience gathered from following a different policy

MC for On Policy Q Evaluation

initialize $N(s,a)=0,G(s,a)=0,Q^{\pi}(s,a)=0,\forall s\in S,\forall a\in A$
loop
- using policy $\pi$ sample episode $i=s_{i,1},a_{i,1},r_{i,1},s_{i,2},a_{i,2},r_{i,2},…,s_{i,T_i},a_{i,T_i},r_{i,T_i}$
- $G_{i,t}=r_{i,t}+\gamma r_{i,t+1}+\gamma^2r_{i,t+2}+…\gamma^{T_i-1}r_{i,T_i}$
- for each state,action $(s,a)$ visited in episode $i$
  - for first or every time $t$ that state $(s,a)$ is visited in episode $i$
    - $N(s,a)=N(s,a)+1,G(s,a)=G(s,a)+G_{i,t}$
    - update estimate $Q^{\pi}(s,a)=G(s,a)/N(s,a)$
given an estimate $Q^{\pi_i}(s,a)\forall s,a$
update new policy $$ \pi_{i+1}(s)=arg,max_a,Q^{\pi_i}(s,a) $$

Monotonic $\epsilon$-greedy policy improvement

for any $\epsilon$-greedy policy $\pi_i$, the ϵ-greedy policy w.r.t.(with respect to) $Q^{\pi_i}$, $\pi_{i+1}$ is a monotonic improvement $V^{\pi_{i+1}}\ge V^{\pi}$ $$ \begin{split} Q^{\pi_i}(s,\pi_{i+1}(s)) & = \sum_{a\in A}\pi_{i+1}(a|s)Q^{\pi}(s,a) \ & = \frac{\epsilon}{|A|}\sum_{a\in A}Q^{\pi_i}(s,a)+(1-\epsilon)max_aQ^{\pi_i}(s,a) \end{split} $$

Greedy in the Limit of Infinite Exploration(GLIE)

all state-action pair are visited an infinite number of times $$ {lim}_{i->\infty}N_i(s,a)\to\infty $$
behavior policy converges to greedy policy
a simple GLIE strategy is $\epsilon$-greedy where $\epsilon$ is reduced to 0 with the following rate $$ \epsilon_i=\frac{1}{i} $$

Monte Carlo Online Control /On Policy Improvement

initialize $N(s,a)=0,G(s,a)=0,Q(s,a)=0,\forall s\in S,\forall a\in A,set,\epsilon=1,k=1$
$\pi_1= \epsilon-greedy(Q)$
loop
- sample $k-th$ episode $(s_{k,1},a_{k,1},r_{k,1},s_{k,2},a_{k,2},r_{k,2},…,s_{k,T},a_{k,T},r_{k,T})$
- $G_{k,t}=r_{k,t}+\gamma r_{k,t+1}+\gamma^2r_{k,t+2}+…\gamma^{T-1}r_{k,T}$
- for $t=1,…,T$ do
  - if first(or every ) vistit to $(s,a)$ in episode $k$ then
    - $N(s,a)=N(s,a)+1$
    - $Q(s_t,a_t)=Q(s_t,a_t)+\frac{1}{N(s,a)}(G_{k,t}-Q(s_t,a_t))$
- $k=k+1,\epsilon=\frac{1}{k}$
- $\pi_k= \epsilon-greedy(Q)$

SARSA

set initial $\epsilon$-greedy policy $\pi$ randomly, $t$=0, initial state $s_t=s_0$
take $a_t\sim\pi(s_t)$ //sample action from policy
observe $(r_t,s_{t+1})$
loop
- take action $a_{t+1}\sim\pi(s_{t+1})$
- observe $(r_{t+1},s_{t+2})$
- $Q(s_t,a_t)=Q(s_t,a_t)+\alpha(r_t+\gamma Q(s_{t+1},a_{t+1})-Q(s_t,a_t))$
- $\pi(s_t)=arg,\max_aQ(s_t,a)$ w.prob $1-\epsilon$,else random
- $t=t+1$

Q-Learning

initialize $Q(s,a),\forall s\in S,a\in A,t=0 $initial state $s_t=s_0$
set $\pi_b$ to be $\epsilon$-greedy w.r.t $Q$
loop
- take action $a_{t}\sim\pi_b(s_{t})$
- observe $(r_{t},s_{t+1})$
- $Q(s_t,a_t)=Q(s_t,a_t)+\alpha(r_t+\gamma max_aQ(s_{t+1},a)-Q(s_t,a_t))$
- $\pi(s_t)=arg,\max_aQ(s_t,a)$ w.prob $1-\epsilon$,else random
- $t=t+1$

Double Q-learning

initialize $Q_1(s,a)$ and $Q_2(s,a) ,\forall s\in S,a\in A,t=0 $ initial state $s_t=s_0$
loop
- select $a_t$ using $\epsilon$-greedy $\pi(s)=arg,max_aQ_1(s_t,a)+Q_2(s_t,a)$
- observe $(r_{t},s_{t+1})$
- if (with 0.5 probability then
  $Q_1(s_t,a_t)=Q_1(s_t,a_t)+\alpha(r_t+\gamma max_aQ_1(s_{t+1},a)-Q_1(s_t,a_t))$
- else
  $Q_2(s_t,a_t)=Q_2(s_t,a_t)+\alpha(r_t+\gamma max_aQ_2(s_{t+1},a)-Q_2(s_t,a_t))$
- $t=t+1$

Lecture 5 (Value Function Approximation)

represent a (state-action/state) value function with a parameterized function instead of a table $$ \hat{V}(s;w) \newline \hat{Q}(s,a;w) $$

motivation

don’t want to have to explicitly store or learn for every single state a
- dynamics or reward model
- value
- state-action value
- policy
want more compact representation that generalized acress state or states and actions

function approximators

linear combinations of features
neural networks
decision trees
nearest neighbors
fourier / wabelet bases

Linear Value Function Approximation

represent a value function (or state-action value function) for a particular policy with a weighted linear combination of features $$ \hat{V}(s;w)=\sum_{j=1}^nx_j(s)w_j=x(s)^Tw $$
objective function is $$ J(w)=E_{\pi}[(V^{\pi}(s)-\hat{V}(s;w))^2] $$
recall weight update is $$ \Delta w=-\frac{1}{2}\alpha \nabla_wJ(w) $$

convergence guarantees

the markov chain defined by a MDP with a paticular policy will eventually converge to a probability distribution over state $d(s)$
$d(s)$ is called the stationary distribution over state of $\pi$
$\sum_sd(s)=1$
$d(s)$ satisfies the following balance equation: $$ d(s)=\sum_{s’}\sum_a\pi(s’|a)p(s’|s,a)d(s’) $$
define the mean squared error of a linear value function approximation for a particular policy $\pi$ relative to the true value as $$ MSVE(w)=\sum_{s\in S}d(s)(V^{\pi}(s)-\hat{V}^{\pi}(s;w))^2 $$

Monte Carlo Value Function Approximation

$$ J(w)=E_{\pi}[(G_t-\hat{V}(s;w))^2] \newline \Delta w=\alpha(G_t-x(s_t)^Tw)x(s_t) $$

monte carlo policy evaluation with VFA converges to the weights $w_{MC}$ which has the minimum mean squared error possible: $$ MSVE(w_{MC})=min_w\sum_{s\in S}d(s)(V^{\pi}(s)-\hat{V}^{\pi}(s;w))^2 $$

Batch Monte Carlo Value Function Approximation

let $G(s_i)$ be an unbiased sample of the true expected return $V^{\pi(s_i)}$ $$ arg,min_w\sum_{i=1}^N(G(s_i)-x(s_i)^Tw)^2 $$ take the derivative and set to 0 $$ w=(X^TX)^{-1}X^TG $$ where $G$ is a vector of all N returns, and $X$ is a matrix of the features of each of the N states $x(s_)$

TD value Function Approximation

$$ J(w)=E_{\pi}[r+\gamma \hat{V}(s’;w))^2] \newline \Delta w=\alpha(r+\gamma x(s’)^Tw-x(s)^Tw)x(s) $$

TD(0) policy evaluation with VFA converges to the weights $w_{TD}$ which is within a constant factor of the minimum mean squared error possible: $$ MSVE(w_{TD})\le \frac{1}{1-\gamma}min_w\sum_{s\in S}d(s)(V^{\pi}(s)-\hat{V}^{\pi}(s;w))^2 $$

control using Value Function Approximation

use value function approximation to represent state-ation values $\hat{Q}^{\pi}(s,a;w)\approx Q^{\pi}$

$$ J(w)=E_{\pi}[(Q^{\pi}(s,a)-\hat{Q}(s,a;w))^2] \newline \Delta w=\alpha E[(Q^{\pi}(s,a)-\hat{Q}(s,a;w))\nabla_w\hat{Q}^{\pi}(s,a;w)] $$

use features to represent both the state and action $$ x(s,a)=\begin{pmatrix} x_1(s,a)\ x_2(s,a)\ …\ x_n(s,a) \end{pmatrix} $$
monte carlo $$ \Delta w=\alpha (G_t-\hat{Q}(s,a;w))\nabla_w\hat{Q}^{\pi}(s,a;w) $$
SARSA $$ \Delta w=\alpha (r+\gamma\hat{Q}(s’,a’;w)-\hat{Q}(s,a;w))\nabla_w\hat{Q}^{\pi}(s,a;w) $$
Q_learning $$ \Delta w=\alpha (r+\gamma,max_{a’}\hat{Q}(s’,a’;w)-\hat{Q}(s,a;w))\nabla_w\hat{Q}^{\pi}(s,a;w) $$

Lecture 6 (Deep Q Learning)

Linear VFA often work well given the right set of features
But can require carefully hand designing that feature set
An alternative is to use a much richer function approximation class that is able to directly go from states without requiring an explicit specification of features

Deep Q-Networks(DQNs)

represent state-action value function by Q-network with weights $w$ $$ \hat{Q}(s,a;w)\approx Q(s,a) $$

Experience Replay

to help remove correlations, store dataset (called a replay buffer) $D$ from prior experience
to perform experience replay., repeat:
- $(s,a,r,s’)\sim D$ :sample an experience tuple from the dataset
- compute the target value for the sampled $s$：$r+\gamma,max_{a’}\hat{Q}(s’,a’;w)$
- use stochastic gradient descent to update the network weight

Prioritized Experience Replay

let $i$ be the index of the $i$-th tuple of experience $s_i,a_i,r_i,s_{i+1} $
sample tuples for update using priority function
priority of a tuple is proportional to DQN error $$ p_i=|r+\gamma,max_{a’}\hat{Q}(s’,a’;w)-\hat{Q}(s,a;w)| $$
update $p_i$ every update
$p_i$ for new tuples is set to 0
one method: $$ P(i)=\frac{p_i^{\alpha}}{\sum_kp_k^{\alpha n}} $$

Fixed Q-Targets

to help improve stability, fix the target weights used in the target calculation for multiple udpdates
use a different set of weights to compute target that is being updated
let parameters $w^-$ be the set of weights used in the target, and $``w$ be the weights that are being updated
slight change to computation of target value:
- sample $(s,a,r,s’)\sim D$
- compute $r+\gamma,max_{a’}\hat{Q}(s’,a’;w^-)$
- update $\Delta w=\alpha (r+\gamma,max_{a’}\hat{Q}(s’,a’;w^-)-\hat{Q}(s,a;w))\nabla_w\hat{Q}(s,a;w)$

Double DQN

current Q-network $w$ is used to select actions
older Q-network $w^-$ is used to evaluate actions

Advantage Function

$$ A^{\pi}(s,a)=Q^{\pi}(s,a)-V^{\pi}(s) $$

Lecture 7 (Imitation Learning)

Expert provides a set of demonstration trajectories: sequences of states and actions
Imitation learning is useful when it is easier for the expert to demonstrate the desired behavior rather than:
- specifying a reward that would generate such behavior
- specifying the desired policy directly
Input
- state space, action space
- transition model $P(s’|s,a)$
- No reward function $R$
- set of one or more teacher’s demonstrations $(s_0,a_0,s_1,…)$ (actions drawn from teacher’s policy $\pi^*$ )
type
- Behavioral cloning
  directly learn the teacher’s policy
- Inverse RL
  recover $R$
- Apprenticeship learning via Inverse RL
  use $R$ to generate a good policy

Behavioral Cloning

formulate problem as a standard machine learning problem:
- fix a policy class (e.g. neural network, decision tree, etc.)
- estimate a policy from training examples $(s_0,a_0),s(_1,a_1)…$

not a good choice

Inverse reinforcement learning

Goal: given input, infer the reward function R

Linear Feature Reward Inverse RL

consider reward is linear over features
$$R(s)=w^Tx(s)$$
where $w\in R^n,x:s\to R^n$
Goal: identify weight vector $w$ given a set of demonstrations
The resulting value function for a policy $\pi$ can be expressed as $$ \begin{equation} \begin{split} V^{\pi}&=E[\sum_{t=0}^{\infty}\gamma^tR(s_t)|\pi]\ &=E[\sum_{t=0}^{\infty}\gamma^tw^Tx(s_t)|\pi]\ &=w^TE[\sum_{t=0}^{\infty}\gamma^tx(s_t)|\pi]\ &=w^T\mu(\pi) \end{split} \end{equation} $$ where $\mu(\pi)(s)$ is defined as the discounted weighted frequency of state features under policy $\pi$

Apprenticeship Learning

step

assumption:$R(s)=w^Tx(s)$
Initialize policy $\pi_0$
loop
- find a reward function such that the teacher maximally outperforms all previous controllers: $$ arg,max_wmax_{\gamma}s.t.w^T\mu(\pi^*)\ge w^T\mu(\pi)+\gamma,,\forall\pi\in{\pi_0,\pi_1,…,\pi_{i-1}} $$
- s.t. $||w||_2\le 1$
- find optimal control policy $\pi_i$ for the current $w$
- exit if $\gamma\le\epsilon/2$

summary

if expert policy is suboptimal then the resulting policy is a mixture of somewhat arbitrary policies which have expert in the convex hull
in practice: pick the best one of this set and pick the corresponding reward function

Lecture 8-10 (Policy Gradient)

Directly parametrize the policy $$ \pi_{\theta}(s,a)=P[a|s;\theta] $$ Goal is to find a policy $\pi$ with the highest value function $V^{\pi}$

Advantages
- better convergence properties
- effective in high-dimensional or continuous action spaces
- can learn stochastic policies
Disadvantages
- Typically converge to a local rather than global optimum
- Evaluating a policy is typically inefficient and