Inverse reinforcement learning
LOLS with classifier only rollouts is RL (Chang et al., 2015)
Rolling out each action can be expensive:
What if we only rollout one?
Bandit Learning
Structured Contextual Bandit LOLS (Chang et al., 2015)
Sokolov et al. (2016) used bandit feedback for chunking
Learning with non-decomposable loss functions means
UNSEARN (Daumé III, 2009): Predict the structured output so that you can predict the input from it (auto-encoder!)
paths = [[],[(0,4),(1,3)],[(0,4),(1,3),(2,2)],[(0,4),(1,3),(2,2),(3,1)]]
rows = ['Noun', 'Verb', 'Modal', 'Pronoun','NULL']
columns = ['NULL','I', 'can', 'fly']
gold_path = [(0,4),(1,3),(2,2),(3,1)]
cbs=[]
cb_gold = cg.draw_cost_breakdown(rows, columns, gold_path)
cbs.append(cb_gold)
wrong_path = [(0,4),(1,2)]
cb_wrong = cg.draw_cost_breakdown(rows, columns, wrong_path)
cbs.append(cb_wrong)
p = gold_path.copy()
for i in range(4):
p = gold_path.copy()
p[2] = (gold_path[2][0],i)
if p == gold_path:
cost = 0
elif i==1:
cost =2
p[3] = (3,0)
else:
cost = 1
cbs.append(cg.draw_cost_breakdown(rows, columns, p, cost, p[3], roll_in_cell=p[1],roll_out_cell=(3,0), explore_cell=p[2]))
util.Carousel(cbs)
Generate useful negative samples around the expert, related to adversarial training (Ho and Ermon, 2016) )
If the optimal action is difficult to
predict, the coach teaches a good one
that is easier (He et al., 2012)
Expert: $\alpha^{\star}= \mathop{\arg \min}_{\alpha \in {\cal A}} L(S_t(\alpha,\pi^{\star}),\mathbf{y})$
Coach: $\alpha^{\dagger}= \mathop{\arg \min}_{\alpha \in {\cal A}} \lbrace L(S_t(\alpha,\pi^{\star}),\mathbf{y}) - \lambda \cdot f(\alpha, \mathbf{x}, S_t)\rbrace$
i.e. $\alpha^{\dagger}=\mathop{\arg \min}_{\alpha \in {\cal A}} \lbrace loss(\alpha) - \lambda \cdot classifier(\alpha) \rbrace$
Speed-up action costing in LOLS (Viera and Eisner, 2016):
If one action changes, don't assess the cost of the whole trajectory (under assumptions), propagate the change!
</a>
They face similar problems: