%load_ext autoreload
%autoreload 2
import il_tutorial.cost_graphs as cg
import il_tutorial.util as util
from IPython.display import HTML
HTML('''<script>
code_show=true;
function code_toggle() {
if (code_show){
$('div.input').hide();
} else {
$('div.input').show();
}
code_show = !code_show
}
$( document ).ready(code_toggle);
</script>
The raw code for this IPython notebook is by default hidden for easier reading.
To toggle on/off the raw code, click <a href="javascript:code_toggle()">here</a>.''')
Andreas Vlachos
a.vlachos@sheffield.ac.uk
Department of Computer Science
University of Sheffield
Based on the EACL2017 tutorial
with Gerasimos Lampouras and
Sebastian Riedel
Input: a sentence $\mathbf{x}=[x_1...x_N]$
Output: a sequence of labels $\mathbf{y}=[y_{1}\ldots y_{N}] \in {\cal Y}^N$
Syntactic parsing, but also semantic parsing, semantic role labeling, question answering over knowledge bases, etc.
Input: a sentence $\mathbf{x}=[x_1...x_N]$
Output: a meaning representation graph $\mathbf{G}=(V,E) \in {\cal G_{\mathbf{x}}}$
Natural language generation (NLG), but also summarization, decoding in machine translation, etc.
Input: a meaning representation
Output: $\mathbf{w}=[w_1...w_N], w\in {\cal V}\cup END, w_N=END$
We assume gold standard
output for training
But we train a classifier to predict
actions constructing the output.
Meta-learning: better model (≈policy) by generating better training data from expert demonstrations.
Incremental modeling, a.k.a:
A model (e.g. conditional random fields) that scores complete outputs (e.g. label sequences):
$$\mathbf{\hat y} =\hat y_{1}\ldots \hat y_{N} = \mathop{\arg \max}_{Y \in {\cal Y}^N} f(y_{1}\ldots y_{N}, \mathbf{x})$$A classifier predicting a label at a time given the previous ones:
\begin{align} \hat y_1 &=\mathop{\arg \max}_{y \in {\cal Y}} f(y, \mathbf{x}),\\ \mathbf{\hat y} = \quad \hat y_2 &=\mathop{\arg \max}_{y \in {\cal Y}} f(y, \mathbf{x}, \hat y_1), \cdots\\ \hat y_N &=\mathop{\arg \max}_{y \in {\cal Y}} f(y, \mathbf{x}, \hat y_{1} \ldots \hat y_{N-1}) \end{align}Improve incremental modeling to:
Meta-learning: use our favourite classifier and features, but generate better (non-i.i.d.) training data
To apply IL we need:
The actions $\cal A$ the classifier $f$ can predict and their effect on the state which tracks the prediction: $S_{t+1}=S_1(\alpha_1\ldots\alpha_t)$
\begin{align} & \textbf{Input:} \; sentence \; \mathbf{x}\\ & state \; S_1=initialize(\mathbf{x}); timestep \; t = 1\\ & \mathbf{while}\; S_t \; \text{not final}\; \mathbf{do}\\ & \quad action \; \alpha_t = \mathop{\arg \max}_{\alpha \in {\cal A}} f(\alpha, \mathbf{x})\\ & \quad S_{t+1}=S_t(\alpha_t); t=t+1\\ \end{align}
What are good actions in incremental structured prediction?
Those that reach $S_{final} = S_1(\alpha_1\ldots\alpha_T)$ with low task loss:
$$loss = L(S_{final}, \mathbf{y}) \geq 0$$How many incorrect PoS tags due to $\alpha_6$ being NNP? 0
How many $FP+FN$ due to $\alpha_6$ being B-PER?
Depends! If $\alpha_7$ is
$FP+FN$ loss is non-decomposable wrt the transition system
Returns the best action at the current state by looking at the gold standard assuming future actions are also optimal:
$$\alpha^{\star}=\pi^{\star}(S_t, \mathbf{y}) = \mathop{\arg \min}_{\alpha \in {\cal A}} L(S_t(\alpha,\pi^{\star}),\mathbf{y})$$In pairs: What action should $\pi^{\star}$ return?
Takes previous actions into account (dynamic vs static)
Finding the optimal action can be expensive but we can learn with sub-optimal experts.
Task loss: Hamming loss: number of incorrectly predicted tags
Transition system: Tag each token left-to-right
Expert policy: Return the next tag from the gold standard
paths = [[],[(0,4),(1,3)],[(0,4),(1,3),(2,2)],[(0,4),(1,3),(2,2),(3,1)]]
rows = ['Noun', 'Verb', 'Modal', 'Pronoun','NULL']
columns = ['NULL','I', 'can', 'fly']
cbs = []
for path in paths:
cbs.append(cg.draw_cost_breakdown(rows, columns, path))
util.Carousel(cbs)
gold_path = [(0,4),(1,3),(2,2),(3,1)]
cb_gold = cg.draw_cost_breakdown(rows, columns, gold_path)
cb_gold
\begin{align} & \textbf{Input:} \; D_{train} = \{(\mathbf{x}^1,\mathbf{y}^1)...(\mathbf{x}^M,\mathbf{y}^M)\}, \; \text{expert}\; \pi^{\star}, \; \text{classifier} \; H\\ & \text{set training examples}\; \cal E = \emptyset\\ & \mathbf{for} \; (\mathbf{x},\mathbf{y}) \in D_{train} \; \mathbf{do}\\ & \quad \text{generate expert trajectory} \; \alpha_1^{\star}\dots \alpha_T^{\star} = \pi^{\star}(\mathbf{x},\mathbf{y})\\ & \quad \mathbf{for} \; \alpha^{\star}_t \in \alpha_1^{\star}\dots \alpha_T^{\star} \; \mathbf{do}\\ & \quad \quad \text{extract features}\; \mathit{feat}=\phi(\mathbf{x},S_{t-1}) \\ & \quad \quad \cal E = \cal E \cup (\mathit{feat},\alpha^{\star}_t)\\ & \text{learn} \; H\; \text{from}\; \cal E\\ \end{align}
wrong_path = [(0,4),(1,3),(2,1)]
cb_wrong = cg.draw_cost_breakdown(rows, columns, wrong_path)
util.Carousel([cb_gold, cb_wrong])
We had seen:
but not:
Define a rollin policy that sometimes uses the expert $\pi^{\star}$ and other times the classifier $H$:
$$\pi^{in} = \beta\pi^{\star} + (1-\beta)H$$\begin{align} & \textbf{Input:} \; D_{train} = \{(\mathbf{x}^1,\mathbf{y}^1)...(\mathbf{x}^M,\mathbf{y}^M)\}, \; \text{expert}\; \pi^{\star}, \; \text{classifier} \; H\\ & \text{set training examples}\; \cal E = \emptyset ,\; \color{red}{\pi^{\star}\; \mathrm{probability}\; \beta=1}\\ & \mathbf{while}\; \text{termination condition not reached}\; \mathbf{do}\\ & \quad \color{red}{\text{set rollin policy} \; \pi^{in} = \beta\pi^{\star} + (1-\beta)H}\\ & \quad \mathbf{for} \; (\mathbf{x},\mathbf{y}) \in D_{train} \; \mathbf{do}\\ & \quad \quad \color{red}{\text{generate trajectory} \; \hat \alpha_1\dots\hat \alpha_T = \pi^{in}(\mathbf{x},\mathbf{y})}\\ & \quad \quad \mathbf{for} \; \hat \alpha_t \in \hat \alpha_1\dots\hat \alpha_T \; \mathbf{do}\\ & \quad \quad \quad \color{red}{\text{ask expert for best action}\; \alpha^{\star} = \pi^{\star}(\mathbf{x},S_{t-1})} \\ & \quad \quad \quad \text{extract features} \; \mathit{feat}=\phi(\mathbf{x},S_{t-1}) \\ & \quad \quad \quad \cal E = \cal E \cup (\mathit{feat},\alpha^{\star})\\ & \quad \text{learn}\; H \; \text{from}\; \cal E\\ & \quad \color{red}{\text{decrease} \; \beta}\\ \end{align}
Proposed by Ross et al. (2011) motivated by robotics
rollins help recover from previous mistakes. How do we learn the future impact of a mistake?
rollout: try each action available and see what happens when future actions are taken by mixing the classifier and the expert
Remember this?
dep_rows = ['SHIFT', 'REDUCE', 'ARC-L', 'ARC-R','NULL']
dep_columns = ['S0','S1', 'S2', 'S3']
dep_cbs = []
dep_paths = [[],[(0,4),(1,0)],[(0,4),(1,0),(2,0)],[(0,4),(1,0),(2,0),(3,2)]]
for path in dep_paths:
dep_cbs.append(cg.draw_cost_breakdown(dep_rows, dep_columns, path))
util.Carousel(dep_cbs)
Stack = [ROOT, had, effect]
Buffer = [on, financial, markets, .]
Features based on the words/PoS in stack and buffer:
wordS1=effect, wordB1=on, wordS2=had, posS1=NOUN, etc.
Features based on the dependencies so far:
depS1=dobj, depLeftChildS1=amod, depRightChildS1=NULL, etc.
Features based on previous transitions:
$r_{t-1}=\text{Right-Arc}(dobj)$, etc.
SHIFT would be the best if everything had been correct (Loss=0), but not anymore.
\begin{align} & \textbf{Input:} \; D_{train} = \{(\mathbf{x}^1,\mathbf{y}^1)...(\mathbf{x}^M,\mathbf{y}^M)\}, \; \text{expert}\; \pi^{\star}, \; \text{classifier} \; H\\ & \text{set training examples}\; \cal E = \emptyset ,\; \color{red}{\pi^{\star}\; \mathrm{probability}\; \beta=1}\\ & \mathbf{while}\; \text{termination condition not reached}\; \mathbf{do}\\ & \quad \color{red}{\text{set rollin policy} \; \pi^{in} = \beta\pi^{\star} + (1-\beta)H}\\ & \quad \mathbf{for} \; (\mathbf{x},\mathbf{y}) \in D_{train} \; \mathbf{do}\\ & \quad \quad \color{red}{\text{generate trajectory} \; \hat \alpha_1\dots\hat \alpha_T = \pi^{in}(\mathbf{x},\mathbf{y})}\\ & \quad \quad \mathbf{for} \; \hat \alpha_t \in \hat \alpha_1\dots\hat \alpha_T \; \mathbf{do}\\ & \quad \quad \quad \color{red}{\text{ask expert for best action}\; \alpha^{\star} = \pi^{\star}(\mathbf{x},S_{t-1})} \\ & \quad \quad \quad \text{extract features} \; \mathit{feat}=\phi(\mathbf{x},S_{t-1}) \\ & \quad \quad \quad \cal E = \cal E \cup (\mathit{feat},\alpha^{\star})\\ & \quad \text{learn}\; H \; \text{from}\; \cal E\\ & \quad \color{red}{\text{decrease} \; \beta}\\ \end{align}
Just like the oracle, but it takes the previous actions into account (dynamic oracle)
Recurrent Neural Network training
(Ranzato et al., 2016)
And some of my own:
Imitation learning:
...for your hard work during the module and your feedback!
Please give more feedback on the course in the forms provided by the department!
Future students will be grateful!