10_DepParsing slides

First part recap¶

Imitation Learning

Meta-learning for action-based models:
improve classifier by generating training from demonstrations

Rollin¶

For each training instance:

construct an action sequence,
incrementally transform initial to terminal state.

Available actions wrt transition system.

Rollin¶

For each training instance:

construct an action sequence,
incrementally transform initial to terminal state.

Available actions wrt transition system.

Rollin¶

Can rollin with classifier, expert policy, or a mix.

Explore¶

At each timestep:

explore alternative actions w.r.t transition system.

Explore¶

Create a new training instance (feature vector):

cost alternative actions,
e.g. DAgger

Rollout¶

Or perform rollouts to cost the actions:

can use non-decomposable loss functions,
e.g. V-DAgger, SEARN, LOLS

Rollout¶

Can rollout with classifier, expert policy, or a mix.

Part 2: NLP Applications and practical advice¶

Applications:
- dependency parsing
- natural language generation
- semantic parsing

Practical advice
- making things faster
- debugging

Applying Imitation Learning to Dependency Parsing

[Goldberg and Nivre 2012](http://www.aclweb.org/anthology/C12-1059)

Dependency parsing¶

To represent the syntax of a sentence as directed labeled arcs between words.

Labels represent dependencies between words.

DAgger Reminder¶

\begin{align} & \textbf{Input:} \; D_{train} = \{(\mathbf{x}^1,\mathbf{y}^1)...(\mathbf{x}^M,\mathbf{y}^M)\}, \; \text{expert}\; \pi^{\star}, \; \text{classifier} \; H\\ & \text{set training examples}\; \cal E = \emptyset ,\; \color{red}{\pi^{\star}\; \mathrm{probability}\; \beta=1}\\ & \mathbf{while}\; \text{termination condition not reached}\; \mathbf{do}\\ & \quad \color{red}{\text{set rollin policy} \; \pi^{in} = \beta\pi^{\star} + (1-\beta)H}\\ & \quad \mathbf{for} \; (\mathbf{x},\mathbf{y}) \in D_{train} \; \mathbf{do}\\ & \quad \quad \color{red}{\text{generate trajectory} \; \hat \alpha_1\dots\hat \alpha_T = \pi^{in}(\mathbf{x},\mathbf{y})}\\ & \quad \quad \mathbf{for} \; \hat \alpha_t \in \hat \alpha_1\dots\hat \alpha_T \; \mathbf{do}\\ & \quad \quad \quad \color{red}{\text{ask expert for } \underline{\text{a set of best actions}}\; \{\alpha_{1}^{\star}\dots\alpha_{k}^{\star}\} = \pi^{\star}(\mathbf{x},S_{t-1})} \\ & \quad \quad \quad \text{extract features} \; \mathit{feat}=\phi(\mathbf{x},S_{t-1}) \\ & \quad \quad \quad \cal E = \cal E \cup (\mathit{feat},\alpha^{\star})\\ & \quad \text{learn}\; H \; \text{from}\; \cal E\\ & \quad \color{red}{\text{decrease} \; \beta}\\ \end{align}

To apply Imitation Learning on any task, we need to define:

Transition
Loss function
Expert policy

Transition system?¶

We can assume any transition-based system (e.g. Arc-Eager).

State: arcs, stack, and buffer.

Action space:
Shift, Reduce, Arc-Left, and Arc-Right.

Arc-Left / Arc-Right combine with arc labels,
but limited #labels in dependency parsing (~50).

The length of the transition sequence is variable.

Shift -> Shift -> Arc-Right -> Shift -> ... -> Arc-Left

Bounded (but not fixed) by length of sentence.
In what task would it be fixed? POS tagging!

...

Stack: -
Buffer: ROOT, 'economic', 'news', 'had', 'little', 'effect', 'on', 'financial', 'markets', '.'

Shift

Stack: ROOT
Buffer: 'economic', 'news', 'had', 'little', 'effect', 'on', 'financial', 'markets', '.'

Shift

Stack: ROOT, 'economic'
Buffer: 'news', 'had', 'little', 'effect', 'on', 'financial', 'markets', '.'

Arc-Left: amod

Stack: ROOT
Buffer: 'news', 'had', 'little', 'effect', 'on', 'financial', 'markets', '.'

Shift

Stack: ROOT, 'news'
Buffer: 'had', 'little', 'effect', 'on', 'financial', 'markets', '.'

Arc-Left: nsubj

Stack: ROOT
Buffer: 'had', 'little', 'effect', 'on', 'financial', 'markets', '.'

and so on

Stack: ROOT
Buffer: -

Loss function?¶

Hamming loss: given predicted arcs, how many parents and labels were incorrectly predicted?

Directly corresponds to attachment score metrics used to evaluate dependency parsers.
Costs dependencies, not actions directly.
Decomposable? Not with this transition model! Cannot score Shift independent of Arc-Right!

Expert policy?¶

Returns the best action at the current state by looking at the gold standard assuming future actions are also optimal:

$$\alpha^{\star}=\pi^{\star}(S_t, \mathbf{y}) = \mathop{\arg \min}_{\alpha \in {\cal A}} L(S_t(\alpha,\pi^{\star}),\mathbf{y})$$

How do we make an expert policy?¶

We can derive a static transition sequence from initial to terminal state using the golden standard.

Static expert policy.

Static expert does not take previous actions into account.

Static expert has not encountered this state before.

Cannot know which is the optimal action.
May default to an action (e.g. Shift).

A static expert may be sufficient for tasks where we do not care whether the previous actions were optimal.

What would such a task be? POS tagging.

Rollin mistakes in dependency parsing¶

Let's assume that we rollin using the classifier.

Arc-Right: iobj ?

Stack: 'wrote'
Buffer: 'her', 'a', 'letter', '.'

Shift

Stack: 'wrote', 'her'
Buffer: 'a', 'letter', '.'

Default: Shift

Stack: 'wrote', 'her', 'a'
Buffer: 'letter', '.'

Arc-Left: det

Stack: 'wrote', 'her'
Buffer: 'letter', '.'

Default: Shift

Stack: 'wrote', 'her', 'letter'
Buffer: '.'

Default: Shift

Stack: 'wrote', 'her', 'letter', '.'
Buffer: -

Static expert policy cannot recover from errors in the rollin.

Also, what if there are multiple correct transitions?¶

Stack: 'her'
Buffer: 'a', 'letter', '.'

Two possible actions: Reduce 'her' / Shift 'a'

Static expert policy arbitarilly choses.

But chosing any one action indirectly labels the alternative actions as incorrect!
Leads to noise in the training signal.

Dynamic expert policy¶

Determines best action, by considering the previous actions.

Can recover from errors.
Allows for multiple optimal actions at each time-step.

Reachable terminal state¶

Reachable terminal state:

Can be reached through a sequence of expert actions $\alpha_1^{\star}\dots \alpha_T^{\star}$, and
no further actions can be taken at that state.

For an optimal reachable terminal state, $L(S_{final}, \mathbf{y}) = 0$.

How does a dynamic expert policy work?¶

For each possible action at a time-step:

...

Stack: 'wrote', 'her'
Buffer: 'letter', '.'

How does a dynamic expert policy work?¶

For each possible action at a time-step:

Arc-Left? Arc-Right? Reduce? Shift?

Stack: 'wrote', 'her'
Buffer: 'letter', '.'

How does a dynamic expert policy work?¶

For each possible action at a time-step:

Arc-Left? Arc-Right? Reduce? Shift?

Determine the reachable terminal state.

Use full rollout with expert?

Arc-Left: det?

Stack: 'wrote',
Buffer: 'letter', '.'

Arc-Right: dobj

Stack: 'wrote',
Buffer: '.'

Arc-Right: p

Stack: 'wrote',
Buffer: -

Full rollouts may be expensive¶

Instead of full expert rollout, we may use heuristics!

i.e. can we estimate how an action will effect the reachable terminal state?

Arc-Right: det?

Stack: 'wrote', 'her'
Buffer: '.'

First part recap¶

Rollin¶

Rollin¶

Rollin¶

Explore¶

Explore¶

Rollout¶

Rollout¶

Part 2: NLP Applications and practical advice¶

Applying Imitation Learning to Dependency Parsing

[Goldberg and Nivre 2012](http://www.aclweb.org/anthology/C12-1059)

Dependency parsing¶

DAgger Reminder¶

Transition system?¶

...

Shift

Shift

Arc-Left: amod

Shift

Arc-Left: nsubj

and so on

Loss function?¶

Expert policy?¶

How do we make an expert policy?¶

Rollin mistakes in dependency parsing¶

Arc-Right: iobj ?

Shift

Default: Shift

Arc-Left: det

Default: Shift

Default: Shift

Also, what if there are multiple correct transitions?¶

Dynamic expert policy¶

Reachable terminal state¶

How does a dynamic expert policy work?¶

...

How does a dynamic expert policy work?¶

How does a dynamic expert policy work?¶

Arc-Left: det?

Arc-Right: dobj

Arc-Right: p

Full rollouts may be expensive¶

Arc-Right: det?

Results

[Goldberg and Nivre 2012](http://www.aclweb.org/anthology/C12-1059)