11_NLGeneration slides

Applying Imitation Learning on Natural Language Generation

[Lampouras and Vlachos 2016](https://aclweb.org/anthology/C/C16/C16-1105.pdf)

Natural Language Generation (concept-to-text)¶

The natural language processing task of generating text from a meaning representation.

\begin{align} & \text{Predicate: INFORM}\\ & \text{______________________}\\ & \color{blue}{\text{type = "hotel"}}\\ & \color{green}{\text{count = "182"}}\\ & \color{red}{\text{dogs_allowed = dont_care}} \end{align}

➟

There are 182 hotels if you do not care whether dogs are allowed.

Transition system?¶

State: meaning representation, incomplete sentence.

Action space:

What to say?: content prediction actions a_c, and
How to say it?: word prediction actions a_w .

...

\begin{align} & \text{Predicate: INFORM}\\ & \text{______________________}\\ & \text{name = X-name-1}\\ & \text{serves = food}\\ & \text{eattype = restaurant}\\ & \text{near = riverside}\\ & \text{area = X-area-1, X-area-2} \end{align}

Content: name

\begin{align} & \text{Predicate: INFORM}\\ & \text{______________________}\\ & \color{red}{\textbf{name}} \text{ = X-name-1}\\ & \text{serves = food}\\ & \text{eattype = restaurant}\\ & \text{near = riverside}\\ & \text{area = X-area-1, X-area-2} \end{align}

Content: eattype

\begin{align} & \text{Predicate: INFORM}\\ & \text{______________________}\\ & \color{red}{\textbf{name}}\text{ = X-name-1}\\ & \text{serves = food}\\ & \color{red}{\textbf{eattype}}\text{ = restaurant}\\ & \text{near = riverside}\\ & \text{area = X-area-1, X-area-2} \end{align}

and so on

\begin{align} & \text{Predicate: INFORM}\\ & \text{______________________}\\ & \color{red}{\textbf{name}}\text{ = X-name-1}\\ & \text{serves = food}\\ & \color{red}{\textbf{eattype}}\text{ = restaurant}\\ & \color{red}{\textbf{near}}\text{ = riverside}\\ & \color{red}{\textbf{area}}\text{ = X-area-1, X-area-2} \end{align}

Word: X-name-1

\begin{align} & \text{Predicate: INFORM}\\ & \text{______________________}\\ & \color{red}{\textbf{name = X-name-1}}\\ & \text{serves = food}\\ & \color{red}{\textbf{eattype}}\text{ = restaurant}\\ & \color{red}{\textbf{near}}\text{ = riverside}\\ & \color{red}{\textbf{area}}\text{ = X-area-1, X-area-2} \end{align}

Word: EoS

Word: is

Word: a

Word: restaurant

\begin{align} & \text{Predicate: INFORM}\\ & \text{______________________}\\ & \color{red}{\textbf{name = X-name-1}}\\ & \text{serves = food}\\ & \color{red}{\textbf{eattype = restaurant}}\\ & \color{red}{\textbf{near}}\text{ = riverside}\\ & \color{red}{\textbf{area}}\text{ = X-area-1, X-area-2} \end{align}

Word: EoS

\begin{align} & \text{Predicate: INFORM}\\ & \text{______________________}\\ & \color{red}{\textbf{name = X-name-1}}\\ & \text{serves = food}\\ & \color{red}{\textbf{eattype = restaurant}}\\ & \color{red}{\textbf{near}}\text{ = riverside}\\ & \color{red}{\textbf{area}}\text{ = X-area-1, X-area-2} \end{align}

and so on

\begin{align} & \text{Predicate: INFORM}\\ & \text{______________________}\\ & \color{red}{\textbf{name = X-name-1}}\\ & \text{serves = food}\\ & \color{red}{\textbf{eattype = restaurant}}\\ & \color{red}{\textbf{near = riverside}}\\ & \color{red}{\textbf{area = X-area-1, X-area-2}} \end{align}

Word actions are chosen based on content actions.

Content actions precede word actions.
The length of the transition sequence is variable.

Instead, alternate between content and word actions?

Different classifiers are required for content and word actions.

Actually, a different word classifier per content action.
Doesn't affect imitation learning!

Action space:

Content actions limited by the MR.
Word actions limited to gold standard observations.

Loss function?¶

BLEU: % of n-grams predicted present in the gold standard,
i.e. $L=1-BLEU(s_{final}, \mathbf{y})$</span></li>

Same as NLG's evaluation metric.

Content actions are ignored by loss function.

Indirectly costed by their impact on word actions.

No explicit supervision on how each word is aligned to which attribute.

Evaluate how good the complete final sentence is.

The loss function also penalizes undesirable behaviour:

Repeating the same word,
predicting content not present in the MR,
etc.

Expert policy?¶

For word actions:

Which word makes the sequence best match the gold standard.
Only consider words in gold standard.
Best match = minimize loss function.

Gold standard: "X-name-1 is at the side of the river"

If content actions:

which content actions to include,
and in what order.

Need alignments between MR and words in gold standard.

Loss requires no explicit supervision on alignments.
Can employ heuristics on the gold standard!

➟

There are 182 hotels if you do not care whether dogs are allowed.

Suboptimal expert policy¶

The expert policy depends on naive heuristics.

mistakes happen...

➟

There are 182 hotels if you are traveling without animals.

Dealing with suboptimal¶

Imitating the suboptimal expert policy is not enough.

Should not rely heavily on expert policy during rollin/rollouts.
e.g. DAgger uses expert exclusively to cost actions.

We need to learn a better classifier than the suboptimal expert.

Use rollouts that mix classifier and expert policy.
We will be using LOLS here.

Locally Optimal Learning to Search

\begin{align} & \textbf{Input:} \; D_{train} = \{(\mathbf{x}^1,\mathbf{y}^1)...(\mathbf{x}^M,\mathbf{y}^M)\}, \; expert\; \pi^{\star}, \; loss \; function \; L\\ & \text{set} \; training\; examples\; \cal E = \emptyset\\ & \color{red}{\text{initialize a classifier } H_{0}}\\ & \mathbf{for}\; i = 0 \;\mathbf{to} \; N\; \mathbf{do}\\ & \quad \color{red}{\text{set} \; rollin \; policy \; \pi^{in} = H_i}\\ & \quad \color{red}{\text{set} \; rollout \; policy \; \pi^{out} = mix(H_i,\pi^{\star})}\\ & \quad \mathbf{for} \; (\mathbf{x},\mathbf{y}) \in D_{train} \; \mathbf{do}\\ & \quad \quad \text{rollin to predict} \; \hat \alpha_1\dots\hat \alpha_T = \pi^{in}(\mathbf{x},\mathbf{y})\\ & \quad \quad \mathbf{for} \; \hat \alpha_t \in \hat \alpha_1\dots\hat \alpha_T \; \mathbf{do}\\ & \quad \quad \quad \text{rollout to obtain costs}\; c \; \text{for all possible actions using}\; L\; \\ & \quad \quad \quad \text{extract features}\; f=\phi(\mathbf{x},S_{t-1}) \\ & \quad \quad \quad \cal E = \cal E \cup (f,c)\\ & \quad \color{red}{\text{learn classifier} \; H_{i+1} \; \text{from}\; \cal E}\\ & \color{red}{H = \; \mathbf{avg}\{H_{0}\mathbf{\dots}H_{N}\}}\\ \end{align}