Semantic parsers map natural language to meaning representations.
We consider a dependency tree as input.
State: nodes, arcs, $\sigma$ and $\beta$ stacks.
The length of the transition sequence is variable.
Smatch (Cai and Knight, 2013)
Naive Smatch employs heuristics.
A set of heuristic rules, based on alignments between nodes in dependency tree and AMR graph.
\begin{align} & \textbf{Input:} \; D_{train} = \{(\mathbf{x}^1,\mathbf{y}^1)...(\mathbf{x}^M,\mathbf{y}^M)\}, \; expert\; \pi^{\star}, \; loss \; function \; L, \\ & \quad \quad \quad learning\; rate\; p\\ & \text{set} \; training\; examples\; \cal E = \emptyset \\ & \mathbf{while}\; \text{termination condition not reached}\; \mathbf{do}\\ & \quad \color{red}{\beta = (1 - p)^i}\\ & \quad \color{red}{\text{set} \; rollin/out \; policy \; \pi^{in/out} = (1-\beta) H + \beta \pi^{\star}}\\ & \quad \mathbf{for} \; (\mathbf{x},\mathbf{y}) \in D_{train} \; \mathbf{do}\\ & \quad \quad \text{rollin to predict} \; \hat \alpha_1\dots\hat \alpha_T = \pi^{in/out}(\mathbf{x},\mathbf{y})\\ & \quad \quad \mathbf{for} \; \hat \alpha_t \in \hat \alpha_1\dots\hat \alpha_T \; \mathbf{do}\\ & \quad \quad \quad \mathbf{for} \; \alpha \in {\cal A} \; \mathbf{do}\\ & \quad \quad \quad \quad \text{rollout} \; S_{final} = \pi^{in/out}(S_{t-1}, \alpha, \mathbf{x})\\ & \quad \quad \quad \quad cost\; c_{\alpha}=L(S_{final}, \mathbf{y})\\ & \quad \quad \quad \text{extract } features=\phi(\mathbf{x}, S_{t-1}) \\ & \quad \quad \quad \cal E = \cal E \cup (features,\mathbf{c})\\ & \quad \text{learn classifier} \; \text{from}\; \cal E\\ \end{align}
Rollout for 50-200 time-steps, and 103 to 104 actions.
Partial exploration is used by SCB-LOLS ([Chang et al., 2015](https://arxiv.org/pdf/1502.02206.pdf)).
Rollout for 50-200 time-steps, and 103 to 104 actions.
Targeted exploration is used by [Goodman et al. 2016](http://aclweb.org/anthology/P16-1001):
V-DAgger and SEARN use step-level mix during roll-out.
V-DAgger and SEARN use step-level mix during roll-out.
Step-level stochasticity causes high variance in training signal.
Introduced by Vlachos and Craven, 2011.
Introduced by Khardon and Wachman (2007).
Reduce training noise by ignoring noisy training instances.
During training, if the classifier makes > $a$ mistakes on a training instance: