[人工智能] 论文笔记：Enhanced LSTM for Natural Language Inference

Enhanced LSTM for Natural Language Inference

Enhancing sequential inference models based on chain networks
Further, considering recursive architectures to encode syntactic parsing information

Two sentences:
- $a = (a_1, ..., a_{l_a})$
- $b = (b_1, ..., b_{l_b})$
Enbedding of $l$ -dimensional vector: $a_i$ 、 $b_j\in \mathbb{R}^l$
$\bar {a}_i$ : generated by the $B i L S T M$ at time $i$ over the input sequence $a$

Use $B i L S T M$ to encode the input premise and hypothesis
Hidden states by two LSTMs at each time step are concatenated to represent that time step and its context
Encode syntactic parse trees of a premise and hypothesis through tree-LSTM
A tree node is deployed with a tree-LSTM memory block depicted
- At each node, an input vector $x_t$ and hidden vectors of it（ $h^L_{t-1}$ and $h^R_{t-1}$ ）are taken in as the input to calculate the current node’s hidden vector $h_t$
Detailed computation:
- $h_t=TrLSTM(x_t, h^L_{t-1}, h^R_{t-1})$
- $h_t=o_t\odot tanh(c_t)$
- $o_t=\sigma(W_ox_t+U^L_oh^L_{t-1}+U^R_oh^R_{t-1})$
- $c_t=f_t^T \odot c^L_{t-1}+f^R_t\odot c^R_{t-1}+i_t\odot u_t$
- $f^L_t=\sigma(W_fx_t+U^{LL}_fh^L_{t-1}+U^{LR}_fh^R_{t-1})$
- $f^R_t=\sigma(W_fx_t+U^{RL}_fh^L_{t-1}+U^{RR}_fh^R_{t-1})$
- $i_t=\sigma(W_ix_t+U^L_i h^L_{t-1}+U^R_ih^R_{t-1})$
- $u_t=tanh(W_cx_t+U^L_ch^L_{t-1}+U^R_ch^R_{t-1})$
All $W\in \mathbb{R}^{d\times l}, U\in\mathbb{d\times d}$ are weight matrices to be learned

Employ some forms of hard or soft alignment to associate the relevant subcomponents between a premise and a hypothesis
Argue for leveraging attention over the bidirectional sequential encoding of the input
soft alignment layer computes the attention weights as the similarity of a hidden state tuple $<\bar a_i,\bar b_j>$ between a premise and a hypothesis with $e_{ij}= \bar {a}^T_i \bar b_j$
use bidirectional LSTM and tree-LSTM to encode the premise and hypothesis
In sequential inference model, use BiLSTM

Local inference is determined by the attentiion weight $e_{ij}$ , which is used to obtain the local relevance between a premise and hypothesis
The content in ${\{\bar b_j\}}^{l_b}_{j=1}$ that is relevant to $\bar a_i$ will be selected and represented as $\tilde a_i$

$\tilde a_i =\sum\limits_{j=1}^{l_b}\frac{exp(e_{ij})}{\sum^{l_b}_{k=1}exp(e_{ik})}\bar b_j, \forall i \in[1,...,l_a]$

$\tilde b_j =\sum\limits_{i=1}^{l_a}\frac{exp(e_{ij})}{\sum^{l_a}_{k=1}exp(e_{kj})}\bar a_i, \forall j \in[1,...,l_b]$

compute the difference and the element-wise product for the tuple $<\bar a, \tilde a>$ as well as for
$<\bar b, \tilde b>$
The difference and element-wise product are then concatenated with the original vectors

$m_a=[\bar a;\tilde a;\bar a-\tilde a;\bar a \odot \tilde a;]$

$m_b=[\bar b;\tilde b;\bar b-\tilde b;\bar b \odot \tilde b;]$

Explore a composition layer to compose the enhanced local inference information $m_a$ and $m_b$

In sequential inference model, use BiLSTM to compose local inference information sequentially
Formulas for BiLSTM are used to capture local inference information $m_a$ and $m_b$ and their context here for inference composition
In the tree composition, a tree node updates to compose local inference

$v_{a,t}=TrLSTM(F(m_{a,t}), h^L_{t-1}, h^R_{t-1})$

$v_{b,t}=TrLSTM(F(m_{b,t}), h^L_{t-1}, h^R_{t-1})$

Use a 1-layer feedforward neural network with the ReLU activation, which is also applied to BiLSTM in sequential inference composition

Convert the resulting vectors obtained above to a fixed-length vector with pooling and feeds it to the final classifier to determine the overall inference relationship
Compute both average and max pooling, and concatenate all these vectors to form the final fixed length vector $v$

$v_{a,ave}=\sum\limits_{i=1}^{l_a}\frac{v_{a,i}}{l_a}$ , $v_{a,max}=\max\limits_{i=1}^{l_a}v_{a,i}$