Attention is the mechanism that can learn to align input features to a certain output element and combine input features according to the learnt alignment to produce the target output.
In a general seq2seq task:
$ X =[x_1, x_2, … x_n] $ →
encoder → $[h_1, h_2, … h_n]$ →
encoded context (c) →
decoder → $[s_1, s_2, … s_m]$ →
$ Y = [y_1, y_2, … y_m] $
An attention model ($ W_{attention} $) can be applied to the encoded input sequence to get a encoded context representation of the totality rather than the RNN proccessed information (only contains information of time steps till current reached word).
To produce a certain output $y_t$, it takes all the encoded hidden states $[h_1, h_2, … h_n]$ (from a RNN, LSTM or bi-directional maybe) with the last decoder hidden state $s_{t-1}$ to produce a set of weights $A_t =[a_{1t}, a_{2t}, … a_{nt}]$ for each input time steps w.r.t the output time step t.
Where:
Decoder hidden state: $ s_t = f(s_{t−1},y_{t−1},c_t) $;
Alignment weight: $ a_{it} = \frac{exp(W_{attention}(h_i,s_{t-1}))}{\sum_{i’=1}^{n} exp(W_{attention}(h_i’,s_{t-1}))} $.
The hidden context for $y_t$ can then be calculated as:
.