Self Attention

Sun, Aug 4, 2024
3-minute read

Basic concept for the self-attention mechanism.

Prerequisite

Introduction

self-attention 是用來處理的輸入的資訊是一排句子或是像是聲音，可以表示成向量的形式資訊。 Self-attention 的公式描述了如何計算每個 token 與序列中其他 token 的關聯性，並用這些關聯性來更新 token 的表示。

How Self-Attention working?

Q: Query (what information do I need?)
K: Key (what information can I provide?)
V: Value (the actual information I hold)

The self-attention mechanism evaluates the same input sequence that it processes.

Self-Attension Mechanism

數學完整計算過程

生成 Query、Key 和 Value 矩陣
計算注意力分數
進行 softmax 歸一化
加權求和得到輸出

Formula

以下是 Self-attention 的主要公式：

Query、Key 和 Value 矩陣：
- 首先，從輸入矩陣 $ X $ 生成 Query ($ Q $)、Key ($ K $) 和 Value ($ V $) 矩陣： $$ Q = XW_Q , K = XW_K , V = XW_V $$ 其中 $ W_Q $、$ W_K $ 和 $ W_V $ 是學習到的權重矩陣。
Attention Scores：
- 計算 Query 和 Key 的點積來獲得注意力分分數：
$$ \text{ Attention Scores } = \frac{QK^T}{\sqrt{d_k}} $$

其中 $ d_k $ 是 Key 的維度，這裡的 $\sqrt{d_k}$ 是一個縮放因子，用來防止點積值過大。
Softmax：
- 對注意力分數進行 softmax 歸一化以獲得注意力權重：
  $$ \text{Attention Weights} = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) $$
加權求和：
- 使用注意力權重對 Value 進行加權求和，得到輸出矩陣：
  $$ \text{Output} = \text{Attention Weights} \cdot V $$

結合所有步驟，Self-attention 的計算公式可以表示為：

$$ \text{SelfAttention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) \cdot V $$

Program

Use query matrix to dot transposed key matrix, get scores
Softmax the score matix get attention probability
Use attention probability matrix dots value matrix

Implementation

import numpy as np

# vector to probability
def soft_max(z):
    t = np.exp(z)
    print(np.sum(t, axis=-1))
    a = np.exp(z) / np.expand_dims(np.sum(t, axis=-1), -1)
    return a

# attention demo
print("attention demo")
Query = np.array([
    [1,0,2],
    [2,2,2],
    [2,1,3]
])

Key = np.array([
    [0,1,1],
    [4,4,0],
    [2,3,1]
])

Value = np.array([
    [1,2,3],
    [2,8,0],
    [2,6,3]
])

scores = Query @ Key.T
print(scores)
scores = soft_max(scores)
print(scores)
out = scores @ Value
print(out)

Muti-Head Self-Attension Mechanism

多頭自注意力（Multi-Head Self-Attention）是Transformer模型中一個重要的組成部分，它在處理序列數據時具有強大的特徵提取能力。以下是將單頭自注意力改成多頭自注意力的過程：

多頭自注意力的步驟

在多頭自注意力（Multi-Head Self-Attention）中，並不是直接切分輸入，而是通過多個線性投影來分別計算不同的查詢（Query）、鍵（Key）和值（Value）。這些投影之後的查詢、鍵和值矩陣被分配到不同的頭進行注意力計算。以下是詳細解釋：

多頭自注意力的計算過程

整體輸入：
- 首先，輸入 (X) 是一個整體。對於一個序列長度為 (T)，特徵維度為 (D) 的輸入 (X) 來說，輸入張量的形狀是 ((T, D))。
線性投影：
- 將輸入 (X) 分別通過三個線性投影得到查詢、鍵和值矩陣。每個頭都有自己的查詢、鍵和值權重矩陣，這些權重矩陣將原始輸入轉換為每個頭的查詢、鍵和值。
- 對於第 (i) 個頭，這個過程可以表示為： [ Q_i = XW_Q^i, \quad K_i = XW_K^i, \quad V_i = XW_V^i ]
- 其中，(W_Q^i)、(W_K^i)、(W_V^i) 是第 (i) 個頭的查詢、鍵和值的權重矩陣。
分割頭：
- 對於每個頭，將線性投影的結果分割為多個頭。假設多頭數為 (h)，每個頭的維度為 (\frac{D}{h})。
- 在這一步，每個查詢、鍵和值矩陣都被重新組織，使得每個頭都有自己的一組查詢、鍵和值。
- 例如，假設 (X) 的形狀是 ((T, D))，經過線性投影後，我們得到 ((T, h, \frac{D}{h})) 的形狀，這裡 (\frac{D}{h}) 是每個頭的特徵維度。
計算每個頭的自注意力：
- 對於每個頭，計算自注意力輸出： [ \text{Attention}_i = \text{softmax}\left(\frac{Q_iK_i^T}{\sqrt{d_k}}\right)V_i ]
- 這裡 (d_k) 是每個頭的鍵的維度，即 (\frac{D}{h})。
連接頭的輸出：
- 將所有頭的自注意力輸出連接起來，形成一個新的矩陣，形狀為 ((T, D))。
- 這一步確保了每個頭的輸出都被融合回到原始維度中，為後續的輸出投影和處理做準備。
輸出投影：
- 最後，通過一個線性層將連接後的頭的輸出進行投影，以產生最終的輸出。

總結

多頭自注意力的核心在於對輸入進行線性投影，並不是直接切分輸入。這種方法可以讓模型在不同的頭之間學習不同的特徵表示，使得整個模型更具表達能力和靈活性。每個頭處理的數據都是從輸入投影而來，而不是直接分割輸入數據。

Implementation

import numpy as np

# vector to probability
def soft_max(z):
    # z_max = np.max(z, axis=-1, keepdims=True)
    # t = np.exp(z - z_max)
    t = np.exp(z)
    print(np.sum(t, axis=-1))
    a = np.exp(z) / np.expand_dims(np.sum(t, axis=-1), -1)
    return a

# attention for Encoder
print("attention for Encoder")

values_length = 3
num_attention_heads = 8
hidden_size = 768
attention_head_size = hidden_size // num_attention_heads

Query = np.random.rand(values_length, hidden_size)
Key = np.random.rand(values_length, hidden_size)
Value = np.random.rand(values_length, hidden_size)

Query = np.reshape(Query, [values_length, num_attention_heads, attention_head_size])
Key = np.reshape(Key, [values_length, num_attention_heads, attention_head_size])
Value = np.reshape(Value, [values_length, num_attention_heads, attention_head_size])
print(np.shape(Query))

Query = np.transpose(Query, [1, 0, 2])
Key = np.transpose(Key, [1, 0, 2])
Value = np.transpose(Value, [1, 0, 2])
print(np.shape(Query))

scores = Query @ np.transpose(Key, [0, 2, 1]) / np.sqrt(attention_head_size)
print(np.shape(np.transpose(Key, [0, 2, 1])))
print(np.shape(scores))
scores = soft_max(scores)
print(np.shape(scores))
out = scores @ Value
print(np.shape(out))
out = np.transpose(out, [1, 0, 2])
print(np.shape(out))
out = np.reshape(out, [values_length , 768])

# Output Projection Matrix
print(np.shape(out))

The main difference between self-attension and attension

Attention 是在兩個不同的序列之間建立聯繫，比如翻譯中的源語言和目標語言 example :　在翻譯「cat」時，模型會「注意」英文句子中的 “cat”
Self-Attention 是在同一個序列內部建立聯繫，比如在一個句子中的不同詞之間建立關聯　example:　“The cat sat on the mat.” 　在翻譯「cat」時，模型會「注意」英文句子中的 “cat”。

Self Attention

Prerequisite

Introduction

How Self-Attention working?

Self-Attension Mechanism

數學完整計算過程

Formula

Program

Implementation

Muti-Head Self-Attension Mechanism

多頭自注意力的步驟

多頭自注意力的計算過程

總結

Implementation

The main difference between self-attension and attension

References

comments