Overview

๋ชจ๋ธ ๊ตฌํ˜„์„ ๊ธฐ๊ฐ€ ๋ง‰ํžˆ๊ฒŒ ์ •๋ฆฌํ•ด๋‘” ํฌ์ŠคํŒ…์ด ์žˆ์–ด ์ด๋ฅผ ์ฐธ๊ณ ํ•˜์—ฌ ๋ชจ๋ธ ๊ตฌํ˜„ ๋Šฅ๋ ฅ๋„ ๊ธฐ๋ฅผ ๊ฒธ ๋ฒˆ์—ญ๊ณผ ์‹œํ–‰ ์ฐฉ์˜ค๋“ฑ์„ ์ •๋ฆฌํ•˜๊ณ ์ž ํ•œ๋‹ค.

๋ณธ ํฌ์ŠคํŒ…์—์„œ๋Š” Multi-Headed Attention (MHA) ๋ฅผ ๊ตฌํ˜„ํ•  ์˜ˆ์ •์ด๋‹ค. ํด๋ก  ์ฝ”๋”ฉ ํ•˜๊ธฐ๋ณด๋‹ค๋Š” from scratch๋กœ ํ•˜๋‚˜ํ•˜๋‚˜ ๊ตฌํ˜„ํ•˜๊ณ  ๋œฏ์–ด๋ณธ๋‹ค. ๋ณธ ํฌ์ŠคํŒ…์˜ ๋ชฉ์ ์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

  1. ์ฐธ์กฐ ํฌ์ŠคํŒ…์„ ๋”ฐ๋ผ Multi-Headed Attention (MHA)๋ฅผ Pytorch๋กœ ๊ตฌํ˜„ํ•œ๋‹ค.
  2. ๊ตฌํ˜„ ๊ณผ์ •์—์„œ์˜ ์‹œํ–‰ ์ฐฉ์˜ค์™€ ์ด์˜ ํ•ด๊ฒฐ ๊ณผ์ •์„ ์ •๋ฆฌํ•œ๋‹ค.
  3. ์ตœ๋Œ€ํ•œ ์ž์„ธํ•˜๊ฒŒ ํ’€์–ด ์„ค๋ช…ํ•˜์—ฌ ๋น„์ „๊ณต์ž๋„ ์ดํ•ดํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•œ๋‹ค.

Multi-Headed Attention (MHA)

๋ณธ ํŠœํ† ๋ฆฌ์–ผ์€ โ€œAttention is All You Needโ€ ๋ผ๋Š” ๋…ผ๋ฌธ์„ ์ฐธ์กฐํ•˜์—ฌ ์ž‘์„ฑ๋œ ํŠœํ† ๋ฆฌ์–ผ์ด๋‹ค. ๋˜ ๋‹ค๋ฅธ ํฌ์ŠคํŒ… ๋„ ๊ฐ€๋Šฅํ•˜๋‹ค๋ฉด ์ฐธ์กฐํ•˜๋ฉด ์ข‹๋‹ค.

ํ›„์ˆ ํ•  MHA๋ฅผ ๊ตฌํ˜„ํ•˜๊ณ  ์ด๋ฅผ ๊ฐ„๋‹จํ•œ Auto Regression task์—๋„ ์ ์šฉํ•˜๋Š” ์ฝ”๋“œ๋ฅผ ์ œ๊ณตํ•œ๋‹ค.

<aside> ๐Ÿ’ก Auto Regression ์ด๋ž€, โ€œ๋‚˜๋Š” ์‚ฌ๊ณผ๋ฅผ ์ข‹์•„ํ•ดโ€ ๋ผ๋Š” ๋ฌธ์žฅ์„ ์ž…๋ ฅํ•˜๋ฉด โ€œ๋‚˜๋Š” ์‚ฌ๊ณผ๋ฅผ ์ข‹์•„ํ•ดโ€ ๋ผ๋Š” ๋‹จ์–ด๋ฅผ ์ถœ๋ ฅํ•˜๋„๋ก ํ•˜๋Š” task์ด๋‹ค.

Transformer๋Š” ์ด๋Ÿฌํ•œ ํ•™์Šต์„ ํ†ตํ•ด ๊ฐ ๋‹จ์–ด์™€ ๋‹จ์–ด (์ •ํ™•ํ•˜๊ฒŒ๋Š” ๊ฐ ์ž„๋ฒ ๋”ฉ๊ณผ ์ž„๋ฒ ๋”ฉ) ๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ํ•™์Šตํ•œ๋‹ค.

</aside>

Tutorial 5: Transformers and Multi-Head Attention โ€” PyTorch Lightning 2.1.4 documentation

labml.ai Annotated PyTorch Paper Implementations

MHA + transformer ๊ตฌํ˜„

Fixed Positional Encodings

Positional encoding