12.1. Processing Text Data

Untitled

์œ„ ์˜ˆ์ œ๋ฅผ ๋ณด๋ฉด ํฌ๊ฒŒ ์„ธ๊ฐ€์ง€๋ฅผ ์•Œ ์ˆ˜ ์žˆ์Œ.

  1. ์ž…๋ ฅ์ด ๋„ˆ๋ฌด ์ปค์„œ FC layer๋ฅผ ์“ฐ๊ธฐ ์–ด๋ ต๋‹ค.

  2. ์ž…๋ ฅ์˜ ๊ธธ์ด๊ฐ€ ๋‹ค๋ฅผ ์ˆ˜ ์žˆ์–ด์„œ FC layer ๋ฅผ ์“ฐ๊ธฐ ์–ด๋ ต๋‹ค.

    โ†’ ๋•Œ๋ฌธ์— CNN๊ณผ ๊ฐ™์ด parameters๋ฅผ share ํ•ด์•ผํ•จ์„ ์‹œ์‚ฌํ•œ๋‹ค.

  3. text๋Š” ๋ชจํ˜ธํ•˜๋‹ค. (e.g. it ์ด restaurant ์„ ์˜๋ฏธํ•˜๋Š”์ง€ vegetarian ์„ ์˜๋ฏธํ•˜๋Š”์ง€ ๋ฌธ๋งฅ์— ๋”ฐ๋ผ ๋‹ค๋ฆ„.)

    โ†’ ์ด๋Š” language model ๋„ ๊ฐ ๋‹จ์–ด ์‚ฌ์ด์— ๋ชจ์ข…์˜ connection ์„ ํ•™์Šตํ•ด์•ผํ•จ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

12.2. Dot-Product Self-Attention

์•ž์„  ์„น์…˜์—์„œ text๋ฅผ processing ํ•˜๋Š” ๋ชจ๋ธ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ํŠน์ง•์„ ๊ฐ€์ ธ์•ผ ํ•œ๋‹ค๊ณ  ํ•˜์˜€๋‹ค.

  1. parameter sharing
  2. ๊ฐ ๋‹จ์–ด ์‚ฌ์ด์˜ connections ๋ฅผ ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์žˆ์–ด์•ผํ•จ.

Transformer ๋Š” ์•ž์„  ๋‘ ๊ฐ€์ง€์˜ ํŠน์„ฑ์„ dot-product self-attention ์„ ์‚ฌ์šฉํ•จ์œผ๋กœ ๋‘ ๋งˆ๋ฆฌ์˜ ํ† ๋ผ๋ฅผ ์žก์•˜๋‹ค.

self-attention ํ•จ์ˆ˜, $\bold{sa}[\bullet]$ ์€ $N$๊ฐœ์˜ embeddings, $\bold{x_1, ..., x_N}$ ์„ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›์•„ ๊ฐ™์€ ๊ฐฏ์ˆ˜์˜ vector ๋“ค์„ ์ถœ๋ ฅํ•ด์•ผํ•œ๋‹ค. ๋จผ์ € values ๋ฅผ ๊ณ„์‚ฐํ•ด์•ผํ•จ. ์ด๋Š” ์•„๋ž˜์™€ ๊ฐ™์Œ.

Untitled

self-attention ์˜ ๊ฒฐ๊ณผ์˜ $n$๋ฒˆ์งธ row๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๊ณ„์‚ฐ๋œ๋‹ค.

Untitled

scalar ๊ฐ’, $a[\bold{x}_m, \bold{x}_n]$ ์„ attention ์ด๋ผ๊ณ  ๋ถ€๋ฅธ๋‹ค. ์ด๋Š” n๋ฒˆ์งธ ์ž…๋ ฅ ๊ฐ’์ด m๋ฒˆ์งธ ์ž…๋ ฅ ๊ฐ’์— ์–ผ๋งˆ๋‚˜ ์ง‘์ค‘ํ•˜๋Š”์ง€๋ฅผ ๋‚˜ํƒ€๋‚ธ๋‹ค.

Untitled