Take Home

  1. ๋ชจ๋ธ์„ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์€ ํŒŒ๋ผ๋ฏธํ„ฐ, $\phi$ ์— ๋Œ€์‘ํ•˜๋Š” loss function, $L[\phi]$ ๋ฅผ ์ตœ์†Œํ™”ํ•˜๋Š” ๊ฒƒ์œผ๋กœ ์ƒ๊ฐํ•  ์ˆ˜ ์žˆ๋‹ค. Gradient Descent ๋Š” ํ˜„์žฌ ํŒŒ๋ผ๋ฏธํ„ฐ์—์„œ ๊ณ„์‚ฐ๋˜๋Š” loss์˜ (ํ•ด๋‹น ์ง€์ ์—์„œ์˜ uphill) gradient๋ฅผ ๊ณ„์‚ฐํ•˜๊ณ  ์ด์˜ ๋ฐ˜๋Œ€๋ฐฉํ–ฅ์ธ downhill (gradient์— $\times -1$ ์„ ๊ณฑํ•˜๋ฉด ๋จ.) ๋ฐฉํ–ฅ์œผ๋กœ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์—…๋ฐ์ดํŠธ ํ•œ๋‹ค.
  2. non-linear function ์— ๋Œ€ํ•œ loss๋Š” non-convex์ผ ํ™•๋ฅ ์ด ์•„์ฃผ์•„์ฃผ ๋†’๋‹ค. ๋”ฐ๋ผ์„œ local-minima๋‚˜ saddle points๋ฅผ ํฌํ•จํ•  ์ˆ˜ ์žˆ๋‹ค. Stochastic Gradient Descent (SGD) ๋Š” ์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋ฅผ ์–ด๋Š ์ •๋„ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ๋‹ค.
  3. SGD๋Š” ํ•™์Šต ๋ฐ์ดํ„ฐ์—์„œ ์ค‘๋ณต์„ ํ—ˆ์šฉํ•˜์ง€ ์•Š๊ฒŒ ๋ช‡๋ช‡ examples ๋ฅผ ์ƒ˜ํ”Œ๋งํ•œ๋‹ค. ์ด๋“ค์„ batch ํ˜น์€ minibatch ๋ผ๊ณ  ํ•œ๋‹ค. ์ด batch์— ๋Œ€ํ•˜์—ฌ loss์™€ gradient๋ฅผ ๊ณ„์‚ฐํ•œ๋‹ค. ์ด๋Ÿฌํ•œ ์ ‘๊ทผ์€ gradient์— noise๋ฅผ ๋”ํ•œ๋‹ค๊ณ  ๋ณผ ์ˆ˜ ์žˆ๊ณ  ์ด๋Ÿฌํ•œ ๊ณผ์ •์—์„œ ์•ž์„  local minima, saddle points๋ฅผ ํ”ผํ•˜๋„๋ก ํ•˜๋ฉฐ ํ•™์Šต ๋ฐ์ดํ„ฐ๋ฅผ ๋” ์ž˜ ์ผ๋ฐ˜ํ™”ํ•œ๋‹ค.
  4. ๋งˆ์ง€๋ง‰์œผ๋กœ ์ด๋Ÿฌํ•œ SGD ์•Œ๊ณ ๋ฆฌ์ฆ˜์— momentum term ์„ ์ถ”๊ฐ€ํ•˜๋Š” ๊ฒƒ์ด ํ•™์Šต์„ ํšจ๊ณผ์ ์œผ๋กœ ๋„์šธ ์ˆ˜ ์žˆ์Œ์„ ๋ณด์•˜๊ณ , Vanilla momentum, Nesterov Accelerated momentum, Adaptive Momentum Estimate (Adam) ๊นŒ์ง€ ์‚ดํŽด๋ณด์•˜๋‹ค.

๋“ค์–ด๊ฐ€๊ธฐ ์•ž์„œ

Chapter 3, 4 ์—์„œ๋Š” SNN, DNN์— ๋Œ€ํ•˜์—ฌ ๊ณต๋ถ€ํ•˜์˜€๋‹ค. ์ด๋Ÿฌํ•œ ๋ชจ๋ธ๋“ค์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋“ค์ด ์–ด๋–ค ํ•จ์ˆ˜๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ์—ฌ๋Ÿฌ ๊ฐœ์˜ piecewise line function๋“ค๋กœ ํ‘œํ˜„๋œ๋‹ค. Chapter 5์—์„œ๋Š” loss ์— ๋Œ€ํ•˜์—ฌ ๊ณต๋ถ€ํ•˜์˜€๋‹ค. ์ด๋Š” ํ•™์Šต ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•˜์—ฌ ground truth (GT = ์ •๋‹ต) ๊ณผ ๋ชจ๋ธ์˜ prediction ๊ฐ„์˜ mismatch ๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ํ•˜๋‚˜์˜ ์Šค์นผ๋ผ ๊ฐ’์ด๋‹ค.

loss๋Š” ๋ชจ๋ธ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ์— ๋”ฐ๋ผ ๊ฒฐ์ •๋˜๋Š” ๊ฐ’์ด๊ณ , ๋ณธ ์ฑ•ํ„ฐ์—์„œ๋Š” loss๊ฐ€ ์ตœ์†Œ๊ฐ’์„ ๊ฐ–๋Š” ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ โ€œ์–ด๋–ป๊ฒŒ ์ฐพ๋Š”์ง€โ€ ์— ๋Œ€ํ•˜์—ฌ ๊ณต๋ถ€๋ฅผ ํ•œ๋‹ค. ์ด๋Ÿฌํ•œ ๋ฐฉ๋ฒ•์„ learning, training ํ˜น์€ fitting ์ด๋ผ๊ณ  ํ•œ๋‹ค.

๋จผ์ € ํŒŒ๋ผ๋ฏธํ„ฐ์˜ ๊ฐ’์„ ์ดˆ๊ธฐํ™”ํ•˜๊ณ  ํฌ๊ฒŒ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋‘ ๊ฐœ์˜ ์Šคํ…์„ ๋”ฐ๋ฅธ๋‹ค.

  1. ํŒŒ๋ผ๋ฏธํ„ฐ์— ๋Œ€ํ•œ loss์˜ derivative(==๋ฏธ๋ถ„) (gradient) ๋ฅผ ๊ตฌํ•œ๋‹ค.
  2. ์•ž์„œ ๊ตฌํ•œ gradient ์— ๋Œ€ํ•˜์—ฌ loss๊ฐ€ ์ž‘์•„์ง€๋„๋ก ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์กฐ์ •ํ•œ๋‹ค.

์—ฌ๋Ÿฌ ๋ฐ˜๋ณต ๋’ค์— loss function์˜ ์ „๋ฐ˜์ ์ธ minimum์— ๋„๋‹ฌํ•˜๊ธฐ๋ฅผ ๊ธฐ๋„ํ•œ๋‹ค. (fitting์€ ์ƒค๋จธ๋‹ˆ์ฆ˜์˜ ์˜์—ญ..)

6.1 Gradient Descent

optimization algorithm ์˜ ์ตœ์ข… ๋ชฉํ‘œ๋Š” ๋ฐ”๋กœ loss๋ฅผ ์ตœ์†Œํ™”ํ•˜๋Š” ํŒŒ๋ผ๋ฏธํ„ฐ, $\hat{\phi}$๋ฅผ ์ฐพ๋Š” ๊ฒƒ์ด๋‹ค.

$$ \hat{\phi} = \displaystyle\argmin_{\phi}\bigg[L[\phi]\bigg], \\\text{Eq. 6.1} $$

๋‹ค์–‘ํ•œ optimization algorithm์ด ์กด์žฌํ•˜์ง€๋งŒ ๋ณดํ†ต neural network ๋ฅผ ํ•™์Šตํ•˜๋Š” ์ผ๋ฐ˜์ ์ธ ๋ฐฉ๋ฒ•์€ ๋จผ์ € ํŒŒ๋ผ๋ฏธํ„ฐ์˜ ๊ฐ’์„ ํœด๋ฆฌ์Šคํ‹ฑ (ํœด๋ฆฌ์Šคํ‹ฑ์€ ๋ง์ด ์ข‹์•„ ํœด๋ฆฌ์Šคํ‹ฑ์ด์ง€ ๊ทธ๋ƒฅ ๊ฐ์œผ๋กœ ๋•Œ๋ ค๋ฐ•๋Š” ๊ฒƒ.) (initialization ์— ๋Œ€ํ•ด์„œ๋Š” ํ›„์ˆ ํ•  ์˜ˆ์ •) ํ•˜๊ฒŒ initialization ํ•˜๊ณ  loss๊ฐ€ ์ค„์–ด๋“œ๋Š” ์ผ๋ จ์˜ ๋ฐฉ๋ฒ•์„ ๋ฐ˜๋ณตํ•˜์—ฌ (iterative) optimization ํ•˜๋Š” ๊ฒƒ์ด๋‹ค.

์ด๋Ÿฌํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์ค‘์—์„œ ๊ฐ€์žฅ ์‹ฌํ”Œํ•œ ๋ฐฉ๋ฒ•์€ gradient descent ์ด๋‹ค. ์ด๋Š” ๋จผ์ € ํŒŒ๋ผ๋ฏธํ„ฐ, $\phi = [\phi_0, \phi_1, ..., \phi_N]^T$ ๋กœ ์ดˆ๊ธฐํ™” ํ•˜๊ณ  ์•„๋ž˜ ๋‘ step์„ ๋ฐ˜๋ณตํ•œ๋‹ค.

Step 1. $\phi$์— ๋Œ€ํ•œ loss์˜ gradient ๋ฅผ ๊ณ„์‚ฐํ•œ๋‹ค.