Take Home

  1. Backpropagation Algorithm์ด ์–ด๋–ป๊ฒŒ ๋™์ž‘ํ•˜๋Š”์ง€์— ๋Œ€ํ•œ ์ดํ•ด

  2. Parameter Initialization ์€ ํ•™์Šต์ด ์•ˆ์ •์ ์œผ๋กœ ๋˜๊ธฐ ์œ„ํ•ด์„œ ๊ฐ ํŒŒ๋ผ๋ฏธํ„ฐ๋“ค์ด์˜ ๊ฒฐ๊ณผ๋“ค์ด ๋น„์Šทํ•œ ๋ถ„ํฌ๋ฅผ ๊ฐ–๋„๋ก ์ดˆ๊ธฐํ™”ํ•˜๋Š” ๋ฐฉ๋ฒ•์ž„.

    Fig 7.7. Weight Initialization. ๋ชจ๋“  ๋ ˆ์ด์–ด์˜ D_h ๊ฐ€ 100์ธ ๊ฒฝ์šฐ์— a) ๋Š” forward pass ์˜ ๊ฐ layer ๋ณ„ activations์˜ ํฌ๊ธฐ๋ฅผ, b) ๋Š” backward pass ์˜ gradients ์˜ ํฌ๊ธฐ๋ฅผ ์‹œ๊ฐ์ ์œผ๋กœ ํ‘œํ˜„ํ•œ ๊ฒƒ์ด๋‹ค. ๊ฐ ๋ ˆ์ด์–ด์˜ weights์˜ ๋ถ„ํฌ๊ฐ€ 2/D_h = 0.02 ์ผ ๋•Œ forward pass, backward pass์—์„œ์˜ activations, gradients ๊ฐ€ ์•ˆ์ •์ ์œผ๋กœ ์œ ์ง€ ๋˜๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.
0.02 ๋ณด๋‹ค ์ž‘์œผ๋ฉด vanishing gradient, 0.02 ๋ณด๋‹ค ํฌ๋ฉด exploding gradient ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ๋‹ค.

    Fig 7.7. Weight Initialization. ๋ชจ๋“  ๋ ˆ์ด์–ด์˜ D_h ๊ฐ€ 100์ธ ๊ฒฝ์šฐ์— a) ๋Š” forward pass ์˜ ๊ฐ layer ๋ณ„ activations์˜ ํฌ๊ธฐ๋ฅผ, b) ๋Š” backward pass ์˜ gradients ์˜ ํฌ๊ธฐ๋ฅผ ์‹œ๊ฐ์ ์œผ๋กœ ํ‘œํ˜„ํ•œ ๊ฒƒ์ด๋‹ค. ๊ฐ ๋ ˆ์ด์–ด์˜ weights์˜ ๋ถ„ํฌ๊ฐ€ 2/D_h = 0.02 ์ผ ๋•Œ forward pass, backward pass์—์„œ์˜ activations, gradients ๊ฐ€ ์•ˆ์ •์ ์œผ๋กœ ์œ ์ง€ ๋˜๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. 0.02 ๋ณด๋‹ค ์ž‘์œผ๋ฉด vanishing gradient, 0.02 ๋ณด๋‹ค ํฌ๋ฉด exploding gradient ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ๋‹ค.

Priliminaries

Abstract

์ด์ „์— Chatper 6 ์—์„œ iterative optimization algorithms ๋“ค์„ ๊ณต๋ถ€ํ•˜์˜€๋‹ค. ์ด๋“ค์€ ์–ด๋–ค function์˜ ์ตœ์†Œ๊ฐ’์„ ์ฐพ๋Š” ๋ฒ”์šฉ์ ์ธ ์ ‘๊ทผ๋ฒ•๋“ค์ด๋‹ค. Neutral network์˜ ๊ด€์ ์—์„œ๋Š” input์ด ์ฃผ์–ด์กŒ์„ ๋•Œ ์ •ํ™•ํ•œ output์„ ์˜ˆ์ธกํ•˜๋„๋ก loss๋ฅผ ์ตœ์†Œํ™”ํ•˜๋Š” ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ฐพ๋Š” ๊ฒƒ์œผ๋กœ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. ๊ฐ€์žฅ ๊ธฐ๋ณธ์ ์ธ ๋ฐฉ๋ฒ•์€ ์ดˆ๊ธฐ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๋žœ๋คํ•˜๊ฒŒ ์„ค์ •ํ•˜๊ณ  loss๋ฅผ ์ตœ์†Œํ™” ํ•˜๋Š” loss์˜ ํ˜„์žฌ ํŒŒ๋ผ๋ฏธํ„ฐ์— ๋Œ€ํ•œ ๋ฏธ๋ถ„๊ฐ’, ์ฆ‰ gradient๋ฅผ ๊ณ„์‚ฐํ•œ๋‹ค.

๋ณธ ์ฑ•ํ„ฐ์—์„œ๋Š” ํฌ๊ฒŒ ๋‘ ๊ฐ€์ง€ ์ด์Šˆ๋ฅผ ์ค‘์ ์ ์œผ๋กœ ๋‹ค๋ฃฌ๋‹ค.

  1. Gradient๋ฅผ ๊ทธ๋Ÿฌ๋ฉด ์–ด๋–ป๊ฒŒ โ€œํšจ์œจ์ โ€์œผ๋กœ ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ๋Š”์ง€?
  2. Parameters๋ฅผ ์–ด๋–ป๊ฒŒ initialize ํ•˜๋ฉด ์ข‹์„์ง€?

7.1 Problem Definitions

๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋‰ด๋Ÿด๋„ท, $\bold{f[x, \phi]}$, input, $\bold{x}$, ํŒŒ๋ผ๋ฏธํ„ฐ, $\phi$์™€ 3๊ฐœ์˜ hidden layers,$\bold{h_1, h_2, h_3}$ ๋ฅผ ์ƒ๊ฐํ•ด๋ณด์ž.

Untitled

activation function, $\bold{a}[\bullet]$์€ element-wisely ์—ฐ์‚ฐ๋œ๋‹ค. $\phi = \{\bold{\beta_i, \Omega_i}\}_{i=1}^3$ ์œผ๋กœ ์ •์˜๋œ๋‹ค. **$\beta$**๋Š” vector bias, $\bold{\Omega}$ ๋Š” weight matrix์ด๋‹ค. ์œ„ ๋ชจ๋ธ์„ ๊ทธ๋ฆผ์œผ๋กœ ํ‘œํ˜„ํ•˜๋ฉด ์•„๋ž˜ Fig 7.1 ์™€ ๊ฐ™๋‹ค.

Fig 7.1 Backpropagation forward pass.

Fig 7.1 Backpropagation forward pass.

label, $y_i$, prediction, $\bold{f[x_i, \phi]}$ ์™€์˜ distance์ธ $l_i = (f[\bold{x_i, \phi}] - y_i)^2$ ์™€ ๊ฐ™์ด $i$๋ฒˆ์งธ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ loss, $l_i$ ๋ฅผ ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ์„ ๋•Œ total loss๋Š” ์•„๋ž˜์™€ ๊ฐ™์ด ๊ณ„์‚ฐ๋œ๋‹ค.

Untitled

๋ณธ ์ฑ•ํ„ฐ์—์„œ๋Š” optimization algorithm์œผ๋กœ Stochastic Gradient Descent (SGD)๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค.