2024-02-01 ์Šคํ„ฐ๋””


์•ž์„  3๊ฐœ์˜ ์ฑ•ํ„ฐ์—์„œ๋Š” linear regression, shallow network ๊ทธ๋ฆฌ๊ณ  deep network์— ๋Œ€ํ•ด ๊ณต๋ถ€ํ–ˆ๋‹ค. ๊ฐ ์ฑ•ํ„ฐ์—์„œ input๊ณผ output์„ ๋งตํ•‘ํ•˜๋Š” family of functions๋ฅผ ์‚ดํŽด๋ณด์•˜๊ณ , ๊ฐ family of functions๋Š” ํŒŒ๋ผ๋ฏธํ„ฐ $\phi$ ์— ์˜ํ•ด ๊ฒฐ์ •๋œ๋‹ค. ์ด๋Ÿฌํ•œ ๋ชจ๋ธ๋“ค์„ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์€ ์šฐ๋ฆฌ๊ฐ€ ํ’€๊ณ ์ž ํ•˜๋Š” ๋ฌธ์ œ์— ๋Œ€ํ•˜์—ฌ ๊ฐ€๋Šฅํ•œ ์ตœ์„ ์˜ โ€œinput โ†’ outputโ€์„ ๋งตํ•‘ํ•˜๋Š” ํŒŒ๋ผ๋ฏธํ„ฐ $\phi$๋ฅผ ์ฐพ๋Š” ๊ฒƒ์ด๋‹ค. ๋ณธ ์ฑ•ํ„ฐ์—์„œ๋Š” โ€œbest possibleโ€ ๋งตํ•‘์ด ์˜๋ฏธํ•˜๋Š” ๊ฒƒ์ด ๋ฌด์—‡์ธ์ง€๋ฅผ ์ •์˜ํ•œ๋‹ค.

์ด๋Ÿฌํ•œ ์ •์˜๋Š” input/output pair๊ฐ€ ์žˆ๋Š” ํ›ˆ๋ จ๋ฐ์ดํ„ฐ, $\{ \bold{x_i, y_i} \}$ ๊ฐ€ ํ•„์š”ํ•˜๋‹ค. loss function ํ˜น์€ cost function, $L[\phi]$๋Š” ํ•™์Šต์‹œํ‚ค๋Š” ๋ชจ๋ธ์˜ prediction, $\bold{f[x_i, \phi]}$์™€ ์ด์— ๋Œ€์‘ํ•˜๋Š” ground-truth, $\bold{y_i}$ ๊ฐ€ ์–ผ๋งˆ๋‚˜ ์„œ๋กœ โ€œ๋‹ค๋ฅธ์ง€โ€์— ๋Œ€ํ•œ ๊ฐ’์„ returnํ•œ๋‹ค. ํ•™์Šต ์ค‘์—๋Š”, ํ•™์Šต ๋ฐ์ดํ„ฐ์˜ input๊ณผ output์˜ ๋งตํ•‘์„ ๊ฐ€๋Šฅํ•œํ•œ loss๊ฐ€ ์ตœ์†Œ๊ฐ€ ๋˜๋„๋กํ•˜๋Š” ํŒŒ๋ผ๋ฏธํ„ฐ $\phi$๋ฅผ ์ฐพ๋Š”๋‹ค. ์ฑ•ํ„ฐ2์—์„œ MSE Loss๋ฅผ ๋ณด์•˜๋Š”๋ฐ ์ด๊ฒŒ ์™œ ์ ์ ˆํ•œ ํ•จ์ˆ˜์˜€๋Š”์ง€๋ฅผ ์‚ดํŽด๋ณธ๋‹ค.

๋ณธ ์ฑ•ํ„ฐ๋Š” $\R$(์‹ค์ˆ˜)๋ฅผ ๊ฐ–๋Š” ์ถœ๋ ฅ์— ๋Œ€ํ•˜์—ฌ ์™œ MSE๋ฅผ ์‚ฌ์šฉํ–ˆ๋Š”์ง€์— ๋Œ€ํ•ด ์„ค๋ช…ํ•˜๊ณ  ๋‹ค๋ฅธ ํƒ€์ž…์˜ prediction์— ๋Œ€ํ•œ loss function์„ ์„ค๊ณ„ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ œ๊ณตํ•˜๋Š” framework์„ ์ œ๊ณตํ•œ๋‹ค.

Take Home

  1. ๋ชจ๋ธ์ด ์ž…๋ ฅ, $\bold{x}$๋ฅผ ๋ฐ›์•„์„œ ์ถœ๋ ฅ, $\bold{y}$๋ฅผ ์ง์ ‘ ๊ณ„์‚ฐํ•˜๋Š” ๊ด€์ ์—์„œ ์ถœ๋ ฅ์— ๋Œ€ํ•œ โ€œํ™•๋ฅ  ๋ถ„ํฌโ€๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๊ด€์ ์œผ๋กœ ์˜ฎ๊ธด๋‹ค. ์ •๋ฆฌ ํ•˜์ž๋ฉด, ์ถœ๋ ฅ space์—์„œ ์ •์˜๋œ ํ™•๋ฅ  ๋ถ„ํฌ, $Pr(\bold{y|\theta})$ ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ, $\theta$๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๋ชจ๋ธ, $\theta = f[\bold{x, \phi}]$ ๋ฅผ ์„ค๊ณ„ํ•œ๋‹ค.

  2. ํ™•๋ฅ  ๋ถ„ํฌ์˜ ๊ด€์ ์—์„œ ์ถœ๋ ฅ์„ ํ•ด์„ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ๋ชจ๋ธ์„ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์€ likelihood ๋ฅผ ์ตœ๋Œ€ํ™”ํ•˜๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ์ƒ๊ฐํ•  ์ˆ˜ ์žˆ๋‹ค. ํ•˜์ง€๋งŒ ํŽธ์˜์ƒ, loss๋ฅผ minimize ํ•˜๋Š” ๊ฒƒ์ด ์ผ๋ฐ˜์ ์ด๊ธฐ ๋•Œ๋ฌธ์— $-1$์„ ๊ณฑํ•˜์—ฌ negative log-likelihood ๋ฅผ ์ตœ์†Œํ™”ํ•˜์—ฌ ๋ชจ๋ธ์„ ํ•™์Šตํ•œ๋‹ค.

    ์ด๋Š” loss๋ฅผ ์ตœ์†Œํ™” ํ•œ๋‹ค๋Š” ์ ์—์„œ ์–ธ์–ด์ ์œผ๋กœ conventional ํ•  ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ precision์ด ์ œํ•œ๋˜๋Š” ์ปดํ“จํ„ฐ์—์„œ ๊ณ„์‚ฐ์ƒ ์ •ํ™•๋„์—์„œ ์ด์ ์ด ์žˆ๋‹ค.

  3. ์ด๋Ÿฌํ•œ ํ๋ฆ„์—์„œ ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ MSE Loss๊ฐ€ negative log-likelihood criterion์˜ ์ผ์ข…์ด๋ผ๋Š” ๊ฒƒ์„ ์œ ๋„ํ•˜์˜€๊ณ , ์—ฌ๊ธฐ์—๋Š” variance๊ฐ€ ๋™์ผํ•˜๋‹ค๋Š” ๊ฐ€์ •์ด ์žˆ๋‹ค๋Š” ๊ฒƒ์„ ์ˆ˜ํ•™์ ์œผ๋กœ ๋ณด์ธ๋‹ค. variance ๊ฐ€ ์ƒ์ˆ˜์ธ ๋ชจ๋ธ์„ homoscedastic ํ•˜๋‹ค๊ณ  ํ•˜๋ฉฐ, variance๊ฐ€ ์ž…๋ ฅ์— ๋”ฐ๋ผ ๋‹ค๋ฅธ ๋ชจ๋ธ์„ heteroscedastic ํ•˜๋‹ค๊ณ  ํ•œ๋‹ค.

  4. ๋‹ค์–‘ํ•œ ์ถœ๋ ฅ, ๋‹ค์–‘ํ•œ tasks ์— ๋Œ€ํ•˜์—ฌ loss function์„ ์„ค๊ณ„ํ•˜๋Š” framework์„ ๊ณต๋ถ€ํ•œ๋‹ค.

    1. output space์— ์ ์ ˆํ•œ probability distribution์„ ์„ ํƒํ•œ๋‹ค.
    2. probability distribution์˜ ํŒŒ๋ผ๋ฏธํ„ฐ, $\theta$ ์ค‘ ์–ด๋–ค ๊ฐ’์„ ์˜ˆ์ธกํ• ์ง€ ์„ค์ •ํ•˜๊ณ , ์ด๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๋ชจ๋ธ, $\bold{f}[\bold{x}, \phi]$ ๋ฅผ ์„ค๊ณ„ํ•œ๋‹ค.
    3. Negative log-likelihood๋ฅผ ์ ์šฉํ•œ๋‹ค.
    4. ์ถœ๋ ฅ์€ maximum์„ return ํ•  ์ˆ˜๋„, ํ™•๋ฅ  ๋ถ„ํฌ ์ž์ฒด๋ฅผ return ํ•  ์ˆ˜๋„ ์žˆ๋‹ค.
  5. ๋งˆ์ง€๋ง‰์œผ๋กœ Cross-Entropy Loss์™€ Negative log-likelihood criterion์ด ๋ณธ์งˆ์ ์œผ๋กœ equivalence ํ•˜๋‹ค๋Š” ๊ฒƒ์„ ์ˆ˜ํ•™์  ๋ณด์ธ๋‹ค.

5.1 Maximum Likelihood

5.1์—์„œ๋Š” loss function์„ ์„ค๊ณ„ํ•˜๊ธฐ ์œ„ํ•œ ๊ธฐ์ดˆ ๋‹จ๊ณ„๋ฅผ ๊ณต๋ถ€ํ•œ๋‹ค. ๋ชจ๋ธ $\bold{f[x, \phi]}$ ๋ฅผ ์ƒ๊ฐํ•ด๋ณด์ž. ์ง€๊ธˆ๊นŒ์ง€๋Š” input $\bold{x}$์— ๋Œ€ํ•˜์—ฌ prediction, $\bold{y}$๋ฅผ โ€œ์ง์ ‘โ€ ๊ณ„์‚ฐํ•˜์˜€๋‹ค. ์ง€๊ธˆ๋ถ€ํ„ฐ๋Š” ์šฐ๋ฆฌ์˜ ๊ด€์ ์„ ์˜ฎ๊ฒจ ๋ชจ๋ธ์„ input $\bold{x}$ ๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ ๊ฐ€๋Šฅํ•œ ์ถœ๋ ฅ $\bold{y}$์— ๋Œ€ํ•œ conditional probability, $Pr(\bold{y_i}|\bold{x_i})$ ๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” ๊ฒƒ์œผ๋กœ ๋ณธ๋‹ค.

5.1.1 Computing a distribution over outputs

์ด๋Ÿฌํ•œ ๊ด€์ ์€ โ€œ๋„๋Œ€์ฒด ์–ด๋–ป๊ฒŒ ๋ชจ๋ธ์ด ํ™•๋ฅ  ๋ถ„ํฌ๋ฅผ ๊ณ„์‚ฐํ•˜๋Š”๋ฐ?โ€ ๋ผ๋Š” ์งˆ๋ฌธ์„ ๋‚ณ๋Š”๋ฐ ๋Œ€๋‹ต์€ ๊ฐ„๋‹จํ•˜๋‹ค.,

  1. parametric distribution, $Pr(\bold{y|\theta})$ ๋ฅผ ์„ ํƒํ•œ๋‹ค.
  2. ๊ทธ๋ฆฌ๊ณ  ์šฐ๋ฆฌ์˜ โ€œ๋ชจ๋ธโ€์ด ์ € $\theta$๋ฅผ ์˜ˆ์ธกํ•˜๋„๋ก ํ•œ๋‹ค.

์˜ˆ๋ฅผ ๋“ค์–ด, ์šฐ๋ฆฌ์˜ prediction ํ•˜๊ณ ์ž ํ•˜๋Š” ์ถœ๋ ฅ์˜ domain์ด ์‹ค์ˆ˜๋ผ๊ณ  ํ•ด๋ณด์ž ($y \in \R$). ์—ฌ๊ธฐ์„œ ์šฐ๋ฆฌ๋Š” univariate normal distribution (๊ณ ๋“ฑํ•™์ƒ ๋•Œ ๋ฐฐ์šฐ๋Š” ์ผ๋ณ€์ˆ˜ ๊ฐ€์šฐ์‹œ์•ˆ ๋ถ„ํฌ) ์„ ์„ ํƒํ•  ์ˆ˜ ์žˆ๋‹ค (step 1). ์ด๋Ÿฌํ•œ ๋ถ„ํฌ๋Š” $\mathcal{N}(\mu, \sigma^2)$ ์™€ ๊ฐ™์ด ๋‚˜ํƒ€๋‚ด์–ด์ง„๋‹ค. ๋ชจ๋ธ์€ ์ด๋Ÿฌํ•œ mean, $\mu$์™€ variance, $\sigma$๋งŒ์„ ์˜ˆ์ธกํ•˜๋ฉด ๋œ๋‹ค (step 2).

5.1.2 Maximum Likelihood Criterion