๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ

๋ถ„๋ฅ˜ ์ „์ฒด๋ณด๊ธฐ39

[๋…ผ๋ฌธ ์ดˆ๋ก๐Ÿ’š] Attention is All you Need The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. ์ง€๋ฐฐ์ ์ธ ์‹œํ€€์Šค ๋ณ€ํ™˜ ๋ชจ๋ธ(dominant sequence transduction models)์€ encoder ๋ฐ decoder๋ฅผ ํฌํ•จํ•˜๋Š” ๋ณต์žกํ•œ ์ˆœํ™˜ ์‹ ๊ฒฝ๋ง ๋˜๋Š” ์ปจ๋ณผ๋ฃจ์…˜ ์‹ ๊ฒฝ๋ง์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ๋‹ค. ์ตœ๊ณ ์˜ ์„ฑ๋Šฅ์„ ์ž๋ž‘ํ•˜๋Š” ๋ชจ๋ธ๋“ค์€ ์–ดํ…์…˜ ๋ฉ”์ปค๋‹ˆ์ฆ˜(attention mechanis.. 2024. 1. 24.
[boostcourse][์ž์—ฐ์–ด ์ฒ˜๋ฆฌ์˜ ๋ชจ๋“ ๊ฒƒ] ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ ํ™œ์šฉ ๋ถ„์•ผ์™€ ํŠธ๋ Œ๋“œ boostcourse๋ฅผ ํ†ตํ•ด KAIST ์ฃผ์žฌ๊ฑธ ๊ต์ˆ˜๋‹˜์˜ "์ž์—ฐ์–ด ์ฒ˜๋ฆฌ์˜ ๋ชจ๋“  ๊ฒƒ" ๊ฐ•์˜๋ฅผ ๋“ฃ๊ฒŒ ๋˜์—ˆ๋‹ค. ์š”์ฆ˜ ์—ฐ๊ตฌ์‹ค์—์„œ ๊ณต๋ถ€ํ•˜๊ฒŒ ๋œ ๋ถ„์•ผ๊ฐ€ ์–ธ์–ด๋ชจ๋ธ์ธ๋ฐ, ์–ธ์–ด๋ชจ๋ธ์— ๋“ค์–ด๊ฐ€๊ธฐ์— ์•ž์„œ์„œ ๋”ฅ๋Ÿฌ๋‹๊ณผ ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ ๊ณต๋ถ€๋ฅผ ์‹œ์ž‘ํ•ด์•ผํ•ด์„œ ๋ฌด๋ฃŒ ๊ฐ•์˜๋ฅผ ์„œ์น˜ํ•˜๋‹ค๊ฐ€ ์•Œ๊ฒŒ ๋˜์—ˆ๋‹ค. ์ฒซ๋ฒˆ์งธ ๊ฐ•์˜๋Š” "์ž์—ฐ์–ด ์ฒ˜๋ฆฌ ํ™œ์šฉ ๋ถ„์•ผ์™€ ํŠธ๋ Œ๋“œ"์˜€๋Š”๋ฐ, ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ๊ฐ€ ๋ฌด์—‡์ธ์ง€๋ฟ ์•„๋‹ˆ๋ผ ๊ธฐ์ˆ ์ด ํ™œ์šฉ๋˜๋Š” ๋ถ„์•ผ์™€ ๊ด€๋ จ ํ•™ํšŒ๊นŒ์ง€ ์•Œ์•„๋ณด๋‹ˆ ์•ž์œผ๋กœ์ด ํ•™์Šต์— ํฐ ๋™๊ธฐ๋ถ€์—ฌ๊ฐ€ ๋˜์—ˆ๋‹ค. ํŠนํžˆ ํ…์ŠคํŠธ ๋งˆ์ด๋‹ ๊ธฐ์ˆ ์—์„œ computational social science(๋น…๋ฐ์ดํ„ฐ ๋ถ„์„ ๊ธฐ๋ฐ˜ ์‚ฌํšŒ๊ณผํ•™)๊ฐ€ ๋‚˜์™€ ์ •๋ง ์ž˜ ๋งž์„ ๊ฒƒ ๊ฐ™๋‹ค๋Š” ์ƒ๊ฐ์ด ๋“ค์–ด ์•ž์œผ๋กœ์˜ ๊ณต๋ถ€ํ•  ๋ถ„์•ผ์— ๋Œ€ํ•œ ๊ธฐ๋Œ€๊ฐ€ ํฌ๋‹ค. ๊ฐ•์˜ ํ”ผํ”ผํ‹ฐ๋Š” ๊ตฌํ•  ์ˆ˜ ์—†์ง€๋งŒ, ์š”์•ฝ๋œ ๋‚ด์šฉ์ด ํ•จ๊ป˜ ๊ณต์œ ๋˜์–ด ๋ณต์Šตํ•  ๋•Œ ์ฐธ๊ณ ํ•ด.. 2024. 1. 24.
[pytorch zero to all ๊ฐ•์˜ ๋‚ด์šฉ ์ •๋ฆฌ] 3๊ฐ• gradient descent ๊ฐ•์˜ ์ฃผ์ œ : 3๊ฐ• Gradient descendent - ๊ฒฝ์‚ฌ ํ•˜๊ฐ•๋ฒ• ๊ฐ•์˜ ๋ชฉํ‘œ : ๊ฒฝ์‚ฌํ•˜๊ฐ•๋ฒ• ์•Œ๊ณ ๋ฆฌ์ฆ˜์— ๋Œ€ํ•ด ์•Œ์•„๋ณด๊ณ  ์†์‹ค์„ ์ตœ์†Œํ™”ํ•˜๋Š” W๊ฐ’์„ ์ฐพ๋Š” ๊ณผ์ •๊ณผ ๋ฐฉ๋ฒ•์„ ์•Œ์•„๋ณธ๋‹ค. ๋˜ํ•œ ๋ฐฉ์ •์‹์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ€์ค‘์น˜๋ฅผ ์—…๋ฐ์ดํŠธํ•˜๊ณ  ๊ธฐ์šธ๊ธฐ๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” ๋ฐฉ๋ฒ•์— ๋Œ€ํ•ด ์ด์•ผ๊ธฐํ•œ๋‹ค. - ๐ŸŽฏ The goal of training or learning in machine learning is to find the optimal value of W that minimizes the loss function. - ๐Ÿงญ The gradient descent algorithm provides a systematic way to identify the optimal value of W by iteratively updating the w.. 2024. 1. 14.
[pytorch zero to all ๊ฐ•์˜ ๋‚ด์šฉ ์ •๋ฆฌ] 2๊ฐ• Linear Model - ์„ ํ˜• ๋ชจ๋ธ ์บก์Šคํ†ค ์ฃผ์ œ๊ฐ€ LLM์„ ์ด์šฉํ•œ ๊ฒ€์ƒ‰ ์—”์ง„ ์ œ์ž‘์œผ๋กœ ์ขํ˜€์ง€๋ฉด์„œ ํŒŒ์ดํ† ์น˜ ์Šคํ„ฐ๋””๋ฅผ ๊ฒจ์šธ๋ฐฉํ•™๋™์•ˆ ์‹œ์ž‘ํ–ˆ์Šต๋‹ˆ๋‹ค. ๊ต์ˆ˜๋‹˜๊ป˜์„œ ๊ณต์œ ํ•ด์ฃผ์‹  pytorch zero to all ๊ฐ•์˜๋ฅผ ์ˆ˜๊ฐ•ํ•˜๋ฉด์„œ ์ •๋ฆฌํ•œ ๋‚ด์šฉ์„ ๊ณต์œ ํ•˜๊ณ ์ž ํ•ฉ๋‹ˆ๋‹ค. ์ˆ˜ํ•™์ ์ธ ๋‚ด์šฉ๊ณผ ์›๋ฆฌ์— ๋Œ€ํ•ด์„œ๋Š” ๊ฐ„๋‹จํžˆ ์ •๋ฆฌํ•˜๊ณ  ๊ฐ•์˜์˜ ์ˆฒ์„ ๋ณด๋Š” ์ฃผ์ œ์œ„์ฃผ๋กœ ์ •๋ฆฌํ•œ ๋ถ€๋ถ„์ด๋‹ˆ ์ €์ฒ˜๋Ÿผ ํŒŒ์ดํ† ์น˜์— ์ œ๋กœ๋ฒ ์ด์Šค์˜€๋˜ ๋ถ„๋“ค๊ป˜์„œ๋Š” ํ•œ๋ฒˆ ์ฝ๊ณ  ํŒŒ์ดํ† ์น˜ ์Šคํ„ฐ๋””๋ฅผ ์‹œ์ž‘ํ•˜์‹œ๋Š”๊ฒŒ ๋„์›€์ด ๋  ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค. 13๊ฐ•๊นŒ์ง€ ๋‚ด์šฉ์„ ์ „๋ถ€ ์˜ฌ๋ฆฌ๊ณ  ์ดํ›„ ์ถ”๊ฐ€์ ์ธ ์Šคํ„ฐ๋””๋ฅผ ์ง„ํ–‰ํ• ๋•Œ๋งˆ๋‹ค ์‹œ๊ฐ„์„ ๋‚ด์–ด ๊ณต๋ถ€ ๋‚ด์šฉ์„ ์ •๋ฆฌํ•˜๊ณ ์ž ํ•ฉ๋‹ˆ๋‹ค. ๊ฐ•์˜ ๋ชฉํ‘œ : ํŒŒ์ดํ† ์น˜์˜ ์„ ํ˜• ๋ชจ๋ธ ๊ฐœ๋…์„ ์ด์•ผ๊ธฐํ•˜๊ณ  ์ง€๋„ ๋จธ์‹ ๋Ÿฌ๋‹์—์„œ ์„ ํ˜• ๋ชจ๋ธ์ด ์‚ฌ์šฉ๋˜๋Š” ๋ฐฉ๋ฒ•๊ณผ ๋ชจ๋ธํ•™์Šต ๋ฐ ํ‰๊ฐ€ ๊ณผ์ •์„ ์„ค๋ช…ํ•œ๋‹ค. ๋˜ํ•œ ์†์‹ค ๊ณ„์‚ฐ๊ณผ ํ‰๊ท ์ œ๊ณฑ์˜ค์ฐจ(MSE)๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ.. 2024. 1. 7.