๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
๊ฐ•์˜๋ฆฌ๋ทฐ๐Ÿ–ฅ๏ธ

[boostcourse][์ž์—ฐ์–ด ์ฒ˜๋ฆฌ์˜ ๋ชจ๋“ ๊ฒƒ] ๊ธฐ์กด์˜ ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ ๊ธฐ๋ฒ•

by hyerong 2024. 1. 24.

์ฑ•ํ„ฐ 1์˜ 2๊ฐ• : ๊ธฐ์กด ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ ๊ธฐ๋ฒ• ์†Œ๊ฐœ 

ํ•ต์‹ฌ ๋‹จ์–ด : BOW, ์›ํ•ซ ๋ฒกํ„ฐ, ๋‚˜์ด๋ธŒ ๋ฒ ์ด์ฆˆ ๋ถ„๋ฅ˜๊ธฐ, ๋ฌธ์žฅ ๋ถ„๋ฅ˜ 

 

Bag-Of-Words (๋‹จ์–ด ๊ฐ€๋ฐฉ ๋ชจํ˜•)

  • ๋‹จ์–ด ์ˆœ์„œ ๊ณ ๋ ค X, ๊ฐ ๋‹จ์–ด๋“ค์˜ ์ถœํ˜„ ๋นˆ๋„(frequency)์—๋งŒ ์ง‘์ค‘ํ•˜๋Š” ๋ฌธ์žํ˜• ๋ฐ์ดํ„ฐ์˜ ์ˆ˜์น˜ํ™” ํ‘œํ˜„ ๋ฐฉ๋ฒ•
  • ๋‹จ์–ด๋ฅผ ๋ฒกํ„ฐ๋กœ ํ‘œํ˜„ํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ์ฃผ์–ด์ง„ ๋ฌธ์žฅ์— ์“ฐ์ธ ๋‹จ์–ด๋“ค์„ ์‚ฌ์ „(Vocabulary(key-value) ํ˜•ํƒœ๋กœ ์ €์žฅ(์ค‘๋ณต ํ—ˆ์šฉ X) 
  • ์ €์žฅ๋œ ๋‹จ์–ด๋“ค์€ ๊ฐ๊ฐ ์œ ๋‹ˆํฌํ•œ ์นดํ…Œ๊ณ ๋ฆฌ ๋ณ€์ˆ˜(Categorical variable)์ด๋ฏ€๋กœ, ์›-ํ•ซ ์ธ์ฝ”๋”ฉ(One-hot Encoding)๋ฅผ ์ด์šฉํ•ด
    ๋ฒกํ„ฐ๋กœ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ๊ณ  -> ๊ฒฐ๊ตญ ์ฃผ์–ด์ง„ ๋ฌธ์žฅ์„ ์›-ํ•ซ ๋ฒกํ„ฐ์˜ ํ•ฉ, ์ฆ‰ ์ˆซ์ž๋กœ ํ‘œํ˜„(numericalํ•˜๊ฒŒ) ๊ฐ€๋Šฅ 

 

๋ฌธ์žฅ์„ ๊ตฌ์„ฑํ•˜๊ณ  ์žˆ๋Š” ๋‹จ์–ด๋“ค์„ ๊ฐ€๋ฐฉ์— ์ˆœ์ฐจ์ ์œผ๋กœ ์ •๋ฆฌ. 

๊ฐ€๋ฐฉ์— ์žˆ๋Š” ๊ฐ ๋‹จ์–ด๋“ค์€ ๊ฐ๊ฐ ์›-ํ•ซ ๋ฒกํ„ฐ๋ฅผ ํ†ตํ•ด ์ˆซ์ž๋กœ ๋ณ€ํ™˜ํ•˜๋ฉฐ, ์ฃผ์–ด์ง„ ๋ฌธ์žฅ์€ ๋ฒกํ„ฐ์˜ ํ•ฉ์œผ๋กœ ํ‘œํ˜„. 

 

 

Naive Bayes Classifier for Document Classification

๋‚˜์ด๋ธŒ ๋ฒ ์ด์ฆˆ ๋ถ„๋ฅ˜๊ธฐ๋Š” ์ธ๊ณต ์‹ ๊ฒฝ๋ง ์•Œ๊ณ ๋ฆฌ์ฆ˜์—๋Š” ์†ํ•˜์ง€ ์•Š์ง€๋งŒ, ๋จธ์‹ ๋Ÿฌ๋‹์˜ ์ฃผ์š” ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ ๋ถ„๋ฅ˜.

๋‚˜์ด๋ธŒ ๋ฒ ์ด์ฆˆ ๋ถ„๋ฅ˜๊ธฐ๋ฅผ ์ดํ•ดํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ์šฐ์„  ๋ฒ ์ด์ฆˆ์˜ ์ •๋ฆฌ(Bayes' theorem)๋ฅผ ์ดํ•ดํ•ด์•ผํ•จ. 

๋ฒ ์ด์ฆˆ ์ •๋ฆฌ๋Š” ์กฐ๊ฑด๋ถ€ ํ™•๋ฅ ์„ ๊ณ„์‚ฐํ•˜๋Š” ๋ฐฉ๋ฒ•. 

 

EX. 

Data Doc(d) Document (words, w) Class (c)
Training 1 Image recognition used convolutional neural networks  CV
  2 Transformers can be used for image classification task CV
  3 Language modeling uses transformer NLP
  4 Document classification task is language task NLP
Test 5 Classification task uses transformer ?

 

ํ•™์Šต ๋ฐ์ดํ„ฐ๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ Test data(5๋ฒˆ ๋ฌธ์žฅ)์„ CV, NLP ๋‘ ํด๋ž˜์Šค ์ค‘์— ํ•œ ๊ณณ์œผ๋กœ ๋ถ„๋ฅ˜ํ•˜๋ ค ํ•œ๋‹ค. 

5๋ฒˆ ๋ฌธ์žฅ์— ์žˆ๋Š” ๊ฐ ๋‹จ์–ด๋“ค์ด 1~4๋ฒˆ ๋ฌธ์žฅ์— ๋ช‡ ๋ฒˆ ๋“ฑ์žฅํ–ˆ๋Š”์ง€๋ฅผ ์กฐ๊ฑด๋ถ€ ํ™•๋ฅ ๋กœ ๊ณ„์‚ฐํ•˜๋ฉด ์‰ฝ๊ฒŒ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

 

๋‹ค๋งŒ ์ด ๋ฐฉ์‹์˜ ์•ฝ์ ์€ ๋‹ค๋ฅธ ๋‹จ์–ด๋“ค์ด ๋ถ„๋ฅ˜ํ•˜๊ณ ์ž ํ•˜๋Š” ๋ฌธ์žฅ์— ๋งŽ์ด ๋“ฑ์žฅํ–ˆ์„์ง€๋ผ๋„, Training data ์—์„œ 1๋ฒˆ์ด๋ผ๋„ ๋“ฑ์žฅํ•˜์ง€ ์•Š์•˜๋‹ค๋ฉด
๋ชจ๋“  ๋‹จ์–ด๋“ค์˜ ํ™•๋ฅ  ๊ณฑ์œผ๋กœ ์ธํ•ด 0์œผ๋กœ ์ˆ˜๋ ดํ•œ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค. (์ด์™€ ๊ฐ™์€ ํŒŒ๋ผ๋ฏธํ„ฐ ์ถ”์ • ๋ฐฉ์‹์€ ์ตœ๋Œ€์šฐ๋„๋ฒ•(MLE)์„ ๊ธฐ๋ฐ˜์œผ๋กœ ์œ ๋„๋œ๋‹ค.)