Skip to content

bcaitech1/p4-dkt-ollehdkt

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

41 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Boost Camp - AI Tech

Stage 4 - Deep Knowledge Tracing

2021.05.24 ~ 2021.06.15

ํŠน์ • ๋ฌธ์ œ๋ฅผ ํ‘ผ ์‚ฌ์šฉ์ž์˜ ๋งˆ์ง€๋ง‰ ์ •๋‹ต ์—ฌ๋ถ€ ์˜ˆ์ธก ๋ฌธ์ œ

Boost Camp P stage 4 ๋Œ€ํšŒ์˜ ๊ณผ์ •๊ณผ ๊ฒฐ๊ณผ๋ฅผ ๋‹ด์€ Git repo ์ž…๋‹ˆ๋‹ค. ๋Œ€ํšŒ ๊ทœ์น™์ƒ ํŠน์ • ๋‚ด์šฉ์ด ์ˆ˜์ •๋˜๊ฑฐ๋‚˜ ์‚ญ์ œ๋œ ๊ฒฝ์šฐ๊ฐ€ ์กด์žฌํ•ฉ๋‹ˆ๋‹ค



๐Ÿ Final Score

Team Rank : 7 , AUROC : 0.8362, Accuracy : 0.7527



๐Ÿ“‹ Table of content

  1. EDA
  2. Feature Engineering
  3. Data Augmentation
  4. Model
  5. Cross Validation strategy
  6. ๊ธฐํƒ€


๐ŸŒOlleh Team

๊น€์ข…ํ˜ธ Project Branch Github Badge Blog Badge

๋ฐ•์ƒ๊ธฐ Project Branch Github Badge Blog Badge

์ž„๋„ํ›ˆ Project Branch Github Badge Blog Badge

์ง€์ •์žฌ Project Branch Github Badge Blog Badge

ํ™์ฑ„์› Project Branch Github Badge Blog Badge

์Šคํ›„ํŽ˜์—˜๋ ˆ๋‚˜

Project Branch๋Š” DKT ๋Œ€ํšŒ์—์„œ ์‚ฌ์šฉํ•œ ํŒ€์› ๋ณ„ Branch์ž…๋‹ˆ๋‹ค. ํŒ€์›์˜ ์ž์„ธํ•œ ์ •๋ณด๋ฅผ ์›ํ•˜์‹œ๋Š” ๊ฒฝ์šฐ Project Branch๋กœ ํ™•์ธ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค




๐Ÿ“–ํ”„๋กœ์ ํŠธ ์ „์ฒด ๊ณผ์ •

image1




๐Ÿ’ก ํ•ต์‹ฌ ์ „๋žต

โžก๊ต์œก ๋„๋ฉ”์ธ ์ง€์‹ ํ™œ์šฉ

โžกUser split augmentation

โžกprivate leader board๋ฅผ ๊ณ ๋ คํ•œ ๋ชจ๋ธ ์‹คํ—˜

โžกTwo track (task cross-reference)




๐Ÿƒโ€โ™€๏ธ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ์œ„ํ•œ ๊ณ ๊ตฐ๋ถ„ํˆฌํ•œ ์—ฌ์ •

1. EDA (Exploratory Data Analysis)

โžก ๋‹ค์–‘ํ•œ EDA๋ฅผ ํ†ตํ•ด Feature engineering๊ณผ validation ์ „๋žต์„ ์„ธ์šฐ๋Š”๋ฐ ํ™œ์šฉ

image


2.Feature Engineering

โžก ๋ฐ์ดํ„ฐ ๋ถ„์„ ๊ธฐ๋ฐ˜ Feature

ใ€€โœณ User ID, assessmentItemID, testId, KnowledgeTag, Timestamp ๊ณผ answerCode ๊ด€๊ณ„

ใ€€โœณ๊ฐ Value์™€ answerCode๊ฐ’์˜ ํ‰๊ท , ๋ถ„์‚ฐ, Skew, ๋ˆ„์ ํ•ฉ, ๋ˆ„์  ํ‰๊ท 

ใ€€โœณ๊ฐ Value ๊ฐ’์˜ ํ†ต๊ณ„์  ์ˆ˜์น˜

โžก ๊ต์œกํ•™ ์ด๋ก  ๊ธฐ๋ฐ˜ Feature

ใ€€โœณassessmentItemID, testId, KnowledgeTag์˜ ๋ณ€๋ณ„๋„ ๊ฐ’

ใ€€โœณ๋ณ€๋ณ„๋„ : (์ƒ์œ„ ์ •๋‹ต ์ˆ˜ - ํ•˜์œ„ ์ •๋‹ต ์ˆ˜ ) / (์ด ์‘์‹œ์ž / 2)

โžก ELO rating

ใ€€โœณ์ •๋‹ต ์—ฌ๋ถ€์— ๋”ฐ๋ฅธ ๊ฐœ์ธ Rank ์ ์ˆ˜ ์ ์šฉ

ใ€€โœณ๋ฌธ์ œ ๋‚œ์ด๋„์— ๋”ฐ๋ฅธ Rank ์ ์ˆ˜์˜ ์ฆ๊ฐ€์™€ ๊ฐ์†Œ

โžก ์ด 47๊ฐœ์˜ Feature ์ƒ์„ฑ

โžก Feature Engineering ์ƒ์„ธ

image


3. Data augmentation

โžก Sliding Window(Stride = 10,20, ... ,128)

โžก User month split (์‚ฌ์šฉ์ž๋ฅผ ์›”๋ณ„๋กœ ์ •๋ฆฌ)

โžก User testID grade split (์‚ฌ์šฉ์ž๋ฅผ ๋ฌธ์ œ์ง€๋ณ„ ์ •๋ฆฌ)


4. Model

โžก Tree decision : LGBM , XGBoost , Catboost

โžก NN Models : LSTM , LSTM with Attention , Bert , Saint , GPT-2, LastQuery_pre/post

image4


5.Cross validation strategy

์ด์ „ stage์—์„œ shake-up์ด ํฌ๊ฒŒ ์ผ์–ด๋‚˜์„œ ํฐ ์ ์ˆ˜ ํ•˜๋ฝ์„ ๊ฒช์—ˆ๊ธฐ ๋•Œ๋ฌธ์— validation ์ „๋žต์— ์กฐ๊ธˆ ๋” ์‹ ๊ฒฝ์„ ์ผ์Šต๋‹ˆ๋‹ค.

โžก UserID split

ใ€€โœณuserID๋ฅผ ๊ธฐ์ค€์œผ๋กœ k-fold๋ฅผ ์ง„ํ–‰

โ€‹

โžก grade๋ณ„ ๊ฒ€์ฆ

ใ€€โœณ ์‚ฌ์šฉ์ž์˜ ๋Œ€ํ‘œ grade๋ฅผ ์ถ”์ถœํ•˜์—ฌ, grade์˜ ๋น„์œจ์— ๋งž๊ฒŒ K-fold ์ˆ˜ํ–‰

ใ€€โœณ ex) A030071005, testID, AssesmentID ์—์„œ ์•ž์ž๋ฆฌ 3์ž๋ฆฌ์˜ ๊ฒฝ์šฐ Grade

ใ€€โœณ ์ƒ์„ธ

User ๋ณ„๋กœ grade๊ฐ€ ๊ณ ์ •๋˜์–ด ์žˆ์ง€ ์•Š์€ ๊ฒฝ์šฐ๋ฅผ ํ™•์ธํ•˜์˜€๋‹ค.( ex, userID 315๊ฐ€ grade 3, 4, 7์˜ ๋ฌธ์ œ๋ฅผ ๋ชจ๋‘ ํ‘ธ๋Š” ๊ฒฝ์šฐ) ๋”ฐ๋ผ์„œ ์‚ฌ์šฉ์ž์˜ grade๋ฅผ ํ•˜๋‚˜๋กœ ํŠน์ •ํ•˜๊ธฐ ์–ด๋ ค์šด ๋ฌธ์ œ ๋ฐœ์ƒํ•˜์˜€๋‹ค. ์ด๋ฅผ ํ•ด๊ฒฐ ํ•˜๊ธฐ ์œ„ํ•ด ํ•˜๋‚˜์˜ ์‚ฌ์šฉ์ž์—์„œ ๊ฐ€์žฅ ๋งŽ์ด ๋“ฑ์žฅํ•œ grade๋ฅผ ๊ธฐ์ค€์œผ๋กœ ์‚ฌ์šฉ์ž์˜ ๋Œ€ํ‘œ grade ์„ค์ •ํ•˜์˜€๋‹ค.

์„ค์ •ํ•œ ๋Œ€ํ‘œ grade๋ฅผ ๊ธฐ์ค€์œผ๋กœ ๋ฐ์ดํ„ฐ ๋ถ„ํฌ๋ฅผ ํ™•์ธ ํ•œ ๊ฒฐ๊ณผ Train set ๊ณผ Test set์˜ ๋ถ„ํฌ๊ฐ€ ์œ ์‚ฌํ•˜๋‹ค๋Š” ๊ฒƒ์„ ํ™•์ธํ•˜์˜€๋‹ค.

image5


6. ๊ธฐํƒ€

โžก Hyper parameter tuning - Optuna

โžก PCA

ใ€€โœณ 40๊ฐœ์˜ features๋ฅผ input์œผ๋กœ ํ•˜์—ฌ ์ฃผ์„ฑ๋ถ„ ๋ถ„์„์„ ์ˆ˜ํ–‰

ใ€€โœณ ์ฃผ์„ฑ๋ถ„ ๋ถ„์„ ๊ฒฐ๊ณผ

image10

94%์ด์ƒ์˜ ๋ถ„์‚ฐ์„ค๋ช…๋ ฅ์„ ๊ฐ€์ง€๊ธฐ ์œ„ํ•ด์„œ๋Š” 20๊ฐœ ์ด์ƒ์˜ ์ฃผ์„ฑ๋ถ„์ด ํ•„์š”ํ–ˆ๋‹ค.

20๊ฐœ์˜ ์ฃผ์„ฑ๋ถ„์„ ๊ฐ€์ง€๊ณ  ๋ชจ๋ธ train๊ณผ inferenceํ•œ ๊ฒฐ๊ณผ, Validation AUC๊ฐ€ 0.8136 ๋‚˜์™”๋‹ค.

โžก ensemble

ใ€€โœณ soft voting

๋ถ„๋ฅ˜๊ธฐ๋“ค์˜ ๋ ˆ์ด๋ธ” ๊ฐ’ ๊ฒฐ์ • ํ™•๋ฅ ์„ ๋ชจ๋‘ ๋”ํ•˜๊ณ  ์ด๋ฅผ ํ‰๊ท ํ•ด์„œ ์ด๋“ค ์ค‘ ํ™•๋ฅ ์ด ๊ฐ€์žฅ ๋†’์€ ๋ ˆ์ด๋ธ” ๊ฐ’์„ ์ตœ์ข… ๋ณดํŒ… ๊ฒฐ๊ด๊ฐ’ ์œผ๋กœ ์„ ์ •.

image8

DKT competetion์˜ eval-metric์ด AUC ์ด๋ฏ€๋กœ class label๊ฐ’์„ ์ œ์ถœํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹Œ ํ™•๋ฅ ๊ฐ’์„ ์ œ์ถœํ•˜๋ฏ€๋กœ, ๋ชจ๋ธ๋ณ„ prediction ๊ฐ’์„ ํ‰๊ท ๋‚ด๋Š” ๋ถ€๋ถ„๊นŒ์ง€ ์ง„ํ–‰ํ•˜์˜€๊ณ , ๋‹จ์ผ ๋ชจ๋ธ ๊ฒฐ๊ณผ๋ณด๋‹ค PB LB ์ ์ˆ˜์ƒ์œผ๋กœ ํ•˜๋ฝํ•˜์˜€๋‹ค.

๋ชจ๋ธ๋งˆ๋‹ค ์ƒ์ดํ•œ prediction์œผ๋กœ ์ธํ•ด ๊ฐ’์ด ํ•˜๋ฝํ•œ ๊ฒƒ์œผ๋กœ ํŒ๋‹จ๋˜์–ด OOF stacking์„ ์‹œ๋„ํ•ด์•ผ ํ• ๊ฒƒ์œผ๋กœ ํŒ๋‹จํ•˜์˜€๋‹ค.


โœณ hard voting

๋‹ค์ˆ˜๊ฒฐ, ์—ฌ๋Ÿฌ ๋ชจ๋ธ์˜ ๊ฒฐ๊ณผ ๊ฐ’์„ ๊ธฐ์ค€์œผ๋กœ ๊ฐ€์žฅ ๋งŽ์€ ๋ชจ๋ธ์ด ์˜ˆ์ธกํ•œ class label์„ ๊ธฐ์ค€์œผ๋กœ ์˜ˆ์ธก๊ฐ’์„ ๋„์ถœํ•˜์˜€๋‹ค.

DKT competetion์˜ eval-metric์ด AUC ์ด๋ฏ€๋กœ class label๊ฐ’์„ ์ œ์ถœํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹Œ ํ™•๋ฅ ๊ฐ’์„ ์ œ์ถœํ•˜๋ฏ€๋กœ, ๊ฐ€์žฅ ๋งŽ์€ ๋ชจ๋ธ์ด ์˜ˆ์ธกํ•œ class label๋กœ ์˜ˆ์ธกํ•œ ๋ชจ๋ธ๋“ค์˜ prediction ๊ฐ’์„ ํ‰๊ท ๋‚ธ ๊ฐ’์œผ๋กœ ์ œ์ถœํ•˜์˜€๊ณ , soft voting๊ณผ ๋™์ผํ•˜๊ฒŒ ๋‹จ์ผ ๋ชจ๋ธ ๊ฒฐ๊ณผ๋ณด๋‹ค PB LB ์ ์ˆ˜์ƒ์œผ๋กœ ํ•˜๋ฝ์„ ๋ณด์˜€๋‹ค.


โœณ oof_stacking

NN๊ธฐ๋ฐ˜ model์˜ prediction ๊ฒฐ๊ณผ๊ฐ’๊ณผ tree ๊ธฐ๋ฐ˜์˜ model์˜ prediction ๊ฒฐ๊ณผ ๊ฐ’์ด ์ƒ์ดํ•œ ๊ฒƒ์„ ์œ„์˜ soft voting์˜ ๊ฒฐ๊ณผ๋กœ์จ ์–ป์—ˆ๊ธฐ ๋•Œ๋ฌธ์— ์ด๋ ‡๊ฒŒ ๊ฒฐ๊ณผ๊ฐ€ ์ƒ์ดํ•œ ๊ฒฝ์šฐ ๋ฉ”ํƒ€ ๋ชจ๋ธ์„ ํ†ตํ•ด ensemble์„ ํ•˜๊ฒŒ๋˜๋Š” oof-stacking ๋ฐฉ๋ฒ•์ด ํšจ๊ณผ์ ์œผ๋กœ ์•Œ๊ณ  ์žˆ์—ˆ๊ธฐ์— ์ด๋ฅผ ์ง„ํ–‰ํ•˜๊ณ ์ž ํ•˜์˜€๋‹ค.

image8

image8


โœณ Priority Max Ensemble

์ƒ์œ„ 4๊ฐœ prediction ์ค‘ ์ •ํ™•๋„๊ฐ€ ๊ฐ€์žฅ ๋†’์€ ์ •ํ™•๋„๋ฅผ ๊ฐ€์ง„ prediction์„ ์šฐ์„ ์œผ๋กœ 4๊ฐœ์˜ prediction์˜ max๊ฐ’์„ ์ทจํ•ด์„œ Ensemble์„ ํ•˜์˜€๋‹ค. ์ด๋Ÿฌํ•œ ๋ฐฉ๋ฒ•์„ ์„ ํƒํ•œ ์ด์œ ๋Š” ์ •ํ™•๋„ ๊ฐ’์€ ๋ณด์กดํ•˜๋ฉด์„œ auc๊ฐ€ ๋†’์•„์งˆ ๊ฒƒ์œผ๋กœ ์˜ˆ์ƒํ•˜์—ฌ ์‚ฌ์šฉํ•˜์˜€๋‹ค.

pd_list[0]['prediction']
new_df = pd.DataFrame(columns=['prediction'])
for i in range(len(pd_list[0])):
    id=i
    a1 = pd_list[0]['prediction'][i] 
    a2 = pd_list[1]['prediction'][i] # ๊ฐ€์žฅ ๋†’์€ acc๋ฅผ ๊ฐ€์ง„ prediction(์ดํ•˜ 1๋ฒˆ์˜ˆ์ธก)
    a3 = pd_list[2]['prediction'][i]
    a4 = pd_list[3]['prediction'][i]

    d = {"up":[],"down":[]}

    for j in range(4): # 0.5๋ฅผ ๊ธฐ์ค€์œผ๋กœ ๋‚˜๋ˆˆ๋‹ค.
        if pd_list[j]['prediction'][i]>=0.5:
            d["up"].append(j)
        else:
            d["down"].append(j)

    if len(d["up"])>0 and len(d["down"])>0: 
        # 0.5๋ฅผ ๊ธฐ์ค€์œผ๋กœ up, down์ด ์žˆ์„ ๋•Œ, prediction์„ max๋กœ ํ•˜์—ฌ auc๋ฅผ ๋Š˜๋ฆผ
        # 1๋ฒˆ ์˜ˆ์ธก์ด ์–ด๋Š ๊ทธ๋ฃน์— ํฌํ•จ๋˜์–ด ์žˆ์„ ๋•Œ, ๊ทธ ๊ทธ๋ฃน์—์„œ max ์ทจํ•จ
        if (1 in d["up"]):
            m = pd_list[max(d["up"])]['prediction'][i]
        elif (1 in d["down"]):
            m = pd_list[max(d["down"])]['prediction'][i]
    else: # ๋„ค ๊ฐœ๋‹ค up ๋˜๋Š” down์— ๋ชจ๋‘ ์žˆ์œผ๋ฉด, max๋กœ prediction ๊ฐ’์„ ๊ตฌํ•จ
        m=(max(pd_list[0]['prediction'][i],pd_list[1]['prediction'][i],pd_list[2]['prediction'][i],pd_list[3]['prediction'][i]))
    
    new_df.loc[len(new_df)]=[m]


7. ์›น ์„œ๋น™

์œ„์—์„œ ์ƒ์„ฑํ•œ ๋ชจ๋ธ์„ ํ™œ์šฉํ•˜์—ฌ ๊ฐ„๋‹จํ•œ SPA ์›น ์•ฑ์ธ "Mini ์ˆ˜๋Šฅ" ์ œ์ž‘

์ถ”๊ฐ€ ์‚ฌ์šฉ ๊ธฐ์ˆ  : Flask, HTML, CSS, Javascript

7.1 ์›น ์„œ๋น™ ๊ตฌ์กฐ๋„

image8

7.2 ์‹œ์—ฐ

image8




Reference

Deep Knowledge Tracing

BERT

Bayesian Opitimization

Saint+

EGNET+KT1


About

p4-dkt-ollehdkt created by GitHub Classroom

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published