Skip to content

Korean version of GoEmotions Dataset 😍😒😱

License

Notifications You must be signed in to change notification settings

monologg/GoEmotions-Korean

Folders and files

NameName
Last commit message
Last commit date

Latest commit

e708cb1 Β· Feb 9, 2021

History

16 Commits
Feb 9, 2021
May 18, 2020
Feb 9, 2021
Feb 9, 2021
May 18, 2020
May 18, 2020
Feb 9, 2021
May 18, 2020
May 18, 2020
Jun 18, 2020
May 18, 2020
Feb 9, 2021
Feb 8, 2021
Feb 9, 2021
May 18, 2020
Feb 9, 2021

Repository files navigation

GoEmotions-Korean

GoEmotions 데이터셋을 ν•œκ΅­μ–΄λ‘œ λ²ˆμ—­ν•œ ν›„, KoELECTRA둜 ν•™μŠ΅

Updates

June 19, 2020 - Transformers v2.9.1 κΈ°μ€€μœΌλ‘œ λͺ¨λΈ ν•™μŠ΅ μ‹œ [NAME], [RELIGION]κ³Ό 같은 Special token을 μΆ”κ°€ν•˜μ˜€μŒμ—λ„ pipelineμ—μ„œ λ‹€μ‹œ μ‚¬μš©ν•  λ•Œ 적용이 λ˜μ§€ μ•ŠλŠ” μ΄μŠˆκ°€ μžˆμ—ˆμœΌλ‚˜, Transformers v2.11.0μ—μ„œ ν•΄λ‹Ή μ΄μŠˆκ°€ ν•΄κ²°λ˜μ—ˆμŠ΅λ‹ˆλ‹€.

Feb 9, 2021 - Transformers v3.5.1 κΈ°μ€€μœΌλ‘œ KoELECTRA-v1, KoELECTRA-v3λ₯Ό 가지고 ν•™μŠ΅ν•˜μ—¬ μƒˆλ‘œ λͺ¨λΈμ„ μ—…λ‘œλ“œ ν•˜μ˜€μŠ΅λ‹ˆλ‹€.

GoEmotions

58000개의 Reddit commentsλ₯Ό 28개의 emotion으둜 λΌλ²¨λ§ν•œ 데이터셋

  • admiration, amusement, anger, annoyance, approval, caring, confusion, curiosity, desire, disappointment, disapproval, disgust, embarrassment, excitement, fear, gratitude, grief, joy, love, nervousness, optimism, pride, realization, relief, remorse, sadness, surprise, neutral

Requirements

  • torch==1.7.1
  • transformers=3.5.1
  • googletrans==2.4.1
  • attrdict==2.0.1
$ pip3 install -r requirements.txt

Translated Data

🚨 Reddit λŒ“κΈ€λ‘œ λ§Œλ“  λ°μ΄ν„°μ—¬μ„œ λ²ˆμ—­λœ 결과물의 ν’ˆμ§ˆμ΄ 쒋지 μ•ŠμŠ΅λ‹ˆλ‹€. 🚨

  • pygoogletransλ₯Ό μ‚¬μš©ν•˜μ—¬ ν•œκ΅­μ–΄ 데이터 생성
    • pygoogletrans v2.4.1이 pypi에 μ—…λ°μ΄νŠΈλ˜μ§€ μ•Šμ€ κ΄€κ³„λ‘œ repositoryμ—μ„œ κ³§λ°”λ‘œ 라이브러리λ₯Ό μ„€μΉ˜ν•˜λŠ” 것을 ꢌμž₯ (requirements.txt에 λͺ…μ‹œλ˜μ–΄ 있음)
  • API 호좜 간에 1.5초의 간격을 μ£Όμ—ˆμŠ΅λ‹ˆλ‹€.
    • ν•œ 번의 request에 μ΅œλŒ€ 5000자λ₯Ό 넣을 수 μžˆλŠ” 점을 κ³ λ €ν•˜μ—¬ λ¬Έμž₯듀을 \r\n으둜 이어 λΆ™μ—¬ input으둜 λ„£μ—ˆμŠ΅λ‹ˆλ‹€.
  • ​​​(Zero-width space)κ°€ λ²ˆμ—­ λ¬Έμž₯ μ•ˆμ— 있으면 λ²ˆμ—­μ΄ λ˜μ§€ μ•ŠλŠ” 였λ₯˜κ°€ μžˆμ–΄μ„œ μ΄λŠ” μ œκ±°ν•˜μ˜€μŠ΅λ‹ˆλ‹€.
  • λ²ˆμ—­μ„ μ™„λ£Œν•œ λ°μ΄ν„°λŠ” data 디렉토리에 이미 μžˆμŠ΅λ‹ˆλ‹€. ν˜Ήμ—¬λ‚˜ 직접 λ²ˆμ—­μ„ 돌리고 μ‹Άλ‹€λ©΄ μ•„λž˜μ˜ λͺ…λ Ήμ–΄λ₯Ό μ‹€ν–‰ν•˜λ©΄ λ©λ‹ˆλ‹€.
$ bash download_original_data.sh
$ pip3 install git+git://github.com/ssut/py-googletrans
$ python3 tranlate_data.py

Tokenizer

  • 데이터셋에 [NAME], [RELIGION]의 Special Token이 μ‘΄μž¬ν•˜μ—¬, 이λ₯Ό vocab.txt의 [unused0]와 [unused1]에 각각 ν• λ‹Ήν•˜μ˜€μŠ΅λ‹ˆλ‹€.

Train & Evaluation

  • Sigmoidλ₯Ό μ μš©ν•œ Multi-label classification (thresholdλŠ” 0.3으둜 지정)
    • model.py의 ElectraForMultiLabelClassification μ°Έκ³ 
  • config의 경우 config λ””λ ‰ν† λ¦¬μ˜ json νŒŒμΌμ—μ„œ λ³€κ²½ν•˜λ©΄ λ©λ‹ˆλ‹€.
$ python3 run_goemotions.py --config_file koelectra-base.json
$ python3 run_goemotions.py --config_file koelectra-small.json

Results

Macro F1을 κΈ°μ€€μœΌλ‘œ κ²°κ³Ό μΈ‘μ • (Best result)

Macro F1 (%) Dev Test
KoELECTRA-small-v1 39.99 41.02
KoELECTRA-base-v1 42.18 44.03
KoELECTRA-small-v3 40.27 40.85
KoELECTRA-base-v3 42.85 42.28

Pipeline

  • MultiLabelPipeline 클래슀λ₯Ό μƒˆλ‘œ λ§Œλ“€μ–΄ Multi-label classification에 λŒ€ν•œ inferenceκ°€ κ°€λŠ₯ν•˜κ²Œ ν•˜μ˜€μŠ΅λ‹ˆλ‹€.
  • Huggingface s3에 λͺ¨λΈμ„ μ—…λ‘œλ“œν•˜μ˜€μŠ΅λ‹ˆλ‹€.
    • monologg/koelectra-small-v1-goemotions
    • monologg/koelectra-base-v1-goemotions
    • monologg/koelectra-small-v3-goemotions
    • monologg/koelectra-base-v3-goemotions
from multilabel_pipeline import MultiLabelPipeline
from transformers import ElectraTokenizer
from model import ElectraForMultiLabelClassification
from pprint import pprint


tokenizer = ElectraTokenizer.from_pretrained("monologg/koelectra-base-v3-goemotions")
model = ElectraForMultiLabelClassification.from_pretrained("monologg/koelectra-base-v3-goemotions")

goemotions = MultiLabelPipeline(
    model=model,
    tokenizer=tokenizer,
    threshold=0.3
)

texts = [
    "μ „ν˜€ 재미 μžˆμ§€ μ•ŠμŠ΅λ‹ˆλ‹€ ...",
    "λ‚˜λŠ” β€œμ§€κΈˆ κ°€μž₯ 큰 두렀움은 λ‚΄ μƒμž μ•ˆμ— μ‚¬λŠ” 것” 이라고 λ§ν–ˆλ‹€.",
    "κ³±μ°½... ν•œμ‹œκ°„λ°˜ 기닀릴 맛은 μ•„λ‹˜!",
    "μ• μ •ν•˜λŠ” 곡간을 μ• μ •ν•˜λŠ” μ‚¬λžŒλ“€λ‘œ μ±„μšΈλ•Œ",
    "λ„ˆλ¬΄ μ’‹μ•„",
    "λ”₯λŸ¬λ‹μ„ μ§μ‚¬λž‘μ€‘μΈ ν•™μƒμž…λ‹ˆλ‹€!",
    "마음이 급해진닀.",
    "μ•„λ‹ˆ μ§„μ§œ λ‹€λ“€ λ―Έμ³€λ‚˜λ΄¨γ…‹γ…‹γ…‹",
    "κ°œλ…ΈμžΌ"
]

pprint(goemotions(texts))

# Output
[{'labels': ['disapproval'], 'scores': [0.97151965]},
 {'labels': ['fear'], 'scores': [0.9519822]},
 {'labels': ['disapproval', 'neutral'], 'scores': [0.452921, 0.5345312]},
 {'labels': ['love'], 'scores': [0.8750478]},
 {'labels': ['admiration'], 'scores': [0.93127275]},
 {'labels': ['love'], 'scores': [0.9093589]},
 {'labels': ['nervousness', 'neutral'], 'scores': [0.76960915, 0.33462417]},
 {'labels': ['disapproval'], 'scores': [0.95657086]},
 {'labels': ['annoyance', 'disgust'], 'scores': [0.39240348, 0.7896941]}]

Reference

Releases

No releases published

Packages

No packages published