NLP) Metrics - Writing

This post is the outline of metrics used in NLP. So, there are only name, category, some relation with other metrics and short description. I will write post about each metric, and link them in this post.

* Two Terms of Evaluation

Instrinsic Evaluation: Evaluation by human. It is most perfect evaluation. Depending on who scores, there is a deviation in the scoring results, because there is no clear grading standard especially in NLP. Futhermore, it spends lots of time and cost.
Extrinsic Evaluation: Evaluation by a score metric such as BLEU or ROUGUE. This runs as program, so mass evaluation and automation are available, with cost and time advantages. But it is not guaranteed that the result of score metrics is not equal or similar with one of evaluation by human.

There are two aspects of evaluation. These are different about how the model is evaluated. I will introduce intrinsic evaluation used in NLP tasks.

On the other aspect, the definition of two words is different. In aspect of evaluation to models in an application, intrinsic evaluation focuses on intermediary objectives related directly the performance of an NLP component on a defined subtask, and extrinsic evaluation focuses on the performance of the final objective of the component on the complete application. For example, a text summarization application consists of word embedding model and summarization model. Intrinsic evaluation is that each models in the application is evaluated separately. In another aspect, whether comparision with other models is included or not is matter. In intrinsic evaluation, the model is evaluated independently. Contrary to extrinsic evaluation is consist of performance comparsion with other models.

[1] Confusion Matrix

ref. Anuganti Suresh, "What is a confusion matrix?"

1. Accuracy

2. Precision

3. Recall

4. F1-Score

5. AUC

[2] Statistics

1. MRR

2. MAP

3. RMSE

4. MAPE

[3] BLUE

[4] Alt to BLEU

1. METEOR

2. PPL

3. STM

4. RIBES

5. MEWR

[5] ROUGE

1. ROUGE-N

2. ROUGE-L

3. ROUGE-W

4. ROUGE-S

5. ROUGE-SU

[6] Alt to ROUGE

1. ROUGE-WE

2. ParaEval

3. ROUGE 2.0

[7] RDASS

Refers

crosstar1228, “[NLP]Rouge score - Summarization의 평가 Metric”
Koo Ping Shung, “Accuracy, Precision, Recall or F1?”
Kurtis Pykes, “The Most Common Evaluation Metrics In NLP”
wikipedia, “Confusion Matrix”
Rahul Patwari Youtube, “The tradeoff between sensitivity and specificity”
BioinformaticsAndMe, “AUC-ROC 커브”
Iamttic, “정보 검색(Information Retrieval) 평가는 어떻게 하는 것이 좋을까?(2/2)”
해솔, “회귀모형 평가하기 - RMSE(평균 제곱근 오차)”
Acdong, “[기계학습]모형의 성능 지표 ( MSE , MAPE , 정확도,정밀도,재현율,특이도 , F1 measure , ROC Curve)”
“딥 러닝을 이용한 자연어 처리 입문” 위키독스
Donghwa KIM, “BLEU Score”
hulk89, “BLEU의 모든 것”
supkoon, “[자연어처리[Metric] BLEU score : bilingual Evaluation Understudy”
SINA BREITENBACH, “MT for Beginners: What is BLEU and what is wrong with it?”
Rachael Tatman, “Evaluating Text Output in NLP: BLEU at your own risk”
MLNerds, “METEOR metric for machine translation”
Juny, “sequence to sequence 모델의 평가방법”
misconstructed, “[논문 리뷰] How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation (ENMNLP 2016)”
Chiara Campagnola, “Perplexity in Language Models”
bo-son, “Automatic Evaluation Metrics for NLG”
pypi, “Subtree Metric Package”
devfon, “예시를 통한 ROUGE 성능 지표의 이해”
Chin-Yew Lin, “ROUGE: A Package for Automatic Evaluation of Summaries”
Kakao Enterprise AI Research, “텍스트 요약 모델 성능 평가를 위한 새로운 척도, RDASS를 소개합니다.”
Probing Semantic Grounding in Language Models of Code with Representational Similarity Analysis - Scientific Figure on ResearchGate