Rouge scores are not well calcuated

Hi, 

Thanks for sharing the code and models of your great paper.

I think that you miss calculating the rouge scores for Text summarization task in your paper.
The bug lies in these lines:

https://github.com/UBC-NLP/araT5/blob/c80cbfa1f06891aced9b265476f9ca2c2b122622/examples/run_trainier_seq2seq_huggingface.py#L573-L574

Let me explain. First, the [rouge_score](https://pypi.org/project/rouge-score/) (which is embedded in HF datasets library) **don't work on Arabic text**. Here is a simple example were the reference and prediction are the exactly the same:

```
from rouge_score import rouge_scorer

gold = "اختر العمر المستقبلي."
pred = "اختر العمر المستقبلي."
rouge_types = ["rouge1", "rouge2", "rougeL"]
scorer = rouge_scorer.RougeScorer(rouge_types=rouge_types, use_stemmer=False)
print({key: value.fmeasure * 100 for key, value in score.items()}) #{'rouge1': 0.0, 'rouge2': 0.0, 'rougeL': 0}

```  
This happen because the default tokenizer of [google rouge wrapper](https://github.com/google-research/google-research/blob/master/rouge/) will delete all non alphanumeric characters (see comment 2 for a solution). 

However, rouge works well on English:

```
from rouge_score import rouge_scorer

gold = "police kill the gunman"
pred = "police kill the gunman"
rouge_types = ["rouge1", "rouge2", "rougeL"]
scorer = rouge_scorer.RougeScorer(rouge_types=rouge_types, use_stemmer=False)
print({key: value.fmeasure * 100 for key, value in score.items()}) #{'rouge1': 100.0, 'rouge2': 100.0, 'rougeL': 100.0}

gold = "police kill the gunman"
pred = "police killed the gunman"
rouge_types = ["rouge1", "rouge2", "rougeL"]
scorer = rouge_scorer.RougeScorer(rouge_types=rouge_types, use_stemmer=False)
print({key: value.fmeasure * 100 for key, value in score.items()}) #{'rouge1': 75.0, 'rouge2': 33.33333333333333, 'rougeL': 75.0}

```  

When in your code you comments these 2 lines because they gives scores around 1%-2% (I will explain why later)

https://github.com/UBC-NLP/araT5/blob/c80cbfa1f06891aced9b265476f9ca2c2b122622/examples/run_trainier_seq2seq_huggingface.py#L576-L577

and you replace it with these 2 lines:

https://github.com/UBC-NLP/araT5/blob/c80cbfa1f06891aced9b265476f9ca2c2b122622/examples/run_trainier_seq2seq_huggingface.py#L573-L574

what you actually did is dividing the number of `\\n \\n` span in the reference and prediction. 
Here is a simple example where the gold reference has 2 sentences and prediction have 4 sentences:

```
gold = "اختر العمر المستقبلي. كن عفويا."
pred = "ابحث عن العمر الذي تريد أن تكونه في المستقبل. تحدث عن نفسك الحالية. فكر في قيمك. فكر في الأشياء التي تجيدها."

# No linebreak  between senctences
score = scorer.score(gold, pred)
print({key: value.fmeasure * 100 for key, value in score.items()}) # {'rouge1': 0.0, 'rouge2': 0.0, 'rougeL': 0}

# Correct way to add linebreak  between sentences
gold_2 = gold.replace(". ", ".\n")
pred_2 = pred.replace(". ", ".\n")
score = scorer.score(gold_2, pred_2)
print({key: value.fmeasure * 100 for key, value in score.items()}) # {'rouge1': 0.0, 'rouge2': 0.0, 'rougeL': 0}


# Adding linebreak between sentences with your method
gold_3 = gold.replace(". ", ".\\n \\n ")
pred_3 = pred.replace(". ", ".\\n \\n ")
score = scorer.score(gold_3, pred_3)
print({key: value.fmeasure * 100 for key, value in score.items()}) # {'rouge1': 50.0, 'rouge2': 33.333333333333336, 'rougeL': 50.0}

print(gold_3) # اختر العمر المستقبلي.\n \n كن عفويا.
print(pred_3)# ابحث عن العمر الذي تريد أن تكونه في المستقبل.\n \n تحدث عن نفسك الحالية.\n \n فكر في قيمك.\n \n فكر في الأشياء التي تجيدها.

```

As you can see, in the example `<gold_3, pred_3>` the 50 rouge is because you predicted 4 sentences (actually `\\n \\n`) while the reference contains 2 only. The 33% is because you have 1 and 3 `\\n \\n` n-grams in the reference and prediction respectively (1/3).

In fact, I re-run Text summarization experiments internally using your model and models mine and found that the results are similarly comparable with your paper when using your method. On the other hand, when adding `\n` correctly the socres are between 1% and 2%. The rouge scores are not zero happen only when the reference and prediction contains same English words, whci happen rarely.

In fact, results in tables 7 and B2 are just the count(sent_num_ref) / count(sent_num_pred).
Don't understand me wrong, your models are good but just need to be evaluated correctly (see comment 2).

It will be great if you can fix your code and adjust the numbers in table 7 and B2 in your paper.

Thanks
    

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Rouge scores are not well calcuated #4

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	preds = ["\\n \\n ".join(nltk.sent_tokenize(pred)) if len(nltk.sent_tokenize(pred))> 1 else pred+"\\n" for pred in preds]
	labels = ["\\n \\n ".join(nltk.sent_tokenize(label)) if len(nltk.sent_tokenize(label))> 1 else label+"\\n" for label in labels]

	# preds = ["\n".join(nltk.sent_tokenize(pred)) for pred in preds]
	# labels = ["\n".join(nltk.sent_tokenize(label)) for label in labels]

Rouge scores are not well calcuated #4

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions