Parsing the xP3 dataset

Q1: I am trying to extract the Arabic instructions from the xP3 dataset, and I want to put them in the format: “Instruction”, “Input”, and “Output”. Currently, the data is in this format: “inputs” and “targets”.

I found that the instructions sometimes are in the last part of the “inputs” and preceded by \n, and sometimes without any delimiter. In other cases, the instructions are at the beginning of the “inputs”, etc.

Here is an example where the instruction is at the end of the input, but without any delimiter to recognize it.
File: xp3_GEM_xlsum_arabic_train_xp3longrest.jsonl
{"inputs":"...\nووسط هذه القلة يقف أيضا شقيقيها الفنان فيصل لعيبي، الذي أثر كثيرا في تطورها الفني وباتت تشكل معه ثنائيا فنيا مميزا، يجعل من أعمالهما في حوار دائم، فتحمل كثيرا من الوشائج والتشابهات الأسلوبية والشكلية لكنها تفترق في التوجه. ففي الوقت الذي يسعى فيصل إلى تأصيل فنه في قلب منجز الرسم العراقي بالتركيز على الخصوصية العراقية واللمسة المحلية والنهل من التراث الفني الرافديني في مراحله المختلفة وعكسه بلغة فنية حداثوية معاصرة، تسعى عفيفة إلى تمييز نفسها عنه بالتحليق في فضاء إنساني عام، مبعدة لوحاتها عن أي ... Write the rest of the article:","targets":"حوارا مع نظرتها ويركز عليها ليكتشف أنها تنظر في مكان آخر أو ربما في ماضٍ بعيد. \n\nوتعنى عفيفة باختيار

Q2: I found many incomplete inputs and outputs, ex: having this string:
"... Continue the article for another 4000 characters max:","targets":"."}
What should we do in such cases?
Thanks
Hamdy Mubarak

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Parsing the xP3 dataset #23

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Parsing the xP3 dataset #23

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions