-
Notifications
You must be signed in to change notification settings - Fork 507
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support to Annotate Arabic #774
Comments
Hi there, I'm working in Arabic NLP and very interesting to help to get this tool supporting Arabic. I believe that, if this happened this tool will get so many citations as the researches on Arabic NLP are flourishing these days and became important. Regarding supporting the transliteration of the ASCII version (Buckwalter) instead of the actual Arabic glyphs, I believe this is not a good choice. As you know the readability of the transliteration is difficult especially for the one who is working on the annotation task. The optimal choice is to support the RTL with Arabic glyphs. Please feel free to contact with me as I'm so happy to be engaged. Fahd |
Hi @fsalotaibi, Thanks for your interest in brat! We're happy to welcome any contribution to Arabic support in brat, and would much appreciate your help on this feature. For rendering the actual glyphs in brat, as a first step, we would need to know how to create an SVG document with Arabic that renders correctly in at least some major browser. If you can look into this, it would be very helpful if you could try exporting an SVG with Arabic from brat (from Data->Visualization->SVG) and see if you can edit it to render correctly. |
Yeah, Firefox might do the right thing when rendering HTML. However, note that we're laying out each word separately by drawing it onto the SVG canvas; so we do not have access to Firefox's heuristics. We need to know the order of the spans. So in the case of
and it should also be the order in which the elements are set in SVG (for copy/paste purposes); but the coordinates on the screen (and ultimately the visual effect) needs to be (seen from left to right):
|
@amadanmath : do I understand correctly that this last issue you mention is that it would be necessary to reverse the RTL order for parts of the document that do not use Arabic glyphs? If we were to assume that there are no such strings (i.e. everything is RTL) or that the text input has already reversed these appropriately, would this substantially ease the task? |
Yes, I suppose that's what I'm saying. Note that for the copy-paste to work properly you'd need to make sure that only the coordinates are reshuffled, but the order in which they're put into SVG is not. I believe a good algorithm might be: lay all chunks out as they appear (showing them RTL); then find sequences of LTR chunks in the same row, and recalculate their coordinates so that they appear in the reverse order, without changing anything else. Obviously if there's no RTL text, the task is easier. Still not easy, since we have a bunch of places where the assumption is LTR. Also, I'm still not 100% convinced I'd know how to tell LTR chunks from RTL ones. It may never happen, I don't know, but say you have "كربون-12" ("carbon-12"), and someone annotates "بون-1" ("bon-1"). You can see that it visually becomes a discontinuous span (but it is not discontinuous byte-wise). The chunk is "كربون-12", but it's neither LTR nor RTL - it's hybrid. |
Even though I don't really know about the client, the example you give sounds like it would take a lot of work to do right. If we want to get all that for the first iteration of Arabic support, I'm guessing it might be a while. Could the tool still be useful for annotating Arabic if we were to assume that everything is RTL? This would get cases like |
"It may never happen, I don't know, but say you have "كربون-12" ("carbon-12"), and someone annotates "بون-1" ("bon-1"). You can see that it visually becomes a discontinuous span (but it is not discontinuous byte-wise). The chunk is "كربون-12", but it's neither LTR nor RTL - it's hybrid." What researches do when want to annotate a piece of Arabic text is to do the tokenization first as a preprocessing. So it is not the brat duty to take care of the proper tokenization. I believe no one will try to annotate such thing like : ["كربون-12" ("carbon-12"), and someone annotates "بون-1" ("bon-1")]. The word order of the mixed Arabic and English is perfectly handled by Microsoft bench softwares such as word. We could inspire the same algorithm to do so. But I'll give you a simple statistic that may convince you: So the mix of RTL and LTR would not be that serieos (currently but it is very powerful to be supported) as the total number of English words is very small. I'm in doubt about the numbers and symbols. ** If you do the option of assuming everything is RTL, I'm happy to test it and give you the feed back for the pros and cons. Fahd |
@fsalotaibi : thank you for the information and statistics! I believe it should make the initial implementation much easier if we can make the assumptions that 1) the text is pre-tokenized 2) everything is RLT. @amadanmath : what would implementing this require on the server side? |
I hope not annoying, any news about supporting Arabic. Actually I'm involved with other in building an Arabic NE corpus as we are planning to start annotating in two weeks time. I really support this nice tool to be used based on its functionality. Team members still waiting for it as well. I'm afraid the time will be the issue. I believe this would be very good reputation once supporting such RTL language. ** I tried to modify the code, but actually I stucked to understand how the calculation of the glyphs happened to switch to RTL instead. Can anyone pinpoint me to right piece of work to let me try? |
@fsalotaibi : not annoying at all, thanks for reminding us! We have a few other features prioritized right now, but if you're willing to have a look at the code, we'd be happy to help. @amadanmath : could you provide some pointers on what would need to be changed to make this happen? |
@spyysalo: Thank you, I'm trying my best to understand how this could happen. It seems brat is a big project to understand in short time. I only have two weeks to start the annotation project, and I do still support this tool within my team. @amadanmath : I worked on a prototype to illustirate what are needed to support Arabic:
The desired and proper way is shown in the following prototype: As you can see:
This is what we need for this stage. I'm not sure how difficult this work is. As I said earlier, I'm very happy to evaluate this work while doing the support. I'm really exciting to let this tool supporting Arabic. I believe this will open many doors for other researchers. |
@fsalotaibi : thank you for your efforts on this! I'm afraid I can't help myself on the technical aspects as I don't know the relevant part of the client code, but hopefully @amadanmath can. I agree this would be a valuable feature to have. For ease of reference, I'm placing your screenshots inline here (click on "GitHub Flavored Markdown" in the comment form for syntax):
|
Okay, some quick pointers: If you look at In it you will find the variable As the first step, these procedures would need to be reversed; if RTL language is rendered, start with the right edge (leaving the space for the sentence number), decrease I don't know what There's a bunch of things I am skipping over here, as the visualisation part is quite complex. |
Hi @fsalotaibi : I chanced on https://www.odesk.com/o/jobs/job/Modifying-Javascript-canvas-GUI_~~fb065ce0129fa79c/, which suggests that you found a way to implement Arabic support. Great! Would you be prepared to consider contributing the implementation of this feature back to brat, so that others in the user community could also benefit from it? |
I had no idea that Unicode had RTR and RTL features, so I will leave this link here for future reference even though using it is discouraged: http://www.w3.org/International/questions/qa-bidi-controls |
Hello all. |
No explicit support has been implemented, but from some recent discussion on the mailing list it appears that it is possible to use brat to annotate Arabic using recent versions of Firefox. |
Thanks @spyysalo |
As relevant as this is, I don't see it happening before v1.4. |
As discussed on the list recently, there has been some success annotating Arabic on recent versions of Firefox. We might wish to document the conditions for making this work. |
That was vey long time. We successfully managed to apply the right to left (RTL) into brat. Please see as an example of Arabic (RTL) text: The modification is part of our project and it is still not released to the public. Meanwhile, anyone who wants to use brat on our server, please don't hesitate to contact me on fahd_alotaibi(AT)hotmail.com, we may be able to give you such access to use it online to tag Arabic text. ** Please use either Google Chrome or Firefox to have the correct rendering result. (internet explorer is not supported) |
@fsalotaibi Do you have any plan to release to the public? I have some arabic text to annote, and currently excel is used. |
Thanks very much fsalotaibi and icycandy |
We have added experimental support for left-to-right to WebAnno now. To this end, I have patched the brat Javascript files from brat that we use in WebAnno to support an LTR and an RTL mode. The changes are all conspicuously marked and should be reasonable easy to transfer back into brat. In particular, the changes do
Some functionalities may not have been fixed for RTL because we don't use them in WebAnno. Also, there are some known issues, e.g.:
Anybody interested in integrating this back into brat? |
@reckart: Cool! We are certainly interested. @amadanmath: When you have the time, could you have a look at putting this into a branch? |
Hi, just wondered if there's been any activity or timeline for inclusion of RTL abled brat? |
@amadanmath : could you please have a look at #774 (comment) and #1150? |
Sorry it took me forever to address this; WebAnno changes backported to brat. Thank you, @reckart. It is committed to the branch You will need to include the following in the
|
Seems it bugs a bit on mixed directionality text -- try selecting half of the abbreviation and half of the neighbouring Arabic word:
|
Fabulous! Well, yes, mixed tokens are a known issue in our code. I hope that sharing the code between brat and WebAnno increases the chance that somebody picks up the baton and addresses the remaining issues and that both projects can profit from this. See also: webanno/webanno#49 |
I finally found some non-trivially annotated RTL data (in Hebrew) which shows that the RTL layout doesn't push out the labels sufficiently. This needs some improvement. Cf. webanno/webanno#273 |
@amadanmath if you have any hot pointers where to look regarding fixing the "pushing", would be great! |
Looks like a general layout problem with wide labels, not limited to the RTL layout or RTL glyphs. |
Managed to fix the layout issue ;) webanno/webanno#273 |
You might find this also interesting: webanno/webanno#265 (comment) |
Merged into master branch now. |
I should mention that there have been more improvements to RTL mode in WebAnno, also some issues still open to be resolved: https://github.com/webanno/webanno/issues?utf8=✓&q=is%3Aissue%20label%3ARTL |
This is issue is to discuss the current short-comings regarding Arabic script and how/if it can be resolved given our current architecture.
Emad Mohamed mentioned on the corpora mailing list that they can use the ASCII Backwater encoding for Arabic but that it is sub-optimal. We really need a native to help out with thin but at least from what I could read at CPAN it looks like a dreadful hack to get Arabic into ASCII.
According to @amadanmath, the following should be an issue:
But it appears that at least Firefox renders both the same and handles the English portion correctly.
From talking to one of the attendees at EACL 2012 tokenisation may also become an issue. For this we could use a similar approach as we have already done for Japanese and incorporate a morphological analyser to find the start and end of the "tokens".
Here is one I found after some minor Googling:
The text was updated successfully, but these errors were encountered: