-
Notifications
You must be signed in to change notification settings - Fork 24
Description
We need some methods/scripts to evaluate parsing performance. We probably want to do two things: a) replicate previous work that uses parseval so that we can easily report previous results (see table 3 in http://www.cc.gatech.edu/~jeisenst/papers/ji-acl-2014.pdf), and b) implement a more appropriate metric based on precision/recall of relations between spans, not just precision/recall of (labeled or unlabled) spans as in parseval. See discussion from @sagae below.
- The metrics should report unlabeled and labeled performance
- The metrics should use the 18 coarse relations from Carlson et al.'s (2001) "Building a Discourse-tagged Corpus in the Framework of Rhetorical Structure Theory."
Discussion from @sagae
Looking at Fig 1 in http://www.isi.edu/~marcu/papers/sigdialbook2002.pdf, there are nine rhetorical relations, represented by the labeled directed arcs (same-unit
is just a side effect of the annotation, and not a discourse relation). We really should be looking at precision and recall of the relations represented in these labeled arcs. So we would be looking for:
16 <- 17-26 : example
17-21 <- 22-26 : elaboration-additional
17-18 <- 19-21 : explanation-argumentative
22-25 <- 26 : consequence-s
17 <- 18 : attribution
19-20 <- 21 : attribution
19 <- 20 : elaboration-object-attribute-embedded
22 <- 23 : attribution-embedded
24 <- 25 : purpose
and precision and recall would be computed in the usual way, and successful identification of a relation requires the correct spans, the correct direction of the arrow, and the correct label. The list doesn't include 22-23 <- 24-25 : same-unit
, but the parser does need to get this right to form the 22-25 span
, so it's taken into account
implicitly, which I think is the right way.