Skip to content

PetRobustness

PeterAdolphs edited this page Aug 11, 2010 · 9 revisions

Some notes on how to make parsing with Pet more robust.

TableOfContents

Unknown words

For both methods, it holds that it is unclear which REL it gets. Also, is it a word or a lexeme? There is no possibility in general (if the surface string is not stamped into CARG, as for proper names) to generate back again from the MRS to the surface string, if unknown word handling has been applied. Such information should be included in the MRS somehow, and this can be achieved in the user-fns.lsp

As to PET, it will be possible to construct the PRED string of an unknown word with chart mapping. However, since inflectional analysis could return many possible stems for a given word and since there is no way to pick the correct one reliably by only looking at this one form, running inflectional analysis before lexical instantiation in order to provide a stem in token mapping doesn't seem to be a strategy that is general enough. The surface form would have to be used instead to build the PRED value, leading to separate relations for each form of a lexeme.

Stephan points out that partial lexical gaps (is verb, but only encoded as noun) is not captured by these mechanisms, but this will be solved when the chart mapping is merged into main.

External POS tagger

cheap -default-les

Needs a mapping in the grammar. See also PetInput.

Lexical type prediction

By Yi.

cheap  -predict-les

Maximum Entropy Model has to be created from a script that Yi, that needs a treebanked profile as input. In e.g. pred-lex.tdl, it should be listed which lexical types should be able to be predicted. At least 2000 sentences are needed.

Supertagger

A supertagger, similar in nature to the CCG's, and based on CRF's, is currently under construction.

Chart mapping/ Reg-ex token handling

["Chart Mapping"] is a mechanism for rule-based manipulation of token feature structures, usually with the aim to adapt the preprocessed input to the expectations of the grammar. Chart mapping is part of the main branch since August 2010.

Grammar Internal Solutions

  • Roots (in english.set)
    • choose the robust root for greater coverage (in english.set)
  • Robustness rules/Mal rules

Pet settings

The amount of items that are parsed of a corpus, depends heavily on the high number of parameters passed onto the parser. Some of those settings involves restricting the search space (like reducing the maximum number of edges). The constraint that Dan uses now is the mem option in cheap (although you should half it, for some strange reason!). This seems to be the most reasonable setting.

See also PetParameters.

  • always use packing
  • recommend -memlimit (amount/2) rather than -limit (edges)
  • -timeout=1 (second) can also be useful

Robuster Partial Parsing

Yi created a two-phase parsing algorithm that, in case the deep grammar does not succeed, a CFG backbone is used to still get a reasonable parse. This still has to be integrated in the main branch.

Clone this wiki locally