Skip to content

PetInput

StephanOepen edited this page Jan 25, 2009 · 53 revisions

TableOfContents

Overview

This page discusses available input formats to the PET parser cheap, where the order of presentation is largely reflects historical order of PET development, but also corresponds to increasing complexity (and, thus, control of system behavior).

Punctuation

Punctuation characters, as specified in the settings file are ignored by PET (removed from the input chart) for pure, textual input.

Here is an example of the punctuation characters found in pet/japanese.set:

  punctuation-characters := "\"!&'()*+,-−./;<=>?@[\]^_`{|}~。?…., ○●◎*".

Note that punctuation-characters are defined separately for the LKB (typically in lkb/globals.lsp).

Punctuation characters are not removed from the other input modes (YY mode, PET Input Char, or MAF). Rather, in these modes they should be removed (or treated otherwise, as appropriate) by the preprocessor that creates the YY/PIC/MAF token lattice.

Line-Oriented Input

YY Input Mode

YY (activated by the -yy option) input mode facilities parsing from a partial (lexical) chart, i.e. it assumes that tokenization (and other text-level pre-processing) have been performed outside of cheap. YY input mode facilitates token-level ambiguity, multi-word tokens, some control over what PET should do for morphological analysis, the use of POS tags on input tokens to enable (better) unknown word handling, and generally feeding a word graph (as, for example, obtained from a speech recognizer) into the parser.

There are at least three existing descriptions of YY input mode that we should merge into one for this page. Here is mine (oe, 06/10/04).

In this example, the words are shown on separate lines for clarity. In the actual input given to PET, all YY tokens must appear as a single line (terminated by newline), as each line of input is processed as a separate utterance.

  (42, 0, 1, <0:11>, 1, "Tokenization", 0, "null", "NNP" 0.7677 "NN" 0.2323)
  (43, 1, 2, <12:12>, 1, ",", 0, "null", "," 1.0000)
  (44, 2, 3, <14:14>, 1, "a", 0, "null", "DT" 1.0000)
  (45, 3, 4, <16:26>, 1, "non-trivial", 0, "null", "JJ" 1.0000)
  (46, 4, 5, <28:35>, 1, "exercise", 0, "null", "NN" 0.9887 "VB" 0.0113)
  (47, 5, 6, <36:36>, 1, ",", 0, "null", "," 1.0000)
  (48, 6, 7, <38:43>, 1, "bazed", 0, "null", "VBD" 0.5975 "VBN" 0.4025)
  (49, 7, 8, <45:57>, 1, "[email protected]", 0, "null", "NN" 0.7342 "JJ" 0.2096)
  (50, 8, 9, <58:58>, 1, ".", 0, "null", "." 1.0000)

This can be fed into PET as follows:

cat corpus.yy | cheap -yy -packing -verbose=4 -default-les english.grm 

where -yy or -tok=yy turns on the partial chart input mode, we virtually always use packing nowadays, we want some verbosity of output, and we enable the use of default lexical entries for unknown tokens.

Each token in the above example has the following format:

  • (id, start, end, path+, form surface, ipos, irule+, {pos p}+)

i.e. each token has a unique identifier and start and end vertex. We will ignore the path component for our purposes (membership in one or more paths through a word lattice). Form is what we use for lexical look-up (surface provides the original surface form in case there was some normalization already, e.g. "EmailErsatz" "[email protected]"). We will ignore ipos and irule, in turn, since we leave morphological analysis to PET. The rest is a sequence of tag plus probability pairs. We used to include the probabilities in parse ranking; these days they are plain ignored. It is legitimate to have multiple tokens for a position, or tokens spanning multiple positions.

Guessing Unknown Word Lexical Types from the input POS

if you look at pet/english.set in the ERG distribution, you will find some settings that determine the treatment of unknown words:

posmapping :=
  UpperAndLowerCase $genericname
  UpperAndLowerCaseInitial $genericname
  JJ $generic_adj
  JJR $generic_adj_compar
  JJS $generic_adj_superl
  NN $generic_mass_count_noun
  NNS $generic_pl_noun
  NNPS $generic_pl_noun
  NNP $genericname
  FW $generic_mass_noun
  RB $generic_adverb
  VB $generic_trans_verb_bse
  VBD $generic_trans_verb_past
  VBG $generic_trans_verb_prp
  VBN $generic_trans_verb_psp
  VBP $generic_trans_verb_presn3sg
  VBZ $generic_trans_verb_pres3sg
.

which determines what happens for unknown words, i.e. tokens whose form is not found in the native lexicon. the top part of the mapping (which is commented out in the current release version) is for PTB tags, the lower part for CLAWS tags.

i suspect both the mapping and constraints on generic entries will need some fine-tuning. consider our initial example: FAQ is not in the ERG lexicon. RASP (wrongly, i think) tags it as a proper noun, thus we use the $genericname lexical entry. when dan did these generic entries, we did not have a tagger (i.e. always threw in all of them), hence he made these entries fairly constrained with respect to their combinatorics: in this case, $genericname does not allow combination with a specifier, hence the above still fails to parse. changing its tag to NN or NN1 we get nine readings, the first of which looks plausible.

As of December 2006, the following lines need to be added to the gen-lex.tdl file in the erg grammar and then compile the grammar again (the generic_mass_count_noun rule is missing):

  generic_mass_count_noun := n_-_mc-unk_le &
    [ STEM < *top* > ].

Pet Input Chart (XML Input)

XML input mode is very similar to YY input mode. It allows you to specify only simple tokens that get analysed internally by \cheap or to put all kinds of preprocessing information \cheap can handle into the input directly, namely POS, morphology, lexicon lookup and multi-component entries.

It extends the YY mode in that it allows to have structured input tokens to provide a means to encode, say, named entities resulting from base tokens. It also allows to specify modifications to feature structures (coming from lexicon entries.

It is called with -tok=pic_counts and can be used in combination with -default-les to trigger unknown words with POS tags, like in YY mode.

Examples

A typical way of calling it, with xml input and the best ranked xml rmrs output) would be:

cat input.xml | cheap -tok=pic_counts -default-les -packing -mrs=rmrx -results=1 grammar.grm

A simple example input is given below:

<?xml version="1.0" encoding="utf-8" standalone="no" ?>
<!DOCTYPE pet-input-chart
 SYSTEM "/use/local/lib/xml/pic.dtd">
<pet-input-chart>
<!-- This FAQ is short -->
  <w id="W1" cstart="1" cend="5">
    <surface>This</surface>
    <pos tag="DD1" prio = "1.0" />
  </w>
  <w id="W2" cstart="7" cend="9">
    <surface>FAQ</surface>
    <pos tag="NP1" prio = "1.0" />
  </w>
   <w id="W2" cstart="7" cend="9" constant="yes">
    <surface>FAQ</surface>
    <typeinfo id="n_-_pn_le" baseform="no" prio="1.0">
      <stem>$genericname</stem>
      <fsmod path="SYNSEM.LKEYS.KEYREL.CARG" value="F.A.Q."/>
      </typeinfo>
  </w>
  <w id="W3" cstart="11" cend="12">
    <surface>is</surface>
    <pos tag="BE" prio = "1.0" />
  </w>
  <w id="W4" cstart="14" cend="18">
    <surface>short</surface>
    <pos tag="JJ" prio = "1.0" />
  </w>
 </pet-input-chart>

[note: the two empty lines at the end of the input file appear necessary when piping data into PET using the above command]

The input is broken up into tokens <w>...</w>, which must have unique ids. Each token gives its start (cstart) and end (cend) (inclusive) character position. It can also include a pos element, with the tag and confidence (priority).

It also allows more detailed specifications (named entities, modified feature structures, ...).

You can only enter a single pet-input-chart in a stream, and it must start with the xml declaration and finish with at least two empty lines. Alternatively, you can give the name of a file consisting of a single pet-input-chart, or a list of such filenames, one on each line.

The example given below illustrates most of the available features. Tokens W0 and W1 are not analysed at all by cheap because the (boolean) constant attribute is yes.

  • The default value of this attribute is {\tt no}, which means that

the token W3 will be analysed by all of the activated preprocessing modules in cheap.

<?xml version="1.0" encoding="ISO-8859-1" standalone="no" ?>
<!DOCTYPE pet-input-chart
  SYSTEM "/path/to/src/pet/doc/pic.dtd">
<pet-input-chart>
  <w id="W0" cstart="1" cend="3" constant="yes">
    <surface>Kim</surface>
  </w>
  <w id="W1" cstart="5" cend="9" constant="yes">
    <surface>Novak</surface>
  </w>
  <ne id="NE0" prio="1.0">
    <ref dtr="W0">
    <ref dtr="W1">
    <pos tag="PN" prio="1.0">
    <typeinfo id="TNE0" baseform="no">
      <stem>$generic_name</stem>
      <fsmod path="SYNSEM.LOCAL.HEAD.FORM" value="Kim Novak"/>
    </typeinfo>
  </ne>
  <w id="W2" cstart="11" cend="16" constant="yes">
    <surface>sleeps</surface>
    <pos tag="VVFIN" prio="7.80000e-1"/>
    <pos tag="NN" prio="2.30000e-2"/>
    <typeinfo id="W1A1">
      <stem>sleep</stem>
      <infl name="$third_sg_fin_verb_infl_rule"/>
    </typeinfo>
    <typeinfo id="W1A2">
      <stem>sleep</stem>
      <infl name="$plur_noun_infl_rule"/>
    </typeinfo>
  </w>
  <w id="W3" cstart="18" cend="22">
    <surface>badly</surface>
    <pos tag="ADV" prio="1.00000e+1"/>
  </w>
</pet-input-chart>

Token NE0 is an example of a complex token referencing a sequence of two base tokens. Its typeinfo directly gives the HPSG type name whose feature structure should be used as lexical item in cheap. While in YY mode this was triggered by a leading special character, in XML the attribute baseform decides if the string enclosed by the <stem> tag is to be interpreted as lexical base form or as type name. The default value of baseform is yes. In this token, the surface string is unified into the feature structure under path SYNSEM.LOCAL.HEAD.FORM, which is specified with the <fsmod> tag. The value of an <fsmod> may be an arbitrary string. cheap will add a dynamic symbol if the string is not a known type or symbol name.

Every <typeinfo> tag potentially generates a lexical item (if it leads to a valid lexical feature structure). Thus, there will be two readings for the token W2 (sleeps), whereas internal analysis of the surface form has been inhibited. This need not be necessarily so. It is possible to provide external analyses and have a <w> token also being analysed internally if the constant flag is omitted or set to no.

The XML tag <surface> encloses the surface string, <pos> and <path> tags are analogous to YY mode; multiple <infl> rules in a <typeinfo> will have to be considered from first to last.

XML input mode can be used in two different ways, either by specifying a file name containing the XML data (preferably with correct XML header and DTD or DTD URL specification) or by giving the XML data directly.

If the XML data is put directly into the standard input, it must start with a valid XML header <?xml version="1.0" ... ?> with no leading whitespace, because recognition of the header triggers the reading of XML from standard input. The end of the data is marked by an empty line (two consecutive newline characters), therefore, the data itself, including an eventually given DTD, may not contain empty lines.

PIC (pet-input-chart) DTD

This is the pic.dtd from the [wiki:HeartofgoldTop Heart of Gold].

<!ELEMENT pet-input-chart ( w | ne )* >
  <!-- base input token -->
  <!ELEMENT w ( surface, path*, pos*, typeinfo* ) >
  <!ATTLIST w         id ID      #REQUIRED
                  cstart NMTOKEN #REQUIRED
                    cend NMTOKEN #REQUIRED
                    prio CDATA   #IMPLIED
                constant (yes | no) "no" >
  <!-- constant "yes" means: do not analyse, i.e., if the tag contains
       no typeinfo, no lexical item will be build by the token -->
 
  <!-- The surface string -->
  <!ELEMENT surface ( #PCDATA ) >

  <!-- numbers that encode valid paths through the input graph (optional) -->
  <!ELEMENT path EMPTY >
  <!ATTLIST path     num NMTOKEN #REQUIRED >
 
  <!-- every typeinfo generates a lexical token -->
  <!ELEMENT typeinfo ( stem, infl*, fsmod* ) >
  <!ATTLIST typeinfo   id ID     #REQUIRED
                     prio CDATA  #IMPLIED
                 baseform (yes | no) "yes" >
  <!-- Baseform yes: lexical base form; no: type name -->

  <!-- lexical base form or type name -->
  <!ELEMENT stem ( #PCDATA ) >

  <!-- type name of an inflection rule-->
  <!ELEMENT infl  EMPTY >
  <!ATTLIST infl    name CDATA   #REQUIRED >

  <!-- put type value under path into the lexical feature structure -->
  <!ELEMENT fsmod  EMPTY >
  <!ATTLIST fsmod   path CDATA   #REQUIRED
                   value CDATA   #REQUIRED >

  <!-- part-of-speech tags with priorities -->
  <!ELEMENT pos  EMPTY >
  <!ATTLIST pos      tag CDATA   #REQUIRED
                    prio CDATA   #IMPLIED >

  <!-- structured input items, mostly to encode named entities -->
  <!ELEMENT ne  ( ref+, pos*, typeinfo+ )  >
  <!ATTLIST ne        id ID      #REQUIRED
                    prio CDATA   #IMPLIED >
 
  <!-- reference to a base token -->
  <!ELEMENT ref  EMPTY >
  <!ATTLIST ref      dtr IDREF   #REQUIRED >

Encoding issues

By default the XML parser used with cheap (libxerces) can handle iso-8859-1 and utf-8. To get other encodings, such as euc-jp, you need to link the xml parser against the icu libraries.

For debian and derivatives this means:

apt-get install sudo apt-get install libxercesicu25 icu

rather than:

apt-get install sudo apt-get install libxerces25 

SMAF

See SmafTop.

Clone this wiki locally