-
Notifications
You must be signed in to change notification settings - Fork 4
PetInput
This page discusses available input formats to the PET parser cheap, where the order of presentation is largely reflects historical order of PET development, but also corresponds to increasing complexity (and, thus, control of system behavior).
Punctuation characters, as specified in the settings file are ignored by PET (removed from the input chart) for pure, textual input.
Here is an example of the punctuation characters found in pet/japanese.set:
punctuation-characters := "\"!&'()*+,-−./;<=>?@[\]^_`{|}~。?…., ○●◎*".
Note that punctuation-characters are defined separately for the LKB (typically in lkb/globals.lsp).
Punctuation characters are not removed from the other input modes (YY mode, PET Input Char, or MAF). Rather, in these modes they should be removed (or treated otherwise, as appropriate) by the preprocessor that creates the YY/PIC/MAF token lattice.
YY (activated by the -yy option) input mode facilities parsing from a partial (lexical) chart, i.e. it assumes that tokenization (and other text-level pre-processing) have been performed outside of cheap. YY input mode facilitates token-level ambiguity, multi-word tokens, some control over what PET should do for morphological analysis, the use of POS tags on input tokens to enable (better) unknown word handling, and generally feeding a word graph (as, for example, obtained from a speech recognizer) into the parser.
There are at least three existing descriptions of YY input mode that we should merge into one for this page. Here is mine (oe, 06/10/04).
In this example, the words are shown on separate lines for clarity. In the actual input given to PET, all YY tokens must appear as a single line (terminated by newline), as each line of input is processed as a separate utterance.
(42, 0, 1, <0:11>, 1, "Tokenization", 0, "null", "NNP" 0.7677 "NN" 0.2323)
(43, 1, 2, <12:12>, 1, ",", 0, "null", "," 1.0000)
(44, 2, 3, <14:14>, 1, "a", 0, "null", "DT" 1.0000)
(45, 3, 4, <16:26>, 1, "non-trivial", 0, "null", "JJ" 1.0000)
(46, 4, 5, <28:35>, 1, "exercise", 0, "null", "NN" 0.9887 "VB" 0.0113)
(47, 5, 6, <36:36>, 1, ",", 0, "null", "," 1.0000)
(48, 6, 7, <38:43>, 1, "bazed", 0, "null", "VBD" 0.5975 "VBN" 0.4025)
(49, 7, 8, <45:57>, 1, "[email protected]", 0, "null", "NN" 0.7342 "JJ" 0.2096)
(50, 8, 9, <58:58>, 1, ".", 0, "null", "." 1.0000)
This can be fed into PET as follows:
cat corpus.yy | cheap -yy -packing -verbose=4 -default-les english.grm
where -yy or -tok=yy turns on the partial chart input mode, we virtually always use packing nowadays, we want some verbosity of output, and we enable the use of default lexical entries for unknown tokens.
Each token in the above example has the following format:
- (id, start, end, path+, form surface, ipos, irule+, {pos p}+)
i.e. each token has a unique identifier and start and end vertex. We will ignore the path component for our purposes (membership in one or more paths through a word lattice). Form is what we use for lexical look-up (surface provides the original surface form in case there was some normalization already, e.g. "EmailErsatz" "[email protected]"). We will ignore ipos and irule, in turn, since we leave morphological analysis to PET. The rest is a sequence of tag plus probability pairs. We used to include the probabilities in parse ranking; these days they are plain ignored. It is legitimate to have multiple tokens for a position, or tokens spanning multiple positions.
if you look at pet/english.set in the ERG distribution, you will find some settings that determine the treatment of unknown words:
posmapping :=
UpperAndLowerCase $genericname
UpperAndLowerCaseInitial $genericname
JJ $generic_adj
JJR $generic_adj_compar
JJS $generic_adj_superl
NN $generic_mass_count_noun
NNS $generic_pl_noun
NNPS $generic_pl_noun
NNP $genericname
FW $generic_mass_noun
RB $generic_adverb
VB $generic_trans_verb_bse
VBD $generic_trans_verb_past
VBG $generic_trans_verb_prp
VBN $generic_trans_verb_psp
VBP $generic_trans_verb_presn3sg
VBZ $generic_trans_verb_pres3sg
.
which determines what happens for unknown words, i.e. tokens whose form is not found in the native lexicon. the top part of the mapping (which is commented out in the current release version) is for PTB tags, the lower part for CLAWS tags.
i suspect both the mapping and constraints on generic entries will need some fine-tuning. consider our initial example: FAQ is not in the ERG lexicon. RASP (wrongly, i think) tags it as a proper noun, thus we use the $genericname lexical entry. when dan did these generic entries, we did not have a tagger (i.e. always threw in all of them), hence he made these entries fairly constrained with respect to their combinatorics: in this case, $genericname does not allow combination with a specifier, hence the above still fails to parse. changing its tag to NN or NN1 we get nine readings, the first of which looks plausible.
As of December 2006, the following lines need to be added to the gen-lex.tdl file in the erg grammar and then compile the grammar again (the generic_mass_count_noun rule is missing):
generic_mass_count_noun := n_-_mc-unk_le &
[ STEM < *top* > ].
XML input mode is very similar to YY input mode. It allows you to specify only simple tokens that get analysed internally by \cheap or to put all kinds of preprocessing information \cheap can handle into the input directly, namely POS, morphology, lexicon lookup and multi-component entries.
It extends the YY mode in that it allows to have structured input tokens to provide a means to encode, say, named entities resulting from base tokens. It also allows to specify modifications to feature structures (coming from lexicon entries.
It is called with -tok=pic_counts and can be used in combination with -default-les to trigger unknown words with POS tags, like in YY mode.
A typical way of calling it, with xml input and the best ranked xml rmrs output) would be:
cat input.xml | cheap -tok=pic_counts -default-les -packing -mrs=rmrx -results=1 grammar.grm
A simple example input is given below:
<?xml version="1.0" encoding="utf-8" standalone="no" ?>
<!DOCTYPE pet-input-chart
SYSTEM "/use/local/lib/xml/pic.dtd">
<pet-input-chart>
<!-- This FAQ is short -->
<w id="W1" cstart="1" cend="5">
<surface>This</surface>
<pos tag="DD1" prio = "1.0" />
</w>
<w id="W2" cstart="7" cend="9">
<surface>FAQ</surface>
<pos tag="NP1" prio = "1.0" />
</w>
<w id="W2" cstart="7" cend="9" constant="yes">
<surface>FAQ</surface>
<typeinfo id="n_-_pn_le" baseform="no" prio="1.0">
<stem>$genericname</stem>
<fsmod path="SYNSEM.LKEYS.KEYREL.CARG" value="F.A.Q."/>
</typeinfo>
</w>
<w id="W3" cstart="11" cend="12">
<surface>is</surface>
<pos tag="BE" prio = "1.0" />
</w>
<w id="W4" cstart="14" cend="18">
<surface>short</surface>
<pos tag="JJ" prio = "1.0" />
</w>
</pet-input-chart>
[note: the two empty lines at the end of the input file appear necessary when piping data into PET using the above command]
The input is broken up into tokens <w>...</w>, which must have unique ids. Each token gives its start (cstart) and end (cend) (inclusive) character position. It can also include a pos element, with the tag and confidence (priority).
It also allows more detailed specifications (named entities, modified feature structures, ...).
You can only enter a single pet-input-chart in a stream, and it must start with the xml declaration and finish with at least two empty lines. Alternatively, you can give the name of a file consisting of a single pet-input-chart, or a list of such filenames, one on each line.
The example given below illustrates most of the available features. Tokens W0 and W1 are not analysed at all by cheap because the (boolean) constant attribute is yes.
- The default value of this attribute is {\tt no}, which means that
the token W3 will be analysed by all of the activated preprocessing modules in cheap.
<?xml version="1.0" encoding="ISO-8859-1" standalone="no" ?>
<!DOCTYPE pet-input-chart
SYSTEM "/path/to/src/pet/doc/pic.dtd">
<pet-input-chart>
<w id="W0" cstart="1" cend="3" constant="yes">
<surface>Kim</surface>
</w>
<w id="W1" cstart="5" cend="9" constant="yes">
<surface>Novak</surface>
</w>
<ne id="NE0" prio="1.0">
<ref dtr="W0">
<ref dtr="W1">
<pos tag="PN" prio="1.0">
<typeinfo id="TNE0" baseform="no">
<stem>$generic_name</stem>
<fsmod path="SYNSEM.LOCAL.HEAD.FORM" value="Kim Novak"/>
</typeinfo>
</ne>
<w id="W2" cstart="11" cend="16" constant="yes">
<surface>sleeps</surface>
<pos tag="VVFIN" prio="7.80000e-1"/>
<pos tag="NN" prio="2.30000e-2"/>
<typeinfo id="W1A1">
<stem>sleep</stem>
<infl name="$third_sg_fin_verb_infl_rule"/>
</typeinfo>
<typeinfo id="W1A2">
<stem>sleep</stem>
<infl name="$plur_noun_infl_rule"/>
</typeinfo>
</w>
<w id="W3" cstart="18" cend="22">
<surface>badly</surface>
<pos tag="ADV" prio="1.00000e+1"/>
</w>
</pet-input-chart>
Token NE0 is an example of a complex token referencing a sequence of two base tokens. Its typeinfo directly gives the HPSG type name whose feature structure should be used as lexical item in cheap. While in YY mode this was triggered by a leading special character, in XML the attribute baseform decides if the string enclosed by the <stem> tag is to be interpreted as lexical base form or as type name. The default value of baseform is yes. In this token, the surface string is unified into the feature structure under path SYNSEM.LOCAL.HEAD.FORM, which is specified with the <fsmod> tag. The value of an <fsmod> may be an arbitrary string. cheap will add a dynamic symbol if the string is not a known type or symbol name.
Every <typeinfo> tag potentially generates a lexical item (if it leads to a valid lexical feature structure). Thus, there will be two readings for the token W2 (sleeps), whereas internal analysis of the surface form has been inhibited. This need not be necessarily so. It is possible to provide external analyses and have a <w> token also being analysed internally if the constant flag is omitted or set to no.
The XML tag <surface> encloses the surface string, <pos> and <path> tags are analogous to YY mode; multiple <infl> rules in a <typeinfo> will have to be considered from first to last.
XML input mode can be used in two different ways, either by specifying a file name containing the XML data (preferably with correct XML header and DTD or DTD URL specification) or by giving the XML data directly.
If the XML data is put directly into the standard input, it must start with a valid XML header <?xml version="1.0" ... ?> with no leading whitespace, because recognition of the header triggers the reading of XML from standard input. The end of the data is marked by an empty line (two consecutive newline characters), therefore, the data itself, including an eventually given DTD, may not contain empty lines.
This is the pic.dtd from the [wiki:HeartofgoldTop Heart of Gold].
<!ELEMENT pet-input-chart ( w | ne )* >
<!-- base input token -->
<!ELEMENT w ( surface, path*, pos*, typeinfo* ) >
<!ATTLIST w id ID #REQUIRED
cstart NMTOKEN #REQUIRED
cend NMTOKEN #REQUIRED
prio CDATA #IMPLIED
constant (yes | no) "no" >
<!-- constant "yes" means: do not analyse, i.e., if the tag contains
no typeinfo, no lexical item will be build by the token -->
<!-- The surface string -->
<!ELEMENT surface ( #PCDATA ) >
<!-- numbers that encode valid paths through the input graph (optional) -->
<!ELEMENT path EMPTY >
<!ATTLIST path num NMTOKEN #REQUIRED >
<!-- every typeinfo generates a lexical token -->
<!ELEMENT typeinfo ( stem, infl*, fsmod* ) >
<!ATTLIST typeinfo id ID #REQUIRED
prio CDATA #IMPLIED
baseform (yes | no) "yes" >
<!-- Baseform yes: lexical base form; no: type name -->
<!-- lexical base form or type name -->
<!ELEMENT stem ( #PCDATA ) >
<!-- type name of an inflection rule-->
<!ELEMENT infl EMPTY >
<!ATTLIST infl name CDATA #REQUIRED >
<!-- put type value under path into the lexical feature structure -->
<!ELEMENT fsmod EMPTY >
<!ATTLIST fsmod path CDATA #REQUIRED
value CDATA #REQUIRED >
<!-- part-of-speech tags with priorities -->
<!ELEMENT pos EMPTY >
<!ATTLIST pos tag CDATA #REQUIRED
prio CDATA #IMPLIED >
<!-- structured input items, mostly to encode named entities -->
<!ELEMENT ne ( ref+, pos*, typeinfo+ ) >
<!ATTLIST ne id ID #REQUIRED
prio CDATA #IMPLIED >
<!-- reference to a base token -->
<!ELEMENT ref EMPTY >
<!ATTLIST ref dtr IDREF #REQUIRED >
By default the XML parser used with cheap (libxerces) can handle iso-8859-1 and utf-8. To get other encodings, such as euc-jp, you need to link the xml parser against the icu libraries.
For debian and derivatives this means:
apt-get install sudo apt-get install libxercesicu25 icu
rather than:
apt-get install sudo apt-get install libxerces25
See SmafTop.
Home | Forum | Discussions | Events