Skip to content

Problems with parsing quotes #2

@NikhilPr95

Description

@NikhilPr95

There are three types of problems that come about when parsing quotes -

  1. It very frequently divides the quote and the rest of the sentence into two separate sentences.

E.g. - "So what?" said Harry.

Here ' "So what?" ' and ' said Harry. ' are parsed as two separate sentences, rather than one.
2. Similar to the first, It divides a quote and the rest of the sentence into two sentences, but here the first word after the quote is a character identified by a character id.

E.g. "What is?" George demanded.

is parsed as two sentences ' "What is?" ' and 'George demanded. '
3. It concatenates two separate quotes which belong in different sentences into the same sentence.

E.g. "How are you?" "I'm fine, thank you", he replied.

Here while ' "How are you" ' is a separate sentence, it is considered as part of the second sentence.
4. It takes the beginning opening quotes ' " ' of a dialogue and takes it as the last token of the previous sentence.

E.g. There was a big blue shape in the sky. " What is it? " Asked Beth.

It parses these two individual sentences as ' There was a big blue shape in the sky. " ' and
' What is it ? " Asked Beth.

However the 'in quotes' values for 'What' here is 'true' making these easy to discover.

I found these errors and corrected them through hard coding in my own program ( For 1 - checking if the first word of a sentence is either in lower case or a character and appropriately concatenating the sentences. For 2 - Checking for every instance of consecutive quotes and dividing, For 3 - Checking if the first word of a sentence is 'in quotes', the word before it in the previous sentence is a double quote, and the word before that is a period, and correcting appropriately)

I was pleased with the results UNTIL I realised that the parser which constructs dependency trees does so on the original 'wrong' sentences and not on mine.
This left me trying to use the actual MaltParser for these affected sentences but I found that the parsing is not exactly the same - I assume that your code does not use the MaltParser directly and uses extra information as well.

I would really like this fixed as I am otherwise using only the tokens document that I got from implementing your code and this complicates things a lot.

If you could tell me a quick fix to this, it would be appreciated as well. In the meanwhile, I'll try to see if it is possible for me to make the necessary changes in your code myself.

P. S.

I am very grateful for this repository without which a project I am working on analyzing novels would have been much much more difficult. Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions