Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove fullstop automatically when it finds the same description #10

Open
3 of 4 tasks
ivanhercaz opened this issue Jan 15, 2019 · 3 comments
Open
3 of 4 tasks
Labels
enhancement New feature or request

Comments

@ivanhercaz
Copy link
Owner

ivanhercaz commented Jan 15, 2019

This idea is to save time clicking to remove exactly the same description that already has been approved to be removed. For example:

  • "Grade II listed building in Powys." [action: Remove full stop]
    But the script can save time if the action would be:
  • "Grade II listed building in Powys." [action: Remove identical full stops automatically]
    It means that if the script find again "Grade II listed building in Powys.", it isn't going to ask for the action to perform, neither quit or skip, it would be removed instantly.

This new action would save time and improve the efficiency of the script, but it couldn't be used with all the descriptions, because it would overload the system for nothing. It has to be used for a specific kind of descriptions, like the one mentioned.

How would it work

The script would have a new action: Remove identical full stops automatically. This new action will add the description to a CSV, previously created and loaded with only one column named sentence. But, before to add it the script has to confirm if the description is already in the CSV: if it is in the CSV, it isn't added, if it isn't, it is added.

This descriptions saved in this CSV would be useful for the next times the script would be run, because the script would read this document, which would storage the old descriptions marked to find and remove automatically and the new ones.

Tasks

  • Development of the action and the requirements to work.
  • System to avoid to add duplicated descriptions in the CSV.
  • Make easy to generate the CSV file in order to follow the commented in Make easier to reuse the code #5.
  • Test several times.
@ivanhercaz
Copy link
Owner Author

But... how useful is to find and delete the exact description? It might be more effective if the system adds to the csv the last word with the full stop. Keeping this in mind, this action could be inserted in the first option ("Remove full stop"), avoiding to create another option. With the actual way the script would make the next steps:

Grade II listed building in Newport City. Located approximately 40 metres SW of Pound-wern Cottage. Bridge carries footpath connecting Ridgeway with the canal towpath.

  1. Remove duplicates automatically.
  2. Delete the full stop in the current item and add it to the csv file.
  3. If it finds the exact description again, it will delete it.

In this way, the script won't delete the description if something differs from the one added. But, if we save the last word and not all the description, the script would make:

  1. Remove full stop.
  2. Delete the full stop in the current item and add the last word, towpath, to the csv file.
  3. If the script finds any other descriptions that ends with towpath., it will delete it.

With this we save time because:

  1. We have not to think in more options (remove, checklist, edit, skip). The script would add the last word with the basic instruction to remove the full stop.
  2. The script would be more "intelligent" as we use it because it would add all the last words with full stops that we consider necessary to remove. In addition, the script will skip, as it does now, the type of words/full stops added in the exceptions list.

Thus, over time it would work with less and less human intervention.

Of course, another things to keep in mind:

  • Should the last words be organized in different documents according to its language (last_words_en.csv, last_words_es.csv)? Or should all the words in a multilingual document?
  • ...

@davidabian, I know all this issue is very long, specifically this last comment, but I would like to know your opinion about the reformulation of the system to save time (and make CanaryBot more "intelligent" 🐦 ). Of course, thanks in advance!

@davidabian
Copy link

This is more of a linguistic issue and I can't expect what the results will be in the languages I don't know deeply; I guess in some languages this will cause too many false positives to be considered, while in Spanish or English this can work (but only if the bot is operated carefully, since a single mistake by the bot operator could be spread to several unrelated entities).

@ivanhercaz
Copy link
Owner Author

@davidabian, at this time all the full stops need to be confirmed to be removed from the description. The criterion to remove it is to be sure that it isn't an abbreviation or something that needs to have the full stop. In addition to that exists the exceptions lists, with it the script avoid the descriptions that match the pattern of any exception.

What kind of human mistakes do you refer? Mark a full stop to remove when it is part of an abbreviation because then the script would remove automatically? If this is the kind of mistakes do you mean, of course, the operator needs to be sure of what is doing.

In the case of the languages I follow the same rule than in the other: Is or isn't the full stop necessary? Of course, if the operator has a doubt in any of the language in which the script work, the operator has the option "Add description to checklist". Then, the operator might review the checklist to ask in Wikidata what could be the good option to choose, or in the case it is part of an abbreviation, create another regex in the exception list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants