Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detects, preserve encoding when revising files #62

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

falquaddoomi
Copy link
Collaborator

This PR addresses issue #61, in which a user reported errors with reading in GBK-encoded files, e.g. for representing Chinese characters. To address the issue, I first use chardet.detect() on each input file, then use that resulting encoding when reading and writing the file. The PR includes a test that a few GBK-encoded characters make it through the revision process.

This PR introduces a dependency on chardet in order to detect the encodings of input files.

Closes #61.

with open(input_filepath, "r") as infile, open(output_filepath, "w") as outfile:
# detect the input file encoding using chardet
# maintain that encoding when reading and writing files
src_encoding = chardet.detect(input_filepath.read_bytes())["encoding"]
Copy link
Collaborator Author

@falquaddoomi falquaddoomi Oct 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, I'm currently looking into how much of each file we need to read. Reading the entire thing is the safest choice, but chardet might be able to accurately detect the encoding with less data.

Alternatively, I may switch to using UniversalDetector (https://chardet.readthedocs.io/en/latest/usage.html#advanced-usage); I'll have to experiment with the confidence level we should use before stopping. (FWIW, this would all be a lot easier to decide if we had the failing markdown files, so hopefully we'll get those soon.)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok,I Know. I have revised the line 283:
539a78a4f8441a9c407da5cb52ca0a7

with open(input_filepath, "r",encoding='utf-8') as infile, open(output_filepath, "w",encoding='utf-8') as outfile:

Copy link
Collaborator

@d33bs d33bs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice job! This looked like a nice change to address the issue which was brought up. I left a few comments about various considerations. Additionally, you might consider pulling in the most recent changes from main, which I believe would allow this PR to observe and document passing tests prior to a merge.

setup.py Outdated
@@ -27,6 +27,7 @@
install_requires=[
"openai==0.28",
"pyyaml",
"chardet==5.2.0",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a completely optional comment mentioned mostly to help share information. I was experimenting with chardet recently and decided in one case to move to using charset_normalizer (specifically, the from_bytes method). It could be useful to consider here or in the future, especially when it comes to time or accuracy, depending on your opinions too.

I took a chance to test the example files provided for testing and found that they result in different time and encoding performance for the 01.abstract.md file: https://colab.research.google.com/drive/1uNrf9obcpzQK8XHJ3KY4l9zquHo2Xpju?authuser=0#scrollTo=rnZGgBZWumqv . My understanding is that chardet detected ISO-8859-9 (which seems more related to Turkish) and charset_normalizer detected CP949 (which seems more related to Korean). This did come at an additional time cost for charset_normalizer, and I'm unsure how this might perform with larger datasets.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting; I frankly am not very familiar with characterset detection libraries; I just went with chardet because it was the one I'd heard of before. charset_normalizer looks great; while I'm mostly concerned with accuracy since the files in question are typically small, it's neat that you can get both with that library.

Thanks for going through the trouble of doing a comparison, too! I'm not too concerned about sub-second differences between these libraries' performance, since later steps of the pipeline that are also run on a per-file basis take much longer, but it's interesting to know all the same.

On a side note, I should flesh out my encoding tests a bit more, too, since it seems there's troubling variability between libraries and we definitely don't want to choose the wrong encoding! We could punt this to the user: for example, we could have them specify their encoding as an env var and, if it's missing, attempt auto-detection. At least they'll have the opportunity to correct a mistaken encoding detection if they know what their desired encoding is...actually, I think I'm going to add that anyway since no auto-detection will be 100% accurate.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like a great plan!

"model",
[
RandomManuscriptRevisionModel(),
# GPT3CompletionModel(None, None),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider removing this comment if it is no longer needed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW this is more cruft from copying an existing tests, but you're right, it shouldn't be here or elsewhere.

Tests that the editor can revise a manuscript that contains GBK-encoded
characters, and can detect those characters encoded in UTF-8 in the output.
"""
print(f"\n{str(tmp_path)}\n")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider removing this print statement, which appears to show the tmp_path location for a file.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO this one is actually useful; tmp_path is a pytest fixture that first creates a randomly-named folder and then resolves to a Path object, it's not something the caller provides and thus would be redundant to print. It's helpful to be able to see where the files are located so you can manually inspect them, and the print statement's output is only shown if the test fails.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Completely fair! I trust your judgement on this.

# GPT3CompletionModel(None, None),
],
)
def test_revise_gbk_encoded_manuscript(tmp_path, model):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding type hints for the parameters in this test.

Comment on lines 1220 to 1221
output_folder = tmp_path
assert output_folder.exists()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider avoiding the creation of a new label for tmp_path and instead use it directly to help increase understandability.

Copy link
Collaborator Author

@falquaddoomi falquaddoomi Nov 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is kind of a lame excuse, but I was following the pattern of tests that were already in the suite. Anyway, you're right, they could be tightened up. I'll follow your suggestion here and for other tests that I create, and perhaps rewrite the old tests in a PR focused on just that. I'll make an issue to revise the tests to be more clear and succinct.

# maintain that encoding when reading and writing files
src_encoding = chardet.detect(input_filepath.read_bytes())["encoding"]

print("Detected encoding:", src_encoding, flush=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding formal logging to the project (here or perhaps in later work).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, but I think that should be done in a PR focused specifically on that, since print statements like this one are prevalent in the library. I'll create an issue to capture it.

with open(input_filepath, "r") as infile, open(output_filepath, "w") as outfile:
# detect the input file encoding using chardet
# maintain that encoding when reading and writing files
src_encoding = chardet.detect(input_filepath.read_bytes())["encoding"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unsure how large the files passed in here might be. Would it make sense to consider reading only a portion of the file if it were very large to help conserve time?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about that (see #62 (comment)), but these files are rarely larger than a few kilobytes, since they're the text of sections of a paper, not binary files. I concluded that the (small, granted) extra engineering effort wasn't worth a difference of milliseconds, especially when it could decrease the accuracy of the encoding detection. Also, the revision process itself takes on the order of seconds to complete since it relies on an external API, so improvements here would IMHO go unnoticed.

I'm of course willing to revise my opinion if we really do have large files that need to be detected or if the difference would be significant. I think your suggestion of using charset_normalizer would also improve both speed and accuracy, so perhaps it'll be enough to switch to it.

…nv vars to specify src/dest encoding manually. Other minor touchups.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

error:code
3 participants