Detects, preserve encoding when revising files #62

falquaddoomi · 2024-10-09T17:26:36Z

This PR addresses issue #61, in which a user reported errors with reading in GBK-encoded files, e.g. for representing Chinese characters. To address the issue, I first use chardet.detect() on each input file, then use that resulting encoding when reading and writing the file. The PR includes a test that a few GBK-encoded characters make it through the revision process.

This PR introduces a dependency on chardet in order to detect the encodings of input files.

Closes #61.

…revising files, preserving it in the output files.

… for characters in the result.

falquaddoomi · 2024-10-09T17:50:44Z

libs/manubot_ai_editor/editor.py

-        with open(input_filepath, "r") as infile, open(output_filepath, "w") as outfile:
+        # detect the input file encoding using chardet
+        # maintain that encoding when reading and writing files
+        src_encoding = chardet.detect(input_filepath.read_bytes())["encoding"]


FYI, I'm currently looking into how much of each file we need to read. Reading the entire thing is the safest choice, but chardet might be able to accurately detect the encoding with less data.

Alternatively, I may switch to using UniversalDetector (https://chardet.readthedocs.io/en/latest/usage.html#advanced-usage); I'll have to experiment with the confidence level we should use before stopping. (FWIW, this would all be a lot easier to decide if we had the failing markdown files, so hopefully we'll get those soon.)

ok，I Know. I have revised the line 283：

with open(input_filepath, "r",encoding='utf-8') as infile, open(output_filepath, "w",encoding='utf-8') as outfile:

d33bs

Nice job! This looked like a nice change to address the issue which was brought up. I left a few comments about various considerations. Additionally, you might consider pulling in the most recent changes from main, which I believe would allow this PR to observe and document passing tests prior to a merge.

d33bs · 2024-11-06T22:47:15Z

setup.py

@@ -27,6 +27,7 @@
    install_requires=[
        "openai==0.28",
        "pyyaml",
+        "chardet==5.2.0",


This is a completely optional comment mentioned mostly to help share information. I was experimenting with chardet recently and decided in one case to move to using charset_normalizer (specifically, the from_bytes method). It could be useful to consider here or in the future, especially when it comes to time or accuracy, depending on your opinions too.

I took a chance to test the example files provided for testing and found that they result in different time and encoding performance for the 01.abstract.md file: https://colab.research.google.com/drive/1uNrf9obcpzQK8XHJ3KY4l9zquHo2Xpju?authuser=0#scrollTo=rnZGgBZWumqv . My understanding is that chardet detected ISO-8859-9 (which seems more related to Turkish) and charset_normalizer detected CP949 (which seems more related to Korean). This did come at an additional time cost for charset_normalizer, and I'm unsure how this might perform with larger datasets.

Interesting; I frankly am not very familiar with characterset detection libraries; I just went with chardet because it was the one I'd heard of before. charset_normalizer looks great; while I'm mostly concerned with accuracy since the files in question are typically small, it's neat that you can get both with that library.

Thanks for going through the trouble of doing a comparison, too! I'm not too concerned about sub-second differences between these libraries' performance, since later steps of the pipeline that are also run on a per-file basis take much longer, but it's interesting to know all the same.

On a side note, I should flesh out my encoding tests a bit more, too, since it seems there's troubling variability between libraries and we definitely don't want to choose the wrong encoding! We could punt this to the user: for example, we could have them specify their encoding as an env var and, if it's missing, attempt auto-detection. At least they'll have the opportunity to correct a mistaken encoding detection if they know what their desired encoding is...actually, I think I'm going to add that anyway since no auto-detection will be 100% accurate.

Sounds like a great plan!

d33bs · 2024-11-06T22:48:17Z

tests/test_editor.py

+    "model",
+    [
+        RandomManuscriptRevisionModel(),
+        # GPT3CompletionModel(None, None),


Consider removing this comment if it is no longer needed.

FWIW this is more cruft from copying an existing tests, but you're right, it shouldn't be here or elsewhere.

d33bs · 2024-11-06T22:50:12Z

tests/test_editor.py

+    Tests that the editor can revise a manuscript that contains GBK-encoded
+    characters, and can detect those characters encoded in UTF-8 in the output.
+    """
+    print(f"\n{str(tmp_path)}\n")


Consider removing this print statement, which appears to show the tmp_path location for a file.

IMHO this one is actually useful; tmp_path is a pytest fixture that first creates a randomly-named folder and then resolves to a Path object, it's not something the caller provides and thus would be redundant to print. It's helpful to be able to see where the files are located so you can manually inspect them, and the print statement's output is only shown if the test fails.

Completely fair! I trust your judgement on this.

d33bs · 2024-11-06T22:50:31Z

tests/test_editor.py

+        # GPT3CompletionModel(None, None),
+    ],
+)
+def test_revise_gbk_encoded_manuscript(tmp_path, model):


Consider adding type hints for the parameters in this test.

d33bs · 2024-11-06T22:51:44Z

tests/test_editor.py

+    output_folder = tmp_path
+    assert output_folder.exists()


Consider avoiding the creation of a new label for tmp_path and instead use it directly to help increase understandability.

This is kind of a lame excuse, but I was following the pattern of tests that were already in the suite. Anyway, you're right, they could be tightened up. I'll follow your suggestion here and for other tests that I create, and perhaps rewrite the old tests in a PR focused on just that. I'll make an issue to revise the tests to be more clear and succinct.

d33bs · 2024-11-06T22:58:04Z

libs/manubot_ai_editor/editor.py

+        # maintain that encoding when reading and writing files
+        src_encoding = chardet.detect(input_filepath.read_bytes())["encoding"]
+
+        print("Detected encoding:", src_encoding, flush=True)


Consider adding formal logging to the project (here or perhaps in later work).

Good call, but I think that should be done in a PR focused specifically on that, since print statements like this one are prevalent in the library. I'll create an issue to capture it.

d33bs · 2024-11-06T23:31:45Z

libs/manubot_ai_editor/editor.py

-        with open(input_filepath, "r") as infile, open(output_filepath, "w") as outfile:
+        # detect the input file encoding using chardet
+        # maintain that encoding when reading and writing files
+        src_encoding = chardet.detect(input_filepath.read_bytes())["encoding"]


I'm unsure how large the files passed in here might be. Would it make sense to consider reading only a portion of the file if it were very large to help conserve time?

I thought about that (see #62 (comment)), but these files are rarely larger than a few kilobytes, since they're the text of sections of a paper, not binary files. I concluded that the (small, granted) extra engineering effort wasn't worth a difference of milliseconds, especially when it could decrease the accuracy of the encoding detection. Also, the revision process itself takes on the order of seconds to complete since it relies on an external API, so improvements here would IMHO go unnoticed.

I'm of course willing to revise my opinion if we really do have large files that need to be detected or if the difference would be significant. I think your suggestion of using charset_normalizer would also improve both speed and accuracy, so perhaps it'll be enough to switch to it.

…nv vars to specify src/dest encoding manually. Other minor touchups.

…g specification vs. autodetection.

falquaddoomi added 2 commits October 9, 2024 11:21

Adds chardet==5.2.0. Uses chardet to detect input file encoding when …

7133118

…revising files, preserving it in the output files.

Adds a test for revising GBK-encoded file w/Chinese characters, looks…

93cb08f

… for characters in the result.

falquaddoomi commented Oct 9, 2024

View reviewed changes

falquaddoomi requested a review from d33bs November 6, 2024 21:04

d33bs approved these changes Nov 6, 2024

View reviewed changes

falquaddoomi added 2 commits November 12, 2024 16:30

Switches from chardet to charset_normalizer. Adds SRC/DEST_ENCODING e…

e31679e

…nv vars to specify src/dest encoding manually. Other minor touchups.

Adds CN lorem ipsum to test manuscript. Adds test of src/dest encodin…

84c67cd

…g specification vs. autodetection.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Detects, preserve encoding when revising files #62

Detects, preserve encoding when revising files #62

falquaddoomi commented Oct 9, 2024

falquaddoomi Oct 9, 2024 •

edited

Loading

shanshen123654789 Oct 10, 2024

d33bs left a comment

d33bs Nov 6, 2024

falquaddoomi Nov 7, 2024

d33bs Nov 7, 2024

d33bs Nov 6, 2024

falquaddoomi Nov 7, 2024

d33bs Nov 6, 2024

falquaddoomi Nov 7, 2024

d33bs Nov 7, 2024

d33bs Nov 6, 2024

d33bs Nov 6, 2024

falquaddoomi Nov 7, 2024 •

edited

Loading

d33bs Nov 6, 2024

falquaddoomi Nov 7, 2024

d33bs Nov 6, 2024

falquaddoomi Nov 7, 2024

Detects, preserve encoding when revising files #62

Are you sure you want to change the base?

Detects, preserve encoding when revising files #62

Conversation

falquaddoomi commented Oct 9, 2024

falquaddoomi Oct 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

d33bs left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

falquaddoomi Nov 7, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

falquaddoomi Oct 9, 2024 •

edited

Loading

falquaddoomi Nov 7, 2024 •

edited

Loading