Skip to content

Newly merged code shows as as chinese text #59

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
rosco-pc opened this issue Jun 11, 2020 · 5 comments
Open

Newly merged code shows as as chinese text #59

rosco-pc opened this issue Jun 11, 2020 · 5 comments

Comments

@rosco-pc
Copy link
Contributor

Community Libraries that have been merged in May show up as chinese text in github's code view.
github_issue

Looking at the raw file it looks OK (apart from not having the propeller font selected)
github_issue_raw

I assume this is related to propTool saving data in UTF-16

@PropGit
Copy link
Collaborator

PropGit commented Jun 11, 2020

Yes, you're right, it's related to Propeller Tool saving data as UTF-16. That's because my judgement call 15 years ago was wrong - I predicted that the world would settle on UTF-16, or at-least support it naturally, as it seemed was already happening with simple tools like Windows Notepad.

GitHub doesn't seem to process UTF-16 in their quick view, but does seem to in their RAW view. Also, it's difficult (in my experience) to get Git to play nicely with UTF-16; I haven't found a way to tell Git to that all .spin files are always text files that may be encoded in either ANSI or UTF-16. Unless we can find a proven-to-work-in-all-cases solution, I think the best solution (since I'm building newer Propeller Tool versions right now to support P2) is to add support to store in either ANSI or UTF-8 format (as needed) by default and to automatically open and convert UTF-16 to UTF-8 (with prompting on saving).

It would mean that future .spin files may not be understandable by old Propeller Tools, and everyone would need to update if they want to continue with Propeller Tool.

What do you think?

@rosco-pc
Copy link
Contributor Author

rosco-pc commented Jun 11, 2020

My thoughts in a nutshell
WRT to the repo:
git has the possibility (since February this year utf-16 commit) to handle/convert UTF-16 files:
https://www.git-scm.com/docs/gitattributes#_working_tree_encoding

You can add a .gitattribute file in the root of the propeller repo and have this then handle conversion automatically like with line endings. Not sure if github already supports this.

I have not tried propTool (can not install on the only, work supplied, windows machine I have), but do you save P2 files as .spin2 and P1 files as .spin?
If that is the case and github supports the new working_tree_encoding attribute it looks like the automatic conversion would allow to configure .spin files to be handled as UTF-16 files and still show properly.

WRT to propTool:

  • UTF-8 for P2 files and UTF_16 for P1 files IF the new gitattribute is supported
  • Otherwise a new preference for fileformat allowing to specify ASCII/UTF-8/UTF-16 output. At least one, but maybe a separate setting for P1 and P2 output
  • Support for reading either of the formats, with conversion on saving depending on the file format preference

EDIT: rational for UTF-8 is to still support the propeller font used in a lot of file for simple circuit/signal drawings.

@PropGit
Copy link
Collaborator

PropGit commented Jun 15, 2020

Thank you, @rosco-pc. I'm considering this and will experiment. I've used the working-tree-encoding feature and found that it didn't solve the problem in the way I expected; specifically (if I'm remembering right), it treats all files that match the expression (ie: *.spin) as UTF-16 that needs to be encoded internally at UTF-8 and re-converted to UTF-16 upon checkout, but that damages ANSI-encoded .spin files (as .spin files are ANSI unless it contains a non-ANSI character).

I think what you are saying either already acknowledges that, or accommodates for that, by suggesting that all .spin entries in this repo be converted to UTF-16 for now and future use, along with the working-tree-encoding attributes added to the repo (and tested with GitHub) which would make it a smooth operation for GitHub viewing and Propeller Tool use.

Ideally, this would be assisted by a custom smudge filter (I've never written one) that would do the conversion automatically to ensure UTF-16 .spin files are input. Actually... that may be the best solution overall... a custom smudge filter that understands that .spin files could be either ANSI or UTF-16 and it detects and converts as necessary in both directions. This would make it seamlessly handle the situation and could even be made to detect UTF-8 .spin and .spin2 files as well.

@canadajones
Copy link

I'm no expert on git or anything like it, but would it be possible to detect the byte order mark? ANSI/ASCII have no BOM, UTF-16
has 0xFEFF and UTF-8 has the sequence 0xEF, 0xBB, 0xBF.

@rosco-pc
Copy link
Contributor Author

rosco-pc commented Jul 20, 2020

I'm far from a git expert as well, use it for work and now and then I still need to start with a fresh checkout as I messed up :P.

UTF files can be stored with or without BOM. UTF16 files without BOM will be treated as binary files by git though.
iconv / libinconv can be used check the BOM (see the UTF-16 commit link above), but I'm not sure how to integrate that with github (would use a hook an git, maybe a webhook on github?).

However I'm not sure this is needed as the git attribute as discussed seems to do the right thing

*.spin working_tree_encoding=UTF-16

I tested it with a test repo: https://github.com/rosco-pc/test-utf-handling.git, which keeps the file in the right format and does not display them as something else.
Still need to test us-ascii though (The file I'm using used chars with code >128, so got automatically stored as utf-8 by the editor I used, geany).
Edit: ascii file now added, will be extracted as a UTF-8 file in my case both on windows and Unix

Checking out the repo keeps the format, although checking out on windows seems to add a BOM to the non-BOM UTF16 file

original.zip
original_files.txt

Edit 2: mmm downloading the zip file shows all files as UTF16LE + BOM. But I do not see any corruption and the file still compiles with openSpin.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants