Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove PDFs #74

Closed
wants to merge 2 commits into from
Closed

Remove PDFs #74

wants to merge 2 commits into from

Conversation

rettinghaus
Copy link
Member

This PR removes the duplicated PDFs from the MEI3 and MEI4 collections, leaving those in the legacy folder.
closes #69

@craigsapp
Copy link
Member

It might be useful to pull out the PDFs into a directory outside of the MEI version directories, because the PDFs are not specific to any given MEI version and applied to all MEI versions. Also when studying an encoding in MEI 4, the user will still need to refer to the PDF, and they will be confused if it is only in MEI 3 folder.

Complications in this idea would be if the MEI encodings refer to different PDFs when changing MEI versions. A solution would be to add a note to the header of the encoding that specifies the URL of the PDF that is being referenced for graphical music.

@craigsapp
Copy link
Member

Also note that when you delete the PDFs from MEI3, they are still in the repositories history (so no reduction in the size of the repository other than preventing additional copies of the PDFs in later version directories of the complete examples).

@th-we
Copy link
Member

th-we commented Oct 4, 2022

Also note that when you delete the PDFs from MEI3, they are still in the repositories history (so no reduction in the size of the repository other than preventing additional copies of the PDFs in later version directories of the complete examples).

Git recognizes if the files are identical and will only store them once in the repo, if I'm not totally mistaken. They will be copied to the working tree multiple times, though. I'd say there is no harm in leaving the duplicates in there. On the contrary, it helps greatly when working with the samples.

@bwbohl
Copy link
Member

bwbohl commented Oct 9, 2022

I'm +1 for having the files that stood model for the encoding nearby/available. Probably moving them to a separate directory as proposed by @craigsapp and consequently deduplicating them would be a sound solution. Ideally, this should be accompanied by referencing them from the encoding. Maybe just a simple <sourceDesc>/<source> pointing to the file, either relatively or to the github/blob?

Copy link
Member

@bwbohl bwbohl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As emerged from the discussion, moving the legacy folder PDFs to a separate folder and referencing them from the files would be really good.

@rettinghaus
Copy link
Member Author

In most of the files a URL to the original source is present. IMSLP links need to be updated, but we could then delete those PDFs here completely. Anyway, they would stay in the git history.

@th-we
Copy link
Member

th-we commented Apr 28, 2023

A personal plea: Please do not delete. When working with the samples, it is extremely helpful being able to have the files close to each other to quickly check files. I now can quickly open the MEI and PDF files side by side from the file manager. What takes a split second now may then take minutes because the link has to be identified in the MEI and the file would have to be downloaded. And links may break in the future. The repo size will not shrink anyway when deleting the PDFs as they stay in the history.

@ahankinson
Copy link
Member

Why are they in a folder called "legacy"? Can't they just be in a folder called "PDF"?

@rettinghaus
Copy link
Member Author

It seems virtually impossible to make people happy with this.

@rettinghaus rettinghaus closed this May 2, 2023
@ahankinson
Copy link
Member

I think it was fine — they weren’t being deleted, and they were being consolidated. I was only asking because legacy implies that they are no longer used, but as many people pointed out they liked having the original next to the encoding. So a folder called “pdf” would be clear and also mark that they were still useful and correspond to the encodings.

@musicEnfanthen
Copy link
Member

It seems virtually impossible to make people happy with this.

Community starts with a C like compromise ;)

Sorry for being late, but I am still trying to understand the initial intention of the removal. Yes, some of the PDFs are duplicated, but as @th-we pointed out, there can be some reasons to have the PDF documents directly with the encodings. On the other hand, moving all the PDFs to a separate folder (and so deduplicating them) does not provide any improvement in repo size. Quite the opposite: Although Git, as mentioned above, stores identical files only once in the git tree, it will move the internal references of the git tree by creating a new snapshot of the tree in the .git/objects folder (plus another entry for the commit itself). This is quite nicely explained here: https://how-to.dev/how-git-stores-data

As you can see in the screenshots from a clean test repo below, the duplicated identical test.txt file is referenced by the same hash cd/f4... in the git tree from folder 1 and folder 2. Moving it to PDF (or another) folder, will create a new git object 56/9f... that now references cd/f4.... At the same time, the previous tree structure c4/fa... representing the test file in folder1 and folder2 is preserved.

Saying this, I think the usability of the repo structure is more important than Git's whatever internal structuring and organization, so I'm not at all against moving or changing it in principle. I'm just trying to see the added value.

Screenshot 2023-05-03 112240
Screenshot 2023-05-03 112302

@craigsapp
Copy link
Member

In case you are not aware: you should use git mv rather than mv when moving files to different locations in a repository (on the command-line). This will prevent a full copy of the file (which is relative large for PDF files) from being stored in the history of the repository (I didn't check how they were moved so this may have been done, but this is a useful thing to know).

@craigsapp
Copy link
Member

Also, I fail to see the point of moving PDFs of the source editions into any thing such as a legacy folder.

What is the purpose of the sample encodings of music? The example encodings are based on the graphical music in these files, so disassociating them, in particular by placing them into any such legacy folder is highly counterproductive. It is also somewhat counterproductive to be moving older encodings into a legacy folder since their URLs then change.

I would propose this structure:

full-music-encoding-examples folder:

  • pdf
  • mei
  • legacy

The pdf directory contains the PDFs for the sample encodings.

The mei directory contains the MEI files for the most recent version of MEI

The legacy directory contains the MEI files for previous versions of MEI, such as:

legacy folder:

  • mei-3
  • mei-4
  • mei-5

If people want to see the source edition for a legacy encoding, then they would look in the pdf directory of the parent directory.

When a new version of MEI comes out and there is a translation script to update MEI files from the previous version, then the files in mei would be moved to legacy/mei-* (where * is the last version of mei).

It is not sufficient to say "look at the rendering of the file in verovio" because many of the files will not render in verovio (unless that has been fixed in the files). Also the original encodings had lots of errors (in particular related to tuplets), and the original encodings were not checked by output to any resulting notation in any renderer (such as verovio which was not available when the encodings were created).

@th-we
Copy link
Member

th-we commented May 3, 2023

you should use git mv rather than mv when moving files to different locations in a repository (on the command-line). This will prevent a full copy of the file (which is relative large for PDF files) from being stored in the history of the repository

git mv is not required. mv works just fine. (Even cp will not lead to duplicates in the repo database, just in the working tree.)

Git has a rename command git mv, but that is just for convenience. The effect is indistinguishable from removing the file and adding another with different name and the same content.

https://archive.kernel.org/oldwiki/git.wiki.kernel.org/index.php/GitFaq.html#Why_does_Git_not_.22track.22_renames.3F

@craigsapp
Copy link
Member

git mv is not required. mv works just fine. (Even cp will not lead to duplicates in the repo database, just in the working tree.)

OK. I checked with a test repo and it is behaving as you describe (only checking mv). Presumably git is storing content based on checksums. And with my test, it is also recognizing the system mv as a rename when doing a commit.

Click to view test
$ mkdir repo

$ cd repo/

$ du -sh .
  0B	.

$ git init
hint: Using 'master' as the name for the initial branch. This default branch name
hint: is subject to change. To configure the initial branch name to use in all
hint: of your new repositories, which will suppress this warning, call:
hint: 
hint: 	git config --global init.defaultBranch <name>
hint: 
hint: Names commonly chosen instead of 'master' are 'main', 'trunk' and
hint: 'development'. The just-created branch can be renamed via this command:
hint: 
hint: 	git branch -m <name>
Initialized empty Git repository in /private/tmp/repo/.git/

$ du -sh .
 76K	.

$ mv ../test.pdf .

$ du -sh .
124K	.

$ git add test.pdf

$ du -sh .
164K	.

$ git commit -am "initial commit"
[master (root-commit) c3d092b] initial commit
 1 file changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 test.pdf

$ du -sh .
188K	.

$ mv test.pdf test2.pdf

$ git status
On branch master
Changes not staged for commit:
  (use "git add/rm <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
	deleted:    test.pdf

Untracked files:
  (use "git add <file>..." to include in what will be committed)
	test2.pdf

no changes added to commit (use "git add" and/or "git commit -a")

$ git add test2.pdf

$ du -sh .
188K	.

$ git status
On branch master
Changes to be committed:
  (use "git restore --staged <file>..." to unstage)
	new file:   test2.pdf

Changes not staged for commit:
  (use "git add/rm <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
	deleted:    test.pdf


$ git commit -am "second commit"
[master 1479930] second commit
 1 file changed, 0 insertions(+), 0 deletions(-)
 rename test.pdf => test2.pdf (100%)

$ du -sh .
196K	.

$ git status
On branch master
nothing to commit, working tree clean

One problem with mv versus git mv is that git status reports it as a delete and a new file:

$ git status
On branch master
Changes not staged for commit:
  (use "git add/rm <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
	deleted:    test.pdf

Untracked files:
  (use "git add <file>..." to include in what will be committed)
	test2.pdf

And you have to run git add test2.pdf before committing. At the commit stage, git is identifying that the delete/new file is actually a rename:

$ git commit -am "second commit"
[master 1479930] second commit
 1 file changed, 0 insertions(+), 0 deletions(-)
 rename test.pdf => test2.pdf (100%)

If you use git mv, then git is aware that the change is a rename before a commit is done (and git add step is not needed):

`$ git mv test2.pdf test3.pdf

$ git status
On branch master
Changes to be committed:
  (use "git restore --staged <file>..." to unstage)
        renamed:    test2.pdf -> test3.pdf

In this case git manages the move rather than the operating system, so it is aware of what is happening rather than deducing it later.

@th-we
Copy link
Member

th-we commented May 3, 2023

I'm sorry for starting to derail this thread with the mv discussion – let's stick to the topic!

@craigsapp
Copy link
Member

It is useful info to know, though :-)

@lpugin
Copy link
Member

lpugin commented May 5, 2023

@craigsapp proposal looks good to me

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: ✅ Done
Development

Successfully merging this pull request may close these issues.

Deduplicate PDFs
7 participants