Remove PDFs #74

rettinghaus · 2022-10-04T09:40:30Z

This PR removes the duplicated PDFs from the MEI3 and MEI4 collections, leaving those in the legacy folder.
closes #69

craigsapp · 2022-10-04T13:57:09Z

It might be useful to pull out the PDFs into a directory outside of the MEI version directories, because the PDFs are not specific to any given MEI version and applied to all MEI versions. Also when studying an encoding in MEI 4, the user will still need to refer to the PDF, and they will be confused if it is only in MEI 3 folder.

Complications in this idea would be if the MEI encodings refer to different PDFs when changing MEI versions. A solution would be to add a note to the header of the encoding that specifies the URL of the PDF that is being referenced for graphical music.

craigsapp · 2022-10-04T13:58:57Z

Also note that when you delete the PDFs from MEI3, they are still in the repositories history (so no reduction in the size of the repository other than preventing additional copies of the PDFs in later version directories of the complete examples).

th-we · 2022-10-04T18:08:26Z

Also note that when you delete the PDFs from MEI3, they are still in the repositories history (so no reduction in the size of the repository other than preventing additional copies of the PDFs in later version directories of the complete examples).

Git recognizes if the files are identical and will only store them once in the repo, if I'm not totally mistaken. They will be copied to the working tree multiple times, though. I'd say there is no harm in leaving the duplicates in there. On the contrary, it helps greatly when working with the samples.

bwbohl · 2022-10-09T22:01:44Z

I'm +1 for having the files that stood model for the encoding nearby/available. Probably moving them to a separate directory as proposed by @craigsapp and consequently deduplicating them would be a sound solution. Ideally, this should be accompanied by referencing them from the encoding. Maybe just a simple <sourceDesc>/<source> pointing to the file, either relatively or to the github/blob?

bwbohl

As emerged from the discussion, moving the legacy folder PDFs to a separate folder and referencing them from the files would be really good.

rettinghaus · 2023-04-28T13:22:05Z

In most of the files a URL to the original source is present. IMSLP links need to be updated, but we could then delete those PDFs here completely. Anyway, they would stay in the git history.

th-we · 2023-04-28T13:45:46Z

A personal plea: Please do not delete. When working with the samples, it is extremely helpful being able to have the files close to each other to quickly check files. I now can quickly open the MEI and PDF files side by side from the file manager. What takes a split second now may then take minutes because the link has to be identified in the MEI and the file would have to be downloaded. And links may break in the future. The repo size will not shrink anyway when deleting the PDFs as they stay in the history.

ahankinson · 2023-05-01T06:34:34Z

Why are they in a folder called "legacy"? Can't they just be in a folder called "PDF"?

rettinghaus · 2023-05-02T09:33:27Z

It seems virtually impossible to make people happy with this.

ahankinson · 2023-05-02T10:10:37Z

I think it was fine — they weren’t being deleted, and they were being consolidated. I was only asking because legacy implies that they are no longer used, but as many people pointed out they liked having the original next to the encoding. So a folder called “pdf” would be clear and also mark that they were still useful and correspond to the encodings.

musicEnfanthen · 2023-05-03T09:53:21Z

It seems virtually impossible to make people happy with this.

Community starts with a C like compromise ;)

Sorry for being late, but I am still trying to understand the initial intention of the removal. Yes, some of the PDFs are duplicated, but as @th-we pointed out, there can be some reasons to have the PDF documents directly with the encodings. On the other hand, moving all the PDFs to a separate folder (and so deduplicating them) does not provide any improvement in repo size. Quite the opposite: Although Git, as mentioned above, stores identical files only once in the git tree, it will move the internal references of the git tree by creating a new snapshot of the tree in the .git/objects folder (plus another entry for the commit itself). This is quite nicely explained here: https://how-to.dev/how-git-stores-data

As you can see in the screenshots from a clean test repo below, the duplicated identical test.txt file is referenced by the same hash cd/f4... in the git tree from folder 1 and folder 2. Moving it to PDF (or another) folder, will create a new git object 56/9f... that now references cd/f4.... At the same time, the previous tree structure c4/fa... representing the test file in folder1 and folder2 is preserved.

Saying this, I think the usability of the repo structure is more important than Git's whatever internal structuring and organization, so I'm not at all against moving or changing it in principle. I'm just trying to see the added value.

craigsapp · 2023-05-03T11:01:57Z

In case you are not aware: you should use git mv rather than mv when moving files to different locations in a repository (on the command-line). This will prevent a full copy of the file (which is relative large for PDF files) from being stored in the history of the repository (I didn't check how they were moved so this may have been done, but this is a useful thing to know).

craigsapp · 2023-05-03T12:00:05Z

Also, I fail to see the point of moving PDFs of the source editions into any thing such as a legacy folder.

What is the purpose of the sample encodings of music? The example encodings are based on the graphical music in these files, so disassociating them, in particular by placing them into any such legacy folder is highly counterproductive. It is also somewhat counterproductive to be moving older encodings into a legacy folder since their URLs then change.

I would propose this structure:

full-music-encoding-examples folder:

pdf
mei
legacy

The pdf directory contains the PDFs for the sample encodings.

The mei directory contains the MEI files for the most recent version of MEI

The legacy directory contains the MEI files for previous versions of MEI, such as:

legacy folder:

mei-3
mei-4
mei-5

If people want to see the source edition for a legacy encoding, then they would look in the pdf directory of the parent directory.

When a new version of MEI comes out and there is a translation script to update MEI files from the previous version, then the files in mei would be moved to legacy/mei-* (where * is the last version of mei).

It is not sufficient to say "look at the rendering of the file in verovio" because many of the files will not render in verovio (unless that has been fixed in the files). Also the original encodings had lots of errors (in particular related to tuplets), and the original encodings were not checked by output to any resulting notation in any renderer (such as verovio which was not available when the encodings were created).

th-we · 2023-05-03T12:17:19Z

you should use git mv rather than mv when moving files to different locations in a repository (on the command-line). This will prevent a full copy of the file (which is relative large for PDF files) from being stored in the history of the repository

git mv is not required. mv works just fine. (Even cp will not lead to duplicates in the repo database, just in the working tree.)

Git has a rename command git mv, but that is just for convenience. The effect is indistinguishable from removing the file and adding another with different name and the same content.

https://archive.kernel.org/oldwiki/git.wiki.kernel.org/index.php/GitFaq.html#Why_does_Git_not_.22track.22_renames.3F

craigsapp · 2023-05-03T13:08:17Z

git mv is not required. mv works just fine. (Even cp will not lead to duplicates in the repo database, just in the working tree.)

OK. I checked with a test repo and it is behaving as you describe (only checking mv). Presumably git is storing content based on checksums. And with my test, it is also recognizing the system mv as a rename when doing a commit.

Click to view test

$ mkdir repo

$ cd repo/

$ du -sh .
  0B	.

$ git init
hint: Using 'master' as the name for the initial branch. This default branch name
hint: is subject to change. To configure the initial branch name to use in all
hint: of your new repositories, which will suppress this warning, call:
hint: 
hint: 	git config --global init.defaultBranch <name>
hint: 
hint: Names commonly chosen instead of 'master' are 'main', 'trunk' and
hint: 'development'. The just-created branch can be renamed via this command:
hint: 
hint: 	git branch -m <name>
Initialized empty Git repository in /private/tmp/repo/.git/

$ du -sh .
 76K	.

$ mv ../test.pdf .

$ du -sh .
124K	.

$ git add test.pdf

$ du -sh .
164K	.

$ git commit -am "initial commit"
[master (root-commit) c3d092b] initial commit
 1 file changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 test.pdf

$ du -sh .
188K	.

$ mv test.pdf test2.pdf

$ git status
On branch master
Changes not staged for commit:
  (use "git add/rm <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
	deleted:    test.pdf

Untracked files:
  (use "git add <file>..." to include in what will be committed)
	test2.pdf

no changes added to commit (use "git add" and/or "git commit -a")

$ git add test2.pdf

$ du -sh .
188K	.

$ git status
On branch master
Changes to be committed:
  (use "git restore --staged <file>..." to unstage)
	new file:   test2.pdf

Changes not staged for commit:
  (use "git add/rm <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
	deleted:    test.pdf


$ git commit -am "second commit"
[master 1479930] second commit
 1 file changed, 0 insertions(+), 0 deletions(-)
 rename test.pdf => test2.pdf (100%)

$ du -sh .
196K	.

$ git status
On branch master
nothing to commit, working tree clean

One problem with mv versus git mv is that git status reports it as a delete and a new file:

$ git status
On branch master
Changes not staged for commit:
  (use "git add/rm <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
	deleted:    test.pdf

Untracked files:
  (use "git add <file>..." to include in what will be committed)
	test2.pdf

And you have to run git add test2.pdf before committing. At the commit stage, git is identifying that the delete/new file is actually a rename:

$ git commit -am "second commit"
[master 1479930] second commit
 1 file changed, 0 insertions(+), 0 deletions(-)
 rename test.pdf => test2.pdf (100%)

If you use git mv, then git is aware that the change is a rename before a commit is done (and git add step is not needed):

`$ git mv test2.pdf test3.pdf

$ git status
On branch master
Changes to be committed:
  (use "git restore --staged <file>..." to unstage)
        renamed:    test2.pdf -> test3.pdf

In this case git manages the move rather than the operating system, so it is aware of what is happening rather than deducing it later.

th-we · 2023-05-03T13:45:19Z

I'm sorry for starting to derail this thread with the mv discussion – let's stick to the topic!

craigsapp · 2023-05-03T13:46:55Z

It is useful info to know, though :-)

lpugin · 2023-05-05T05:30:55Z

@craigsapp proposal looks good to me

rettinghaus requested review from bwbohl and musicEnfanthen October 4, 2022 09:40

bwbohl requested changes Apr 27, 2023

View reviewed changes

rettinghaus added 2 commits April 28, 2023 14:39

Remove PDFs

10cd136

deduplicate pdfs in one folder

1e6bd04

rettinghaus force-pushed the pdf branch from 469978f to 1e6bd04 Compare April 28, 2023 13:58

rettinghaus closed this May 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove PDFs #74

Remove PDFs #74

rettinghaus commented Oct 4, 2022

craigsapp commented Oct 4, 2022

craigsapp commented Oct 4, 2022

th-we commented Oct 4, 2022 •

edited

Loading

bwbohl commented Oct 9, 2022 •

edited

Loading

bwbohl left a comment

rettinghaus commented Apr 28, 2023

th-we commented Apr 28, 2023 •

edited

Loading

ahankinson commented May 1, 2023

rettinghaus commented May 2, 2023

ahankinson commented May 2, 2023

musicEnfanthen commented May 3, 2023

craigsapp commented May 3, 2023

craigsapp commented May 3, 2023

th-we commented May 3, 2023

craigsapp commented May 3, 2023

th-we commented May 3, 2023

craigsapp commented May 3, 2023

lpugin commented May 5, 2023

Remove PDFs #74

Remove PDFs #74

Conversation

rettinghaus commented Oct 4, 2022

craigsapp commented Oct 4, 2022

craigsapp commented Oct 4, 2022

th-we commented Oct 4, 2022 • edited Loading

bwbohl commented Oct 9, 2022 • edited Loading

bwbohl left a comment

Choose a reason for hiding this comment

rettinghaus commented Apr 28, 2023

th-we commented Apr 28, 2023 • edited Loading

ahankinson commented May 1, 2023

rettinghaus commented May 2, 2023

ahankinson commented May 2, 2023

musicEnfanthen commented May 3, 2023

craigsapp commented May 3, 2023

craigsapp commented May 3, 2023

th-we commented May 3, 2023

craigsapp commented May 3, 2023

th-we commented May 3, 2023

craigsapp commented May 3, 2023

lpugin commented May 5, 2023

th-we commented Oct 4, 2022 •

edited

Loading

bwbohl commented Oct 9, 2022 •

edited

Loading

th-we commented Apr 28, 2023 •

edited

Loading