GPT4All-v3.3.0: LocalDocs: localdocs_v2 database: local analysis ref. chunks #2988

SINAPSA-IC · 2024-09-25T09:38:29Z

SINAPSA-IC
Sep 25, 2024

I've been doing some Research into the structure and contents of My localdocs_v2 database, and found some interesting (for me, the curious common user) things:
(leaving aside the fact that the value of field "file" in table "chunks" is a String repeated over and over again, increasing the size of the database, instead of an integer that'd be the ID of that file)

Program: DB Browser for SQLite, https://sqlitebrowser.org/
Database: localdocs_vs.db
(open Read-only)

+2000 chunks from a 9.7 MB PDF file @ +200 pages,
+600 chunks from a 20 MB PDF file @ +300 pages
the largest number of chunks, +32000, is from a PDF file @ 200 MB, +4360 pages
the smallest number of chunks, 1, is from a .txt file @ 0.5 KB

The graph shows the distribution of the number of chunks vs the corresponding files; so, the OX value of "document_id" means the same file appears that many times in the "chunks" table, with the "peak" at a document ID appearing ~ 125 times.
Smaller numbers of chunks are obviously taken from small files, while the opposite is true too.

There are 326731 chunks:
SELECT COUNT(words) FROM chunks
for an average of 75 words
SELECT AVG(words) FROM chunks
and 247 distinct numbers of words
SELECT COUNT (DISTINCT words) FROM chunks
while the maximum number of words is 256
SELECT MAX(words) FROM chunks
of 1 chunk
SELECT COUNT (words) FROM chunks WHERE words=256
and the minimum number of words is 1
SELECT MIN(words) FROM chunks
of 719 chunks
SELECT COUNT (words) FROM chunks WHERE words=1
and +100000 chunks of 65 to 85 words each (the AVeraGe being^ 75)
SELECT COUNT (words) FROM chunks WHERE words BETWEEN 65 AND 85

(the number of tokens, a field in the same table, is 0 everywhere, it may be reserved for future whatevers)

The average length, in characters, of the values in the "chunks_text" field (those are words, not tokens): 466.x
SELECT AVG(length(chunk_text)) FROM chunks
with the maximum MAX(... : 665 characters
and the minimum MIN(... : 1 character
There are 297 Document IDs
SELECT COUNT (DISTINCT document_id) FROM chunks
and 295 files along w/ their paths, all repeated as Strings over and over, let the database grow
SELECT COUNT (DISTINCT file) FROM chunks
-- why the difference? why not?

As there is No Way of making more relevant and generic graphs, like the number of words in a chunk vs the number of words in that document (no document_filesize, or document_numberofwords, fields anywhere in the db), that would show some method, these crappy ones referring to 1 particular user's LocalDocs will do.

SINAPSA-IC · 2024-09-25T12:21:53Z

SINAPSA-IC
Sep 25, 2024
Author

1 reply

SINAPSA-IC Sep 25, 2024
Author

The large differences between the number of chunks near the peak, for instance

22K chunks 61 to 70 words long and 89K chunks 71 to 80 words long,
cannot be justified by the length of text, or, the number of words in those files. Where 75 words can be extracted from, so 65 can.
or, 40K chunks 91 to 100 words long and 6K chunks 101 to 110 words long,
likewise, Most files do not stop abruptly after 105 words.

As Common Knowledge goes, the larger the sequence the greater the chance of identifying the Context correctly. So, further work is needed etc etc. especially for the range below the maximum 75 to 90 or whatever values where the peak is on any group of LocalDocs collections.

SINAPSA-IC · 2024-09-25T18:44:03Z

SINAPSA-IC
Sep 25, 2024
Author

And a graph about the localdocs_v1 database, where either there is no field for the number of words in a chunk/snippet, or I haven't found it - so, the length in characters of a snippet was used instead. The discrepancy between the value at the peak (the number of snippets 501 to 600 characters long) and any other is humongous.

0 replies

SINAPSA-IC · 2024-09-25T21:06:06Z

SINAPSA-IC
Sep 25, 2024
Author

I suppose that I'm not reading it correctly - the max length in characters of a snippet is larger than 512 (the value specified in the UI); is it so (then why?), or must it be read in another way?

And what signifies the length of 1, is it 1 character only but which one and why, what is the meaning of such a snippet only 1 character long? There are a handful of such records.

0 replies

SINAPSA-IC · 2024-09-25T21:15:16Z

SINAPSA-IC
Sep 25, 2024
Author

0 replies

SINAPSA-IC · 2024-09-25T21:24:44Z

SINAPSA-IC
Sep 25, 2024
Author

There has to be a reason why these 1-character snippets exist, may be reusable markers to be placed within other snippets, or control markers within the database, otherwise I for one can't fathom their purpose although they all come from actual files.

1 reply

SINAPSA-IC Sep 26, 2024
Author

Given that most of the 10^2 files are PDF in the range 2 to 40 MB - there is No field for filesize, not interesting, or is a secret - tens of thousands of words each, and only 1 is a truly big encyclopedia, it may be the case that the length of word sequences considered for embedding depends on file size. The peak values may be indicative of that huge encyclopedia.

SINAPSA-IC · 2024-09-26T19:57:05Z

SINAPSA-IC
Sep 26, 2024
Author

A PDF (ofc) file with a nicely-formatted Analysis of two big files,
quote,
"in which interesting things can be found - such as how many words exist in the snippets from large files (~200 MB, +300 MB), and how many snippets go well beyond the limit set in the UI for their length (512 in this case), going up to +70% from the total number of chunks from a file.",
end quote

sinapsa_nomicai_gpt4all_localdocs_res_202409.pdf

0 replies

SINAPSA-IC · 2024-09-27T07:25:21Z

SINAPSA-IC
Sep 27, 2024
Author

The v2 of the PDF (ofc) file with a nicely-formatted Analysis of two big files, with the images below inside of it.

sinapsa_nomicai_gpt4all_localdocs_res_202409.pdf

0 replies

SINAPSA-IC · 2024-09-27T07:27:19Z

SINAPSA-IC
Sep 27, 2024
Author

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPT4All-v3.3.0: LocalDocs: localdocs_v2 database: local analysis ref. chunks #2988

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 8 comments 2 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

GPT4All-v3.3.0: LocalDocs: localdocs_v2 database: local analysis ref. chunks #2988

SINAPSA-IC Sep 25, 2024

Replies: 8 comments · 2 replies

SINAPSA-IC Sep 25, 2024 Author

SINAPSA-IC Sep 25, 2024 Author

SINAPSA-IC Sep 25, 2024 Author

SINAPSA-IC Sep 25, 2024 Author

SINAPSA-IC Sep 25, 2024 Author

SINAPSA-IC Sep 25, 2024 Author

SINAPSA-IC Sep 26, 2024 Author

SINAPSA-IC Sep 26, 2024 Author

SINAPSA-IC Sep 27, 2024 Author

SINAPSA-IC Sep 27, 2024 Author

SINAPSA-IC
Sep 25, 2024

Replies: 8 comments 2 replies

SINAPSA-IC
Sep 25, 2024
Author

SINAPSA-IC Sep 25, 2024
Author

SINAPSA-IC
Sep 25, 2024
Author

SINAPSA-IC
Sep 25, 2024
Author

SINAPSA-IC
Sep 25, 2024
Author

SINAPSA-IC
Sep 25, 2024
Author

SINAPSA-IC Sep 26, 2024
Author

SINAPSA-IC
Sep 26, 2024
Author

SINAPSA-IC
Sep 27, 2024
Author

SINAPSA-IC
Sep 27, 2024
Author