GPT4All-v3.3.0: LocalDocs: localdocs_v2 database: local analysis ref. chunks #2988
Replies: 8 comments 2 replies
-
And a graph about the localdocs_v1 database, where either there is no field for the number of words in a chunk/snippet, or I haven't found it - so, the length in characters of a snippet was used instead. The discrepancy between the value at the peak (the number of snippets 501 to 600 characters long) and any other is humongous. |
Beta Was this translation helpful? Give feedback.
-
I suppose that I'm not reading it correctly - the max length in characters of a snippet is larger than 512 (the value specified in the UI); is it so (then why?), or must it be read in another way? And what signifies the length of 1, is it 1 character only but which one and why, what is the meaning of such a snippet only 1 character long? There are a handful of such records. |
Beta Was this translation helpful? Give feedback.
-
There has to be a reason why these 1-character snippets exist, may be reusable markers to be placed within other snippets, or control markers within the database, otherwise I for one can't fathom their purpose although they all come from actual files. |
Beta Was this translation helpful? Give feedback.
-
A PDF (ofc) file with a nicely-formatted Analysis of two big files, |
Beta Was this translation helpful? Give feedback.
-
The v2 of the PDF (ofc) file with a nicely-formatted Analysis of two big files, with the images below inside of it. |
Beta Was this translation helpful? Give feedback.
-
I've been doing some Research into the structure and contents of My localdocs_v2 database, and found some interesting (for me, the curious common user) things:
(leaving aside the fact that the value of field "file" in table "chunks" is a String repeated over and over again, increasing the size of the database, instead of an integer that'd be the ID of that file)
Program: DB Browser for SQLite, https://sqlitebrowser.org/
Database: localdocs_vs.db
(open Read-only)
The graph shows the distribution of the number of chunks vs the corresponding files; so, the OX value of "document_id" means the same file appears that many times in the "chunks" table, with the "peak" at a document ID appearing ~ 125 times.
Smaller numbers of chunks are obviously taken from small files, while the opposite is true too.
SELECT COUNT(words) FROM chunks
SELECT AVG(words) FROM chunks
SELECT COUNT (DISTINCT words) FROM chunks
SELECT MAX(words) FROM chunks
SELECT COUNT (words) FROM chunks WHERE words=256
SELECT MIN(words) FROM chunks
SELECT COUNT (words) FROM chunks WHERE words=1
SELECT COUNT (words) FROM chunks WHERE words BETWEEN 65 AND 85
(the number of tokens, a field in the same table, is 0 everywhere, it may be reserved for future whatevers)
The average length, in characters, of the values in the "chunks_text" field (those are words, not tokens): 466.x
SELECT AVG(length(chunk_text)) FROM chunks
with the maximum MAX(... : 665 characters
and the minimum MIN(... : 1 character
There are 297 Document IDs
SELECT COUNT (DISTINCT document_id) FROM chunks
and 295 files along w/ their paths, all repeated as Strings over and over, let the database grow
SELECT COUNT (DISTINCT file) FROM chunks
-- why the difference? why not?
As there is No Way of making more relevant and generic graphs, like the number of words in a chunk vs the number of words in that document (no document_filesize, or document_numberofwords, fields anywhere in the db), that would show some method, these crappy ones referring to 1 particular user's LocalDocs will do.
Beta Was this translation helpful? Give feedback.
All reactions