feat: move to use maven-index repository with sharding support #57

knqyf263 · 2025-04-22T07:17:01Z

Overview

This PR changes the database building process to utilize the aquasecurity/maven-index repository instead of creating a database directly. It also adds sharding functionality to distribute files evenly across the filesystem, preventing performance issues caused by too many files in a repository.

Description

The main changes in this PR:

Add support for loading Maven index data from a separate repository (aquasecurity/maven-index)
Implement sharding to distribute files across multiple directories based on a configurable shard count
Improve file organization by creating a more efficient directory structure based on digests of group-id+artifact-id
Add new command-line options:
- --index-dir: Specify the directory where Maven index data is stored
- --shards: Configure the number of shards (default: 256)

This approach solves two issues:

Separates the index collection from the database building process
Prevents Git performance issues by avoiding too many small files in the same directory through sharding

Test run

https://github.com/aquasecurity/maven-index/actions/runs/14766996699

- Removed GroupID and ArtifactID from Index struct, replaced with Packaging. - Updated Build function to derive GroupID and ArtifactID from directory path. - Enhanced file path creation to convert GroupID into directory names for JSON output. - Adjusted JSON field names for Version struct for better clarity.

- Introduced indexDir flag for specifying the index repository directory. - Updated Build function to accept indexDir instead of cacheDir. - Improved SHA1 fetching logic to return string instead of byte slice. - Enhanced error handling and logging for SHA1 validation. - Adjusted JSON serialization for Version struct to use string for SHA1.

- Eliminated the dbc field from the Crawler struct as it is no longer needed. - Updated the NewCrawler function to reflect the removal of the database connection.

- Enhanced the crawlSHA1 function to check for previously saved versions before fetching new SHA1s. - Introduced a new savedVersions function to read existing versions from a JSON file. - Added Exists function in fileutil to check for file existence before attempting to open it. - Updated JSON writing logic to use the complete file path for better clarity and maintainability.

- Upgraded Go version from 1.22.7 to 1.23.0 and set toolchain to go1.24.2. - Updated golang.org/x/sync dependency to version 0.13.0 for improved concurrency handling. - Refactored the crawl function to use a simpler limit parameter instead of a weighted semaphore. - Enhanced the index structure to include only SHA1, simplifying the data model. - Improved error handling and logging in the crawlSHA1 function for better traceability.

- Introduced shardCount parameter to enable sharding of artifact records during processing. - Updated Crawler and Aggregator structures to handle sharding logic. - Enhanced the crawling pipeline to include Lister, Fetcher, and Aggregator components for improved data handling. - Implemented error handling and logging improvements throughout the crawling process. - Adjusted file writing logic to support sharded output files for better organization and performance.

…d output - Added logging for every 100,000 processed artifacts to track progress. - Updated the Aggregator to flush records every 100,000 processed instead of 10,000 for improved performance. - Implemented a hierarchical directory structure for sharded output files, enhancing organization. - Created subdirectories based on shard index to better manage output files.

- Increased maxResults from 5000 to 10000 to enhance performance during artifact fetching. - Removed the randomSleep function to streamline the HTTP request process, reducing unnecessary delays. - Cleaned up imports by removing unused packages, improving code clarity and maintainability.

… performance - Updated maxResults from 5000 to 10000 to enhance performance during artifact fetching.

- Replaced the retryablehttp.Client with a new GCS client for improved artifact handling. - Updated the Lister and Fetcher components to utilize the GCS client for fetching SHA1 files. - Introduced parallel processing for artifact listing to enhance performance. - Cleaned up the code by removing unused functions and optimizing the structure for better maintainability. - Enhanced logging to provide clearer insights during the crawling process.

- Removed the wrongSHA1s slice and replaced it with an atomic error count for better performance and thread safety. - Updated the Fetcher to increment the error count when a SHA1 cannot be fetched. - Adjusted logging to report the error count instead of listing wrong SHA1s, streamlining the output during the crawl process. - Modified GCS client to return "N/A" for invalid SHA1 formats, improving error handling.

- Introduced a new limit parameter for the Lister to optimize the parallelism of the listing process based on the number of stored GAVs. - Adjusted the limit for the Fetcher to ensure it operates within the new constraints, improving overall performance during the crawl process. - Enhanced comments to clarify the logic behind the limit adjustments for better maintainability.

knqyf263 · 2025-04-24T11:42:57Z

As we discussed before, GCS is missing a lot of artifacts existing in Maven Central.

Comparing SQLite databases:
Current DB: ./trivy-java.db.current
New DB: ./trivy-java.db.new
----------------------------------------
Comparing artifacts table (group_id, artifact_id only)
Record count in Current DB: 633036
Record count in New DB: 607501
Difference: -25535 records

- Introduced a new hash package to encapsulate hashing logic for GroupID, ArtifactID, and Version. - Updated the Builder to process TSV files and handle version mismatches more effectively. - Improved logging to provide better insights during the index building process. - Removed unused hash functions from the Crawler, streamlining the codebase. - Enhanced error handling for SHA1 decoding and record processing in the Builder.

- Introduced a new script `compare_sqlite.sh` for comparing two SQLite databases. - The script checks for the existence of database files and compares the record counts in the artifacts table. - It exports `group_id` and `artifact_id` for both databases, allowing for detailed comparison of records. - Added functionality to display records present in one database but not the other, enhancing data validation capabilities.

Copilot

Pull Request Overview

This PR refactors the database building process to use the aquasecurity/maven-index repository with added sharding functionality for improved file distribution.

Removed obsolete types and updated index-related types.
Added new hash functions and file utilities, including an Exists method.
Updated crawler, builder, and CLI modules to support the new maven-index approach and sharding.

Reviewed Changes

Copilot reviewed 10 out of 22 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
pkg/types/types.go	Removed the old Version type to streamline index definitions.
pkg/hash/hash.go	Added new GA and GAV functions for sharding and deduplication.
pkg/fileutil/file.go	Updated WriteJSON signature and added new Exists function.
pkg/crawler/types.go	Updated types and JSON tags for index and GCS responses.
pkg/crawler/gcs.go	Added comprehensive GCS client functionality including pagination.
pkg/crawler/crawler_test.go	Adjusted tests for crawler functionality with updated file paths and names.
pkg/builder/builder.go	Revised builder logic to process TSV files and improved logging.
cmd/trivy-java-db/main.go	Introduced new CLI flags for index directory and shard count.

Files not reviewed (12)

go.mod: Language not supported
misc/compare_sqlite.sh: Language not supported
pkg/crawler/testdata/happy/abbot-with-db.json.golden: Language not supported
pkg/crawler/testdata/happy/abbot-with-index.tsv.golden: Language not supported
pkg/crawler/testdata/happy/abbot.json.golden: Language not supported
pkg/crawler/testdata/happy/abbot.tsv.golden: Language not supported
pkg/crawler/testdata/happy/abbot_abbot.html: Language not supported
pkg/crawler/testdata/happy/abbot_abbot_0.12.3.html: Language not supported
pkg/crawler/testdata/happy/abbot_abbot_0.13.0.html: Language not supported
pkg/crawler/testdata/happy/abbot_abbot_1.4.0.html: Language not supported
pkg/crawler/testdata/happy/index.html: Language not supported
pkg/crawler/testdata/happy/maven-metadata.xml: Language not supported

Comments suppressed due to low confidence (1)

pkg/crawler/types.go:27

[nitpick] The JSON tag "1" is non-descriptive; consider using a more meaningful tag name to improve clarity and maintainability.

SHA1 string `json:"1"`

pkg/crawler/gcs.go

cmd/trivy-java-db/main.go

knqyf263 · 2025-05-01T04:40:22Z

NOTE: To view a TSV file on GitHub UI, the file size needs to be less than 512 KB, which means shards must be more than 8192.
https://docs.github.com/en/repositories/working-with-files/using-files/working-with-non-code-files#rendering-csv-and-tsv-data

- Removed the check for HTTP 404 status code in the GCS getObject method, simplifying error handling. - This change streamlines the code by eliminating redundant error messages for non-existent objects.

- Changed the variable name from `indexDir` to `centralIndexDir` for clarity. - Updated the `b.Build` method to use the new `centralIndexDir` variable, ensuring the correct path is utilized during the database build process.

DmitriyLewen

@knqyf263 I left a comments.

What about makefile and workflows?
I think we need to update them too.

go.mod

pkg/crawler/crawler.go

pkg/crawler/gcs.go

pkg/crawler/types.go

pkg/crawler/crawler.go

pkg/builder/builder.go

- Changed the log message to report the count of artifacts missing SHA1 instead of errors during the crawl process. - This adjustment improves clarity in the logging output, providing better insights into the crawl results.

- Changed the error assignment in the loadExistingIndexes method from a declaration to an assignment for consistency.

- Added a note in the `processPrefix` method to clarify that the return value 'processed' is preserved for logging purposes.

- Updated Go version from 1.23.0 to 1.24. - Removed several indirect dependencies related to Google Cloud. - Added new indirect dependencies including `github.com/google/go-cmp` and `github.com/kr/pretty`. - Updated `golang.org/x/mod` version from 0.17.0 to 0.18.0. - Adjusted `go.sum` to reflect the changes in dependencies.

- Deleted the `Metadata` and `Versioning` structs from `types.go` as they are no longer utilized in the codebase. - This change helps to simplify the code and reduce clutter, improving maintainability.

- Reduced the maximum results from 10,000 to 5,000 in the JARSHA1Files method to comply with the GCS API limit. - This change ensures that requests do not exceed the maximum allowed results, preventing potential errors during item listing.

…update SHA1 handling - Introduced a new `index.Reader` to handle CSV reading with tab delimiters, improving performance and code clarity. - Updated the `Build` method in the `builder` package to utilize the new `index.NewReader` for reading records. - Modified SHA1 handling in the `crawler` package to use `index.NotAvailable` instead of the string "N/A", ensuring consistency across the codebase. - Removed unused CSV reading logic from the `loadExistingIndexes` method, streamlining the code.

DmitriyLewen · 2025-05-12T08:52:48Z

pkg/crawler/crawler.go

+				groupID, artifactID, versionDir := record[0], record[1], record[2]
+
+				// Hash GAV and add to map
+				gavHash := hash.GAV(groupID, artifactID, versionDir)


Do we also need to check version here?

IIUC we say - if dir was saved (in TSV file), then file contains all versions from dir.

But it seems that there may be a case where we save only one version and get an error (e.g. internet access dropped).
But this is a very rare case, so we can miss it.
On the other hand, trivy-java-db contains a lot of artifacts, so we may not know about it.

You can simulate that save only 1.4.0 in test:

trivy-java-db/pkg/crawler/crawler_test.go

Line 117 in 35bcd9a

indexContent := "abbot\tabbot\t1.4.0\t-\t0000000000000000000000000000000000000000\nabbot\tabbot\t1.4.0\t1.4.0-lite\t0000000000000000000000000000000000000000\n"

IIUC we say - if dir was saved (in TSV file), then file contains all versions from dir.

For example, if a new version is released, I believe we still need to obtain the SHA-1 hash for that new version, even if the artifact is already listed in the TSV. Am I overlooking something?

i meant case when version dir contains more then 1 artifact.

e.g.:
1.4.0 version dir contains:

1.4.0

1.4.0-lite

We will use same hash for both versions.

But if TSV file contains only 1.4.0 we will not add 1.4.0-lite, because storedGAVs already contains hash for this artifact.

Changing the topic slightly, while looking through the contents of the index, I realized that I had probably been mistaken. The string appended after the version is likely what's generally referred to as a "classifier." In other words, if the classifier is null, it would be something like 1.4.0, and if the classifier is "lite", it would be 1.4.0-lite.

Considering that, I believe your point is that we should also take the classifier into account when making comparisons, right? The answer to that is YES — you are correct. I will update the code to refer to it as a classifier first, and then change the hash function.

Updated: 751fe62

BTW, if I understand correctly, "1.4.0-lite" has version "1.4.0" and the classifier is "lite", so I suppose it would be more accurate to detect the package as version 1.4.0. What do you think?
Ideally, the classifier should be added to the Java database, and Trivy should include the classifier information in the package data, but that would require some changes. So I’d like to discuss a temporary solution for now.

I think it's about the same.

But I think we don't need to separate version and classifier now:

Users also don't know what is correct (I mean version or version + classifier).

Developers can use both logics when adding a classifier.

maven/purl/etc. don't have information about classifiers, so it can be confusing.

No users asked about it, so I think versions with classifiers are rare.

Actually, PURL has classifier.

pkg:maven/org.apache.xmlgraphics/[email protected]?type=zip&classifier=dist

https://github.com/package-url/purl-spec/blob/main/PURL-TYPES.rst#maven

I need to double-check my knowledge more often 😅

You are right.
Purl supports classifiers.
Maven also supports them - https://maven.apache.org/plugins/maven-deploy-plugin/examples/deploying-with-classifiers.html

So after double-checking the information about classifiers - I think we can add them.
But I suggest asking the community about it.

It might be unnecessary work.

one more small thought:
after adding classifiers, users may ask to add all artifacts with classifiers (so we will need to add artifacts with source, test, etc. classifiers)

- Modified the `Record` struct in the `crawler` package to replace `VersionDir` with `Classifier`, allowing for better classification of artifacts. - Updated the `Build` method in the `builder` package to handle records with classifiers, ensuring proper indexing and storage. - Adjusted the `GAV` hashing function to include classifiers for deduplication, enhancing the accuracy of artifact identification. - Updated test cases and golden files to reflect changes in the index format, ensuring consistency across the codebase.

pkg/builder/builder.go

DmitriyLewen · 2025-05-14T07:17:53Z

pkg/types/types.go

@@ -14,11 +14,7 @@ type Index struct {
 	GroupID     string
 	ArtifactID  string
 	Version     string
+	Classifier  string


You don't use this field to build DB.

trivy-java-db/pkg/db/db.go

Line 126 in 793673c

index.GroupID, index.ArtifactID, index.Version, index.SHA1, index.ArchiveType)

Added a comment
72bbd7b

we must not forget that until we add this, the database will not include new artifacts with classifiers.

pkg/builder/builder.go

pkg/crawler/crawler.go

DmitriyLewen · 2025-05-14T07:33:38Z

pkg/crawler/crawler_test.go

+					t.Fatalf("failed to create index dir: %v", err)
+				}
+				// The digest is intentionally set to all zeros to check that the record is skipped and not updated.
+				indexContent := "abbot\tabbot\t1.4.0\t\t0000000000000000000000000000000000000000\nabbot\tabbot\t1.4.0\tlite\t0000000000000000000000000000000000000000\n"


Let's stay only 1.4.0 here.
This is required to check that we work correctly for case when tsv file contains index and we get another index with same version, but different classifier from GSC.

Suggested change

indexContent := "abbot\tabbot\t1.4.0\t\t0000000000000000000000000000000000000000\nabbot\tabbot\t1.4.0\tlite\t0000000000000000000000000000000000000000\n"

indexContent := "abbot\tabbot\t1.4.0\t\t0000000000000000000000000000000000000000\\n"

1 more comment @knqyf263

I'll update it later.

pkg/crawler/gcs.go

pkg/crawler/crawler.go

Co-authored-by: DmitriyLewen <[email protected]>

- Updated the `Classifier` field in the `Index` struct within `types.go` to include a TODO comment indicating that it is not currently used but retained for future use.

- Reduced the maximum results from 10,000 to 5,000 in the TopLevelPrefixes method to comply with the GCS API limit. - This change ensures that requests do not exceed the maximum allowed results, preventing potential errors during item listing.

- Changed comment from "Close all writers and files when done" to "Flush all writers and close all files when done" for clarity. - Removed the redundant final flush of CSV writers and buffer writers, as it is no longer necessary.

DmitriyLewen

LGTM

Did you collect artifacts after these changes?

knqyf263 · 2025-05-16T10:46:07Z

Yes, I did, but I found another problem.
https://github.com/aquasecurity/maven-index/commit/fea31d5de17f28195162ec66eebaa7e7a9b78e97

While I see a lot of artifacts under data/, I don't find them in Maven Central.

curl "https://storage.googleapis.com/storage/v1/b/maven-central/o?prefix=maven2/data/ai&maxResults=3"
{
  "kind": "storage#objects",
  "nextPageToken": "CkltYXZlbjIvZGF0YS9haS9hY3RpdmUvd2ViaG9vay1zZGsvMS4wLjAvd2ViaG9vay1zZGstMS4wLjAtamF2YWRvYy5qYXIubWQ1",
  "items": [
    {
      "kind": "storage#object",
      "id": "maven-central/maven2/data/ai/active/webhook-sdk/1.0.0/webhook-sdk-1.0.0-javadoc.jar/1563117836507463",
      "selfLink": "https://www.googleapis.com/storage/v1/b/maven-central/o/maven2%2Fdata%2Fai%2Factive%2Fwebhook-sdk%2F1.0.0%2Fwebhook-sdk-1.0.0-javadoc.jar",
      "mediaLink": "https://storage.googleapis.com/download/storage/v1/b/maven-central/o/maven2%2Fdata%2Fai%2Factive%2Fwebhook-sdk%2F1.0.0%2Fwebhook-sdk-1.0.0-javadoc.jar?generation=1563117836507463&alt=media",
      "name": "maven2/data/ai/active/webhook-sdk/1.0.0/webhook-sdk-1.0.0-javadoc.jar",
      "bucket": "maven-central",
      "generation": "1563117836507463",
      "metageneration": "1",
      "contentType": "application/java-archive",
      "storageClass": "STANDARD",
      "size": "178260",
      "md5Hash": "YYphsJKqoQd5hj8/P6J3BA==",
      "crc32c": "e74vlw==",
      "etag": "CMeyjNvbtOMCEAE=",
      "timeCreated": "2019-07-14T15:23:56.507Z",
      "updated": "2019-07-14T15:23:56.507Z",
      "timeStorageClassUpdated": "2019-07-14T15:23:56.507Z",
      "timeFinalized": "2019-07-14T15:23:56.507Z",
      "metadata": {
        "mtime": "2019-04-11T04:22:54.000000000Z"
      }
    },
    {
      "kind": "storage#object",
      "id": "maven-central/maven2/data/ai/active/webhook-sdk/1.0.0/webhook-sdk-1.0.0-javadoc.jar.asc/1563117841356253",
      "selfLink": "https://www.googleapis.com/storage/v1/b/maven-central/o/maven2%2Fdata%2Fai%2Factive%2Fwebhook-sdk%2F1.0.0%2Fwebhook-sdk-1.0.0-javadoc.jar.asc",
      "mediaLink": "https://storage.googleapis.com/download/storage/v1/b/maven-central/o/maven2%2Fdata%2Fai%2Factive%2Fwebhook-sdk%2F1.0.0%2Fwebhook-sdk-1.0.0-javadoc.jar.asc?generation=1563117841356253&alt=media",
      "name": "maven2/data/ai/active/webhook-sdk/1.0.0/webhook-sdk-1.0.0-javadoc.jar.asc",
      "bucket": "maven-central",
      "generation": "1563117841356253",
      "metageneration": "1",
      "contentType": "text/plain; charset=utf-8",
      "storageClass": "STANDARD",
      "size": "488",
      "md5Hash": "d/stVJJsOWro348PPFQq7Q==",
      "crc32c": "iFuWXg==",
      "etag": "CN2rtN3btOMCEAE=",
      "timeCreated": "2019-07-14T15:24:01.355Z",
      "updated": "2019-07-14T15:24:01.355Z",
      "timeStorageClassUpdated": "2019-07-14T15:24:01.355Z",
      "timeFinalized": "2019-07-14T15:24:01.355Z",
      "metadata": {
        "mtime": "2019-04-11T04:22:54.000000000Z"
      }
    },
    {
      "kind": "storage#object",
      "id": "maven-central/maven2/data/ai/active/webhook-sdk/1.0.0/webhook-sdk-1.0.0-javadoc.jar.md5/1563117846340040",
      "selfLink": "https://www.googleapis.com/storage/v1/b/maven-central/o/maven2%2Fdata%2Fai%2Factive%2Fwebhook-sdk%2F1.0.0%2Fwebhook-sdk-1.0.0-javadoc.jar.md5",
      "mediaLink": "https://storage.googleapis.com/download/storage/v1/b/maven-central/o/maven2%2Fdata%2Fai%2Factive%2Fwebhook-sdk%2F1.0.0%2Fwebhook-sdk-1.0.0-javadoc.jar.md5?generation=1563117846340040&alt=media",
      "name": "maven2/data/ai/active/webhook-sdk/1.0.0/webhook-sdk-1.0.0-javadoc.jar.md5",
      "bucket": "maven-central",
      "generation": "1563117846340040",
      "metageneration": "1",
      "contentType": "application/octet-stream",
      "storageClass": "STANDARD",
      "size": "32",
      "md5Hash": "8kZxiSwwz05OAroW5WlWmg==",
      "crc32c": "t8MUEg==",
      "etag": "CMjD5N/btOMCEAE=",
      "timeCreated": "2019-07-14T15:24:06.339Z",
      "updated": "2019-07-14T15:24:06.339Z",
      "timeStorageClassUpdated": "2019-07-14T15:24:06.339Z",
      "timeFinalized": "2019-07-14T15:24:06.339Z",
      "metadata": {
        "mtime": "2019-04-11T04:22:54.000000000Z"
      }
    }
  ]
}

https://repo.maven.apache.org/maven2/data

Do you know anything?

knqyf263 · 2025-05-16T11:01:42Z

I've changed it to skip maven2/data and triggered the job again.
https://github.com/aquasecurity/maven-index/actions/runs/15066942538/job/42353809058

DmitriyLewen · 2025-05-16T11:02:50Z

I didn't know anything about the data/ directory.

But it looks like GSC contains duplicates in this directory:

➜ curl "https://storage.googleapis.com/storage/v1/b/maven-central/o?prefix=maven2/data/ai/active/webhook-sdk/1.0.0/webhook-sdk-1.0.0-javadoc.jar.md5&maxResults=3"
{
  "kind": "storage#objects",
  "items": [
    {
      "kind": "storage#object",
      "id": "maven-central/maven2/data/ai/active/webhook-sdk/1.0.0/webhook-sdk-1.0.0-javadoc.jar.md5/1563117846340040",
      "selfLink": "https://www.googleapis.com/storage/v1/b/maven-central/o/maven2%2Fdata%2Fai%2Factive%2Fwebhook-sdk%2F1.0.0%2Fwebhook-sdk-1.0.0-javadoc.jar.md5",
      "mediaLink": "https://storage.googleapis.com/download/storage/v1/b/maven-central/o/maven2%2Fdata%2Fai%2Factive%2Fwebhook-sdk%2F1.0.0%2Fwebhook-sdk-1.0.0-javadoc.jar.md5?generation=1563117846340040&alt=media",
      "name": "maven2/data/ai/active/webhook-sdk/1.0.0/webhook-sdk-1.0.0-javadoc.jar.md5",
      "bucket": "maven-central",
      "generation": "1563117846340040",
      "metageneration": "1",
      "contentType": "application/octet-stream",
      "storageClass": "STANDARD",
      "size": "32",
      "md5Hash": "8kZxiSwwz05OAroW5WlWmg==",
      "crc32c": "t8MUEg==",
      "etag": "CMjD5N/btOMCEAE=",
      "timeCreated": "2019-07-14T15:24:06.339Z",
      "updated": "2019-07-14T15:24:06.339Z",
      "timeStorageClassUpdated": "2019-07-14T15:24:06.339Z",
      "timeFinalized": "2019-07-14T15:24:06.339Z",
      "metadata": {
        "mtime": "2019-04-11T04:22:54.000000000Z"
      }
    }
  ]
}
➜ curl "https://storage.googleapis.com/storage/v1/b/maven-central/o?prefix=maven2/ai/active/webhook-sdk/1.0.0/webhook-sdk-1.0.0-javadoc.jar.md5&maxResults=3" 
{
  "kind": "storage#objects",
  "items": [
    {
      "kind": "storage#object",
      "id": "maven-central/maven2/ai/active/webhook-sdk/1.0.0/webhook-sdk-1.0.0-javadoc.jar.md5/1573089967740369",
      "selfLink": "https://www.googleapis.com/storage/v1/b/maven-central/o/maven2%2Fai%2Factive%2Fwebhook-sdk%2F1.0.0%2Fwebhook-sdk-1.0.0-javadoc.jar.md5",
      "mediaLink": "https://storage.googleapis.com/download/storage/v1/b/maven-central/o/maven2%2Fai%2Factive%2Fwebhook-sdk%2F1.0.0%2Fwebhook-sdk-1.0.0-javadoc.jar.md5?generation=1573089967740369&alt=media",
      "name": "maven2/ai/active/webhook-sdk/1.0.0/webhook-sdk-1.0.0-javadoc.jar.md5",
      "bucket": "maven-central",
      "generation": "1573089967740369",
      "metageneration": "1",
      "contentType": "text/plain",
      "storageClass": "STANDARD",
      "size": "32",
      "md5Hash": "8kZxiSwwz05OAroW5WlWmg==",
      "contentLanguage": "en",
      "crc32c": "t8MUEg==",
      "etag": "CNGr6OX41uUCEAE=",
      "timeCreated": "2019-11-07T01:26:07.739Z",
      "updated": "2019-11-07T01:26:07.739Z",
      "timeStorageClassUpdated": "2019-11-07T01:26:07.739Z",
      "timeFinalized": "2019-11-07T01:26:07.739Z",
      "metadata": {
        "checksum-sha1": "0c1f74a4724f82f574cdc309d4286413098fd6b2",
        "last-modified-epoch": "1554868646000",
        "checksum-md5": "f24671892c30cf4e4e02ba16e569569a",
        "last-modified": "Wed, 10 Apr 2019 3:57:26 GMT"
      }
    }
  ]
}

But data dir doesn't contains all artifacts:

➜ curl "https://storage.googleapis.com/storage/v1/b/maven-central/o?prefix=maven2/ai/h2o/deepwater-backend-api/&maxResults=1"   
{
  "kind": "storage#objects",
  "nextPageToken": "ClFtYXZlbjIvYWkvaDJvL2RlZXB3YXRlci1iYWNrZW5kLWFwaS8wLjEuMC9kZWVwd2F0ZXItYmFja2VuZC1hcGktMC4xLjAtc291cmNlcy5qYXI=",
  "items": [
    {
      "kind": "storage#object",
      "id": "maven-central/maven2/ai/h2o/deepwater-backend-api/0.1.0/deepwater-backend-api-0.1.0-sources.jar/1563118352781426",
      "selfLink": "https://www.googleapis.com/storage/v1/b/maven-central/o/maven2%2Fai%2Fh2o%2Fdeepwater-backend-api%2F0.1.0%2Fdeepwater-backend-api-0.1.0-sources.jar",
      "mediaLink": "https://storage.googleapis.com/download/storage/v1/b/maven-central/o/maven2%2Fai%2Fh2o%2Fdeepwater-backend-api%2F0.1.0%2Fdeepwater-backend-api-0.1.0-sources.jar?generation=1563118352781426&alt=media",
      "name": "maven2/ai/h2o/deepwater-backend-api/0.1.0/deepwater-backend-api-0.1.0-sources.jar",
      "bucket": "maven-central",
      "generation": "1563118352781426",
      "metageneration": "1",
      "contentType": "application/java-archive",
      "storageClass": "STANDARD",
      "size": "4434",
      "md5Hash": "4KWinFLfYicX5cgOKGWFlw==",
      "crc32c": "jhMqqA==",
      "etag": "CPKgo9HdtOMCEAE=",
      "timeCreated": "2019-07-14T15:32:32.781Z",
      "updated": "2019-07-14T15:32:32.781Z",
      "timeStorageClassUpdated": "2019-07-14T15:32:32.781Z",
      "timeFinalized": "2019-07-14T15:32:32.781Z",
      "metadata": {
        "mtime": "2016-10-13T23:57:39.000000000Z"
      }
    }
  ]
}
➜ curl "https://storage.googleapis.com/storage/v1/b/maven-central/o?prefix=maven2/data/ai/h2o/deepwater-backend-api/&maxResults=1"   
{
  "kind": "storage#objects"
}

knqyf263 · 2025-05-16T11:05:01Z

Yes, I also observed it, which is really weird...

knqyf263 added 7 commits April 17, 2025 10:27

refactor(crawler): remove unused database connection from Crawler struct

4723adf

- Eliminated the dbc field from the Crawler struct as it is no longer needed. - Updated the NewCrawler function to reflect the removal of the database connection.

knqyf263 self-assigned this Apr 22, 2025

knqyf263 added 5 commits April 23, 2025 10:58

refactor(crawler): increase maxResults for improved artifact fetching…

b4d56d2

… performance - Updated maxResults from 5000 to 10000 to enhance performance during artifact fetching.

refactor(crawler): handle empty files

95eeb80

knqyf263 force-pushed the maven-index branch from d810b01 to 732ed18 Compare April 23, 2025 11:34

knqyf263 force-pushed the maven-index branch from 732ed18 to 2ba85b7 Compare April 23, 2025 11:37

knqyf263 added 3 commits April 28, 2025 10:59

test(crawler): use maven-index

35bcd9a

knqyf263 marked this pull request as ready for review May 1, 2025 04:34

Copilot AI review requested due to automatic review settings May 1, 2025 04:34

Copilot AI reviewed May 1, 2025

View reviewed changes

pkg/crawler/gcs.go Show resolved Hide resolved

cmd/trivy-java-db/main.go Outdated Show resolved Hide resolved

knqyf263 added 2 commits May 1, 2025 10:50

refactor(crawler): remove unnecessary not found error handling

6792f59

- Removed the check for HTTP 404 status code in the GCS getObject method, simplifying error handling. - This change streamlines the code by eliminating redundant error messages for non-existent objects.

refactor(builder): update index directory variable

9fdf0b0

- Changed the variable name from `indexDir` to `centralIndexDir` for clarity. - Updated the `b.Build` method to use the new `centralIndexDir` variable, ensuring the correct path is utilized during the database build process.

knqyf263 mentioned this pull request May 1, 2025

feat(crawler): crawl maven central index #58

Closed

1 task

DmitriyLewen reviewed May 7, 2025

View reviewed changes

knqyf263 added 2 commits May 8, 2025 10:26

refactor(crawler): update logging for artifact processing

4414723

- Changed the log message to report the count of artifacts missing SHA1 instead of errors during the crawl process. - This adjustment improves clarity in the logging output, providing better insights into the crawl results.

refactor(crawler): simplify error handling in loadExistingIndexes

9b31776

- Changed the error assignment in the loadExistingIndexes method from a declaration to an assignment for consistency.

knqyf263 added 5 commits May 8, 2025 10:34

refactor(crawler): improve logging for processed items

03f74a1

- Added a note in the `processPrefix` method to clarify that the return value 'processed' is preserved for logging purposes.

refactor(crawler): remove unused Metadata and Versioning types

082bc95

- Deleted the `Metadata` and `Versioning` structs from `types.go` as they are no longer utilized in the codebase. - This change helps to simplify the code and reduce clutter, improving maintainability.

DmitriyLewen reviewed May 12, 2025

View reviewed changes

DmitriyLewen reviewed May 14, 2025

View reviewed changes

knqyf263 and others added 5 commits May 15, 2025 10:33

Update pkg/crawler/crawler.go

8d205a6

Co-authored-by: DmitriyLewen <[email protected]>

Update pkg/builder/builder.go

5b7c2b3

Co-authored-by: DmitriyLewen <[email protected]>

refactor(types): add TODO comment for unused classifier field

72bbd7b

- Updated the `Classifier` field in the `Index` struct within `types.go` to include a TODO comment indicating that it is not currently used but retained for future use.

DmitriyLewen approved these changes May 15, 2025

View reviewed changes

knqyf263 closed this May 29, 2025

	indexContent := "abbot\tabbot\t1.4.0\t\t0000000000000000000000000000000000000000\nabbot\tabbot\t1.4.0\tlite\t0000000000000000000000000000000000000000\n"
	indexContent := "abbot\tabbot\t1.4.0\t\t0000000000000000000000000000000000000000\\n"

feat: move to use maven-index repository with sharding support #57

feat: move to use maven-index repository with sharding support #57

Uh oh!

Conversation

knqyf263 commented Apr 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Description

Test run

Uh oh!

knqyf263 commented Apr 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

knqyf263 commented May 1, 2025

Uh oh!

DmitriyLewen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

DmitriyLewen left a comment

Choose a reason for hiding this comment

Uh oh!

knqyf263 commented May 16, 2025

Uh oh!

knqyf263 commented May 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DmitriyLewen commented May 16, 2025

knqyf263 commented Apr 22, 2025 •

edited

Loading

knqyf263 commented Apr 24, 2025 •

edited

Loading

knqyf263 commented May 16, 2025 •

edited

Loading