Skip to content

feat(crawler): crawl maven central index #58

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 24 commits into from
Closed

Conversation

knqyf263
Copy link
Collaborator

@knqyf263 knqyf263 commented May 1, 2025

Overview

This PR enhances the trivy-java-db crawler by introducing a source type abstraction that supports multiple artifact sources. It adds support for Maven Central Index as an additional source to complement the existing GCS-based crawling.

Description

Maven artifact crawling is a critical part of the trivy-java-db functionality. Currently, we rely solely on Google Cloud Storage (GCS) which mirrors Maven Central. However, our analysis has identified gaps in the coverage:

Therefore, we cannot rely on a single source. To improve coverage, we use both GCS and Maven Central Index. While the combination of GCS and Central Index still won't cover 100% of Maven Central artifacts, this approach improves our coverage for sure. The remaining limitations are accepted as constraints in OSS.

Implementation

This PR introduces a pluggable source architecture that allows crawling from both GCS and Maven Central Index:

  1. Added a SourceType abstraction to support different artifact sources
  2. Refactored the crawler to use a common interface for different sources
  3. Implemented initial structure for Maven Central Index support
  4. Added command-line flag (--source) to specify the source type (gcs, central)

Related PRs

@knqyf263 knqyf263 requested a review from Copilot May 1, 2025 07:15
@knqyf263 knqyf263 self-assigned this May 1, 2025
@knqyf263 knqyf263 changed the title feat(crawler): crawl central index feat(crawler): crawl maven central index May 1, 2025
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a pluggable source architecture for crawling Maven artifacts by adding support for both GCS and Maven Central Index. Key changes include:

  • Defining a new Source interface and Record type to standardize artifact retrieval.
  • Refactoring the existing GCS crawler components and removing legacy types to support the new design.
  • Implementing a new Maven Central Index source and updating the crawler pipeline and CLI flags accordingly.

Reviewed Changes

Copilot reviewed 11 out of 12 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
pkg/crawler/types/types.go Introduces the new Source interface and Record struct to abstract artifact retrieval.
pkg/crawler/types.go Removes the legacy types previously used for GCS-specific crawling.
pkg/crawler/gcs/lister.go Implements the parallel GCS lister with concurrent prefix processing.
pkg/crawler/gcs/gcs.go Refactors the GCS client into a more generic Client with updated methods and baseURL fallback.
pkg/crawler/gcs/fetcher.go Provides the logic for fetching SHA1 hashes from GCS while handling errors appropriately.
pkg/crawler/crawler_test.go Updates test cases to refer to the new gcs.ListResponse type and related changes.
pkg/crawler/crawler.go Refactors the crawling pipeline to support multiple source types via the new Source interface.
pkg/crawler/central/central.go Adds support for processing the Maven Central Index and integrating it into the crawling pipeline.
cmd/trivy-java-db/main.go Adds a new command-line flag to select the artifact source type and updates corresponding parameters.
Files not reviewed (1)
  • go.mod: Language not supported
Comments suppressed due to low confidence (1)

pkg/crawler/gcs/gcs.go:56

  • [nitpick] Add a brief comment explaining how cmp.Or selects the baseURL fallback to aid future maintainability.
baseURL:    cmp.Or(baseURL, gcsURL),

limit: opt.Limit,
shardCount: opt.Shard,
storedGAVs: make(map[uint64]struct{}),
mutex: sync.Mutex{},
sourceType: cmp.Or(opt.SourceType, SourceTypeGCS), // Default to GCS if not specified
Copy link
Preview

Copilot AI May 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Consider replacing cmp.Or with an explicit defaulting logic (e.g., an if-check) for SourceType to improve clarity.

Suggested change
sourceType: cmp.Or(opt.SourceType, SourceTypeGCS), // Default to GCS if not specified
sourceType: func() types.SourceType {
if opt.SourceType == "" {
return SourceTypeGCS // Default to GCS if not specified
}
return opt.SourceType
}(),

Copilot uses AI. Check for mistakes.

GroupID: groupID,
ArtifactID: artifactID,
VersionDir: versionDir,
Version: "-", // Assume that version is the same as versionDir in Central Index
Copy link
Preview

Copilot AI May 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Document the rationale behind defaulting Version to '-' more explicitly or consider deriving it from available metadata if appropriate.

Suggested change
Version: "-", // Assume that version is the same as versionDir in Central Index
Version: deriveVersion(versionDir), // Derive version from versionDir or fallback to "-"

Copilot uses AI. Check for mistakes.

@knqyf263
Copy link
Collaborator Author

It seems that I didn’t fully understand the contents of the central index. I thought this artifact didn’t exist in the index and we needed to combine GCS, but it turns out that was because I was filtering by packaging type "jar". Upon closer inspection of the index, I found that the GAV does exist, but the packaging is set to "bundle". A JAR is still generated even when the packaging is "bundle".

In other words, although we went around in circles, by examining the index, we now see that there’s a good chance we can accurately retrieve artifact information from Maven Central.

grep "org.wso2.carbon.identity.entitlement.ui" central-index-gavc.tsv | head -n 5
org.wso2.carbon.identity.framework	org.wso2.carbon.identity.entitlement.ui	5.9.0	bundle		da4520d9be9a4e693a6993ce3c80447090327b08
org.wso2.carbon.identity.framework	org.wso2.carbon.identity.entitlement.ui	5.8.155	bundle		46dc746ae0381a43111ad9b1409bfb480bd33a2d
org.wso2.carbon.identity.framework	org.wso2.carbon.identity.entitlement.ui	5.8.154	bundle		f80c898c35cc4988178bf8f063015eab82f9eddd
org.wso2.carbon.identity.framework	org.wso2.carbon.identity.entitlement.ui	5.8.153	bundle		962b0e42861af8cf188174a70f96b78626cdee09
org.wso2.carbon.identity.framework	org.wso2.carbon.identity.entitlement.ui	5.8.152	bundle		542ee64e3e7f14e8ab78082649724cdcdddea5a6
<packaging>bundle</packaging>

https://repo.maven.apache.org/maven2/org/wso2/carbon/identity/framework/org.wso2.carbon.identity.entitlement.ui/5.25.44/org.wso2.carbon.identity.entitlement.ui-5.25.44.pom

knqyf263 added 5 commits May 15, 2025 15:16
- Added a new `sourceType` flag to specify the source type (GCS or Central) for the artifact crawling process.
- Updated the `Crawler` to handle different source types, allowing for more flexible artifact retrieval.
- Refactored the `crawlWithPipeline` method to create sources based on the specified type, enhancing the modularity of the code.
- Introduced a new `central` package to manage Maven Central Index interactions, improving the overall architecture.
- Enhanced the GCS client and related components to support the new structure, ensuring compatibility with existing functionality.
- Added cache directory support in the Crawler for storing the Maven Central Index.
- Updated the New function in the central package to accept a cache directory parameter and ensure its existence.
- Implemented downloadIndex method to handle downloading and caching of the Maven Central Index.
- Improved error handling for index file retrieval and processing, ensuring robustness in the crawling process.
- Updated the Crawler struct to rename `storedGAVs` to `storedGAVCs`, reflecting the change to store GroupID, ArtifactID, Version, and Classifier hashes.
- Adjusted the New function in the Crawler to initialize `storedGAVCs` accordingly.
- Modified the central package to utilize `storedGAVCs` for consistent artifact processing.
- Enhanced comments to clarify the purpose of the storedGAVCs map in preventing duplicate processing.
- Improved error logging in the Builder's Build method to include additional context (GroupID, ArtifactID, Version, Classifier) when SHA1 decoding fails.
- Added validation for SHA1 hashes in the Crawler's Read method to ensure they are correctly formatted and of the expected length, enhancing data integrity checks.
@knqyf263
Copy link
Collaborator Author

knqyf263 commented May 15, 2025

So, if I understand correctly, obtaining the SHA1 can be done using only the Central Index, and GCS is not necessary. The Central Index is updated once a week, but I believe this delay is acceptable. Delays of over a week can occur, but I think that’s still acceptable for an OSS project.

I tried updating the maven-index using only the Central Index. And build the database from this index.

$ trivy-java-db build --index-dir /path/to/maven-index/

The database generated using this maven-index showed no significant differences from our current one. Some artifacts (example) were missing, but they’re artifacts that don’t contain any deployable JARs (their classifier is like javadoc or test), so it’s correct that they were omitted

@DmitriyLewen Could you double-check as well to ensure I haven’t overlooked anything?

@DmitriyLewen
Copy link
Collaborator

Before i start checking:
We discussed using index archives, but saw that the update periods are unstable - #52 (comment)
Is it stable now? Or are these just errors with dates?

@knqyf263
Copy link
Collaborator Author

There may have been a mistake in the comparison method. I'm currently double-checking it, so please wait a moment.

Also, regarding the update frequency, it appears that updates are currently being made at least once every two weeks.

@knqyf263
Copy link
Collaborator Author

I was mistaken. It turns out that the Central Index indeed does not include the SHA1 digest for some artifact JARs.

For example, this sha1 is missing.

grep "ai.ancf.lmos-router" central-index-gavc.tsv | grep benchmarks
ai.ancf.lmos-router	benchmarks	0.2.0		pom.sha512	
ai.ancf.lmos-router	benchmarks	0.2.0	sources	jar.sha512	
ai.ancf.lmos-router	benchmarks	0.2.0	sources	jar.sha256	
ai.ancf.lmos-router	benchmarks	0.2.0	sources	jar.asc.sha512	
ai.ancf.lmos-router	benchmarks	0.2.0	sources	jar.asc.sha256	
ai.ancf.lmos-router	benchmarks	0.2.0	sources	jar	94d3d33aafc21b11386131ab469e4ef0c58c16eb
ai.ancf.lmos-router	benchmarks	0.2.0	javadoc	jar.sha512	
ai.ancf.lmos-router	benchmarks	0.2.0	javadoc	jar.sha256	
ai.ancf.lmos-router	benchmarks	0.2.0	javadoc	jar.asc.sha512	
ai.ancf.lmos-router	benchmarks	0.2.0	javadoc	jar.asc.sha256	
ai.ancf.lmos-router	benchmarks	0.2.0	javadoc	jar	53886a59b8ed977c1088698468b09b1c02c38b8c
ai.ancf.lmos-router	benchmarks	0.28.0		pom.sha512	
ai.ancf.lmos-router	benchmarks	0.28.0	sources	jar.sha512	
ai.ancf.lmos-router	benchmarks	0.28.0	sources	jar.sha256	
ai.ancf.lmos-router	benchmarks	0.28.0	sources	jar.asc.sha512	
ai.ancf.lmos-router	benchmarks	0.28.0	sources	jar.asc.sha256	
ai.ancf.lmos-router	benchmarks	0.28.0	sources	jar	3a6c48b0296576e0a945c7d89f33ed9897cd196f
ai.ancf.lmos-router	benchmarks	0.28.0	javadoc	jar.sha512	
ai.ancf.lmos-router	benchmarks	0.28.0	javadoc	jar.sha256	
ai.ancf.lmos-router	benchmarks	0.28.0	javadoc	jar.asc.sha512	
ai.ancf.lmos-router	benchmarks	0.28.0	javadoc	jar.asc.sha256	
ai.ancf.lmos-router	benchmarks	0.28.0	javadoc	jar	ab2fa7e53dc7127d79595612c4ea25ad745d01fd

@DmitriyLewen
Copy link
Collaborator

0.28.0 artifacts are 2 months old (according to the dates from Maven Central)...
https://repo.maven.apache.org/maven2/ai/ancf/lmos-router/benchmarks/0.28.0/

But i have sad news:
GSC also doesn't contain these artifacts:

➜ curl "https://storage.googleapis.com/storage/v1/b/maven-central/o?prefix=maven2/ai/ancf/&maxResults=5000"
{
  "kind": "storage#objects"
}

knqyf263 added 5 commits May 16, 2025 12:30
- Removed redundant SHA1 handling code from the GCS client.
- Introduced a new `sha1` package to encapsulate SHA1 parsing logic.
- Enhanced the `FetchSHA1` method to utilize the new `sha1.Parse` function for improved clarity and maintainability.
- Updated dependencies in go.mod and go.sum for consistency.
- Added functionality to fetch missing SHA1s directly from Maven Central for records without classifiers.
- Introduced a new map to track missing SHA1s and updated the Read method to handle records accordingly.
- Enhanced error handling and logging during the fetching process, including status code checks and logging of fetched counts.
- Improved the overall structure of the Source type to accommodate the new fetching logic, ensuring robust artifact processing.
- Added a condition to skip the "maven2/data/" directory during the listing process in the GCS lister.
- This change prevents unnecessary processing of a directory that is not present in Maven Central, improving efficiency in the crawling operation.
- Introduced a new `maven` package with a `ValidateClassifier` function to encapsulate classifier validation logic.
- Updated the `read` method in the `Source` type to utilize the new validation function, improving clarity and maintainability.
- Enhanced the `fetch` method in the `Fetcher` to include version validation alongside groupID and artifactID checks, ensuring robust artifact processing.
- Updated the `read` method in the `Source` type to improve artifact type validation.
- Removed the check for `fileExtension` being "jar" from the initial condition, allowing for more flexible handling of artifact types.
- Added a separate check for `fileExtension` to ensure only records with the "jar" extension are processed, enhancing clarity and maintainability of the code.
@knqyf263
Copy link
Collaborator Author

I have modified the code to query Maven Central only when the SHA1 is not found in the Central Index. This significantly reduces the number of HTTP requests. Additionally, it has improved the coverage. When comparing the current database using this script, I found only 46 differences in GAVs.

Furthermore, it seems that none of these 46 cases are valid GAVs. Details are as follows:

  1. GAs that have already been deleted
  1. GAVs that have already been deleted
  1. JARs stored under different versions
  1. Empty directory

@knqyf263
Copy link
Collaborator Author

Very bad news... the index seems to contain wrong values 😭

net.itarray	automotion	1.4.1		jar	aa50e73ec13a2140d990e4684ff0822b84caa962
net.itarray	automotion	1.4.1	sources	jar	a114f4a08a9fdf6535923dcbc43685cc427e3a7a
net.itarray	automotion	1.4.1	javadoc	jar	92046e9b8e69e86e7c1c37210211401ce0bf763a

The index holds aa50e73ec13a2140d990e4684ff0822b84caa962, but it should be 8293e469853b8fbff6ab842cbd07400c4831b94b
https://repo.maven.apache.org/maven2/net/itarray/automotion/1.4.1/automotion-1.4.1.jar.sha1

@DmitriyLewen
Copy link
Collaborator

this is very bad 😭
If we could consider the absence of some artifacts acceptable for OSS (as we said about GCS), then incorrect digests are definitely unacceptable...

@DmitriyLewen
Copy link
Collaborator

DmitriyLewen commented May 19, 2025

We can try to combinate 3 sources:

  1. We will take artifacts from indexes and GCS
  2. take digests for these artifacts from GCS and maven central (if GCS doesn't have this artifact)

But i am still not sure about indexes.
We don't know what other errors might be there...

knqyf263 added 2 commits May 19, 2025 16:14
…ndexes

- Eliminated the redundant check for classifier being "-" in the loadExistingIndexes method.
- This change simplifies the code by directly assigning an empty string to classifier when it is not provided, enhancing clarity and maintainability.
…tral Index

- Updated the `Source` type to enhance the handling of GAVC records and SHA1 fetching.
- Removed reliance on unreliable SHA1 values from the central index, implementing a two-step process to ensure accurate SHA1 retrieval directly from Maven Central.
- Introduced a new `records` slice to store all GAVC records, improving clarity and maintainability.
- Enhanced logging to reflect the number of records processed and fetched, ensuring better tracking of operations.
- Refined the `fetchSHA1s` method to process records more efficiently, improving overall performance.
@knqyf263
Copy link
Collaborator Author

I’m currently planning this as a two-phase process.

  1. Crawl GCS and collect as many SHA1 hashes as possible
  2. Read the Central Index and produce only a list of GAVC tuples (group ID + artifact ID + version + classifier). At that stage, we don’t use any SHA1 values, since they can’t be fully trusted. Then remove from this list any GAVCs for which a SHA1 has already been retrieved from GCS, and for the remaining GAVCs we construct the corresponding Maven Central URLs and fetch their SHA1 hashes over HTTP.

Although dropping the index-based SHA1 lookups increases the number of requests sent to Maven Central, it saves requests when building the list, and—because it’s a differential update—already-stored artifacts are skipped.

I’ve implemented this strategy, so I’ll run it again on GitHub Actions.

knqyf263 added 2 commits May 20, 2025 11:21
- Added a case to handle HTTP 403 Forbidden status in the fetchSHA1s method.
- This change improves error handling by explicitly logging when access to a resource is forbidden, enhancing clarity in the fetching process.
- Improved error handling in the fetchSHA1s method by returning errors instead of skipping them on HTTP request failures.
- Added logging for the number of fetched SHA1s every 10,000 records processed, providing better visibility into the fetching process.
- Enhanced logging for failed SHA1 content reads, including the URL and error details, to aid in debugging and monitoring.
knqyf263 added 4 commits May 20, 2025 11:30
- Updated the fetchSHA1s method to return errors instead of skipping them on HTTP request failures, enhancing error visibility.
- Added logging for the number of errors encountered during SHA1 fetching, providing better insights into the fetching process.
- Enhanced logging in the crawlWithPipeline method to report the number of errors encountered during artifact processing.
- Enhanced the logic in the parseItemName function to better handle classifier extraction from filenames.
- Updated comments to clarify the handling of special cases where classifiers may not follow the expected format.
- This change improves the robustness of classifier parsing, ensuring that any remaining string is treated as a classifier when necessary.
- Introduced a new test file for the GCS fetcher to validate the parseItemName function.
- Added multiple test cases covering various scenarios, including normal cases, edge cases, and invalid inputs.
- This addition enhances test coverage and ensures the correctness of the classifier parsing logic.
- Introduced a new function `containsControlChar` to check for control characters in strings.
- Updated the `read` method to skip records with control characters in `GroupID`, `ArtifactID`, `Version`, or `Classifier`.
- Added unit tests for `containsControlChar` to ensure correct functionality across various input scenarios.
@knqyf263
Copy link
Collaborator Author

I want to believe this is some kind of bad joke, but I’ve found a case where the SHA1 values differ between Maven Central and GCS.

# Maven Central
$ curl https://repo.maven.apache.org/maven2/com/onfido/api/client/onfido-api-client/3.1.3/onfido-api-client-3.1.3.jar.sha1
941c5572bc8fbe9203745c9eca4a0fa4d4ee0fbb%                                                                                                  

# GCS
$ curl "https://storage.googleapis.com/download/storage/v1/b/maven-central/o/maven2%2Fcom%2Fonfido%2Fapi%2Fclient%2Fonfido-api-client%2F3.1.3%2Fonfido-api-client-3.1.3.jar.sha1?generation=1617659797337662&alt=media"
2b0f02371aa9be4644ebbf65186ab6b4956b498f

It appears that the actual JAR files are different.

# Maven Central
$ curl -s https://repo.maven.apache.org/maven2/com/onfido/api/client/onfido-api-client/3.1.3/onfido-api-client-3.1.3.jar | sha1sum
941c5572bc8fbe9203745c9eca4a0fa4d4ee0fbb  -

# GCS
curl -s "https://storage.googleapis.com/download/storage/v1/b/maven-central/o/maven2%2Fcom%2Fonfido%2Fapi%2Fclient%2Fonfido-api-client%2F3.1.3%2Fonfido-api-client-3.1.3.jar?generation=1617659792993892&alt=media" | sha1sum
2b0f02371aa9be4644ebbf65186ab6b4956b498f  -

This is just my guess, but considering that the last update on Maven Central was April 7, 2021, and the last update on GCS was April 5, 2021, it’s possible that the JAR file on Maven Central was updated after it had already been synchronized to GCS. That update might not have been propagated to GCS, leaving the older artifact there.

    {
      "kind": "storage#object",
      "id": "maven-central/maven2/com/onfido/api/client/onfido-api-client/3.1.3/onfido-api-client-3.1.3.jar/1617659792993892",
      "selfLink": "https://www.googleapis.com/storage/v1/b/maven-central/o/maven2%2Fcom%2Fonfido%2Fapi%2Fclient%2Fonfido-api-client%2F3.1.3%2Fonfido-api-client-3.1.3.jar",
      "mediaLink": "https://storage.googleapis.com/download/storage/v1/b/maven-central/o/maven2%2Fcom%2Fonfido%2Fapi%2Fclient%2Fonfido-api-client%2F3.1.3%2Fonfido-api-client-3.1.3.jar?generation=1617659792993892&alt=media",
      "name": "maven2/com/onfido/api/client/onfido-api-client/3.1.3/onfido-api-client-3.1.3.jar",
      "bucket": "maven-central",
      "generation": "1617659792993892",
      "metageneration": "1",
      "contentType": "application/java-archive",
      "storageClass": "STANDARD",
      "size": "33",
      "md5Hash": "jMpWQ/+3GCzKnC4VS0TLXg==",
      "contentLanguage": "en",
      "crc32c": "52QnRw==",
      "etag": "COTc0KqM6O8CEAE=",
      "timeCreated": "2021-04-05T21:56:33.026Z",
      "updated": "2021-04-05T21:56:33.026Z",
      "timeStorageClassUpdated": "2021-04-05T21:56:33.026Z",
      "timeFinalized": "2021-04-05T21:56:33.026Z",
      "metadata": {
        "checksum-sha1": "2b0f02371aa9be4644ebbf65186ab6b4956b498f",
        "last-modified-epoch": "1617650729000",
        "checksum-md5": "8cca5643ffb7182cca9c2e154b44cb5e",
        "last-modified": "Mon, 05 Apr 2021 19:25:29 GMT"
      }
    },

@knqyf263
Copy link
Collaborator Author

knqyf263 commented May 20, 2025

It looks like there are about 100 such cases, assuming my comparison script is correct (I'll double-check how many of these cases there actually are). Out of 16 million artifacts, I believe having around 100 cases is within an acceptable range. I think it's fine to ignore them, but if they are particularly important, we could also consider fixing them manually.

- Removed the `records` slice from the `Source` type to streamline record handling.
- Implemented a channel-based pipeline for processing records, enhancing concurrency and clarity.
- Introduced `readFromIndex`, `parseRecords`, and `fetchSHA1s` methods to separate concerns and improve maintainability.
- Enhanced error handling and logging throughout the pipeline stages for better visibility.
- Updated the `validateRecord` method to encapsulate record validation logic, ensuring cleaner code.
@knqyf263
Copy link
Collaborator Author

If my script works correctly, there are only 113 cases where SHA1 is wrong in GCS. I would accept these edge cases as I saw many wrong values in the current database.

[SHA1 diff: SHA1 mismatch] GAVs with different SHA1:
com.onfido.api.client|onfido-api-client|4.0.0	SQLite SHA1: b9797437bf7b564b9b70227ea48e3a295704ece3	TSV SHA1: 2b0f02371aa9be4644ebbf65186ab6b4956b498f @https://repo.maven.apache.org/maven2/com/onfido/api/client/onfido-api-client/4.0.0/
de.knightsoft-net|mt-bean-validators|0.22.2	SQLite SHA1: 30df790d183f3376ad38130988d16bfb882f6bad	TSV SHA1: b0a86483d70eb2922a256e0563110f03b949dcf5 @https://repo.maven.apache.org/maven2/de/knightsoft-net/mt-bean-validators/0.22.2/
com.tobedevoured.command|spring|0.3.2	SQLite SHA1: a159f2da20f64564cbf2a91a3e8c13d7a173844c	TSV SHA1: 912fd9c8d977421b9ed0448ca59ab8eb2fb7983e @https://repo.maven.apache.org/maven2/com/tobedevoured/command/spring/0.3.2/
com.onfido.api.client|onfido-api-client|3.1.4	SQLite SHA1: 4e0c83e92eacef88640cdad2c38e9aff7e972f14	TSV SHA1: 2b0f02371aa9be4644ebbf65186ab6b4956b498f @https://repo.maven.apache.org/maven2/com/onfido/api/client/onfido-api-client/3.1.4/
com.17jee|e-core-template-freemarker|3.0.1.1	SQLite SHA1: 7eaeb96adef187ef7d1016e573092169957651d5	TSV SHA1: ddda3636db969962edae594ffc0008acb4e2c7b1 @https://repo.maven.apache.org/maven2/com/17jee/e-core-template-freemarker/3.0.1.1/
com.onfido.api.client|onfido-api-client|0.3.3	SQLite SHA1: b6d3f3fca67e34e6f04def0442e0bef6cbc5dae1	TSV SHA1: 2b0f02371aa9be4644ebbf65186ab6b4956b498f @https://repo.maven.apache.org/maven2/com/onfido/api/client/onfido-api-client/0.3.3/
de.knightsoft-net|gwt-mt-widgets|0.22.2	SQLite SHA1: 7fb541f8eac7b717608fac86450542e6e235b242	TSV SHA1: e267ef2160e012b664d6f3d5db25e4348df939b7 @https://repo.maven.apache.org/maven2/de/knightsoft-net/gwt-mt-widgets/0.22.2/
com.onfido.api.client|onfido-api-client|1.3.0	SQLite SHA1: 220ea7203349616a6cc9d257a7e658c8ef0821dc	TSV SHA1: 2b0f02371aa9be4644ebbf65186ab6b4956b498f @https://repo.maven.apache.org/maven2/com/onfido/api/client/onfido-api-client/1.3.0/
com.onfido.api.client|onfido-api-client|5.3.1	SQLite SHA1: 418314bce104ace9ad2001924e155c234cb7d2c2	TSV SHA1: 2b0f02371aa9be4644ebbf65186ab6b4956b498f @https://repo.maven.apache.org/maven2/com/onfido/api/client/onfido-api-client/5.3.1/
com.onfido.api.client|onfido-api-client|5.5.0	SQLite SHA1: 76ce90918154892fe44a2b200c6b2b8d5f3ee030	TSV SHA1: 2b0f02371aa9be4644ebbf65186ab6b4956b498f @https://repo.maven.apache.org/maven2/com/onfido/api/client/onfido-api-client/5.5.0/
de.knightsoft-net|gwt-bean-validators-spring-gwtp|0.22.2	SQLite SHA1: 075cc6a35b0542400dc95076d78383f31f87ebf8	TSV SHA1: aeb2d6ea968ef21e3a022eeedaf74684c00463cc @https://repo.maven.apache.org/maven2/de/knightsoft-net/gwt-bean-validators-spring-gwtp/0.22.2/
com.onfido.api.client|onfido-api-client|7.3.0	SQLite SHA1: c4dfb62eb73b499991ad7b86280c5f520210b569	TSV SHA1: 2b0f02371aa9be4644ebbf65186ab6b4956b498f @https://repo.maven.apache.org/maven2/com/onfido/api/client/onfido-api-client/7.3.0/
com.onfido.api.client|onfido-api-client|0.4.0	SQLite SHA1: ae3b921a60b62f230879bc680716de225242ebd3	TSV SHA1: 2b0f02371aa9be4644ebbf65186ab6b4956b498f @https://repo.maven.apache.org/maven2/com/onfido/api/client/onfido-api-client/0.4.0/
com.onfido.api.client|onfido-api-client|0.3.2	SQLite SHA1: 8b51345fe6335e55680cbc4dab000e00741739a9	TSV SHA1: 2b0f02371aa9be4644ebbf65186ab6b4956b498f @https://repo.maven.apache.org/maven2/com/onfido/api/client/onfido-api-client/0.3.2/
com.smartdevicelink|sdl_java_ee|4.12.2	SQLite SHA1: 8ee44885441ecad96700d680b9dd984f36524fd5	TSV SHA1: 21a7e60bb893d167f9b9a0f95323d32eade07190 @https://repo.maven.apache.org/maven2/com/smartdevicelink/sdl_java_ee/4.12.2/
com.indix.api|indix-api-java|3.0.0	SQLite SHA1: fb5cdf064a215d38db1f4445f13e4cdf72466036	TSV SHA1: bce804b40aa001d12855fc90516e122a6dbaa883 @https://repo.maven.apache.org/maven2/com/indix/api/indix-api-java/3.0.0/
com.onfido.api.client|onfido-api-client|3.1.0	SQLite SHA1: 90c98cead3faea2c527014fc7ec12f6230ef0e5d	TSV SHA1: 2b0f02371aa9be4644ebbf65186ab6b4956b498f @https://repo.maven.apache.org/maven2/com/onfido/api/client/onfido-api-client/3.1.0/
org.sonatype.install4j|install4j-slf4j|1.0.8	SQLite SHA1: 1afe175f56ae49daff24d5a339533360f5b7aaf8	TSV SHA1: 49dc96882a139124309791ab80dcd587f3122550 @https://repo.maven.apache.org/maven2/org/sonatype/install4j/install4j-slf4j/1.0.8/
com.smartdevicelink|sdl_java_ee|5.0.0	SQLite SHA1: 02fb47876da3d52c1f2f6b69f989b50b866b0688	TSV SHA1: 057b0ab05a3c5e65d1b71777de415a518a1da0e7 @https://repo.maven.apache.org/maven2/com/smartdevicelink/sdl_java_ee/5.0.0/
com.17jee|e-log-ui|3.0.1.1	SQLite SHA1: 8f7eaa96c02396c96a77c912d7c335dac77b4bb0	TSV SHA1: 4d57476d4eedec9356d4213f3e60ecfcca9533e6 @https://repo.maven.apache.org/maven2/com/17jee/e-log-ui/3.0.1.1/
... and more
Total SHA1 diff: 113

@DmitriyLewen
Copy link
Collaborator

DmitriyLewen commented May 21, 2025

This is just my guess, but considering that the last update on Maven Central was April 7, 2021, and the last update on GCS was April 5, 2021, it’s possible that the JAR file on Maven Central was updated after it had already been synchronized to GCS. That update might not have been propagated to GCS, leaving the older artifact there.

Most likely, you are right.
But then that means maven central is not immutable...

But in any case, I agree with you. 113 artifacts out of 16 million is an acceptable deviation.

@DmitriyLewen
Copy link
Collaborator

@knqyf263 I see you're still working on this PR.

Let me know when I can start reviewing it.

@knqyf263
Copy link
Collaborator Author

Yes, I hope the job will complete successfully, and this PR will be ready.

@knqyf263
Copy link
Collaborator Author

The job was successful. Also, I've checked maven-index and confirmed that all records except for a few edge cases in the current database are present in the TSV files.

One point to note is that when inserting records into the database, a different GAV with the same SHA1 digest might be inserted. Therefore, I compared the database with the TSV files, rather than comparing databases.

@DmitriyLewen You can start reviewing the PR. I didn't mark this as ready for review since this PR depends on #57.

groupPath := strings.ReplaceAll(rec.GroupID, ".", "/")

// Build jar name: artifactId-version[-classifier].jar
jarName := rec.ArtifactID + "-" + rec.Version + ".jar"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

classifiers are missed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was a reason why I didn't add a classifier here, but I can't recall it...
Fixed in 69a2117

knqyf263 and others added 4 commits May 22, 2025 15:53
- Removed the `records` parameter from the `read` method call in the `Source` type.
- This change streamlines the method signature and improves clarity in the codebase.
- Updated the jar name construction logic to include the classifier if it is present.
- This change ensures that the jar name follows the format: artifactId-version[-classifier].jar, improving clarity and correctness in artifact naming.
@knqyf263 knqyf263 requested a review from DmitriyLewen May 22, 2025 14:12
Copy link
Collaborator

@DmitriyLewen DmitriyLewen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

But let's run Update Maven Index again to make sure the changes with classifiers didn't break anything.

@knqyf263
Copy link
Collaborator Author

Thanks for reviewing. Triggered.
https://github.com/aquasecurity/maven-index/actions/runs/15192742680/job/42729417839

@knqyf263
Copy link
Collaborator Author

I confirmed that the database could be built with the latest index.

@knqyf263
Copy link
Collaborator Author

I also tested that a number of detected components were the same between the current and the new DB.

$ trivy image jenkins:2.60.3 -q --format cyclonedx --cache-backend memory | jq '.components | length'
456
$ cp ~/Library/Caches/trivy-java-db/db/trivy-java.db ~/Library/Caches/trivy/java-db/
$ trivy image jenkins:2.60.3 -q --format cyclonedx --cache-backend memory | jq '.components | length'
456

@knqyf263 knqyf263 closed this May 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants