feat(crawler): crawl maven central index #58

knqyf263 · 2025-05-01T07:15:38Z

Overview

This PR enhances the trivy-java-db crawler by introducing a source type abstraction that supports multiple artifact sources. It adds support for Maven Central Index as an additional source to complement the existing GCS-based crawling.

Description

Maven artifact crawling is a critical part of the trivy-java-db functionality. Currently, we rely solely on Google Cloud Storage (GCS) which mirrors Maven Central. However, our analysis has identified gaps in the coverage:

Approximately 20,000-30,000 artifacts (group-id + artifact-id combinations) are missing from GCS
- When including different versions, the number of missing artifacts is significantly higher
Conversely, Maven Central Index (which should be updated weekly) sometimes has long periods without updates
Some SHA1 digests exist in GCS but are missing from Maven Central Index
- e.g.
  - https://repo.maven.apache.org/maven2/activemq/activemq-ra/1.3/ (only rars are found in Maven Central Index)
  - https://repo.maven.apache.org/maven2/org/bsc/langgraph4j/langgraph4j-springai-agent-handoff/1.5.11/
Some SHA1 digests exist in Maven Central Index but are missing from GCS
- e.g. https://repo.maven.apache.org/maven2/io/streamnative/pulsar/handlers/kafka-client-api/3.2.2.1/
- https://storage.googleapis.com/storage/v1/b/maven-central/o?prefix=maven2/io/streamnative/pulsar/handlers/kafka-client-api/3.2.2.1

Therefore, we cannot rely on a single source. To improve coverage, we use both GCS and Maven Central Index. While the combination of GCS and Central Index still won't cover 100% of Maven Central artifacts, this approach improves our coverage for sure. The remaining limitations are accepted as constraints in OSS.

Implementation

This PR introduces a pluggable source architecture that allows crawling from both GCS and Maven Central Index:

Added a SourceType abstraction to support different artifact sources
Refactored the crawler to use a common interface for different sources
Implemented initial structure for Maven Central Index support
Added command-line flag (--source) to specify the source type (gcs, central)

Related PRs

feat: move to use maven-index repository with sharding support #57
- This PR is based on the maven-index branch

Copilot

Pull Request Overview

This PR introduces a pluggable source architecture for crawling Maven artifacts by adding support for both GCS and Maven Central Index. Key changes include:

Defining a new Source interface and Record type to standardize artifact retrieval.
Refactoring the existing GCS crawler components and removing legacy types to support the new design.
Implementing a new Maven Central Index source and updating the crawler pipeline and CLI flags accordingly.

Reviewed Changes

Copilot reviewed 11 out of 12 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
pkg/crawler/types/types.go	Introduces the new Source interface and Record struct to abstract artifact retrieval.
pkg/crawler/types.go	Removes the legacy types previously used for GCS-specific crawling.
pkg/crawler/gcs/lister.go	Implements the parallel GCS lister with concurrent prefix processing.
pkg/crawler/gcs/gcs.go	Refactors the GCS client into a more generic Client with updated methods and baseURL fallback.
pkg/crawler/gcs/fetcher.go	Provides the logic for fetching SHA1 hashes from GCS while handling errors appropriately.
pkg/crawler/crawler_test.go	Updates test cases to refer to the new gcs.ListResponse type and related changes.
pkg/crawler/crawler.go	Refactors the crawling pipeline to support multiple source types via the new Source interface.
pkg/crawler/central/central.go	Adds support for processing the Maven Central Index and integrating it into the crawling pipeline.
cmd/trivy-java-db/main.go	Adds a new command-line flag to select the artifact source type and updates corresponding parameters.

Files not reviewed (1)

go.mod: Language not supported

Comments suppressed due to low confidence (1)

pkg/crawler/gcs/gcs.go:56

[nitpick] Add a brief comment explaining how cmp.Or selects the baseURL fallback to aid future maintainability.

baseURL:    cmp.Or(baseURL, gcsURL),

Copilot · 2025-05-01T07:16:31Z

pkg/crawler/crawler.go

 		limit:      opt.Limit,
 		shardCount: opt.Shard,
 		storedGAVs: make(map[uint64]struct{}),
 		mutex:      sync.Mutex{},
+		sourceType: cmp.Or(opt.SourceType, SourceTypeGCS), // Default to GCS if not specified


[nitpick] Consider replacing cmp.Or with an explicit defaulting logic (e.g., an if-check) for SourceType to improve clarity.

Suggested change

sourceType: cmp.Or(opt.SourceType, SourceTypeGCS), // Default to GCS if not specified

sourceType: func() types.SourceType {

if opt.SourceType == "" {

return SourceTypeGCS // Default to GCS if not specified

}

return opt.SourceType

}(),

Copilot · 2025-05-01T07:16:31Z

pkg/crawler/central/central.go

+			GroupID:    groupID,
+			ArtifactID: artifactID,
+			VersionDir: versionDir,
+			Version:    "-", // Assume that version is the same as versionDir in Central Index


[nitpick] Document the rationale behind defaulting Version to '-' more explicitly or consider deriving it from available metadata if appropriate.

Suggested change

Version: "-", // Assume that version is the same as versionDir in Central Index

Version: deriveVersion(versionDir), // Derive version from versionDir or fallback to "-"

knqyf263 · 2025-05-15T11:05:32Z

It seems that I didn’t fully understand the contents of the central index. I thought this artifact didn’t exist in the index and we needed to combine GCS, but it turns out that was because I was filtering by packaging type "jar". Upon closer inspection of the index, I found that the GAV does exist, but the packaging is set to "bundle". A JAR is still generated even when the packaging is "bundle".

In other words, although we went around in circles, by examining the index, we now see that there’s a good chance we can accurately retrieve artifact information from Maven Central.

grep "org.wso2.carbon.identity.entitlement.ui" central-index-gavc.tsv | head -n 5
org.wso2.carbon.identity.framework	org.wso2.carbon.identity.entitlement.ui	5.9.0	bundle		da4520d9be9a4e693a6993ce3c80447090327b08
org.wso2.carbon.identity.framework	org.wso2.carbon.identity.entitlement.ui	5.8.155	bundle		46dc746ae0381a43111ad9b1409bfb480bd33a2d
org.wso2.carbon.identity.framework	org.wso2.carbon.identity.entitlement.ui	5.8.154	bundle		f80c898c35cc4988178bf8f063015eab82f9eddd
org.wso2.carbon.identity.framework	org.wso2.carbon.identity.entitlement.ui	5.8.153	bundle		962b0e42861af8cf188174a70f96b78626cdee09
org.wso2.carbon.identity.framework	org.wso2.carbon.identity.entitlement.ui	5.8.152	bundle		542ee64e3e7f14e8ab78082649724cdcdddea5a6

<packaging>bundle</packaging>

https://repo.maven.apache.org/maven2/org/wso2/carbon/identity/framework/org.wso2.carbon.identity.entitlement.ui/5.25.44/org.wso2.carbon.identity.entitlement.ui-5.25.44.pom

- Added a new `sourceType` flag to specify the source type (GCS or Central) for the artifact crawling process. - Updated the `Crawler` to handle different source types, allowing for more flexible artifact retrieval. - Refactored the `crawlWithPipeline` method to create sources based on the specified type, enhancing the modularity of the code. - Introduced a new `central` package to manage Maven Central Index interactions, improving the overall architecture. - Enhanced the GCS client and related components to support the new structure, ensuring compatibility with existing functionality.

- Added cache directory support in the Crawler for storing the Maven Central Index. - Updated the New function in the central package to accept a cache directory parameter and ensure its existence. - Implemented downloadIndex method to handle downloading and caching of the Maven Central Index. - Improved error handling for index file retrieval and processing, ensuring robustness in the crawling process.

- Updated the Crawler struct to rename `storedGAVs` to `storedGAVCs`, reflecting the change to store GroupID, ArtifactID, Version, and Classifier hashes. - Adjusted the New function in the Crawler to initialize `storedGAVCs` accordingly. - Modified the central package to utilize `storedGAVCs` for consistent artifact processing. - Enhanced comments to clarify the purpose of the storedGAVCs map in preventing duplicate processing.

- Improved error logging in the Builder's Build method to include additional context (GroupID, ArtifactID, Version, Classifier) when SHA1 decoding fails. - Added validation for SHA1 hashes in the Crawler's Read method to ensure they are correctly formatted and of the expected length, enhancing data integrity checks.

knqyf263 · 2025-05-15T12:47:44Z

So, if I understand correctly, obtaining the SHA1 can be done using only the Central Index, and GCS is not necessary. The Central Index is updated once a week, but I believe this delay is acceptable. Delays of over a week can occur, but I think that’s still acceptable for an OSS project.

I tried updating the maven-index using only the Central Index. And build the database from this index.

$ trivy-java-db build --index-dir /path/to/maven-index/

The database generated using this maven-index showed no significant differences from our current one. Some artifacts (example) were missing, but they’re artifacts that don’t contain any deployable JARs (their classifier is like javadoc or test), so it’s correct that they were omitted

@DmitriyLewen Could you double-check as well to ensure I haven’t overlooked anything?

DmitriyLewen · 2025-05-15T12:58:25Z

Before i start checking:
We discussed using index archives, but saw that the update periods are unstable - #52 (comment)
Is it stable now? Or are these just errors with dates?

knqyf263 · 2025-05-15T13:19:41Z

There may have been a mistake in the comparison method. I'm currently double-checking it, so please wait a moment.

Also, regarding the update frequency, it appears that updates are currently being made at least once every two weeks.

knqyf263 · 2025-05-16T06:38:12Z

I was mistaken. It turns out that the Central Index indeed does not include the SHA1 digest for some artifact JARs.

For example, this sha1 is missing.

grep "ai.ancf.lmos-router" central-index-gavc.tsv | grep benchmarks
ai.ancf.lmos-router	benchmarks	0.2.0		pom.sha512	
ai.ancf.lmos-router	benchmarks	0.2.0	sources	jar.sha512	
ai.ancf.lmos-router	benchmarks	0.2.0	sources	jar.sha256	
ai.ancf.lmos-router	benchmarks	0.2.0	sources	jar.asc.sha512	
ai.ancf.lmos-router	benchmarks	0.2.0	sources	jar.asc.sha256	
ai.ancf.lmos-router	benchmarks	0.2.0	sources	jar	94d3d33aafc21b11386131ab469e4ef0c58c16eb
ai.ancf.lmos-router	benchmarks	0.2.0	javadoc	jar.sha512	
ai.ancf.lmos-router	benchmarks	0.2.0	javadoc	jar.sha256	
ai.ancf.lmos-router	benchmarks	0.2.0	javadoc	jar.asc.sha512	
ai.ancf.lmos-router	benchmarks	0.2.0	javadoc	jar.asc.sha256	
ai.ancf.lmos-router	benchmarks	0.2.0	javadoc	jar	53886a59b8ed977c1088698468b09b1c02c38b8c
ai.ancf.lmos-router	benchmarks	0.28.0		pom.sha512	
ai.ancf.lmos-router	benchmarks	0.28.0	sources	jar.sha512	
ai.ancf.lmos-router	benchmarks	0.28.0	sources	jar.sha256	
ai.ancf.lmos-router	benchmarks	0.28.0	sources	jar.asc.sha512	
ai.ancf.lmos-router	benchmarks	0.28.0	sources	jar.asc.sha256	
ai.ancf.lmos-router	benchmarks	0.28.0	sources	jar	3a6c48b0296576e0a945c7d89f33ed9897cd196f
ai.ancf.lmos-router	benchmarks	0.28.0	javadoc	jar.sha512	
ai.ancf.lmos-router	benchmarks	0.28.0	javadoc	jar.sha256	
ai.ancf.lmos-router	benchmarks	0.28.0	javadoc	jar.asc.sha512	
ai.ancf.lmos-router	benchmarks	0.28.0	javadoc	jar.asc.sha256	
ai.ancf.lmos-router	benchmarks	0.28.0	javadoc	jar	ab2fa7e53dc7127d79595612c4ea25ad745d01fd

DmitriyLewen · 2025-05-16T06:45:22Z

0.28.0 artifacts are 2 months old (according to the dates from Maven Central)...
https://repo.maven.apache.org/maven2/ai/ancf/lmos-router/benchmarks/0.28.0/

But i have sad news:
GSC also doesn't contain these artifacts:

➜ curl "https://storage.googleapis.com/storage/v1/b/maven-central/o?prefix=maven2/ai/ancf/&maxResults=5000"
{
  "kind": "storage#objects"
}

- Removed redundant SHA1 handling code from the GCS client. - Introduced a new `sha1` package to encapsulate SHA1 parsing logic. - Enhanced the `FetchSHA1` method to utilize the new `sha1.Parse` function for improved clarity and maintainability. - Updated dependencies in go.mod and go.sum for consistency.

- Added functionality to fetch missing SHA1s directly from Maven Central for records without classifiers. - Introduced a new map to track missing SHA1s and updated the Read method to handle records accordingly. - Enhanced error handling and logging during the fetching process, including status code checks and logging of fetched counts. - Improved the overall structure of the Source type to accommodate the new fetching logic, ensuring robust artifact processing.

- Added a condition to skip the "maven2/data/" directory during the listing process in the GCS lister. - This change prevents unnecessary processing of a directory that is not present in Maven Central, improving efficiency in the crawling operation.

- Introduced a new `maven` package with a `ValidateClassifier` function to encapsulate classifier validation logic. - Updated the `read` method in the `Source` type to utilize the new validation function, improving clarity and maintainability. - Enhanced the `fetch` method in the `Fetcher` to include version validation alongside groupID and artifactID checks, ensuring robust artifact processing.

- Updated the `read` method in the `Source` type to improve artifact type validation. - Removed the check for `fileExtension` being "jar" from the initial condition, allowing for more flexible handling of artifact types. - Added a separate check for `fileExtension` to ensure only records with the "jar" extension are processed, enhancing clarity and maintainability of the code.

knqyf263 · 2025-05-19T08:51:44Z

I have modified the code to query Maven Central only when the SHA1 is not found in the Central Index. This significantly reduces the number of HTTP requests. Additionally, it has improved the coverage. When comparing the current database using this script, I found only 46 differences in GAVs.

Furthermore, it seems that none of these 46 cases are valid GAVs. Details are as follows:

GAs that have already been deleted

GAVs that have already been deleted

JARs stored under different versions

https://repo.maven.apache.org/maven2/org/bouncycastle/bcprov-debug-jdk14/1.78/ (bcprov-debug-jdk14-1.78.1.jar is stored)
https://repo.maven.apache.org/maven2/com/sun/org/apache/jaxp-ri/1.4/ (1.4.1 and 1.4.2 are located)

Empty directory

https://repo.maven.apache.org/maven2/com/sun/jsftemplating/jsftemplating-dt/1.2.5/ (1.2.5 is actually stored under 1.2

knqyf263 · 2025-05-19T10:08:11Z

Very bad news... the index seems to contain wrong values 😭

net.itarray	automotion	1.4.1		jar	aa50e73ec13a2140d990e4684ff0822b84caa962
net.itarray	automotion	1.4.1	sources	jar	a114f4a08a9fdf6535923dcbc43685cc427e3a7a
net.itarray	automotion	1.4.1	javadoc	jar	92046e9b8e69e86e7c1c37210211401ce0bf763a

The index holds aa50e73ec13a2140d990e4684ff0822b84caa962, but it should be 8293e469853b8fbff6ab842cbd07400c4831b94b
https://repo.maven.apache.org/maven2/net/itarray/automotion/1.4.1/automotion-1.4.1.jar.sha1

DmitriyLewen · 2025-05-19T10:17:16Z

this is very bad 😭
If we could consider the absence of some artifacts acceptable for OSS (as we said about GCS), then incorrect digests are definitely unacceptable...

DmitriyLewen · 2025-05-19T10:19:31Z

We can try to combinate 3 sources:

We will take artifacts from indexes and GCS
take digests for these artifacts from GCS and maven central (if GCS doesn't have this artifact)

But i am still not sure about indexes.
We don't know what other errors might be there...

…ndexes - Eliminated the redundant check for classifier being "-" in the loadExistingIndexes method. - This change simplifies the code by directly assigning an empty string to classifier when it is not provided, enhancing clarity and maintainability.

…tral Index - Updated the `Source` type to enhance the handling of GAVC records and SHA1 fetching. - Removed reliance on unreliable SHA1 values from the central index, implementing a two-step process to ensure accurate SHA1 retrieval directly from Maven Central. - Introduced a new `records` slice to store all GAVC records, improving clarity and maintainability. - Enhanced logging to reflect the number of records processed and fetched, ensuring better tracking of operations. - Refined the `fetchSHA1s` method to process records more efficiently, improving overall performance.

knqyf263 · 2025-05-19T12:41:45Z

I’m currently planning this as a two-phase process.

Crawl GCS and collect as many SHA1 hashes as possible
Read the Central Index and produce only a list of GAVC tuples (group ID + artifact ID + version + classifier). At that stage, we don’t use any SHA1 values, since they can’t be fully trusted. Then remove from this list any GAVCs for which a SHA1 has already been retrieved from GCS, and for the remaining GAVCs we construct the corresponding Maven Central URLs and fetch their SHA1 hashes over HTTP.

Although dropping the index-based SHA1 lookups increases the number of requests sent to Maven Central, it saves requests when building the list, and—because it’s a differential update—already-stored artifacts are skipped.

I’ve implemented this strategy, so I’ll run it again on GitHub Actions.

- Added a case to handle HTTP 403 Forbidden status in the fetchSHA1s method. - This change improves error handling by explicitly logging when access to a resource is forbidden, enhancing clarity in the fetching process.

- Improved error handling in the fetchSHA1s method by returning errors instead of skipping them on HTTP request failures. - Added logging for the number of fetched SHA1s every 10,000 records processed, providing better visibility into the fetching process. - Enhanced logging for failed SHA1 content reads, including the URL and error details, to aid in debugging and monitoring.

- Updated the fetchSHA1s method to return errors instead of skipping them on HTTP request failures, enhancing error visibility. - Added logging for the number of errors encountered during SHA1 fetching, providing better insights into the fetching process. - Enhanced logging in the crawlWithPipeline method to report the number of errors encountered during artifact processing.

- Enhanced the logic in the parseItemName function to better handle classifier extraction from filenames. - Updated comments to clarify the handling of special cases where classifiers may not follow the expected format. - This change improves the robustness of classifier parsing, ensuring that any remaining string is treated as a classifier when necessary.

- Introduced a new test file for the GCS fetcher to validate the parseItemName function. - Added multiple test cases covering various scenarios, including normal cases, edge cases, and invalid inputs. - This addition enhances test coverage and ensures the correctness of the classifier parsing logic.

- Introduced a new function `containsControlChar` to check for control characters in strings. - Updated the `read` method to skip records with control characters in `GroupID`, `ArtifactID`, `Version`, or `Classifier`. - Added unit tests for `containsControlChar` to ensure correct functionality across various input scenarios.

knqyf263 · 2025-05-20T17:28:28Z

I want to believe this is some kind of bad joke, but I’ve found a case where the SHA1 values differ between Maven Central and GCS.

# Maven Central
$ curl https://repo.maven.apache.org/maven2/com/onfido/api/client/onfido-api-client/3.1.3/onfido-api-client-3.1.3.jar.sha1
941c5572bc8fbe9203745c9eca4a0fa4d4ee0fbb%                                                                                                  

# GCS
$ curl "https://storage.googleapis.com/download/storage/v1/b/maven-central/o/maven2%2Fcom%2Fonfido%2Fapi%2Fclient%2Fonfido-api-client%2F3.1.3%2Fonfido-api-client-3.1.3.jar.sha1?generation=1617659797337662&alt=media"
2b0f02371aa9be4644ebbf65186ab6b4956b498f

It appears that the actual JAR files are different.

# Maven Central
$ curl -s https://repo.maven.apache.org/maven2/com/onfido/api/client/onfido-api-client/3.1.3/onfido-api-client-3.1.3.jar | sha1sum
941c5572bc8fbe9203745c9eca4a0fa4d4ee0fbb  -

# GCS
curl -s "https://storage.googleapis.com/download/storage/v1/b/maven-central/o/maven2%2Fcom%2Fonfido%2Fapi%2Fclient%2Fonfido-api-client%2F3.1.3%2Fonfido-api-client-3.1.3.jar?generation=1617659792993892&alt=media" | sha1sum
2b0f02371aa9be4644ebbf65186ab6b4956b498f  -

This is just my guess, but considering that the last update on Maven Central was April 7, 2021, and the last update on GCS was April 5, 2021, it’s possible that the JAR file on Maven Central was updated after it had already been synchronized to GCS. That update might not have been propagated to GCS, leaving the older artifact there.

    {
      "kind": "storage#object",
      "id": "maven-central/maven2/com/onfido/api/client/onfido-api-client/3.1.3/onfido-api-client-3.1.3.jar/1617659792993892",
      "selfLink": "https://www.googleapis.com/storage/v1/b/maven-central/o/maven2%2Fcom%2Fonfido%2Fapi%2Fclient%2Fonfido-api-client%2F3.1.3%2Fonfido-api-client-3.1.3.jar",
      "mediaLink": "https://storage.googleapis.com/download/storage/v1/b/maven-central/o/maven2%2Fcom%2Fonfido%2Fapi%2Fclient%2Fonfido-api-client%2F3.1.3%2Fonfido-api-client-3.1.3.jar?generation=1617659792993892&alt=media",
      "name": "maven2/com/onfido/api/client/onfido-api-client/3.1.3/onfido-api-client-3.1.3.jar",
      "bucket": "maven-central",
      "generation": "1617659792993892",
      "metageneration": "1",
      "contentType": "application/java-archive",
      "storageClass": "STANDARD",
      "size": "33",
      "md5Hash": "jMpWQ/+3GCzKnC4VS0TLXg==",
      "contentLanguage": "en",
      "crc32c": "52QnRw==",
      "etag": "COTc0KqM6O8CEAE=",
      "timeCreated": "2021-04-05T21:56:33.026Z",
      "updated": "2021-04-05T21:56:33.026Z",
      "timeStorageClassUpdated": "2021-04-05T21:56:33.026Z",
      "timeFinalized": "2021-04-05T21:56:33.026Z",
      "metadata": {
        "checksum-sha1": "2b0f02371aa9be4644ebbf65186ab6b4956b498f",
        "last-modified-epoch": "1617650729000",
        "checksum-md5": "8cca5643ffb7182cca9c2e154b44cb5e",
        "last-modified": "Mon, 05 Apr 2021 19:25:29 GMT"
      }
    },

knqyf263 · 2025-05-20T17:36:43Z

It looks like there are about 100 such cases, assuming my comparison script is correct (I'll double-check how many of these cases there actually are). Out of 16 million artifacts, I believe having around 100 cases is within an acceptable range. I think it's fine to ignore them, but if they are particularly important, we could also consider fixing them manually.

- Removed the `records` slice from the `Source` type to streamline record handling. - Implemented a channel-based pipeline for processing records, enhancing concurrency and clarity. - Introduced `readFromIndex`, `parseRecords`, and `fetchSHA1s` methods to separate concerns and improve maintainability. - Enhanced error handling and logging throughout the pipeline stages for better visibility. - Updated the `validateRecord` method to encapsulate record validation logic, ensuring cleaner code.

knqyf263 · 2025-05-21T06:00:59Z

If my script works correctly, there are only 113 cases where SHA1 is wrong in GCS. I would accept these edge cases as I saw many wrong values in the current database.

[SHA1 diff: SHA1 mismatch] GAVs with different SHA1:
com.onfido.api.client|onfido-api-client|4.0.0	SQLite SHA1: b9797437bf7b564b9b70227ea48e3a295704ece3	TSV SHA1: 2b0f02371aa9be4644ebbf65186ab6b4956b498f @https://repo.maven.apache.org/maven2/com/onfido/api/client/onfido-api-client/4.0.0/
de.knightsoft-net|mt-bean-validators|0.22.2	SQLite SHA1: 30df790d183f3376ad38130988d16bfb882f6bad	TSV SHA1: b0a86483d70eb2922a256e0563110f03b949dcf5 @https://repo.maven.apache.org/maven2/de/knightsoft-net/mt-bean-validators/0.22.2/
com.tobedevoured.command|spring|0.3.2	SQLite SHA1: a159f2da20f64564cbf2a91a3e8c13d7a173844c	TSV SHA1: 912fd9c8d977421b9ed0448ca59ab8eb2fb7983e @https://repo.maven.apache.org/maven2/com/tobedevoured/command/spring/0.3.2/
com.onfido.api.client|onfido-api-client|3.1.4	SQLite SHA1: 4e0c83e92eacef88640cdad2c38e9aff7e972f14	TSV SHA1: 2b0f02371aa9be4644ebbf65186ab6b4956b498f @https://repo.maven.apache.org/maven2/com/onfido/api/client/onfido-api-client/3.1.4/
com.17jee|e-core-template-freemarker|3.0.1.1	SQLite SHA1: 7eaeb96adef187ef7d1016e573092169957651d5	TSV SHA1: ddda3636db969962edae594ffc0008acb4e2c7b1 @https://repo.maven.apache.org/maven2/com/17jee/e-core-template-freemarker/3.0.1.1/
com.onfido.api.client|onfido-api-client|0.3.3	SQLite SHA1: b6d3f3fca67e34e6f04def0442e0bef6cbc5dae1	TSV SHA1: 2b0f02371aa9be4644ebbf65186ab6b4956b498f @https://repo.maven.apache.org/maven2/com/onfido/api/client/onfido-api-client/0.3.3/
de.knightsoft-net|gwt-mt-widgets|0.22.2	SQLite SHA1: 7fb541f8eac7b717608fac86450542e6e235b242	TSV SHA1: e267ef2160e012b664d6f3d5db25e4348df939b7 @https://repo.maven.apache.org/maven2/de/knightsoft-net/gwt-mt-widgets/0.22.2/
com.onfido.api.client|onfido-api-client|1.3.0	SQLite SHA1: 220ea7203349616a6cc9d257a7e658c8ef0821dc	TSV SHA1: 2b0f02371aa9be4644ebbf65186ab6b4956b498f @https://repo.maven.apache.org/maven2/com/onfido/api/client/onfido-api-client/1.3.0/
com.onfido.api.client|onfido-api-client|5.3.1	SQLite SHA1: 418314bce104ace9ad2001924e155c234cb7d2c2	TSV SHA1: 2b0f02371aa9be4644ebbf65186ab6b4956b498f @https://repo.maven.apache.org/maven2/com/onfido/api/client/onfido-api-client/5.3.1/
com.onfido.api.client|onfido-api-client|5.5.0	SQLite SHA1: 76ce90918154892fe44a2b200c6b2b8d5f3ee030	TSV SHA1: 2b0f02371aa9be4644ebbf65186ab6b4956b498f @https://repo.maven.apache.org/maven2/com/onfido/api/client/onfido-api-client/5.5.0/
de.knightsoft-net|gwt-bean-validators-spring-gwtp|0.22.2	SQLite SHA1: 075cc6a35b0542400dc95076d78383f31f87ebf8	TSV SHA1: aeb2d6ea968ef21e3a022eeedaf74684c00463cc @https://repo.maven.apache.org/maven2/de/knightsoft-net/gwt-bean-validators-spring-gwtp/0.22.2/
com.onfido.api.client|onfido-api-client|7.3.0	SQLite SHA1: c4dfb62eb73b499991ad7b86280c5f520210b569	TSV SHA1: 2b0f02371aa9be4644ebbf65186ab6b4956b498f @https://repo.maven.apache.org/maven2/com/onfido/api/client/onfido-api-client/7.3.0/
com.onfido.api.client|onfido-api-client|0.4.0	SQLite SHA1: ae3b921a60b62f230879bc680716de225242ebd3	TSV SHA1: 2b0f02371aa9be4644ebbf65186ab6b4956b498f @https://repo.maven.apache.org/maven2/com/onfido/api/client/onfido-api-client/0.4.0/
com.onfido.api.client|onfido-api-client|0.3.2	SQLite SHA1: 8b51345fe6335e55680cbc4dab000e00741739a9	TSV SHA1: 2b0f02371aa9be4644ebbf65186ab6b4956b498f @https://repo.maven.apache.org/maven2/com/onfido/api/client/onfido-api-client/0.3.2/
com.smartdevicelink|sdl_java_ee|4.12.2	SQLite SHA1: 8ee44885441ecad96700d680b9dd984f36524fd5	TSV SHA1: 21a7e60bb893d167f9b9a0f95323d32eade07190 @https://repo.maven.apache.org/maven2/com/smartdevicelink/sdl_java_ee/4.12.2/
com.indix.api|indix-api-java|3.0.0	SQLite SHA1: fb5cdf064a215d38db1f4445f13e4cdf72466036	TSV SHA1: bce804b40aa001d12855fc90516e122a6dbaa883 @https://repo.maven.apache.org/maven2/com/indix/api/indix-api-java/3.0.0/
com.onfido.api.client|onfido-api-client|3.1.0	SQLite SHA1: 90c98cead3faea2c527014fc7ec12f6230ef0e5d	TSV SHA1: 2b0f02371aa9be4644ebbf65186ab6b4956b498f @https://repo.maven.apache.org/maven2/com/onfido/api/client/onfido-api-client/3.1.0/
org.sonatype.install4j|install4j-slf4j|1.0.8	SQLite SHA1: 1afe175f56ae49daff24d5a339533360f5b7aaf8	TSV SHA1: 49dc96882a139124309791ab80dcd587f3122550 @https://repo.maven.apache.org/maven2/org/sonatype/install4j/install4j-slf4j/1.0.8/
com.smartdevicelink|sdl_java_ee|5.0.0	SQLite SHA1: 02fb47876da3d52c1f2f6b69f989b50b866b0688	TSV SHA1: 057b0ab05a3c5e65d1b71777de415a518a1da0e7 @https://repo.maven.apache.org/maven2/com/smartdevicelink/sdl_java_ee/5.0.0/
com.17jee|e-log-ui|3.0.1.1	SQLite SHA1: 8f7eaa96c02396c96a77c912d7c335dac77b4bb0	TSV SHA1: 4d57476d4eedec9356d4213f3e60ecfcca9533e6 @https://repo.maven.apache.org/maven2/com/17jee/e-log-ui/3.0.1.1/
... and more
Total SHA1 diff: 113

DmitriyLewen · 2025-05-21T10:09:31Z

This is just my guess, but considering that the last update on Maven Central was April 7, 2021, and the last update on GCS was April 5, 2021, it’s possible that the JAR file on Maven Central was updated after it had already been synchronized to GCS. That update might not have been propagated to GCS, leaving the older artifact there.

Most likely, you are right.
But then that means maven central is not immutable...

But in any case, I agree with you. 113 artifacts out of 16 million is an acceptable deviation.

This reverts commit f727716.

DmitriyLewen · 2025-05-21T11:24:49Z

@knqyf263 I see you're still working on this PR.

Let me know when I can start reviewing it.

knqyf263 · 2025-05-21T11:27:13Z

Yes, I hope the job will complete successfully, and this PR will be ready.

knqyf263 · 2025-05-21T16:34:58Z

The job was successful. Also, I've checked maven-index and confirmed that all records except for a few edge cases in the current database are present in the TSV files.

One point to note is that when inserting records into the database, a different GAV with the same SHA1 digest might be inserted. Therefore, I compared the database with the TSV files, rather than comparing databases.

@DmitriyLewen You can start reviewing the PR. I didn't mark this as ready for review since this PR depends on #57.

pkg/crawler/central/central.go

DmitriyLewen · 2025-05-22T11:25:07Z

pkg/crawler/central/central.go

+			groupPath := strings.ReplaceAll(rec.GroupID, ".", "/")
+
+			// Build jar name: artifactId-version[-classifier].jar
+			jarName := rec.ArtifactID + "-" + rec.Version + ".jar"


classifiers are missed.

There was a reason why I didn't add a classifier here, but I can't recall it...
Fixed in 69a2117

pkg/crawler/gcs/gcs.go

pkg/crawler/central/central.go

Co-authored-by: DmitriyLewen <[email protected]>

- Removed the `records` parameter from the `read` method call in the `Source` type. - This change streamlines the method signature and improves clarity in the codebase.

- Updated the jar name construction logic to include the classifier if it is present. - This change ensures that the jar name follows the format: artifactId-version[-classifier].jar, improving clarity and correctness in artifact naming.

DmitriyLewen

LGTM.

But let's run Update Maven Index again to make sure the changes with classifiers didn't break anything.

knqyf263 · 2025-05-22T17:14:30Z

Thanks for reviewing. Triggered.
https://github.com/aquasecurity/maven-index/actions/runs/15192742680/job/42729417839

knqyf263 · 2025-05-23T06:18:17Z

I confirmed that the database could be built with the latest index.

knqyf263 · 2025-05-23T08:09:08Z

I also tested that a number of detected components were the same between the current and the new DB.

$ trivy image jenkins:2.60.3 -q --format cyclonedx --cache-backend memory | jq '.components | length'
456
$ cp ~/Library/Caches/trivy-java-db/db/trivy-java.db ~/Library/Caches/trivy/java-db/
$ trivy image jenkins:2.60.3 -q --format cyclonedx --cache-backend memory | jq '.components | length'
456

knqyf263 requested a review from Copilot May 1, 2025 07:15

knqyf263 self-assigned this May 1, 2025

knqyf263 changed the title ~~feat(crawler): crawl central index~~ feat(crawler): crawl maven central index May 1, 2025

Copilot AI reviewed May 1, 2025

View reviewed changes

knqyf263 added 5 commits May 15, 2025 15:16

feat: add support for Central Index

8001400

knqyf263 force-pushed the central-index branch from d7779cd to 1ac7a8d Compare May 15, 2025 12:09

knqyf263 added 5 commits May 16, 2025 12:30

knqyf263 added 2 commits May 19, 2025 16:14

knqyf263 added 2 commits May 20, 2025 11:21

fix(crawler): handle HTTP 403 status in SHA1 fetching

6940992

- Added a case to handle HTTP 403 Forbidden status in the fetchSHA1s method. - This change improves error handling by explicitly logging when access to a resource is forbidden, enhancing clarity in the fetching process.

knqyf263 added 4 commits May 20, 2025 11:30

Revert "fix(crawler): refactor record processing with pipeline design"

641213a

This reverts commit f727716.

DmitriyLewen reviewed May 22, 2025

View reviewed changes

knqyf263 and others added 4 commits May 22, 2025 15:53

Update pkg/crawler/gcs/gcs.go

c476991

Co-authored-by: DmitriyLewen <[email protected]>

Update pkg/crawler/central/central.go

65eb965

Co-authored-by: DmitriyLewen <[email protected]>

fix(crawler): simplify read method in central index processing

065d84a

- Removed the `records` parameter from the `read` method call in the `Source` type. - This change streamlines the method signature and improves clarity in the codebase.

knqyf263 requested a review from DmitriyLewen May 22, 2025 14:12

DmitriyLewen approved these changes May 22, 2025

View reviewed changes

knqyf263 closed this May 29, 2025

-		sourceType: cmp.Or(opt.SourceType, SourceTypeGCS), // Default to GCS if not specified
+		sourceType: func() types.SourceType {
+			if opt.SourceType == "" {
+				return SourceTypeGCS // Default to GCS if not specified
+			}
+			return opt.SourceType
+		}(),

	Version: "-", // Assume that version is the same as versionDir in Central Index
	Version: deriveVersion(versionDir), // Derive version from versionDir or fallback to "-"

feat(crawler): crawl maven central index #58

feat(crawler): crawl maven central index #58

Uh oh!

Conversation

knqyf263 commented May 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Description

Implementation

Related PRs

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI May 1, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI May 1, 2025

Choose a reason for hiding this comment

Uh oh!

knqyf263 commented May 15, 2025

Uh oh!

knqyf263 commented May 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DmitriyLewen commented May 15, 2025

Uh oh!

knqyf263 commented May 15, 2025

Uh oh!

knqyf263 commented May 16, 2025

Uh oh!

DmitriyLewen commented May 16, 2025

Uh oh!

knqyf263 commented May 19, 2025

Uh oh!

knqyf263 commented May 19, 2025

Uh oh!

DmitriyLewen commented May 19, 2025

Uh oh!

DmitriyLewen commented May 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

knqyf263 commented May 19, 2025

Uh oh!

knqyf263 commented May 20, 2025

Uh oh!

knqyf263 commented May 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

knqyf263 commented May 21, 2025

Uh oh!

DmitriyLewen commented May 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DmitriyLewen commented May 21, 2025

Uh oh!

knqyf263 commented May 21, 2025

Uh oh!

knqyf263 commented May 21, 2025

Uh oh!

Uh oh!

DmitriyLewen May 22, 2025

Choose a reason for hiding this comment

Uh oh!

knqyf263 May 22, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

DmitriyLewen left a comment

Choose a reason for hiding this comment

Uh oh!

knqyf263 commented May 22, 2025

Uh oh!

knqyf263 commented May 23, 2025

Uh oh!

knqyf263 commented May 23, 2025

knqyf263 commented May 1, 2025 •

edited

Loading

knqyf263 commented May 15, 2025 •

edited

Loading

DmitriyLewen commented May 19, 2025 •

edited

Loading

knqyf263 commented May 20, 2025 •

edited

Loading

DmitriyLewen commented May 21, 2025 •

edited

Loading