-
Notifications
You must be signed in to change notification settings - Fork 20
feat(crawler): crawl maven central index #58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces a pluggable source architecture for crawling Maven artifacts by adding support for both GCS and Maven Central Index. Key changes include:
- Defining a new Source interface and Record type to standardize artifact retrieval.
- Refactoring the existing GCS crawler components and removing legacy types to support the new design.
- Implementing a new Maven Central Index source and updating the crawler pipeline and CLI flags accordingly.
Reviewed Changes
Copilot reviewed 11 out of 12 changed files in this pull request and generated 2 comments.
Show a summary per file
File | Description |
---|---|
pkg/crawler/types/types.go | Introduces the new Source interface and Record struct to abstract artifact retrieval. |
pkg/crawler/types.go | Removes the legacy types previously used for GCS-specific crawling. |
pkg/crawler/gcs/lister.go | Implements the parallel GCS lister with concurrent prefix processing. |
pkg/crawler/gcs/gcs.go | Refactors the GCS client into a more generic Client with updated methods and baseURL fallback. |
pkg/crawler/gcs/fetcher.go | Provides the logic for fetching SHA1 hashes from GCS while handling errors appropriately. |
pkg/crawler/crawler_test.go | Updates test cases to refer to the new gcs.ListResponse type and related changes. |
pkg/crawler/crawler.go | Refactors the crawling pipeline to support multiple source types via the new Source interface. |
pkg/crawler/central/central.go | Adds support for processing the Maven Central Index and integrating it into the crawling pipeline. |
cmd/trivy-java-db/main.go | Adds a new command-line flag to select the artifact source type and updates corresponding parameters. |
Files not reviewed (1)
- go.mod: Language not supported
Comments suppressed due to low confidence (1)
pkg/crawler/gcs/gcs.go:56
- [nitpick] Add a brief comment explaining how cmp.Or selects the baseURL fallback to aid future maintainability.
baseURL: cmp.Or(baseURL, gcsURL),
pkg/crawler/crawler.go
Outdated
limit: opt.Limit, | ||
shardCount: opt.Shard, | ||
storedGAVs: make(map[uint64]struct{}), | ||
mutex: sync.Mutex{}, | ||
sourceType: cmp.Or(opt.SourceType, SourceTypeGCS), // Default to GCS if not specified |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Consider replacing cmp.Or with an explicit defaulting logic (e.g., an if-check) for SourceType to improve clarity.
sourceType: cmp.Or(opt.SourceType, SourceTypeGCS), // Default to GCS if not specified | |
sourceType: func() types.SourceType { | |
if opt.SourceType == "" { | |
return SourceTypeGCS // Default to GCS if not specified | |
} | |
return opt.SourceType | |
}(), |
Copilot uses AI. Check for mistakes.
pkg/crawler/central/central.go
Outdated
GroupID: groupID, | ||
ArtifactID: artifactID, | ||
VersionDir: versionDir, | ||
Version: "-", // Assume that version is the same as versionDir in Central Index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Document the rationale behind defaulting Version to '-' more explicitly or consider deriving it from available metadata if appropriate.
Version: "-", // Assume that version is the same as versionDir in Central Index | |
Version: deriveVersion(versionDir), // Derive version from versionDir or fallback to "-" |
Copilot uses AI. Check for mistakes.
It seems that I didn’t fully understand the contents of the central index. I thought this artifact didn’t exist in the index and we needed to combine GCS, but it turns out that was because I was filtering by packaging type "jar". Upon closer inspection of the index, I found that the GAV does exist, but the packaging is set to "bundle". A JAR is still generated even when the packaging is "bundle". In other words, although we went around in circles, by examining the index, we now see that there’s a good chance we can accurately retrieve artifact information from Maven Central.
|
- Added a new `sourceType` flag to specify the source type (GCS or Central) for the artifact crawling process. - Updated the `Crawler` to handle different source types, allowing for more flexible artifact retrieval. - Refactored the `crawlWithPipeline` method to create sources based on the specified type, enhancing the modularity of the code. - Introduced a new `central` package to manage Maven Central Index interactions, improving the overall architecture. - Enhanced the GCS client and related components to support the new structure, ensuring compatibility with existing functionality.
- Added cache directory support in the Crawler for storing the Maven Central Index. - Updated the New function in the central package to accept a cache directory parameter and ensure its existence. - Implemented downloadIndex method to handle downloading and caching of the Maven Central Index. - Improved error handling for index file retrieval and processing, ensuring robustness in the crawling process.
- Updated the Crawler struct to rename `storedGAVs` to `storedGAVCs`, reflecting the change to store GroupID, ArtifactID, Version, and Classifier hashes. - Adjusted the New function in the Crawler to initialize `storedGAVCs` accordingly. - Modified the central package to utilize `storedGAVCs` for consistent artifact processing. - Enhanced comments to clarify the purpose of the storedGAVCs map in preventing duplicate processing.
- Improved error logging in the Builder's Build method to include additional context (GroupID, ArtifactID, Version, Classifier) when SHA1 decoding fails. - Added validation for SHA1 hashes in the Crawler's Read method to ensure they are correctly formatted and of the expected length, enhancing data integrity checks.
So, if I understand correctly, obtaining the SHA1 can be done using only the Central Index, and GCS is not necessary. The Central Index is updated once a week, but I believe this delay is acceptable. Delays of over a week can occur, but I think that’s still acceptable for an OSS project. I tried updating the maven-index using only the Central Index. And build the database from this index.
The database generated using this maven-index showed no significant differences from our current one. Some artifacts (example) were missing, but they’re artifacts that don’t contain any deployable JARs (their classifier is like javadoc or test), so it’s correct that they were omitted @DmitriyLewen Could you double-check as well to ensure I haven’t overlooked anything? |
Before i start checking: |
There may have been a mistake in the comparison method. I'm currently double-checking it, so please wait a moment. Also, regarding the update frequency, it appears that updates are currently being made at least once every two weeks. |
I was mistaken. It turns out that the Central Index indeed does not include the SHA1 digest for some artifact JARs. For example, this sha1 is missing.
|
0.28.0 artifacts are 2 months old (according to the dates from Maven Central)... But i have sad news:
|
- Removed redundant SHA1 handling code from the GCS client. - Introduced a new `sha1` package to encapsulate SHA1 parsing logic. - Enhanced the `FetchSHA1` method to utilize the new `sha1.Parse` function for improved clarity and maintainability. - Updated dependencies in go.mod and go.sum for consistency.
- Added functionality to fetch missing SHA1s directly from Maven Central for records without classifiers. - Introduced a new map to track missing SHA1s and updated the Read method to handle records accordingly. - Enhanced error handling and logging during the fetching process, including status code checks and logging of fetched counts. - Improved the overall structure of the Source type to accommodate the new fetching logic, ensuring robust artifact processing.
- Added a condition to skip the "maven2/data/" directory during the listing process in the GCS lister. - This change prevents unnecessary processing of a directory that is not present in Maven Central, improving efficiency in the crawling operation.
- Introduced a new `maven` package with a `ValidateClassifier` function to encapsulate classifier validation logic. - Updated the `read` method in the `Source` type to utilize the new validation function, improving clarity and maintainability. - Enhanced the `fetch` method in the `Fetcher` to include version validation alongside groupID and artifactID checks, ensuring robust artifact processing.
- Updated the `read` method in the `Source` type to improve artifact type validation. - Removed the check for `fileExtension` being "jar" from the initial condition, allowing for more flexible handling of artifact types. - Added a separate check for `fileExtension` to ensure only records with the "jar" extension are processed, enhancing clarity and maintainability of the code.
I have modified the code to query Maven Central only when the SHA1 is not found in the Central Index. This significantly reduces the number of HTTP requests. Additionally, it has improved the coverage. When comparing the current database using this script, I found only 46 differences in GAVs. Furthermore, it seems that none of these 46 cases are valid GAVs. Details are as follows:
|
Very bad news... the index seems to contain wrong values 😭
The index holds aa50e73ec13a2140d990e4684ff0822b84caa962, but it should be 8293e469853b8fbff6ab842cbd07400c4831b94b |
this is very bad 😭 |
We can try to combinate 3 sources:
But i am still not sure about indexes. |
…ndexes - Eliminated the redundant check for classifier being "-" in the loadExistingIndexes method. - This change simplifies the code by directly assigning an empty string to classifier when it is not provided, enhancing clarity and maintainability.
…tral Index - Updated the `Source` type to enhance the handling of GAVC records and SHA1 fetching. - Removed reliance on unreliable SHA1 values from the central index, implementing a two-step process to ensure accurate SHA1 retrieval directly from Maven Central. - Introduced a new `records` slice to store all GAVC records, improving clarity and maintainability. - Enhanced logging to reflect the number of records processed and fetched, ensuring better tracking of operations. - Refined the `fetchSHA1s` method to process records more efficiently, improving overall performance.
I’m currently planning this as a two-phase process.
Although dropping the index-based SHA1 lookups increases the number of requests sent to Maven Central, it saves requests when building the list, and—because it’s a differential update—already-stored artifacts are skipped. I’ve implemented this strategy, so I’ll run it again on GitHub Actions. |
- Added a case to handle HTTP 403 Forbidden status in the fetchSHA1s method. - This change improves error handling by explicitly logging when access to a resource is forbidden, enhancing clarity in the fetching process.
- Improved error handling in the fetchSHA1s method by returning errors instead of skipping them on HTTP request failures. - Added logging for the number of fetched SHA1s every 10,000 records processed, providing better visibility into the fetching process. - Enhanced logging for failed SHA1 content reads, including the URL and error details, to aid in debugging and monitoring.
- Updated the fetchSHA1s method to return errors instead of skipping them on HTTP request failures, enhancing error visibility. - Added logging for the number of errors encountered during SHA1 fetching, providing better insights into the fetching process. - Enhanced logging in the crawlWithPipeline method to report the number of errors encountered during artifact processing.
- Enhanced the logic in the parseItemName function to better handle classifier extraction from filenames. - Updated comments to clarify the handling of special cases where classifiers may not follow the expected format. - This change improves the robustness of classifier parsing, ensuring that any remaining string is treated as a classifier when necessary.
- Introduced a new test file for the GCS fetcher to validate the parseItemName function. - Added multiple test cases covering various scenarios, including normal cases, edge cases, and invalid inputs. - This addition enhances test coverage and ensures the correctness of the classifier parsing logic.
- Introduced a new function `containsControlChar` to check for control characters in strings. - Updated the `read` method to skip records with control characters in `GroupID`, `ArtifactID`, `Version`, or `Classifier`. - Added unit tests for `containsControlChar` to ensure correct functionality across various input scenarios.
I want to believe this is some kind of bad joke, but I’ve found a case where the SHA1 values differ between Maven Central and GCS. # Maven Central
$ curl https://repo.maven.apache.org/maven2/com/onfido/api/client/onfido-api-client/3.1.3/onfido-api-client-3.1.3.jar.sha1
941c5572bc8fbe9203745c9eca4a0fa4d4ee0fbb%
# GCS
$ curl "https://storage.googleapis.com/download/storage/v1/b/maven-central/o/maven2%2Fcom%2Fonfido%2Fapi%2Fclient%2Fonfido-api-client%2F3.1.3%2Fonfido-api-client-3.1.3.jar.sha1?generation=1617659797337662&alt=media"
2b0f02371aa9be4644ebbf65186ab6b4956b498f It appears that the actual JAR files are different. # Maven Central
$ curl -s https://repo.maven.apache.org/maven2/com/onfido/api/client/onfido-api-client/3.1.3/onfido-api-client-3.1.3.jar | sha1sum
941c5572bc8fbe9203745c9eca4a0fa4d4ee0fbb -
# GCS
curl -s "https://storage.googleapis.com/download/storage/v1/b/maven-central/o/maven2%2Fcom%2Fonfido%2Fapi%2Fclient%2Fonfido-api-client%2F3.1.3%2Fonfido-api-client-3.1.3.jar?generation=1617659792993892&alt=media" | sha1sum
2b0f02371aa9be4644ebbf65186ab6b4956b498f - This is just my guess, but considering that the last update on Maven Central was April 7, 2021, and the last update on GCS was April 5, 2021, it’s possible that the JAR file on Maven Central was updated after it had already been synchronized to GCS. That update might not have been propagated to GCS, leaving the older artifact there. {
"kind": "storage#object",
"id": "maven-central/maven2/com/onfido/api/client/onfido-api-client/3.1.3/onfido-api-client-3.1.3.jar/1617659792993892",
"selfLink": "https://www.googleapis.com/storage/v1/b/maven-central/o/maven2%2Fcom%2Fonfido%2Fapi%2Fclient%2Fonfido-api-client%2F3.1.3%2Fonfido-api-client-3.1.3.jar",
"mediaLink": "https://storage.googleapis.com/download/storage/v1/b/maven-central/o/maven2%2Fcom%2Fonfido%2Fapi%2Fclient%2Fonfido-api-client%2F3.1.3%2Fonfido-api-client-3.1.3.jar?generation=1617659792993892&alt=media",
"name": "maven2/com/onfido/api/client/onfido-api-client/3.1.3/onfido-api-client-3.1.3.jar",
"bucket": "maven-central",
"generation": "1617659792993892",
"metageneration": "1",
"contentType": "application/java-archive",
"storageClass": "STANDARD",
"size": "33",
"md5Hash": "jMpWQ/+3GCzKnC4VS0TLXg==",
"contentLanguage": "en",
"crc32c": "52QnRw==",
"etag": "COTc0KqM6O8CEAE=",
"timeCreated": "2021-04-05T21:56:33.026Z",
"updated": "2021-04-05T21:56:33.026Z",
"timeStorageClassUpdated": "2021-04-05T21:56:33.026Z",
"timeFinalized": "2021-04-05T21:56:33.026Z",
"metadata": {
"checksum-sha1": "2b0f02371aa9be4644ebbf65186ab6b4956b498f",
"last-modified-epoch": "1617650729000",
"checksum-md5": "8cca5643ffb7182cca9c2e154b44cb5e",
"last-modified": "Mon, 05 Apr 2021 19:25:29 GMT"
}
}, |
It looks like there are about 100 such cases, assuming my comparison script is correct (I'll double-check how many of these cases there actually are). Out of 16 million artifacts, I believe having around 100 cases is within an acceptable range. I think it's fine to ignore them, but if they are particularly important, we could also consider fixing them manually. |
- Removed the `records` slice from the `Source` type to streamline record handling. - Implemented a channel-based pipeline for processing records, enhancing concurrency and clarity. - Introduced `readFromIndex`, `parseRecords`, and `fetchSHA1s` methods to separate concerns and improve maintainability. - Enhanced error handling and logging throughout the pipeline stages for better visibility. - Updated the `validateRecord` method to encapsulate record validation logic, ensuring cleaner code.
If my script works correctly, there are only 113 cases where SHA1 is wrong in GCS. I would accept these edge cases as I saw many wrong values in the current database.
|
Most likely, you are right. But in any case, I agree with you. 113 artifacts out of 16 million is an acceptable deviation. |
This reverts commit f727716.
@knqyf263 I see you're still working on this PR. Let me know when I can start reviewing it. |
Yes, I hope the job will complete successfully, and this PR will be ready. |
The job was successful. Also, I've checked maven-index and confirmed that all records except for a few edge cases in the current database are present in the TSV files. One point to note is that when inserting records into the database, a different GAV with the same SHA1 digest might be inserted. Therefore, I compared the database with the TSV files, rather than comparing databases. @DmitriyLewen You can start reviewing the PR. I didn't mark this as ready for review since this PR depends on #57. |
pkg/crawler/central/central.go
Outdated
groupPath := strings.ReplaceAll(rec.GroupID, ".", "/") | ||
|
||
// Build jar name: artifactId-version[-classifier].jar | ||
jarName := rec.ArtifactID + "-" + rec.Version + ".jar" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
classifiers are missed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a reason why I didn't add a classifier here, but I can't recall it...
Fixed in 69a2117
Co-authored-by: DmitriyLewen <[email protected]>
Co-authored-by: DmitriyLewen <[email protected]>
- Removed the `records` parameter from the `read` method call in the `Source` type. - This change streamlines the method signature and improves clarity in the codebase.
- Updated the jar name construction logic to include the classifier if it is present. - This change ensures that the jar name follows the format: artifactId-version[-classifier].jar, improving clarity and correctness in artifact naming.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
But let's run Update Maven Index again to make sure the changes with classifiers didn't break anything.
Thanks for reviewing. Triggered. |
I confirmed that the database could be built with the latest index. |
I also tested that a number of detected components were the same between the current and the new DB.
|
Overview
This PR enhances the trivy-java-db crawler by introducing a source type abstraction that supports multiple artifact sources. It adds support for Maven Central Index as an additional source to complement the existing GCS-based crawling.
Description
Maven artifact crawling is a critical part of the trivy-java-db functionality. Currently, we rely solely on Google Cloud Storage (GCS) which mirrors Maven Central. However, our analysis has identified gaps in the coverage:
Therefore, we cannot rely on a single source. To improve coverage, we use both GCS and Maven Central Index. While the combination of GCS and Central Index still won't cover 100% of Maven Central artifacts, this approach improves our coverage for sure. The remaining limitations are accepted as constraints in OSS.
Implementation
This PR introduces a pluggable source architecture that allows crawling from both GCS and Maven Central Index:
SourceType
abstraction to support different artifact sources--source
) to specify the source type (gcs, central)Related PRs