-
Notifications
You must be signed in to change notification settings - Fork 20
feat: move to use maven-index repository with sharding support #57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- Removed GroupID and ArtifactID from Index struct, replaced with Packaging. - Updated Build function to derive GroupID and ArtifactID from directory path. - Enhanced file path creation to convert GroupID into directory names for JSON output. - Adjusted JSON field names for Version struct for better clarity.
- Introduced indexDir flag for specifying the index repository directory. - Updated Build function to accept indexDir instead of cacheDir. - Improved SHA1 fetching logic to return string instead of byte slice. - Enhanced error handling and logging for SHA1 validation. - Adjusted JSON serialization for Version struct to use string for SHA1.
- Eliminated the dbc field from the Crawler struct as it is no longer needed. - Updated the NewCrawler function to reflect the removal of the database connection.
- Enhanced the crawlSHA1 function to check for previously saved versions before fetching new SHA1s. - Introduced a new savedVersions function to read existing versions from a JSON file. - Added Exists function in fileutil to check for file existence before attempting to open it. - Updated JSON writing logic to use the complete file path for better clarity and maintainability.
- Upgraded Go version from 1.22.7 to 1.23.0 and set toolchain to go1.24.2. - Updated golang.org/x/sync dependency to version 0.13.0 for improved concurrency handling. - Refactored the crawl function to use a simpler limit parameter instead of a weighted semaphore. - Enhanced the index structure to include only SHA1, simplifying the data model. - Improved error handling and logging in the crawlSHA1 function for better traceability.
- Introduced shardCount parameter to enable sharding of artifact records during processing. - Updated Crawler and Aggregator structures to handle sharding logic. - Enhanced the crawling pipeline to include Lister, Fetcher, and Aggregator components for improved data handling. - Implemented error handling and logging improvements throughout the crawling process. - Adjusted file writing logic to support sharded output files for better organization and performance.
…d output - Added logging for every 100,000 processed artifacts to track progress. - Updated the Aggregator to flush records every 100,000 processed instead of 10,000 for improved performance. - Implemented a hierarchical directory structure for sharded output files, enhancing organization. - Created subdirectories based on shard index to better manage output files.
- Increased maxResults from 5000 to 10000 to enhance performance during artifact fetching. - Removed the randomSleep function to streamline the HTTP request process, reducing unnecessary delays. - Cleaned up imports by removing unused packages, improving code clarity and maintainability.
… performance - Updated maxResults from 5000 to 10000 to enhance performance during artifact fetching.
- Replaced the retryablehttp.Client with a new GCS client for improved artifact handling. - Updated the Lister and Fetcher components to utilize the GCS client for fetching SHA1 files. - Introduced parallel processing for artifact listing to enhance performance. - Cleaned up the code by removing unused functions and optimizing the structure for better maintainability. - Enhanced logging to provide clearer insights during the crawling process.
- Removed the wrongSHA1s slice and replaced it with an atomic error count for better performance and thread safety. - Updated the Fetcher to increment the error count when a SHA1 cannot be fetched. - Adjusted logging to report the error count instead of listing wrong SHA1s, streamlining the output during the crawl process. - Modified GCS client to return "N/A" for invalid SHA1 formats, improving error handling.
- Introduced a new limit parameter for the Lister to optimize the parallelism of the listing process based on the number of stored GAVs. - Adjusted the limit for the Fetcher to ensure it operates within the new constraints, improving overall performance during the crawl process. - Enhanced comments to clarify the logic behind the limit adjustments for better maintainability.
As we discussed before, GCS is missing a lot of artifacts existing in Maven Central.
|
- Introduced a new hash package to encapsulate hashing logic for GroupID, ArtifactID, and Version. - Updated the Builder to process TSV files and handle version mismatches more effectively. - Improved logging to provide better insights during the index building process. - Removed unused hash functions from the Crawler, streamlining the codebase. - Enhanced error handling for SHA1 decoding and record processing in the Builder.
- Introduced a new script `compare_sqlite.sh` for comparing two SQLite databases. - The script checks for the existence of database files and compares the record counts in the artifacts table. - It exports `group_id` and `artifact_id` for both databases, allowing for detailed comparison of records. - Added functionality to display records present in one database but not the other, enhancing data validation capabilities.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR refactors the database building process to use the aquasecurity/maven-index repository with added sharding functionality for improved file distribution.
- Removed obsolete types and updated index-related types.
- Added new hash functions and file utilities, including an Exists method.
- Updated crawler, builder, and CLI modules to support the new maven-index approach and sharding.
Reviewed Changes
Copilot reviewed 10 out of 22 changed files in this pull request and generated 2 comments.
Show a summary per file
File | Description |
---|---|
pkg/types/types.go | Removed the old Version type to streamline index definitions. |
pkg/hash/hash.go | Added new GA and GAV functions for sharding and deduplication. |
pkg/fileutil/file.go | Updated WriteJSON signature and added new Exists function. |
pkg/crawler/types.go | Updated types and JSON tags for index and GCS responses. |
pkg/crawler/gcs.go | Added comprehensive GCS client functionality including pagination. |
pkg/crawler/crawler_test.go | Adjusted tests for crawler functionality with updated file paths and names. |
pkg/builder/builder.go | Revised builder logic to process TSV files and improved logging. |
cmd/trivy-java-db/main.go | Introduced new CLI flags for index directory and shard count. |
Files not reviewed (12)
- go.mod: Language not supported
- misc/compare_sqlite.sh: Language not supported
- pkg/crawler/testdata/happy/abbot-with-db.json.golden: Language not supported
- pkg/crawler/testdata/happy/abbot-with-index.tsv.golden: Language not supported
- pkg/crawler/testdata/happy/abbot.json.golden: Language not supported
- pkg/crawler/testdata/happy/abbot.tsv.golden: Language not supported
- pkg/crawler/testdata/happy/abbot_abbot.html: Language not supported
- pkg/crawler/testdata/happy/abbot_abbot_0.12.3.html: Language not supported
- pkg/crawler/testdata/happy/abbot_abbot_0.13.0.html: Language not supported
- pkg/crawler/testdata/happy/abbot_abbot_1.4.0.html: Language not supported
- pkg/crawler/testdata/happy/index.html: Language not supported
- pkg/crawler/testdata/happy/maven-metadata.xml: Language not supported
Comments suppressed due to low confidence (1)
pkg/crawler/types.go:27
- [nitpick] The JSON tag "1" is non-descriptive; consider using a more meaningful tag name to improve clarity and maintainability.
SHA1 string `json:"1"`
NOTE: To view a TSV file on GitHub UI, the file size needs to be less than 512 KB, which means shards must be more than 8192. |
- Removed the check for HTTP 404 status code in the GCS getObject method, simplifying error handling. - This change streamlines the code by eliminating redundant error messages for non-existent objects.
- Changed the variable name from `indexDir` to `centralIndexDir` for clarity. - Updated the `b.Build` method to use the new `centralIndexDir` variable, ensuring the correct path is utilized during the database build process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@knqyf263 I left a comments.
What about makefile and workflows?
I think we need to update them too.
- Changed the log message to report the count of artifacts missing SHA1 instead of errors during the crawl process. - This adjustment improves clarity in the logging output, providing better insights into the crawl results.
- Changed the error assignment in the loadExistingIndexes method from a declaration to an assignment for consistency.
- Added a note in the `processPrefix` method to clarify that the return value 'processed' is preserved for logging purposes.
- Updated Go version from 1.23.0 to 1.24. - Removed several indirect dependencies related to Google Cloud. - Added new indirect dependencies including `github.com/google/go-cmp` and `github.com/kr/pretty`. - Updated `golang.org/x/mod` version from 0.17.0 to 0.18.0. - Adjusted `go.sum` to reflect the changes in dependencies.
- Deleted the `Metadata` and `Versioning` structs from `types.go` as they are no longer utilized in the codebase. - This change helps to simplify the code and reduce clutter, improving maintainability.
- Reduced the maximum results from 10,000 to 5,000 in the JARSHA1Files method to comply with the GCS API limit. - This change ensures that requests do not exceed the maximum allowed results, preventing potential errors during item listing.
…update SHA1 handling - Introduced a new `index.Reader` to handle CSV reading with tab delimiters, improving performance and code clarity. - Updated the `Build` method in the `builder` package to utilize the new `index.NewReader` for reading records. - Modified SHA1 handling in the `crawler` package to use `index.NotAvailable` instead of the string "N/A", ensuring consistency across the codebase. - Removed unused CSV reading logic from the `loadExistingIndexes` method, streamlining the code.
pkg/crawler/crawler.go
Outdated
groupID, artifactID, versionDir := record[0], record[1], record[2] | ||
|
||
// Hash GAV and add to map | ||
gavHash := hash.GAV(groupID, artifactID, versionDir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we also need to check version
here?
IIUC we say - if dir was saved (in TSV file), then file contains all versions from dir.
But it seems that there may be a case where we save only one version and get an error (e.g. internet access dropped).
But this is a very rare case, so we can miss it.
On the other hand, trivy-java-db contains a lot of artifacts, so we may not know about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can simulate that save only 1.4.0
in test:
trivy-java-db/pkg/crawler/crawler_test.go
Line 117 in 35bcd9a
indexContent := "abbot\tabbot\t1.4.0\t-\t0000000000000000000000000000000000000000\nabbot\tabbot\t1.4.0\t1.4.0-lite\t0000000000000000000000000000000000000000\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC we say - if dir was saved (in TSV file), then file contains all versions from dir.
For example, if a new version is released, I believe we still need to obtain the SHA-1 hash for that new version, even if the artifact is already listed in the TSV. Am I overlooking something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i meant case when version dir contains more then 1 artifact.
e.g.:
1.4.0
version dir contains:
- 1.4.0
- 1.4.0-lite
We will use same hash for both versions.
But if TSV file contains only 1.4.0
we will not add 1.4.0-lite
, because storedGAVs
already contains hash for this artifact.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing the topic slightly, while looking through the contents of the index, I realized that I had probably been mistaken. The string appended after the version is likely what's generally referred to as a "classifier." In other words, if the classifier is null, it would be something like 1.4.0, and if the classifier is "lite", it would be 1.4.0-lite.
Considering that, I believe your point is that we should also take the classifier into account when making comparisons, right? The answer to that is YES — you are correct. I will update the code to refer to it as a classifier first, and then change the hash function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated: 751fe62
BTW, if I understand correctly, "1.4.0-lite" has version "1.4.0" and the classifier is "lite", so I suppose it would be more accurate to detect the package as version 1.4.0. What do you think?
Ideally, the classifier should be added to the Java database, and Trivy should include the classifier information in the package data, but that would require some changes. So I’d like to discuss a temporary solution for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's about the same.
But I think we don't need to separate version and classifier now:
- Users also don't know what is correct (I mean version or version + classifier).
- Developers can use both logics when adding a classifier.
- maven/purl/etc. don't have information about classifiers, so it can be confusing.
- No users asked about it, so I think versions with classifiers are rare.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, PURL has classifier
.
pkg:maven/org.apache.xmlgraphics/[email protected]?type=zip&classifier=dist
https://github.com/package-url/purl-spec/blob/main/PURL-TYPES.rst#maven
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to double-check my knowledge more often 😅
You are right.
Purl supports classifiers.
Maven also supports them - https://maven.apache.org/plugins/maven-deploy-plugin/examples/deploying-with-classifiers.html
So after double-checking the information about classifiers - I think we can add them.
But I suggest asking the community about it.
It might be unnecessary work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one more small thought:
after adding classifiers, users may ask to add all artifacts with classifiers (so we will need to add artifacts with source
, test
, etc. classifiers)
- Modified the `Record` struct in the `crawler` package to replace `VersionDir` with `Classifier`, allowing for better classification of artifacts. - Updated the `Build` method in the `builder` package to handle records with classifiers, ensuring proper indexing and storage. - Adjusted the `GAV` hashing function to include classifiers for deduplication, enhancing the accuracy of artifact identification. - Updated test cases and golden files to reflect changes in the index format, ensuring consistency across the codebase.
pkg/types/types.go
Outdated
@@ -14,11 +14,7 @@ type Index struct { | |||
GroupID string | |||
ArtifactID string | |||
Version string | |||
Classifier string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't use this field to build DB.
Line 126 in 793673c
index.GroupID, index.ArtifactID, index.Version, index.SHA1, index.ArchiveType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a comment
72bbd7b
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we must not forget that until we add this, the database will not include new artifacts with classifiers.
t.Fatalf("failed to create index dir: %v", err) | ||
} | ||
// The digest is intentionally set to all zeros to check that the record is skipped and not updated. | ||
indexContent := "abbot\tabbot\t1.4.0\t\t0000000000000000000000000000000000000000\nabbot\tabbot\t1.4.0\tlite\t0000000000000000000000000000000000000000\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's stay only 1.4.0
here.
This is required to check that we work correctly for case when tsv file contains index and we get another index with same version, but different classifier from GSC.
indexContent := "abbot\tabbot\t1.4.0\t\t0000000000000000000000000000000000000000\nabbot\tabbot\t1.4.0\tlite\t0000000000000000000000000000000000000000\n" | |
indexContent := "abbot\tabbot\t1.4.0\t\t0000000000000000000000000000000000000000\\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 more comment @knqyf263
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll update it later.
Co-authored-by: DmitriyLewen <[email protected]>
Co-authored-by: DmitriyLewen <[email protected]>
- Updated the `Classifier` field in the `Index` struct within `types.go` to include a TODO comment indicating that it is not currently used but retained for future use.
- Reduced the maximum results from 10,000 to 5,000 in the TopLevelPrefixes method to comply with the GCS API limit. - This change ensures that requests do not exceed the maximum allowed results, preventing potential errors during item listing.
- Changed comment from "Close all writers and files when done" to "Flush all writers and close all files when done" for clarity. - Removed the redundant final flush of CSV writers and buffer writers, as it is no longer necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Did you collect artifacts after these changes?
Yes, I did, but I found another problem. While I see a lot of artifacts under
https://repo.maven.apache.org/maven2/data Do you know anything? |
I've changed it to skip |
I didn't know anything about the But it looks like GSC contains duplicates in this directory:
But
|
Yes, I also observed it, which is really weird... |
Overview
This PR changes the database building process to utilize the
aquasecurity/maven-index
repository instead of creating a database directly. It also adds sharding functionality to distribute files evenly across the filesystem, preventing performance issues caused by too many files in a repository.Description
The main changes in this PR:
aquasecurity/maven-index
)--index-dir
: Specify the directory where Maven index data is stored--shards
: Configure the number of shards (default: 256)This approach solves two issues:
Test run
https://github.com/aquasecurity/maven-index/actions/runs/14766996699