feat: use state machine to process package stats #1808
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR refactors the package stats feature to use AWS Step Functions instead of a single Lambda function. The change addresses a fundamental scalability challenge: fetching download statistics from NPM's API for thousands of packages is too slow and resource-intensive for a single Lambda execution.
Why This Change
The package stats feature provides users with NPM download counts on package pages, helping them make informed decisions about which packages to adopt based on community usage. However, the original implementation using a single Lambda function couldn't scale to handle large package catalogs efficiently. Fetching statistics for each package sequentially would take hours and risk Lambda timeouts, while fetching them all in parallel would overwhelm NPM's API and exhaust Lambda memory.
The solution is a distributed map-reduce architecture using Step Functions. By splitting the work into chunks and processing them in parallel with controlled concurrency, we can complete updates within a reasonable timeframe while respecting NPM's rate limits. The state machine provides built-in retry logic, error handling, and visibility into which stage of processing fails.
State Machine Architecture
The workflow follows a three-phase map-reduce pattern:
The Chunker reads the catalog to determine which packages need statistics and divides them into manageable groups. The Processor functions run in parallel, each fetching statistics from NPM for their assigned packages and writing intermediate results to S3. The Aggregator collects all these intermediate results and produces the final
stats.jsonfile that the frontend consumes. The map state is configured with a maximum concurrency of 10 to avoid overwhelming the NPM API, and each processor includes retry logic with a 5-minute backoff to handle transient issues.What Changed
The implementation replaces the single Lambda function with a Step Functions state machine orchestrating three specialized Lambda functions. The CloudWatch alarm now monitors state machine failures instead of Lambda errors, providing better visibility into which stage fails. The operator runbook has been significantly expanded with narrative explanations of why the feature exists, how it's architected, detailed investigation steps for each failure scenario, and the state machine diagram shown above. The backend dashboard now links to the state machine and all three Lambda functions for easier troubleshooting.
Comprehensive unit tests cover all three Lambda functions, and a new integration test validates the complete end-to-end workflow execution. All existing tests pass, and snapshot tests have been updated to reflect the new infrastructure.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license