Skip to content

Remove outdated Crypto logic and improve ETL configuration #138

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 23 commits into
base: main
Choose a base branch
from

Conversation

TDeSain
Copy link
Contributor

@TDeSain TDeSain commented Mar 3, 2025

Description

This pull request introduces several significant changes to improve the application's maintainability, performance, and configurations:

  • Crypto Removal:

    • Removed Crypto-based locking logic from query and querySync endpoints.
    • Simplified PicSureService.java to process queries without key validation.
    • Updated AbstractProcessor and configurations by eliminating unused Crypto-related code.
  • ETL Enhancements:

    • Increased cache size for improved performance.
    • Refactored CSVLoader to utilize Spring Boot features like ApplicationRunner.
    • Introduced configurable properties (rollupEnabled, hpdsDirectory) for better maintainability.
    • Changed default hpdsDirectory to an absolute path (/opt/local/hpds/).
    • Optimized ETL logic and improved CSV parsing buffer size.
  • Dockerfile Update:

    • Upgraded base image to OpenJDK 21.
    • Removed unused dependencies and simplified build stages for Docker.
  • General Improvements:

    • Added Jakarta Annotation API dependency to ensure compatibility with Jakarta EE specifications.
    • Refactored directory properties for consistency and robustness.

Checklist

  • Code changes have been reviewed and tested.
  • Related documentation has been updated where necessary.

TDeSain added 9 commits March 2, 2025 14:06
- Added performance benchmarking tests for encryption (AES).
- Modularized encryption logic with togglable `encryption.enabled`.
- Refactored CSVLoader for Spring Boot compatibility.
- Added configuration options via `application.properties`.
- Improved logging and error handling across components.
This commit adds the Jakarta Annotation API dependency (version 2.1.1)
to `common/pom.xml`. This ensures support for annotations provided
by the Jakarta EE ecosystem.
- Switch to single-stage build using openjdk:11-jre-slim.
- Remove multi-stage Maven build and related dependencies.
- Add required JARs and scripts directly.
- Simplify ENTRYPOINT for cleaner execution.
- Upgrade Docker base image to OpenJDK 21 and adjust labels.
- Remove `UpdateClinicalVariableCounts` dependency from Dockerfile.
- Update CSVLoader to reference `allConcepts.csv` instead of `FHS_allConcepts.csv`.
Change default hpdsDirectory from "./" to "/opt/local/hpds/".
Provides a more robust default configuration for the application.
Still allows overriding via application.properties if needed.
- Updated cache size from 16 to 2048 in LoadingStore.
- Refactored CSVLoader to use ApplicationRunner and @value properties.
- Replaced NO_ROLLUP flag with configurable rollupEnabled property.
- Simplified and optimized ETL loading and processing logic.
- Adjusted buffer size for CSV parsing to improve performance.
- Updated application properties to streamline configuration setup.
Changed the `etl.hpds.directory` property to use the absolute path
`/opt/local/hpds` instead of `./` to avoid relative path issues.
Add a trailing slash to `etl.hpds.directory` for consistency.
AbstractProcessor
- Eliminated Crypto key checks in `query` and `querySync` endpoints.
- Switched hpdsDataDirectory to use Field injection instead of Method Parameter injection.

PicSureService.java
- Simplified logic by directly processing queries without key validation.

Configurations
- Added application-local-dev.properties for local configurations.
- Refactored `AbstractProcessor` to remove unused Crypto-related code.
@TDeSain TDeSain self-assigned this Mar 3, 2025
@TDeSain TDeSain added the enhancement New feature or request label Mar 3, 2025
TDeSain added 3 commits March 2, 2025 20:53
Replace `CSVLoader` with `CSVLoaderNewSearch` in test setup.
The dumpStats() method call was removed to clean up the code.
It was not serving any functional purpose in the current context.
Split CSVLoaderService into its own file for better modularity
and maintainability. Removed the duplicate implementation
from CSVLoaderNewSearch.
@TDeSain TDeSain force-pushed the encryption-optional branch from 5cf238a to 805d2e0 Compare March 3, 2025 03:20
Cleaned up unused imports in CSVLoaderService and CSVLoaderNewSearch
to improve code readability and maintainability.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think integration testing is a big no no here. Should just be unit testing. Integration testing should be done in dedicated environments like nhanes-dev, bdc-dev etc... Too heavy and keep data sets in a java project is expensive from a operational standpoint.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, yes. Given the existing complexity of HPDS, the integration tests are incredibly valuable for development

Copy link
Contributor Author

@TDeSain TDeSain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Developer's Initial review

Copy link
Contributor Author

@TDeSain TDeSain Jun 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I imagine this has some potential breakage in environments. Easiest solution is create a Dockerfile explicitly for a Project and stored with project assets.

public LoadingCache<String, PhenoCube> store = CacheBuilder.newBuilder()
.maximumSize(16)
.maximumSize(2048)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any thoughts on increasing the size of the cache here? 16 seems extremely low. Probably creating a lot of overhead. Probably should be configuration to override default value.

.removalListener(new RemovalListener<String, PhenoCube>() {

@Override
public void onRemoval(RemovalNotification<String, PhenoCube> arg0) {
log.info("removing " + arg0.getKey());
//log.debug("Cache removal and writing to disk: " + arg0.getKey());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This just spams the log. It's pretty useless knowledge unless you want to know if I/O and cache is working...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will look at the runtime flags being set. Shouldn't need debug all the time. Implementation issue not code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CSVLoader is now a SpringBootApplication. Was pretty easy to look at other code for reference. Methods are ported to the service class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just trying out some benchmarking.

@@ -53,10 +53,9 @@ public class AbstractProcessor {
@Autowired
public AbstractProcessor(
PhenotypeMetaStore phenotypeMetaStore,
GenomicProcessor genomicProcessor, @Value("${HPDS_DATA_DIRECTORY:/opt/local/hpds/}") String hpdsDataDirectory
GenomicProcessor genomicProcessor
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably should figure out why I moved injection from the constructor to a field. Constructor would be preferred.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Used this for some local testing.

@@ -349,15 +343,11 @@ public ResponseEntity queryFormat(@RequestBody QueryRequest resultRequest) {

@PostMapping(value = "/query/sync", produces = MediaType.TEXT_PLAIN_VALUE)
public ResponseEntity querySync(@RequestBody QueryRequest resultRequest) {
if (Crypto.hasKey(Crypto.DEFAULT_KEY_NAME)) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the relationship between the Crypto Class having keys and the resource being locked?

TDeSain and others added 3 commits June 9, 2025 09:04
Introduce CSVLoaderServiceTest to validate ETL process with test data.
Refactor LoggingStore to use consistent logging levels (info/debug)
and improve log clarity. Add test CSV file for automated testing.
# Conflicts:
#	docker/pic-sure-hpds-etl/Dockerfile
#	etl/src/main/java/edu/harvard/hms/dbmi/avillach/hpds/etl/LoadingStore.java
#	etl/src/main/java/edu/harvard/hms/dbmi/avillach/hpds/etl/phenotype/csv/CSVLoaderNewSearch.java
#	service/src/main/java/edu/harvard/hms/dbmi/avillach/hpds/service/PicSureService.java
#	service/src/test/java/edu/harvard/hms/dbmi/avillach/hpds/test/util/BuildIntegrationTestEnvironment.java
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make a new class and leave main class as is?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Talked with George and I will include changes in CSVNewLoader to the CSVLoaderService

TDeSain added 2 commits June 9, 2025 12:57
- Renamed CSVLoaderService to LoaderService for clarity.
- Introduced a new Loader class as the entry point for Spring Boot.
- Added dumpStats method in LoadingStore for detailed ETL statistics.
- Improved CSV processing with CSVConfig integration.
- Consolidated redundant logic and cleaned up component wiring.
- Update `Crypto.hasKey` to respect `ENCRYPTION_ENABLED` flag.
- Add `@PostConstruct` to initialize `DO_VARNAME_ROLLUP` properly.
- Remove redundant cache invalidation in `LoaderService`.
@TDeSain TDeSain force-pushed the encryption-optional branch from 84cf2e5 to 5ad5876 Compare June 9, 2025 17:40
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just whitespace not sure why it's getting picked up looks like main and this are both tabs....

- These do not exist in main and not required for this work
@@ -156,6 +156,35 @@ public void dumpStats() {
}
}

public void dumpStats(String hpdsDirectory) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overloading this to get some stats without having a hardcoded path.

basePackages = "edu.harvard.hms.dbmi.avillach.hpds",
includeFilters = @ComponentScan.Filter(type = FilterType.ASSIGNABLE_TYPE, classes = Crypto.class)
)
public class Loader {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decided to leave the CSVLoaderNewSearch.java alone and utilize a new class to do this work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whiitespace only. Bah

@@ -21,6 +21,11 @@
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
</dependency>
<dependency>
<groupId>jakarta.annotation</groupId>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding this for post-construct Annotation. To work around some static Fields + Spring.


private static final Logger LOGGER = LoggerFactory.getLogger(Crypto.class);
private static final HashMap<String, byte[]> keys = new HashMap<>();

@Value("${encryption.enabled:true}")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Utilizing a non-static field and post construct to handle static field. Static methods are the reason. Not pulling that change into scope

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

static byte[] encryptData for example.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sample dataset to test LoaderService. Small extract of nhanes allConcepts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants