-
Notifications
You must be signed in to change notification settings - Fork 6
Remove outdated Crypto logic and improve ETL configuration #138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- Added performance benchmarking tests for encryption (AES). - Modularized encryption logic with togglable `encryption.enabled`. - Refactored CSVLoader for Spring Boot compatibility. - Added configuration options via `application.properties`. - Improved logging and error handling across components.
This commit adds the Jakarta Annotation API dependency (version 2.1.1) to `common/pom.xml`. This ensures support for annotations provided by the Jakarta EE ecosystem.
- Switch to single-stage build using openjdk:11-jre-slim. - Remove multi-stage Maven build and related dependencies. - Add required JARs and scripts directly. - Simplify ENTRYPOINT for cleaner execution.
- Upgrade Docker base image to OpenJDK 21 and adjust labels. - Remove `UpdateClinicalVariableCounts` dependency from Dockerfile. - Update CSVLoader to reference `allConcepts.csv` instead of `FHS_allConcepts.csv`.
Change default hpdsDirectory from "./" to "/opt/local/hpds/". Provides a more robust default configuration for the application. Still allows overriding via application.properties if needed.
- Updated cache size from 16 to 2048 in LoadingStore. - Refactored CSVLoader to use ApplicationRunner and @value properties. - Replaced NO_ROLLUP flag with configurable rollupEnabled property. - Simplified and optimized ETL loading and processing logic. - Adjusted buffer size for CSV parsing to improve performance. - Updated application properties to streamline configuration setup.
Changed the `etl.hpds.directory` property to use the absolute path `/opt/local/hpds` instead of `./` to avoid relative path issues.
Add a trailing slash to `etl.hpds.directory` for consistency.
AbstractProcessor - Eliminated Crypto key checks in `query` and `querySync` endpoints. - Switched hpdsDataDirectory to use Field injection instead of Method Parameter injection. PicSureService.java - Simplified logic by directly processing queries without key validation. Configurations - Added application-local-dev.properties for local configurations. - Refactored `AbstractProcessor` to remove unused Crypto-related code.
Replace `CSVLoader` with `CSVLoaderNewSearch` in test setup.
The dumpStats() method call was removed to clean up the code. It was not serving any functional purpose in the current context.
Split CSVLoaderService into its own file for better modularity and maintainability. Removed the duplicate implementation from CSVLoaderNewSearch.
5cf238a
to
805d2e0
Compare
Cleaned up unused imports in CSVLoaderService and CSVLoaderNewSearch to improve code readability and maintainability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think integration testing is a big no no here. Should just be unit testing. Integration testing should be done in dedicated environments like nhanes-dev, bdc-dev etc... Too heavy and keep data sets in a java project is expensive from a operational standpoint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, yes. Given the existing complexity of HPDS, the integration tests are incredibly valuable for development
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Developer's Initial review
docker/pic-sure-hpds-etl/Dockerfile
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I imagine this has some potential breakage in environments. Easiest solution is create a Dockerfile explicitly for a Project and stored with project assets.
public LoadingCache<String, PhenoCube> store = CacheBuilder.newBuilder() | ||
.maximumSize(16) | ||
.maximumSize(2048) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any thoughts on increasing the size of the cache here? 16 seems extremely low. Probably creating a lot of overhead. Probably should be configuration to override default value.
.removalListener(new RemovalListener<String, PhenoCube>() { | ||
|
||
@Override | ||
public void onRemoval(RemovalNotification<String, PhenoCube> arg0) { | ||
log.info("removing " + arg0.getKey()); | ||
//log.debug("Cache removal and writing to disk: " + arg0.getKey()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This just spams the log. It's pretty useless knowledge unless you want to know if I/O and cache is working...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will look at the runtime flags being set. Shouldn't need debug all the time. Implementation issue not code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CSVLoader is now a SpringBootApplication. Was pretty easy to look at other code for reference. Methods are ported to the service class.
etl/src/main/java/edu/harvard/hms/dbmi/avillach/hpds/etl/phenotype/csv/CSVLoaderService.java
Outdated
Show resolved
Hide resolved
etl/src/main/java/edu/harvard/hms/dbmi/avillach/hpds/etl/phenotype/csv/CSVLoaderService.java
Outdated
Show resolved
Hide resolved
etl/src/main/java/edu/harvard/hms/dbmi/avillach/hpds/etl/phenotype/csv/CSVLoaderService.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just trying out some benchmarking.
@@ -53,10 +53,9 @@ public class AbstractProcessor { | |||
@Autowired | |||
public AbstractProcessor( | |||
PhenotypeMetaStore phenotypeMetaStore, | |||
GenomicProcessor genomicProcessor, @Value("${HPDS_DATA_DIRECTORY:/opt/local/hpds/}") String hpdsDataDirectory | |||
GenomicProcessor genomicProcessor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably should figure out why I moved injection from the constructor to a field. Constructor would be preferred.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Used this for some local testing.
@@ -349,15 +343,11 @@ public ResponseEntity queryFormat(@RequestBody QueryRequest resultRequest) { | |||
|
|||
@PostMapping(value = "/query/sync", produces = MediaType.TEXT_PLAIN_VALUE) | |||
public ResponseEntity querySync(@RequestBody QueryRequest resultRequest) { | |||
if (Crypto.hasKey(Crypto.DEFAULT_KEY_NAME)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the relationship between the Crypto Class having keys and the resource being locked?
processing/src/main/java/edu/harvard/hms/dbmi/avillach/hpds/processing/AbstractProcessor.java
Outdated
Show resolved
Hide resolved
processing/src/main/java/edu/harvard/hms/dbmi/avillach/hpds/processing/AbstractProcessor.java
Outdated
Show resolved
Hide resolved
service/src/main/java/edu/harvard/hms/dbmi/avillach/hpds/service/PicSureService.java
Outdated
Show resolved
Hide resolved
Introduce CSVLoaderServiceTest to validate ETL process with test data. Refactor LoggingStore to use consistent logging levels (info/debug) and improve log clarity. Add test CSV file for automated testing.
# Conflicts: # docker/pic-sure-hpds-etl/Dockerfile # etl/src/main/java/edu/harvard/hms/dbmi/avillach/hpds/etl/LoadingStore.java # etl/src/main/java/edu/harvard/hms/dbmi/avillach/hpds/etl/phenotype/csv/CSVLoaderNewSearch.java # service/src/main/java/edu/harvard/hms/dbmi/avillach/hpds/service/PicSureService.java # service/src/test/java/edu/harvard/hms/dbmi/avillach/hpds/test/util/BuildIntegrationTestEnvironment.java
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make a new class and leave main class as is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Talked with George and I will include changes in CSVNewLoader to the CSVLoaderService
- Renamed CSVLoaderService to LoaderService for clarity. - Introduced a new Loader class as the entry point for Spring Boot. - Added dumpStats method in LoadingStore for detailed ETL statistics. - Improved CSV processing with CSVConfig integration. - Consolidated redundant logic and cleaned up component wiring.
- Update `Crypto.hasKey` to respect `ENCRYPTION_ENABLED` flag. - Add `@PostConstruct` to initialize `DO_VARNAME_ROLLUP` properly. - Remove redundant cache invalidation in `LoaderService`.
84cf2e5
to
5ad5876
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just whitespace not sure why it's getting picked up looks like main and this are both tabs....
- These do not exist in main and not required for this work
@@ -156,6 +156,35 @@ public void dumpStats() { | |||
} | |||
} | |||
|
|||
public void dumpStats(String hpdsDirectory) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overloading this to get some stats without having a hardcoded path.
basePackages = "edu.harvard.hms.dbmi.avillach.hpds", | ||
includeFilters = @ComponentScan.Filter(type = FilterType.ASSIGNABLE_TYPE, classes = Crypto.class) | ||
) | ||
public class Loader { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Decided to leave the CSVLoaderNewSearch.java alone and utilize a new class to do this work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whiitespace only. Bah
@@ -21,6 +21,11 @@ | |||
<groupId>com.google.guava</groupId> | |||
<artifactId>guava</artifactId> | |||
</dependency> | |||
<dependency> | |||
<groupId>jakarta.annotation</groupId> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding this for post-construct Annotation. To work around some static Fields + Spring.
|
||
private static final Logger LOGGER = LoggerFactory.getLogger(Crypto.class); | ||
private static final HashMap<String, byte[]> keys = new HashMap<>(); | ||
|
||
@Value("${encryption.enabled:true}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Utilizing a non-static field and post construct to handle static field. Static methods are the reason. Not pulling that change into scope
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
static byte[] encryptData for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sample dataset to test LoaderService. Small extract of nhanes allConcepts.
Description
This pull request introduces several significant changes to improve the application's maintainability, performance, and configurations:
Crypto Removal:
query
andquerySync
endpoints.PicSureService.java
to process queries without key validation.AbstractProcessor
and configurations by eliminating unused Crypto-related code.ETL Enhancements:
CSVLoader
to utilize Spring Boot features likeApplicationRunner
.rollupEnabled
,hpdsDirectory
) for better maintainability.hpdsDirectory
to an absolute path (/opt/local/hpds/
).Dockerfile Update:
General Improvements:
Checklist