Detailed Summary of Workflow (for answering queries)

1. Data Extraction

A Python script extracts NHANES questionnaire variable names and descriptions from an online source.

Process:

Fetches webpage content using requests.
Parses HTML to locate the relevant table with BeautifulSoup.
Extracts variable names and descriptions and saves them for reference.
The extracted data is used to map question codes to their definitions in dbt models.

2. Base Layer (dbt Processing)

Loads raw questionnaire data from the source database by using dbt seed
Integrates metadata by mapping question codes to definitions.
Ensures each record contains:
- Respondent ID (unique identifier)
- Question Code (survey question identifier)
- Response (raw data)
- Question Type (categorization)
- Definition (human-readable description)

3. Staging Layer (Data Transformation & Standardization)

Enhances the dataset by:

Adding Short Definitions: Assigns human-friendly labels to question codes (e.g., HIQ031C → is_covered_by_medigap).
Standardizing Responses: Maps raw responses to meaningful values, such as:
- 1 → Yes, 2 → No, 7/9 → Refused.
- Unmapped responses default to No or Unknown.
Produces a structured dataset with cleaned responses and easy-to-understand column names.

4. Mart Layer (Aggregations & Analysis)

Merges demographic data with health insurance responses.
Aggregates insurance coverage statistics by gender.
Uses SQL calculations to count respondents with and without coverage.

5. Validation & Testing

Uses DuckDB for ad hoc testing of dbt models before deployment.
Ensures:
- All question codes are correctly mapped.
- Extracted metadata aligns with dbt outputs.
- Data integrity and consistency checks pass.

6. Final Deliverables

Reference Data: Extracted questionnaire metadata.
Processed dbt Models: Cleaned and structured data ready for analysis.
Aggregated Reports: Insights into health insurance coverage trends.

Diabetes Prediction Model: Workflow & Results

1. Data Loading and Preprocessing

The merged dataset is loaded.
Only numerical columns are retained.
Missing values are imputed using the median.
The dataset is split into:
- Features (X)
- Target variable (y) – indicating diabetes presence.

2. Cross-Validation with Stratified K-Fold

5-fold Stratified K-Fold cross-validation ensures class balance across folds.
For each fold:
1. The dataset is split into training and testing sets.
2. SMOTE is applied to balance class distribution in the training set.
3. StandardScaler is used for feature scaling.

3. Feature Selection

A Random Forest model selects the top 15 most important features.
These features are used for training and testing.

4. Model Training and Evaluation

A BaggingClassifier with a Random Forest base model is trained.
Predictions are made on the test set.
Evaluation metrics include:
- Accuracy score
- Macro-averaged F1 score
- Classification report

5. Summary of Results

Diabetes Prediction Model: Cross-Validation Results

Our diabetes prediction model was evaluated using 5-fold cross-validation, leveraging a Random Forest classifier for feature selection and classification.

Top 15 Most Important Features

The most influential features across all folds include:

Glycohemoglobin (HbA1c) – most important predictor
Glucose (Refrigerated Serum, mmol/L & mg/dL) – second most important
Osmolality (mmol/kg)
Albumin-Creatinine Ratio (mg/g)
Triglycerides (Refrigerated, mmol/L & mg/dL)
Basophils Count (per 1000 cells/µL)
Blood Urea Nitrogen (BUN, mg/dL & mmol/L)
Albumin (g/dL, g/L, and Urine mg/L)
Incomplete OGTT Comment Code

Performance Metrics

Fold	Accuracy	F1-score (Diabetic Class)	Precision (Diabetic Class)	Recall (Diabetic Class)
1	95.52%	0.8319	0.60	0.81
2	94.75%	0.8124	0.55	0.81
3	94.75%	0.8052	0.55	0.76
4	96.69%	-	-	-
5	Pending	Pending	Pending	Pending

Key Observations

The model performs well in distinguishing diabetic vs. non-diabetic cases, achieving an average accuracy of ~95%.
The recall for the diabetic class (~76-81%) suggests the model effectively identifies diabetic cases but still misses some.
The precision for the diabetic class (55-60%) indicates some false positives.
Feature importance analysis consistently ranks glycohemoglobin and glucose levels as the strongest predictors.

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
health_data_platform		health_data_platform
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Detailed Summary of Workflow (for answering queries)

1. Data Extraction

Process:

2. Base Layer (dbt Processing)

3. Staging Layer (Data Transformation & Standardization)

4. Mart Layer (Aggregations & Analysis)

5. Validation & Testing

6. Final Deliverables

Diabetes Prediction Model: Workflow & Results

1. Data Loading and Preprocessing

2. Cross-Validation with Stratified K-Fold

3. Feature Selection

4. Model Training and Evaluation

5. Summary of Results

Diabetes Prediction Model: Cross-Validation Results

Top 15 Most Important Features

Performance Metrics

Key Observations

About

Uh oh!

Releases

Packages

Languages

jddeguia/nhanes-analysis

Folders and files

Latest commit

History

Repository files navigation

Detailed Summary of Workflow (for answering queries)

1. Data Extraction

Process:

2. Base Layer (dbt Processing)

3. Staging Layer (Data Transformation & Standardization)

4. Mart Layer (Aggregations & Analysis)

5. Validation & Testing

6. Final Deliverables

Diabetes Prediction Model: Workflow & Results

1. Data Loading and Preprocessing

2. Cross-Validation with Stratified K-Fold

3. Feature Selection

4. Model Training and Evaluation

5. Summary of Results

Diabetes Prediction Model: Cross-Validation Results

Top 15 Most Important Features

Performance Metrics

Key Observations

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages