A Python script extracts NHANES questionnaire variable names and descriptions from an online source.
- Fetches webpage content using
requests. - Parses HTML to locate the relevant table with
BeautifulSoup. - Extracts variable names and descriptions and saves them for reference.
- The extracted data is used to map question codes to their definitions in dbt models.
- Loads raw questionnaire data from the source database by using dbt seed
- Integrates metadata by mapping question codes to definitions.
- Ensures each record contains:
- Respondent ID (unique identifier)
- Question Code (survey question identifier)
- Response (raw data)
- Question Type (categorization)
- Definition (human-readable description)
Enhances the dataset by:
- Adding Short Definitions: Assigns human-friendly labels to question codes (e.g.,
HIQ031C→is_covered_by_medigap). - Standardizing Responses: Maps raw responses to meaningful values, such as:
1→ Yes,2→ No,7/9→ Refused.- Unmapped responses default to
NoorUnknown.
- Produces a structured dataset with cleaned responses and easy-to-understand column names.
- Merges demographic data with health insurance responses.
- Aggregates insurance coverage statistics by gender.
- Uses SQL calculations to count respondents with and without coverage.
- Uses DuckDB for ad hoc testing of dbt models before deployment.
- Ensures:
- All question codes are correctly mapped.
- Extracted metadata aligns with dbt outputs.
- Data integrity and consistency checks pass.
- Reference Data: Extracted questionnaire metadata.
- Processed dbt Models: Cleaned and structured data ready for analysis.
- Aggregated Reports: Insights into health insurance coverage trends.
- The merged dataset is loaded.
- Only numerical columns are retained.
- Missing values are imputed using the median.
- The dataset is split into:
- Features (X)
- Target variable (y) – indicating diabetes presence.
- 5-fold Stratified K-Fold cross-validation ensures class balance across folds.
- For each fold:
- The dataset is split into training and testing sets.
- SMOTE is applied to balance class distribution in the training set.
- StandardScaler is used for feature scaling.
- A Random Forest model selects the top 15 most important features.
- These features are used for training and testing.
- A BaggingClassifier with a Random Forest base model is trained.
- Predictions are made on the test set.
- Evaluation metrics include:
- Accuracy score
- Macro-averaged F1 score
- Classification report
Our diabetes prediction model was evaluated using 5-fold cross-validation, leveraging a Random Forest classifier for feature selection and classification.
The most influential features across all folds include:
- Glycohemoglobin (HbA1c) – most important predictor
- Glucose (Refrigerated Serum, mmol/L & mg/dL) – second most important
- Osmolality (mmol/kg)
- Albumin-Creatinine Ratio (mg/g)
- Triglycerides (Refrigerated, mmol/L & mg/dL)
- Basophils Count (per 1000 cells/µL)
- Blood Urea Nitrogen (BUN, mg/dL & mmol/L)
- Albumin (g/dL, g/L, and Urine mg/L)
- Incomplete OGTT Comment Code
| Fold | Accuracy | F1-score (Diabetic Class) | Precision (Diabetic Class) | Recall (Diabetic Class) |
|---|---|---|---|---|
| 1 | 95.52% | 0.8319 | 0.60 | 0.81 |
| 2 | 94.75% | 0.8124 | 0.55 | 0.81 |
| 3 | 94.75% | 0.8052 | 0.55 | 0.76 |
| 4 | 96.69% | - | - | - |
| 5 | Pending | Pending | Pending | Pending |
- The model performs well in distinguishing diabetic vs. non-diabetic cases, achieving an average accuracy of ~95%.
- The recall for the diabetic class (~76-81%) suggests the model effectively identifies diabetic cases but still misses some.
- The precision for the diabetic class (55-60%) indicates some false positives.
- Feature importance analysis consistently ranks glycohemoglobin and glucose levels as the strongest predictors.

