feat: add OpenDataCapture to ReproSchema converter#108
feat: add OpenDataCapture to ReproSchema converter#108
Conversation
- Add odc_mappings.py with type mapping functions - Add odc2reproschema.py converter module - Add 'odc2reproschema' CLI command - Add tests for type mappings and CLI Supports conversion of: - Simple fields (string, number, boolean, date) - Likert scales (expanded to multiple items) - Record arrays (prefixed sub-items) - Multiple choice fields
for more information, see https://pre-commit.ci
Summary of ChangesHello @yibeichan, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the data interoperability capabilities of the system by introducing a robust converter for OpenDataCapture (ODC) instruments to the ReproSchema format. This new feature streamlines the process of integrating ODC-based data collection forms into ReproSchema workflows, ensuring that complex field structures like Likert scales and record arrays are accurately translated. The addition of a dedicated CLI command simplifies execution for users, making the conversion process accessible and efficient. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a new feature to convert OpenDataCapture (ODC) form instruments into ReproSchema format, along with a CLI command and corresponding tests. The implementation covers various field types, including simple fields, likert scales, record arrays, and multiple-choice fields. The code is well-structured and includes appropriate error handling for file operations and JSON parsing. The type mapping logic is comprehensive, and the CLI integration is seamless. Overall, this is a valuable addition to the project.
However, I've identified a few areas for improvement:
- Type Conversion Robustness: The
process_optionsfunction inodc2reproschema.pymight silently fail to convert values to their declared XSD type, which could lead to unexpected data types in the output. It would be more robust to explicitly handle such cases. - Mapping Fallback Logic: In
odc_mappings.py, the fallback logic forget_odc_input_typewhen a specific variant is not found could be made more explicit to ensure deterministic behavior. - Test Coverage for Content: The CLI tests currently verify the existence of generated files but do not assert their content. Adding assertions for the content of the converted ReproSchema files would significantly improve test robustness.
I am having trouble creating individual review comments. Click here to see my feedback.
reproschema/tests/test_odc2reproschema.py (93-104)
The current CLI tests only verify the existence of the generated files and directories. To ensure the conversion logic is correct, it would be beneficial to also assert the content of these files. For example, load the generated test_instrument_schema and item1 JSON-LD files and check if key fields like prefLabel, question, inputType, and responseOptions match the expected values based on the test_data fixture.
This would provide much stronger validation of the converter's functionality.
reproschema/odc2reproschema.py (66-75)
The try-except ValueError: pass block for type conversion in process_options can lead to silent failures where a value intended to be an integer or float remains a string if conversion fails. This might result in type mismatches in the generated ReproSchema, which expects strict XSD types.
Consider logging a warning when a value cannot be converted to its declared type, or explicitly ensuring that processed_val is always a string if the conversion fails, to maintain consistency and provide better debugging information.
processed_val = val
if "integer" in value_type:
try:
processed_val = int(val)
except ValueError:
logger.warning("Could not convert value '%s' to integer for type '%s'. Keeping as string.", val, value_type)
processed_val = str(val)
elif "decimal" in value_type or "float" in value_type:
try:
processed_val = float(val)
except ValueError:
logger.warning("Could not convert value '%s' to float/decimal for type '%s'. Keeping as string.", val, value_type)
processed_val = str(val)reproschema/odc_mappings.py (80-81)
The fallback if variants: return list(variants.values())[0] is non-deterministic as dictionary iteration order is not guaranteed across Python versions (though it is insertion-ordered in 3.7+). Relying on the 'first' item might lead to unexpected behavior if the dictionary structure or Python version changes. It would be more robust to define an explicit default for each kind if no specific variant or 'None' default is found, or return a generic 'text' type if no better match is available.
# If no default, return a generic 'text' type as a safe fallback
if variants:
logger.debug("No specific variant or 'None' default found for kind '%s', variant '%s'. Returning generic 'text'.", kind, variant)
return "text"
There was a problem hiding this comment.
Pull request overview
Adds an OpenDataCapture (ODC) → ReproSchema conversion path to the reproschema toolkit, including mapping utilities, a converter module, a new CLI command, and initial tests.
Changes:
- Added ODC type/input mapping helpers (
odc_mappings.py). - Added an
odc2reproschemaconverter module that generates activity + protocol outputs. - Added
odc2reproschemaCLI command and a basic CLI+mapping test.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 11 comments.
| File | Description |
|---|---|
reproschema/odc_mappings.py |
Defines ODC→ReproSchema type/input mapping utilities used by the converter. |
reproschema/odc2reproschema.py |
Implements the ODC JSON conversion into ReproSchema activity/protocol output structure. |
reproschema/cli.py |
Registers a new odc2reproschema click command to invoke the converter. |
reproschema/tests/test_odc2reproschema.py |
Adds tests for mapping helpers and a minimal end-to-end CLI conversion run. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| name = instrument_name or internal.get("name") or input_path.stem | ||
| title = details.get("title") or name | ||
| description = details.get("description") or "" | ||
| version = str(internal.get("edition", "1")) | ||
|
|
There was a problem hiding this comment.
Instrument names are used as directory names and schema ids without normalization. Other converters/protocol creation normalize names (e.g. replace spaces with underscores), and create_protocol_schema will also sanitize the protocol name—leading to mismatched activity vs protocol paths if instrument_name contains spaces. Normalize/sanitize name once (strip + replace spaces, and potentially other filesystem-unsafe chars) and use that consistently for both activity and protocol generation.
| # Handle multiple choice | ||
| if is_multiple_choice(kind, variant): | ||
| item["responseOptions"]["multipleChoice"] = True | ||
|
|
There was a problem hiding this comment.
New behavior: multipleChoice is set when is_multiple_choice(...) returns true, but there’s no test asserting this flag is written for ODC set fields. Add coverage for a set field with options to ensure multipleChoice: true and the generated choices are correct.
reproschema/odc_mappings.py
Outdated
| return False | ||
|
|
||
| kind = kind.lower().strip() | ||
| variant = str(variant).lower().strip() if variant else "" |
There was a problem hiding this comment.
In is_multiple_choice, the local variant variable is assigned but never used, which will trigger flake8 F841 in this repo. Remove the variant normalization line or incorporate variant into the logic.
| variant = str(variant).lower().strip() if variant else "" |
reproschema/odc2reproschema.py
Outdated
| import json | ||
| import logging | ||
| from pathlib import Path | ||
| from typing import Any, Dict, List, Optional, Tuple |
There was a problem hiding this comment.
Tuple is imported but never used in this module; with the repo’s flake8 pre-commit hook this will fail CI (F401). Remove the unused import or use it.
| from typing import Any, Dict, List, Optional, Tuple | |
| from typing import Any, Dict, List, Optional |
| from .convertutils import ( | ||
| create_activity_schema, | ||
| create_protocol_schema, | ||
| parse_html, |
There was a problem hiding this comment.
parse_html is imported from convertutils but never used, which will fail flake8 (F401). Either remove the import or use parse_html when populating question/description fields (consistent with the other converters).
| parse_html, |
reproschema/odc2reproschema.py
Outdated
| item["question"] = {"en": str(label)} | ||
|
|
||
| # Add description/hint | ||
| description = field_data.get("description") | ||
| if description: | ||
| item["description"] = {"en": str(description)} |
There was a problem hiding this comment.
label is coerced with str(...) when building question. Elsewhere in the codebase (e.g. NBDC/REDCap converters), labels/descriptions are passed through parse_html to preserve text extraction and language tags. Consider using parse_html(label) or {"en": str(label)} here so ODC HTML/multilingual content isn’t lost.
| item["question"] = {"en": str(label)} | |
| # Add description/hint | |
| description = field_data.get("description") | |
| if description: | |
| item["description"] = {"en": str(description)} | |
| parsed_label = parse_html(label) | |
| item["question"] = parsed_label or {"en": str(label)} | |
| # Add description/hint | |
| description = field_data.get("description") | |
| if description: | |
| parsed_description = parse_html(description) | |
| item["description"] = parsed_description or {"en": str(description)} |
reproschema/odc2reproschema.py
Outdated
| # Add description/hint | ||
| description = field_data.get("description") | ||
| if description: | ||
| item["description"] = {"en": str(description)} |
There was a problem hiding this comment.
description is currently stored as plain str(...). For consistency with other converters and to handle HTML/language-tagged content, use parse_html(description) (with a plain-text fallback) when populating the ReproSchema description field.
| item["description"] = {"en": str(description)} | |
| try: | |
| parsed_description = parse_html(description) | |
| except Exception: | |
| parsed_description = None | |
| item["description"] = parsed_description or {"en": str(description)} |
| if kind == "record-array": | ||
| return process_record_array(name, field_data) | ||
| if kind == "number-record" and variant == "likert": |
There was a problem hiding this comment.
New behavior: record-array fields are expanded via process_record_array(...), but the test suite added in this PR doesn’t cover this conversion path. Add a unit/CLI test fixture that includes a record-array field and asserts the expected prefixed sub-items are generated.
| from .odc_mappings import ( | ||
| ODC_COLUMN_MAP, | ||
| get_odc_input_type, | ||
| get_odc_value_type, | ||
| is_multiple_choice, | ||
| ) |
There was a problem hiding this comment.
ODC_COLUMN_MAP is imported but never used, which will fail flake8 (F401). Remove the import, or apply it when translating ODC keys to internal ReproSchema fields.
reproschema/odc2reproschema.py
Outdated
| kind = field_data.get("kind", "string") | ||
| variant = field_data.get("variant") | ||
|
|
||
| # Handle special complex fields | ||
| if kind == "record-array": |
There was a problem hiding this comment.
kind/variant are compared for special cases (record-array, likert, dynamic) before any normalization. If the input uses different casing/whitespace (e.g. "Record-Array"), these branches won’t run and the converter will produce incorrect output. Normalize kind/variant (lower/strip) once up front and use the normalized values for all comparisons and mapping calls.
- Add sample_odc_instrument.json fixture covering all ODC field types - Add odc_form_schema.json (ODC JSON schema for reference) - Add TestComprehensiveInstrument class with: - test_comprehensive_instrument_conversion: verifies all 15 field types - test_activity_schema_order_and_properties: verifies order/addProperties All 5 tests passing
for more information, see https://pre-commit.ci
Changes: - Remove unused imports: Tuple and ODC_COLUMN_MAP (F401) - Remove unused variable in is_multiple_choice() (F841) - Add logging for type conversion failures in process_options() - Fix non-deterministic fallback in get_odc_input_type() to return 'text' - Use parse_html() for labels and descriptions in process_field() - Normalize instrument name with strip() and replace spaces with underscores - Validate content is dict before iterating, raise ValueError if not - Normalize kind/variant at top of process_field() with lower/strip Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a new feature to convert OpenDataCapture (ODC) instruments to the ReproSchema format. The changes are well-structured, with separate modules for the core conversion logic, type mappings, and a new CLI command. The implementation covers various ODC field types, including simple fields, Likert scales, and record arrays, and handles them robustly. The addition of comprehensive unit and integration tests is excellent and ensures the converter's correctness and reliability. I have a few suggestions for minor refactoring and code cleanup to improve maintainability and clarity, but overall this is a high-quality contribution.
| if "integer" in value_type: | ||
| try: | ||
| processed_val = int(val) | ||
| except ValueError: | ||
| logger.warning( | ||
| "Could not convert option value '%s' to integer, " | ||
| "using string fallback", | ||
| str(val), | ||
| ) | ||
| processed_val = str(val) | ||
| elif "decimal" in value_type or "float" in value_type: | ||
| try: | ||
| processed_val = float(val) | ||
| except ValueError: | ||
| logger.warning( | ||
| "Could not convert option value '%s' to float, " | ||
| "using string fallback", | ||
| str(val), | ||
| ) | ||
| processed_val = str(val) |
There was a problem hiding this comment.
There is significant code duplication in the try-except blocks for converting option values to int and float. This logic can be refactored into a private helper function to improve code clarity and maintainability. For example:
def _try_convert(value, target_type, type_name):
try:
return target_type(value)
except ValueError:
logger.warning(
"Could not convert option value '%s' to %s, "
"using string fallback",
str(value),
type_name,
)
return str(value)
# ... in process_options ...
processed_val = val
if "integer" in value_type:
processed_val = _try_convert(val, int, "integer")
elif "decimal" in value_type or "float" in value_type:
processed_val = _try_convert(val, float, "float")| item.setdefault("additionalNotesObj", []).append( | ||
| { | ||
| "source": "odc", | ||
| "column": "record-array-parent", | ||
| "value": name, | ||
| } | ||
| ) |
There was a problem hiding this comment.
The use of item.setdefault("additionalNotesObj", []) is redundant here. The process_field function ensures that any returned item will have an additionalNotesObj key with a list value. You can simplify this to a direct key access.
| item.setdefault("additionalNotesObj", []).append( | |
| { | |
| "source": "odc", | |
| "column": "record-array-parent", | |
| "value": name, | |
| } | |
| ) | |
| item["additionalNotesObj"].append( | |
| { | |
| "source": "odc", | |
| "column": "record-array-parent", | |
| "value": name, | |
| } | |
| ) |
| ODC_COLUMN_MAP = { | ||
| "label": "question", | ||
| "description": "description", | ||
| "instructions": "instruction", | ||
| } |
| return "xsd:string" | ||
|
|
||
|
|
||
| def is_multiple_choice(kind: str, variant: str = None) -> bool: |
There was a problem hiding this comment.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated 9 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| item["question"] = {"en": str(parse_html(label))} | ||
|
|
||
| # Add description/hint | ||
| description = field_data.get("description") | ||
| if description: | ||
| item["description"] = {"en": str(parse_html(description))} |
There was a problem hiding this comment.
parse_html can return None if the input is invalid or empty. The current code wraps it with str() which will turn None into "None". After fixing the parse_html usage issue, add a check to ensure parse_html didn't return None before assigning to item["description"].
| item["question"] = {"en": str(parse_html(label))} | |
| # Add description/hint | |
| description = field_data.get("description") | |
| if description: | |
| item["description"] = {"en": str(parse_html(description))} | |
| parsed_label = parse_html(label) | |
| if parsed_label is not None: | |
| item["question"] = {"en": str(parsed_label)} | |
| # Add description/hint | |
| description = field_data.get("description") | |
| if description: | |
| parsed_description = parse_html(description) | |
| if parsed_description is not None: | |
| item["description"] = {"en": str(parsed_description)} |
| ) | ||
| processed_val = str(val) | ||
|
|
||
| choices.append({"name": {"en": str(label)}, "value": processed_val}) |
There was a problem hiding this comment.
For consistency with how question labels are handled throughout the codebase, option labels should use parse_html instead of str(). ODC option labels might contain HTML markup that should be properly parsed. This ensures all user-facing text is processed consistently.
| choices.append({"name": {"en": str(label)}, "value": processed_val}) | |
| choices.append({"name": {"en": parse_html(str(label))}, "value": processed_val}) |
| raise ValueError( | ||
| f"ODC instrument 'content' must be a dictionary, " | ||
| f"got {type(content).__name__}" | ||
| ) |
There was a problem hiding this comment.
Consider adding a warning or error if content is empty. An ODC instrument with no fields is unusual and might indicate a malformed input file. Adding a check like if not content: logger.warning("ODC instrument has no content fields") would help users identify potential issues with their input data.
| ) | |
| ) | |
| if not content: | |
| logger.warning("ODC instrument '%s' has no content fields", name) |
|
|
||
| Returns: | ||
| List of ReproSchema item dictionaries | ||
| """ |
There was a problem hiding this comment.
Add validation to check if field_data is a dictionary before processing. If field_data is None or not a dict, the code will fail with an unclear error. Add a check at the start: if not isinstance(field_data, dict): logger.warning("Skipping field '%s': invalid field data", name); return []
| """ | |
| """ | |
| if not isinstance(field_data, dict): | |
| logger.warning("Skipping field '%s': invalid field data", name) | |
| return [] |
| item["question"] = {"en": str(parse_html(label))} | ||
|
|
||
| # Add description/hint | ||
| description = field_data.get("description") | ||
| if description: | ||
| item["description"] = {"en": str(parse_html(description))} |
There was a problem hiding this comment.
Incorrect usage of parse_html. The function already returns a dictionary like {"en": "text"}, so wrapping it in str() and then putting it in a dict creates {"en": "{'en': 'text'}"} instead of the correct {"en": "text"}. Remove str() wrapper and assign parse_html result directly.
| item["question"] = {"en": str(parse_html(label))} | |
| # Add description/hint | |
| description = field_data.get("description") | |
| if description: | |
| item["description"] = {"en": str(parse_html(description))} | |
| item["question"] = parse_html(label) | |
| # Add description/hint | |
| description = field_data.get("description") | |
| if description: | |
| item["description"] = parse_html(description) |
| row_label = row_data.get("label", row_name) | ||
|
|
||
| item = { | ||
| "category": "reproschema:Item", | ||
| "id": prefixed_name, | ||
| "prefLabel": {"en": prefixed_name}, | ||
| "question": {"en": str(row_label)}, |
There was a problem hiding this comment.
For consistency with process_field and other converters, use parse_html for the row_label instead of directly wrapping with str(). Also check for and handle the optional "description" field in row_data, using parse_html. This ensures HTML content is properly parsed and the data structure is consistent.
| # ODC field properties to internal names | ||
| ODC_COLUMN_MAP = { | ||
| "label": "question", | ||
| "description": "description", | ||
| "instructions": "instruction", | ||
| } |
There was a problem hiding this comment.
ODC_COLUMN_MAP is defined but never used in the codebase. If this mapping is intended for future use or documentation, add a comment explaining its purpose. Otherwise, consider removing it to reduce code clutter.
| item["question"] = {"en": str(parse_html(label))} | ||
|
|
||
| # Add description/hint | ||
| description = field_data.get("description") | ||
| if description: | ||
| item["description"] = {"en": str(parse_html(description))} |
There was a problem hiding this comment.
Incorrect usage of parse_html. The function already returns a dictionary like {"en": "text"}, so wrapping it in str() and then putting it in a dict creates {"en": "{'en': 'text'}"} instead of the correct {"en": "text"}. Remove str() wrapper and assign parse_html result directly.
| item["question"] = {"en": str(parse_html(label))} | |
| # Add description/hint | |
| description = field_data.get("description") | |
| if description: | |
| item["description"] = {"en": str(parse_html(description))} | |
| item["question"] = parse_html(label) | |
| # Add description/hint | |
| description = field_data.get("description") | |
| if description: | |
| item["description"] = parse_html(description) |
| item["question"] = {"en": str(parse_html(label))} | ||
|
|
||
| # Add description/hint | ||
| description = field_data.get("description") | ||
| if description: | ||
| item["description"] = {"en": str(parse_html(description))} |
There was a problem hiding this comment.
parse_html can return None if the input is invalid or empty. The current code wraps it with str() which will turn None into "None". After fixing the parse_html usage issue, add a check to ensure parse_html didn't return None before assigning to item["question"].
| item["question"] = {"en": str(parse_html(label))} | |
| # Add description/hint | |
| description = field_data.get("description") | |
| if description: | |
| item["description"] = {"en": str(parse_html(description))} | |
| parsed_label = parse_html(label) | |
| if parsed_label is not None: | |
| item["question"] = {"en": str(parsed_label)} | |
| # Add description/hint | |
| description = field_data.get("description") | |
| if description: | |
| parsed_description = parse_html(description) | |
| if parsed_description is not None: | |
| item["description"] = {"en": str(parsed_description)} |
- Add Features section highlighting ODC converter - Add Usage section with CLI command examples - Document output structure with visual tree - Add ODC Field Type Support table - Add Special Handling notes (Likert, record arrays, multiple choice) - Add example showing input/output JSON transformation Addresses PR #108 review feedback for documentation
- Clarify that --instrument-name defaults to internal.name from ODC file - Make optional status more explicit
Supports conversion of: