Skip to content

Add TrueQuery QA Generation#10559

Open
askliar wants to merge 33 commits intopyg-team:masterfrom
askliar:master
Open

Add TrueQuery QA Generation#10559
askliar wants to merge 33 commits intopyg-team:masterfrom
askliar:master

Conversation

@askliar
Copy link

@askliar askliar commented Dec 15, 2025

This PR introduces QA generation procedure, a comprehensive LLM-powered pipeline for generating high-quality question-answer pairs from text documents. The system is designed to create training data for retrieval-augmented generation (RAG) systems and question-answering models.

Note: Original Author of TrueQuery is Vibhor Agrawal of NVIDIA

@askliar askliar requested a review from puririshi98 as a code owner December 15, 2025 18:39
@askliar askliar changed the title WIP: Add TrueQuery QA Generation Draft: Add TrueQuery QA Generation Dec 15, 2025
@puririshi98
Copy link
Contributor

first step, address whatever complaints the CI linter has: https://results.pre-commit.ci/run/github/106024057/1765827433.VRbWJ22KTYipZNbGq422sQ
and update the changelog:
https://github.com/askliar/pytorch_geometric/blob/master/CHANGELOG.md

@puririshi98
Copy link
Contributor

upon quick review no glaring issues. Looking forward to the final restructuring before reviewing deeper and also involving Vibhor for review.
One thing i would ask is that when the final restructuring is done, please attach a log w a succesful run of the full example in the latest pyg container: https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pyg/tags

@codecov
Copy link

codecov bot commented Dec 16, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 81.91%. Comparing base (c211214) to head (cb696d4).
⚠️ Report is 179 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master   #10559      +/-   ##
==========================================
- Coverage   86.11%   81.91%   -4.20%     
==========================================
  Files         496      511      +15     
  Lines       33655    37778    +4123     
==========================================
+ Hits        28981    30945    +1964     
- Misses       4674     6833    +2159     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@askliar askliar changed the title Draft: Add TrueQuery QA Generation [WIP] Add TrueQuery QA Generation Dec 16, 2025
@askliar askliar marked this pull request as draft December 16, 2025 09:27
Copy link
Contributor

@puririshi98 puririshi98 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall lgtm.
2 things:

  1. update changelog.md
  2. please attach a log w a succesful run of the full example in the latest pyg container + pip install vllm: https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pyg/tags

@puririshi98 puririshi98 marked this pull request as ready for review February 5, 2026 23:03
@puririshi98
Copy link
Contributor

@askliar plz also add an update to the examples/llm/README.md with enough details for smooth user bringup

Andrii and others added 6 commits February 13, 2026 16:02
…Generator

- Implemented automatic dataset download if the input directory does not exist.
- Added validation to ensure at least one data source is specified in the configuration.
- Enhanced QAGenerator initialization to read additional parameters from the configuration file.
- Updated LLMClient to support new parameters for maximum batched tokens.
- Introduced new YAML configuration files for different backend setups (nim and vllm).
- Minor adjustments in txt2kg_rag.py to ensure proper exit behavior.
…umentation

- Fixed FAISS L2→cosine similarity conversion bug in validate_answer_spans_hybrid()
  that caused 0% validation pass rate (all QA pairs incorrectly rejected)
- Fixed AttributeError in LLMClient.cleanup() caused by undefined is_local attribute;
  now correctly uses the existing backend attribute
- Fixed JSON parsing errors by proactively stripping markdown code blocks from LLM
  responses before parsing, with json_repair as fallback
- Added json_repair and langgraph to the rag extras in pyproject.toml
- Added comprehensive README for the txt2qa synthetic QA generation pipeline,
  covering vLLM and NIM backends, quickstart, config reference, troubleshooting,
  and performance benchmarks
- Added success log examples for vLLM and NIM backends in examples/llm/logs/
- Added DEPLOYMENT_SUMMARY.md documenting bugs fixed and test results
- Updated vLLM and NIM configuration files

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
@askliar askliar requested a review from rusty1s as a code owner February 17, 2026 19:32
pre-commit-ci bot and others added 5 commits February 17, 2026 19:33
- Add noqa: E402 to late imports required before vLLM initialization
- Fix E251 keyword arg formatting by extracting short local variables
- Fix E501 long lines across both files by splitting strings and logger calls
- Replace Unicode dashes/comparison chars with ASCII equivalents

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
- Keep noqa: E402 for late imports required before vLLM init
- Fix broken syntax from pre-commit.ci auto-formatter (E501 split)
- Keep our correct long-line fixes; adopt pre-commit.ci's unused-var removal

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
@askliar askliar marked this pull request as draft February 17, 2026 23:14
root and others added 4 commits February 18, 2026 01:03
…rag.py

- Replace fabricated vllm_success.log and nim_success.log with actual
  pipeline run output (real timestamps, model names, progress lines)
- Revert examples/llm/txt2kg_rag.py to pyg-team/master (remove stray exit(0))
- Collapse CHANGELOG Unreleased section to a single Added line for
  txt2qa.py (pyg-team#10559) and add PR links to all four Fixed items
- Shorten examples/llm/README.md: update txt2qa table row to one sentence
  and remove the 260-line TXT2QA guide section
- Replace 318-line DEPLOYMENT_SUMMARY.md with a compact ~45-line reference
  (overview, quick-start commands, critical config flags, output note)

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
@askliar askliar changed the title [WIP] Add TrueQuery QA Generation Add TrueQuery QA Generation Feb 18, 2026
@askliar askliar marked this pull request as ready for review February 18, 2026 09:06
Copy link
Contributor

@puririshi98 puririshi98 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall lgtm. one thing that would make this really solid is add a minimal unit test to test/llm/test_txt2qa.py that generates a single query locally just to make sure this stays working well for posterity. After these changes im happy to merge once Vibhor reviews

2026-02-18 00:23:18,251 - __main__ - INFO - Final aggregated metrics across all chunks: {'average_question_length': 202.625, 'average_answer_length': 413.25, 'complexity_distribution': {5: 13, 4: 3}, 'total_pairs': 16, 'query_type_distribution': {'multi_hop': 8, 'structural': 8}, 'reasoning_type_distribution': {'causal': 9, 'factual': 7}, 'multi_hop_metrics': {'average_hop_count': 3.0, 'hop_count_distribution': {3.0: 6, 2.0: 1, 4.0: 1}}}
2026-02-18 00:23:18,254 - __main__ - INFO - Wrote 5 QA pairs to /home/scratch.askliar_ent/pytorch_geometric/techqa_output_nim/all_qa_pairs_batch_0.jsonl
2026-02-18 00:23:18,261 - __main__ - INFO - Done from 0 to 5
2026-02-18 00:23:18,261 - __main__ - INFO - Process 5 to 10
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to include the logs in the PR itself, just add them to the top PR description for logging purposes, dont need to fill the examples repo w/ extra file

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I will remove them once PR is approved. Did not want to add it as a file separately.

Andrii added 2 commits February 20, 2026 18:59
…T_SUMMARY.md, and enhance README.md with updated TXT2QA quick start instructions. Introduce tests for txt2qa functionality in test_txt2qa.py.
@askliar askliar requested a review from wsad1 as a code owner February 20, 2026 18:22
Copy link
Contributor

@puririshi98 puririshi98 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm just waiting on vibhors stamp but plz move the files to the PR description instead of the actual diff

@askliar
Copy link
Author

askliar commented Feb 20, 2026

vllm_success.log
nim_success.log

Done @puririshi98

@puririshi98
Copy link
Contributor

@askliar plz also fix the precommit, it wont let us merge otherwise

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants