Conversation
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
|
first step, address whatever complaints the CI linter has: https://results.pre-commit.ci/run/github/106024057/1765827433.VRbWJ22KTYipZNbGq422sQ |
|
upon quick review no glaring issues. Looking forward to the final restructuring before reviewing deeper and also involving Vibhor for review. |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #10559 +/- ##
==========================================
- Coverage 86.11% 81.91% -4.20%
==========================================
Files 496 511 +15
Lines 33655 37778 +4123
==========================================
+ Hits 28981 30945 +1964
- Misses 4674 6833 +2159 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
puririshi98
left a comment
There was a problem hiding this comment.
overall lgtm.
2 things:
- update changelog.md
- please attach a log w a succesful run of the full example in the latest pyg container + pip install vllm: https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pyg/tags
|
@askliar plz also add an update to the examples/llm/README.md with enough details for smooth user bringup |
…Generator - Implemented automatic dataset download if the input directory does not exist. - Added validation to ensure at least one data source is specified in the configuration. - Enhanced QAGenerator initialization to read additional parameters from the configuration file. - Updated LLMClient to support new parameters for maximum batched tokens. - Introduced new YAML configuration files for different backend setups (nim and vllm). - Minor adjustments in txt2kg_rag.py to ensure proper exit behavior.
…umentation - Fixed FAISS L2→cosine similarity conversion bug in validate_answer_spans_hybrid() that caused 0% validation pass rate (all QA pairs incorrectly rejected) - Fixed AttributeError in LLMClient.cleanup() caused by undefined is_local attribute; now correctly uses the existing backend attribute - Fixed JSON parsing errors by proactively stripping markdown code blocks from LLM responses before parsing, with json_repair as fallback - Added json_repair and langgraph to the rag extras in pyproject.toml - Added comprehensive README for the txt2qa synthetic QA generation pipeline, covering vLLM and NIM backends, quickstart, config reference, troubleshooting, and performance benchmarks - Added success log examples for vLLM and NIM backends in examples/llm/logs/ - Added DEPLOYMENT_SUMMARY.md documenting bugs fixed and test results - Updated vLLM and NIM configuration files Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
for more information, see https://pre-commit.ci
- Add noqa: E402 to late imports required before vLLM initialization - Fix E251 keyword arg formatting by extracting short local variables - Fix E501 long lines across both files by splitting strings and logger calls - Replace Unicode dashes/comparison chars with ASCII equivalents Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
- Keep noqa: E402 for late imports required before vLLM init - Fix broken syntax from pre-commit.ci auto-formatter (E501 split) - Keep our correct long-line fixes; adopt pre-commit.ci's unused-var removal Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
for more information, see https://pre-commit.ci
…rag.py - Replace fabricated vllm_success.log and nim_success.log with actual pipeline run output (real timestamps, model names, progress lines) - Revert examples/llm/txt2kg_rag.py to pyg-team/master (remove stray exit(0)) - Collapse CHANGELOG Unreleased section to a single Added line for txt2qa.py (pyg-team#10559) and add PR links to all four Fixed items - Shorten examples/llm/README.md: update txt2qa table row to one sentence and remove the 260-line TXT2QA guide section - Replace 318-line DEPLOYMENT_SUMMARY.md with a compact ~45-line reference (overview, quick-start commands, critical config flags, output note) Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
puririshi98
left a comment
There was a problem hiding this comment.
overall lgtm. one thing that would make this really solid is add a minimal unit test to test/llm/test_txt2qa.py that generates a single query locally just to make sure this stays working well for posterity. After these changes im happy to merge once Vibhor reviews
examples/llm/logs/nim_success.log
Outdated
| 2026-02-18 00:23:18,251 - __main__ - INFO - Final aggregated metrics across all chunks: {'average_question_length': 202.625, 'average_answer_length': 413.25, 'complexity_distribution': {5: 13, 4: 3}, 'total_pairs': 16, 'query_type_distribution': {'multi_hop': 8, 'structural': 8}, 'reasoning_type_distribution': {'causal': 9, 'factual': 7}, 'multi_hop_metrics': {'average_hop_count': 3.0, 'hop_count_distribution': {3.0: 6, 2.0: 1, 4.0: 1}}} | ||
| 2026-02-18 00:23:18,254 - __main__ - INFO - Wrote 5 QA pairs to /home/scratch.askliar_ent/pytorch_geometric/techqa_output_nim/all_qa_pairs_batch_0.jsonl | ||
| 2026-02-18 00:23:18,261 - __main__ - INFO - Done from 0 to 5 | ||
| 2026-02-18 00:23:18,261 - __main__ - INFO - Process 5 to 10 |
There was a problem hiding this comment.
no need to include the logs in the PR itself, just add them to the top PR description for logging purposes, dont need to fill the examples repo w/ extra file
There was a problem hiding this comment.
Yeah, I will remove them once PR is approved. Did not want to add it as a file separately.
…T_SUMMARY.md, and enhance README.md with updated TXT2QA quick start instructions. Introduce tests for txt2qa functionality in test_txt2qa.py.
|
@askliar plz also fix the precommit, it wont let us merge otherwise |
This PR introduces QA generation procedure, a comprehensive LLM-powered pipeline for generating high-quality question-answer pairs from text documents. The system is designed to create training data for retrieval-augmented generation (RAG) systems and question-answering models.
Note: Original Author of TrueQuery is Vibhor Agrawal of NVIDIA