Skip to content

Commit ebd0e13

Browse files
authored
Add support for page and block level ignores (#33)
1 parent b4cab5a commit ebd0e13

File tree

12 files changed

+469
-29
lines changed

12 files changed

+469
-29
lines changed

CHANGELOG.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,8 @@ Changelog
44
0.5.0
55
-----
66

7+
- Add :ref:`block_level_ignore` and :ref:`page_level_ignore`
8+
`#33 <https://github.com/jdillard/sphinx-llms-txt/pull/33>`_
79
- Add :confval:`llms_txt_full_size_policy` configuration option to control behavior when :confval:`llms_txt_full_max_size` is exceeded.
810
`#35 <https://github.com/jdillard/sphinx-llms-txt/pull/35>`_
911

docs/source/advanced-configuration.rst

Lines changed: 68 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -124,6 +124,13 @@ This ensures that paths in your custom directives are properly resolved in the g
124124
Excluding Content
125125
^^^^^^^^^^^^^^^^^
126126

127+
There are several ways to exclude content from the generated ``llms-full.txt`` file:
128+
129+
.. _global_exclusion:
130+
131+
Global Page Exclusion
132+
~~~~~~~~~~~~~~~~~~~~~~
133+
127134
You can exclude specific pages from being included in the generated files:
128135

129136
.. code-block:: python
@@ -135,6 +142,67 @@ You can exclude specific pages from being included in the generated files:
135142
]
136143
137144
This is useful for excluding auto-generated pages, indexes, or content that isn't relevant for LLM consumption.
145+
It can also be used to reduce the size of llms-full.txt.
146+
147+
.. _page_level_ignore:
148+
149+
Page-Level Ignore Metadata
150+
~~~~~~~~~~~~~~~~~~~~~~~~~~~
151+
152+
You can exclude individual pages by adding metadata at the top of any reStructuredText file:
153+
154+
.. code-block:: restructuredtext
155+
156+
:llms-txt-ignore: true
157+
158+
Page Title
159+
==========
160+
161+
This entire page will be excluded from llms-full.txt
162+
163+
When this metadata is present, the entire page is skipped during processing.
164+
165+
.. _block_level_ignore:
166+
167+
Block-Level Ignore Directives
168+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
169+
170+
You can exclude specific sections within a page using ignore directives:
171+
172+
.. code-block:: restructuredtext
173+
174+
Page Title
175+
==========
176+
177+
This content will be included in llms-full.txt.
178+
179+
.. llms-txt-ignore-start
180+
181+
This content will be excluded from llms-full.txt.
182+
183+
Section To Ignore
184+
-----------------
185+
186+
This entire section and any nested content will be ignored.
187+
188+
.. code-block:: python
189+
190+
# This code block will also be ignored
191+
def ignored_function():
192+
pass
193+
194+
.. llms-txt-ignore-end
195+
196+
This content will be included again.
197+
198+
Block-level ignores can be useful for:
199+
200+
- Removing internal notes or TODOs
201+
- Hiding implementation details while keeping user-facing documentation
202+
203+
.. note::
204+
- Multiple ignore blocks can be used within the same file
205+
- Ignore directives work with any indentation level
138206

139207
.. _including_code_files:
140208

docs/source/configuration-values.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -89,7 +89,7 @@ Project Configuration Values
8989

9090
- **Type**: list of strings
9191
- **Default**: ``[]``
92-
- **Description**: A list of pages to ignore.
92+
- **Description**: A list of pages to ignore using glob patterns.
9393
See :ref:`excluding_content`.
9494

9595
.. versionadded:: 0.2.1

docs/source/getting-started.rst

Lines changed: 11 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,6 @@
11
Getting Started
22
===============
33

4-
Demo
5-
----
6-
7-
You can see this Sphinx project's `llms.txt`_ and `llms-full.txt`_ files as a simple example.
8-
94
Installation
105
------------
116

@@ -15,6 +10,12 @@ Directly install via ``pip`` by using:
1510
1611
pip install sphinx-llms-txt
1712
13+
Or with ``conda`` via ``conda-forge``:
14+
15+
.. code::
16+
17+
conda install -c conda-forge sphinx-llms-txt
18+
1819
Usage
1920
-----
2021

@@ -26,25 +27,12 @@ Add the extension to your Sphinx configuration (``conf.py``):
2627
'sphinx_llms_txt',
2728
]
2829
29-
Once added, the extension will automatically generate the LLMs.txt files during the build process.
30-
31-
See :doc:`advanced-configuration` for more information about how to use **sphinx-llms-txt**.
32-
33-
How It Works
34-
------------
30+
After the HTML finishes building, **sphinx-llms-txt** will output the location of the output files::
3531

36-
During the Sphinx build process:
32+
sphinx-llms-txt: Created /path/to/_build/html/llms-full.txt with 45 sources and 6879 lines
33+
sphinx-llms-txt: created /path/to/_build/html/llms.txt
3734

38-
1. **Content Collection**: Scans all of your documentation's ``_source`` pages and collects their content
39-
2. **Directive Processing**: Resolves ``include`` directives by automatically incorporating their content
40-
3. **Path Resolution**: Transforms relative paths in directives to full paths
41-
4. **Output Generation**: Creates two optional files:
4235

43-
- ``llms.txt``: A concise summary of your documentation, in Markdown
44-
- ``llms-full.txt``: A comprehensive version with all documentation content, in reStructuredText
36+
.. tip:: Make sure to confirm the accuracy of the output files after installs and upgrades.
4537

46-
5. **Content Filtering**: Allows you to exclude specific pages from the generated files
47-
48-
49-
.. _llms.txt: https://sphinx-llms-txt.readthedocs.io/en/latest/llms.txt
50-
.. _llms-full.txt: https://sphinx-llms-txt.readthedocs.io/en/latest/llms-full.txt
38+
See :doc:`advanced-configuration` for more information about how to use **sphinx-llms-txt**.

docs/source/index.rst

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,25 @@ A `Sphinx`_ extension that generates a summary ``llms.txt`` file, written in Mar
55

66
|PyPI version| |Conda Version| |Downloads| |Parallel Safe| |GitHub Stars|
77

8+
Demo
9+
----
10+
11+
You can see this Sphinx project's `llms.txt`_ and `llms-full.txt`_ files as a simple example.
12+
13+
Highlights
14+
----------
15+
16+
1. **Content Collection**: Quickly gathers content from _sources, without needing a separate build
17+
2. **Directive Processing**: Resolves ``include`` directives by automatically incorporating their content
18+
3. **Path Resolution**: Transforms relative paths in directives to full paths
19+
4. **Output Generation**: Creates two optional files:
20+
21+
- ``llms.txt``: A concise summary of your documentation, in Markdown
22+
- ``llms-full.txt``: A comprehensive version with all documentation content, in reStructuredText
23+
24+
5. **Content Filtering**: Allows you to exclude specific pages or sections
25+
6. **Source Code**: Allows you to include specific source code files
26+
827
.. toctree::
928
:maxdepth: 2
1029

@@ -15,6 +34,8 @@ A `Sphinx`_ extension that generates a summary ``llms.txt`` file, written in Mar
1534
changelog
1635

1736

37+
.. _llms.txt: https://sphinx-llms-txt.readthedocs.io/en/latest/llms.txt
38+
.. _llms-full.txt: https://sphinx-llms-txt.readthedocs.io/en/latest/llms-full.txt
1839
.. _Sphinx: http://sphinx-doc.org/
1940

2041
.. |PyPI version| image:: https://img.shields.io/pypi/v/sphinx-llms-txt.svg

sphinx_llms_txt/__init__.py

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,13 @@ def doctree_resolved(app: Sphinx, doctree, docname: str):
3333
"""Called when a docname has been resolved to a document."""
3434
global _root_first_paragraph
3535

36+
# Check for llms-txt-ignore metadata at the page level
37+
if hasattr(app.env, "metadata") and docname in app.env.metadata:
38+
metadata = app.env.metadata[docname]
39+
if metadata.get("llms-txt-ignore", "").lower() in ("true", "1", "yes"):
40+
_manager.mark_page_ignored(docname)
41+
return
42+
3643
# Extract title from the document
3744
title = None
3845
# findall() returns a generator, convert to list to check if it has elements

sphinx_llms_txt/manager.py

Lines changed: 45 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55
import glob
66
import subprocess
77
from pathlib import Path
8-
from typing import Any, Dict, List, Optional, Tuple
8+
from typing import Any, Dict, List, Optional, Tuple, Union
99

1010
from sphinx.application import Sphinx
1111
from sphinx.environment import BuildEnvironment
@@ -129,6 +129,7 @@ def __init__(self):
129129
self.srcdir: Optional[str] = None
130130
self.outdir: Optional[str] = None
131131
self.app: Optional[Sphinx] = None
132+
self.ignored_pages: set = set()
132133

133134
def set_master_doc(self, master_doc: str):
134135
"""Set the master document name."""
@@ -144,6 +145,27 @@ def update_page_title(self, docname: str, title: str):
144145
"""Update the title for a page."""
145146
self.collector.update_page_title(docname, title)
146147

148+
def mark_page_ignored(self, docname: str):
149+
"""Mark a page as ignored due to llms-txt-ignore metadata."""
150+
self.ignored_pages.add(docname)
151+
152+
def _filter_ignored_pages(
153+
self, page_order: Union[List[str], List[Tuple[str, str]]]
154+
) -> Union[List[str], List[Tuple[str, str]]]:
155+
"""Filter out ignored pages from page_order."""
156+
filtered_pages = []
157+
for item in page_order:
158+
# Handle both old format (str) and new format (tuple)
159+
if isinstance(item, tuple):
160+
docname, _ = item
161+
else:
162+
docname = item
163+
164+
if docname not in self.ignored_pages:
165+
filtered_pages.append(item)
166+
167+
return filtered_pages
168+
147169
def set_config(self, config: Dict[str, Any]):
148170
"""Set configuration options."""
149171
self.config = config
@@ -286,6 +308,11 @@ def combine_sources(self, outdir: str, srcdir: str):
286308
should_abort_early = size_policy_action in ["skip", "note"]
287309

288310
for docname, _ in page_order:
311+
# Skip pages marked as ignored
312+
if docname in self.ignored_pages:
313+
logger.debug(f"sphinx-llms-txt: Skipping ignored page: {docname}")
314+
continue
315+
289316
if docname in docname_to_file:
290317
file_path = docname_to_file[docname]
291318
content, line_count = self._read_source_file(file_path, docname)
@@ -383,6 +410,13 @@ def combine_sources(self, outdir: str, srcdir: str):
383410
if docname is None:
384411
continue
385412

413+
# Skip pages marked as ignored
414+
if docname in self.ignored_pages:
415+
logger.debug(
416+
f"sphinx-llms-txt: Skipping ignored remaining file: {docname}"
417+
)
418+
continue
419+
386420
# Skip excluded docnames
387421
if exclude_patterns and any(
388422
self.collector._match_exclude_pattern(docname, pattern)
@@ -468,8 +502,11 @@ def combine_sources(self, outdir: str, srcdir: str):
468502
logger.info(f"sphinx-llms-txt: Skipping {filename} generation")
469503
# Log summary information if requested
470504
if self.config.get("llms_txt_file"):
505+
filtered_page_order = self._filter_ignored_pages(page_order)
471506
self.writer.write_verbose_info_to_file(
472-
page_order, self.collector.page_titles, total_line_count
507+
filtered_page_order,
508+
self.collector.page_titles,
509+
total_line_count,
473510
)
474511
return
475512
elif action == "note":
@@ -478,8 +515,11 @@ def combine_sources(self, outdir: str, srcdir: str):
478515

479516
# Log summary information if requested
480517
if self.config.get("llms_txt_file"):
518+
filtered_page_order = self._filter_ignored_pages(page_order)
481519
self.writer.write_verbose_info_to_file(
482-
page_order, self.collector.page_titles, total_line_count
520+
filtered_page_order,
521+
self.collector.page_titles,
522+
total_line_count,
483523
)
484524
return
485525
elif action == "keep":
@@ -496,8 +536,9 @@ def combine_sources(self, outdir: str, srcdir: str):
496536

497537
# Log summary information if requested
498538
if success and self.config.get("llms_txt_file"):
539+
filtered_page_order = self._filter_ignored_pages(page_order)
499540
self.writer.write_verbose_info_to_file(
500-
page_order, self.collector.page_titles, total_line_count
541+
filtered_page_order, self.collector.page_titles, total_line_count
501542
)
502543

503544
def _read_source_file(self, file_path: Path, docname: str) -> Tuple[str, int]:

sphinx_llms_txt/processor.py

Lines changed: 37 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -44,7 +44,10 @@ def process_content(self, content: str, source_path: Path) -> str:
4444
Returns:
4545
Processed content with directives properly resolved
4646
"""
47-
# First process include directives
47+
# First process llms-txt-ignore blocks
48+
content = self._process_ignore_blocks(content)
49+
50+
# Then process include directives
4851
content = self._process_includes(content, source_path)
4952

5053
# Then process path directives (image, figure, etc.)
@@ -337,3 +340,36 @@ def replace_include(match):
337340
# Replace all includes with their content
338341
processed_content = include_pattern.sub(replace_include, content)
339342
return processed_content
343+
344+
def _process_ignore_blocks(self, content: str) -> str:
345+
"""Process llms-txt-ignore-start/end blocks by removing their content.
346+
347+
Args:
348+
content: The source content to process
349+
350+
Returns:
351+
Processed content with ignore blocks removed
352+
"""
353+
# Process ignore blocks iteratively to handle nested cases correctly
354+
while True:
355+
# Pattern to match ignore blocks - handles whitespace and indentation
356+
ignore_pattern = re.compile(
357+
r"^\s*\.\.\s+llms-txt-ignore-start\s*\n" # Start directive line
358+
r"(.*?)" # Content to ignore (non-greedy)
359+
r"^\s*\.\.\s+llms-txt-ignore-end\s*$", # End directive line
360+
re.MULTILINE | re.DOTALL,
361+
)
362+
363+
# Find and remove one ignore block at a time
364+
match = ignore_pattern.search(content)
365+
if not match:
366+
break
367+
368+
# Remove the matched block
369+
content = content[: match.start()] + content[match.end() :]
370+
371+
# Clean up any extra blank lines that might be left
372+
# Replace multiple consecutive newlines with at most 2 newlines
373+
processed_content = re.sub(r"\n\n\n+", "\n\n", content)
374+
375+
return processed_content

tests/roots/basic/index.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,8 @@ Welcome to Test Project's documentation!
88
page1
99
page2
1010
page_with_include
11+
page_ignored_metadata
12+
page_with_ignore_blocks
1113

1214
Indices and tables
1315
==================
Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
:llms-txt-ignore: true
2+
3+
Page Ignored by Metadata
4+
========================
5+
6+
This page should not appear in llms-full.txt because of the metadata directive.
7+
8+
Section 1
9+
---------
10+
11+
This content should be completely ignored.
12+
13+
Section 2
14+
---------
15+
16+
This content should also be ignored.

0 commit comments

Comments
 (0)