Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto-generate markdown from source documentation #2730

Open
3 tasks
Blargian opened this issue Oct 25, 2024 · 10 comments
Open
3 tasks

Auto-generate markdown from source documentation #2730

Blargian opened this issue Oct 25, 2024 · 10 comments
Assignees

Comments

@Blargian
Copy link
Collaborator

Blargian commented Oct 25, 2024

PRs 70289 and 2714 introduced changes to auto generate markdown documentation from source for format settings and core settings,

We would like to do the same for:

  • global server settings
  • system tables
  • functions
@Blargian
Copy link
Collaborator Author

@rschu1ze I would value your input on how we will approach auto generating markdown from source for functions in particular. Global server settings and system tables appear to be manageable using the approach so far used for format and core settings, but markdown generation for functions is not as straight forward.

Starting with the obvious, most functions are missing C++ source documentation like this one has:

https://github.com/ClickHouse/ClickHouse/blob/aaca3b61567e9f22c69c441aef629f39b3d8b406/src/Functions/sin.cpp#L17-L23

Ideally we automate as much of updating documentation in source as possible... the thought of working through all the functions again one by one to update documentation in source is not one I am particularly fond of (neither for you as reviewer I suspect). It seems to me that unless we standardise the current function docs to a high enough level this in itself might be a challenge. Take for instance arithmetic-functions - some have only a syntax section, some have syntax and examples, and more recently updated ones have syntax, arguments, returned value, examples. In other places we have heading 'parameters' instead of 'arguments' etc. I think that scripting something to modify C++ source documentation from the markdown will be tricky if there is too little uniformity in the structure of each markdown page.

Once that is done it seems plausible to script something to update C++ source similar to what was done in Alexey's PR, maybe in batches or per category of functions on the docs page, and once all source files have documentation we can extend functionality of FunctionFactory to generate markdown from .cpp.

Does that approach sound reasonable?

@rschu1ze
Copy link
Member

rschu1ze commented Oct 28, 2024

@Blargian I was afraid that day would come :-)

Let's start with server settings docs. I agree that these are manageable with the approach in #2714. There are two hurdles:

  1. The existing public docs (server settings and the internal docs in src/Core/ServerSettings.cpp may not be in sync. So Step 1 would be to go through each setting individually and make sure the more correct/sophisticated/verbose version appears in ServerSettings.cpp.

  2. For the normal and session settings, corresponding files Core/Settings.cpp and Core/FormatFactorySettings.h contain all relevant settings. As a result, system table system.settings shows all normal settings (and I guess it contains the format settings as well ... didn't check but it also doesn't matter for the sake of the argument). The same is not true for ServerSettings.cpp, i.e. there are server settings outside that file. An example for which I am guilty of are the query cache settings. They are documented publicly here but they are not part of ServerSettings.cpp or system.server_settings. The main reason for this is that server settings can be nested, e.g. for query_cache settings you could have this XML documentation:

<query_cache>
    <max_size_in_bytes>1073741824</max_size_in_bytes>
    <max_entries>1024</max_entries>
    <max_entry_size_in_bytes>1048576</max_entry_size_in_bytes>
    <max_entry_size_in_rows>30000000</max_entry_size_in_rows>
</query_cache>

ClickHouse ships with a template configuration file (here) ... as you can see, the majority of server settings is nested. (Note that even the template configuration file contains only a subset of all server settings).

There is no principal reason why the nesting itself could not be represented in ServerSettings.cpp. The reason it is not done is that nesting can come with additional constraints, depending on the setting. E.g. in the query cache example above, each sub-tag (e.g. <max_size_in_bytes>) may occur at most once below <query_cache>. Such constraints can be different for other nested structures, for example the logging configuration (public docs) may contain this XML:

<levels>
    <logger>
        <name>ContextAccess (default)</name>
        <level>none</level>
    </logger>
    <logger>
        <name>DatabaseOrdinary (test)</name>
        <level>none</level>
    </logger>
    [...]
</levels>

Note how <logger> occurs arbitrarily many times below <levels>. Such constraints are too complex to express in ServerSettings.cpp.

Long story short: The best we can do is to auto-generate the settings in ServerSettings.cpp (which still contains 155 settings as of now) and ignore everything else.

(thoughts about system table and function docs in the next comments).

@rschu1ze
Copy link
Member

rschu1ze commented Oct 28, 2024

Before discussing system table and function docs:

The public docs contain a page on restricting query complexity via settings. The content of this page largely overlaps with the (auto-generated) setting docs.

It would be cool to consolidate both and

  1. move the former into the latter (i.e. Settings.cpp), or extend the latter
  2. change the former to be more of a guide that merely links into the settings docs.

Related to that, there is this weird doc page for which we should probably apply the same steps.

@rschu1ze
Copy link
Member

Aaaaand it doesn't stop there ...

The publicly documented merge tree settings are at the moment also not auto-generated. They are conceptually similar to normal settings (no nesting), so this will be straightforward:

  1. Consolidate the public docs and MergeTreeSettings.cpp (code).

  2. Auto-generate docs using the same approach as Autogenerate settings #2714.

@rschu1ze
Copy link
Member

rschu1ze commented Oct 28, 2024

About auto-generating docs for system views:

Instead of a single doc page, each system table has its own page (see here) - 100 in total if I counted correctly (*).

Each system table is created ("attached") at startup, this happens in this file. The file contains table-level comments, e.g. This table contains a single row with a single dummy UInt8 column containing the value 0. Used when the table is not specified explicitly, for example in queries like SELECT 1. This is what is shown by

SELECT database, name, comment FROM system.tables WHERE database = 'system'

I'd say, the first step is to make sure that the internal comment string is in-sync with the publicy documented per-table comment string (e.g. here for system.users).

Later, we can auto-generate the public table comment string from the internal table comment string.

The next thing to consider are, for each system table, the docs of the resepective column names, their data types, and their per-column comment string. Every system table is implemented by its own C++ file, e.g. this file for sytem.users. This is what is shown by

SELECT name, type, comment FROM system.columns WHERE database = 'system' AND table = 'users'

Some of the existing system table docs have additional sections like See also and Examples, e.g. system.backup_log. We better don't auto-generate these sections because of two reasons 1: they typically make heavy use of markdown, e.g. to create markdown tables and link other pages. Including additional formatting in C++ is rather ugly. 2. We'd need to expose these fields from the database via standard system views. This either means to add new SEE_ALSO and EXAMPLES columns to system.tables (which would be awful) or encoding/squeezing more information into the existing field system.tables.COMMENT and parsing the individual fields back similar to https://github.com/ClickHouse/clickhouse-docs/pull/2714/files#diff-0ed5a36db0e1c468253084936c7502d77f11a46a0c3b64038d09b1ffd676dcf5R54 (and that would be ugly as well). 3. the number of system where we need to maintain "See also" and "Examples" sections manually a lot smaller than the number of functions that have similar sections. I'd say the extra overhead is okay.

The good thing about (*) above is that we don't need to do a big bang PR for system table docs. We can iterate table-by-table, check the rendered docs and improve incrementally.

@rschu1ze
Copy link
Member

rschu1ze commented Oct 28, 2024

Function docs will be the most challenging.

Newly added functions are nowadays required to have in-source docs (example). Unfortunately, the majority of functions still only comes with public docs, so ... as usual ..., the first step would be to synchronize the public docs back into the in-source docs (*). This wil be a lot of "fun", I promise!

In-source docs are specified for each function in the form of a FunctionDocumentation object (header, source). The exact fields of this class (e.g. description, syntax, arguments) and the formatting they expect were invented ca. a year ago and they are not set in stone. I.e. if we think we should change then, we can do so. System table system.functions gobbles everything up (code):

SELECT * FROM system.functions

The public function docs are grouped into categories, e.g. "Arithmetic", "Arrays", "arrayJoin", .... Keeping this grouping when we auto-generate docs makes sense, IMHO, otherwise newbies will have a much harder time to find the correct function of 1000+ functions. We should use the categories field in FunctionDocumentation, respectively in system.functions to group functions. There are currently three problems:

  1. More than one category can be defined per function (i.e. FunctionDocumentation::Category is defined as std::set). For simplicity, we should allow only a single category.
  2. Tons of functions don't set the category in their in-source docs. That should be easy to fix.
  3. There is no one-to-one map between categories (e.g. "UUID") and public function documentation groups (e.g. "Functions for Working with UUIDs" (page). We should put this mapping as a comment into "src/Common/FunctionDocumentation.h".

Within groups (in the public docs), the functions are loosely sorted by descending popularity / ascending obscurity. See e.g. the string functions for an example. To maintain this sorting in auto-generated docs, we'd need a relative order between functions. E.g., with the previous string function example, function 'empty' could have order = 1, 'notEmpty' could have order = 2, 'left' could have order = 5, etc. The order could be made a new field within FunctionDocumentation. BUT: As you can imagine, maintaining the existing order within a group when a new function is added or deleted will be a nightmare. For example, adding a function "in the middle" means all subsequent functions will need to have their order updated. I am therefore willing to lift the popularity sorting and to sort all functions in auto-generated docs alphabetically. That should be okay assuming that groups contain sufficiently few functions each.

The fields of FunctionDocumentation may contain markdown, e.g.

/// Example: src/Functions/FunctionsHashingMisc.cpp

    factory.registerFunction<FunctionHalfMD5>(FunctionDocumentation{
        .description = R"(
[Interprets](../..//sql-reference/functions/type-conversion-functions.md/#type_conversion_functions-reinterpretAsString) all the input
parameters as strings and calculates the MD5 hash value for each of them. Then combines hashes, takes the first 8 bytes of the hash of the
resulting string, and interprets them as [UInt64](../../../sql-reference/data-types/int-uint.md) in big-endian byte order. The function is
relatively slow (5 million short strings per second per processor core).

Consider using the [sipHash64](../../sql-reference/functions/hash-functions.md/#hash_functions-siphash64) function instead.
                       )",
        .syntax = "SELECT halfMD5(par1,par2,...,parN);",
        .arguments
        = {{"par1,par2,...,parN",
            R"(
The function takes a variable number of input parameters. Arguments can be any of the supported data types. For some data types calculated
value of hash function may be the same for the same values even if types of arguments differ (integers of different size, named and unnamed
Tuple with the same data, Map and the corresponding Array(Tuple(key, value)) type with the same data).
                       )"}},
        .returned_value = "The computed half MD5 hash of the given input params returned as a "
                          "[UInt64](../../../sql-reference/data-types/int-uint.md) in big-endian byte order.",
        .examples
        = {{"",
            "SELECT HEX(halfMD5('abc', 'cde', 'fgh'));",
            R"(
┌─hex(halfMD5('abc', 'cde', 'fgh'))─┐
│ 2C9506B7374CFAF4                  │
└───────────────────────────────────┘
            )"}}});

One could argue if there are more C++-ish ways to encode things like text formatting, links, tables (or if that should even be possible at all). In the end, I think embedded markdown is the least bad option and after all, markdown is designed to be human-readable.

The previous example and almost all other places that define in-source docs uses designated initializer syntax. It is compact but with longer strings it becomes hard to read and edit. It is one of the reasons why people hate writing in-source docs. My proposal is that you use this as a template instead:

{
    FunctionDocumentation::Description description = "";
    FunctionDocumentation::Syntax syntax = "";
    FunctionDocumentation::Argument argument1 = {"", ""};
    FunctionDocumentation::Argument argument2 = {"", ""};
    FunctionDocumentation::Arguments arguments = {argument1, argument2};
    FunctionDocumentation::ReturnedValue returned_value = "";
    FunctionDocumentation::Example example1 = {"", "", ""};
    FunctionDocumentation::Example example2 = {"", "", ""};
    FunctionDocumentation::Examples examples = {example1, example2};
    FunctionDocumentation::Categories categories = {};
    FunctionDocumentation documentation = {description, syntax, arguments, returned_value, examples, categories};

    factory.registerFunction<FunctionHalfMD5>(documentation);
}

Since some of the fields will contain linebreaks, we can make the template even more readable by using raw strings:

    {
        FunctionDocumentation::Description description = R"(
)";
        FunctionDocumentation::Syntax syntax = R"(
)";
        FunctionDocumentation::Argument argument1 = {R"(
)", R"(
)"};
        FunctionDocumentation::Argument argument2 = {R"(
)", R"(
)"};
        FunctionDocumentation::Arguments arguments = {argument1, argument2};
        FunctionDocumentation::ReturnedValue returned_value = R"(
)";
        FunctionDocumentation::Example example1 = {"",
R"(
)", R"(
)"
        };
        FunctionDocumentation::Example example2 = {"",
R"(
)", R"(
)"
        };
        FunctionDocumentation::Examples examples = {example1, example2};
        FunctionDocumentation::Categories categories = {};
        FunctionDocumentation documentation = {description, syntax, arguments, returned_value, examples, categories};

        factory.registerFunction<FunctionHalfMD5>(documentation);
    }

Note: It makes sense to split the Arguments and ReturnedValue types (see "src/Common/FunctionDocumentation.h") into sub-fields name, type, description, respectively name, type. Having to specify the type separately makes it harder to forget it.

So with all the fields filled out for the example, we'll get:

    {
        FunctionDocumentation::Description description = R"(
[Interprets](../..//sql-reference/functions/type-conversion-functions.md/#type_conversion_functions-reinterpretAsString) all the input parameters as strings and calculates the MD5 hash value for each of them. Then combines hashes, takes the first 8 bytes of the hash of the resulting string, and interprets them as [UInt64](../../../sql-reference/data-types/int-uint.md) in big-endian byte order. The function is relatively slow (5 million short strings per second per
> processor core).

Consider using the [sipHash64](../../sql-reference/functions/hash-functions.md/#hash_functions-siphash64) function instead.
)";
        FunctionDocumentation::Syntax syntax = R"(
SELECT halfMD5(par1,par2,...,parN);
)";
        FunctionDocumentation::Argument argument1 = {R"(
par1,par2,...,parN
)", R"(
The function takes a variable number of input parameters. Arguments can be any of the supported data types. For some data types calculated value of hash function may be the same for the same values even if types of arguments differ (integers of different size, named and unnamed Tuple with the same data, Map and the corresponding Array(Tuple(key, value)) type with the same data).
)"};
        FunctionDocumentation::Arguments arguments = {argument1};
        FunctionDocumentation::ReturnedValue returned_value = R"(
The computed half MD5 hash of the given input params returned as a [UInt64](../../../sql-reference/data-types/int-uint.md) in big-endian byte order.
)";
        FunctionDocumentation::Example example1 = {"",
R"(
SELECT HEX(halfMD5('abc', 'cde', 'fgh'));
)", R"(
┌─hex(halfMD5('abc', 'cde', 'fgh'))─┐
│ 2C9506B7374CFAF4                  │
└───────────────────────────────────┘
)"
        };
        FunctionDocumentation::Examples examples = {example1};
        FunctionDocumentation::Categories categories = {"hash"};
        FunctionDocumentation documentation = {description, syntax, arguments, returned_value, examples, categories};

        factory.registerFunction<FunctionHalfMD5>(documentation);
    }
}

Note how all useful text is neatly left-aligned and uses no linebreaks. Pretty readable, if you ask me.

To sum up, I'd propose to proceed group-by-group, take care of the things I mentioned above, and then make adjustments as you go since there are probably plenty of things I forgot.

@rschu1ze
Copy link
Member

rschu1ze commented Oct 28, 2024

@Blargian And to reply to your original thoughts:

Ideally we automate as much of updating documentation in source as possible... the thought of working through all the functions again one by one to update documentation in source is not one I am particularly fond of (neither for you as reviewer I suspect). It seems to me that unless we standardise the current function docs to a high enough level this in itself might be a challenge. Take for instance arithmetic-functions - some have only a syntax section, some have syntax and examples, and more recently updated ones have syntax, arguments, returned value, examples. In other places we have heading 'parameters' instead of 'arguments' etc. I think that scripting something to modify C++ source documentation from the markdown will be tricky if there is too little uniformity in the structure of each markdown page.

For any given function, the in-source docs should ideally be the more verbose/exhaustive/complete version of sections "syntax", "arguments", "examples" etc. of the in-source docs and the public docs. The idea is to make this step as mechanical as possible (it will still need to be done by hand). There is no need to come up with new sections "syntax", "arguments", "examples" if neither the in-source nor the public docs contain it.

Once that is done it seems plausible to script something to update C++ source similar to what was done in Alexey's ClickHouse/ClickHouse#70289, maybe in batches or per category of functions on the docs page, and once all source files have documentation we can extend functionality of FunctionFactory to generate markdown from .cpp.

Yes. Let's do it category-by-category. The day when we can finally delete the public function docs will be so glorious.

@rschu1ze
Copy link
Member

@justindeguzman FYI ^^

@alexey-milovidov
Copy link
Member

It is ok to include Markdown directly in .cpp - while it is ugly, we have no better way to do it.
We should include the most detailed documentation in .cpp, because it will be the single source of this documentation.

@rschu1ze
Copy link
Member

rschu1ze commented Oct 29, 2024

Agree about Markdown.

Re "most detailed documentation": This is relevant for

  • system tables: Maybe we can squeeze sections like "Examples" and "See also" (as Markdown) into the Comment field of each system table (I actually see no alternative to that). Advantages: Thematically, examples and references are "comments". 2. During parsing, we can easily split these sections again. 3. Field comment would become quite long but only very few people will look at field system.columns.comment and bother.
  • functions: Here, we could add extra fields to system.function though as far as I see, the existing fields cover all relevant structures already. Anyways, function docs will require some experimentation due to their complexity so we may need to come up with some other approaches if needed.

@Blargian Blargian self-assigned this Nov 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants