Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto-generate settings and functions markdown documentation from source documentation #2730

Open
4 tasks
Blargian opened this issue Oct 25, 2024 · 17 comments
Open
4 tasks
Assignees
Labels

Comments

@Blargian
Copy link
Member

Blargian commented Oct 25, 2024

PRs 70289 and 2714 introduced changes to auto generate markdown documentation from source for format settings and core settings,

We would like to do the same for:

  • Global Server Settings
  • Core Settings
  • Functions
  • MergeTree Settings
@Blargian
Copy link
Member Author

@rschu1ze I would value your input on how we will approach auto generating markdown from source for functions in particular. Global server settings and system tables appear to be manageable using the approach so far used for format and core settings, but markdown generation for functions is not as straight forward.

Starting with the obvious, most functions are missing C++ source documentation like this one has:

https://github.com/ClickHouse/ClickHouse/blob/aaca3b61567e9f22c69c441aef629f39b3d8b406/src/Functions/sin.cpp#L17-L23

Ideally we automate as much of updating documentation in source as possible... the thought of working through all the functions again one by one to update documentation in source is not one I am particularly fond of (neither for you as reviewer I suspect). It seems to me that unless we standardise the current function docs to a high enough level this in itself might be a challenge. Take for instance arithmetic-functions - some have only a syntax section, some have syntax and examples, and more recently updated ones have syntax, arguments, returned value, examples. In other places we have heading 'parameters' instead of 'arguments' etc. I think that scripting something to modify C++ source documentation from the markdown will be tricky if there is too little uniformity in the structure of each markdown page.

Once that is done it seems plausible to script something to update C++ source similar to what was done in Alexey's PR, maybe in batches or per category of functions on the docs page, and once all source files have documentation we can extend functionality of FunctionFactory to generate markdown from .cpp.

Does that approach sound reasonable?

@rschu1ze
Copy link
Member

rschu1ze commented Oct 28, 2024

@Blargian I was afraid that day would come :-)

Let's start with server settings docs. I agree that these are manageable with the approach in #2714. There are two hurdles:

  1. The existing public docs (server settings and the internal docs in src/Core/ServerSettings.cpp may not be in sync. So Step 1 would be to go through each setting individually and make sure the more correct/sophisticated/verbose version appears in ServerSettings.cpp.

  2. For the normal and session settings, corresponding files Core/Settings.cpp and Core/FormatFactorySettings.h contain all relevant settings. As a result, system table system.settings shows all normal settings (and I guess it contains the format settings as well ... didn't check but it also doesn't matter for the sake of the argument). The same is not true for ServerSettings.cpp, i.e. there are server settings outside that file. An example for which I am guilty of are the query cache settings. They are documented publicly here but they are not part of ServerSettings.cpp or system.server_settings. The main reason for this is that server settings can be nested, e.g. for query_cache settings you could have this XML documentation:

<query_cache>
    <max_size_in_bytes>1073741824</max_size_in_bytes>
    <max_entries>1024</max_entries>
    <max_entry_size_in_bytes>1048576</max_entry_size_in_bytes>
    <max_entry_size_in_rows>30000000</max_entry_size_in_rows>
</query_cache>

ClickHouse ships with a template configuration file (here) ... as you can see, the majority of server settings is nested. (Note that even the template configuration file contains only a subset of all server settings).

There is no principal reason why the nesting itself could not be represented in ServerSettings.cpp. The reason it is not done is that nesting can come with additional constraints, depending on the setting. E.g. in the query cache example above, each sub-tag (e.g. <max_size_in_bytes>) may occur at most once below <query_cache>. Such constraints can be different for other nested structures, for example the logging configuration (public docs) may contain this XML:

<levels>
    <logger>
        <name>ContextAccess (default)</name>
        <level>none</level>
    </logger>
    <logger>
        <name>DatabaseOrdinary (test)</name>
        <level>none</level>
    </logger>
    [...]
</levels>

Note how <logger> occurs arbitrarily many times below <levels>. Such constraints are too complex to express in ServerSettings.cpp.

Long story short: The best we can do is to auto-generate the settings in ServerSettings.cpp (which still contains 155 settings as of now) and ignore everything else.

(thoughts about system table and function docs in the next comments).

@rschu1ze
Copy link
Member

rschu1ze commented Oct 28, 2024

Before discussing system table and function docs:

The public docs contain a page on restricting query complexity via settings. The content of this page largely overlaps with the (auto-generated) setting docs.

It would be cool to consolidate both and

  1. move the former into the latter (i.e. Settings.cpp), or extend the latter
  2. change the former to be more of a guide that merely links into the settings docs.

Related to that, there is this weird doc page for which we should probably apply the same steps.

@rschu1ze
Copy link
Member

Aaaaand it doesn't stop there ...

The publicly documented merge tree settings are at the moment also not auto-generated. They are conceptually similar to normal settings (no nesting), so this will be straightforward:

  1. Consolidate the public docs and MergeTreeSettings.cpp (code).

  2. Auto-generate docs using the same approach as Autogenerate settings #2714.

@rschu1ze
Copy link
Member

rschu1ze commented Oct 28, 2024

About auto-generating docs for system views:

Instead of a single doc page, each system table has its own page (see here) - 100 in total if I counted correctly (*).

Each system table is created ("attached") at startup, this happens in this file. The file contains table-level comments, e.g. This table contains a single row with a single dummy UInt8 column containing the value 0. Used when the table is not specified explicitly, for example in queries like SELECT 1. This is what is shown by

SELECT database, name, comment FROM system.tables WHERE database = 'system'

I'd say, the first step is to make sure that the internal comment string is in-sync with the publicy documented per-table comment string (e.g. here for system.users).

Later, we can auto-generate the public table comment string from the internal table comment string.

The next thing to consider are, for each system table, the docs of the resepective column names, their data types, and their per-column comment string. Every system table is implemented by its own C++ file, e.g. this file for sytem.users. This is what is shown by

SELECT name, type, comment FROM system.columns WHERE database = 'system' AND table = 'users'

Some of the existing system table docs have additional sections like See also and Examples, e.g. system.backup_log. We better don't auto-generate these sections because of two reasons 1: they typically make heavy use of markdown, e.g. to create markdown tables and link other pages. Including additional formatting in C++ is rather ugly. 2. We'd need to expose these fields from the database via standard system views. This either means to add new SEE_ALSO and EXAMPLES columns to system.tables (which would be awful) or encoding/squeezing more information into the existing field system.tables.COMMENT and parsing the individual fields back similar to https://github.com/ClickHouse/clickhouse-docs/pull/2714/files#diff-0ed5a36db0e1c468253084936c7502d77f11a46a0c3b64038d09b1ffd676dcf5R54 (and that would be ugly as well). 3. the number of system where we need to maintain "See also" and "Examples" sections manually a lot smaller than the number of functions that have similar sections. I'd say the extra overhead is okay.

The good thing about (*) above is that we don't need to do a big bang PR for system table docs. We can iterate table-by-table, check the rendered docs and improve incrementally.

EDIT: #706 is also relevant.

@rschu1ze
Copy link
Member

rschu1ze commented Oct 28, 2024

Function docs will be the most challenging.

Newly added functions are nowadays required to have in-source docs (example). Unfortunately, the majority of functions still only comes with public docs, so ... as usual ..., the first step would be to synchronize the public docs back into the in-source docs (*). This wil be a lot of "fun", I promise!

In-source docs are specified for each function in the form of a FunctionDocumentation object (header, source). The exact fields of this class (e.g. description, syntax, arguments) and the formatting they expect were invented ca. a year ago and they are not set in stone. I.e. if we think we should change then, we can do so. System table system.functions gobbles everything up (code):

SELECT * FROM system.functions

The public function docs are grouped into categories, e.g. "Arithmetic", "Arrays", "arrayJoin", .... Keeping this grouping when we auto-generate docs makes sense, IMHO, otherwise newbies will have a much harder time to find the correct function of 1000+ functions. We should use the categories field in FunctionDocumentation, respectively in system.functions to group functions. There are currently three problems:

  1. More than one category can be defined per function (i.e. FunctionDocumentation::Category is defined as std::set). For simplicity, we should allow only a single category.
  2. Tons of functions don't set the category in their in-source docs. That should be easy to fix.
  3. There is no one-to-one map between categories (e.g. "UUID") and public function documentation groups (e.g. "Functions for Working with UUIDs" (page). We should put this mapping as a comment into "src/Common/FunctionDocumentation.h".

Within groups (in the public docs), the functions are loosely sorted by descending popularity / ascending obscurity. See e.g. the string functions for an example. To maintain this sorting in auto-generated docs, we'd need a relative order between functions. E.g., with the previous string function example, function 'empty' could have order = 1, 'notEmpty' could have order = 2, 'left' could have order = 5, etc. The order could be made a new field within FunctionDocumentation. BUT: As you can imagine, maintaining the existing order within a group when a new function is added or deleted will be a nightmare. For example, adding a function "in the middle" means all subsequent functions will need to have their order updated. I am therefore willing to lift the popularity sorting and to sort all functions in auto-generated docs alphabetically. That should be okay assuming that groups contain sufficiently few functions each.

The fields of FunctionDocumentation may contain markdown, e.g.

/// Example: src/Functions/FunctionsHashingMisc.cpp

    factory.registerFunction<FunctionHalfMD5>(FunctionDocumentation{
        .description = R"(
[Interprets](../..//sql-reference/functions/type-conversion-functions.md/#type_conversion_functions-reinterpretAsString) all the input
parameters as strings and calculates the MD5 hash value for each of them. Then combines hashes, takes the first 8 bytes of the hash of the
resulting string, and interprets them as [UInt64](../../../sql-reference/data-types/int-uint.md) in big-endian byte order. The function is
relatively slow (5 million short strings per second per processor core).

Consider using the [sipHash64](../../sql-reference/functions/hash-functions.md/#hash_functions-siphash64) function instead.
                       )",
        .syntax = "SELECT halfMD5(par1,par2,...,parN);",
        .arguments
        = {{"par1,par2,...,parN",
            R"(
The function takes a variable number of input parameters. Arguments can be any of the supported data types. For some data types calculated
value of hash function may be the same for the same values even if types of arguments differ (integers of different size, named and unnamed
Tuple with the same data, Map and the corresponding Array(Tuple(key, value)) type with the same data).
                       )"}},
        .returned_value = "The computed half MD5 hash of the given input params returned as a "
                          "[UInt64](../../../sql-reference/data-types/int-uint.md) in big-endian byte order.",
        .examples
        = {{"",
            "SELECT HEX(halfMD5('abc', 'cde', 'fgh'));",
            R"(
┌─hex(halfMD5('abc', 'cde', 'fgh'))─┐
│ 2C9506B7374CFAF4                  │
└───────────────────────────────────┘
            )"}}});

One could argue if there are more C++-ish ways to encode things like text formatting, links, tables (or if that should even be possible at all). In the end, I think embedded markdown is the least bad option and after all, markdown is designed to be human-readable.

The previous example and almost all other places that define in-source docs uses designated initializer syntax. It is compact but with longer strings it becomes hard to read and edit. It is one of the reasons why people hate writing in-source docs. My proposal is that you use this as a template instead:

{
    FunctionDocumentation::Description description = "";
    FunctionDocumentation::Syntax syntax = "";
    FunctionDocumentation::Argument argument1 = {"", ""};
    FunctionDocumentation::Argument argument2 = {"", ""};
    FunctionDocumentation::Arguments arguments = {argument1, argument2};
    FunctionDocumentation::ReturnedValue returned_value = "";
    FunctionDocumentation::Example example1 = {"", "", ""};
    FunctionDocumentation::Example example2 = {"", "", ""};
    FunctionDocumentation::Examples examples = {example1, example2};
    FunctionDocumentation::Categories categories = {};
    FunctionDocumentation documentation = {description, syntax, arguments, returned_value, examples, categories};

    factory.registerFunction<FunctionHalfMD5>(documentation);
}

Since some of the fields will contain linebreaks, we can make the template even more readable by using raw strings:

    {
        FunctionDocumentation::Description description = R"(
)";
        FunctionDocumentation::Syntax syntax = R"(
)";
        FunctionDocumentation::Argument argument1 = {R"(
)", R"(
)"};
        FunctionDocumentation::Argument argument2 = {R"(
)", R"(
)"};
        FunctionDocumentation::Arguments arguments = {argument1, argument2};
        FunctionDocumentation::ReturnedValue returned_value = R"(
)";
        FunctionDocumentation::Example example1 = {"",
R"(
)", R"(
)"
        };
        FunctionDocumentation::Example example2 = {"",
R"(
)", R"(
)"
        };
        FunctionDocumentation::Examples examples = {example1, example2};
        FunctionDocumentation::Categories categories = {};
        FunctionDocumentation documentation = {description, syntax, arguments, returned_value, examples, categories};

        factory.registerFunction<FunctionHalfMD5>(documentation);
    }

Note: It makes sense to split the Arguments and ReturnedValue types (see "src/Common/FunctionDocumentation.h") into sub-fields name, type, description, respectively name, type. Having to specify the type separately makes it harder to forget it.

So with all the fields filled out for the example, we'll get:

    {
        FunctionDocumentation::Description description = R"(
[Interprets](../..//sql-reference/functions/type-conversion-functions.md/#type_conversion_functions-reinterpretAsString) all the input parameters as strings and calculates the MD5 hash value for each of them. Then combines hashes, takes the first 8 bytes of the hash of the resulting string, and interprets them as [UInt64](../../../sql-reference/data-types/int-uint.md) in big-endian byte order. The function is relatively slow (5 million short strings per second per
> processor core).

Consider using the [sipHash64](../../sql-reference/functions/hash-functions.md/#hash_functions-siphash64) function instead.
)";
        FunctionDocumentation::Syntax syntax = R"(
SELECT halfMD5(par1,par2,...,parN);
)";
        FunctionDocumentation::Argument argument1 = {R"(
par1,par2,...,parN
)", R"(
The function takes a variable number of input parameters. Arguments can be any of the supported data types. For some data types calculated value of hash function may be the same for the same values even if types of arguments differ (integers of different size, named and unnamed Tuple with the same data, Map and the corresponding Array(Tuple(key, value)) type with the same data).
)"};
        FunctionDocumentation::Arguments arguments = {argument1};
        FunctionDocumentation::ReturnedValue returned_value = R"(
The computed half MD5 hash of the given input params returned as a [UInt64](../../../sql-reference/data-types/int-uint.md) in big-endian byte order.
)";
        FunctionDocumentation::Example example1 = {"",
R"(
SELECT HEX(halfMD5('abc', 'cde', 'fgh'));
)", R"(
┌─hex(halfMD5('abc', 'cde', 'fgh'))─┐
│ 2C9506B7374CFAF4                  │
└───────────────────────────────────┘
)"
        };
        FunctionDocumentation::Examples examples = {example1};
        FunctionDocumentation::Categories categories = {"hash"};
        FunctionDocumentation documentation = {description, syntax, arguments, returned_value, examples, categories};

        factory.registerFunction<FunctionHalfMD5>(documentation);
    }
}

Note how all useful text is neatly left-aligned and uses no linebreaks. Pretty readable, if you ask me.

To sum up, I'd propose to proceed group-by-group, take care of the things I mentioned above, and then make adjustments as you go since there are probably plenty of things I forgot.

@rschu1ze
Copy link
Member

rschu1ze commented Oct 28, 2024

@Blargian And to reply to your original thoughts:

Ideally we automate as much of updating documentation in source as possible... the thought of working through all the functions again one by one to update documentation in source is not one I am particularly fond of (neither for you as reviewer I suspect). It seems to me that unless we standardise the current function docs to a high enough level this in itself might be a challenge. Take for instance arithmetic-functions - some have only a syntax section, some have syntax and examples, and more recently updated ones have syntax, arguments, returned value, examples. In other places we have heading 'parameters' instead of 'arguments' etc. I think that scripting something to modify C++ source documentation from the markdown will be tricky if there is too little uniformity in the structure of each markdown page.

For any given function, the in-source docs should ideally be the more verbose/exhaustive/complete version of sections "syntax", "arguments", "examples" etc. of the in-source docs and the public docs. The idea is to make this step as mechanical as possible (it will still need to be done by hand). There is no need to come up with new sections "syntax", "arguments", "examples" if neither the in-source nor the public docs contain it.

Once that is done it seems plausible to script something to update C++ source similar to what was done in Alexey's ClickHouse/ClickHouse#70289, maybe in batches or per category of functions on the docs page, and once all source files have documentation we can extend functionality of FunctionFactory to generate markdown from .cpp.

Yes. Let's do it category-by-category. The day when we can finally delete the public function docs will be so glorious.

@rschu1ze
Copy link
Member

@justindeguzman FYI ^^

@alexey-milovidov
Copy link
Member

It is ok to include Markdown directly in .cpp - while it is ugly, we have no better way to do it.
We should include the most detailed documentation in .cpp, because it will be the single source of this documentation.

@rschu1ze
Copy link
Member

rschu1ze commented Oct 29, 2024

Agree about Markdown.

Re "most detailed documentation": This is relevant for

  • system tables: Maybe we can squeeze sections like "Examples" and "See also" (as Markdown) into the Comment field of each system table (I actually see no alternative to that). Advantages: Thematically, examples and references are "comments". 2. During parsing, we can easily split these sections again. 3. Field comment would become quite long but only very few people will look at field system.columns.comment and bother.
  • functions: Here, we could add extra fields to system.function though as far as I see, the existing fields cover all relevant structures already. Anyways, function docs will require some experimentation due to their complexity so we may need to come up with some other approaches if needed.

@Blargian Blargian self-assigned this Nov 1, 2024
@Blargian
Copy link
Member Author

Blargian commented Nov 15, 2024

Agree about Markdown.

Re "most detailed documentation": This is relevant for

  • system tables: Maybe we can squeeze sections like "Examples" and "See also" (as Markdown) into the Comment field of each system table (I actually see no alternative to that). Advantages: Thematically, examples and references are "comments". 2. During parsing, we can easily split these sections again. 3. Field comment would become quite long but only very few people will look at field system.columns.comment and bother.
  • functions: Here, we could add extra fields to system.function though as far as I see, the existing fields cover all relevant structures already. Anyways, function docs will require some experimentation due to their complexity so we may need to come up with some other approaches if needed.

@rschu1ze regarding system tables - possibly a silly question, i'm not sure if it's feasible, but could we not maybe create a new system.docs.settings table which would have the setting name and a field for the example in markdown? The markdown in system.settings is a real eye-sore now that we have put markdown in the description field:

image

At least that way we keep system.settings respectable looking and 'hide' out the ugly bit.

@rschu1ze
Copy link
Member

rschu1ze commented Nov 15, 2024

Agree that is an eye-sore. But don't worry too much:

  1. users can always do a SELECT name, value, changed on system.settings instead of a SELECT *. Alternatively, a different output format also helps, e.g. SELECT * FROM system.settings FORMAT Vertical.
  2. many other system tables are so wide that their output doesn't fit on the screen, and this is because of fields other then documentation fields. I'll paste a screenshot of SELECT * FROM system.tables below.

It is not a difficult thing to add new "technical" system tables which are only useful in the context of generating docs from the sources. I'd say we can decide if we like to do that after all docs are auto-generated, depending how painful the system tables are to look at then.

grafik

@alexey-milovidov
Copy link
Member

I'm ok with it not fitting on the screen. It is alright. The database is for having data, and when you do SELECT * it is expected that a lot of unnecessary data will return.

I'm ok with keeping everything in the description field without introducing more structure for the examples.

@rschu1ze
Copy link
Member

@max-vostrikov

@gingerwizard gingerwizard changed the title Auto-generate markdown from source documentation Auto-generate settings and functions markdown documentation from source documentation Dec 17, 2024
@Blargian
Copy link
Member Author

Blargian commented Dec 24, 2024

@rschu1ze I'd like to propose something. Instead of adding markdown documentation directly into the setting description, let's add a new column called documentation or something similar of type JSON. In it we will store our documentation data associated with each setting such as:

  • description
  • notes (with type of note: warning, information etc, and the text for the note)
  • possible_values
  • examples (which could have any number of children each with keys example text, code block)
  • see_also (containing links to other relevant functions)
    etc.

The reason I propose this is that the current docs are incomplete as is. We can copy across the more 'complete' (based on length at least) version of the documentation from the .md to the source, but I foresee having to make many changes in future to correct broken links, fix typos, change examples etc. Having everything in one text field is going to make it really hard to maintain, not to mention it already makes the source look horrible.

Taking as an arbitrary example, this setting, from the already autogenerated format settings:

image

Purely for argument sake, suppose we decide that notes are best displayed after possible values, we would need to go through every description in the format settings and rearrange the text. If we have the information separate we only need to rearrange how we render functions in one place and it will be done for every function. If a page gets moved and a link is no longer valid, then we just need to update the array for key documentation.see_also for example.

Building on Alexey's idea to have documentation accessible from ClickHouse even without internet documentation, I think it could be cool if at a later point we introduced functionality for displaying the documentation beautifully directly within ClickHouse. I am picturing being able to write something like:

SELECT DOCUMENTATION(input_format_defaults_for_omitted_fields) 

and have the documentation for that setting be displayed in the console, similar to man in bash.
Having the information structured like that will make it easier to work with for that purpose as well I would imagine, as opposed to one lump of text containing ":::note"s and "sql"s. It's better if we leave that to the front-end and store purely content.

@alexey-milovidov
Copy link
Member

This could be too much of a structure. The first priority is to make inline docs for other parts of the code, e.g., functions, aggregate functions, table functions, formats, and system tables, and have 100% completeness of the published docs. Clarity and details will be a 2nd priority.

Actually I'm not opposed to your idea, if you can pull it over entirely by yourself.

@rschu1ze
Copy link
Member

@Blargian There is a misunderstanding.

Having everything in one text field is going to make it really hard to maintain, not to mention it already makes the source look horrible.

Agree.

If we have the information separate we only need to rearrange how we render functions in one place and it will be done for every function.

Agree.

The in-source docs are already stored separately.

Regarding markdown vs. JSON: My main concern is that we should make it developers as easy as possible to write in-source docs. Markdown is the least bad choice - it's human-readable while JSON is designed for machines. I don't mind too much which data type system.functions use for docs (String for markdown, JSON) but for simplicity, String/markdown seems the better choice too.

Regarding priorities: Synchronizing public and in-source docs is a tedious task. The main problem will be that in-source and public docs will exist side-by-side for a while. This introduces a risk of new inconsistencies in already synchronized docs (for a function, a setting, a system table, etc.). To avoid this, I suggest to go iteratively and delete the public version immediately once we are happy with the in-source version. My proposal is to start with something simple (e.g. system table docs), then add/change the scripts to auto-generate public from embedded docs, then check docs for each system table one-by-one and delete the public docs as you go.

Side note 1: I heard @max-vostrikov also works on this. Please synchronize with him to avoid double work.

Side note 2: ClickHouse/ClickHouse#73989

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants