-
Notifications
You must be signed in to change notification settings - Fork 288
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Auto-generate settings and functions markdown documentation from source documentation #2730
Comments
@rschu1ze I would value your input on how we will approach auto generating markdown from source for functions in particular. Global server settings and system tables appear to be manageable using the approach so far used for format and core settings, but markdown generation for functions is not as straight forward. Starting with the obvious, most functions are missing C++ source documentation like this one has: Ideally we automate as much of updating documentation in source as possible... the thought of working through all the functions again one by one to update documentation in source is not one I am particularly fond of (neither for you as reviewer I suspect). It seems to me that unless we standardise the current function docs to a high enough level this in itself might be a challenge. Take for instance arithmetic-functions - some have only a syntax section, some have syntax and examples, and more recently updated ones have syntax, arguments, returned value, examples. In other places we have heading 'parameters' instead of 'arguments' etc. I think that scripting something to modify C++ source documentation from the markdown will be tricky if there is too little uniformity in the structure of each markdown page. Once that is done it seems plausible to script something to update C++ source similar to what was done in Alexey's PR, maybe in batches or per category of functions on the docs page, and once all source files have documentation we can extend functionality of Does that approach sound reasonable? |
@Blargian I was afraid that day would come :-) Let's start with server settings docs. I agree that these are manageable with the approach in #2714. There are two hurdles:
<query_cache>
<max_size_in_bytes>1073741824</max_size_in_bytes>
<max_entries>1024</max_entries>
<max_entry_size_in_bytes>1048576</max_entry_size_in_bytes>
<max_entry_size_in_rows>30000000</max_entry_size_in_rows>
</query_cache> ClickHouse ships with a template configuration file (here) ... as you can see, the majority of server settings is nested. (Note that even the template configuration file contains only a subset of all server settings). There is no principal reason why the nesting itself could not be represented in ServerSettings.cpp. The reason it is not done is that nesting can come with additional constraints, depending on the setting. E.g. in the query cache example above, each sub-tag (e.g. <levels>
<logger>
<name>ContextAccess (default)</name>
<level>none</level>
</logger>
<logger>
<name>DatabaseOrdinary (test)</name>
<level>none</level>
</logger>
[...]
</levels> Note how Long story short: The best we can do is to auto-generate the settings in ServerSettings.cpp (which still contains 155 settings as of now) and ignore everything else. (thoughts about system table and function docs in the next comments). |
Before discussing system table and function docs: The public docs contain a page on restricting query complexity via settings. The content of this page largely overlaps with the (auto-generated) setting docs. It would be cool to consolidate both and
Related to that, there is this weird doc page for which we should probably apply the same steps. |
Aaaaand it doesn't stop there ... The publicly documented merge tree settings are at the moment also not auto-generated. They are conceptually similar to normal settings (no nesting), so this will be straightforward:
|
About auto-generating docs for system views: Instead of a single doc page, each system table has its own page (see here) - 100 in total if I counted correctly (*). Each system table is created ("attached") at startup, this happens in this file. The file contains table-level comments, e.g. SELECT database, name, comment FROM system.tables WHERE database = 'system' I'd say, the first step is to make sure that the internal comment string is in-sync with the publicy documented per-table comment string (e.g. here for Later, we can auto-generate the public table comment string from the internal table comment string. The next thing to consider are, for each system table, the docs of the resepective column names, their data types, and their per-column comment string. Every system table is implemented by its own C++ file, e.g. this file for SELECT name, type, comment FROM system.columns WHERE database = 'system' AND table = 'users' Some of the existing system table docs have additional sections like The good thing about (*) above is that we don't need to do a big bang PR for system table docs. We can iterate table-by-table, check the rendered docs and improve incrementally. EDIT: #706 is also relevant. |
Function docs will be the most challenging. Newly added functions are nowadays required to have in-source docs (example). Unfortunately, the majority of functions still only comes with public docs, so ... as usual ..., the first step would be to synchronize the public docs back into the in-source docs (*). This wil be a lot of "fun", I promise! In-source docs are specified for each function in the form of a SELECT * FROM system.functions The public function docs are grouped into categories, e.g. "Arithmetic", "Arrays", "arrayJoin", .... Keeping this grouping when we auto-generate docs makes sense, IMHO, otherwise newbies will have a much harder time to find the correct function of 1000+ functions. We should use the
Within groups (in the public docs), the functions are loosely sorted by descending popularity / ascending obscurity. See e.g. the string functions for an example. To maintain this sorting in auto-generated docs, we'd need a relative order between functions. E.g., with the previous string function example, function 'empty' could have order = 1, 'notEmpty' could have order = 2, 'left' could have order = 5, etc. The order could be made a new field within The fields of /// Example: src/Functions/FunctionsHashingMisc.cpp
factory.registerFunction<FunctionHalfMD5>(FunctionDocumentation{
.description = R"(
[Interprets](../..//sql-reference/functions/type-conversion-functions.md/#type_conversion_functions-reinterpretAsString) all the input
parameters as strings and calculates the MD5 hash value for each of them. Then combines hashes, takes the first 8 bytes of the hash of the
resulting string, and interprets them as [UInt64](../../../sql-reference/data-types/int-uint.md) in big-endian byte order. The function is
relatively slow (5 million short strings per second per processor core).
Consider using the [sipHash64](../../sql-reference/functions/hash-functions.md/#hash_functions-siphash64) function instead.
)",
.syntax = "SELECT halfMD5(par1,par2,...,parN);",
.arguments
= {{"par1,par2,...,parN",
R"(
The function takes a variable number of input parameters. Arguments can be any of the supported data types. For some data types calculated
value of hash function may be the same for the same values even if types of arguments differ (integers of different size, named and unnamed
Tuple with the same data, Map and the corresponding Array(Tuple(key, value)) type with the same data).
)"}},
.returned_value = "The computed half MD5 hash of the given input params returned as a "
"[UInt64](../../../sql-reference/data-types/int-uint.md) in big-endian byte order.",
.examples
= {{"",
"SELECT HEX(halfMD5('abc', 'cde', 'fgh'));",
R"(
┌─hex(halfMD5('abc', 'cde', 'fgh'))─┐
│ 2C9506B7374CFAF4 │
└───────────────────────────────────┘
)"}}}); One could argue if there are more C++-ish ways to encode things like text formatting, links, tables (or if that should even be possible at all). In the end, I think embedded markdown is the least bad option and after all, markdown is designed to be human-readable. The previous example and almost all other places that define in-source docs uses designated initializer syntax. It is compact but with longer strings it becomes hard to read and edit. It is one of the reasons why people hate writing in-source docs. My proposal is that you use this as a template instead: {
FunctionDocumentation::Description description = "";
FunctionDocumentation::Syntax syntax = "";
FunctionDocumentation::Argument argument1 = {"", ""};
FunctionDocumentation::Argument argument2 = {"", ""};
FunctionDocumentation::Arguments arguments = {argument1, argument2};
FunctionDocumentation::ReturnedValue returned_value = "";
FunctionDocumentation::Example example1 = {"", "", ""};
FunctionDocumentation::Example example2 = {"", "", ""};
FunctionDocumentation::Examples examples = {example1, example2};
FunctionDocumentation::Categories categories = {};
FunctionDocumentation documentation = {description, syntax, arguments, returned_value, examples, categories};
factory.registerFunction<FunctionHalfMD5>(documentation);
} Since some of the fields will contain linebreaks, we can make the template even more readable by using raw strings: {
FunctionDocumentation::Description description = R"(
)";
FunctionDocumentation::Syntax syntax = R"(
)";
FunctionDocumentation::Argument argument1 = {R"(
)", R"(
)"};
FunctionDocumentation::Argument argument2 = {R"(
)", R"(
)"};
FunctionDocumentation::Arguments arguments = {argument1, argument2};
FunctionDocumentation::ReturnedValue returned_value = R"(
)";
FunctionDocumentation::Example example1 = {"",
R"(
)", R"(
)"
};
FunctionDocumentation::Example example2 = {"",
R"(
)", R"(
)"
};
FunctionDocumentation::Examples examples = {example1, example2};
FunctionDocumentation::Categories categories = {};
FunctionDocumentation documentation = {description, syntax, arguments, returned_value, examples, categories};
factory.registerFunction<FunctionHalfMD5>(documentation);
} Note: It makes sense to split the So with all the fields filled out for the example, we'll get: {
FunctionDocumentation::Description description = R"(
[Interprets](../..//sql-reference/functions/type-conversion-functions.md/#type_conversion_functions-reinterpretAsString) all the input parameters as strings and calculates the MD5 hash value for each of them. Then combines hashes, takes the first 8 bytes of the hash of the resulting string, and interprets them as [UInt64](../../../sql-reference/data-types/int-uint.md) in big-endian byte order. The function is relatively slow (5 million short strings per second per
> processor core).
Consider using the [sipHash64](../../sql-reference/functions/hash-functions.md/#hash_functions-siphash64) function instead.
)";
FunctionDocumentation::Syntax syntax = R"(
SELECT halfMD5(par1,par2,...,parN);
)";
FunctionDocumentation::Argument argument1 = {R"(
par1,par2,...,parN
)", R"(
The function takes a variable number of input parameters. Arguments can be any of the supported data types. For some data types calculated value of hash function may be the same for the same values even if types of arguments differ (integers of different size, named and unnamed Tuple with the same data, Map and the corresponding Array(Tuple(key, value)) type with the same data).
)"};
FunctionDocumentation::Arguments arguments = {argument1};
FunctionDocumentation::ReturnedValue returned_value = R"(
The computed half MD5 hash of the given input params returned as a [UInt64](../../../sql-reference/data-types/int-uint.md) in big-endian byte order.
)";
FunctionDocumentation::Example example1 = {"",
R"(
SELECT HEX(halfMD5('abc', 'cde', 'fgh'));
)", R"(
┌─hex(halfMD5('abc', 'cde', 'fgh'))─┐
│ 2C9506B7374CFAF4 │
└───────────────────────────────────┘
)"
};
FunctionDocumentation::Examples examples = {example1};
FunctionDocumentation::Categories categories = {"hash"};
FunctionDocumentation documentation = {description, syntax, arguments, returned_value, examples, categories};
factory.registerFunction<FunctionHalfMD5>(documentation);
}
} Note how all useful text is neatly left-aligned and uses no linebreaks. Pretty readable, if you ask me. To sum up, I'd propose to proceed group-by-group, take care of the things I mentioned above, and then make adjustments as you go since there are probably plenty of things I forgot. |
@Blargian And to reply to your original thoughts:
For any given function, the in-source docs should ideally be the more verbose/exhaustive/complete version of sections "syntax", "arguments", "examples" etc. of the in-source docs and the public docs. The idea is to make this step as mechanical as possible (it will still need to be done by hand). There is no need to come up with new sections "syntax", "arguments", "examples" if neither the in-source nor the public docs contain it.
Yes. Let's do it category-by-category. The day when we can finally delete the public function docs will be so glorious. |
@justindeguzman FYI ^^ |
It is ok to include Markdown directly in .cpp - while it is ugly, we have no better way to do it. |
Agree about Markdown. Re "most detailed documentation": This is relevant for
|
@rschu1ze regarding system tables - possibly a silly question, i'm not sure if it's feasible, but could we not maybe create a new At least that way we keep |
Agree that is an eye-sore. But don't worry too much:
It is not a difficult thing to add new "technical" system tables which are only useful in the context of generating docs from the sources. I'd say we can decide if we like to do that after all docs are auto-generated, depending how painful the system tables are to look at then. |
I'm ok with it not fitting on the screen. It is alright. The database is for having data, and when you do I'm ok with keeping everything in the |
@rschu1ze I'd like to propose something. Instead of adding markdown documentation directly into the setting description, let's add a new column called
The reason I propose this is that the current docs are incomplete as is. We can copy across the more 'complete' (based on length at least) version of the documentation from the .md to the source, but I foresee having to make many changes in future to correct broken links, fix typos, change examples etc. Having everything in one text field is going to make it really hard to maintain, not to mention it already makes the source look horrible. Taking as an arbitrary example, this setting, from the already autogenerated format settings: Purely for argument sake, suppose we decide that notes are best displayed after possible values, we would need to go through every description in the format settings and rearrange the text. If we have the information separate we only need to rearrange how we render functions in one place and it will be done for every function. If a page gets moved and a link is no longer valid, then we just need to update the array for key Building on Alexey's idea to have documentation accessible from ClickHouse even without internet documentation, I think it could be cool if at a later point we introduced functionality for displaying the documentation beautifully directly within ClickHouse. I am picturing being able to write something like: SELECT DOCUMENTATION(input_format_defaults_for_omitted_fields) and have the documentation for that setting be displayed in the console, similar to |
This could be too much of a structure. The first priority is to make inline docs for other parts of the code, e.g., functions, aggregate functions, table functions, formats, and system tables, and have 100% completeness of the published docs. Clarity and details will be a 2nd priority. Actually I'm not opposed to your idea, if you can pull it over entirely by yourself. |
@Blargian There is a misunderstanding.
Agree.
Agree. The in-source docs are already stored separately.
Regarding markdown vs. JSON: My main concern is that we should make it developers as easy as possible to write in-source docs. Markdown is the least bad choice - it's human-readable while JSON is designed for machines. I don't mind too much which data type Regarding priorities: Synchronizing public and in-source docs is a tedious task. The main problem will be that in-source and public docs will exist side-by-side for a while. This introduces a risk of new inconsistencies in already synchronized docs (for a function, a setting, a system table, etc.). To avoid this, I suggest to go iteratively and delete the public version immediately once we are happy with the in-source version. My proposal is to start with something simple (e.g. system table docs), then add/change the scripts to auto-generate public from embedded docs, then check docs for each system table one-by-one and delete the public docs as you go. Side note 1: I heard @max-vostrikov also works on this. Please synchronize with him to avoid double work. Side note 2: ClickHouse/ClickHouse#73989 |
PRs 70289 and 2714 introduced changes to auto generate markdown documentation from source for format settings and core settings,
We would like to do the same for:
The text was updated successfully, but these errors were encountered: