[Feature] Codebook Generation from Documentation #1017

DerAndereJohannes · 2024-07-28T00:45:11Z

Description

This PR aims at creating a general Codebook that has all variables that can be generated from the NeuroKit package. The idea of this feature stems from discussion #1012 from @HeatherUrry.

The addition of this feature would allow users to more easily see what variables are extractable from NeuroKit and is useful in other software programs.

Note

This pull request should be considered more as a draft as some polishing would still be required.
For example:

Not all variables are likely included yet
The field descriptions are mainly all just copied and pasted from what was already there, maybe we can update them more ? (maybe out of scope of this PR?)
The Codebook page should probably have a better description
Currently, the new directive does not support following new lines if we are still looking at the same field, so lines can get very long. (this will most likely fail the style checks) EDIT: I fixed this as I found out that directives also use the same python new line character (\)
We could still change column structure of the codebook. The current structure is: Field Name, Field Description, Field Category and Source File Name.

Feedback and change requests are all welcome. Let me know if I should also share a copy of the documentation _build.

Proposed Changes

I added an additional Sphinx directive that takes in information in the Returns section of the python source code documentation and creates a csv file. At the same time, the directive also places a formatted version of the information in the return section, such that this information does not have to be written twice. E.g.,
The line ECG_Raw|The raw ECG Data. appends ECG_Raw,The raw ECG Data.,Electrocardiogram,ecg_process.py to the csv file and adds a formatted string * ``ECG_Raw``: The raw ECG Data. to the documentation page. Note that if the description contains a comma, the entire field is encapsulated in quotation marks i.e., ECG_foobar,"foo, bar",... to keep csv formatting intact.

The CSV from the codebook webpage like in this example:

Here is a sample codebook: neurokit_codebook.csv

I am not sure if it is possible to add the table to the Codebook page from within the documentation generation process and so therefore I added some javascript to dynamically load the csv file into the codebook page. Please let me know if this is too much.

Checklist

I have read the CONTRIBUTING file.
My PR is targeted at the dev branch (and not towards the master branch).
I ran the CODE CHECKS on the files I added or modified and fixed the errors.
I have added the newly added features to News.rst (if applicable)

- now handles quoted fields

- added column "Field Category" - added column "Source File Name"

DominiqueMakowski · 2024-08-05T14:56:23Z

Thanks for this huge endeavour @DerAndereJohannes it is quite impressive.
Given the automation that you implemented it should be relatively painless to maintain and update is that correct?

HeatherUrry · 2024-08-05T17:18:44Z

WOW. This is awesome, @DerAndereJohannes. Thank you! What can I do to help?

DerAndereJohannes · 2024-08-07T10:28:14Z

In reality, my pull request is actually quite small (from a code perspective)

Given the automation that you implemented it should be relatively painless to maintain and update is that correct?

There are really only 2 files I really mainly contributed with this change which are:

csv_codebook_directive.py: Which gets called when the directive is seen on a documentation string. It should hopefully be simple enough to understand what it is doing.
codebook.rst: Which builds the codebook page and adds some javascript to be able to display the codebook based on the new csv document.

Given that these files are quite simple and the main software that is doing the work is actually the sphinx documentation build system, I would argue that this new feature is easy to maintain.

Where all my line changes come from are simply using these directives in their correct places, replacing Returns blocks for this new system e.g.,

where the new directive converts the Returns lines with the directive and where the it then saves these to a csv on build and also writes to the documentation page as if nothing had changed. All styles are kept.

What can I do to help?

One way would be perhaps to create a better description for the codebook on the specific codebook site. As you can see in the image in my initial post, I wrote an extremely short text description for each of the sections (Codebook and Codebook Table). Although it should probably remain concise, I may have been a bit too concise? :p

Honestly, other than that, what I did was simply use the existing descriptions that were already in the documentation. There are some parts of the documentation that are not that consistent with each other and would probably require quite a bit of refactor e.g., I had to add the HRV non-linear features into the return statement here since they were all present in the description of the function, but the return block simply stated Contains non-linear HRV metrics.. My change here does mean that the data is technically duplicated (in the description and results), but I felt not that confident in rewriting these docstrings to an acceptable standard for this pull request.

However, any changes that could be done to help improve the codebook feature I am unsure if it is relevant for this PR since we would end up rewriting the majority of the docs in this PR if that were the case. I am sure the maintainers could come up with a good solution.

Thanks!

danibene

Thank you for doing all of this, it's great!

I left a couple of comments about (lack of) changes I wasn't sure were intentional. To respond to some of your points from the description:

Not all variables are likely included yet

The field descriptions are mainly all just copied and pasted from what was already there, maybe we can update them more ? (maybe out of scope of this PR?)

The Codebook page should probably have a better description

The field descriptions are mainly all just copied and pasted from what was already there, maybe we can update them more ? (maybe out of scope of this PR?)

I think it is fine to update these in separate PRs, especially given multiple people interested in contributing to this feature 😊 For example, I imagine one next step could be to make a list of the functions that need to be checked and compare these variables to the code book content (some of this checking could probably be automated).

The line ECG_Raw|The raw ECG Data. appends ECG_Raw,The raw ECG Data.,Electrocardiogram,ecg_process.py to the csv file and adds a formatted string * ``ECG_Raw``: The raw ECG Data. to the documentation page. Note that if the description contains a comma, the entire field is encapsulated in quotation marks i.e., ECG_foobar,"foo, bar",... to keep csv formatting intact.

Do you think a different separator (e.g. tabs instead of commas) would be more robust to formatting mistakes in the docstrings?

I am not sure if it is possible to add the table to the Codebook page from within the documentation generation process and so therefore I added some javascript to dynamically load the csv file into the codebook page. Please let me know if this is too much.

I don’t see why the JavaScript would be problematic (but @DominiqueMakowski feel free to chime in), though if we want to avoid it for whatever reason I think it should be possible to add the table from within the documentation generation process if before sphinx is run you update a markdown/rst file with the table. I’ve done something similar here in case helpful: https://github.com/miso-sound/miso-sound-annotate/blob/cc36d781327d7d6dec425a8178714bab40a64fa3/docs/readme/gen_summary_table.py

danibene · 2024-08-05T19:01:09Z

neurokit2/ecg/ecg_intervalrelated.py

-        * ``ECG_Rate_Mean``: the mean heart rate.
+        .. codebookadd::
+            ECG_Rate_Mean|The mean heart rate.
+
        * ``ECG_HRV``: the different heart rate variability metrices.


Was this line intentionally left as is?

Yes this was done intentionally, because I was thinking about what should be added to the codebook and I wasn't sure, if adding ECG_HRV which is basically an entry holding entries that are already accounted for the the HRV section. However, I can also see an argument for adding these "meta keys" to the codebook too. What do you think?

I think I would at least have a way to filter out "meta-keys" within the code book generation process, rather than formatting them differently in the docstring, to avoid inconsistent formatting in the documentation. It could also be helpful to have these variables saved somewhere if we wanted to programmatically validate the output of the codebook later on.

We can probably solve this using a third column which could be empty for normal variables and for "meta" keys, we could add the reference to the docs location which would give the link to the documentation that shows you what the key contains. How does that sound?

e.g., for ECG_HRV, it would look like this in codebookadd: ECG_HRV|the different heart rate variability metrics|<path to function page> where the 3rd column adds a :doc: in rst which links to the page. All other variables in the directive would not have to be changed

neurokit2/eda/eda_intervalrelated.py

neurokit2/eog/eog_intervalrelated.py

neurokit2/hrv/hrv_nonlinear.py

neurokit2/rsp/rsp_intervalrelated.py

danibene · 2024-08-07T13:42:25Z

Maybe I'm dreaming here, but I was also wondering if we could do something similar to make the "report" functionality easier to develop and maintain: #785

I think it would be helpful if we could document each method only in the docstrings of the associated function, and the extract that description in a structured format so that it can be included in the generated report without having to manually add that description in a separate file.

DerAndereJohannes · 2024-08-09T08:18:03Z

Thanks for all of your comments danibene!

For example, I imagine one next step could be to make a list of the functions that need to be checked and compare these variables to the code book content (some of this checking could probably be automated).

Do you mean a way to double check that all the variables emitted from the functions are in fact the variables that are stated in the codebook?

Do you think a different separator (e.g. tabs instead of commas) would be more robust to formatting mistakes in the docstrings?

Tabs can for sure work too and could maybe make it easier to read. However, the csv python library already automatically accounts for additional commas using the quotation marks around the field so it should be fine too. It is true, that no one will probably ever use a tab in a docstring though. Should I switch it?

I’ve done something similar here in case helpful: https://github.com/miso-sound/miso-sound-annotate/blob/cc36d781327d7d6dec425a8178714bab40a64fa3/docs/readme/gen_summary_table.py

I see you created a script and then had it run throughout your CI/CD Pipeline! This is of course also something we can do to avoid the additional javascript. With this pull request, I was aiming for just using sphinx for everything. But if I am allowed to do something similar to what you did, then I can convert it to that for the doc builder.

Maybe I'm dreaming here, but I was also wondering if we could do something similar to make the "report" functionality easier to develop and maintain: #785

This is definitely a possibility. I would imagine this looking like what you did in your gen_summary_table where we could use a no-op decorator like @gen_report to the report function and then in the CI/CD make the decorator act as a macro and replace it with a function that is defined using the docstring information. The only argument against this for me would be function testing (However, there is probably a way around this) and the idea for code generation during CI/CD can be dangerous if not looked at very carefully (e.g., like the xz attack). I would not mind creating this feature too if it is wanted

DominiqueMakowski · 2024-08-09T09:09:44Z

I'm happy to merge it when @danibene gives the green light. I have no opinion regarding the javascript so it's up to what you guys feel is the most maintainable approach 🤷

danibene · 2024-08-09T12:19:12Z

Thank you for your responses @DerAndereJohannes !

For example, I imagine one next step could be to make a list of the functions that need to be checked and compare these variables to the code book content (some of this checking could probably be automated).

Do you mean a way to double check that all the variables emitted from the functions are in fact the variables that are stated in the codebook?

Yes, exactly. It doesn’t have to be addressed in this PR if it’s a significant effort, but it might be worth considering when thinking about how to parse “meta-keys.”

Do you think a different separator (e.g. tabs instead of commas) would be more robust to formatting mistakes in the docstrings?

Tabs can for sure work too and could maybe make it easier to read. However, the csv python library already automatically accounts for additional commas using the quotation marks around the field so it should be fine too. It is true, that no one will probably ever use a tab in a docstring though. Should I switch it?

If you think the current version with the CSVs will work just as well, I’m fine with keeping it as is.

I see you created a script and then had it run throughout your CI/CD Pipeline! This is of course also something we can do to avoid the additional javascript. With this pull request, I was aiming for just using sphinx for everything. But if I am allowed to do something similar to what you did, then I can convert it to that for the doc builder.

I think both would be possible. I also don't have a strong opinion about which would be more maintainable - I do see the advantages of using sphinx for everything. Feel free to keep the current version if you'd like.

Maybe I'm dreaming here, but I was also wondering if we could do something similar to make the "report" functionality easier to develop and maintain: #785

This is definitely a possibility. I would imagine this looking like what you did in your gen_summary_table where we could use a no-op decorator like @gen_report to the report function and then in the CI/CD make the decorator act as a macro and replace it with a function that is defined using the docstring information. The only argument against this for me would be function testing (However, there is probably a way around this) and the idea for code generation during CI/CD can be dangerous if not looked at very carefully (e.g., like the xz attack). I would not mind creating this feature too if it is wanted

Do you think we could achieve this without code generation, maybe by loading the text from a file, similar to how example datasets are currently handled? (https://github.com/neuropsychology/NeuroKit/blob/master/neurokit2/data/data.py#L176-L180)

DerAndereJohannes · 2024-08-10T17:10:06Z

Thank you for the comments.

For example, I imagine one next step could be to make a list of the functions that need to be checked and compare these variables to the code book content (some of this checking could probably be automated).

Do you mean a way to double check that all the variables emitted from the functions are in fact the variables that are stated in the codebook?

Yes, exactly. It doesn’t have to be addressed in this PR if it’s a significant effort, but it might be worth considering when thinking about how to parse “meta-keys.”

I think this should be pretty easy to do even with a pytest which would require it to be conform for everyone before they merge a new PR which adds a key. I would be happy to implement this in a different PR.

If you think the current version with the CSVs will work just as well, I’m fine with keeping it as is.

You might be right that using a tab might be a lot simpler as CSV is not really a well defined format and a tab should be far rarer than commas in the descriptions. I think I will change this now.

Do you think we could achieve this without code generation, maybe by loading the text from a file, similar to how example datasets are currently handled?

I think I need to understand the report function a bit more before I can make an answer. Especially how loading the text from a file would be different than from implementing the text directly in the report function.

With all these good ideas and a strive for making the documentation more maintainable, I think it would also make sense to create a page that describes the automated processes and all the directives I have been writing (and the additional stuff to come).

DominiqueMakowski · 2024-08-13T07:22:38Z

I'll merge to keep the momentum going.
Again, thanks for that impressive work. No doubt it will be really useful to users :)

DerAndereJohannes added 15 commits July 27, 2024 14:47

Added codebookadd directive to sphinx documentation

93a26cf

added initial codebook page

834a95d

altered codebook.rst csv parsing

2298813

- now handles quoted fields

Change the directive to use | as delimiter

70e48cb

updated the *_process files with the new codebook directive

685768f

added codebookadd directive to the eda components

dfa7c65

added codebookadd directive to the eda components

373d14b

added codebookadd directive to the emg components

fff66d7

added codebookadd directive to the eog components

4e8fc20

added codebookadd directive to the ppg components

0013624

added codebookadd directive to the rsp components

77d826c

added codebookadd directive to the hrv components

240815e

Fixed codebookadd directive in the eda components

e67787e

updated codebook directive with two new columns

7ca5c57

- added column "Field Category" - added column "Source File Name"

added header to the codebook table

f051655

pull-request-size bot added the size/XL label Jul 28, 2024

adjusted csv_codebook_directive to handle newlines

086336e

DerAndereJohannes force-pushed the feature-codebook-docs branch from f1a7e91 to 1dd2e64 Compare July 28, 2024 11:19

added newline separators to all codebookadd directives

b37098f

DerAndereJohannes force-pushed the feature-codebook-docs branch from 1dd2e64 to b37098f Compare July 28, 2024 11:23

DominiqueMakowski requested a review from danibene August 5, 2024 14:55

danibene reviewed Aug 7, 2024

View reviewed changes

DominiqueMakowski merged commit b99036c into neuropsychology:dev Aug 13, 2024
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Codebook Generation from Documentation #1017

[Feature] Codebook Generation from Documentation #1017

DerAndereJohannes commented Jul 28, 2024 •

edited

Loading

DominiqueMakowski commented Aug 5, 2024

HeatherUrry commented Aug 5, 2024

DerAndereJohannes commented Aug 7, 2024 •

edited

Loading

danibene left a comment

danibene Aug 5, 2024

DerAndereJohannes Aug 9, 2024

danibene Aug 9, 2024 •

edited

Loading

DerAndereJohannes Aug 10, 2024

danibene commented Aug 7, 2024

DerAndereJohannes commented Aug 9, 2024

DominiqueMakowski commented Aug 9, 2024

danibene commented Aug 9, 2024

DerAndereJohannes commented Aug 10, 2024

DominiqueMakowski commented Aug 13, 2024

[Feature] Codebook Generation from Documentation #1017

[Feature] Codebook Generation from Documentation #1017

Conversation

DerAndereJohannes commented Jul 28, 2024 • edited Loading

Description

Note

Proposed Changes

Checklist

DominiqueMakowski commented Aug 5, 2024

HeatherUrry commented Aug 5, 2024

DerAndereJohannes commented Aug 7, 2024 • edited Loading

danibene left a comment

Choose a reason for hiding this comment

danibene Aug 5, 2024

Choose a reason for hiding this comment

DerAndereJohannes Aug 9, 2024

Choose a reason for hiding this comment

danibene Aug 9, 2024 • edited Loading

Choose a reason for hiding this comment

DerAndereJohannes Aug 10, 2024

Choose a reason for hiding this comment

danibene commented Aug 7, 2024

DerAndereJohannes commented Aug 9, 2024

DominiqueMakowski commented Aug 9, 2024

danibene commented Aug 9, 2024

DerAndereJohannes commented Aug 10, 2024

DominiqueMakowski commented Aug 13, 2024

DerAndereJohannes commented Jul 28, 2024 •

edited

Loading

DerAndereJohannes commented Aug 7, 2024 •

edited

Loading

danibene Aug 9, 2024 •

edited

Loading