Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Codebook Generation from Documentation #1017

Merged

Conversation

DerAndereJohannes
Copy link
Contributor

@DerAndereJohannes DerAndereJohannes commented Jul 28, 2024

Description

This PR aims at creating a general Codebook that has all variables that can be generated from the NeuroKit package. The idea of this feature stems from discussion #1012 from @HeatherUrry.

The addition of this feature would allow users to more easily see what variables are extractable from NeuroKit and is useful in other software programs.

Note

This pull request should be considered more as a draft as some polishing would still be required.
For example:

  • Not all variables are likely included yet
  • The field descriptions are mainly all just copied and pasted from what was already there, maybe we can update them more ? (maybe out of scope of this PR?)
  • The Codebook page should probably have a better description
  • Currently, the new directive does not support following new lines if we are still looking at the same field, so lines can get very long. (this will most likely fail the style checks) EDIT: I fixed this as I found out that directives also use the same python new line character (\)
  • We could still change column structure of the codebook. The current structure is: Field Name, Field Description, Field Category and Source File Name.

Feedback and change requests are all welcome. Let me know if I should also share a copy of the documentation _build.

Proposed Changes

I added an additional Sphinx directive that takes in information in the Returns section of the python source code documentation and creates a csv file. At the same time, the directive also places a formatted version of the information in the return section, such that this information does not have to be written twice. E.g.,
The line ECG_Raw|The raw ECG Data. appends ECG_Raw,The raw ECG Data.,Electrocardiogram,ecg_process.py to the csv file and adds a formatted string * ``ECG_Raw``: The raw ECG Data. to the documentation page. Note that if the description contains a comma, the entire field is encapsulated in quotation marks i.e., ECG_foobar,"foo, bar",... to keep csv formatting intact.

The CSV from the codebook webpage like in this example:
codebook-page-visual

Here is a sample codebook: neurokit_codebook.csv

I am not sure if it is possible to add the table to the Codebook page from within the documentation generation process and so therefore I added some javascript to dynamically load the csv file into the codebook page. Please let me know if this is too much.

Checklist

  • I have read the CONTRIBUTING file.
  • My PR is targeted at the dev branch (and not towards the master branch).
  • I ran the CODE CHECKS on the files I added or modified and fixed the errors.
  • I have added the newly added features to News.rst (if applicable)

@DominiqueMakowski
Copy link
Member

Thanks for this huge endeavour @DerAndereJohannes it is quite impressive.
Given the automation that you implemented it should be relatively painless to maintain and update is that correct?

@HeatherUrry
Copy link

WOW. This is awesome, @DerAndereJohannes. Thank you! What can I do to help?

@DerAndereJohannes
Copy link
Contributor Author

DerAndereJohannes commented Aug 7, 2024

In reality, my pull request is actually quite small (from a code perspective)

Given the automation that you implemented it should be relatively painless to maintain and update is that correct?

There are really only 2 files I really mainly contributed with this change which are:

  • csv_codebook_directive.py: Which gets called when the directive is seen on a documentation string. It should hopefully be simple enough to understand what it is doing.
  • codebook.rst: Which builds the codebook page and adds some javascript to be able to display the codebook based on the new csv document.

Given that these files are quite simple and the main software that is doing the work is actually the sphinx documentation build system, I would argue that this new feature is easy to maintain.

Where all my line changes come from are simply using these directives in their correct places, replacing Returns blocks for this new system e.g.,
image

where the new directive converts the Returns lines with the directive and where the it then saves these to a csv on build and also writes to the documentation page as if nothing had changed. All styles are kept.

What can I do to help?

One way would be perhaps to create a better description for the codebook on the specific codebook site. As you can see in the image in my initial post, I wrote an extremely short text description for each of the sections (Codebook and Codebook Table). Although it should probably remain concise, I may have been a bit too concise? :p

Honestly, other than that, what I did was simply use the existing descriptions that were already in the documentation. There are some parts of the documentation that are not that consistent with each other and would probably require quite a bit of refactor e.g., I had to add the HRV non-linear features into the return statement here since they were all present in the description of the function, but the return block simply stated Contains non-linear HRV metrics.. My change here does mean that the data is technically duplicated (in the description and results), but I felt not that confident in rewriting these docstrings to an acceptable standard for this pull request.

However, any changes that could be done to help improve the codebook feature I am unsure if it is relevant for this PR since we would end up rewriting the majority of the docs in this PR if that were the case. I am sure the maintainers could come up with a good solution.

Thanks!

Copy link
Collaborator

@danibene danibene left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for doing all of this, it's great!

I left a couple of comments about (lack of) changes I wasn't sure were intentional. To respond to some of your points from the description:

  • Not all variables are likely included yet

  • The field descriptions are mainly all just copied and pasted from what was already there, maybe we can update them more ? (maybe out of scope of this PR?)

  • The Codebook page should probably have a better description

  • The field descriptions are mainly all just copied and pasted from what was already there, maybe we can update them more ? (maybe out of scope of this PR?)

I think it is fine to update these in separate PRs, especially given multiple people interested in contributing to this feature 😊 For example, I imagine one next step could be to make a list of the functions that need to be checked and compare these variables to the code book content (some of this checking could probably be automated).

The line ECG_Raw|The raw ECG Data. appends ECG_Raw,The raw ECG Data.,Electrocardiogram,ecg_process.py to the csv file and adds a formatted string * ``ECG_Raw``: The raw ECG Data. to the documentation page. Note that if the description contains a comma, the entire field is encapsulated in quotation marks i.e., ECG_foobar,"foo, bar",... to keep csv formatting intact.

Do you think a different separator (e.g. tabs instead of commas) would be more robust to formatting mistakes in the docstrings?

I am not sure if it is possible to add the table to the Codebook page from within the documentation generation process and so therefore I added some javascript to dynamically load the csv file into the codebook page. Please let me know if this is too much.

I don’t see why the JavaScript would be problematic (but @DominiqueMakowski feel free to chime in), though if we want to avoid it for whatever reason I think it should be possible to add the table from within the documentation generation process if before sphinx is run you update a markdown/rst file with the table. I’ve done something similar here in case helpful: https://github.com/miso-sound/miso-sound-annotate/blob/cc36d781327d7d6dec425a8178714bab40a64fa3/docs/readme/gen_summary_table.py

* ``ECG_Rate_Mean``: the mean heart rate.
.. codebookadd::
ECG_Rate_Mean|The mean heart rate.

* ``ECG_HRV``: the different heart rate variability metrices.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this line intentionally left as is?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this was done intentionally, because I was thinking about what should be added to the codebook and I wasn't sure, if adding ECG_HRV which is basically an entry holding entries that are already accounted for the the HRV section. However, I can also see an argument for adding these "meta keys" to the codebook too. What do you think?

Copy link
Collaborator

@danibene danibene Aug 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I would at least have a way to filter out "meta-keys" within the code book generation process, rather than formatting them differently in the docstring, to avoid inconsistent formatting in the documentation. It could also be helpful to have these variables saved somewhere if we wanted to programmatically validate the output of the codebook later on.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can probably solve this using a third column which could be empty for normal variables and for "meta" keys, we could add the reference to the docs location which would give the link to the documentation that shows you what the key contains. How does that sound?

e.g., for ECG_HRV, it would look like this in codebookadd: ECG_HRV|the different heart rate variability metrics|<path to function page> where the 3rd column adds a :doc: in rst which links to the page. All other variables in the directive would not have to be changed

neurokit2/eda/eda_intervalrelated.py Show resolved Hide resolved
neurokit2/eog/eog_intervalrelated.py Show resolved Hide resolved
neurokit2/hrv/hrv_nonlinear.py Show resolved Hide resolved
neurokit2/rsp/rsp_intervalrelated.py Show resolved Hide resolved
@danibene
Copy link
Collaborator

danibene commented Aug 7, 2024

Maybe I'm dreaming here, but I was also wondering if we could do something similar to make the "report" functionality easier to develop and maintain: #785

I think it would be helpful if we could document each method only in the docstrings of the associated function, and the extract that description in a structured format so that it can be included in the generated report without having to manually add that description in a separate file.

@DerAndereJohannes
Copy link
Contributor Author

Thanks for all of your comments danibene!

For example, I imagine one next step could be to make a list of the functions that need to be checked and compare these variables to the code book content (some of this checking could probably be automated).

Do you mean a way to double check that all the variables emitted from the functions are in fact the variables that are stated in the codebook?

Do you think a different separator (e.g. tabs instead of commas) would be more robust to formatting mistakes in the docstrings?

Tabs can for sure work too and could maybe make it easier to read. However, the csv python library already automatically accounts for additional commas using the quotation marks around the field so it should be fine too. It is true, that no one will probably ever use a tab in a docstring though. Should I switch it?

I’ve done something similar here in case helpful: https://github.com/miso-sound/miso-sound-annotate/blob/cc36d781327d7d6dec425a8178714bab40a64fa3/docs/readme/gen_summary_table.py

I see you created a script and then had it run throughout your CI/CD Pipeline! This is of course also something we can do to avoid the additional javascript. With this pull request, I was aiming for just using sphinx for everything. But if I am allowed to do something similar to what you did, then I can convert it to that for the doc builder.

Maybe I'm dreaming here, but I was also wondering if we could do something similar to make the "report" functionality easier to develop and maintain: #785

This is definitely a possibility. I would imagine this looking like what you did in your gen_summary_table where we could use a no-op decorator like @gen_report to the report function and then in the CI/CD make the decorator act as a macro and replace it with a function that is defined using the docstring information. The only argument against this for me would be function testing (However, there is probably a way around this) and the idea for code generation during CI/CD can be dangerous if not looked at very carefully (e.g., like the xz attack). I would not mind creating this feature too if it is wanted

@DominiqueMakowski
Copy link
Member

I'm happy to merge it when @danibene gives the green light. I have no opinion regarding the javascript so it's up to what you guys feel is the most maintainable approach 🤷

@danibene
Copy link
Collaborator

danibene commented Aug 9, 2024

Thank you for your responses @DerAndereJohannes !

For example, I imagine one next step could be to make a list of the functions that need to be checked and compare these variables to the code book content (some of this checking could probably be automated).

Do you mean a way to double check that all the variables emitted from the functions are in fact the variables that are stated in the codebook?

Yes, exactly. It doesn’t have to be addressed in this PR if it’s a significant effort, but it might be worth considering when thinking about how to parse “meta-keys.”

Do you think a different separator (e.g. tabs instead of commas) would be more robust to formatting mistakes in the docstrings?

Tabs can for sure work too and could maybe make it easier to read. However, the csv python library already automatically accounts for additional commas using the quotation marks around the field so it should be fine too. It is true, that no one will probably ever use a tab in a docstring though. Should I switch it?

If you think the current version with the CSVs will work just as well, I’m fine with keeping it as is.

I see you created a script and then had it run throughout your CI/CD Pipeline! This is of course also something we can do to avoid the additional javascript. With this pull request, I was aiming for just using sphinx for everything. But if I am allowed to do something similar to what you did, then I can convert it to that for the doc builder.

I think both would be possible. I also don't have a strong opinion about which would be more maintainable - I do see the advantages of using sphinx for everything. Feel free to keep the current version if you'd like.

Maybe I'm dreaming here, but I was also wondering if we could do something similar to make the "report" functionality easier to develop and maintain: #785

This is definitely a possibility. I would imagine this looking like what you did in your gen_summary_table where we could use a no-op decorator like @gen_report to the report function and then in the CI/CD make the decorator act as a macro and replace it with a function that is defined using the docstring information. The only argument against this for me would be function testing (However, there is probably a way around this) and the idea for code generation during CI/CD can be dangerous if not looked at very carefully (e.g., like the xz attack). I would not mind creating this feature too if it is wanted

Do you think we could achieve this without code generation, maybe by loading the text from a file, similar to how example datasets are currently handled? (https://github.com/neuropsychology/NeuroKit/blob/master/neurokit2/data/data.py#L176-L180)

@DerAndereJohannes
Copy link
Contributor Author

Thank you for the comments.

For example, I imagine one next step could be to make a list of the functions that need to be checked and compare these variables to the code book content (some of this checking could probably be automated).

Do you mean a way to double check that all the variables emitted from the functions are in fact the variables that are stated in the codebook?

Yes, exactly. It doesn’t have to be addressed in this PR if it’s a significant effort, but it might be worth considering when thinking about how to parse “meta-keys.”

I think this should be pretty easy to do even with a pytest which would require it to be conform for everyone before they merge a new PR which adds a key. I would be happy to implement this in a different PR.

If you think the current version with the CSVs will work just as well, I’m fine with keeping it as is.

You might be right that using a tab might be a lot simpler as CSV is not really a well defined format and a tab should be far rarer than commas in the descriptions. I think I will change this now.

Do you think we could achieve this without code generation, maybe by loading the text from a file, similar to how example datasets are currently handled?

I think I need to understand the report function a bit more before I can make an answer. Especially how loading the text from a file would be different than from implementing the text directly in the report function.

With all these good ideas and a strive for making the documentation more maintainable, I think it would also make sense to create a page that describes the automated processes and all the directives I have been writing (and the additional stuff to come).

@DominiqueMakowski
Copy link
Member

I'll merge to keep the momentum going.
Again, thanks for that impressive work. No doubt it will be really useful to users :)

@DominiqueMakowski DominiqueMakowski merged commit b99036c into neuropsychology:dev Aug 13, 2024
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants