-
-
Notifications
You must be signed in to change notification settings - Fork 432
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Codebook Generation from Documentation #1017
[Feature] Codebook Generation from Documentation #1017
Conversation
- now handles quoted fields
- added column "Field Category" - added column "Source File Name"
f1a7e91
to
1dd2e64
Compare
1dd2e64
to
b37098f
Compare
Thanks for this huge endeavour @DerAndereJohannes it is quite impressive. |
WOW. This is awesome, @DerAndereJohannes. Thank you! What can I do to help? |
In reality, my pull request is actually quite small (from a code perspective)
There are really only 2 files I really mainly contributed with this change which are:
Given that these files are quite simple and the main software that is doing the work is actually the sphinx documentation build system, I would argue that this new feature is easy to maintain. Where all my line changes come from are simply using these directives in their correct places, replacing Returns blocks for this new system e.g., where the new directive converts the Returns lines with the directive and where the it then saves these to a csv on build and also writes to the documentation page as if nothing had changed. All styles are kept.
One way would be perhaps to create a better description for the codebook on the specific codebook site. As you can see in the image in my initial post, I wrote an extremely short text description for each of the sections (Codebook and Codebook Table). Although it should probably remain concise, I may have been a bit too concise? :p Honestly, other than that, what I did was simply use the existing descriptions that were already in the documentation. There are some parts of the documentation that are not that consistent with each other and would probably require quite a bit of refactor e.g., I had to add the HRV non-linear features into the return statement here since they were all present in the description of the function, but the return block simply stated However, any changes that could be done to help improve the codebook feature I am unsure if it is relevant for this PR since we would end up rewriting the majority of the docs in this PR if that were the case. I am sure the maintainers could come up with a good solution. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for doing all of this, it's great!
I left a couple of comments about (lack of) changes I wasn't sure were intentional. To respond to some of your points from the description:
Not all variables are likely included yet
The field descriptions are mainly all just copied and pasted from what was already there, maybe we can update them more ? (maybe out of scope of this PR?)
The Codebook page should probably have a better description
The field descriptions are mainly all just copied and pasted from what was already there, maybe we can update them more ? (maybe out of scope of this PR?)
I think it is fine to update these in separate PRs, especially given multiple people interested in contributing to this feature 😊 For example, I imagine one next step could be to make a list of the functions that need to be checked and compare these variables to the code book content (some of this checking could probably be automated).
The line
ECG_Raw|The raw ECG Data.
appendsECG_Raw,The raw ECG Data.,Electrocardiogram,ecg_process.py
to the csv file and adds a formatted string* ``ECG_Raw``: The raw ECG Data.
to the documentation page. Note that if the description contains a comma, the entire field is encapsulated in quotation marks i.e.,ECG_foobar,"foo, bar",...
to keep csv formatting intact.
Do you think a different separator (e.g. tabs instead of commas) would be more robust to formatting mistakes in the docstrings?
I am not sure if it is possible to add the table to the Codebook page from within the documentation generation process and so therefore I added some javascript to dynamically load the csv file into the codebook page. Please let me know if this is too much.
I don’t see why the JavaScript would be problematic (but @DominiqueMakowski feel free to chime in), though if we want to avoid it for whatever reason I think it should be possible to add the table from within the documentation generation process if before sphinx is run you update a markdown/rst file with the table. I’ve done something similar here in case helpful: https://github.com/miso-sound/miso-sound-annotate/blob/cc36d781327d7d6dec425a8178714bab40a64fa3/docs/readme/gen_summary_table.py
* ``ECG_Rate_Mean``: the mean heart rate. | ||
.. codebookadd:: | ||
ECG_Rate_Mean|The mean heart rate. | ||
|
||
* ``ECG_HRV``: the different heart rate variability metrices. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this line intentionally left as is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this was done intentionally, because I was thinking about what should be added to the codebook and I wasn't sure, if adding ECG_HRV which is basically an entry holding entries that are already accounted for the the HRV section. However, I can also see an argument for adding these "meta keys" to the codebook too. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I would at least have a way to filter out "meta-keys" within the code book generation process, rather than formatting them differently in the docstring, to avoid inconsistent formatting in the documentation. It could also be helpful to have these variables saved somewhere if we wanted to programmatically validate the output of the codebook later on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can probably solve this using a third column which could be empty for normal variables and for "meta" keys, we could add the reference to the docs location which would give the link to the documentation that shows you what the key contains. How does that sound?
e.g., for ECG_HRV, it would look like this in codebookadd: ECG_HRV|the different heart rate variability metrics|<path to function page>
where the 3rd column adds a :doc: in rst which links to the page. All other variables in the directive would not have to be changed
Maybe I'm dreaming here, but I was also wondering if we could do something similar to make the "report" functionality easier to develop and maintain: #785 I think it would be helpful if we could document each method only in the docstrings of the associated function, and the extract that description in a structured format so that it can be included in the generated report without having to manually add that description in a separate file. |
Thanks for all of your comments danibene!
Do you mean a way to double check that all the variables emitted from the functions are in fact the variables that are stated in the codebook?
Tabs can for sure work too and could maybe make it easier to read. However, the csv python library already automatically accounts for additional commas using the quotation marks around the field so it should be fine too. It is true, that no one will probably ever use a tab in a docstring though. Should I switch it?
I see you created a script and then had it run throughout your CI/CD Pipeline! This is of course also something we can do to avoid the additional javascript. With this pull request, I was aiming for just using sphinx for everything. But if I am allowed to do something similar to what you did, then I can convert it to that for the doc builder.
This is definitely a possibility. I would imagine this looking like what you did in your gen_summary_table where we could use a no-op decorator like @gen_report to the report function and then in the CI/CD make the decorator act as a macro and replace it with a function that is defined using the docstring information. The only argument against this for me would be function testing (However, there is probably a way around this) and the idea for code generation during CI/CD can be dangerous if not looked at very carefully (e.g., like the xz attack). I would not mind creating this feature too if it is wanted |
I'm happy to merge it when @danibene gives the green light. I have no opinion regarding the javascript so it's up to what you guys feel is the most maintainable approach 🤷 |
Thank you for your responses @DerAndereJohannes !
Yes, exactly. It doesn’t have to be addressed in this PR if it’s a significant effort, but it might be worth considering when thinking about how to parse “meta-keys.”
If you think the current version with the CSVs will work just as well, I’m fine with keeping it as is.
I think both would be possible. I also don't have a strong opinion about which would be more maintainable - I do see the advantages of using sphinx for everything. Feel free to keep the current version if you'd like.
Do you think we could achieve this without code generation, maybe by loading the text from a file, similar to how example datasets are currently handled? (https://github.com/neuropsychology/NeuroKit/blob/master/neurokit2/data/data.py#L176-L180) |
Thank you for the comments.
I think this should be pretty easy to do even with a pytest which would require it to be conform for everyone before they merge a new PR which adds a key. I would be happy to implement this in a different PR.
You might be right that using a tab might be a lot simpler as CSV is not really a well defined format and a tab should be far rarer than commas in the descriptions. I think I will change this now.
I think I need to understand the report function a bit more before I can make an answer. Especially how loading the text from a file would be different than from implementing the text directly in the report function. With all these good ideas and a strive for making the documentation more maintainable, I think it would also make sense to create a page that describes the automated processes and all the directives I have been writing (and the additional stuff to come). |
I'll merge to keep the momentum going. |
Description
This PR aims at creating a general
Codebook
that has all variables that can be generated from the NeuroKit package. The idea of this feature stems from discussion #1012 from @HeatherUrry.The addition of this feature would allow users to more easily see what variables are extractable from NeuroKit and is useful in other software programs.
Note
This pull request should be considered more as a draft as some polishing would still be required.
For example:
Feedback and change requests are all welcome. Let me know if I should also share a copy of the documentation
_build
.Proposed Changes
I added an additional Sphinx directive that takes in information in the
Returns
section of the python source code documentation and creates a csv file. At the same time, the directive also places a formatted version of the information in the return section, such that this information does not have to be written twice. E.g.,The line
ECG_Raw|The raw ECG Data.
appendsECG_Raw,The raw ECG Data.,Electrocardiogram,ecg_process.py
to the csv file and adds a formatted string* ``ECG_Raw``: The raw ECG Data.
to the documentation page. Note that if the description contains a comma, the entire field is encapsulated in quotation marks i.e.,ECG_foobar,"foo, bar",...
to keep csv formatting intact.The CSV from the codebook webpage like in this example:
Here is a sample codebook: neurokit_codebook.csv
I am not sure if it is possible to add the table to the Codebook page from within the documentation generation process and so therefore I added some javascript to dynamically load the csv file into the codebook page. Please let me know if this is too much.
Checklist