Skip to content

Implement translations infrastructure #61380

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

goanpeca
Copy link

@goanpeca goanpeca commented Apr 30, 2025

Hello team!

This PR is a proposal for adding the translations infrastructure to the pandas web page.

Following the discussion in #56301, we (a group of folks working on the Scientific Python grant) have been working to set up infrastructure and translate the contents of the pandas web site. As of this moment, we have 100% translations for the pandas website into Spanish and Brazilian Portuguese, with other languages available for translation (depending on volunteer translators).

To build, the command remains the same:

python pandas_web.py pandas/content --target-path build

If you want to check out other related work, please take a look at scipy/scipy.org#617

You an read more about how the translation process works at https://scientific-python-translations.github.io/docs/

What this PR does?

Supersedes #61220

Demo

pandas


cc @mroeschke @datapythonista

@datapythonista
Copy link
Member

/preview

Copy link
Contributor

Website preview of this PR available at: https://pandas.pydata.org/preview/pandas-dev/pandas/61380/

@datapythonista
Copy link
Member

Thanks @goanpeca for this. Do you mind adding some more context here? I can't see in https://github.com/Scientific-Python-Translations/pandas-translations much information, like what languages are available, or how to fix a bad translation, which would be useful to know.

Also, in the docs generated from this PR I can see any language dropdown or anything different from our current docs. What are we expecting?

@melissawm
Copy link
Contributor

Hi @datapythonista - this is a follow up to #61220, a proof-of-concept CI job to build the website with translations that don't live in this repo. This PR and #61220 are meant to work together and I'm happy to incorporate one into the other once we agree on the general direction and workflow for this.

Let us know if we can answer any other questions. Unfortunately I'm not sure how to get the preview for the other PR, I relied on building locally to test that things were working.

@datapythonista
Copy link
Member

Sorry, I missed #61220 and the issue discussion.

I don't fully understand what you're doing here, but I describe next how to add translations without adding too much complexity in this repo, which I don't think any core dev would be onboard with.

  1. You decide on how to generate translations and manage it independently from this repo, and end up with a structure like this with the translated documents:
+ es/
  - index.md
  + about/
    - team.md
    - ...
  - ...
+ pt/
  - index.md
  + about/
    - team.md
    - ...
  - ...
  1. In our CI, before calling pandas_web.py you download this directory structure to the web/ directory. No other changes needed, this will create all translated pages.
  2. We add a dropdown with the languages to the website (you can add the language list to web/pandas/config.yml)

I think this makes everyone's life easy, and we get the expected result.

@melissawm
Copy link
Contributor

Thanks @datapythonista !

Can you clarify what is missing from #61220 to match your description? That is pretty much what is done in that PR. Maybe this is confusing because we chose to do it in two parts exactly because we wanted to decouple the reorganization of the repo + switcher (in #61220) from the actual translations (this PR).

Happy to follow up with any feedback in the other PR as well. Cheers!

@datapythonista
Copy link
Member

In #61220 you are moving all the current website pages, that should be undone. You are adding the translated pages to this repo, we don't want it. You are making changes to pandas_web.py, this is not needed based on what I described above.

Only changes in a PR to this pandas repo should be addi g a CI step as per step 2, editing the wevsite template with the language dropdown as per step 3.

@melissawm
Copy link
Contributor

I see! I will rework what I have there to match your proposal. Thanks!

@goanpeca goanpeca force-pushed the translations branch 3 times, most recently from cd72e5c to 2b85ad4 Compare May 8, 2025 01:47
Copy link
Member

@datapythonista datapythonista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great improvement to the PR, this makes a lot of sense to me.

I'd personally still simplify things more here in two ways. And feel free to disagree, as it's something opinionated.

First, if I understand correctly, you download the translations of the web, and the percentage of the translated content, amd then you check for each language if it's translated enough to be published. Personally, I think you should better take care of this logic in your repo when generating the tar, not here. First to simplify the code here, and second to avoid downloading translations that are not going to be used.

Just as a suggestion, I wouldn't use this approach, even in your repo. Imagine you publish translations that are at least 90%, and we have Spanish at 100%. Then I add new content that is 11% of the website. And automatically the Spanish translations that are already indexed by search engines, in user bookmarks, in links in blog posts... are deleted from our website. Not great in my opinion, much better to simply get the new content in English and hope it will eventually be translated.

Another thing I would do is to extract the tar file as it is downloaded. So the tar file is passed to gzip/tarfile in memory, with the io module, not as a path in disk. With this you can get all the code here in a single short function. Or we ciuld even create a github action in your repo with this, as it's generic, and just use it here. So only the CI step would live in this PR.

Finally, we already have a configuration file for the website. We could save the url of the translations tar there. Also good in the script, just a question of preference. Or if you go for the github action approach, it could simply be a parameter in the CI step. Then you would need another one for the target dir in this repo.

In any case, the approach here is also very reasonable, all above are suggestions that personally I think would make things simpler.

@goanpeca goanpeca force-pushed the translations branch 3 times, most recently from 4a6532b to 04f9259 Compare May 12, 2025 02:38
@goanpeca goanpeca marked this pull request as ready for review May 12, 2025 02:47
@goanpeca
Copy link
Author

goanpeca commented May 12, 2025

Great improvement to the PR, this makes a lot of sense to me.

Thanks for the review @datapythonista.

First, if I understand correctly, you download the translations of the web, and the percentage of the translated content, amd then you check for each language if it's translated enough to be published. Personally, I think you should better take care of this logic in your repo when generating the tar, not here. First to simplify the code here, and second to avoid downloading translations that are not going to be used.

Fixed!

Just as a suggestion, I wouldn't use this approach, even in your repo. Imagine you publish translations that are at least 90%, and we have Spanish at 100%. Then I add new content that is 11% of the website. And automatically the Spanish translations that are already indexed by search engines, in user bookmarks, in links in blog posts... are deleted from our website. Not great in my opinion, much better to simply get the new content in English and hope it will eventually be translated.

This is also fixed!

Another thing I would do is to extract the tar file as it is downloaded. So the tar file is passed to gzip/tarfile in memory, with the io module, not as a path in disk. With this you can get all the code here in a single short function. Or we ciuld even create a github action in your repo with this, as it's generic, and just use it here. So only the CI step would live in this PR.

Did not follow this one as I think things are now simpler and in a single script.

Finally, we already have a configuration file for the website. We could save the url of the translations tar there. Also good in the script, just a question of preference. Or if you go for the github action approach, it could simply be a parameter in the CI step. Then you would need another one for the target dir in this repo.

Added the information the to the config file as requested and updated the scripts to handle site generation for languages. Moved all logic to the existing script.

In any case, the approach here is also very reasonable, all above are suggestions that personally I think would make things simpler.

Please let me know what do you think about the current changes.


This PR now supersedes #61220

Thanks @melissawm!

@goanpeca goanpeca requested a review from datapythonista May 12, 2025 02:50
@goanpeca goanpeca changed the title Update CI to include Translations from Scientific Python Repo Implement translations infrastructure May 12, 2025
@goanpeca goanpeca force-pushed the translations branch 3 times, most recently from 09aaca5 to 7e68731 Compare May 12, 2025 03:04
@datapythonista
Copy link
Member

/preview

Copy link
Contributor

Website preview of this PR available at: https://pandas.pydata.org/preview/pandas-dev/pandas/61380/

@datapythonista
Copy link
Member

Thanks @goanpeca, a couple of comments.

Feels like we are repeating in this PR our logic that we already have implemented. If a markdown file is added to the web directory before the rendering of the website, it will be rendered in the same way as tge existing pages. There is no need to make changes to pandas_web.py for the translations, just copy the content of the translations tar in the web directory, and all the translated content will be in the rendered web.

In the preview links to translations are broken. This is likely to be caused by assuming the website is always going to be hosted at the root of the domain, not a subdirectory. This can't be assumed.

Finally, I think it'd be good to handle the downloading amd uncompressing of the translation file in a github action. Or less neat, in the step on the CI config, but out of pandas_web.py. This will keep things simple in pandas_web.py, and it'll be easier to maintain.

@goanpeca
Copy link
Author

goanpeca commented May 12, 2025

Hi again, thanks for the comments @datapythonista !

There is no need to make changes to pandas_web.py for the translations, just copy the content of the translations tar in the web directory, and all the translated content will be in the rendered web.

  1. Yes, there is need as the navbar is different for each language and this comes from the config file which contains translatable content. So each language has its own config and each language is treated as a separate site/build.

In the preview links to translations are broken. This is likely to be caused by assuming the website is always going to be hosted at the root of the domain, not a subdirectory. This can't be assumed.

  1. Fixed!

Finally, I think it'd be good to handle the downloading amd uncompressing of the translation file in a github action. Or less neat, in the step on the CI config, but out of pandas_web.py. This will keep things simple in pandas_web.py, and it'll be easier to maintain.

  1. AFAICT It is not possible (with the current way the website is build), to make the translations work without changes to the pandas_webscript. I tried my best to keep the changes to the minimum on that file, and having the downloads there "is needed" to know the languages that are available and process things accordingly as pointed in 1.

If this request stems from not wanting to have this happen locally, I could add a flag to process translations, and only handle english if not.

# Download and extract and copy translations
python pandas_web.py pandas/content --target-path build --translations

# Will just process english, with no translations download, extract or copy
python pandas_web.py pandas/content --target-path build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants