Applying automatic sanitization to uploaded logs #326

Jany26 · 2026-01-19T14:38:25Z

This change required making a separate version of regular expressions that look for fullnames of people - one version uses escaped sequences for unicode characters, the other uses \p{Lu} and \p{L} (special unicode letter regexes found in 3rd party regex module) and applies them directly to the strings themselves before json.dumps() is called.

gemini-code-assist · 2026-01-19T14:38:52Z

Summary of Changes

Hello @Jany26, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the data privacy measures by introducing an automatic sanitization system for uploaded log files. It refines the existing sanitization logic by centralizing regular expression definitions and implementing a dual-strategy approach to handle both ASCII-escaped and direct Unicode characters in sensitive fields. The new sanitization pipeline is seamlessly integrated into the backend's data ingestion process, ensuring that all data is scrubbed of personal information before being stored, thereby improving compliance and data handling practices.

Highlights

Dual Regex Strategy for Sanitization: Implemented two distinct sets of regular expressions for identifying personal data like full names: one for handling ASCII-escaped Unicode sequences and another for direct Unicode characters, leveraging the advanced capabilities of the regex module with properties like \p{L} and \p{Lu}.
Centralized Regex Definitions: All regular expressions pertinent to data sanitization have been consolidated into a new dedicated file, backend/src/regexes_sanitization.py, enhancing organization, maintainability, and reusability across the codebase.
Integrated Automatic Sanitization Pipeline: A new sanitize_uploaded_log function has been introduced and integrated into the API's data storage process. This ensures that all incoming log data, including content, snippets, and metadata, is automatically sanitized before being stored, bolstering data privacy.
Unicode-Aware Sanitization Logic: The core sanitize_string function now includes an ascii_only flag, enabling it to dynamically select the appropriate set of regexes based on the input string's encoding, thus supporting comprehensive sanitization for both escaped and direct Unicode content.
New Dependency for Advanced Regex: The python3-regex package has been added to the Dockerfile, providing the necessary support for the advanced Unicode regular expression features utilized in the new sanitization logic.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces automatic sanitization for uploaded logs, a valuable enhancement for user privacy. The refactoring to centralize regular expressions and create a sanitization pipeline is well-executed. My review includes suggestions to correct a new Unicode-aware regex, improve code clarity by addressing a leftover TODO and inconsistent parameter naming, and enhance maintainability by refactoring duplicated logic. Overall, this is a strong and important contribution.

backend/src/regexes_sanitization.py

backend/src/sanitization.py

backend/src/spells.py

Signed-off-by: Jan Matufka <[email protected]>

gemini-code-assist bot reviewed Jan 19, 2026

View reviewed changes

Jany26 force-pushed the auto-sanitization branch 2 times, most recently from e1a22df to 6b3544c Compare January 19, 2026 15:01

Jany26 marked this pull request as draft January 20, 2026 09:09

Jany26 force-pushed the auto-sanitization branch 2 times, most recently from bbd0148 to 4808ebd Compare January 20, 2026 16:15

Applying automatic sanitization to uploaded logs

21bc666

Signed-off-by: Jan Matufka <[email protected]>

Jany26 force-pushed the auto-sanitization branch from 4808ebd to 21bc666 Compare January 20, 2026 16:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Applying automatic sanitization to uploaded logs #326

Applying automatic sanitization to uploaded logs #326

Uh oh!

Jany26 commented Jan 19, 2026

Uh oh!

gemini-code-assist bot commented Jan 19, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Applying automatic sanitization to uploaded logs #326

Are you sure you want to change the base?

Applying automatic sanitization to uploaded logs #326

Uh oh!

Conversation

Jany26 commented Jan 19, 2026

Uh oh!

gemini-code-assist bot commented Jan 19, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant