Skip to content

Conversation

@Jany26
Copy link
Contributor

@Jany26 Jany26 commented Jan 19, 2026

This change required making a separate version of regular expressions that look for fullnames of people - one version uses escaped sequences for unicode characters, the other uses \p{Lu} and \p{L} (special unicode letter regexes found in 3rd party regex module) and applies them directly to the strings themselves before json.dumps() is called.

@gemini-code-assist
Copy link

Summary of Changes

Hello @Jany26, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the data privacy measures by introducing an automatic sanitization system for uploaded log files. It refines the existing sanitization logic by centralizing regular expression definitions and implementing a dual-strategy approach to handle both ASCII-escaped and direct Unicode characters in sensitive fields. The new sanitization pipeline is seamlessly integrated into the backend's data ingestion process, ensuring that all data is scrubbed of personal information before being stored, thereby improving compliance and data handling practices.

Highlights

  • Dual Regex Strategy for Sanitization: Implemented two distinct sets of regular expressions for identifying personal data like full names: one for handling ASCII-escaped Unicode sequences and another for direct Unicode characters, leveraging the advanced capabilities of the regex module with properties like \p{L} and \p{Lu}.
  • Centralized Regex Definitions: All regular expressions pertinent to data sanitization have been consolidated into a new dedicated file, backend/src/regexes_sanitization.py, enhancing organization, maintainability, and reusability across the codebase.
  • Integrated Automatic Sanitization Pipeline: A new sanitize_uploaded_log function has been introduced and integrated into the API's data storage process. This ensures that all incoming log data, including content, snippets, and metadata, is automatically sanitized before being stored, bolstering data privacy.
  • Unicode-Aware Sanitization Logic: The core sanitize_string function now includes an ascii_only flag, enabling it to dynamically select the appropriate set of regexes based on the input string's encoding, thus supporting comprehensive sanitization for both escaped and direct Unicode content.
  • New Dependency for Advanced Regex: The python3-regex package has been added to the Dockerfile, providing the necessary support for the advanced Unicode regular expression features utilized in the new sanitization logic.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces automatic sanitization for uploaded logs, a valuable enhancement for user privacy. The refactoring to centralize regular expressions and create a sanitization pipeline is well-executed. My review includes suggestions to correct a new Unicode-aware regex, improve code clarity by addressing a leftover TODO and inconsistent parameter naming, and enhance maintainability by refactoring duplicated logic. Overall, this is a strong and important contribution.

@Jany26 Jany26 force-pushed the auto-sanitization branch 2 times, most recently from e1a22df to 6b3544c Compare January 19, 2026 15:01
@Jany26 Jany26 marked this pull request as draft January 20, 2026 09:09
@Jany26 Jany26 force-pushed the auto-sanitization branch 2 times, most recently from bbd0148 to 4808ebd Compare January 20, 2026 16:15
@Jany26 Jany26 force-pushed the auto-sanitization branch from 4808ebd to 21bc666 Compare January 20, 2026 16:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant