Adds initial WARP documentation

paulcruse3 · paulcruse3 · commit 4d52958ead7f · 2025-08-29T21:18:19.000-05:00
Adds a WARP.md file to provide guidance for developers working with the doubletake library, covering project overview, architecture, package structure, development commands, key files, PII patterns, CI/CD pipeline, and common development patterns.

Improves PII replacement in data walker

Refactors the string replacement logic in the data walker to handle known and extra patterns separately, allowing for more flexible PII masking.

Adds comprehensive unit tests to validate the string replacement functionality, covering various scenarios including email, phone, SSN, credit card, extra patterns, allowed patterns, and mixed-case PII.
diff --git a/WARP.md b/WARP.md
@@ -0,0 +1,136 @@
+# WARP.md
+
+This file provides guidance to WARP (warp.dev) when working with code in this repository.
+
+## Project Overview
+
+**doubletake** is a Python library for intelligent PII (Personally Identifiable Information) detection and replacement. It provides high-performance processing of complex nested data structures with multiple replacement strategies.
+
+### Core Architecture
+
+The library uses a **dual-strategy architecture** for optimal performance vs. flexibility:
+
+1. **JSONGrepper**: High-performance JSON serialization + regex replacement for simple use cases
+2. **DataWalker**: Recursive tree traversal with full context for advanced features (callbacks, fake data, path targeting)
+
+**Strategy Selection Logic**: The main `DoubleTake` class automatically chooses the appropriate processor:
+- Uses `JSONGrepper` when only basic pattern replacement is needed (default settings)
+- Switches to `DataWalker` when advanced features are enabled (`use_faker=True`, custom `callback`, etc.)
+
+### Package Structure
+
+```
+doubletake/
+├── __init__.py              # Main DoubleTake class with auto-strategy selection
+├── searcher/
+│   ├── json_grepper.py      # Fast JSON-based PII replacement
+│   └── data_walker.py       # Flexible recursive data traversal
+├── types/
+│   └── settings.py          # TypedDict configuration schema
+└── utils/
+    ├── config_validator.py   # Settings validation
+    ├── data_faker.py         # Realistic fake data generation
+    └── pattern_manager.py    # Centralized regex pattern management
+```
+
+**Key Design Patterns**:
+- **Strategy Pattern**: Automatic selection between JSONGrepper/DataWalker
+- **Manager Pattern**: PatternManager centralizes all PII regex patterns
+- **Breadcrumb Navigation**: DataWalker tracks path through nested structures
+- **TypedDict Configuration**: Strongly typed settings with validation
+
+## Development Commands
+
+### Environment Setup
+```bash
+# Install dependencies (development)
+pipenv install --dev
+
+# Install production dependencies only
+pipenv install
+
+# Alternative package managers
+pip install doubletake
+poetry add doubletake
+```
+
+### Testing
+```bash
+# Run all tests (unittest discovery)
+pipenv run test
+
+# Run with coverage reporting
+pipenv run coverage
+
+# Run specific test file
+python -m unittest tests/unit/test_init.py
+
+# Quick test (appears to be specific GRPC test)
+pipenv run qt
+```
+
+### Code Quality
+```bash
+# Lint code (requires score >= 10)
+pipenv run lint
+
+# Type checking
+pipenv run mypy
+
+# Type check with HTML report
+pipenv run mypy-report
+```
+
+### Key Files for Development
+
+**Core Logic**:
+- `doubletake/__init__.py`: Main class with strategy selection logic
+- `doubletake/searcher/json_grepper.py`: JSON-based fast processing (~100 lines)
+- `doubletake/searcher/data_walker.py`: Tree traversal with context (~180 lines)
+
+**Configuration**:
+- `doubletake/types/settings.py`: TypedDict schema for all configuration options
+- `doubletake/utils/pattern_manager.py`: Built-in PII patterns (email, phone, SSN, etc.)
+
+**Testing Strategy**:
+- Unit tests in `tests/unit/` mirror the package structure
+- Mock data in `tests/mocks/test_data.py`
+- Tests cover both searcher strategies and all utility modules
+
+## Built-in PII Patterns
+
+The `PatternManager` defines these standard patterns:
+- `email`: Email addresses (standard regex)
+- `phone`: US phone number formats 
+- `ssn`: Social Security Numbers (XXX-XX-XXXX)
+- `credit_card`: Credit card numbers
+- `ip_address`: IPv4 addresses
+- `url`: HTTP/HTTPS URLs
+
+## CI/CD Pipeline
+
+Uses **CircleCI** with comprehensive testing:
+- **Lint**: pylint with HTML reports (`pipenv run lint`)
+- **Type Check**: mypy with HTML reports (`pipenv run mypy-report`)
+- **Unit Tests**: pytest with coverage reporting + **SonarCloud integration**
+- **PyPI Publishing**: Automated on git tags
+
+**Coverage Requirements**: The project maintains high test coverage with detailed reporting.
+
+## Common Development Patterns
+
+### Adding New PII Patterns
+1. Add regex pattern to `PatternManager.patterns` dictionary
+2. Update `DataFaker` to generate appropriate fake data
+3. Add corresponding tests in `tests/unit/utils/test_pattern_manager.py`
+
+### Performance Optimization
+- For large datasets: Ensure `JSONGrepper` path is used (avoid `use_faker`, `callback`)
+- For complex logic: Use `DataWalker` with custom callbacks
+- Memory efficiency: `JSONGrepper` processes entire structure as single JSON string
+
+### Testing New Features
+- Follow the existing pattern: unit tests in `tests/unit/` mirroring package structure
+- Use mock data from `tests/mocks/test_data.py`
+- Test both processing strategies if applicable
+- Maintain coverage standards for CI/CD pipeline
diff --git a/doubletake/searcher/data_walker.py b/doubletake/searcher/data_walker.py
@@ -166,7 +166,14 @@ def __replace_value(
     def __replace_string_value(self, item) -> Union[str, None]:
         if not isinstance(item, str):
             return None
-        for pattern_key, pattern_value in PatternManager().patterns.items():
+        item = self.__replace_known_patterns_in_string(item)
+        item = self.__replace_extra_patterns_in_string(item)
+        return item
+
+    def __replace_known_patterns_in_string(self, item: str) -> str:
+        for pattern_key, pattern_value in self.__pattern_manager.patterns.items():
+            if pattern_key in self.__allowed:
+                continue
             match = re.search(pattern_value, item)
             if match:
                 return re.sub(
@@ -177,3 +184,16 @@ def __replace_string_value(self, item) -> Union[str, None]:
                     flags=re.IGNORECASE
                 )
         return item
+
+    def __replace_extra_patterns_in_string(self, item: str) -> str:
+        for pattern in self.__pattern_manager.extras:
+            match = re.search(pattern, item)
+            if match:
+                return re.sub(
+                    pattern,
+                    self.__data_faker.get_fake_data(None),
+                    item,
+                    count=0,
+                    flags=re.IGNORECASE
+                )
+        return item
diff --git a/tests/unit/searcher/test_data_walker.py b/tests/unit/searcher/test_data_walker.py
@@ -507,3 +507,176 @@ def test_walk_and_replace_known_paths_triggers_replacement(self) -> None:
         if result["normal_field"] != original_normal:  # type: ignore
             # If it was replaced, it was due to PII pattern matching, not known paths
             pass  # This is acceptable
+
+    # Tests for __replace_string_value method through public interface
+    def test_replace_string_value_with_email(self) -> None:
+        """Test __replace_string_value with email string through walk_and_replace."""
+        test_email = "user@example.com"
+
+        walker = DataWalker()
+        result = walker.walk_and_replace(test_email)
+
+        # Should return a replaced string (different from original)
+        self.assertIsInstance(result, str)
+        self.assertNotEqual(result, test_email)
+
+    def test_replace_string_value_with_phone(self) -> None:
+        """Test __replace_string_value with phone number string."""
+        test_phone = "555-123-4567"
+
+        walker = DataWalker()
+        result = walker.walk_and_replace(test_phone)
+
+        # Should return a replaced string
+        self.assertIsInstance(result, str)
+        self.assertNotEqual(result, test_phone)
+
+    def test_replace_string_value_with_ssn(self) -> None:
+        """Test __replace_string_value with SSN string."""
+        test_ssn = "123-45-6789"
+
+        walker = DataWalker()
+        result = walker.walk_and_replace(test_ssn)
+
+        # Should return a replaced string
+        self.assertIsInstance(result, str)
+        self.assertNotEqual(result, test_ssn)
+
+    def test_replace_string_value_with_credit_card(self) -> None:
+        """Test __replace_string_value with credit card number."""
+        test_cc = "4532-1234-5678-9012"
+
+        walker = DataWalker()
+        result = walker.walk_and_replace(test_cc)
+
+        # Should return a replaced string
+        self.assertIsInstance(result, str)
+        self.assertNotEqual(result, test_cc)
+
+    def test_replace_string_value_with_no_pii(self) -> None:
+        """Test __replace_string_value with string containing no PII."""
+        test_string = "just a normal string with no sensitive data"
+
+        walker = DataWalker()
+        result = walker.walk_and_replace(test_string)
+
+        # Should return the original string unchanged
+        self.assertEqual(result, test_string)
+
+    def test_replace_string_value_with_extra_patterns(self) -> None:
+        """Test __replace_string_value with extra regex patterns."""
+        test_string = "USER123456"
+
+        # Add extra pattern to match USER followed by digits
+        walker = DataWalker(extras=[r'USER\d+'])  # type: ignore
+        result = walker.walk_and_replace(test_string)
+
+        # Should return a replaced string due to extra pattern
+        self.assertIsInstance(result, str)
+        self.assertNotEqual(result, test_string)
+
+    def test_replace_string_value_with_multiple_patterns(self) -> None:
+        """Test __replace_string_value with string containing multiple PII patterns."""
+        test_string = "Contact: john@example.com or call 555-123-4567"
+
+        walker = DataWalker()
+        result = walker.walk_and_replace(test_string)
+
+        # Should return a replaced string (first match should trigger replacement)
+        self.assertIsInstance(result, str)
+        self.assertNotEqual(result, test_string)
+
+    def test_replace_string_value_with_allowed_patterns(self) -> None:
+        """Test __replace_string_value respects allowed patterns."""
+        test_email = "user@example.com"
+
+        # Create walker with email in allowed list
+        walker = DataWalker(allowed=['email'])  # type: ignore
+        result = walker.walk_and_replace(test_email)
+
+        # Email should remain unchanged (in allowed list)
+        self.assertEqual(result, test_email)
+
+    def test_replace_string_value_with_non_string_input(self) -> None:
+        """Test __replace_string_value with non-string inputs returns None."""
+        walker = DataWalker()
+
+        # Test various non-string types
+        self.assertIsNone(walker.walk_and_replace(123))
+        self.assertIsNone(walker.walk_and_replace(True))
+        self.assertIsNone(walker.walk_and_replace(None))
+        self.assertIsNone(walker.walk_and_replace(45.67))
+        self.assertIsNone(walker.walk_and_replace(["list", "items"]))
+
+    def test_replace_string_value_empty_string(self) -> None:
+        """Test __replace_string_value with empty string."""
+        test_string = ""
+
+        walker = DataWalker()
+        result = walker.walk_and_replace(test_string)
+
+        # Should return empty string unchanged
+        self.assertEqual(result, "")
+
+    def test_replace_string_value_whitespace_only(self) -> None:
+        """Test __replace_string_value with whitespace-only string."""
+        test_string = "   \t\n   "
+
+        walker = DataWalker()
+        result = walker.walk_and_replace(test_string)
+
+        # Should return whitespace string unchanged (no PII patterns)
+        self.assertEqual(result, test_string)
+
+    @patch('doubletake.searcher.data_walker.DataFaker')
+    def test_replace_string_value_uses_data_faker(self, mock_data_faker_class) -> None:
+        """Test __replace_string_value uses DataFaker for replacements."""
+        mock_data_faker = Mock()
+        mock_data_faker.get_fake_data.return_value = "FAKE_EMAIL"
+        mock_data_faker_class.return_value = mock_data_faker
+
+        test_email = "test@example.com"
+
+        walker = DataWalker()
+        result = walker.walk_and_replace(test_email)
+
+        # DataFaker should have been called
+        mock_data_faker.get_fake_data.assert_called()
+        # Result should be the fake data
+        self.assertEqual(result, "FAKE_EMAIL")
+
+    def test_replace_string_value_with_mixed_case_pii(self) -> None:
+        """Test __replace_string_value handles mixed case PII patterns."""
+        test_email = "User@EXAMPLE.COM"
+
+        walker = DataWalker()
+        result = walker.walk_and_replace(test_email)
+
+        # Should handle case-insensitive matching and replace
+        self.assertIsInstance(result, str)
+        self.assertNotEqual(result, test_email)
+
+    def test_replace_string_value_with_extra_pattern_only(self) -> None:
+        """Test __replace_string_value with string that only matches extra patterns."""
+        test_string = "CUSTOM-ID-98765"
+
+        # Add extra pattern that doesn't match standard PII
+        walker = DataWalker(extras=[r'CUSTOM-ID-\d+'])  # type: ignore
+        result = walker.walk_and_replace(test_string)
+
+        # Should be replaced due to extra pattern
+        self.assertIsInstance(result, str)
+        self.assertNotEqual(result, test_string)
+
+    def test_replace_string_value_processes_known_patterns_first(self) -> None:
+        """Test that __replace_string_value processes known patterns before extra patterns."""
+        # Use an email that would match both known email pattern and a custom extra pattern
+        test_string = "admin@company.com"
+
+        # Create a walker with an extra pattern that would also match
+        walker = DataWalker(extras=[r'admin@.*'])  # type: ignore
+        result = walker.walk_and_replace(test_string)
+
+        # Should be replaced (either by known email pattern or a custom extra pattern)
+        self.assertIsInstance(result, str)
+        self.assertNotEqual(result, test_string)