Skip to content

Commit 4d52958

Browse files
committed
Adds initial WARP documentation
Adds a WARP.md file to provide guidance for developers working with the doubletake library, covering project overview, architecture, package structure, development commands, key files, PII patterns, CI/CD pipeline, and common development patterns. Improves PII replacement in data walker Refactors the string replacement logic in the data walker to handle known and extra patterns separately, allowing for more flexible PII masking. Adds comprehensive unit tests to validate the string replacement functionality, covering various scenarios including email, phone, SSN, credit card, extra patterns, allowed patterns, and mixed-case PII.
1 parent 7e3cd5a commit 4d52958

File tree

3 files changed

+330
-1
lines changed

3 files changed

+330
-1
lines changed

WARP.md

Lines changed: 136 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,136 @@
1+
# WARP.md
2+
3+
This file provides guidance to WARP (warp.dev) when working with code in this repository.
4+
5+
## Project Overview
6+
7+
**doubletake** is a Python library for intelligent PII (Personally Identifiable Information) detection and replacement. It provides high-performance processing of complex nested data structures with multiple replacement strategies.
8+
9+
### Core Architecture
10+
11+
The library uses a **dual-strategy architecture** for optimal performance vs. flexibility:
12+
13+
1. **JSONGrepper**: High-performance JSON serialization + regex replacement for simple use cases
14+
2. **DataWalker**: Recursive tree traversal with full context for advanced features (callbacks, fake data, path targeting)
15+
16+
**Strategy Selection Logic**: The main `DoubleTake` class automatically chooses the appropriate processor:
17+
- Uses `JSONGrepper` when only basic pattern replacement is needed (default settings)
18+
- Switches to `DataWalker` when advanced features are enabled (`use_faker=True`, custom `callback`, etc.)
19+
20+
### Package Structure
21+
22+
```
23+
doubletake/
24+
├── __init__.py # Main DoubleTake class with auto-strategy selection
25+
├── searcher/
26+
│ ├── json_grepper.py # Fast JSON-based PII replacement
27+
│ └── data_walker.py # Flexible recursive data traversal
28+
├── types/
29+
│ └── settings.py # TypedDict configuration schema
30+
└── utils/
31+
├── config_validator.py # Settings validation
32+
├── data_faker.py # Realistic fake data generation
33+
└── pattern_manager.py # Centralized regex pattern management
34+
```
35+
36+
**Key Design Patterns**:
37+
- **Strategy Pattern**: Automatic selection between JSONGrepper/DataWalker
38+
- **Manager Pattern**: PatternManager centralizes all PII regex patterns
39+
- **Breadcrumb Navigation**: DataWalker tracks path through nested structures
40+
- **TypedDict Configuration**: Strongly typed settings with validation
41+
42+
## Development Commands
43+
44+
### Environment Setup
45+
```bash
46+
# Install dependencies (development)
47+
pipenv install --dev
48+
49+
# Install production dependencies only
50+
pipenv install
51+
52+
# Alternative package managers
53+
pip install doubletake
54+
poetry add doubletake
55+
```
56+
57+
### Testing
58+
```bash
59+
# Run all tests (unittest discovery)
60+
pipenv run test
61+
62+
# Run with coverage reporting
63+
pipenv run coverage
64+
65+
# Run specific test file
66+
python -m unittest tests/unit/test_init.py
67+
68+
# Quick test (appears to be specific GRPC test)
69+
pipenv run qt
70+
```
71+
72+
### Code Quality
73+
```bash
74+
# Lint code (requires score >= 10)
75+
pipenv run lint
76+
77+
# Type checking
78+
pipenv run mypy
79+
80+
# Type check with HTML report
81+
pipenv run mypy-report
82+
```
83+
84+
### Key Files for Development
85+
86+
**Core Logic**:
87+
- `doubletake/__init__.py`: Main class with strategy selection logic
88+
- `doubletake/searcher/json_grepper.py`: JSON-based fast processing (~100 lines)
89+
- `doubletake/searcher/data_walker.py`: Tree traversal with context (~180 lines)
90+
91+
**Configuration**:
92+
- `doubletake/types/settings.py`: TypedDict schema for all configuration options
93+
- `doubletake/utils/pattern_manager.py`: Built-in PII patterns (email, phone, SSN, etc.)
94+
95+
**Testing Strategy**:
96+
- Unit tests in `tests/unit/` mirror the package structure
97+
- Mock data in `tests/mocks/test_data.py`
98+
- Tests cover both searcher strategies and all utility modules
99+
100+
## Built-in PII Patterns
101+
102+
The `PatternManager` defines these standard patterns:
103+
- `email`: Email addresses (standard regex)
104+
- `phone`: US phone number formats
105+
- `ssn`: Social Security Numbers (XXX-XX-XXXX)
106+
- `credit_card`: Credit card numbers
107+
- `ip_address`: IPv4 addresses
108+
- `url`: HTTP/HTTPS URLs
109+
110+
## CI/CD Pipeline
111+
112+
Uses **CircleCI** with comprehensive testing:
113+
- **Lint**: pylint with HTML reports (`pipenv run lint`)
114+
- **Type Check**: mypy with HTML reports (`pipenv run mypy-report`)
115+
- **Unit Tests**: pytest with coverage reporting + **SonarCloud integration**
116+
- **PyPI Publishing**: Automated on git tags
117+
118+
**Coverage Requirements**: The project maintains high test coverage with detailed reporting.
119+
120+
## Common Development Patterns
121+
122+
### Adding New PII Patterns
123+
1. Add regex pattern to `PatternManager.patterns` dictionary
124+
2. Update `DataFaker` to generate appropriate fake data
125+
3. Add corresponding tests in `tests/unit/utils/test_pattern_manager.py`
126+
127+
### Performance Optimization
128+
- For large datasets: Ensure `JSONGrepper` path is used (avoid `use_faker`, `callback`)
129+
- For complex logic: Use `DataWalker` with custom callbacks
130+
- Memory efficiency: `JSONGrepper` processes entire structure as single JSON string
131+
132+
### Testing New Features
133+
- Follow the existing pattern: unit tests in `tests/unit/` mirroring package structure
134+
- Use mock data from `tests/mocks/test_data.py`
135+
- Test both processing strategies if applicable
136+
- Maintain coverage standards for CI/CD pipeline

doubletake/searcher/data_walker.py

Lines changed: 21 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -166,7 +166,14 @@ def __replace_value(
166166
def __replace_string_value(self, item) -> Union[str, None]:
167167
if not isinstance(item, str):
168168
return None
169-
for pattern_key, pattern_value in PatternManager().patterns.items():
169+
item = self.__replace_known_patterns_in_string(item)
170+
item = self.__replace_extra_patterns_in_string(item)
171+
return item
172+
173+
def __replace_known_patterns_in_string(self, item: str) -> str:
174+
for pattern_key, pattern_value in self.__pattern_manager.patterns.items():
175+
if pattern_key in self.__allowed:
176+
continue
170177
match = re.search(pattern_value, item)
171178
if match:
172179
return re.sub(
@@ -177,3 +184,16 @@ def __replace_string_value(self, item) -> Union[str, None]:
177184
flags=re.IGNORECASE
178185
)
179186
return item
187+
188+
def __replace_extra_patterns_in_string(self, item: str) -> str:
189+
for pattern in self.__pattern_manager.extras:
190+
match = re.search(pattern, item)
191+
if match:
192+
return re.sub(
193+
pattern,
194+
self.__data_faker.get_fake_data(None),
195+
item,
196+
count=0,
197+
flags=re.IGNORECASE
198+
)
199+
return item
Lines changed: 173 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -507,3 +507,176 @@ def test_walk_and_replace_known_paths_triggers_replacement(self) -> None:
507507
if result["normal_field"] != original_normal: # type: ignore
508508
# If it was replaced, it was due to PII pattern matching, not known paths
509509
pass # This is acceptable
510+
511+
# Tests for __replace_string_value method through public interface
512+
def test_replace_string_value_with_email(self) -> None:
513+
"""Test __replace_string_value with email string through walk_and_replace."""
514+
test_email = "[email protected]"
515+
516+
walker = DataWalker()
517+
result = walker.walk_and_replace(test_email)
518+
519+
# Should return a replaced string (different from original)
520+
self.assertIsInstance(result, str)
521+
self.assertNotEqual(result, test_email)
522+
523+
def test_replace_string_value_with_phone(self) -> None:
524+
"""Test __replace_string_value with phone number string."""
525+
test_phone = "555-123-4567"
526+
527+
walker = DataWalker()
528+
result = walker.walk_and_replace(test_phone)
529+
530+
# Should return a replaced string
531+
self.assertIsInstance(result, str)
532+
self.assertNotEqual(result, test_phone)
533+
534+
def test_replace_string_value_with_ssn(self) -> None:
535+
"""Test __replace_string_value with SSN string."""
536+
test_ssn = "123-45-6789"
537+
538+
walker = DataWalker()
539+
result = walker.walk_and_replace(test_ssn)
540+
541+
# Should return a replaced string
542+
self.assertIsInstance(result, str)
543+
self.assertNotEqual(result, test_ssn)
544+
545+
def test_replace_string_value_with_credit_card(self) -> None:
546+
"""Test __replace_string_value with credit card number."""
547+
test_cc = "4532-1234-5678-9012"
548+
549+
walker = DataWalker()
550+
result = walker.walk_and_replace(test_cc)
551+
552+
# Should return a replaced string
553+
self.assertIsInstance(result, str)
554+
self.assertNotEqual(result, test_cc)
555+
556+
def test_replace_string_value_with_no_pii(self) -> None:
557+
"""Test __replace_string_value with string containing no PII."""
558+
test_string = "just a normal string with no sensitive data"
559+
560+
walker = DataWalker()
561+
result = walker.walk_and_replace(test_string)
562+
563+
# Should return the original string unchanged
564+
self.assertEqual(result, test_string)
565+
566+
def test_replace_string_value_with_extra_patterns(self) -> None:
567+
"""Test __replace_string_value with extra regex patterns."""
568+
test_string = "USER123456"
569+
570+
# Add extra pattern to match USER followed by digits
571+
walker = DataWalker(extras=[r'USER\d+']) # type: ignore
572+
result = walker.walk_and_replace(test_string)
573+
574+
# Should return a replaced string due to extra pattern
575+
self.assertIsInstance(result, str)
576+
self.assertNotEqual(result, test_string)
577+
578+
def test_replace_string_value_with_multiple_patterns(self) -> None:
579+
"""Test __replace_string_value with string containing multiple PII patterns."""
580+
test_string = "Contact: [email protected] or call 555-123-4567"
581+
582+
walker = DataWalker()
583+
result = walker.walk_and_replace(test_string)
584+
585+
# Should return a replaced string (first match should trigger replacement)
586+
self.assertIsInstance(result, str)
587+
self.assertNotEqual(result, test_string)
588+
589+
def test_replace_string_value_with_allowed_patterns(self) -> None:
590+
"""Test __replace_string_value respects allowed patterns."""
591+
test_email = "[email protected]"
592+
593+
# Create walker with email in allowed list
594+
walker = DataWalker(allowed=['email']) # type: ignore
595+
result = walker.walk_and_replace(test_email)
596+
597+
# Email should remain unchanged (in allowed list)
598+
self.assertEqual(result, test_email)
599+
600+
def test_replace_string_value_with_non_string_input(self) -> None:
601+
"""Test __replace_string_value with non-string inputs returns None."""
602+
walker = DataWalker()
603+
604+
# Test various non-string types
605+
self.assertIsNone(walker.walk_and_replace(123))
606+
self.assertIsNone(walker.walk_and_replace(True))
607+
self.assertIsNone(walker.walk_and_replace(None))
608+
self.assertIsNone(walker.walk_and_replace(45.67))
609+
self.assertIsNone(walker.walk_and_replace(["list", "items"]))
610+
611+
def test_replace_string_value_empty_string(self) -> None:
612+
"""Test __replace_string_value with empty string."""
613+
test_string = ""
614+
615+
walker = DataWalker()
616+
result = walker.walk_and_replace(test_string)
617+
618+
# Should return empty string unchanged
619+
self.assertEqual(result, "")
620+
621+
def test_replace_string_value_whitespace_only(self) -> None:
622+
"""Test __replace_string_value with whitespace-only string."""
623+
test_string = " \t\n "
624+
625+
walker = DataWalker()
626+
result = walker.walk_and_replace(test_string)
627+
628+
# Should return whitespace string unchanged (no PII patterns)
629+
self.assertEqual(result, test_string)
630+
631+
@patch('doubletake.searcher.data_walker.DataFaker')
632+
def test_replace_string_value_uses_data_faker(self, mock_data_faker_class) -> None:
633+
"""Test __replace_string_value uses DataFaker for replacements."""
634+
mock_data_faker = Mock()
635+
mock_data_faker.get_fake_data.return_value = "FAKE_EMAIL"
636+
mock_data_faker_class.return_value = mock_data_faker
637+
638+
test_email = "[email protected]"
639+
640+
walker = DataWalker()
641+
result = walker.walk_and_replace(test_email)
642+
643+
# DataFaker should have been called
644+
mock_data_faker.get_fake_data.assert_called()
645+
# Result should be the fake data
646+
self.assertEqual(result, "FAKE_EMAIL")
647+
648+
def test_replace_string_value_with_mixed_case_pii(self) -> None:
649+
"""Test __replace_string_value handles mixed case PII patterns."""
650+
test_email = "[email protected]"
651+
652+
walker = DataWalker()
653+
result = walker.walk_and_replace(test_email)
654+
655+
# Should handle case-insensitive matching and replace
656+
self.assertIsInstance(result, str)
657+
self.assertNotEqual(result, test_email)
658+
659+
def test_replace_string_value_with_extra_pattern_only(self) -> None:
660+
"""Test __replace_string_value with string that only matches extra patterns."""
661+
test_string = "CUSTOM-ID-98765"
662+
663+
# Add extra pattern that doesn't match standard PII
664+
walker = DataWalker(extras=[r'CUSTOM-ID-\d+']) # type: ignore
665+
result = walker.walk_and_replace(test_string)
666+
667+
# Should be replaced due to extra pattern
668+
self.assertIsInstance(result, str)
669+
self.assertNotEqual(result, test_string)
670+
671+
def test_replace_string_value_processes_known_patterns_first(self) -> None:
672+
"""Test that __replace_string_value processes known patterns before extra patterns."""
673+
# Use an email that would match both known email pattern and a custom extra pattern
674+
test_string = "[email protected]"
675+
676+
# Create a walker with an extra pattern that would also match
677+
walker = DataWalker(extras=[r'admin@.*']) # type: ignore
678+
result = walker.walk_and_replace(test_string)
679+
680+
# Should be replaced (either by known email pattern or a custom extra pattern)
681+
self.assertIsInstance(result, str)
682+
self.assertNotEqual(result, test_string)

0 commit comments

Comments
 (0)