Intelligent PII Detection and Replacement for Python
doubletake is a powerful, flexible library for automatically detecting and replacing Personally Identifiable Information (PII) in your data structures. Whether you're anonymizing datasets for testing, protecting sensitive information in logs, or ensuring GDPR compliance, doubletake makes it effortless.
- π High Performance: Choose between fast JSON-based processing or flexible tree traversal
- π― Smart Detection: Built-in patterns for emails, phones, SSNs, credit cards, IPs, and URLs
- π§ Highly Configurable: Custom patterns, callbacks, and replacement strategies
- π Realistic Fake Data: Generate believable replacements using the Faker library
- π³ Deep Traversal: Handle complex nested data structures automatically
- β‘ Zero Dependencies: Lightweight with minimal external requirements
- π‘οΈ Type Safe: Full TypeScript-style type hints for better development experience
- π Path Targeting: Precisely target specific data paths for replacement
- π Safe Values: Protect specific values from being replaced
- π Idempotent Operations: Safely re-process data without double-masking and keep data relationships intact after masking
The Problem: You have sensitive data in complex structures that needs to be anonymized for testing, logging, or compliance, but existing solutions are either too rigid, too slow, or don't handle your specific use cases.
The Solution: doubletake provides intelligent PII detection with multiple processing strategies, letting you choose the perfect balance of performance and flexibility for your needs.
pip install doubletake
# or
pipenv install doubletake
# or
poetry add doubletakefrom doubletake import DoubleTake
# Initialize with default settings
db = DoubleTake()
# Your data with PII
data = [
{
"user_id": 12345,
"name": "John Doe",
"email": "[email protected]",
"phone": "555-123-4567",
"ssn": "123-45-6789"
},
{
"customer": {
"contact": "[email protected]",
"billing": {
"card": "4532-1234-5678-9012",
"address": "123 Main St"
}
}
}
]
# Replace PII automatically
masked_data = db.mask_data(data)
print(masked_data)
# Output:
# [
# {
# "user_id": 12345,
# "name": "John Doe",
# "email": "****@******.***",
# "phone": "***-***-****",
# "ssn": "***-**-****"
# },
# ...
# ]from doubletake import DoubleTake
# Generate realistic fake data instead of asterisks
db = DoubleTake(use_faker=True)
masked_data = db.mask_data(data)
# Emails become: [email protected]
# Phones become: +1-555-234-5678
# SSNs become: 987-65-4321# The callback receives:
# meta_match: MetaMatch (fields: pattern, value, replacement, breadcrumbs)
# faker: Faker instance (for generating fake data)
# value: The matched value being replaced
def custom_replacer(meta_match, faker, value):
"""Custom replacement with full context"""
# meta_match.pattern: the pattern key (e.g. 'email', 'ssn', etc.)
# meta_match.value: the regex pattern or extra pattern string
# meta_match.replacement: the default replacement value
# meta_match.breadcrumbs: set of path keys (for path-aware logic)
if meta_match.pattern == 'email':
return "***REDACTED_EMAIL***"
if meta_match.pattern == 'ssn':
return "XXX-XX-XXXX"
if meta_match.pattern == 'city':
# Use Faker to generate a fake city name
return faker.city()
if 'secret' in value:
return "***CLASSIFIED***"
return meta_match.replacement
db = DoubleTake(callback=custom_replacer)# Only replace certain types, allow others through
db = DoubleTake(
allowed=['email'], # Don't replace emails
extras={
'customer_id': r'CUST-\d+',
'reference': r'REF-[A-Z]{3}-\d{4}'
} # Custom patterns as a dict
)# Only replace PII at specific data paths
db = DoubleTake(
known_paths=[
'customer.email',
'billing.ssn',
'contacts.emergency.phone'
]
)# Protect specific values from being replaced
db = DoubleTake(
safe_values=[
'[email protected]', # Corporate email to keep
'[email protected]', # Support contact
'555-000-0000', # Test phone number
'N/A' # Placeholder values
]
)
# These values will never be replaced, even if they match PII patterns
data = {
"primary_email": "[email protected]", # β Stays unchanged
"user_email": "[email protected]", # β Gets replaced
"phone": "555-000-0000", # β Stays unchanged
"mobile": "555-123-4567" # β Gets replaced
}# Safely re-process data without double-masking
db = DoubleTake(
idempotent=True, # Prevents replacing already masked data
replace_with='*' # Use consistent masking character
)
# First processing
data = {"email": "[email protected]"}
masked_once = db.mask_data([data])
# Result: {"email": "****@******.***"}
# Second processing (safe!)
masked_twice = db.mask_data(masked_once)
# Result: {"email": "****@******.***"} β Same result, no double-maskingπ‘ Data Consistency with Faker: When using
idempotent=Truewithuse_faker=True, the same original value will always generate the same fake replacement across your entire dataset. This ensures data relationships remain intact after masking.
# Consistent faker replacements across multiple datasets
db = DoubleTake(use_faker=True, idempotent=True)
# User profile data
profile_data = {
"user_id": 12345,
"email": "[email protected]",
"department": "Engineering"
}
# Notification log data
notification_data = {
"timestamp": "2023-10-15",
"recipient": "[email protected]", # Same email as profile
"message": "Welcome to the team!"
}
# Both datasets masked separately
masked_profile = db.mask_data([profile_data])[0]
masked_notifications = db.mask_data([notification_data])[0]
print(masked_profile["email"]) # [email protected]
print(masked_notifications["recipient"]) # [email protected] β Same fake email!
# Data relationships preserved - you can still join/correlate the datasets
assert masked_profile["email"] == masked_notifications["recipient"] # β
Truedoubletake offers three complementary processing strategies:
- Best for: Large datasets, simple replacement needs
- Speed: β‘ Fastest option
- Method: JSON serialization + regex replacement
- Trade-offs: Less flexibility, no custom callbacks
# Automatically chosen when no custom logic needed
db = DoubleTake() # Uses JSONGrepper internally- Best for: Simple string processing, single-level data structures
- Speed: π° Moderate performance for straightforward replacements
- Method: Direct string pattern matching and replacement
- Features: Basic pattern detection, simple replacements, lightweight processing
- Trade-offs: No deep traversal, limited to string-to-string operations
# Automatically chosen when using advanced features
db = DoubleTake(use_faker=True) # Uses StringReplacer for simple string input
# example simple string input
# ['some log with your phone: 111-333-444', 'some log with your ssn: 123-456-7890']- Best for: Complex logic, custom callbacks, path targeting
- Speed: π’ Slower but more capable
- Method: Recursive tree traversal
- Features: Full context, breadcrumbs, custom callbacks
# Automatically chosen when using advanced features
db = DoubleTake(use_faker=True) # Uses DataWalker
db = DoubleTake(callback=my_func) # Uses DataWalker| Pattern | Description | Example |
|---|---|---|
email |
Email addresses | [email protected] |
phone |
Phone numbers (US formats) | 555-123-4567, (555) 123-4567 |
ssn |
Social Security Numbers | 123-45-6789, 123456789 |
credit_card |
Credit card numbers | 4532-1234-5678-9012 |
ip_address |
IPv4 addresses | 192.168.1.1 |
url |
HTTP/HTTPS URLs | https://example.com/path |
db = DoubleTake(
use_faker=False, # Use fake data vs asterisks
callback=None, # Custom replacement function
allowed=[], # Pattern types to skip
extras={}, # Additional regex patterns as a dict
safe_values=[], # Values to protect from replacement
idempotent=False, # Prevent double-masking operations
known_paths=[], # Specific paths to target
replace_with='*', # Character for replacements
maintain_length=False # Preserve original string length
)# Sanitize API responses for logging
api_response = {
"status": "success",
"data": {
"users": [
{"id": 1, "email": "[email protected]", "role": "admin"},
{"id": 2, "email": "[email protected]", "role": "user"}
]
},
"metadata": {"request_ip": "203.0.113.42"}
}
db = DoubleTake()
safe_response = db.mask_data([api_response])[0]
# Safe to log without exposing PII# Anonymize database exports for development
db_records = [
{"patient_id": "PT001", "ssn": "123-45-6789", "email": "[email protected]"},
{"patient_id": "PT002", "ssn": "987-65-4321", "email": "[email protected]"}
]
db = DoubleTake(
use_faker=True,
allowed=[], # Replace all PII types
)
anonymized_records = db.mask_data(db_records)
# Safe for development environments# Remove secrets from config files
config = {
"database": {
"host": "db.company.com",
"admin_email": "[email protected]"
},
"api_keys": {
"stripe": "sk_live_abcd1234...",
"support_email": "[email protected]"
}
}
db = DoubleTake(known_paths=['database.admin_email', 'api_keys.support_email'])
sanitized_config = db.mask_data([config])[0]# Sanitize logs while preserving important contact info
logs = [
"Please contact our support team at [email protected] or call +1-555-SUPPORT",
"User [email protected] reported an issue. Forward to [email protected]",
"Error: Invalid email [email protected] blocked by system"
]
db = DoubleTake(
safe_values=['[email protected]'], # Keep official support email visible
extras={'phone': r'\+1-555-SUPPORT'} # Keep support phone pattern
)
sanitized_logs = db.mask_data(logs)
# Result preserves support contacts but masks personal info# Different masking strategies for different environments
def create_masker_for_env(environment: str):
if environment == 'production':
# Strictest masking for production logs
return DoubleTake(
idempotent=True, # Safe re-processing
safe_values=[], # No exceptions
allowed=[] # Mask everything
)
elif environment == 'staging':
# Moderate masking, keep some test data
return DoubleTake(
safe_values=[
'[email protected]',
'[email protected]',
'555-000-0000'
],
idempotent=True
)
else: # development
# Minimal masking for debugging
return DoubleTake(
allowed=['email'], # Keep emails for debugging
safe_values=['[email protected]'],
idempotent=True
)
# Usage
prod_masker = create_masker_for_env('production')
staging_masker = create_masker_for_env('staging')
dev_masker = create_masker_for_env('development')# Process large datasets consistently across multiple runs
data_batches = [
[{"user": "[email protected]", "id": 1}],
[{"user": "[email protected]", "id": 2}],
[{"user": "[email protected]", "id": 3}] # Same email appears again
]
db = DoubleTake(
use_faker=True, # Consistent fake data
idempotent=True, # Safe for re-processing
safe_values=['[email protected]'] # Keep specific user visible
)
# Process each batch - [email protected] stays consistent
for batch in data_batches:
processed = db.mask_data(batch)
print(processed)doubletake includes comprehensive tests with 100% coverage:
# Run tests
pipenv run test
# Run with coverage
pipenv run pytest --cov=doubletake tests/Performance Benchmarks (10,000 records):
- JSONGrepper: ~0.1s (simple patterns)
- StringReplacer: ~0.2s (with fake data generation)
- DataWalker: ~0.3s (with fake data generation)
We welcome contributions! Please see our contributing guidelines for details.
# Development setup
git clone https://github.com/paulcruse3/doubletake.git
cd doubletake
pipenv install --dev
pipenv run testMIT License - see LICENSE file for details.
- Documentation (coming soon)
- Issues
- Changelog
- Security Policy
Made with β€οΈ for data privacy and security