Data Privacy by Design: Building Compliance Into Your Product
Here's the blog post:
Data Privacy by Design: Building Compliance Into Your Product
Most organizations treat data privacy as a retrofit — something bolted on after the product ships, often in response to a regulatory inquiry or a breach. This approach is expensive, fragile, and increasingly untenable. The European Data Protection Board's 2025 enforcement report showed that 68% of GDPR penalties exceeding €1 million cited "failure to implement appropriate technical and organizational measures" — the exact language of Article 25, which mandates data protection by design and by default.
The concept isn't new. Ann Cavoukian introduced Privacy by Design in the 1990s, and GDPR codified it into law in 2018. Yet eight years later, most engineering teams still treat privacy as a compliance team's problem, not an architectural concern. The result: PII scattered across logging pipelines, analytics databases storing raw user data indefinitely, microservices passing unmasked personal data through message queues, and shadow copies of production data sitting in staging environments with relaxed access controls.
Building privacy into your product from the start isn't just about avoiding fines — though those are substantial (Meta's €1.2 billion penalty, Amazon's €746 million fine, and TikTok's €345 million fine demonstrate the scale). It's about reducing the engineering burden of compliance. Teams that bake privacy into their architecture spend less time scrambling to respond to Data Subject Access Requests, less time remediating audit findings, and less time explaining to regulators why PII ended up somewhere it shouldn't be.
What Article 25 Actually Requires

GDPR Article 25 imposes two distinct obligations: data protection by design and data protection by default.
By design means implementing appropriate technical and organizational measures — like pseudonymization and data minimization — at the time of determining the means for processing and at the time of processing itself. This isn't aspirational guidance. It's a legal requirement with its own enforcement track.
By default means that, out of the box, your product processes only the personal data necessary for each specific purpose. Users shouldn't need to navigate settings to limit data collection — the minimum should be the starting point.
In practice, this translates to concrete engineering decisions:
- Data minimization at the schema level: Don't collect fields you don't need. If your signup form asks for a phone number but your product never calls or texts users, remove the field.
- Purpose limitation in your data model: Tag data with its processing purpose and enforce those boundaries in code.
- Storage limitation by default: Set TTLs on personal data. If you don't have a retention policy, you're implicitly retaining everything forever — which violates Article 5(1)(e).
- Pseudonymization as an architectural pattern: Separate identifying data from behavioral data wherever possible.
Data Minimization: Collect Less, Risk Less

Data minimization is the highest-leverage privacy engineering principle. Every field you don't collect is a field that can't be breached, can't be subject to a DSAR, and can't create a compliance gap.
Audit your data collection points
Walk through every form, API endpoint, and data pipeline in your product. For each field of personal data, answer:
1. Why do we collect this? (Map to a specific, documented purpose) 2. Could we achieve the same purpose with less data? (e.g., zip code instead of full address for geographic analytics) 3. Could we achieve the same purpose with pseudonymized or aggregated data? 4. When should this data be deleted?
If you can't answer question 1, you shouldn't be collecting the field.
Implement minimization in your API layer
Enforce data minimization at the API boundary, not just in the UI:
`python
from pydantic import BaseModel, Field
from typing import Optional
class UserRegistration(BaseModel): """Only collect what's required for account creation.""" email: str password: str # Display name is optional — don't force collection display_name: Optional[str] = None
# These fields are explicitly NOT included: # - phone_number (not needed for core product) # - date_of_birth (not needed unless age verification required) # - full_address (not needed for a SaaS product)
class UserAnalyticsEvent(BaseModel):
"""Pseudonymized event — no raw PII in analytics pipeline."""
user_hash: str = Field(description="SHA-256 of user ID, not the ID itself")
event_type: str
timestamp: float
# IP address is truncated to /24 for geolocation, then discarded
geo_country: Optional[str] = None
geo_region: Optional[str] = None
`
This approach prevents PII from leaking into systems where it doesn't belong. Your analytics pipeline doesn't need email addresses. Your logging system doesn't need full request bodies containing user data.
PII-Aware Logging and Observability

Application logs are one of the most common places PII accumulates undetected. A debug log statement that dumps a request payload, an error handler that includes user context, a query log that captures parameter values — these all create uncontrolled PII stores that typically have no retention policy and broad access.
Implement structured logging with PII redaction
Build PII filtering into your logging infrastructure, not into individual log statements:
`python
import logging
import re
import json
class PIIRedactingFilter(logging.Filter): """Redact common PII patterns before log output."""
PATTERNS = [ (re.compile(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'), '[EMAIL_REDACTED]'), (re.compile(r'\b\d{3}-\d{2}-\d{4}\b'), '[SSN_REDACTED]'), (re.compile(r'\b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b'), '[CC_REDACTED]'), (re.compile(r'\b\+?1?[-.\s]?\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}\b'), '[PHONE_REDACTED]'), ]
def filter(self, record): if isinstance(record.msg, str): for pattern, replacement in self.PATTERNS: record.msg = pattern.sub(replacement, record.msg) return True
Apply globally
logger = logging.getLogger() logger.addFilter(PIIRedactingFilter())`This is a safety net, not a replacement for disciplined logging. The goal is defense in depth: developers should avoid logging PII intentionally, and the filter catches what slips through.
Set aggressive retention on log stores
Application logs containing potential PII should have a maximum retention of 30-90 days. Configure this at the infrastructure level:
- Elasticsearch/OpenSearch: Set Index Lifecycle Management (ILM) policies to delete indices after the retention period
- CloudWatch Logs: Set retention policies per log group
- S3 log archives: Apply lifecycle rules to transition to Glacier and then delete
Privacy-Preserving Database Design

Your database schema encodes privacy decisions. A well-designed schema makes compliance easier; a poorly designed one makes it nearly impossible.
Separate identity from activity
Store personally identifiable information in a dedicated, access-controlled identity store. Reference it by opaque identifiers elsewhere:
`sql
-- Identity store (restricted access, encrypted, audit-logged)
CREATE TABLE user_identities (
user_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
email VARCHAR(255) NOT NULL UNIQUE,
full_name VARCHAR(255),
phone VARCHAR(50),
created_at TIMESTAMPTZ DEFAULT now(),
deletion_requested_at TIMESTAMPTZ,
data_retention_until TIMESTAMPTZ
);
-- Activity store (broader access for analytics/product teams) -- Contains NO direct PII — only the opaque user_id CREATE TABLE user_events ( event_id UUID PRIMARY KEY DEFAULT gen_random_uuid(), user_id UUID NOT NULL, -- opaque reference, not a natural key event_type VARCHAR(100) NOT NULL, event_data JSONB, -- must NOT contain PII; enforce via application layer created_at TIMESTAMPTZ DEFAULT now() );
-- When a user exercises their right to erasure (Article 17):
-- 1. Delete from user_identities
-- 2. user_events become pseudonymous (user_id no longer resolves to a person)
`
This architecture means that deleting a user's identity record effectively pseudonymizes all their activity data, making Article 17 (right to erasure) compliance straightforward without losing aggregate analytics value.
Implement row-level security for PII access
PostgreSQL's row-level security (RLS) can enforce access controls at the database level:
`sql
ALTER TABLE user_identities ENABLE ROW LEVEL SECURITY;
-- Only the identity_service role can access PII CREATE POLICY pii_access ON user_identities FOR ALL TO identity_service USING (true);
-- Analytics role cannot see the identity table at all
-- (no policy = no access when RLS is enabled)
`
Automated PII Discovery and Monitoring
You cannot protect what you don't know exists. Even with the best architectural intentions, PII drifts into unexpected places: a developer adds a user_email column to a caching table for debugging, a CSV export with customer names lands in a shared S3 bucket, a third-party SDK starts logging device identifiers.
Scan continuously, not just during audits
One-time PII discovery during an annual audit is insufficient. You need continuous monitoring:
1. Schema monitoring: Alert when new database columns matching PII naming patterns are created 2. Content scanning: Periodically scan data stores for PII patterns in actual values — not just column names 3. File system scanning: Monitor cloud storage, shared drives, and file servers for documents containing PII 4. Pipeline validation: Scan data flowing through ETL pipelines, message queues, and API responses
A tool like PrivaSift automates this across your file systems, databases, and cloud storage — detecting email addresses, SSNs, credit card numbers, phone numbers, and other PII patterns without requiring manual review. Integrating it into your CI/CD pipeline catches PII in test fixtures and seed data before it reaches production:
`yaml
.github/workflows/pii-scan.yml
name: PII Detection on: pull_request: paths: - 'migrations/**' - 'seeds/**' - 'fixtures/**' - 'data/**'jobs:
scan:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Scan for PII in data files
run: |
privasift scan ./migrations ./seeds ./fixtures ./data \
--format json \
--fail-on-detection \
--sensitivity confidential
- name: Upload scan results
if: failure()
uses: actions/upload-artifact@v4
with:
name: pii-scan-results
path: privasift-report.json
`
Build a PII data map
Maintain a living document that records where every category of PII lives, who can access it, and what the retention policy is. Update it automatically as scans discover changes. This becomes your foundation for DSAR responses, breach impact assessments, and regulatory inquiries.
Handling Data Subject Rights Programmatically
GDPR grants data subjects specific rights (Articles 15-22), and your product needs to fulfill them within tight deadlines — 30 days for most requests. Building these capabilities into your architecture from the start is dramatically easier than retrofitting them.
Right to access (Article 15)
Build a data export endpoint that aggregates all personal data from every system:
`python
def handle_dsar_export(user_id: str) -> dict:
"""Collect all personal data for a DSAR response."""
export = {
"identity": get_user_identity(user_id),
"activity_history": get_user_events(user_id),
"support_tickets": get_user_tickets(user_id),
"consent_records": get_consent_history(user_id),
"data_processing_info": {
"purposes": get_processing_purposes(user_id),
"recipients": get_data_recipients(user_id),
"retention_periods": get_retention_policies(),
"source": get_data_source(user_id),
},
"export_generated_at": datetime.utcnow().isoformat(),
}
return export
`
Without a comprehensive PII inventory, you can't know which systems to query. This is where automated PII discovery pays for itself — it ensures your DSAR response includes data from every source, not just the ones you remembered.
Right to erasure (Article 17)
Implement cascading deletion that hits every data store:
`python
def handle_erasure_request(user_id: str) -> dict:
"""Execute right-to-erasure across all systems."""
results = {}
# 1. Delete from primary identity store
results["identity"] = delete_user_identity(user_id)
# 2. Delete from support system
results["support"] = delete_user_tickets(user_id)
# 3. Anonymize analytics (retain aggregates, remove identity link)
results["analytics"] = anonymize_user_events(user_id)
# 4. Remove from email marketing
results["marketing"] = remove_from_mailing_lists(user_id)
# 5. Notify processors (Article 17(2))
results["processor_notifications"] = notify_processors_of_erasure(user_id)
# 6. Log the erasure (for accountability — Article 5(2))
log_erasure_action(user_id, results)
return results
`
Key implementation detail: you must also notify any third-party processors you've shared the data with. Maintain a registry of processors per data category so this notification can be automated.
Embedding Privacy Into Your SDLC
Privacy by design fails if it lives only in architecture documents. It needs to be part of your daily engineering workflow.
Privacy threat modeling
Add privacy considerations to your threat modeling process. For every new feature or data flow, ask:
- What personal data does this feature process?
- Is all of it necessary? (data minimization)
- Where will it be stored, and for how long? (storage limitation)
- Who will have access? (access control)
- How will it be deleted when no longer needed? (retention)
- Does this require a DPIA? (high-risk processing check)
Code review checklist for privacy
Add these items to your PR review process:
- [ ] No new PII fields added without documented purpose and retention policy
- [ ] No raw PII in log statements
- [ ] No PII passed to analytics or third-party SDKs without pseudonymization
- [ ] New database columns containing PII have appropriate access controls
- [ ] API responses don't expose more PII than the client needs
- [ ] Test fixtures use synthetic data, not production PII
Privacy-focused testing
Include privacy in your automated test suite:
`python
def test_analytics_events_contain_no_pii():
"""Verify analytics pipeline doesn't leak PII."""
user = create_test_user(email="test@example.com", name="Jane Doe")
perform_user_action(user, "page_view")
events = get_analytics_events(user_id=user.id) for event in events: event_str = json.dumps(event) assert "test@example.com" not in event_str assert "Jane Doe" not in event_str assert user.id not in event_str # should use hashed ID
def test_user_deletion_removes_all_pii(): """Verify right-to-erasure removes PII from all stores.""" user = create_test_user() handle_erasure_request(user.id)
assert get_user_identity(user.id) is None
assert get_user_tickets(user.id) == []
assert count_raw_pii_references(user.email) == 0
`
Frequently Asked Questions
What's the difference between "privacy by design" and "privacy by default"?
Privacy by design means building data protection into the technical architecture and business processes from the outset — choosing pseudonymization, implementing data minimization, enforcing retention at the schema level. Privacy by default means that without any user action, the most privacy-protective settings apply. For example, a social media profile should be private by default, not public. Marketing communications should be opt-in, not opt-out. Data sharing with third parties should require explicit activation. Both are legally required under GDPR Article 25, and regulators assess them independently. The Spanish DPA fined CaixaBank €6 million in 2021 partly because the bank's default settings didn't adequately limit data processing to what was strictly necessary.
Can we retroactively apply privacy by design to an existing product?
Yes, but it's significantly more expensive than building it in from the start. Begin with a comprehensive PII audit — scan every data store, log pipeline, and third-party integration to understand where personal data currently lives. Prioritize remediation by risk: start with unencrypted PII, data without retention policies, and systems with excessive access. Implement the identity-separation pattern (dedicated PII store with opaque IDs elsewhere) incrementally — you don't need to rewrite your entire database schema at once. Automated PII scanning tools like PrivaSift make the discovery phase faster and more reliable than manual audits, which typically miss 30-40% of PII locations according to IAPP research.
How do we measure whether our privacy-by-design implementation is effective?
Track concrete metrics: DSAR response time (target under 72 hours, legal requirement under 30 days), PII sprawl score (number of data stores containing PII — lower is better), retention compliance rate (percentage of data categories with enforced TTLs), data breach blast radius (how many records and PII categories would be exposed in a worst-case breach of any single system), and privacy debt (number of known data processing activities without documented legal basis or retention policy). Review these metrics quarterly with your DPO and engineering leadership. The goal isn't perfection — it's measurable improvement and demonstrable accountability, which is exactly what regulators look for under Article 5(2).
What technical controls are most critical for GDPR Article 25 compliance?
The most impactful technical controls, in order of priority: (1) Encryption at rest and in transit for all personal data — this is table stakes and the most commonly cited deficiency in enforcement actions. (2) Access control and least-privilege — implement role-based access with specific policies for PII stores; use database-level RLS where possible. (3) Automated data retention — TTLs and deletion jobs that enforce your retention schedule without human intervention. (4) PII detection and monitoring — continuous scanning to catch PII drift into unexpected locations. (5) Pseudonymization — separate identity from behavioral data so that a breach of one system doesn't expose full personal profiles. (6) Audit logging — immutable records of who accessed what PII, when, and why.
Does privacy by design apply to AI and machine learning systems?
Absolutely, and this is an area of increasing regulatory scrutiny. The EU AI Act (effective August 2025) adds requirements on top of GDPR for AI systems processing personal data. Key considerations: training data must be collected with an appropriate legal basis and minimized to what's necessary. Model outputs that constitute personal data (e.g., profiling, scoring) require transparency under Article 22. Feature engineering should use pseudonymized or aggregated data wherever possible — you rarely need raw PII to train an effective model. Inference logs containing personal data need the same retention and access controls as any other PII store. The Italian DPA's temporary ban on ChatGPT in 2023, later lifted after OpenAI implemented privacy measures, demonstrated that regulators will act against AI systems that don't embed privacy into their design.
Start Scanning for PII Today
PrivaSift automatically detects PII across your files, databases, and cloud storage — helping you stay GDPR and CCPA compliant without the manual work.
[Try PrivaSift Free →](https://privasift.com)
Scan your data for PII — free, no setup required
Try PrivaSift