How to Create a PII Classification Framework with JSON-Based Rules
How to Create a PII Classification Framework with JSON-Based Rules
Every organization handling personal data faces the same fundamental challenge: you can't protect what you can't classify. With GDPR enforcement actions surpassing €4.5 billion in cumulative fines since 2018 and the California Privacy Protection Agency ramping up CCPA audits in 2026, the cost of misclassifying — or entirely missing — personally identifiable information has never been higher.
Yet most engineering teams still rely on ad-hoc regex patterns scattered across codebases, or worse, manual data inventories maintained in spreadsheets that go stale the moment they're saved. The gap between what compliance officers need (a living, auditable classification system) and what developers actually build (a handful of grep commands) is where breaches happen.
A JSON-based PII classification framework solves this by turning your data classification logic into structured, version-controlled, machine-readable rules that both humans and automated scanners can consume. In this tutorial, we'll walk through building one from scratch — covering rule design, sensitivity tiers, regex and NLP pattern matching, integration with scanning pipelines, and how to keep the framework aligned with evolving regulations like GDPR, CCPA, and Brazil's LGPD.
Why JSON for PII Classification Rules

Before writing a single rule, it's worth understanding why JSON is the right format for this job compared to alternatives like YAML, XML, or hardcoded logic.
Machine-readable and human-editable. JSON strikes the balance between being parseable by every programming language on Earth and being readable enough for a DPO or compliance officer to review during an audit. When Ireland's Data Protection Commission asks how you classify health data, you can point to a specific rule file — not a Python function buried three abstractions deep.
Version-controllable. Storing classification rules as JSON files in Git means every change is tracked, attributed, and reversible. This is critical for demonstrating compliance under GDPR Article 5(2), the accountability principle, which requires you to demonstrate that you have appropriate measures in place.
Portable. The same JSON ruleset can drive a Python scanner, a Go microservice, a JavaScript front-end validator, or a third-party tool like PrivaSift. You define the rules once and consume them everywhere.
Testable. JSON rules can be validated against schemas, unit-tested with sample data, and diffed in pull requests — none of which is practical with rules embedded in application code.
Designing the Rule Schema

A well-designed schema is the foundation. Each rule should capture five things: what to detect, how sensitive it is, which regulations care about it, how to detect it, and what to do when it's found. Here's a base schema:
`json
{
"$schema": "http://json-schema.org/draft-07/schema#",
"type": "object",
"properties": {
"rule_id": {
"type": "string",
"description": "Unique identifier, e.g. PII-EU-001"
},
"category": {
"type": "string",
"enum": ["direct_identifier", "quasi_identifier", "sensitive", "financial", "biometric", "genetic", "health"]
},
"pii_type": {
"type": "string",
"description": "Human-readable label, e.g. 'Email Address'"
},
"sensitivity_tier": {
"type": "integer",
"minimum": 1,
"maximum": 4
},
"regulations": {
"type": "array",
"items": {
"type": "string",
"enum": ["GDPR", "CCPA", "LGPD", "HIPAA", "PCI-DSS"]
}
},
"detection": {
"type": "object",
"properties": {
"method": { "type": "string", "enum": ["regex", "nlp", "dictionary", "checksum", "composite"] },
"patterns": { "type": "array", "items": { "type": "string" } },
"confidence_threshold": { "type": "number", "minimum": 0, "maximum": 1 },
"context_keywords": { "type": "array", "items": { "type": "string" } }
}
},
"action": {
"type": "object",
"properties": {
"on_detect": { "type": "string", "enum": ["flag", "redact", "encrypt", "quarantine", "alert"] },
"notify": { "type": "array", "items": { "type": "string" } },
"retention_days": { "type": "integer" }
}
}
},
"required": ["rule_id", "category", "pii_type", "sensitivity_tier", "regulations", "detection", "action"]
}
`
This schema is intentionally opinionated. The sensitivity_tier field (1 = public, 4 = special category data under GDPR Article 9) drives downstream decisions about encryption, access control, and breach notification timelines. The detection.method field supports multiple strategies because no single technique catches everything — email addresses yield to regex, but detecting someone's religious beliefs in free text requires NLP.
Building Your First Rules: Practical Examples

Let's build out a real ruleset covering the most common PII types. Save this as pii_rules.json:
`json
{
"version": "2.1.0",
"last_updated": "2026-04-01",
"rules": [
{
"rule_id": "PII-DIRECT-001",
"category": "direct_identifier",
"pii_type": "Email Address",
"sensitivity_tier": 2,
"regulations": ["GDPR", "CCPA", "LGPD"],
"detection": {
"method": "regex",
"patterns": [
"[a-zA-Z0-9._%+\\-]+@[a-zA-Z0-9.\\-]+\\.[a-zA-Z]{2,}"
],
"confidence_threshold": 0.95,
"context_keywords": ["email", "e-mail", "contact", "address"]
},
"action": {
"on_detect": "flag",
"notify": ["dpo@company.com"],
"retention_days": 730
}
},
{
"rule_id": "PII-DIRECT-002",
"category": "direct_identifier",
"pii_type": "Phone Number (International)",
"sensitivity_tier": 2,
"regulations": ["GDPR", "CCPA"],
"detection": {
"method": "regex",
"patterns": [
"\\+?[1-9]\\d{1,14}",
"\\(?\\d{3}\\)?[\\s.\\-]?\\d{3}[\\s.\\-]?\\d{4}"
],
"confidence_threshold": 0.80,
"context_keywords": ["phone", "mobile", "tel", "call", "sms"]
},
"action": {
"on_detect": "flag",
"notify": ["dpo@company.com"],
"retention_days": 730
}
},
{
"rule_id": "PII-FINANCIAL-001",
"category": "financial",
"pii_type": "Credit Card Number",
"sensitivity_tier": 3,
"regulations": ["GDPR", "CCPA", "PCI-DSS"],
"detection": {
"method": "composite",
"patterns": [
"\\b(?:4[0-9]{12}(?:[0-9]{3})?|5[1-5][0-9]{14}|3[47][0-9]{13}|6(?:011|5[0-9]{2})[0-9]{12})\\b"
],
"confidence_threshold": 0.90,
"context_keywords": ["card", "credit", "payment", "visa", "mastercard"],
"checksum": "luhn"
},
"action": {
"on_detect": "redact",
"notify": ["dpo@company.com", "security@company.com"],
"retention_days": 90
}
},
{
"rule_id": "PII-SENSITIVE-001",
"category": "health",
"pii_type": "Medical Condition",
"sensitivity_tier": 4,
"regulations": ["GDPR", "HIPAA"],
"detection": {
"method": "nlp",
"patterns": [],
"confidence_threshold": 0.70,
"context_keywords": ["diagnosis", "patient", "treatment", "symptom", "prescription", "medical", "condition", "disease"],
"dictionary": "medical_conditions_icd10.json"
},
"action": {
"on_detect": "quarantine",
"notify": ["dpo@company.com", "legal@company.com"],
"retention_days": 30
}
}
]
}
`
Notice how the credit card rule uses "method": "composite" — it combines regex with the Luhn checksum algorithm to eliminate false positives. A 16-digit number that fails Luhn validation isn't a real card number and shouldn't trigger alerts. This kind of layered detection is what separates a production-grade framework from a toy.
Implementing the Rule Engine in Python

Rules are only useful if something executes them. Here's a minimal but functional rule engine:
`python
import json
import re
from dataclasses import dataclass
@dataclass class Detection: rule_id: str pii_type: str value: str confidence: float sensitivity_tier: int action: str
def load_rules(path: str) -> list[dict]: with open(path) as f: data = json.load(f) return data["rules"]
def luhn_check(number: str) -> bool: digits = [int(d) for d in number if d.isdigit()] checksum = 0 for i, d in enumerate(reversed(digits)): if i % 2 == 1: d *= 2 if d > 9: d -= 9 checksum += d return checksum % 10 == 0
def scan_text(text: str, rules: list[dict]) -> list[Detection]: detections = [] text_lower = text.lower()
for rule in rules: detection = rule["detection"] method = detection["method"]
# Context boost: if context keywords appear near a match, # increase confidence context_keywords = detection.get("context_keywords", []) context_present = any(kw in text_lower for kw in context_keywords) context_boost = 0.1 if context_present else 0.0
if method in ("regex", "composite"): for pattern in detection["patterns"]: for match in re.finditer(pattern, text): value = match.group() confidence = min( detection["confidence_threshold"] + context_boost, 1.0 )
# Apply checksum validation for composite rules if detection.get("checksum") == "luhn": if not luhn_check(value): continue
if confidence >= detection["confidence_threshold"]: detections.append(Detection( rule_id=rule["rule_id"], pii_type=rule["pii_type"], value=value, confidence=confidence, sensitivity_tier=rule["sensitivity_tier"], action=rule["action"]["on_detect"] ))
return detections
Usage
rules = load_rules("pii_rules.json") text = """ Please contact john.doe@example.com or call +1-555-867-5309 for payment issues. Card on file: 4532015112830366 """ results = scan_text(text, rules) for d in results: print(f"[{d.action.upper()}] {d.pii_type}: {d.value} " f"(confidence: {d.confidence}, tier: {d.sensitivity_tier})")`This engine handles regex and composite methods. For NLP-based rules (like medical condition detection), you'd integrate a library like spaCy or a specialized NER model and route those rules to a separate handler based on the method field. The JSON framework stays the same — only the engine's dispatch logic changes.
Sensitivity Tiers and Automated Response
The sensitivity_tier field isn't just metadata — it should drive automated responses across your infrastructure. Here's a practical tier mapping:
| Tier | Classification | Examples | Auto-Response | Breach Notification | |------|---------------|----------|---------------|-------------------| | 1 | Public | Company name, job title | Log only | Not required | | 2 | Internal | Email, phone, IP address | Flag + review | 72 hours (GDPR Art. 33) | | 3 | Confidential | SSN, credit card, passport | Redact + encrypt | 72 hours + data subject notification | | 4 | Restricted | Health data, biometrics, racial/ethnic origin | Quarantine + legal review | Immediate escalation |
Under GDPR, Tier 4 data corresponds to Article 9 "special categories" — processing these without explicit consent or another lawful basis under Article 9(2) can result in fines up to €20 million or 4% of annual global turnover. The €1.2 billion fine against Meta in 2023 for data transfers underscored that regulators are willing to apply maximum penalties.
In your JSON rules, map tiers to automated workflows:
`json
{
"tier_policies": {
"1": { "log": true, "alert": false, "encrypt": false, "quarantine": false },
"2": { "log": true, "alert": true, "encrypt": false, "quarantine": false },
"3": { "log": true, "alert": true, "encrypt": true, "quarantine": false },
"4": { "log": true, "alert": true, "encrypt": true, "quarantine": true }
}
}
`
This policy file lives alongside your rules and gets consumed by whatever data pipeline, API gateway, or storage layer needs to enforce it.
Integrating with CI/CD and Data Pipelines
A classification framework only delivers value if it runs continuously. Here are three integration patterns:
1. Pre-commit hooks. Run the scanner against staged files before code is committed. This catches developers accidentally hardcoding PII in config files, test fixtures, or log statements — a problem that the GitGuardian 2025 report found affects 1 in 10 repositories.
`bash
#!/bin/bash
.git/hooks/pre-commit
staged_files=$(git diff --cached --name-only --diff-filter=ACM) for file in $staged_files; do results=$(python pii_scanner.py --rules pii_rules.json --file "$file") if [ -n "$results" ]; then echo "PII detected in $file:" echo "$results" exit 1 fi done`2. CI pipeline stage. Add a scanning step to your CI/CD pipeline that fails the build if Tier 3+ PII is found in any source file, migration, or fixture.
3. Data pipeline integration. For ETL and streaming pipelines, run the scanner as a transformation step. Kafka consumers, Airflow tasks, or dbt models can invoke the rule engine and apply redaction or encryption before data lands in a warehouse.
The key advantage of JSON-based rules here is that your security team can update classification logic without touching application code. A new rule gets merged via PR, reviewed by the DPO, and automatically picked up by every integration point.
Keeping Rules Current with Regulatory Changes
Regulations evolve. CCPA was amended by CPRA. GDPR guidance gets updated through European Data Protection Board opinions. New laws like the EU AI Act introduce classification requirements for AI training data. Your framework needs a maintenance process.
Quarterly rule reviews. Schedule a review every 90 days where legal, security, and engineering jointly audit the ruleset. Use the last_updated field in your JSON to track staleness.
Regulatory change feeds. Subscribe to updates from the EDPB, California Privacy Protection Agency, and relevant national DPAs. When new guidance drops, create a ticket to evaluate rule impacts.
Rule versioning. The version field in your ruleset follows semver: major bumps for breaking changes (new required fields), minor for new rules, patch for pattern refinements. This lets downstream consumers pin to compatible versions.
Testing against labeled datasets. Maintain a set of sample documents with known PII and run your rules against them in CI. Track precision and recall over time — if a rule update drops recall below your threshold, the build fails.
Frequently Asked Questions
How many rules does a typical PII classification framework need?
A production framework typically contains 30–60 rules covering the PII types relevant to your jurisdictions and data processing activities. Start with the 10–15 most common types (email, phone, SSN/national ID, credit card, name, address, date of birth, IP address, cookie identifiers, health data, biometric data) and expand based on your Record of Processing Activities under GDPR Article 30. Don't try to build 200 rules on day one — start focused and iterate.
What's the difference between regex-based and NLP-based PII detection?
Regex works well for structured PII with predictable formats: email addresses, phone numbers, credit card numbers, Social Security numbers. It's fast, deterministic, and easy to test. NLP-based detection is necessary for unstructured PII embedded in free text — names, medical conditions, religious beliefs, or political opinions. NLP methods use named entity recognition (NER) models and are probabilistic, which is why the confidence_threshold field matters. In practice, you need both: regex for the 70% of PII that has a pattern, NLP for the 30% that doesn't.
How do I handle PII that spans multiple fields (quasi-identifiers)?
A single field like zip code or gender isn't PII on its own, but combinations can re-identify individuals. Research by Latanya Sweeney demonstrated that 87% of the U.S. population can be uniquely identified by zip code + date of birth + gender. Your framework should include composite detection rules that evaluate field combinations. Add a "linked_fields" property to rules that only trigger when multiple quasi-identifiers appear together in the same record or document. This is particularly important for GDPR compliance, where the regulation defines personal data broadly as any information that can identify a person "directly or indirectly."
Can I use this framework for CCPA's "sale of personal information" requirements?
Yes. CCPA (as amended by CPRA) requires businesses to track which categories of personal information are sold or shared for cross-context behavioral advertising. Add a "ccpa_category" field to your rules mapping each PII type to one of CCPA's statutory categories (identifiers, commercial information, biometric information, internet activity, geolocation, etc.). This mapping feeds directly into the disclosures required in your privacy policy and the metrics you report in response to consumer "Do Not Sell" requests.
How does a JSON-based framework compare to using a dedicated tool like PrivaSift?
A custom JSON framework gives you full control over rule logic and integrates tightly with your specific infrastructure. However, building and maintaining the detection engine, NLP models, database connectors, cloud storage scanners, and reporting dashboards is significant engineering work. Tools like PrivaSift provide all of this out of the box — pre-built rule libraries covering 100+ PII types across multiple jurisdictions, automated scanning of files, databases, and cloud storage, plus compliance reporting. Many teams use both: PrivaSift for broad automated coverage, and custom JSON rules for organization-specific data types or business logic that no off-the-shelf tool would know about.
Start Scanning for PII Today
PrivaSift automatically detects PII across your files, databases, and cloud storage — helping you stay GDPR and CCPA compliant without the manual work.
[Try PrivaSift Free →](https://privasift.com)
Scan your data for PII — free, no setup required
Try PrivaSift