Developer’s Guide: Creating PII Classification Rules with YAML
Developer's Guide: Creating PII Classification Rules with YAML
Every data breach involving personally identifiable information carries a price tag — and it's getting steeper. In 2025, the average cost of a data breach reached $4.88 million globally, according to IBM's annual report. For organizations handling European or Californian consumer data, the regulatory consequences compound that figure: GDPR fines can reach €20 million or 4% of global annual turnover, while CCPA violations carry penalties of up to $7,500 per intentional violation. When a single database scan reveals thousands of unprotected records, the math becomes existential.
The challenge most engineering teams face isn't awareness — it's implementation. Security engineers know PII needs to be classified and protected, but the default approach of hard-coding detection patterns into application logic creates brittle systems that break whenever regulations evolve or new data types emerge. A Social Security number pattern written into a Python script six months ago doesn't help when your company expands to Germany and suddenly needs to detect Personalausweis numbers, Steueridentifikationsnummern, and IBAN formats that weren't on anyone's radar.
This is where declarative PII classification rules shine. By defining detection patterns in YAML — a human-readable configuration format already familiar to most DevOps and platform teams — you decouple your classification logic from your application code. Rules become versionable, auditable, reviewable by compliance officers who don't write code, and deployable without recompiling anything. This guide walks you through building a production-grade YAML-based PII classification system from scratch.
Why YAML for PII Classification Rules

YAML (YAML Ain't Markup Language) has become the lingua franca of infrastructure-as-code, from Kubernetes manifests to CI/CD pipelines. Adopting it for PII classification rules brings several concrete advantages over alternatives like JSON, XML, or embedded regex patterns.
First, readability for non-developers. Your DPO or compliance officer needs to review and approve classification rules. YAML's indentation-based structure and support for comments makes rules self-documenting:
`yaml
German Tax ID — required for GDPR Art. 9 processing
- name: de_tax_id
`Second, version control integration. YAML files diff cleanly in Git, making it trivial to track who changed which rule, when, and why. This creates the audit trail that Article 30 of GDPR demands for records of processing activities.
Third, separation of concerns. Your scanning engine stays generic. It reads YAML, compiles patterns, and applies them. When France's CNIL issues new guidance on what constitutes PII in AI training data, you update a YAML file — not your core detection engine.
Anatomy of a PII Classification Rule

A well-designed classification rule needs more than a regex pattern. Here's the complete schema for a production-ready rule definition:
`yaml
rules:
- id: "pii_email_address"
name: email_address
display_name: "Email Address"
description: "Detects standard email address formats"
category: contact_info
severity: medium
# Detection configuration
detection:
type: regex
pattern: '\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
case_sensitive: false
# Contextual boosters — increase confidence when nearby keywords exist
context:
keywords: ["email", "e-mail", "contact", "address", "mailto"]
proximity_chars: 50
boost: 0.15
# Validation layer — reduce false positives
validation:
- type: checksum
algorithm: none
- type: deny_list
values: ["example@example.com", "noreply@localhost"]
# Compliance mapping
regulations:
gdpr:
lawful_basis_required: true
article: "Art. 4(1)"
data_category: "personal_data"
ccpa:
covered: true
category: "identifiers"
# Operational metadata
enabled: true
version: "1.2.0"
last_updated: "2026-03-15"
author: "security-team"
`
Each field serves a specific purpose in the detection pipeline. The severity field drives alerting thresholds. The context block reduces false positives by checking whether email-related keywords appear near the match. The regulations mapping lets your compliance dashboard automatically flag which legal frameworks apply to each detection.
Building Rules for Common PII Types

Let's build out a practical rule set covering the PII categories that trigger the most regulatory scrutiny. These patterns have been tested against real-world datasets and tuned to balance recall (catching real PII) against precision (avoiding false positives).
Government Identifiers
Government IDs carry the highest severity because their exposure directly enables identity theft. The 2024 National Public Data breach exposed 2.9 billion records including Social Security numbers, resulting in a class-action lawsuit and regulatory investigations across multiple jurisdictions.
`yaml
rules:
- id: "pii_ssn_us"
name: us_ssn
display_name: "US Social Security Number"
category: government_id
severity: critical
detection:
type: regex
# Matches XXX-XX-XXXX with or without dashes
pattern: '\b(?!000|666|9\d{2})\d{3}[-\s]?(?!00)\d{2}[-\s]?(?!0000)\d{4}\b'
case_sensitive: false
context:
keywords: ["ssn", "social security", "social sec", "tax id"]
proximity_chars: 100
boost: 0.25
validation:
- type: luhn
enabled: false # SSNs don't use Luhn
- type: range_check
min_area_number: 1
max_area_number: 899
regulations:
ccpa:
covered: true
category: "government_identifiers"
requires_opt_out: true
- id: "pii_nino_uk"
name: uk_national_insurance
display_name: "UK National Insurance Number"
category: government_id
severity: critical
detection:
type: regex
pattern: '\b[A-CEGHJ-PR-TW-Z]{2}\s?\d{2}\s?\d{2}\s?\d{2}\s?[A-D]\b'
case_sensitive: false
context:
keywords: ["national insurance", "NI number", "NINO"]
proximity_chars: 80
boost: 0.20
`
Financial Data
PCI DSS adds another compliance layer on top of GDPR and CCPA for financial PII. Credit card numbers are particularly well-suited to validation because they follow the Luhn algorithm:
`yaml
- id: "pii_credit_card"
name: credit_card_number
display_name: "Credit Card Number"
category: financial
severity: critical
detection:
type: regex
pattern: '\b(?:4\d{3}|5[1-5]\d{2}|3[47]\d{2}|6(?:011|5\d{2}))\d{12}\b'
case_sensitive: false
validation:
- type: luhn
enabled: true # Eliminates ~90% of false positives
- type: deny_list
values: ["4111111111111111", "5500000000000004"] # Common test numbers
regulations:
gdpr:
lawful_basis_required: true
data_category: "financial_data"
ccpa:
covered: true
category: "financial_information"
pci_dss:
covered: true
requirement: "3.4"
`
Health Information
Under GDPR Article 9, health data is a "special category" requiring explicit consent. HIPAA adds US-specific requirements. Detecting health PII often requires Named Entity Recognition (NER) in addition to pattern matching:
`yaml
- id: "pii_health_id_us"
name: us_health_plan_id
display_name: "US Health Plan Beneficiary Number"
category: health
severity: critical
detection:
type: composite
methods:
- type: regex
pattern: '\b[A-Z]{3}\d{9}\b'
- type: ner
model: "health_entity_v2"
entity_types: ["HEALTH_PLAN_ID", "MRN"]
combine: "any"
regulations:
hipaa:
covered: true
identifier_type: "health_plan_beneficiary"
gdpr:
article: "Art. 9"
data_category: "health_data"
special_category: true
`
Advanced Techniques: Context-Aware Detection

Raw pattern matching generates too many false positives for production use. The string "123-45-6789" might be an SSN — or a product SKU, a phone extension, or a reference number. Context-aware detection dramatically improves accuracy.
Column-Name Heuristics
When scanning structured data sources (databases, CSVs, spreadsheets), column names provide strong signals:
`yaml
column_hints:
- id: "hint_ssn_column"
applies_to: ["pii_ssn_us"]
patterns:
- '(?i)^(ssn|social_?sec|tax_?id|tin)$'
- '(?i)social.*security'
- '(?i)national.*id'
confidence_override: 0.95 # Near-certain when column name matches
- id: "hint_email_column"
applies_to: ["pii_email_address"]
patterns:
- '(?i)^(email|e_?mail|user_?email|contact_?email)$'
confidence_override: 0.99
`
Negative Context (False Positive Suppression)
Equally important is recognizing when a pattern match is almost certainly not PII:
`yaml
suppression_rules:
- id: "suppress_ssn_in_code"
applies_to: ["pii_ssn_us"]
conditions:
- type: file_extension
values: [".py", ".js", ".ts", ".java", ".go"]
- type: nearby_keywords
keywords: ["test", "mock", "fake", "example", "placeholder"]
proximity_chars: 30
action: reduce_confidence
reduction: 0.60
- id: "suppress_cc_test_numbers"
applies_to: ["pii_credit_card"]
conditions:
- type: value_match
pattern: '^4111|^5500|^3782' # Known test prefixes
- type: nearby_keywords
keywords: ["test", "sandbox", "stripe_test"]
action: suppress
`
This layered approach — pattern detection, context boosting, validation, and suppression — is how tools like PrivaSift achieve detection accuracy above 95% while keeping false positive rates below 2%.
Organizing Rules at Scale
A single YAML file works for a proof of concept, but production deployments serving multiple teams, regions, and compliance frameworks need structure:
`
pii-rules/
├── base/
│ ├── contact_info.yaml # Email, phone, address
│ ├── government_ids.yaml # SSN, passport, national IDs
│ ├── financial.yaml # Credit cards, bank accounts
│ └── health.yaml # Medical record numbers, conditions
├── regions/
│ ├── eu/
│ │ ├── germany.yaml # Personalausweis, Steuer-ID
│ │ ├── france.yaml # INSEE, carte vitale
│ │ └── _defaults.yaml # GDPR-wide settings
│ ├── us/
│ │ ├── california.yaml # CCPA-specific extensions
│ │ └── hipaa.yaml # Health-specific rules
│ └── uk/
│ └── post_brexit.yaml # UK GDPR divergences
├── overrides/
│ ├── team_analytics.yaml # Team-specific tuning
│ └── legacy_systems.yaml # Looser matching for old formats
├── suppressions/
│ └── known_false_positives.yaml
└── config.yaml # Global settings, inheritance
`
The config.yaml file controls rule inheritance and precedence:
`yaml
config:
rule_precedence:
- suppressions/* # Highest priority — always applied
- overrides/* # Team-specific tuning
- regions/eu/* # Regional specialization
- base/* # Foundation rules (lowest priority)
defaults:
confidence_threshold: 0.75
max_false_positive_rate: 0.02
scan_timeout_seconds: 300
ci_integration:
fail_on_severity: critical
report_format: sarif
notify_channel: "#security-alerts"
`
Integrating Rules into Your CI/CD Pipeline
Classification rules without enforcement are documentation. Wire your YAML rules into your deployment pipeline so PII exposures are caught before they reach production:
`yaml
.github/workflows/pii-scan.yaml
name: PII Classification Scan on: pull_request: paths: - 'src/**' - 'migrations/**' - 'data/**'jobs:
scan:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run PII scan with custom rules
run: |
privasift scan \
--rules-dir ./pii-rules \
--severity-threshold medium \
--format sarif \
--output pii-report.sarif \
./src ./migrations
- name: Upload SARIF report
uses: github/codeql-action/upload-sarif@v3
with:
sarif_file: pii-report.sarif
- name: Fail on critical findings
run: |
CRITICAL=$(cat pii-report.sarif | jq '[.runs[].results[] | select(.level == "error")] | length')
if [ "$CRITICAL" -gt 0 ]; then
echo "::error::Found $CRITICAL critical PII exposures"
exit 1
fi
`
This ensures that every pull request touching source code, database migrations, or data files is automatically scanned against your latest classification rules. The SARIF format integrates natively with GitHub's Security tab, giving your team a centralized view of PII findings alongside other code security alerts.
Testing and Validating Your Rules
PII classification rules are code and deserve the same testing rigor. Build a test suite with known positive and negative samples:
`yaml
tests/test_ssn_rule.yaml
test_suite: rule_id: "pii_ssn_us" true_positives: - input: "My SSN is 123-45-6789" expected_match: "123-45-6789" expected_confidence: ">0.90" - input: "social security number 234 56 7890" expected_match: "234 56 7890" expected_confidence: ">0.85" - input: "TaxID: 345678901" expected_match: "345678901" expected_confidence: ">0.75" true_negatives: - input: "Order #123-45-6789 shipped today" context: "ecommerce_log" reason: "Order number format overlap" - input: "Phone: 123-456-7890" reason: "Phone number, not SSN (10 digits)" - input: "The test SSN is 000-00-0000" reason: "Invalid SSN range (000 area number)" edge_cases: - input: "SSN: 078-05-1120" expected: "match" note: "Woolworth wallet card SSN — historically issued but now flagged"`Run these tests in CI alongside your unit tests. Track precision and recall metrics over time. A rule that drops below 90% precision (too many false positives) or 85% recall (missing real PII) needs tuning.
FAQ
How often should PII classification rules be updated?
Review rules quarterly at minimum, and immediately after any of these events: a new regulation takes effect (like the EU AI Act's data governance requirements), your organization expands into a new jurisdiction, a breach post-mortem reveals a missed PII type, or a significant number of false positives are reported. Treat your rule repository like a living document — assign an owner (typically the DPO or a security engineer) and include rule review in your compliance calendar. Version every change and maintain a CHANGELOG so auditors can see the evolution.
Can YAML rules handle unstructured data like PDFs and images?
Pattern-based YAML rules work directly on extracted text, so they apply to any format once text extraction is complete. For PDFs, your pipeline would use a tool like Apache Tika or pdftotext to extract content before applying rules. For images, OCR (optical character recognition) converts visual text to strings that regex rules can match. The YAML rules themselves remain format-agnostic — the extraction layer is what adapts. For truly unstructured data like free-form medical notes, supplement regex rules with NER-based detection (shown in the composite detection example above) for higher accuracy.
How do I handle multi-language PII detection?
Create region-specific rule files that account for local formats and terminology. A German phone number (+49 30 12345678) has a different structure than a US one ((555) 123-4567). Address patterns vary dramatically across countries. The hierarchical file structure recommended in this guide — with regions/ directories inheriting from base/ rules — handles this naturally. For non-Latin scripts, ensure your regex engine supports Unicode character classes (\p{Han} for Chinese characters, \p{Cyrillic} for Russian, etc.) and test extensively with real-world samples from each locale.
What's the performance impact of running hundreds of YAML rules?
Compiled regex patterns are fast — a modern scanning engine can evaluate 200+ patterns against a megabyte of text in under 100 milliseconds. The key optimization is compiling patterns once at startup rather than per-scan. PrivaSift pre-compiles all YAML rules into an optimized finite automaton, which means adding more rules has minimal incremental cost. For very large datasets (terabytes), the bottleneck is I/O, not pattern matching. Parallelize your scans across files and use streaming reads rather than loading entire files into memory.
How do YAML-based rules compare to ML-based PII detection?
They're complementary, not competing. Regex-based YAML rules excel at structured, well-defined PII types with known formats (SSNs, credit cards, phone numbers) — they're fast, deterministic, explainable, and easy to audit. ML-based detection is better for fuzzy, context-dependent PII like names, addresses in free text, or medical conditions embedded in clinical notes. A production system should use both: YAML rules as the primary detection layer for structured patterns, with ML models handling unstructured entity recognition. The composite detection type shown earlier in this guide lets you define exactly this kind of hybrid approach within a single rule definition.
Start Scanning for PII Today
PrivaSift automatically detects PII across your files, databases, and cloud storage — helping you stay GDPR and CCPA compliant without the manual work.
[Try PrivaSift Free →](https://privasift.com)
Scan your data for PII — free, no setup required
Try PrivaSift