Developer’s Guide: Creating PII Classification Rules with YAML

PrivaSift TeamApr 01, 2026piipii-detectiongdprccpacompliance

Developer's Guide: Creating PII Classification Rules with YAML

Every data breach involving personally identifiable information carries a price tag — and it's getting steeper. In 2025, the average cost of a data breach reached $4.88 million globally, according to IBM's annual report. For organizations handling European or Californian consumer data, the regulatory consequences compound that figure: GDPR fines can reach €20 million or 4% of global annual turnover, while CCPA violations carry penalties of up to $7,500 per intentional violation. When a single database scan reveals thousands of unprotected records, the math becomes existential.

The challenge most engineering teams face isn't awareness — it's implementation. Security engineers know PII needs to be classified and protected, but the default approach of hard-coding detection patterns into application logic creates brittle systems that break whenever regulations evolve or new data types emerge. A Social Security number pattern written into a Python script six months ago doesn't help when your company expands to Germany and suddenly needs to detect Personalausweis numbers, Steueridentifikationsnummern, and IBAN formats that weren't on anyone's radar.

This is where declarative PII classification rules shine. By defining detection patterns in YAML — a human-readable configuration format already familiar to most DevOps and platform teams — you decouple your classification logic from your application code. Rules become versionable, auditable, reviewable by compliance officers who don't write code, and deployable without recompiling anything. This guide walks you through building a production-grade YAML-based PII classification system from scratch.

Why YAML for PII Classification Rules

![Why YAML for PII Classification Rules](https://max.dnt-ai.ru/img/privasift/pii-classification-yaml_sec1.png)

YAML (YAML Ain't Markup Language) has become the lingua franca of infrastructure-as-code, from Kubernetes manifests to CI/CD pipelines. Adopting it for PII classification rules brings several concrete advantages over alternatives like JSON, XML, or embedded regex patterns.

First, readability for non-developers. Your DPO or compliance officer needs to review and approve classification rules. YAML's indentation-based structure and support for comments makes rules self-documenting:

`yaml

German Tax ID — required for GDPR Art. 9 processing

name: de_tax_id

display_name: "German Tax Identification Number" category: government_id gdpr_special: false pattern: '\b\d{2}\s?\d{3}\s?\d{3}\s?\d{3}\b' confidence: high regulations: - GDPR - BDSG # Bundesdatenschutzgesetz `

Second, version control integration. YAML files diff cleanly in Git, making it trivial to track who changed which rule, when, and why. This creates the audit trail that Article 30 of GDPR demands for records of processing activities.

Third, separation of concerns. Your scanning engine stays generic. It reads YAML, compiles patterns, and applies them. When France's CNIL issues new guidance on what constitutes PII in AI training data, you update a YAML file — not your core detection engine.

Anatomy of a PII Classification Rule

![Anatomy of a PII Classification Rule](https://max.dnt-ai.ru/img/privasift/pii-classification-yaml_sec2.png)

A well-designed classification rule needs more than a regex pattern. Here's the complete schema for a production-ready rule definition:

`yaml rules: - id: "pii_email_address" name: email_address display_name: "Email Address" description: "Detects standard email address formats" category: contact_info severity: medium # Detection configuration detection: type: regex pattern: '\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b' case_sensitive: false # Contextual boosters — increase confidence when nearby keywords exist context: keywords: ["email", "e-mail", "contact", "address", "mailto"] proximity_chars: 50 boost: 0.15 # Validation layer — reduce false positives validation: - type: checksum algorithm: none - type: deny_list values: ["example@example.com", "noreply@localhost"] # Compliance mapping regulations: gdpr: lawful_basis_required: true article: "Art. 4(1)" data_category: "personal_data" ccpa: covered: true category: "identifiers" # Operational metadata enabled: true version: "1.2.0" last_updated: "2026-03-15" author: "security-team" `

Each field serves a specific purpose in the detection pipeline. The severity field drives alerting thresholds. The context block reduces false positives by checking whether email-related keywords appear near the match. The regulations mapping lets your compliance dashboard automatically flag which legal frameworks apply to each detection.

Building Rules for Common PII Types

![Building Rules for Common PII Types](https://max.dnt-ai.ru/img/privasift/pii-classification-yaml_sec3.png)

Let's build out a practical rule set covering the PII categories that trigger the most regulatory scrutiny. These patterns have been tested against real-world datasets and tuned to balance recall (catching real PII) against precision (avoiding false positives).

Government Identifiers

Government IDs carry the highest severity because their exposure directly enables identity theft. The 2024 National Public Data breach exposed 2.9 billion records including Social Security numbers, resulting in a class-action lawsuit and regulatory investigations across multiple jurisdictions.

`yaml rules: - id: "pii_ssn_us" name: us_ssn display_name: "US Social Security Number" category: government_id severity: critical detection: type: regex # Matches XXX-XX-XXXX with or without dashes pattern: '\b(?!000|666|9\d{2})\d{3}[-\s]?(?!00)\d{2}[-\s]?(?!0000)\d{4}\b' case_sensitive: false context: keywords: ["ssn", "social security", "social sec", "tax id"] proximity_chars: 100 boost: 0.25 validation: - type: luhn enabled: false # SSNs don't use Luhn - type: range_check min_area_number: 1 max_area_number: 899 regulations: ccpa: covered: true category: "government_identifiers" requires_opt_out: true

- id: "pii_nino_uk" name: uk_national_insurance display_name: "UK National Insurance Number" category: government_id severity: critical detection: type: regex pattern: '\b[A-CEGHJ-PR-TW-Z]{2}\s?\d{2}\s?\d{2}\s?\d{2}\s?[A-D]\b' case_sensitive: false context: keywords: ["national insurance", "NI number", "NINO"] proximity_chars: 80 boost: 0.20 `

Financial Data

PCI DSS adds another compliance layer on top of GDPR and CCPA for financial PII. Credit card numbers are particularly well-suited to validation because they follow the Luhn algorithm:

`yaml - id: "pii_credit_card" name: credit_card_number display_name: "Credit Card Number" category: financial severity: critical detection: type: regex pattern: '\b(?:4\d{3}|5[1-5]\d{2}|3[47]\d{2}|6(?:011|5\d{2}))\d{12}\b' case_sensitive: false validation: - type: luhn enabled: true # Eliminates ~90% of false positives - type: deny_list values: ["4111111111111111", "5500000000000004"] # Common test numbers regulations: gdpr: lawful_basis_required: true data_category: "financial_data" ccpa: covered: true category: "financial_information" pci_dss: covered: true requirement: "3.4" `

Health Information

Under GDPR Article 9, health data is a "special category" requiring explicit consent. HIPAA adds US-specific requirements. Detecting health PII often requires Named Entity Recognition (NER) in addition to pattern matching:

`yaml - id: "pii_health_id_us" name: us_health_plan_id display_name: "US Health Plan Beneficiary Number" category: health severity: critical detection: type: composite methods: - type: regex pattern: '\b[A-Z]{3}\d{9}\b' - type: ner model: "health_entity_v2" entity_types: ["HEALTH_PLAN_ID", "MRN"] combine: "any" regulations: hipaa: covered: true identifier_type: "health_plan_beneficiary" gdpr: article: "Art. 9" data_category: "health_data" special_category: true `

Advanced Techniques: Context-Aware Detection

![Advanced Techniques: Context-Aware Detection](https://max.dnt-ai.ru/img/privasift/pii-classification-yaml_sec4.png)

Raw pattern matching generates too many false positives for production use. The string "123-45-6789" might be an SSN — or a product SKU, a phone extension, or a reference number. Context-aware detection dramatically improves accuracy.

Column-Name Heuristics

When scanning structured data sources (databases, CSVs, spreadsheets), column names provide strong signals:

`yaml column_hints: - id: "hint_ssn_column" applies_to: ["pii_ssn_us"] patterns: - '(?i)^(ssn|social_?sec|tax_?id|tin)$' - '(?i)social.*security' - '(?i)national.*id' confidence_override: 0.95 # Near-certain when column name matches - id: "hint_email_column" applies_to: ["pii_email_address"] patterns: - '(?i)^(email|e_?mail|user_?email|contact_?email)$' confidence_override: 0.99 `

Negative Context (False Positive Suppression)

Equally important is recognizing when a pattern match is almost certainly not PII:

`yaml suppression_rules: - id: "suppress_ssn_in_code" applies_to: ["pii_ssn_us"] conditions: - type: file_extension values: [".py", ".js", ".ts", ".java", ".go"] - type: nearby_keywords keywords: ["test", "mock", "fake", "example", "placeholder"] proximity_chars: 30 action: reduce_confidence reduction: 0.60 - id: "suppress_cc_test_numbers" applies_to: ["pii_credit_card"] conditions: - type: value_match pattern: '^4111|^5500|^3782' # Known test prefixes - type: nearby_keywords keywords: ["test", "sandbox", "stripe_test"] action: suppress `

This layered approach — pattern detection, context boosting, validation, and suppression — is how tools like PrivaSift achieve detection accuracy above 95% while keeping false positive rates below 2%.

Organizing Rules at Scale

A single YAML file works for a proof of concept, but production deployments serving multiple teams, regions, and compliance frameworks need structure:

` pii-rules/ ├── base/ │ ├── contact_info.yaml # Email, phone, address │ ├── government_ids.yaml # SSN, passport, national IDs │ ├── financial.yaml # Credit cards, bank accounts │ └── health.yaml # Medical record numbers, conditions ├── regions/ │ ├── eu/ │ │ ├── germany.yaml # Personalausweis, Steuer-ID │ │ ├── france.yaml # INSEE, carte vitale │ │ └── _defaults.yaml # GDPR-wide settings │ ├── us/ │ │ ├── california.yaml # CCPA-specific extensions │ │ └── hipaa.yaml # Health-specific rules │ └── uk/ │ └── post_brexit.yaml # UK GDPR divergences ├── overrides/ │ ├── team_analytics.yaml # Team-specific tuning │ └── legacy_systems.yaml # Looser matching for old formats ├── suppressions/ │ └── known_false_positives.yaml └── config.yaml # Global settings, inheritance `

The config.yaml file controls rule inheritance and precedence:

`yaml config: rule_precedence: - suppressions/* # Highest priority — always applied - overrides/* # Team-specific tuning - regions/eu/* # Regional specialization - base/* # Foundation rules (lowest priority) defaults: confidence_threshold: 0.75 max_false_positive_rate: 0.02 scan_timeout_seconds: 300 ci_integration: fail_on_severity: critical report_format: sarif notify_channel: "#security-alerts" `

Integrating Rules into Your CI/CD Pipeline

Classification rules without enforcement are documentation. Wire your YAML rules into your deployment pipeline so PII exposures are caught before they reach production:

`yaml

.github/workflows/pii-scan.yaml

name: PII Classification Scan on: pull_request: paths: - 'src/**' - 'migrations/**' - 'data/**'

jobs: scan: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Run PII scan with custom rules run: | privasift scan \ --rules-dir ./pii-rules \ --severity-threshold medium \ --format sarif \ --output pii-report.sarif \ ./src ./migrations - name: Upload SARIF report uses: github/codeql-action/upload-sarif@v3 with: sarif_file: pii-report.sarif - name: Fail on critical findings run: | CRITICAL=$(cat pii-report.sarif | jq '[.runs[].results[] | select(.level == "error")] | length') if [ "$CRITICAL" -gt 0 ]; then echo "::error::Found $CRITICAL critical PII exposures" exit 1 fi `

This ensures that every pull request touching source code, database migrations, or data files is automatically scanned against your latest classification rules. The SARIF format integrates natively with GitHub's Security tab, giving your team a centralized view of PII findings alongside other code security alerts.

Testing and Validating Your Rules

PII classification rules are code and deserve the same testing rigor. Build a test suite with known positive and negative samples:

`yaml

tests/test_ssn_rule.yaml

test_suite: rule_id: "pii_ssn_us" true_positives: - input: "My SSN is 123-45-6789" expected_match: "123-45-6789" expected_confidence: ">0.90" - input: "social security number 234 56 7890" expected_match: "234 56 7890" expected_confidence: ">0.85" - input: "TaxID: 345678901" expected_match: "345678901" expected_confidence: ">0.75" true_negatives: - input: "Order #123-45-6789 shipped today" context: "ecommerce_log" reason: "Order number format overlap" - input: "Phone: 123-456-7890" reason: "Phone number, not SSN (10 digits)" - input: "The test SSN is 000-00-0000" reason: "Invalid SSN range (000 area number)" edge_cases: - input: "SSN: 078-05-1120" expected: "match" note: "Woolworth wallet card SSN — historically issued but now flagged" `

Run these tests in CI alongside your unit tests. Track precision and recall metrics over time. A rule that drops below 90% precision (too many false positives) or 85% recall (missing real PII) needs tuning.

FAQ

How often should PII classification rules be updated?

Review rules quarterly at minimum, and immediately after any of these events: a new regulation takes effect (like the EU AI Act's data governance requirements), your organization expands into a new jurisdiction, a breach post-mortem reveals a missed PII type, or a significant number of false positives are reported. Treat your rule repository like a living document — assign an owner (typically the DPO or a security engineer) and include rule review in your compliance calendar. Version every change and maintain a CHANGELOG so auditors can see the evolution.

Can YAML rules handle unstructured data like PDFs and images?

Pattern-based YAML rules work directly on extracted text, so they apply to any format once text extraction is complete. For PDFs, your pipeline would use a tool like Apache Tika or pdftotext to extract content before applying rules. For images, OCR (optical character recognition) converts visual text to strings that regex rules can match. The YAML rules themselves remain format-agnostic — the extraction layer is what adapts. For truly unstructured data like free-form medical notes, supplement regex rules with NER-based detection (shown in the composite detection example above) for higher accuracy.

How do I handle multi-language PII detection?

Create region-specific rule files that account for local formats and terminology. A German phone number (+49 30 12345678) has a different structure than a US one ((555) 123-4567). Address patterns vary dramatically across countries. The hierarchical file structure recommended in this guide — with regions/ directories inheriting from base/ rules — handles this naturally. For non-Latin scripts, ensure your regex engine supports Unicode character classes (\p{Han} for Chinese characters, \p{Cyrillic} for Russian, etc.) and test extensively with real-world samples from each locale.

What's the performance impact of running hundreds of YAML rules?

Compiled regex patterns are fast — a modern scanning engine can evaluate 200+ patterns against a megabyte of text in under 100 milliseconds. The key optimization is compiling patterns once at startup rather than per-scan. PrivaSift pre-compiles all YAML rules into an optimized finite automaton, which means adding more rules has minimal incremental cost. For very large datasets (terabytes), the bottleneck is I/O, not pattern matching. Parallelize your scans across files and use streaming reads rather than loading entire files into memory.

How do YAML-based rules compare to ML-based PII detection?

They're complementary, not competing. Regex-based YAML rules excel at structured, well-defined PII types with known formats (SSNs, credit cards, phone numbers) — they're fast, deterministic, explainable, and easy to audit. ML-based detection is better for fuzzy, context-dependent PII like names, addresses in free text, or medical conditions embedded in clinical notes. A production system should use both: YAML rules as the primary detection layer for structured patterns, with ML models handling unstructured entity recognition. The composite detection type shown earlier in this guide lets you define exactly this kind of hybrid approach within a single rule definition.

Start Scanning for PII Today

PrivaSift automatically detects PII across your files, databases, and cloud storage — helping you stay GDPR and CCPA compliant without the manual work.

[Try PrivaSift Free →](https://privasift.com)

Scan your data for PII — free, no setup required

Try PrivaSift