Understanding the Differences Between PHI, PII, and PCI Data for Security Engineers
Understanding the Differences Between PHI, PII, and PCI Data for Security Engineers
If you've ever sat in a compliance review and heard someone use PHI, PII, and PCI interchangeably, you're not alone — and you're right to wince. These three categories of sensitive data are governed by entirely different regulations, carry distinct breach penalties, and demand separate technical controls. Conflating them isn't just sloppy; it's a fast track to regulatory fines that now routinely exceed eight figures.
The stakes have never been higher. In 2024 alone, GDPR enforcement actions totaled over €4.5 billion in cumulative fines since inception, with Meta's €1.2 billion penalty setting the record. HIPAA settlements in the US reached $4.75 million on average per breach, while PCI DSS non-compliance costs merchants between $5,000 and $100,000 per month in penalties from payment brands. For security engineers and compliance officers, understanding what each data type actually includes — and where they overlap — is not academic. It's operational.
This guide breaks down PHI, PII, and PCI data with the precision that technical teams need: concrete definitions, regulatory mappings, real detection patterns, and actionable classification strategies you can implement this quarter.
What Is PII? The Broadest Category You're Probably Under-Detecting

Personally Identifiable Information (PII) is any data that can identify a specific individual, either on its own (direct identifiers) or when combined with other data points (indirect identifiers). It is the broadest of the three categories and sits at the center of both GDPR and CCPA enforcement.
Direct PII includes:
- Full name, email address, phone number
- Social Security Number (SSN), passport number, driver's license
- Biometric data (fingerprints, facial recognition templates)
- IP addresses (explicitly classified as PII under GDPR)
- Date of birth, ZIP code, gender
- Job title combined with employer name
- Device IDs, cookie identifiers, advertising IDs
Key regulations: GDPR (EU), CCPA/CPRA (California), LGPD (Brazil), PIPEDA (Canada), POPIA (South Africa).
What Is PHI? PII's Healthcare-Specific Subset

Protected Health Information (PHI) is defined by the US Health Insurance Portability and Accountability Act (HIPAA) as any individually identifiable health information created, received, maintained, or transmitted by a covered entity or business associate. Think of PHI as PII plus a healthcare context.
The 18 HIPAA identifiers that make health data PHI:
1. Names 2. Geographic data smaller than a state 3. Dates (except year) related to an individual 4. Phone numbers 5. Fax numbers 6. Email addresses 7. SSNs 8. Medical record numbers 9. Health plan beneficiary numbers 10. Account numbers 11. Certificate/license numbers 12. Vehicle identifiers and serial numbers 13. Device identifiers and serial numbers 14. Web URLs 15. IP addresses 16. Biometric identifiers 17. Full-face photographs 18. Any other unique identifying number or code
The crucial distinction: A diagnosis code (ICD-10) sitting in a table by itself is not PHI. The moment it's linked to any of those 18 identifiers, it becomes PHI. This means your data pipelines might be generating PHI without your team realizing it — every JOIN operation that connects clinical data to a patient identifier creates a PHI obligation.
Real-world example: In 2023, HCA Healthcare disclosed a breach affecting 11 million patients. The compromised data included names, addresses, dates of birth, and appointment dates — standard PII fields that became PHI because they existed in a healthcare treatment context. The resulting class-action settlements and OCR investigations cost hundreds of millions.
What Is PCI Data? Cardholder Data Under PCI DSS

PCI data refers to cardholder data (CHD) and sensitive authentication data (SAD) as defined by the Payment Card Industry Data Security Standard (PCI DSS). Unlike GDPR or HIPAA, PCI DSS is not a government regulation — it's an industry standard enforced contractually by payment card brands (Visa, Mastercard, Amex, Discover).
Cardholder Data (CHD):
- Primary Account Number (PAN) — the 15-16 digit card number
- Cardholder name
- Expiration date
- Service code
- Full magnetic stripe data (track data)
- CAV2/CVC2/CVV2/CID (the 3-4 digit security code)
- PIN and PIN block
PCI DSS v4.0, which became mandatory in March 2025, introduced stricter requirements around targeted risk analysis and stronger authentication controls. Non-compliant merchants face fines of $5,000–$100,000 per month, increased transaction fees, and potential revocation of card processing privileges.
Where PHI, PII, and PCI Overlap — and Why It Matters

These categories are not mutually exclusive. A single database row can simultaneously contain all three types:
| Field | PII | PHI | PCI | |-------|-----|-----|-----| | Patient name | ✓ | ✓ | — | | Date of birth | ✓ | ✓ | — | | SSN | ✓ | ✓ | — | | Diagnosis code (linked to patient) | — | ✓ | — | | Credit card number | ✓ | — | ✓ | | Insurance claim with payment info | ✓ | ✓ | ✓ |
The compliance multiplier effect: When a healthcare billing system stores patient names alongside credit card numbers for copay processing, a single breach triggers three parallel obligations — HIPAA breach notification (60-day window to HHS), GDPR notification (72-hour window if EU residents are affected), and PCI DSS incident response with forensic investigation by a PCI Forensic Investigator (PFI).
This is why classification must happen at the field level, not the system level. Labeling a database as "HIPAA-compliant" tells you nothing about whether it also contains PCI data that requires a completely separate set of controls.
Building a Detection Pipeline: Pattern Matching, NER, and Context Analysis
Effective data classification requires layered detection. Here's a practical architecture that security engineers can implement:
Layer 1: Regex Pattern Matching (High-Speed, High-Volume)
Start with deterministic patterns for structured data:
`python
import re
DETECTION_PATTERNS = { "ssn": r"\b\d{3}-\d{2}-\d{4}\b", "credit_card": r"\b(?:4[0-9]{12}(?:[0-9]{3})?|5[1-5][0-9]{14}|3[47][0-9]{13}|6(?:011|5[0-9]{2})[0-9]{12})\b", "email": r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b", "phone_us": r"\b(?:\+1[-.\s]?)?\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}\b", "iban": r"\b[A-Z]{2}\d{2}[A-Z0-9]{4}\d{7}([A-Z0-9]?){0,16}\b", "ip_address": r"\b(?:\d{1,3}\.){3}\d{1,3}\b", }
def scan_text(text: str) -> dict: findings = {} for label, pattern in DETECTION_PATTERNS.items(): matches = re.findall(pattern, text) if matches: findings[label] = { "count": len(matches), "category": classify_regulation(label) } return findings
def classify_regulation(label: str) -> list:
mapping = {
"ssn": ["PII", "PHI"],
"credit_card": ["PII", "PCI"],
"email": ["PII", "PHI"],
"phone_us": ["PII", "PHI"],
"iban": ["PII"],
"ip_address": ["PII", "PHI"],
}
return mapping.get(label, ["PII"])
`
Layer 2: Luhn Validation for PCI Data
Raw regex matches produce false positives. For credit card numbers, always apply the Luhn algorithm:
`python
def luhn_check(card_number: str) -> bool:
digits = [int(d) for d in card_number if d.isdigit()]
checksum = 0
for i, d in enumerate(reversed(digits)):
if i % 2 == 1:
d *= 2
if d > 9:
d -= 9
checksum += d
return checksum % 10 == 0
`
Layer 3: Contextual Classification
A name next to a diagnosis code is PHI. The same name next to an order number is PII. Context-aware classification looks at surrounding fields, table names, and data source metadata to assign the correct regulatory category. This is where tools like PrivaSift add significant value — automated context analysis across your entire data estate eliminates the guesswork.
Practical Compliance Checklist: Mapping Data Types to Controls
Once you've classified your data, map each type to its required controls:
For PII (GDPR/CCPA):
- [ ] Maintain a Record of Processing Activities (ROPA) — GDPR Article 30
- [ ] Implement data subject access request (DSAR) workflows with 30-day response SLA
- [ ] Apply pseudonymization or encryption at rest and in transit
- [ ] Conduct Data Protection Impact Assessments (DPIAs) for high-risk processing
- [ ] Ensure lawful basis documentation for each processing activity
- [ ] Execute Business Associate Agreements (BAAs) with all vendors touching PHI
- [ ] Implement access controls with unique user IDs and audit logging
- [ ] Encrypt PHI at rest (AES-256) and in transit (TLS 1.2+)
- [ ] Maintain 6-year retention of HIPAA audit logs
- [ ] Conduct annual HIPAA Security Risk Assessment
- [ ] Never store SAD post-authorization — validate with quarterly scans
- [ ] Render PAN unreadable wherever stored (tokenization, truncation, hashing, or encryption)
- [ ] Segment cardholder data environment (CDE) from general network
- [ ] Implement multi-factor authentication for all access to the CDE
- [ ] Run quarterly ASV scans and annual penetration tests
Common Misclassification Pitfalls and How to Avoid Them
Pitfall 1: Treating log files as non-sensitive. Application logs frequently capture email addresses in error messages, IP addresses in access logs, and occasionally full request bodies that include credit card numbers or SSNs. In 2022, Morgan Stanley was fined $35 million by the SEC partly due to PII found on decommissioned hardware that hadn't been properly wiped — data that included information from log archives.
Pitfall 2: Ignoring PII in unstructured data. Contracts, support tickets, PDFs, and chat transcripts are rich sources of unclassified PII. A customer support Slack channel might contain screenshots of IDs, medical documents, and payment details shared by customers. These require the same protections as structured database fields.
Pitfall 3: Assuming anonymized data stays anonymous. Researchers have repeatedly demonstrated re-identification of "anonymized" datasets. The 2006 Netflix Prize dataset was de-anonymized by cross-referencing with public IMDb reviews. True anonymization under GDPR requires that re-identification be irreversible — not merely difficult. K-anonymity, l-diversity, and differential privacy are minimum thresholds, not nice-to-haves.
Pitfall 4: Misclassifying data in transit. A webhook payload containing a customer's name and diagnosis code is PHI while it traverses your API gateway, not just when it lands in your database. Transit-stage data must be classified and protected at every hop.
FAQ
What is the difference between PII and PHI in simple terms?
PII is any information that identifies a person — names, email addresses, Social Security Numbers, IP addresses. PHI is a subset of PII that specifically relates to healthcare: it's identifiable information that is created or received by a healthcare provider, health plan, or healthcare clearinghouse, and relates to a person's past, present, or future health condition, treatment, or payment for healthcare. All PHI contains PII, but not all PII is PHI. The regulatory distinction matters because PHI triggers HIPAA obligations (including mandatory breach notification to HHS within 60 days), while PII falls under broader frameworks like GDPR and CCPA.
Can a single data field be classified as PII, PHI, and PCI simultaneously?
Not a single field in isolation, but a single record absolutely can. Consider a hospital billing system where one row contains a patient's name (PII + PHI), their diagnosis (PHI), and their credit card number used for a copay (PII + PCI). The record as a whole falls under HIPAA, GDPR/CCPA, and PCI DSS simultaneously. This is why field-level classification is essential — system-level labels like "this is a HIPAA database" fail to capture PCI obligations on payment columns within the same table.
How often should we scan our systems for unclassified sensitive data?
At minimum, quarterly — aligned with PCI DSS ASV scan requirements. However, best practice for organizations handling all three data types is continuous scanning triggered by data pipeline changes, new deployments, and schema migrations. Every new microservice, every new logging configuration, and every new third-party integration is an opportunity for unclassified sensitive data to appear in unexpected locations. Automated tools that integrate into your CI/CD pipeline catch these issues before they reach production.
Is an IP address considered PII under GDPR?
Yes. The Court of Justice of the European Union (CJEU) ruled in Breyer v. Germany (Case C-582/14) that dynamic IP addresses constitute personal data when the data controller has legal means to identify the individual — which ISPs and law enforcement cooperation make possible. This means web server access logs, CDN logs, API gateway logs, and firewall logs all contain GDPR-regulated PII. Many engineering teams overlook this, leaving vast quantities of unprotected personal data in their infrastructure logging pipelines.
What happens if we suffer a breach involving multiple data types?
You face parallel regulatory obligations with different timelines, notification requirements, and enforcement bodies. For PHI: notify HHS within 60 days and affected individuals without unreasonable delay (HIPAA Breach Notification Rule). For PII under GDPR: notify the supervisory authority within 72 hours and affected individuals if there's high risk (Articles 33–34). For PCI data: notify your acquiring bank immediately, engage a PCI Forensic Investigator, and potentially notify affected cardholders. You may also face state-level breach notification laws in the US (all 50 states have them), each with their own timelines. The operational complexity of a multi-type breach is precisely why proactive classification and segmentation — knowing exactly what data you have and where — is far cheaper than reactive incident response.
Start Scanning for PII Today
PrivaSift automatically detects PII across your files, databases, and cloud storage — helping you stay GDPR and CCPA compliant without the manual work.
[Try PrivaSift Free →](https://privasift.com)
Scan your data for PII — free, no setup required
Try PrivaSift