Top 10 Types of PII Your Database Is Storing Without You Knowing

PrivaSift TeamApr 01, 2026piidata-privacycompliancepii-detectiondata-breach

Here's the blog post:

Top 10 Types of PII Your Database Is Storing Without You Knowing

Every database accumulates personal data over time — and most of it was never intentionally collected. Developers add fields for debugging, customer support logs grow unchecked, and third-party integrations dump raw payloads into staging tables that never get cleaned up. The result: your production database is almost certainly storing personally identifiable information (PII) you didn't plan for, don't need, and are legally required to protect.

This isn't a theoretical problem. In 2023, the average cost of a data breach reached $4.45 million according to IBM's Cost of a Data Breach Report. Regulators under GDPR have issued fines exceeding €4 billion since 2018, and CCPA enforcement actions are accelerating. Meta alone was fined €1.2 billion in May 2023 for improper handling of personal data transfers. The common thread in these cases isn't sophisticated hacking — it's organizations that didn't know what personal data they were holding in the first place.

If you're a CTO, DPO, or security engineer responsible for compliance, the first step is understanding what's actually in your data stores. This article walks through the ten most common types of PII that silently accumulate in databases — and how to find them before an auditor or attacker does.

1. Email Addresses Embedded in Free-Text Fields

![1. Email Addresses Embedded in Free-Text Fields](https://max.dnt-ai.ru/img/privasift/top-10-types-pii-database-storing-without-knowing_sec1.png)

Email addresses are the most pervasive form of PII, and they show up far beyond your users.email column. They leak into:

Log tables — error messages containing "Failed to send notification to john.doe@company.com"
Support ticket bodies — customers paste their email into message fields
JSON blobs — API request/response payloads stored for debugging
Comments and notes — internal CRM notes like "Follow up with sarah@client.org"

A simple regex scan won't catch everything. Email addresses appear inside URLs, embedded in XML/JSON strings, and concatenated into error codes. You need contextual detection that understands the surrounding data structure.

`sql -- Quick check: find email-like patterns in your logs table SELECT id, message FROM application_logs WHERE message ~ '[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}' LIMIT 100; `

This query will likely return more results than you expect. The real question is: do you have a retention policy for these tables?

2. IP Addresses and Geolocation Data

![2. IP Addresses and Geolocation Data](https://max.dnt-ai.ru/img/privasift/top-10-types-pii-database-storing-without-knowing_sec2.png)

Under GDPR, IP addresses are explicitly classified as personal data (Recital 30). Yet most web applications store them liberally:

Access logs stored indefinitely in database tables
Rate-limiting tables that track IPs alongside user sessions
Analytics events with full IP addresses instead of anonymized versions
Fraud detection systems that correlate IPs with user identities

The problem compounds when IP addresses are stored alongside timestamps and user IDs — creating a detailed behavioral profile that's clearly personal data under any regulatory framework.

What to do about it

Truncate or hash IP addresses when full precision isn't needed. For IPv4, zeroing the last octet (192.168.1.0) provides reasonable anonymization while preserving network-level analytics. For IPv6, truncate to the first 48 bits. If you need full IPs for security purposes, isolate them in a dedicated table with strict access controls and a 90-day retention policy.

3. Phone Numbers in Unexpected Formats

![3. Phone Numbers in Unexpected Formats](https://max.dnt-ai.ru/img/privasift/top-10-types-pii-database-storing-without-knowing_sec3.png)

Phone numbers hide in databases in dozens of formats: +1-555-123-4567, (555) 123-4567, 5551234567, +44 20 7946 0958. They appear in:

Shipping address fields — customers add phone numbers to address line 2
User-agent strings — some mobile browsers include the device's phone number
Webhook payloads — payment processors send phone data in transaction metadata
CSV imports — bulk uploads that include phone columns mapped to generic varchar fields

Standard PII detection tools that only match one or two phone formats will miss international numbers, extensions, and numbers embedded in longer strings. Effective scanning requires country-aware pattern matching that handles E.164 format, local conventions, and partial numbers.

`python

Phone numbers appear in many formats — a single regex won't cut it

import re

phone_patterns = [ r'\+?1?[-.\s]?$?\d{3}$?[-.\s]?\d{3}[-.\s]?\d{4}', # US/CA r'\+44\s?\d{2,4}\s?\d{3,4}\s?\d{3,4}', # UK r'\+49\s?\d{2,4}\s?\d{4,8}', # DE r'\+\d{1,3}[-.\s]?\d{4,14}', # Generic intl ]

def scan_column_for_phones(values: list[str]) -> list[dict]: hits = [] for i, val in enumerate(values): for pattern in phone_patterns: if re.search(pattern, str(val)): hits.append({"row": i, "value": val, "pattern": pattern}) break return hits `

4. Government Identifiers: SSNs, Tax IDs, and Passport Numbers

![4. Government Identifiers: SSNs, Tax IDs, and Passport Numbers](https://max.dnt-ai.ru/img/privasift/top-10-types-pii-database-storing-without-knowing_sec4.png)

Social Security Numbers, national insurance numbers, tax identifiers, and passport numbers are high-sensitivity PII — and they end up in databases more often than most organizations realize.

Common hiding places:

File attachment metadata — PDF forms with SSN fields uploaded to document management systems
Customer onboarding tables — KYC processes that store raw identity documents
Legacy migration artifacts — old systems that stored SSNs as primary keys (yes, this was common)
Audit trail tables — change logs that capture before/after values including SSN fields

A leaked SSN can cause real, lasting harm to individuals. Under CCPA, SSNs are considered "sensitive personal information" requiring additional protections. Under GDPR, national identification numbers fall under Article 87 with member states setting specific processing conditions.

Action item: Run a targeted scan across all varchar and text columns for patterns matching \d{3}-\d{2}-\d{4} (US SSN), \d{2}\s\d{2}\s\d{2}\s\d{3}\s\d{3}\s\d{2} (French NIR), and equivalent patterns for your operating jurisdictions. Then verify whether any matches correspond to actual identifiers or are false positives.

5. Financial Data: Credit Card Numbers and Bank Accounts

PCI DSS compliance gets a lot of attention for payment processing systems, but credit card numbers (PANs) leak into non-PCI systems through:

Customer support tickets — "My card ending in 4242 was charged twice, the full number is 4242424242424242"
Application logs — payment gateway request/response logs that include full card data
Email archives — stored in database-backed email systems
Test data in production — developers who used real card numbers in test fixtures that made it to prod

The Luhn algorithm can validate potential card numbers with high accuracy:

`python def luhn_check(number: str) -> bool: digits = [int(d) for d in number if d.isdigit()] if len(digits) < 13 or len(digits) > 19: return False checksum = 0 for i, d in enumerate(reversed(digits)): if i % 2 == 1: d *= 2 if d > 9: d -= 9 checksum += d return checksum % 10 == 0 `

If your database stores any field that passes a Luhn check and matches a known BIN (Bank Identification Number) prefix, you have a PCI compliance incident — not just a privacy concern.

6. Biometric and Health-Related Data

This category catches most organizations off guard. You don't need to be a healthcare company to store health data. Consider:

HR systems — sick leave reasons, disability accommodations, health insurance details
Profile photos — facial images are biometric data under GDPR and BIPA (Illinois)
Wellness program data — step counts, heart rate, sleep data from corporate wellness integrations
Customer support transcripts — "I need to cancel because I was diagnosed with..."

Under GDPR Article 9, health data is a "special category" requiring explicit consent. Under HIPAA, Protected Health Information (PHI) has strict handling requirements. Illinois' Biometric Information Privacy Act (BIPA) has generated over $1 billion in settlements since 2020 for improper handling of biometric data.

The challenge with health data is that it's often unstructured — embedded in free-text fields where automated detection requires natural language processing, not just pattern matching.

7. Device Fingerprints and Tracking Identifiers

Modern applications collect device-level identifiers that, combined with other data, constitute personal data:

IDFA/GAID — mobile advertising identifiers stored in analytics tables
Browser fingerprints — canvas hashes, WebGL renderer strings, installed font lists
MAC addresses — captured by network-level logging or IoT integrations
Cookie values — session and tracking cookies stored server-side

The EU's ePrivacy Directive and CJEU case law make clear that any identifier capable of singling out an individual is personal data. California's CCPA explicitly includes "unique personal identifiers" and "internet or other electronic network activity information" in its definition of personal information.

Check your analytics and event-tracking tables. If you're storing raw device identifiers alongside behavioral data, you're building a personal data asset that requires full GDPR/CCPA compliance treatment.

8. Location Data Beyond IP Geolocation

GPS coordinates, Wi-Fi access point data, and cell tower triangulation data are all increasingly stored in application databases:

Check-in features — latitude/longitude pairs stored with timestamps
Delivery and logistics tables — driver and customer GPS tracks
Photo metadata — EXIF data from user-uploaded images containing GPS coordinates
Bluetooth beacon interactions — retail and venue analytics

Location data is particularly sensitive because it can reveal a person's home address, workplace, medical visits, religious practice, and political activities. Under GDPR, precise location data is considered high-risk and typically requires a Data Protection Impact Assessment (DPIA).

Quick audit step: Search your database for any float or decimal columns with values in the range of valid coordinates (latitude: -90 to 90, longitude: -180 to 180) that are paired together. Also scan text columns for patterns like "lat": or "longitude": in JSON strings.

9. Names and Identifiers in Metadata and System Tables

Names and personal identifiers hide in places developers rarely think to check:

Database audit trails — created_by and modified_by columns containing usernames or full names
File path strings — /home/john.smith/uploads/report.pdf stored in document management tables
Git-style revision metadata — author names and emails in version-controlled content tables
Temporary tables — staging tables from ETL jobs that were never dropped
Backup manifests — table metadata containing references to named individuals

These "system-level" references to individuals are still personal data under GDPR's broad definition. They must be included in data subject access requests (DSARs) and must be deletable to comply with the right to erasure.

`bash

Find columns likely containing names across all tables (PostgreSQL)

psql -d yourdb -c " SELECT table_name, column_name FROM information_schema.columns WHERE column_name ILIKE ANY(ARRAY[ '%name%', '%author%', '%creator%', '%owner%', '%user%', '%person%', '%contact%', '%employee%' ]) AND table_schema = 'public' ORDER BY table_name; " `

10. Derived and Inferred PII

This is the category that catches sophisticated organizations off guard. Data that doesn't look like PII on its own becomes PII when combined:

Behavioral profiles — purchase history + browsing patterns + demographic segments
Risk scores — credit risk, fraud risk, or churn probability scores linked to individuals
Recommendation engine data — preference models that encode personal characteristics
Anonymized datasets — research consistently shows that "anonymized" datasets can be re-identified with as few as 3-4 data points (a 2019 Nature Communications study demonstrated 99.98% re-identification accuracy with 15 demographic attributes)

Under GDPR, the Article 29 Working Party (now EDPB) has made clear that pseudonymized data is still personal data. If you can link a record back to an individual — even indirectly — it's PII and must be treated accordingly.

This is where automated scanning tools become essential. Manual audits can catch obvious PII, but detecting re-identification risk in combined datasets requires systematic analysis that scales with your data.

---

How to Find Hidden PII: A Practical Approach

Scanning for hidden PII requires a layered strategy:

1. Schema analysis — identify columns by name and data type that likely contain PII 2. Pattern matching — scan content for known PII formats (emails, SSNs, card numbers, phone numbers) 3. Statistical analysis — flag columns with high cardinality and low repetition (characteristic of personal identifiers) 4. Contextual detection — use NLP to identify PII in unstructured text fields 5. Cross-reference analysis — identify combinations of quasi-identifiers that create re-identification risk

Running this manually across a production database with hundreds of tables is impractical. Automated PII scanning tools can classify every column in your database in minutes, flagging high-confidence matches and surfacing suspicious patterns for human review.

---

FAQ

How often should I scan my database for PII?

At minimum, run a comprehensive PII scan quarterly and after any significant schema change or data migration. Organizations with high data velocity — SaaS platforms, e-commerce, fintech — should run weekly or continuous scans. GDPR's Article 5(1)(e) requires that personal data is "kept in a form which permits identification of data subjects for no longer than is necessary," which implies ongoing monitoring, not one-time audits.

Is pseudonymized data still considered PII?

Yes. Under GDPR, pseudonymized data is explicitly still personal data (Recital 26). The test is whether the data can be attributed to a person "by the use of additional information." If your organization holds the key to reverse the pseudonymization — or if it's technically feasible to re-identify individuals — the data requires full GDPR compliance treatment. Only truly anonymous data (where re-identification is irreversible and practically impossible) falls outside GDPR scope.

What's the difference between PII under GDPR vs. CCPA?

GDPR uses the term "personal data" and defines it broadly as any information relating to an identified or identifiable natural person. CCPA uses "personal information" and defines it as information that identifies, relates to, or could reasonably be linked to a particular consumer or household. The key practical differences: CCPA explicitly includes household-level data (not just individual), covers data sold or shared for cross-context behavioral advertising, and includes specific categories like geolocation, biometric, and internet activity data. Both are broad enough that most data about people qualifies — but their enforcement mechanisms, consumer rights, and exemptions differ significantly.

Can I use regex alone to detect all PII in my database?

No. Regex is effective for structured PII with predictable formats — email addresses, SSNs, credit card numbers, and phone numbers. But it fails for unstructured PII like names in free text, health information in support tickets, or behavioral data that becomes PII in aggregate. A comprehensive PII detection strategy combines regex-based pattern matching with named entity recognition (NER), data type heuristics, and contextual analysis. Regex is a starting point, not a complete solution.

What should I do when I find unexpected PII in my database?

Follow this immediate response process: (1) Document what you found — data type, location, volume, and how long it's been stored. (2) Assess the legal basis — do you have a lawful reason to process this data under GDPR Article 6 or a valid business purpose under CCPA? (3) If no legal basis exists, plan deletion with your DPO and engineering team. (4) Update your Record of Processing Activities (ROPA). (5) Evaluate whether a data breach notification is required — if the PII was accessible to unauthorized parties, you may have a 72-hour notification obligation under GDPR Article 33. (6) Implement controls to prevent re-accumulation: column-level encryption, data masking, or automated PII detection in your CI/CD pipeline.

---

Start Scanning for PII Today

PrivaSift automatically detects PII across your files, databases, and cloud storage — helping you stay GDPR and CCPA compliant without the manual work.

[Try PrivaSift Free →](https://privasift.com)

Scan your data for PII — free, no setup required

Try PrivaSift