Top 10 Types of PII Your Database Is Storing Without You Knowing
Here's the blog post:
Top 10 Types of PII Your Database Is Storing Without You Knowing
Every database accumulates personal data over time — and most of it was never intentionally collected. Developers add fields for debugging, customer support logs grow unchecked, and third-party integrations dump raw payloads into staging tables that never get cleaned up. The result: your production database is almost certainly storing personally identifiable information (PII) you didn't plan for, don't need, and are legally required to protect.
This isn't a theoretical problem. In 2023, the average cost of a data breach reached $4.45 million according to IBM's Cost of a Data Breach Report. Regulators under GDPR have issued fines exceeding €4 billion since 2018, and CCPA enforcement actions are accelerating. Meta alone was fined €1.2 billion in May 2023 for improper handling of personal data transfers. The common thread in these cases isn't sophisticated hacking — it's organizations that didn't know what personal data they were holding in the first place.
If you're a CTO, DPO, or security engineer responsible for compliance, the first step is understanding what's actually in your data stores. This article walks through the ten most common types of PII that silently accumulate in databases — and how to find them before an auditor or attacker does.
1. Email Addresses Embedded in Free-Text Fields

Email addresses are the most pervasive form of PII, and they show up far beyond your users.email column. They leak into:
- Log tables — error messages containing
"Failed to send notification to john.doe@company.com" - Support ticket bodies — customers paste their email into message fields
- JSON blobs — API request/response payloads stored for debugging
- Comments and notes — internal CRM notes like
"Follow up with sarah@client.org"
`sql
-- Quick check: find email-like patterns in your logs table
SELECT id, message
FROM application_logs
WHERE message ~ '[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'
LIMIT 100;
`
This query will likely return more results than you expect. The real question is: do you have a retention policy for these tables?
2. IP Addresses and Geolocation Data

Under GDPR, IP addresses are explicitly classified as personal data (Recital 30). Yet most web applications store them liberally:
- Access logs stored indefinitely in database tables
- Rate-limiting tables that track IPs alongside user sessions
- Analytics events with full IP addresses instead of anonymized versions
- Fraud detection systems that correlate IPs with user identities
What to do about it
Truncate or hash IP addresses when full precision isn't needed. For IPv4, zeroing the last octet (192.168.1.0) provides reasonable anonymization while preserving network-level analytics. For IPv6, truncate to the first 48 bits. If you need full IPs for security purposes, isolate them in a dedicated table with strict access controls and a 90-day retention policy.
3. Phone Numbers in Unexpected Formats

Phone numbers hide in databases in dozens of formats: +1-555-123-4567, (555) 123-4567, 5551234567, +44 20 7946 0958. They appear in:
- Shipping address fields — customers add phone numbers to address line 2
- User-agent strings — some mobile browsers include the device's phone number
- Webhook payloads — payment processors send phone data in transaction metadata
- CSV imports — bulk uploads that include phone columns mapped to generic
varcharfields
`python
Phone numbers appear in many formats — a single regex won't cut it
import rephone_patterns = [ r'\+?1?[-.\s]?\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}', # US/CA r'\+44\s?\d{2,4}\s?\d{3,4}\s?\d{3,4}', # UK r'\+49\s?\d{2,4}\s?\d{4,8}', # DE r'\+\d{1,3}[-.\s]?\d{4,14}', # Generic intl ]
def scan_column_for_phones(values: list[str]) -> list[dict]:
hits = []
for i, val in enumerate(values):
for pattern in phone_patterns:
if re.search(pattern, str(val)):
hits.append({"row": i, "value": val, "pattern": pattern})
break
return hits
`
4. Government Identifiers: SSNs, Tax IDs, and Passport Numbers

Social Security Numbers, national insurance numbers, tax identifiers, and passport numbers are high-sensitivity PII — and they end up in databases more often than most organizations realize.
Common hiding places:
- File attachment metadata — PDF forms with SSN fields uploaded to document management systems
- Customer onboarding tables — KYC processes that store raw identity documents
- Legacy migration artifacts — old systems that stored SSNs as primary keys (yes, this was common)
- Audit trail tables — change logs that capture before/after values including SSN fields
Action item: Run a targeted scan across all varchar and text columns for patterns matching \d{3}-\d{2}-\d{4} (US SSN), \d{2}\s\d{2}\s\d{2}\s\d{3}\s\d{3}\s\d{2} (French NIR), and equivalent patterns for your operating jurisdictions. Then verify whether any matches correspond to actual identifiers or are false positives.
5. Financial Data: Credit Card Numbers and Bank Accounts
PCI DSS compliance gets a lot of attention for payment processing systems, but credit card numbers (PANs) leak into non-PCI systems through:
- Customer support tickets — "My card ending in 4242 was charged twice, the full number is 4242424242424242"
- Application logs — payment gateway request/response logs that include full card data
- Email archives — stored in database-backed email systems
- Test data in production — developers who used real card numbers in test fixtures that made it to prod
`python
def luhn_check(number: str) -> bool:
digits = [int(d) for d in number if d.isdigit()]
if len(digits) < 13 or len(digits) > 19:
return False
checksum = 0
for i, d in enumerate(reversed(digits)):
if i % 2 == 1:
d *= 2
if d > 9:
d -= 9
checksum += d
return checksum % 10 == 0
`
If your database stores any field that passes a Luhn check and matches a known BIN (Bank Identification Number) prefix, you have a PCI compliance incident — not just a privacy concern.
6. Biometric and Health-Related Data
This category catches most organizations off guard. You don't need to be a healthcare company to store health data. Consider:
- HR systems — sick leave reasons, disability accommodations, health insurance details
- Profile photos — facial images are biometric data under GDPR and BIPA (Illinois)
- Wellness program data — step counts, heart rate, sleep data from corporate wellness integrations
- Customer support transcripts — "I need to cancel because I was diagnosed with..."
The challenge with health data is that it's often unstructured — embedded in free-text fields where automated detection requires natural language processing, not just pattern matching.
7. Device Fingerprints and Tracking Identifiers
Modern applications collect device-level identifiers that, combined with other data, constitute personal data:
- IDFA/GAID — mobile advertising identifiers stored in analytics tables
- Browser fingerprints — canvas hashes, WebGL renderer strings, installed font lists
- MAC addresses — captured by network-level logging or IoT integrations
- Cookie values — session and tracking cookies stored server-side
Check your analytics and event-tracking tables. If you're storing raw device identifiers alongside behavioral data, you're building a personal data asset that requires full GDPR/CCPA compliance treatment.
8. Location Data Beyond IP Geolocation
GPS coordinates, Wi-Fi access point data, and cell tower triangulation data are all increasingly stored in application databases:
- Check-in features — latitude/longitude pairs stored with timestamps
- Delivery and logistics tables — driver and customer GPS tracks
- Photo metadata — EXIF data from user-uploaded images containing GPS coordinates
- Bluetooth beacon interactions — retail and venue analytics
Quick audit step: Search your database for any float or decimal columns with values in the range of valid coordinates (latitude: -90 to 90, longitude: -180 to 180) that are paired together. Also scan text columns for patterns like "lat": or "longitude": in JSON strings.
9. Names and Identifiers in Metadata and System Tables
Names and personal identifiers hide in places developers rarely think to check:
- Database audit trails —
created_byandmodified_bycolumns containing usernames or full names - File path strings —
/home/john.smith/uploads/report.pdfstored in document management tables - Git-style revision metadata — author names and emails in version-controlled content tables
- Temporary tables — staging tables from ETL jobs that were never dropped
- Backup manifests — table metadata containing references to named individuals
`bash
Find columns likely containing names across all tables (PostgreSQL)
psql -d yourdb -c " SELECT table_name, column_name FROM information_schema.columns WHERE column_name ILIKE ANY(ARRAY[ '%name%', '%author%', '%creator%', '%owner%', '%user%', '%person%', '%contact%', '%employee%' ]) AND table_schema = 'public' ORDER BY table_name; "`10. Derived and Inferred PII
This is the category that catches sophisticated organizations off guard. Data that doesn't look like PII on its own becomes PII when combined:
- Behavioral profiles — purchase history + browsing patterns + demographic segments
- Risk scores — credit risk, fraud risk, or churn probability scores linked to individuals
- Recommendation engine data — preference models that encode personal characteristics
- Anonymized datasets — research consistently shows that "anonymized" datasets can be re-identified with as few as 3-4 data points (a 2019 Nature Communications study demonstrated 99.98% re-identification accuracy with 15 demographic attributes)
This is where automated scanning tools become essential. Manual audits can catch obvious PII, but detecting re-identification risk in combined datasets requires systematic analysis that scales with your data.
---
How to Find Hidden PII: A Practical Approach
Scanning for hidden PII requires a layered strategy:
1. Schema analysis — identify columns by name and data type that likely contain PII 2. Pattern matching — scan content for known PII formats (emails, SSNs, card numbers, phone numbers) 3. Statistical analysis — flag columns with high cardinality and low repetition (characteristic of personal identifiers) 4. Contextual detection — use NLP to identify PII in unstructured text fields 5. Cross-reference analysis — identify combinations of quasi-identifiers that create re-identification risk
Running this manually across a production database with hundreds of tables is impractical. Automated PII scanning tools can classify every column in your database in minutes, flagging high-confidence matches and surfacing suspicious patterns for human review.
---
FAQ
How often should I scan my database for PII?
At minimum, run a comprehensive PII scan quarterly and after any significant schema change or data migration. Organizations with high data velocity — SaaS platforms, e-commerce, fintech — should run weekly or continuous scans. GDPR's Article 5(1)(e) requires that personal data is "kept in a form which permits identification of data subjects for no longer than is necessary," which implies ongoing monitoring, not one-time audits.
Is pseudonymized data still considered PII?
Yes. Under GDPR, pseudonymized data is explicitly still personal data (Recital 26). The test is whether the data can be attributed to a person "by the use of additional information." If your organization holds the key to reverse the pseudonymization — or if it's technically feasible to re-identify individuals — the data requires full GDPR compliance treatment. Only truly anonymous data (where re-identification is irreversible and practically impossible) falls outside GDPR scope.
What's the difference between PII under GDPR vs. CCPA?
GDPR uses the term "personal data" and defines it broadly as any information relating to an identified or identifiable natural person. CCPA uses "personal information" and defines it as information that identifies, relates to, or could reasonably be linked to a particular consumer or household. The key practical differences: CCPA explicitly includes household-level data (not just individual), covers data sold or shared for cross-context behavioral advertising, and includes specific categories like geolocation, biometric, and internet activity data. Both are broad enough that most data about people qualifies — but their enforcement mechanisms, consumer rights, and exemptions differ significantly.
Can I use regex alone to detect all PII in my database?
No. Regex is effective for structured PII with predictable formats — email addresses, SSNs, credit card numbers, and phone numbers. But it fails for unstructured PII like names in free text, health information in support tickets, or behavioral data that becomes PII in aggregate. A comprehensive PII detection strategy combines regex-based pattern matching with named entity recognition (NER), data type heuristics, and contextual analysis. Regex is a starting point, not a complete solution.
What should I do when I find unexpected PII in my database?
Follow this immediate response process: (1) Document what you found — data type, location, volume, and how long it's been stored. (2) Assess the legal basis — do you have a lawful reason to process this data under GDPR Article 6 or a valid business purpose under CCPA? (3) If no legal basis exists, plan deletion with your DPO and engineering team. (4) Update your Record of Processing Activities (ROPA). (5) Evaluate whether a data breach notification is required — if the PII was accessible to unauthorized parties, you may have a 72-hour notification obligation under GDPR Article 33. (6) Implement controls to prevent re-accumulation: column-level encryption, data masking, or automated PII detection in your CI/CD pipeline.
---
Start Scanning for PII Today
PrivaSift automatically detects PII across your files, databases, and cloud storage — helping you stay GDPR and CCPA compliant without the manual work.
[Try PrivaSift Free →](https://privasift.com)
Scan your data for PII — free, no setup required
Try PrivaSift