How to Build a Custom PII Detection Solution in Java

PrivaSift TeamApr 01, 2026pii-detectionpiigdprccpacompliance

How to Build a Custom PII Detection Solution in Java

Every organization handling personal data faces the same uncomfortable question: do you actually know where all your PII lives? For companies operating under GDPR and CCPA, the answer to that question can mean the difference between business as usual and a seven-figure fine. In 2025 alone, European data protection authorities issued over €2.1 billion in GDPR penalties, with several landmark cases stemming directly from organizations that failed to identify and protect personally identifiable information stored across their systems.

For engineering teams, the challenge is deceptively complex. PII doesn't sit neatly in a single database column labeled "personal_data." It hides in log files, free-text fields, email threads, CSV exports, and legacy systems that predate your compliance program. Off-the-shelf solutions can help, but many CTOs and security engineers find themselves needing something tailored — a detection engine that understands their specific data formats, their domain terminology, and their infrastructure.

Building a custom PII detection solution in Java gives you that control. Java remains the dominant language in enterprise backends, and its ecosystem — from Apache OpenNLP to robust regex libraries — provides everything you need to construct a detection pipeline that fits your architecture. This tutorial walks you through the key components, from pattern matching to NER-based detection, so you can build a scanner that actually works for your data.

Understanding What Counts as PII Under GDPR and CCPA

![Understanding What Counts as PII Under GDPR and CCPA](https://max.dnt-ai.ru/img/privasift/custom-pii-detector-java_sec1.png)

Before writing a single line of code, you need to define what your scanner is looking for. GDPR and CCPA define personal data differently, and your detection solution must account for both if you serve users in the EU and California.

Under GDPR (Article 4), personal data means any information relating to an identified or identifiable natural person. This includes obvious identifiers like names and email addresses, but also extends to IP addresses, cookie identifiers, and even location data.

Under CCPA (§1798.140), personal information is broadly defined as information that identifies, relates to, or could reasonably be linked to a particular consumer or household. This includes biometric data, browsing history, and geolocation — categories that many scanning tools miss entirely.

At a minimum, your Java PII detector should handle these categories:

Direct identifiers: full names, email addresses, phone numbers, Social Security numbers, passport numbers
Financial data: credit card numbers, IBANs, bank account numbers
Digital identifiers: IP addresses (v4 and v6), MAC addresses, device IDs
Location data: physical addresses, GPS coordinates, ZIP/postal codes
Health and biometric data: medical record numbers, biometric templates (if stored as text)

Map each category to the specific regulation it falls under. This mapping becomes critical when you generate compliance reports downstream.

Setting Up the Java Project Structure

![Setting Up the Java Project Structure](https://max.dnt-ai.ru/img/privasift/custom-pii-detector-java_sec2.png)

Start with a clean Maven project and pull in the dependencies you'll need. The core detection engine requires a regex library (built into Java), a natural language processing library for entity recognition, and a structured output format for results.

`xml org.apache.opennlp opennlp-tools 2.3.1 com.googlecode.libphonenumber libphonenumber 8.13.27 com.fasterxml.jackson.core jackson-databind 2.16.1 `

Organize your project around a pipeline pattern:

` src/main/java/com/yourcompany/piidetector/ ├── PiiScanner.java // Orchestrator ├── detectors/ │ ├── PiiDetector.java // Interface │ ├── RegexDetector.java // Pattern-based detection │ ├── NerDetector.java // NLP-based detection │ └── ChecksumDetector.java // Luhn, IBAN validation ├── models/ │ ├── PiiMatch.java // Detection result │ └── PiiCategory.java // Enum of PII types └── reporters/ └── ComplianceReporter.java `

This separation matters. Regex detectors are fast but produce false positives. NER detectors handle context but are slower. By composing them behind a common interface, you can tune the pipeline per data source.

Building the Regex Detection Layer

![Building the Regex Detection Layer](https://max.dnt-ai.ru/img/privasift/custom-pii-detector-java_sec3.png)

Regex-based detection is the foundation of any PII scanner. It's fast, deterministic, and handles structured PII formats — credit card numbers, SSNs, email addresses — with high accuracy.

Define your detector interface first:

`java public interface PiiDetector { List detect(String text); String getDetectorName(); } `

Then implement the regex layer:

`java public class RegexDetector implements PiiDetector {

private static final Map PATTERNS = Map.of( PiiCategory.EMAIL, Pattern.compile("[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}"), PiiCategory.SSN, Pattern.compile("\\b\\d{3}-\\d{2}-\\d{4}\\b"), PiiCategory.CREDIT_CARD, Pattern.compile("\\b(?:\\d[ -]*?){13,19}\\b"), PiiCategory.IPV4, Pattern.compile("\\b(?:\\d{1,3}\\.){3}\\d{1,3}\\b"), PiiCategory.PHONE_US, Pattern.compile("\\b(?:\\+1[-.]?)?\$?\\d{3}\$?[-.]?\\d{3}[-.]?\\d{4}\\b"), PiiCategory.IBAN, Pattern.compile("\\b[A-Z]{2}\\d{2}[A-Z0-9]{4}\\d{7}([A-Z0-9]?){0,16}\\b") );

@Override public List detect(String text) { List matches = new ArrayList<>(); for (var entry : PATTERNS.entrySet()) { Matcher matcher = entry.getValue().matcher(text); while (matcher.find()) { matches.add(new PiiMatch( entry.getKey(), matcher.group(), matcher.start(), matcher.end(), "regex", calculateConfidence(entry.getKey(), matcher.group()) )); } } return matches; }

private double calculateConfidence(PiiCategory category, String match) { return switch (category) { case EMAIL -> match.contains(".") && match.contains("@") ? 0.95 : 0.60; case CREDIT_CARD -> passesLuhnCheck(match) ? 0.92 : 0.30; case SSN -> 0.85; case IBAN -> validateIbanChecksum(match) ? 0.95 : 0.20; default -> 0.70; }; } } `

The critical insight here is the confidence score. A 16-digit number that passes the Luhn algorithm is almost certainly a credit card number. A 16-digit number that fails Luhn is probably an order ID or timestamp. Without checksum validation, your scanner will drown you in false positives — and false positives erode trust in the tool faster than anything else.

Adding NLP-Based Named Entity Recognition

![Adding NLP-Based Named Entity Recognition](https://max.dnt-ai.ru/img/privasift/custom-pii-detector-java_sec4.png)

Regex handles structured PII well, but it cannot detect names, addresses, or organization references embedded in free text. For that, you need Named Entity Recognition (NER).

Apache OpenNLP provides pre-trained models for person names, locations, and organizations. Here's how to wire it into your pipeline:

`java public class NerDetector implements PiiDetector {

private final TokenNameFinderModel personModel; private final TokenNameFinderModel locationModel; private final TokenizerModel tokenizerModel;

public NerDetector() throws IOException { this.personModel = new TokenNameFinderModel( getClass().getResourceAsStream("/models/en-ner-person.bin")); this.locationModel = new TokenNameFinderModel( getClass().getResourceAsStream("/models/en-ner-location.bin")); this.tokenizerModel = new TokenizerModel( getClass().getResourceAsStream("/models/en-token.bin")); }

@Override public List detect(String text) { List matches = new ArrayList<>(); Tokenizer tokenizer = new TokenizerME(tokenizerModel); String[] tokens = tokenizer.tokenize(text);

// Detect person names NameFinderME personFinder = new NameFinderME(personModel); Span[] personSpans = personFinder.find(tokens); double[] probs = personFinder.probs(personSpans);

for (int i = 0; i < personSpans.length; i++) { String name = String.join(" ", Arrays.copyOfRange(tokens, personSpans[i].getStart(), personSpans[i].getEnd())); matches.add(new PiiMatch( PiiCategory.PERSON_NAME, name, personSpans[i].getStart(), personSpans[i].getEnd(), "ner-opennlp", probs[i] )); }

// Repeat for locations NameFinderME locationFinder = new NameFinderME(locationModel); Span[] locationSpans = locationFinder.find(tokens); double[] locationProbs = locationFinder.probs(locationSpans);

for (int i = 0; i < locationSpans.length; i++) { String location = String.join(" ", Arrays.copyOfRange(tokens, locationSpans[i].getStart(), locationSpans[i].getEnd())); matches.add(new PiiMatch( PiiCategory.LOCATION, location, locationSpans[i].getStart(), locationSpans[i].getEnd(), "ner-opennlp", locationProbs[i] )); }

return matches; } } `

A practical tip: NER models are not perfect. The default OpenNLP English person model achieves roughly 80–85% F1 on standard benchmarks. For production use, consider fine-tuning the model on your domain data — customer support tickets use different language patterns than medical records, and a model trained on newswire text may miss both. Alternatively, if you need higher accuracy, you can integrate with a cloud NLP API (Google Cloud DLP, AWS Comprehend) as a secondary detector, though this introduces latency and data-residency concerns that may conflict with your compliance goals.

Orchestrating the Detection Pipeline

With both detection layers built, the orchestrator composes them and handles deduplication, confidence thresholding, and output formatting:

`java public class PiiScanner {

private final List detectors; private final double confidenceThreshold;

public PiiScanner(double confidenceThreshold) { this.confidenceThreshold = confidenceThreshold; this.detectors = new ArrayList<>(); }

public PiiScanner addDetector(PiiDetector detector) { this.detectors.add(detector); return this; }

public ScanResult scan(String text, String sourceId) { List allMatches = new ArrayList<>();

for (PiiDetector detector : detectors) { List matches = detector.detect(text); allMatches.addAll(matches); }

// Filter by confidence threshold List filtered = allMatches.stream() .filter(m -> m.confidence() >= confidenceThreshold) .collect(Collectors.toList());

// Deduplicate overlapping matches, keeping highest confidence List deduplicated = deduplicateOverlaps(filtered);

return new ScanResult(sourceId, deduplicated, Instant.now(), text.length()); }

private List deduplicateOverlaps(List matches) { matches.sort(Comparator.comparingInt(PiiMatch::start)); List result = new ArrayList<>(); PiiMatch current = null;

for (PiiMatch match : matches) { if (current == null || match.start() >= current.end()) { current = match; result.add(current); } else if (match.confidence() > current.confidence()) { result.remove(result.size() - 1); current = match; result.add(current); } } return result; } } `

Usage is straightforward:

`java PiiScanner scanner = new PiiScanner(0.70) .addDetector(new RegexDetector()) .addDetector(new NerDetector());

ScanResult result = scanner.scan( "Contact John Smith at john.smith@example.com or 555-123-4567. " + "His SSN is 123-45-6789 and card number is 4532015112830366.", "customer_support_ticket_4821" );

result.matches().forEach(m -> System.out.printf("[%s] '%s' (confidence: %.2f, detector: %s)%n", m.category(), m.value(), m.confidence(), m.detector()) ); `

Set your confidence threshold based on your risk tolerance. For GDPR Article 33, which requires breach notification within 72 hours, it's better to over-detect (threshold 0.60) and triage manually than to miss genuine PII (threshold 0.95) and discover it during an audit.

Scanning Data Sources at Scale

A detection engine is only useful if it can reach your data. In enterprise environments, PII lives across databases, file storage, message queues, and APIs. Here's a pattern for scanning a relational database:

`java public class DatabaseScanner {

private final PiiScanner scanner; private final DataSource dataSource;

public List scanTable(String tableName, List columns) throws SQLException { List results = new ArrayList<>();

try (Connection conn = dataSource.getConnection(); Statement stmt = conn.createStatement()) {

String columnList = String.join(", ", columns.stream() .map(c -> "CAST(" + c + " AS VARCHAR(4000))") .toList());

// Sample rows for large tables instead of full scan ResultSet rs = stmt.executeQuery( "SELECT " + columnList + " FROM " + tableName + " TABLESAMPLE SYSTEM(10)"); // 10% sample

while (rs.next()) { for (String col : columns) { String value = rs.getString(col); if (value != null && !value.isBlank()) { String sourceId = tableName + "." + col; ScanResult result = scanner.scan(value, sourceId); if (!result.matches().isEmpty()) { results.add(result); } } } } } return results; } } `

For large datasets, sampling is essential. Scanning every row of a 500-million-record table is neither practical nor necessary for a data inventory. A 10% statistical sample, combined with metadata analysis (column names like first_name, ssn, email are strong signals on their own), gives you coverage without the compute cost.

For file-based scanning (CSV exports, log files, document stores), use Java's Files.walk() with a thread pool:

`java ExecutorService executor = Executors.newFixedThreadPool( Runtime.getRuntime().availableProcessors());

List> futures = Files.walk(Path.of("/data/exports")) .filter(p -> p.toString().endsWith(".csv") || p.toString().endsWith(".log")) .map(path -> executor.submit(() -> { String content = Files.readString(path); return scanner.scan(content, path.toString()); })) .toList(); `

Testing, Tuning, and Reducing False Positives

The difference between a PII scanner that gets used and one that gets ignored is its false positive rate. A scanner that flags every 9-digit number as an SSN will be disabled within a week.

Build a test suite with known PII samples and known non-PII samples:

`java @Test void shouldDetectValidSSN() { ScanResult result = scanner.scan("SSN: 123-45-6789", "test"); assertThat(result.matches()) .extracting(PiiMatch::category) .contains(PiiCategory.SSN); }

@Test void shouldNotFlagOrderIdAsSSN() { ScanResult result = scanner.scan("Order #123-45-6789-A placed", "test"); assertThat(result.matches()) .extracting(PiiMatch::category) .doesNotContain(PiiCategory.SSN); }

@Test void shouldDetectCreditCardWithLuhn() { // Valid Luhn: 4532015112830366 ScanResult result = scanner.scan("Card: 4532015112830366", "test"); assertThat(result.matches()) .anyMatch(m -> m.category() == PiiCategory.CREDIT_CARD && m.confidence() > 0.90); } `

Key tuning strategies:

1. Context windows: Check the 10 characters before and after a match. If an SSN-shaped number follows "Order #" or "Invoice:", downgrade its confidence. 2. Allowlists: Maintain a list of known non-PII values (test data, system-generated IDs) and exclude them. 3. Multi-signal confirmation: If a regex detects an email address and the NER model detects a person name in the same paragraph, boost confidence for both. 4. Track precision and recall: Log every detection, have humans review a weekly sample, and feed corrections back into your confidence scoring.

According to a 2024 IAPP survey, organizations that implemented automated PII discovery reduced their data breach response time by 48% and their DSAR (Data Subject Access Request) fulfillment time from an average of 22 days to 6 days. The investment in tuning pays compounding dividends.

FAQ

How accurate is regex-based PII detection compared to machine learning approaches?

Regex-based detection achieves near-perfect precision for structured PII formats — email addresses, credit card numbers, SSNs — where the format is well-defined. For these categories, you can expect 90–98% precision with proper checksum validation (Luhn for credit cards, modular arithmetic for IBANs). However, regex cannot detect unstructured PII like person names, physical addresses written in free text, or contextual identifiers. Machine learning approaches like NER models typically achieve 80–90% F1 scores on entity recognition tasks but require trained models and more compute resources. The best approach is a layered pipeline: use regex for structured formats and NLP for unstructured text, then combine results with confidence scoring.

What are the GDPR penalties for failing to identify PII in your systems?

GDPR penalties are tiered. Under Article 83(4), failures related to data processing obligations — which includes not maintaining an accurate record of processing activities or failing to identify where personal data is stored — can result in fines of up to €10 million or 2% of annual global turnover, whichever is higher. Under Article 83(5), violations of data processing principles or data subject rights can reach €20 million or 4% of turnover. In practice, regulators have issued substantial fines specifically for inadequate data inventories: the Italian DPA fined a telecom company €27.8 million in 2020 partly because the company could not account for where customer PII was stored and processed. Under CCPA, the California Attorney General can impose fines of $2,500 per unintentional violation and $7,500 per intentional violation — amounts that scale rapidly when thousands of consumer records are involved.

Can I use this approach to scan cloud storage like S3 or Google Cloud Storage?

Yes. The detection engine itself is data-source agnostic — it operates on strings. To scan cloud storage, you need a connector layer that streams objects from S3, GCS, or Azure Blob Storage and feeds the content to your scanner. For AWS S3, use the AWS SDK for Java v2 (S3Client) to list and read objects. For large files, stream them in chunks rather than loading the entire object into memory. Be mindful of data residency: if your GDPR compliance requires that EU citizen data stays within the EU, ensure your scanner runs in the same region as the storage bucket. Also consider using cloud-native services like AWS Macie or Google Cloud DLP as a complement — they offer built-in PII detection that runs within the cloud provider's infrastructure, avoiding cross-region data transfer.

How do I handle PII detection in multiple languages?

Multilingual PII detection is significantly more challenging than English-only scanning. Regex patterns for structured data (emails, credit cards, phone numbers) are largely language-independent, but you'll need locale-specific patterns for national IDs (e.g., German Personalausweis numbers, French INSEE numbers, Brazilian CPF numbers). For NER-based detection, you need language-specific models — OpenNLP provides models for German, Dutch, Spanish, and Portuguese, among others. Detect the language of input text first using a library like Apache Tika's language detector, then route to the appropriate NER model. For names, be aware that transliteration and diacritics create additional complexity: "José García" and "Jose Garcia" should both trigger detection. Consider using ICU4J for Unicode normalization before matching.

Should I build a custom PII scanner or use an existing tool?

Building a custom scanner makes sense when you have unique data formats, domain-specific PII categories (e.g., internal employee IDs that qualify as personal data under GDPR), strict data residency requirements that prevent using cloud APIs, or the need to integrate detection directly into your CI/CD pipeline or data processing framework. However, building and maintaining a production-grade PII scanner is a significant engineering investment — you need to handle dozens of PII categories across multiple locales, continuously tune for false positives, keep up with evolving regulations, and cover every data source in your infrastructure. For most organizations, the practical path is to use a proven tool like PrivaSift that handles the detection, classification, and compliance reporting out of the box, and reserve custom development for the edge cases specific to your domain.

Start Scanning for PII Today

PrivaSift automatically detects PII across your files, databases, and cloud storage — helping you stay GDPR and CCPA compliant without the manual work.

[Try PrivaSift Free →](https://privasift.com)

Scan your data for PII — free, no setup required

Try PrivaSift