How to Scan AWS S3 Buckets for PII

PrivaSift TeamApr 01, 2026piipii-detectionsecuritycompliancedata-privacy

Now I have the style reference. Here's the blog post:

How to Scan AWS S3 Buckets for PII: A Complete Guide for Security Teams

Your S3 buckets almost certainly contain personal data you don't know about. Customer CSV exports uploaded by the sales team. Debug logs with email addresses dumped during an incident. Database backups containing full user records sitting in a bucket that was "temporary" three years ago. According to Laminar's 2024 State of Public Cloud Data Security report, 52% of cloud data stores contain sensitive data, and 20% of S3 buckets with PII have overly permissive access controls.

The regulatory stakes have never been higher. In 2025, GDPR enforcement actions exceeded €2.1 billion in total fines, while the California Privacy Protection Agency ramped up CCPA audits targeting data inventory gaps. When the Irish DPC fined Meta €1.2 billion for insufficient data transfer controls, the investigation hinged on the company's inability to demonstrate exactly where personal data lived and how it moved. S3 buckets — the default dumping ground for unstructured data in AWS environments — are consistently the blind spot that auditors and regulators probe first.

If you're a CTO, DPO, or security engineer responsible for data compliance, you need a systematic approach to finding PII in S3 before a regulator, a breach, or an auditor finds it for you. This guide covers the complete workflow: from understanding what PII typically hides in S3 buckets, to building automated scanning pipelines that keep you compliant continuously.

Why S3 Buckets Are a PII Blind Spot

![Why S3 Buckets Are a PII Blind Spot](https://max.dnt-ai.ru/img/privasift/scan-aws-s3-buckets-for-pii_sec1.png)

Amazon S3 is the backbone of cloud storage for most AWS organizations. It's cheap, scalable, and frictionless — which is exactly the problem. Any developer, analyst, or automated process can create a bucket and dump data into it with minimal oversight.

Common scenarios where PII accumulates in S3 undetected:

ETL pipeline outputs: Data engineering jobs extract user records from production databases and land them in S3 as Parquet or CSV files — often without masking PII fields.
Application log archives: CloudWatch log groups exported to S3 frequently contain IP addresses, email addresses, user IDs, and sometimes full request/response bodies with PII.
Database backups: Automated RDS snapshots exported to S3 contain every row of every table — including all personal data.
File uploads: User-facing applications storing documents (ID scans, contracts, resumes) in S3 buckets.
Analytics exports: Business intelligence teams exporting customer segments, cohort data, and marketing lists.
Machine learning datasets: Training data derived from production user records, often stored without access controls matching the sensitivity of the data.

The average enterprise has hundreds to thousands of S3 buckets. AWS's own research found that the typical organization has 13% more S3 buckets than they realize. Without active scanning, PII in these buckets is invisible to your compliance program.

What Counts as PII Under GDPR and CCPA

![What Counts as PII Under GDPR and CCPA](https://max.dnt-ai.ru/img/privasift/scan-aws-s3-buckets-for-pii_sec2.png)

Before scanning, you need to know what you're scanning for. GDPR and CCPA define personal data broadly, but with important differences.

Under GDPR (Article 4), personal data is any information relating to an identified or identifiable natural person. This includes:

Direct identifiers: name, email, phone number, government ID (SSN, passport), credit card number
Indirect identifiers: IP address, cookie ID, device fingerprint, employee ID
Special category data (Article 9): health records, biometric data, racial/ethnic origin, political opinions, religious beliefs, sexual orientation, trade union membership, genetic data

Under CCPA (§1798.140), personal information is information that identifies, relates to, or could reasonably be linked to a consumer or household. This extends to:

Geolocation data, browsing history, purchase history
Professional or employment-related information
Inferences drawn from other data to create a consumer profile

For S3 scanning, focus detection on these high-priority patterns:

| PII Type | Detection Pattern | Risk Level | |---|---|---| | Email addresses | RFC 5322 pattern matching | High | | Social Security Numbers | \d{3}-\d{2}-\d{4} with validation | Critical | | Credit card numbers | Luhn algorithm validation | Critical | | Phone numbers | International format detection | High | | IP addresses | IPv4/IPv6 patterns in non-infrastructure files | Medium | | Names + addresses | NLP entity recognition | High | | Passport/driver's license | Country-specific format patterns | Critical | | Dates of birth | Date patterns in PII context | Medium | | IBAN/bank account numbers | Country-prefix + checksum validation | Critical |

Step 1: Inventory Your S3 Buckets and Access Patterns

![Step 1: Inventory Your S3 Buckets and Access Patterns](https://max.dnt-ai.ru/img/privasift/scan-aws-s3-buckets-for-pii_sec3.png)

You can't scan what you can't see. Start by building a complete inventory of every S3 bucket in your AWS organization.

List all buckets across all accounts

If you're running AWS Organizations, you need visibility across every member account — not just the management account.

`bash

List all S3 buckets in the current account with creation dates and regions

aws s3api list-buckets --query 'Buckets[*].[Name,CreationDate]' --output table

For each bucket, check the region and public access settings

for bucket in $(aws s3api list-buckets --query 'Buckets[*].Name' --output text); do region=$(aws s3api get-bucket-location --bucket "$bucket" \ --query 'LocationConstraint' --output text 2>/dev/null) public=$(aws s3api get-public-access-block --bucket "$bucket" \ --query 'PublicAccessBlockConfiguration' --output json 2>/dev/null) echo "$bucket | Region: $region | Public access block: $public" done `

Identify high-risk buckets

Prioritize scanning based on risk indicators:

1. Buckets without encryption: aws s3api get-bucket-encryption --bucket BUCKET_NAME — if this returns an error, encryption isn't configured. 2. Buckets with public access: Check for BlockPublicAcls: false or bucket policies allowing s3:GetObject to "Principal": "*". 3. Buckets with broad IAM access: Review bucket policies and IAM policies granting s3:GetObject or s3:* to large groups. 4. Buckets with no lifecycle rules: Data with no expiration policy accumulates indefinitely. 5. Large buckets with mixed content types: Buckets storing both application assets and data exports.

Tag and classify buckets

Establish a tagging convention for data sensitivity:

`bash aws s3api put-bucket-tagging --bucket my-data-bucket --tagging \ 'TagSet=[{Key=DataClassification,Value=Confidential},{Key=PIIScanStatus,Value=Pending},{Key=DataOwner,Value=engineering}]' `

Step 2: Configure AWS-Native PII Detection with Amazon Macie

![Step 2: Configure AWS-Native PII Detection with Amazon Macie](https://max.dnt-ai.ru/img/privasift/scan-aws-s3-buckets-for-pii_sec4.png)

Amazon Macie is AWS's managed service for discovering sensitive data in S3. It uses machine learning and pattern matching to identify PII, financial data, and credentials.

Enable Macie and create a discovery job

`bash

Enable Macie in the current region

aws macie2 enable-macie

Create a classification job for specific buckets

aws macie2 create-classification-job \ --job-type ONE_TIME \ --name "pii-scan-q1-2026" \ --s3-job-definition '{ "bucketDefinitions": [{ "accountId": "123456789012", "buckets": ["customer-data-prod", "analytics-exports", "app-logs-archive"] }], "scoping": { "includes": { "and": [{ "simpleScopeTerm": { "comparator": "STARTS_WITH", "key": "OBJECT_KEY", "values": ["exports/", "data/", "logs/"] } }] } } }' \ --description "Quarterly PII scan of high-risk S3 buckets" `

Macie's limitations

Macie is a useful starting point but has real constraints:

Cost: Macie charges per GB scanned. For organizations with terabytes of S3 data, costs can escalate to thousands per month for comprehensive scans.
File format support: Macie handles common formats (CSV, JSON, Parquet, text, Excel) but struggles with PDFs, images with embedded text, and custom binary formats.
Detection accuracy: Macie's managed data identifiers cast a wide net, which means high false-positive rates for certain PII types — especially names and addresses that overlap with non-PII data.
No remediation workflow: Macie identifies PII but doesn't help you quarantine, mask, or delete it. You need separate tooling for response.

For comprehensive PII scanning that covers diverse file types, integrates with remediation workflows, and works across cloud providers — not just AWS — dedicated tools like PrivaSift fill the gaps that Macie leaves.

Step 3: Build a Custom PII Scanning Pipeline

For teams that need more control, deeper file format support, or cross-cloud scanning, building a custom pipeline on top of a PII detection engine gives you flexibility.

Architecture overview

` S3 Event Notification → SQS Queue → Lambda/ECS Scanner → Results to DynamoDB → Alerts via SNS `

Scan S3 objects with Python

`python import boto3 import re import json from typing import Generator

s3 = boto3.client("s3")

PII_PATTERNS = { "email": re.compile(r"[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z]{2,}"), "ssn": re.compile(r"\b\d{3}-\d{2}-\d{4}\b"), "credit_card": re.compile(r"\b(?:4\d{12}(?:\d{3})?|5[1-5]\d{14}|3[47]\d{13})\b"), "phone_us": re.compile(r"\b(?:\+1[-.\s]?)?$?\d{3}$?[-.\s]?\d{3}[-.\s]?\d{4}\b"), "ip_address": re.compile(r"\b(?:\d{1,3}\.){3}\d{1,3}\b"), "iban": re.compile(r"\b[A-Z]{2}\d{2}[A-Z0-9]{4}\d{7}([A-Z0-9]?){0,16}\b"), }

def scan_s3_object(bucket: str, key: str) -> dict: """Download and scan a single S3 object for PII patterns.""" response = s3.get_object(Bucket=bucket, Key=key) content_type = response["ContentType"] size = response["ContentLength"]

# Skip objects larger than 100MB or binary types if size > 100 1024 1024: return {"bucket": bucket, "key": key, "skipped": "too_large"} if content_type in ("application/octet-stream", "image/png", "image/jpeg"): return {"bucket": bucket, "key": key, "skipped": "binary"}

body = response["Body"].read().decode("utf-8", errors="ignore") findings = {}

for pii_type, pattern in PII_PATTERNS.items(): matches = pattern.findall(body) if matches: findings[pii_type] = { "count": len(matches), "samples": list(set(matches[:3])) # Keep up to 3 unique samples }

return { "bucket": bucket, "key": key, "size_bytes": size, "content_type": content_type, "pii_detected": bool(findings), "findings": findings }

def scan_bucket(bucket: str, prefix: str = "") -> Generator: """Iterate through all objects in a bucket and scan each.""" paginator = s3.get_paginator("list_objects_v2") for page in paginator.paginate(Bucket=bucket, Prefix=prefix): for obj in page.get("Contents", []): yield scan_s3_object(bucket, obj["Key"]) `

Trigger scans automatically on new uploads

Configure S3 event notifications to scan every new object as it lands:

`python

Lambda handler triggered by S3 PutObject events

def lambda_handler(event, context): for record in event["Records"]: bucket = record["s3"]["bucket"]["name"] key = record["s3"]["object"]["key"]

result = scan_s3_object(bucket, key)

if result.get("pii_detected"): # Store finding in DynamoDB dynamodb = boto3.resource("dynamodb") table = dynamodb.Table("pii-scan-findings") table.put_item(Item={ "bucket_key": f"{bucket}/{key}", "scan_timestamp": context.get_remaining_time_in_millis(), **result })

# Alert via SNS for critical PII types critical_types = {"ssn", "credit_card", "iban"} if critical_types & set(result["findings"].keys()): sns = boto3.client("sns") sns.publish( TopicArn="arn:aws:sns:eu-west-1:123456789012:pii-alerts", Subject=f"Critical PII detected in s3://{bucket}/{key}", Message=json.dumps(result, indent=2) ) `

Step 4: Handle Common S3 File Formats

S3 buckets hold data in dozens of formats. Your scanner needs to handle each appropriately.

CSV and TSV files

The most common format for data exports. Parse with Python's csv module and scan cell-by-cell rather than treating the entire file as a text blob — this lets you identify which columns contain PII.

`python import csv import io

def scan_csv_from_s3(bucket: str, key: str) -> dict: response = s3.get_object(Bucket=bucket, Key=key) content = response["Body"].read().decode("utf-8", errors="ignore") reader = csv.DictReader(io.StringIO(content))

column_pii = {} # Track PII types found per column for row in reader: for col_name, value in row.items(): if not value: continue for pii_type, pattern in PII_PATTERNS.items(): if pattern.search(value): column_pii.setdefault(col_name, set()).add(pii_type)

return { "bucket": bucket, "key": key, "format": "csv", "columns_with_pii": {k: list(v) for k, v in column_pii.items()} } `

Parquet files

Common in data engineering pipelines. Use pyarrow to read Parquet metadata and sample data without loading the entire file:

`python import pyarrow.parquet as pq

def scan_parquet_from_s3(bucket: str, key: str, sample_rows: int = 1000): # Read only the first N rows for PII detection table = pq.read_table(f"s3://{bucket}/{key}", columns=None, use_pandas_metadata=False) sample = table.slice(0, min(sample_rows, len(table))) df = sample.to_pandas()

column_pii = {} for col in df.columns: col_str = df[col].astype(str) for pii_type, pattern in PII_PATTERNS.items(): if col_str.str.contains(pattern, na=False).any(): column_pii.setdefault(col, []).append(pii_type)

return column_pii `

JSON and JSONL files

API response logs, event streams, and application data often land in S3 as JSON. Flatten nested structures before scanning:

`python import json

def flatten_json(obj, prefix=""): """Recursively flatten a nested JSON object.""" items = {} if isinstance(obj, dict): for k, v in obj.items(): items.update(flatten_json(v, f"{prefix}{k}.")) elif isinstance(obj, list): for i, v in enumerate(obj): items.update(flatten_json(v, f"{prefix}[{i}].")) else: items[prefix.rstrip(".")] = str(obj) return items `

Compressed archives

S3 frequently stores .gz, .zip, and .tar.gz files. Always decompress before scanning — PII in compressed files is just as regulated as PII in plaintext.

Step 5: Classify Findings and Prioritize Remediation

Scanning without a remediation workflow just generates a list of problems. You need a triage process.

Severity classification

| Severity | Criteria | Response SLA | |---|---|---| | Critical | SSN, credit card, or government ID in a publicly accessible bucket | 4 hours | | High | PII in unencrypted bucket, or PII accessible by overly broad IAM roles | 24 hours | | Medium | PII in encrypted bucket but without documented purpose or retention | 1 week | | Low | Email addresses or IP addresses in internal analytics data with proper access controls | Next quarterly review |

Remediation actions

For each finding, choose the appropriate response:

1. Delete: Data has no business purpose and no legal retention requirement. Remove it. 2. Mask/anonymize: Data is needed for analytics but doesn't require real PII. Replace SSNs, emails, and names with synthetic values. 3. Encrypt: Data has a valid purpose but is stored unencrypted. Apply SSE-S3 or SSE-KMS encryption. 4. Restrict access: Data is properly stored but accessible too broadly. Tighten bucket policies and IAM roles. 5. Document: Data is properly handled but not reflected in your data inventory. Update your RoPA.

Automate quarantine for critical findings

`python def quarantine_object(bucket: str, key: str, finding: dict): """Move a PII-containing object to a quarantine bucket with restricted access.""" quarantine_bucket = f"{bucket}-quarantine" s3.copy_object( CopySource={"Bucket": bucket, "Key": key}, Bucket=quarantine_bucket, Key=f"quarantine/{key}", ServerSideEncryption="aws:kms", Metadata={ "quarantine-reason": "pii-detected", "pii-types": ",".join(finding["findings"].keys()), "original-bucket": bucket }, MetadataDirective="REPLACE" ) s3.delete_object(Bucket=bucket, Key=key) `

Step 6: Automate Continuous PII Monitoring

One-time scans find today's PII. Continuous monitoring catches tomorrow's.

Schedule recurring scans with EventBridge

`json { "ScheduleExpression": "rate(7 days)", "Target": { "Arn": "arn:aws:lambda:eu-west-1:123456789012:function:s3-pii-scanner", "Input": "{\"buckets\": [\"customer-data-prod\", \"analytics-exports\", \"app-logs-archive\"]}" } } `

Integrate with your SIEM

Send PII findings to your security information and event management platform for centralized alerting and incident tracking:

`python

Send findings to CloudWatch Logs for SIEM ingestion

import logging

logger = logging.getLogger("pii-scanner") logger.setLevel(logging.INFO)

cloudwatch_handler = logging.handlers.SysLogHandler(address="/dev/log") logger.addHandler(cloudwatch_handler)

def report_finding(finding: dict): logger.info(json.dumps({ "event_type": "pii_detection", "severity": classify_severity(finding), "bucket": finding["bucket"], "object_key": finding["key"], "pii_types": list(finding["findings"].keys()), "pii_counts": {k: v["count"] for k, v in finding["findings"].items()} })) `

Track compliance metrics over time

Monitor these KPIs monthly:

PII exposure rate: Percentage of scanned objects containing PII
Mean time to remediate (MTTR): Average time from PII detection to resolution
Unclassified bucket rate: Percentage of buckets without a DataClassification tag
Scan coverage: Percentage of total S3 storage scanned in the last 30 days

Frequently Asked Questions

How much does it cost to scan S3 buckets for PII?

Costs vary significantly by approach. Amazon Macie charges $1.00 per GB for the first 50,000 GB scanned per month, dropping to $0.25/GB at higher volumes — meaning a 10 TB scan costs roughly $10,000 with Macie alone. Custom pipelines using Lambda and S3 GET requests are cheaper on compute (Lambda free tier covers ~1 million invocations/month) but require engineering time to build and maintain. The hidden cost is data transfer: scanning objects requires reading them, so factor in S3 GET request costs ($0.0004 per 1,000 requests) and potential cross-region transfer fees. For most organizations, a hybrid approach works best — use Macie for initial discovery, then a purpose-built tool like PrivaSift for ongoing monitoring with broader file format support and lower per-GB costs.

Can AWS Macie detect all types of PII in S3?

No. Macie covers common PII types well — email addresses, credit card numbers, AWS credentials, SSNs, and several country-specific identifiers. However, it has blind spots. Macie struggles with: PII embedded in PDFs and images (no OCR capability), PII in custom or proprietary file formats, context-dependent PII where a value is only personal data in combination with other fields (e.g., a zip code that becomes identifying when combined with age and gender), and PII in languages that use non-Latin scripts. Macie also doesn't detect PII patterns specific to many non-US jurisdictions without custom data identifiers. For comprehensive coverage, supplement Macie with content-level scanning tools that support NLP-based entity recognition, OCR, and broader international PII pattern libraries.

What happens if we find PII in a publicly accessible S3 bucket?

Treat it as a potential data breach. Under GDPR Article 33, you must notify your supervisory authority within 72 hours of becoming aware of a breach that poses a risk to data subjects' rights. Under CCPA, notification requirements vary but generally require prompt disclosure to affected consumers. Immediate steps: (1) Block public access to the bucket immediately using aws s3api put-public-access-block. (2) Assess the scope — what data was exposed, for how long, and how many data subjects are affected. Check S3 server access logs and CloudTrail to determine if the data was actually accessed. (3) Engage your incident response team and legal counsel. (4) Document everything for regulatory notification. (5) Remediate — move PII to an encrypted, access-controlled location. The 2023 enforcement action by the FTC against Drizly (an Uber subsidiary) following an S3 data exposure resulted in a 20-year compliance order, demonstrating that regulators take cloud storage misconfigurations seriously.

How do we prevent PII from landing in S3 in the first place?

Prevention is always cheaper than detection. Implement these controls: (1) S3 bucket policies that require encryption (aws:SecureTransport condition) and deny uploads without server-side encryption. (2) AWS SCPs (Service Control Policies) that prevent creating public buckets or disabling default encryption across your organization. (3) CI/CD scanning — scan test fixtures, seed data, and configuration files for PII before they reach any environment. (4) PII redaction in ETL pipelines — mask or hash PII fields at the extraction stage, not after data lands in S3. (5) S3 Object Lock on quarantine buckets to prevent premature deletion of flagged data during investigation. (6) Developer training — engineers should know that dumping a production database export into S3 "temporarily" creates a compliance liability the moment it lands.

Should we scan all S3 buckets or just the ones we think contain PII?

Scan all of them. The buckets you "think" don't contain PII are exactly where unreported PII accumulates — debug log archives, temporary export buckets, build artifact storage, ML training datasets. A pragmatic phased approach: scan all buckets in a lightweight pass (sampling the first 1,000 objects per bucket) to identify which buckets warrant deep scanning, then run comprehensive scans on flagged buckets. Tag every bucket with its scan status and last scan date. The goal is 100% scan coverage within your first quarter, with continuous monitoring going forward. Data protection authorities expect you to know where all personal data resides — "we didn't think that bucket had PII" is not a defense that holds up in enforcement proceedings.

Start Scanning for PII Today

PrivaSift automatically detects PII across your files, databases, and cloud storage — helping you stay GDPR and CCPA compliant without the manual work.

[Try PrivaSift Free →](https://privasift.com)

Scan your data for PII — free, no setup required

Try PrivaSift