> ## Documentation Index > Fetch the complete documentation index at: https://docs.unsiloed.ai/docs/llms.txt > Use this file to discover all available pages before exploring further. # Redact Sensitive Data From a Document > Find personal data in a PDF with extraction, locate every occurrence with parsing, and black it out to produce a clean redacted PDF. Before a document leaves your hands, the sensitive parts often have to go: names, account numbers, email addresses, phone numbers. Doing it by hand is slow, and a single missed occurrence leaks the very data you meant to protect. Worse, the obvious shortcut, drawing a black box over the text in a PDF viewer, doesn't actually remove anything. The text still sits underneath, ready to be copied straight back out. Governments and law firms have leaked data exactly this way. This recipe redacts properly. It uses extraction to work out what the sensitive values are, parsing to find every place they appear on the page, and then removes them, producing a `redacted.pdf` you can safely share. You don't have to know the values in advance; the extractor reads them off the document for you. An account statement before and after redaction, with the name, account number, email, and phone number blacked out

An account statement before and after redaction, with the name, account number, email, and phone number blacked out

This recipe chains two endpoints covered on their own in the [Extract quickstart](/document-processing/extraction/quickstart) and the [parsing guide](/document-processing/parsing/parsing). Read those first if you want the detail on either call in isolation. ## What We'll Build A script that: 1. Sends a document to `/v2/extract` with a schema describing the sensitive fields, and reads back their values. 2. Sends the same document to `/parse` to get every word on the page with its bounding box. 3. Matches each sensitive value against the parsed words to find every occurrence, not just the first. 4. Blacks out each match and writes a `redacted.pdf` with the underlying text removed. We'll run it against a one-page account statement that repeats the same personal details several times. Grab the full script from the dropdown below if you'd rather skip the walkthrough. Set `UNSILOED_API_KEY` in your environment and save the document as `document.pdf` in the same directory before running. ```python redact.py theme={null} import json import os import re import time import fitz # PyMuPDF import requests API_KEY = os.environ["UNSILOED_API_KEY"] BASE_URL = "https://prod.visionapi.unsiloed.ai" # The kinds of information we want gone. Each description tells the extractor # what to look for, so adapt these fields to your own document. SCHEMA = { "type": "object", "properties": { "full_name": {"type": "string", "description": "The full name of the account holder"}, "account_number": {"type": "string", "description": "The bank account number"}, "email": {"type": "string", "description": "The account holder's email address"}, "phone": {"type": "string", "description": "The account holder's phone number"}, }, "required": ["full_name", "account_number", "email", "phone"], "additionalProperties": False, } def wait_for(url): for _ in range(90): # roughly 6 minutes at 4 seconds per poll job = requests.get(url, headers={"api-key": API_KEY}).json() status = job.get("status") if status in ("completed", "review", "Succeeded"): return job if status in ("failed", "Failed", "Cancelled"): # The extractor reports its reason in "error"; the parser uses "message". raise RuntimeError(job.get("error") or job.get("message") or f"job {status}") time.sleep(4) raise TimeoutError(f"Job at {url} did not finish in time") # Step 1: ask the extractor what the sensitive values are. with open("document.pdf", "rb") as f: resp = requests.post( f"{BASE_URL}/v2/extract", headers={"api-key": API_KEY}, files={"pdf_file": ("document.pdf", f, "application/pdf")}, data={"schema_data": json.dumps(SCHEMA), "model": "gamma"}, ) resp.raise_for_status() job = resp.json() result = wait_for(f"{BASE_URL}/extract/{job['job_id']}")["result"] sensitive = [] for field in SCHEMA["properties"]: value = result[field]["value"] if value is None: print(f"Field '{field}' not found in the document, skipping.") continue sensitive.append(value) print("Will redact:", sensitive) # Step 2: parse the document to get every word with a page-relative box. with open("document.pdf", "rb") as f: resp = requests.post( f"{BASE_URL}/parse", headers={"api-key": API_KEY}, files={"file": ("document.pdf", f, "application/pdf")}, ) resp.raise_for_status() job = resp.json() parsed = wait_for(f"{BASE_URL}/parse/{job['job_id']}") def normalize(text): return re.sub(r"[^a-z0-9]", "", str(text).lower()) words = [] for chunk in parsed["chunks"]: for seg in chunk["segments"]: sx, sy = seg["bbox"]["left"], seg["bbox"]["top"] for word in seg.get("ocr") or []: norm = normalize(word["text"]) if not norm: continue b = word["bbox"] words.append({ "norm": norm, "page": seg["page_number"], "page_width": seg["page_width"], "x": sx + b["left"], "width": b["width"], # y + height lands on the text baseline reliably, even when the # reported height is not, so we anchor redaction boxes to it. "baseline": sy + b["top"] + b["height"], }) # With no words there is nothing to match, and the script would "succeed" # by writing an unredacted copy. Fail loudly instead. if not words: raise RuntimeError("Parse returned no OCR words; refusing to write an unredacted PDF") # Step 3: find every occurrence of each value by normalized concatenation, so a # value matches however OCR split it (hyphens, "@") or wrapped across lines. def find_runs(value): target = normalize(value) runs = [] if len(target) < 3: return runs for i in range(len(words)): acc = "" for j in range(i, min(i + 10, len(words))): if words[j]["page"] != words[i]["page"]: break acc += words[j]["norm"] if acc == target: runs.append(words[i:j + 1]) break if not target.startswith(acc): break return runs # Step 4: turn each run into one box per line, anchored to the baseline. BAND_ABOVE, BAND_BELOW, PAD = 19, 6, 2 def line_boxes(run): lines, current = [], [run[0]] for word in run[1:]: same_line = abs(word["baseline"] - current[-1]["baseline"]) <= 12 if same_line and word["x"] >= current[-1]["x"] - 2: current.append(word) else: lines.append(current) current = [word] lines.append(current) boxes = [] for line in lines: baseline = max(w["baseline"] for w in line) boxes.append(( min(w["x"] for w in line) - PAD, baseline - BAND_ABOVE, max(w["x"] + w["width"] for w in line) + PAD, baseline + BAND_BELOW, )) return boxes doc = fitz.open("document.pdf") boxes_by_page = {} for value in sensitive: for run in find_runs(value): page_index = run[0]["page"] - 1 # Parse boxes are in render pixels; scale them back to PDF points. scale = run[0]["page_width"] / doc[page_index].rect.width for x0, y0, x1, y1 in line_boxes(run): rect = fitz.Rect(x0 / scale, y0 / scale, x1 / scale, y1 / scale) boxes_by_page.setdefault(page_index, []).append(rect) # apply_redactions deletes any text that touches the redaction rect, even # by a fraction of a point, so we delete with a rect shrunk clear of the # neighboring lines, then draw the full-height box as plain ink afterwards. for page_index, rects in boxes_by_page.items(): page = doc[page_index] for rect in rects: page.add_redact_annot(rect + (0, 2, 0, -2), fill=(0, 0, 0)) page.apply_redactions() for rect in rects: page.draw_rect(rect, fill=(0, 0, 0), color=None) doc.save("redacted.pdf") print("Wrote redacted.pdf") ``` Save this as `redact.mjs` or set `"type": "module"` in your `package.json`. Requires Node.js 18 or newer for the global `fetch`, `FormData`, and `Blob`. ```javascript redact.mjs theme={null} import { readFileSync, writeFileSync } from "node:fs"; import { createCanvas } from "@napi-rs/canvas"; import { PDFDocument } from "pdf-lib"; import * as pdfjsLib from "pdfjs-dist/legacy/build/pdf.mjs"; const API_KEY = process.env.UNSILOED_API_KEY; const BASE_URL = "https://prod.visionapi.unsiloed.ai"; // The kinds of information we want gone. Each description tells the extractor // what to look for, so adapt these fields to your own document. const SCHEMA = { type: "object", properties: { full_name: { type: "string", description: "The full name of the account holder" }, account_number: { type: "string", description: "The bank account number" }, email: { type: "string", description: "The account holder's email address" }, phone: { type: "string", description: "The account holder's phone number" }, }, required: ["full_name", "account_number", "email", "phone"], additionalProperties: false, }; async function waitFor(url) { for (let i = 0; i < 90; i++) { // roughly 6 minutes at 4 seconds per poll const job = await (await fetch(url, { headers: { "api-key": API_KEY } })).json(); if (["completed", "review", "Succeeded"].includes(job.status)) return job; // The extractor reports its reason in "error"; the parser uses "message". if (["failed", "Failed", "Cancelled"].includes(job.status)) throw new Error(job.error || job.message || `job ${job.status}`); await new Promise((r) => setTimeout(r, 4000)); } throw new Error(`Job at ${url} did not finish in time`); } const bytes = readFileSync("document.pdf"); // Step 1: ask the extractor what the sensitive values are. const extractForm = new FormData(); extractForm.append("pdf_file", new Blob([bytes], { type: "application/pdf" }), "document.pdf"); extractForm.append("schema_data", JSON.stringify(SCHEMA)); extractForm.append("model", "gamma"); const extractRes = await fetch(`${BASE_URL}/v2/extract`, { method: "POST", headers: { "api-key": API_KEY }, body: extractForm }); if (!extractRes.ok) throw new Error(`extract submit failed: HTTP ${extractRes.status} ${await extractRes.text()}`); const extractJob = await extractRes.json(); const result = (await waitFor(`${BASE_URL}/extract/${extractJob.job_id}`)).result; const sensitive = []; for (const field of Object.keys(SCHEMA.properties)) { const value = result[field].value; if (value == null) { console.log(`Field '${field}' not found in the document, skipping.`); continue; } sensitive.push(value); } console.log("Will redact:", sensitive); // Step 2: parse to get every word with a page-relative box. const parseForm = new FormData(); parseForm.append("file", new Blob([bytes], { type: "application/pdf" }), "document.pdf"); const parseRes = await fetch(`${BASE_URL}/parse`, { method: "POST", headers: { "api-key": API_KEY }, body: parseForm }); if (!parseRes.ok) throw new Error(`parse submit failed: HTTP ${parseRes.status} ${await parseRes.text()}`); const parseJob = await parseRes.json(); const parsed = await waitFor(`${BASE_URL}/parse/${parseJob.job_id}`); const normalize = (text) => String(text).toLowerCase().replace(/[^a-z0-9]/g, ""); const words = []; for (const chunk of parsed.chunks) { for (const seg of chunk.segments) { const { left: sx, top: sy } = seg.bbox; for (const word of seg.ocr || []) { const norm = normalize(word.text); if (!norm) continue; const b = word.bbox; words.push({ norm, page: seg.page_number, pageWidth: seg.page_width, x: sx + b.left, width: b.width, // y + height lands on the baseline reliably; the height alone does not. baseline: sy + b.top + b.height, }); } } } // With no words there is nothing to match, and the script would "succeed" // by writing an unredacted copy. Fail loudly instead. if (words.length === 0) throw new Error("Parse returned no OCR words; refusing to write an unredacted PDF"); // Step 3: find every occurrence by normalized concatenation, so a value matches // however OCR split it (hyphens, "@") or wrapped across lines. function findRuns(value) { const target = normalize(value); const runs = []; if (target.length < 3) return runs; for (let i = 0; i < words.length; i++) { let acc = ""; for (let j = i; j < words.length && words[j].page === words[i].page && j - i < 10; j++) { acc += words[j].norm; if (acc === target) { runs.push(words.slice(i, j + 1)); break; } if (!target.startsWith(acc)) break; } } return runs; } // Step 4: turn each run into one box per line, anchored to the baseline. const BAND_ABOVE = 19, BAND_BELOW = 6, PAD = 2; function lineBoxes(run) { const lines = []; let current = [run[0]]; for (const word of run.slice(1)) { const sameLine = Math.abs(word.baseline - current[current.length - 1].baseline) <= 12; if (sameLine && word.x >= current[current.length - 1].x - 2) current.push(word); else { lines.push(current); current = [word]; } } lines.push(current); return lines.map((line) => { const baseline = Math.max(...line.map((w) => w.baseline)); return [ Math.min(...line.map((w) => w.x)) - PAD, baseline - BAND_ABOVE, Math.max(...line.map((w) => w.x + w.width)) + PAD, baseline + BAND_BELOW, ]; }); } const boxesByPage = new Map(); for (const value of sensitive) { for (const run of findRuns(value)) { const page = run[0].page; if (!boxesByPage.has(page)) boxesByPage.set(page, []); boxesByPage.get(page).push(...lineBoxes(run)); } } // Step 5: render each page to an image, paint the boxes, and rebuild the PDF. // Flattening to an image removes the text for good; a drawn box alone would not. const pageWidths = new Map(words.map((w) => [w.page, w.pageWidth])); const fonts = "./node_modules/pdfjs-dist/standard_fonts/"; const pdf = await pdfjsLib.getDocument({ data: new Uint8Array(bytes), standardFontDataUrl: fonts }).promise; const out = await PDFDocument.create(); for (let pageNo = 1; pageNo <= pdf.numPages; pageNo++) { const page = await pdf.getPage(pageNo); const base = page.getViewport({ scale: 1.0 }); // page size in PDF points // Render at whatever scale matches parse's pixel grid for this page. const scale = (pageWidths.get(pageNo) ?? base.width * 2) / base.width; const viewport = page.getViewport({ scale }); const canvas = createCanvas(viewport.width, viewport.height); const ctx = canvas.getContext("2d"); await page.render({ canvasContext: ctx, viewport, canvas }).promise; ctx.fillStyle = "#000"; for (const [x0, y0, x1, y1] of boxesByPage.get(pageNo) || []) { ctx.fillRect(x0, y0, x1 - x0, y1 - y0); } const png = await out.embedPng(canvas.toBuffer("image/png")); const outPage = out.addPage([base.width, base.height]); outPage.drawImage(png, { x: 0, y: 0, width: base.width, height: base.height }); } writeFileSync("redacted.pdf", await out.save()); console.log("Wrote redacted.pdf"); ``` ## Step 1: Set Up Your Environment Before writing any code, we need three things: an API key, a document to redact, and a few libraries for the chosen language. ### 1.1 Get an Unsiloed AI API Key To get API access, [sign up on Unsiloed AI](https://cal.com/aman-mishra-p0ry57/15min). Export your key as an environment variable so it stays out of source control: ```bash theme={null} export UNSILOED_API_KEY="your-api-key" ``` ### 1.2 Pick a Document The walkthrough assumes a PDF saved as `document.pdf` in your working directory. To follow along with the exact output shown below, download our [sample account statement](https://raw.githubusercontent.com/Unsiloed-AI/cookbook/9c80a90e0315a33c9b8a68d8b3355199771b598f/sample-documents/sample-statement.pdf): a one-page letter that repeats the account holder's name, account number, and email so we can prove every occurrence gets removed, not just the first. ### 1.3 Install Dependencies The API calls are plain HTTP, but writing the redacted file means rendering the PDF, so each language needs a couple of libraries. You need Python 3.8 or newer. Install `requests` for the API calls and `PyMuPDF` for the redaction: ```bash theme={null} pip install requests PyMuPDF ``` You need Node.js 18 or newer for the global `fetch`, `FormData`, and `Blob`. Install the PDF rendering and writing libraries: ```bash theme={null} npm install pdfjs-dist @napi-rs/canvas pdf-lib ``` `pdfjs-dist` renders each page, `@napi-rs/canvas` gives it a canvas to draw on (with prebuilt binaries, so there are no system dependencies), and `pdf-lib` assembles the redacted pages back into a PDF. ## Step 2: Find the Sensitive Values With Extraction We start with extraction rather than a list of search terms because we don't want to hard-code the data we're protecting. A schema describes the kinds of information that are sensitive, and the extractor reads the actual values off the document. Point it at a different statement and it finds that person's details instead. ### 2.1 Describe the Sensitive Fields Each field is a name and a description. The description tells the extractor what to look for, so the clearer it is, the more reliable the result. We're after four pieces of personal data. Create a file called `redact.py` with the configuration and schema: ```python redact.py theme={null} import json import os import re import time import fitz # PyMuPDF import requests API_KEY = os.environ["UNSILOED_API_KEY"] BASE_URL = "https://prod.visionapi.unsiloed.ai" SCHEMA = { "type": "object", "properties": { "full_name": {"type": "string", "description": "The full name of the account holder"}, "account_number": {"type": "string", "description": "The bank account number"}, "email": {"type": "string", "description": "The account holder's email address"}, "phone": {"type": "string", "description": "The account holder's phone number"}, }, "required": ["full_name", "account_number", "email", "phone"], "additionalProperties": False, } ``` Create a file called `redact.mjs` with the configuration and schema: ```javascript redact.mjs theme={null} import { readFileSync, writeFileSync } from "node:fs"; import { createCanvas } from "@napi-rs/canvas"; import { PDFDocument } from "pdf-lib"; import * as pdfjsLib from "pdfjs-dist/legacy/build/pdf.mjs"; const API_KEY = process.env.UNSILOED_API_KEY; const BASE_URL = "https://prod.visionapi.unsiloed.ai"; const SCHEMA = { type: "object", properties: { full_name: { type: "string", description: "The full name of the account holder" }, account_number: { type: "string", description: "The bank account number" }, email: { type: "string", description: "The account holder's email address" }, phone: { type: "string", description: "The account holder's phone number" }, }, required: ["full_name", "account_number", "email", "phone"], additionalProperties: false, }; ``` ### 2.2 Submit the Document and Read the Values Both endpoints in this recipe run asynchronously: we submit a job, get a `job_id`, and poll until it's done. Since we do that twice, it's worth a small helper. The extractor accepts `completed` as its done state; the parser uses `Succeeded`, so the helper checks for both. The failure side is just as inconsistent: a job can end up `failed`, `Failed`, or `Cancelled`, and the parser reports its reason in a `message` field where the extractor uses `error`. The helper covers all of those too, because otherwise a cancelled job would poll until the timeout and surface as a misleading "did not finish in time". Add the polling helper and the extract call: ```python redact.py theme={null} def wait_for(url): for _ in range(90): # roughly 6 minutes at 4 seconds per poll job = requests.get(url, headers={"api-key": API_KEY}).json() status = job.get("status") if status in ("completed", "review", "Succeeded"): return job if status in ("failed", "Failed", "Cancelled"): # The extractor reports its reason in "error"; the parser uses "message". raise RuntimeError(job.get("error") or job.get("message") or f"job {status}") time.sleep(4) raise TimeoutError(f"Job at {url} did not finish in time") with open("document.pdf", "rb") as f: resp = requests.post( f"{BASE_URL}/v2/extract", headers={"api-key": API_KEY}, files={"pdf_file": ("document.pdf", f, "application/pdf")}, data={"schema_data": json.dumps(SCHEMA), "model": "gamma"}, ) resp.raise_for_status() job = resp.json() result = wait_for(f"{BASE_URL}/extract/{job['job_id']}")["result"] sensitive = [] for field in SCHEMA["properties"]: value = result[field]["value"] if value is None: print(f"Field '{field}' not found in the document, skipping.") continue sensitive.append(value) print("Will redact:", sensitive) ``` Add the polling helper and the extract call: ```javascript redact.mjs theme={null} async function waitFor(url) { for (let i = 0; i < 90; i++) { // roughly 6 minutes at 4 seconds per poll const job = await (await fetch(url, { headers: { "api-key": API_KEY } })).json(); if (["completed", "review", "Succeeded"].includes(job.status)) return job; // The extractor reports its reason in "error"; the parser uses "message". if (["failed", "Failed", "Cancelled"].includes(job.status)) throw new Error(job.error || job.message || `job ${job.status}`); await new Promise((r) => setTimeout(r, 4000)); } throw new Error(`Job at ${url} did not finish in time`); } const bytes = readFileSync("document.pdf"); const extractForm = new FormData(); extractForm.append("pdf_file", new Blob([bytes], { type: "application/pdf" }), "document.pdf"); extractForm.append("schema_data", JSON.stringify(SCHEMA)); extractForm.append("model", "gamma"); const extractRes = await fetch(`${BASE_URL}/v2/extract`, { method: "POST", headers: { "api-key": API_KEY }, body: extractForm }); if (!extractRes.ok) throw new Error(`extract submit failed: HTTP ${extractRes.status} ${await extractRes.text()}`); const extractJob = await extractRes.json(); const result = (await waitFor(`${BASE_URL}/extract/${extractJob.job_id}`)).result; const sensitive = []; for (const field of Object.keys(SCHEMA.properties)) { const value = result[field].value; if (value == null) { console.log(`Field '${field}' not found in the document, skipping.`); continue; } sensitive.push(value); } console.log("Will redact:", sensitive); ``` Each entry in the extractor's `result` is an object with a `value`, so we pull out the values into a flat `sensitive` list. For the sample document, that prints: ```text theme={null} Will redact: ['Jonathan Hale', '4471-8829-0007', 'jhale@example.com', '(415) 555-0142'] ``` These are the strings we now need to find and remove wherever they appear. Leave the extractor's `detect_pii` parameter at its default of `false` for this recipe. That flag exists to block extraction from documents that contain PII: with it enabled, the endpoint returns HTTP 200 with `job_id: null` and `status: "pii_blocked"` instead of starting a job, and the polling step fails. Extracting PII is the point of this recipe, so the gate has to stay off. ## Step 3: Locate Every Occurrence With Parsing Extraction told us what is sensitive. It does not tell us where each value sits on the page, or how many times it appears. For that we parse the document. Parsing returns the page broken into segments, and within each segment an `ocr` array giving every word along with its bounding box. That word-level detail is what lets us draw a box around each occurrence. ### 3.1 Parse the Document Send the same file to `/parse`. Note the form field is `file` here, not `pdf_file`: the parser and the extractor use different field names. We leave `ocr_strategy` at its default `auto_detection`: it returns word-level `ocr` boxes for digital and scanned PDFs alike, and in our testing `force_ocr` degrades the geometry to line-level boxes, which makes redaction miss values. Add the parse call: ```python redact.py theme={null} with open("document.pdf", "rb") as f: resp = requests.post( f"{BASE_URL}/parse", headers={"api-key": API_KEY}, files={"file": ("document.pdf", f, "application/pdf")}, ) resp.raise_for_status() job = resp.json() parsed = wait_for(f"{BASE_URL}/parse/{job['job_id']}") ``` Add the parse call: ```javascript redact.mjs theme={null} const parseForm = new FormData(); parseForm.append("file", new Blob([bytes], { type: "application/pdf" }), "document.pdf"); const parseRes = await fetch(`${BASE_URL}/parse`, { method: "POST", headers: { "api-key": API_KEY }, body: parseForm }); if (!parseRes.ok) throw new Error(`parse submit failed: HTTP ${parseRes.status} ${await parseRes.text()}`); const parseJob = await parseRes.json(); const parsed = await waitFor(`${BASE_URL}/parse/${parseJob.job_id}`); ``` ### 3.2 Build a Flat List of Words A word's `ocr` box is given relative to its segment, so its position on the page is the segment's top-left corner plus the word's own offset. We flatten every word into one list, recording its normalized text, its page, and a box. Two details matter for drawing clean boxes later: * **The reported word height is unreliable.** Some boxes come back only a couple of pixels tall. The bottom edge, `top + height`, sits on the text baseline consistently, so we store that baseline and reconstruct a full-height box from it in Step 4. * **Parse boxes are in render pixels,** measured against the rendered page that `page_width` and `page_height` describe. We keep each word's `page_width` so we can compute the exact pixels-to-points ratio when we redact. Add the normalizer and the word-collection loop: ```python redact.py theme={null} def normalize(text): return re.sub(r"[^a-z0-9]", "", str(text).lower()) words = [] for chunk in parsed["chunks"]: for seg in chunk["segments"]: sx, sy = seg["bbox"]["left"], seg["bbox"]["top"] for word in seg.get("ocr") or []: norm = normalize(word["text"]) if not norm: continue b = word["bbox"] words.append({ "norm": norm, "page": seg["page_number"], "page_width": seg["page_width"], "x": sx + b["left"], "width": b["width"], "baseline": sy + b["top"] + b["height"], }) # With no words there is nothing to match, and the script would "succeed" # by writing an unredacted copy. Fail loudly instead. if not words: raise RuntimeError("Parse returned no OCR words; refusing to write an unredacted PDF") ``` Add the normalizer and the word-collection loop: ```javascript redact.mjs theme={null} const normalize = (text) => String(text).toLowerCase().replace(/[^a-z0-9]/g, ""); const words = []; for (const chunk of parsed.chunks) { for (const seg of chunk.segments) { const { left: sx, top: sy } = seg.bbox; for (const word of seg.ocr || []) { const norm = normalize(word.text); if (!norm) continue; const b = word.bbox; words.push({ norm, page: seg.page_number, pageWidth: seg.page_width, x: sx + b.left, width: b.width, baseline: sy + b.top + b.height, }); } } } // With no words there is nothing to match, and the script would "succeed" // by writing an unredacted copy. Fail loudly instead. if (words.length === 0) throw new Error("Parse returned no OCR words; refusing to write an unredacted PDF"); ``` We normalize text down to lowercase letters and digits, dropping spaces and punctuation. That's what makes the matching in the next step robust. The guard at the end matters: every later step quietly does nothing when `words` is empty, so without it the script would still write a `redacted.pdf` with nothing redacted. A redaction script should fail loudly rather than produce a clean-looking file that leaks everything. ## Step 4: Match Values to Their Locations Now we find each sensitive value in the word list. A naive equality check breaks down quickly: `4471-8829-0007` might arrive as one OCR token or several, an email keeps its `@`, and a phone number can wrap across a line break. So instead of matching word by word, we match by **normalized concatenation**. Starting at each word, we glue the normalized words together one at a time until the running string equals the normalized target, and stop early the moment it can no longer lead to a match. ### 4.1 Find Every Run of Words That Spells a Value Add the matcher: ```python redact.py theme={null} def find_runs(value): target = normalize(value) runs = [] if len(target) < 3: return runs for i in range(len(words)): acc = "" for j in range(i, min(i + 10, len(words))): if words[j]["page"] != words[i]["page"]: break acc += words[j]["norm"] if acc == target: runs.append(words[i:j + 1]) break if not target.startswith(acc): break return runs ``` Add the matcher: ```javascript redact.mjs theme={null} function findRuns(value) { const target = normalize(value); const runs = []; if (target.length < 3) return runs; for (let i = 0; i < words.length; i++) { let acc = ""; for (let j = i; j < words.length && words[j].page === words[i].page && j - i < 10; j++) { acc += words[j].norm; if (acc === target) { runs.push(words.slice(i, j + 1)); break; } if (!target.startsWith(acc)) break; } } return runs; } ``` Each value can return several runs, one per occurrence. We skip targets shorter than three characters so a stray initial can't trigger a flood of matches. ## Step 5: Black Out Every Match A run is a list of words. To redact it we need rectangles, and we need them to behave at two awkward edges: a value can wrap across a line, and individual word heights are unreliable. We handle both by grouping each run into lines and anchoring every box to the baseline we stored earlier. ### 5.1 Turn a Run Into One Box per Line We walk the run's words, starting a new line whenever the baseline jumps or the text steps back to the left margin. For each line we take the horizontal extent from the words and a fixed band above the baseline (plus a little below for descenders). Add the box builder: ```python redact.py theme={null} BAND_ABOVE, BAND_BELOW, PAD = 19, 6, 2 def line_boxes(run): lines, current = [], [run[0]] for word in run[1:]: same_line = abs(word["baseline"] - current[-1]["baseline"]) <= 12 if same_line and word["x"] >= current[-1]["x"] - 2: current.append(word) else: lines.append(current) current = [word] lines.append(current) boxes = [] for line in lines: baseline = max(w["baseline"] for w in line) boxes.append(( min(w["x"] for w in line) - PAD, baseline - BAND_ABOVE, max(w["x"] + w["width"] for w in line) + PAD, baseline + BAND_BELOW, )) return boxes ``` Add the box builder: ```javascript redact.mjs theme={null} const BAND_ABOVE = 19, BAND_BELOW = 6, PAD = 2; function lineBoxes(run) { const lines = []; let current = [run[0]]; for (const word of run.slice(1)) { const sameLine = Math.abs(word.baseline - current[current.length - 1].baseline) <= 12; if (sameLine && word.x >= current[current.length - 1].x - 2) current.push(word); else { lines.push(current); current = [word]; } } lines.push(current); return lines.map((line) => { const baseline = Math.max(...line.map((w) => w.baseline)); return [ Math.min(...line.map((w) => w.x)) - PAD, baseline - BAND_ABOVE, Math.max(...line.map((w) => w.x + w.width)) + PAD, baseline + BAND_BELOW, ]; }); } ``` ### 5.2 Remove the Text This is where the two languages diverge, because covering text and removing it are not the same thing. A black rectangle painted over a PDF leaves the original characters in the file, selectable and searchable underneath. To redact for real you either delete the content or destroy it. * **Python** uses PyMuPDF's redaction annotations. `apply_redactions` deletes the underlying text and image data inside each box, then fills it black. The output stays a normal PDF, and everything outside the boxes is left untouched and still selectable. * **JavaScript** renders each page to an image, paints the boxes onto the pixels, and rebuilds the PDF from those images. Once a page is an image, there is no text layer left to recover. Add the redaction loop and save. We scale each box from parse's render pixels back to PDF points using the ratio of `page_width` to the page's width in points. Deletion and drawing use different rectangles on purpose: `apply_redactions` removes any text that touches the redaction rect, even by a fraction of a point, so we shrink the deletion rect two points clear of the neighboring lines (otherwise a box can silently delete words from the line above or below), then draw the full-height black box as plain ink: ```python redact.py theme={null} doc = fitz.open("document.pdf") boxes_by_page = {} for value in sensitive: for run in find_runs(value): page_index = run[0]["page"] - 1 scale = run[0]["page_width"] / doc[page_index].rect.width for x0, y0, x1, y1 in line_boxes(run): rect = fitz.Rect(x0 / scale, y0 / scale, x1 / scale, y1 / scale) boxes_by_page.setdefault(page_index, []).append(rect) for page_index, rects in boxes_by_page.items(): page = doc[page_index] for rect in rects: page.add_redact_annot(rect + (0, 2, 0, -2), fill=(0, 0, 0)) page.apply_redactions() for rect in rects: page.draw_rect(rect, fill=(0, 0, 0), color=None) doc.save("redacted.pdf") print("Wrote redacted.pdf") ``` Run it: ```bash theme={null} python redact.py ``` Group the boxes by page, then render, paint, and reassemble. We derive each page's render scale from the `page_width` that parse reported, so the canvas matches parse's pixel grid exactly. When placing each image we use the page's size in points, so the rebuilt page keeps its original dimensions: ```javascript redact.mjs theme={null} const boxesByPage = new Map(); for (const value of sensitive) { for (const run of findRuns(value)) { const page = run[0].page; if (!boxesByPage.has(page)) boxesByPage.set(page, []); boxesByPage.get(page).push(...lineBoxes(run)); } } const pageWidths = new Map(words.map((w) => [w.page, w.pageWidth])); const fonts = "./node_modules/pdfjs-dist/standard_fonts/"; const pdf = await pdfjsLib.getDocument({ data: new Uint8Array(bytes), standardFontDataUrl: fonts }).promise; const out = await PDFDocument.create(); for (let pageNo = 1; pageNo <= pdf.numPages; pageNo++) { const page = await pdf.getPage(pageNo); const base = page.getViewport({ scale: 1.0 }); const scale = (pageWidths.get(pageNo) ?? base.width * 2) / base.width; const viewport = page.getViewport({ scale }); const canvas = createCanvas(viewport.width, viewport.height); const ctx = canvas.getContext("2d"); await page.render({ canvasContext: ctx, viewport, canvas }).promise; ctx.fillStyle = "#000"; for (const [x0, y0, x1, y1] of boxesByPage.get(pageNo) || []) { ctx.fillRect(x0, y0, x1 - x0, y1 - y0); } const png = await out.embedPng(canvas.toBuffer("image/png")); const outPage = out.addPage([base.width, base.height]); outPage.drawImage(png, { x: 0, y: 0, width: base.width, height: base.height }); } writeFileSync("redacted.pdf", await out.save()); console.log("Wrote redacted.pdf"); ``` Run it: ```bash theme={null} node redact.mjs ``` ## Step 6: Confirm the Text Is Really Gone The point of redaction is that the data can't be recovered, so it's worth checking rather than trusting the visual. Open `redacted.pdf` and try to select the blacked-out text: nothing should be there. You can confirm the same thing in code by pulling the text layer back out of the result. With Python and PyMuPDF: ```python theme={null} import fitz text = " ".join(page.get_text() for page in fitz.open("redacted.pdf")) for value in ["Jonathan Hale", "4471-8829-0007", "jhale@example.com"]: print(value, "->", "REMOVED" if value not in text else "STILL PRESENT") ``` For the sample document this prints `REMOVED` for every value, while the parts we didn't target, like the closing balance and the dates, are still present and selectable: ```text theme={null} Jonathan Hale -> REMOVED 4471-8829-0007 -> REMOVED jhale@example.com -> REMOVED ``` The JavaScript output is image-only, so it has no text layer at all; selecting anywhere on the page returns nothing. ## Where to Take This Next The schema is the lever here. Add fields for any other data you need gone, such as a date of birth, a postal address, or a national ID number, and the same pipeline finds and removes it. A few directions to take it further: * **Redact signatures.** Parsing labels handwritten signatures with a `Signature` segment type, but only when you submit the parse job with `layout_analysis=advanced_layout_detection`; the default layout analysis we use in this recipe never returns it. With that parameter set, you can black out each signature segment's box directly, without an extraction step, since the segment already carries its location. * **Handle scanned documents.** Because the locations come from the parser's OCR rather than the PDF text layer, the same script works on scans and photos, not just digital PDFs. * **Review before you ship.** For a human-in-the-loop workflow, draw the boxes in a bright color first and have a reviewer confirm them, then switch to black once the set is approved. How schemas drive extraction, including nested objects and arrays. The full parse response, segment types, and word-level OCR boxes. Every segment type the parser can return, including `Signature`. Browse the full request and response specs for `/v2/extract` and `/parse`.