Skip to main content
Before a document leaves your hands, the sensitive parts often have to go: names, account numbers, email addresses, phone numbers. Doing it by hand is slow, and a single missed occurrence leaks the very data you meant to protect. Worse, the obvious shortcut, drawing a black box over the text in a PDF viewer, doesn’t actually remove anything. The text still sits underneath, ready to be copied straight back out. Governments and law firms have leaked data exactly this way. This recipe redacts properly. It uses extraction to work out what the sensitive values are, parsing to find every place they appear on the page, and then removes them, producing a redacted.pdf you can safely share. You don’t have to know the values in advance; the extractor reads them off the document for you. An account statement before and after redaction, with the name, account number, email, and phone number blacked out
This recipe chains two endpoints covered on their own in the Extract quickstart and the parsing guide. Read those first if you want the detail on either call in isolation.

What We’ll Build

A script that:
  1. Sends a document to /v2/extract with a schema describing the sensitive fields, and reads back their values.
  2. Sends the same document to /parse to get every word on the page with its bounding box.
  3. Matches each sensitive value against the parsed words to find every occurrence, not just the first.
  4. Blacks out each match and writes a redacted.pdf with the underlying text removed.
We’ll run it against a one-page account statement that repeats the same personal details several times. Grab the full script from the dropdown below if you’d rather skip the walkthrough.
Set UNSILOED_API_KEY in your environment and save the document as document.pdf in the same directory before running.
redact.py
import json
import os
import re
import time

import fitz  # PyMuPDF
import requests

API_KEY = os.environ["UNSILOED_API_KEY"]
BASE_URL = "https://prod.visionapi.unsiloed.ai"

# The kinds of information we want gone. Each description tells the extractor
# what to look for, so adapt these fields to your own document.
SCHEMA = {
    "type": "object",
    "properties": {
        "full_name": {"type": "string", "description": "The full name of the account holder"},
        "account_number": {"type": "string", "description": "The bank account number"},
        "email": {"type": "string", "description": "The account holder's email address"},
        "phone": {"type": "string", "description": "The account holder's phone number"},
    },
    "required": ["full_name", "account_number", "email", "phone"],
    "additionalProperties": False,
}


def wait_for(url):
    for _ in range(90):  # roughly 6 minutes at 4 seconds per poll
        job = requests.get(url, headers={"api-key": API_KEY}).json()
        status = job.get("status")
        if status in ("completed", "review", "Succeeded"):
            return job
        if status in ("failed", "Failed", "Cancelled"):
            # The extractor reports its reason in "error"; the parser uses "message".
            raise RuntimeError(job.get("error") or job.get("message") or f"job {status}")
        time.sleep(4)
    raise TimeoutError(f"Job at {url} did not finish in time")


# Step 1: ask the extractor what the sensitive values are.
with open("document.pdf", "rb") as f:
    resp = requests.post(
        f"{BASE_URL}/v2/extract",
        headers={"api-key": API_KEY},
        files={"pdf_file": ("document.pdf", f, "application/pdf")},
        data={"schema_data": json.dumps(SCHEMA), "model": "gamma"},
    )
resp.raise_for_status()
job = resp.json()
result = wait_for(f"{BASE_URL}/extract/{job['job_id']}")["result"]
sensitive = []
for field in SCHEMA["properties"]:
    value = result[field]["value"]
    if value is None:
        print(f"Field '{field}' not found in the document, skipping.")
        continue
    sensitive.append(value)
print("Will redact:", sensitive)

# Step 2: parse the document to get every word with a page-relative box.
with open("document.pdf", "rb") as f:
    resp = requests.post(
        f"{BASE_URL}/parse",
        headers={"api-key": API_KEY},
        files={"file": ("document.pdf", f, "application/pdf")},
    )
resp.raise_for_status()
job = resp.json()
parsed = wait_for(f"{BASE_URL}/parse/{job['job_id']}")


def normalize(text):
    return re.sub(r"[^a-z0-9]", "", str(text).lower())


words = []
for chunk in parsed["chunks"]:
    for seg in chunk["segments"]:
        sx, sy = seg["bbox"]["left"], seg["bbox"]["top"]
        for word in seg.get("ocr") or []:
            norm = normalize(word["text"])
            if not norm:
                continue
            b = word["bbox"]
            words.append({
                "norm": norm,
                "page": seg["page_number"],
                "page_width": seg["page_width"],
                "x": sx + b["left"],
                "width": b["width"],
                # y + height lands on the text baseline reliably, even when the
                # reported height is not, so we anchor redaction boxes to it.
                "baseline": sy + b["top"] + b["height"],
            })

# With no words there is nothing to match, and the script would "succeed"
# by writing an unredacted copy. Fail loudly instead.
if not words:
    raise RuntimeError("Parse returned no OCR words; refusing to write an unredacted PDF")


# Step 3: find every occurrence of each value by normalized concatenation, so a
# value matches however OCR split it (hyphens, "@") or wrapped across lines.
def find_runs(value):
    target = normalize(value)
    runs = []
    if len(target) < 3:
        return runs
    for i in range(len(words)):
        acc = ""
        for j in range(i, min(i + 10, len(words))):
            if words[j]["page"] != words[i]["page"]:
                break
            acc += words[j]["norm"]
            if acc == target:
                runs.append(words[i:j + 1])
                break
            if not target.startswith(acc):
                break
    return runs


# Step 4: turn each run into one box per line, anchored to the baseline.
BAND_ABOVE, BAND_BELOW, PAD = 19, 6, 2


def line_boxes(run):
    lines, current = [], [run[0]]
    for word in run[1:]:
        same_line = abs(word["baseline"] - current[-1]["baseline"]) <= 12
        if same_line and word["x"] >= current[-1]["x"] - 2:
            current.append(word)
        else:
            lines.append(current)
            current = [word]
    lines.append(current)
    boxes = []
    for line in lines:
        baseline = max(w["baseline"] for w in line)
        boxes.append((
            min(w["x"] for w in line) - PAD,
            baseline - BAND_ABOVE,
            max(w["x"] + w["width"] for w in line) + PAD,
            baseline + BAND_BELOW,
        ))
    return boxes


doc = fitz.open("document.pdf")
boxes_by_page = {}
for value in sensitive:
    for run in find_runs(value):
        page_index = run[0]["page"] - 1
        # Parse boxes are in render pixels; scale them back to PDF points.
        scale = run[0]["page_width"] / doc[page_index].rect.width
        for x0, y0, x1, y1 in line_boxes(run):
            rect = fitz.Rect(x0 / scale, y0 / scale, x1 / scale, y1 / scale)
            boxes_by_page.setdefault(page_index, []).append(rect)

# apply_redactions deletes any text that touches the redaction rect, even
# by a fraction of a point, so we delete with a rect shrunk clear of the
# neighboring lines, then draw the full-height box as plain ink afterwards.
for page_index, rects in boxes_by_page.items():
    page = doc[page_index]
    for rect in rects:
        page.add_redact_annot(rect + (0, 2, 0, -2), fill=(0, 0, 0))
    page.apply_redactions()
    for rect in rects:
        page.draw_rect(rect, fill=(0, 0, 0), color=None)
doc.save("redacted.pdf")
print("Wrote redacted.pdf")

Step 1: Set Up Your Environment

Before writing any code, we need three things: an API key, a document to redact, and a few libraries for the chosen language.

1.1 Get an Unsiloed AI API Key

To get API access, sign up on Unsiloed AI. Export your key as an environment variable so it stays out of source control:
export UNSILOED_API_KEY="your-api-key"

1.2 Pick a Document

The walkthrough assumes a PDF saved as document.pdf in your working directory. To follow along with the exact output shown below, download our sample account statement: a one-page letter that repeats the account holder’s name, account number, and email so we can prove every occurrence gets removed, not just the first.

1.3 Install Dependencies

The API calls are plain HTTP, but writing the redacted file means rendering the PDF, so each language needs a couple of libraries.
You need Python 3.8 or newer. Install requests for the API calls and PyMuPDF for the redaction:
pip install requests PyMuPDF

Step 2: Find the Sensitive Values With Extraction

We start with extraction rather than a list of search terms because we don’t want to hard-code the data we’re protecting. A schema describes the kinds of information that are sensitive, and the extractor reads the actual values off the document. Point it at a different statement and it finds that person’s details instead.

2.1 Describe the Sensitive Fields

Each field is a name and a description. The description tells the extractor what to look for, so the clearer it is, the more reliable the result. We’re after four pieces of personal data.
Create a file called redact.py with the configuration and schema:
redact.py
import json
import os
import re
import time

import fitz  # PyMuPDF
import requests

API_KEY = os.environ["UNSILOED_API_KEY"]
BASE_URL = "https://prod.visionapi.unsiloed.ai"

SCHEMA = {
    "type": "object",
    "properties": {
        "full_name": {"type": "string", "description": "The full name of the account holder"},
        "account_number": {"type": "string", "description": "The bank account number"},
        "email": {"type": "string", "description": "The account holder's email address"},
        "phone": {"type": "string", "description": "The account holder's phone number"},
    },
    "required": ["full_name", "account_number", "email", "phone"],
    "additionalProperties": False,
}

2.2 Submit the Document and Read the Values

Both endpoints in this recipe run asynchronously: we submit a job, get a job_id, and poll until it’s done. Since we do that twice, it’s worth a small helper. The extractor accepts completed as its done state; the parser uses Succeeded, so the helper checks for both. The failure side is just as inconsistent: a job can end up failed, Failed, or Cancelled, and the parser reports its reason in a message field where the extractor uses error. The helper covers all of those too, because otherwise a cancelled job would poll until the timeout and surface as a misleading “did not finish in time”.
Add the polling helper and the extract call:
redact.py
def wait_for(url):
    for _ in range(90):  # roughly 6 minutes at 4 seconds per poll
        job = requests.get(url, headers={"api-key": API_KEY}).json()
        status = job.get("status")
        if status in ("completed", "review", "Succeeded"):
            return job
        if status in ("failed", "Failed", "Cancelled"):
            # The extractor reports its reason in "error"; the parser uses "message".
            raise RuntimeError(job.get("error") or job.get("message") or f"job {status}")
        time.sleep(4)
    raise TimeoutError(f"Job at {url} did not finish in time")


with open("document.pdf", "rb") as f:
    resp = requests.post(
        f"{BASE_URL}/v2/extract",
        headers={"api-key": API_KEY},
        files={"pdf_file": ("document.pdf", f, "application/pdf")},
        data={"schema_data": json.dumps(SCHEMA), "model": "gamma"},
    )
resp.raise_for_status()
job = resp.json()
result = wait_for(f"{BASE_URL}/extract/{job['job_id']}")["result"]
sensitive = []
for field in SCHEMA["properties"]:
    value = result[field]["value"]
    if value is None:
        print(f"Field '{field}' not found in the document, skipping.")
        continue
    sensitive.append(value)
print("Will redact:", sensitive)
Each entry in the extractor’s result is an object with a value, so we pull out the values into a flat sensitive list. For the sample document, that prints:
Will redact: ['Jonathan Hale', '4471-8829-0007', 'jhale@example.com', '(415) 555-0142']
These are the strings we now need to find and remove wherever they appear.
Leave the extractor’s detect_pii parameter at its default of false for this recipe. That flag exists to block extraction from documents that contain PII: with it enabled, the endpoint returns HTTP 200 with job_id: null and status: "pii_blocked" instead of starting a job, and the polling step fails. Extracting PII is the point of this recipe, so the gate has to stay off.

Step 3: Locate Every Occurrence With Parsing

Extraction told us what is sensitive. It does not tell us where each value sits on the page, or how many times it appears. For that we parse the document. Parsing returns the page broken into segments, and within each segment an ocr array giving every word along with its bounding box. That word-level detail is what lets us draw a box around each occurrence.

3.1 Parse the Document

Send the same file to /parse. Note the form field is file here, not pdf_file: the parser and the extractor use different field names. We leave ocr_strategy at its default auto_detection: it returns word-level ocr boxes for digital and scanned PDFs alike, and in our testing force_ocr degrades the geometry to line-level boxes, which makes redaction miss values.
Add the parse call:
redact.py
with open("document.pdf", "rb") as f:
    resp = requests.post(
        f"{BASE_URL}/parse",
        headers={"api-key": API_KEY},
        files={"file": ("document.pdf", f, "application/pdf")},
    )
resp.raise_for_status()
job = resp.json()
parsed = wait_for(f"{BASE_URL}/parse/{job['job_id']}")

3.2 Build a Flat List of Words

A word’s ocr box is given relative to its segment, so its position on the page is the segment’s top-left corner plus the word’s own offset. We flatten every word into one list, recording its normalized text, its page, and a box. Two details matter for drawing clean boxes later:
  • The reported word height is unreliable. Some boxes come back only a couple of pixels tall. The bottom edge, top + height, sits on the text baseline consistently, so we store that baseline and reconstruct a full-height box from it in Step 4.
  • Parse boxes are in render pixels, measured against the rendered page that page_width and page_height describe. We keep each word’s page_width so we can compute the exact pixels-to-points ratio when we redact.
Add the normalizer and the word-collection loop:
redact.py
def normalize(text):
    return re.sub(r"[^a-z0-9]", "", str(text).lower())


words = []
for chunk in parsed["chunks"]:
    for seg in chunk["segments"]:
        sx, sy = seg["bbox"]["left"], seg["bbox"]["top"]
        for word in seg.get("ocr") or []:
            norm = normalize(word["text"])
            if not norm:
                continue
            b = word["bbox"]
            words.append({
                "norm": norm,
                "page": seg["page_number"],
                "page_width": seg["page_width"],
                "x": sx + b["left"],
                "width": b["width"],
                "baseline": sy + b["top"] + b["height"],
            })

# With no words there is nothing to match, and the script would "succeed"
# by writing an unredacted copy. Fail loudly instead.
if not words:
    raise RuntimeError("Parse returned no OCR words; refusing to write an unredacted PDF")
We normalize text down to lowercase letters and digits, dropping spaces and punctuation. That’s what makes the matching in the next step robust. The guard at the end matters: every later step quietly does nothing when words is empty, so without it the script would still write a redacted.pdf with nothing redacted. A redaction script should fail loudly rather than produce a clean-looking file that leaks everything.

Step 4: Match Values to Their Locations

Now we find each sensitive value in the word list. A naive equality check breaks down quickly: 4471-8829-0007 might arrive as one OCR token or several, an email keeps its @, and a phone number can wrap across a line break. So instead of matching word by word, we match by normalized concatenation. Starting at each word, we glue the normalized words together one at a time until the running string equals the normalized target, and stop early the moment it can no longer lead to a match.

4.1 Find Every Run of Words That Spells a Value

Add the matcher:
redact.py
def find_runs(value):
    target = normalize(value)
    runs = []
    if len(target) < 3:
        return runs
    for i in range(len(words)):
        acc = ""
        for j in range(i, min(i + 10, len(words))):
            if words[j]["page"] != words[i]["page"]:
                break
            acc += words[j]["norm"]
            if acc == target:
                runs.append(words[i:j + 1])
                break
            if not target.startswith(acc):
                break
    return runs
Each value can return several runs, one per occurrence. We skip targets shorter than three characters so a stray initial can’t trigger a flood of matches.

Step 5: Black Out Every Match

A run is a list of words. To redact it we need rectangles, and we need them to behave at two awkward edges: a value can wrap across a line, and individual word heights are unreliable. We handle both by grouping each run into lines and anchoring every box to the baseline we stored earlier.

5.1 Turn a Run Into One Box per Line

We walk the run’s words, starting a new line whenever the baseline jumps or the text steps back to the left margin. For each line we take the horizontal extent from the words and a fixed band above the baseline (plus a little below for descenders).
Add the box builder:
redact.py
BAND_ABOVE, BAND_BELOW, PAD = 19, 6, 2


def line_boxes(run):
    lines, current = [], [run[0]]
    for word in run[1:]:
        same_line = abs(word["baseline"] - current[-1]["baseline"]) <= 12
        if same_line and word["x"] >= current[-1]["x"] - 2:
            current.append(word)
        else:
            lines.append(current)
            current = [word]
    lines.append(current)
    boxes = []
    for line in lines:
        baseline = max(w["baseline"] for w in line)
        boxes.append((
            min(w["x"] for w in line) - PAD,
            baseline - BAND_ABOVE,
            max(w["x"] + w["width"] for w in line) + PAD,
            baseline + BAND_BELOW,
        ))
    return boxes

5.2 Remove the Text

This is where the two languages diverge, because covering text and removing it are not the same thing. A black rectangle painted over a PDF leaves the original characters in the file, selectable and searchable underneath. To redact for real you either delete the content or destroy it.
  • Python uses PyMuPDF’s redaction annotations. apply_redactions deletes the underlying text and image data inside each box, then fills it black. The output stays a normal PDF, and everything outside the boxes is left untouched and still selectable.
  • JavaScript renders each page to an image, paints the boxes onto the pixels, and rebuilds the PDF from those images. Once a page is an image, there is no text layer left to recover.
Add the redaction loop and save. We scale each box from parse’s render pixels back to PDF points using the ratio of page_width to the page’s width in points. Deletion and drawing use different rectangles on purpose: apply_redactions removes any text that touches the redaction rect, even by a fraction of a point, so we shrink the deletion rect two points clear of the neighboring lines (otherwise a box can silently delete words from the line above or below), then draw the full-height black box as plain ink:
redact.py
doc = fitz.open("document.pdf")
boxes_by_page = {}
for value in sensitive:
    for run in find_runs(value):
        page_index = run[0]["page"] - 1
        scale = run[0]["page_width"] / doc[page_index].rect.width
        for x0, y0, x1, y1 in line_boxes(run):
            rect = fitz.Rect(x0 / scale, y0 / scale, x1 / scale, y1 / scale)
            boxes_by_page.setdefault(page_index, []).append(rect)

for page_index, rects in boxes_by_page.items():
    page = doc[page_index]
    for rect in rects:
        page.add_redact_annot(rect + (0, 2, 0, -2), fill=(0, 0, 0))
    page.apply_redactions()
    for rect in rects:
        page.draw_rect(rect, fill=(0, 0, 0), color=None)
doc.save("redacted.pdf")
print("Wrote redacted.pdf")
Run it:
python redact.py

Step 6: Confirm the Text Is Really Gone

The point of redaction is that the data can’t be recovered, so it’s worth checking rather than trusting the visual. Open redacted.pdf and try to select the blacked-out text: nothing should be there. You can confirm the same thing in code by pulling the text layer back out of the result. With Python and PyMuPDF:
import fitz

text = " ".join(page.get_text() for page in fitz.open("redacted.pdf"))
for value in ["Jonathan Hale", "4471-8829-0007", "jhale@example.com"]:
    print(value, "->", "REMOVED" if value not in text else "STILL PRESENT")
For the sample document this prints REMOVED for every value, while the parts we didn’t target, like the closing balance and the dates, are still present and selectable:
Jonathan Hale -> REMOVED
4471-8829-0007 -> REMOVED
jhale@example.com -> REMOVED
The JavaScript output is image-only, so it has no text layer at all; selecting anywhere on the page returns nothing.

Where to Take This Next

The schema is the lever here. Add fields for any other data you need gone, such as a date of birth, a postal address, or a national ID number, and the same pipeline finds and removes it. A few directions to take it further:
  • Redact signatures. Parsing labels handwritten signatures with a Signature segment type, but only when you submit the parse job with layout_analysis=advanced_layout_detection; the default layout analysis we use in this recipe never returns it. With that parameter set, you can black out each signature segment’s box directly, without an extraction step, since the segment already carries its location.
  • Handle scanned documents. Because the locations come from the parser’s OCR rather than the PDF text layer, the same script works on scans and photos, not just digital PDFs.
  • Review before you ship. For a human-in-the-loop workflow, draw the boxes in a bright color first and have a reviewer confirm them, then switch to black once the set is approved.

Extraction

How schemas drive extraction, including nested objects and arrays.

Parsing

The full parse response, segment types, and word-level OCR boxes.

Element Types

Every segment type the parser can return, including Signature.

API Reference

Browse the full request and response specs for /v2/extract and /parse.