> ## Documentation Index
> Fetch the complete documentation index at: https://docs.unsiloed.ai/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Redact Sensitive Data From a Document

> Find personal data in a PDF with extraction, locate every occurrence with parsing, and black it out to produce a clean redacted PDF.

Before a document leaves your hands, the sensitive parts often have to go: names, account numbers, email addresses, phone numbers. Doing it by hand is slow, and a single missed occurrence leaks the very data you meant to protect. Worse, the obvious shortcut, drawing a black box over the text in a PDF viewer, doesn't actually remove anything. The text still sits underneath, ready to be copied straight back out. Governments and law firms have leaked data exactly this way.

This recipe redacts properly. It uses extraction to work out what the sensitive values are, parsing to find every place they appear on the page, and then removes them, producing a `redacted.pdf` you can safely share. You don't have to know the values in advance; the extractor reads them off the document for you.

<img src="https://mintcdn.com/unsiloed/8foOMTYS3RsIAKAD/images/redaction-before-after.png?fit=max&auto=format&n=8foOMTYS3RsIAKAD&q=85&s=75a13a496ad72bdce947b41c716b1bb2" alt="An account statement before and after redaction, with the name, account number, email, and phone number blacked out" width="1505" height="1638" data-path="images/redaction-before-after.png" />

<Note>
  This recipe chains two endpoints covered on their own in the [Extract quickstart](/document-processing/extraction/quickstart) and the [parsing guide](/document-processing/parsing/parsing). Read those first if you want the detail on either call in isolation.
</Note>

## What We'll Build

A script that:

1. Sends a document to `/v2/extract` with a schema describing the sensitive fields, and reads back their values.
2. Sends the same document to `/parse` to get every word on the page with its bounding box.
3. Matches each sensitive value against the parsed words to find every occurrence, not just the first.
4. Blacks out each match and writes a `redacted.pdf` with the underlying text removed.

We'll run it against a one-page account statement that repeats the same personal details several times. Grab the full script from the dropdown below if you'd rather skip the walkthrough.

<Accordion title="Show the Full Script">
  Set `UNSILOED_API_KEY` in your environment and save the document as `document.pdf` in the same directory before running.

  <Tabs>
    <Tab title="Python">
      ```python redact.py theme={null}
      import json
      import os
      import re
      import time

      import fitz  # PyMuPDF
      import requests

      API_KEY = os.environ["UNSILOED_API_KEY"]
      BASE_URL = "https://prod.visionapi.unsiloed.ai"

      # The kinds of information we want gone. Each description tells the extractor
      # what to look for, so adapt these fields to your own document.
      SCHEMA = {
          "type": "object",
          "properties": {
              "full_name": {"type": "string", "description": "The full name of the account holder"},
              "account_number": {"type": "string", "description": "The bank account number"},
              "email": {"type": "string", "description": "The account holder's email address"},
              "phone": {"type": "string", "description": "The account holder's phone number"},
          },
          "required": ["full_name", "account_number", "email", "phone"],
          "additionalProperties": False,
      }


      def wait_for(url):
          for _ in range(90):  # roughly 6 minutes at 4 seconds per poll
              job = requests.get(url, headers={"api-key": API_KEY}).json()
              status = job.get("status")
              if status in ("completed", "review", "Succeeded"):
                  return job
              if status in ("failed", "Failed", "Cancelled"):
                  # The extractor reports its reason in "error"; the parser uses "message".
                  raise RuntimeError(job.get("error") or job.get("message") or f"job {status}")
              time.sleep(4)
          raise TimeoutError(f"Job at {url} did not finish in time")


      # Step 1: ask the extractor what the sensitive values are.
      with open("document.pdf", "rb") as f:
          resp = requests.post(
              f"{BASE_URL}/v2/extract",
              headers={"api-key": API_KEY},
              files={"pdf_file": ("document.pdf", f, "application/pdf")},
              data={"schema_data": json.dumps(SCHEMA), "model": "gamma"},
          )
      resp.raise_for_status()
      job = resp.json()
      result = wait_for(f"{BASE_URL}/extract/{job['job_id']}")["result"]
      sensitive = []
      for field in SCHEMA["properties"]:
          value = result[field]["value"]
          if value is None:
              print(f"Field '{field}' not found in the document, skipping.")
              continue
          sensitive.append(value)
      print("Will redact:", sensitive)

      # Step 2: parse the document to get every word with a page-relative box.
      with open("document.pdf", "rb") as f:
          resp = requests.post(
              f"{BASE_URL}/parse",
              headers={"api-key": API_KEY},
              files={"file": ("document.pdf", f, "application/pdf")},
          )
      resp.raise_for_status()
      job = resp.json()
      parsed = wait_for(f"{BASE_URL}/parse/{job['job_id']}")


      def normalize(text):
          return re.sub(r"[^a-z0-9]", "", str(text).lower())


      words = []
      for chunk in parsed["chunks"]:
          for seg in chunk["segments"]:
              sx, sy = seg["bbox"]["left"], seg["bbox"]["top"]
              for word in seg.get("ocr") or []:
                  norm = normalize(word["text"])
                  if not norm:
                      continue
                  b = word["bbox"]
                  words.append({
                      "norm": norm,
                      "page": seg["page_number"],
                      "page_width": seg["page_width"],
                      "x": sx + b["left"],
                      "width": b["width"],
                      # y + height lands on the text baseline reliably, even when the
                      # reported height is not, so we anchor redaction boxes to it.
                      "baseline": sy + b["top"] + b["height"],
                  })

      # With no words there is nothing to match, and the script would "succeed"
      # by writing an unredacted copy. Fail loudly instead.
      if not words:
          raise RuntimeError("Parse returned no OCR words; refusing to write an unredacted PDF")


      # Step 3: find every occurrence of each value by normalized concatenation, so a
      # value matches however OCR split it (hyphens, "@") or wrapped across lines.
      def find_runs(value):
          target = normalize(value)
          runs = []
          if len(target) < 3:
              return runs
          for i in range(len(words)):
              acc = ""
              for j in range(i, min(i + 10, len(words))):
                  if words[j]["page"] != words[i]["page"]:
                      break
                  acc += words[j]["norm"]
                  if acc == target:
                      runs.append(words[i:j + 1])
                      break
                  if not target.startswith(acc):
                      break
          return runs


      # Step 4: turn each run into one box per line, anchored to the baseline.
      BAND_ABOVE, BAND_BELOW, PAD = 19, 6, 2


      def line_boxes(run):
          lines, current = [], [run[0]]
          for word in run[1:]:
              same_line = abs(word["baseline"] - current[-1]["baseline"]) <= 12
              if same_line and word["x"] >= current[-1]["x"] - 2:
                  current.append(word)
              else:
                  lines.append(current)
                  current = [word]
          lines.append(current)
          boxes = []
          for line in lines:
              baseline = max(w["baseline"] for w in line)
              boxes.append((
                  min(w["x"] for w in line) - PAD,
                  baseline - BAND_ABOVE,
                  max(w["x"] + w["width"] for w in line) + PAD,
                  baseline + BAND_BELOW,
              ))
          return boxes


      doc = fitz.open("document.pdf")
      boxes_by_page = {}
      for value in sensitive:
          for run in find_runs(value):
              page_index = run[0]["page"] - 1
              # Parse boxes are in render pixels; scale them back to PDF points.
              scale = run[0]["page_width"] / doc[page_index].rect.width
              for x0, y0, x1, y1 in line_boxes(run):
                  rect = fitz.Rect(x0 / scale, y0 / scale, x1 / scale, y1 / scale)
                  boxes_by_page.setdefault(page_index, []).append(rect)

      # apply_redactions deletes any text that touches the redaction rect, even
      # by a fraction of a point, so we delete with a rect shrunk clear of the
      # neighboring lines, then draw the full-height box as plain ink afterwards.
      for page_index, rects in boxes_by_page.items():
          page = doc[page_index]
          for rect in rects:
              page.add_redact_annot(rect + (0, 2, 0, -2), fill=(0, 0, 0))
          page.apply_redactions()
          for rect in rects:
              page.draw_rect(rect, fill=(0, 0, 0), color=None)
      doc.save("redacted.pdf")
      print("Wrote redacted.pdf")
      ```
    </Tab>

    <Tab title="JavaScript">
      Save this as `redact.mjs` or set `"type": "module"` in your `package.json`. Requires Node.js 18 or newer for the global `fetch`, `FormData`, and `Blob`.

      ```javascript redact.mjs theme={null}
      import { readFileSync, writeFileSync } from "node:fs";
      import { createCanvas } from "@napi-rs/canvas";
      import { PDFDocument } from "pdf-lib";
      import * as pdfjsLib from "pdfjs-dist/legacy/build/pdf.mjs";

      const API_KEY = process.env.UNSILOED_API_KEY;
      const BASE_URL = "https://prod.visionapi.unsiloed.ai";

      // The kinds of information we want gone. Each description tells the extractor
      // what to look for, so adapt these fields to your own document.
      const SCHEMA = {
        type: "object",
        properties: {
          full_name:      { type: "string", description: "The full name of the account holder" },
          account_number: { type: "string", description: "The bank account number" },
          email:          { type: "string", description: "The account holder's email address" },
          phone:          { type: "string", description: "The account holder's phone number" },
        },
        required: ["full_name", "account_number", "email", "phone"],
        additionalProperties: false,
      };

      async function waitFor(url) {
        for (let i = 0; i < 90; i++) { // roughly 6 minutes at 4 seconds per poll
          const job = await (await fetch(url, { headers: { "api-key": API_KEY } })).json();
          if (["completed", "review", "Succeeded"].includes(job.status)) return job;
          // The extractor reports its reason in "error"; the parser uses "message".
          if (["failed", "Failed", "Cancelled"].includes(job.status)) throw new Error(job.error || job.message || `job ${job.status}`);
          await new Promise((r) => setTimeout(r, 4000));
        }
        throw new Error(`Job at ${url} did not finish in time`);
      }

      const bytes = readFileSync("document.pdf");

      // Step 1: ask the extractor what the sensitive values are.
      const extractForm = new FormData();
      extractForm.append("pdf_file", new Blob([bytes], { type: "application/pdf" }), "document.pdf");
      extractForm.append("schema_data", JSON.stringify(SCHEMA));
      extractForm.append("model", "gamma");
      const extractRes = await fetch(`${BASE_URL}/v2/extract`, { method: "POST", headers: { "api-key": API_KEY }, body: extractForm });
      if (!extractRes.ok) throw new Error(`extract submit failed: HTTP ${extractRes.status} ${await extractRes.text()}`);
      const extractJob = await extractRes.json();
      const result = (await waitFor(`${BASE_URL}/extract/${extractJob.job_id}`)).result;
      const sensitive = [];
      for (const field of Object.keys(SCHEMA.properties)) {
        const value = result[field].value;
        if (value == null) {
          console.log(`Field '${field}' not found in the document, skipping.`);
          continue;
        }
        sensitive.push(value);
      }
      console.log("Will redact:", sensitive);

      // Step 2: parse to get every word with a page-relative box.
      const parseForm = new FormData();
      parseForm.append("file", new Blob([bytes], { type: "application/pdf" }), "document.pdf");
      const parseRes = await fetch(`${BASE_URL}/parse`, { method: "POST", headers: { "api-key": API_KEY }, body: parseForm });
      if (!parseRes.ok) throw new Error(`parse submit failed: HTTP ${parseRes.status} ${await parseRes.text()}`);
      const parseJob = await parseRes.json();
      const parsed = await waitFor(`${BASE_URL}/parse/${parseJob.job_id}`);

      const normalize = (text) => String(text).toLowerCase().replace(/[^a-z0-9]/g, "");
      const words = [];
      for (const chunk of parsed.chunks) {
        for (const seg of chunk.segments) {
          const { left: sx, top: sy } = seg.bbox;
          for (const word of seg.ocr || []) {
            const norm = normalize(word.text);
            if (!norm) continue;
            const b = word.bbox;
            words.push({
              norm,
              page: seg.page_number,
              pageWidth: seg.page_width,
              x: sx + b.left,
              width: b.width,
              // y + height lands on the baseline reliably; the height alone does not.
              baseline: sy + b.top + b.height,
            });
          }
        }
      }

      // With no words there is nothing to match, and the script would "succeed"
      // by writing an unredacted copy. Fail loudly instead.
      if (words.length === 0) throw new Error("Parse returned no OCR words; refusing to write an unredacted PDF");

      // Step 3: find every occurrence by normalized concatenation, so a value matches
      // however OCR split it (hyphens, "@") or wrapped across lines.
      function findRuns(value) {
        const target = normalize(value);
        const runs = [];
        if (target.length < 3) return runs;
        for (let i = 0; i < words.length; i++) {
          let acc = "";
          for (let j = i; j < words.length && words[j].page === words[i].page && j - i < 10; j++) {
            acc += words[j].norm;
            if (acc === target) { runs.push(words.slice(i, j + 1)); break; }
            if (!target.startsWith(acc)) break;
          }
        }
        return runs;
      }

      // Step 4: turn each run into one box per line, anchored to the baseline.
      const BAND_ABOVE = 19, BAND_BELOW = 6, PAD = 2;
      function lineBoxes(run) {
        const lines = [];
        let current = [run[0]];
        for (const word of run.slice(1)) {
          const sameLine = Math.abs(word.baseline - current[current.length - 1].baseline) <= 12;
          if (sameLine && word.x >= current[current.length - 1].x - 2) current.push(word);
          else { lines.push(current); current = [word]; }
        }
        lines.push(current);
        return lines.map((line) => {
          const baseline = Math.max(...line.map((w) => w.baseline));
          return [
            Math.min(...line.map((w) => w.x)) - PAD,
            baseline - BAND_ABOVE,
            Math.max(...line.map((w) => w.x + w.width)) + PAD,
            baseline + BAND_BELOW,
          ];
        });
      }

      const boxesByPage = new Map();
      for (const value of sensitive) {
        for (const run of findRuns(value)) {
          const page = run[0].page;
          if (!boxesByPage.has(page)) boxesByPage.set(page, []);
          boxesByPage.get(page).push(...lineBoxes(run));
        }
      }

      // Step 5: render each page to an image, paint the boxes, and rebuild the PDF.
      // Flattening to an image removes the text for good; a drawn box alone would not.
      const pageWidths = new Map(words.map((w) => [w.page, w.pageWidth]));
      const fonts = "./node_modules/pdfjs-dist/standard_fonts/";
      const pdf = await pdfjsLib.getDocument({ data: new Uint8Array(bytes), standardFontDataUrl: fonts }).promise;
      const out = await PDFDocument.create();
      for (let pageNo = 1; pageNo <= pdf.numPages; pageNo++) {
        const page = await pdf.getPage(pageNo);
        const base = page.getViewport({ scale: 1.0 }); // page size in PDF points
        // Render at whatever scale matches parse's pixel grid for this page.
        const scale = (pageWidths.get(pageNo) ?? base.width * 2) / base.width;
        const viewport = page.getViewport({ scale });
        const canvas = createCanvas(viewport.width, viewport.height);
        const ctx = canvas.getContext("2d");
        await page.render({ canvasContext: ctx, viewport, canvas }).promise;
        ctx.fillStyle = "#000";
        for (const [x0, y0, x1, y1] of boxesByPage.get(pageNo) || []) {
          ctx.fillRect(x0, y0, x1 - x0, y1 - y0);
        }
        const png = await out.embedPng(canvas.toBuffer("image/png"));
        const outPage = out.addPage([base.width, base.height]);
        outPage.drawImage(png, { x: 0, y: 0, width: base.width, height: base.height });
      }
      writeFileSync("redacted.pdf", await out.save());
      console.log("Wrote redacted.pdf");
      ```
    </Tab>
  </Tabs>
</Accordion>

## Step 1: Set Up Your Environment

Before writing any code, we need three things: an API key, a document to redact, and a few libraries for the chosen language.

### 1.1 Get an Unsiloed AI API Key

To get API access, [sign up on Unsiloed AI](https://cal.com/aman-mishra-p0ry57/15min). Export your key as an environment variable so it stays out of source control:

```bash theme={null}
export UNSILOED_API_KEY="your-api-key"
```

### 1.2 Pick a Document

The walkthrough assumes a PDF saved as `document.pdf` in your working directory. To follow along with the exact output shown below, download our [sample account statement](https://raw.githubusercontent.com/Unsiloed-AI/cookbook/9c80a90e0315a33c9b8a68d8b3355199771b598f/sample-documents/sample-statement.pdf): a one-page letter that repeats the account holder's name, account number, and email so we can prove every occurrence gets removed, not just the first.

### 1.3 Install Dependencies

The API calls are plain HTTP, but writing the redacted file means rendering the PDF, so each language needs a couple of libraries.

<Tabs>
  <Tab title="Python">
    You need Python 3.8 or newer. Install `requests` for the API calls and `PyMuPDF` for the redaction:

    ```bash theme={null}
    pip install requests PyMuPDF
    ```
  </Tab>

  <Tab title="JavaScript">
    You need Node.js 18 or newer for the global `fetch`, `FormData`, and `Blob`. Install the PDF rendering and writing libraries:

    ```bash theme={null}
    npm install pdfjs-dist @napi-rs/canvas pdf-lib
    ```

    `pdfjs-dist` renders each page, `@napi-rs/canvas` gives it a canvas to draw on (with prebuilt binaries, so there are no system dependencies), and `pdf-lib` assembles the redacted pages back into a PDF.
  </Tab>
</Tabs>

## Step 2: Find the Sensitive Values With Extraction

We start with extraction rather than a list of search terms because we don't want to hard-code the data we're protecting. A schema describes the kinds of information that are sensitive, and the extractor reads the actual values off the document. Point it at a different statement and it finds that person's details instead.

### 2.1 Describe the Sensitive Fields

Each field is a name and a description. The description tells the extractor what to look for, so the clearer it is, the more reliable the result. We're after four pieces of personal data.

<Tabs>
  <Tab title="Python">
    Create a file called `redact.py` with the configuration and schema:

    ```python redact.py theme={null}
    import json
    import os
    import re
    import time

    import fitz  # PyMuPDF
    import requests

    API_KEY = os.environ["UNSILOED_API_KEY"]
    BASE_URL = "https://prod.visionapi.unsiloed.ai"

    SCHEMA = {
        "type": "object",
        "properties": {
            "full_name": {"type": "string", "description": "The full name of the account holder"},
            "account_number": {"type": "string", "description": "The bank account number"},
            "email": {"type": "string", "description": "The account holder's email address"},
            "phone": {"type": "string", "description": "The account holder's phone number"},
        },
        "required": ["full_name", "account_number", "email", "phone"],
        "additionalProperties": False,
    }
    ```
  </Tab>

  <Tab title="JavaScript">
    Create a file called `redact.mjs` with the configuration and schema:

    ```javascript redact.mjs theme={null}
    import { readFileSync, writeFileSync } from "node:fs";
    import { createCanvas } from "@napi-rs/canvas";
    import { PDFDocument } from "pdf-lib";
    import * as pdfjsLib from "pdfjs-dist/legacy/build/pdf.mjs";

    const API_KEY = process.env.UNSILOED_API_KEY;
    const BASE_URL = "https://prod.visionapi.unsiloed.ai";

    const SCHEMA = {
      type: "object",
      properties: {
        full_name:      { type: "string", description: "The full name of the account holder" },
        account_number: { type: "string", description: "The bank account number" },
        email:          { type: "string", description: "The account holder's email address" },
        phone:          { type: "string", description: "The account holder's phone number" },
      },
      required: ["full_name", "account_number", "email", "phone"],
      additionalProperties: false,
    };
    ```
  </Tab>
</Tabs>

### 2.2 Submit the Document and Read the Values

Both endpoints in this recipe run asynchronously: we submit a job, get a `job_id`, and poll until it's done. Since we do that twice, it's worth a small helper. The extractor accepts `completed` as its done state; the parser uses `Succeeded`, so the helper checks for both. The failure side is just as inconsistent: a job can end up `failed`, `Failed`, or `Cancelled`, and the parser reports its reason in a `message` field where the extractor uses `error`. The helper covers all of those too, because otherwise a cancelled job would poll until the timeout and surface as a misleading "did not finish in time".

<Tabs>
  <Tab title="Python">
    Add the polling helper and the extract call:

    ```python redact.py theme={null}
    def wait_for(url):
        for _ in range(90):  # roughly 6 minutes at 4 seconds per poll
            job = requests.get(url, headers={"api-key": API_KEY}).json()
            status = job.get("status")
            if status in ("completed", "review", "Succeeded"):
                return job
            if status in ("failed", "Failed", "Cancelled"):
                # The extractor reports its reason in "error"; the parser uses "message".
                raise RuntimeError(job.get("error") or job.get("message") or f"job {status}")
            time.sleep(4)
        raise TimeoutError(f"Job at {url} did not finish in time")


    with open("document.pdf", "rb") as f:
        resp = requests.post(
            f"{BASE_URL}/v2/extract",
            headers={"api-key": API_KEY},
            files={"pdf_file": ("document.pdf", f, "application/pdf")},
            data={"schema_data": json.dumps(SCHEMA), "model": "gamma"},
        )
    resp.raise_for_status()
    job = resp.json()
    result = wait_for(f"{BASE_URL}/extract/{job['job_id']}")["result"]
    sensitive = []
    for field in SCHEMA["properties"]:
        value = result[field]["value"]
        if value is None:
            print(f"Field '{field}' not found in the document, skipping.")
            continue
        sensitive.append(value)
    print("Will redact:", sensitive)
    ```
  </Tab>

  <Tab title="JavaScript">
    Add the polling helper and the extract call:

    ```javascript redact.mjs theme={null}
    async function waitFor(url) {
      for (let i = 0; i < 90; i++) { // roughly 6 minutes at 4 seconds per poll
        const job = await (await fetch(url, { headers: { "api-key": API_KEY } })).json();
        if (["completed", "review", "Succeeded"].includes(job.status)) return job;
        // The extractor reports its reason in "error"; the parser uses "message".
        if (["failed", "Failed", "Cancelled"].includes(job.status)) throw new Error(job.error || job.message || `job ${job.status}`);
        await new Promise((r) => setTimeout(r, 4000));
      }
      throw new Error(`Job at ${url} did not finish in time`);
    }

    const bytes = readFileSync("document.pdf");

    const extractForm = new FormData();
    extractForm.append("pdf_file", new Blob([bytes], { type: "application/pdf" }), "document.pdf");
    extractForm.append("schema_data", JSON.stringify(SCHEMA));
    extractForm.append("model", "gamma");
    const extractRes = await fetch(`${BASE_URL}/v2/extract`, { method: "POST", headers: { "api-key": API_KEY }, body: extractForm });
    if (!extractRes.ok) throw new Error(`extract submit failed: HTTP ${extractRes.status} ${await extractRes.text()}`);
    const extractJob = await extractRes.json();
    const result = (await waitFor(`${BASE_URL}/extract/${extractJob.job_id}`)).result;
    const sensitive = [];
    for (const field of Object.keys(SCHEMA.properties)) {
      const value = result[field].value;
      if (value == null) {
        console.log(`Field '${field}' not found in the document, skipping.`);
        continue;
      }
      sensitive.push(value);
    }
    console.log("Will redact:", sensitive);
    ```
  </Tab>
</Tabs>

Each entry in the extractor's `result` is an object with a `value`, so we pull out the values into a flat `sensitive` list. For the sample document, that prints:

```text theme={null}
Will redact: ['Jonathan Hale', '4471-8829-0007', 'jhale@example.com', '(415) 555-0142']
```

These are the strings we now need to find and remove wherever they appear.

<Warning>
  Leave the extractor's `detect_pii` parameter at its default of `false` for this recipe. That flag exists to block extraction from documents that contain PII: with it enabled, the endpoint returns HTTP 200 with `job_id: null` and `status: "pii_blocked"` instead of starting a job, and the polling step fails. Extracting PII is the point of this recipe, so the gate has to stay off.
</Warning>

## Step 3: Locate Every Occurrence With Parsing

Extraction told us what is sensitive. It does not tell us where each value sits on the page, or how many times it appears. For that we parse the document. Parsing returns the page broken into segments, and within each segment an `ocr` array giving every word along with its bounding box. That word-level detail is what lets us draw a box around each occurrence.

### 3.1 Parse the Document

Send the same file to `/parse`. Note the form field is `file` here, not `pdf_file`: the parser and the extractor use different field names. We leave `ocr_strategy` at its default `auto_detection`: it returns word-level `ocr` boxes for digital and scanned PDFs alike, and in our testing `force_ocr` degrades the geometry to line-level boxes, which makes redaction miss values.

<Tabs>
  <Tab title="Python">
    Add the parse call:

    ```python redact.py theme={null}
    with open("document.pdf", "rb") as f:
        resp = requests.post(
            f"{BASE_URL}/parse",
            headers={"api-key": API_KEY},
            files={"file": ("document.pdf", f, "application/pdf")},
        )
    resp.raise_for_status()
    job = resp.json()
    parsed = wait_for(f"{BASE_URL}/parse/{job['job_id']}")
    ```
  </Tab>

  <Tab title="JavaScript">
    Add the parse call:

    ```javascript redact.mjs theme={null}
    const parseForm = new FormData();
    parseForm.append("file", new Blob([bytes], { type: "application/pdf" }), "document.pdf");
    const parseRes = await fetch(`${BASE_URL}/parse`, { method: "POST", headers: { "api-key": API_KEY }, body: parseForm });
    if (!parseRes.ok) throw new Error(`parse submit failed: HTTP ${parseRes.status} ${await parseRes.text()}`);
    const parseJob = await parseRes.json();
    const parsed = await waitFor(`${BASE_URL}/parse/${parseJob.job_id}`);
    ```
  </Tab>
</Tabs>

### 3.2 Build a Flat List of Words

A word's `ocr` box is given relative to its segment, so its position on the page is the segment's top-left corner plus the word's own offset. We flatten every word into one list, recording its normalized text, its page, and a box.

Two details matter for drawing clean boxes later:

* **The reported word height is unreliable.** Some boxes come back only a couple of pixels tall. The bottom edge, `top + height`, sits on the text baseline consistently, so we store that baseline and reconstruct a full-height box from it in Step 4.
* **Parse boxes are in render pixels,** measured against the rendered page that `page_width` and `page_height` describe. We keep each word's `page_width` so we can compute the exact pixels-to-points ratio when we redact.

<Tabs>
  <Tab title="Python">
    Add the normalizer and the word-collection loop:

    ```python redact.py theme={null}
    def normalize(text):
        return re.sub(r"[^a-z0-9]", "", str(text).lower())


    words = []
    for chunk in parsed["chunks"]:
        for seg in chunk["segments"]:
            sx, sy = seg["bbox"]["left"], seg["bbox"]["top"]
            for word in seg.get("ocr") or []:
                norm = normalize(word["text"])
                if not norm:
                    continue
                b = word["bbox"]
                words.append({
                    "norm": norm,
                    "page": seg["page_number"],
                    "page_width": seg["page_width"],
                    "x": sx + b["left"],
                    "width": b["width"],
                    "baseline": sy + b["top"] + b["height"],
                })

    # With no words there is nothing to match, and the script would "succeed"
    # by writing an unredacted copy. Fail loudly instead.
    if not words:
        raise RuntimeError("Parse returned no OCR words; refusing to write an unredacted PDF")
    ```
  </Tab>

  <Tab title="JavaScript">
    Add the normalizer and the word-collection loop:

    ```javascript redact.mjs theme={null}
    const normalize = (text) => String(text).toLowerCase().replace(/[^a-z0-9]/g, "");
    const words = [];
    for (const chunk of parsed.chunks) {
      for (const seg of chunk.segments) {
        const { left: sx, top: sy } = seg.bbox;
        for (const word of seg.ocr || []) {
          const norm = normalize(word.text);
          if (!norm) continue;
          const b = word.bbox;
          words.push({
            norm,
            page: seg.page_number,
            pageWidth: seg.page_width,
            x: sx + b.left,
            width: b.width,
            baseline: sy + b.top + b.height,
          });
        }
      }
    }

    // With no words there is nothing to match, and the script would "succeed"
    // by writing an unredacted copy. Fail loudly instead.
    if (words.length === 0) throw new Error("Parse returned no OCR words; refusing to write an unredacted PDF");
    ```
  </Tab>
</Tabs>

We normalize text down to lowercase letters and digits, dropping spaces and punctuation. That's what makes the matching in the next step robust.

The guard at the end matters: every later step quietly does nothing when `words` is empty, so without it the script would still write a `redacted.pdf` with nothing redacted. A redaction script should fail loudly rather than produce a clean-looking file that leaks everything.

## Step 4: Match Values to Their Locations

Now we find each sensitive value in the word list. A naive equality check breaks down quickly: `4471-8829-0007` might arrive as one OCR token or several, an email keeps its `@`, and a phone number can wrap across a line break. So instead of matching word by word, we match by **normalized concatenation**. Starting at each word, we glue the normalized words together one at a time until the running string equals the normalized target, and stop early the moment it can no longer lead to a match.

### 4.1 Find Every Run of Words That Spells a Value

<Tabs>
  <Tab title="Python">
    Add the matcher:

    ```python redact.py theme={null}
    def find_runs(value):
        target = normalize(value)
        runs = []
        if len(target) < 3:
            return runs
        for i in range(len(words)):
            acc = ""
            for j in range(i, min(i + 10, len(words))):
                if words[j]["page"] != words[i]["page"]:
                    break
                acc += words[j]["norm"]
                if acc == target:
                    runs.append(words[i:j + 1])
                    break
                if not target.startswith(acc):
                    break
        return runs
    ```
  </Tab>

  <Tab title="JavaScript">
    Add the matcher:

    ```javascript redact.mjs theme={null}
    function findRuns(value) {
      const target = normalize(value);
      const runs = [];
      if (target.length < 3) return runs;
      for (let i = 0; i < words.length; i++) {
        let acc = "";
        for (let j = i; j < words.length && words[j].page === words[i].page && j - i < 10; j++) {
          acc += words[j].norm;
          if (acc === target) { runs.push(words.slice(i, j + 1)); break; }
          if (!target.startsWith(acc)) break;
        }
      }
      return runs;
    }
    ```
  </Tab>
</Tabs>

Each value can return several runs, one per occurrence. We skip targets shorter than three characters so a stray initial can't trigger a flood of matches.

## Step 5: Black Out Every Match

A run is a list of words. To redact it we need rectangles, and we need them to behave at two awkward edges: a value can wrap across a line, and individual word heights are unreliable. We handle both by grouping each run into lines and anchoring every box to the baseline we stored earlier.

### 5.1 Turn a Run Into One Box per Line

We walk the run's words, starting a new line whenever the baseline jumps or the text steps back to the left margin. For each line we take the horizontal extent from the words and a fixed band above the baseline (plus a little below for descenders).

<Tabs>
  <Tab title="Python">
    Add the box builder:

    ```python redact.py theme={null}
    BAND_ABOVE, BAND_BELOW, PAD = 19, 6, 2


    def line_boxes(run):
        lines, current = [], [run[0]]
        for word in run[1:]:
            same_line = abs(word["baseline"] - current[-1]["baseline"]) <= 12
            if same_line and word["x"] >= current[-1]["x"] - 2:
                current.append(word)
            else:
                lines.append(current)
                current = [word]
        lines.append(current)
        boxes = []
        for line in lines:
            baseline = max(w["baseline"] for w in line)
            boxes.append((
                min(w["x"] for w in line) - PAD,
                baseline - BAND_ABOVE,
                max(w["x"] + w["width"] for w in line) + PAD,
                baseline + BAND_BELOW,
            ))
        return boxes
    ```
  </Tab>

  <Tab title="JavaScript">
    Add the box builder:

    ```javascript redact.mjs theme={null}
    const BAND_ABOVE = 19, BAND_BELOW = 6, PAD = 2;
    function lineBoxes(run) {
      const lines = [];
      let current = [run[0]];
      for (const word of run.slice(1)) {
        const sameLine = Math.abs(word.baseline - current[current.length - 1].baseline) <= 12;
        if (sameLine && word.x >= current[current.length - 1].x - 2) current.push(word);
        else { lines.push(current); current = [word]; }
      }
      lines.push(current);
      return lines.map((line) => {
        const baseline = Math.max(...line.map((w) => w.baseline));
        return [
          Math.min(...line.map((w) => w.x)) - PAD,
          baseline - BAND_ABOVE,
          Math.max(...line.map((w) => w.x + w.width)) + PAD,
          baseline + BAND_BELOW,
        ];
      });
    }
    ```
  </Tab>
</Tabs>

### 5.2 Remove the Text

This is where the two languages diverge, because covering text and removing it are not the same thing. A black rectangle painted over a PDF leaves the original characters in the file, selectable and searchable underneath. To redact for real you either delete the content or destroy it.

* **Python** uses PyMuPDF's redaction annotations. `apply_redactions` deletes the underlying text and image data inside each box, then fills it black. The output stays a normal PDF, and everything outside the boxes is left untouched and still selectable.
* **JavaScript** renders each page to an image, paints the boxes onto the pixels, and rebuilds the PDF from those images. Once a page is an image, there is no text layer left to recover.

<Tabs>
  <Tab title="Python">
    Add the redaction loop and save. We scale each box from parse's render pixels back to PDF points using the ratio of `page_width` to the page's width in points. Deletion and drawing use different rectangles on purpose: `apply_redactions` removes any text that touches the redaction rect, even by a fraction of a point, so we shrink the deletion rect two points clear of the neighboring lines (otherwise a box can silently delete words from the line above or below), then draw the full-height black box as plain ink:

    ```python redact.py theme={null}
    doc = fitz.open("document.pdf")
    boxes_by_page = {}
    for value in sensitive:
        for run in find_runs(value):
            page_index = run[0]["page"] - 1
            scale = run[0]["page_width"] / doc[page_index].rect.width
            for x0, y0, x1, y1 in line_boxes(run):
                rect = fitz.Rect(x0 / scale, y0 / scale, x1 / scale, y1 / scale)
                boxes_by_page.setdefault(page_index, []).append(rect)

    for page_index, rects in boxes_by_page.items():
        page = doc[page_index]
        for rect in rects:
            page.add_redact_annot(rect + (0, 2, 0, -2), fill=(0, 0, 0))
        page.apply_redactions()
        for rect in rects:
            page.draw_rect(rect, fill=(0, 0, 0), color=None)
    doc.save("redacted.pdf")
    print("Wrote redacted.pdf")
    ```

    Run it:

    ```bash theme={null}
    python redact.py
    ```
  </Tab>

  <Tab title="JavaScript">
    Group the boxes by page, then render, paint, and reassemble. We derive each page's render scale from the `page_width` that parse reported, so the canvas matches parse's pixel grid exactly. When placing each image we use the page's size in points, so the rebuilt page keeps its original dimensions:

    ```javascript redact.mjs theme={null}
    const boxesByPage = new Map();
    for (const value of sensitive) {
      for (const run of findRuns(value)) {
        const page = run[0].page;
        if (!boxesByPage.has(page)) boxesByPage.set(page, []);
        boxesByPage.get(page).push(...lineBoxes(run));
      }
    }

    const pageWidths = new Map(words.map((w) => [w.page, w.pageWidth]));
    const fonts = "./node_modules/pdfjs-dist/standard_fonts/";
    const pdf = await pdfjsLib.getDocument({ data: new Uint8Array(bytes), standardFontDataUrl: fonts }).promise;
    const out = await PDFDocument.create();
    for (let pageNo = 1; pageNo <= pdf.numPages; pageNo++) {
      const page = await pdf.getPage(pageNo);
      const base = page.getViewport({ scale: 1.0 });
      const scale = (pageWidths.get(pageNo) ?? base.width * 2) / base.width;
      const viewport = page.getViewport({ scale });
      const canvas = createCanvas(viewport.width, viewport.height);
      const ctx = canvas.getContext("2d");
      await page.render({ canvasContext: ctx, viewport, canvas }).promise;
      ctx.fillStyle = "#000";
      for (const [x0, y0, x1, y1] of boxesByPage.get(pageNo) || []) {
        ctx.fillRect(x0, y0, x1 - x0, y1 - y0);
      }
      const png = await out.embedPng(canvas.toBuffer("image/png"));
      const outPage = out.addPage([base.width, base.height]);
      outPage.drawImage(png, { x: 0, y: 0, width: base.width, height: base.height });
    }
    writeFileSync("redacted.pdf", await out.save());
    console.log("Wrote redacted.pdf");
    ```

    Run it:

    ```bash theme={null}
    node redact.mjs
    ```
  </Tab>
</Tabs>

## Step 6: Confirm the Text Is Really Gone

The point of redaction is that the data can't be recovered, so it's worth checking rather than trusting the visual. Open `redacted.pdf` and try to select the blacked-out text: nothing should be there. You can confirm the same thing in code by pulling the text layer back out of the result.

With Python and PyMuPDF:

```python theme={null}
import fitz

text = " ".join(page.get_text() for page in fitz.open("redacted.pdf"))
for value in ["Jonathan Hale", "4471-8829-0007", "jhale@example.com"]:
    print(value, "->", "REMOVED" if value not in text else "STILL PRESENT")
```

For the sample document this prints `REMOVED` for every value, while the parts we didn't target, like the closing balance and the dates, are still present and selectable:

```text theme={null}
Jonathan Hale -> REMOVED
4471-8829-0007 -> REMOVED
jhale@example.com -> REMOVED
```

The JavaScript output is image-only, so it has no text layer at all; selecting anywhere on the page returns nothing.

## Where to Take This Next

The schema is the lever here. Add fields for any other data you need gone, such as a date of birth, a postal address, or a national ID number, and the same pipeline finds and removes it.

A few directions to take it further:

* **Redact signatures.** Parsing labels handwritten signatures with a `Signature` segment type, but only when you submit the parse job with `layout_analysis=advanced_layout_detection`; the default layout analysis we use in this recipe never returns it. With that parameter set, you can black out each signature segment's box directly, without an extraction step, since the segment already carries its location.
* **Handle scanned documents.** Because the locations come from the parser's OCR rather than the PDF text layer, the same script works on scans and photos, not just digital PDFs.
* **Review before you ship.** For a human-in-the-loop workflow, draw the boxes in a bright color first and have a reviewer confirm them, then switch to black once the set is approved.

<CardGroup cols={2}>
  <Card title="Extraction" icon="brackets-curly" href="/document-processing/extraction/extraction">
    How schemas drive extraction, including nested objects and arrays.
  </Card>

  <Card title="Parsing" icon="file-lines" href="/document-processing/parsing/parsing">
    The full parse response, segment types, and word-level OCR boxes.
  </Card>

  <Card title="Element Types" icon="shapes" href="/document-processing/parsing/element-types">
    Every segment type the parser can return, including `Signature`.
  </Card>

  <Card title="API Reference" icon="code" href="/api-reference/extraction/extract-data">
    Browse the full request and response specs for `/v2/extract` and `/parse`.
  </Card>
</CardGroup>
