> ## Documentation Index
> Fetch the complete documentation index at: https://docs.unsiloed.ai/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Sort and Extract a Mixed Document Pile

> Split a PDF that holds several different documents into one file per type, then run a type-specific extraction schema over each.

A scanned batch or a shared inbox rarely holds one clean document. It holds an invoice, a receipt, and a purchase order merged into a single PDF, in no particular order. You can't run one extraction schema over the whole thing, because the fields you want from an invoice aren't the fields you want from a receipt.

This recipe handles that in two stages. First we split the pile into one file per document type, which also tells us what each file is. Then we run a different extraction schema over each file, chosen by its type. The result is a tidy object keyed by document type, with the right fields pulled from each.

<Note>
  This recipe chains two endpoints covered on their own in the [Splitting quickstart](/document-processing/splitting/quickstart) and the [Extract quickstart](/document-processing/extraction/quickstart). Read those first if you want the detail on either call in isolation.
</Note>

## What We'll Build

A script that:

1. Sends a multi-document PDF to `/splitter` along with the document types we expect to find.
2. Gets back one PDF per type, each labelled with its category and hosted at a URL.
3. Picks a matching extraction schema for each file and sends it to `/v2/extract` by URL.
4. Collects the extracted fields into a single object, keyed by document type.

We'll run it against a three-page sample that holds an invoice, a receipt, and a purchase order. Grab the full script from the dropdown below if you'd rather skip the walkthrough.

<Accordion title="Show the Full Script">
  Set `UNSILOED_API_KEY` in your environment and save the document pile as `pile.pdf` in the same directory before running.

  <Tabs>
    <Tab title="Python">
      ```python sort_and_extract.py theme={null}
      import json
      import os
      import time
      import requests

      API_KEY = os.environ["UNSILOED_API_KEY"]
      BASE_URL = "https://prod.visionapi.unsiloed.ai"

      # The document types we expect in the pile. The description guides the split.
      CATEGORIES = [
          {"name": "Invoice", "description": "A bill issued by a seller requesting payment for goods or services"},
          {"name": "Receipt", "description": "Proof of a completed purchase from a store or point of sale"},
          {"name": "Purchase Order", "description": "A buyer-issued document authorizing a purchase from a vendor"},
      ]

      # One extraction schema per type. The keys match the category names above.
      SCHEMAS = {
          "Invoice": {
              "type": "object",
              "properties": {
                  "vendor_name": {"type": "string", "description": "Company issuing the invoice (the seller)"},
                  "invoice_number": {"type": "string", "description": "Invoice identifier"},
                  "total_due": {"type": "number", "description": "Total amount due in US dollars"},
              },
              "required": ["vendor_name", "invoice_number", "total_due"],
              "additionalProperties": False,
          },
          "Receipt": {
              "type": "object",
              "properties": {
                  "store_name": {"type": "string", "description": "Name of the store"},
                  "transaction_id": {"type": "string", "description": "Transaction or receipt ID"},
                  "total": {"type": "number", "description": "Total amount paid in US dollars"},
              },
              "required": ["store_name", "transaction_id", "total"],
              "additionalProperties": False,
          },
          "Purchase Order": {
              "type": "object",
              "properties": {
                  "po_number": {"type": "string", "description": "Purchase order number"},
                  "vendor_name": {"type": "string", "description": "Vendor the order is placed with"},
                  "required_by": {"type": "string", "description": "Date the order is required by"},
              },
              "required": ["po_number", "vendor_name", "required_by"],
              "additionalProperties": False,
          },
      }

      def wait_for(url):
          """Poll a job URL until it finishes, then return the completed response."""
          for _ in range(90):  # roughly 6 minutes at 4 seconds per poll
              job = requests.get(url, headers={"api-key": API_KEY}).json()
              if job.get("status") in ("completed", "review"):
                  return job
              if job.get("status") == "failed":
                  raise RuntimeError(job.get("error", "job failed"))
              time.sleep(4)
          raise TimeoutError(f"Job at {url} did not finish in time")

      # Stage 1: split the pile into one file per document type.
      with open("pile.pdf", "rb") as f:
          resp = requests.post(
              f"{BASE_URL}/splitter",
              headers={"api-key": API_KEY},
              files={"file": ("pile.pdf", f, "application/pdf")},
              data={"categories": json.dumps(CATEGORIES)},
          )
      resp.raise_for_status()
      job = resp.json()
      sections = wait_for(f"{BASE_URL}/splitter/{job['job_id']}")["result"]["files"]
      print(f"Split into {len(sections)} documents: {[s['name'] for s in sections]}")

      # Stage 2: extract type-specific fields from each section, by URL.
      extract_jobs = {}
      for section in sections:
          category = os.path.splitext(section["name"])[0]
          schema = SCHEMAS.get(category)
          if schema is None:
              print(f"No schema for '{category}', skipping.")
              continue
          resp = requests.post(
              f"{BASE_URL}/v2/extract",
              headers={"api-key": API_KEY},
              data={
                  "file_url": section["full_path"],
                  "schema_data": json.dumps(schema),
                  "model": "gamma",
              },
          )
          resp.raise_for_status()
          extract_jobs[category] = resp.json()["job_id"]

      results = {}
      for category, job_id in extract_jobs.items():
          done = wait_for(f"{BASE_URL}/extract/{job_id}")
          results[category] = {field: cell["value"] for field, cell in done["result"].items()}

      print(json.dumps(results, indent=2))
      ```
    </Tab>

    <Tab title="JavaScript">
      Save this as `sort_and_extract.mjs` or set `"type": "module"` in your `package.json`. Requires Node.js 18 or newer for the global `fetch`, `FormData`, and `Blob`.

      ```javascript sort_and_extract.mjs theme={null}
      import fs from "node:fs";

      const API_KEY = process.env.UNSILOED_API_KEY;
      const BASE_URL = "https://prod.visionapi.unsiloed.ai";

      // The document types we expect in the pile. The description guides the split.
      const CATEGORIES = [
        { name: "Invoice", description: "A bill issued by a seller requesting payment for goods or services" },
        { name: "Receipt", description: "Proof of a completed purchase from a store or point of sale" },
        { name: "Purchase Order", description: "A buyer-issued document authorizing a purchase from a vendor" },
      ];

      // One extraction schema per type. The keys match the category names above.
      const SCHEMAS = {
        Invoice: {
          type: "object",
          properties: {
            vendor_name:    { type: "string", description: "Company issuing the invoice (the seller)" },
            invoice_number: { type: "string", description: "Invoice identifier" },
            total_due:      { type: "number", description: "Total amount due in US dollars" },
          },
          required: ["vendor_name", "invoice_number", "total_due"],
          additionalProperties: false,
        },
        Receipt: {
          type: "object",
          properties: {
            store_name:     { type: "string", description: "Name of the store" },
            transaction_id: { type: "string", description: "Transaction or receipt ID" },
            total:          { type: "number", description: "Total amount paid in US dollars" },
          },
          required: ["store_name", "transaction_id", "total"],
          additionalProperties: false,
        },
        "Purchase Order": {
          type: "object",
          properties: {
            po_number:   { type: "string", description: "Purchase order number" },
            vendor_name: { type: "string", description: "Vendor the order is placed with" },
            required_by: { type: "string", description: "Date the order is required by" },
          },
          required: ["po_number", "vendor_name", "required_by"],
          additionalProperties: false,
        },
      };

      async function waitFor(url) {
        // Poll a job URL until it finishes, then return the completed response.
        for (let i = 0; i < 90; i++) { // roughly 6 minutes at 4 seconds per poll
          const job = await (await fetch(url, { headers: { "api-key": API_KEY } })).json();
          if (["completed", "review"].includes(job.status)) return job;
          if (job.status === "failed") throw new Error(job.error || "job failed");
          await new Promise((r) => setTimeout(r, 4000));
        }
        throw new Error(`Job at ${url} did not finish in time`);
      }

      // Stage 1: split the pile into one file per document type.
      const splitForm = new FormData();
      splitForm.append("file", new Blob([fs.readFileSync("pile.pdf")]), "pile.pdf");
      splitForm.append("categories", JSON.stringify(CATEGORIES));
      const splitRes = await fetch(`${BASE_URL}/splitter`, {
        method: "POST",
        headers: { "api-key": API_KEY },
        body: splitForm,
      });
      if (!splitRes.ok) throw new Error(`split submit failed: HTTP ${splitRes.status} ${await splitRes.text()}`);
      const splitJob = await splitRes.json();
      const sections = (await waitFor(`${BASE_URL}/splitter/${splitJob.job_id}`)).result.files;
      console.log(`Split into ${sections.length} documents: ${sections.map((s) => s.name).join(", ")}`);

      // Stage 2: extract type-specific fields from each section, by URL.
      const extractJobs = {};
      for (const section of sections) {
        const category = section.name.replace(/\.pdf$/, "");
        const schema = SCHEMAS[category];
        if (!schema) {
          console.log(`No schema for '${category}', skipping.`);
          continue;
        }
        const extractForm = new FormData();
        extractForm.append("file_url", section.full_path);
        extractForm.append("schema_data", JSON.stringify(schema));
        extractForm.append("model", "gamma");
        const extractRes = await fetch(`${BASE_URL}/v2/extract`, {
          method: "POST",
          headers: { "api-key": API_KEY },
          body: extractForm,
        });
        if (!extractRes.ok) throw new Error(`extract submit failed: HTTP ${extractRes.status} ${await extractRes.text()}`);
        extractJobs[category] = (await extractRes.json()).job_id;
      }

      const results = {};
      for (const [category, jobId] of Object.entries(extractJobs)) {
        const done = await waitFor(`${BASE_URL}/extract/${jobId}`);
        results[category] = Object.fromEntries(
          Object.entries(done.result).map(([field, cell]) => [field, cell.value]),
        );
      }

      console.log(JSON.stringify(results, null, 2));
      ```
    </Tab>
  </Tabs>
</Accordion>

## Step 1: Set Up Your Environment

Before writing any code, we need three things: an API key, a document pile, and the runtime for our chosen language.

### 1.1 Get an Unsiloed AI API Key

To get API access, [sign up on Unsiloed AI](https://cal.com/aman-mishra-p0ry57/15min). Export your key as an environment variable so it stays out of source control:

```bash theme={null}
export UNSILOED_API_KEY="your-api-key"
```

### 1.2 Pick a Document Pile

The walkthrough assumes a PDF saved as `pile.pdf` in your working directory. To follow along with the exact output shown below, download our [sample document pile](https://raw.githubusercontent.com/Unsiloed-AI/cookbook/9c80a90e0315a33c9b8a68d8b3355199771b598f/sample-documents/sample-split.pdf): a three-page PDF that holds an invoice, a store receipt, and a purchase order, one per page. The three documents are unrelated and in mixed order, which is exactly the case this recipe is built for.

### 1.3 Install Dependencies

<Tabs>
  <Tab title="Python">
    You need Python 3.8 or newer. Install the `requests` package:

    ```bash theme={null}
    pip install requests
    ```
  </Tab>

  <Tab title="JavaScript">
    You need Node.js 18 or newer for the global `fetch`, `FormData`, and `Blob`. No external packages needed.
  </Tab>
</Tabs>

## Step 2: Split the Pile by Document Type

The splitter takes the whole PDF plus a list of the document types we expect, and returns one new PDF per type. Each returned file is labelled with the category it matched and hosted at a URL we can hand straight to the extractor in the next step, so there's no need to save anything to disk in between.

### 2.1 Describe the Document Types

A category is a `name` and a `description`. The description does the work: it tells the splitter how to recognize each type, so the clearer it is, the more reliable the split. We're looking for three types.

<Tabs>
  <Tab title="Python">
    Create a file called `sort_and_extract.py` with the configuration and categories:

    ```python sort_and_extract.py theme={null}
    import json
    import os
    import time
    import requests

    API_KEY = os.environ["UNSILOED_API_KEY"]
    BASE_URL = "https://prod.visionapi.unsiloed.ai"

    CATEGORIES = [
        {"name": "Invoice", "description": "A bill issued by a seller requesting payment for goods or services"},
        {"name": "Receipt", "description": "Proof of a completed purchase from a store or point of sale"},
        {"name": "Purchase Order", "description": "A buyer-issued document authorizing a purchase from a vendor"},
    ]
    ```
  </Tab>

  <Tab title="JavaScript">
    Create a file called `sort_and_extract.mjs` with the configuration and categories:

    ```javascript sort_and_extract.mjs theme={null}
    import fs from "node:fs";

    const API_KEY = process.env.UNSILOED_API_KEY;
    const BASE_URL = "https://prod.visionapi.unsiloed.ai";

    const CATEGORIES = [
      { name: "Invoice", description: "A bill issued by a seller requesting payment for goods or services" },
      { name: "Receipt", description: "Proof of a completed purchase from a store or point of sale" },
      { name: "Purchase Order", description: "A buyer-issued document authorizing a purchase from a vendor" },
    ];
    ```
  </Tab>
</Tabs>

### 2.2 Submit the Pile and Wait for the Split

Both endpoints in this recipe run asynchronously: we submit a job, get a `job_id`, and poll until it's done. Since we do that twice, it's worth a small helper. Then we post the pile to `/splitter` with the categories as a JSON string under the `categories` field.

<Tabs>
  <Tab title="Python">
    Add the polling helper and the split call:

    ```python sort_and_extract.py theme={null}
    def wait_for(url):
        for _ in range(90):  # roughly 6 minutes at 4 seconds per poll
            job = requests.get(url, headers={"api-key": API_KEY}).json()
            if job.get("status") in ("completed", "review"):
                return job
            if job.get("status") == "failed":
                raise RuntimeError(job.get("error", "job failed"))
            time.sleep(4)
        raise TimeoutError(f"Job at {url} did not finish in time")

    with open("pile.pdf", "rb") as f:
        resp = requests.post(
            f"{BASE_URL}/splitter",
            headers={"api-key": API_KEY},
            files={"file": ("pile.pdf", f, "application/pdf")},
            data={"categories": json.dumps(CATEGORIES)},
        )
    resp.raise_for_status()
    job = resp.json()
    sections = wait_for(f"{BASE_URL}/splitter/{job['job_id']}")["result"]["files"]
    print(f"Split into {len(sections)} documents: {[s['name'] for s in sections]}")
    ```

    Note the form field is `file` here, not `pdf_file`. The splitter and the extractor use different field names.
  </Tab>

  <Tab title="JavaScript">
    Add the polling helper and the split call:

    ```javascript sort_and_extract.mjs theme={null}
    async function waitFor(url) {
      for (let i = 0; i < 90; i++) { // roughly 6 minutes at 4 seconds per poll
        const job = await (await fetch(url, { headers: { "api-key": API_KEY } })).json();
        if (["completed", "review"].includes(job.status)) return job;
        if (job.status === "failed") throw new Error(job.error || "job failed");
        await new Promise((r) => setTimeout(r, 4000));
      }
      throw new Error(`Job at ${url} did not finish in time`);
    }

    const splitForm = new FormData();
    splitForm.append("file", new Blob([fs.readFileSync("pile.pdf")]), "pile.pdf");
    splitForm.append("categories", JSON.stringify(CATEGORIES));
    const splitRes = await fetch(`${BASE_URL}/splitter`, {
      method: "POST",
      headers: { "api-key": API_KEY },
      body: splitForm,
    });
    if (!splitRes.ok) throw new Error(`split submit failed: HTTP ${splitRes.status} ${await splitRes.text()}`);
    const splitJob = await splitRes.json();
    const sections = (await waitFor(`${BASE_URL}/splitter/${splitJob.job_id}`)).result.files;
    console.log(`Split into ${sections.length} documents: ${sections.map((s) => s.name).join(", ")}`);
    ```

    Note the form field is `file` here, not `pdf_file`. The splitter and the extractor use different field names.
  </Tab>
</Tabs>

Each entry in `sections` describes one split-out document. The two fields we use are `name` (the category it matched, with a `.pdf` suffix, such as `Invoice.pdf`) and `full_path` (a URL to the new single-document PDF). Those URLs are [signed S3 links that expire roughly an hour after the response](/document-processing/splitting/response-format#file-object), so we use them right away in the next step rather than storing them. If a link does lapse, re-issue `GET /splitter/{job_id}` to get fresh ones.

## Step 3: Extract the Right Fields From Each Document

Now we loop over the split files. For each one, we look up the schema that matches its type and send the file's URL to `/v2/extract`. Passing `file_url` means the extractor pulls the document straight from the split result, so we never download and re-upload it.

### 3.1 Define a Schema per Type

Each document type wants different fields. We keep one schema per category in a dictionary, keyed by the same names we gave the splitter, so we can look up the right schema by the file's category.

<Tabs>
  <Tab title="Python">
    Add the schema map below the categories:

    ```python sort_and_extract.py theme={null}
    SCHEMAS = {
        "Invoice": {
            "type": "object",
            "properties": {
                "vendor_name": {"type": "string", "description": "Company issuing the invoice (the seller)"},
                "invoice_number": {"type": "string", "description": "Invoice identifier"},
                "total_due": {"type": "number", "description": "Total amount due in US dollars"},
            },
            "required": ["vendor_name", "invoice_number", "total_due"],
            "additionalProperties": False,
        },
        "Receipt": {
            "type": "object",
            "properties": {
                "store_name": {"type": "string", "description": "Name of the store"},
                "transaction_id": {"type": "string", "description": "Transaction or receipt ID"},
                "total": {"type": "number", "description": "Total amount paid in US dollars"},
            },
            "required": ["store_name", "transaction_id", "total"],
            "additionalProperties": False,
        },
        "Purchase Order": {
            "type": "object",
            "properties": {
                "po_number": {"type": "string", "description": "Purchase order number"},
                "vendor_name": {"type": "string", "description": "Vendor the order is placed with"},
                "required_by": {"type": "string", "description": "Date the order is required by"},
            },
            "required": ["po_number", "vendor_name", "required_by"],
            "additionalProperties": False,
        },
    }
    ```
  </Tab>

  <Tab title="JavaScript">
    Add the schema map below the categories:

    ```javascript sort_and_extract.mjs theme={null}
    const SCHEMAS = {
      Invoice: {
        type: "object",
        properties: {
          vendor_name:    { type: "string", description: "Company issuing the invoice (the seller)" },
          invoice_number: { type: "string", description: "Invoice identifier" },
          total_due:      { type: "number", description: "Total amount due in US dollars" },
        },
        required: ["vendor_name", "invoice_number", "total_due"],
        additionalProperties: false,
      },
      Receipt: {
        type: "object",
        properties: {
          store_name:     { type: "string", description: "Name of the store" },
          transaction_id: { type: "string", description: "Transaction or receipt ID" },
          total:          { type: "number", description: "Total amount paid in US dollars" },
        },
        required: ["store_name", "transaction_id", "total"],
        additionalProperties: false,
      },
      "Purchase Order": {
        type: "object",
        properties: {
          po_number:   { type: "string", description: "Purchase order number" },
          vendor_name: { type: "string", description: "Vendor the order is placed with" },
          required_by: { type: "string", description: "Date the order is required by" },
        },
        required: ["po_number", "vendor_name", "required_by"],
        additionalProperties: false,
      },
    };
    ```
  </Tab>
</Tabs>

### 3.2 Extract Each Section by URL

Loop over the split files, look up the schema for each category, and submit an extract job for each one. We strip the `.pdf` suffix from the file name to get the category key, and skip any type we don't have a schema for. Submitting every job before polling any of them matters here: the extractor fetches each `file_url` at submission time, so the signed links get consumed while they're fresh instead of waiting behind another job's polling loop, and the jobs run in parallel on the server rather than one at a time.

<Tabs>
  <Tab title="Python">
    Add the extraction loop:

    ```python sort_and_extract.py theme={null}
    extract_jobs = {}
    for section in sections:
        category = os.path.splitext(section["name"])[0]
        schema = SCHEMAS.get(category)
        if schema is None:
            print(f"No schema for '{category}', skipping.")
            continue
        resp = requests.post(
            f"{BASE_URL}/v2/extract",
            headers={"api-key": API_KEY},
            data={
                "file_url": section["full_path"],
                "schema_data": json.dumps(schema),
                "model": "gamma",
            },
        )
        resp.raise_for_status()
        extract_jobs[category] = resp.json()["job_id"]

    results = {}
    for category, job_id in extract_jobs.items():
        done = wait_for(f"{BASE_URL}/extract/{job_id}")
        results[category] = {field: cell["value"] for field, cell in done["result"].items()}
    ```

    Because we extract by URL, there's no file upload here, so everything goes in the `data` form fields instead of `files`.
  </Tab>

  <Tab title="JavaScript">
    Add the extraction loop:

    ```javascript sort_and_extract.mjs theme={null}
    const extractJobs = {};
    for (const section of sections) {
      const category = section.name.replace(/\.pdf$/, "");
      const schema = SCHEMAS[category];
      if (!schema) {
        console.log(`No schema for '${category}', skipping.`);
        continue;
      }
      const extractForm = new FormData();
      extractForm.append("file_url", section.full_path);
      extractForm.append("schema_data", JSON.stringify(schema));
      extractForm.append("model", "gamma");
      const extractRes = await fetch(`${BASE_URL}/v2/extract`, {
        method: "POST",
        headers: { "api-key": API_KEY },
        body: extractForm,
      });
      if (!extractRes.ok) throw new Error(`extract submit failed: HTTP ${extractRes.status} ${await extractRes.text()}`);
      extractJobs[category] = (await extractRes.json()).job_id;
    }

    const results = {};
    for (const [category, jobId] of Object.entries(extractJobs)) {
      const done = await waitFor(`${BASE_URL}/extract/${jobId}`);
      results[category] = Object.fromEntries(
        Object.entries(done.result).map(([field, cell]) => [field, cell.value]),
      );
    }
    ```
  </Tab>
</Tabs>

## Step 4: Print the Combined Result

The `results` object now holds the extracted fields for every document in the pile, keyed by type. Print it as formatted JSON.

<Tabs>
  <Tab title="Python">
    Add the final line and run the script:

    ```python sort_and_extract.py theme={null}
    print(json.dumps(results, indent=2))
    ```

    ```bash theme={null}
    python sort_and_extract.py
    ```
  </Tab>

  <Tab title="JavaScript">
    Add the final line and run the script:

    ```javascript sort_and_extract.mjs theme={null}
    console.log(JSON.stringify(results, null, 2));
    ```

    ```bash theme={null}
    node sort_and_extract.mjs
    ```
  </Tab>
</Tabs>

### Sample Output

Run against the [sample pile](https://raw.githubusercontent.com/Unsiloed-AI/cookbook/9c80a90e0315a33c9b8a68d8b3355199771b598f/sample-documents/sample-split.pdf), the script splits the three pages into three labelled documents, then extracts the fields each type calls for:

```text theme={null}
Split into 3 documents: ['Invoice.pdf', 'Receipt.pdf', 'Purchase Order.pdf']
```

```json theme={null}
{
  "Invoice": {
    "vendor_name": "Greenfield Print & Bindery",
    "invoice_number": "GPB-20418",
    "total_due": 2150.0
  },
  "Receipt": {
    "store_name": "Cooper's Office Supply",
    "transaction_id": "7741-208-4490",
    "total": 87.61
  },
  "Purchase Order": {
    "po_number": "PO-2026-0813",
    "vendor_name": "Westbrook Furniture Co.",
    "required_by": "June 5, 2026"
  }
}
```

One PDF went in; three correctly typed records came out, each carrying only the fields that make sense for its document type. A receipt never gets asked for an invoice number, and the purchase order's vendor is pulled from the right place on the page.

## Where to Take This Next

The pile in this recipe had one page per document, but the splitter handles multi-page documents too, grouping consecutive pages that belong together. To process a folder of mixed PDFs, wrap the whole script in a loop over your files and merge each run's `results` into one collection.

<CardGroup cols={2}>
  <Card title="Splitting" icon="scissors" href="/document-processing/splitting/splitting">
    How the splitter decides document boundaries, and the full response reference.
  </Card>

  <Card title="Classification" icon="tags" href="/document-processing/classification/classification">
    Label a single document by type without splitting it.
  </Card>

  <Card title="Extraction Schemas" icon="brackets-curly" href="/document-processing/extraction/schemas">
    Build richer per-type schemas, including nested objects and line-item tables.
  </Card>

  <Card title="API Reference" icon="code" href="/api-reference/splitting/split-document">
    Browse the full request and response specs for `/splitter` and `/v2/extract`.
  </Card>
</CardGroup>
