Skip to main content

Overview

This API provides powerful document processing capabilities using Vision models to extract structured data from PDF documents. It combines computer vision with natural language processing to understand document layouts, identify key information, and extract data according to custom JSON schemas.

Key Features

Structured Data Extraction

Extract specific fields and data points from documents using custom JSON schemas

Bounding Box Detection

Identify and locate visual elements with precise coordinate information

Multi-format Support

Process 20+ document modalities (PDFs, PPTs, spreadsheets, images, and more) with image-based rendering, OCR, and structure-preserving parsing.

Async Processing

Handle large documents with background processing and job management

Getting Started with the /extract Endpoint

Here are the steps to get started:
  1. Get an API Key - Sign up on Unsiloed AI to get API access
  2. Define a JSON Schema - Specify the fields you want to extract. In the Unsiloed AI Platform, you can directly generate a JSON schema through the UI and export that schema as an endpoint to call the /extract endpoint

Defining Extraction Schemas

Unsiloed AI uses JSON Schema to define what data should be extracted from a document. You describe the structure you want, and the extraction engine returns structured JSON with citations, bounding boxes, and confidence scores for each field.
Important: When defining a schema, keep field descriptions as detailed and specific as possible. Clear, pointed descriptions help the model correctly locate and extract the intended information, especially in complex or ambiguous documents.
Schemas are strict by default, making outputs deterministic and production-safe.

Schema Rules - Detailed Guide

All extraction schemas must follow JSON Schema specification with strict constraints to ensure deterministic, production-safe outputs.

Core Requirements

1. Root Object Every schema must start with "type": "object". Arrays or primitives are not allowed at root level.
{
  "type": "object",
  "properties": {
    // Define your fields here
  },
  "required": [...],
  "additionalProperties": false
}
2. Properties Define all fields you want to extract using the "properties" key. Each field must specify a "type" and should include a clear "description".
{
  "type": "object",
  "properties": {
    "field_name_1": {
      "type": "string",
      "description": "Clear description of what to extract"
    },
    "field_name_2": {
      "type": "number",
      "description": "Description with units or context"
    }
  },
  "required": [...],
  "additionalProperties": false
}
3. Required Fields Specify mandatory fields using the "required" array. Field names must exactly match those defined in "properties".
{
  "type": "object",
  "properties": {
    "mandatory_field": { "type": "string", "description": "This field is required" },
    "another_required_field": { "type": "string", "description": "This is also required" }
  },
  "required": ["mandatory_field", "another_required_field"],
  "additionalProperties": false
}
4. Additional Properties Always set "additionalProperties": false at every object level to ensure only specified fields appear in output.
{
  "type": "object",
  "properties": {
    "items": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "field_name": { "type": "string" }
        },
        "required": [...],
        "additionalProperties": false  // Required in array items
      }
    }
  },
  "required": [...],
  "additionalProperties": false  // Required at root level
}

Supported Types

String - For text, dates, IDs, names, addresses, and any textual data
{
  "field_name": {
    "type": "string",
    "description": "Description of the text field"
  }
}
Number - For integers and decimals like prices, quantities, counts, and measurements
{
  "field_name": {
    "type": "number",
    "description": "Description with units (e.g., USD, kg)"
  }
}
Boolean - For true/false values such as status flags and yes/no fields
{
  "field_name": {
    "type": "boolean",
    "description": "Description of the boolean condition"
  }
}
Array - For repeating items like line items or lists. Must include items to define the structure of array elements
{
  "field_name": {
    "type": "array",
    "description": "Description of the array items",
    "items": {
      "type": "object",
      "properties": {
        "item_field1": { "type": "string", "description": "..." },
        "item_field2": { "type": "string", "description": "..." }
      },
      "required": [...],
      "additionalProperties": false
    }
  }
}

Building Schemas

Primitive Types

For simple fields like strings, numbers, and booleans, you must use the format type: "string". These are the building blocks of your schema.
{
  "type": "object",
  "properties": {
    "invoice_number": {
      "type": "string",
      "description": "Invoice number"
    },
    "total_amount": {
      "type": "number",
      "description": "Total amount in USD"
    },
    "is_paid": {
      "type": "boolean",
      "description": "Payment status"
    }
  },
  "required": ["invoice_number", "total_amount", "is_paid"],
  "additionalProperties": false
}

Arrays of Objects

Use arrays when you have repeating data like line items, transactions, or people.
{
  "type": "object",
  "properties": {
    "line_items": {
      "type": "array",
      "description": "Invoice line items",
      "items": {
        "type": "object",
        "properties": {
          "description": {
            "type": "string",
            "description": "Item description"
          },
          "quantity": {
            "type": "number",
            "description": "Quantity"
          },
          "price": {
            "type": "number",
            "description": "Unit price"
          }
        },
        "required": ["description", "quantity", "price"],
        "additionalProperties": false
      }
    }
  },
  "additionalProperties": false
}

Nested Arrays

You can nest arrays inside objects within arrays for complex hierarchical data.
{
  "type": "object",
  "properties": {
    "orders": {
      "type": "array",
      "description": "Customer orders",
      "items": {
        "type": "object",
        "properties": {
          "order_id": {
            "type": "string",
            "description": "Order ID"
          },
          "shipments": {
            "type": "array",
            "description": "Shipments for this order",
            "items": {
              "type": "object",
              "properties": {
                "tracking_number": {
                  "type": "string",
                  "description": "Tracking number"
                },
                "carrier": {
                  "type": "string",
                  "description": "Shipping carrier"
                }
              },
              "required": ["tracking_number", "carrier"],
              "additionalProperties": false
            }
          }
        },
        "required": ["order_id", "shipments"],
        "additionalProperties": false
      }
    }
  },
  "additionalProperties": false
}
A common schema for extracting data from US invoices.
{
  "type": "object",
  "properties": {
    "document_type": {
      "type": "string",
      "description": "Type of document (invoice, receipt, etc.)"
    },
    "invoice_header": {
      "type": "object",
      "description": "Invoice header details",
      "properties": {
        "invoice_number": {
          "type": "string",
          "description": "Invoice number"
        },
        "invoice_date": {
          "type": "string",
          "description": "Invoice issue date"
        },
        "due_date": {
          "type": "string",
          "description": "Payment due date"
        }
      },
      "required": ["invoice_number", "invoice_date"],
      "additionalProperties": false
    },
    "vendor": {
      "type": "object",
      "description": "Vendor information",
      "properties": {
        "vendor_name": {
          "type": "string",
          "description": "Legal business name"
        },
        "vendor_address": {
          "type": "string",
          "description": "Business address"
        },
        "vendor_email": {
          "type": "string",
          "description": "Accounts receivable contact email"
        }
      },
      "required": ["vendor_name"],
      "additionalProperties": false
    },
    "line_items": {
      "type": "array",
      "description": "List of billed items",
      "items": {
        "type": "object",
        "properties": {
          "description": {
            "type": "string",
            "description": "Item or service description"
          },
          "quantity": {
            "type": "number",
            "description": "Quantity billed"
          },
          "unit_price": {
            "type": "number",
            "description": "Price per unit in USD"
          },
          "line_total": {
            "type": "number",
            "description": "Total cost for this line item"
          }
        },
        "required": ["description", "quantity", "unit_price"],
        "additionalProperties": false
      }
    },
    "invoice_totals": {
      "type": "object",
      "description": "Invoice totals",
      "properties": {
        "subtotal": {
          "type": "number",
          "description": "Subtotal before tax"
        },
        "sales_tax": {
          "type": "number",
          "description": "Sales tax amount"
        },
        "total_amount_due": {
          "type": "number",
          "description": "Final amount due in USD"
        }
      },
      "required": ["total_amount_due"],
      "additionalProperties": false
    }
  },
  "required": ["document_type", "invoice_header", "invoice_totals"],
  "additionalProperties": false
}
A schema for extracting governance and ownership information from US SEC filings.
{
  "type": "object",
  "properties": {
    "board_of_directors": {
      "type": "array",
      "description": "Board of Directors",
      "items": {
        "type": "object",
        "properties": {
          "director_name": {
            "type": "string",
            "description": "Full name of board member"
          }
        },
        "required": ["director_name"],
        "additionalProperties": false
      }
    },
    "major_shareholders": {
      "type": "array",
      "description": "Major shareholders and ownership",
      "items": {
        "type": "object",
        "properties": {
          "shareholder_name": {
            "type": "string",
            "description": "Name of shareholder"
          },
          "ownership_percentage": {
            "type": "string",
            "description": "Percentage ownership"
          }
        },
        "required": ["shareholder_name", "ownership_percentage"],
        "additionalProperties": false
      }
    }
  },
  "required": ["board_of_directors", "major_shareholders"],
  "additionalProperties": false
}

How Extraction Works

Once you submit a schema:
  1. Unsiloed locates each field in the document
  2. Extracts values that best match the schema
  3. Returns structured JSON with:
    • Field-level confidence scores
    • Word-level citations
    • Bounding boxes mapped to the original document
This makes every extracted value auditable and traceable.

Best Practices

  • Keep field names simple and descriptive
  • Use nested objects to reflect document structure
  • Avoid free-form schemas—strict schemas produce better results
  • Prefer arrays for repeated sections (line items, directors, transactions)

Basic Usage

Simple Data Extraction

from unsiloed_sdk import UnsiloedClient

# Define extraction schema using JSON Schema format
schema = {
    "type": "object",
    "properties": {
        "title": {
            "type": "string",
            "description": "Document title"
        },
        "date": {
            "type": "string",
            "description": "Document date"
        }
    },
    "required": ["title", "date"],
    "additionalProperties": False
}

with UnsiloedClient(api_key="your-api-key") as client:
    # Extract and wait for completion
    result = client.extract_and_wait(
        file="document.pdf",
        schema=schema
    )

    # Access extracted data with confidence scores
    print(f"Title: {result.result['title']['value']}")
    print(f"Confidence: {result.result['title']['score']:.2%}")

Advanced Features

Asynchronous Processing

For large documents or batch processing, use the async endpoints:
from unsiloed_sdk import UnsiloedClient

with UnsiloedClient(api_key="your-api-key") as client:
    # Option 1: Start job and poll manually
    response = client.extract(
        file="document.pdf",
        schema=schema
    )
    job_id = response.job_id
    print(f"Job ID: {job_id}")

    # Check job status later
    result = client.get_extract_result(job_id)
    print(f"Status: {result.status}")

    # Option 2: Let SDK handle polling automatically
    result = client.extract_and_wait(
        file="document.pdf",
        schema=schema,
        poll_interval=5,  # Check every 5 seconds
    )
    print("Extraction complete!")

Response Format

Extraction Results Structure

The extraction API returns a complete job response with metadata and extracted results:
{
  "job_id": "4943f2a3-7c99-46b9-90e8-c1c4b748a9bb",
  "status": "completed",
  "file_name": "invoice.pdf",
  "created_at": "2026-01-05T15:00:10.836401+00:00",
  "updated_at": "2026-01-05T15:00:53.123541+00:00",
  "result": {
    "invoice_number": {
      "value": "INV-2024-00542",
      "score": 0.9995421238765432,
      "page_no": 1,
      "bboxes": [
        {
          "bbox": [450, 120, 580, 145],
          "type": "segment",
          "confidence": 0.8654321,
          "page_width": 1191.0,
          "page_height": 1684.0
        },
        {
          "bbox": [450, 120, 580, 145],
          "text": "INV-2024-00542",
          "type": "ocr",
          "confidence": null
        }
      ]
    },
    "invoice_date": {
      "value": "January 15, 2024",
      "score": 0.9992341234567891,
      "page_no": 1,
      "bboxes": [
        {
          "bbox": [450, 155, 580, 180],
          "type": "segment",
          "confidence": 0.8923456,
          "page_width": 1191.0,
          "page_height": 1684.0
        },
        {
          "bbox": [450, 155, 580, 180],
          "text": "January 15, 2024",
          "type": "ocr",
          "confidence": null
        }
      ]
    },
    "vendor_name": {
      "value": "Acme Corporation",
      "score": 0.9998765432109876,
      "page_no": 1,
      "bboxes": [
        {
          "bbox": [85, 95, 250, 125],
          "type": "segment",
          "confidence": 0.9123456,
          "page_width": 1191.0,
          "page_height": 1684.0
        },
        {
          "bbox": [85, 95, 250, 125],
          "text": "Acme Corporation",
          "type": "ocr",
          "confidence": null
        }
      ]
    },
    "total_amount": {
      "value": 2547.50,
      "score": 0.9999876543210987,
      "page_no": 1,
      "bboxes": [
        {
          "bbox": [480, 625, 580, 655],
          "type": "segment",
          "confidence": 0.9534567,
          "page_width": 1191.0,
          "page_height": 1684.0
        },
        {
          "bbox": [480, 625, 580, 655],
          "text": "2547.50",
          "type": "ocr",
          "confidence": null
        }
      ]
    }
  }
}

Response Fields

Top-Level Fields

  • job_id: Unique identifier for the extraction job
  • status: Job status (completed, processing, failed, etc.)
  • file_name: Name of the processed file
  • created_at: Timestamp when the job was created
  • updated_at: Timestamp of the last status update
  • result: Object containing all extracted fields

Extracted Field Structure

Each field in the result object contains:
  • value: The extracted data (type matches your schema: string, number, object, array)
  • score: Confidence score between 0 and 1 (higher is better)
  • page_no: Page number where the data was found (1-indexed)
  • bboxes: Array of bounding box objects with location information

Bounding Box Structure

Each bounding box in the bboxes array includes:
  • bbox: Pixel coordinates [left, top, right, bottom]
    • left, top: Top-left corner coordinates
    • right, bottom: Bottom-right corner coordinates
  • type: Either "segment" (document region) or "ocr" (word-level text)
  • confidence: Segment confidence score (for segment type) or null (for OCR type)
  • page_width: Page width in pixels (for coordinate reference)
  • page_height: Page height in pixels (for coordinate reference)
  • text: Extracted text (only present for OCR-type bboxes)
Each extracted field typically has two bounding boxes: one for the document segment containing the data, and one for the precise OCR text location. This dual-level citation allows you to trace extractions back to both the visual region and the exact words in the document.

Error Handling

The API provides detailed error information for JSON schema validation:
{
  "error": "Schema validation failed",
  "details": {
    "field": "shareholding pattern",
    "message": "Invalid JSON Schema format - missing required properties"
  },
  "suggestions": [
    "Ensure all objects have 'type' property",
    "Add 'properties' for object types",
    "Include 'items' for array types",
    "Set 'additionalProperties' to false"
  ]
}