Skip to main content

Overview

Unsiloed AI provides a powerful API for processing unstructured documents. You can:
  • Parse documents into structured Markdown and JSON
  • Extract data using custom schemas
  • Classify documents by type
  • Split multi-document files into separate documents

Prerequisites

Before you begin, you’ll need:
  1. An Unsiloed AI account and API key
  2. A document to process (PDF, DOCX, PPTX, image, etc.)
  3. Python 3.7+ or Node.js 14+ (optional, for SDK usage)

Step 1: Get Your API Key

To get API access, sign up on Unsiloed AI. We’ll get you set up with an API key and help you get started.
Keep your API key secure and never commit it to version control. Use environment variables to store it.

Step 2: Install the SDK (Optional)

We provide official SDKs for Python and JavaScript/TypeScript. You can also use the REST API directly.
pip install unsiloed-sdk

Step 3: Parse Your First Document

Choose your preferred language and run the example below:
from unsiloed_sdk import UnsiloedClient

# Initialize the client
with UnsiloedClient(api_key="your-api-key") as client:
    # Parse a document and wait for results
    result = client.parse_and_wait(file="document.pdf")
    
    # Access the parsed content
    print(f"Total chunks: {result.total_chunks}")
    
    # Get the embed content
    for chunk in result.chunks:
        print(f"\n--- {chunk['embed'][:100]} ---")

Step 4: Extract Structured Data

To extract specific fields from your document, define a JSON schema:
from unsiloed_sdk import UnsiloedClient

# Define extraction schema using JSON Schema format
schema = {
    "type": "object",
    "properties": {
        "title": {
            "type": "string",
            "description": "Document title"
        },
        "date": {
            "type": "string",
            "description": "Document date"
        }
    },
    "required": ["title", "date"],
    "additionalProperties": False
}

with UnsiloedClient(api_key="your-api-key") as client:
    # Extract and wait for completion
    result = client.extract_and_wait(
        file="document.pdf",
        schema=schema
    )

    # Access extracted data with confidence scores
    print(f"Title: {result.result['title']['value']}")
    print(f"Confidence: {result.result['title']['score']:.2%}")

Understanding the Response

Parsing Response

The parsing API returns structured chunks with markdown, segments, and metadata:
{
  "job_id": "1699d429-9c2e-464e-b311-d4b68a8444b8",
  "status": "Succeeded",
  "file_name": "document.pdf",
  "total_chunks": 3,
  "page_count": 1,
  "created_at": "2026-01-05T15:06:27.966175Z",
  "started_at": "2026-01-05T15:06:28.130578Z",
  "finished_at": "2026-01-05T15:06:36.009842Z",
  "chunks": [
    {
      "chunk_id": "6b2eca3a-d14f-4164-ba9a-0a3a58fcaf45",
      "chunk_length": 118,
      "embed": "# Document Title\n\nThis is the parsed content...",
      "segments": [
        {
          "segment_id": "c60d89b1-373e-428d-9950-544e7c903b61",
          "segment_type": "Text",
          "markdown": "Document content here...",
          "html": "<p>Document content here...</p>",
          "bbox": {
            "left": 34.47,
            "top": 30.99,
            "width": 118.26,
            "height": 29.03
          },
          "page_number": 1,
          "page_width": 595.0,
          "page_height": 842.0,
          "confidence": 0.98
        }
      ]
    }
  ],
  "pdf_url": "https://s3.us-east-1.amazonaws.com/...",
  "metadata": {
    "keep_segment_types": "all"
  }
}

Extraction Response

The extraction API returns extracted fields with confidence scores and bounding boxes:
{
  "job_id": "4943f2a3-7c99-46b9-90e8-c1c4b748a9bb",
  "status": "completed",
  "file_name": "invoice.pdf",
  "created_at": "2026-01-05T15:00:10.836401+00:00",
  "updated_at": "2026-01-05T15:00:53.123541+00:00",
  "result": {
    "invoice_number": {
      "value": "25G1TIZT00000999",
      "score": 0.8686260225487502,
      "page_no": 1,
      "bboxes": [
        {
          "bbox": [65, 251, 1136, 431],
          "type": "segment",
          "confidence": 0.7901723,
          "page_width": 1191.0,
          "page_height": 1684.0
        },
        {
          "bbox": [216, 385, 413, 400],
          "text": "25G1TIZT00000999",
          "type": "ocr",
          "confidence": null
        }
      ]
    },
    "invoice_date": {
      "value": "10/05/2025",
      "score": 0.9999994661137259,
      "page_no": 1,
      "bboxes": [
        {
          "bbox": [232, 407, 344, 422],
          "text": "10/05/2025",
          "type": "ocr",
          "confidence": null
        }
      ]
    },
    "total_amount": {
      "value": 346.5,
      "score": 0.9999993593737946,
      "page_no": 1,
      "bboxes": [
        {
          "bbox": [221, 892, 275, 894],
          "text": "346.5",
          "type": "ocr",
          "confidence": null
        }
      ]
    }
  }
}
Key Features:
  • Parsing: Returns chunks with markdown, HTML, segments, and layout information
  • Extraction: Returns structured fields with confidence scores and precise bounding boxes
  • Bounding boxes: Pixel-level coordinates for locating data in the original document
  • Confidence scores: Model confidence (0-1) for each extracted field
  • Page references: Page numbers where each field was found

Next Steps

Common Use Cases

Extract structured data from invoices with citations and confidence scores for validation workflows.
schema = {
    "invoice_number": {"type": "string"},
    "total": {"type": "number"},
    "line_items": {"type": "array"}
}

result = client.extract_and_wait(
    file="invoice.pdf",
    schema=schema
)
Parse legal documents while preserving structure, then extract key clauses and dates.
# First parse to get structured content
parse_result = client.parse_and_wait(file="contract.pdf")

# Then extract specific clauses
schema = {
    "parties": {"type": "array"},
    "effective_date": {"type": "string"},
    "termination_clause": {"type": "string"}
}

extract_result = client.extract_and_wait(
    file="contract.pdf",
    schema=schema
)
Classify documents to route them to appropriate processing pipelines.
result = client.classify_and_wait(
    file="document.pdf",
    classes=["invoice", "receipt", "contract", "form"]
)

print(f"Document type: {result.classification}")
print(f"Confidence: {result.confidence}")

API Base URL

All API requests should be made to:
https://prod.visionapi.unsiloed.ai
Authentication is required using your API key in the api-key header.

Need Help?