Skip to main content

Overview

Document parsing in our Vision API is achieved through intelligent chunking strategies that analyze document structure using advanced AI and vision language models. The parsing functionality identifies different document elements like text blocks, tables, images, headers, and footers while maintaining proper reading order and extracting content with high accuracy.
Parsing capabilities are accessed through the /parse endpoint, which combines structure detection with content extraction and intelligent segmentation.

Key Features

Element Detection

Identify and classify document elements using advanced AI models

Content Extraction

Extract text, tables, and images with appropriate processing methods

Reading Order

Maintain proper document flow and reading sequence

Multi-Modal Processing

Handle text, images, tables, and formulas with specialized extractors

Document Element Types

The parsing system can identify and process the following element types through chunking strategies:

Text Elements

  • Text: Regular paragraph text
  • Title: Document and section titles
  • Section-header: Section headings
  • Page-header: Header content
  • Page-footer: Footer content
  • Caption: Image and table captions
  • Footnote: Footnote references and content
  • List-item: Bulleted and numbered lists

Visual Elements

  • Table: Structured tabular data
  • Picture: Images, charts, and diagrams
  • Formula: Mathematical equations and expressions

Quick Start

Using Local files

from unsiloed_sdk import UnsiloedClient
from unsiloed_sdk.exceptions import APIError

with UnsiloedClient(api_key="your-api-key") as client:
    try:
        # Parse and wait for completion automatically
        result = client.parse_and_wait(
            file="document.pdf",
            merge_tables=True
        )

        print(f"Total chunks: {result.total_chunks}")
        print(f"Status: {result.status}")

        # Process chunks
        for chunk in result.chunks:
            for segment in chunk['segments']:
                print(f"Type: {segment['segment_type']}")
                content = segment.get('content') or ''
                print(f"Content: {content[:100]}...")
    except APIError as e:
        print(f"API Error: {e}")
        print(f"Status code: {e.status_code}")
        print(f"Response data: {e.response_data}")

# Or start job and poll manually
with UnsiloedClient(api_key="your-api-key") as client:
    # Start parse job
    response = client.parse(file="document.pdf")
    job_id = response.job_id

    # Get results later
    result = client.get_parse_result(job_id)

Using Presigned URLs

Instead of uploading files directly, you can provide a presigned URL using the url parameter. This is ideal for documents already stored in cloud storage (S3, GCS, Azure Blob, etc.).
from unsiloed_sdk import UnsiloedClient
from unsiloed_sdk.exceptions import APIError

# Using presigned URL instead of file upload
document_url = "https://lyltzyvtloozzovxrupp.supabase.co/storage/v1/object/public/pdfs/0589f42e-0684-434c-999a-beedcf34c04a/Invoice_bc11de5d-e45a-46a8-892d-32f8794ab72d.pdf"

with UnsiloedClient(api_key="your-api-key") as client:
    try:
        # Parse document from URL and wait for results
        print("Parsing document from URL...")
        
        result = client.parse_and_wait(url=document_url)
        
        # Access the parsed content
        print(f"\nTotal chunks: {result.total_chunks}")
        
        # Get the embed content
        for chunk in result.chunks:
            print(f"\n--- {chunk['embed'][:100]} ---")
    except APIError as e:
        print(f"API Error: {e}")
        print(f"Status code: {e.status_code}")
        print(f"Response data: {e.response_data}")
For detailed API parameters and configuration options, see the Parse API Reference.

Response Format

Parsed content is organized into chunks with segments containing rich metadata, bounding boxes, and extracted content:
{
  "job_id": "1699d429-9c2e-464e-b311-d4b68a8444b8",
  "status": "Succeeded",
  "file_name": "document.pdf",
  "total_chunks": 3,
  "page_count": 1,
  "created_at": "2026-01-05T15:06:27.966175Z",
  "started_at": "2026-01-05T15:06:28.130578Z",
  "finished_at": "2026-01-05T15:06:36.009842Z",
  "pdf_url": "https://s3.us-east-1.amazonaws.com/...",
  "chunks": [
    {
      "chunk_id": "6b2eca3a-d14f-4164-ba9a-0a3a58fcaf45",
      "chunk_length": 118,
      "embed": "## Tax Invoice on behalf of -\n\nThe parsed content combined from all segments...",
      "segments": [
        {
          "segment_id": "c60d89b1-373e-428d-9950-544e7c903b61",
          "segment_type": "Picture",
          "markdown": "1. DESCRIPTION:\nThe image displays a logo...",
          "html": "<p>The image displays a logo...</p>",
          "image": "https://s3.us-east-1.amazonaws.com/chunker-bucket/...",
          "page_number": 1,
          "page_width": 595.0,
          "page_height": 842.0,
          "confidence": 0.9805617332458496,
          "bbox": {
            "left": 34.476383209228516,
            "top": 30.995285034179688,
            "width": 118.25996398925781,
            "height": 29.03614044189453
          },
          "ocr": []
        },
        {
          "segment_id": "16d38aae-9d38-4a5e-8e78-febb2f206e3d",
          "segment_type": "SectionHeader",
          "content": "Tax Invoice on behalf of -",
          "markdown": "## Tax Invoice on behalf of -",
          "html": "<h2>Tax Invoice on behalf of -</h2>",
          "page_number": 1,
          "page_width": 595.0,
          "page_height": 842.0,
          "confidence": 0.5002586245536804,
          "bbox": {
            "left": 33.08928680419922,
            "top": 103.89154052734375,
            "width": 144.82037353515625,
            "height": 11.463272094726562
          },
          "ocr": [
            {
              "bbox": {
                "left": 1.1465377807617188,
                "top": 3.37347412109375,
                "width": 19.81999969482422,
                "height": 5.54998779296875
              },
              "text": "Tax",
              "confidence": null,
              "color": {
                "r": 0,
                "g": 0,
                "b": 0,
                "hex": "#000000"
              }
            }
          ]
        },
        {
          "segment_id": "4f4b54bc-793e-49cc-b0a3-113bbb5484be",
          "segment_type": "Table",
          "markdown": "| Particulars | Gross value | Discount | Net value |\n| :--- | :--- | :--- | :--- |\n| 1 x Dal Tadka | 170 | 0 | 170 |",
          "html": "<table>...</table>",
          "image": "https://s3.us-east-1.amazonaws.com/chunker-bucket/...",
          "page_number": 1,
          "confidence": 0.9682685732841492,
          "bbox": {
            "left": 32.230628967285156,
            "top": 299.7548522949219,
            "width": 539.4254760742188,
            "height": 106.92483520507812
          }
        }
      ]
    }
  ],
  "metadata": {
    "keep_segment_types": "all"
  },
  "credit_used": 1,
  "merge_tables": false
}

Response Fields

Top-Level Fields

  • job_id: Unique identifier for the parsing job
  • status: Job status (Succeeded, Failed, Processing, etc.)
  • file_name: Name of the uploaded file
  • total_chunks: Number of content chunks generated
  • page_count: Total number of pages in the document
  • pdf_url: Temporary S3 URL to access the processed PDF
  • metadata: Additional processing metadata and settings

Chunk Fields

  • chunk_id: Unique identifier for each chunk
  • chunk_length: Character length of the chunk
  • embed: Combined markdown content from all segments in the chunk (ideal for RAG/embeddings)
  • segments: Array of document elements within the chunk

Segment Fields

  • segment_id: Unique identifier for each segment
  • segment_type: Element classification (Text, Title, Table, Picture, SectionHeader, etc.)
  • content: Plain text content (for text segments)
  • markdown: Markdown-formatted content
  • html: HTML-formatted content
  • image: S3 URL for visual elements (Picture, Table)
  • bbox: Bounding box with left, top, width, height coordinates
  • page_number: Page where the segment appears
  • page_width / page_height: Page dimensions for coordinate reference
  • confidence: AI model confidence score (0-1) for element detection
  • ocr: Array of OCR results with word-level text, bounding boxes, and colors

Use Cases

  • Document Q&A: Extract structured content for RAG applications
  • Content Migration: Parse and convert documents to markdown or HTML
  • Data Extraction: Identify and extract specific document sections
  • Archive Processing: Batch process large document collections