Parse Document

Overview

The Parse Document endpoint processes PDF, DOCX, and PPTX documents using various chunking strategies to break them into meaningful sections. This endpoint is designed for document analysis, RAG applications, and content organization workflows.

This endpoint returns a job ID for asynchronous processing. Use the job status endpoints to monitor progress and retrieve results when processing is complete.

Request

document_file

file

required

Document file to process. Supported formats: PDF, DOCX, PPTX

strategy

string

default:"semantic"

Chunking strategy to use. Options: “semantic”, “fixed”, “page”, “paragraph”, “heading”, “zchunk”, “hybrid”

chunk_size

integer

default:"1000"

Size of chunks in characters (used for “fixed” strategy)

overlap

integer

default:"100"

Overlap size in characters between chunks (used for “fixed” strategy)

X-API-Key

string

required

API key for authentication

Response

job_id

string

Unique identifier for the processing job

status

string

Initial job status (typically “queued”)

message

string

Status message about the job creation

quota_remaining

number

Remaining API quota after this request

Chunking Strategies

Semantic Chunking (Default)

Uses advanced AI to intelligently group related content together. Best for maintaining context and meaning.

Fixed Size Chunking

Splits text into fixed-size chunks with configurable overlap. Useful for consistent chunk sizes.

Page-Based Chunking

Splits documents by pages (PDF only). For other formats, falls back to paragraph chunking.

Paragraph Chunking

Splits text based on paragraph boundaries. Good for maintaining natural text flow.

Heading Chunking

Splits text based on detected headings and section structure. Ideal for structured documents.

zChunk Semantic Chunking

Advanced semantic chunking using LLaMA logprobs for boundary detection. Provides high-quality semantic boundaries.

Hybrid Chunking

Multi-modal processing with table extraction, image analysis, and enhanced metadata. Most comprehensive option.

curl -X 'POST' \
  'https://visionapi.unsiloed.ai/chunking' \
  -H 'accept: application/json' \
  -H 'X-API-Key: your-api-key' \
  -H 'Content-Type: multipart/form-data' \
  -F 'document_file=@document.pdf;type=application/pdf' \
  -F 'strategy=semantic' \
  -F 'chunk_size=1000' \
  -F 'overlap=100'

{
  "job_id": "b2094b38-e432-44b6-a5d0-67bed07d5de1",
  "status": "queued",
  "message": "Document chunking started",
  "quota_remaining": 48954
}

Retrieving Results

After the job is created, use the job status and results endpoints to monitor progress and retrieve the parsed chunks:

import requests
import time

def get_chunking_results(job_id, api_key):
    """Monitor job and retrieve results when complete"""
    
    headers = {"X-API-Key": api_key}
    status_url = f"https://visionapi.unsiloed.ai/jobs/{job_id}"
    
    # Poll for completion
    while True:
        response = requests.get(status_url, headers=headers)
        
        if response.status_code == 200:
            status_data = response.json()
            print(f"Job Status: {status_data['status']}")
            
            if status_data['status'] == 'COMPLETED':
                # Get results
                results_url = f"https://visionapi.unsiloed.ai/jobs/{job_id}/result"
                results_response = requests.get(results_url, headers=headers)
                
                if results_response.status_code == 200:
                    return results_response.json()
                else:
                    raise Exception(f"Failed to get results: {results_response.text}")
                    
            elif status_data['status'] == 'FAILED':
                raise Exception(f"Job failed: {status_data.get('error', 'Unknown error')}")
                
        time.sleep(5)  # Check every 5 seconds

# Usage
job_id = "your-job-id"
results = get_chunking_results(job_id, "your-api-key")

Expected Results Structure

When the job completes, the results contain the parsed chunks and metadata:

{
  "job_id": "f7e8d9c2-4a5b-6c7d-8e9f-0a1b2c3d4e5f",
  "status": "completed",
  "results": {
    "file_type": "pdf",
    "strategy": "semantic",
    "total_chunks": 15,
    "avg_chunk_size": 847.3,
    "chunks": [
      {
        "chunk_id": 1,
        "text": "Introduction\n\nThis document provides an overview of our quarterly financial performance...",
        "page_number": 1,
        "bbox": {
          "x1": 72,
          "y1": 100,
          "x2": 540,
          "y2": 300
        },
        "confidence": 0.95,
        "element_types": ["Title", "Text"],
        "word_count": 156,
        "char_count": 892
      },
      {
        "chunk_id": 2,
        "text": "Financial Highlights\n\n• Revenue increased by 15% year-over-year\n• Net income grew by 22%...",
        "page_number": 1,
        "bbox": {
          "x1": 72,
          "y1": 320,
          "x2": 540,
          "y2": 480
        },
        "confidence": 0.92,
        "element_types": ["Title", "List"],
        "word_count": 89,
        "char_count": 534
      }
    ],
    "image_dimensions": [
      {
        "page": 1,
        "width": 612,
        "height": 792
      }
    ]
  }
}

Advanced Chunking Options

zChunk Semantic Chunking

For advanced semantic boundary detection:

data = {
    "strategy": "zchunk",
    "chunk_size": 2000,
    "overlap": 200
}

zChunk uses LLaMA logprobs for intelligent boundary detection, providing high-quality semantic chunks.

Error Handling

Common Error Scenarios

Unsupported File Type: Only PDF, DOCX, and PPTX files are supported
File Size Limits: Large files may exceed processing limits
Quota Exceeded: API usage limits reached
Invalid Strategy: Unsupported chunking strategy specified
Processing Timeout: Document too complex or large to process

Best Practices

File Quality: Use high-quality, text-based documents for better results
Strategy Selection: Choose the appropriate strategy based on your use case:
- Semantic: Best for maintaining context and meaning
- Fixed: Good for consistent chunk sizes
- Hybrid: Most comprehensive for complex documents
Monitor Jobs: Always check job status before assuming completion
Handle Failures: Implement retry logic for failed jobs
Quota Management: Monitor your API quota usage

Rate Limits

Concurrent Jobs: Limited number of active processing jobs per API key
File Size: Individual files may have size restrictions
Processing Time: Complex documents may take several minutes to process
Daily Quota: API usage limits based on your subscription plan

Check your API plan for specific limits and quotas.

Use Cases

RAG Applications

Use semantic or hybrid chunking to create meaningful chunks for retrieval-augmented generation systems.

Document Analysis

Parse documents into structured sections for content analysis and information extraction.

Content Organization

Automatically organize long documents into logical sections based on content structure.

Search Indexing

Create searchable chunks optimized for document search and retrieval systems.

Core Endpoints

Job Management

Overview

Request

Response

Chunking Strategies

Semantic Chunking (Default)

Fixed Size Chunking

Page-Based Chunking

Paragraph Chunking

Heading Chunking

zChunk Semantic Chunking

Hybrid Chunking

Retrieving Results

Expected Results Structure

Advanced Chunking Options

zChunk Semantic Chunking

Error Handling

Common Error Scenarios

Best Practices

Rate Limits

Use Cases

RAG Applications

Document Analysis

Content Organization

Search Indexing

Core Endpoints

Job Management

​Overview

​Request

​Response

​Chunking Strategies

​Semantic Chunking (Default)

​Fixed Size Chunking

​Page-Based Chunking

​Paragraph Chunking

​Heading Chunking

​zChunk Semantic Chunking

​Hybrid Chunking

​Retrieving Results

​Expected Results Structure

​Advanced Chunking Options

​zChunk Semantic Chunking

​Error Handling

​Common Error Scenarios

​Best Practices

​Rate Limits

​Use Cases

​RAG Applications

​Document Analysis

​Content Organization

​Search Indexing

Overview

Request

Response

Chunking Strategies

Semantic Chunking (Default)

Fixed Size Chunking

Page-Based Chunking

Paragraph Chunking

Heading Chunking

zChunk Semantic Chunking

Hybrid Chunking

Retrieving Results

Expected Results Structure

Advanced Chunking Options

zChunk Semantic Chunking

Error Handling

Common Error Scenarios

Best Practices

Rate Limits

Use Cases

RAG Applications

Document Analysis

Content Organization

Search Indexing