Overview
The Parse Document endpoint processes PDF, DOCX, and PPTX documents using various chunking strategies to break them into meaningful sections. This endpoint is designed for document analysis, RAG applications, and content organization workflows.
This endpoint returns a job ID for asynchronous processing. Use the job status endpoints to monitor progress and retrieve results when processing is complete.
Request
Document file to process. Supported formats: PDF, DOCX, PPTX
Chunking strategy to use. Options: “semantic”, “fixed”, “page”, “paragraph”, “heading”, “zchunk”, “hybrid”
Size of chunks in characters (used for “fixed” strategy)
Overlap size in characters between chunks (used for “fixed” strategy)
API key for authentication
Response
Unique identifier for the processing job
Initial job status (typically “queued”)
Status message about the job creation
Remaining API quota after this request
Chunking Strategies
Semantic Chunking (Default)
Uses advanced AI to intelligently group related content together. Best for maintaining context and meaning.
Fixed Size Chunking
Splits text into fixed-size chunks with configurable overlap. Useful for consistent chunk sizes.
Page-Based Chunking
Splits documents by pages (PDF only). For other formats, falls back to paragraph chunking.
Paragraph Chunking
Splits text based on paragraph boundaries. Good for maintaining natural text flow.
Heading Chunking
Splits text based on detected headings and section structure. Ideal for structured documents.
zChunk Semantic Chunking
Advanced semantic chunking using LLaMA logprobs for boundary detection. Provides high-quality semantic boundaries.
Hybrid Chunking
Multi-modal processing with table extraction, image analysis, and enhanced metadata. Most comprehensive option.
curl -X 'POST' \
'https://visionapi.unsiloed.ai/chunking' \
-H 'accept: application/json' \
-H 'X-API-Key: your-api-key' \
-H 'Content-Type: multipart/form-data' \
-F 'document_file=@document.pdf;type=application/pdf' \
-F 'strategy=semantic' \
-F 'chunk_size=1000' \
-F 'overlap=100'
Success Response
Error Response - Unsupported File Type
Error Response - Validation Error
Error Response - Quota Exceeded
{
"job_id" : "b2094b38-e432-44b6-a5d0-67bed07d5de1" ,
"status" : "queued" ,
"message" : "Document chunking started" ,
"quota_remaining" : 48954
}
Retrieving Results
After the job is created, use the job status and results endpoints to monitor progress and retrieve the parsed chunks:
import requests
import time
def get_chunking_results ( job_id , api_key ):
"""Monitor job and retrieve results when complete"""
headers = { "X-API-Key" : api_key}
status_url = f "https://visionapi.unsiloed.ai/jobs/ { job_id } "
# Poll for completion
while True :
response = requests.get(status_url, headers = headers)
if response.status_code == 200 :
status_data = response.json()
print ( f "Job Status: { status_data[ 'status' ] } " )
if status_data[ 'status' ] == 'COMPLETED' :
# Get results
results_url = f "https://visionapi.unsiloed.ai/jobs/ { job_id } /result"
results_response = requests.get(results_url, headers = headers)
if results_response.status_code == 200 :
return results_response.json()
else :
raise Exception ( f "Failed to get results: { results_response.text } " )
elif status_data[ 'status' ] == 'FAILED' :
raise Exception ( f "Job failed: { status_data.get( 'error' , 'Unknown error' ) } " )
time.sleep( 5 ) # Check every 5 seconds
# Usage
job_id = "your-job-id"
results = get_chunking_results(job_id, "your-api-key" )
Expected Results Structure
When the job completes, the results contain the parsed chunks and metadata:
{
"job_id" : "f7e8d9c2-4a5b-6c7d-8e9f-0a1b2c3d4e5f" ,
"status" : "completed" ,
"results" : {
"file_type" : "pdf" ,
"strategy" : "semantic" ,
"total_chunks" : 15 ,
"avg_chunk_size" : 847.3 ,
"chunks" : [
{
"chunk_id" : 1 ,
"text" : "Introduction \n\n This document provides an overview of our quarterly financial performance..." ,
"page_number" : 1 ,
"bbox" : {
"x1" : 72 ,
"y1" : 100 ,
"x2" : 540 ,
"y2" : 300
},
"confidence" : 0.95 ,
"element_types" : [ "Title" , "Text" ],
"word_count" : 156 ,
"char_count" : 892
},
{
"chunk_id" : 2 ,
"text" : "Financial Highlights \n\n • Revenue increased by 15% year-over-year \n • Net income grew by 22%..." ,
"page_number" : 1 ,
"bbox" : {
"x1" : 72 ,
"y1" : 320 ,
"x2" : 540 ,
"y2" : 480
},
"confidence" : 0.92 ,
"element_types" : [ "Title" , "List" ],
"word_count" : 89 ,
"char_count" : 534
}
],
"image_dimensions" : [
{
"page" : 1 ,
"width" : 612 ,
"height" : 792
}
]
}
}
Advanced Chunking Options
zChunk Semantic Chunking
For advanced semantic boundary detection:
data = {
"strategy" : "zchunk" ,
"chunk_size" : 2000 ,
"overlap" : 200
}
zChunk uses LLaMA logprobs for intelligent boundary detection, providing high-quality semantic chunks.
Error Handling
Common Error Scenarios
Unsupported File Type : Only PDF, DOCX, and PPTX files are supported
File Size Limits : Large files may exceed processing limits
Quota Exceeded : API usage limits reached
Invalid Strategy : Unsupported chunking strategy specified
Processing Timeout : Document too complex or large to process
Best Practices
File Quality : Use high-quality, text-based documents for better results
Strategy Selection : Choose the appropriate strategy based on your use case:
Semantic : Best for maintaining context and meaning
Fixed : Good for consistent chunk sizes
Hybrid : Most comprehensive for complex documents
Monitor Jobs : Always check job status before assuming completion
Handle Failures : Implement retry logic for failed jobs
Quota Management : Monitor your API quota usage
Rate Limits
Concurrent Jobs : Limited number of active processing jobs per API key
File Size : Individual files may have size restrictions
Processing Time : Complex documents may take several minutes to process
Daily Quota : API usage limits based on your subscription plan
Check your API plan for specific limits and quotas.
Use Cases
RAG Applications
Use semantic or hybrid chunking to create meaningful chunks for retrieval-augmented generation systems.
Document Analysis
Parse documents into structured sections for content analysis and information extraction.
Content Organization
Automatically organize long documents into logical sections based on content structure.
Search Indexing
Create searchable chunks optimized for document search and retrieval systems.
Responses are generated using AI and may contain mistakes.