Skip to main content
POST
/
parse
curl -X 'POST' \
  'https://prod.visionapi.unsiloed.ai/parse' \
  -H 'accept: application/json' \
  -H 'api-key: your-api-key' \
  -H 'Content-Type: multipart/form-data' \
  -F 'file=@document.pdf;type=application/pdf' \
  -F 'use_high_resolution=true' \
  -F 'segmentation_method=smart_layout_detection' \
  -F 'ocr_mode=auto_ocr' \
  -F 'ocr_engine=UnsiloedHawk' \
  -F 'validate_table_segments=false' \
  -F 'merge_tables=true' \
  -F 'keep_segment_types=all' \
  -F 'validate_segments=["Table","Picture","Formula"]' \
  -F 'segment_analysis={"Table":{"html":"LLM","markdown":"LLM","extended_context":true,"crop_image":"All","model_id":"us_table_v2"}}'

# Alternative: Use presigned URL instead of file upload
# Replace the file parameter with url parameter:
# -F 'url=https://your-bucket.s3.amazonaws.com/document.pdf?signature=...' \
{
  "job_id": "e77a5c42-4dc1-44d0-a30e-ed191e8a8908",
  "status": "Starting",
  "file_name": "document.pdf",
  "created_at": "2025-07-18T10:42:10.545832520Z",
  "message": "Task created successfully. Use GET /parse/{job_id} to check status and retrieve results.",
  "credit_used": 5,
  "quota_remaining": 23695,
  "merge_tables": false
}

Overview

The Parse Document endpoint processes PDFs, images (PNG, JPEG, TIFF), and office files (PPT, DOCX, XLSX) documents and breaks them into meaningful sections with detailed analysis including text extraction, image recognition, table parsing, and OCR data. You can provide documents either by direct file upload or by presigned URL. This endpoint supports advanced customization options for fine-tuning the parsing behavior to match your specific use cases.
This endpoint returns a job ID for asynchronous processing. Use the GET PARSE JOB STATUS endpoint to check status and retrieve results when processing is complete.
Processing large files or running many requests in parallel? The Presigned Upload endpoint (POST /v2/parse/upload) supports faster uploads, larger file sizes, and higher throughput.

Request

You must provide either file or url parameter. Both cannot be provided simultaneously.
file
file
Document file to process. Supported formats: PDF, images (PNG, JPEG, TIFF), and office documents (PPT, PPTX, DOC, DOCX, XLS, XLSX). Required if url is not provided.
url
string
Presigned or public URL of the document to fetch and process. Required if file is not provided.
use_high_resolution
boolean
Use high-resolution images for cropping and post-processing. Latency penalty: ~2–3 seconds per page. Defaults to true.
segmentation_method
string
Document segmentation strategy:
  • "smart_layout_detection" (default): Detects layout elements (tables, pictures, formulas, etc.) using bounding boxes.
  • "page_by_page": Treats each page as a single segment; faster for simple documents.
ocr_mode
string
OCR processing mode:
  • "auto_ocr" (default): Applies OCR only where needed.
  • "force_ocr": Applies OCR to all content regardless of existing text layer.
ocr_engine
string
OCR engine to use for text recognition:
  • "UnsiloedBeta" (default): Handles rotated/warped text and irregular bounding boxes.
  • "UnsiloedHawk": Higher accuracy for complex layouts and mixed content.
  • "UnsiloedStorm": Enterprise-grade accuracy optimized for 50+ languages.
merge_tables
boolean
Merge tables that span multiple pages into a single unified structure. Defaults to false.
validate_table_segments
boolean
Legacy: validate table segment classifications using VLM. Prefer validate_segments: ["Table"] instead. Defaults to false.
validate_segments
string
JSON array string of segment types to validate with VLM. Example: ["Table", "Formula", "Picture"]. Defaults to [].
keep_segment_types
string
Filter output to include only the specified segment types (comma-separated). Example: "table,picture". Use "all" to include everything. Defaults to "all".Available segment types:
  • table: Tabular data segments
  • picture: Image and graphic segments
  • formula: Mathematical equations
  • text: Regular text content
  • sectionheader: Section headers
  • title: Document titles
  • listitem: List items
  • caption: Image captions
  • footnote: Footnotes
  • pageheader: Page headers
  • pagefooter: Page footers
xml_citation
boolean
Extract and hyperlink bibliography citations in the markdown output. PDFs only. Defaults to false.
output_fields
string
JSON object controlling which output fields are included in the response. Example: {"html": false, "markdown": true, "ocr": false}. All fields default to true. Available fields: html, markdown, ocr, image, llm, content, bbox, confidence, embed.
segment_analysis
string
JSON object controlling HTML/Markdown generation strategy and AI model per segment type. Example: {"Table": {"html": "LLM", "markdown": "LLM", "model_id": "us_table_v2"}}.
segment_processing
string
Alias for segment_analysis (Core Parser name). If both are provided, this takes precedence.
page_range
string
Page range to process. Formats: "1-5", "2,4,6", "[1,3,5]". Defaults to all pages.
segment_type_naming
string
Segment type naming convention. "Unsiloed" (default) — e.g., PageHeader, ListItem, Picture. "Other" — alternative names e.g., Header, List Item, Figure.
detect_checkboxes
boolean
Detect checkboxes in the document. Defaults to false.
extract_charts
boolean
Extract structured data from charts and graphs. Defaults to false.
extract_colors
boolean
Transfer text color from the PDF text layer to OCR results. Defaults to false.
Attach hyperlink URLs from PDF annotations to OCR results. Defaults to false.
error_handling
string
Error handling strategy for non-critical processing errors. "Continue" (default) — proceed despite errors (e.g., LLM refusals). "Fail" — stop and fail the task on any error.
expires_in
integer
Seconds until the task and its output are deleted. Defaults to the plan expiration time.
chunk_processing
string
JSON object for chunk processing configuration.
llm_processing
string
JSON object for LLM processing configuration.

Configuration Best Practices

Click to expand each scenario below to view detailed configuration settings, recommendations, and trade-offs for your specific use case.
Use this configuration when accuracy is critical and processing time is less important.Configuration:
{
  "use_high_resolution": true,
  "segmentation_method": "smart_layout_detection",
  "ocr_mode": "force_ocr",
  "merge_tables": true,
  "validate_segments": ["Table", "Picture", "Formula"],
  "segment_analysis": {
    "Table": {
      "html": "LLM",
      "markdown": "LLM",
      "extended_context": true,
      "crop_image": "All",
      "model_id": "us_table_v2"
    }
  }
}
When to Use:
  • Legal documents requiring precise text extraction
  • Financial statements with complex tables
  • Archival documents with low-quality scans
  • Documents where accuracy is more important than speed
Trade-offs:
  • Latency: +2-3 seconds per page for high resolution
  • Latency: +1-2 seconds per page for segment validation
Use this configuration when speed is prioritized over maximum accuracy.Configuration:
{
  "use_high_resolution": false,
  "segmentation_method": "page_by_page",
  "ocr_mode": "auto_ocr",
  "merge_tables": false
}
When to Use:
  • High-volume document processing
  • Real-time applications requiring quick results
  • Documents with simple layouts
  • Pre-screened high-quality digital documents
Benefits:
  • Fastest processing time
  • Lower cost per document
  • Suitable for batch processing large volumes
Extract only tables and charts from financial reports and statements.Configuration:
{
  "merge_tables": true,
  "keep_segment_types": "table,picture",
  "validate_segments": ["Table", "Picture"],
  "segmentation_method": "smart_layout_detection",
  "ocr_mode": "auto_ocr",
  "segment_analysis": {
    "Table": {
      "html": "LLM",
      "markdown": "LLM",
      "model_id": "us_table_v2"
    }
  }
}
When to Use:
  • Balance sheets and P&L statements
  • Quarterly/annual financial reports
  • Investment reports with charts
  • Documents where only structured data matters
Benefits:
  • Reduced response size (text content filtered out)
  • Focus on data-rich content
  • Merged multi-page tables for complete datasets
Extract only tabular data with minimal response size for maximum efficiency.Configuration:
{
  "merge_tables": true,
  "keep_segment_types": "table",
  "validate_segments": ["Table"],
  "output_fields": {
    "html": true,
    "markdown": true,
    "ocr": false,
    "image": false,
    "content": true,
    "bbox": false,
    "confidence": false
  }
}
When to Use:
  • Extracting data from invoices
  • Processing structured forms
  • Database population from documents
  • CSV/Excel export workflows
Benefits:
  • Minimal response payload
  • Faster data transfer
  • Easy integration with data pipelines
Extract content with structured citations from research papers and academic documents.Configuration:
{
  "use_high_resolution": true,
  "segmentation_method": "smart_layout_detection",
  "ocr_mode": "auto_ocr",
  "xml_citation": true
}
When to Use:
  • Research papers with bibliographies
  • Academic articles with citations
  • Scientific documents
  • Literature reviews
Benefits:
  • Automatic citation extraction and linking
  • Structured bibliography metadata
  • In-text citation hyperlinks in markdown
  • Preserves academic document structure
Citation extraction is only available for PDF documents.
Optimize for scanned documents and images with poor text quality.Configuration:
{
  "use_high_resolution": true,
  "ocr_mode": "force_ocr",
  "segmentation_method": "smart_layout_detection"
}
When to Use:
  • Scanned paper documents
  • Low-quality photocopies
  • Historical documents
  • Image-based PDFs
Benefits:
  • Maximum OCR coverage
  • Better text extraction from poor quality sources
  • Higher accuracy for challenging documents
Optimize response size and performance by selectively including only the fields you need.For Minimal Response Size:
{
  "output_fields": {
    "html": false,
    "markdown": false,
    "ocr": false,
    "image": false,
    "llm": false,
    "content": true,
    "bbox": false,
    "confidence": false,
    "embed": true
  }
}
For Text-Only Processing:
{
  "output_fields": {
    "html": false,
    "markdown": true,
    "ocr": false,
    "image": false,
    "llm": false,
    "content": true,
    "bbox": true,
    "confidence": false,
    "embed": true
  }
}
For Full Analysis (Default):Omit output_fields or set all fields to true to include all available data.Benefits:
  • Reduced response size and bandwidth usage
  • Faster processing and data transfer
  • Cost optimization for high-volume processing

Parameter Details

File Input Options

The API supports two methods for providing the document to process:
  1. Direct File Upload (file parameter): Upload the document file directly as multipart/form-data
  2. Presigned URL (url parameter): Provide a publicly accessible URL or presigned URL to the document
Important Notes:
  • You must provide either file or url, but not both
  • When using url, the document will be downloaded from the provided URL before processing
  • Presigned URLs are ideal for documents already stored in cloud storage (S3, GCS, Azure Blob, etc.)
  • The URL must be publicly accessible or include necessary authentication parameters (e.g., S3 presigned URLs with signatures)
  • Supported formats are the same for both methods: PDF, images (PNG, JPEG, TIFF), and office documents (PPT, PPTX, DOC, DOCX, XLS, XLSX)
Use Cases for Presigned URLs:
  • Documents already stored in cloud storage
  • Avoiding duplicate file uploads
  • Integration with existing document management systems
  • Processing large files without upload overhead

Segmentation Method

The segmentation_method parameter controls how the document is analyzed and segmented:
  • "smart_layout_detection" (default): Analyzes pages for layout elements (e.g., Table, Picture, Formula, etc.) using bounding boxes. Provides fine-grained segmentation and better chunking for complex documents.
  • "page_by_page": Treats each page as a single segment. Faster processing, ideal for simple documents without complex layouts.

OCR Mode

The ocr_mode parameter controls optical character recognition processing:
  • "auto_ocr" (default): Intelligently determines when OCR is needed based on the document content. Balances accuracy and performance.
  • "force_ocr": Applies OCR to all content regardless of existing text layer. Use this for scanned documents or when maximum text extraction is required.

Table Merging

The merge_tables parameter enables merging of tables that span across multiple pages: How It Works:
  • Analyzes consecutive table segments across pages
  • Identifies tables with matching column headers
  • Merges them into a single unified table structure
  • Preserves table formatting and data integrity
When to Use:
  • Multi-Page Financial Statements: Consolidate P&L statements or balance sheets spanning multiple pages
  • Large Data Tables: Merge inventory lists, transaction records, or data sets split across pages
  • Reports with Continuation Tables: Automatically combine tables marked with “continued on next page”
Example:
{
  "merge_tables": true
}
Benefits:
  • Simplified Data Processing: Work with complete tables instead of fragments
  • Better Context: Maintain full table context for analysis and extraction
  • Reduced Post-Processing: Eliminates need for manual table stitching

Citation Extraction (Research Papers)

This is a specialized feature for academic and research PDF documents. Not needed for general document parsing.
The xml_citation parameter enables automatic extraction and linking of citations from research papers, academic articles, and scientific documents. How It Works:
  • Extracts structured bibliography from the document
  • Identifies in-text citation references (e.g., “Chen et al., 2021”)
  • Hyperlinks citations in the markdown output to their bibliography entries
  • Returns structured citation metadata in the response
Example:
{
  "xml_citation": true
}
Response Metadata: When enabled, the response includes a metadata field with structured citation data:
{
  "metadata": {
    "citations": [
      {
        "id": 1,
        "title": "Deep Learning for NLP",
        "authors": ["John Smith", "Jane Doe"],
        "year": "2021",
        "journal": "Nature",
        "volume": "15",
        "pages": "123-145",
        "doi": "10.1000/example"
      }
    ],
    "document_metadata": {
      "title": "Document Title",
      "authors": ["Author Name"]
    }
  }
}
Markdown Enhancement: In-text citations are automatically hyperlinked:
  • Original: "As shown by Chen et al. (2021)..."
  • Enhanced: "As shown by [Chen et al. (2021)](#ref-5)..."
Only available for PDF documents. This parameter is ignored for other file types (images, Office documents).

Content Type Filtering

The keep_segment_types parameter allows you to filter the output to include only specific segment types, reducing response size and focusing on relevant content: How It Works:
  • Accepts a comma-separated list of segment types (case-insensitive)
  • Filters segments after processing is complete
  • Removes chunks that have no segments after filtering
Available Options:
  • "all" (default): Include all segment types
  • "table": Only table segments
  • "picture": Only image/graphic segments
  • "table,picture": Tables and pictures only
  • "table,formula": Tables and formulas only
  • Custom combinations using any segment type
Supported Segment Types:
  • table, picture, formula, text, sectionheader, title, listitem, caption, footnote, pageheader, pagefooter
Example Usage:
{
  "keep_segment_types": "table,picture"
}
Use Cases:
  • Tables Only: Extract only tabular data from financial documents
  • Pictures Only: Extract charts, graphs, and diagrams for visual analysis
  • Tables + Pictures: Get structured data and visualizations, skip text content
  • Custom Combinations: Mix any segment types based on your needs
Benefits:
  • Reduced Response Size: Filter out unwanted content before receiving results
  • Faster Processing: Less data to transfer and parse
  • Focused Extraction: Get only the content types you need
  • Cost Optimization: Smaller responses reduce bandwidth usage

Output Fields Configuration

The output_fields parameter allows you to control which fields are included in the API response. This is useful for reducing response size, improving performance, and optimizing bandwidth usage when you don’t need all available data. Available Fields:
  • html (default: true): Include HTML representation of segments
  • markdown (default: true): Include Markdown representation of segments
  • ocr (default: true): Include OCR results with bounding boxes and confidence scores
  • image (default: true): Include cropped segment images (base64 encoded)
  • llm (default: true): Include LLM-generated content and descriptions
  • content (default: true): Include text content of segments
  • bbox (default: true): Include bounding box coordinates
  • confidence (default: true): Include confidence scores for segments
  • embed (default: true): Include embed text in chunk responses
Usage: Set fields to false to exclude them from the response. Fields not specified default to true for backward compatibility. Example Configuration:
{
  "html": false,
  "markdown": true,
  "ocr": false,
  "image": false,
  "llm": false,
  "content": true,
  "bbox": true,
  "confidence": false,
  "embed": true
}
Benefits:
  • Reduced Response Size: Excluding large fields like image and html can significantly reduce payload size
  • Faster Processing: Less data to serialize and transfer
  • Cost Optimization: Smaller responses reduce bandwidth costs
  • Selective Data: Only retrieve the fields you need for your use case
When to Use:
  • Minimal Response: Set most fields to false when you only need basic content
  • Text-Only Processing: Exclude image, ocr, and llm when processing text content
  • Embedding Generation: Include only content and embed when generating embeddings
  • Full Analysis: Keep all fields enabled (default) for comprehensive document analysis

Segment Analysis Configuration

The segment_analysis parameter allows you to customize how different segment types are processed, including HTML/Markdown generation strategies and which field should populate the content field. Available Segment Types: You can configure processing for any of the following segment types:
  • Table: Tabular data segments
  • Picture: Image and graphic segments
  • Formula: Mathematical equations
  • Title: Document titles
  • SectionHeader: Section headers
  • Text: Regular text content
  • ListItem: List items
  • Caption: Image captions
  • Footnote: Footnotes
  • PageHeader: Page headers
  • PageFooter: Page footers
  • Page: Full page segments
Configuration Options: For each segment type, you can specify:
  • html: Generation strategy for HTML representation
    • "Auto" (default): Automatically determine the best method
    • "LLM": Use LLM to generate HTML
  • markdown: Generation strategy for Markdown representation
    • "Auto" (default): Automatically determine the best method
    • "LLM": Use LLM to generate Markdown
  • content_source: Defines which field should populate the content field in the response
    • "OCR" (default): Use OCR text for content
    • "HTML": Use HTML representation as content
    • "Markdown": Use Markdown representation as content
  • model_id (Table segments only): Specifies which AI model to use for table processing
    • "us_table_v1": Standard table processing model
    • "us_table_v2": Enhanced table processing model with improved accuracy
Example Configuration:
{
  "Table": {
    "html": "LLM",
    "markdown": "LLM",
    "content_source": "HTML",
    "model_id": "us_table_v2"
  },
  "Picture": {
    "html": "LLM",
    "markdown": "LLM",
    "content_source": "Markdown"
  }
}
How content_source Works: The content_source parameter determines which field’s value will be used to populate the content field in the segment response:
  • When content_source is set to "HTML", the content field will contain the HTML representation, and the separate html and markdown fields will be empty
  • When content_source is set to "Markdown", the content field will contain the Markdown representation, and the separate html and markdown fields will be empty
  • When content_source is set to "OCR" (default), the content field contains OCR text, and html and markdown fields are populated separately
Use Cases:
  • HTML as Content: Set content_source: "HTML" for Table segments when you want HTML-formatted table data directly in the content field
  • Markdown as Content: Set content_source: "Markdown" for Picture segments when you want Markdown-formatted descriptions in the content field
  • LLM-Enhanced Output: Use "LLM" for both html and markdown generation strategies to get AI-enhanced representations in those fields

Response

job_id
string
required
Job identifier — pass this to GET /parse/{job_id} to poll for results.
status
string
required
Initial job status. Always "Starting" on creation.
file_name
string
required
Name of the uploaded file, or "unknown" when a URL was provided.
created_at
string
required
ISO 8601 timestamp when the job was created.
message
string
required
Human-readable status message with a polling hint.
credit_used
integer
required
Number of pages deducted from your quota for this job.
quota_remaining
integer
required
Remaining page quota after this job was deducted.
merge_tables
boolean
required
Whether table merging is enabled for this job (reflects the submitted merge_tables value).

Document Analysis Features

The parsing endpoint provides comprehensive document analysis including:

Text Extraction

Extracts text content with high accuracy, preserving formatting and structure.

Image Recognition

Identifies and analyzes images within documents, providing descriptions and metadata.

Table Parsing

Extracts tabular data with proper structure and formatting.

OCR Processing

Performs optical character recognition on text elements with confidence scores.

Section Detection

Automatically identifies different document sections like headers, body text, and captions.

Bounding Box Information

Provides precise coordinates for all extracted elements.

Advanced Content Processing

  • LLM-Enhanced Analysis: Uses language models for better content understanding
  • Multi-Format Output: Generates HTML, Markdown, and plain text versions
  • Context-Aware Processing: Maintains document context across segments
  • Intelligent Chunking: Creates semantically meaningful document chunks
curl -X 'POST' \
  'https://prod.visionapi.unsiloed.ai/parse' \
  -H 'accept: application/json' \
  -H 'api-key: your-api-key' \
  -H 'Content-Type: multipart/form-data' \
  -F 'file=@document.pdf;type=application/pdf' \
  -F 'use_high_resolution=true' \
  -F 'segmentation_method=smart_layout_detection' \
  -F 'ocr_mode=auto_ocr' \
  -F 'ocr_engine=UnsiloedHawk' \
  -F 'validate_table_segments=false' \
  -F 'merge_tables=true' \
  -F 'keep_segment_types=all' \
  -F 'validate_segments=["Table","Picture","Formula"]' \
  -F 'segment_analysis={"Table":{"html":"LLM","markdown":"LLM","extended_context":true,"crop_image":"All","model_id":"us_table_v2"}}'

# Alternative: Use presigned URL instead of file upload
# Replace the file parameter with url parameter:
# -F 'url=https://your-bucket.s3.amazonaws.com/document.pdf?signature=...' \
{
  "job_id": "e77a5c42-4dc1-44d0-a30e-ed191e8a8908",
  "status": "Starting",
  "file_name": "document.pdf",
  "created_at": "2025-07-18T10:42:10.545832520Z",
  "message": "Task created successfully. Use GET /parse/{job_id} to check status and retrieve results.",
  "credit_used": 5,
  "quota_remaining": 23695,
  "merge_tables": false
}

Retrieving Results

After the job is created, use the GET /parse/ endpoint to check status and retrieve results:
cURL
curl -X 'GET' \
  'https://prod.visionapi.unsiloed.ai/parse/{job_id}' \
  -H 'accept: application/json' \
  -H 'api-key: your-api-key'
Python
import requests
import time

def get_parse_results(job_id, api_key):
    """Monitor job and retrieve results when complete"""
    
    headers = {"api-key": api_key}
    status_url = f"https://prod.visionapi.unsiloed.ai/parse/{job_id}"
    
    # Poll for completion
    while True:
        response = requests.get(status_url, headers=headers)
        
        if response.status_code == 200:
            status_data = response.json()
            print(f"Job Status: {status_data['status']}")
            
            if status_data['status'] == 'Succeeded':
                return status_data  # Results are included in the same response
                    
            elif status_data['status'] == 'Failed':
                raise Exception(f"Job failed: {status_data.get('message', 'Unknown error')}")
                
        time.sleep(5)  # Check every 5 seconds

# Usage
job_id = "e77a5c42-4dc1-44d0-a30e-ed191e8a8908"
results = get_parse_results(job_id, "your-api-key")

Expected Results Structure

When the job completes successfully, the response contains comprehensive document analysis with enhanced processing:
{
  "job_id": "04a7a6d8-5ef7-465a-b22a-8a98e7104dd9",
  "status": "Succeeded",
  "created_at": "2025-10-22T06:51:16.870302Z",
  "started_at": "2025-10-22T06:51:16.966136Z",
  "finished_at": "2025-10-22T06:57:19.821541Z",
  "total_chunks": 25,
  "chunks": [
    {
      "segments": [
        {
          "segment_type": "Title",
          "content": "Disinvestment of IFCI's entire stake in Assets Care & Reconstruction Enterprise Ltd (ACRE)",
          "image": null,
          "page_number": 1,
          "segment_id": "cc5f8dff-31be-4ccf-885d-4f9062fcee17",
          "confidence": 0.90187776,
          "page_width": 1191.0,
          "page_height": 1684.0,
          "html": "<h1>Disinvestment of IFCI's entire stake in Assets Care & Reconstruction Enterprise Ltd (ACRE)</h1>",
          "markdown": "# Disinvestment of IFCI's entire stake in Assets Care & Reconstruction Enterprise Ltd (ACRE)",
          "bbox": {
            "left": 72.92226,
            "top": 62.030334,
            "width": 230.36308,
            "height": 55.395317
          },
          "ocr": [
            {
              "bbox": {
                "left": 63.753525,
                "top": 5.395447,
                "width": 164.45312,
                "height": 42.757812
              },
              "text": "Disinvestment",
              "confidence": 0.9999992
            }
          ]
        },
        {
          "segment_type": "Text",
          "content": "Background and context information about the disinvestment process...",
          "image": null,
          "page_number": 1,
          "segment_id": "9d60e48b-77ba-4a23-a0ac-95ee13c615ec",
          "confidence": 0.88558982,
          "page_width": 1191.0,
          "page_height": 1684.0,
          "html": "<p>Background and context information about the disinvestment process...</p>",
          "markdown": "Background and context information about the disinvestment process...",
          "bbox": {
            "left": 486.9685,
            "top": 139.61847,
            "width": 241.29932,
            "height": 48.451706
          },
          "ocr": [
            {
              "bbox": {
                "left": 50.9729,
                "top": 3.4557495,
                "width": 46.046875,
                "height": 19.734375
              },
              "text": "Background",
              "confidence": 0.99999654
            }
          ]
        }
      ]
    }
  ]
}

Segment Types

The parsing API identifies and processes different types of document segments with enhanced processing:

Picture

Images and graphics within the document, including logos, charts, and illustrations. Enhanced with LLM-based description generation.

SectionHeader

Document headers and titles that define section boundaries. Processed with semantic understanding.

Text

Regular text content including paragraphs, sentences, and individual text elements. Enhanced with context-aware processing.

Table

Tabular data with structured rows and columns. Enhanced with LLM-based formatting and extended context options. You can configure the table processing model using model_id in the segment_analysis parameter:
  • us_table_v1: Standard table processing model
  • us_table_v2: Enhanced table processing model with improved accuracy

Caption

Text captions associated with images or figures. Processed with relationship awareness.

Formula

Mathematical equations and expressions. Enhanced with specialized formula processing.

Title

Document titles and main headings. Processed with enhanced formatting.

Footnote

Document footnotes and references. Processed with context linking.

ListItem

Bulleted and numbered list items. Processed with structure preservation. Each segment includes detailed metadata such as confidence scores, bounding boxes, OCR data, and formatted output in both HTML and Markdown with LLM enhancement.

Error Handling

Common Error Scenarios

  1. Invalid API Key: Authentication failed
  2. File Too Large: File exceeds size limits
  3. Invalid Configuration: Malformed processing parameters
  4. Server Error: Internal processing error
  5. Processing Timeout: Task took too long to complete
  6. Missing File or URL: Neither file nor url parameter provided
  7. Both File and URL Provided: Cannot provide both file and url simultaneously
  8. Invalid URL: URL is not accessible or malformed
  9. URL Download Failed: Unable to download document from provided URL
  10. Insufficient Quota (402): Not enough page credits remaining.
  11. Usage Limit Exceeded (429): Billing usage cap reached. Returns plain text: Usage limit exceeded. No Retry-After header.
  12. Rate Limit Exceeded (429): Org exceeded 60 requests / 60s sliding window. Returns plain text: Rate limit exceeded. A Retry-After header may be present depending on the infrastructure layer (Envoy/Istio), but is not set by the application.

Authorizations

Authorization
string
header
required

API key for authentication. Use 'Bearer <your_api_key>'

Body

multipart/form-data

Multipart form data. Provide either file (binary upload) or url (presigned/public URL), not both.

Request body for POST /parse (multipart/form-data).

Provide either file (binary upload) or url (presigned/public URL) — not both.

file
file
required

Document file to process. Required if url is not provided. Supported formats: PDF, PNG, JPEG, TIFF, PPT, PPTX, DOC, DOCX, XLS, XLSX.

chunk_processing
string | null

JSON object for chunk processing configuration.

detect_checkboxes
boolean | null
default:false

Detect checkboxes in the document. Defaults to false.

error_handling
string | null
default:Continue

Error handling strategy for non-critical processing errors. Continue (default) — proceed despite errors (e.g., LLM refusals). Fail — stop and fail the task on any error.

expires_in
integer<int32> | null

Seconds until the task and its output are deleted. Defaults to the plan expiration time.

extract_charts
boolean | null
default:false

Extract structured data from charts and graphs. Defaults to false.

extract_colors
boolean | null
default:false

Transfer text color from the PDF text layer to OCR results. Defaults to false.

Attach hyperlink URLs from PDF annotations to OCR results. Defaults to false.

keep_segment_types
string | null
default:all

Filter output to include only the specified segment types (comma-separated). Example: "table,picture". Use "all" to include everything. Defaults to "all".

llm_processing
string | null

JSON object for LLM processing configuration.

merge_tables
boolean | null
default:false

Merge tables that span multiple pages into a single unified structure. Defaults to false.

ocr_engine
string | null
default:UnsiloedBeta

OCR engine to use for text recognition. UnsiloedBeta (default) — handles rotated/warped text and irregular bounding boxes. UnsiloedHawk — higher accuracy, complex layouts. UnsiloedStorm — enterprise-grade accuracy, optimized for 50+ languages.

ocr_mode
string | null
default:auto_ocr

OCR processing mode. auto_ocr (default) — applies OCR only where needed. force_ocr — applies OCR to all content regardless of existing text layer.

output_fields
string | null

JSON object controlling which output fields are included in the response. Example: {"html": false, "markdown": true, "ocr": false}. All fields default to true.

page_range
string | null

Page range to process. Formats: "1-5", "2,4,6", "[1,3,5]". Defaults to all pages.

segment_analysis
string | null

JSON object controlling HTML/Markdown generation strategy and AI model per segment type. Example: {"Table": {"html": "LLM", "markdown": "LLM", "model_id": "us_table_v2"}}.

segment_processing
string | null

Alias for segment_analysis (Core Parser name). If both are provided, this takes precedence.

segment_type_naming
string | null
default:Unsiloed

Segment type naming convention. Unsiloed (default) — e.g., PageHeader, ListItem, Picture. Other — alternative names e.g., Header, List Item, Figure.

segmentation_method
string | null
default:smart_layout_detection

Document segmentation strategy. smart_layout_detection (default) — detects layout elements (tables, pictures, formulas, etc.) using bounding boxes. page_by_page — treats each page as a single segment; faster for simple documents.

url
string | null

Presigned or public URL of the document to fetch and process. Required if file is not provided.

use_high_resolution
boolean | null
default:true

Use high-resolution images for cropping and post-processing. Latency penalty: ~2–3 s per page. Defaults to true.

validate_segments
string | null

JSON array string of segment types to validate with VLM. Example: ["Table", "Formula", "Picture"]. Defaults to [].

validate_table_segments
boolean | null
default:false

Legacy: validate table segment classifications using VLM. Prefer validate_segments: ["Table"] instead. Defaults to false.

xml_citation
boolean | null
default:false

Extract and hyperlink bibliography citations in the markdown output. PDFs only. Defaults to false.

Response

Job created — poll with GET /parse/{job_id} to retrieve results.

Response body for a successful POST /parse call.

created_at
string
required

ISO 8601 timestamp when the job was created.

credit_used
integer<int32>
required

Number of pages deducted from your quota for this job.

file_name
string
required

Name of the uploaded file or "unknown" when a URL was provided.

job_id
string
required

Job identifier — pass this to GET /parse/{job_id} to poll for results.

merge_tables
boolean
required

Whether table merging is enabled for this job (reflects the submitted merge_tables value).

message
string
required

Human-readable status message with a polling hint.

quota_remaining
integer<int64>
required

Remaining page quota after this job was deducted.

status
string
required

Initial job status. Always "Starting" on creation.