Overview
The Parse Document endpoint processes PDFs, images (PNG, JPEG, TIFF), and office files (PPT, DOCX, XLSX) documents and breaks them into meaningful sections with detailed analysis including text extraction, image recognition, table parsing, and OCR data. You can provide documents either by direct file upload or by presigned URL. This endpoint supports advanced customization options for fine-tuning the parsing behavior to match your specific use cases.This endpoint returns a job ID for asynchronous processing. Use the GET PARSE JOB STATUS endpoint to check status and retrieve results when processing is complete.
Request
You must provide either
file or url parameter. Both cannot be provided simultaneously.Document file to process. Supported formats: PDF, images (PNG, JPEG, TIFF), and office documents (PPT, PPTX, DOC, DOCX, XLS, XLSX). Required if
url is not provided.Presigned URL of the document to process. The URL must be publicly accessible or a valid presigned URL from cloud storage (S3, GCS, Azure Blob, etc.). Supported formats: PDF, images (PNG, JPEG, TIFF), and office documents (PPT, PPTX, DOC, DOCX, XLS, XLSX). Required if
file is not provided.Whether to use high-resolution images for cropping and post-processing. (Latency penalty: ~2-3 seconds per page). Default: false
Document segmentation strategy:
"smart_layout_detection"(default): Analyzes pages for layout elements using bounding boxes"page_by_page": Treats each page as a single segment
OCR processing strategy:
"auto_ocr"(default): Automatically determine when to use OCR"full_ocr": Process all text elements with OCR
OCR engine selection:
"UnsiloedHawk"(default): Higher accuracy, better for complex layouts"UnsiloedStorm": Fast processing, optimized for general documents
Whether to merge tables that span across multiple pages into a single unified table structure. When enabled, consecutive table segments with matching headers will be consolidated. Default: false
Filter output to include only specific segment types. Accepts a comma-separated list of segment types or “all” to include everything. Examples:
"table", "picture", "table,picture", "table,formula". Default: “all”Available segment types:table: Tabular data segmentspicture: Image and graphic segmentsformula: Mathematical equationstext: Regular text contentsectionheader: Section headerstitle: Document titleslistitem: List itemscaption: Image captionsfootnote: Footnotespageheader: Page headerspagefooter: Page footers
JSON configuration object to control which fields are included in the response. By default, all fields are included. Set fields to
false to exclude them and reduce response size. Example: {"html": false, "markdown": true, "ocr": false}. Available fields: html, markdown, ocr, image, llm, content, bbox, confidence, embed.JSON configuration object to customize how different segment types are processed. Allows you to control HTML/Markdown generation strategies, specify which field should populate the
content field for each segment type, and configure the AI model for table processing. Example: {"Table": {"html": "LLM", "markdown": "LLM", "content_source": "HTML", "model_id": "us_table_v2"}}.Parameter Details
File Input Options
The API supports two methods for providing the document to process:- Direct File Upload (
fileparameter): Upload the document file directly as multipart/form-data - Presigned URL (
urlparameter): Provide a publicly accessible URL or presigned URL to the document
- You must provide either
fileorurl, but not both - When using
url, the document will be downloaded from the provided URL before processing - Presigned URLs are ideal for documents already stored in cloud storage (S3, GCS, Azure Blob, etc.)
- The URL must be publicly accessible or include necessary authentication parameters (e.g., S3 presigned URLs with signatures)
- Supported formats are the same for both methods: PDF, images (PNG, JPEG, TIFF), and office documents (PPT, PPTX, DOC, DOCX, XLS, XLSX)
- Documents already stored in cloud storage
- Avoiding duplicate file uploads
- Integration with existing document management systems
- Processing large files without upload overhead
Segmentation Method
Thesegmentation_method parameter controls how the document is analyzed and segmented:
-
"smart_layout_detection"(default): Analyzes pages for layout elements (e.g., Table, Picture, Formula, etc.) using bounding boxes. Provides fine-grained segmentation and better chunking for complex documents. -
"page_by_page": Treats each page as a single segment. Faster processing, ideal for simple documents without complex layouts.
OCR Mode
Theocr_mode parameter controls optical character recognition processing:
-
"auto_ocr"(default): Intelligently determines when OCR is needed based on the document content. Balances accuracy and performance. -
"full_ocr": Applies OCR to all text elements in the document. Use this for scanned documents or when maximum text extraction is required.
OCR Engine Selection
Select the OCR engine for text recognition:- “UnsiloedStorm” (default): Fast processing, optimized for general documents
- “UnsiloedHawk”: Higher accuracy, better for complex layouts and multilingual content
- UnsiloedStorm:
- Fast but less accurate
- Faster processing time
- Good for standard documents
- Optimized for English text
- Lower resource usage
- UnsiloedHawk:
- Slow but more accurate
- Higher accuracy
- Better handling of complex layouts
- Superior multilingual support
- Longer processing time
Table Merging
Themerge_tables parameter enables intelligent merging of tables that span across multiple pages:
How It Works:
- Analyzes consecutive table segments across pages
- Identifies tables with matching column headers
- Merges them into a single unified table structure
- Preserves table formatting and data integrity
- Multi-Page Financial Statements: Consolidate P&L statements or balance sheets spanning multiple pages
- Large Data Tables: Merge inventory lists, transaction records, or data sets split across pages
- Reports with Continuation Tables: Automatically combine tables marked with “continued on next page”
- Simplified Data Processing: Work with complete tables instead of fragments
- Better Context: Maintain full table context for analysis and extraction
- Reduced Post-Processing: Eliminates need for manual table stitching
Content Type Filtering
Thekeep_segment_types parameter allows you to filter the output to include only specific segment types, reducing response size and focusing on relevant content:
How It Works:
- Accepts a comma-separated list of segment types (case-insensitive)
- Filters segments after processing is complete
- Removes chunks that have no segments after filtering
"all"(default): Include all segment types"table": Only table segments"picture": Only image/graphic segments"table,picture": Tables and pictures only"table,formula": Tables and formulas only- Custom combinations using any segment type
table,picture,formula,text,sectionheader,title,listitem,caption,footnote,pageheader,pagefooter
- Tables Only: Extract only tabular data from financial documents
- Pictures Only: Extract charts, graphs, and diagrams for visual analysis
- Tables + Pictures: Get structured data and visualizations, skip text content
- Custom Combinations: Mix any segment types based on your needs
- Reduced Response Size: Filter out unwanted content before receiving results
- Faster Processing: Less data to transfer and parse
- Focused Extraction: Get only the content types you need
- Cost Optimization: Smaller responses reduce bandwidth usage
Output Fields Configuration
Theoutput_fields parameter allows you to control which fields are included in the API response. This is useful for reducing response size, improving performance, and optimizing bandwidth usage when you don’t need all available data.
Available Fields:
html(default:true): Include HTML representation of segmentsmarkdown(default:true): Include Markdown representation of segmentsocr(default:true): Include OCR results with bounding boxes and confidence scoresimage(default:true): Include cropped segment images (base64 encoded)llm(default:true): Include LLM-generated content and descriptionscontent(default:true): Include text content of segmentsbbox(default:true): Include bounding box coordinatesconfidence(default:true): Include confidence scores for segmentsembed(default:true): Include embed text in chunk responses
false to exclude them from the response. Fields not specified default to true for backward compatibility.
Example Configuration:
- Reduced Response Size: Excluding large fields like
imageandhtmlcan significantly reduce payload size - Faster Processing: Less data to serialize and transfer
- Cost Optimization: Smaller responses reduce bandwidth costs
- Selective Data: Only retrieve the fields you need for your use case
- Minimal Response: Set most fields to
falsewhen you only need basic content - Text-Only Processing: Exclude
image,ocr, andllmwhen processing text content - Embedding Generation: Include only
contentandembedwhen generating embeddings - Full Analysis: Keep all fields enabled (default) for comprehensive document analysis
Segment Analysis Configuration
Thesegment_analysis parameter allows you to customize how different segment types are processed, including HTML/Markdown generation strategies and which field should populate the content field.
Available Segment Types:
You can configure processing for any of the following segment types:
Table: Tabular data segmentsPicture: Image and graphic segmentsFormula: Mathematical equationsTitle: Document titlesSectionHeader: Section headersText: Regular text contentListItem: List itemsCaption: Image captionsFootnote: FootnotesPageHeader: Page headersPageFooter: Page footersPage: Full page segments
html: Generation strategy for HTML representation"Auto"(default): Automatically determine the best method"LLM": Use LLM to generate HTML
markdown: Generation strategy for Markdown representation"Auto"(default): Automatically determine the best method"LLM": Use LLM to generate Markdown
content_source: Defines which field should populate thecontentfield in the response"OCR"(default): Use OCR text for content"HTML": Use HTML representation as content"Markdown": Use Markdown representation as content
model_id(Table segments only): Specifies which AI model to use for table processing"us_table_v1": Standard table processing model"us_table_v2": Enhanced table processing model with improved accuracy
content_source Works:
The content_source parameter determines which field’s value will be used to populate the content field in the segment response:
- When
content_sourceis set to"HTML", thecontentfield will contain the HTML representation, and the separatehtmlandmarkdownfields will be empty - When
content_sourceis set to"Markdown", thecontentfield will contain the Markdown representation, and the separatehtmlandmarkdownfields will be empty - When
content_sourceis set to"OCR"(default), thecontentfield contains OCR text, andhtmlandmarkdownfields are populated separately - When
content_sourceis set to"LLM", thecontentfield contains LLM-generated content
- HTML as Content: Set
content_source: "HTML"for Table segments when you want HTML-formatted table data directly in thecontentfield - Markdown as Content: Set
content_source: "Markdown"for Picture segments when you want Markdown-formatted descriptions in thecontentfield - LLM-Enhanced Content: Use
"LLM"for bothhtml/markdowngeneration strategies and setcontent_source: "LLM"to get AI-enhanced content in thecontentfield
Response
Unique identifier for the parsing job
Initial job status (typically “Starting”)
Name of the uploaded file
Timestamp when the job was created
Status message about the job creation
Remaining page quota for the API key
Whether table merging is enabled for this job
Segment types filter applied to this job (stored in metadata)
Document Analysis Features
The parsing endpoint provides comprehensive document analysis including:Text Extraction
Extracts text content with high accuracy, preserving formatting and structure.Image Recognition
Identifies and analyzes images within documents, providing descriptions and metadata.Table Parsing
Extracts tabular data with proper structure and formatting.OCR Processing
Performs optical character recognition on text elements with confidence scores.Section Detection
Automatically identifies different document sections like headers, body text, and captions.Bounding Box Information
Provides precise coordinates for all extracted elements.Advanced Content Processing
- LLM-Enhanced Analysis: Uses language models for better content understanding
- Multi-Format Output: Generates HTML, Markdown, and plain text versions
- Context-Aware Processing: Maintains document context across segments
- Intelligent Chunking: Creates semantically meaningful document chunks
Retrieving Results
After the job is created, use the GET /parse/ endpoint to check status and retrieve results:cURL
Python
Expected Results Structure
When the job completes successfully, the response contains comprehensive document analysis with enhanced processing:Segment Types
The parsing API identifies and processes different types of document segments with enhanced processing:Picture
Images and graphics within the document, including logos, charts, and illustrations. Enhanced with LLM-based description generation.SectionHeader
Document headers and titles that define section boundaries. Processed with semantic understanding.Text
Regular text content including paragraphs, sentences, and individual text elements. Enhanced with context-aware processing.Table
Tabular data with structured rows and columns. Enhanced with LLM-based formatting and extended context options. You can configure the table processing model usingmodel_id in the segment_analysis parameter:
us_table_v1: Standard table processing modelus_table_v2: Enhanced table processing model with improved accuracy
Caption
Text captions associated with images or figures. Processed with relationship awareness.Formula
Mathematical equations and expressions. Enhanced with specialized formula processing.Title
Document titles and main headings. Processed with enhanced formatting.Footnote
Document footnotes and references. Processed with context linking.ListItem
Bulleted and numbered list items. Processed with structure preservation. Each segment includes detailed metadata such as confidence scores, bounding boxes, OCR data, and formatted output in both HTML and Markdown with LLM enhancement.Configuration Best Practices
For High-Accuracy Processing
Use this configuration when accuracy is critical:For Fast Processing
Use this configuration when speed is prioritized:For Financial Documents (Tables + Charts)
Extract only tables and charts from financial reports:For Data Extraction Only (Tables)
Extract only tabular data with minimal response size:OCR Engine Selection Guide
Choose the appropriate OCR engine based on your document characteristics: Use UnsiloedStorm when:- Processing standard business documents
- Speed is prioritized over accuracy
- Working with primarily English text
- Processing large volumes of documents
- Resource efficiency is important
- You can tolerate some noise in the output for faster processing
- High accuracy is critical
- Working with multilingual documents
- Processing complex layouts
- Quality over speed is preferred
- You need clean, accurate text extraction
Output Fields Optimization (Optional)
Optimize response size and performance by selectively including only the fields you need: For Minimal Response Size:output_fields or set all fields to true to include all available data.
Error Handling
Common Error Scenarios
- Invalid API Key: Authentication failed
- File Too Large: File exceeds size limits
- Invalid Configuration: Malformed processing parameters
- Server Error: Internal processing error
- Processing Timeout: Task took too long to complete
- Missing File or URL: Neither
filenorurlparameter provided - Both File and URL Provided: Cannot provide both
fileandurlsimultaneously - Invalid URL: URL is not accessible or malformed
- URL Download Failed: Unable to download document from provided URL
Authorizations
Body
multipart/form-data
Supported file types: PDFs, Images (PNG, JPEG, TIFF, BMP) and Office Documents (DOCX, XLSX, PPTX)
Whether to use high-resolution images for cropping and post-processing (default: false)
Document segmentation strategy
Available options:
smart_layout_detection, page_by_page OCR processing strategy
Available options:
auto_ocr, full_ocr OCR engine selection: 'UnsiloedHawk' (higher accuracy) or 'UnsiloedStorm' (faster processing)
Available options:
UnsiloedHawk, UnsiloedStorm Response
200 - application/json
Successful response
Unique identifier for the parsing job
Initial job status (typically 'Starting')
Name of the uploaded file
Timestamp when the job was created
Status message about the job creation
Remaining page quota for the API key
Whether table merging is enabled
