Overview
The Parse Document endpoint processes PDF documents and breaks them into meaningful sections with detailed analysis including text extraction, image recognition, table parsing, and OCR data. This endpoint supports advanced customization options for fine-tuning the parsing behavior to match your specific use cases.This endpoint returns a task ID for asynchronous processing. Use the GET /parse/ endpoint to check status and retrieve results when processing is complete.
Request
PDF file to process.
Whether to use high-resolution images for cropping and post-processing. (Latency penalty: ~7 seconds per page)
Advanced chunking configuration options
Segment-specific processing configuration for different document elements
Error handling strategy: “Continue”, “Fail” (default: “Continue”)
Document segmentation strategy: “LayoutAnalysis” or “Page”
LLM processing configuration for enhanced content analysis
OCR processing strategy: “All”, “Auto”(default: “All”)
OCR engine selection: “UnsiloedStorm”, “UnsiloedHawk” (default: “UnsiloedStorm”)
Advanced Configuration Options
Chunk Processing Configuration
Thechunk_processing
parameter allows fine-tuning of how document content is chunked:
ignore_headers_and_footers
(boolean, optional): Whether to exclude headers and footers from chunking.target_length
(number, optional): Target chunk length in tokens (default: 512)tokenizer
(object, optional): Tokenization strategy configuration with Enum values: “Word”, “Cl100kBase”, “XlmRobertaBase”, “BertBaseUncased”
Segment Processing Configuration
Thesegment_processing
parameter allows customization of how different document segments are processed. Each segment type can be configured individually:
-
html
(string, optional): HTML generation method: “LLM”, “Auto” -
markdown
(string, optional): Markdown generation method: “LLM”, “Auto” -
extended_context
(boolean, optional): Use the full page image as context for LLM generation -
crop_image
Controls whether to crop the file’s images to the segment’s bounding box. The cropped image will be stored in the segment’s image field. Use All to always crop, or Auto to only crop when needed for post-processing
Available Segment Types
- Table: Tabular data processing
- Picture: Image and graphic processing
- Formula: Mathematical expression processing
- Text: Regular text content processing
- SectionHeader: Section heading processing
- Title: Document title processing
- Caption: Image/table caption processing
- Footnote: Footnote processing
- ListItem: List item processing
- Page: Page-level processing
- PageFooter: Footer processing
- PageHeader: Header processing
LLM Processing Configuration
Thellm_processing
parameter configures language model processing:
fallback_strategy
(string, optional): Fallback strategy when LLM processing fails: “None”,max_completion_tokens
(number, optional): Maximum tokens for LLM completion (default: 1000)model_id
(string, optional): LLM model identifier to use for processingtemperature
(number, optional): Temperature setting for LLM processing (0.0 to 1.0)
Error Handling Strategies
error_handling
(string, optional): How to handle processing errors:
- “Continue”: Continue processing and report errors
- “Fail”: Stop processing on first error
Segmentation Strategy Options
segmentation_strategy
(string, optional): Document segmentation approach:
- “LayoutAnalysis”: Analyzes pages for layout elements (e.g., Table, Picture, Formula, etc.) using bounding boxes. Provides fine-grained segmentation and better chunking.
- “Page”: Treats each page as a single segment. Faster processing, but without layout element detection and only simple chunking.
OCR Strategy Options
ocr_strategy
(string, optional): OCR processing strategy:
- “All”: Process all text elements with OCR
- “Auto”: Automatically determine when to use OCR
OCR Engine Options
ocr_engine
(string, optional): Select the OCR engine for text recognition:
- “UnsiloedStorm” (default): Fast processing, optimized for general documents
- “UnsiloedHawk”: Higher accuracy, better for complex layouts and multilingual content
-
UnsiloedStorm:
- Fast but noisy
- Faster processing time
- Good for standard documents
- Optimized for English text
- Lower resource usage
-
UnsiloedHawk:
- Slow but accurate
- Higher accuracy
- Better handling of complex layouts
- Superior multilingual support
- Longer processing time
Response
Unique identifier for the processing task
Initial task status (typically “Starting”)
Name of the uploaded file
Timestamp when the task was created
Status message about the task creation
Complete configuration used for processing
Timestamp when the task expires (if expiration is set)
URL for accessing task status and results
Document Analysis Features
The parsing endpoint provides comprehensive document analysis including:Text Extraction
Extracts text content with high accuracy, preserving formatting and structure.Image Recognition
Identifies and analyzes images within documents, providing descriptions and metadata.Table Parsing
Extracts tabular data with proper structure and formatting.OCR Processing
Performs optical character recognition on text elements with confidence scores.Section Detection
Automatically identifies different document sections like headers, body text, and captions.Bounding Box Information
Provides precise coordinates for all extracted elements.Advanced Content Processing
- LLM-Enhanced Analysis: Uses language models for better content understanding
- Multi-Format Output: Generates HTML, Markdown, and plain text versions
- Context-Aware Processing: Maintains document context across segments
- Intelligent Chunking: Creates semantically meaningful document chunks
Retrieving Results
After the task is created, use the GET /parse/ endpoint to check status and retrieve results:cURL
Python
Expected Results Structure
When the task completes successfully, the response contains comprehensive document analysis with enhanced processing:Segment Types
The parsing API identifies and processes different types of document segments with enhanced processing:Picture
Images and graphics within the document, including logos, charts, and illustrations. Enhanced with LLM-based description generation.SectionHeader
Document headers and titles that define section boundaries. Processed with semantic understanding.Text
Regular text content including paragraphs, sentences, and individual text elements. Enhanced with context-aware processing.Table
Tabular data with structured rows and columns. Enhanced with LLM-based formatting and extended context options.Caption
Text captions associated with images or figures. Processed with relationship awareness.Formula
Mathematical equations and expressions. Enhanced with specialized formula processing.Title
Document titles and main headings. Processed with enhanced formatting.Footnote
Document footnotes and references. Processed with context linking.ListItem
Bulleted and numbered list items. Processed with structure preservation. Each segment includes detailed metadata such as confidence scores, bounding boxes, OCR data, and formatted output in both HTML and Markdown with LLM enhancement.Configuration Best Practices
For High-Accuracy Processing
OCR Engine Selection Guide
Choose the appropriate OCR engine based on your document characteristics: Use UnsiloedStorm when:- Processing standard business documents
- Speed is prioritized over accuracy
- Working with primarily English text
- Processing large volumes of documents
- Resource efficiency is important
- You can tolerate some noise in the output for faster processing
- High accuracy is critical
- Working with multilingual documents
- Quality over speed is preferred
- You need clean, accurate text extraction
Error Handling
Common Error Scenarios
- Invalid API Key: Authentication failed
- File Too Large: File exceeds size limits
- Invalid Configuration: Malformed processing parameters
- Server Error: Internal processing error
- Processing Timeout: Task took too long to complete
Authorizations
Body
multipart/form-data
PDF file to process
Whether to use high-resolution images for cropping and post-processing
Document segmentation strategy: 'LayoutAnalysis' or 'Page'
OCR engine selection: 'UnsiloedStorm', 'UnsiloedHawk' (default: 'UnsiloedStorm')
Available options:
UnsiloedStorm
, UnsiloedHawk
OCR processing strategy: 'All', 'Auto' (default: 'All')
Available options:
All
, Auto