{ "job_id": "e77a5c42-4dc1-44d0-a30e-ed191e8a8908", "status": "Starting", "file_name": "document.pdf", "created_at": "2025-07-18T10:42:10.545832520Z", "message": "Job created successfully. Use GET /parse/{job_id} to check status and retrieve results.", "quota_remaining": 23700, "merge_tables": false,}
Parsing
Parse Document
Parse and segment PDFs, images, and Office files into meaningful sections using advanced AI models with flexible customization options.
{ "job_id": "e77a5c42-4dc1-44d0-a30e-ed191e8a8908", "status": "Starting", "file_name": "document.pdf", "created_at": "2025-07-18T10:42:10.545832520Z", "message": "Job created successfully. Use GET /parse/{job_id} to check status and retrieve results.", "quota_remaining": 23700, "merge_tables": false,}
The Parse Document endpoint processes PDFs, images (PNG, JPEG, TIFF), and office files (PPT, DOCX, XLSX) documents and breaks them into meaningful sections with detailed analysis including text extraction, image recognition, table parsing, and OCR data. You can provide documents either by direct file upload or by presigned URL. This endpoint supports advanced customization options for fine-tuning the parsing behavior to match your specific use cases.
This endpoint returns a job ID for asynchronous processing. Use the GET PARSE JOB STATUS endpoint to check status and retrieve results when processing is complete.
Presigned URL of the document to process. The URL must be publicly accessible or a valid presigned URL from cloud storage (S3, GCS, Azure Blob, etc.). Supported formats: PDF, images (PNG, JPEG, TIFF), and office documents (PPT, PPTX, DOC, DOCX, XLS, XLSX). Required if file is not provided.
Whether to merge tables that span across multiple pages into a single unified table structure. When enabled, consecutive table segments with matching headers will be consolidated. Default: false
Whether to use our Vision Language Model (Judge) to validate and correct table segment classifications. When enabled, segments that may have been misclassified (e.g., tables detected as text or pictures) are validated and corrected(add 4-5s Latency per page).
Filter output to include only specific segment types. Accepts a comma-separated list of segment types or “all” to include everything. Examples: "table", "picture", "table,picture", "table,formula". Default: “all”Available segment types:
JSON configuration object to control which fields are included in the response. By default, all fields are included. Set fields to false to exclude them and reduce response size. Example: {"html": false, "markdown": true, "ocr": false}. Available fields: html, markdown, ocr, image, llm, content, bbox, confidence, embed.
JSON configuration object to customize how different segment types are processed. Allows you to control HTML/Markdown generation strategies, specify which field should populate the content field for each segment type, and configure the AI model for table processing. Example: {"Table": {"html": "LLM", "markdown": "LLM", "content_source": "HTML", "model_id": "us_table_v2"}}.
The segmentation_method parameter controls how the document is analyzed and segmented:
"smart_layout_detection" (default): Analyzes pages for layout elements (e.g., Table, Picture, Formula, etc.) using bounding boxes. Provides fine-grained segmentation and better chunking for complex documents.
"page_by_page": Treats each page as a single segment. Faster processing, ideal for simple documents without complex layouts.
The keep_segment_types parameter allows you to filter the output to include only specific segment types, reducing response size and focusing on relevant content:How It Works:
Accepts a comma-separated list of segment types (case-insensitive)
Filters segments after processing is complete
Removes chunks that have no segments after filtering
The output_fields parameter allows you to control which fields are included in the API response. This is useful for reducing response size, improving performance, and optimizing bandwidth usage when you don’t need all available data.Available Fields:
html (default: true): Include HTML representation of segments
markdown (default: true): Include Markdown representation of segments
ocr (default: true): Include OCR results with bounding boxes and confidence scores
image (default: true): Include cropped segment images (base64 encoded)
llm (default: true): Include LLM-generated content and descriptions
content (default: true): Include text content of segments
bbox (default: true): Include bounding box coordinates
confidence (default: true): Include confidence scores for segments
embed (default: true): Include embed text in chunk responses
Usage:Set fields to false to exclude them from the response. Fields not specified default to true for backward compatibility.Example Configuration:
The segment_analysis parameter allows you to customize how different segment types are processed, including HTML/Markdown generation strategies and which field should populate the content field.Available Segment Types:You can configure processing for any of the following segment types:
Table: Tabular data segments
Picture: Image and graphic segments
Formula: Mathematical equations
Title: Document titles
SectionHeader: Section headers
Text: Regular text content
ListItem: List items
Caption: Image captions
Footnote: Footnotes
PageHeader: Page headers
PageFooter: Page footers
Page: Full page segments
Configuration Options:For each segment type, you can specify:
html: Generation strategy for HTML representation
"Auto" (default): Automatically determine the best method
"LLM": Use LLM to generate HTML
markdown: Generation strategy for Markdown representation
"Auto" (default): Automatically determine the best method
"LLM": Use LLM to generate Markdown
content_source: Defines which field should populate the content field in the response
"OCR" (default): Use OCR text for content
"HTML": Use HTML representation as content
"Markdown": Use Markdown representation as content
model_id (Table segments only): Specifies which AI model to use for table processing
"us_table_v1": Standard table processing model
"us_table_v2": Enhanced table processing model with improved accuracy
How content_source Works:The content_source parameter determines which field’s value will be used to populate the content field in the segment response:
When content_source is set to "HTML", the content field will contain the HTML representation, and the separate html and markdown fields will be empty
When content_source is set to "Markdown", the content field will contain the Markdown representation, and the separate html and markdown fields will be empty
When content_source is set to "OCR" (default), the content field contains OCR text, and html and markdown fields are populated separately
When content_source is set to "LLM", the content field contains LLM-generated content
Use Cases:
HTML as Content: Set content_source: "HTML" for Table segments when you want HTML-formatted table data directly in the content field
Markdown as Content: Set content_source: "Markdown" for Picture segments when you want Markdown-formatted descriptions in the content field
LLM-Enhanced Content: Use "LLM" for both html/markdown generation strategies and set content_source: "LLM" to get AI-enhanced content in the content field
{ "job_id": "e77a5c42-4dc1-44d0-a30e-ed191e8a8908", "status": "Starting", "file_name": "document.pdf", "created_at": "2025-07-18T10:42:10.545832520Z", "message": "Job created successfully. Use GET /parse/{job_id} to check status and retrieve results.", "quota_remaining": 23700, "merge_tables": false,}
Tabular data with structured rows and columns. Enhanced with LLM-based formatting and extended context options. You can configure the table processing model using model_id in the segment_analysis parameter:
us_table_v1: Standard table processing model
us_table_v2: Enhanced table processing model with improved accuracy
Bulleted and numbered list items. Processed with structure preservation.Each segment includes detailed metadata such as confidence scores, bounding boxes, OCR data, and formatted output in both HTML and Markdown with LLM enhancement.
Filter output to include only specific segment types. Accepts comma-separated list (e.g., 'table', 'picture', 'table,picture') or 'all' for everything (default: 'all')