> ## Documentation Index
> Fetch the complete documentation index at: https://docs.unsiloed.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Parse Document

> Parse and segment PDFs, images, and Office files into meaningful sections using advanced AI models with flexible customization options.

## Overview

The Parse Document endpoint processes PDFs, images (PNG, JPEG, TIFF), and office files (PPT, DOCX, XLSX) documents and breaks them into meaningful sections with detailed analysis including text extraction, image recognition, table parsing, and OCR data. You can provide documents either by direct file upload or by presigned URL. This endpoint supports advanced customization options for fine-tuning the parsing behavior to match your specific use cases.

The parse endpoint uploads your document and configuration in a single request:

1. **POST** to `/parse` with your file and configuration: the API uploads the document and creates a parse job.
2. The job is automatically enqueued for processing.
3. **Poll** `GET /parse/{job_id}` to track progress and retrieve results.

<Tip>
  Processing large files or running many requests in parallel? The [Presigned Upload endpoint](/api-reference/parser/parse-document-v2) (`POST /v2/parse/upload`) decouples document delivery from job creation for faster uploads, larger file sizes, and higher throughput.
</Tip>

## Request

<Note>
  You must provide either `file` or `url`. If both are provided, `file` takes precedence.
</Note>

<ParamField body="file" type="file">
  Document file to process. Supported formats: PDF, images (PNG, JPEG, TIFF), and office documents (PPT, PPTX, DOC, DOCX, XLS, XLSX). Required if `url` is not provided.
</ParamField>

<ParamField body="url" type="string">
  Presigned or public URL of the document to fetch and process. Required if `file` is not provided.
</ParamField>

<ParamField body="use_high_resolution" type="boolean">
  Use high-resolution images for cropping and post-processing. Improves OCR accuracy on low-quality scans by enhancing clarity and contrast. Latency penalty: \~2–3 seconds per page. Defaults to `false`.
</ParamField>

<ParamField body="layout_analysis" type="string">
  How the system analyzes and segments document structure.

  * `"smart_layout_detection"` **(default)**: Intelligently identifies document structure, headers, sections, and content relationships across the entire document using bounding boxes.
  * `"page_by_page"`: Analyzes each page independently as a single segment. Faster for simple documents.
  * `"advanced_layout_detection"`: Uses a vision-language model for exhaustive page segmentation. Detects 14 element types (Caption, Footnote, Formula, ListItem, PageFooter, PageHeader, Picture, SectionHeader, Table, Text, Title, KeyValuePair, Signature, Seal). Best for visually complex or unusual layouts.
</ParamField>

<ParamField body="ocr_strategy" type="string">
  Choose whether OCR runs automatically on detected images or processes all content.

  * `"auto_detection"` **(default)**: Intelligently detects bad quality PDFs, scanned documents, and images, then applies OCR only where needed.
  * `"force_ocr"`: Runs OCR on the entire document regardless of quality.
</ParamField>

<ParamField body="ocr_engine" type="string">
  OCR engine to use for text recognition:

  * `"UnsiloedHawk"` **(default)**: Higher accuracy for complex layouts and mixed content. Unrecognized values also fall back to this engine.
  * `"UnsiloedBeta"`: Handles rotated/warped text and irregular bounding boxes.
  * `"UnsiloedStorm"`: Enterprise-grade accuracy optimized for 50+ languages.
</ParamField>

<ParamField body="agentic_ocr" type="string">
  Per-segment OCR enhancement: re-runs a dedicated agentic OCR model on each detected segment after layout detection for higher accuracy. Omit or leave empty to disable.

  * `"standard"`: Good balance of speed and accuracy.
  * `"advanced"`: Higher quality, best for complex layouts, rotated text, and mixed-language content.
</ParamField>

<ParamField body="extract_strikethrough" type="boolean">
  Detect and preserve strikethrough formatting in HTML and Markdown output. Defaults to `false`.
</ParamField>

<ParamField body="merge_tables" type="boolean">
  Detect and combine table segments across page breaks, reconstructing complete table structure by matching headers and columns. Defaults to `false`.
</ParamField>

<ParamField body="merge_batch_size" type="integer">
  Maximum number of tables per merge group when `merge_tables` is enabled. Groups larger than this are split. Defaults to `20`.
</ParamField>

<ParamField body="enhance_reading_order" type="boolean">
  Fix the reading order of detected segments. Defaults to `false`.
</ParamField>

<ParamField body="detect_pii" type="boolean">
  Run a PII detection pass before parsing. If PII is found at or above `pii_block_severity`, the task is rejected and no parsing occurs. Defaults to `false`.
</ParamField>

<ParamField body="pii_block_severity" type="string">
  Severity threshold at which the task is rejected when `detect_pii` is enabled: `any` (default) blocks on any PII found; `low` blocks on quasi-identifiers (names, dates, locations) or higher; `medium` blocks on contact PII (email, phone) or higher; `high` blocks only on direct identifiers (SSN, passport, credit card). Ignored if `detect_pii` is false.
</ParamField>

<ParamField body="pii_engine" type="string">
  PII detection engine: `standard` (default) or `advanced` (higher precision, additional processing cost). Ignored if `detect_pii` is false.
</ParamField>

<ParamField body="validate_segments" type="string">
  JSON array string of segment types to validate and correct using a Vision Language Model, fixing misclassified segments. Example: `["Table", "Formula", "Picture"]`. Defaults to `["Table", "Picture"]`; an empty or unparseable value also falls back to that default, so Table and Picture validation runs even when this field is omitted.
</ParamField>

<ParamField body="validate_table_segments" type="boolean">
  Legacy parameter that validates table segment classifications using a Vision Language Model. Prefer `validate_segments: ["Table"]` instead. Defaults to `false`.
</ParamField>

<ParamField body="segment_filter" type="string">
  Choose which types of content to include in the parsed output. Comma-separated segment types, or `"all"` to include everything. Defaults to `"all"`.

  Available segment types:

  * `table`: Tabular data segments
  * `picture`: Image and graphic segments
  * `formula`: Mathematical equations
  * `text`: Regular text content
  * `sectionheader`: Section headers
  * `title`: Document titles
  * `listitem`: List items
  * `caption`: Image captions
  * `footnote`: Footnotes
  * `pageheader`: Page headers
  * `pagefooter`: Page footers
  * `keyvaluepair`: Key-value pairs (advanced layout detection)
  * `signature`: Signatures (advanced layout detection)
  * `seal`: Seals and stamps (advanced layout detection)
  * `page`: Full-page segments

  Examples: `"table"`, `"table,picture"`, `"table,formula"`, `"picture,formula"`.
</ParamField>

<ParamField body="xml_citation" type="boolean">
  Extract and hyperlink bibliography citations in the markdown output. PDFs only. Defaults to `false`.
</ParamField>

<ParamField body="output_fields" type="string">
  JSON object controlling which fields are included in the response. Set fields to `false` to exclude them and reduce response size. All fields default to `true`. Ignored when `response_profile` is `slim` or `full` (the profile wins).

  Available fields:

  * `html`: HTML representation of segments
  * `markdown`: Markdown representation of segments
  * `ocr`: Raw OCR text data with bounding boxes and confidence scores
  * `image`: Cropped segment images (base64 encoded)
  * `content`: Text content of segments
  * `bbox`: Bounding box coordinates
  * `confidence`: Confidence scores for segments
  * `embed`: Vector embeddings / embed text
  * `chart_data`: Extracted chart data for Picture segments identified as charts

  Example: `{"html": true, "markdown": true, "ocr": false, "image": false}`.
</ParamField>

<ParamField body="response_profile" type="string">
  Response shape selector: `slim`, `full`, or `custom`. Omit to return the full shape.

  * `"slim"`: Returns only the essentials per chunk — `embed`, `bbox`, `page_number`, `segment_id`, `segment_type`, and HTML for tables / Markdown for everything else. Drops `content`, `image`, `ocr`, `confidence`, `chart_data`, `page_height`, `page_width`. Best for embedding-only workflows where you want the smallest payload.
  * `"full"`: Every field returned (equivalent to omitting this param).
  * `"custom"`: Honor `output_fields` verbatim.

  When both `response_profile` and `output_fields` are provided, the profile wins — `output_fields` is only consulted for `custom` or when the profile is omitted. Applies to inline JSON responses only; `GET /parse/{job_id}?output_file=true` returns a presigned URL to the stored full-shape output file.
</ParamField>

<ParamField body="segment_analysis" type="string">
  JSON object controlling HTML/Markdown generation strategy and AI model per segment type. Configure how different segment types are processed, including table processing models, image description models, and formula processing.

  Example:

  ```json theme={null}
  {
    "Table": {"html": "VLM", "markdown": "VLM", "model_id": "us_table_v2"},
    "Picture": {"html": "VLM", "markdown": "VLM", "model_id": "nova"},
    "Formula": {"html": "Auto", "markdown": "VLM", "model_id": "nova"}
  }
  ```

  Options per segment type:

  * `html`: `"VLM"` or `"Auto"`
  * `markdown`: `"VLM"` or `"Auto"`
  * `model_id` (Table): `"astra"`, `"us_table_v1"`, `"us_table_v2"`
  * `model_id` (Picture/Formula): `"nova"`, `"luna"`, `"sol"`
  * `use_table_ocr` (Table only): Advanced OCR optimized for tabular data. Better handles bordered cells, gridlines, and complex table layouts.
  * `vlm`: Custom prompt for the VLM model. Use this to give the model specific instructions for extracting or describing these segment types.
  * `translation`: Optional per-segment translation, e.g. `{"provider": "Auto", "target_language": "en"}`. `provider` is `"Auto"` for fast machine translation or `"VLM"`/`"LLM"` for model-based translation; `target_language` is an ISO 639-1 code, or `"auto"` to auto-detect the source and translate to English. Optional `model_id` and `prompt` apply to model-based translation.
</ParamField>

<ParamField body="segment_processing" type="string">
  Alias for `segment_analysis`. If both are provided, `segment_processing` takes precedence.
</ParamField>

<ParamField body="page_range" type="string">
  Specify which pages to process. Formats: `"1-5"`, `"2,4,6"`, `"[1,3,5]"`. Defaults to all pages.
</ParamField>

<ParamField body="segment_type_naming" type="string">
  Segment type naming convention. `"Unsiloed"` (default) uses names like `PageHeader`, `ListItem`, `Picture`. `"Other"` uses alternative names like `Header`, `List Item`, `Figure`.
</ParamField>

<ParamField body="extract_colors" type="boolean">
  Transfer text color from the PDF text layer to OCR results. Defaults to `false`.
</ParamField>

<ParamField body="extract_links" type="boolean">
  Attach hyperlink URLs from PDF annotations to OCR results. Defaults to `false`.
</ParamField>

<ParamField body="export_format" type="string">
  JSON array string of export formats to generate after processing, e.g. `["docx"]`. When set, the pipeline generates the requested export files after parsing completes. The exported files are available as presigned URLs in the `exports` field of the response. Supported values: `"docx"`, `"markdown"`, `"json"`.

  <Note>
    This is a multipart form field, so the value must be a JSON-encoded string (`["docx"]`), not a repeated field. Passing a bare value like `docx` will fail to parse and silently skip the export.
  </Note>
</ParamField>

<ParamField body="error_handling" type="string">
  Error handling strategy for non-critical processing errors. `"Continue"` (default) proceeds despite errors (e.g., LLM refusals on individual segments). `"Fail"` stops and fails the task on any error.
</ParamField>

<ParamField body="expires_in" type="integer">
  Reserved field. Persisted in the task configuration but currently has no effect on retention for `POST /parse` — the task is not auto-deleted. To get a presigned-upload TTL, use [`POST /v2/parse/upload`](/api-reference/parser/parse-document-v2) instead, where `expires_in` controls the upload URL's validity.
</ParamField>

<ParamField body="chunk_processing" type="string">
  JSON object for chunk processing configuration.
</ParamField>

<ParamField body="llm_processing" type="string">
  JSON object for LLM processing configuration.
</ParamField>

## Configuration Best Practices

<Note>
  Click to expand each scenario below to view detailed configuration settings, recommendations, and trade-offs for your specific use case.
</Note>

<AccordionGroup>
  <Accordion title="High-Accuracy Processing">
    Use this configuration when accuracy is critical and processing time is less important.

    **Configuration:**

    ```json theme={null}
    {
      "use_high_resolution": true,
      "layout_analysis": "smart_layout_detection",
      "ocr_strategy": "force_ocr",
      "merge_tables": true,
      "validate_segments": ["Table", "Picture", "Formula"],
      "segment_analysis": {
        "Table": {
          "html": "VLM",
          "markdown": "VLM",
          "extended_context": true,
          "crop_image": "All",
          "model_id": "us_table_v2"
        }
      }
    }
    ```

    **When to Use:**

    * Legal documents requiring precise text extraction
    * Financial statements with complex tables
    * Archival documents with low-quality scans
    * Documents where accuracy is more important than speed

    **Trade-offs:**

    * **Latency**: +2-3 seconds per page for high resolution
    * **Latency**: +1-2 seconds per page for segment validation
  </Accordion>

  <Accordion title="Fast Processing">
    Use this configuration when speed is prioritized over maximum accuracy.

    **Configuration:**

    ```json theme={null}
    {
      "use_high_resolution": false,
      "layout_analysis": "page_by_page",
      "ocr_strategy": "auto_detection",
      "merge_tables": false
    }
    ```

    **When to Use:**

    * High-volume document processing
    * Real-time applications requiring quick results
    * Documents with simple layouts
    * Pre-screened high-quality digital documents

    **Benefits:**

    * Fastest processing time
    * Lower cost per document
    * Suitable for batch processing large volumes
  </Accordion>

  <Accordion title="Financial Documents (Tables + Charts)">
    Extract only tables and charts from financial reports and statements.

    **Configuration:**

    ```json theme={null}
    {
      "merge_tables": true,
      "segment_filter": "table,picture",
      "validate_segments": ["Table", "Picture"],
      "layout_analysis": "smart_layout_detection",
      "ocr_strategy": "auto_detection",
      "segment_analysis": {
        "Table": {
          "html": "VLM",
          "markdown": "VLM",
          "model_id": "us_table_v2"
        }
      }
    }
    ```

    **When to Use:**

    * Balance sheets and P\&L statements
    * Quarterly/annual financial reports
    * Investment reports with charts
    * Documents where only structured data matters

    **Benefits:**

    * Reduced response size (text content filtered out)
    * Focus on data-rich content
    * Merged multi-page tables for complete datasets
  </Accordion>

  <Accordion title="Data Extraction Only (Tables)">
    Extract only tabular data with minimal response size for maximum efficiency.

    **Configuration:**

    ```json theme={null}
    {
      "merge_tables": true,
      "segment_filter": "table",
      "validate_segments": ["Table"],
      "output_fields": {
        "html": true,
        "markdown": true,
        "ocr": false,
        "image": false,
        "content": true,
        "bbox": false,
        "confidence": false
      }
    }
    ```

    **When to Use:**

    * Extracting data from invoices
    * Processing structured forms
    * Database population from documents
    * CSV/Excel export workflows

    **Benefits:**

    * Minimal response payload
    * Faster data transfer
    * Easy integration with data pipelines
  </Accordion>

  <Accordion title="Academic/Research Documents">
    Extract content with structured citations from research papers and academic documents.

    **Configuration:**

    ```json theme={null}
    {
      "use_high_resolution": true,
      "layout_analysis": "smart_layout_detection",
      "ocr_strategy": "auto_detection",
      "xml_citation": true
    }
    ```

    **When to Use:**

    * Research papers with bibliographies
    * Academic articles with citations
    * Scientific documents
    * Literature reviews

    **Benefits:**

    * Automatic citation extraction and linking
    * Structured bibliography metadata
    * In-text citation hyperlinks in markdown
    * Preserves academic document structure

    <Note>
      Citation extraction is only available for PDF documents.
    </Note>
  </Accordion>

  <Accordion title="Scanned Documents">
    Optimize for scanned documents and images with poor text quality.

    **Configuration:**

    ```json theme={null}
    {
      "use_high_resolution": true,
      "ocr_strategy": "force_ocr",
      "layout_analysis": "smart_layout_detection"
    }
    ```

    **When to Use:**

    * Scanned paper documents
    * Low-quality photocopies
    * Historical documents
    * Image-based PDFs

    **Benefits:**

    * Maximum OCR coverage
    * Better text extraction from poor quality sources
    * Higher accuracy for challenging documents
  </Accordion>

  <Accordion title="Output Fields Optimization">
    Optimize response size and performance by selectively including only the fields you need.

    **For Minimal Response Size:**

    ```json theme={null}
    {
      "output_fields": {
        "html": false,
        "markdown": false,
        "ocr": false,
        "image": false,
        "content": true,
        "bbox": false,
        "confidence": false,
        "embed": true
      }
    }
    ```

    **For Text-Only Processing:**

    ```json theme={null}
    {
      "output_fields": {
        "html": false,
        "markdown": true,
        "ocr": false,
        "image": false,
        "content": true,
        "bbox": true,
        "confidence": false,
        "embed": true
      }
    }
    ```

    **For Full Analysis (Default):**

    Omit `output_fields` or set all fields to `True` to include all available data.

    **Benefits:**

    * Reduced response size and bandwidth usage
    * Faster processing and data transfer
    * Cost optimization for high-volume processing
  </Accordion>
</AccordionGroup>

## Parameter Details

### File Input Options

The API supports two methods for providing the document to process:

1. **Direct File Upload** (`file` parameter): Upload the document file directly as multipart/form-data
2. **Presigned URL** (`url` parameter): Provide a publicly accessible URL or presigned URL to the document

**Important Notes:**

* You must provide **either** `file` or `url`, but not both
* When using `url`, the document will be downloaded from the provided URL before processing
* Presigned URLs are ideal for documents already stored in cloud storage (S3, GCS, Azure Blob, etc.)
* The URL must be publicly accessible or include necessary authentication parameters (e.g., S3 presigned URLs with signatures)
* Supported formats are the same for both methods: PDF, images (PNG, JPEG, TIFF), and office documents (PPT, PPTX, DOC, DOCX, XLS, XLSX)

**Use Cases for Presigned URLs:**

* Documents already stored in cloud storage
* Avoiding duplicate file uploads
* Integration with existing document management systems
* Processing large files without upload overhead

### Segmentation Method

The `layout_analysis` parameter controls how the document is analyzed and segmented:

* **`"smart_layout_detection"`** (default): Analyzes pages for layout elements (e.g., Table, Picture, Formula, etc.) using bounding boxes. Provides fine-grained segmentation and better chunking for complex documents.

* **`"page_by_page"`**: Treats each page as a single segment. Faster processing, ideal for simple documents without complex layouts.

* **`"advanced_layout_detection"`**: Uses a vision-language model to exhaustively segment each page into 14 element types (including `KeyValuePair`, `Signature`, and `Seal` in addition to the standard set). Recommended for documents with dense, non-standard, or visually complex layouts where VGT-based detection misses regions.

### Agentic OCR

The `agentic_ocr` parameter enables per-segment OCR enhancement after layout detection, yielding higher accuracy on small text, stylized fonts, and mathematical formulas.

**Values:**

* `"standard"`: Fast, good for most documents.
* `"advanced"`: Higher quality, better for complex layouts, rotated or irregular text, and multilingual content.

### OCR Mode

The `ocr_strategy` parameter controls optical character recognition processing:

* **`"auto_detection"`** (default): Intelligently determines when OCR is needed based on the document content. Balances accuracy and performance.

* **`"force_ocr"`**: Applies OCR to all content regardless of existing text layer. Use this for scanned documents or when maximum text extraction is required.

### Table Merging

The `merge_tables` parameter enables merging of tables that span across multiple pages:

**How It Works:**

* Analyzes consecutive table segments across pages
* Identifies tables with matching column headers
* Merges them into a single unified table structure
* Preserves table formatting and data integrity

**When to Use:**

* **Multi-Page Financial Statements**: Consolidate P\&L statements or balance sheets spanning multiple pages
* **Large Data Tables**: Merge inventory lists, transaction records, or data sets split across pages
* **Reports with Continuation Tables**: Automatically combine tables marked with "continued on next page"

**Example:**

```json theme={null}
{
  "merge_tables": true
}
```

**Benefits:**

* **Simplified Data Processing**: Work with complete tables instead of fragments
* **Better Context**: Maintain full table context for analysis and extraction
* **Reduced Post-Processing**: Eliminates need for manual table stitching

### Citation Extraction (Research Papers)

<Note>
  This is a specialized feature for academic and research PDF documents. Not needed for general document parsing.
</Note>

The `xml_citation` parameter enables automatic extraction and linking of citations from research papers, academic articles, and scientific documents.

**How It Works:**

* Extracts structured bibliography from the document
* Identifies in-text citation references (e.g., "Chen et al., 2021")
* Hyperlinks citations in the markdown output to their bibliography entries
* Returns structured citation metadata in the response

**Example:**

```json theme={null}
{
  "xml_citation": true
}
```

**Response Metadata:**

When enabled, the response includes a `metadata` field with structured citation data:

```json theme={null}
{
  "metadata": {
    "citations": [
      {
        "id": 1,
        "title": "Deep Learning for NLP",
        "authors": ["John Smith", "Jane Doe"],
        "year": "2021",
        "journal": "Nature",
        "volume": "15",
        "pages": "123-145",
        "doi": "10.1000/example"
      }
    ],
    "document_metadata": {
      "title": "Document Title",
      "authors": ["Author Name"]
    }
  }
}
```

**Markdown Enhancement:**

In-text citations are automatically hyperlinked:

* Original: `"As shown by Chen et al. (2021)..."`
* Enhanced: `"As shown by [Chen et al. (2021)](#ref-5)..."`

<Warning>
  Only available for PDF documents. This parameter is ignored for other file types (images, Office documents).
</Warning>

### Content Type Filtering

The `segment_filter` parameter allows you to filter the output to include only specific segment types, reducing response size and focusing on relevant content:

**How It Works:**

* Accepts a comma-separated list of segment types (case-insensitive)
* Filters segments after processing is complete
* Removes chunks that have no segments after filtering

**Available Options:**

* `"all"` (default): Include all segment types
* `"table"`: Only table segments
* `"picture"`: Only image/graphic segments
* `"table,picture"`: Tables and pictures only
* `"table,formula"`: Tables and formulas only
* Custom combinations using any segment type

**Supported Segment Types:**

* `table`, `picture`, `formula`, `text`, `sectionheader`, `title`, `listitem`, `caption`, `footnote`, `pageheader`, `pagefooter`

**Example Usage:**

```json theme={null}
{
  "segment_filter": "table,picture"
}
```

**Use Cases:**

* **Tables Only**: Extract only tabular data from financial documents
* **Pictures Only**: Extract charts, graphs, and diagrams for visual analysis
* **Tables + Pictures**: Get structured data and visualizations, skip text content
* **Custom Combinations**: Mix any segment types based on your needs

**Benefits:**

* **Reduced Response Size**: Filter out unwanted content before receiving results
* **Faster Processing**: Less data to transfer and parse
* **Focused Extraction**: Get only the content types you need
* **Cost Optimization**: Smaller responses reduce bandwidth usage

### Output Fields Configuration

The `output_fields` parameter allows you to control which fields are included in the API response. This is useful for reducing response size, improving performance, and optimizing bandwidth usage when you don't need all available data.

**Available Fields:**

* **`html`** (default: `true`): Include HTML representation of segments
* **`markdown`** (default: `true`): Include Markdown representation of segments
* **`ocr`** (default: `true`): Include OCR results with bounding boxes and confidence scores
* **`image`** (default: `true`): Include cropped segment images (base64 encoded)
* **`content`** (default: `true`): Include text content of segments
* **`bbox`** (default: `true`): Include bounding box coordinates
* **`confidence`** (default: `true`): Include confidence scores for segments
* **`embed`** (default: `true`): Include embed text in chunk responses

**Usage:**

Set fields to `false` to exclude them from the response. Fields not specified default to `true` for backward compatibility.

**Example Configuration:**

```json theme={null}
{
  "html": false,
  "markdown": true,
  "ocr": false,
  "image": false,
  "content": true,
  "bbox": true,
  "confidence": false,
  "embed": true
}
```

**Benefits:**

* **Reduced Response Size**: Excluding large fields like `image` and `html` can significantly reduce payload size
* **Faster Processing**: Less data to serialize and transfer
* **Cost Optimization**: Smaller responses reduce bandwidth costs
* **Selective Data**: Only retrieve the fields you need for your use case

**When to Use:**

* **Minimal Response**: Set most fields to `false` when you only need basic content
* **Text-Only Processing**: Exclude `image` and `ocr` when processing text content
* **Embedding Generation**: Include only `content` and `embed` when generating embeddings
* **Full Analysis**: Keep all fields enabled (default) for comprehensive document analysis

### Segment Analysis Configuration

The `segment_analysis` parameter allows you to customize how different segment types are processed, including HTML/Markdown generation strategies and which field should populate the `content` field.

**Available Segment Types:**

You can configure processing for any of the following segment types:

* `Table`: Tabular data segments
* `Picture`: Image and graphic segments
* `Formula`: Mathematical equations
* `Title`: Document titles
* `SectionHeader`: Section headers
* `Text`: Regular text content
* `ListItem`: List items
* `Caption`: Image captions
* `Footnote`: Footnotes
* `PageHeader`: Page headers
* `PageFooter`: Page footers
* `Page`: Full page segments

**Configuration Options:**

For each segment type, you can specify:

* **`html`**: Generation strategy for HTML representation
  * `"Auto"` (default): Automatically determine the best method
  * `"VLM"`: Use VLM to generate HTML
* **`markdown`**: Generation strategy for Markdown representation
  * `"Auto"` (default): Automatically determine the best method
  * `"VLM"`: Use VLM to generate Markdown
* **`content_source`**: Defines which field should populate the `content` field in the response
  * `"OCR"` (default): Use OCR text for content
  * `"HTML"`: Use HTML representation as content
  * `"Markdown"`: Use Markdown representation as content
  * `"VLM"` (alias `"LLM"`): Use the VLM-generated representation as content
* **`model_id`** (Table segments only): Specifies which AI model to use for table processing
  * `"us_table_v1"`: Standard table processing model
  * `"us_table_v2"`: Enhanced table processing model with improved accuracy
* **`vlm`**: Custom prompt for the VLM model. Use this to give the model specific instructions for extracting or describing these segment types.
* **`translation`**: Optional per-segment translation configuration:
  * `provider`: `"Auto"` (fast machine translation) or `"VLM"`/`"LLM"` (model-based translation)
  * `target_language`: ISO 639-1 code (e.g. `"en"`, `"es"`, `"fr"`, `"ko"`), or `"auto"` to auto-detect the source language and translate to English
  * `model_id` (optional): model for VLM/LLM translation; defaults to the provider default
  * `prompt` (optional): custom instructions appended to the translation system prompt

**Example Configuration:**

```json theme={null}
{
  "Table": {
    "html": "VLM",
    "markdown": "VLM",
    "content_source": "HTML",
    "model_id": "us_table_v2",
    "vlm": "Preserve all merged cells. Use empty strings for missing values."
  },
  "Picture": {
    "html": "VLM",
    "markdown": "VLM",
    "content_source": "Markdown",
    "vlm": "Focus on chart axes, legend labels, and key data trends."
  }
}
```

**How `content_source` Works:**

The `content_source` parameter determines which field's value will be used to populate the `content` field in the segment response:

* When `content_source` is set to `"HTML"`, the `content` field will contain the HTML representation, and the separate `html` and `markdown` fields will be empty
* When `content_source` is set to `"Markdown"`, the `content` field will contain the Markdown representation, and the separate `html` and `markdown` fields will be empty
* When `content_source` is set to `"OCR"` (default), the `content` field contains OCR text, and `html` and `markdown` fields are populated separately

**Use Cases:**

* **HTML as Content**: Set `content_source: "HTML"` for Table segments when you want HTML-formatted table data directly in the `content` field
* **Markdown as Content**: Set `content_source: "Markdown"` for Picture segments when you want Markdown-formatted descriptions in the `content` field
* **VLM-Enhanced Output**: Use `"VLM"` for both `html` and `markdown` generation strategies to get AI-enhanced representations in those fields

## Response

<ResponseField name="job_id" type="string" required>
  Job identifier. Pass this to `GET /parse/{job_id}` to poll for results.
</ResponseField>

<ResponseField name="status" type="string" required>
  Initial job status. Always `"Starting"` on creation.
</ResponseField>

<ResponseField name="file_name" type="string" required>
  Name of the uploaded file. For URL submissions this is the last path segment of the URL, or `"unknown"` when no usable segment exists.
</ResponseField>

<ResponseField name="created_at" type="string" required>
  ISO 8601 timestamp when the job was created.
</ResponseField>

<ResponseField name="message" type="string" required>
  Human-readable status message with a polling hint.
</ResponseField>

<ResponseField name="credit_used" type="integer" required>
  Number of pages deducted from your quota for this job.
</ResponseField>

<ResponseField name="quota_remaining" type="integer" required>
  Remaining page quota after this job was deducted.
</ResponseField>

<ResponseField name="merge_tables" type="boolean" required>
  Whether table merging is enabled for this job (reflects the submitted `merge_tables` value).
</ResponseField>

## Document Analysis Features

The parsing endpoint provides comprehensive document analysis including:

### Text Extraction

Extracts text content with high accuracy, preserving formatting and structure.

### Image Recognition

Identifies and analyzes images within documents, providing descriptions and metadata.

### Table Parsing

Extracts tabular data with proper structure and formatting.

### OCR Processing

Performs optical character recognition on text elements with confidence scores.

### Section Detection

Automatically identifies different document sections like headers, body text, and captions.

### Bounding Box Information

Provides precise coordinates for all extracted elements.

### Advanced Content Processing

* **VLM-Enhanced Analysis**: Uses vision-language models for better content understanding
* **Multi-Format Output**: Generates HTML, Markdown, and plain text versions
* **Context-Aware Processing**: Maintains document context across segments
* **Intelligent Chunking**: Creates semantically meaningful document chunks

<RequestExample>
  ```bash cURL theme={null}
  curl -X 'POST' \
    'https://prod.visionapi.unsiloed.ai/parse' \
    -H 'accept: application/json' \
    -H 'api-key: your-api-key' \
    -H 'Content-Type: multipart/form-data' \
    -F 'file=@document.pdf;type=application/pdf' \
    -F 'use_high_resolution=true' \
    -F 'layout_analysis=smart_layout_detection' \
    -F 'ocr_strategy=auto_detection' \
    -F 'ocr_engine=UnsiloedHawk' \
    -F 'extract_strikethrough=false' \
    -F 'merge_tables=true' \
    -F 'enhance_reading_order=false' \
    -F 'segment_filter=all' \
    -F 'validate_segments=["Table","Picture","Formula"]' \
    -F 'export_format=["docx"]' \
    -F 'segment_analysis={"Table":{"html":"VLM","markdown":"VLM","extended_context":true,"crop_image":"All","model_id":"us_table_v2"}}'

  # Alternative: Use presigned URL instead of file upload
  # Replace the file parameter with url parameter:
  # -F 'url=https://your-bucket.s3.amazonaws.com/document.pdf?signature=...' \
  ```

  ```python Python theme={null}
  import requests
  import json

  url = "https://prod.visionapi.unsiloed.ai/parse"
  headers = {
      "accept": "application/json",
      "api-key": "your-api-key"
  }

  # Basic file upload with all options
  files = {
      "file": ("document.pdf", open("document.pdf", "rb"), "application/pdf")
  }

  data = {
      "use_high_resolution": "true",
      "layout_analysis": "smart_layout_detection",
      "ocr_strategy": "auto_detection",
      "ocr_engine": "UnsiloedHawk",
      "extract_strikethrough": "false",
      "merge_tables": "true",
      "enhance_reading_order": "false",
      "segment_filter": "all",
      "export_format": json.dumps(["docx"]),
      "segment_analysis": json.dumps({
          "Table": {
              "html": "VLM",
              "markdown": "VLM",
              "extended_context": True,
              "crop_image": "All",
              "model_id": "us_table_v2"
          }
      })
  }

  response = requests.post(url, headers=headers, files=files, data=data)

  if response.status_code == 200:
      result = response.json()
      print(f"Job ID: {result['job_id']}")
      print(f"Status: {result['status']}")
      print(f"File Name: {result['file_name']}")
      print(f"Credit Used: {result['credit_used']}")
      print(f"Message: {result['message']}")
  else:
      print("Error:", response.status_code, response.text)

  files["file"][1].close()

  # ========== Alternative: Use Presigned URL ==========
  # Instead of uploading a file, you can provide a presigned URL
  # Remove the 'files' parameter and add 'url' to data:
  # data = {
  #     "url": "https://your-bucket.s3.amazonaws.com/document.pdf?signature=...",
  #     "use_high_resolution": "true",
  #     "layout_analysis": "smart_layout_detection",
  #     "ocr_strategy": "auto_detection",
  #     "merge_tables": "true",
  #     "segment_filter": "all"
  # }
  # response = requests.post(url, headers=headers, data=data)

  # ========== Use Case: Extract Tables Only ==========
  # For financial reports or documents where you only need tables:
  # data = {
  #     "merge_tables": "true",
  #     "segment_filter": "table",
  #     "validate_segments": json.dumps(["Table"]),
  #     "layout_analysis": "smart_layout_detection",
  #     "ocr_strategy": "auto_detection"
  # }

  # ========== Use Case: Citation Extraction (Academic Papers) ==========
  # For research papers, enable citation extraction:
  # data = {
  #     "use_high_resolution": "true",
  #     "layout_analysis": "smart_layout_detection",
  #     "ocr_strategy": "auto_detection",
  #     "xml_citation": "true"
  # }

  # ========== Use Case: Advanced Segment Analysis ==========
  # Configure how different segment types are processed:
  # segment_analysis_config = {
  #     "Table": {
  #         "html": "VLM",
  #         "markdown": "VLM",
  #         "content_source": "HTML",
  #         "model_id": "us_table_v2"
  #     },
  #     "Picture": {
  #         "html": "VLM",
  #         "markdown": "VLM",
  #         "content_source": "Markdown"
  #     }
  # }
  # data = {
  #     "use_high_resolution": "true",
  #     "layout_analysis": "smart_layout_detection",
  #     "ocr_strategy": "auto_detection",
  #     "merge_tables": "true",
  #     "segment_filter": "table,picture",
  #     "segment_analysis": json.dumps(segment_analysis_config)
  # }
  ```

  ```javascript JavaScript theme={null}
  const formData = new FormData();

  // Basic file upload
  const fileInput = document.querySelector('input[type="file"]');
  if (fileInput.files[0]) {
    formData.append('file', fileInput.files[0]);
  }

  // Add configuration parameters
  formData.append('use_high_resolution', 'true');
  formData.append('layout_analysis', 'smart_layout_detection');
  formData.append('ocr_strategy', 'auto_detection');
  formData.append('ocr_engine', 'UnsiloedHawk');
  formData.append('extract_strikethrough', 'false');
  formData.append('merge_tables', 'true');
  formData.append('enhance_reading_order', 'false');
  formData.append('segment_filter', 'all');
  formData.append('validate_segments', JSON.stringify(["Table", "Picture", "Formula"]));
  formData.append('export_format', JSON.stringify(["docx"]));
  formData.append('segment_analysis', JSON.stringify({
    Table: {
      html: 'VLM',
      markdown: 'VLM',
      extended_context: true,
      crop_image: 'All',
      model_id: 'us_table_v2'
    }
  }));

  const response = await fetch('https://prod.visionapi.unsiloed.ai/parse', {
    method: 'POST',
    headers: {
      'accept': 'application/json',
      'api-key': 'your-api-key'
    },
    body: formData
  });

  if (response.ok) {
    const result = await response.json();
    console.log(`Job ID: ${result.job_id}`);
    console.log(`Status: ${result.status}`);
    console.log(`File Name: ${result.file_name}`);
    console.log(`Credit Used: ${result.credit_used}`);
    console.log(`Message: ${result.message}`);
  } else {
    console.error('Parsing failed:', response.status, await response.text());
  }

  // ========== Alternative: Use Presigned URL ==========
  // Instead of file upload, use a presigned URL:
  // formData.append('url', 'https://your-bucket.s3.amazonaws.com/document.pdf?signature=...');
  // Remove the file upload lines above and use the url parameter instead

  // ========== Use Case: Extract Tables Only ==========
  // For financial documents where you only need tables:
  // formData.append('merge_tables', 'true');
  // formData.append('segment_filter', 'table');
  // formData.append('layout_analysis', 'smart_layout_detection');
  // formData.append('ocr_strategy', 'auto_detection');

  // ========== Use Case: Citation Extraction (Academic Papers) ==========
  // For research papers, enable citation extraction:
  // formData.append('use_high_resolution', 'true');
  // formData.append('layout_analysis', 'smart_layout_detection');
  // formData.append('ocr_strategy', 'auto_detection');
  // formData.append('xml_citation', 'true');
  ```
</RequestExample>

<ResponseExample>
  ```json Success Response theme={null}
  {
    "job_id": "e77a5c42-4dc1-44d0-a30e-ed191e8a8908",
    "status": "Starting",
    "file_name": "document.pdf",
    "created_at": "2025-07-18T10:42:10.545832520Z",
    "message": "Task created successfully. Use GET /parse/{job_id} to check status and retrieve results.",
    "credit_used": 5,
    "quota_remaining": 23695,
    "merge_tables": false
  }
  ```

  ```text Error Response - Missing Input theme={null}
  Either file or url must be provided
  ```

  ```json Error Response - File Too Large theme={null}
  {
    "error": "file_too_large",
    "message": "File size exceeds the configured limit (default 500MB for uploads)",
    "file_size_bytes": 614400000,
    "max_file_size_bytes": 524288000
  }
  ```

  ```text 400 - Bad Request theme={null}
  Bad request: missing file/url, unsupported file type, or invalid parameters.
  ```

  ```json 402 - Insufficient Quota theme={null}
  {
    "message": "Insufficient quota",
    "status": "INSUFFICIENT_QUOTA",
    "quota_data": {
      "remaining": 0
    }
  }
  ```

  ```text 429 - Usage Limit Exceeded theme={null}
  Usage limit exceeded
  ```

  ```json 429 - Rate Limit Exceeded theme={null}
  {
    "error": "rate_limit_exceeded",
    "message": "Rate limit of 10 requests per second exceeded. Retry after 1s.",
    "retry_after": 1
  }
  ```

  ```text 500 - Internal Server Error theme={null}
  Internal server error.
  ```

  ```text 503 - Service Unavailable theme={null}
  Service unavailable: job queue is at capacity. Retry after the duration indicated in the Retry-After header.
  ```
</ResponseExample>

## Retrieving Results

After the job is created, use the GET /parse/{job_id} endpoint to check status and retrieve results:

```bash cURL theme={null}
curl -X 'GET' \
  'https://prod.visionapi.unsiloed.ai/parse/{job_id}' \
  -H 'accept: application/json' \
  -H 'api-key: your-api-key'
```

```python Python theme={null}
import requests
import time

def get_parse_results(job_id, api_key):
    """Monitor job and retrieve results when complete"""
    
    headers = {"api-key": api_key}
    status_url = f"https://prod.visionapi.unsiloed.ai/parse/{job_id}"
    
    # Poll for completion
    while True:
        response = requests.get(status_url, headers=headers)
        
        if response.status_code == 200:
            status_data = response.json()
            print(f"Job Status: {status_data['status']}")
            
            if status_data['status'] == 'Succeeded':
                return status_data  # Results are included in the same response
                    
            elif status_data['status'] == 'Failed':
                raise Exception(f"Job failed: {status_data.get('message', 'Unknown error')}")
                
        time.sleep(5)  # Check every 5 seconds

# Usage
job_id = "e77a5c42-4dc1-44d0-a30e-ed191e8a8908"
results = get_parse_results(job_id, "your-api-key")
```

## Expected Results Structure

When the job completes successfully, the response contains comprehensive document analysis with enhanced processing:

```json theme={null}
{
  "job_id": "04a7a6d8-5ef7-465a-b22a-8a98e7104dd9",
  "status": "Succeeded",
  "created_at": "2025-10-22T06:51:16.870302Z",
  "started_at": "2025-10-22T06:51:16.966136Z",
  "finished_at": "2025-10-22T06:57:19.821541Z",
  "total_chunks": 25,
  "chunks": [
    {
      "segments": [
        {
          "segment_type": "Title",
          "content": "Disinvestment of IFCI's entire stake in Assets Care & Reconstruction Enterprise Ltd (ACRE)",
          "image": null,
          "page_number": 1,
          "segment_id": "cc5f8dff-31be-4ccf-885d-4f9062fcee17",
          "confidence": 0.90187776,
          "page_width": 1191.0,
          "page_height": 1684.0,
          "html": "<h1>Disinvestment of IFCI's entire stake in Assets Care & Reconstruction Enterprise Ltd (ACRE)</h1>",
          "markdown": "# Disinvestment of IFCI's entire stake in Assets Care & Reconstruction Enterprise Ltd (ACRE)",
          "bbox": {
            "left": 72.92226,
            "top": 62.030334,
            "width": 230.36308,
            "height": 55.395317
          },
          "ocr": [
            {
              "bbox": {
                "left": 63.753525,
                "top": 5.395447,
                "width": 164.45312,
                "height": 42.757812
              },
              "text": "Disinvestment",
              "confidence": 0.9999992
            }
          ]
        },
        {
          "segment_type": "Text",
          "content": "Background and context information about the disinvestment process...",
          "image": null,
          "page_number": 1,
          "segment_id": "9d60e48b-77ba-4a23-a0ac-95ee13c615ec",
          "confidence": 0.88558982,
          "page_width": 1191.0,
          "page_height": 1684.0,
          "html": "<p>Background and context information about the disinvestment process...</p>",
          "markdown": "Background and context information about the disinvestment process...",
          "bbox": {
            "left": 486.9685,
            "top": 139.61847,
            "width": 241.29932,
            "height": 48.451706
          },
          "ocr": [
            {
              "bbox": {
                "left": 50.9729,
                "top": 3.4557495,
                "width": 46.046875,
                "height": 19.734375
              },
              "text": "Background",
              "confidence": 0.99999654
            }
          ]
        }
      ]
    }
  ]
}
```

## Segment Types

The parsing API identifies and processes different types of document segments with enhanced processing:

### Picture

Images and graphics within the document, including logos, charts, and illustrations. Enhanced with VLM-based description generation.

### SectionHeader

Document headers and titles that define section boundaries. Processed with semantic understanding.

### Text

Regular text content including paragraphs, sentences, and individual text elements. Enhanced with context-aware processing.

### Table

Tabular data with structured rows and columns. Enhanced with VLM-based formatting and extended context options. You can configure the table processing model using `model_id` in the `segment_analysis` parameter:

* **`us_table_v1`**: Standard table processing model
* **`us_table_v2`**: Enhanced table processing model with improved accuracy

### Caption

Text captions associated with images or figures. Processed with relationship awareness.

### Formula

Mathematical equations and expressions. Enhanced with specialized formula processing.

### Title

Document titles and main headings. Processed with enhanced formatting.

### Footnote

Document footnotes and references. Processed with context linking.

### ListItem

Bulleted and numbered list items. Processed with structure preservation.

Each segment includes detailed metadata such as confidence scores, bounding boxes, OCR data, and formatted output in both HTML and Markdown with VLM enhancement.

## Error Handling

### Common Error Scenarios

1. **Invalid API Key**: Authentication failed
2. **File Too Large**: File exceeds size limits
3. **Invalid Configuration**: Malformed processing parameters
4. **Server Error**: Internal processing error
5. **Processing Timeout**: Task took too long to complete
6. **Missing File or URL**: Neither `file` nor `url` parameter provided
7. **Both File and URL Provided**: Cannot provide both `file` and `url` simultaneously
8. **Invalid URL**: URL is not accessible or malformed
9. **URL Download Failed**: Unable to download document from provided URL
10. **Insufficient Quota** (`402`): Not enough page credits remaining.
11. **Usage Limit Exceeded** (`429`): Billing usage cap reached. Returns plain text: `Usage limit exceeded`. No `Retry-After` header.
12. **Rate Limit Exceeded** (`429`): Org exceeded its per-second request budget (default 10 requests per second, configurable per organization). Returns JSON `{"error": "rate_limit_exceeded", "message": ..., "retry_after": 1}` with a `Retry-After: 1` header.
13. **Internal Server Error** (`500`): An unexpected error occurred during processing.
14. **Service Unavailable** (`503`): Job queue is at capacity. Retry after the duration indicated in the `Retry-After` header.
15. **Forbidden** (`403`): Access has been revoked.


## OpenAPI

````yaml api-reference/parser/openapi-v1.json POST /parse
openapi: 3.1.0
info:
  title: Unsiloed Parser API — v1
  description: >-
    The original document parsing API. Accepts multipart file uploads and
    URL-based processing. These endpoints have no version prefix in their URLs
    and are stable indefinitely.
  contact:
    name: Unsiloed
    url: https://unsiloed.ai
    email: hello@unsiloed.com
  license:
    name: ''
  version: 1.0.0
servers:
  - url: https://prod.visionapi.unsiloed.ai
    description: Production
security: []
tags:
  - name: Authentication
    description: API key management endpoints
  - name: Health
    description: Endpoint for checking the health of the service.
  - name: Parse (Vision-API Compatible)
    description: >-
      Vision-API compatible endpoints for parsing - accepts multipart form data
      with Vision-API parameter names
paths:
  /parse:
    post:
      tags:
        - Parse (Vision-API Compatible)
      summary: POST /parse
      description: >-
        Create a document processing task. Accepts two request shapes at the
        same path,

        chosen by `Content-Type`:

        - `multipart/form-data` — binary file upload (`file` field) or a `url`
        field.

        - `application/json` (or `application/x-www-form-urlencoded`) — JSON
        body with a
          required `url` field; the `file` field is not applicable.

        Every other configuration field is accepted under both content types and
        behaves

        identically.
      operationId: create_parse_task
      requestBody:
        description: >-
          Provide either `file` (binary upload, multipart only) or `url`
          (presigned/public URL, both content types), not both. JSON callers
          send all fields as native JSON values; multipart callers send each
          field as a form part. The `file` field is multipart-only.
        content:
          application/json:
            schema:
              $ref: '#/components/schemas/ParseCreateRequest'
          multipart/form-data:
            schema:
              $ref: '#/components/schemas/ParseCreateRequest'
        required: true
      responses:
        '200':
          description: Job created — poll with GET /parse/{job_id} to retrieve results.
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/ParseCreateResponse'
        '400':
          description: >-
            Bad request — missing file/url, unsupported file type, or invalid
            parameters.
          content:
            text/plain:
              schema:
                type: string
        '401':
          description: Unauthorized
        '402':
          description: Insufficient quota — not enough page credits remaining.
          content:
            text/plain:
              schema:
                type: string
        '403':
          description: Forbidden — access has been revoked.
        '429':
          description: >-
            Usage limit exceeded (billing cap) or rate limit hit (60 req/60 s
            sliding window).
          content:
            text/plain:
              schema:
                type: string
        '500':
          description: Internal server error.
          content:
            text/plain:
              schema:
                type: string
        '503':
          description: >-
            Service unavailable — job queue is at capacity. Retry after the
            duration indicated in the `Retry-After` header.
      security:
        - api_key: []
components:
  schemas:
    ParseCreateRequest:
      type: object
      description: >-
        Request body for `POST /parse` (multipart/form-data).


        Provide either `file` (binary upload) or `url` (presigned/public URL) —
        not both.
      required:
        - file
      properties:
        agentic_ocr:
          type:
            - string
            - 'null'
          description: >-
            Enable per-segment agentic OCR for higher accuracy. Pass
            `"standard"` or `"advanced"`.
        chunk_processing:
          type:
            - string
            - 'null'
          description: JSON object for chunk processing configuration.
        detect_pii:
          type:
            - boolean
            - 'null'
          description: >-
            Run a PII pre-check before parsing. When enabled, the document is
            scanned

            for personally identifiable information before any extraction work
            happens.

            If PII is found at or above `pii_block_severity`, the task is
            rejected and

            no parsing occurs (the job ends in a failed state with a PII
            reason).

            Defaults to `false`.
          default: false
        enhance_reading_order:
          type:
            - boolean
            - 'null'
          description: Fix the reading order of detected segments. Defaults to `false`.
          default: false
        error_handling:
          type:
            - string
            - 'null'
          description: |-
            Error handling strategy for non-critical processing errors.
            `Continue` (default) — proceed despite errors (e.g., LLM refusals).
            `Fail` — stop and fail the task on any error.
          default: Continue
        expires_in:
          type:
            - integer
            - 'null'
          format: int32
          description: >-
            Reserved field. Persisted in the task configuration but currently
            has no

            effect on retention for this endpoint — `POST /parse` (multipart and
            JSON/Form)

            does not set the task's `expires_at` column, and the cleanup job
            only deletes

            `AwaitingUpload` rows past their `expires_at`. To get a
            presigned-upload TTL,

            use `POST /v2/parse/upload` instead, where `expires_in` controls the
            upload

            URL's validity.
        export_format:
          type:
            - array
            - 'null'
          items:
            $ref: '#/components/schemas/ExportFormat'
          description: >-
            Export format(s) to generate after processing.

            When set, the pipeline generates the requested export files after
            parsing completes.

            The exported files are available as presigned URLs in the `exports`
            field of the response.

            Supported: `["docx", "markdown", "json"]`.
          example:
            - docx
            - markdown
            - json
        extract_colors:
          type:
            - boolean
            - 'null'
          description: >-
            Transfer text color from the PDF text layer to OCR results. Defaults
            to `false`.
          default: false
        extract_links:
          type:
            - boolean
            - 'null'
          description: >-
            Attach hyperlink URLs from PDF annotations to OCR results. Defaults
            to `false`.
          default: false
        extract_strikethrough:
          type:
            - boolean
            - 'null'
          description: >-
            Preserve strikethrough formatting in HTML/Markdown output. Defaults
            to `false`.
          default: false
        file:
          type: string
          format: binary
          description: >-
            Document file to process. Required if `url` is not provided.

            Supported formats: PDF, PNG, JPEG, TIFF, PPT, PPTX, DOC, DOCX, XLS,
            XLSX.
        layout_analysis:
          type:
            - string
            - 'null'
          description: >-
            Layout analysis strategy.

            `smart_layout_detection` (default) — detects layout elements using
            bounding boxes.

            `page_by_page` — treats each page as a single segment; faster for
            simple documents.

            `advanced_layout_detection` — higher-accuracy layout detection for
            complex pages

            (multi-column layouts, dense tables/figures); slower than
            `smart_layout_detection`.
          default: smart_layout_detection
        llm_processing:
          type:
            - string
            - 'null'
          description: JSON object for LLM processing configuration.
        merge_batch_size:
          type:
            - integer
            - 'null'
          format: int32
          description: >-
            Maximum number of tables per merge group when `merge_tables` is
            enabled.

            Groups larger than this are split into separate merges. Defaults to
            `20`.
          default: 20
        merge_tables:
          type:
            - boolean
            - 'null'
          description: >-
            Merge tables that span multiple pages into a single unified
            structure. Defaults to `false`.
          default: false
        ocr_engine:
          type:
            - string
            - 'null'
          description: >-
            OCR engine to use for text recognition.

            `UnsiloedBeta` (default) — handles irregular bounding boxes,
            rotated/warped text.

            `UnsiloedHawk` — higher accuracy, better for complex layouts.

            `UnsiloedStorm` — enterprise-grade accuracy, optimized for 50+
            languages.
          default: UnsiloedBeta
        ocr_strategy:
          type:
            - string
            - 'null'
          description: >-
            OCR strategy.

            `auto_detection` (default) — applies OCR only where needed.

            `force_ocr` — applies OCR to all content regardless of existing text
            layer.
          default: auto_detection
        output_fields:
          type:
            - string
            - 'null'
          description: >-
            JSON object filtering which fields appear on each segment / chunk.

            Each key defaults to `true`; set a key to `false` to drop the field.

            Keys: `bbox`, `chart_data`, `confidence`, `content`, `embed`,
            `html`,

            `image`, `markdown`, `ocr`. Example: `{"html": false, "ocr":
            false}`.

            Ignored when `response_profile` is `slim` or `full`.
        page_range:
          type:
            - string
            - 'null'
          description: >-
            Page range to process. Formats: `"1-5"`, `"2,4,6"`, `"[1,3,5]"`.
            Defaults to all pages.
        pii_block_severity:
          type:
            - string
            - 'null'
          description: >-
            Severity threshold at which a detected PII finding blocks the task.

            Ignored when `detect_pii` is `false`. Findings strictly below the
            threshold

            are allowed through; findings at or above it reject the task.

            - `any` (default) — block on any detection, regardless of severity.

            - `low` — block on low, medium, or high severity findings.

            - `medium` — block on medium or high severity findings.

            - `high` — block only on high severity findings.
          default: any
        pii_engine:
          type:
            - string
            - 'null'
          description: >-
            PII detector engine to use when `detect_pii` is `true`. Ignored
            otherwise.

            - `standard` (default) — fast pattern-based detector; low latency,
              well-suited to bulk pre-screening.
            - `advanced` — model-based detector; slower but catches contextual
              cases that pattern matching misses (e.g. handwritten names,
              partially redacted IDs, document-style references to a person).
          default: standard
        response_profile:
          type:
            - string
            - 'null'
          description: >-
            Response shape selector: `slim`, `full`, or `custom`.

            - `slim`: chunk `embed` + bbox + page_number + segment_id +
              segment_type + HTML for tables / Markdown for everything else.
              Drops `content`, `image`, `ocr`, `confidence`, `chart_data`,
              `page_height`, `page_width`.
            - `full`: every field returned (equivalent to omitting this param).

            - `custom`: honor `output_fields` verbatim.


            Precedence: when both `response_profile` and `output_fields` are

            provided, the profile wins (`output_fields` only matters for
            `custom`

            or when the profile is omitted).


            Applies to inline JSON responses only — `GET
            /parse/{job_id}?output_file=true`

            returns a presigned URL to the stored full-shape output file.
          example: slim
        segment_analysis:
          type:
            - string
            - 'null'
          description: >-
            JSON object controlling HTML/Markdown generation strategy and AI
            model per segment type.

            Example: `{"Table": {"html": "LLM", "markdown": "LLM", "model_id":
            "us_table_v2"}}`.
        segment_filter:
          type:
            - string
            - 'null'
          description: >-
            Content filter: comma-separated segment types to keep.

            Example: `"table,picture"`. Use `"all"` to include everything.
            Defaults to `"all"`.
          default: all
        segment_processing:
          type:
            - string
            - 'null'
          description: >-
            Alias for `segment_analysis` (Core Parser name). If both are
            provided, this takes precedence.
        segment_type_naming:
          type:
            - string
            - 'null'
          description: |-
            Segment type naming convention.
            `Unsiloed` (default) — e.g., `PageHeader`, `ListItem`, `Picture`.
            `Other` — alternative names e.g., `Header`, `List Item`, `Figure`.
          default: Unsiloed
        url:
          type:
            - string
            - 'null'
          description: |-
            Presigned or public URL of the document to fetch and process.
            Required if `file` is not provided.
        use_high_resolution:
          type:
            - boolean
            - 'null'
          description: |-
            Use high-resolution images for cropping and post-processing.
            Latency penalty: ~2–3 s per page. Defaults to `true`.
          default: true
        validate_segments:
          type:
            - string
            - 'null'
          description: |-
            JSON array string of segment types to validate with VLM.
            Example: `["Table", "Formula", "Picture"]`. Defaults to `[]`.
        validate_table_segments:
          type:
            - boolean
            - 'null'
          description: |-
            Legacy: validate table segment classifications using VLM.
            Prefer `validate_segments: ["Table"]` instead. Defaults to `false`.
          default: false
        xml_citation:
          type:
            - boolean
            - 'null'
          description: >-
            Extract and hyperlink bibliography citations in the markdown output.
            PDFs only.

            Defaults to `false`.
          default: false
    ParseCreateResponse:
      type: object
      description: Response body for a successful `POST /parse` call.
      required:
        - job_id
        - status
        - file_name
        - created_at
        - message
        - credit_used
        - quota_remaining
        - merge_tables
      properties:
        created_at:
          type: string
          description: ISO 8601 timestamp when the job was created.
        credit_used:
          type: integer
          format: int32
          description: Number of pages deducted from your quota for this job.
        file_name:
          type: string
          description: Name of the uploaded file or `"unknown"` when a URL was provided.
        job_id:
          type: string
          description: >-
            Job identifier — pass this to `GET /parse/{job_id}` to poll for
            results.
        merge_tables:
          type: boolean
          description: >-
            Whether table merging is enabled for this job (reflects the
            submitted `merge_tables` value).
        message:
          type: string
          description: Human-readable status message with a polling hint.
        quota_remaining:
          type: integer
          format: int64
          description: Remaining page quota after this job was deducted.
        status:
          type: string
          description: Initial job status. Always `"Starting"` on creation.
    ExportFormat:
      type: string
      description: >-
        File format for exporting parsed results. When specified in a parse
        request,

        the pipeline generates the requested export file after processing
        completes.

        The exported file is available via the `exports` field in the task
        response.
      enum:
        - docx
        - markdown
        - json
  securitySchemes:
    api_key:
      type: http
      scheme: bearer
      description: API key for authentication. Use 'Bearer <your_api_key>'

````