> ## Documentation Index
> Fetch the complete documentation index at: https://docs.unsiloed.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Get Parse Result

> Check the status and retrieve results of parsing jobs

## Overview

The Get Parse Job Status endpoint allows you to check the current status of parsing jobs and retrieve the complete results when processing is complete. This endpoint is specifically designed for the parsing API and returns comprehensive document analysis including text extraction, image recognition, table parsing, and OCR data.

<Note>
  Parsing jobs are processed asynchronously. Use this endpoint to poll for completion and retrieve results when the job status is "Succeeded".
</Note>

## Parameters

<ParamField path="job_id" type="string" required>
  Job ID returned by `POST /parse`.
</ParamField>

<ParamField query="base64_urls" type="boolean">
  Return segment images as base64-encoded data URIs instead of S3 presigned URLs. Defaults to `false`.
</ParamField>

<ParamField query="include_chunks" type="boolean">
  Include the `chunks` array in the response. Defaults to `true`.
</ParamField>

<ParamField query="output_file" type="boolean">
  Return a presigned S3 URL to the raw output JSON file instead of inlining the full response body. Defaults to `false`.
</ParamField>

<ParamField query="include_url" type="boolean">
  Opt in to receiving file URLs (`pdf_url`, `file_url`, `output_file_url`, segment `image`, `configuration.input_file_url`) in the response. Defaults to `false`, in which case these fields are returned as `null` so the response (and any log of it) does not expose the storage bucket, region, or path. `exports` URLs are always returned regardless of this setting. Can also be set via the `include-url` header.
</ParamField>

<ParamField query="segment_filter" type="string">
  Comma-separated segment types to keep in the response (alias: `keep_segment_types`). Omit to include every type.
</ParamField>

<ParamField query="merge_tables" type="boolean">
  Return the cached cross-page merged result for jobs that were submitted with `merge_tables`. No merging happens at read time; the job's own setting takes precedence, so passing this for a job created without `merge_tables` has no effect. Defaults to `false`.
</ParamField>

## Response

<ResponseField name="job_id" type="string">
  Job identifier.
</ResponseField>

<ResponseField name="status" type="string">
  Current job status: `Starting`, `Queued`, `Processing`, `Succeeded`, `Failed`, or `Cancelled`. Jobs created through the [v2 presigned-upload flow](/api-reference/parser/parse-document-v2) can also report `AwaitingUpload` before the file is uploaded.
</ResponseField>

<ResponseField name="created_at" type="string">
  ISO 8601 timestamp when the job was created.
</ResponseField>

<ResponseField name="metadata" type="object">
  Citation or job metadata. Populated when `xml_citation` is enabled or from the job record.
</ResponseField>

<ResponseField name="started_at" type="string">
  ISO 8601 timestamp when processing started. `null` until a worker picks the job up (`Starting` and `Queued`).
</ResponseField>

<ResponseField name="finished_at" type="string">
  ISO 8601 timestamp when processing completed. Present when status is `Succeeded` or `Failed`.
</ResponseField>

<ResponseField name="total_chunks" type="integer">
  Total number of document chunks. `0` until status is `Succeeded`.
</ResponseField>

<ResponseField name="page_count" type="integer">
  Number of pages in the document. `0` until processing begins; populated while status is `Processing`.
</ResponseField>

<ResponseField name="chunks" type="array">
  Array of document chunks with segments and extracted content. Empty until status is `Succeeded` (chunk content is only downloaded for succeeded jobs).
</ResponseField>

<ResponseField name="pdf_url" type="string">
  Presigned S3 URL to the generated PDF. Returned as `null` unless `include_url=true` is set.
</ResponseField>

<ResponseField name="exports" type="object">
  Presigned download URLs for exported file formats. Only present when `export_format` was specified in the parse request and the export has completed. Keys are format names (e.g. `"docx"`), values are presigned S3 URLs valid for 1 hour. If export failed, contains `{"docx_error": "..."}` instead. Always returned with URLs, regardless of `include_url`.
</ResponseField>

<ResponseField name="file_name" type="string">
  Original file name from the job record.
</ResponseField>

<ResponseField name="file_type" type="string">
  MIME type of the uploaded file.
</ResponseField>

<ResponseField name="file_url" type="string">
  S3 URL of the original uploaded file. Returned as `null` unless `include_url=true` is set.
</ResponseField>

<ResponseField name="credit_used" type="integer">
  Credits used for this job.
</ResponseField>

<ResponseField name="message" type="string">
  Status detail message, always present (e.g. "Task queued"). Carries the failure reason when status is `Failed`.
</ResponseField>

<ResponseField name="configuration" type="object">
  Configuration used for this job, mirroring the parameters submitted at creation time.
</ResponseField>

<ResponseField name="merge_tables" type="boolean">
  Whether table merging was enabled for this job.
</ResponseField>

<Note>
  Two flags change the response shape for succeeded jobs: with `output_file=true` the parsed content is replaced by a `result` object carrying `output_file_url` (a presigned URL to the raw output JSON) instead of inline `chunks`; and a `merge_tables` job whose merged result is already available may return the content nested under an `output` key with a `task_id` instead of the standard top-level fields.
</Note>

<RequestExample>
  ```bash cURL theme={null}
  curl -X 'GET' \
    'https://prod.visionapi.unsiloed.ai/parse/04a7a6d8-5ef7-465a-b22a-8a98e7104dd9' \
    -H 'accept: application/json' \
    -H 'api-key: your-api-key'
  ```

  ```python Python theme={null}
  import requests
  import time

  def get_parse_job_status(job_id, api_key):
      """Get the status of a parsing job"""
      
      url = f"https://prod.visionapi.unsiloed.ai/parse/{job_id}"
      headers = {"api-key": api_key}
      
      response = requests.get(url, headers=headers)
      
      if response.status_code == 200:
          job = response.json()
          print(f"Job Status: {job['status']}")
          print(f"Created: {job['created_at']}")
          
          if job['status'] == 'Succeeded':
              print("Job completed successfully!")
              print(f"Total chunks: {job['total_chunks']}")
              return job
          elif job['status'] == 'Failed':
              print(f"Job failed: {job.get('message', 'Unknown error')}")
              return None
          elif job['status'] in ['Starting', 'Processing']:
              print("Job is currently being processed...")
              return job
      else:
          print("Error:", response.status_code, response.text)
          return None

  # Usage
  job_id = "04a7a6d8-5ef7-465a-b22a-8a98e7104dd9"
  result = get_parse_job_status(job_id, "your-api-key")
  ```

  ```javascript JavaScript theme={null}
  async function getParseJobStatus(jobId, apiKey) {
    const response = await fetch(`https://prod.visionapi.unsiloed.ai/parse/${jobId}`, {
      headers: {
        'accept': 'application/json',
        'api-key': apiKey
      }
    });

    if (response.ok) {
      const job = await response.json();
      console.log('Job Status:', job.status);
      console.log('Created:', job.created_at);
      
      switch (job.status) {
        case 'Succeeded':
          console.log('Job completed successfully!');
          console.log('Total chunks:', job.total_chunks);
          return job;
        case 'Failed':
          console.log('Job failed:', job.message || 'Unknown error');
          return null;
        case 'Starting':
        case 'Processing':
          console.log('Job is currently being processed...');
          return job;
        default:
          console.log('Unknown status:', job.status);
          return job;
      }
    } else {
      console.error('Failed to get job status:', await response.text());
      return null;
    }
  }

  // Usage
  const jobId = '04a7a6d8-5ef7-465a-b22a-8a98e7104dd9';
  const result = await getParseJobStatus(jobId, 'your-api-key');
  ```
</RequestExample>

<ResponseExample>
  ```json Starting Job theme={null}
  {
    "job_id": "04a7a6d8-5ef7-465a-b22a-8a98e7104dd9",
    "status": "Starting",
    "created_at": "2025-10-22T06:51:16.870302Z",
    "metadata": {}
  }
  ```

  ```json Processing Job theme={null}
  {
    "job_id": "04a7a6d8-5ef7-465a-b22a-8a98e7104dd9",
    "status": "Processing",
    "created_at": "2025-10-22T06:51:16.870302Z",
    "started_at": "2025-10-22T06:51:16.966136Z",
    "metadata": {}
  }
  ```

  ```json Succeeded Job theme={null}
  {
    "job_id": "04a7a6d8-5ef7-465a-b22a-8a98e7104dd9",
    "status": "Succeeded",
    "created_at": "2025-10-22T06:51:16.870302Z",
    "started_at": "2025-10-22T06:51:16.966136Z",
    "finished_at": "2025-10-22T06:57:19.821541Z",
    "metadata": {},
    "total_chunks": 25,
    "page_count": 10,
    "pdf_url": "https://s3.us-east-1.amazonaws.com/unsiloed-bucket/output.pdf?...",
    "exports": {
      "docx": "https://s3.us-east-1.amazonaws.com/unsiloed-bucket/output.docx?..."
    },
    "file_name": "document.pdf",
    "file_type": "application/pdf",
    "file_url": "https://s3.us-east-1.amazonaws.com/unsiloed-bucket/input/document.pdf",
    "credit_used": 10,
    "merge_tables": false,
    "configuration": {
      "high_resolution": false,
      "ocr_strategy": "auto_detection",
      "ocr_engine": "UnsiloedHawk",
      "layout_analysis": "smart_layout_detection",
      "segment_type_naming": "Unsiloed",
      "error_handling": "Continue",
      "agentic_ocr": null,
      "merge_tables": false,
      "extract_colors": false,
      "extract_links": false,
      "extract_strikethrough": false,
      "xml_citation": false,
      "chunk_processing": {},
      "segment_analysis": {}
    },
    "chunks": [
      {
        "chunk_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
        "chunk_length": 128,
        "embed": "eternal logo the image displays a logo...",
        "segments": [
          {
            "segment_type": "Picture",
            "content": "eternal",
            "image": "https://s3.us-east-1.amazonaws.com/unsiloed-bucket/...",
            "page_number": 1,
            "segment_id": "1c60ecbd-b6da-493c-b3f6-6849337a981f",
            "confidence": 0.6062846,
            "page_width": 1191.0,
            "page_height": 1684.0,
            "html": "<p>The image displays a logo...</p>",
            "markdown": "Eternal Logo\n\nThe image displays a logo...",
            "bbox": {
              "left": 72.92226,
              "top": 62.030334,
              "width": 230.36308,
              "height": 55.395317
            },
            "ocr": [
              {
                "bbox": {
                  "left": 63.753525,
                  "top": 5.395447,
                  "width": 164.45312,
                  "height": 42.757812
                },
                "text": "eternal",
                "confidence": 0.9999992
              }
            ]
          }
        ]
      }
    ]
  }
  ```

  ```json Failed Job theme={null}
  {
    "job_id": "04a7a6d8-5ef7-465a-b22a-8a98e7104dd9",
    "status": "Failed",
    "created_at": "2025-10-22T06:51:16.870302Z",
    "started_at": "2025-10-22T06:51:16.966136Z",
    "finished_at": "2025-10-22T06:51:25.123456Z",
    "metadata": {},
    "message": "Failed to process document: File appears to be corrupted"
  }
  ```

  ```json Error Response - Unauthorized theme={null}
  "Unauthorized"
  ```

  ```text Error Response - Forbidden theme={null}
  You don't have permission to access this task
  ```

  ```json Error Response - Task Not Found theme={null}
  {
    "error": {
      "code": "task_not_found",
      "message": "Task 04a7a6d8-5ef7-465a-b22a-8a98e7104dd9 not found",
      "details": {
        "task_id": "04a7a6d8-5ef7-465a-b22a-8a98e7104dd9"
      }
    }
  }
  ```

  ```json Error Response - Server Error theme={null}
  {
    "error": {
      "code": "internal_error",
      "message": "Unable to load this parse task right now. Please try again shortly."
    }
  }
  ```
</ResponseExample>

## Job Status Values

<AccordionGroup>
  <Accordion title="Starting">
    Job has been created and is waiting to be processed. This is the initial status when a parsing job is first created.
  </Accordion>

  <Accordion title="AwaitingUpload">
    Job was created through the v2 presigned-upload flow and is waiting for the file to be uploaded. It transitions to Queued once the upload completes.
  </Accordion>

  <Accordion title="Queued">
    Job is waiting to be picked up by a worker.
  </Accordion>

  <Accordion title="Processing">
    Job is currently being processed. This includes PDF parsing, text extraction, image analysis, table detection, and OCR processing.
  </Accordion>

  <Accordion title="Succeeded">
    Job has completed successfully. The response includes the complete analysis results with all extracted data, images, and metadata.
  </Accordion>

  <Accordion title="Failed">
    Job failed during processing. Check the message field for details about what went wrong.
  </Accordion>

  <Accordion title="Cancelled">
    Job was cancelled before processing completed.
  </Accordion>
</AccordionGroup>

## Polling Strategy

For long-running parsing jobs, implement a polling strategy to check status periodically:

```python theme={null}
import requests
import time

def poll_parse_job(job_id, api_key, max_wait_time=300, poll_interval=5):
    """Poll a parsing job until completion or timeout"""
    
    start_time = time.time()
    headers = {"api-key": api_key}
    
    while time.time() - start_time < max_wait_time:
        response = requests.get(
            f"https://prod.visionapi.unsiloed.ai/parse/{job_id}",
            headers=headers
        )
        
        if response.status_code == 200:
            job = response.json()
            
            if job['status'] == 'Succeeded':
                return job
            elif job['status'] == 'Failed':
                raise Exception(f"Job failed: {job.get('message', 'Unknown error')}")
            elif job['status'] in ['Starting', 'Processing']:
                print(f"Job status: {job['status']} - waiting...")
                time.sleep(poll_interval)
            else:
                print(f"Unknown status: {job['status']}")
                time.sleep(poll_interval)
        else:
            print(f"Error checking status: {response.status_code}")
            time.sleep(poll_interval)
    
    raise Exception("Job polling timed out")

# Usage
try:
    result = poll_parse_job("04a7a6d8-5ef7-465a-b22a-8a98e7104dd9", "your-api-key")
    print("Job completed successfully!")
    print(f"Total chunks: {result['total_chunks']}")
except Exception as e:
    print(f"Error: {e}")
```

## Segment Types

When a job succeeds, the response includes detailed analysis of different document segments:

### Title

Top-level document titles, distinct from section headers.

### SectionHeader

Document headers and titles that define section boundaries.

### Text

Regular text content including paragraphs, sentences, and individual text elements.

### ListItem

Individual items within ordered or unordered lists.

### Table

Tabular data with structured rows and columns.

### Picture

Images and graphics within the document, including logos, charts, and illustrations.

### Caption

Text captions associated with images or figures.

### Formula

Mathematical or chemical formulas detected within the document.

### Footnote

Footnote text appearing at the bottom of a page.

### PageHeader

Recurring header content appearing at the top of pages.

### PageFooter

Recurring footer content appearing at the bottom of pages.

### Page

A full-page segment when the document is processed without fine-grained layout analysis.

Each segment includes:

* **segment\_type**: Type of content detected
* **content**: Extracted text content
* **image**: URL to extracted image (if applicable)
* **page\_number**: Page where the segment appears
* **confidence**: Confidence score for the extraction
* **bbox**: Precise coordinates of the segment
* **html**: HTML-formatted content
* **markdown**: Markdown-formatted content
* **ocr**: Detailed OCR data with individual text elements
* **chart\_data**: Structured chart data (chart type, series, labels, legend) for Picture segments identified as charts, when chart extraction is enabled
* **cell\_references**: Spreadsheet cell-range references (`{sheet, address, ref}`) for Excel segments
* **references**: Research-paper citations attached to the segment, when `xml_citation` is enabled
* **merged\_page\_bboxes**: Per-page bounding boxes for tables merged across page breaks, when `merge_tables` is enabled

## Error Handling

### Common Error Scenarios

1. **Job Not Found**: Invalid or expired job ID returns a `404` response.
2. **Unauthorized**: Missing or invalid API key returns a `401` response.
3. **Forbidden**: Valid API key but no permission to access this task returns a `403` response.
4. **Rate Limiting**: This GET endpoint is not rate limited by the application; only the submit endpoints are. Poll responsibly regardless.
5. **Client-Side Polling Timeout**: The job did not complete within the time your polling logic allows. This is not a server-returned error; implement a reasonable client-side timeout and handle it gracefully.
6. **Server Error**: Internal processing error returns a `500` response.

### Best Practices

* **Polling Frequency**: Check status every 5-10 seconds for long-running jobs
* **Timeout Handling**: Implement reasonable timeouts to prevent infinite polling
* **Error Recovery**: Handle failed jobs gracefully with retry logic
* **API Key Security**: Keep your API key secure and never expose it in client-side code

## Rate Limits

* **Concurrent Jobs**: Limited number of active parsing jobs per API key
* **Request Frequency**: Avoid excessive polling (recommended: 5-10 second intervals)

Check your API plan for specific limits and quotas.


## OpenAPI

````yaml api-reference/parser/openapi-v1.json GET /parse/{job_id}
openapi: 3.1.0
info:
  title: Unsiloed Parser API — v1
  description: >-
    The original document parsing API. Accepts multipart file uploads and
    URL-based processing. These endpoints have no version prefix in their URLs
    and are stable indefinitely.
  contact:
    name: Unsiloed
    url: https://unsiloed.ai
    email: hello@unsiloed.com
  license:
    name: ''
  version: 1.0.0
servers:
  - url: https://prod.visionapi.unsiloed.ai
    description: Production
security: []
tags:
  - name: Authentication
    description: API key management endpoints
  - name: Health
    description: Endpoint for checking the health of the service.
  - name: Parse (Vision-API Compatible)
    description: >-
      Vision-API compatible endpoints for parsing - accepts multipart form data
      with Vision-API parameter names
paths:
  /parse/{job_id}:
    get:
      tags:
        - Parse (Vision-API Compatible)
      summary: GET /parse/{job_id}
      description: |-
        Get the status and results of a processing task.
        Returns task information in Vision-API compatible format.
      operationId: get_parse_task
      parameters:
        - name: job_id
          in: path
          description: Job ID returned by `POST /parse`.
          required: true
          schema:
            type: string
        - name: include_chunks
          in: query
          description: Include the `chunks` array in the response. Defaults to `true`.
          required: false
          schema:
            type: boolean
        - name: base64_urls
          in: query
          description: >-
            Return segment images as base64-encoded data URIs instead of S3
            presigned URLs.

            Defaults to `false`.
          required: false
          schema:
            type: boolean
        - name: output_file
          in: query
          description: >-
            Return a presigned S3 URL to the raw output JSON file instead of
            inlining the

            full response body. Also accepted as the `output-file` request
            header.

            Defaults to `false`.
          required: false
          schema:
            type: boolean
        - name: enhanced_table
          in: query
          description: >-
            Apply enhanced table post-processing when assembling the response —
            improves

            cell-merge accuracy and structure recovery for complex tables, at
            the cost

            of extra latency. Also accepted as the `enhanced-table` request
            header.

            Defaults to `false`.
          required: false
          schema:
            type: boolean
        - name: merge_tables
          in: query
          description: >-
            Apply the cross-page table-merge post-processing pass when
            assembling the

            response. Has no effect unless the job was parsed with
            `merge_tables=true`

            (the merge work runs at parse time). Defaults to `false`.
          required: false
          schema:
            type: boolean
        - name: include_url
          in: query
          description: >-
            Include file URLs (`pdf_url`, `file_url`, segment images, exports,

            `configuration.input_file_url`) in the response. When `false`
            (default),

            every URL-bearing field is nulled so the response — and any log of
            it —

            does not leak the storage bucket/region/path. Also accepted as the

            `include-url` request header.
          required: false
          schema:
            type: boolean
      responses:
        '200':
          description: >-
            Job status and results. Output fields (`chunks`, `total_chunks`,
            `page_count`, `pdf_url`) are present only when status is
            `Succeeded`.
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/ParseGetResponse'
        '401':
          description: Unauthorized
        '403':
          description: Forbidden — you don't have permission to access this task.
        '404':
          description: Job not found.
          content:
            text/plain:
              schema:
                type: string
        '429':
          description: Rate limit exceeded.
        '500':
          description: Internal server error.
          content:
            text/plain:
              schema:
                type: string
      security:
        - api_key: []
components:
  schemas:
    ParseGetResponse:
      type: object
      description: >-
        Response body for `GET /parse/{job_id}`.


        Fields marked as optional appear only when the job has reached the
        relevant status.
      required:
        - job_id
        - status
        - created_at
        - metadata
      properties:
        chunks:
          type:
            - array
            - 'null'
          items:
            $ref: '#/components/schemas/Chunk'
          description: >-
            Array of document chunks with segments and extracted content.
            Present when status is `Succeeded`.
        configuration:
          oneOf:
            - type: 'null'
            - $ref: '#/components/schemas/Configuration'
              description: >-
                Configuration used for this job (mirrors the parameters
                submitted at creation time).

                The effective `merge_tables` value lives at
                `configuration.merge_tables`.
        created_at:
          type: string
          description: ISO 8601 timestamp when the job was created.
        credit_used:
          type:
            - integer
            - 'null'
          format: int64
          description: Credits used for this job.
        exports:
          type:
            - object
            - 'null'
          description: >-
            Presigned download URLs for exported file formats.

            Only present when `export_format` was specified in the parse request
            and the export has completed.

            Keys are format names (e.g. `"docx"`), values are presigned S3 URLs
            valid for 1 hour.

            Example: `{"docx": "https://s3.amazonaws.com/..."}`.

            If export failed, contains `{"docx_error": "..."}` instead.
          additionalProperties:
            type: string
          propertyNames:
            type: string
        file_name:
          type:
            - string
            - 'null'
          description: Original file name from the job record.
        file_type:
          type:
            - string
            - 'null'
          description: MIME type of the uploaded file.
        file_url:
          type:
            - string
            - 'null'
          description: S3 URL of the original uploaded file.
        finished_at:
          type:
            - string
            - 'null'
          description: >-
            ISO 8601 timestamp when processing completed. Present when status is
            `Succeeded` or `Failed`.
        job_id:
          type: string
          description: Job identifier.
        message:
          type:
            - string
            - 'null'
          description: Error or status detail message. Present when status is `Failed`.
        metadata:
          type: object
          description: >-
            Citation or job metadata. Populated when `xml_citation` is enabled
            or from the job record.
        page_count:
          type:
            - integer
            - 'null'
          format: int64
          description: Number of pages in the document. Present when status is `Succeeded`.
        pdf_url:
          type:
            - string
            - 'null'
          description: >-
            Presigned S3 URL to the generated PDF. Present when status is
            `Succeeded`.
        started_at:
          type:
            - string
            - 'null'
          description: >-
            ISO 8601 timestamp when processing started. Present when status is
            not `Starting`.
        status:
          type: string
          description: >-
            Current job status: `Starting`, `Processing`, `Succeeded`, `Failed`,
            or `Cancelled`.
        total_chunks:
          type:
            - integer
            - 'null'
          format: int64
          description: Total number of document chunks. Present when status is `Succeeded`.
    Chunk:
      type: object
      required:
        - chunk_length
        - segments
      properties:
        chunk_id:
          type: string
          description: The unique identifier for the chunk.
        chunk_length:
          type: integer
          format: int32
          description: >-
            The total number of tokens in the chunk. Calculated by the
            `tokenizer`.
          minimum: 0
        embed:
          type:
            - string
            - 'null'
          description: >-
            Suggested text to be embedded for the chunk. This text is generated
            by combining the embed content

            from each segment according to the configured embed sources (HTML,
            Markdown, LLM, or Content).

            Can be configured using `embed_sources` in the `SegmentProcessing`
            configuration.
        segments:
          type: array
          items:
            $ref: '#/components/schemas/Segment'
          description: >-
            Collection of document segments that form this chunk.

            When `target_chunk_length` > 0, contains the maximum number of
            segments

            that fit within that length (segments remain intact).

            Otherwise, contains exactly one segment.
    Configuration:
      type: object
      required:
        - chunk_processing
        - high_resolution
        - ocr_strategy
        - ocr_engine
        - segment_analysis
        - layout_analysis
        - segment_type_naming
        - error_handling
      properties:
        agentic_ocr:
          type:
            - string
            - 'null'
          description: >-
            Agentic OCR engine to use. None = disabled. Some("standard") or
            Some("advanced").
        chunk_processing:
          $ref: '#/components/schemas/ChunkProcessing'
        detect_pii:
          type: boolean
          description: >-
            When true, run a PII detection pass before parsing. If PII is found
            at

            or above `pii_block_severity`, the task is rejected and no parsing
            occurs.
        enhance_reading_order:
          type: boolean
          description: >-
            Fix reading order of segments after smart layout detection. Default:
            false.
        error_handling:
          $ref: '#/components/schemas/ErrorHandlingStrategy'
        excel_config:
          oneOf:
            - type: 'null'
            - $ref: '#/components/schemas/ExcelConfiguration'
              description: >-
                Excel-specific config, set only for tasks created via `POST
                /parse/excel`.

                `None` for every PDF/image/office task. Forwarded to the Excel
                pipeline service.
        expires_in:
          type:
            - integer
            - 'null'
          format: int32
          description: >-
            The number of seconds until task is deleted.

            Expired tasks can **not** be updated, polled or accessed via web
            interface.
        export_format:
          type:
            - array
            - 'null'
          items:
            $ref: '#/components/schemas/ExportFormat'
          description: >-
            Export format(s) requested for this task. When present, the pipeline
            generates

            export files after processing. Currently supported: `["docx"]`.
        extract_colors:
          type: boolean
          description: >-
            Whether to extract and apply color information from PDF text layer
            to OCR results.

            When enabled, text color (RGB/hex) from the PDF is transferred to
            matching OCR words.

            Default: false
        extract_links:
          type: boolean
          description: >-
            Whether to extract and apply hyperlink annotations from PDF to OCR
            results.

            When enabled, link URLs from PDF annotations are attached to
            matching OCR words.

            Default: false
        extract_strikethrough:
          type: boolean
          description: >-
            Preserve strikethrough formatting in HTML/Markdown output. Default:
            false.
        high_resolution:
          type: boolean
          description: >-
            Whether to use high-resolution images for cropping and
            post-processing.
        input_file_url:
          type:
            - string
            - 'null'
          description: The presigned URL of the input file.
        json_schema:
          deprecated: true
        layout_analysis:
          $ref: '#/components/schemas/SegmentationStrategy'
        merge_batch_size:
          type: integer
          format: int32
          description: >-
            Maximum number of tables allowed in a single merge group.

            Groups larger than this are split into sub-groups of this size.

            Defaults to 20. Helps avoid Gemini rate limits on very large
            documents.
        merge_tables:
          type: boolean
          description: Whether to merge adjacent tables using VLM
        model:
          oneOf:
            - type: 'null'
            - $ref: '#/components/schemas/Model'
        ocr_engine:
          $ref: '#/components/schemas/OcrEngine'
        ocr_strategy:
          $ref: '#/components/schemas/OcrStrategy'
        output_fields:
          $ref: '#/components/schemas/OutputFieldsConfig'
          description: >-
            Configuration for controlling which output fields are included in
            responses.

            Allows selective inclusion/exclusion of HTML, Markdown, OCR, images,
            etc.

            Skipped from the echo when it's the all-true default so a Slim
            response

            doesn't carry a misleading "everything on" block.
        page_numbers:
          type:
            - array
            - 'null'
          items:
            type: integer
            format: int32
            minimum: 0
          description: |-
            Explicit mapping of array indices to actual page numbers (1-based).
            Used when parsing selective pages (e.g., [4, 5, 8, 9, 10, 11, 12]).
            Populated from the page_range parameter.
        page_range:
          type:
            - string
            - 'null'
          description: >-
            Optional page range to process (e.g., "1-5,8,10-12"). If None, all
            pages are processed.
        pii_block_severity:
          $ref: '#/components/schemas/PiiBlockSeverity'
          description: >-
            Severity threshold at which to block parsing. Ignored if
            `detect_pii` is false.
        pii_engine:
          $ref: '#/components/schemas/PiiEngine'
          description: Which PII detector engine to use. Ignored if `detect_pii` is false.
        response_profile:
          oneOf:
            - type: 'null'
            - $ref: '#/components/schemas/ResponseProfile'
              description: >-
                Optional high-level response shape selector.

                `None` (default) preserves the legacy behavior driven entirely
                by

                `output_fields`. `Some(Slim)` returns a minimal payload;
                `Some(Full)`

                forces every field on; `Some(Custom)` honors `output_fields`
                verbatim.
        segment_analysis:
          $ref: '#/components/schemas/SegmentProcessing'
        segment_filter:
          type: string
          description: |-
            Segment filter: comma-separated segment types to keep, or "all".
            Replaces the old metadata-based keep_segment_types storage.
        segment_type_naming:
          $ref: '#/components/schemas/SegmentTypeNamingConvention'
          description: Controls the naming convention for segment types in the output
        target_chunk_length:
          type:
            - integer
            - 'null'
          format: int32
          description: >-
            The target number of words in each chunk. If 0, each chunk will
            contain a single segment.
          deprecated: true
          minimum: 0
        validate_segments:
          type: array
          items:
            $ref: '#/components/schemas/SegmentType'
        xml_citation:
          type: boolean
          description: Whether to extract structured citation data
    Segment:
      type: object
      required:
        - content
        - html
        - markdown
        - page_height
        - page_width
        - page_number
        - segment_id
        - segment_type
      properties:
        bbox:
          oneOf:
            - type: 'null'
            - $ref: '#/components/schemas/BoundingBox'
        cell_references:
          type:
            - array
            - 'null'
          items: {}
          description: >-
            Spreadsheet cell-range references ({ sheet, address, ref }) for
            Excel

            segments. Distinct from `references` (GROBID citations) so the two

            concepts never collide in the output or on the frontend.
        chart_data:
          description: >-
            Extracted chart data for Picture segments identified as charts.

            Contains structured data including chart_type, data series, labels,
            and legend.
        confidence:
          type:
            - number
            - 'null'
          format: float
          description: Confidence score of the layout analysis model
        content:
          type: string
          description: >-
            Text content of the segment. By default, this contains OCR-extracted
            text.

            The source can be configured via the `content_source` field in
            segment processing configuration

            to use HTML, Markdown, or LLM representation instead.
        html:
          type: string
          description: HTML representation of the segment.
        image:
          type:
            - string
            - 'null'
          description: Presigned URL to the image of the segment.
        markdown:
          type: string
          description: Markdown representation of the segment.
        merged_page_bboxes:
          type:
            - array
            - 'null'
          items:
            $ref: '#/components/schemas/PageBBox'
          description: >-
            Per-page bbox info for all pages of a merged multi-page table.

            Only present on merged table segments (segment_id ends with
            "_merged").
        ocr:
          type:
            - array
            - 'null'
          items:
            $ref: '#/components/schemas/OCRResult'
          description: OCR results for the segment.
        page_height:
          type: number
          format: float
          description: Height of the page containing the segment.
        page_number:
          type: integer
          format: int32
          description: Page number of the segment.
          minimum: 0
        page_width:
          type: number
          format: float
          description: Width of the page containing the segment.
        references:
          type:
            - array
            - 'null'
          items: {}
          description: GROBID research-paper citations attached to this segment.
        segment_id:
          type: string
          description: Unique identifier for the segment.
        segment_type:
          $ref: '#/components/schemas/SegmentType'
        segment_type_display:
          type:
            - string
            - 'null'
          description: >-
            Display name for the segment type (used for serialization based on
            naming convention)
    ChunkProcessing:
      type: object
      description: Controls the setting for the chunking and post-processing of each chunk.
      properties:
        ignore_headers_and_footers:
          type: boolean
          description: Whether to ignore headers and footers in the chunking process.
          default: false
        target_length:
          type: integer
          format: int32
          description: >-
            The target number of words in each chunk. If 0, each chunk will
            contain a single segment.
          default: 512
          minimum: 0
        tokenizer:
          oneOf:
            - $ref: '#/components/schemas/TokenizerType'
              description: The tokenizer to use for the chunking process.
          default: Word
    ErrorHandlingStrategy:
      type: string
      description: >-
        Controls how errors are handled during processing:

        - `Fail`: Stops processing and fails the task when any error occurs

        - `Continue`: Attempts to continue processing despite non-critical
        errors (eg. LLM refusals etc.)
      enum:
        - Fail
        - Continue
    ExcelConfiguration:
      type: object
      description: >-
        Excel-specific parsing configuration, forwarded verbatim to the Excel
        pipeline

        service (`services/Excel_pipeline` `POST /parse`). These are the ONLY
        knobs the

        Excel pipeline honors — none of the PDF parsing fields (OCR, layout,
        segments)

        apply. Defaults mirror the service's own `SpreadsheetConfig` defaults so
        an

        omitted value behaves identically to not setting it. Set on a task via
        the

        dedicated `POST /parse/excel` endpoint; `None` for every non-Excel task.
      properties:
        cell_colors:
          type: boolean
          description: >-
            Include cell background/font colors (only when `cell_metadata`).
            Default true.
        cell_metadata:
          type: boolean
          description: >-
            Include cell color, formula, and dropdown metadata in the output.
            Default false.
        dropdowns:
          type: boolean
          description: >-
            Include data-validation dropdown options (only when
            `cell_metadata`). Default true.
        exclude_hidden:
          type: boolean
          description: Drop hidden sheets/rows/cols/styling from the output. Default false.
        exclude_hidden_cols:
          type: boolean
          description: >-
            When excluding hidden content, also drop hidden columns. Default
            true.
        exclude_hidden_rows:
          type: boolean
          description: When excluding hidden content, also drop hidden rows. Default true.
        exclude_hidden_sheets:
          type: boolean
          description: >-
            When excluding hidden content, also drop hidden sheets. Default
            true.
        exclude_images:
          type: boolean
          description: >-
            When excluding hidden content, also drop embedded/pasted images.
            Default false.
        exclude_styling:
          type: boolean
          description: When excluding hidden content, also drop styling. Default true.
        formulas:
          type: boolean
          description: Include cell formulas (only when `cell_metadata`). Default true.
        max_rows_per_segment:
          type: integer
          format: int32
          description: Max rows per split segment when `split_large_tables`. Default 50.
        split_large_tables:
          type: boolean
          description: Split large tables into smaller segments. Default true.
        table_clustering:
          type: string
          description: 'Table clustering effort: `accurate` (default), `fast`, or `off`.'
    ExportFormat:
      type: string
      description: >-
        File format for exporting parsed results. When specified in a parse
        request,

        the pipeline generates the requested export file after processing
        completes.

        The exported file is available via the `exports` field in the task
        response.
      enum:
        - docx
        - markdown
        - json
    SegmentationStrategy:
      type: string
      description: >-
        Controls the segmentation strategy:

        - `LayoutAnalysis` (wire value `smart_layout_detection`): Analyzes pages
        for layout
          elements (e.g., `Table`, `Picture`, `Formula`, etc.) using bounding boxes. Provides
          fine-grained segmentation and better chunking. (Latency penalty: ~TBD seconds per page).
        - `Page` (wire value `page_by_page`): Treats each page as a single
        segment. Faster
          processing, but without layout element detection and only simple chunking.
        - `AdvancedLayoutAnalysis` (wire value `advanced_layout_detection`):
        Higher-accuracy
          layout detection for complex pages (multi-column layouts, dense tables/figures);
          slower than `LayoutAnalysis`.
      enum:
        - smart_layout_detection
        - page_by_page
        - advanced_layout_detection
    Model:
      type: string
      enum:
        - Fast
        - HighQuality
      deprecated: true
    OcrEngine:
      type: string
      description: >-
        Controls which OCR engine to use for text recognition:

        - `UnsiloedStorm`: Enterprise-grade accuracy, optimized for 50+
        languages

        - `UnsiloedHawk`: Higher accuracy, better for complex layouts [DEFAULT]

        - `UnsiloedBeta`: Handles irregular bounding boxes, rotated/warped text
      enum:
        - UnsiloedStorm
        - UnsiloedHawk
        - UnsiloedBeta
    OcrStrategy:
      type: string
      description: >-
        Controls the Optical Character Recognition (OCR) strategy.

        - `All`: Processes all pages with OCR. (Latency penalty: ~0.5 seconds
        per page)

        - `Auto`: Selectively applies OCR only to pages with missing or
        low-quality text. When text layer is present the bounding boxes from the
        text layer are used.
      enum:
        - force_ocr
        - auto_detection
    OutputFieldsConfig:
      type: object
      description: >-
        Per-field include/exclude filter for the response. Each field defaults
        to

        `true`; set a field to `false` to drop it. Honored when
        `response_profile`

        is omitted or set to `custom`. Ignored when `response_profile` is `slim`

        or `full` (the profile wins).
      properties:
        bbox:
          type: boolean
          description: Include bounding box information in segment responses
        chart_data:
          type: boolean
          description: >-
            Include extracted chart data for Picture segments identified as
            charts
        confidence:
          type: boolean
          description: Include confidence scores in segment responses
        content:
          type: boolean
          description: Include text content in segment responses
        embed:
          type: boolean
          description: Include embed text in chunk responses
        html:
          type: boolean
          description: Include HTML representation in segment responses
        image:
          type: boolean
          description: Include images (cropped segment images) in segment responses
        markdown:
          type: boolean
          description: Include Markdown representation in segment responses
        ocr:
          type: boolean
          description: Include OCR results in segment responses
    PiiBlockSeverity:
      type: string
      description: >-
        Severity threshold at which the task is rejected for PII.

        - `Any`: block on any PII found (strictest).

        - `Low`: block on quasi-identifiers (names, dates, locations) or higher.

        - `Medium`: block on contact PII (email, phone) or higher.

        - `High`: block only on direct identifiers (SSN, passport, credit card,
        etc.).
      enum:
        - any
        - low
        - medium
        - high
    PiiEngine:
      type: string
      description: >-
        Engine used by the PII detector service.

        - `Standard`: regex + spaCy NER.

        - `Advanced`: VLM-based judgment via OpenAI-compatible API. Higher
        precision, costs per call.
      enum:
        - standard
        - advanced
    ResponseProfile:
      type: string
      description: |-
        High-level response shape selector. Optional. When set, it overrides
        `OutputFieldsConfig` (the profile wins).

        - `Slim`: chunk `embed` + bbox + `segment_id` + `segment_type` +
          `page_number` + one representation per segment (HTML for `Table`,
          Markdown otherwise). Drops `content`, `image`, `ocr`, `confidence`,
          `chart_data`, `page_height`, `page_width`, and null `references`.
        - `Full`: every field returned. `output_fields` is ignored.
        - `Custom`: honor the caller's `OutputFieldsConfig` verbatim.
      enum:
        - slim
        - full
        - custom
      example: slim
    SegmentProcessing:
      type: object
      properties:
        Caption:
          oneOf:
            - type: 'null'
            - $ref: '#/components/schemas/AutoGenerationConfig'
        Footnote:
          oneOf:
            - type: 'null'
            - $ref: '#/components/schemas/AutoGenerationConfig'
        Formula:
          oneOf:
            - type: 'null'
            - $ref: '#/components/schemas/LlmGenerationConfig'
        ListItem:
          oneOf:
            - type: 'null'
            - $ref: '#/components/schemas/AutoGenerationConfig'
        Page:
          oneOf:
            - type: 'null'
            - $ref: '#/components/schemas/LlmGenerationConfig'
        PageFooter:
          oneOf:
            - type: 'null'
            - $ref: '#/components/schemas/AutoGenerationConfig'
        PageHeader:
          oneOf:
            - type: 'null'
            - $ref: '#/components/schemas/AutoGenerationConfig'
        Picture:
          oneOf:
            - type: 'null'
            - $ref: '#/components/schemas/PictureGenerationConfig'
        SectionHeader:
          oneOf:
            - type: 'null'
            - $ref: '#/components/schemas/AutoGenerationConfig'
        Table:
          oneOf:
            - type: 'null'
            - $ref: '#/components/schemas/LlmGenerationConfig'
        Text:
          oneOf:
            - type: 'null'
            - $ref: '#/components/schemas/AutoGenerationConfig'
        Title:
          oneOf:
            - type: 'null'
            - $ref: '#/components/schemas/AutoGenerationConfig'
    SegmentTypeNamingConvention:
      type: string
      description: >-
        Controls the naming convention for segment types in the output:

        - `Unsiloed`: Uses Unsiloed naming convention (e.g., `PageHeader`,
        `PageFooter`, `SectionHeader`, `ListItem`, `Picture`)

        - `Other`: Uses alternative naming convention (e.g., `Header`, `Footer`,
        `Section Header`, `List Item`, `Figure`)
      enum:
        - Unsiloed
        - Other
    SegmentType:
      type: string
      description: |-
        All the possible types for a segment.
        Note: Different configurations will produce different types.
        Please refer to the documentation for more information.
      enum:
        - Caption
        - Footnote
        - Formula
        - KeyValuePair
        - ListItem
        - Page
        - PageFooter
        - PageHeader
        - Picture
        - Seal
        - SectionHeader
        - Signature
        - Table
        - Text
        - Title
    BoundingBox:
      type: object
      description: >-
        Bounding box for an item. It is used for chunks, segments and OCR
        results.
      required:
        - left
        - top
        - width
        - height
      properties:
        height:
          type: number
          format: float
          description: The height of the bounding box.
        left:
          type: number
          format: float
          description: The left coordinate of the bounding box.
        top:
          type: number
          format: float
          description: The top coordinate of the bounding box.
        width:
          type: number
          format: float
          description: The width of the bounding box.
    PageBBox:
      type: object
      description: Bbox and page info for one page of a merged multi-page table.
      required:
        - page_number
        - page_height
        - page_width
        - bbox
      properties:
        bbox:
          $ref: '#/components/schemas/BoundingBox'
        page_height:
          type: number
          format: float
        page_number:
          type: integer
          format: int32
          minimum: 0
        page_width:
          type: number
          format: float
    OCRResult:
      type: object
      description: OCR results for a segment
      required:
        - bbox
        - text
      properties:
        bbox:
          $ref: '#/components/schemas/BoundingBox'
        color:
          oneOf:
            - type: 'null'
            - $ref: '#/components/schemas/TextColor'
              description: Optional text color (RGB) provided by OCR engine
        confidence:
          type:
            - number
            - 'null'
          format: float
          description: The confidence score of the recognized text.
        link:
          type:
            - string
            - 'null'
          description: >-
            Optional link URL associated with this OCR result (from PDF
            annotations)
        text:
          type: string
          description: The recognized text of the OCR result.
    TokenizerType:
      oneOf:
        - type: object
          description: Use one of the predefined tokenizer types
          required:
            - Enum
          properties:
            Enum:
              $ref: '#/components/schemas/Tokenizer'
              description: Use one of the predefined tokenizer types
        - type: object
          description: Use a custom tokenizer by specifying its model ID
          required:
            - String
          properties:
            String:
              type: string
              description: Use a custom tokenizer by specifying its model ID
      description: |-
        Specifies which tokenizer to use for the chunking process.

        This type supports two ways of specifying a tokenizer:
        1. Using a predefined tokenizer from the `Tokenizer` enum
        2. Using a custom tokenizer by providing its model ID as a string
    AutoGenerationConfig:
      type: object
      properties:
        content_source:
          oneOf:
            - $ref: '#/components/schemas/ContentSource'
              description: Controls which source is used to populate the `content` field
          default: OCR
        crop_image:
          oneOf:
            - $ref: '#/components/schemas/CroppingStrategy'
          default: Auto
        embed_sources:
          type: array
          items:
            $ref: '#/components/schemas/EmbedSource'
          default: '[Markdown]'
        extended_context:
          type: boolean
          description: Use the full page image as context for VLM generation
          default: false
        html:
          oneOf:
            - $ref: '#/components/schemas/GenerationStrategy'
          default: Auto
        markdown:
          oneOf:
            - $ref: '#/components/schemas/GenerationStrategy'
          default: Auto
        translation:
          oneOf:
            - type: 'null'
            - $ref: '#/components/schemas/TranslationConfig'
              description: Optional translation configuration for this segment type
        vlm:
          type:
            - string
            - 'null'
    LlmGenerationConfig:
      type: object
      properties:
        content_source:
          oneOf:
            - $ref: '#/components/schemas/ContentSource'
              description: >-
                Controls which source is used to populate the `content` field
                (default: OCR)
          default: OCR
        crop_image:
          oneOf:
            - $ref: '#/components/schemas/CroppingStrategy'
          default: Auto
        embed_sources:
          type: array
          items:
            $ref: '#/components/schemas/EmbedSource'
          default: '[Markdown]'
        extended_context:
          type: boolean
          description: Use the full page image as context for VLM generation
          default: false
        html:
          oneOf:
            - $ref: '#/components/schemas/GenerationStrategy'
          default: VLM
        markdown:
          oneOf:
            - $ref: '#/components/schemas/GenerationStrategy'
          default: VLM
        model_id:
          type:
            - string
            - 'null'
          description: Model ID for VLM generation
        translation:
          oneOf:
            - type: 'null'
            - $ref: '#/components/schemas/TranslationConfig'
              description: Optional translation configuration for this segment type
        use_table_ocr:
          type: boolean
          description: >-
            Use dedicated OCR for table processing to provide structured context
            to VLM (default: false)
          default: false
        vlm:
          type:
            - string
            - 'null'
          description: Prompt for the VLM model
    PictureGenerationConfig:
      type: object
      properties:
        content_source:
          oneOf:
            - $ref: '#/components/schemas/ContentSource'
              description: >-
                Controls which source is used to populate the `content` field
                (default: OCR)
          default: OCR
        crop_image:
          oneOf:
            - $ref: '#/components/schemas/PictureCroppingStrategy'
          default: All
        embed_sources:
          type: array
          items:
            $ref: '#/components/schemas/EmbedSource'
          default: '[Markdown]'
        extended_context:
          type: boolean
          description: Use the full page image as context for VLM generation
          default: false
        html:
          oneOf:
            - $ref: '#/components/schemas/GenerationStrategy'
          default: Auto
        markdown:
          oneOf:
            - $ref: '#/components/schemas/GenerationStrategy'
          default: Auto
        model_id:
          type:
            - string
            - 'null'
          description: Model ID for VLM generation
        transcribe_text:
          type: boolean
          description: >-
            When true, text-bearing pictures (forms, handwritten pages, scanned

            documents, screenshots of text) are transcribed verbatim by the VLM

            instead of described/summarized. Only takes effect for segments that

            actually contain OCR text — true figures (photos, charts, diagrams
            with

            little text) still get the standard description. Default: false.
          default: false
        translation:
          oneOf:
            - type: 'null'
            - $ref: '#/components/schemas/TranslationConfig'
              description: Optional translation configuration for this segment type
        vlm:
          type:
            - string
            - 'null'
          description: Prompt for the VLM model
    TextColor:
      type: object
      description: Text color in RGB format
      required:
        - r
        - g
        - b
        - hex
      properties:
        b:
          type: integer
          format: int32
        g:
          type: integer
          format: int32
        hex:
          type: string
        r:
          type: integer
          format: int32
    Tokenizer:
      type: string
      description: Common tokenizers used for text processing.
      enum:
        - Word
        - Cl100kBase
        - XlmRobertaBase
        - BertBaseUncased
    ContentSource:
      type: string
      description: >-
        Controls which source is used to populate the `content` field in the
        segment output
      enum:
        - OCR
        - HTML
        - Markdown
        - VLM
    CroppingStrategy:
      type: string
      description: >-
        - `All` crops all images and includes URLs in response

        - `Auto` crops images if needed for VLM processing and includes URLs in
        response

        - `None` crops images if needed for VLM processing but does NOT include
        URLs in response
      enum:
        - All
        - Auto
        - None
    EmbedSource:
      type: string
      enum:
        - HTML
        - Markdown
        - VLM
        - Content
    GenerationStrategy:
      type: string
      enum:
        - VLM
        - Auto
    TranslationConfig:
      type: object
      required:
        - target_language
        - provider
      properties:
        model_id:
          type:
            - string
            - 'null'
          description: >-
            Optional model ID for VLM/LLM provider. If not specified, uses the
            default model.
        prompt:
          type:
            - string
            - 'null'
          description: >-
            Optional custom prompt for VLM/LLM provider. Appended to the system
            prompt

            to give the model additional translation instructions.
        provider:
          $ref: '#/components/schemas/TranslationProvider'
          description: >-
            Translation provider: "Auto" (or legacy "Google") for fast machine
            translation,

            "VLM" or "LLM" for LLM/VLM-based translation.
        target_language:
          type: string
          description: |-
            Target language code (ISO 639-1, e.g. "en", "es", "fr", "ko").
            Use "auto" to auto-detect source language and translate to English.
    PictureCroppingStrategy:
      type: string
      description: >-
        Controls the cropping strategy for an item (e.g. segment, chunk, etc.)

        - `All` crops all images and includes URLs in response

        - `Auto` crops images if needed for VLM processing and includes URLs in
        response

        - `None` crops images if needed for VLM processing but does NOT include
        URLs in response
      enum:
        - All
        - Auto
        - None
    TranslationProvider:
      type: string
      enum:
        - Auto
        - VLM
  securitySchemes:
    api_key:
      type: http
      scheme: bearer
      description: API key for authentication. Use 'Bearer <your_api_key>'

````