> ## Documentation Index > Fetch the complete documentation index at: https://docs.unsiloed.ai/llms.txt > Use this file to discover all available pages before exploring further. # Get Parse Result > Check the status and retrieve results of parsing jobs ## Overview The Get Parse Job Status endpoint allows you to check the current status of parsing jobs and retrieve the complete results when processing is complete. This endpoint is specifically designed for the parsing API and returns comprehensive document analysis including text extraction, image recognition, table parsing, and OCR data. Parsing jobs are processed asynchronously. Use this endpoint to poll for completion and retrieve results when the job status is "Succeeded". ## Parameters Job ID returned by `POST /parse`. Return segment images as base64-encoded data URIs instead of S3 presigned URLs. Defaults to `false`. Include the `chunks` array in the response. Defaults to `true`. Return a presigned S3 URL to the raw output JSON file instead of inlining the full response body. Defaults to `false`. Opt in to receiving file URLs (`pdf_url`, `file_url`, `output_file_url`, segment `image`, `configuration.input_file_url`) in the response. Defaults to `false`, in which case these fields are returned as `null` so the response (and any log of it) does not expose the storage bucket, region, or path. `exports` URLs are always returned regardless of this setting. Can also be set via the `include-url` header. Comma-separated segment types to keep in the response (alias: `keep_segment_types`). Omit to include every type. Return the cached cross-page merged result for jobs that were submitted with `merge_tables`. No merging happens at read time; the job's own setting takes precedence, so passing this for a job created without `merge_tables` has no effect. Defaults to `false`. ## Response Job identifier. Current job status: `Starting`, `Queued`, `Processing`, `Succeeded`, `Failed`, or `Cancelled`. Jobs created through the [v2 presigned-upload flow](/api-reference/parser/parse-document-v2) can also report `AwaitingUpload` before the file is uploaded. ISO 8601 timestamp when the job was created. Citation or job metadata. Populated when `xml_citation` is enabled or from the job record. ISO 8601 timestamp when processing started. `null` until a worker picks the job up (`Starting` and `Queued`). ISO 8601 timestamp when processing completed. Present when status is `Succeeded` or `Failed`. Total number of document chunks. `0` until status is `Succeeded`. Number of pages in the document. `0` until processing begins; populated while status is `Processing`. Array of document chunks with segments and extracted content. Empty until status is `Succeeded` (chunk content is only downloaded for succeeded jobs). Presigned S3 URL to the generated PDF. Returned as `null` unless `include_url=true` is set. Presigned download URLs for exported file formats. Only present when `export_format` was specified in the parse request and the export has completed. Keys are format names (e.g. `"docx"`), values are presigned S3 URLs valid for 1 hour. If export failed, contains `{"docx_error": "..."}` instead. Always returned with URLs, regardless of `include_url`. Original file name from the job record. MIME type of the uploaded file. S3 URL of the original uploaded file. Returned as `null` unless `include_url=true` is set. Credits used for this job. Status detail message, always present (e.g. "Task queued"). Carries the failure reason when status is `Failed`. Configuration used for this job, mirroring the parameters submitted at creation time. Whether table merging was enabled for this job. Two flags change the response shape for succeeded jobs: with `output_file=true` the parsed content is replaced by a `result` object carrying `output_file_url` (a presigned URL to the raw output JSON) instead of inline `chunks`; and a `merge_tables` job whose merged result is already available may return the content nested under an `output` key with a `task_id` instead of the standard top-level fields. ```bash cURL theme={null} curl -X 'GET' \ 'https://prod.visionapi.unsiloed.ai/parse/04a7a6d8-5ef7-465a-b22a-8a98e7104dd9' \ -H 'accept: application/json' \ -H 'api-key: your-api-key' ``` ```python Python theme={null} import requests import time def get_parse_job_status(job_id, api_key): """Get the status of a parsing job""" url = f"https://prod.visionapi.unsiloed.ai/parse/{job_id}" headers = {"api-key": api_key} response = requests.get(url, headers=headers) if response.status_code == 200: job = response.json() print(f"Job Status: {job['status']}") print(f"Created: {job['created_at']}") if job['status'] == 'Succeeded': print("Job completed successfully!") print(f"Total chunks: {job['total_chunks']}") return job elif job['status'] == 'Failed': print(f"Job failed: {job.get('message', 'Unknown error')}") return None elif job['status'] in ['Starting', 'Processing']: print("Job is currently being processed...") return job else: print("Error:", response.status_code, response.text) return None # Usage job_id = "04a7a6d8-5ef7-465a-b22a-8a98e7104dd9" result = get_parse_job_status(job_id, "your-api-key") ``` ```javascript JavaScript theme={null} async function getParseJobStatus(jobId, apiKey) { const response = await fetch(`https://prod.visionapi.unsiloed.ai/parse/${jobId}`, { headers: { 'accept': 'application/json', 'api-key': apiKey } }); if (response.ok) { const job = await response.json(); console.log('Job Status:', job.status); console.log('Created:', job.created_at); switch (job.status) { case 'Succeeded': console.log('Job completed successfully!'); console.log('Total chunks:', job.total_chunks); return job; case 'Failed': console.log('Job failed:', job.message || 'Unknown error'); return null; case 'Starting': case 'Processing': console.log('Job is currently being processed...'); return job; default: console.log('Unknown status:', job.status); return job; } } else { console.error('Failed to get job status:', await response.text()); return null; } } // Usage const jobId = '04a7a6d8-5ef7-465a-b22a-8a98e7104dd9'; const result = await getParseJobStatus(jobId, 'your-api-key'); ``` ```json Starting Job theme={null} { "job_id": "04a7a6d8-5ef7-465a-b22a-8a98e7104dd9", "status": "Starting", "created_at": "2025-10-22T06:51:16.870302Z", "metadata": {} } ``` ```json Processing Job theme={null} { "job_id": "04a7a6d8-5ef7-465a-b22a-8a98e7104dd9", "status": "Processing", "created_at": "2025-10-22T06:51:16.870302Z", "started_at": "2025-10-22T06:51:16.966136Z", "metadata": {} } ``` ```json Succeeded Job theme={null} { "job_id": "04a7a6d8-5ef7-465a-b22a-8a98e7104dd9", "status": "Succeeded", "created_at": "2025-10-22T06:51:16.870302Z", "started_at": "2025-10-22T06:51:16.966136Z", "finished_at": "2025-10-22T06:57:19.821541Z", "metadata": {}, "total_chunks": 25, "page_count": 10, "pdf_url": "https://s3.us-east-1.amazonaws.com/unsiloed-bucket/output.pdf?...", "exports": { "docx": "https://s3.us-east-1.amazonaws.com/unsiloed-bucket/output.docx?..." }, "file_name": "document.pdf", "file_type": "application/pdf", "file_url": "https://s3.us-east-1.amazonaws.com/unsiloed-bucket/input/document.pdf", "credit_used": 10, "merge_tables": false, "configuration": { "high_resolution": false, "ocr_strategy": "auto_detection", "ocr_engine": "UnsiloedHawk", "layout_analysis": "smart_layout_detection", "segment_type_naming": "Unsiloed", "error_handling": "Continue", "agentic_ocr": null, "merge_tables": false, "extract_colors": false, "extract_links": false, "extract_strikethrough": false, "xml_citation": false, "chunk_processing": {}, "segment_analysis": {} }, "chunks": [ { "chunk_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890", "chunk_length": 128, "embed": "eternal logo the image displays a logo...", "segments": [ { "segment_type": "Picture", "content": "eternal", "image": "https://s3.us-east-1.amazonaws.com/unsiloed-bucket/...", "page_number": 1, "segment_id": "1c60ecbd-b6da-493c-b3f6-6849337a981f", "confidence": 0.6062846, "page_width": 1191.0, "page_height": 1684.0, "html": "

The image displays a logo...

", "markdown": "Eternal Logo\n\nThe image displays a logo...", "bbox": { "left": 72.92226, "top": 62.030334, "width": 230.36308, "height": 55.395317 }, "ocr": [ { "bbox": { "left": 63.753525, "top": 5.395447, "width": 164.45312, "height": 42.757812 }, "text": "eternal", "confidence": 0.9999992 } ] } ] } ] } ``` ```json Failed Job theme={null} { "job_id": "04a7a6d8-5ef7-465a-b22a-8a98e7104dd9", "status": "Failed", "created_at": "2025-10-22T06:51:16.870302Z", "started_at": "2025-10-22T06:51:16.966136Z", "finished_at": "2025-10-22T06:51:25.123456Z", "metadata": {}, "message": "Failed to process document: File appears to be corrupted" } ``` ```json Error Response - Unauthorized theme={null} "Unauthorized" ``` ```text Error Response - Forbidden theme={null} You don't have permission to access this task ``` ```json Error Response - Task Not Found theme={null} { "error": { "code": "task_not_found", "message": "Task 04a7a6d8-5ef7-465a-b22a-8a98e7104dd9 not found", "details": { "task_id": "04a7a6d8-5ef7-465a-b22a-8a98e7104dd9" } } } ``` ```json Error Response - Server Error theme={null} { "error": { "code": "internal_error", "message": "Unable to load this parse task right now. Please try again shortly." } } ``` ## Job Status Values Job has been created and is waiting to be processed. This is the initial status when a parsing job is first created. Job was created through the v2 presigned-upload flow and is waiting for the file to be uploaded. It transitions to Queued once the upload completes. Job is waiting to be picked up by a worker. Job is currently being processed. This includes PDF parsing, text extraction, image analysis, table detection, and OCR processing. Job has completed successfully. The response includes the complete analysis results with all extracted data, images, and metadata. Job failed during processing. Check the message field for details about what went wrong. Job was cancelled before processing completed. ## Polling Strategy For long-running parsing jobs, implement a polling strategy to check status periodically: ```python theme={null} import requests import time def poll_parse_job(job_id, api_key, max_wait_time=300, poll_interval=5): """Poll a parsing job until completion or timeout""" start_time = time.time() headers = {"api-key": api_key} while time.time() - start_time < max_wait_time: response = requests.get( f"https://prod.visionapi.unsiloed.ai/parse/{job_id}", headers=headers ) if response.status_code == 200: job = response.json() if job['status'] == 'Succeeded': return job elif job['status'] == 'Failed': raise Exception(f"Job failed: {job.get('message', 'Unknown error')}") elif job['status'] in ['Starting', 'Processing']: print(f"Job status: {job['status']} - waiting...") time.sleep(poll_interval) else: print(f"Unknown status: {job['status']}") time.sleep(poll_interval) else: print(f"Error checking status: {response.status_code}") time.sleep(poll_interval) raise Exception("Job polling timed out") # Usage try: result = poll_parse_job("04a7a6d8-5ef7-465a-b22a-8a98e7104dd9", "your-api-key") print("Job completed successfully!") print(f"Total chunks: {result['total_chunks']}") except Exception as e: print(f"Error: {e}") ``` ## Segment Types When a job succeeds, the response includes detailed analysis of different document segments: ### Title Top-level document titles, distinct from section headers. ### SectionHeader Document headers and titles that define section boundaries. ### Text Regular text content including paragraphs, sentences, and individual text elements. ### ListItem Individual items within ordered or unordered lists. ### Table Tabular data with structured rows and columns. ### Picture Images and graphics within the document, including logos, charts, and illustrations. ### Caption Text captions associated with images or figures. ### Formula Mathematical or chemical formulas detected within the document. ### Footnote Footnote text appearing at the bottom of a page. ### PageHeader Recurring header content appearing at the top of pages. ### PageFooter Recurring footer content appearing at the bottom of pages. ### Page A full-page segment when the document is processed without fine-grained layout analysis. Each segment includes: * **segment\_type**: Type of content detected * **content**: Extracted text content * **image**: URL to extracted image (if applicable) * **page\_number**: Page where the segment appears * **confidence**: Confidence score for the extraction * **bbox**: Precise coordinates of the segment * **html**: HTML-formatted content * **markdown**: Markdown-formatted content * **ocr**: Detailed OCR data with individual text elements * **chart\_data**: Structured chart data (chart type, series, labels, legend) for Picture segments identified as charts, when chart extraction is enabled * **cell\_references**: Spreadsheet cell-range references (`{sheet, address, ref}`) for Excel segments * **references**: Research-paper citations attached to the segment, when `xml_citation` is enabled * **merged\_page\_bboxes**: Per-page bounding boxes for tables merged across page breaks, when `merge_tables` is enabled ## Error Handling ### Common Error Scenarios 1. **Job Not Found**: Invalid or expired job ID returns a `404` response. 2. **Unauthorized**: Missing or invalid API key returns a `401` response. 3. **Forbidden**: Valid API key but no permission to access this task returns a `403` response. 4. **Rate Limiting**: This GET endpoint is not rate limited by the application; only the submit endpoints are. Poll responsibly regardless. 5. **Client-Side Polling Timeout**: The job did not complete within the time your polling logic allows. This is not a server-returned error; implement a reasonable client-side timeout and handle it gracefully. 6. **Server Error**: Internal processing error returns a `500` response. ### Best Practices * **Polling Frequency**: Check status every 5-10 seconds for long-running jobs * **Timeout Handling**: Implement reasonable timeouts to prevent infinite polling * **Error Recovery**: Handle failed jobs gracefully with retry logic * **API Key Security**: Keep your API key secure and never expose it in client-side code ## Rate Limits * **Concurrent Jobs**: Limited number of active parsing jobs per API key * **Request Frequency**: Avoid excessive polling (recommended: 5-10 second intervals) Check your API plan for specific limits and quotas. ## OpenAPI ````yaml api-reference/parser/openapi-v1.json GET /parse/{job_id} openapi: 3.1.0 info: title: Unsiloed Parser API — v1 description: >- The original document parsing API. Accepts multipart file uploads and URL-based processing. These endpoints have no version prefix in their URLs and are stable indefinitely. contact: name: Unsiloed url: https://unsiloed.ai email: hello@unsiloed.com license: name: '' version: 1.0.0 servers: - url: https://prod.visionapi.unsiloed.ai description: Production security: [] tags: - name: Authentication description: API key management endpoints - name: Health description: Endpoint for checking the health of the service. - name: Parse (Vision-API Compatible) description: >- Vision-API compatible endpoints for parsing - accepts multipart form data with Vision-API parameter names paths: /parse/{job_id}: get: tags: - Parse (Vision-API Compatible) summary: GET /parse/{job_id} description: |- Get the status and results of a processing task. Returns task information in Vision-API compatible format. operationId: get_parse_task parameters: - name: job_id in: path description: Job ID returned by `POST /parse`. required: true schema: type: string - name: include_chunks in: query description: Include the `chunks` array in the response. Defaults to `true`. required: false schema: type: boolean - name: base64_urls in: query description: >- Return segment images as base64-encoded data URIs instead of S3 presigned URLs. Defaults to `false`. required: false schema: type: boolean - name: output_file in: query description: >- Return a presigned S3 URL to the raw output JSON file instead of inlining the full response body. Also accepted as the `output-file` request header. Defaults to `false`. required: false schema: type: boolean - name: enhanced_table in: query description: >- Apply enhanced table post-processing when assembling the response — improves cell-merge accuracy and structure recovery for complex tables, at the cost of extra latency. Also accepted as the `enhanced-table` request header. Defaults to `false`. required: false schema: type: boolean - name: merge_tables in: query description: >- Apply the cross-page table-merge post-processing pass when assembling the response. Has no effect unless the job was parsed with `merge_tables=true` (the merge work runs at parse time). Defaults to `false`. required: false schema: type: boolean - name: include_url in: query description: >- Include file URLs (`pdf_url`, `file_url`, segment images, exports, `configuration.input_file_url`) in the response. When `false` (default), every URL-bearing field is nulled so the response — and any log of it — does not leak the storage bucket/region/path. Also accepted as the `include-url` request header. required: false schema: type: boolean responses: '200': description: >- Job status and results. Output fields (`chunks`, `total_chunks`, `page_count`, `pdf_url`) are present only when status is `Succeeded`. content: application/json: schema: $ref: '#/components/schemas/ParseGetResponse' '401': description: Unauthorized '403': description: Forbidden — you don't have permission to access this task. '404': description: Job not found. content: text/plain: schema: type: string '429': description: Rate limit exceeded. '500': description: Internal server error. content: text/plain: schema: type: string security: - api_key: [] components: schemas: ParseGetResponse: type: object description: >- Response body for `GET /parse/{job_id}`. Fields marked as optional appear only when the job has reached the relevant status. required: - job_id - status - created_at - metadata properties: chunks: type: - array - 'null' items: $ref: '#/components/schemas/Chunk' description: >- Array of document chunks with segments and extracted content. Present when status is `Succeeded`. configuration: oneOf: - type: 'null' - $ref: '#/components/schemas/Configuration' description: >- Configuration used for this job (mirrors the parameters submitted at creation time). The effective `merge_tables` value lives at `configuration.merge_tables`. created_at: type: string description: ISO 8601 timestamp when the job was created. credit_used: type: - integer - 'null' format: int64 description: Credits used for this job. exports: type: - object - 'null' description: >- Presigned download URLs for exported file formats. Only present when `export_format` was specified in the parse request and the export has completed. Keys are format names (e.g. `"docx"`), values are presigned S3 URLs valid for 1 hour. Example: `{"docx": "https://s3.amazonaws.com/..."}`. If export failed, contains `{"docx_error": "..."}` instead. additionalProperties: type: string propertyNames: type: string file_name: type: - string - 'null' description: Original file name from the job record. file_type: type: - string - 'null' description: MIME type of the uploaded file. file_url: type: - string - 'null' description: S3 URL of the original uploaded file. finished_at: type: - string - 'null' description: >- ISO 8601 timestamp when processing completed. Present when status is `Succeeded` or `Failed`. job_id: type: string description: Job identifier. message: type: - string - 'null' description: Error or status detail message. Present when status is `Failed`. metadata: type: object description: >- Citation or job metadata. Populated when `xml_citation` is enabled or from the job record. page_count: type: - integer - 'null' format: int64 description: Number of pages in the document. Present when status is `Succeeded`. pdf_url: type: - string - 'null' description: >- Presigned S3 URL to the generated PDF. Present when status is `Succeeded`. started_at: type: - string - 'null' description: >- ISO 8601 timestamp when processing started. Present when status is not `Starting`. status: type: string description: >- Current job status: `Starting`, `Processing`, `Succeeded`, `Failed`, or `Cancelled`. total_chunks: type: - integer - 'null' format: int64 description: Total number of document chunks. Present when status is `Succeeded`. Chunk: type: object required: - chunk_length - segments properties: chunk_id: type: string description: The unique identifier for the chunk. chunk_length: type: integer format: int32 description: >- The total number of tokens in the chunk. Calculated by the `tokenizer`. minimum: 0 embed: type: - string - 'null' description: >- Suggested text to be embedded for the chunk. This text is generated by combining the embed content from each segment according to the configured embed sources (HTML, Markdown, LLM, or Content). Can be configured using `embed_sources` in the `SegmentProcessing` configuration. segments: type: array items: $ref: '#/components/schemas/Segment' description: >- Collection of document segments that form this chunk. When `target_chunk_length` > 0, contains the maximum number of segments that fit within that length (segments remain intact). Otherwise, contains exactly one segment. Configuration: type: object required: - chunk_processing - high_resolution - ocr_strategy - ocr_engine - segment_analysis - layout_analysis - segment_type_naming - error_handling properties: agentic_ocr: type: - string - 'null' description: >- Agentic OCR engine to use. None = disabled. Some("standard") or Some("advanced"). chunk_processing: $ref: '#/components/schemas/ChunkProcessing' detect_pii: type: boolean description: >- When true, run a PII detection pass before parsing. If PII is found at or above `pii_block_severity`, the task is rejected and no parsing occurs. enhance_reading_order: type: boolean description: >- Fix reading order of segments after smart layout detection. Default: false. error_handling: $ref: '#/components/schemas/ErrorHandlingStrategy' excel_config: oneOf: - type: 'null' - $ref: '#/components/schemas/ExcelConfiguration' description: >- Excel-specific config, set only for tasks created via `POST /parse/excel`. `None` for every PDF/image/office task. Forwarded to the Excel pipeline service. expires_in: type: - integer - 'null' format: int32 description: >- The number of seconds until task is deleted. Expired tasks can **not** be updated, polled or accessed via web interface. export_format: type: - array - 'null' items: $ref: '#/components/schemas/ExportFormat' description: >- Export format(s) requested for this task. When present, the pipeline generates export files after processing. Currently supported: `["docx"]`. extract_colors: type: boolean description: >- Whether to extract and apply color information from PDF text layer to OCR results. When enabled, text color (RGB/hex) from the PDF is transferred to matching OCR words. Default: false extract_links: type: boolean description: >- Whether to extract and apply hyperlink annotations from PDF to OCR results. When enabled, link URLs from PDF annotations are attached to matching OCR words. Default: false extract_strikethrough: type: boolean description: >- Preserve strikethrough formatting in HTML/Markdown output. Default: false. high_resolution: type: boolean description: >- Whether to use high-resolution images for cropping and post-processing. input_file_url: type: - string - 'null' description: The presigned URL of the input file. json_schema: deprecated: true layout_analysis: $ref: '#/components/schemas/SegmentationStrategy' merge_batch_size: type: integer format: int32 description: >- Maximum number of tables allowed in a single merge group. Groups larger than this are split into sub-groups of this size. Defaults to 20. Helps avoid Gemini rate limits on very large documents. merge_tables: type: boolean description: Whether to merge adjacent tables using VLM model: oneOf: - type: 'null' - $ref: '#/components/schemas/Model' ocr_engine: $ref: '#/components/schemas/OcrEngine' ocr_strategy: $ref: '#/components/schemas/OcrStrategy' output_fields: $ref: '#/components/schemas/OutputFieldsConfig' description: >- Configuration for controlling which output fields are included in responses. Allows selective inclusion/exclusion of HTML, Markdown, OCR, images, etc. Skipped from the echo when it's the all-true default so a Slim response doesn't carry a misleading "everything on" block. page_numbers: type: - array - 'null' items: type: integer format: int32 minimum: 0 description: |- Explicit mapping of array indices to actual page numbers (1-based). Used when parsing selective pages (e.g., [4, 5, 8, 9, 10, 11, 12]). Populated from the page_range parameter. page_range: type: - string - 'null' description: >- Optional page range to process (e.g., "1-5,8,10-12"). If None, all pages are processed. pii_block_severity: $ref: '#/components/schemas/PiiBlockSeverity' description: >- Severity threshold at which to block parsing. Ignored if `detect_pii` is false. pii_engine: $ref: '#/components/schemas/PiiEngine' description: Which PII detector engine to use. Ignored if `detect_pii` is false. response_profile: oneOf: - type: 'null' - $ref: '#/components/schemas/ResponseProfile' description: >- Optional high-level response shape selector. `None` (default) preserves the legacy behavior driven entirely by `output_fields`. `Some(Slim)` returns a minimal payload; `Some(Full)` forces every field on; `Some(Custom)` honors `output_fields` verbatim. segment_analysis: $ref: '#/components/schemas/SegmentProcessing' segment_filter: type: string description: |- Segment filter: comma-separated segment types to keep, or "all". Replaces the old metadata-based keep_segment_types storage. segment_type_naming: $ref: '#/components/schemas/SegmentTypeNamingConvention' description: Controls the naming convention for segment types in the output target_chunk_length: type: - integer - 'null' format: int32 description: >- The target number of words in each chunk. If 0, each chunk will contain a single segment. deprecated: true minimum: 0 validate_segments: type: array items: $ref: '#/components/schemas/SegmentType' xml_citation: type: boolean description: Whether to extract structured citation data Segment: type: object required: - content - html - markdown - page_height - page_width - page_number - segment_id - segment_type properties: bbox: oneOf: - type: 'null' - $ref: '#/components/schemas/BoundingBox' cell_references: type: - array - 'null' items: {} description: >- Spreadsheet cell-range references ({ sheet, address, ref }) for Excel segments. Distinct from `references` (GROBID citations) so the two concepts never collide in the output or on the frontend. chart_data: description: >- Extracted chart data for Picture segments identified as charts. Contains structured data including chart_type, data series, labels, and legend. confidence: type: - number - 'null' format: float description: Confidence score of the layout analysis model content: type: string description: >- Text content of the segment. By default, this contains OCR-extracted text. The source can be configured via the `content_source` field in segment processing configuration to use HTML, Markdown, or LLM representation instead. html: type: string description: HTML representation of the segment. image: type: - string - 'null' description: Presigned URL to the image of the segment. markdown: type: string description: Markdown representation of the segment. merged_page_bboxes: type: - array - 'null' items: $ref: '#/components/schemas/PageBBox' description: >- Per-page bbox info for all pages of a merged multi-page table. Only present on merged table segments (segment_id ends with "_merged"). ocr: type: - array - 'null' items: $ref: '#/components/schemas/OCRResult' description: OCR results for the segment. page_height: type: number format: float description: Height of the page containing the segment. page_number: type: integer format: int32 description: Page number of the segment. minimum: 0 page_width: type: number format: float description: Width of the page containing the segment. references: type: - array - 'null' items: {} description: GROBID research-paper citations attached to this segment. segment_id: type: string description: Unique identifier for the segment. segment_type: $ref: '#/components/schemas/SegmentType' segment_type_display: type: - string - 'null' description: >- Display name for the segment type (used for serialization based on naming convention) ChunkProcessing: type: object description: Controls the setting for the chunking and post-processing of each chunk. properties: ignore_headers_and_footers: type: boolean description: Whether to ignore headers and footers in the chunking process. default: false target_length: type: integer format: int32 description: >- The target number of words in each chunk. If 0, each chunk will contain a single segment. default: 512 minimum: 0 tokenizer: oneOf: - $ref: '#/components/schemas/TokenizerType' description: The tokenizer to use for the chunking process. default: Word ErrorHandlingStrategy: type: string description: >- Controls how errors are handled during processing: - `Fail`: Stops processing and fails the task when any error occurs - `Continue`: Attempts to continue processing despite non-critical errors (eg. LLM refusals etc.) enum: - Fail - Continue ExcelConfiguration: type: object description: >- Excel-specific parsing configuration, forwarded verbatim to the Excel pipeline service (`services/Excel_pipeline` `POST /parse`). These are the ONLY knobs the Excel pipeline honors — none of the PDF parsing fields (OCR, layout, segments) apply. Defaults mirror the service's own `SpreadsheetConfig` defaults so an omitted value behaves identically to not setting it. Set on a task via the dedicated `POST /parse/excel` endpoint; `None` for every non-Excel task. properties: cell_colors: type: boolean description: >- Include cell background/font colors (only when `cell_metadata`). Default true. cell_metadata: type: boolean description: >- Include cell color, formula, and dropdown metadata in the output. Default false. dropdowns: type: boolean description: >- Include data-validation dropdown options (only when `cell_metadata`). Default true. exclude_hidden: type: boolean description: Drop hidden sheets/rows/cols/styling from the output. Default false. exclude_hidden_cols: type: boolean description: >- When excluding hidden content, also drop hidden columns. Default true. exclude_hidden_rows: type: boolean description: When excluding hidden content, also drop hidden rows. Default true. exclude_hidden_sheets: type: boolean description: >- When excluding hidden content, also drop hidden sheets. Default true. exclude_images: type: boolean description: >- When excluding hidden content, also drop embedded/pasted images. Default false. exclude_styling: type: boolean description: When excluding hidden content, also drop styling. Default true. formulas: type: boolean description: Include cell formulas (only when `cell_metadata`). Default true. max_rows_per_segment: type: integer format: int32 description: Max rows per split segment when `split_large_tables`. Default 50. split_large_tables: type: boolean description: Split large tables into smaller segments. Default true. table_clustering: type: string description: 'Table clustering effort: `accurate` (default), `fast`, or `off`.' ExportFormat: type: string description: >- File format for exporting parsed results. When specified in a parse request, the pipeline generates the requested export file after processing completes. The exported file is available via the `exports` field in the task response. enum: - docx - markdown - json SegmentationStrategy: type: string description: >- Controls the segmentation strategy: - `LayoutAnalysis` (wire value `smart_layout_detection`): Analyzes pages for layout elements (e.g., `Table`, `Picture`, `Formula`, etc.) using bounding boxes. Provides fine-grained segmentation and better chunking. (Latency penalty: ~TBD seconds per page). - `Page` (wire value `page_by_page`): Treats each page as a single segment. Faster processing, but without layout element detection and only simple chunking. - `AdvancedLayoutAnalysis` (wire value `advanced_layout_detection`): Higher-accuracy layout detection for complex pages (multi-column layouts, dense tables/figures); slower than `LayoutAnalysis`. enum: - smart_layout_detection - page_by_page - advanced_layout_detection Model: type: string enum: - Fast - HighQuality deprecated: true OcrEngine: type: string description: >- Controls which OCR engine to use for text recognition: - `UnsiloedStorm`: Enterprise-grade accuracy, optimized for 50+ languages - `UnsiloedHawk`: Higher accuracy, better for complex layouts [DEFAULT] - `UnsiloedBeta`: Handles irregular bounding boxes, rotated/warped text enum: - UnsiloedStorm - UnsiloedHawk - UnsiloedBeta OcrStrategy: type: string description: >- Controls the Optical Character Recognition (OCR) strategy. - `All`: Processes all pages with OCR. (Latency penalty: ~0.5 seconds per page) - `Auto`: Selectively applies OCR only to pages with missing or low-quality text. When text layer is present the bounding boxes from the text layer are used. enum: - force_ocr - auto_detection OutputFieldsConfig: type: object description: >- Per-field include/exclude filter for the response. Each field defaults to `true`; set a field to `false` to drop it. Honored when `response_profile` is omitted or set to `custom`. Ignored when `response_profile` is `slim` or `full` (the profile wins). properties: bbox: type: boolean description: Include bounding box information in segment responses chart_data: type: boolean description: >- Include extracted chart data for Picture segments identified as charts confidence: type: boolean description: Include confidence scores in segment responses content: type: boolean description: Include text content in segment responses embed: type: boolean description: Include embed text in chunk responses html: type: boolean description: Include HTML representation in segment responses image: type: boolean description: Include images (cropped segment images) in segment responses markdown: type: boolean description: Include Markdown representation in segment responses ocr: type: boolean description: Include OCR results in segment responses PiiBlockSeverity: type: string description: >- Severity threshold at which the task is rejected for PII. - `Any`: block on any PII found (strictest). - `Low`: block on quasi-identifiers (names, dates, locations) or higher. - `Medium`: block on contact PII (email, phone) or higher. - `High`: block only on direct identifiers (SSN, passport, credit card, etc.). enum: - any - low - medium - high PiiEngine: type: string description: >- Engine used by the PII detector service. - `Standard`: regex + spaCy NER. - `Advanced`: VLM-based judgment via OpenAI-compatible API. Higher precision, costs per call. enum: - standard - advanced ResponseProfile: type: string description: |- High-level response shape selector. Optional. When set, it overrides `OutputFieldsConfig` (the profile wins). - `Slim`: chunk `embed` + bbox + `segment_id` + `segment_type` + `page_number` + one representation per segment (HTML for `Table`, Markdown otherwise). Drops `content`, `image`, `ocr`, `confidence`, `chart_data`, `page_height`, `page_width`, and null `references`. - `Full`: every field returned. `output_fields` is ignored. - `Custom`: honor the caller's `OutputFieldsConfig` verbatim. enum: - slim - full - custom example: slim SegmentProcessing: type: object properties: Caption: oneOf: - type: 'null' - $ref: '#/components/schemas/AutoGenerationConfig' Footnote: oneOf: - type: 'null' - $ref: '#/components/schemas/AutoGenerationConfig' Formula: oneOf: - type: 'null' - $ref: '#/components/schemas/LlmGenerationConfig' ListItem: oneOf: - type: 'null' - $ref: '#/components/schemas/AutoGenerationConfig' Page: oneOf: - type: 'null' - $ref: '#/components/schemas/LlmGenerationConfig' PageFooter: oneOf: - type: 'null' - $ref: '#/components/schemas/AutoGenerationConfig' PageHeader: oneOf: - type: 'null' - $ref: '#/components/schemas/AutoGenerationConfig' Picture: oneOf: - type: 'null' - $ref: '#/components/schemas/PictureGenerationConfig' SectionHeader: oneOf: - type: 'null' - $ref: '#/components/schemas/AutoGenerationConfig' Table: oneOf: - type: 'null' - $ref: '#/components/schemas/LlmGenerationConfig' Text: oneOf: - type: 'null' - $ref: '#/components/schemas/AutoGenerationConfig' Title: oneOf: - type: 'null' - $ref: '#/components/schemas/AutoGenerationConfig' SegmentTypeNamingConvention: type: string description: >- Controls the naming convention for segment types in the output: - `Unsiloed`: Uses Unsiloed naming convention (e.g., `PageHeader`, `PageFooter`, `SectionHeader`, `ListItem`, `Picture`) - `Other`: Uses alternative naming convention (e.g., `Header`, `Footer`, `Section Header`, `List Item`, `Figure`) enum: - Unsiloed - Other SegmentType: type: string description: |- All the possible types for a segment. Note: Different configurations will produce different types. Please refer to the documentation for more information. enum: - Caption - Footnote - Formula - KeyValuePair - ListItem - Page - PageFooter - PageHeader - Picture - Seal - SectionHeader - Signature - Table - Text - Title BoundingBox: type: object description: >- Bounding box for an item. It is used for chunks, segments and OCR results. required: - left - top - width - height properties: height: type: number format: float description: The height of the bounding box. left: type: number format: float description: The left coordinate of the bounding box. top: type: number format: float description: The top coordinate of the bounding box. width: type: number format: float description: The width of the bounding box. PageBBox: type: object description: Bbox and page info for one page of a merged multi-page table. required: - page_number - page_height - page_width - bbox properties: bbox: $ref: '#/components/schemas/BoundingBox' page_height: type: number format: float page_number: type: integer format: int32 minimum: 0 page_width: type: number format: float OCRResult: type: object description: OCR results for a segment required: - bbox - text properties: bbox: $ref: '#/components/schemas/BoundingBox' color: oneOf: - type: 'null' - $ref: '#/components/schemas/TextColor' description: Optional text color (RGB) provided by OCR engine confidence: type: - number - 'null' format: float description: The confidence score of the recognized text. link: type: - string - 'null' description: >- Optional link URL associated with this OCR result (from PDF annotations) text: type: string description: The recognized text of the OCR result. TokenizerType: oneOf: - type: object description: Use one of the predefined tokenizer types required: - Enum properties: Enum: $ref: '#/components/schemas/Tokenizer' description: Use one of the predefined tokenizer types - type: object description: Use a custom tokenizer by specifying its model ID required: - String properties: String: type: string description: Use a custom tokenizer by specifying its model ID description: |- Specifies which tokenizer to use for the chunking process. This type supports two ways of specifying a tokenizer: 1. Using a predefined tokenizer from the `Tokenizer` enum 2. Using a custom tokenizer by providing its model ID as a string AutoGenerationConfig: type: object properties: content_source: oneOf: - $ref: '#/components/schemas/ContentSource' description: Controls which source is used to populate the `content` field default: OCR crop_image: oneOf: - $ref: '#/components/schemas/CroppingStrategy' default: Auto embed_sources: type: array items: $ref: '#/components/schemas/EmbedSource' default: '[Markdown]' extended_context: type: boolean description: Use the full page image as context for VLM generation default: false html: oneOf: - $ref: '#/components/schemas/GenerationStrategy' default: Auto markdown: oneOf: - $ref: '#/components/schemas/GenerationStrategy' default: Auto translation: oneOf: - type: 'null' - $ref: '#/components/schemas/TranslationConfig' description: Optional translation configuration for this segment type vlm: type: - string - 'null' LlmGenerationConfig: type: object properties: content_source: oneOf: - $ref: '#/components/schemas/ContentSource' description: >- Controls which source is used to populate the `content` field (default: OCR) default: OCR crop_image: oneOf: - $ref: '#/components/schemas/CroppingStrategy' default: Auto embed_sources: type: array items: $ref: '#/components/schemas/EmbedSource' default: '[Markdown]' extended_context: type: boolean description: Use the full page image as context for VLM generation default: false html: oneOf: - $ref: '#/components/schemas/GenerationStrategy' default: VLM markdown: oneOf: - $ref: '#/components/schemas/GenerationStrategy' default: VLM model_id: type: - string - 'null' description: Model ID for VLM generation translation: oneOf: - type: 'null' - $ref: '#/components/schemas/TranslationConfig' description: Optional translation configuration for this segment type use_table_ocr: type: boolean description: >- Use dedicated OCR for table processing to provide structured context to VLM (default: false) default: false vlm: type: - string - 'null' description: Prompt for the VLM model PictureGenerationConfig: type: object properties: content_source: oneOf: - $ref: '#/components/schemas/ContentSource' description: >- Controls which source is used to populate the `content` field (default: OCR) default: OCR crop_image: oneOf: - $ref: '#/components/schemas/PictureCroppingStrategy' default: All embed_sources: type: array items: $ref: '#/components/schemas/EmbedSource' default: '[Markdown]' extended_context: type: boolean description: Use the full page image as context for VLM generation default: false html: oneOf: - $ref: '#/components/schemas/GenerationStrategy' default: Auto markdown: oneOf: - $ref: '#/components/schemas/GenerationStrategy' default: Auto model_id: type: - string - 'null' description: Model ID for VLM generation transcribe_text: type: boolean description: >- When true, text-bearing pictures (forms, handwritten pages, scanned documents, screenshots of text) are transcribed verbatim by the VLM instead of described/summarized. Only takes effect for segments that actually contain OCR text — true figures (photos, charts, diagrams with little text) still get the standard description. Default: false. default: false translation: oneOf: - type: 'null' - $ref: '#/components/schemas/TranslationConfig' description: Optional translation configuration for this segment type vlm: type: - string - 'null' description: Prompt for the VLM model TextColor: type: object description: Text color in RGB format required: - r - g - b - hex properties: b: type: integer format: int32 g: type: integer format: int32 hex: type: string r: type: integer format: int32 Tokenizer: type: string description: Common tokenizers used for text processing. enum: - Word - Cl100kBase - XlmRobertaBase - BertBaseUncased ContentSource: type: string description: >- Controls which source is used to populate the `content` field in the segment output enum: - OCR - HTML - Markdown - VLM CroppingStrategy: type: string description: >- - `All` crops all images and includes URLs in response - `Auto` crops images if needed for VLM processing and includes URLs in response - `None` crops images if needed for VLM processing but does NOT include URLs in response enum: - All - Auto - None EmbedSource: type: string enum: - HTML - Markdown - VLM - Content GenerationStrategy: type: string enum: - VLM - Auto TranslationConfig: type: object required: - target_language - provider properties: model_id: type: - string - 'null' description: >- Optional model ID for VLM/LLM provider. If not specified, uses the default model. prompt: type: - string - 'null' description: >- Optional custom prompt for VLM/LLM provider. Appended to the system prompt to give the model additional translation instructions. provider: $ref: '#/components/schemas/TranslationProvider' description: >- Translation provider: "Auto" (or legacy "Google") for fast machine translation, "VLM" or "LLM" for LLM/VLM-based translation. target_language: type: string description: |- Target language code (ISO 639-1, e.g. "en", "es", "fr", "ko"). Use "auto" to auto-detect source language and translate to English. PictureCroppingStrategy: type: string description: >- Controls the cropping strategy for an item (e.g. segment, chunk, etc.) - `All` crops all images and includes URLs in response - `Auto` crops images if needed for VLM processing and includes URLs in response - `None` crops images if needed for VLM processing but does NOT include URLs in response enum: - All - Auto - None TranslationProvider: type: string enum: - Auto - VLM securitySchemes: api_key: type: http scheme: bearer description: API key for authentication. Use 'Bearer ' ````