> ## Documentation Index > Fetch the complete documentation index at: https://docs.unsiloed.ai/llms.txt > Use this file to discover all available pages before exploring further. # Parse Document (v3) > Async PDF parsing that returns markdown. Supports inline upload, public URL, presigned upload for large files, and archive-based batches — all under one endpoint with per-key isolation. ## Overview The v3 endpoint parses a PDF (or archive of PDFs) and returns **markdown text** for each page. Compared to v1/v2 it is intentionally simpler: no layout / OCR-engine / segment-analysis knobs, no segment tree in the response. You submit a PDF, you get markdown back. The pipeline picks the best model per page internally. The endpoint is **async**: every submission returns a `job_id` you poll until the job reaches a terminal state. **Endpoint base URL:** `https://prod.visionapi.unsiloed.ai/v3/parse` Use this endpoint when you want clean markdown out of a PDF without configuring layout or OCR settings. For fine-grained control over segment types, bounding boxes, and per-segment processing, use the [v2 Parse Document](/api-reference/parser/parse-document-v2) endpoint instead. The v3 surface has four routes: | Route | Use it for | | ------------------------ | ---------------------------------------------------------------------------- | | `POST /v3/parse` | Submit a single PDF — three body shapes (multipart, JSON URL, JSON file\_id) | | `POST /v3/parse/upload` | Mint a presigned PUT URL for PDFs larger than the inline cap | | `POST /v3/parse/batch` | Submit a tar/tar.gz/zip archive of PDFs in one job | | `GET /v3/parse/{job_id}` | Poll status and retrieve the inline markdown result | ## Authentication Every request requires an `X-API-Key` header. Keys are **personal**, **rate-limited per key** (100 requests/day, 2 RPS), and **isolated** — you can only see your own jobs. v3 API keys are issued on request — they are separate from v1/v2 keys. To get one, email **[aman@unsiloed.ai](mailto:aman@unsiloed.ai)** (or open an issue at [github.com/Unsiloed-AI/unsiloed-olmocr-benchmark](https://github.com/Unsiloed-AI/unsiloed-olmocr-benchmark/issues/new)) with a one-line note about what you're evaluating. Typical turnaround is same-day. ## Guarantees | Property | What it means for you | | ------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | **Per-key isolation** | Polling another user's `job_id` returns `404`, as does trying to re-parse another user's `file_id`. | | **24-hour retention** | Every job artifact — your uploaded PDF, status, result, container logs — is deleted automatically 24 hours after the job is created. Pull your results within that window. | | **No scoring on our end** | The API returns markdown only. If you want to reproduce a benchmark number, run the unmodified upstream scorer against the markdown locally. | *** ## POST /v3/parse — Submit a single PDF `POST /v3/parse` accepts **three** body shapes (auto-detected from the `Content-Type` header). All three submit the same async job and return the same response. ### Body shape 1 — Inline multipart upload For small PDFs (up to \~3 MB raw). Single HTTP call. PDF binary, sent as `multipart/form-data`. Capped at \~3 MB raw (≈ 4 MB after base64 encoding inside API Gateway). For larger files use body shape 2 or 3. Optional query parameter on the request URL. Restrict OCR to a subset of pages. * `"1-5"`: pages 1 through 5 * `"1,3,5"`: specific pages * omitted: all pages ```bash theme={null} curl -X POST \ -H "X-API-Key: " \ -F file=@input.pdf \ https://prod.visionapi.unsiloed.ai/v3/parse ``` ### Body shape 2 — JSON with caller-hosted URL For PDFs up to 50 MB that you already host (S3 public-read, S3 presigned, GitHub release asset, your own web server, etc.). Single HTTP call. Publicly fetchable `https://` URL **or** `s3://bucket/key` reference to the PDF. We fetch it. URLs pointing at private IPs, link-local, or AWS instance metadata are rejected by an SSRF guard. ```bash theme={null} curl -X POST \ -H "X-API-Key: " \ -H "Content-Type: application/json" \ -d '{"url":"https://example.com/your-paper.pdf"}' \ https://prod.visionapi.unsiloed.ai/v3/parse ``` ### Body shape 3 — JSON with `file_id` from a presigned upload For PDFs up to 50 MB that you do **not** want to host publicly. First call `POST /v3/parse/upload` (below) to get a presigned `upload_url` and `file_id`; PUT your PDF to the URL; then submit the parse using the `file_id`. The `file_id` returned by `POST /v3/parse/upload` after you finish the PUT. Acts as the `job_id` for subsequent polling. ```bash theme={null} curl -X POST \ -H "X-API-Key: " \ -H "Content-Type: application/json" \ -d '{"file_id":""}' \ https://prod.visionapi.unsiloed.ai/v3/parse ``` You must use **the same API key** that minted the `file_id` via `POST /v3/parse/upload`. Cross-key submissions return `404` (hiding existence) — this is how per-key isolation is enforced. ### Response (any body shape) Job identifier (32-character UUID hex). Pass to `GET /v3/parse/{job_id}` to poll. When you used body shape 3, this equals the `file_id` you submitted. Always `"queued"` on submission. Subsequent values: `"running"` → `"done"` or `"failed"`. ISO 8601 timestamp when the job was created. ```json theme={null} { "job_id": "5eb3493042f84af5860531df5b18c56b", "status": "queued", "created_at": "2026-05-11T18:47:26Z" } ``` *** ## POST /v3/parse/upload — Presigned upload URL Returns a presigned S3 `PUT` URL so you can upload a PDF directly (bypassing the API Gateway request size cap). Use this for the 3-call flow of body shape 3 above. **No request body required** — just an empty POST with the auth header. ```bash theme={null} curl -X POST \ -H "X-API-Key: " \ https://prod.visionapi.unsiloed.ai/v3/parse/upload ``` ### Response Opaque identifier. After you PUT the PDF to `upload_url`, pass this back as `{"file_id": "..."}` to `POST /v3/parse` to start parsing. Presigned S3 `PUT` URL. 1-hour expiry from issuance. **Send the PDF body directly to this URL with HTTP method `PUT` and `Content-Type: application/pdf`.** The transfer bypasses our API Gateway entirely. Always `"PUT"`. Always `"application/pdf"`. Your `PUT` must set the same `Content-Type` header. Maximum PDF size accepted by the pipeline after upload. Currently `52428800` (50 MB). Seconds until the `upload_url` expires (3600). ```json theme={null} { "file_id": "527f4097f3d1...", "upload_url": "https://...s3.amazonaws.com/...?X-Amz-Signature=...", "upload_method": "PUT", "upload_content_type": "application/pdf", "max_bytes": 52428800, "expires_in": 3600, "next": "POST /v3/parse with body {\"file_id\": \"527f4097f3d1...\"}" } ``` ### Full 3-call flow ```bash theme={null} export API=https://prod.visionapi.unsiloed.ai/v3/parse export KEY= # 1. Mint a presigned URL UP=$(curl -s -X POST -H "X-API-Key: $KEY" $API/upload) FILE_ID=$(jq -r .file_id <<< "$UP") UPLOAD_URL=$(jq -r .upload_url <<< "$UP") # 2. PUT the PDF directly to S3 (no auth header needed — URL is presigned) curl -X PUT \ -H "Content-Type: application/pdf" \ --upload-file big.pdf \ "$UPLOAD_URL" # 3. Start the parse curl -X POST \ -H "X-API-Key: $KEY" \ -H "Content-Type: application/json" \ -d "{\"file_id\":\"$FILE_ID\"}" \ $API ``` *** ## POST /v3/parse/batch — Archive of PDFs Process many PDFs in one job. You host an archive of PDFs; we fetch it and process every PDF inside. Public `https://` URL or `s3://` reference to a `.tar`, `.tar.gz`/`.tgz`, or `.zip` archive of PDFs. Archive format is **auto-detected by content sniffing** the first bytes, not by file extension. Non-PDF files inside the archive are skipped silently. ```bash theme={null} curl -X POST \ -H "X-API-Key: " \ -H "Content-Type: application/json" \ -d '{"url":"https://example.com/my-bench.tar.gz"}' \ https://prod.visionapi.unsiloed.ai/v3/parse/batch ``` Submission response is the same shape as `POST /v3/parse`: ```json theme={null} { "job_id": "...", "status": "queued", "created_at": "..." } ``` The completion response uses `documents[]` instead of `pages[]` (one entry per PDF in the archive) — see "Polling" below. *** ## GET /v3/parse/{job_id} — Poll status + retrieve result ```bash theme={null} curl -H "X-API-Key: " \ https://prod.visionapi.unsiloed.ai/v3/parse/ ``` ### Query parameters When set to `"markdown"` **and** `status` is `"done"`, returns concatenated page markdown as `Content-Type: text/markdown; charset=utf-8` instead of a JSON envelope. Useful for `curl ... | tee out.md`. Ignored while the job is queued/running/failed. ### Response — while running ```json theme={null} { "job_id": "5eb3493042f84af5860531df5b18c56b", "status": "running", "created_at": "...", "started_at": "...", "progress": { "page": 2, "of": 5 }, "phase": "ocr" } ``` ### Response — single-PDF done `"done"` for a completed single-PDF job. Number of pages in the PDF (after applying the `pages` selector, if any). Per-page markdown. Each entry has `page` (1-indexed integer) and `markdown` (string). Page order is ascending. ```json theme={null} { "job_id": "5eb3493042f84af5860531df5b18c56b", "status": "done", "file_name": "input.pdf", "created_at": "...", "started_at": "...", "finished_at": "...", "page_count": 3, "pages": [ { "page": 1, "markdown": "..." }, { "page": 2, "markdown": "..." }, { "page": 3, "markdown": "..." } ] } ``` ### Response — batch done One entry per PDF found in the archive. Each entry has `pdf` (relative path inside the archive), `page_count`, and `pages[]` (same shape as single-PDF). If a particular PDF failed, the entry has an `error` field instead of `pages`. ```json theme={null} { "job_id": "...", "status": "done", "source_url": "https://example.com/my-bench.tar.gz", "pdf_count": 3, "documents": [ { "pdf": "doc1.pdf", "page_count": 1, "pages": [{ "page": 1, "markdown": "..." }] }, { "pdf": "doc2.pdf", "page_count": 2, "pages": [{ "page": 1, "markdown": "..." }, { "page": 2, "markdown": "..." }] }, { "pdf": "broken.pdf", "error": "PdfStreamError: Stream has ended unexpectedly" } ] } ``` If the JSON would exceed API Gateway's 10 MB response cap, the response is `{ "job_id", "status": "done", "result_url" }` instead — fetch `result_url` (presigned S3 GET) to download the same JSON. The schema of the downloaded JSON is identical to the inline shape, so clients can use one code path for both. ### Response — failed ```json theme={null} { "job_id": "...", "status": "failed", "created_at": "...", "error": "PDF exceeds size limit (50 MB)" } ``` ### Polling example ```python theme={null} import requests, time API = "https://prod.visionapi.unsiloed.ai/v3/parse" KEY = "" def wait_for(job_id): while True: r = requests.get(f"{API}/{job_id}", headers={"X-API-Key": KEY}) r.raise_for_status() body = r.json() status = body["status"] if status in ("done", "failed"): return body time.sleep(5) ``` *** ## Error responses | Status | Body / Header | When | What to do | | ----------------------- | ---------------------------------------------- | ---------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------- | | `401` | `{"message": "Unauthorized"}` | Missing or invalid `X-API-Key` | Use a personal key; request one if you don't have it yet | | `403` | `{"message": "Forbidden"}` | Key not yet propagated through API Gateway edges (within 30–60s of issuance) | Retry after a minute | | `404` | `{"error": "Job not found"}` | Job doesn't exist, **or** the job belongs to a different API key | Confirm you're using the same key that submitted the job; otherwise check the `job_id` | | `404` | `{"error": "no upload found for file_id=..."}` | The `file_id` you sent doesn't have an uploaded PDF behind it, **or** it belongs to a different key | Make sure you completed the `PUT` step from `/upload`, and that you're using the same key | | `413` | `{"message": "Request Too Long"}` | Multipart body too big for API Gateway's request cap | Switch to body shape 2 (JSON URL) or body shape 3 (presigned upload) | | `429` | `{"message": "Limit Exceeded"}` | Per-key quota (100 requests/day) or rate limit (2 RPS / 2 burst) exceeded | Slow down and retry; quota resets daily | | `400` | `{"error": "invalid url: ..."}` | URL validation failed (wrong scheme, IP-literal, link-local, RFC1918, AWS metadata, etc.) | Use an `https://` or `s3://` URL pointing at a public/presigned object | | `400` | `{"error": "multipart parse failed: ..."}` | Malformed multipart body | Verify your client sets `Content-Type: multipart/form-data; boundary=...` correctly | | job `failed` (200 body) | `{"status": "failed", "error": "..."}` | Container hit a runtime error (file isn't a real PDF, archive contains no PDFs, fetch URL timed out, etc.) | Read the `error` field; fix and resubmit | ## Code examples ```bash cURL — Body shape 1 (multipart) theme={null} curl -X POST \ -H "X-API-Key: your-api-key" \ -F file=@input.pdf \ "https://prod.visionapi.unsiloed.ai/v3/parse" ``` ```bash cURL — Body shape 2 (caller-hosted URL) theme={null} curl -X POST \ -H "X-API-Key: your-api-key" \ -H "Content-Type: application/json" \ -d '{"url":"https://example.com/your-paper.pdf"}' \ "https://prod.visionapi.unsiloed.ai/v3/parse" ``` ```bash cURL — Body shape 3 (3-call upload flow) theme={null} KEY=your-api-key API=https://prod.visionapi.unsiloed.ai/v3/parse UP=$(curl -s -X POST -H "X-API-Key: $KEY" "$API/upload") FILE_ID=$(jq -r .file_id <<< "$UP") UPLOAD_URL=$(jq -r .upload_url <<< "$UP") curl -X PUT \ -H "Content-Type: application/pdf" \ --upload-file big.pdf \ "$UPLOAD_URL" curl -X POST \ -H "X-API-Key: $KEY" \ -H "Content-Type: application/json" \ -d "{\"file_id\":\"$FILE_ID\"}" \ "$API" ``` ```bash cURL — Batch theme={null} curl -X POST \ -H "X-API-Key: your-api-key" \ -H "Content-Type: application/json" \ -d '{"url":"https://example.com/my-bench.tar.gz"}' \ "https://prod.visionapi.unsiloed.ai/v3/parse/batch" ``` ```python Python theme={null} import os, time, requests API = "https://prod.visionapi.unsiloed.ai/v3/parse" KEY = os.environ["UNSILOED_API_KEY"] HEADERS = {"X-API-Key": KEY} # Small PDF (≤ 3 MB) — body shape 1 def submit_small(pdf_path: str) -> str: with open(pdf_path, "rb") as f: r = requests.post( API, headers=HEADERS, files={"file": (os.path.basename(pdf_path), f, "application/pdf")}, ) r.raise_for_status() return r.json()["job_id"] # Caller-hosted URL — body shape 2 def submit_url(url: str) -> str: r = requests.post( API, headers={**HEADERS, "Content-Type": "application/json"}, json={"url": url}, ) r.raise_for_status() return r.json()["job_id"] # Large PDF via presigned upload — body shape 3 def submit_large(pdf_path: str) -> str: up = requests.post(f"{API}/upload", headers=HEADERS).json() file_id, upload_url = up["file_id"], up["upload_url"] with open(pdf_path, "rb") as f: requests.put( upload_url, headers={"Content-Type": "application/pdf"}, data=f, ).raise_for_status() r = requests.post( API, headers={**HEADERS, "Content-Type": "application/json"}, json={"file_id": file_id}, ) r.raise_for_status() return r.json()["job_id"] # Poll until done or failed def wait(job_id: str) -> dict: while True: body = requests.get(f"{API}/{job_id}", headers=HEADERS).json() if body["status"] in ("done", "failed"): return body time.sleep(5) job = submit_small("input.pdf") result = wait(job) if result["status"] == "done": for p in result["pages"]: print(f"--- page {p['page']} ---") print(p["markdown"]) else: print("FAILED:", result.get("error")) ``` ```javascript JavaScript theme={null} const API = "https://prod.visionapi.unsiloed.ai/v3/parse"; const KEY = process.env.UNSILOED_API_KEY; // Body shape 1 — small PDF via multipart async function submitSmall(file) { const form = new FormData(); form.append("file", file); const r = await fetch(API, { method: "POST", headers: { "X-API-Key": KEY }, body: form }); if (!r.ok) throw new Error(`submit failed: ${r.status}`); return (await r.json()).job_id; } // Body shape 2 — caller-hosted URL async function submitUrl(url) { const r = await fetch(API, { method: "POST", headers: { "X-API-Key": KEY, "Content-Type": "application/json" }, body: JSON.stringify({ url }) }); if (!r.ok) throw new Error(`submit failed: ${r.status}`); return (await r.json()).job_id; } // Body shape 3 — presigned upload for large files async function submitLarge(file) { const up = await (await fetch(`${API}/upload`, { method: "POST", headers: { "X-API-Key": KEY } })).json(); await fetch(up.upload_url, { method: "PUT", headers: { "Content-Type": "application/pdf" }, body: file }); const r = await fetch(API, { method: "POST", headers: { "X-API-Key": KEY, "Content-Type": "application/json" }, body: JSON.stringify({ file_id: up.file_id }) }); if (!r.ok) throw new Error(`submit failed: ${r.status}`); return (await r.json()).job_id; } async function wait(jobId) { while (true) { const r = await fetch(`${API}/${jobId}`, { headers: { "X-API-Key": KEY } }); const body = await r.json(); if (body.status === "done" || body.status === "failed") return body; await new Promise(res => setTimeout(res, 5000)); } } ``` ```json Submission (201) theme={null} { "job_id": "5eb3493042f84af5860531df5b18c56b", "status": "queued", "created_at": "2026-05-11T18:47:26Z" } ``` ```json Presigned upload (POST /v3/parse/upload, 201) theme={null} { "file_id": "527f4097f3d1...", "upload_url": "https://...s3.amazonaws.com/...?X-Amz-Signature=...", "upload_method": "PUT", "upload_content_type": "application/pdf", "max_bytes": 52428800, "expires_in": 3600 } ``` ```json Single-PDF done (GET /v3/parse/{job_id}) theme={null} { "job_id": "5eb3493042f84af5860531df5b18c56b", "status": "done", "file_name": "input.pdf", "started_at": "2026-05-11T18:47:36Z", "finished_at": "2026-05-11T18:47:53Z", "page_count": 1, "pages": [ { "page": 1, "markdown": "Example 4.28. Let us use Proposition 4.26 ..." } ] } ``` ```json Batch done (GET /v3/parse/{job_id}) theme={null} { "job_id": "...", "status": "done", "source_url": "https://example.com/my-bench.tar.gz", "pdf_count": 3, "documents": [ { "pdf": "doc1.pdf", "page_count": 1, "pages": [{ "page": 1, "markdown": "..." }] }, { "pdf": "doc2.pdf", "page_count": 2, "pages": [{ "page": 1, "markdown": "..." }, { "page": 2, "markdown": "..." }] } ] } ``` ```json Failed (GET /v3/parse/{job_id}) theme={null} { "job_id": "...", "status": "failed", "created_at": "...", "error": "PDF exceeds size limit (50 MB)" } ``` ## See also * **[Open-source benchmark harness + client](https://github.com/Unsiloed-AI/unsiloed-olmocr-benchmark)** — reproduces our published olmOCR-Bench numbers across vendors and includes a thin client (`clients/bench_via_api.py`) that calls this endpoint, collects the returned markdown, and scores it with the unmodified upstream scorer. * **[v1 Parse Document](/api-reference/parser/parse-document)** — segmented response with bounding boxes, OCR data, and per-segment processing knobs. * **[v2 Parse Document (Presigned Upload)](/api-reference/parser/parse-document-v2)** — segmented response variant with presigned upload for larger files and higher throughput. ## OpenAPI ````yaml api-reference/parser/openapi-v3.json POST /v3/parse openapi: 3.1.0 info: title: Unsiloed Parser API — v3 description: >- Async PDF parsing that returns markdown. Supports inline multipart upload, caller-hosted URL, presigned upload for large files, and archive-based batches — all under one endpoint with per-key isolation. contact: name: Unsiloed url: https://unsiloed.ai email: hello@unsiloed.com license: name: '' version: 3.0.0 servers: - url: https://prod.visionapi.unsiloed.ai description: Production security: [] tags: - name: Parse v3 description: Async markdown-only parsing with per-key isolation paths: /v3/parse: post: tags: - Parse v3 summary: POST /v3/parse description: >- Submit a single PDF for async parsing. Accepts three body shapes (auto-detected from Content-Type): multipart inline upload, JSON with a caller-hosted URL, or JSON with a `file_id` from a presigned upload. Returns a `job_id` you poll via GET /v3/parse/{job_id}. operationId: submit_parse_v3 parameters: - name: pages in: query required: false description: >- Optional page selector. Examples: `1-5`, `1,3,5`. Omitted means all pages. schema: type: string requestBody: content: multipart/form-data: schema: $ref: '#/components/schemas/MultipartParseRequest' application/json: schema: oneOf: - $ref: '#/components/schemas/UrlParseRequest' - $ref: '#/components/schemas/FileIdParseRequest' required: true responses: '201': description: Job accepted and queued content: application/json: schema: $ref: '#/components/schemas/JobSubmission' '400': description: Invalid request body, malformed multipart, or rejected URL '401': description: Unauthorized — missing or invalid X-API-Key '403': description: Forbidden — key not yet propagated '413': description: Request body too large — use body shape 2 or 3 '429': description: Per-key quota or rate limit exceeded security: - api_key: [] components: schemas: MultipartParseRequest: type: object description: Body shape 1 — multipart inline upload for small PDFs (≤ ~3 MB raw). required: - file properties: file: type: string format: binary description: PDF binary. Capped at ~3 MB raw. UrlParseRequest: type: object description: Body shape 2 — caller-hosted URL (up to 50 MB). required: - url properties: url: type: string description: >- Publicly fetchable https:// URL or s3://bucket/key reference. URLs pointing at private IPs, link-local, or AWS instance metadata are rejected by an SSRF guard. FileIdParseRequest: type: object description: Body shape 3 — submit a previously presigned-uploaded PDF by file_id. required: - file_id properties: file_id: type: string description: >- The file_id returned by POST /v3/parse/upload after the PUT completes. Must be used with the same API key that minted it. JobSubmission: type: object description: Async job acknowledgment. required: - job_id - status - created_at properties: job_id: type: string description: 32-character UUID hex. Pass to GET /v3/parse/{job_id} to poll. status: type: string description: Always `queued` on submission. enum: - queued created_at: type: string description: ISO 8601 timestamp when the job was created. securitySchemes: api_key: type: apiKey in: header name: X-API-Key description: Personal API key. Issued separately from v1/v2 keys. ````