> ## Documentation Index
> Fetch the complete documentation index at: https://docs.unsiloed.ai/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Getting Started With Extract

> Submit a document and a JSON schema to /v2/extract and read back typed fields with confidence scores.

<Note>
  Extraction pulls typed fields out of a document against a JSON schema we define, returning each leaf tagged with its own confidence score. For raw Markdown or just a category label instead, see the [Parse quickstart](/quickstart) or the [Classification quickstart](/document-processing/classification/quickstart).
</Note>

By the end, we'll have a script that gives an invoice PDF and a schema to `/v2/extract`, waits for the job to finish, and writes the matched fields back as a clean JSON object with per-field confidence scores. Grab the full script from the dropdown below if you'd rather skip the walkthrough.

<Accordion title="Show the Full Script">
  Set `UNSILOED_API_KEY` in your environment and save the document you want to extract from as `document.pdf` in the same directory before running.

  <Tabs>
    <Tab title="Python">
      ```python extract_document.py theme={null}
      import json
      import os
      import time
      import requests

      API_KEY = os.environ["UNSILOED_API_KEY"]
      BASE_URL = "https://prod.visionapi.unsiloed.ai"

      schema = {
          "type": "object",
          "properties": {
              "vendor_name": {
                  "type": "string",
                  "description": "Name of the company issuing the invoice (the seller)",
              },
              "invoice_number": {
                  "type": "string",
                  "description": "Unique invoice identifier shown on the document",
              },
              "issue_date": {
                  "type": "string",
                  "description": "Date the invoice was issued",
              },
              "total_due": {
                  "type": "number",
                  "description": "Final total amount due in US dollars, including tax",
              },
              "line_items": {
                  "type": "array",
                  "description": "One row per line item in the invoice table",
                  "items": {
                      "type": "object",
                      "properties": {
                          "description": {"type": "string", "description": "Description of the product or service"},
                          "quantity":    {"type": "number", "description": "Quantity of the item ordered"},
                          "unit_price":  {"type": "number", "description": "Price per unit in US dollars"},
                          "subtotal":    {"type": "number", "description": "Line subtotal in US dollars (quantity x unit_price)"},
                      },
                      "required": ["description", "quantity", "unit_price", "subtotal"],
                      "additionalProperties": False,
                  },
              },
          },
          "required": ["vendor_name", "invoice_number", "issue_date", "total_due", "line_items"],
          "additionalProperties": False,
      }

      with open("document.pdf", "rb") as f:
          response = requests.post(
              f"{BASE_URL}/v2/extract",
              headers={"api-key": API_KEY},
              files={"pdf_file": ("document.pdf", f, "application/pdf")},
              data={"schema_data": json.dumps(schema)},
          )
      response.raise_for_status()

      job_id = response.json()["job_id"]
      print(f"Job submitted: {job_id}")

      max_attempts = 60  # roughly 5 minutes at 5 seconds per poll
      attempts = 0
      while True:
          result = requests.get(
              f"{BASE_URL}/extract/{job_id}",
              headers={"api-key": API_KEY},
          ).json()
          print(f"Status: {result['status']}")
          if result["status"] == "completed":
              break
          if result["status"] == "failed":
              raise RuntimeError(result.get("error", "extract job failed"))
          attempts += 1
          if attempts >= max_attempts:
              raise TimeoutError("Extract job did not finish within 5 minutes")
          time.sleep(5)

      with open("result.json", "w") as f:
          json.dump(result, f, indent=2)

      print(f"Saved extracted fields to result.json")
      ```
    </Tab>

    <Tab title="JavaScript">
      Save this as `script.mjs` or set `"type": "module"` in your `package.json`. Requires Node.js 18 or newer for the global `fetch`, `FormData`, and `Blob`.

      ```javascript script.mjs theme={null}
      import fs from "node:fs";

      const API_KEY = process.env.UNSILOED_API_KEY;
      const BASE_URL = "https://prod.visionapi.unsiloed.ai";

      const schema = {
        type: "object",
        properties: {
          vendor_name:    { type: "string", description: "Name of the company issuing the invoice (the seller)" },
          invoice_number: { type: "string", description: "Unique invoice identifier shown on the document" },
          issue_date:     { type: "string", description: "Date the invoice was issued" },
          total_due:      { type: "number", description: "Final total amount due in US dollars, including tax" },
          line_items: {
            type: "array",
            description: "One row per line item in the invoice table",
            items: {
              type: "object",
              properties: {
                description: { type: "string", description: "Description of the product or service" },
                quantity:    { type: "number", description: "Quantity of the item ordered" },
                unit_price:  { type: "number", description: "Price per unit in US dollars" },
                subtotal:    { type: "number", description: "Line subtotal in US dollars (quantity x unit_price)" },
              },
              required: ["description", "quantity", "unit_price", "subtotal"],
              additionalProperties: false,
            },
          },
        },
        required: ["vendor_name", "invoice_number", "issue_date", "total_due", "line_items"],
        additionalProperties: false,
      };

      const form = new FormData();
      form.append("pdf_file", new Blob([fs.readFileSync("document.pdf")]), "document.pdf");
      form.append("schema_data", JSON.stringify(schema));

      const response = await fetch(`${BASE_URL}/v2/extract`, {
        method: "POST",
        headers: { "api-key": API_KEY },
        body: form,
      });
      if (!response.ok) throw new Error(`${response.status}: ${await response.text()}`);

      const { job_id } = await response.json();
      console.log(`Job submitted: ${job_id}`);

      const maxAttempts = 60; // roughly 5 minutes at 5 seconds per poll
      let attempts = 0;
      let result;
      while (true) {
        const res = await fetch(`${BASE_URL}/extract/${job_id}`, {
          headers: { "api-key": API_KEY },
        });
        result = await res.json();
        console.log(`Status: ${result.status}`);
        if (result.status === "completed") break;
        if (result.status === "failed") throw new Error(result.error || "extract job failed");
        if (++attempts >= maxAttempts) throw new Error("Extract job did not finish within 5 minutes");
        await new Promise((r) => setTimeout(r, 5000));
      }

      fs.writeFileSync("result.json", JSON.stringify(result, null, 2));
      console.log("Saved extracted fields to result.json");
      ```
    </Tab>

    <Tab title="cURL">
      ```bash theme={null}
      # Write the schema to a file so we can pass it cleanly:
      cat > schema.json <<'EOF'
      {
        "type": "object",
        "properties": {
          "vendor_name":    { "type": "string", "description": "Name of the company issuing the invoice (the seller)" },
          "invoice_number": { "type": "string", "description": "Unique invoice identifier shown on the document" },
          "issue_date":     { "type": "string", "description": "Date the invoice was issued" },
          "total_due":      { "type": "number", "description": "Final total amount due in US dollars, including tax" },
          "line_items": {
            "type": "array",
            "description": "One row per line item in the invoice table",
            "items": {
              "type": "object",
              "properties": {
                "description": { "type": "string", "description": "Description of the product or service" },
                "quantity":    { "type": "number", "description": "Quantity of the item ordered" },
                "unit_price":  { "type": "number", "description": "Price per unit in US dollars" },
                "subtotal":    { "type": "number", "description": "Line subtotal in US dollars (quantity x unit_price)" }
              },
              "required": ["description", "quantity", "unit_price", "subtotal"],
              "additionalProperties": false
            }
          }
        },
        "required": ["vendor_name", "invoice_number", "issue_date", "total_due", "line_items"],
        "additionalProperties": false
      }
      EOF

      # Submit the document with the schema and capture the job_id:
      resp=$(curl -sX POST "https://prod.visionapi.unsiloed.ai/v2/extract" \
        -H "api-key: $UNSILOED_API_KEY" \
        -F "pdf_file=@document.pdf" \
        -F "schema_data=$(cat schema.json)")
      JOB_ID=$(echo "$resp" | grep -o '"job_id":"[^"]*"' | cut -d'"' -f4)
      echo "Job submitted: $JOB_ID"

      # Poll until the job finishes, with a 5-minute timeout:
      attempts=0
      max_attempts=60
      while true; do
        resp=$(curl -sX GET "https://prod.visionapi.unsiloed.ai/extract/$JOB_ID" \
          -H "api-key: $UNSILOED_API_KEY")
        status=$(echo "$resp" | grep -o '"status":"[^"]*"' | head -1 | cut -d'"' -f4)
        echo "Status: $status"
        [ "$status" = "completed" ] && break
        [ "$status" = "failed" ] && { echo "Job failed"; exit 1; }
        attempts=$((attempts + 1))
        [ "$attempts" -ge "$max_attempts" ] && { echo "Extract job did not finish within 5 minutes"; exit 1; }
        sleep 5
      done

      # Save the full response to disk:
      echo "$resp" > result.json
      ```
    </Tab>
  </Tabs>
</Accordion>

## Step 1: Set Up Your Environment

Before writing any code, we need three things: an API key, a document, and the runtime for our chosen language.

### 1.1 Get an Unsiloed AI API Key

To get API access, [sign up on Unsiloed AI](https://cal.com/aman-mishra-p0ry57/15min). Export your key as an environment variable named `UNSILOED_API_KEY` so it stays out of source control:

```bash theme={null}
export UNSILOED_API_KEY="your-api-key"
```

### 1.2 Pick a Document to Extract Fields From

The `/v2/extract` endpoint supports PDF, DOCX, PPTX, JPG, PNG, and other formats. The walkthrough below assumes a PDF saved as `document.pdf` in your working directory. To use a different format, update the filename in the snippets to match your file.

If you don't have a document handy, download our [sample invoice PDF](https://raw.githubusercontent.com/Unsiloed-AI/cookbook/c585446e46e4be2790c6c29fe2a7a3a1b346191d/sample-documents/sample-extract.pdf) (a one-page invoice from Northwind Office Supplies with five line items) and save it as `document.pdf`. The schema in this guide targets the vendor, invoice number, issue date, total, and the line item table on that invoice.

### 1.3 Install Dependencies

<Tabs>
  <Tab title="Python">
    You need Python 3.8 or newer. Install the `requests` package:

    ```bash theme={null}
    pip install requests
    ```
  </Tab>

  <Tab title="JavaScript">
    You need Node.js 18 or newer for the global `fetch`, `FormData`, and `Blob`. No external packages needed.
  </Tab>

  <Tab title="cURL">
    You need cURL, which is preinstalled on macOS and most Linux distributions. No external packages needed.
  </Tab>
</Tabs>

## Step 2: Submit a Document With a Schema

The request bundles two fields: `pdf_file` for the document and `schema_data` for the JSON schema as a string. The schema is the interesting half. Anything we describe there, from a single total to a nested array of line items, comes back typed and scored in the exact shape we asked for. The endpoint returns a `job_id` we can poll. All requests go to `https://prod.visionapi.unsiloed.ai` with the API key in the `api-key` header.

### 2.1 Set Up the Script

<Tabs>
  <Tab title="Python">
    Create a file called `extract_document.py` and start with the imports and configuration:

    ```python extract_document.py theme={null}
    import json
    import os
    import time
    import requests

    API_KEY = os.environ["UNSILOED_API_KEY"]
    BASE_URL = "https://prod.visionapi.unsiloed.ai"
    ```

    `API_KEY` reads your key from the environment so it doesn't get hard-coded into the file, and `BASE_URL` points at the Unsiloed AI production endpoint. Both appear in every request below.
  </Tab>

  <Tab title="JavaScript">
    Create a file called `script.mjs` and start with the imports and configuration:

    ```javascript script.mjs theme={null}
    import fs from "node:fs";

    const API_KEY = process.env.UNSILOED_API_KEY;
    const BASE_URL = "https://prod.visionapi.unsiloed.ai";
    ```

    `API_KEY` reads your key from the environment so it doesn't get hard-coded into the file, and `BASE_URL` points at the Unsiloed AI production endpoint. Both appear in every request below.
  </Tab>

  <Tab title="cURL">
    cURL doesn't need a setup step. Each command below inlines the API key and base URL directly.
  </Tab>
</Tabs>

### 2.2 Define the Schema

The schema tells the API which fields to pull out and what shape they should take. The clearer the `description` on each field, the better the model locates and types each value. For our sample invoice, we want the vendor name, the invoice number, the issue date, the total due, and the five rows of the line item table.

<Tabs>
  <Tab title="Python">
    Continue the file by defining the schema as a Python dict:

    ```python extract_document.py theme={null}
    schema = {
        "type": "object",
        "properties": {
            "vendor_name": {
                "type": "string",
                "description": "Name of the company issuing the invoice (the seller)",
            },
            "invoice_number": {
                "type": "string",
                "description": "Unique invoice identifier shown on the document",
            },
            "issue_date": {
                "type": "string",
                "description": "Date the invoice was issued",
            },
            "total_due": {
                "type": "number",
                "description": "Final total amount due in US dollars, including tax",
            },
            "line_items": {
                "type": "array",
                "description": "One row per line item in the invoice table",
                "items": {
                    "type": "object",
                    "properties": {
                        "description": {"type": "string", "description": "Description of the product or service"},
                        "quantity":    {"type": "number", "description": "Quantity of the item ordered"},
                        "unit_price":  {"type": "number", "description": "Price per unit in US dollars"},
                        "subtotal":    {"type": "number", "description": "Line subtotal in US dollars (quantity x unit_price)"},
                    },
                    "required": ["description", "quantity", "unit_price", "subtotal"],
                    "additionalProperties": False,
                },
            },
        },
        "required": ["vendor_name", "invoice_number", "issue_date", "total_due", "line_items"],
        "additionalProperties": False,
    }
    ```

    The schema is plain JSON Schema with strict-mode rules: `additionalProperties: false` at every object level (which prevents the model from inventing fields we didn't ask for), and a `required` list naming the must-have fields. See the [Schemas](/document-processing/extraction/schemas) reference for the full ruleset.
  </Tab>

  <Tab title="JavaScript">
    Continue the file by defining the schema as a JavaScript object:

    ```javascript script.mjs theme={null}
    const schema = {
      type: "object",
      properties: {
        vendor_name:    { type: "string", description: "Name of the company issuing the invoice (the seller)" },
        invoice_number: { type: "string", description: "Unique invoice identifier shown on the document" },
        issue_date:     { type: "string", description: "Date the invoice was issued" },
        total_due:      { type: "number", description: "Final total amount due in US dollars, including tax" },
        line_items: {
          type: "array",
          description: "One row per line item in the invoice table",
          items: {
            type: "object",
            properties: {
              description: { type: "string", description: "Description of the product or service" },
              quantity:    { type: "number", description: "Quantity of the item ordered" },
              unit_price:  { type: "number", description: "Price per unit in US dollars" },
              subtotal:    { type: "number", description: "Line subtotal in US dollars (quantity x unit_price)" },
            },
            required: ["description", "quantity", "unit_price", "subtotal"],
            additionalProperties: false,
          },
        },
      },
      required: ["vendor_name", "invoice_number", "issue_date", "total_due", "line_items"],
      additionalProperties: false,
    };
    ```

    The schema is plain JSON Schema with strict-mode rules: `additionalProperties: false` at every object level (which prevents the model from inventing fields we didn't ask for), and a `required` list naming the must-have fields. See the [Schemas](/document-processing/extraction/schemas) reference for the full ruleset.
  </Tab>

  <Tab title="cURL">
    Write the schema to a file to pass it to `curl` as the value of the `schema_data` form field:

    ```bash theme={null}
    cat > schema.json <<'EOF'
    {
      "type": "object",
      "properties": {
        "vendor_name":    { "type": "string", "description": "Name of the company issuing the invoice (the seller)" },
        "invoice_number": { "type": "string", "description": "Unique invoice identifier shown on the document" },
        "issue_date":     { "type": "string", "description": "Date the invoice was issued" },
        "total_due":      { "type": "number", "description": "Final total amount due in US dollars, including tax" },
        "line_items": {
          "type": "array",
          "description": "One row per line item in the invoice table",
          "items": {
            "type": "object",
            "properties": {
              "description": { "type": "string", "description": "Description of the product or service" },
              "quantity":    { "type": "number", "description": "Quantity of the item ordered" },
              "unit_price":  { "type": "number", "description": "Price per unit in US dollars" },
              "subtotal":    { "type": "number", "description": "Line subtotal in US dollars (quantity x unit_price)" }
            },
            "required": ["description", "quantity", "unit_price", "subtotal"],
            "additionalProperties": false
          }
        }
      },
      "required": ["vendor_name", "invoice_number", "issue_date", "total_due", "line_items"],
      "additionalProperties": false
    }
    EOF
    ```

    The schema follows JSON Schema with strict-mode rules: `additionalProperties: false` at every object level, and a `required` list naming the must-have fields. See the [Schemas](/document-processing/extraction/schemas) reference for the full ruleset.
  </Tab>
</Tabs>

### 2.3 Upload the Document

Send the file and the schema as a multipart upload to `/v2/extract`. The endpoint expects the document under the form field name `pdf_file` and the schema under `schema_data` (as a JSON string).

<Tabs>
  <Tab title="Python">
    Next, upload the document and the schema together:

    ```python extract_document.py theme={null}
    with open("document.pdf", "rb") as f:
        response = requests.post(
            f"{BASE_URL}/v2/extract",
            headers={"api-key": API_KEY},
            files={"pdf_file": ("document.pdf", f, "application/pdf")},
            data={"schema_data": json.dumps(schema)},
        )
    response.raise_for_status()
    ```

    The `raise_for_status()` call throws an `HTTPError` on any non-2xx response, so we don't need to check `.status_code` ourselves. The `json.dumps(schema)` call serializes the dict because the endpoint expects `schema_data` as a string, not a nested form field.
  </Tab>

  <Tab title="JavaScript">
    Next, upload the document and the schema together:

    ```javascript script.mjs theme={null}
    const form = new FormData();
    form.append("pdf_file", new Blob([fs.readFileSync("document.pdf")]), "document.pdf");
    form.append("schema_data", JSON.stringify(schema));

    const response = await fetch(`${BASE_URL}/v2/extract`, {
      method: "POST",
      headers: { "api-key": API_KEY },
      body: form,
    });
    if (!response.ok) throw new Error(`${response.status}: ${await response.text()}`);
    ```

    `fetch` doesn't throw on non-2xx responses by default, so we check `response.ok` and raise the error ourselves. The `JSON.stringify(schema)` call serializes the schema because the endpoint expects `schema_data` as a string, not a nested form field.
  </Tab>

  <Tab title="cURL">
    Run:

    ```bash theme={null}
    curl -X POST "https://prod.visionapi.unsiloed.ai/v2/extract" \
      -H "api-key: $UNSILOED_API_KEY" \
      -F "pdf_file=@document.pdf" \
      -F "schema_data=$(cat schema.json)"
    ```

    The response prints to stdout. We need the `job_id` field for the next step.
  </Tab>
</Tabs>

### 2.4 Capture the Job ID

<Tabs>
  <Tab title="Python">
    Then read and print the `job_id`:

    ```python extract_document.py theme={null}
    job_id = response.json()["job_id"]
    print(f"Job submitted: {job_id}")
    ```

    Run the script:

    ```bash theme={null}
    python extract_document.py
    ```

    The output should be a single line like `Job submitted: a90e48c4-f564-435e-9bf2-ab6eb5a0376d`.
  </Tab>

  <Tab title="JavaScript">
    Then read and log the `job_id`:

    ```javascript script.mjs theme={null}
    const { job_id } = await response.json();
    console.log(`Job submitted: ${job_id}`);
    ```

    Run the script:

    ```bash theme={null}
    node script.mjs
    ```

    The output should be a single line like `Job submitted: a90e48c4-f564-435e-9bf2-ab6eb5a0376d`.
  </Tab>

  <Tab title="cURL">
    The response body from the POST above looks like:

    ```json theme={null}
    {
      "job_id": "a90e48c4-f564-435e-9bf2-ab6eb5a0376d",
      "status": "queued",
      "message": "PDF citation processing started",
      "quota_remaining": 7705
    }
    ```

    Copy the `job_id` value; you'll paste it into the polling command in the next step.
  </Tab>
</Tabs>

## Step 3: Poll for Results

The job runs asynchronously. We GET `/extract/{job_id}` repeatedly until the status is `completed`, then save the extracted fields to disk.

A status of `completed` means the result is ready; `failed` means the job errored; any other value (`queued`, `processing`, and so on) means the job is still running.

### 3.1 Write the Polling Loop

<Tabs>
  <Tab title="Python">
    Next, drop in a polling loop. The `max_attempts` cap stops the loop if the job hangs:

    ```python extract_document.py theme={null}
    max_attempts = 60  # roughly 5 minutes at 5 seconds per poll
    attempts = 0
    while True:
        result = requests.get(
            f"{BASE_URL}/extract/{job_id}",
            headers={"api-key": API_KEY},
        ).json()
        print(f"Status: {result['status']}")
        if result["status"] == "completed":
            break
        if result["status"] == "failed":
            raise RuntimeError(result.get("error", "extract job failed"))
        attempts += 1
        if attempts >= max_attempts:
            raise TimeoutError("Extract job did not finish within 5 minutes")
        time.sleep(5)
    ```
  </Tab>

  <Tab title="JavaScript">
    Next, drop in a polling loop. The `maxAttempts` cap stops the loop if the job hangs:

    ```javascript script.mjs theme={null}
    const maxAttempts = 60; // roughly 5 minutes at 5 seconds per poll
    let attempts = 0;
    let result;
    while (true) {
      const res = await fetch(`${BASE_URL}/extract/${job_id}`, {
        headers: { "api-key": API_KEY },
      });
      result = await res.json();
      console.log(`Status: ${result.status}`);
      if (result.status === "completed") break;
      if (result.status === "failed") throw new Error(result.error || "extract job failed");
      if (++attempts >= maxAttempts) throw new Error("Extract job did not finish within 5 minutes");
      await new Promise((r) => setTimeout(r, 5000));
    }
    ```
  </Tab>

  <Tab title="cURL">
    Replace `JOB_ID` below with the value you captured from Step 2.4, then run this loop. It polls every 5 seconds and gives up after 5 minutes if the job hasn't completed:

    ```bash theme={null}
    JOB_ID="paste-job-id-here"
    attempts=0
    max_attempts=60  # roughly 5 minutes at 5 seconds per poll

    while true; do
      resp=$(curl -sX GET "https://prod.visionapi.unsiloed.ai/extract/$JOB_ID" \
        -H "api-key: $UNSILOED_API_KEY")
      status=$(echo "$resp" | grep -o '"status":"[^"]*"' | head -1 | cut -d'"' -f4)
      echo "Status: $status"
      [ "$status" = "completed" ] && break
      [ "$status" = "failed" ] && { echo "Job failed"; exit 1; }
      attempts=$((attempts + 1))
      [ "$attempts" -ge "$max_attempts" ] && { echo "Extract job did not finish within 5 minutes"; exit 1; }
      sleep 5
    done
    ```

    The loop keeps the latest response body in `$resp` for the next step.
  </Tab>
</Tabs>

### 3.2 Save the Extracted Fields

Finally, persist the result to disk. The response is already structured JSON, so we write it straight to `result.json`.

<Tabs>
  <Tab title="Python">
    Finally, write the result to disk:

    ```python extract_document.py theme={null}
    with open("result.json", "w") as f:
        json.dump(result, f, indent=2)

    print(f"Saved extracted fields to result.json")
    ```

    Run the script:

    ```bash theme={null}
    python extract_document.py
    ```

    You should see a few `Status: processing` lines, then `Status: completed`, then the summary line. The `result.json` file appears in the working directory.
  </Tab>

  <Tab title="JavaScript">
    Finally, write the result to disk:

    ```javascript script.mjs theme={null}
    fs.writeFileSync("result.json", JSON.stringify(result, null, 2));
    console.log("Saved extracted fields to result.json");
    ```

    Run the script:

    ```bash theme={null}
    node script.mjs
    ```

    You should see a few `Status: processing` lines, then `Status: completed`, then the summary line. The `result.json` file appears in the working directory.
  </Tab>

  <Tab title="cURL">
    The polling loop in Step 3.1 left the full response in `$resp`. Write it to disk:

    ```bash theme={null}
    echo "$resp" > result.json
    ```

    The `result.json` file now holds the full response. Extracted fields sit under `result.{field_name}`, each with a `value` and a `score`.
  </Tab>
</Tabs>

## Error Responses

Failures fall into two buckets: HTTP errors raised before the job is queued, and a `failed` status on a job that started but couldn't complete.

### HTTP Errors

The `/v2/extract` endpoint returns JSON bodies on HTTP errors, with a single `detail` field describing the problem. The common cases:

* **`401 Unauthorized`:** body is `{"detail": "Invalid API key"}`. The `api-key` header is missing or wrong.
* **`400 Bad Request` (missing file):** body is `{"detail": "Either pdf_file or file_url must be provided"}`. The `pdf_file` form field is missing.
* **`400 Bad Request` (bad JSON):** body is `{"detail": "schema_data must be valid JSON"}`. The `schema_data` field isn't parseable JSON.
* **`400 Bad Request` (unreadable PDF):** body starts with `{"detail": "Error extracting data with citations: Failed to get PDF page count..."}`. The uploaded file isn't a valid PDF.
* **`422 Unprocessable Entity`:** body lists the missing or malformed form fields. Usually thrown when `schema_data` is absent.
* **`404 Not Found`:** body is `{"detail": "Job <id> not found"}`. The `job_id` you polled doesn't exist.

### Failed Jobs

A job that was accepted but couldn't be processed comes back with `status: "failed"` on a subsequent poll. The response shape mirrors a completed one, with an `error` field describing what went wrong:

```json theme={null}
{
  "job_id": "7b31a7d7-e810-4a0b-931e-fbed0879bab2",
  "status": "failed",
  "file_name": "document.pdf",
  "error": "Failed to extract structured data from document"
}
```

## Response Shape

A completed response contains job metadata plus a `result` object with one entry per top-level field in your schema. Each entry has the extracted `value` and a `score` between 0 and 1. For arrays of objects, the array itself has a `score`, and every property inside each row carries its own score as well.

```json theme={null}
{
  "job_id": "cec5dcb5-53c6-47d5-afe7-28b2182171fb",
  "status": "completed",
  "file_name": "document.pdf",
  "file_url": "https://example-bucket.s3.amazonaws.com/...",
  "created_at": "2026-05-27T08:36:58.336535+00:00",
  "updated_at": "2026-05-27T08:37:19.436604+00:00",
  "metadata": {
    "page_count": 1,
    "order": ["vendor_name", "invoice_number", "issue_date", "total_due", "line_items"],
    "schema": { "...": "..." }
  },
  "result": {
    "vendor_name":    { "value": "Northwind Office Supplies", "score": 0.96 },
    "invoice_number": { "value": "INV-2026-00487",            "score": 0.97 },
    "issue_date":     { "value": "April 14, 2026",            "score": 0.98 },
    "total_due":      { "value": 3705.1,                      "score": 0.96 },
    "line_items": {
      "score": 0.98,
      "value": [
        {
          "description": { "value": "Ergonomic Mesh Office Chair", "score": 0.95, "citation": null },
          "quantity":    { "value": 4,                              "score": 0.96, "citation": null },
          "unit_price":  { "value": 289.0,                          "score": 0.97, "citation": null },
          "subtotal":    { "value": 1156.0,                         "score": 0.98, "citation": null }
        },
        "...four more rows..."
      ]
    }
  }
}
```

The fields you'll actually use depend on what you're building. They fall into three broad categories:

**For typed values and validation:**

* **`result.{field_name}.value`:** the extracted data, typed to match your schema (`string`, `number`, `boolean`, `object`, or `array`)
* **`result.{field_name}.score`:** confidence score between 0 and 1, higher is better. Use it to flag uncertain values for human review.
* **`result.{array_field}.value[].{property}.citation`:** reserved slot for source citations on array rows; `null` for now

**For schema and ordering:**

* **`metadata.schema`:** an echo of the schema you submitted, useful for round-tripping or auditing
* **`metadata.order`:** the original order of top-level fields in your schema, since JSON objects don't preserve insertion order across all clients
* **`metadata.page_count`:** number of pages in the uploaded document

**For job and audit tracking:**

* **`job_id`:** unique identifier for the extraction job
* **`status`:** `completed`, `failed`, or an in-progress value (`queued`, `processing`)
* **`file_name`:** name of the uploaded file
* **`file_url`:** temporary signed S3 URL to the uploaded file
* **`created_at`, `updated_at`:** ISO 8601 timestamps for submission and the most recent status change

### Sample Output

Running the script against the [sample invoice](https://raw.githubusercontent.com/Unsiloed-AI/cookbook/c585446e46e4be2790c6c29fe2a7a3a1b346191d/sample-documents/sample-extract.pdf) writes the JSON above to `result.json`. Every field comes back with its own confidence score, so flagging uncertain values becomes a per-field check rather than a re-read of the document. The fields extracted from the sample:

| Field            | Extracted value           | Confidence |
| ---------------- | ------------------------- | ---------- |
| `vendor_name`    | Northwind Office Supplies | 96%        |
| `invoice_number` | INV-2026-00487            | 97%        |
| `issue_date`     | April 14, 2026            | 98%        |
| `total_due`      | 3705.10                   | 96%        |
| `line_items`     | 5 rows                    | 98%        |

And the five rows of `line_items`, each a structured object in its own right:

| Description                             | Quantity | Unit Price | Subtotal |
| --------------------------------------- | -------- | ---------- | -------- |
| Ergonomic Mesh Office Chair             | 4        | 289.00     | 1,156.00 |
| Adjustable Standing Desk (60" x 30")    | 2        | 549.00     | 1,098.00 |
| LED Desk Lamp with USB Charging         | 6        | 42.50      | 255.00   |
| Acoustic Panel, 24" Hexagon (Pack of 4) | 3        | 78.00      | 234.00   |
| Wireless Mechanical Keyboard            | 5        | 129.99     | 649.95   |

Every cell in the table has its own score in the underlying JSON, so downstream code can flag individual uncertain values without rejecting the whole row.

## Next Steps

<Note>
  For more on extraction, including schema rules, supported types, and the full response reference, see the [Extract overview](/document-processing/extraction/extraction).
</Note>

<CardGroup cols={2}>
  <Card title="Schemas" icon="brackets-curly" href="/document-processing/extraction/schemas">
    JSON Schema rules, supported types, and worked examples for invoices and SEC filings.
  </Card>

  <Card title="Response Format" icon="square-list" href="/document-processing/extraction/response-format">
    The canonical extraction response with a field-by-field reference.
  </Card>

  <Card title="API Reference" icon="code" href="/api-reference/extraction/extract-data">
    Browse the full request and response specs for `/v2/extract`.
  </Card>

  <Card title="FAQ" icon="circle-question" href="/faq/general">
    Check limits, supported formats, and answers to common questions.
  </Card>
</CardGroup>
