> ## Documentation Index
> Fetch the complete documentation index at: https://docs.unsiloed.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Extract Data

> Extract structured data from PDF documents using custom schemas

## Overview

The `/v2/extract` endpoint extracts structured data from PDF documents. It supports optional bounding box citations and handles large documents efficiently.

<Note>
  The endpoint returns a job ID for asynchronous processing. Use the job management endpoints to check status and retrieve results.
</Note>

## Request

<ParamField body="pdf_file" type="file">
  The PDF file to process for data extraction. Maximum file size: 500MB. Either `pdf_file` or `file_url` must be provided.
</ParamField>

<ParamField body="file_url" type="string">
  URL to a PDF file to process. Either `pdf_file` or `file_url` must be provided.
</ParamField>

<ParamField body="schema_data" type="string" required>
  JSON schema defining the structure and fields to extract from the document. Must be a valid JSON Schema format with type definitions, properties, and required fields.
</ParamField>

<ParamField body="model" type="string" default="gamma">
  Model tier to use for extraction. Available tiers: `alpha`, `beta`, `gamma`, `delta`.

  **Recommended: `gamma`** (default), the thorough tier. Tiers: `alpha` (fast), `beta` (balanced), `gamma` (thorough), `delta` (advanced).
</ParamField>

<ParamField body="enable_citations" type="boolean" default="false">
  Return bounding box coordinates for extracted values. When enabled, each extracted field includes a `citation` object with precise location data in the source document.
</ParamField>

<ParamField body="schema_name" type="string">
  Optional name identifying the schema, stored with the job for traceability.
</ParamField>

<ParamField body="detect_pii" type="boolean" default="false">
  Run a pre-flight PII check before extraction. If PII is found at or above `pii_block_severity`, no extraction job is created and the endpoint returns HTTP 200 with `"status": "pii_blocked"` and `"blocked": true` in the body, so check those fields before polling.
</ParamField>

<ParamField body="pii_engine" type="string" default="standard">
  PII detection engine used when `detect_pii` is enabled: `standard` (default) or `advanced` (higher precision).
</ParamField>

<ParamField body="pii_block_severity" type="string" default="high">
  Severity threshold that blocks extraction when `detect_pii` is enabled: `any`, `low`, `medium`, or `high`.
</ParamField>

## Response

<ResponseField name="job_id" type="string">
  Unique identifier for the extraction job
</ResponseField>

<ResponseField name="status" type="string">
  Initial job status (typically "queued")
</ResponseField>

<ResponseField name="message" type="string">
  Descriptive message about the job creation
</ResponseField>

<ResponseField name="quota_remaining" type="number">
  Number of Credits remaining in your quota
</ResponseField>

### PII-Blocked Response

When `detect_pii` is enabled and PII at or above the threshold is found, the endpoint returns HTTP 200 with no job created. Check `status` and `blocked` to tell this apart from a normal submission. The `reason` is `"pii_detected"` when `pii_block_severity` is `any`, or `"pii_detected:<severity>"` for the other thresholds:

```json theme={null}
{
  "job_id": null,
  "status": "pii_blocked",
  "blocked": true,
  "reason": "pii_detected:high",
  "severity": "high",
  "engine": "standard",
  "ocr_used": true,
  "total_findings": 4,
  "by_category": {"financial": 2, "contact": 2},
  "by_type": {"CREDIT_CARD": 2, "EMAIL_ADDRESS": 2},
  "summary": null,
  "spans": []
}
```

## Extraction Results Format

Once the job is completed, the results will contain the extracted data with additional metadata:

<ResponseField name="[field_name]" type="object">
  Each extracted field returns an object with the following structure:
</ResponseField>

<ResponseField name="[field_name].value" type="string|number|array|object">
  The extracted value matching the schema type
</ResponseField>

<ResponseField name="[field_name].score" type="object">
  Confidence scores for the field:

  <Expandable title="score_structure">
    <ResponseField name="grounding_score" type="number">
      Confidence (0-1) that the value was located in the document. `0.0` when citations are disabled.
    </ResponseField>

    <ResponseField name="extraction_score" type="number | null">
      Confidence (0-1) in the extracted value itself.
    </ResponseField>
  </Expandable>
</ResponseField>

<ResponseField name="[field_name].citation" type="object | null">
  Where the value was found. `null` when citations are disabled or the value could not be grounded:

  <Expandable title="citation_structure">
    <ResponseField name="bbox" type="array">
      Bounding box `[left, top, right, bottom]` in PDF point space (origin: top-left).
    </ResponseField>

    <ResponseField name="page" type="number">
      Page number where the value was found (1-indexed).
    </ResponseField>

    <ResponseField name="page_width" type="number">
      Page width in PDF points.
    </ResponseField>

    <ResponseField name="page_height" type="number">
      Page height in PDF points.
    </ResponseField>
  </Expandable>
</ResponseField>

### Example Extraction Results

For a simple financial document schema:

```json Schema theme={null}
{
  "type": "object",
  "properties": {
    "Individuals": {
      "type": "string",
      "description": "Percentage Holding"
    },
    "LIC of India": {
      "type": "string",
      "description": "No of Shares Held"
    },
    "United bank of india": {
      "type": "string",
      "description": "No of shares held by United bank of india"
    }
  },
  "required": [
    "Individuals",
    "LIC of India",
    "United bank of india"
  ],
  "additionalProperties": false
}
```

```json Extraction Results theme={null}
{
  "Individuals": {
    "value": "10.57",
    "score": {
      "grounding_score": 0.97,
      "extraction_score": 0.99
    },
    "citation": {
      "bbox": [79, 381, 524, 565],
      "page": 2,
      "page_width": 612,
      "page_height": 792
    }
  },
  "LIC of India": {
    "value": "1515000",
    "score": {
      "grounding_score": 0.98,
      "extraction_score": 0.99
    },
    "citation": {
      "bbox": [79, 381, 524, 565],
      "page": 2,
      "page_width": 612,
      "page_height": 792
    }
  },
  "United bank of india": {
    "value": "500000",
    "score": {
      "grounding_score": 0.99,
      "extraction_score": 0.99
    },
    "citation": {
      "bbox": [79, 381, 524, 565],
      "page": 2,
      "page_width": 612,
      "page_height": 792
    }
  }
}
```

<RequestExample>
  ```bash cURL theme={null}
  curl -X POST "https://prod.visionapi.unsiloed.ai/v2/extract" \
    -H "accept: application/json" \
    -H "api-key: your-api-key" \
    -H "Content-Type: multipart/form-data" \
    -F "pdf_file=@document.pdf;type=application/pdf" \
    -F "schema_data={\"type\":\"object\",\"properties\":{\"title\":{\"type\":\"string\",\"description\":\"Document title\"},\"date\":{\"type\":\"string\",\"description\":\"Document date\"}},\"required\":[\"title\",\"date\"],\"additionalProperties\":false}"
  ```

  ```python Python theme={null}
  import requests
  import json

  url = "https://prod.visionapi.unsiloed.ai/v2/extract"
  headers = {
      "accept": "application/json",
      "api-key": "your-api-key"
  }

  # Define extraction schema using JSON Schema format
  schema = {
      "type": "object",
      "properties": {
          "title": {
              "type": "string",
              "description": "Document title"
          },
          "date": {
              "type": "string",
              "description": "Document date"
          }
      },
      "required": ["title", "date"],
      "additionalProperties": False
  }

  # Prepare file and schema data
  files = {
      "pdf_file": ("document.pdf", open("document.pdf", "rb"), "application/pdf")
  }

  data = {
      "schema_data": json.dumps(schema)
  }

  # Optional: Enable citations
  # data["enable_citations"] = "true"

  # Optional: Select model tier (default: gamma)
  # data["model"] = "gamma"

  response = requests.post(url, headers=headers, files=files, data=data)

  if response.status_code == 200:
      result = response.json()
      print(f"Extraction job created: {result['job_id']}")
      print(f"Status: {result['status']}")
      print(f"Quota remaining: {result['quota_remaining']}")
  else:
      print("Error:", response.status_code, response.text)

  files["pdf_file"][1].close()
  ```

  ```javascript JavaScript theme={null}
  const formData = new FormData();
  formData.append('pdf_file', fileInput.files[0]);

  const schema = {
    type: "object",
    properties: {
      title: {
        type: "string",
        description: "Document title"
      },
      date: {
        type: "string",
        description: "Document date"
      }
    },
    required: ["title", "date"],
    additionalProperties: false
  };

  formData.append('schema_data', JSON.stringify(schema));

  // Optional: Enable citations
  // formData.append('enable_citations', 'true');

  // Optional: Select model tier (default: gamma)
  // formData.append('model', 'gamma');

  const response = await fetch('https://prod.visionapi.unsiloed.ai/v2/extract', {
    method: 'POST',
    headers: {
      'accept': 'application/json',
      'api-key': 'your-api-key'
    },
    body: formData
  });

  if (response.ok) {
    const result = await response.json();
    console.log(`Extraction job created: ${result.job_id}`);
    console.log(`Status: ${result.status}`);
    console.log(`Quota remaining: ${result.quota_remaining}`);

    // Poll for job completion using GET /extract/{job_id}

  } else {
    console.error('Extraction failed:', response.status, await response.text());
  }
  ```
</RequestExample>

<ResponseExample>
  ```json Success Response theme={null}
  {
    "job_id": "945b4578-691f-4c74-8184-dde654093b11",
    "status": "queued",
    "message": "PDF citation processing started",
    "quota_remaining": 48988
  }
  ```

  ```json Error Response theme={null}
  {
    "detail": "Either pdf_file or file_url must be provided"
  }
  ```
</ResponseExample>

## Citations

The `enable_citations` parameter controls whether bounding box coordinates are returned with extracted data. Citations provide references back to the source document, allowing you to trace where each extracted value was found.

### With Citations Enabled

When `enable_citations` is set to `True`, each extracted field includes a `citation` object with precise location data (or `null` when the value could not be grounded):

```json theme={null}
{
  "invoice_number": {
    "value": "INV-2025-001",
    "score": {
      "grounding_score": 0.97,
      "extraction_score": 0.99
    },
    "citation": {
      "bbox": [139, 209, 280, 222],
      "page": 1,
      "page_width": 595,
      "page_height": 842
    }
  }
}
```

**Bbox coordinate system:**

* `bbox`: `[left, top, right, bottom]` in PDF point space (origin: top-left)
* Standard A4 page = 595 x 842 points
* `page_width` / `page_height` included for scaling to any display size

### Without Citations (Default)

When `enable_citations` is `False` (default), no grounding pass runs. On the default `gamma` tier each field still uses the nested shape, with `grounding_score: 0.0` and `citation: null`. On the `alpha`, `beta`, and `delta` tiers the response uses a legacy flat shape instead: `{"value": ..., "score": <number>}` with no `citation` key. The nested `gamma` shape:

```json theme={null}
{
  "invoice_number": {
    "value": "INV-2025-001",
    "score": {
      "grounding_score": 0.0,
      "extraction_score": 0.97
    },
    "citation": null
  }
}
```

<Note>
  Set `enable_citations` to `True` when you need to trace extracted values back to their exact location in the document, such as for UI highlighting or audit trails.
</Note>

## JSON Schema Definition

The `schema_data` parameter must be a valid JSON Schema that defines the structure of data to extract. All schemas must follow the JSON Schema specification with proper type definitions, properties, and constraints.

### Basic Schema Structure

All extraction schemas must include:

* `type`: "object" (root level)
* `properties`: Object defining the fields to extract
* `required`: Array of required field names
* `additionalProperties`: Set to `False` for strict validation

### Financial Document Schema Example

This example demonstrates extracting shareholding patterns and board information from financial documents:

```json theme={null}
{
  "type": "object",
  "properties": {
    "Individuals": {
      "type": "string",
      "description": "Percentage Holding"
    },
    "LIC of India": {
      "type": "number",
      "description": "No of Shares Held"
    },
    "board of directors": {
      "type": "array",
      "description": "list of names of board of directors",
      "items": {
        "type": "object",
        "required": [
          "names of board of directors"
        ],
        "properties": {
          "names of board of directors": {
            "type": "string",
            "description": "names of all the members of board of directors of ACRE"
          }
        },
        "additionalProperties": false
      }
    },
    "shareholding pattern": {
      "type": "array",
      "description": "shareholding pattern",
      "items": {
        "type": "object",
        "required": [
          "name of shareholders",
          "number of shares held"
        ],
        "properties": {
          "name of shareholders": {
            "type": "string",
            "description": "name of the shareholders in ACRE Table"
          },
          "number of shares held": {
            "type": "string",
            "description": "numbers of shares held by shareholders in ACRE Table"
          }
        },
        "additionalProperties": false
      }
    }
  },
  "required": [
    "Individuals",
    "LIC of India",
    "board of directors",
    "shareholding pattern"
  ],
  "additionalProperties": false
}
```

### Advanced Financial Schema Example

This example shows a more complex schema for extracting detailed shareholding information:

```json theme={null}
{
  "type": "object",
  "properties": {
    "shares held by Punjab National bank": {
      "type": "string",
      "description": "shares held by Punjab National bank"
    },
    "shares held by IFCI": {
      "type": "string",
      "description": "shares held by IFCI"
    },
    "shareholding pattern": {
      "type": "object",
      "description": "shareholding pattern",
      "properties": {
        "Percentage holding": {
          "type": "array",
          "description": "percentage holding of shareholders in ACRE",
          "items": {
            "type": "string",
            "description": "percentage holding of shareholders in ACRE"
          }
        },
        "Name of shareholders": {
          "type": "array",
          "description": "Names of shareholders in ACRE",
          "items": {
            "type": "string",
            "description": "Names of shareholders in ACRE"
          }
        }
      },
      "required": ["Percentage holding", "Name of shareholders"],
      "additionalProperties": false
    },
    "names of board of directors": {
      "type": "array",
      "description": "list of names of members of board of directors in ACRE",
      "items": {
        "type": "object",
        "properties": {
          "names of board of directors": {
            "type": "string",
            "description": "list of names of members of board of directors in ACRE"
          }
        },
        "required": ["names of board of directors"],
        "additionalProperties": false
      }
    }
  },
  "required": [
    "shares held by Punjab National bank", 
    "shares held by IFCI", 
    "shareholding pattern", 
    "names of board of directors"
  ],
  "additionalProperties": false
}
```

### Citation Extraction Schema

```json theme={null}
{
  "type": "object",
  "properties": {
    "title": {
      "type": "string",
      "description": "Document title or paper title"
    },
    "authors": {
      "type": "array",
      "description": "List of author names",
      "items": {
        "type": "string"
      }
    },
    "publication_date": {
      "type": "string",
      "description": "Publication date in YYYY-MM-DD format"
    },
    "journal_name": {
      "type": "string",
      "description": "Name of journal or publication venue"
    },
    "doi": {
      "type": "string",
      "description": "Digital Object Identifier"
    },
    "abstract": {
      "type": "string",
      "description": "Document abstract or summary"
    },
    "keywords": {
      "type": "array",
      "description": "Key terms and subject keywords",
      "items": {
        "type": "string"
      }
    },
    "references": {
      "type": "array",
      "description": "List of cited references",
      "items": {
        "type": "string"
      }
    }
  },
  "required": ["title", "authors"],
  "additionalProperties": false
}
```

### Legal Document Schema

```json theme={null}
{
  "type": "object",
  "properties": {
    "document_type": {
      "type": "string",
      "description": "Type of legal document (contract, agreement, etc.)"
    },
    "parties": {
      "type": "array",
      "description": "Names of parties involved",
      "items": {
        "type": "object",
        "properties": {
          "name": {
            "type": "string",
            "description": "Party name"
          },
          "role": {
            "type": "string",
            "description": "Party role (e.g., buyer, seller, contractor)"
          }
        },
        "required": ["name", "role"],
        "additionalProperties": false
      }
    },
    "effective_date": {
      "type": "string",
      "description": "Document effective date"
    },
    "key_terms": {
      "type": "array",
      "description": "Important terms and conditions",
      "items": {
        "type": "string"
      }
    },
    "obligations": {
      "type": "array",
      "description": "Key obligations and responsibilities",
      "items": {
        "type": "object",
        "properties": {
          "party": {
            "type": "string",
            "description": "Party responsible for the obligation"
          },
          "obligation": {
            "type": "string",
            "description": "Description of the obligation"
          }
        },
        "required": ["party", "obligation"],
        "additionalProperties": false
      }
    }
  },
  "required": ["document_type", "parties", "effective_date"],
  "additionalProperties": false
}
```

## JSON Schema Field Types

<ResponseField name="string" type="string">
  Text content, single values. Use for names, descriptions, dates as text.
</ResponseField>

<ResponseField name="number" type="number">
  Numeric values, amounts, quantities. Use for counts, percentages, monetary values.
</ResponseField>

<ResponseField name="integer" type="integer">
  Whole numbers only. Use for counts, IDs, years.
</ResponseField>

<ResponseField name="boolean" type="boolean">
  True/false values. Use for yes/no questions, flags.
</ResponseField>

<ResponseField name="array" type="array">
  Lists of items. Must include `items` property defining the type of array elements.
</ResponseField>

<ResponseField name="object" type="object">
  Structured data with nested fields. Must include `properties` defining nested structure.
</ResponseField>

<ResponseField name="null" type="null">
  Null values. Can be combined with other types using array notation: `["string", "null"]`
</ResponseField>

## Job Management Integration

After creating an extraction job, you can poll for completion using the job status endpoints:

```python theme={null}
import requests
import time

# After creating the extraction job, you receive a job_id
job_id = "945b4578-691f-4c74-8184-dde654093b11"

headers = {
    "accept": "application/json",
    "api-key": "your-api-key"
}

# Poll for job completion
while True:
    response = requests.get(
        f"https://prod.visionapi.unsiloed.ai/extract/{job_id}",
        headers=headers
    )

    if response.status_code == 200:
        result = response.json()
        status = result.get("status", "").lower()
        print(f"Job status: {status}")

        if status == "completed":
            print("Extraction completed!")
            print("Extracted data:", result.get("result"))
            break
        elif status == "failed":
            print(f"Job failed: {result.get('error', 'Unknown error')}")
            break
    else:
        print(f"Error checking status: {response.status_code}")
        break

    time.sleep(5)  # Wait 5 seconds before checking again
```

## Advanced Schema Patterns

### Nested Object Structures

For complex documents with hierarchical data:

```json theme={null}
{
  "type": "object",
  "properties": {
    "company_info": {
      "type": "object",
      "description": "Company identification and basic information",
      "properties": {
        "name": {
          "type": "string",
          "description": "Full company name"
        },
        "ticker": {
          "type": "string",
          "description": "Stock ticker symbol"
        },
        "sector": {
          "type": "string",
          "description": "Business sector"
        }
      },
      "required": ["name"],
      "additionalProperties": false
    },
    "financial_data": {
      "type": "object",
      "description": "Financial metrics and performance data",
      "properties": {
        "revenue": {
          "type": "number",
          "description": "Total revenue"
        },
        "profit_margin": {
          "type": "number",
          "description": "Profit margin percentage"
        }
      },
      "required": ["revenue"],
      "additionalProperties": false
    }
  },
  "required": ["company_info", "financial_data"],
  "additionalProperties": false
}
```

### Array of Complex Objects

For extracting lists of structured data:

```json theme={null}
{
  "type": "object",
  "properties": {
    "transactions": {
      "type": "array",
      "description": "List of financial transactions",
      "items": {
        "type": "object",
        "properties": {
          "date": {
            "type": "string",
            "description": "Transaction date"
          },
          "amount": {
            "type": "number",
            "description": "Transaction amount"
          },
          "description": {
            "type": "string",
            "description": "Transaction description"
          },
          "category": {
            "type": "string",
            "description": "Transaction category"
          }
        },
        "required": ["date", "amount", "description"],
        "additionalProperties": false
      }
    }
  },
  "required": ["transactions"],
  "additionalProperties": false
}
```

## Error Handling

<ResponseField name="400" type="Bad Request">
  Invalid JSON schema format or missing required parameters
</ResponseField>

<ResponseField name="401" type="Unauthorized">
  Invalid or missing API key
</ResponseField>

<ResponseField name="402" type="Payment Required">
  Insufficient or expired credits
</ResponseField>

<ResponseField name="413" type="Payload Too Large">
  File size exceeds the 500MB limit
</ResponseField>

<ResponseField name="422" type="Unprocessable Entity">
  Invalid file format, malformed JSON schema, or processing error
</ResponseField>

<ResponseField name="429" type="Too Many Requests">
  Rate limit exceeded
</ResponseField>

<ResponseField name="500" type="Internal Server Error">
  Server error during processing
</ResponseField>


## OpenAPI

````yaml api-reference/openapi.json POST /v2/extract
openapi: 3.1.0
info:
  title: Unsiloed AI Document Processing API
  description: >-
    A comprehensive API for document processing, extraction, and analysis using
    AI-powered tools
  license:
    name: MIT
  version: 1.0.0
servers:
  - url: https://prod.visionapi.unsiloed.ai
    description: Production server
security:
  - apiKeyAuth: []
paths:
  /v2/extract:
    post:
      summary: Extract Data
      description: >-
        Extract structured data from PDF documents using custom schemas with
        vision-language models. Supports multiple model tiers, optional bounding
        box citations, and intelligent page selection for large documents.
      requestBody:
        required: true
        content:
          multipart/form-data:
            schema:
              type: object
              properties:
                pdf_file:
                  type: string
                  format: binary
                  description: >-
                    The PDF file to process for data extraction. Maximum file
                    size: 500MB. Either pdf_file or file_url must be provided.
                file_url:
                  type: string
                  description: >-
                    URL to a PDF file to process. Either pdf_file or file_url
                    must be provided.
                schema_data:
                  type: string
                  description: >-
                    JSON schema defining the structure and fields to extract
                    from the document. Example:
                    {"type":"object","properties":{"invoice_number":{"type":"string","description":"The
                    invoice
                    number"}},"required":["invoice_number"],"additionalProperties":false}
                model:
                  type: string
                  description: >-
                    Model tier to use for extraction. Options: alpha, beta,
                    gamma (default, recommended), delta
                  default: gamma
                  enum:
                    - alpha
                    - beta
                    - gamma
                    - delta
                enable_citations:
                  type: boolean
                  description: Return bounding box coordinates for extracted values
                  default: false
                schema_name:
                  anyOf:
                    - type: string
                    - type: 'null'
                  title: Schema Name
                  description: >-
                    Optional name identifying the schema, stored with the job
                    for traceability.
                detect_pii:
                  type: boolean
                  title: Detect Pii
                  description: >-
                    Run a pre-flight PII check before extraction. If PII is
                    found at or above pii_block_severity, no extraction job is
                    created and the endpoint returns HTTP 200 with "status":
                    "pii_blocked" and "blocked": true in the body.
                  default: false
                pii_engine:
                  type: string
                  title: Pii Engine
                  description: >-
                    PII detection engine: 'standard' (default) or 'advanced'
                    (higher precision).
                  default: standard
                  enum:
                    - standard
                    - advanced
                pii_block_severity:
                  type: string
                  title: Pii Block Severity
                  description: >-
                    Severity threshold that blocks extraction when detect_pii is
                    enabled: 'any', 'low', 'medium', or 'high'.
                  default: high
                  enum:
                    - any
                    - low
                    - medium
                    - high
              required:
                - schema_data
      responses:
        '200':
          description: Successful response
          content:
            application/json:
              schema:
                type: object
                properties:
                  job_id:
                    type: string
                    description: Unique identifier for the extraction job
                  status:
                    type: string
                    description: Initial job status (typically 'queued')
                  message:
                    type: string
                    description: Descriptive message about the job creation
                  quota_remaining:
                    type: number
                    description: Number of Credits remaining in your quota
      security:
        - apiKeyAuth: []
components:
  securitySchemes:
    apiKeyAuth:
      type: apiKey
      in: header
      name: api-key

````