Extract Data

Overview

The Extract Data endpoint processes PDF documents and extracts structured information based on custom schemas. This is ideal for extracting specific data points, citations, references, and structured content from documents using AI-powered analysis.

The endpoint returns a job ID for asynchronous processing. Use the job management endpoints to check status and retrieve results.

Request

pdf_file

file

required

The PDF file to process for data extraction. Maximum file size: 100MB

schema_data

string

required

JSON schema defining the structure and fields to extract from the document. Must be a valid JSON Schema format with type definitions, properties, and required fields.

citation_config

string

JSON object controlling citation detail level. Configure what citation references are returned with extracted data.Properties:

semantic_citations (boolean): Enable segment-level citations. Default: false
ocr_citations (boolean): Enable word-level citations with OCR coordinates. Default: false

Example:

{
  "semantic_citations": true,
  "ocr_citations": true
}

When to use:

Set both to true for segment-level and word-level citations
Set semantic_citations: true, ocr_citations: false for segment-level citations only
Set both to false (default) when citations aren’t needed

Response

job_id

string

Unique identifier for the extraction job

status

string

Initial job status (typically “queued”)

message

string

Descriptive message about the job creation

quota_remaining

number

Number of Credits remaining in your quota

Extraction Results Format

Once the job is completed, the results will contain the extracted data with additional metadata:

[field_name]

object

Each extracted field returns an object with the following structure:

[field_name].value

string|number|array|object

The extracted value matching the schema type

[field_name].score

number

Confidence score between 0 and 1 indicating extraction accuracy

[field_name].bboxes

array

Array of bounding box coordinates where the data was found in the document. Only included when semantic_citations or ocr_citations is enabled in citation_config.

[field_name].bboxes[].bbox

array

Bounding box coordinates [x1, y1, x2, y2] in pixels. Only included when citations are enabled.

[field_name].page_no

number

Page number where the data was extracted (1-indexed). Only included when citations are enabled via citation_config.

min_confidence_score

number

Minimum confidence score across all extracted fields

Example Extraction Results

For a simple financial document schema:

Schema

{
  "type": "object",
  "properties": {
    "Individuals": {
      "type": "string",
      "description": "Percentage Holding"
    },
    "LIC of India": {
      "type": "string",
      "description": "No of Shares Held"
    },
    "United bank of india": {
      "type": "string",
      "description": "No of shares held by United bank of india"
    }
  },
  "required": [
    "Individuals",
    "LIC of India",
    "United bank of india"
  ],
  "additionalProperties": false
}

Extraction Results

{
  "Individuals": {
    "score": 0.9998314521743098,
    "value": "10.57",
    "bboxes": [
      {
        "bbox": [
          79,
          381,
          524,
          565
        ]
      }
    ],
    "page_no": 2
  },
  "LIC of India": {
    "score": 0.9999889986487799,
    "value": "1515000",
    "bboxes": [
      {
        "bbox": [
          79,
          381,
          524,
          565
        ]
      }
    ],
    "page_no": 2
  },
  "United bank of india": {
    "score": 0.999984548437705,
    "value": "500000",
    "bboxes": [
      {
        "bbox": [
          79,
          381,
          524,
          565
        ]
      }
    ],
    "page_no": 2
  },
  "min_confidence_score": 0.9998314521743098
}

curl -X POST "https://prod.visionapi.unsiloed.ai/extract" \
  -H "accept: application/json" \
  -H "api-key: your-api-key" \
  -H "Content-Type: multipart/form-data" \
  -F "[email protected];type=application/pdf" \
  -F "schema_data={\"type\":\"object\",\"properties\":{\"title\":{\"type\":\"string\",\"description\":\"Document title\"},\"date\":{\"type\":\"string\",\"description\":\"Document date\"}},\"required\":[\"title\",\"date\"],\"additionalProperties\":false}"

{
  "job_id": "945b4578-691f-4c74-8184-dde654093b11",
  "status": "queued",
  "message": "PDF citation processing started",
  "quota_remaining": 48988
}

Citation Configuration

The citation_config parameter controls the level of citation detail returned with extracted data. Citations provide references back to the source document, allowing you to trace where each extracted value was found.

Configuration Options

semantic_citations

boolean

default:"false"

Enabled (true):

Returns segment-level citations with bounding boxes
Provides contextual text spans for each extracted field
Useful for tracing data back to document sections

Disabled (false):

No segment-level citations included
Smaller response payload

ocr_citations

boolean

default:"false"

Enabled (true):

Returns word-level citations with precise coordinates
Includes OCR-level bounding boxes for individual words
Enables character-level text mapping

Disabled (false):

No word-level citations included
Faster processing and smaller response size

Use Cases

Segment-Level and Word-Level Citations

{
  "semantic_citations": true,
  "ocr_citations": true
}

Use when you need both segment-level context and word-level precision for detailed auditing or UI highlighting. Segment-Level Citations Only

{
  "semantic_citations": true,
  "ocr_citations": false
}

Use when you need to trace data back to document sections without word-level detail. Balanced performance with moderate response size. No Citations (Values Only)

{
  "semantic_citations": false,
  "ocr_citations": false
}

Use when you only need extracted values without citation references. Fastest processing with minimal response payload.

When both semantic_citations and ocr_citations are set to false (default), the response will only contain value and score for each field. The bboxes and page_no fields will be omitted.

JSON Schema Definition

The schema_data parameter must be a valid JSON Schema that defines the structure of data to extract. All schemas must follow the JSON Schema specification with proper type definitions, properties, and constraints.

Basic Schema Structure

All extraction schemas must include:

type: “object” (root level)
properties: Object defining the fields to extract
required: Array of required field names
additionalProperties: Set to false for strict validation

Financial Document Schema Example

This example demonstrates extracting shareholding patterns and board information from financial documents:

{
  "type": "object",
  "properties": {
    "Individuals": {
      "type": "string",
      "description": "Percentage Holding"
    },
    "LIC of India": {
      "type": "number",
      "description": "No of Shares Held"
    },
    "board of directors": {
      "type": "array",
      "description": "list of names of board of directors",
      "items": {
        "type": "object",
        "required": [
          "names of board of directors"
        ],
        "properties": {
          "names of board of directors": {
            "type": "string",
            "description": "names of all the members of board of directors of ACRE"
          }
        },
        "additionalProperties": false
      }
    },
    "shareholding pattern": {
      "type": "array",
      "description": "shareholding pattern",
      "items": {
        "type": "object",
        "required": [
          "name of shareholders",
          "number of shares held"
        ],
        "properties": {
          "name of shareholders": {
            "type": "string",
            "description": "name of the shareholders in ACRE Table"
          },
          "number of shares held": {
            "type": "string",
            "description": "numbers of shares held by shareholders in ACRE Table"
          }
        },
        "additionalProperties": false
      }
    }
  },
  "required": [
    "Individuals",
    "LIC of India",
    "board of directors",
    "shareholding pattern"
  ],
  "additionalProperties": false
}

Advanced Financial Schema Example

This example shows a more complex schema for extracting detailed shareholding information:

{
  "type": "object",
  "properties": {
    "shares held by Punjab National bank": {
      "type": "string",
      "description": "shares held by Punjab National bank"
    },
    "shares held by IFCI": {
      "type": "string",
      "description": "shares held by IFCI"
    },
    "shareholding pattern": {
      "type": "object",
      "description": "shareholding pattern",
      "properties": {
        "Percentage holding": {
          "type": "array",
          "description": "percentage holding of shareholders in ACRE",
          "items": {
            "type": "string",
            "description": "percentage holding of shareholders in ACRE"
          }
        },
        "Name of shareholders": {
          "type": "array",
          "description": "Names of shareholders in ACRE",
          "items": {
            "type": "string",
            "description": "Names of shareholders in ACRE"
          }
        }
      },
      "required": ["Percentage holding", "Name of shareholders"],
      "additionalProperties": false
    },
    "names of board of directors": {
      "type": "array",
      "description": "list of names of members of board of directors in ACRE",
      "items": {
        "type": "object",
        "properties": {
          "names of board of directors": {
            "type": "string",
            "description": "list of names of members of board of directors in ACRE"
          }
        },
        "required": ["names of board of directors"],
        "additionalProperties": false
      }
    }
  },
  "required": [
    "shares held by Punjab National bank", 
    "shares held by IFCI", 
    "shareholding pattern", 
    "names of board of directors"
  ],
  "additionalProperties": false
}

Citation Extraction Schema

{
  "type": "object",
  "properties": {
    "title": {
      "type": "string",
      "description": "Document title or paper title"
    },
    "authors": {
      "type": "array",
      "description": "List of author names",
      "items": {
        "type": "string"
      }
    },
    "publication_date": {
      "type": "string",
      "description": "Publication date in YYYY-MM-DD format"
    },
    "journal_name": {
      "type": "string",
      "description": "Name of journal or publication venue"
    },
    "doi": {
      "type": "string",
      "description": "Digital Object Identifier"
    },
    "abstract": {
      "type": "string",
      "description": "Document abstract or summary"
    },
    "keywords": {
      "type": "array",
      "description": "Key terms and subject keywords",
      "items": {
        "type": "string"
      }
    },
    "references": {
      "type": "array",
      "description": "List of cited references",
      "items": {
        "type": "string"
      }
    }
  },
  "required": ["title", "authors"],
  "additionalProperties": false
}

Legal Document Schema

{
  "type": "object",
  "properties": {
    "document_type": {
      "type": "string",
      "description": "Type of legal document (contract, agreement, etc.)"
    },
    "parties": {
      "type": "array",
      "description": "Names of parties involved",
      "items": {
        "type": "object",
        "properties": {
          "name": {
            "type": "string",
            "description": "Party name"
          },
          "role": {
            "type": "string",
            "description": "Party role (e.g., buyer, seller, contractor)"
          }
        },
        "required": ["name", "role"],
        "additionalProperties": false
      }
    },
    "effective_date": {
      "type": "string",
      "description": "Document effective date"
    },
    "key_terms": {
      "type": "array",
      "description": "Important terms and conditions",
      "items": {
        "type": "string"
      }
    },
    "obligations": {
      "type": "array",
      "description": "Key obligations and responsibilities",
      "items": {
        "type": "object",
        "properties": {
          "party": {
            "type": "string",
            "description": "Party responsible for the obligation"
          },
          "obligation": {
            "type": "string",
            "description": "Description of the obligation"
          }
        },
        "required": ["party", "obligation"],
        "additionalProperties": false
      }
    }
  },
  "required": ["document_type", "parties", "effective_date"],
  "additionalProperties": false
}

JSON Schema Field Types

string

Text content, single values. Use for names, descriptions, dates as text.

number

Numeric values, amounts, quantities. Use for counts, percentages, monetary values.

integer

Whole numbers only. Use for counts, IDs, years.

boolean

True/false values. Use for yes/no questions, flags.

array

Lists of items. Must include items property defining the type of array elements.

object

Structured data with nested fields. Must include properties defining nested structure.

null

Null values. Can be combined with other types using array notation: ["string", "null"]

Job Management Integration

After creating an extraction job, you can poll for completion using the job status endpoints:

import requests
import time

# After creating the extraction job, you receive a job_id
job_id = "945b4578-691f-4c74-8184-dde654093b11"

headers = {
    "accept": "application/json",
    "api-key": "your-api-key"
}

# Poll for job completion
while True:
    response = requests.get(
        f"https://prod.visionapi.unsiloed.ai/jobs/{job_id}",
        headers=headers
    )

    if response.status_code == 200:
        result = response.json()
        print(f"Job status: {result['status']}")

        if result['status'] == 'Succeeded':
            print("Extraction completed!")
            print("Extracted data:", result['result'])
            break
        elif result['status'] == 'Failed':
            print(f"Job failed: {result.get('error', 'Unknown error')}")
            break
    else:
        print(f"Error checking status: {response.status_code}")
        break

    time.sleep(5)  # Wait 5 seconds before checking again

Advanced Schema Patterns

Nested Object Structures

For complex documents with hierarchical data:

{
  "type": "object",
  "properties": {
    "company_info": {
      "type": "object",
      "description": "Company identification and basic information",
      "properties": {
        "name": {
          "type": "string",
          "description": "Full company name"
        },
        "ticker": {
          "type": "string",
          "description": "Stock ticker symbol"
        },
        "sector": {
          "type": "string",
          "description": "Business sector"
        }
      },
      "required": ["name"],
      "additionalProperties": false
    },
    "financial_data": {
      "type": "object",
      "description": "Financial metrics and performance data",
      "properties": {
        "revenue": {
          "type": "number",
          "description": "Total revenue"
        },
        "profit_margin": {
          "type": "number",
          "description": "Profit margin percentage"
        }
      },
      "required": ["revenue"],
      "additionalProperties": false
    }
  },
  "required": ["company_info", "financial_data"],
  "additionalProperties": false
}

Array of Complex Objects

For extracting lists of structured data:

{
  "type": "object",
  "properties": {
    "transactions": {
      "type": "array",
      "description": "List of financial transactions",
      "items": {
        "type": "object",
        "properties": {
          "date": {
            "type": "string",
            "description": "Transaction date"
          },
          "amount": {
            "type": "number",
            "description": "Transaction amount"
          },
          "description": {
            "type": "string",
            "description": "Transaction description"
          },
          "category": {
            "type": "string",
            "description": "Transaction category"
          }
        },
        "required": ["date", "amount", "description"],
        "additionalProperties": false
      }
    }
  },
  "required": ["transactions"],
  "additionalProperties": false
}

Error Handling

400

Bad Request

Invalid JSON schema format or missing required parameters

401

Unauthorized

Invalid or missing API key

413

Payload Too Large

File size exceeds 100MB limit

422

Unprocessable Entity

Invalid file format, malformed JSON schema, or processing error

429

Too Many Requests

Rate limit exceeded or quota exhausted

500

Internal Server Error

Server error during processing

Authorizations

api-key

string

header

required

Body

multipart/form-data

pdf_file

file

required

The PDF file to process for data extraction. Maximum file size: 100MB

schema_data

string

required

JSON schema defining the structure and fields to extract from the document. Example: {"type":"object","properties":{"invoice_number":{"type":"string","description":"The invoice number"}},"required":["invoice_number"],"additionalProperties":false}

Response

200 - application/json

Successful response

job_id

string

Unique identifier for the extraction job

status

string

Initial job status (typically 'queued')

message

string

Descriptive message about the job creation

quota_remaining

number

Number of Credits remaining in your quota

Parsing

Extraction

Classification

Splitting

Organization

Overview

Request

Response

Extraction Results Format

Example Extraction Results

Citation Configuration

Configuration Options

Use Cases

JSON Schema Definition

Basic Schema Structure

Financial Document Schema Example

Advanced Financial Schema Example

Citation Extraction Schema

Legal Document Schema

JSON Schema Field Types

Job Management Integration

Advanced Schema Patterns

Nested Object Structures

Array of Complex Objects

Error Handling

Authorizations

Body

Response

Parsing

Extraction

Classification

Splitting

Organization

​Overview

​Request

​Response

​Extraction Results Format

​Example Extraction Results

​Citation Configuration

​Configuration Options

​Use Cases

​JSON Schema Definition

​Basic Schema Structure

​Financial Document Schema Example

​Advanced Financial Schema Example

​Citation Extraction Schema

​Legal Document Schema

​JSON Schema Field Types

​Job Management Integration

​Advanced Schema Patterns

​Nested Object Structures

​Array of Complex Objects

​Error Handling

Authorizations

Body

Response

Overview

Request

Response

Extraction Results Format

Example Extraction Results

Citation Configuration

Configuration Options

Use Cases

JSON Schema Definition

Basic Schema Structure

Financial Document Schema Example

Advanced Financial Schema Example

Citation Extraction Schema

Legal Document Schema

JSON Schema Field Types

Job Management Integration

Advanced Schema Patterns

Nested Object Structures

Array of Complex Objects

Error Handling