Documentation Index Fetch the complete documentation index at: https://docs.unsiloed.ai/llms.txt
Use this file to discover all available pages before exploring further.
Overview
The /v2/extract endpoint extracts structured data from PDF documents. It supports optional bounding box citations and handles large documents efficiently.
The endpoint returns a job ID for asynchronous processing. Use the job management endpoints to check status and retrieve results.
Request
The PDF file to process for data extraction. Maximum file size: 100MB. Either pdf_file or file_url must be provided.
URL to a PDF file to process. Either pdf_file or file_url must be provided.
JSON schema defining the structure and fields to extract from the document. Must be a valid JSON Schema format with type definitions, properties, and required fields.
Model tier to use for extraction. Available tiers: alpha, beta, gamma, delta. Recommended: gamma (default) — best balance of accuracy and speed.
Return bounding box coordinates for extracted values. When enabled, each extracted field includes bboxes with precise location data in the source document.
Response
Unique identifier for the extraction job
Initial job status (typically “queued”)
Descriptive message about the job creation
Number of Credits remaining in your quota
Once the job is completed, the results will contain the extracted data with additional metadata:
Each extracted field returns an object with the following structure:
[field_name].value
string|number|array|object
The extracted value matching the schema type
Confidence score between 0 and 1 indicating extraction accuracy
Array of bounding box coordinates where the data was found in the document. Only included when enable_citations is set to True.
[field_name].bboxes[].bbox
Bounding box coordinates [left, top, right, bottom] in PDF point space. Only included when citations are enabled.
Page number where the data was extracted (1-indexed).
Minimum confidence score across all extracted fields
For a simple financial document schema:
{
"type" : "object" ,
"properties" : {
"Individuals" : {
"type" : "string" ,
"description" : "Percentage Holding"
},
"LIC of India" : {
"type" : "string" ,
"description" : "No of Shares Held"
},
"United bank of india" : {
"type" : "string" ,
"description" : "No of shares held by United bank of india"
}
},
"required" : [
"Individuals" ,
"LIC of India" ,
"United bank of india"
],
"additionalProperties" : false
}
{
"Individuals" : {
"score" : 0.9998314521743098 ,
"value" : "10.57" ,
"bboxes" : [
{
"bbox" : [
79 ,
381 ,
524 ,
565
]
}
],
"page_no" : 2
},
"LIC of India" : {
"score" : 0.9999889986487799 ,
"value" : "1515000" ,
"bboxes" : [
{
"bbox" : [
79 ,
381 ,
524 ,
565
]
}
],
"page_no" : 2
},
"United bank of india" : {
"score" : 0.999984548437705 ,
"value" : "500000" ,
"bboxes" : [
{
"bbox" : [
79 ,
381 ,
524 ,
565
]
}
],
"page_no" : 2
},
"min_confidence_score" : 0.9998314521743098
}
curl -X POST "https://prod.visionapi.unsiloed.ai/v2/extract" \
-H "accept: application/json" \
-H "api-key: your-api-key" \
-H "Content-Type: multipart/form-data" \
-F "pdf_file=@document.pdf;type=application/pdf" \
-F "schema_data={ \" type \" : \" object \" , \" properties \" :{ \" title \" :{ \" type \" : \" string \" , \" description \" : \" Document title \" }, \" date \" :{ \" type \" : \" string \" , \" description \" : \" Document date \" }}, \" required \" :[ \" title \" , \" date \" ], \" additionalProperties \" :false}"
Success Response
Error Response
{
"job_id" : "945b4578-691f-4c74-8184-dde654093b11" ,
"status" : "queued" ,
"message" : "PDF citation processing started" ,
"quota_remaining" : 48988
}
Citations
The enable_citations parameter controls whether bounding box coordinates are returned with extracted data. Citations provide references back to the source document, allowing you to trace where each extracted value was found.
With Citations Enabled
When enable_citations is set to True, each extracted field includes bboxes with precise location data:
{
"invoice_number" : {
"value" : "INV-2025-001" ,
"page_no" : 1 ,
"score" : 0.97 ,
"bboxes" : [
{
"bbox" : [ 139 , 209 , 280 , 222 ],
"text" : "INV-2025-001" ,
"confidence" : 0.95 ,
"page_width" : 595.0 ,
"page_height" : 842.0
}
]
}
}
Bbox coordinate system:
bbox: [left, top, right, bottom] in PDF point space (origin: top-left)
Standard A4 page = 595 x 842 points
page_width / page_height included for scaling to any display size
Without Citations (Default)
When enable_citations is False (default), the response contains value, score, and page_no for each field without bounding box data:
{
"invoice_number" : {
"value" : "INV-2025-001" ,
"page_no" : 1 ,
"score" : 0.97
}
}
Set enable_citations to True when you need to trace extracted values back to their exact location in the document, such as for UI highlighting or audit trails.
JSON Schema Definition
The schema_data parameter must be a valid JSON Schema that defines the structure of data to extract. All schemas must follow the JSON Schema specification with proper type definitions, properties, and constraints.
Basic Schema Structure
All extraction schemas must include:
type: “object” (root level)
properties: Object defining the fields to extract
required: Array of required field names
additionalProperties: Set to False for strict validation
Financial Document Schema Example
This example demonstrates extracting shareholding patterns and board information from financial documents:
{
"type" : "object" ,
"properties" : {
"Individuals" : {
"type" : "string" ,
"description" : "Percentage Holding"
},
"LIC of India" : {
"type" : "number" ,
"description" : "No of Shares Held"
},
"board of directors" : {
"type" : "array" ,
"description" : "list of names of board of directors" ,
"items" : {
"type" : "object" ,
"required" : [
"names of board of directors"
],
"properties" : {
"names of board of directors" : {
"type" : "string" ,
"description" : "names of all the members of board of directors of ACRE"
}
},
"additionalProperties" : false
}
},
"shareholding pattern" : {
"type" : "array" ,
"description" : "shareholding pattern" ,
"items" : {
"type" : "object" ,
"required" : [
"name of shareholders" ,
"number of shares held"
],
"properties" : {
"name of shareholders" : {
"type" : "string" ,
"description" : "name of the shareholders in ACRE Table"
},
"number of shares held" : {
"type" : "string" ,
"description" : "numbers of shares held by shareholders in ACRE Table"
}
},
"additionalProperties" : false
}
}
},
"required" : [
"Individuals" ,
"LIC of India" ,
"board of directors" ,
"shareholding pattern"
],
"additionalProperties" : false
}
Advanced Financial Schema Example
This example shows a more complex schema for extracting detailed shareholding information:
{
"type" : "object" ,
"properties" : {
"shares held by Punjab National bank" : {
"type" : "string" ,
"description" : "shares held by Punjab National bank"
},
"shares held by IFCI" : {
"type" : "string" ,
"description" : "shares held by IFCI"
},
"shareholding pattern" : {
"type" : "object" ,
"description" : "shareholding pattern" ,
"properties" : {
"Percentage holding" : {
"type" : "array" ,
"description" : "percentage holding of shareholders in ACRE" ,
"items" : {
"type" : "string" ,
"description" : "percentage holding of shareholders in ACRE"
}
},
"Name of shareholders" : {
"type" : "array" ,
"description" : "Names of shareholders in ACRE" ,
"items" : {
"type" : "string" ,
"description" : "Names of shareholders in ACRE"
}
}
},
"required" : [ "Percentage holding" , "Name of shareholders" ],
"additionalProperties" : false
},
"names of board of directors" : {
"type" : "array" ,
"description" : "list of names of members of board of directors in ACRE" ,
"items" : {
"type" : "object" ,
"properties" : {
"names of board of directors" : {
"type" : "string" ,
"description" : "list of names of members of board of directors in ACRE"
}
},
"required" : [ "names of board of directors" ],
"additionalProperties" : false
}
}
},
"required" : [
"shares held by Punjab National bank" ,
"shares held by IFCI" ,
"shareholding pattern" ,
"names of board of directors"
],
"additionalProperties" : false
}
{
"type" : "object" ,
"properties" : {
"title" : {
"type" : "string" ,
"description" : "Document title or paper title"
},
"authors" : {
"type" : "array" ,
"description" : "List of author names" ,
"items" : {
"type" : "string"
}
},
"publication_date" : {
"type" : "string" ,
"description" : "Publication date in YYYY-MM-DD format"
},
"journal_name" : {
"type" : "string" ,
"description" : "Name of journal or publication venue"
},
"doi" : {
"type" : "string" ,
"description" : "Digital Object Identifier"
},
"abstract" : {
"type" : "string" ,
"description" : "Document abstract or summary"
},
"keywords" : {
"type" : "array" ,
"description" : "Key terms and subject keywords" ,
"items" : {
"type" : "string"
}
},
"references" : {
"type" : "array" ,
"description" : "List of cited references" ,
"items" : {
"type" : "string"
}
}
},
"required" : [ "title" , "authors" ],
"additionalProperties" : false
}
Legal Document Schema
{
"type" : "object" ,
"properties" : {
"document_type" : {
"type" : "string" ,
"description" : "Type of legal document (contract, agreement, etc.)"
},
"parties" : {
"type" : "array" ,
"description" : "Names of parties involved" ,
"items" : {
"type" : "object" ,
"properties" : {
"name" : {
"type" : "string" ,
"description" : "Party name"
},
"role" : {
"type" : "string" ,
"description" : "Party role (e.g., buyer, seller, contractor)"
}
},
"required" : [ "name" , "role" ],
"additionalProperties" : false
}
},
"effective_date" : {
"type" : "string" ,
"description" : "Document effective date"
},
"key_terms" : {
"type" : "array" ,
"description" : "Important terms and conditions" ,
"items" : {
"type" : "string"
}
},
"obligations" : {
"type" : "array" ,
"description" : "Key obligations and responsibilities" ,
"items" : {
"type" : "object" ,
"properties" : {
"party" : {
"type" : "string" ,
"description" : "Party responsible for the obligation"
},
"obligation" : {
"type" : "string" ,
"description" : "Description of the obligation"
}
},
"required" : [ "party" , "obligation" ],
"additionalProperties" : false
}
}
},
"required" : [ "document_type" , "parties" , "effective_date" ],
"additionalProperties" : false
}
JSON Schema Field Types
Text content, single values. Use for names, descriptions, dates as text.
Numeric values, amounts, quantities. Use for counts, percentages, monetary values.
Whole numbers only. Use for counts, IDs, years.
True/false values. Use for yes/no questions, flags.
Lists of items. Must include items property defining the type of array elements.
Structured data with nested fields. Must include properties defining nested structure.
Null values. Can be combined with other types using array notation: ["string", "null"]
Job Management Integration
After creating an extraction job, you can poll for completion using the job status endpoints:
import requests
import time
# After creating the extraction job, you receive a job_id
job_id = "945b4578-691f-4c74-8184-dde654093b11"
headers = {
"accept" : "application/json" ,
"api-key" : "your-api-key"
}
# Poll for job completion
while True :
response = requests.get(
f "https://prod.visionapi.unsiloed.ai/extract/ { job_id } " ,
headers = headers
)
if response.status_code == 200 :
result = response.json()
status = result.get( "status" , "" ).lower()
print ( f "Job status: { status } " )
if status == "completed" :
print ( "Extraction completed!" )
print ( "Extracted data:" , result.get( "result" ))
break
elif status == "failed" :
print ( f "Job failed: { result.get( 'error' , 'Unknown error' ) } " )
break
else :
print ( f "Error checking status: { response.status_code } " )
break
time.sleep( 5 ) # Wait 5 seconds before checking again
Advanced Schema Patterns
Nested Object Structures
For complex documents with hierarchical data:
{
"type" : "object" ,
"properties" : {
"company_info" : {
"type" : "object" ,
"description" : "Company identification and basic information" ,
"properties" : {
"name" : {
"type" : "string" ,
"description" : "Full company name"
},
"ticker" : {
"type" : "string" ,
"description" : "Stock ticker symbol"
},
"sector" : {
"type" : "string" ,
"description" : "Business sector"
}
},
"required" : [ "name" ],
"additionalProperties" : false
},
"financial_data" : {
"type" : "object" ,
"description" : "Financial metrics and performance data" ,
"properties" : {
"revenue" : {
"type" : "number" ,
"description" : "Total revenue"
},
"profit_margin" : {
"type" : "number" ,
"description" : "Profit margin percentage"
}
},
"required" : [ "revenue" ],
"additionalProperties" : false
}
},
"required" : [ "company_info" , "financial_data" ],
"additionalProperties" : false
}
Array of Complex Objects
For extracting lists of structured data:
{
"type" : "object" ,
"properties" : {
"transactions" : {
"type" : "array" ,
"description" : "List of financial transactions" ,
"items" : {
"type" : "object" ,
"properties" : {
"date" : {
"type" : "string" ,
"description" : "Transaction date"
},
"amount" : {
"type" : "number" ,
"description" : "Transaction amount"
},
"description" : {
"type" : "string" ,
"description" : "Transaction description"
},
"category" : {
"type" : "string" ,
"description" : "Transaction category"
}
},
"required" : [ "date" , "amount" , "description" ],
"additionalProperties" : false
}
}
},
"required" : [ "transactions" ],
"additionalProperties" : false
}
Error Handling
Invalid JSON schema format or missing required parameters
Invalid or missing API key
File size exceeds 100MB limit
Invalid file format, malformed JSON schema, or processing error
Rate limit exceeded or quota exhausted
Server error during processing
JSON schema defining the structure and fields to extract from the document. Example: {"type":"object","properties":{"invoice_number":{"type":"string","description":"The invoice number"}},"required":["invoice_number"],"additionalProperties":false}
The PDF file to process for data extraction. Maximum file size: 100MB. Either pdf_file or file_url must be provided.
URL to a PDF file to process. Either pdf_file or file_url must be provided.
model
enum<string>
default: gamma
Model tier to use for extraction. Options: alpha, beta, gamma (default, recommended), delta
Available options:
alpha,
beta,
gamma,
delta
Return bounding box coordinates for extracted values
Unique identifier for the extraction job
Initial job status (typically 'queued')
Descriptive message about the job creation
Number of Credits remaining in your quota