Extract structured data from PDF documents using custom schemas for citation and data extraction
The Extract Data endpoint processes PDF documents and extracts structured information based on custom schemas. This is ideal for extracting specific data points, citations, references, and structured content from documents using AI-powered analysis.
The endpoint returns a job ID for asynchronous processing. Use the job management endpoints to check status and retrieve results.
The PDF file to process for data extraction. Maximum file size: 100MB
JSON schema defining the structure and fields to extract from the document. Must be a valid JSON Schema format with type definitions, properties, and required fields.
API key for authentication
Unique identifier for the extraction job
Initial job status (typically “queued”)
Descriptive message about the job creation
Number of API calls remaining in your quota
Once the job is completed, the results will contain the extracted data with additional metadata:
Each extracted field returns an object with the following structure:
The extracted value matching the schema type
Confidence score between 0 and 1 indicating extraction accuracy
Array of bounding box coordinates where the data was found in the document
Bounding box coordinates [x1, y1, x2, y2] in pixels
Page number where the data was extracted (1-indexed)
Minimum confidence score across all extracted fields
For a simple financial document schema:
Process multiple PDF files simultaneously for efficient batch data extraction.
Array of PDF files to process for batch extraction
JSON schema defining the structure and fields to extract (applied to all files)
Number of files to process concurrently in the batch
API key for authentication
The schema_data
parameter must be a valid JSON Schema that defines the structure of data to extract. All schemas must follow the JSON Schema specification with proper type definitions, properties, and constraints.
All extraction schemas must include:
type
: “object” (root level)properties
: Object defining the fields to extractrequired
: Array of required field namesadditionalProperties
: Set to false
for strict validationThis example demonstrates extracting shareholding patterns and board information from financial documents:
This example shows a more complex schema for extracting detailed shareholding information:
Text content, single values. Use for names, descriptions, dates as text.
Numeric values, amounts, quantities. Use for counts, percentages, monetary values.
Whole numbers only. Use for counts, IDs, years.
True/false values. Use for yes/no questions, flags.
Lists of items. Must include items
property defining the type of array elements.
Structured data with nested fields. Must include properties
defining nested structure.
Null values. Can be combined with other types using array notation: ["string", "null"]
After creating an extraction job, use the job management endpoints to monitor progress:
For complex documents with hierarchical data:
For extracting lists of structured data:
Invalid JSON schema format or missing required parameters
Invalid or missing API key
File size exceeds 100MB limit
Invalid file format, malformed JSON schema, or processing error
Rate limit exceeded or quota exhausted
Server error during processing
Schema Design: Use specific, descriptive field names and descriptions. Clear descriptions improve extraction accuracy significantly.
Required Fields: Only mark fields as required if they are essential. Optional fields allow for more flexible extraction.
Field Types: Choose appropriate types - use number
for numeric data, string
for text, array
for lists, and object
for nested structures.
Array Items: Always define the items
property for arrays to specify the structure of array elements.
Batch Processing: Use batch extraction for multiple similar documents to improve efficiency and reduce API calls.
Schema Validation: Ensure your JSON schema is valid. Invalid schemas will result in 422 errors.
Processing Time: Complex schemas and large documents take longer to process. Monitor job status regularly.
Quota Management: Check quota_remaining
in responses to avoid hitting limits during batch operations.
Rate limits are enforced per API key and reset on a rolling window basis. Monitor your quota usage through the quota_remaining
field in responses.