Overview
The Document Splitting feature analyzes PDF pages, classifies them into predefined categories, and creates separate PDF files for each category. This is ideal for processing mixed document batches like scanned files containing invoices, contracts, and reports.
Splitting jobs are processed asynchronously. Submit a splitting job and poll the status endpoint to retrieve results when complete.
How It Works
Our document splitting system uses advanced AI to:
- Analyze Each Page: Extract text and visual features from every page
- Classify Content: Categorize pages based on document type and content
- Generate Confidence Scores: Provide accuracy metrics for each classification
- Create Separate Files: Split the original PDF into category-specific documents
- Package Results: Deliver all split documents in a convenient ZIP file
Supported Categories
You can split documents into various categories:
- Business Documents: Invoices, receipts, purchase orders, contracts
- Financial Documents: Bank statements, financial reports, tax forms
- Legal Documents: Contracts, agreements, legal notices, compliance forms
- Healthcare Documents: Medical records, insurance forms, lab reports
- HR Documents: Resumes, employment forms, payroll documents
- Academic Documents: Research papers, reports, transcripts
API Usage
Submit Split Job
from unsiloed_sdk import UnsiloedClient, Category
# Define categories with optional descriptions for better accuracy
categories = [
Category(name="Invoice", description="Financial invoices with itemized charges"),
Category(name="Receipt", description="Purchase receipts"),
Category(name="Contract") # Description is optional
]
with UnsiloedClient(api_key="your-api-key") as client:
# Split and wait for completion
result = client.split_and_wait(
file="mixed_documents.pdf",
categories=categories
)
# Check if split was successful
if result.result['success']:
print(f"✓ {result.result['message']}")
# Access the generated split files
for file_info in result.result['files']:
print(f"\nFile: {file_info['name']}")
print(f" Confidence: {file_info['confidence_score']:.2%}")
print(f" Download: {file_info['full_path']}")
else:
print(f"Split failed: {result.result['message']}")
Check Split Job Status
from unsiloed_sdk import UnsiloedClient
def check_split_status(job_id: str, api_key: str):
with UnsiloedClient(api_key=api_key) as client:
# Get split job result
job = client.get_split_result(job_id)
print(f"Status: {job.status}")
print(f"Progress: {job.progress}")
if job.status == "completed" and job.result:
if job.result['success']:
files = job.result['files']
print(f"\n{job.result['message']}")
print(f"Documents split into {len(files)} files")
for file_info in files:
print(f"\n{file_info['name']}")
print(f" Confidence: {file_info['confidence_score']:.2%}")
print(f" Download: {file_info['full_path']}")
return job
Job Creation Response
{
"job_id": "c8a86841-beb1-4d00-ac4f-2f9fb9de9d5a",
"status": "processing",
"message": "Split job started",
"quota_remaining": 450
}
Job Status Response (Completed)
{
"job_id": "c8a86841-beb1-4d00-ac4f-2f9fb9de9d5a",
"status": "completed",
"progress": "Starting document processing...",
"file_url": "https://lyltzyvtloozzovxrupp.supabase.co/storage/v1/object/public/job-files-bucket/...",
"file_name": "mixed_documents.pdf",
"parameters": {
"classes": ["Invoice", "Receipt", "Contract"],
"category_descriptions": {
"Invoice": "Financial invoices with itemized charges",
"Receipt": "Purchase receipts"
}
},
"result": {
"success": true,
"message": "Successfully split PDF into 3 files",
"files": [
{
"name": "Invoice.pdf",
"path": "Invoice.pdf",
"type": "file",
"fileId": "580e091d-c354-4558-8318-89e600346691",
"full_path": "https://lyltzyvtloozzovxrupp.supabase.co/storage/v1/object/public/files/...",
"confidence_score": 0.999147126422347
}
]
},
"error": null,
"quota_remaining": 450
}
Response Fields
Top-Level Fields:
job_id (string): Unique identifier for the split job
status (string): Job status - “processing”, “completed”, or “failed”
progress (string): Current processing status message
file_url (string): URL to the original uploaded file
file_name (string): Name of the original file
parameters (object): Job parameters including classes and category descriptions
error (string|null): Error message if job failed
quota_remaining (number|null): Remaining API quota
Result Object Fields:
success (boolean): Whether the split operation succeeded
message (string): Success/failure message
files (array): Array of generated split files
File Object Fields:
name (string): Name of the split file (category-based)
path (string): Relative path to the file
type (string): File type (always “file”)
fileId (string): Unique identifier for the file
full_path (string): Full download URL for the split file
confidence_score (number): Confidence score for the classification (0-1)