Skip to main content

Overview

The Document Splitting feature analyzes PDF pages, classifies them into predefined categories, and creates separate PDF files for each category. This is ideal for processing mixed document batches like scanned files containing invoices, contracts, and reports.
Splitting jobs are processed asynchronously. Submit a splitting job and poll the status endpoint to retrieve results when complete.

How It Works

Our document splitting system uses advanced AI to:
  1. Analyze Each Page: Extract text and visual features from every page
  2. Classify Content: Categorize pages based on document type and content
  3. Generate Confidence Scores: Provide accuracy metrics for each classification
  4. Create Separate Files: Split the original PDF into category-specific documents
  5. Package Results: Deliver all split documents in a convenient ZIP file

Supported Categories

You can split documents into various categories:
  • Business Documents: Invoices, receipts, purchase orders, contracts
  • Financial Documents: Bank statements, financial reports, tax forms
  • Legal Documents: Contracts, agreements, legal notices, compliance forms
  • Healthcare Documents: Medical records, insurance forms, lab reports
  • HR Documents: Resumes, employment forms, payroll documents
  • Academic Documents: Research papers, reports, transcripts

API Usage

Submit Split Job

from unsiloed_sdk import UnsiloedClient, Category

# Define categories with optional descriptions for better accuracy
categories = [
    Category(name="Invoice", description="Financial invoices with itemized charges"),
    Category(name="Receipt", description="Purchase receipts"),
    Category(name="Contract")  # Description is optional
]

with UnsiloedClient(api_key="your-api-key") as client:
    # Split and wait for completion
    result = client.split_and_wait(
        file="mixed_documents.pdf",
        categories=categories
    )

    # Check if split was successful
    if result.result['success']:
        print(f"✓ {result.result['message']}")

        # Access the generated split files
        for file_info in result.result['files']:
            print(f"\nFile: {file_info['name']}")
            print(f"  Confidence: {file_info['confidence_score']:.2%}")
            print(f"  Download: {file_info['full_path']}")
    else:
        print(f"Split failed: {result.result['message']}")

Check Split Job Status

from unsiloed_sdk import UnsiloedClient

def check_split_status(job_id: str, api_key: str):
    with UnsiloedClient(api_key=api_key) as client:
        # Get split job result
        job = client.get_split_result(job_id)

        print(f"Status: {job.status}")
        print(f"Progress: {job.progress}")

        if job.status == "completed" and job.result:
            if job.result['success']:
                files = job.result['files']
                print(f"\n{job.result['message']}")
                print(f"Documents split into {len(files)} files")

                for file_info in files:
                    print(f"\n{file_info['name']}")
                    print(f"  Confidence: {file_info['confidence_score']:.2%}")
                    print(f"  Download: {file_info['full_path']}")
        
        return job

Response Format

Job Creation Response

{
  "job_id": "c8a86841-beb1-4d00-ac4f-2f9fb9de9d5a",
  "status": "processing",
  "message": "Split job started",
  "quota_remaining": 450
}

Job Status Response (Completed)

{
  "job_id": "c8a86841-beb1-4d00-ac4f-2f9fb9de9d5a",
  "status": "completed",
  "progress": "Starting document processing...",
  "file_url": "https://lyltzyvtloozzovxrupp.supabase.co/storage/v1/object/public/job-files-bucket/...",
  "file_name": "mixed_documents.pdf",
  "parameters": {
    "classes": ["Invoice", "Receipt", "Contract"],
    "category_descriptions": {
      "Invoice": "Financial invoices with itemized charges",
      "Receipt": "Purchase receipts"
    }
  },
  "result": {
    "success": true,
    "message": "Successfully split PDF into 3 files",
    "files": [
      {
        "name": "Invoice.pdf",
        "path": "Invoice.pdf",
        "type": "file",
        "fileId": "580e091d-c354-4558-8318-89e600346691",
        "full_path": "https://lyltzyvtloozzovxrupp.supabase.co/storage/v1/object/public/files/...",
        "confidence_score": 0.999147126422347
      }
    ]
  },
  "error": null,
  "quota_remaining": 450
}

Response Fields

Top-Level Fields:
  • job_id (string): Unique identifier for the split job
  • status (string): Job status - “processing”, “completed”, or “failed”
  • progress (string): Current processing status message
  • file_url (string): URL to the original uploaded file
  • file_name (string): Name of the original file
  • parameters (object): Job parameters including classes and category descriptions
  • error (string|null): Error message if job failed
  • quota_remaining (number|null): Remaining API quota
Result Object Fields:
  • success (boolean): Whether the split operation succeeded
  • message (string): Success/failure message
  • files (array): Array of generated split files
File Object Fields:
  • name (string): Name of the split file (category-based)
  • path (string): Relative path to the file
  • type (string): File type (always “file”)
  • fileId (string): Unique identifier for the file
  • full_path (string): Full download URL for the split file
  • confidence_score (number): Confidence score for the classification (0-1)