Splitting

Overview

The Document Splitting feature analyzes PDF pages, classifies them into predefined categories, and creates separate PDF files for each category. This is ideal for processing mixed document batches like scanned files containing invoices, contracts, and reports.

The splitting process returns a ZIP file containing the split documents, with classification and confidence score data provided in response headers.

How It Works

Our document splitting system uses advanced AI to:

Analyze Each Page: Extract text and visual features from every page
Classify Content: Categorize pages based on document type and content
Generate Confidence Scores: Provide accuracy metrics for each classification
Create Separate Files: Split the original PDF into category-specific documents
Package Results: Deliver all split documents in a convenient ZIP file

Supported Categories

You can split documents into various categories:

Business Documents: Invoices, receipts, purchase orders, contracts
Financial Documents: Bank statements, financial reports, tax forms
Legal Documents: Contracts, agreements, legal notices, compliance forms
Healthcare Documents: Medical records, insurance forms, lab reports
HR Documents: Resumes, employment forms, payroll documents
Academic Documents: Research papers, reports, transcripts

API Usage

Basic Request

curl -X POST "https://visionapi.unsiloed.ai/splitter/split-pdf?classes=invoice,contract,report" \
  -H "Content-Type: multipart/form-data" \
  -F "file=@mixed_documents.pdf" \
  --output split_documents.zip

With Category Descriptions

import requests
import json

url = "https://visionapi.unsiloed.ai/splitter/split-pdf"

# Define categories with descriptions for better accuracy
categories = {
    "invoice": "Business invoices with itemized charges and payment terms",
    "contract": "Legal agreements, terms of service, and binding documents", 
    "report": "Analytical reports, summaries, and data presentations"
}

files = {"file": ("mixed_documents.pdf", open("mixed_documents.pdf", "rb"), "application/pdf")}
params = {"classes": "invoice,contract,report"}
data = {"categories": json.dumps(categories)}

response = requests.post(url, files=files, params=params, data=data)

if response.status_code == 200:
    with open("split_documents.zip", "wb") as f:
        f.write(response.content)
    
    # Get classification results
    classifications = json.loads(response.headers.get("X-Classifications", "{}"))
    confidence_scores = json.loads(response.headers.get("X-Confidence-Scores", "{}"))
    
    print("Split completed successfully!")
    for page, category in classifications.items():
        confidence = confidence_scores.get(page, 0)
        print(f"Page {page}: {category} (confidence: {confidence:.2f})")

files["file"][1].close()

Best Practices

Category Descriptions: Always provide detailed category descriptions. This can improve classification accuracy by 20-30%.

File Quality: Ensure PDFs contain readable text. Scanned documents should be OCR-processed first for better results.

Category Selection: Use 3-7 categories for optimal accuracy. Too many categories can reduce precision.

File Size: Large files (over 50 pages) may timeout. Consider pre-processing very large documents.

Text Quality: The service relies on text extraction. Poor quality scans or image-only PDFs may not classify accurately.

Accuracy and Confidence

Classification accuracy varies by document type and quality:

High Accuracy (85-95%): Documents with distinct visual layouts and clear textual content
Medium Accuracy (70-85%): Mixed quality documents or similar document types
Lower Accuracy (under 70%): Poor quality scans or very similar document categories

Confidence Score Interpretation

0.9-1.0: Very high confidence - Classification is very likely correct
0.7-0.9: High confidence - Classification is likely correct
0.5-0.7: Medium confidence - May benefit from human review
0.3-0.5: Low confidence - Human review recommended
0.0-0.3: Very low confidence - Classification may be incorrect

Integration Examples

Workflow Integration

async function processMixedDocuments(files, categories) {
  const results = [];
  
  for (const file of files) {
    const formData = new FormData();
    formData.append('file', file);
    formData.append('categories', JSON.stringify(categories));
    
    const params = new URLSearchParams({
      classes: Object.keys(categories).join(',')
    });
    
    try {
      const response = await fetch(`https://visionapi.unsiloed.ai/splitter/split-pdf?${params}`, {
        method: 'POST',
        body: formData
      });
      
      if (response.ok) {
        const zipBlob = await response.blob();
        const classifications = JSON.parse(response.headers.get('X-Classifications') || '{}');
        const confidenceScores = JSON.parse(response.headers.get('X-Confidence-Scores') || '{}');
        
        results.push({
          filename: file.name,
          success: true,
          zipBlob,
          classifications,
          confidenceScores
        });
      } else {
        results.push({
          filename: file.name,
          success: false,
          error: await response.text()
        });
      }
    } catch (error) {
      results.push({
        filename: file.name,
        success: false,
        error: error.message
      });
    }
  }
  
  return results;
}

Getting Started

Ready to start splitting your documents?

API Reference: For detailed API documentation, see our Splitting API Reference
Quick Start: Check out our quickstart guide for setup instructions
Try It Out: Test the API in our playground

Common Use Cases

Invoice Processing

Split mixed business documents to separate invoices for automated processing:

categories = {
    "invoice": "Business invoices with line items, totals, and vendor information",
    "receipt": "Purchase receipts and payment confirmations",
    "other": "Any other business documents"
}

Legal Document Organization

Organize legal document batches by document type:

categories = {
    "contract": "Legal contracts and binding agreements",
    "notice": "Legal notices and communications", 
    "form": "Legal forms and applications",
    "correspondence": "Letters and email communications"
}

Healthcare Records Management

Split patient files into organized categories:

categories = {
    "lab_report": "Laboratory test results and medical reports",
    "insurance": "Insurance forms and coverage documents",
    "prescription": "Prescription records and medication information",
    "visit_notes": "Doctor visit notes and medical consultations"
}

Core Endpoints

Job Management

Overview

How It Works

Supported Categories

API Usage

Basic Request

With Category Descriptions

Best Practices

Accuracy and Confidence

Confidence Score Interpretation

Integration Examples

Workflow Integration

Getting Started

Common Use Cases

Invoice Processing

Legal Document Organization

Healthcare Records Management

Core Endpoints

Job Management

​Overview

​How It Works

​Supported Categories

​API Usage

​Basic Request

​With Category Descriptions

​Best Practices

​Accuracy and Confidence

​Confidence Score Interpretation

​Integration Examples

​Workflow Integration

​Getting Started

​Common Use Cases

​Invoice Processing

​Legal Document Organization

​Healthcare Records Management

Overview

How It Works

Supported Categories

API Usage

Basic Request

With Category Descriptions

Best Practices

Accuracy and Confidence

Confidence Score Interpretation

Integration Examples

Workflow Integration

Getting Started

Common Use Cases

Invoice Processing

Legal Document Organization

Healthcare Records Management