Overview

The Document Splitting feature analyzes PDF pages, classifies them into predefined categories, and creates separate PDF files for each category. This is ideal for processing mixed document batches like scanned files containing invoices, contracts, and reports.

The splitting process returns a ZIP file containing the split documents, with classification and confidence score data provided in response headers.

How It Works

Our document splitting system uses advanced AI to:

  1. Analyze Each Page: Extract text and visual features from every page
  2. Classify Content: Categorize pages based on document type and content
  3. Generate Confidence Scores: Provide accuracy metrics for each classification
  4. Create Separate Files: Split the original PDF into category-specific documents
  5. Package Results: Deliver all split documents in a convenient ZIP file

Supported Categories

You can split documents into various categories:

  • Business Documents: Invoices, receipts, purchase orders, contracts
  • Financial Documents: Bank statements, financial reports, tax forms
  • Legal Documents: Contracts, agreements, legal notices, compliance forms
  • Healthcare Documents: Medical records, insurance forms, lab reports
  • HR Documents: Resumes, employment forms, payroll documents
  • Academic Documents: Research papers, reports, transcripts

API Usage

Basic Request

curl -X POST "https://visionapi.unsiloed.ai/splitter/split-pdf?classes=invoice,contract,report" \
  -H "Content-Type: multipart/form-data" \
  -F "file=@mixed_documents.pdf" \
  --output split_documents.zip

With Category Descriptions

import requests
import json

url = "https://visionapi.unsiloed.ai/splitter/split-pdf"

# Define categories with descriptions for better accuracy
categories = {
    "invoice": "Business invoices with itemized charges and payment terms",
    "contract": "Legal agreements, terms of service, and binding documents", 
    "report": "Analytical reports, summaries, and data presentations"
}

files = {"file": ("mixed_documents.pdf", open("mixed_documents.pdf", "rb"), "application/pdf")}
params = {"classes": "invoice,contract,report"}
data = {"categories": json.dumps(categories)}

response = requests.post(url, files=files, params=params, data=data)

if response.status_code == 200:
    with open("split_documents.zip", "wb") as f:
        f.write(response.content)
    
    # Get classification results
    classifications = json.loads(response.headers.get("X-Classifications", "{}"))
    confidence_scores = json.loads(response.headers.get("X-Confidence-Scores", "{}"))
    
    print("Split completed successfully!")
    for page, category in classifications.items():
        confidence = confidence_scores.get(page, 0)
        print(f"Page {page}: {category} (confidence: {confidence:.2f})")

files["file"][1].close()

Best Practices

Category Descriptions: Always provide detailed category descriptions. This can improve classification accuracy by 20-30%.

File Quality: Ensure PDFs contain readable text. Scanned documents should be OCR-processed first for better results.

Category Selection: Use 3-7 categories for optimal accuracy. Too many categories can reduce precision.

File Size: Large files (over 50 pages) may timeout. Consider pre-processing very large documents.

Text Quality: The service relies on text extraction. Poor quality scans or image-only PDFs may not classify accurately.

Accuracy and Confidence

Classification accuracy varies by document type and quality:

  • High Accuracy (85-95%): Documents with distinct visual layouts and clear textual content
  • Medium Accuracy (70-85%): Mixed quality documents or similar document types
  • Lower Accuracy (under 70%): Poor quality scans or very similar document categories

Confidence Score Interpretation

  • 0.9-1.0: Very high confidence - Classification is very likely correct
  • 0.7-0.9: High confidence - Classification is likely correct
  • 0.5-0.7: Medium confidence - May benefit from human review
  • 0.3-0.5: Low confidence - Human review recommended
  • 0.0-0.3: Very low confidence - Classification may be incorrect

Integration Examples

Workflow Integration

async function processMixedDocuments(files, categories) {
  const results = [];
  
  for (const file of files) {
    const formData = new FormData();
    formData.append('file', file);
    formData.append('categories', JSON.stringify(categories));
    
    const params = new URLSearchParams({
      classes: Object.keys(categories).join(',')
    });
    
    try {
      const response = await fetch(`https://visionapi.unsiloed.ai/splitter/split-pdf?${params}`, {
        method: 'POST',
        body: formData
      });
      
      if (response.ok) {
        const zipBlob = await response.blob();
        const classifications = JSON.parse(response.headers.get('X-Classifications') || '{}');
        const confidenceScores = JSON.parse(response.headers.get('X-Confidence-Scores') || '{}');
        
        results.push({
          filename: file.name,
          success: true,
          zipBlob,
          classifications,
          confidenceScores
        });
      } else {
        results.push({
          filename: file.name,
          success: false,
          error: await response.text()
        });
      }
    } catch (error) {
      results.push({
        filename: file.name,
        success: false,
        error: error.message
      });
    }
  }
  
  return results;
}

Getting Started

Ready to start splitting your documents?

Common Use Cases

Invoice Processing

Split mixed business documents to separate invoices for automated processing:

categories = {
    "invoice": "Business invoices with line items, totals, and vendor information",
    "receipt": "Purchase receipts and payment confirmations",
    "other": "Any other business documents"
}

Organize legal document batches by document type:

categories = {
    "contract": "Legal contracts and binding agreements",
    "notice": "Legal notices and communications", 
    "form": "Legal forms and applications",
    "correspondence": "Letters and email communications"
}

Healthcare Records Management

Split patient files into organized categories:

categories = {
    "lab_report": "Laboratory test results and medical reports",
    "insurance": "Insurance forms and coverage documents",
    "prescription": "Prescription records and medication information",
    "visit_notes": "Doctor visit notes and medical consultations"
}