Split Document

Overview

The Split Document endpoint analyzes PDF pages, classifies them into predefined categories, and creates separate PDF files for each category. This is ideal for processing mixed document batches like scanned files containing invoices, contracts, and reports.

The endpoint returns a ZIP file containing the split documents, with classification and confidence score data provided in response headers.

Request

file

required

The PDF file to split. Maximum file size: 100MB

classes

string

required

Comma-separated list of classification categories (e.g., “invoice,contract,report”)

Response

The endpoint returns a ZIP file containing the split PDF documents, with additional metadata in response headers.

Response Headers

Content-Type

string

application/zip

Content-Disposition

string

attachment; filename=classified_pdfs.zip

X-Classifications

string

JSON string containing page-to-category mappings

X-Confidence-Scores

string

JSON string containing confidence scores for each page classification

ZIP Contents

The ZIP file contains separate PDF files for each category found in the document:

invoice.pdf - Pages classified as invoices
contract.pdf - Pages classified as contracts
report.pdf - Pages classified as reports
etc.

Request Examples

curl -X POST "https://visionapi.unsiloed.ai/splitter/split-pdf?classes=invoice,contract,report" \
  -H "Content-Type: multipart/form-data" \
  -F "file=@mixed_documents.pdf" \
  -F 'categories={"invoice":"Business invoices with itemized charges","contract":"Legal agreements and binding documents"}' \
  --output split_documents.zip

Response Examples

HTTP/1.1 200 OK
Content-Type: application/zip
Content-Disposition: attachment; filename=classified_pdfs.zip
Content-Length: 2048576
X-Classifications: {"1": "invoice", "2": "invoice", "3": "contract", "4": "report"}
X-Confidence-Scores: {"1": 0.95, "2": 0.87, "3": 0.92, "4": 0.78}

[ZIP file containing invoice.pdf, contract.pdf, and report.pdf]

Best Practices

Category Descriptions: Always provide detailed category descriptions. This can improve classification accuracy by 20-30%.

File Quality: Ensure PDFs contain readable text. Scanned documents should be OCR-processed first for better results.

Category Selection: Use 3-7 categories for optimal accuracy. Too many categories can reduce precision.

File Size: Large files (>50 pages) may timeout. Consider pre-processing very large documents.

Text Quality: The service relies on text extraction. Poor quality scans or image-only PDFs may not classify accurately.

Supported Document Types

The splitting system works best with:

Business Documents: Invoices, receipts, purchase orders, contracts
Financial Documents: Bank statements, financial reports, tax forms
Legal Documents: Contracts, agreements, legal notices, compliance forms
Healthcare Documents: Medical records, insurance forms, lab reports
HR Documents: Resumes, employment forms, payroll documents
Academic Documents: Research papers, reports, transcripts

Classification accuracy varies by document type and quality. Documents with distinct visual layouts and clear textual content typically achieve 85-95% accuracy.

Getting Started

Ready to start splitting your documents? Check out our quickstart guide or try the API in our playground.

Core Endpoints

Job Management

Overview

Request

Response

Response Headers

ZIP Contents

Request Examples

Response Examples

Best Practices

Supported Document Types

Getting Started

Core Endpoints

Job Management

​Overview

​Request

​Response

​Response Headers

​ZIP Contents

​Request Examples

​Response Examples

​Best Practices

​Supported Document Types

​Getting Started

Overview

Request

Response

Response Headers

ZIP Contents

Request Examples

Response Examples

Best Practices

Supported Document Types

Getting Started