Overview
The Document Splitting feature analyzes PDF pages, classifies them into predefined categories, and creates separate PDF files for each category. This is ideal for processing mixed document batches like scanned files containing invoices, contracts, and reports.
The splitting process returns a ZIP file containing the split documents, with classification and confidence score data provided in response headers.
How It Works
Our document splitting system uses advanced AI to:
- Analyze Each Page: Extract text and visual features from every page
- Classify Content: Categorize pages based on document type and content
- Generate Confidence Scores: Provide accuracy metrics for each classification
- Create Separate Files: Split the original PDF into category-specific documents
- Package Results: Deliver all split documents in a convenient ZIP file
Supported Categories
You can split documents into various categories:
- Business Documents: Invoices, receipts, purchase orders, contracts
- Financial Documents: Bank statements, financial reports, tax forms
- Legal Documents: Contracts, agreements, legal notices, compliance forms
- Healthcare Documents: Medical records, insurance forms, lab reports
- HR Documents: Resumes, employment forms, payroll documents
- Academic Documents: Research papers, reports, transcripts
API Usage
Basic Request
curl -X POST "https://visionapi.unsiloed.ai/splitter/split-pdf?classes=invoice,contract,report" \
-H "Content-Type: multipart/form-data" \
-F "file=@mixed_documents.pdf" \
--output split_documents.zip
With Category Descriptions
import requests
import json
url = "https://visionapi.unsiloed.ai/splitter/split-pdf"
# Define categories with descriptions for better accuracy
categories = {
"invoice": "Business invoices with itemized charges and payment terms",
"contract": "Legal agreements, terms of service, and binding documents",
"report": "Analytical reports, summaries, and data presentations"
}
files = {"file": ("mixed_documents.pdf", open("mixed_documents.pdf", "rb"), "application/pdf")}
params = {"classes": "invoice,contract,report"}
data = {"categories": json.dumps(categories)}
response = requests.post(url, files=files, params=params, data=data)
if response.status_code == 200:
with open("split_documents.zip", "wb") as f:
f.write(response.content)
# Get classification results
classifications = json.loads(response.headers.get("X-Classifications", "{}"))
confidence_scores = json.loads(response.headers.get("X-Confidence-Scores", "{}"))
print("Split completed successfully!")
for page, category in classifications.items():
confidence = confidence_scores.get(page, 0)
print(f"Page {page}: {category} (confidence: {confidence:.2f})")
files["file"][1].close()
Best Practices
Category Descriptions: Always provide detailed category descriptions. This can improve classification accuracy by 20-30%.
File Quality: Ensure PDFs contain readable text. Scanned documents should be OCR-processed first for better results.
Category Selection: Use 3-7 categories for optimal accuracy. Too many categories can reduce precision.
File Size: Large files (over 50 pages) may timeout. Consider pre-processing very large documents.
Text Quality: The service relies on text extraction. Poor quality scans or image-only PDFs may not classify accurately.
Accuracy and Confidence
Classification accuracy varies by document type and quality:
- High Accuracy (85-95%): Documents with distinct visual layouts and clear textual content
- Medium Accuracy (70-85%): Mixed quality documents or similar document types
- Lower Accuracy (under 70%): Poor quality scans or very similar document categories
Confidence Score Interpretation
- 0.9-1.0: Very high confidence - Classification is very likely correct
- 0.7-0.9: High confidence - Classification is likely correct
- 0.5-0.7: Medium confidence - May benefit from human review
- 0.3-0.5: Low confidence - Human review recommended
- 0.0-0.3: Very low confidence - Classification may be incorrect
Integration Examples
Workflow Integration
async function processMixedDocuments(files, categories) {
const results = [];
for (const file of files) {
const formData = new FormData();
formData.append('file', file);
formData.append('categories', JSON.stringify(categories));
const params = new URLSearchParams({
classes: Object.keys(categories).join(',')
});
try {
const response = await fetch(`https://visionapi.unsiloed.ai/splitter/split-pdf?${params}`, {
method: 'POST',
body: formData
});
if (response.ok) {
const zipBlob = await response.blob();
const classifications = JSON.parse(response.headers.get('X-Classifications') || '{}');
const confidenceScores = JSON.parse(response.headers.get('X-Confidence-Scores') || '{}');
results.push({
filename: file.name,
success: true,
zipBlob,
classifications,
confidenceScores
});
} else {
results.push({
filename: file.name,
success: false,
error: await response.text()
});
}
} catch (error) {
results.push({
filename: file.name,
success: false,
error: error.message
});
}
}
return results;
}
Getting Started
Ready to start splitting your documents?
Common Use Cases
Invoice Processing
Split mixed business documents to separate invoices for automated processing:
categories = {
"invoice": "Business invoices with line items, totals, and vendor information",
"receipt": "Purchase receipts and payment confirmations",
"other": "Any other business documents"
}
Legal Document Organization
Organize legal document batches by document type:
categories = {
"contract": "Legal contracts and binding agreements",
"notice": "Legal notices and communications",
"form": "Legal forms and applications",
"correspondence": "Letters and email communications"
}
Healthcare Records Management
Split patient files into organized categories:
categories = {
"lab_report": "Laboratory test results and medical reports",
"insurance": "Insurance forms and coverage documents",
"prescription": "Prescription records and medication information",
"visit_notes": "Doctor visit notes and medical consultations"
}
Responses are generated using AI and may contain mistakes.