Split PDF documents by classifying pages into different categories
The Split Document endpoint analyzes PDF pages, classifies them into predefined categories, and creates separate PDF files for each category. This is ideal for processing mixed document batches like scanned files containing invoices, contracts, and reports.
The endpoint returns a ZIP file containing the split documents, with classification and confidence score data provided in response headers.
The PDF file to split. Maximum file size: 100MB
Comma-separated list of classification categories (e.g., “invoice,contract,report”)
JSON string with detailed category descriptions for better classification accuracy
The endpoint returns a ZIP file containing the split PDF documents, with additional metadata in response headers.
application/zip
attachment; filename=classified_pdfs.zip
JSON string containing page-to-category mappings
JSON string containing confidence scores for each page classification
The ZIP file contains separate PDF files for each category found in the document:
invoice.pdf
- Pages classified as invoicescontract.pdf
- Pages classified as contractsreport.pdf
- Pages classified as reportsCategory Descriptions: Always provide detailed category descriptions. This can improve classification accuracy by 20-30%.
File Quality: Ensure PDFs contain readable text. Scanned documents should be OCR-processed first for better results.
Category Selection: Use 3-7 categories for optimal accuracy. Too many categories can reduce precision.
File Size: Large files (>50 pages) may timeout. Consider pre-processing very large documents.
Text Quality: The service relies on text extraction. Poor quality scans or image-only PDFs may not classify accurately.
The splitting system works best with:
Classification accuracy varies by document type and quality. Documents with distinct visual layouts and clear textual content typically achieve 85-95% accuracy.
Ready to start splitting your documents? Check out our quickstart guide or try the API in our playground.