This recipe chains two endpoints covered on their own in the Splitting quickstart and the Extract quickstart. Read those first if you want the detail on either call in isolation.
What We’ll Build
A script that:- Sends a multi-document PDF to
/splitteralong with the document types we expect to find. - Gets back one PDF per type, each labelled with its category and hosted at a URL.
- Picks a matching extraction schema for each file and sends it to
/v2/extractby URL. - Collects the extracted fields into a single object, keyed by document type.
Show the Full Script
Show the Full Script
Set
UNSILOED_API_KEY in your environment and save the document pile as pile.pdf in the same directory before running.- Python
- JavaScript
sort_and_extract.py
Step 1: Set Up Your Environment
Before writing any code, we need three things: an API key, a document pile, and the runtime for our chosen language.1.1 Get an Unsiloed AI API Key
To get API access, sign up on Unsiloed AI. Export your key as an environment variable so it stays out of source control:1.2 Pick a Document Pile
The walkthrough assumes a PDF saved aspile.pdf in your working directory. To follow along with the exact output shown below, download our sample document pile: a three-page PDF that holds an invoice, a store receipt, and a purchase order, one per page. The three documents are unrelated and in mixed order, which is exactly the case this recipe is built for.
1.3 Install Dependencies
- Python
- JavaScript
You need Python 3.8 or newer. Install the
requests package:Step 2: Split the Pile by Document Type
The splitter takes the whole PDF plus a list of the document types we expect, and returns one new PDF per type. Each returned file is labelled with the category it matched and hosted at a URL we can hand straight to the extractor in the next step, so there’s no need to save anything to disk in between.2.1 Describe the Document Types
A category is aname and a description. The description does the work: it tells the splitter how to recognize each type, so the clearer it is, the more reliable the split. We’re looking for three types.
- Python
- JavaScript
Create a file called
sort_and_extract.py with the configuration and categories:sort_and_extract.py
2.2 Submit the Pile and Wait for the Split
Both endpoints in this recipe run asynchronously: we submit a job, get ajob_id, and poll until it’s done. Since we do that twice, it’s worth a small helper. Then we post the pile to /splitter with the categories as a JSON string under the categories field.
- Python
- JavaScript
Add the polling helper and the split call:Note the form field is
sort_and_extract.py
file here, not pdf_file. The splitter and the extractor use different field names.sections describes one split-out document. The two fields we use are name (the category it matched, with a .pdf suffix, such as Invoice.pdf) and full_path (a URL to the new single-document PDF). Those URLs are signed S3 links that expire roughly an hour after the response, so we use them right away in the next step rather than storing them. If a link does lapse, re-issue GET /splitter/{job_id} to get fresh ones.
Step 3: Extract the Right Fields From Each Document
Now we loop over the split files. For each one, we look up the schema that matches its type and send the file’s URL to/v2/extract. Passing file_url means the extractor pulls the document straight from the split result, so we never download and re-upload it.
3.1 Define a Schema per Type
Each document type wants different fields. We keep one schema per category in a dictionary, keyed by the same names we gave the splitter, so we can look up the right schema by the file’s category.- Python
- JavaScript
Add the schema map below the categories:
sort_and_extract.py
3.2 Extract Each Section by URL
Loop over the split files, look up the schema for each category, and submit an extract job for each one. We strip the.pdf suffix from the file name to get the category key, and skip any type we don’t have a schema for. Submitting every job before polling any of them matters here: the extractor fetches each file_url at submission time, so the signed links get consumed while they’re fresh instead of waiting behind another job’s polling loop, and the jobs run in parallel on the server rather than one at a time.
- Python
- JavaScript
Add the extraction loop:Because we extract by URL, there’s no file upload here, so everything goes in the
sort_and_extract.py
data form fields instead of files.Step 4: Print the Combined Result
Theresults object now holds the extracted fields for every document in the pile, keyed by type. Print it as formatted JSON.
- Python
- JavaScript
Add the final line and run the script:
sort_and_extract.py
Sample Output
Run against the sample pile, the script splits the three pages into three labelled documents, then extracts the fields each type calls for:Where to Take This Next
The pile in this recipe had one page per document, but the splitter handles multi-page documents too, grouping consecutive pages that belong together. To process a folder of mixed PDFs, wrap the whole script in a loop over your files and merge each run’sresults into one collection.
Splitting
How the splitter decides document boundaries, and the full response reference.
Classification
Label a single document by type without splitting it.
Extraction Schemas
Build richer per-type schemas, including nested objects and line-item tables.
API Reference
Browse the full request and response specs for
/splitter and /v2/extract.
