Parse and chunk PDF documents into semantic sections using advanced AI-powered strategies with comprehensive customization options
chunk_processing
parameter allows fine-tuning of how document content is chunked:
ignore_headers_and_footers
(boolean, optional): Whether to exclude headers and footers from chunking.target_length
(number, optional): Target chunk length in tokens (default: 512)tokenizer
(object, optional): Tokenization strategy configuration with Enum values: “Word”, “Cl100kBase”, “XlmRobertaBase”, “BertBaseUncased”segment_processing
parameter allows customization of how different document segments are processed. Each segment type can be configured individually:
html
(string, optional): HTML generation method: “LLM”, “Auto”
markdown
(string, optional): Markdown generation method: “LLM”, “Auto”
extended_context
(boolean, optional): Use the full page image as context for LLM generation
crop_image
Controls whether to crop the file’s images to the segment’s bounding box. The cropped image will be stored in the segment’s image field. Use All to always crop, or Auto to only crop when needed for post-processing
llm_processing
parameter configures language model processing:
fallback_strategy
(string, optional): Fallback strategy when LLM processing fails: “None”,max_completion_tokens
(number, optional): Maximum tokens for LLM completion (default: 1000)model_id
(string, optional): LLM model identifier to use for processingtemperature
(number, optional): Temperature setting for LLM processing (0.0 to 1.0)error_handling
(string, optional): How to handle processing errors:
segmentation_strategy
(string, optional): Document segmentation approach:
ocr_strategy
(string, optional): OCR processing strategy:
Successful response
The response is of type object
.