Extract Contents
Extract text content from book files (EPUB, PDF) as structured sections. Automatically uses AI to identify chapter boundaries in large documents.
How It Works
For EPUB files, the task parses the document structure to extract individual sections with their titles and content. Each section is automatically classified with a section type (e.g., “titlepage”, “dedication”, “chapter”, “epilogue”, “glossary”) when the EPUB includes structural metadata. Sections are also marked as front matter, body, or back matter based on their type. The section type and front matter status are displayed in the contents viewer.
For PDF files, AI is used to detect chapter boundaries within the continuous text.
Classify Sections with AI
Some EPUB files don’t include the structural metadata needed to automatically identify section types. When Classify Sections with AI is enabled, AI determines what each section is — whether it’s a chapter, dedication, epilogue, acknowledgements, and so on. Sections that are already identified from the file’s metadata are left as-is.
When to Use
Use this task early in a pipeline to convert book files into the StructuredText format required by most AI analysis tasks. Enable Classify Sections with AI when you want your sections to be automatically labelled by type, especially for files that don’t already include that information.
Reference
Book file (EPUB or PDF) to extract content from.
Sections with titles and text content.
false +2 creditsEnable PDF file support. Without this, only EPUB files are supported.
false +1 creditUse AI to classify sections by type (e.g., chapter, dedication, epilogue) when the file lacks structural metadata. Sections already classified via EPUB metadata are left untouched.