tabula

package module

v1.6.6 Latest Latest Go to latest Published: Feb 4, 2026 License: MIT Imports: 20 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/tsawler/tabula

Links

Open Source Insights

README ¶

Tabula

A Go text extraction library with a fluent API, designed for RAG (Retrieval-Augmented Generation) workflows. Supports PDF, DOCX, ODT, XLSX, PPTX, HTML, and EPUB files with automatic OCR for scanned documents.

Features

Fluent API - Chain methods for clean, readable code
Multi-Format Support - PDF (.pdf), Word (.docx), OpenDocument (.odt), Excel (.xlsx), PowerPoint (.pptx), HTML (.html, .htm), and EPUB (.epub) files
Layout Analysis - Detect headings, paragraphs, lists, and tables
Header/Footer Detection - Automatically identify and exclude repeating content
HTML Navigation Filtering - Remove headers, footers, nav, and sidebars from web pages with configurable exclusion modes
RAG-Ready Chunking - Semantic document chunking with metadata
Markdown Export - Convert extracted content to markdown
PDF 1.0-1.7 Support - Including modern XRef streams (PDF 1.5+)
Optional OCR - Text extraction from scanned PDFs via Tesseract (build with -tags ocr)

Installation

go get github.com/tsawler/tabula

By default, Tabula builds as pure Go with no CGO dependencies. This handles PDF, DOCX, ODT, XLSX, PPTX, HTML, and EPUB files with native text.

Optional: OCR Support for Scanned PDFs

To enable OCR for scanned PDFs (pages with no native text), build with the ocr tag:

go build -tags ocr

This requires Tesseract OCR to be installed:

macOS (Homebrew):

brew install tesseract
# Optional: additional language packs
brew install tesseract-lang

Ubuntu/Debian:

apt-get install tesseract-ocr libtesseract-dev libleptonica-dev
# Optional: additional language packs
apt-get install tesseract-ocr-fra tesseract-ocr-deu  # French, German, etc.

CGO Flags (macOS Apple Silicon only)

On Apple Silicon Macs (M1/M2/M3/M4), when building with -tags ocr, you must set CGO flags for the compiler to find Tesseract:

export CGO_CPPFLAGS="-I/opt/homebrew/include"
export CGO_LDFLAGS="-L/opt/homebrew/lib"

Add these to your ~/.zshrc for persistence. These flags are needed both when building tabula and when building any project that uses tabula as a dependency.

Platform	CGO flags needed?
Ubuntu/Debian	No
macOS Intel	No
macOS Apple Silicon	Yes (with `-tags ocr` only)

Quick Start

Extract Text

package main

import (
    "fmt"
    "log"

    "github.com/tsawler/tabula"
)

func main() {
    // Works with PDF, DOCX, ODT, XLSX, PPTX, HTML, and EPUB files
    text, warnings, err := tabula.Open("document.pdf").Text()
    // text, warnings, err := tabula.Open("document.docx").Text()
    // text, warnings, err := tabula.Open("document.odt").Text()
    // text, warnings, err := tabula.Open("spreadsheet.xlsx").Text()
    // text, warnings, err := tabula.Open("presentation.pptx").Text()
    // text, warnings, err := tabula.Open("page.html").Text()
    // text, warnings, err := tabula.Open("book.epub").Text()
    if err != nil {
        log.Fatal(err)
    }

    fmt.Println(text)

    for _, w := range warnings {
        fmt.Println("Warning:", w.Message)
    }
}

Extract with Options

// PDF with all options
text, warnings, err := tabula.Open("document.pdf").
    Pages(1, 2, 3).              // Specific pages (PDF only)
    ExcludeHeadersAndFooters().  // Remove headers/footers
    JoinParagraphs().            // Join text into paragraphs (PDF only)
    Text()

// DOCX/ODT with header/footer exclusion
text, warnings, err := tabula.Open("document.docx").
    ExcludeHeadersAndFooters().  // Remove headers/footers
    Text()

// XLSX (each sheet extracted as tab-separated values)
text, warnings, err := tabula.Open("spreadsheet.xlsx").Text()

// PPTX (each slide extracted with title and content)
text, warnings, err := tabula.Open("presentation.pptx").
    ExcludeHeadersAndFooters().  // Remove slide footers and numbers
    Text()

// HTML (extracts text from headings, paragraphs, lists, tables)
// Use navigation exclusion to remove headers, footers, and nav elements
text, warnings, err := tabula.Open("page.html").Text()

// HTML from URL with navigation filtering
import "github.com/tsawler/tabula/htmldoc"
resp, _ := http.Get("https://example.com")
reader, _ := htmldoc.OpenReader(resp.Body)
opts := htmldoc.ExtractOptions{NavigationExclusion: htmldoc.NavigationExclusionStandard}
text, _ := reader.TextWithOptions(opts)

// EPUB (e-books, supports EPUB 2 and EPUB 3)
text, warnings, err := tabula.Open("book.epub").Text()

Extract as Markdown

// PDF with header/footer exclusion
markdown, warnings, err := tabula.Open("document.pdf").
    ExcludeHeadersAndFooters().
    ToMarkdown()

// DOCX with header/footer exclusion (preserves headings, lists, tables)
markdown, warnings, err := tabula.Open("document.docx").
    ExcludeHeadersAndFooters().
    ToMarkdown()

// ODT with header/footer exclusion (preserves headings, lists, tables)
markdown, warnings, err := tabula.Open("document.odt").
    ExcludeHeadersAndFooters().
    ToMarkdown()

// XLSX (each sheet as a markdown table)
markdown, warnings, err := tabula.Open("spreadsheet.xlsx").ToMarkdown()

// PPTX (each slide with title as heading, content, and tables)
markdown, warnings, err := tabula.Open("presentation.pptx").
    ExcludeHeadersAndFooters().
    ToMarkdown()

// HTML (preserves headings, lists, tables, code blocks)
// Use navigation exclusion to remove headers, footers, and nav elements
markdown, warnings, err := tabula.Open("page.html").ToMarkdown()

// HTML with aggressive navigation filtering
import "github.com/tsawler/tabula/htmldoc"
reader, _ := htmldoc.OpenReader(resp.Body)
opts := htmldoc.ExtractOptions{NavigationExclusion: htmldoc.NavigationExclusionAggressive}
markdown, _ := reader.MarkdownWithOptions(opts)

// EPUB (preserves chapter structure, headings, lists)
markdown, warnings, err := tabula.Open("book.epub").ToMarkdown()

RAG Chunking

package main

import (
    "fmt"
    "log"

    "github.com/tsawler/tabula"
)

func main() {
    // Works with PDF, DOCX, ODT, XLSX, PPTX, HTML, and EPUB
    chunks, warnings, err := tabula.Open("document.pdf").Chunks()
    // chunks, warnings, err := tabula.Open("document.docx").Chunks()
    // chunks, warnings, err := tabula.Open("document.odt").Chunks()
    // chunks, warnings, err := tabula.Open("spreadsheet.xlsx").Chunks()
    // chunks, warnings, err := tabula.Open("presentation.pptx").Chunks()
    // chunks, warnings, err := tabula.Open("page.html").Chunks()
    // chunks, warnings, err := tabula.Open("book.epub").Chunks()
    if err != nil {
        log.Fatal(err)
    }

    for i, chunk := range chunks.Chunks {
        fmt.Printf("Chunk %d: %s (p.%d-%d, ~%d tokens)\n",
            i+1,
            chunk.Metadata.SectionTitle,
            chunk.Metadata.PageStart,
            chunk.Metadata.PageEnd,
            chunk.Metadata.EstimatedTokens)
        fmt.Println(chunk.Text)
        fmt.Println("---")
    }

    // Warnings are non-fatal issues
    for _, w := range warnings {
        fmt.Println("Warning:", w.Message)
    }
}

Chunks as Markdown (for Vector DBs)

chunks, _, err := tabula.Open("document.pdf").
    ExcludeHeadersAndFooters().
    Chunks()
if err != nil {
    log.Fatal(err)
}

// Get each chunk as separate markdown strings
mdChunks := chunks.ToMarkdownChunks()

for i, md := range mdChunks {
    // Store each chunk in your vector database
    embedding := embedModel.Embed(md)
    vectorDB.Store(chunks.Chunks[i].ID, embedding, md)
}

API Reference

Opening Documents

// From file path (format auto-detected by extension)
ext := tabula.Open("document.pdf")
ext := tabula.Open("document.docx")
ext := tabula.Open("document.odt")
ext := tabula.Open("spreadsheet.xlsx")
ext := tabula.Open("presentation.pptx")
ext := tabula.Open("page.html")
ext := tabula.Open("book.epub")

// From existing PDF reader (PDF only)
r, _ := reader.Open("document.pdf")
ext := tabula.FromReader(r)

// From HTML string (useful for web scraping)
html := `<html><body><h1>Hello</h1><p>World</p></body></html>`
ext := tabula.FromHTMLString(html)

// From HTML io.Reader (useful for HTTP responses)
resp, _ := http.Get("https://example.com/page")
defer resp.Body.Close()
ext := tabula.FromHTMLReader(resp.Body)

Fluent Options

Method	Description	Formats
`Pages(1, 2, 3)`	Extract specific pages (1-indexed)	PDF
`PageRange(1, 10)`	Extract page range (inclusive)	PDF
`ExcludeHeaders()`	Exclude detected headers	PDF, DOCX, ODT, XLSX, PPTX, EPUB
`ExcludeFooters()`	Exclude detected footers	PDF, DOCX, ODT, XLSX, PPTX, EPUB
`ExcludeHeadersAndFooters()`	Exclude both	PDF, DOCX, ODT, XLSX, PPTX, EPUB
`JoinParagraphs()`	Join text fragments into paragraphs	PDF
`ByColumn()`	Process multi-column layouts column by column	PDF
`PreserveLayout()`	Maintain spatial positioning	PDF

Note: HTML files are single-page documents, so page selection options don't apply. For HTML navigation/header/footer removal, use the htmldoc package directly with NavigationExclusionMode options (see below).

Terminal Operations

Method	Returns	Description	Formats
`Text()`	`string`	Plain text content	PDF, DOCX, ODT, XLSX, PPTX, HTML, EPUB
`ToMarkdown()`	`string`	Markdown-formatted content	PDF, DOCX, ODT, XLSX, PPTX, HTML, EPUB
`ToMarkdownWithOptions(opts)`	`string`	Markdown with custom options	PDF, DOCX, ODT, XLSX, PPTX, HTML, EPUB
`Document()`	`*model.Document`	Full document structure	PDF, DOCX, ODT, XLSX, PPTX, HTML, EPUB
`Chunks()`	`*rag.ChunkCollection`	Semantic chunks for RAG	PDF, DOCX, ODT, XLSX, PPTX, HTML, EPUB
`ChunksWithConfig(config, sizeConfig)`	`*rag.ChunkCollection`	Chunks with custom sizing	PDF, DOCX, ODT, XLSX, PPTX, HTML, EPUB
`PageCount()`	`int`	Number of pages/sheets/slides/chapters	PDF, DOCX, ODT, XLSX, PPTX, HTML, EPUB
`Fragments()`	`[]text.TextFragment`	Raw text fragments with positions	PDF
`Lines()`	`[]layout.Line`	Detected text lines	PDF
`Paragraphs()`	`[]layout.Paragraph`	Detected paragraphs	PDF
`Headings()`	`[]layout.Heading`	Detected headings (H1-H6)	PDF
`Lists()`	`[]layout.List`	Detected lists	PDF
`Blocks()`	`[]layout.Block`	Text blocks	PDF
`Elements()`	`[]layout.LayoutElement`	All elements in reading order	PDF
`Analyze()`	`*layout.AnalysisResult`	Complete layout analysis	PDF

Note on PDF-only methods: The methods marked "PDF" in the tables above (Pages, PageRange, JoinParagraphs, ByColumn, PreserveLayout, Fragments, Lines, Paragraphs, Headings, Lists, Blocks, Elements, Analyze) exist because PDFs lack semantic structure - they store raw text fragments at arbitrary positions, requiring layout analysis to reconstruct document structure. DOCX, ODT, XLSX, PPTX, HTML, and EPUB files already contain explicit semantic markup, so these detection methods aren't needed. Use Document() to access the semantic structure for all formats.

Note on XLSX: For Excel files, each sheet becomes a page, and the sheet data is represented as a table element. PageCount() returns the number of sheets. Text() returns tab-separated values, while ToMarkdown() formats each sheet as a markdown table.

Note on PPTX: For PowerPoint files, each slide becomes a page. PageCount() returns the number of slides. Slide titles are extracted as headings, bullet points as lists, and tables are preserved. Use ExcludeHeadersAndFooters() to remove slide footers, dates, and slide numbers.

Note on HTML: For HTML files, the entire document is treated as a single page. PageCount() returns 1. Semantic elements are preserved: headings (<h1>-<h6>), paragraphs (<p>), lists (<ul>, <ol>), tables (<table> with colspan/rowspan), code blocks (<pre>, <code>), and blockquotes (<blockquote>). Metadata is extracted from <title> and <meta> tags. For navigation/header/footer removal, use the htmldoc package with NavigationExclusionMode (see HTML Navigation Filtering section below).

Note on EPUB: For EPUB files (both EPUB 2 and EPUB 3), each chapter (spine item) becomes a page. PageCount() returns the number of chapters. Dublin Core metadata is extracted (title, author, language, identifier/ISBN, etc.). The table of contents is parsed from NCX (EPUB 2) or nav document (EPUB 3). DRM-protected EPUBs are rejected with an error. Content is extracted using the HTML parser, preserving headings, paragraphs, lists, and tables.

Inspection Methods (non-terminal, PDF only)

ext := tabula.Open("document.pdf")
defer ext.Close()

isCharLevel, _ := ext.IsCharacterLevel()  // Detect character-level PDFs
isMultiCol, _ := ext.IsMultiColumn()      // Detect multi-column layouts
pageCount, _ := ext.PageCount()           // Get page count (works with DOCX and ODT too)

When processing HTML content (especially web pages), use the htmldoc package directly to filter out navigation, headers, footers, and sidebars:

import "github.com/tsawler/tabula/htmldoc"

// From HTTP response
resp, _ := http.Get("https://example.com/article")
defer resp.Body.Close()
reader, _ := htmldoc.OpenReader(resp.Body)

// Choose exclusion mode
opts := htmldoc.ExtractOptions{
    NavigationExclusion: htmldoc.NavigationExclusionStandard,
}

// Extract clean text or markdown
text, _ := reader.TextWithOptions(opts)
markdown, _ := reader.MarkdownWithOptions(opts)

Navigation Exclusion Modes:

Mode	Description
`NavigationExclusionNone`	Include all content without filtering
`NavigationExclusionExplicit`	Skip only semantic HTML5 elements: `<nav>`, `<aside>`, and ARIA roles. `<header>`/`<footer>` only when top-level
`NavigationExclusionStandard`	Explicit + class/id pattern matching (nav, navbar, menu, footer, sidebar, etc.)
`NavigationExclusionAggressive`	Standard + link-density heuristics (excludes sections with >60% link text)

OCR Fallback for Scanned PDFs

When built with -tags ocr, Tabula automatically uses OCR when a PDF page contains no native text (i.e., scanned documents). This happens transparently—no code changes needed:

// Works automatically on scanned PDFs (when built with -tags ocr)
text, warnings, err := tabula.Open("scanned-document.pdf").Text()
if err != nil {
    log.Fatal(err)
}

// Check if OCR was used
for _, w := range warnings {
    if w.Code == tabula.WarningOCRFallback {
        fmt.Println("Note: OCR was used for some pages")
    }
}

Build requirements:

# Without OCR - scanned pages return empty text
go build

# With OCR - scanned pages use Tesseract
go build -tags ocr

How it works:

For each page, Tabula first attempts native PDF text extraction
If a page has no text fragments (common in scanned PDFs), it extracts embedded images
Images are converted to PNG and processed through Tesseract OCR
A WarningOCRFallback is added to indicate which pages used OCR

Without the ocr build tag: OCR is disabled and ocr.New() returns ocr.ErrOCRNotEnabled. Scanned PDF pages will return empty text.

Supported image formats in PDFs:

CCITT Group 3/4 fax (common in scanned documents)
DCT (JPEG)
Grayscale, RGB, and CMYK color spaces
1-bit, 4-bit, and 8-bit depths

RAG Integration

Chunk Filtering

chunks, _, _ := tabula.Open("doc.pdf").Chunks()

// Filter by content type
tablesOnly := chunks.FilterWithTables()
listsOnly := chunks.FilterWithLists()

// Filter by location
section := chunks.FilterBySection("Introduction")
page5 := chunks.FilterByPage(5)
pages1to10 := chunks.FilterByPageRange(1, 10)

// Filter by size
smallChunks := chunks.FilterByMaxTokens(500)
largeChunks := chunks.FilterByMinTokens(100)

// Search
matches := chunks.Search("keyword")

// Chain filters
result := chunks.
    FilterBySection("Methods").
    FilterByMinTokens(100).
    Search("algorithm")

Markdown Options

import "github.com/tsawler/tabula/rag"

// Options supported by all formats (PDF, DOCX, ODT, XLSX, PPTX, HTML, EPUB)
opts := rag.MarkdownOptions{
    IncludeMetadata:        true,   // YAML front matter with document metadata
    IncludeTableOfContents: true,   // Generated TOC from headings
    HeadingLevelOffset:     0,      // Adjust heading levels (1 makes H1 -> H2)
    MaxHeadingLevel:        6,      // Cap heading depth
}

// Works with all formats
markdown, _, _ := tabula.Open("doc.pdf").ToMarkdownWithOptions(opts)
markdown, _, _ := tabula.Open("doc.docx").ToMarkdownWithOptions(opts)
markdown, _, _ := tabula.Open("doc.odt").ToMarkdownWithOptions(opts)
markdown, _, _ := tabula.Open("spreadsheet.xlsx").ToMarkdownWithOptions(opts)
markdown, _, _ := tabula.Open("presentation.pptx").ToMarkdownWithOptions(opts)
markdown, _, _ := tabula.Open("page.html").ToMarkdownWithOptions(opts)
markdown, _, _ := tabula.Open("book.epub").ToMarkdownWithOptions(opts)

// PDF-only options (used via RAG chunking pipeline)
pdfOpts := rag.MarkdownOptions{
    IncludeMetadata:        true,
    IncludeChunkSeparators: true,   // --- between chunks (PDF only)
    IncludePageNumbers:     true,   // Page references (PDF only)
    IncludeChunkIDs:        true,   // HTML comments with chunk IDs (PDF only)
}

// Or use preset for RAG
opts := rag.RAGOptimizedMarkdownOptions()

Custom Chunk Sizing

import "github.com/tsawler/tabula/rag"

config := rag.ChunkerConfig{
    TargetChunkSize: 500,   // Target characters per chunk
    MaxChunkSize:    1000,  // Maximum characters
    MinChunkSize:    100,   // Minimum characters
    OverlapSize:     50,    // Overlap between chunks
}
sizeConfig := rag.DefaultSizeConfig()

chunks, _, _ := tabula.Open("doc.pdf").ChunksWithConfig(config, sizeConfig)

Working with Results

Chunk Metadata

for _, chunk := range chunks.Chunks {
    fmt.Println("ID:", chunk.ID)
    fmt.Println("Section:", chunk.Metadata.SectionTitle)
    fmt.Println("Pages:", chunk.Metadata.PageStart, "-", chunk.Metadata.PageEnd)
    fmt.Println("Words:", chunk.Metadata.WordCount)
    fmt.Println("Tokens:", chunk.Metadata.EstimatedTokens)
    fmt.Println("Has Table:", chunk.Metadata.HasTable)
    fmt.Println("Has List:", chunk.Metadata.HasList)
}

Collection Statistics

stats := chunks.Statistics()
fmt.Println("Total chunks:", stats.TotalChunks)
fmt.Println("Total words:", stats.TotalWords)
fmt.Println("Average tokens:", stats.AvgTokens)
fmt.Println("Chunks with tables:", stats.ChunksWithTables)

Warnings

The library returns warnings for non-fatal issues:

text, warnings, err := tabula.Open("document.pdf").Text()
if err != nil {
    log.Fatal(err)  // Fatal error
}

for _, w := range warnings {
    log.Println("Warning:", w.Message)  // Non-fatal issues
}

// Format all warnings
formatted := tabula.FormatWarnings(warnings)

Common warnings:

"Detected messy/display-oriented PDF traits" - PDF may have unusual text layout
"Used OCR fallback (scanned content)" - Page contained only images; text extracted via OCR
High fragmentation warnings - Text is split into many small fragments

Error Handling Helpers

// Panic on error (for scripts/tests)
text := tabula.MustText(tabula.Open("doc.pdf").Text())
count := tabula.Must(tabula.Open("doc.pdf").PageCount())

Testing

# Run tests without OCR (pure Go)
go test ./...

# Run tests with OCR enabled (requires Tesseract)
go test -tags ocr ./...

# Or using Task
task test       # without OCR
task test-ocr   # with OCR (handles CGO flags automatically)

Note: On Apple Silicon Macs with OCR enabled, ensure you've set the CGO flags described in the Installation section, or use task test-ocr which sets them automatically.

License

MIT License

ARCHITECTURE.md - System architecture
PDF_PARSING_GUIDE.md - PDF internals
RAG_INTEGRATION.md - RAG pipeline details

Documentation ¶

Overview ¶

Package tabula provides a fluent API for extracting text, tables, and other content from PDF, DOCX, ODT, XLSX, PPTX, and HTML files.

Basic usage:

text, warnings, err := tabula.Open("document.pdf").Text()
if err != nil {
    // handle error
}
if len(warnings) > 0 {
    log.Println("Warnings:", tabula.FormatWarnings(warnings))
}

DOCX files work the same way:

text, warnings, err := tabula.Open("document.docx").Text()

With options:

text, _, err := tabula.Open("report.pdf").
    Pages(1, 2, 3).
    ExcludeHeaders().
    ExcludeFooters().
    Text()

HTML content can be parsed from a string (useful for web scraping):

text, _, err := tabula.FromHTMLString(htmlContent).Text()

For advanced use cases, the lower-level reader package is also available.

Example (ChunkFiltering) ¶

package main

import (
	"github.com/tsawler/tabula"
)

func main() {
	chunks, _, _ := tabula.Open("doc.pdf").Chunks()

	// Filter by content type
	tablesOnly := chunks.FilterWithTables()
	listsOnly := chunks.FilterWithLists()
	_ = tablesOnly
	_ = listsOnly

	// Filter by location
	section := chunks.FilterBySection("Introduction")
	page5 := chunks.FilterByPage(5)
	pages1to10 := chunks.FilterByPageRange(1, 10)
	_ = section
	_ = page5
	_ = pages1to10

	// Filter by size
	smallChunks := chunks.FilterByMaxTokens(500)
	largeChunks := chunks.FilterByMinTokens(100)
	_ = smallChunks
	_ = largeChunks

	// Search
	matches := chunks.Search("keyword")
	_ = matches

	// Chain filters
	result := chunks.
		FilterBySection("Methods").
		FilterByMinTokens(100).
		Search("algorithm")
	_ = result
}

Example (ChunkMetadata) ¶

package main

import (
	"fmt"

	"github.com/tsawler/tabula"
)

func main() {
	chunks, _, _ := tabula.Open("doc.pdf").Chunks()

	for _, chunk := range chunks.Chunks {
		fmt.Println("ID:", chunk.ID)
		fmt.Println("Section:", chunk.Metadata.SectionTitle)
		fmt.Println("Pages:", chunk.Metadata.PageStart, "-", chunk.Metadata.PageEnd)
		fmt.Println("Words:", chunk.Metadata.WordCount)
		fmt.Println("Tokens:", chunk.Metadata.EstimatedTokens)
		fmt.Println("Has Table:", chunk.Metadata.HasTable)
		fmt.Println("Has List:", chunk.Metadata.HasList)
	}
}

Example (ChunksAsMarkdown) ¶

package main

import (
	"log"

	"github.com/tsawler/tabula"
)

func main() {
	chunks, _, err := tabula.Open("document.pdf").
		ExcludeHeadersAndFooters().
		Chunks()
	if err != nil {
		log.Fatal(err)
	}

	// Get each chunk as separate markdown strings
	mdChunks := chunks.ToMarkdownChunks()

	for i, md := range mdChunks {
		// Example: store each chunk in your vector database
		_ = chunks.Chunks[i].ID
		_ = md
	}
}

Example (CollectionStatistics) ¶

package main

import (
	"fmt"

	"github.com/tsawler/tabula"
)

func main() {
	chunks, _, _ := tabula.Open("doc.pdf").Chunks()

	stats := chunks.Statistics()
	fmt.Println("Total chunks:", stats.TotalChunks)
	fmt.Println("Total words:", stats.TotalWords)
	fmt.Println("Average tokens:", stats.AvgTokens)
	fmt.Println("Chunks with tables:", stats.ChunksWithTables)
}

Example (CustomChunkSizing) ¶

package main

import (
	"github.com/tsawler/tabula"
	"github.com/tsawler/tabula/rag"
)

func main() {
	config := rag.ChunkerConfig{
		TargetChunkSize: 500,  // Target characters per chunk
		MaxChunkSize:    1000, // Maximum characters
		MinChunkSize:    100,  // Minimum characters
		OverlapSize:     50,   // Overlap between chunks
	}
	sizeConfig := rag.DefaultSizeConfig()

	chunks, _, _ := tabula.Open("doc.pdf").ChunksWithConfig(config, sizeConfig)
	_ = chunks
}

Example (ErrorHandling) ¶

package main

import (
	"github.com/tsawler/tabula"
)

func main() {
	// Panic on error (for scripts/tests)
	text := tabula.MustText(tabula.Open("doc.pdf").Text())
	count := tabula.Must(tabula.Open("doc.pdf").PageCount())
	_ = text
	_ = count
}

Example (ExtractMarkdown) ¶

package main

import (
	"github.com/tsawler/tabula"
)

func main() {
	// PDF with header/footer exclusion
	markdown, warnings, err := tabula.Open("document.pdf").
		ExcludeHeadersAndFooters().
		ToMarkdown()
	_ = markdown
	_ = warnings
	_ = err

	// DOCX (preserves headings, lists, tables)
	markdown, warnings, err = tabula.Open("document.docx").ToMarkdown()
	_ = markdown
	_ = warnings
	_ = err
}

Example (ExtractText) ¶

package main

import (
	"fmt"
	"log"

	"github.com/tsawler/tabula"
)

func main() {
	// Works with both PDF and DOCX files
	text, warnings, err := tabula.Open("document.pdf").Text()
	// text, warnings, err := tabula.Open("document.docx").Text()
	if err != nil {
		log.Fatal(err)
	}

	fmt.Println(text)

	for _, w := range warnings {
		fmt.Println("Warning:", w.Message)
	}
}

Example (ExtractWithOptions) ¶

package main

import (
	"github.com/tsawler/tabula"
)

func main() {
	text, warnings, err := tabula.Open("document.pdf").
		Pages(1, 2, 3).             // Specific pages (PDF only)
		ExcludeHeadersAndFooters(). // Remove repeating headers/footers (PDF only)
		JoinParagraphs().           // Join text into paragraphs (PDF only)
		Text()
	_ = text
	_ = warnings
	_ = err
}

Example (InspectionMethods) ¶

package main

import (
	"github.com/tsawler/tabula"
)

func main() {
	ext := tabula.Open("document.pdf")
	defer ext.Close()

	isCharLevel, _ := ext.IsCharacterLevel() // Detect character-level PDFs
	isMultiCol, _ := ext.IsMultiColumn()     // Detect multi-column layouts
	pageCount, _ := ext.PageCount()          // Get page count (works with DOCX too)
	_ = isCharLevel
	_ = isMultiCol
	_ = pageCount
}

Example (MarkdownOptions) ¶

package main

import (
	"github.com/tsawler/tabula"
	"github.com/tsawler/tabula/rag"
)

func main() {
	opts := rag.MarkdownOptions{
		IncludeMetadata:        true, // YAML front matter
		IncludeTableOfContents: true, // Generated TOC
		IncludeChunkSeparators: true, // --- between chunks
		IncludePageNumbers:     true, // Page references
		IncludeChunkIDs:        true, // HTML comments with chunk IDs
	}

	markdown, _, _ := tabula.Open("doc.pdf").ToMarkdownWithOptions(opts)
	_ = markdown

	// Or use preset for RAG
	opts = rag.RAGOptimizedMarkdownOptions()
	_ = opts
}

Example (OpenDocuments) ¶

package main

import (
	"github.com/tsawler/tabula"
	"github.com/tsawler/tabula/reader"
)

func main() {
	// From file path (format auto-detected by extension)
	ext := tabula.Open("document.pdf")
	_ = ext
	ext = tabula.Open("document.docx")
	_ = ext

	// From existing PDF reader (PDF only)
	r, _ := reader.Open("document.pdf")
	ext = tabula.FromReader(r)
	_ = ext
}

Example (RagChunking) ¶

package main

import (
	"fmt"
	"log"

	"github.com/tsawler/tabula"
)

func main() {
	// Works with both PDF and DOCX
	chunks, warnings, err := tabula.Open("document.pdf").Chunks()
	// chunks, warnings, err := tabula.Open("document.docx").Chunks()
	if err != nil {
		log.Fatal(err)
	}

	for i, chunk := range chunks.Chunks {
		fmt.Printf("Chunk %d: %s (p.%d-%d, ~%d tokens)\n",
			i+1,
			chunk.Metadata.SectionTitle,
			chunk.Metadata.PageStart,
			chunk.Metadata.PageEnd,
			chunk.Metadata.EstimatedTokens)
		fmt.Println(chunk.Text)
		fmt.Println("---")
	}

	// Warnings are non-fatal issues
	for _, w := range warnings {
		fmt.Println("Warning:", w.Message)
	}
}

Example (Warnings) ¶

package main

import (
	"log"

	"github.com/tsawler/tabula"
)

func main() {
	text, warnings, err := tabula.Open("document.pdf").Text()
	if err != nil {
		log.Fatal(err) // Fatal error
	}
	_ = text

	for _, w := range warnings {
		log.Println("Warning:", w.Message) // Non-fatal issues
	}

	// Format all warnings
	formatted := tabula.FormatWarnings(warnings)
	_ = formatted
}

Index ¶

func FormatWarnings(warnings []Warning) string
func Must[T any](val T, err error) T
func MustText[T any](val T, _ []Warning, err error) T
type ExtractOptions
type Extractor
type Warning
- func (w Warning) String() string
type WarningCode

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func FormatWarnings ¶

func FormatWarnings(warnings []Warning) string

FormatWarnings returns a human-readable string of all warnings. Returns empty string if there are no warnings.

func Must ¶

func Must[T any](val T, err error) T

Must is a helper that wraps a call to a function returning (T, error) and panics if the error is non-nil. It is intended for use in scripts or tests where error handling would be cumbersome.

Example:

count := tabula.Must(tabula.Open("document.pdf").PageCount())

func MustText ¶

func MustText[T any](val T, _ []Warning, err error) T

MustText is a helper that wraps a call to Text() or Fragments() and panics if the error is non-nil. It discards warnings and returns just the value. It is intended for use in scripts or tests where error handling would be cumbersome.

Example:

text := tabula.MustText(tabula.Open("document.pdf").Text())

Types ¶

type ExtractOptions ¶

type ExtractOptions struct {
	// contains filtered or unexported fields
}

ExtractOptions holds configuration for text extraction.

type Extractor ¶

type Extractor struct {
	// contains filtered or unexported fields
}

Extractor provides a fluent interface for extracting content from PDFs, DOCX, ODT, XLSX, PPTX, HTML, and EPUB files. Each configuration method returns a new Extractor instance, making it safe for concurrent use and allowing method chaining.

func FromHTMLReader ¶ added in v1.4.0

func FromHTMLReader(r io.Reader) *Extractor

FromHTMLReader creates an Extractor from an io.Reader containing HTML content. This is useful when you have HTML content that was fetched from a remote source (e.g., via HTTP) and want to extract text or convert it to markdown without saving it to a file first.

Example:

resp, err := http.Get("https://example.com/page")
if err != nil {
    // handle error
}
defer resp.Body.Close()
text, warnings, err := tabula.FromHTMLReader(resp.Body).Text()

func FromHTMLString ¶ added in v1.4.0

func FromHTMLString(html string) *Extractor

FromHTMLString creates an Extractor from a string containing HTML content. This is useful when you have HTML content as a string (e.g., fetched from a web API or embedded in your application) and want to extract text or convert it to markdown.

Example:

html := `<html><body><h1>Hello</h1><p>World</p></body></html>`
text, warnings, err := tabula.FromHTMLString(html).Text()

For web scraping:

resp, err := http.Get("https://example.com/page")
if err != nil {
    // handle error
}
body, _ := io.ReadAll(resp.Body)
resp.Body.Close()
text, _, _ := tabula.FromHTMLString(string(body)).Text()

func FromReader ¶

func FromReader(r *reader.Reader) *Extractor

FromReader creates an Extractor from an already-opened PDF reader.Reader. This is useful when you need more control over the PDF reader lifecycle. Note: The caller is responsible for closing the reader. For DOCX files, use Open() instead which handles format detection automatically.

Example:

r, err := reader.Open("document.pdf")
if err != nil {
    // handle error
}
defer r.Close()
text, warnings, err := tabula.FromReader(r).Text()

func Open ¶

func Open(filename string) *Extractor

Open opens a PDF or DOCX file and returns an Extractor for fluent configuration. The file format is automatically detected based on the file extension. The returned Extractor must be closed when done, either explicitly via Close() or implicitly when calling a terminal operation like Text().

Supported formats:

PDF (.pdf)
DOCX (.docx)

Example:

text, warnings, err := tabula.Open("document.pdf").Text()
text, warnings, err := tabula.Open("document.docx").Text()

func (*Extractor) Analyze ¶

func (e *Extractor) Analyze() (*layout.AnalysisResult, error)

Analyze performs complete layout analysis and returns all detected elements. This is the most comprehensive extraction method, combining columns, lines, paragraphs, headings, lists, and reading order into a unified result. This is a terminal operation that closes the underlying reader.

Example:

result, err := tabula.Open("document.pdf").Pages(1).Analyze()
for _, elem := range result.Elements {
    fmt.Printf("[%s] %s\n", elem.Type, elem.Text)
}

func (*Extractor) Blocks ¶

func (e *Extractor) Blocks() ([]layout.Block, error)

Blocks extracts and returns detected text blocks from the document. Blocks are spatially grouped regions of text, useful for understanding document layout structure. This is a terminal operation that closes the underlying reader.

Example:

blocks, err := tabula.Open("document.pdf").Blocks()
for _, block := range blocks {
    fmt.Printf("Block at (%.1f, %.1f): %s\n", block.BBox.X, block.BBox.Y, block.GetText())
}

func (*Extractor) ByColumn ¶

func (e *Extractor) ByColumn() *Extractor

ByColumn configures the extractor to process text column by column in reading order, rather than line by line across the full page width. This is useful for multi-column documents like newspapers or academic papers.

Example:

text, _, err := tabula.Open("newspaper.pdf").ByColumn().Text()

func (*Extractor) Chunks ¶

func (e *Extractor) Chunks() (*rag.ChunkCollection, []Warning, error)

Chunks extracts content and returns semantic chunks for RAG workflows. This method combines document extraction with RAG chunking in a single call. This is a terminal operation that closes the underlying reader.

Example:

chunks, warnings, err := tabula.Open("document.pdf").
    ExcludeHeadersAndFooters().
    Chunks()
if err != nil {
    log.Fatal(err)
}
for _, chunk := range chunks.Chunks {
    fmt.Printf("[%s] %s\n", chunk.Metadata.SectionTitle, chunk.Text[:50])
}

func (*Extractor) ChunksWithConfig ¶

func (e *Extractor) ChunksWithConfig(config rag.ChunkerConfig, sizeConfig rag.SizeConfig) (*rag.ChunkCollection, []Warning, error)

ChunksWithConfig extracts content and returns semantic chunks using custom configuration. This allows fine-tuning of chunk sizes, overlap, and other parameters. This is a terminal operation that closes the underlying reader.

Example:

config := rag.ChunkerConfig{
    TargetChunkSize: 500,
    MaxChunkSize:    1000,
    OverlapSize:     50,
}
sizeConfig := rag.DefaultSizeConfig()
chunks, warnings, err := tabula.Open("document.pdf").
    ExcludeHeadersAndFooters().
    ChunksWithConfig(config, sizeConfig)

func (*Extractor) Close ¶

func (e *Extractor) Close() error

Close releases resources associated with the Extractor. It is safe to call Close multiple times.

func (*Extractor) Document ¶

func (e *Extractor) Document() (*model.Document, []Warning, error)

Document extracts content and returns a model.Document structure suitable for RAG chunking and other document processing workflows. This is a terminal operation that closes the underlying reader.

Example:

doc, warnings, err := tabula.Open("document.pdf").
    ExcludeHeadersAndFooters().
    Document()
doc, warnings, err := tabula.Open("document.docx").Document()
if err != nil {
    log.Fatal(err)
}
// Use doc for chunking or other processing

func (*Extractor) Elements ¶

func (e *Extractor) Elements() ([]layout.LayoutElement, error)

Elements extracts and returns all detected elements in reading order. Elements include paragraphs, headings, and lists, unified into a single ordered list. This is useful for document reconstruction or RAG workflows. This is a terminal operation that closes the underlying reader.

Example:

elements, err := tabula.Open("document.pdf").Elements()
for _, elem := range elements {
    fmt.Printf("[%s] %s\n", elem.Type, elem.Text)
}

func (*Extractor) ExcludeFooters ¶

func (e *Extractor) ExcludeFooters() *Extractor

ExcludeFooters configures the extractor to exclude detected footers.

Example:

text, _, err := tabula.Open("doc.pdf").ExcludeFooters().Text()

func (*Extractor) ExcludeHeaders ¶

func (e *Extractor) ExcludeHeaders() *Extractor

ExcludeHeaders configures the extractor to exclude detected headers.

Example:

text, _, err := tabula.Open("doc.pdf").ExcludeHeaders().Text()

func (*Extractor) ExcludeHeadersAndFooters ¶

func (e *Extractor) ExcludeHeadersAndFooters() *Extractor

ExcludeHeadersAndFooters configures the extractor to exclude both detected headers and footers. This is a convenience method equivalent to calling ExcludeHeaders().ExcludeFooters().

Example:

text, _, err := tabula.Open("doc.pdf").ExcludeHeadersAndFooters().Text()

func (*Extractor) Fragments ¶

func (e *Extractor) Fragments() ([]text.TextFragment, []Warning, error)

Fragments extracts and returns text fragments with position information. This is a terminal operation that closes the underlying reader.

Returns the fragments, any warnings encountered during processing, and an error if extraction failed.

Example:

fragments, warnings, err := tabula.Open("document.pdf").Pages(1).Fragments()

func (*Extractor) Headings ¶

func (e *Extractor) Headings() ([]layout.Heading, error)

Headings extracts and returns detected headings (H1-H6) from the document. This is a terminal operation that closes the underlying reader.

Example:

headings, err := tabula.Open("document.pdf").Headings()
for _, h := range headings {
    fmt.Printf("[%s] %s\n", h.Level, h.Text)
}

func (*Extractor) IsCharacterLevel ¶

func (e *Extractor) IsCharacterLevel() (bool, error)

IsCharacterLevel checks if the first page of the PDF uses character-level text fragments (one character per fragment). This requires special handling for proper text extraction. Note: This reads page 1 to make the determination. The reader remains open.

Example:

ext := tabula.Open("document.pdf")
defer ext.Close()
isCharLevel, err := ext.IsCharacterLevel()

func (*Extractor) IsMultiColumn ¶

func (e *Extractor) IsMultiColumn() (bool, error)

IsMultiColumn checks if the first page of the PDF appears to have a multi-column layout. Note: This reads page 1 to make the determination. The reader remains open.

Example:

ext := tabula.Open("newspaper.pdf")
defer ext.Close()
multiCol, err := ext.IsMultiColumn()

func (*Extractor) JoinParagraphs ¶

func (e *Extractor) JoinParagraphs() *Extractor

JoinParagraphs configures the extractor to join lines within paragraphs using spaces instead of newlines. This produces cleaner text output where paragraph breaks are preserved but soft line breaks within paragraphs are removed.

Example:

text, _, err := tabula.Open("doc.pdf").JoinParagraphs().Text()
text, _, err := tabula.Open("doc.pdf").ExcludeHeadersAndFooters().JoinParagraphs().Text()

func (*Extractor) Lines ¶

func (e *Extractor) Lines() ([]layout.Line, error)

Lines extracts and returns detected text lines with position and alignment info. This is a terminal operation that closes the underlying reader.

Example:

lines, err := tabula.Open("document.pdf").Lines()
for _, line := range lines {
    fmt.Printf("%s (align: %s)\n", line.Text, line.Alignment)
}

func (*Extractor) Lists ¶

func (e *Extractor) Lists() ([]layout.List, error)

Lists extracts and returns detected lists (bulleted, numbered, etc.) from the document. This is a terminal operation that closes the underlying reader.

Example:

lists, err := tabula.Open("document.pdf").Lists()
for _, list := range lists {
    fmt.Printf("List type: %s, items: %d\n", list.Type, len(list.Items))
}

func (*Extractor) PageCount ¶

func (e *Extractor) PageCount() (int, error)

PageCount returns the total number of pages in the document. For DOCX files, this returns 1 (the entire document is treated as a single page). Note: This does NOT close the reader, allowing further operations.

Example:

ext := tabula.Open("document.pdf")
defer ext.Close()
count, err := ext.PageCount()

func (*Extractor) PageRange ¶

func (e *Extractor) PageRange(start, end int) *Extractor

PageRange specifies a range of pages to extract (1-indexed, inclusive).

Example:

text, _, err := tabula.Open("doc.pdf").PageRange(5, 10).Text()

func (*Extractor) Pages ¶

func (e *Extractor) Pages(pages ...int) *Extractor

Pages specifies which pages to extract from (1-indexed). Multiple calls are cumulative.

Example:

text, _, err := tabula.Open("doc.pdf").Pages(1, 3, 5).Text()

func (*Extractor) Paragraphs ¶

func (e *Extractor) Paragraphs() ([]layout.Paragraph, error)

Paragraphs extracts and returns detected paragraphs with style information. This uses reading order detection to handle multi-column layouts correctly. This is a terminal operation that closes the underlying reader.

Example:

paragraphs, err := tabula.Open("document.pdf").
    ExcludeHeaders().
    ExcludeFooters().
    Paragraphs()
for _, para := range paragraphs {
    fmt.Printf("[%s] %s\n", para.Style, para.Text)
}

func (*Extractor) PreserveLayout ¶

func (e *Extractor) PreserveLayout() *Extractor

PreserveLayout maintains spatial positioning by inserting spaces to approximate the visual layout of the original document.

Example:

text, _, err := tabula.Open("form.pdf").PreserveLayout().Text()

func (*Extractor) ReadingOrder ¶

func (e *Extractor) ReadingOrder() (*layout.ReadingOrderResult, error)

ReadingOrder extracts and returns detailed reading order analysis. This includes column detection, section boundaries, and proper text ordering for multi-column documents. This is a terminal operation that closes the underlying reader.

Example:

ro, err := tabula.Open("newspaper.pdf").Pages(1).ReadingOrder()
fmt.Printf("Columns: %d\n", ro.ColumnCount)
for _, section := range ro.Sections {
    fmt.Printf("Section: %s\n", section.Type)
}

func (*Extractor) Text ¶

func (e *Extractor) Text() (string, []Warning, error)

Text extracts and returns the text content from the configured pages. This is a terminal operation that closes the underlying reader.

Returns the extracted text, any warnings encountered during processing, and an error if extraction failed. Warnings indicate non-fatal issues (e.g., messy PDF detected) where extraction succeeded but results may be imperfect.

Example:

text, warnings, err := tabula.Open("document.pdf").Text()
text, warnings, err := tabula.Open("document.docx").Text()
if len(warnings) > 0 {
    log.Println("Warnings:", tabula.FormatWarnings(warnings))
}

func (*Extractor) ToMarkdown ¶

func (e *Extractor) ToMarkdown() (string, []Warning, error)

ToMarkdown extracts content and returns it as a markdown-formatted string. This preserves document structure including headings, paragraphs, and lists. This is a terminal operation that closes the underlying reader.

Returns the markdown text, any warnings encountered during processing, and an error if extraction failed.

Example:

md, warnings, err := tabula.Open("document.pdf").
    ExcludeHeadersAndFooters().
    ToMarkdown()

func (*Extractor) ToMarkdownWithOptions ¶

func (e *Extractor) ToMarkdownWithOptions(opts rag.MarkdownOptions) (string, []Warning, error)

ToMarkdownWithOptions extracts content and returns it as markdown with custom options. This is a terminal operation that closes the underlying reader.

Supported options for all formats:

IncludeMetadata: adds YAML front matter with document metadata
IncludeTableOfContents: generates a table of contents from headings
HeadingLevelOffset: adjusts heading levels (e.g., 1 makes H1 -> H2)
MaxHeadingLevel: caps heading depth (default: 6)

PDF-only options (used via RAG chunking):

IncludeChunkSeparators: adds horizontal rules between chunks
IncludePageNumbers: adds page references
IncludeChunkIDs: adds chunk IDs as HTML comments

Example:

opts := rag.MarkdownOptions{
    IncludeTableOfContents: true,
    IncludeMetadata:        true,
}
md, warnings, err := tabula.Open("document.pdf").ToMarkdownWithOptions(opts)
md, warnings, err := tabula.Open("document.docx").ToMarkdownWithOptions(opts)

type Warning ¶

type Warning struct {
	Code    WarningCode
	Message string
}

Warning represents a non-fatal issue encountered during PDF processing. Unlike errors, warnings indicate that extraction succeeded but the results may be imperfect or require attention.

func (Warning) String ¶

func (w Warning) String() string

String returns the warning message.

type WarningCode ¶

type WarningCode int

WarningCode identifies the type of warning encountered during PDF processing.

const (
	// WarningMessyPDF indicates the PDF exhibits traits of being "messy" or
	// display-oriented (e.g., generated by Word, Quartz, or highly fragmented).
	// Text extraction may still succeed but results might have ordering issues.
	WarningMessyPDF WarningCode = iota

	// WarningOCRFallback indicates that OCR was used to extract text from
	// a scanned page that contained no native PDF text. This typically means
	// the page contains only images (e.g., a scanned document).
	WarningOCRFallback
)

Source Files ¶

View all Source files

Directories ¶

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL

Path	Synopsis
contentstream Package contentstream provides parsing of PDF content streams.	Package contentstream provides parsing of PDF content streams.
core Package core provides low-level PDF parsing primitives and object types.	Package core provides low-level PDF parsing primitives and object types.
docx Package docx provides DOCX (Office Open XML) document parsing.	Package docx provides DOCX (Office Open XML) document parsing.
epubdoc Package epubdoc provides EPUB document parsing.	Package epubdoc provides EPUB document parsing.
font Package font provides PDF font handling including Type1, TrueType, and CID fonts.	Package font provides PDF font handling including Type1, TrueType, and CID fonts.
format Package format provides file format detection for the tabula library.	Package format provides file format detection for the tabula library.
graphicsstate Package graphicsstate provides PDF graphics state management.	Package graphicsstate provides PDF graphics state management.
htmldoc Package htmldoc provides HTML document parsing.	Package htmldoc provides HTML document parsing.
internal
filters Package filters provides PDF stream decompression filters.	Package filters provides PDF stream decompression filters.
layout Package layout provides document layout analysis for extracting semantic structure from PDF pages.	Package layout provides document layout analysis for extracting semantic structure from PDF pages.
model Package model provides the intermediate representation (IR) for extracted document content.	Package model provides the intermediate representation (IR) for extracted document content.
ocr Package ocr provides OCR (Optical Character Recognition) capabilities for extracting text from images in scanned PDFs.	Package ocr provides OCR (Optical Character Recognition) capabilities for extracting text from images in scanned PDFs.
odt Package odt provides ODT (OpenDocument Text) document parsing.	Package odt provides ODT (OpenDocument Text) document parsing.
pages Package pages provides PDF page tree traversal and page access.	Package pages provides PDF page tree traversal and page access.
pptx Package pptx provides PPTX (Office Open XML Presentation) document parsing.	Package pptx provides PPTX (Office Open XML Presentation) document parsing.
rag Package rag provides semantic chunking for RAG (Retrieval-Augmented Generation) workflows.	Package rag provides semantic chunking for RAG (Retrieval-Augmented Generation) workflows.
reader Package reader provides high-level PDF file reading and object resolution.	Package reader provides high-level PDF file reading and object resolution.
resolver Package resolver provides PDF indirect reference resolution.	Package resolver provides PDF indirect reference resolution.
tables Package tables provides table detection and extraction from PDF pages.	Package tables provides table detection and extraction from PDF pages.
text Package text provides text extraction from PDF content streams.	Package text provides text extraction from PDF content streams.
xlsx Package xlsx provides XLSX (Office Open XML Spreadsheet) document parsing.	Package xlsx provides XLSX (Office Open XML Spreadsheet) document parsing.

README ¶

Tabula

Features

Installation

Optional: OCR Support for Scanned PDFs

CGO Flags (macOS Apple Silicon only)

Quick Start

Extract Text

Extract with Options

Extract as Markdown

RAG Chunking

Chunks as Markdown (for Vector DBs)

API Reference

Opening Documents

Fluent Options

Terminal Operations

Inspection Methods (non-terminal, PDF only)

HTML Navigation Filtering

OCR Fallback for Scanned PDFs

RAG Integration

Chunk Filtering

Markdown Options

Custom Chunk Sizing

Working with Results

Chunk Metadata

Collection Statistics

Warnings

Error Handling Helpers

Testing

License

Related Documentation

Documentation ¶

Overview ¶

Index ¶

Examples ¶

Constants ¶

Variables ¶

Functions ¶

func FormatWarnings ¶

func Must ¶

func MustText ¶

Types ¶

type ExtractOptions ¶

type Extractor ¶

func FromHTMLReader ¶ added in v1.4.0

func FromHTMLString ¶ added in v1.4.0

func FromReader ¶

func Open ¶

func (*Extractor) Analyze ¶

func (*Extractor) Blocks ¶

func (*Extractor) ByColumn ¶

func (*Extractor) Chunks ¶

func (*Extractor) ChunksWithConfig ¶

func (*Extractor) Close ¶

func (*Extractor) Document ¶

func (*Extractor) Elements ¶

func (*Extractor) ExcludeFooters ¶

func (*Extractor) ExcludeHeaders ¶

func (*Extractor) ExcludeHeadersAndFooters ¶

func (*Extractor) Fragments ¶

func (*Extractor) Headings ¶

func (*Extractor) IsCharacterLevel ¶

func (*Extractor) IsMultiColumn ¶

func (*Extractor) JoinParagraphs ¶

func (*Extractor) Lines ¶

func (*Extractor) Lists ¶

func (*Extractor) PageCount ¶

func (*Extractor) PageRange ¶

func (*Extractor) Pages ¶

func (*Extractor) Paragraphs ¶

func (*Extractor) PreserveLayout ¶

func (*Extractor) ReadingOrder ¶

func (*Extractor) Text ¶

func (*Extractor) ToMarkdown ¶

func (*Extractor) ToMarkdownWithOptions ¶

type Warning ¶

func (Warning) String ¶

type WarningCode ¶

Source Files ¶

Directories ¶