reader

package
v1.6.6 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 4, 2026 License: MIT Imports: 13 Imported by: 0

Documentation

Overview

Package reader provides high-level PDF file reading and object resolution.

This package orchestrates the lower-level core package to provide a convenient API for reading PDF files and extracting content.

Opening PDF Files

Use Open to open a PDF file for reading:

reader, err := reader.Open("document.pdf")
if err != nil {
    log.Fatal(err)
}
defer reader.Close()

Or use NewReader with an existing *os.File.

Document Information

The Reader provides access to document structure:

  • Version() - PDF version (e.g., 1.7)
  • PageCount() - number of pages
  • GetCatalog() - document catalog dictionary
  • GetInfo() - document info dictionary (metadata)
  • Trailer() - trailer dictionary

Page Access

Access pages by index (0-based):

page, err := reader.GetPage(0)  // First page

Object Resolution

The Reader resolves indirect object references:

  • GetObject(objNum) - load object by number
  • ResolveReference(ref) - resolve an IndirectRef
  • Resolve(obj) - resolve if indirect, otherwise return as-is
  • ResolveDeep(obj) - recursively resolve all references

Text Extraction

Convenience methods for text extraction:

  • ExtractText(page) - extract text as a string
  • ExtractTextFragments(page) - extract positioned text fragments

Object Caching

The Reader caches loaded objects for efficiency. Use ClearCache() to free memory when processing large PDFs.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type PDFVersion

type PDFVersion struct {
	Major int
	Minor int
}

PDFVersion represents a PDF version

func (PDFVersion) String

func (v PDFVersion) String() string

String returns the version as a string (e.g., "1.7")

type PageImage added in v1.6.0

type PageImage struct {
	Name             string // XObject name (e.g., "Im1")
	Width            int
	Height           int
	ColorSpace       string // DeviceGray, DeviceRGB, DeviceCMYK, etc.
	BitsPerComponent int
	Data             []byte // Decoded pixel data
	Filter           string // Original filter (for format detection)
}

PageImage represents an extracted image from a PDF page.

func (*PageImage) ToPNG added in v1.6.0

func (img *PageImage) ToPNG() ([]byte, error)

ToPNG converts the decoded pixel data to PNG format. This is suitable for use with OCR engines like Tesseract.

type Reader

type Reader struct {
	// contains filtered or unexported fields
}

Reader represents a PDF file reader

func NewReader

func NewReader(file *os.File) (*Reader, error)

NewReader creates a new PDF reader for the given file

func Open

func Open(filename string) (*Reader, error)

Open opens a PDF file and returns a Reader

func (*Reader) CacheSize

func (r *Reader) CacheSize() int

CacheSize returns the number of cached objects

func (*Reader) ClearCache

func (r *Reader) ClearCache()

ClearCache clears the object cache and object stream cache Useful for freeing memory when processing large PDFs

func (*Reader) Close

func (r *Reader) Close() error

Close closes the PDF file

func (*Reader) ExtractPageImages added in v1.6.0

func (r *Reader) ExtractPageImages(page *pages.Page) ([]PageImage, error)

ExtractPageImages extracts all image XObjects from a page. It returns a slice of PageImage containing decoded image data.

func (*Reader) ExtractText

func (r *Reader) ExtractText(page *pages.Page) (string, error)

ExtractText extracts text from a page and returns it as a string This is a convenience method for simple text extraction

func (*Reader) ExtractTextFragments

func (r *Reader) ExtractTextFragments(page *pages.Page) ([]text.TextFragment, error)

ExtractTextFragments extracts text fragments from a page This is a convenience method that handles content stream decoding and font registration

func (*Reader) FileSize

func (r *Reader) FileSize() int64

FileSize returns the size of the PDF file in bytes

func (*Reader) GetCatalog

func (r *Reader) GetCatalog() (core.Dict, error)

GetCatalog returns the document catalog (root object)

func (*Reader) GetInfo

func (r *Reader) GetInfo() (core.Dict, error)

GetInfo returns the document info dictionary (metadata)

func (*Reader) GetObject

func (r *Reader) GetObject(objNum int) (core.Object, error)

GetObject loads an object by its number Uses caching to avoid re-reading objects Supports both uncompressed objects and objects in object streams (PDF 1.5+)

func (*Reader) GetPage

func (r *Reader) GetPage(index int) (*pages.Page, error)

GetPage returns the page at the given index (0-based)

func (*Reader) NumObjects

func (r *Reader) NumObjects() int

NumObjects returns the total number of objects in the PDF

func (*Reader) ObjectStreamCacheSize

func (r *Reader) ObjectStreamCacheSize() int

ObjectStreamCacheSize returns the number of cached object streams

func (*Reader) PageCount

func (r *Reader) PageCount() (int, error)

PageCount returns the number of pages in the PDF

func (*Reader) Resolve

func (r *Reader) Resolve(obj core.Object) (core.Object, error)

Resolve resolves an object if it's an indirect reference, otherwise returns it as-is Implements pages.ObjectResolver interface

func (*Reader) ResolveDeep

func (r *Reader) ResolveDeep(obj core.Object) (core.Object, error)

ResolveDeep recursively resolves all indirect references in an object Implements pages.ObjectResolver interface

func (*Reader) ResolveReference

func (r *Reader) ResolveReference(ref core.IndirectRef) (core.Object, error)

ResolveReference resolves an indirect reference

func (*Reader) Trailer

func (r *Reader) Trailer() core.Dict

Trailer returns the trailer dictionary

func (*Reader) Version

func (r *Reader) Version() PDFVersion

Version returns the PDF version

func (*Reader) XRefTable

func (r *Reader) XRefTable() *core.XRefTable

XRefTable returns the cross-reference table Exposed for debugging/inspection

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL