Documentation
¶
Overview ¶
Package reader provides high-level PDF file reading and object resolution.
This package orchestrates the lower-level core package to provide a convenient API for reading PDF files and extracting content.
Opening PDF Files ¶
Use Open to open a PDF file for reading:
reader, err := reader.Open("document.pdf")
if err != nil {
log.Fatal(err)
}
defer reader.Close()
Or use NewReader with an existing *os.File.
Document Information ¶
The Reader provides access to document structure:
- Version() - PDF version (e.g., 1.7)
- PageCount() - number of pages
- GetCatalog() - document catalog dictionary
- GetInfo() - document info dictionary (metadata)
- Trailer() - trailer dictionary
Page Access ¶
Access pages by index (0-based):
page, err := reader.GetPage(0) // First page
Object Resolution ¶
The Reader resolves indirect object references:
- GetObject(objNum) - load object by number
- ResolveReference(ref) - resolve an IndirectRef
- Resolve(obj) - resolve if indirect, otherwise return as-is
- ResolveDeep(obj) - recursively resolve all references
Text Extraction ¶
Convenience methods for text extraction:
- ExtractText(page) - extract text as a string
- ExtractTextFragments(page) - extract positioned text fragments
Object Caching ¶
The Reader caches loaded objects for efficiency. Use ClearCache() to free memory when processing large PDFs.
Index ¶
- type PDFVersion
- type PageImage
- type Reader
- func (r *Reader) CacheSize() int
- func (r *Reader) ClearCache()
- func (r *Reader) Close() error
- func (r *Reader) ExtractPageImages(page *pages.Page) ([]PageImage, error)
- func (r *Reader) ExtractText(page *pages.Page) (string, error)
- func (r *Reader) ExtractTextFragments(page *pages.Page) ([]text.TextFragment, error)
- func (r *Reader) FileSize() int64
- func (r *Reader) GetCatalog() (core.Dict, error)
- func (r *Reader) GetInfo() (core.Dict, error)
- func (r *Reader) GetObject(objNum int) (core.Object, error)
- func (r *Reader) GetPage(index int) (*pages.Page, error)
- func (r *Reader) NumObjects() int
- func (r *Reader) ObjectStreamCacheSize() int
- func (r *Reader) PageCount() (int, error)
- func (r *Reader) Resolve(obj core.Object) (core.Object, error)
- func (r *Reader) ResolveDeep(obj core.Object) (core.Object, error)
- func (r *Reader) ResolveReference(ref core.IndirectRef) (core.Object, error)
- func (r *Reader) Trailer() core.Dict
- func (r *Reader) Version() PDFVersion
- func (r *Reader) XRefTable() *core.XRefTable
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type PDFVersion ¶
PDFVersion represents a PDF version
func (PDFVersion) String ¶
func (v PDFVersion) String() string
String returns the version as a string (e.g., "1.7")
type PageImage ¶ added in v1.6.0
type PageImage struct {
Name string // XObject name (e.g., "Im1")
Width int
Height int
ColorSpace string // DeviceGray, DeviceRGB, DeviceCMYK, etc.
BitsPerComponent int
Data []byte // Decoded pixel data
Filter string // Original filter (for format detection)
}
PageImage represents an extracted image from a PDF page.
type Reader ¶
type Reader struct {
// contains filtered or unexported fields
}
Reader represents a PDF file reader
func (*Reader) ClearCache ¶
func (r *Reader) ClearCache()
ClearCache clears the object cache and object stream cache Useful for freeing memory when processing large PDFs
func (*Reader) ExtractPageImages ¶ added in v1.6.0
ExtractPageImages extracts all image XObjects from a page. It returns a slice of PageImage containing decoded image data.
func (*Reader) ExtractText ¶
ExtractText extracts text from a page and returns it as a string This is a convenience method for simple text extraction
func (*Reader) ExtractTextFragments ¶
ExtractTextFragments extracts text fragments from a page This is a convenience method that handles content stream decoding and font registration
func (*Reader) GetCatalog ¶
GetCatalog returns the document catalog (root object)
func (*Reader) GetObject ¶
GetObject loads an object by its number Uses caching to avoid re-reading objects Supports both uncompressed objects and objects in object streams (PDF 1.5+)
func (*Reader) NumObjects ¶
NumObjects returns the total number of objects in the PDF
func (*Reader) ObjectStreamCacheSize ¶
ObjectStreamCacheSize returns the number of cached object streams
func (*Reader) Resolve ¶
Resolve resolves an object if it's an indirect reference, otherwise returns it as-is Implements pages.ObjectResolver interface
func (*Reader) ResolveDeep ¶
ResolveDeep recursively resolves all indirect references in an object Implements pages.ObjectResolver interface
func (*Reader) ResolveReference ¶
ResolveReference resolves an indirect reference