charset

package
v0.7.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 13, 2026 License: MIT Imports: 9 Imported by: 0

Documentation

Overview

Package charset provides character encoding utilities for GEDCOM files.

This package handles UTF-8 validation and Byte Order Mark (BOM) removal for GEDCOM file parsing. It ensures that GEDCOM data is properly encoded and provides detailed error reporting for encoding issues.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func IsCombiningDiacritical added in v0.5.0

func IsCombiningDiacritical(b byte) bool

IsCombiningDiacritical returns true if the given byte is an ANSEL combining diacritical mark (0xE0-0xFE range). These marks modify the character that follows them in ANSEL encoding.

Note: Not all bytes in the 0xE0-0xFE range are defined in ANSEL (0xFC and 0xFD are undefined), but this function returns true for the entire range to allow the decoder to handle undefined codes gracefully.

func NewReader

func NewReader(r io.Reader) io.Reader

NewReader wraps an io.Reader to provide encoding detection and UTF-8 validation. It first checks for a BOM (Byte Order Mark), then looks for a CHAR tag in the GEDCOM header to determine the encoding. The input is converted to UTF-8 and validated.

Supported encodings:

  • UTF-16 LE (BOM: 0xFF 0xFE) - Converted to UTF-8
  • UTF-16 BE (BOM: 0xFE 0xFF) - Converted to UTF-8
  • UTF-8 (BOM: 0xEF 0xBB 0xBF) - BOM removed, validated
  • ANSEL (CHAR tag: ANSEL) - Converted to UTF-8, validated
  • No BOM or CHAR tag - Assumed UTF-8, validated

func NewReaderWithEncoding added in v0.5.0

func NewReaderWithEncoding(r io.Reader, enc Encoding) io.Reader

NewReaderWithEncoding wraps a reader with the specified encoding converter. It converts the input from the given encoding to UTF-8 and validates the result.

Supported encodings:

  • EncodingANSEL: ANSEL to UTF-8 conversion, then validation
  • EncodingLATIN1: ISO-8859-1 to UTF-8 conversion, then validation
  • EncodingUTF16LE: UTF-16 LE to UTF-8 conversion, then validation
  • EncodingUTF16BE: UTF-16 BE to UTF-8 conversion, then validation
  • EncodingUTF8, EncodingASCII, EncodingUnknown: UTF-8 validation only

func ValidateBytes

func ValidateBytes(b []byte) bool

ValidateBytes checks if a byte slice is valid UTF-8.

func ValidateString

func ValidateString(s string) bool

ValidateString checks if a string is valid UTF-8.

Types

type Encoding added in v0.5.0

type Encoding int

Encoding represents the character encoding of a GEDCOM file.

const (
	// EncodingUnknown indicates no BOM was detected.
	EncodingUnknown Encoding = iota
	// EncodingUTF8 indicates UTF-8 encoding (BOM: 0xEF 0xBB 0xBF).
	EncodingUTF8
	// EncodingUTF16LE indicates UTF-16 Little Endian (BOM: 0xFF 0xFE).
	EncodingUTF16LE
	// EncodingUTF16BE indicates UTF-16 Big Endian (BOM: 0xFE 0xFF).
	EncodingUTF16BE
	// EncodingANSEL indicates ANSEL (ANSI Z39.47) encoding.
	EncodingANSEL
	// EncodingASCII indicates ASCII encoding.
	EncodingASCII
	// EncodingLATIN1 indicates ISO-8859-1 (Latin-1) encoding.
	EncodingLATIN1
)

func DetectBOM added in v0.5.0

func DetectBOM(r io.Reader) (io.Reader, Encoding, error)

DetectBOM reads the first few bytes from r to detect a Byte Order Mark (BOM). It returns a new reader containing all the original data (with BOM consumed if present), the detected encoding, and any error encountered.

BOM detection:

  • UTF-16 LE: 0xFF 0xFE
  • UTF-16 BE: 0xFE 0xFF
  • UTF-8: 0xEF 0xBB 0xBF

If no BOM is detected, the encoding is EncodingUnknown and all bytes are preserved.

func DetectEncodingFromHeader added in v0.5.0

func DetectEncodingFromHeader(r io.Reader) (io.Reader, Encoding, error)

DetectEncodingFromHeader peeks at GEDCOM header to find the CHAR tag. It returns a new reader with all bytes preserved, the detected encoding, and any error encountered.

If the CHAR tag is not found within the first headerPeekSize bytes, EncodingUnknown is returned and the caller should assume UTF-8.

Note: This function reads the entire remaining content to avoid issues with multi-byte UTF-8 sequences being split at arbitrary boundaries.

type ErrInvalidANSEL added in v0.5.0

type ErrInvalidANSEL struct {
	Line   int
	Column int
	Byte   byte
}

ErrInvalidANSEL is returned when an invalid ANSEL byte sequence is encountered.

func (*ErrInvalidANSEL) Error added in v0.5.0

func (e *ErrInvalidANSEL) Error() string

type ErrInvalidUTF8

type ErrInvalidUTF8 struct {
	Line   int
	Column int
}

ErrInvalidUTF8 is returned when invalid UTF-8 sequences are encountered.

func (*ErrInvalidUTF8) Error

func (e *ErrInvalidUTF8) Error() string

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL