charset

package

v0.7.0 Latest Latest Go to latest Published: Jan 13, 2026 License: MIT Imports: 9 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/cacack/gedcom-go

Links

Open Source Insights

Documentation ¶

Overview ¶

Package charset provides character encoding utilities for GEDCOM files.

This package handles UTF-8 validation and Byte Order Mark (BOM) removal for GEDCOM file parsing. It ensures that GEDCOM data is properly encoded and provides detailed error reporting for encoding issues.

Index ¶

func IsCombiningDiacritical(b byte) bool
func NewReader(r io.Reader) io.Reader
func NewReaderWithEncoding(r io.Reader, enc Encoding) io.Reader
func ValidateBytes(b []byte) bool
func ValidateString(s string) bool
type Encoding
- func DetectBOM(r io.Reader) (io.Reader, Encoding, error)
- func DetectEncodingFromHeader(r io.Reader) (io.Reader, Encoding, error)
type ErrInvalidANSEL
- func (e *ErrInvalidANSEL) Error() string
type ErrInvalidUTF8
- func (e *ErrInvalidUTF8) Error() string

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func IsCombiningDiacritical ¶ added in v0.5.0

func IsCombiningDiacritical(b byte) bool

IsCombiningDiacritical returns true if the given byte is an ANSEL combining diacritical mark (0xE0-0xFE range). These marks modify the character that follows them in ANSEL encoding.

Note: Not all bytes in the 0xE0-0xFE range are defined in ANSEL (0xFC and 0xFD are undefined), but this function returns true for the entire range to allow the decoder to handle undefined codes gracefully.

func NewReader ¶

func NewReader(r io.Reader) io.Reader

NewReader wraps an io.Reader to provide encoding detection and UTF-8 validation. It first checks for a BOM (Byte Order Mark), then looks for a CHAR tag in the GEDCOM header to determine the encoding. The input is converted to UTF-8 and validated.

Supported encodings:

UTF-16 LE (BOM: 0xFF 0xFE) - Converted to UTF-8
UTF-16 BE (BOM: 0xFE 0xFF) - Converted to UTF-8
UTF-8 (BOM: 0xEF 0xBB 0xBF) - BOM removed, validated
ANSEL (CHAR tag: ANSEL) - Converted to UTF-8, validated
No BOM or CHAR tag - Assumed UTF-8, validated

func NewReaderWithEncoding ¶ added in v0.5.0

func NewReaderWithEncoding(r io.Reader, enc Encoding) io.Reader

NewReaderWithEncoding wraps a reader with the specified encoding converter. It converts the input from the given encoding to UTF-8 and validates the result.

Supported encodings:

EncodingANSEL: ANSEL to UTF-8 conversion, then validation
EncodingLATIN1: ISO-8859-1 to UTF-8 conversion, then validation
EncodingUTF16LE: UTF-16 LE to UTF-8 conversion, then validation
EncodingUTF16BE: UTF-16 BE to UTF-8 conversion, then validation
EncodingUTF8, EncodingASCII, EncodingUnknown: UTF-8 validation only

func ValidateBytes ¶

func ValidateBytes(b []byte) bool

ValidateBytes checks if a byte slice is valid UTF-8.

func ValidateString ¶

func ValidateString(s string) bool

ValidateString checks if a string is valid UTF-8.

Types ¶

type Encoding ¶ added in v0.5.0

type Encoding int

Encoding represents the character encoding of a GEDCOM file.

const (
	// EncodingUnknown indicates no BOM was detected.
	EncodingUnknown Encoding = iota
	// EncodingUTF8 indicates UTF-8 encoding (BOM: 0xEF 0xBB 0xBF).
	EncodingUTF8
	// EncodingUTF16LE indicates UTF-16 Little Endian (BOM: 0xFF 0xFE).
	EncodingUTF16LE
	// EncodingUTF16BE indicates UTF-16 Big Endian (BOM: 0xFE 0xFF).
	EncodingUTF16BE
	// EncodingANSEL indicates ANSEL (ANSI Z39.47) encoding.
	EncodingANSEL
	// EncodingASCII indicates ASCII encoding.
	EncodingASCII
	// EncodingLATIN1 indicates ISO-8859-1 (Latin-1) encoding.
	EncodingLATIN1
)

func DetectBOM ¶ added in v0.5.0

func DetectBOM(r io.Reader) (io.Reader, Encoding, error)

DetectBOM reads the first few bytes from r to detect a Byte Order Mark (BOM). It returns a new reader containing all the original data (with BOM consumed if present), the detected encoding, and any error encountered.

BOM detection:

UTF-16 LE: 0xFF 0xFE
UTF-16 BE: 0xFE 0xFF
UTF-8: 0xEF 0xBB 0xBF

If no BOM is detected, the encoding is EncodingUnknown and all bytes are preserved.

func DetectEncodingFromHeader ¶ added in v0.5.0

func DetectEncodingFromHeader(r io.Reader) (io.Reader, Encoding, error)

DetectEncodingFromHeader peeks at GEDCOM header to find the CHAR tag. It returns a new reader with all bytes preserved, the detected encoding, and any error encountered.

If the CHAR tag is not found within the first headerPeekSize bytes, EncodingUnknown is returned and the caller should assume UTF-8.

Note: This function reads the entire remaining content to avoid issues with multi-byte UTF-8 sequences being split at arbitrary boundaries.

type ErrInvalidANSEL ¶ added in v0.5.0

type ErrInvalidANSEL struct {
	Line   int
	Column int
	Byte   byte
}

ErrInvalidANSEL is returned when an invalid ANSEL byte sequence is encountered.

func (*ErrInvalidANSEL) Error ¶ added in v0.5.0

func (e *ErrInvalidANSEL) Error() string

type ErrInvalidUTF8 ¶

type ErrInvalidUTF8 struct {
	Line   int
	Column int
}

ErrInvalidUTF8 is returned when invalid UTF-8 sequences are encountered.

func (*ErrInvalidUTF8) Error ¶

func (e *ErrInvalidUTF8) Error() string

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL