Documentation
¶
Overview ¶
Package charset provides character encoding utilities for GEDCOM files.
This package handles UTF-8 validation and Byte Order Mark (BOM) removal for GEDCOM file parsing. It ensures that GEDCOM data is properly encoded and provides detailed error reporting for encoding issues.
Index ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func IsCombiningDiacritical ¶ added in v0.5.0
IsCombiningDiacritical returns true if the given byte is an ANSEL combining diacritical mark (0xE0-0xFE range). These marks modify the character that follows them in ANSEL encoding.
Note: Not all bytes in the 0xE0-0xFE range are defined in ANSEL (0xFC and 0xFD are undefined), but this function returns true for the entire range to allow the decoder to handle undefined codes gracefully.
func NewReader ¶
NewReader wraps an io.Reader to provide encoding detection and UTF-8 validation. It first checks for a BOM (Byte Order Mark), then looks for a CHAR tag in the GEDCOM header to determine the encoding. The input is converted to UTF-8 and validated.
Supported encodings:
- UTF-16 LE (BOM: 0xFF 0xFE) - Converted to UTF-8
- UTF-16 BE (BOM: 0xFE 0xFF) - Converted to UTF-8
- UTF-8 (BOM: 0xEF 0xBB 0xBF) - BOM removed, validated
- ANSEL (CHAR tag: ANSEL) - Converted to UTF-8, validated
- No BOM or CHAR tag - Assumed UTF-8, validated
func NewReaderWithEncoding ¶ added in v0.5.0
NewReaderWithEncoding wraps a reader with the specified encoding converter. It converts the input from the given encoding to UTF-8 and validates the result.
Supported encodings:
- EncodingANSEL: ANSEL to UTF-8 conversion, then validation
- EncodingLATIN1: ISO-8859-1 to UTF-8 conversion, then validation
- EncodingUTF16LE: UTF-16 LE to UTF-8 conversion, then validation
- EncodingUTF16BE: UTF-16 BE to UTF-8 conversion, then validation
- EncodingUTF8, EncodingASCII, EncodingUnknown: UTF-8 validation only
func ValidateBytes ¶
ValidateBytes checks if a byte slice is valid UTF-8.
func ValidateString ¶
ValidateString checks if a string is valid UTF-8.
Types ¶
type Encoding ¶ added in v0.5.0
type Encoding int
Encoding represents the character encoding of a GEDCOM file.
const ( // EncodingUnknown indicates no BOM was detected. EncodingUnknown Encoding = iota // EncodingUTF8 indicates UTF-8 encoding (BOM: 0xEF 0xBB 0xBF). EncodingUTF8 // EncodingUTF16LE indicates UTF-16 Little Endian (BOM: 0xFF 0xFE). EncodingUTF16LE // EncodingUTF16BE indicates UTF-16 Big Endian (BOM: 0xFE 0xFF). EncodingUTF16BE // EncodingANSEL indicates ANSEL (ANSI Z39.47) encoding. EncodingANSEL // EncodingASCII indicates ASCII encoding. EncodingASCII // EncodingLATIN1 indicates ISO-8859-1 (Latin-1) encoding. EncodingLATIN1 )
func DetectBOM ¶ added in v0.5.0
DetectBOM reads the first few bytes from r to detect a Byte Order Mark (BOM). It returns a new reader containing all the original data (with BOM consumed if present), the detected encoding, and any error encountered.
BOM detection:
- UTF-16 LE: 0xFF 0xFE
- UTF-16 BE: 0xFE 0xFF
- UTF-8: 0xEF 0xBB 0xBF
If no BOM is detected, the encoding is EncodingUnknown and all bytes are preserved.
func DetectEncodingFromHeader ¶ added in v0.5.0
DetectEncodingFromHeader peeks at GEDCOM header to find the CHAR tag. It returns a new reader with all bytes preserved, the detected encoding, and any error encountered.
If the CHAR tag is not found within the first headerPeekSize bytes, EncodingUnknown is returned and the caller should assume UTF-8.
Note: This function reads the entire remaining content to avoid issues with multi-byte UTF-8 sequences being split at arbitrary boundaries.
type ErrInvalidANSEL ¶ added in v0.5.0
ErrInvalidANSEL is returned when an invalid ANSEL byte sequence is encountered.
func (*ErrInvalidANSEL) Error ¶ added in v0.5.0
func (e *ErrInvalidANSEL) Error() string
type ErrInvalidUTF8 ¶
ErrInvalidUTF8 is returned when invalid UTF-8 sequences are encountered.
func (*ErrInvalidUTF8) Error ¶
func (e *ErrInvalidUTF8) Error() string