lsh

package module
v1.0.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 12, 2026 License: Apache-2.0 Imports: 19 Imported by: 0

README

🚀 High-Performance MinHash LSH for Go

A production-grade Locality Sensitive Hashing (LSH) implementation engineered for high-throughput deduplication, fraud detection, and similarity search. This library focuses on extreme CPU efficiency and zero-allocation hot paths. ✨ Key Features

⚡ Odd-Coefficient Optimization: Replaces expensive modulo arithmetic with implicit 264 overflow. By enforcing odd coefficients, we guarantee a mathematically perfect permutation with zero CPU overhead.

♻️ Zero-Allocation Design: Utilizes sync.Pool for signature buffers and optimized slicing to eliminate GC pressure during heavy ingestion.

🛡️ Hot-Bucket Protection: Includes a "Peek & Stub" strategy for Aerospike/Database layers to prevent OOM crashes when encountering "hot" tokens (millions of matches).

🧩 Scalable Architecture: Decoupled storage interfaces supporting In-Memory and Aerospike (CDT-based) backends out of the box.

🧪 Mathematically Sound: Implements 3-char shingling and MinHash signatures with configurable P=1−(1−sr)b probability curves.

📦 Installation

go get github.com/FrogoAI/lsh

🚀 Quick Start

package main

import (
	"context"
	
	"github.com/FrogoAI/lsh"
	"github.com/FrogoAI/lsh/repositories/memory"
)

func main() {
	// 1. Setup Config
	cfg := &lsh.Config{
		Bands:            20,
		Rows:             5,
		ShingleSize:      3,
		JaccardThreshold: 0.6,
		Seed:             13374269,
	}

	// 2. Initialize Service
	repo := memory.New()
	service := lsh.NewSimilarityService(repo, cfg)

	// 3. Upsert a record
	// Returns a new unique record id, or the existing ID if it's a duplicate.
	id, _ := service.Upsert(context.Background(), "users", "John Doe")
}

📐 Tuning the LSH Curve

The probability of two items sharing a bucket depends on their Jaccard similarity s, the number of bands b, and rows per band r.

Higher Recall (Catch more): Increase Bands.

Higher Precision (Fewer false positives): Increase Rows.

🔬 Performance Optimizations The "Odd Coefficient" Trick

Standard MinHash often uses (ax+b)(modP). We utilize the CPU's natural behavior with 264 overflow. Because we ensure a is always odd, it remains coprime to 264, satisfying the requirement for a linear congruential generator to produce a full-period permutation. Aerospike CDT Implementation

The Aerospike repository uses Map CDTs (Collection Data Types). Instead of fetching a whole list of IDs, we use MapSizeOp to check the density of a bucket server-side. If a bucket is too "hot," we skip it without ever transferring the data over the wire.

Documentation

Index

Constants

View Source
const (
	EnvPrefix = "LSH"
)

Variables

View Source
var (
	ErrSignatureTooShort    = errors.New("signature too short")
	ErrShingleResultIsEmpty = errors.New("error shingle result is empty")
)

Functions

func EstimateJaccard added in v1.0.0

func EstimateJaccard(sig1, sig2 []uint64) float64

func GetTinyID

func GetTinyID() ([]byte, error)

Types

type Config

type Config struct {
	Bands              int     `env:"_BANDS" envDefault:"40"`
	Rows               int     `env:"_ROWS" envDefault:"5"`
	ShingleSize        int     `env:"_SHINGLE_SIZE" envDefault:"3"`
	JaccardThreshold   float64 `env:"_JAC_THRESHOLD" envDefault:"0.6"`
	MaxBucketSize      int     `env:"_MAX_BUCKET_SIZE" envDefault:"200"`
	MaxTotalCandidates int     `env:"_MAX_TOTAL_CANDIDATES" envDefault:"100"`
	Seed               int64   `env:"_SEED" envDefault:"13374269"`
}

func GetLSHConfigFromEnv

func GetLSHConfigFromEnv() (*Config, error)

func (*Config) HashVersion

func (c *Config) HashVersion(group string) (string, error)

type Hasher

type Hasher struct {
	// contains filtered or unexported fields
}

func NewHasher

func NewHasher(bands, rows int, seed int64) *Hasher

func (*Hasher) ComputeBands

func (h *Hasher) ComputeBands(signature []uint64) ([]string, error)

func (*Hasher) ComputeSignature

func (h *Hasher) ComputeSignature(tokens set.GenericDataSet[string], sig []uint64)

type SimilarityService

type SimilarityService struct {
	// contains filtered or unexported fields
}

func NewSimilarityService

func NewSimilarityService(repo repositories.Storage, config *Config) *SimilarityService

func (*SimilarityService) CalculateJaccardOptimized

func (s *SimilarityService) CalculateJaccardOptimized(sourceSet set.GenericDataSet[string], targetStr string) float64

func (*SimilarityService) GetNewID

func (s *SimilarityService) GetNewID() (string, error)

func (*SimilarityService) Shingle

func (s *SimilarityService) Shingle(input string) set.GenericDataSet[string]

func (*SimilarityService) Upsert

func (s *SimilarityService) Upsert(ctx context.Context, group, input string) (string, error)

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL