Documentation
¶
Overview ¶
Package graph provides a computational graph abstraction.
Index ¶
- Variables
- func ParallelForward[T tensor.Numeric](ctx context.Context, g *Graph[T], inputs ...*tensor.TensorNumeric[T]) (*tensor.TensorNumeric[T], error)
- type BufferArena
- type BufferLayout
- type Builder
- type CUDAGraphExecutor
- type EmbeddedFrozenProvider
- type ExecutionPlan
- func (p *ExecutionPlan[T]) BufferLayout() *BufferLayout
- func (p *ExecutionPlan[T]) EnsureSlotsGPU(gpuSlotCache map[int]*tensor.TensorNumeric[T])
- func (p *ExecutionPlan[T]) FrozenSlots() []FrozenSlot[T]
- func (p *ExecutionPlan[T]) HasPreallocatedBuffers() bool
- func (p *ExecutionPlan[T]) InputSlots() []int
- func (p *ExecutionPlan[T]) InstructionCount() int
- func (p *ExecutionPlan[T]) InstructionOpName(i int) string
- func (p *ExecutionPlan[T]) InstructionOutputIdx(i int) int
- func (p *ExecutionPlan[T]) Instructions() []InstructionMeta
- func (p *ExecutionPlan[T]) OutputSlot() int
- func (p *ExecutionPlan[T]) OutputTensor() *tensor.TensorNumeric[T]
- func (p *ExecutionPlan[T]) PreallocateBuffers()
- func (p *ExecutionPlan[T]) PrepareSlots(inputs ...*tensor.TensorNumeric[T]) error
- func (p *ExecutionPlan[T]) Run(ctx context.Context, inputs ...*tensor.TensorNumeric[T]) (*tensor.TensorNumeric[T], error)
- func (p *ExecutionPlan[T]) RunInstructionRange(ctx context.Context, start, end int) error
- func (p *ExecutionPlan[T]) RunInstructions(ctx context.Context, inputs ...*tensor.TensorNumeric[T]) (*tensor.TensorNumeric[T], error)
- func (p *ExecutionPlan[T]) ScratchSlot(idx int) *tensor.TensorNumeric[T]
- func (p *ExecutionPlan[T]) SetMegakernelFn(...)
- func (p *ExecutionPlan[T]) SetScratchSlot(idx int, t *tensor.TensorNumeric[T])
- func (p *ExecutionPlan[T]) SlotShapes() [][]int
- type FrozenSlot
- type Graph
- func (g *Graph[T]) Backward(ctx context.Context, mode types.BackwardMode, ...) error
- func (g *Graph[T]) ClearMemo()
- func (g *Graph[T]) Compile(ctx context.Context, inputs ...*tensor.TensorNumeric[T]) (*ExecutionPlan[T], error)
- func (g *Graph[T]) CompileTraced(ctx context.Context, inputs ...*tensor.TensorNumeric[T]) (*ExecutionPlan[T], error)
- func (g *Graph[T]) ConstantTensors() []*tensor.TensorNumeric[T]
- func (g *Graph[T]) Dependencies(n Node[T]) []Node[T]
- func (g *Graph[T]) EngineProxy() *compute.EngineProxy[T]
- func (g *Graph[T]) Forward(ctx context.Context, inputs ...*tensor.TensorNumeric[T]) (*tensor.TensorNumeric[T], error)
- func (g *Graph[T]) GetAllNodes() []Node[T]
- func (g *Graph[T]) GetDependencies() map[Node[T]][]Node[T]
- func (g *Graph[T]) GetNodeMetadata(n Node[T]) map[string]interface{}
- func (g *Graph[T]) GetTopologicalOrder() ([]Node[T], error)
- func (g *Graph[T]) Inputs() []Node[T]
- func (g *Graph[T]) Nodes() []Node[T]
- func (g *Graph[T]) Output() Node[T]
- func (g *Graph[T]) Parameters() []*Parameter[T]
- func (g *Graph[T]) SetEngineProxy(proxy *compute.EngineProxy[T])
- func (g *Graph[T]) WithParallel(enabled bool)
- func (g *Graph[T]) WithPool(pool TensorReleaser[T])
- type Instruction
- type InstructionMeta
- type NoParameters
- type Node
- type Parameter
- type TensorReleaser
- type Transposer
Constants ¶
This section is empty.
Variables ¶
var ErrInvalidInputCount = errors.New("invalid number of inputs")
ErrInvalidInputCount is returned when the number of inputs to a node is incorrect.
Functions ¶
func ParallelForward ¶ added in v1.1.0
func ParallelForward[T tensor.Numeric](ctx context.Context, g *Graph[T], inputs ...*tensor.TensorNumeric[T]) (*tensor.TensorNumeric[T], error)
ParallelForward executes the forward pass with dependency-aware parallelism. Independent nodes are dispatched to a goroutine pool concurrently. The result is identical to sequential Forward.
Types ¶
type BufferArena ¶ added in v1.1.0
BufferArena pre-allocates tensor buffers for use by an ExecutionPlan. All buffers are created once and reused across Run() calls. Frozen slots (parameters, constants) are not zeroed on Reset.
func NewBufferArena ¶ added in v1.1.0
func NewBufferArena[T tensor.Numeric](shapes [][]int) *BufferArena[T]
NewBufferArena pre-allocates one tensor per shape.
func (*BufferArena[T]) Get ¶ added in v1.1.0
func (a *BufferArena[T]) Get(idx int) *tensor.TensorNumeric[T]
Get returns the pre-allocated buffer at index idx.
func (*BufferArena[T]) Len ¶ added in v1.1.0
func (a *BufferArena[T]) Len() int
Len returns the number of buffer slots.
func (*BufferArena[T]) Reset ¶ added in v1.1.0
func (a *BufferArena[T]) Reset()
Reset zeros all non-frozen buffer data for the next execution step.
func (*BufferArena[T]) Set ¶ added in v1.1.0
func (a *BufferArena[T]) Set(idx int, t *tensor.TensorNumeric[T], freeze bool)
Set replaces the buffer at idx with the given tensor and optionally marks it as frozen (skip during Reset).
type BufferLayout ¶ added in v1.1.0
type BufferLayout struct {
// Offsets[i] is the element offset of slot i into the contiguous buffer.
// A value of -1 means the slot is not part of the contiguous buffer
// (e.g. frozen or input slots that are managed externally).
Offsets []int
// Sizes[i] is the element count for slot i (product of its shape dims).
// Zero for slots without a known shape.
Sizes []int
// TotalElements is the sum of all slot sizes that are part of the
// contiguous buffer.
TotalElements int
}
BufferLayout describes a contiguous pre-allocated buffer with fixed offsets for each slot in an ExecutionPlan. CUDA graph capture requires that device memory addresses remain stable across runs; this layout ensures every intermediate tensor occupies the same offset in every execution.
func ComputeBufferLayout ¶ added in v1.1.0
func ComputeBufferLayout(slotShapes [][]int, frozenIdx []int, inputIdx []int) BufferLayout
ComputeBufferLayout computes element offsets for each slot based on the slot shapes from compilation. Frozen and input slots are excluded (offset -1) since they are managed externally (model weights are constant, inputs change).
type Builder ¶
Builder provides a fluent API for constructing a computation graph.
func NewBuilder ¶
NewBuilder creates a new graph builder.
func (*Builder[T]) Parameters ¶
Parameters returns all the trainable parameters in the graph.
type CUDAGraphExecutor ¶ added in v1.1.0
CUDAGraphExecutor captures and replays a CUDA graph for an ExecutionPlan. It splits the plan into three regions:
- Pre-capture: instructions that trigger D2H copies or have dynamic state
- Capture region: GPU-only, position-independent instructions
- Post-capture: any trailing non-capturable instructions
During replay, regions 1 and 3 run normally while region 2 is replayed from the captured graph with near-zero launch overhead.
func NewCUDAGraphExecutor ¶ added in v1.1.0
func NewCUDAGraphExecutor[T tensor.Numeric](plan *ExecutionPlan[T], streamPtr unsafe.Pointer, warmups int, onCaptured func()) *CUDAGraphExecutor[T]
NewCUDAGraphExecutor creates a graph executor for the given plan. The optional onCaptured callback is invoked after a successful capture, allowing the caller to protect arena allocations from being reclaimed.
func (*CUDAGraphExecutor[T]) Destroy ¶ added in v1.1.0
func (g *CUDAGraphExecutor[T]) Destroy()
Destroy releases the CUDA graph resources.
func (*CUDAGraphExecutor[T]) Run ¶ added in v1.1.0
func (g *CUDAGraphExecutor[T]) Run(ctx context.Context, inputs ...*tensor.TensorNumeric[T]) (*tensor.TensorNumeric[T], error)
Run executes the plan, using graph capture/replay when available.
type EmbeddedFrozenProvider ¶ added in v1.1.0
type EmbeddedFrozenProvider[T tensor.Numeric] interface { EmbeddedFrozen() []*tensor.TensorNumeric[T] }
EmbeddedFrozenProvider is implemented by nodes that carry frozen data internally (e.g. Gather with embedded weights). Compile detects this interface and creates synthetic frozen slots so the megakernel emitter can reference the data via frozen_%d pointers.
type ExecutionPlan ¶ added in v1.1.0
ExecutionPlan is a compiled, flat instruction sequence that replaces the interpreted node-by-node Forward() loop. Node outputs are stored in an indexed slot array instead of a map, eliminating map lookups.
func (*ExecutionPlan[T]) BufferLayout ¶ added in v1.1.0
func (p *ExecutionPlan[T]) BufferLayout() *BufferLayout
BufferLayout returns the computed buffer layout, or nil if buffers have not been pre-allocated.
func (*ExecutionPlan[T]) EnsureSlotsGPU ¶ added in v1.1.0
func (p *ExecutionPlan[T]) EnsureSlotsGPU(gpuSlotCache map[int]*tensor.TensorNumeric[T])
EnsureSlotsGPU uploads any CPU-resident scratch slot tensors to GPU. If a pre-allocated GPU tensor exists for the slot (from a previous capture), the CPU data is copied into it to preserve device addresses for CUDA graph replay. Otherwise a new GPU tensor is allocated and stored in gpuSlotCache for reuse.
This is called after pre-capture instructions run (e.g. EmbeddingLookup with quantized embedding tables that produce CPU tensors) to ensure the capture region sees only GPU-resident data.
func (*ExecutionPlan[T]) FrozenSlots ¶ added in v1.1.0
func (p *ExecutionPlan[T]) FrozenSlots() []FrozenSlot[T]
FrozenSlots returns the frozen (constant/parameter) slots and their data.
func (*ExecutionPlan[T]) HasPreallocatedBuffers ¶ added in v1.1.0
func (p *ExecutionPlan[T]) HasPreallocatedBuffers() bool
HasPreallocatedBuffers reports whether buffers have been pre-allocated.
func (*ExecutionPlan[T]) InputSlots ¶ added in v1.1.0
func (p *ExecutionPlan[T]) InputSlots() []int
InputSlots returns the slot indices that receive graph inputs.
func (*ExecutionPlan[T]) InstructionCount ¶ added in v1.1.0
func (p *ExecutionPlan[T]) InstructionCount() int
InstructionCount returns the number of instructions in the plan.
func (*ExecutionPlan[T]) InstructionOpName ¶ added in v1.1.0
func (p *ExecutionPlan[T]) InstructionOpName(i int) string
InstructionOpName returns the operation name of instruction at index i.
func (*ExecutionPlan[T]) InstructionOutputIdx ¶ added in v1.1.0
func (p *ExecutionPlan[T]) InstructionOutputIdx(i int) int
InstructionOutputIdx returns the output slot index of instruction at index i.
func (*ExecutionPlan[T]) Instructions ¶ added in v1.1.0
func (p *ExecutionPlan[T]) Instructions() []InstructionMeta
Instructions returns exported metadata for each compute instruction in the plan. The order matches the execution order.
func (*ExecutionPlan[T]) OutputSlot ¶ added in v1.1.0
func (p *ExecutionPlan[T]) OutputSlot() int
OutputSlot returns the slot index that holds the final output.
func (*ExecutionPlan[T]) OutputTensor ¶ added in v1.1.0
func (p *ExecutionPlan[T]) OutputTensor() *tensor.TensorNumeric[T]
OutputTensor returns the tensor currently in the output slot. Used by CUDAGraphExecutor to read the result after graph replay.
func (*ExecutionPlan[T]) PreallocateBuffers ¶ added in v1.1.0
func (p *ExecutionPlan[T]) PreallocateBuffers()
PreallocateBuffers creates pre-allocated tensors for all intermediate slots in the execution plan based on the slot shapes determined during compilation. After calling this method, RunInstructions will copy each Forward() result into the pre-allocated buffer, keeping memory addresses stable across runs.
Frozen and input slots are excluded since they are managed externally.
func (*ExecutionPlan[T]) PrepareSlots ¶ added in v1.1.0
func (p *ExecutionPlan[T]) PrepareSlots(inputs ...*tensor.TensorNumeric[T]) error
PrepareSlots initializes the scratch slot array and populates input slots. Must be called before RunInstructionRange.
func (*ExecutionPlan[T]) Run ¶ added in v1.1.0
func (p *ExecutionPlan[T]) Run(ctx context.Context, inputs ...*tensor.TensorNumeric[T]) (*tensor.TensorNumeric[T], error)
Run executes the compiled plan. It sets input tensors into the slot array, executes each instruction in sequence, and returns the output.
Not safe for concurrent use. The generator calls Run() sequentially per token.
func (*ExecutionPlan[T]) RunInstructionRange ¶ added in v1.1.0
func (p *ExecutionPlan[T]) RunInstructionRange(ctx context.Context, start, end int) error
RunInstructionRange executes instructions [start, end) using the shared slot array. The caller must have already populated input slots. This is used by CUDAGraphExecutor to split execution into capturable and non-capturable regions.
func (*ExecutionPlan[T]) RunInstructions ¶ added in v1.1.0
func (p *ExecutionPlan[T]) RunInstructions(ctx context.Context, inputs ...*tensor.TensorNumeric[T]) (*tensor.TensorNumeric[T], error)
RunInstructions executes the instruction loop directly, bypassing the megakernel/graph capture hook. Used by CUDAGraphExecutor during warmup and capture phases.
func (*ExecutionPlan[T]) ScratchSlot ¶ added in v1.1.0
func (p *ExecutionPlan[T]) ScratchSlot(idx int) *tensor.TensorNumeric[T]
ScratchSlot returns the tensor at the given scratch slot index, or nil.
func (*ExecutionPlan[T]) SetMegakernelFn ¶ added in v1.1.0
func (p *ExecutionPlan[T]) SetMegakernelFn(fn func(context.Context, []*tensor.TensorNumeric[T]) (*tensor.TensorNumeric[T], error))
SetMegakernelFn sets an optional megakernel function that, when set, replaces the per-instruction execution loop in Run(). This allows a fused kernel to transparently handle the entire plan execution.
func (*ExecutionPlan[T]) SetScratchSlot ¶ added in v1.1.0
func (p *ExecutionPlan[T]) SetScratchSlot(idx int, t *tensor.TensorNumeric[T])
SetScratchSlot sets the tensor at the given scratch slot index.
func (*ExecutionPlan[T]) SlotShapes ¶ added in v1.1.0
func (p *ExecutionPlan[T]) SlotShapes() [][]int
SlotShapes returns the shape of each slot as determined during compilation. Nil entries indicate slots that were not populated during the warmup pass.
type FrozenSlot ¶ added in v1.1.0
type FrozenSlot[T tensor.Numeric] struct { SlotIdx int Data *tensor.TensorNumeric[T] }
FrozenSlot describes a slot that holds frozen (constant) data such as model weights. The Data field holds the tensor from the warmup pass.
type Graph ¶
Graph represents a computation graph with a defined execution order.
func FoldConstantTransposes ¶ added in v1.1.0
FoldConstantTransposes removes Transpose nodes whose sole input is a constant (Parameter/Constant node). The transpose is pre-applied to the constant data and all consumers of the Transpose node are rewired to use the pre-transposed constant directly.
If the graph has no foldable transposes, the original graph is returned. The original graph should not be used after this call if a new graph is returned.
func (*Graph[T]) Backward ¶
func (g *Graph[T]) Backward(ctx context.Context, mode types.BackwardMode, initialGradient *tensor.TensorNumeric[T]) error
Backward executes the backward pass of the entire graph. It is safe for concurrent use; callers will be serialized.
func (*Graph[T]) ClearMemo ¶ added in v1.1.0
func (g *Graph[T]) ClearMemo()
ClearMemo releases intermediate tensors from the last forward pass. Call this after Backward to free GPU device memory between training steps. Input tensors and parameter values are not released.
func (*Graph[T]) Compile ¶ added in v1.1.0
func (g *Graph[T]) Compile(ctx context.Context, inputs ...*tensor.TensorNumeric[T]) (*ExecutionPlan[T], error)
Compile pre-compiles the graph into a flat ExecutionPlan. It runs one Forward() pass to determine tensor shapes, then assigns buffer indices and creates instruction kernels for each node.
func (*Graph[T]) CompileTraced ¶ added in v1.1.0
func (g *Graph[T]) CompileTraced(ctx context.Context, inputs ...*tensor.TensorNumeric[T]) (*ExecutionPlan[T], error)
CompileTraced produces a primitive-op ExecutionPlan by tracing through the graph's Forward pass with the EngineProxy recording every engine call. Unlike Compile (which creates one instruction per graph node), CompileTraced decomposes composite nodes into their constituent engine calls, enabling the megakernel emitter to see only primitive operations.
func (*Graph[T]) ConstantTensors ¶ added in v1.1.0
func (g *Graph[T]) ConstantTensors() []*tensor.TensorNumeric[T]
ConstantTensors returns all constant/parameter weight tensors in the graph. Includes tensors from Parameter/Constant nodes, tensors embedded in nodes that implement EmbeddedFrozenProvider (e.g. LM head, gather), and all Parameter values from every node (e.g. attention and FFN weights). Call after graph construction to collect tensors for GPU pre-upload.
func (*Graph[T]) Dependencies ¶ added in v1.1.0
Dependencies returns the dependencies of a given node.
func (*Graph[T]) EngineProxy ¶ added in v1.1.0
func (g *Graph[T]) EngineProxy() *compute.EngineProxy[T]
EngineProxy returns the EngineProxy if one was set, or nil.
func (*Graph[T]) Forward ¶
func (g *Graph[T]) Forward(ctx context.Context, inputs ...*tensor.TensorNumeric[T]) (*tensor.TensorNumeric[T], error)
Forward executes the forward pass of the entire graph. It is safe for concurrent use; callers will be serialized. When parallel mode is enabled via WithParallel(true), independent nodes are executed concurrently using a goroutine pool.
func (*Graph[T]) GetAllNodes ¶ added in v1.1.0
GetAllNodes returns all nodes in the graph in their current order.
func (*Graph[T]) GetDependencies ¶ added in v1.1.0
GetDependencies returns the dependency map for all nodes in the graph.
func (*Graph[T]) GetNodeMetadata ¶ added in v1.1.0
GetNodeMetadata returns metadata for a specific node including its type, attributes, and shape.
func (*Graph[T]) GetTopologicalOrder ¶ added in v1.1.0
GetTopologicalOrder returns the nodes in topological order for execution.
func (*Graph[T]) Parameters ¶
Parameters returns all the trainable parameters in the graph.
func (*Graph[T]) SetEngineProxy ¶ added in v1.1.0
func (g *Graph[T]) SetEngineProxy(proxy *compute.EngineProxy[T])
SetEngineProxy stores a reference to the EngineProxy used by this graph's layers.
func (*Graph[T]) WithParallel ¶ added in v1.1.0
WithParallel enables or disables parallel execution of independent nodes. When enabled, Forward delegates to ParallelForward for concurrent execution. Default is false (sequential) for backward compatibility.
func (*Graph[T]) WithPool ¶ added in v1.1.0
func (g *Graph[T]) WithPool(pool TensorReleaser[T])
WithPool sets a tensor pool for intermediate buffer reuse during Forward. When set, the executor releases intermediate tensors back to the pool as soon as all their consumers have executed.
type Instruction ¶ added in v1.1.0
type Instruction[T tensor.Numeric] struct { Forward func(ctx context.Context, inputs []*tensor.TensorNumeric[T]) (*tensor.TensorNumeric[T], error) InputIdx []int // indices into the slot array OutputIdx int // index into the slot array OpName string // for error reporting ExtraArgs map[string]any // optional extra arguments (e.g. layer index for KV cache ops) }
Instruction is a single pre-resolved operation in a compiled execution plan. It holds a direct function that calls node.Forward() with pre-computed buffer indices, eliminating dependency map lookups and memo operations.
type InstructionMeta ¶ added in v1.1.0
type InstructionMeta struct {
OpName string // operation type (e.g. "Add", "MatMulNBits", "RMSNorm")
InputIdx []int // slot indices for inputs
OutputIdx int // slot index for the output
ExtraArgs map[string]any // optional extra arguments (e.g. layer index for KV cache ops)
}
InstructionMeta is the exported metadata for a single compiled instruction. It contains everything needed by a code generator without exposing the Forward() closure.
type NoParameters ¶
NoParameters is a utility type for nodes that have no trainable parameters.
func (*NoParameters[T]) Parameters ¶
func (n *NoParameters[T]) Parameters() []*Parameter[T]
Parameters returns an empty slice of parameters.
type Node ¶
type Node[T tensor.Numeric] interface { // OpType returns the operation type of the node, e.g., "ReLU", "Dense". OpType() string // Attributes returns a map of the node's non-tensor attributes. Attributes() map[string]interface{} // Forward computes the output of the node given the inputs. Forward(ctx context.Context, inputs ...*tensor.TensorNumeric[T]) (*tensor.TensorNumeric[T], error) // Backward computes the gradients of the node with respect to its inputs. Backward(ctx context.Context, mode types.BackwardMode, outputGradient *tensor.TensorNumeric[T], inputs ...*tensor.TensorNumeric[T]) ([]*tensor.TensorNumeric[T], error) // Parameters returns the trainable parameters of the node. Parameters() []*Parameter[T] // OutputShape returns the shape of the output tensor. OutputShape() []int }
Node represents a node in the computation graph.
type Parameter ¶
type Parameter[T tensor.Numeric] struct { Name string Value *tensor.TensorNumeric[T] Gradient *tensor.TensorNumeric[T] }
Parameter represents a trainable parameter in the graph.
func NewParameter ¶
func NewParameter[T tensor.Numeric](name string, value *tensor.TensorNumeric[T], newTensorFn func([]int, []T) (*tensor.TensorNumeric[T], error)) (*Parameter[T], error)
NewParameter creates a new parameter.
func (*Parameter[T]) AddGradient ¶
func (p *Parameter[T]) AddGradient(grad *tensor.TensorNumeric[T]) error
AddGradient adds the given gradient to the parameter's gradient.
func (*Parameter[T]) ClearGradient ¶
func (p *Parameter[T]) ClearGradient()
ClearGradient resets the parameter's gradient to zero.
type TensorReleaser ¶ added in v1.1.0
type TensorReleaser[T tensor.Numeric] interface { Release(t *tensor.TensorNumeric[T]) }
TensorReleaser can release tensors back to a pool for reuse.
type Transposer ¶ added in v1.1.0
type Transposer[T tensor.Numeric] interface { Transpose(ctx context.Context, a *tensor.TensorNumeric[T], axes []int, dst ...*tensor.TensorNumeric[T]) (*tensor.TensorNumeric[T], error) }
Transposer is the minimal interface needed by FoldConstantTransposes to pre-apply transpose operations on constant tensors. The signature matches compute.Engine[T].Transpose (variadic dst for buffer reuse).