Skip to main content
Version: 0.2.3

Custom Pipeline Processors

The document processing pipeline transforms documents before they're indexed. This guide explains how to create and register custom processors by modifying the Sercha source code.

Overview

When documents are synced, they pass through a pipeline of processors:

Document → [Processor 1] → [Processor 2] → ... → Chunks → Index

The default pipeline includes a chunker that splits documents into searchable chunks. You can add custom processors for tasks like:

  • Content transformation (cleaning, formatting)
  • Metadata extraction
  • Language detection
  • Custom chunking strategies

Architecture

The pipeline system consists of three components:

ComponentDescriptionSource
Port InterfaceThe contract processors must implementinternal/core/ports/driven/postprocessor.go
RegistryMaps processor names to buildersinternal/postprocessors/registry.go
PipelineChains processors and executes theminternal/postprocessors/pipeline.go

The PostProcessor Interface

Custom processors must implement the PostProcessor interface:

// internal/core/ports/driven/postprocessor.go

type PostProcessor interface {
// Name returns the processor name for logging and configuration.
Name() string

// Process takes a document and returns chunks.
// If the processor modifies chunks (e.g., stemming), it receives and returns chunks.
// If the processor creates chunks (e.g., chunker), it receives nil and returns new chunks.
Process(ctx context.Context, doc *domain.Document, chunks []domain.Chunk) ([]domain.Chunk, error)
}

See the full interface: postprocessor.go

Creating a Custom Processor

1. Implement the Interface

Create a new package for your processor in internal/postprocessors/:

// internal/postprocessors/myprocessor/processor.go
package myprocessor

import (
"context"
"github.com/custodia-labs/sercha-cli/internal/core/domain"
)

type Processor struct {
// Your configuration fields
}

func New(opts ...Option) *Processor {
return &Processor{
// Initialize with defaults or options
}
}

func (p *Processor) Name() string {
return "myprocessor"
}

func (p *Processor) Process(
ctx context.Context,
doc *domain.Document,
chunks []domain.Chunk,
) ([]domain.Chunk, error) {
// Transform chunks or create new ones
// Return modified chunks
return chunks, nil
}

Reference implementation: internal/postprocessors/chunker/processor.go

2. Create a Builder Function

Add a builder function that creates your processor from configuration:

// internal/postprocessors/defaults.go

func buildMyProcessor(cfg map[string]any) (driven.PostProcessor, error) {
var opts []myprocessor.Option

// Parse config values
if val := getIntFromConfig(cfg, "my_setting"); val > 0 {
opts = append(opts, myprocessor.WithMySetting(val))
}

return myprocessor.New(opts...), nil
}

Reference: internal/postprocessors/defaults.go

3. Register the Processor

Add your processor to the registry in RegisterDefaults:

// internal/postprocessors/defaults.go

func RegisterDefaults(r *Registry) {
r.Register("chunker", buildChunker)
r.Register("myprocessor", buildMyProcessor) // Add this line
}

The registry maps the processor name (used in config) to its builder function.

Configuration

Once registered, enable your processor in ~/.sercha/config.toml:

[pipeline]
processors = ["myprocessor", "chunker"]

[pipeline.myprocessor]
my_setting = 100
another_option = "value"

[pipeline.chunker]
chunk_size = 1000
overlap = 200

Processors run in the order listed. For most use cases, the chunker should be last.

Pipeline Execution

The pipeline is constructed in cmd/sercha/main.go:

// Create PostProcessor pipeline from configuration
pipelineCfg := settingsSvc.GetPipelineConfig()
processorRegistry := postprocessors.NewRegistry()
postprocessors.RegisterDefaults(processorRegistry)

pipeline := postprocessors.NewPipeline()
for _, name := range pipelineCfg.Processors {
cfg := pipelineCfg.GetProcessorConfig(name)
processor, err := processorRegistry.Build(name, cfg)
if err != nil {
log.Printf("failed to build processor %s: %v", name, err)
return 1
}
pipeline.Add(processor)
}

Key Source Files

FilePurpose
internal/core/ports/driven/postprocessor.goPort interface definition
internal/postprocessors/registry.goProcessor registry
internal/postprocessors/defaults.goDefault registration and builders
internal/postprocessors/pipeline.goPipeline implementation
internal/postprocessors/chunker/Reference processor implementation
cmd/sercha/main.goPipeline wiring

Default Chunker Configuration

The built-in chunker processor accepts these configuration options:

OptionTypeDefaultDescription
chunk_sizeint1000Characters per chunk
overlapint200Overlapping characters between chunks
[pipeline.chunker]
chunk_size = 1000
overlap = 200