Version: 0.2.3

Custom Pipeline Processors

The document processing pipeline transforms documents before they're indexed. This guide explains how to create and register custom processors by modifying the Sercha source code.

Overview

When documents are synced, they pass through a pipeline of processors:

Document → [Processor 1] → [Processor 2] → ... → Chunks → Index

The default pipeline includes a chunker that splits documents into searchable chunks. You can add custom processors for tasks like:

Content transformation (cleaning, formatting)
Metadata extraction
Language detection
Custom chunking strategies

Architecture

The pipeline system consists of three components:

Component	Description	Source
Port Interface	The contract processors must implement	`internal/core/ports/driven/postprocessor.go`
Registry	Maps processor names to builders	`internal/postprocessors/registry.go`
Pipeline	Chains processors and executes them	`internal/postprocessors/pipeline.go`

The PostProcessor Interface

Custom processors must implement the PostProcessor interface:

// internal/core/ports/driven/postprocessor.go

type PostProcessor interface {
    // Name returns the processor name for logging and configuration.
    Name() string

    // Process takes a document and returns chunks.
    // If the processor modifies chunks (e.g., stemming), it receives and returns chunks.
    // If the processor creates chunks (e.g., chunker), it receives nil and returns new chunks.
    Process(ctx context.Context, doc *domain.Document, chunks []domain.Chunk) ([]domain.Chunk, error)
}

See the full interface: postprocessor.go

Creating a Custom Processor

1. Implement the Interface

Create a new package for your processor in internal/postprocessors/:

// internal/postprocessors/myprocessor/processor.go
package myprocessor

import (
    "context"
    "github.com/custodia-labs/sercha-cli/internal/core/domain"
)

type Processor struct {
    // Your configuration fields
}

func New(opts ...Option) *Processor {
    return &Processor{
        // Initialize with defaults or options
    }
}

func (p *Processor) Name() string {
    return "myprocessor"
}

func (p *Processor) Process(
    ctx context.Context,
    doc *domain.Document,
    chunks []domain.Chunk,
) ([]domain.Chunk, error) {
    // Transform chunks or create new ones
    // Return modified chunks
    return chunks, nil
}

Reference implementation: internal/postprocessors/chunker/processor.go

2. Create a Builder Function

Add a builder function that creates your processor from configuration:

// internal/postprocessors/defaults.go

func buildMyProcessor(cfg map[string]any) (driven.PostProcessor, error) {
    var opts []myprocessor.Option

    // Parse config values
    if val := getIntFromConfig(cfg, "my_setting"); val > 0 {
        opts = append(opts, myprocessor.WithMySetting(val))
    }

    return myprocessor.New(opts...), nil
}

Reference: internal/postprocessors/defaults.go

3. Register the Processor

Add your processor to the registry in RegisterDefaults:

// internal/postprocessors/defaults.go

func RegisterDefaults(r *Registry) {
    r.Register("chunker", buildChunker)
    r.Register("myprocessor", buildMyProcessor)  // Add this line
}

The registry maps the processor name (used in config) to its builder function.

Configuration

Once registered, enable your processor in ~/.sercha/config.toml:

[pipeline]
processors = ["myprocessor", "chunker"]

[pipeline.myprocessor]
my_setting = 100
another_option = "value"

[pipeline.chunker]
chunk_size = 1000
overlap = 200

Processors run in the order listed. For most use cases, the chunker should be last.

Pipeline Execution

The pipeline is constructed in cmd/sercha/main.go:

// Create PostProcessor pipeline from configuration
pipelineCfg := settingsSvc.GetPipelineConfig()
processorRegistry := postprocessors.NewRegistry()
postprocessors.RegisterDefaults(processorRegistry)

pipeline := postprocessors.NewPipeline()
for _, name := range pipelineCfg.Processors {
    cfg := pipelineCfg.GetProcessorConfig(name)
    processor, err := processorRegistry.Build(name, cfg)
    if err != nil {
        log.Printf("failed to build processor %s: %v", name, err)
        return 1
    }
    pipeline.Add(processor)
}

Key Source Files

File	Purpose
`internal/core/ports/driven/postprocessor.go`	Port interface definition
`internal/postprocessors/registry.go`	Processor registry
`internal/postprocessors/defaults.go`	Default registration and builders
`internal/postprocessors/pipeline.go`	Pipeline implementation
`internal/postprocessors/chunker/`	Reference processor implementation
`cmd/sercha/main.go`	Pipeline wiring

Default Chunker Configuration

The built-in chunker processor accepts these configuration options:

Option	Type	Default	Description
`chunk_size`	int	1000	Characters per chunk
`overlap`	int	200	Overlapping characters between chunks

[pipeline.chunker]
chunk_size = 1000
overlap = 200

Overview​

Architecture​

The PostProcessor Interface​

Creating a Custom Processor​

1. Implement the Interface​

2. Create a Builder Function​

3. Register the Processor​

Configuration​

Pipeline Execution​

Key Source Files​

Default Chunker Configuration​