Skip to main content
Version: 0.2.3

Normalisers

Normalisers are the content transformation layer of Sercha. They convert raw bytes from connectors into structured documents ready for embedding and search.

What Normalisers Do

When a connector produces a raw document, the normaliser registry selects the appropriate normaliser based on MIME type and connector type. The normaliser then extracts text content and metadata, producing a document suitable for the embedding pipeline.

Core Responsibilities

ResponsibilityDescription
Content extractionExtract readable text from various file formats
Metadata preservationRetain useful metadata from the original content
Format handlingParse format-specific structures (PDF pages, HTML elements, Markdown sections)
Text cleaningRemove formatting artifacts and non-searchable content

Normaliser Registry

The normaliser registry maintains all registered normalisers and dispatches documents to the appropriate handler. Selection follows a priority-based system that allows specialised normalisers to take precedence over generic ones.

Selection Process

When a document arrives, the registry:

  1. Filters normalisers that support the document's MIME type
  2. Further filters by connector type if applicable
  3. Sorts remaining candidates by priority (highest first)
  4. Selects the highest priority normaliser
  5. Invokes normalisation

Priority System

Normalisers declare a priority value that determines selection order when multiple normalisers match a document. Higher values indicate higher priority.

Priority RangePurposeExamples
90-100Connector-specific normalisersGitHub issues, GitHub pull requests
50-89Format-specific normalisersMarkdown, HTML, PDF
1-9Fallback normalisersPlain text catchall

This system ensures that a GitHub issue (with its custom MIME type) is handled by the GitHub-specific normaliser rather than a generic JSON parser, whilst allowing plain text to act as a universal fallback.

Connector-Specific Normalisers

Some normalisers are designed for content from specific connectors. These normalisers declare which connector types they support, allowing them to handle custom content formats that only appear from certain sources.

NormaliserConnectorMIME TypePurpose
GitHub IssueGitHubapplication/vnd.github.issue+jsonFormats issue threads with comments
GitHub Pull RequestGitHubapplication/vnd.github.pull+jsonFormats PR threads with review comments

Connector-specific normalisers receive the highest priority, ensuring they always handle their specialised content types.

Built-in Normalisers

Sercha includes normalisers for common document formats:

FormatMIME TypesPriorityDescription
PDFapplication/pdf70Extracts text from PDF documents
Markdowntext/markdown50Preserves structure whilst extracting content
HTMLtext/html50Strips tags, extracts text content
Emailmessage/rfc82250Parses email headers and body
Calendartext/calendar50Extracts event details from ICS files
Office Documentsapplication/vnd.openxmlformats-*50Extracts text from DOCX and similar formats
Plain Texttext/*5Fallback for any text content

Plain Text Fallback

The plain text normaliser acts as a universal fallback for text content. It matches a broad range of MIME types and has the lowest priority, ensuring it only handles documents when no specialised normaliser is available.

Supported fallback types include:

  • Programming languages (text/x-go, text/x-python, text/typescript, etc.)
  • Configuration files (text/yaml, text/toml, application/json)
  • General text (text/plain)

Normalisation Output

Each normaliser produces a result containing:

FieldDescription
ContentExtracted text ready for embedding
MetadataPreserved metadata from the original document
TitleDocument title if available

The output document retains the original URI, source ID, and other identifying information from the raw document. Only the content and metadata are transformed.

Processing Pipeline

Normalisation is one stage in the broader sync pipeline:

After normalisation, documents proceed through chunking (splitting large documents into smaller pieces) and embedding (converting text to vectors) before being stored for search.

Error Handling

Error TypeHandling
No matching normaliserFalls back to plain text if text MIME type
Parse errorReturns error, document skipped
Empty contentDocument stored with empty content
Encoding errorAttempts UTF-8 conversion

Documents that fail normalisation are logged and skipped. The sync continues with remaining documents.

Limitations

LimitationDescription
Binary contentImages, audio, and video are not processed
Encrypted filesPassword-protected documents cannot be normalised
Very large filesMemory constraints apply to PDF and Office documents
Complex layoutsMulti-column PDFs may have extraction issues

Next