Skip to main content
Version: 0.2.3

GitHub

GitHub

GitHub Connector

The GitHub connector indexes content from GitHub repositories accessible to the authenticated user. This includes owned repositories, repositories where the user is a collaborator, and organisation member repositories.

Content Types

The connector can index four types of content from each repository:

Content TypeDescription
FilesSource code and documentation files from the repository
IssuesIssue threads including comments
Pull RequestsPull request threads including comments
WikisWiki pages if the repository has a wiki enabled

By default, all content types are indexed. Use the content_types configuration option to limit indexing to specific types.

Capabilities

CapabilitySupportedNotes
Full syncYesIndexes all content from all accessible repositories
Incremental syncYesUses cursors to track changes per repository
Watch modeNoWebhook integration not available in CLI
HierarchyYesFiles preserve directory structure
Binary contentNoText files only
ValidationYesVerifies credentials before sync

Authentication

The GitHub connector requires authentication to access repositories. Two methods are supported: Personal Access Tokens (PAT) and OAuth.

MethodBest ForSetup Complexity
Personal Access TokenIndividual users, quick setupLow
OAuthShared configurations, automatic token refreshMedium

Both authentication methods provide 5,000 API requests per hour. Unauthenticated requests (60 per hour) are not supported.

Setting Up a Personal Access Token

Personal access tokens provide a straightforward way to authenticate. GitHub offers two token types: fine-grained (recommended) and classic.

Fine-grained tokens offer granular permission control and can be scoped to specific repositories.

Creating a Fine-Grained Token

  1. Navigate to github.com/settings/personal-access-tokens
  2. Click Generate new token
  3. Enter a descriptive name for the token
  4. Set an expiration period (GitHub recommends setting an expiry)
  5. Under Repository access, select which repositories the token can access:
    • All repositories for full access
    • Only select repositories to limit scope
  6. Under Permissions, expand Repository permissions and set the following:
PermissionAccess LevelPurpose
ActionsReadAccess workflow information
CodeReadRead repository files
DiscussionsReadAccess discussion threads
IssuesReadRead issues and comments
Merge queuesReadAccess merge queue status
MetadataReadRequired for basic repository access
PagesReadAccess GitHub Pages content
Pull requestsReadRead pull requests and comments
  1. Click Generate token
  2. Copy the token immediately (it will not be shown again)
Minimal Permissions

For basic file indexing only, you can use a smaller permission set: Metadata (Read) and Code (Read). Add other permissions based on which content types you want to index.

Classic Tokens

Classic tokens use broader scope-based permissions. They are simpler to configure but offer less granular control.

Creating a Classic Token

  1. Navigate to github.com/settings/tokens
  2. Click Generate new token (classic)
  3. Enter a descriptive note for the token
  4. Set an expiration period
  5. Select the repo scope (provides full access to repositories)
  6. Click Generate token
  7. Copy the token immediately
ScopePurpose
repoFull control of private repositories (includes read access to code, issues, PRs)
Token Expiry

GitHub automatically removes personal access tokens that have not been used for one year. Set a reasonable expiry and regenerate tokens as needed.

Authentication References

The setup guides above are based on GitHub's official documentation - both are up to date as of December 2025:

Configuration

Configuration options are specified when creating a source:

OptionDescriptionDefault
content_typesComma-separated list of content to indexAll types
file_patternsComma-separated glob patterns for file filteringAll files

Content Types

Valid values for content_types:

ValueDescription
filesRepository source files
issuesIssue threads
prsPull request threads
wikisWiki pages

File Patterns

The file_patterns option accepts glob patterns to filter which files are indexed. Multiple patterns can be specified separated by commas.

Example PatternMatches
*.goAll Go files
*.mdAll Markdown files
src/**/*.tsTypeScript files in src directory
*.go,*.mdGo and Markdown files

When no patterns are specified, all text files are indexed.

Repository Discovery

The connector automatically discovers all repositories accessible to the authenticated user. No explicit repository configuration is required.

Accessible repositories include:

  • Repositories owned by the user
  • Repositories where the user is a collaborator
  • Repositories in organisations where the user is a member

Archived and disabled repositories are excluded from sync.

Document Structure

URI Patterns

Documents are identified by URIs following these patterns:

Content TypeURI PatternExample
Filesgithub://{owner}/{repo}/blob/{path}github://acme/api/blob/src/main.go
Issuesgithub://{owner}/{repo}/issues/{number}github://acme/api/issues/42
Pull Requestsgithub://{owner}/{repo}/pull/{number}github://acme/api/pull/123
Wiki Pagesgithub://{owner}/{repo}/wiki/{page}github://acme/api/wiki/Home

MIME Types

The connector assigns custom MIME types to distinguish GitHub content:

Content TypeMIME Type
FilesDetected from file extension
Issuesapplication/vnd.github.issue+json
Pull Requestsapplication/vnd.github.pull+json
Wiki Pagestext/markdown

Metadata

Documents include metadata appropriate to their content type:

FieldDescriptionAvailable For
ownerRepository ownerAll
repoRepository nameAll
pathFile path within repositoryFiles
numberIssue or PR numberIssues, PRs
stateOpen or closedIssues, PRs
labelsApplied labelsIssues, PRs
authorContent authorIssues, PRs, Wiki
created_atCreation timestampAll
updated_atLast update timestampAll

Rate Limiting

The connector implements a dual-strategy approach to rate limiting:

Proactive Throttling

A token bucket algorithm limits requests to approximately 1.2 requests per second. This stays well under the 5,000 requests per hour limit whilst maximising throughput.

Reactive Handling

The connector monitors rate limit headers in API responses:

HeaderPurpose
X-RateLimit-RemainingRemaining requests in current window
X-RateLimit-ResetUnix timestamp when limits reset

When limits are exhausted, the connector waits until the reset time before continuing.

Secondary Rate Limits

GitHub's secondary rate limits (abuse detection) are handled with exponential backoff. If a secondary limit is triggered, the connector backs off and retries.

Sync Behaviour

Full Sync

Full sync retrieves all content from all accessible repositories. For each repository:

  1. Repository tree is fetched using the recursive Trees API
  2. Blob content is retrieved for each file matching configured patterns
  3. Issues and pull requests are fetched with their comments
  4. Wiki pages are retrieved if the repository has a wiki

Incremental Sync

Incremental sync uses cursors to track state for each repository. The cursor stores:

ComponentPurpose
Tree SHADetects file changes by comparing against current HEAD
Issues timestampFilters issues updated since last sync
PRs timestampFilters pull requests updated since last sync
Wiki SHATracks wiki repository changes

Each repository maintains independent cursor state, enabling partial syncs to resume from where they left off.

Change Detection

Content TypeDetection Method
FilesCompare tree SHA against stored value
IssuesFilter by updated_at since last sync
Pull RequestsFilter by updated_at since last sync
WikiCompare wiki commit SHA against stored value

Error Handling

The connector distinguishes between recoverable and fatal errors:

Error TypeHandling
Rate limitWait for reset, then continue
Network timeoutRetry with exponential backoff
Authentication failureReport immediately, stop sync
Permission deniedLog warning, skip repository
Not foundSkip resource, continue sync

Limitations

LimitationDescription
Binary filesNot indexed (text content only)
File sizeMaximum 1MB per file (GitHub API constraint)
Watch modeNot supported (no webhook integration)
Private repositoriesRequires appropriate token scopes

Next