Skip to content

feat: add summary for functions / files for better semantic search#40

Draft
aeneasr wants to merge 12 commits intomainfrom
add-semantic-summary
Draft

feat: add summary for functions / files for better semantic search#40
aeneasr wants to merge 12 commits intomainfrom
add-semantic-summary

Conversation

@aeneasr
Copy link
Member

@aeneasr aeneasr commented Mar 16, 2026

No description provided.

aeneasr and others added 12 commits March 15, 2026 14:17
Plan covers all work described in the 2026-03-15 design spec:
- Task 1: nomic-ai/nomic-embed-text-GGUF added to KnownModels
- Task 2: Config extended with Summaries/SummaryModel/SummaryEmbedModel/SummaryEmbedDims
- Task 3: new internal/summarizer/ package (Ollama + LM Studio chat clients)
- Task 4: store extended with 4 new tables and explicit summary cleanup
- Task 5: indexer summary passes (chunk filter ≥3 lines, hierarchical file summary)
- Task 6: summarizer wired into cmd/stdio.go and cmd/index.go
- Task 7: search fans out to 3 indices, emits <relevant_files> XML section
- Task 8: lint + final verification

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ignature

Extends Config with Summaries, SummaryModel, SummaryEmbedModel, and
SummaryEmbedDims fields populated from LUMEN_SUMMARIES, LUMEN_SUMMARY_MODEL,
and LUMEN_SUMMARY_EMBED_MODEL env vars. Updates DBPathForProject to accept a
third summaryEmbedModel parameter so different summary embed models produce
distinct DB paths. Updates all callers including cmd/stdio_test.go.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…clients

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… DeleteFileChunks

- Add chunk_summaries and file_summaries text tables (always created)
- Add vec_chunk_summaries and vec_file_summaries virtual tables (when summaryDims > 0)
- Store.New and NewIndexer accept summaryDims as a new parameter
- ensureVecDimensions handles both vec table pairs atomically on dimension mismatch
- DeleteFileChunks uses three-phase cleanup: collect IDs → delete vec rows → delete data rows
- Add InsertChunkSummaries, InsertFileSummary, SearchChunkSummaries, SearchFileSummaries, TopChunksByFile, ChunksByFile methods
- Work around sqlite-vec limitations: no ON CONFLICT upsert (delete+insert), no combined k+LIMIT

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Extend Indexer with optional summarizer.Summarizer and summary embedder
fields. After raw embedding completes, runSummaryPasses generates
per-chunk and per-file summaries, embeds them, and stores them in the
summary tables. Callers that don't need summaries pass nil, nil.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tion

Add newSummarizer and newSummaryEmbedder helpers to cmd/embedder.go that
return nil when cfg.Summaries is false, and pass the real instances through
indexerCache (stdio) and setupIndexer (index CLI) instead of nil, nil.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…CP response

- Add RelevantFile type and RelevantFiles field to SemanticSearchOutput
- Update formatSearchResults to append <relevant_files> XML block when populated
- Add embedSummaryQuery helper on indexerCache (no-op when summaryEmbedder is nil)
- Add mergeSearchResults helper to merge store.SearchResult slices by identity key
- Add SearchChunkSummaries, SearchFileSummaries, TopChunksByFile proxy methods to Indexer
- Expand handleSemanticSearch to fan out across vec_chunk_summaries and vec_file_summaries
  when a summary embedder is configured, merging results and populating RelevantFiles

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- fix(stdio): SearchChunkSummaries was passed the raw-code query vector
  instead of summaryQueryVec, causing dimension-mismatch panics at runtime
  when code and summary embedders differ in dimensionality
- fix(stdio): file-expansion chunks now go through mergeSearchResults
  instead of raw append, preventing duplicates when expanded file chunks
  overlap with already-fetched results
- fix(stdio): TopChunksByFile now receives maxDistance so injected chunks
  are bounded by the same noise floor as the primary search, not unbounded
- fix(stdio): relevant_files XML attribute now uses xmlEscaper.Replace
  instead of %q (Go quoting) to produce valid XML on all platforms
- fix(summarizer): readErr is now checked before status code in both
  ollama.go and lmstudio.go; previously a partial body read on a 200
  response fell through to JSON unmarshal on a corrupt buffer
- fix(summarizer): Ollama and LM Studio backends now return an error on
  empty response content rather than silently storing an empty summary
- fix(summarizer/test): LM Studio mock handler now guards against empty
  messages slice before indexing, matching the Ollama handler
- fix(embedder): newSummaryEmbedder looks up CtxLength from KnownModels
  for the Ollama backend instead of always passing 0
- fix(store): TopChunksByFile signature gains maxDistance parameter;
  callers updated in index.go, stdio.go, and store_test.go

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@aeneasr aeneasr changed the title Add semantic summary feat: add summary for functions / files for better semantic search Mar 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant