Build a semantic code search engine using embeddings. Search by intent ('function that validates email') rather than exact text matches.
## Task Semantic code search using embeddings — find code by intent, not just keywords. ## Requirements - Embeddings: OpenAI text-embedding-3-small or CodeBERT - Vector store: pgvector, Qdrant, or ChromaDB - Language: Python or TypeScript ## Pipeline ``` Indexing: 1. Walk codebase, extract functions/classes/modules 2. For each code chunk, create embedding of: - Code itself - Auto-generated description (via LLM) - Docstring/comments 3. Store in vector DB with metadata (file, line, language) Querying: 1. User types: "function that retries HTTP requests with backoff" 2. Embed the query 3. Find top 10 nearest code chunks 4. Rerank by: similarity score × recency × file relevance 5. Return with file path, line numbers, and preview ``` ## Chunking Strategy ``` - Functions: one chunk per function (with signature + body) - Classes: one chunk per class (signature + method signatures) - Modules: one chunk for top-level code per file - Max chunk: 512 tokens - Include 2 lines of context above and below ``` ## Implementation Notes 1. Incremental indexing: only re-embed changed files (use git diff) 2. Language-aware parsing: use tree-sitter for AST extraction 3. Hybrid search: combine vector similarity with keyword BM25 4. Cache embeddings with content hash 5. Support multiple languages (TypeScript, Python, Go, Rust) 6. CLI tool: `codesearch "retry with exponential backoff"`
No gallery images yet.