Prompt:
Build me a personal knowledge base with RAG (retrieval-augmented generation).
Ingestion — I send a URL or file and the system saves it. It should handle:
Web articles
YouTube videos (transcripts)
Tweets/X posts
PDFs
Plain text or notes
Source type detection: Determine the type from the URL pattern or file extension. Classify as: article, video, pdf, text, tweet, or other.
Content extraction with fallback chain — this is important because no single extractor works for every site:
1. For Twitter/X URLs:
a. Try FxTwitter API (api.fxtwitter.com) — free, no auth needed
b. Fall back to X API direct lookup
c. Fall back to web scraping
2. For YouTube URLs:
a. Pull transcript via YouTube transcript API or yt-dlp
3. For all other URLs (articles, blogs, etc.):
a. Try a clean text extractor (like Mozilla Readability or similar)
b. Fall back to Firecrawl or Apify for sites that block simple extraction
c. Fall back to a headless browser (Playwright/Puppeteer) for JavaScript-heavy pages
d. Last resort: raw HTTP fetch + HTML tag stripping
Retry once on transient errors (ECONNRESET, ETIMEDOUT, DNS failures) with a 2-second delay.
Content quality validation — reject bad extractions:
Minimum 20 characters of content.
For articles/non-tweets: at least 15% of non-empty lines must be longer than 80 characters (to detect prose vs. navigation menus).
Total content must be at least 500 characters for non-tweet types.
Detect error pages by looking for 2+ signals: "access denied", "captcha", "please enable javascript", "cloudflare", "404", "sign in", "blocked", "rate limit".
Maximum content length: 200,000 characters (truncate beyond this).
Deduplication — two layers:
1. URL-based: Normalize URLs before comparing — strip tracking params (utm_source, utm_medium, utm_campaign, fbclid, igshid, ref, s, t), remove www., normalize twitter.com to x.com, remove trailing slashes and fragments.
2. Content-hash: SHA-256 hash of the cleaned content. Store as a UNIQUE column — reject if the same hash already exists.
Chunking:
Chunk size: 800 characters per chunk.
Overlap: 200 characters between chunks.
Minimum chunk size: 100 characters (append tiny remainders to the last chunk).
Split on sentence boundaries
Embedding generation:
Use Google gemini-embedding-001 (768 dimensions, free) or OpenAI text-embedding-3-small (1536 dimensions) as fallback.
Max input: 8000 characters per chunk.
Process in batches of 10 chunks with 200ms delay between batches.
Retry failed embeddings 3 times with exponential backoff (1s, 2s, 4s).
Cache embeddings with an LRU cache (1000 entries).
Storage — SQLite with two tables:
sources: id, url, title, source_type, summary, raw_content, content_hash (UNIQUE), tags (JSON array), created_at, updated_at
chunks: id, source_id (FK), chunk_index, content, embedding (BLOB), embedding_dim, embedding_provider, embedding_model, created_at
Index on chunks(source_id), sources(source_type), sources(content_hash).
Enable WAL mode and foreign keys with CASCADE deletes.
Concurrency protection: Use a lock file to prevent simultaneous ingestion runs. Check if lock is stale (PID dead or file older than 15 minutes).
Retrieval — When I ask a question:
1. Embed my query using the same embedding provider.
2. Cosine similarity search over all stored chunks. Return top 10.
3. Deduplicate results: keep only the best chunk per source.
4. Sanitize content in results (max 2500 characters per excerpt).
5. Pass top chunks to an LLM: "Answer using only the provided context. Cite which sources you drew from."
6. Return the answer with source references.
Follow me ► x.com/matthewberman
Instagram ► / matthewberman_ai
Media/Sponsorship Inquiries ► https://bit.ly/44TC45V
Информация по комментариям в разработке