Free LLMs.
Powering AI Agents
Powering AI Projects
npmai gives you 12 open-source large language models, a production RAG pipeline, and persistent memory — all with a single pip install. No API keys, no signup, no local GPU required.
Powered projects
Every project is open source, live in production, and built entirely on free infrastructure. Each one imports from npmai.
Retrieval-Augmented Generation has emerged as the dominant paradigm for grounding large language model outputs in external knowledge (Lewis et al., 2020). In the canonical RAG pipeline, a query is embedded and used to retrieve the top-K most similar chunks from a vector database, which are then concatenated into the LLM context alongside the query. Despite the sophistication of recent advances — including self-reflective retrieval (Asai et al., 2024), adaptive strategy selection (Jeong et al., 2024), and uncertainty-triggered retrieval (Jiang et al., 2023) — a fundamental assumption has remained unexamined: the value of K is fixed at design time by the developer and never adjusted at inference.
This paper argues that fixed-K retrieval is not merely suboptimal but systematically harmful across three distinct failure modes, and proposes LARA as a principled replacement.
Fixed-K retrieval fails in three qualitatively distinct ways:
When corpus size ≤ K, the system is forced to include semantically irrelevant chunks. On a 5-chunk corpus with k=4, the LLM receives 80% of all available content regardless of relevance — producing hallucinations from contradictory context.
On large corpora (500+ chunks), k=4 retrieves less than 1% of available content. Critical supporting information is never retrieved, producing incomplete or incorrect answers that appear confident.
Liu et al. (2024) document a U-shaped performance degradation in LLMs: information positioned in the middle of a long context window is disproportionately ignored. Large K values systematically trigger this effect.
Current adaptive RAG systems address whether or when to retrieve — not how many chunks to retrieve. Adaptive-RAG (Jeong et al., 2024) selects among retrieval strategies but applies a fixed K within each strategy. Self-RAG (Asai et al., 2024) determines whether retrieval is needed at all, but retrieves a fixed K when it does. FLARE (Jiang et al., 2023) triggers retrieval on uncertainty but uses fixed K. RankRAG (Yu et al., 2024) fine-tunes the LLM for ranking but does not address the count problem.
LARA processes each query through five sequential phases. The diagram below shows the complete pipeline:
The corpus is chunked (1,000 characters, 200-character overlap) and encoded using a bi-encoder (BGE-small-en-v1.5). Embeddings are stored in a FAISS index using IVF/HNSW clustering. At query time, the query is embedded and used to retrieve approximately 200 candidates via approximate nearest neighbor search.
The use of ANN indexing reduces retrieval complexity from O(N) — the cost of running a cross-encoder over all corpus chunks — to O(log N). The ~200 candidate pool provides a sufficiently broad recall base while remaining computationally tractable for the cross-encoder reranking phase.
The 200 candidates are passed to a cross-encoder reranker (cross-encoder/ms-marco-MiniLM-L-6-v2), which computes a fine-grained relevance score S(c) ∈ [0, 1] for each candidate chunk c relative to the query. A threshold filter is applied:
The threshold θ = 0.3 was determined empirically: chunks scoring below this threshold were observed to consistently introduce contradictory or semantically irrelevant content. The critical insight is that Dynamic_K is an emergent property of relevance, not a developer hyperparameter. A narrow query against a niche corpus produces a small, high-precision Dynamic_K. A broad query against a rich corpus produces a larger Dynamic_K. The system adapts automatically without configuration.
Dynamic_K is sorted in descending order of S(c) to ensure the highest-quality chunks are preserved when the latency governor requires reduction.
Real-world deployments operate under latency constraints. A developer deploying LARA specifies L_afford — the total allowable end-to-end latency in seconds. After reranking, the system measures the time consumed by phases 1–3:
The latency required to process the current Dynamic_K through the LLM is estimated as:
Where Lat_chunk is the average time to process one chunk through the LLM (calibrated at runtime). If Total_Latency exceeds L_budget, the governor computes the minimum number of chunks to remove:
For all inputs, |Send_List| × Lat_chunk ≤ L_budget.
Proof. We consider two cases.
Case 1: Total_Latency ≤ L_budget. Then Exceeded = 0, Reduce_Count = 0, and Send_List = Dynamic_K. The latency of Send_List is Total_Latency ≤ L_budget. ✓
Case 2: Total_Latency > L_budget. Let n = |Dynamic_K|. We have Exceeded = n·Lat_chunk − L_budget. The ceiling function guarantees Reduce_Count ≥ Exceeded / Lat_chunk, so the number of remaining chunks satisfies:
Therefore |Send_List| × Lat_chunk ≤ L_budget. ✓
Since the ceiling ensures integer reduction and Dynamic_K is sorted by descending score, removal always eliminates the lowest-ranked chunks, preserving semantic quality. □
Directly concatenating all chunks in Send_List risks triggering the Middle Context Loss effect (Liu et al., 2024). LARA instead processes Send_List in a sliding window of 3 chunks with iterative answer refinement:
The running answer Answer_i carries forward synthesized knowledge from all previous batches. Each LLM call processes at most 3 chunks simultaneously, ensuring that no chunk is buried in a middle position of an excessively long context. Processing N chunks in batches of 3 requires ⌈N/3⌉ LLM calls, compared to N calls for one-chunk-at-a-time refinement — a 66% reduction in total LLM API calls.
| System | Adapts K count | Latency constraint | Addresses Mid-Context Loss | Training-free |
|---|---|---|---|---|
| Standard RAG | No | No | No | Yes |
| Adaptive-RAG | No | No | No | No |
| Self-RAG | No | No | No | No |
| FLARE | No | No | No | Yes |
| RankRAG | Partial | No | No | No |
| LARA (ours) | Yes | Yes (proven) | Yes | Yes |
Evaluation is planned on three datasets: NaturalQuestions (NQ) for single-hop factual retrieval, HotpotQA for multi-hop reasoning across documents, and a custom NPMAI corpus of 50 AI/ML technical documents representing the target deployment environment.
| Dataset | Domain | Task type | Metrics |
|---|---|---|---|
| NaturalQuestions | Open domain | Single-hop factual | Exact Match, F1 |
| HotpotQA | Multi-document | Multi-hop reasoning | EM, F1, Supporting Fact F1 |
| NPMAI Corpus | AI/ML technical | Domain QA | EM, F1, Context Noise Rate |
Baselines: Standard RAG k=4, Standard RAG k=10, Naive Two-Stage RAG (bi-encoder only, no cross-encoder), and LARA full pipeline. Context Noise Rate is defined as the proportion of retrieved chunks with cross-encoder score < 0.3 — a direct measure of noise in the retrieved context. Full benchmark results are pending.
Candidate pool size. The ~200 ANN candidate pool is currently a fixed constant. For very small corpora (<200 chunks), this is trivially satisfied. For extremely large corpora, a proportional candidate pool (e.g., min(200, N × 0.05)) may improve recall. Ablation studies are planned.
Lat_chunk calibration. The current implementation uses a developer-provided Lat_chunk constant (default 0.1s). In practice, inference latency varies by model, server load, and chunk token length. A calibration step measuring average latency over a sample of chunks before inference would improve the governor's accuracy. This is a known limitation.
Threshold sensitivity. The θ = 0.3 threshold was empirically determined on the NPMAI corpus. Cross-domain generalizability requires ablation across θ ∈ {0.1, 0.2, 0.3, 0.4, 0.5}.
We presented LARA, a five-phase RAG architecture that addresses the fixed-K problem through a principled quality gate, enforces developer-specified latency budgets with formal mathematical guarantees, and eliminates Middle Context Loss through sliding-window batch refinement. LARA is training-free, requires only standard open-source components, and has been deployed in production serving 888,000+ package installations. Full benchmark evaluation on NQ and HotpotQA is forthcoming.
System architecture
Four-layer cloud architecture running entirely on free-tier infrastructure. Every layer is connected — a request flows from SDK → Render gateway → HuggingFace models → Supabase storage.
Changelog
Release history for the npmai ecosystem.
The builder
Sonu Kumar
14-year-old · Software Developer · AI Developer · Web Developer · Cloud Developer · Devops · TEDx speaker · Founder, NPMAI Ecosystem · 100% scholarship at Allen Career Institute, Deesha Delphi Public School Kota
Background
Self-taught from rural Bihar with no CS background in the family. Built the entire NPMAI ecosystem on free cloud infrastructure — Render, HuggingFace Spaces, Supabase, Netlify — serving nearly a million package installations. Challenged the state Chief Minister at age 11, gave a TEDx talk at 13, and is currently studying for JEE at Allen Career Institute, Kota on a full scholarship. Supported by Lokesh Chaudhary (CS & Robotics Teacher, DDPS Kota).
Mission
"Promoting Individual Journalism to every nation village so that democratic values of a nation can be strengthen so we can achieve Representative Ideal Democracy."
