Open Source Reaserch & Development

Free LLMs.
Powering AI Agents
Powering AI Projects

npmai gives you 12 open-source large language models, a production RAG pipeline, and persistent memory — all with a single pip install. No API keys, no signup, no local GPU required.

GitHub ↗
$ pip install npmai
888K+
PyPI downloads
12
LLM models
10+
Live projects
$0
Monthly cost
python
from npmai import Ollama llm = Ollama( model="llama3.2", temperature=0.5 ) response = llm.invoke("Explain NPMAI in one sentence.") print(response)
Research Paper · NPMAI Ecosystem
LARA: Latency-Aware Rerank-then-Allocate Architecture for Resource-Constrained Retrieval-Augmented Generation
A five-phase adaptive RAG pipeline that replaces fixed-K retrieval with a quality-threshold dynamic system and a formally proven latency governor.
Sonu Kumar · NPMAI Ecosystem 2025 cs.IR · cs.CL Training-free Production deployed
Abstract Retrieval-Augmented Generation (RAG) systems universally rely on a fixed retrieval parameter K, determined at design time and never adjusted at inference. We demonstrate that this fixed-K assumption fails systematically — injecting noise into small corpora, causing recall loss in large corpora, and ignoring the latency constraints of real-world deployments. We propose LARA (Latency-Aware Rerank-then-Allocate), a five-phase architecture that (1) uses ANN indexing for O(log N) candidate capture, (2) applies a cross-encoder quality gate with a score threshold to produce a dynamic K, (3) enforces a developer-specified latency budget through a mathematically proven reduction formula, (4) processes the final send list through sliding-window batch refinement, eliminating the Middle Context Loss problem documented by Liu et al. (2024). LARA is training-free, requires only a standard cross-encoder, and has been deployed in production serving 888,000+ installations.
1. Introduction

Retrieval-Augmented Generation has emerged as the dominant paradigm for grounding large language model outputs in external knowledge (Lewis et al., 2020). In the canonical RAG pipeline, a query is embedded and used to retrieve the top-K most similar chunks from a vector database, which are then concatenated into the LLM context alongside the query. Despite the sophistication of recent advances — including self-reflective retrieval (Asai et al., 2024), adaptive strategy selection (Jeong et al., 2024), and uncertainty-triggered retrieval (Jiang et al., 2023) — a fundamental assumption has remained unexamined: the value of K is fixed at design time by the developer and never adjusted at inference.

This paper argues that fixed-K retrieval is not merely suboptimal but systematically harmful across three distinct failure modes, and proposes LARA as a principled replacement.

2. Problem Statement: The Fixed-K Assumption
2.1 Three Failure Modes

Fixed-K retrieval fails in three qualitatively distinct ways:

Noise injection

When corpus size ≤ K, the system is forced to include semantically irrelevant chunks. On a 5-chunk corpus with k=4, the LLM receives 80% of all available content regardless of relevance — producing hallucinations from contradictory context.

Context overflow

On large corpora (500+ chunks), k=4 retrieves less than 1% of available content. Critical supporting information is never retrieved, producing incomplete or incorrect answers that appear confident.

Middle context loss

Liu et al. (2024) document a U-shaped performance degradation in LLMs: information positioned in the middle of a long context window is disproportionately ignored. Large K values systematically trigger this effect.

2.2 Existing Approaches Do Not Solve Fixed-K

Current adaptive RAG systems address whether or when to retrieve — not how many chunks to retrieve. Adaptive-RAG (Jeong et al., 2024) selects among retrieval strategies but applies a fixed K within each strategy. Self-RAG (Asai et al., 2024) determines whether retrieval is needed at all, but retrieves a fixed K when it does. FLARE (Jiang et al., 2023) triggers retrieval on uncertainty but uses fixed K. RankRAG (Yu et al., 2024) fine-tunes the LLM for ranking but does not address the count problem.

3. The LARA Architecture

LARA processes each query through five sequential phases. The diagram below shows the complete pipeline:

Figure 1 — LARA five-phase pipeline
Phase 1 & 2
ANN candidate capture
Phase 3
Quality gate
Phase 4
Latency governor
Phase 5
Batch refinement
Corpus → Bi-encoder → IVF/HNSW → ~200 candidates O(log N)
Cross-encoder → S ≥ 0.3 filter → Dynamic K emerges
T_rerank measured → L_budget computed → chunks trimmed
Batches of 3 → running answer → final response
3.1 Phases 1 & 2 — ANN Candidate Capture

The corpus is chunked (1,000 characters, 200-character overlap) and encoded using a bi-encoder (BGE-small-en-v1.5). Embeddings are stored in a FAISS index using IVF/HNSW clustering. At query time, the query is embedded and used to retrieve approximately 200 candidates via approximate nearest neighbor search.

The use of ANN indexing reduces retrieval complexity from O(N) — the cost of running a cross-encoder over all corpus chunks — to O(log N). The ~200 candidate pool provides a sufficiently broad recall base while remaining computationally tractable for the cross-encoder reranking phase.

3.2 Phase 3 — The Quality Gate: Dynamic K

The 200 candidates are passed to a cross-encoder reranker (cross-encoder/ms-marco-MiniLM-L-6-v2), which computes a fine-grained relevance score S(c) ∈ [0, 1] for each candidate chunk c relative to the query. A threshold filter is applied:

Dynamic_K = { c ∈ Candidates | S(c) ≥ 0.3 }

The threshold θ = 0.3 was determined empirically: chunks scoring below this threshold were observed to consistently introduce contradictory or semantically irrelevant content. The critical insight is that Dynamic_K is an emergent property of relevance, not a developer hyperparameter. A narrow query against a niche corpus produces a small, high-precision Dynamic_K. A broad query against a rich corpus produces a larger Dynamic_K. The system adapts automatically without configuration.

Dynamic_K is sorted in descending order of S(c) to ensure the highest-quality chunks are preserved when the latency governor requires reduction.

3.3 Phase 4 — The Latency Governor (with formal proof)

Real-world deployments operate under latency constraints. A developer deploying LARA specifies L_afford — the total allowable end-to-end latency in seconds. After reranking, the system measures the time consumed by phases 1–3:

L_budget = L_afford − T_rerank

The latency required to process the current Dynamic_K through the LLM is estimated as:

Total_Latency = |Dynamic_K| × Lat_chunk

Where Lat_chunk is the average time to process one chunk through the LLM (calibrated at runtime). If Total_Latency exceeds L_budget, the governor computes the minimum number of chunks to remove:

Exceeded = max(0, Total_Latency − L_budget) Reduce_Count = ⌈ Exceeded / Lat_chunk ⌉ Send_List = Dynamic_K[: |Dynamic_K| − Reduce_Count]
Theorem (Latency Guarantee)

For all inputs, |Send_List| × Lat_chunk ≤ L_budget.

Proof. We consider two cases.

Case 1: Total_Latency ≤ L_budget. Then Exceeded = 0, Reduce_Count = 0, and Send_List = Dynamic_K. The latency of Send_List is Total_Latency ≤ L_budget. ✓

Case 2: Total_Latency > L_budget. Let n = |Dynamic_K|. We have Exceeded = n·Lat_chunk − L_budget. The ceiling function guarantees Reduce_Count ≥ Exceeded / Lat_chunk, so the number of remaining chunks satisfies:

|Send_List| = n − Reduce_Count ≤ n − Exceeded/Lat_chunk = n − (n − L_budget/Lat_chunk) = L_budget/Lat_chunk

Therefore |Send_List| × Lat_chunk ≤ L_budget. ✓

Since the ceiling ensures integer reduction and Dynamic_K is sorted by descending score, removal always eliminates the lowest-ranked chunks, preserving semantic quality. □

3.4 Phase 5 — Sliding Window Batch Refinement

Directly concatenating all chunks in Send_List risks triggering the Middle Context Loss effect (Liu et al., 2024). LARA instead processes Send_List in a sliding window of 3 chunks with iterative answer refinement:

for i in range(0, len(Send_List), 3): batch = Send_List[i : i+3] context = "\n---\n".join([doc.page_content for doc in batch]) Answer_i = LLM(prompt=f"Text: {context}\nExisting Answer: {Answer_{i-1}}\nQuestion: {query}")

The running answer Answer_i carries forward synthesized knowledge from all previous batches. Each LLM call processes at most 3 chunks simultaneously, ensuring that no chunk is buried in a middle position of an excessively long context. Processing N chunks in batches of 3 requires ⌈N/3⌉ LLM calls, compared to N calls for one-chunk-at-a-time refinement — a 66% reduction in total LLM API calls.

4. Comparison with Existing Systems
System Adapts K count Latency constraint Addresses Mid-Context Loss Training-free
Standard RAGNoNoNoYes
Adaptive-RAGNoNoNoNo
Self-RAGNoNoNoNo
FLARENoNoNoYes
RankRAGPartialNoNoNo
LARA (ours)YesYes (proven)YesYes
5. Implementation
python — full LARA pipeline
import math, time from langchain_community.vectorstores import FAISS from langchain_community.embeddings import HuggingFaceBgeEmbeddings from sentence_transformers import CrossEncoder from npmai import Ollama # ── Setup ────────────────────────────────────────── emb = HuggingFaceBgeEmbeddings( model_name="BAAI/bge-small-en-v1.5", encode_kwargs={"normalize_embeddings": True}, query_instruction="Represent this sentence for searching relevant passages: " ) reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2") llm = Ollama(model="llama3.2", temperature=0.5) def lara_retrieve(query, vectordb, L_afford=10.0, lat_chunk=0.1): # Phase 1 & 2 — ANN candidate capture O(log N) t_start = time.time() candidates = vectordb.similarity_search(query, k=200) # Phase 3 — Quality gate: Dynamic K pairs = [(query, doc.page_content) for doc in candidates] scores = reranker.predict(pairs) dynamic_k = [ doc for doc, score in zip(candidates, scores) if score >= 0.3 ] score_map = {id(doc): s for doc, s in zip(candidates, scores)} dynamic_k.sort(key=lambda d: score_map[id(d)], reverse=True) # Phase 4 — Latency governor (formally proven) t_rerank = time.time() - t_start l_budget = L_afford - t_rerank total_lat = len(dynamic_k) * lat_chunk if total_lat > l_budget: exceeded = total_lat - l_budget reduce_count = math.ceil(exceeded / lat_chunk) send_list = dynamic_k[:-reduce_count] if reduce_count > 0 else dynamic_k else: send_list = dynamic_k # Phase 5 — Sliding window batch refinement running_answer = [] for i in range(0, len(send_list), 3): batch = send_list[i : i + 3] context = "\n---\n".join([d.page_content for d in batch]) prompt = ( f"Text:\n{context}\n\n" f"Existing Answer: {running_answer}\n\n" f"Question: {query}\n\nAnswer:" ) result = llm.invoke(prompt) running_answer = [result] return running_answer[0] if running_answer else "No relevant information found."
6. Experimental Setup

Evaluation is planned on three datasets: NaturalQuestions (NQ) for single-hop factual retrieval, HotpotQA for multi-hop reasoning across documents, and a custom NPMAI corpus of 50 AI/ML technical documents representing the target deployment environment.

DatasetDomainTask typeMetrics
NaturalQuestionsOpen domainSingle-hop factualExact Match, F1
HotpotQAMulti-documentMulti-hop reasoningEM, F1, Supporting Fact F1
NPMAI CorpusAI/ML technicalDomain QAEM, F1, Context Noise Rate

Baselines: Standard RAG k=4, Standard RAG k=10, Naive Two-Stage RAG (bi-encoder only, no cross-encoder), and LARA full pipeline. Context Noise Rate is defined as the proportion of retrieved chunks with cross-encoder score < 0.3 — a direct measure of noise in the retrieved context. Full benchmark results are pending.

7. Limitations

Candidate pool size. The ~200 ANN candidate pool is currently a fixed constant. For very small corpora (<200 chunks), this is trivially satisfied. For extremely large corpora, a proportional candidate pool (e.g., min(200, N × 0.05)) may improve recall. Ablation studies are planned.

Lat_chunk calibration. The current implementation uses a developer-provided Lat_chunk constant (default 0.1s). In practice, inference latency varies by model, server load, and chunk token length. A calibration step measuring average latency over a sample of chunks before inference would improve the governor's accuracy. This is a known limitation.

Threshold sensitivity. The θ = 0.3 threshold was empirically determined on the NPMAI corpus. Cross-domain generalizability requires ablation across θ ∈ {0.1, 0.2, 0.3, 0.4, 0.5}.

8. Conclusion

We presented LARA, a five-phase RAG architecture that addresses the fixed-K problem through a principled quality gate, enforces developer-specified latency budgets with formal mathematical guarantees, and eliminates Middle Context Loss through sliding-window batch refinement. LARA is training-free, requires only standard open-source components, and has been deployed in production serving 888,000+ package installations. Full benchmark evaluation on NQ and HotpotQA is forthcoming.

References
Asai, A., et al. (2024). Self-RAG: Learning to retrieve, generate, and critique through self-reflection. ICLR 2024.
Guu, K., et al. (2020). Retrieval augmented language model pre-training. ICML 2020.
Jeong, S., et al. (2024). Adaptive-RAG: Learning to adapt retrieval-augmented large language models through question complexity. NAACL 2024.
Jiang, Z., et al. (2023). Active retrieval augmented generation. EMNLP 2023.
Lewis, P., et al. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. NeurIPS 2020.
Liu, N. F., et al. (2024). Lost in the middle: How language models use long contexts. TACL 2024.
Yu, W., et al. (2024). RankRAG: Unifying context ranking with retrieval-augmented generation. NeurIPS 2024.

System architecture

Four-layer cloud architecture running entirely on free-tier infrastructure. Every layer is connected — a request flows from SDK → Render gateway → HuggingFace models → Supabase storage.

Applications — user-facing layer
NPM AutoCode AI
NPM Debater AI
NPM Rag AI
NPM Legal AI
NPM Journalist
NPM YouTube
↓ import from npmai
npmai SDK — pip install npmai
Ollama class
Memory class
Rag class
LangChain-compatible
Dual-gateway failover
↓ HTTP POST → npmai-api.onrender.com
Render gateway — FastAPI · load balancer · schema validation
Primary API
Model-in-use tracker
Pydantic schema validation
Concurrency management
↓ failover to HuggingFace Spaces
HuggingFace Spaces — 10 dedicated model endpoints
Llama 3.2 /llm
Qwen 2.5 Coder /qwen
Mistral 7b /llm
CodeLLaMA /codellama
Gemma 2 /gemma
+ 5 more models
RAG ingestion /ingestion
OCR · Whisper · FAISS
↕ FAISS index read/write
Supabase — persistent vector and video storage
NPMRagWebVectorDB bucket
NPMSMAVIDEODB bucket
Public access (no key)
Private (secret_key)

Changelog

Release history for the npmai ecosystem.

The builder

Sonu2

Sonu Kumar

14-year-old · Software Developer · AI Developer · Web Developer · Cloud Developer · Devops · TEDx speaker · Founder, NPMAI Ecosystem · 100% scholarship at Allen Career Institute, Deesha Delphi Public School Kota

888K+
PyPI downloads
431K+
Facebook followers
TEDx
Speaker at age 13
100%
Allen scholarship

Background

Self-taught from rural Bihar with no CS background in the family. Built the entire NPMAI ecosystem on free cloud infrastructure — Render, HuggingFace Spaces, Supabase, Netlify — serving nearly a million package installations. Challenged the state Chief Minister at age 11, gave a TEDx talk at 13, and is currently studying for JEE at Allen Career Institute, Kota on a full scholarship. Supported by Lokesh Chaudhary (CS & Robotics Teacher, DDPS Kota).

Mission

"Promoting Individual Journalism to every nation village so that democratic values of a nation can be strengthen so we can achieve Representative Ideal Democracy."

Home
Projects
Docs
LARA
About