Open Source Reaserch & Development

Free LLMs.
Powering AI Agents
Powering AI Projects

npmai gives you 12 open-source large language models, a production RAG pipeline, and persistent memory — all with a single pip install. No API keys, no signup, no local GPU required.

GitHub ↗

$ pip install npmai

888K+

PyPI downloads

12

LLM models

10+

Live projects

$0

Monthly cost

Available models — click any to see usage

python

from npmai import Ollama

llm = Ollama(
    model="llama3.2",
    temperature=0.5
)

response = llm.invoke("Explain NPMAI in one sentence.")
print(response)
      

Research Paper · NPMAI Ecosystem

LARA: Latency-Aware Rerank-then-Allocate Architecture for Resource-Constrained Retrieval-Augmented Generation

A five-phase adaptive RAG pipeline that replaces fixed-K retrieval with a quality-threshold dynamic system and a formally proven latency governor.

Sonu Kumar · NPMAI Ecosystem 2025 cs.IR · cs.CL Training-free Production deployed

Abstract Retrieval-Augmented Generation (RAG) systems universally rely on a fixed retrieval parameter K, determined at design time and never adjusted at inference. We demonstrate that this fixed-K assumption fails systematically — injecting noise into small corpora, causing recall loss in large corpora, and ignoring the latency constraints of real-world deployments. We propose LARA (Latency-Aware Rerank-then-Allocate), a five-phase architecture that (1) uses ANN indexing for O(log N) candidate capture, (2) applies a cross-encoder quality gate with a score threshold to produce a dynamic K, (3) enforces a developer-specified latency budget through a mathematically proven reduction formula, (4) processes the final send list through sliding-window batch refinement, eliminating the Middle Context Loss problem documented by Liu et al. (2024). LARA is training-free, requires only a standard cross-encoder, and has been deployed in production serving 888,000+ installations.

1. Introduction

Retrieval-Augmented Generation has emerged as the dominant paradigm for grounding large language model outputs in external knowledge (Lewis et al., 2020). In the canonical RAG pipeline, a query is embedded and used to retrieve the top-K most similar chunks from a vector database, which are then concatenated into the LLM context alongside the query. Despite the sophistication of recent advances — including self-reflective retrieval (Asai et al., 2024), adaptive strategy selection (Jeong et al., 2024), and uncertainty-triggered retrieval (Jiang et al., 2023) — a fundamental assumption has remained unexamined: the value of K is fixed at design time by the developer and never adjusted at inference.

This paper argues that fixed-K retrieval is not merely suboptimal but systematically harmful across three distinct failure modes, and proposes LARA as a principled replacement.

2. Problem Statement: The Fixed-K Assumption

2.1 Three Failure Modes

Fixed-K retrieval fails in three qualitatively distinct ways:

Noise injection

When corpus size ≤ K, the system is forced to include semantically irrelevant chunks. On a 5-chunk corpus with k=4, the LLM receives 80% of all available content regardless of relevance — producing hallucinations from contradictory context.

Context overflow

On large corpora (500+ chunks), k=4 retrieves less than 1% of available content. Critical supporting information is never retrieved, producing incomplete or incorrect answers that appear confident.

Middle context loss

Liu et al. (2024) document a U-shaped performance degradation in LLMs: information positioned in the middle of a long context window is disproportionately ignored. Large K values systematically trigger this effect.

2.2 Existing Approaches Do Not Solve Fixed-K

Current adaptive RAG systems address whether or when to retrieve — not how many chunks to retrieve. Adaptive-RAG (Jeong et al., 2024) selects among retrieval strategies but applies a fixed K within each strategy. Self-RAG (Asai et al., 2024) determines whether retrieval is needed at all, but retrieves a fixed K when it does. FLARE (Jiang et al., 2023) triggers retrieval on uncertainty but uses fixed K. RankRAG (Yu et al., 2024) fine-tunes the LLM for ranking but does not address the count problem.

3. The LARA Architecture

LARA processes each query through five sequential phases. The diagram below shows the complete pipeline:

Figure 1 — LARA five-phase pipeline

Phase 1 & 2

ANN candidate capture

→

Phase 3

Quality gate

→

Phase 4

Latency governor

→

Phase 5

Batch refinement

Corpus → Bi-encoder → IVF/HNSW → ~200 candidates O(log N)

Cross-encoder → S ≥ 0.3 filter → Dynamic K emerges

T_rerank measured → L_budget computed → chunks trimmed

Batches of 3 → running answer → final response

3.1 Phases 1 & 2 — ANN Candidate Capture

The corpus is chunked (1,000 characters, 200-character overlap) and encoded using a bi-encoder (BGE-small-en-v1.5). Embeddings are stored in a FAISS index using IVF/HNSW clustering. At query time, the query is embedded and used to retrieve approximately 200 candidates via approximate nearest neighbor search.

The use of ANN indexing reduces retrieval complexity from O(N) — the cost of running a cross-encoder over all corpus chunks — to O(log N). The ~200 candidate pool provides a sufficiently broad recall base while remaining computationally tractable for the cross-encoder reranking phase.

3.2 Phase 3 — The Quality Gate: Dynamic K

The 200 candidates are passed to a cross-encoder reranker (cross-encoder/ms-marco-MiniLM-L-6-v2), which computes a fine-grained relevance score S(c) ∈ [0, 1] for each candidate chunk c relative to the query. A threshold filter is applied:

Dynamic_K = { c ∈ Candidates | S(c) ≥ 0.3 }

The threshold θ = 0.3 was determined empirically: chunks scoring below this threshold were observed to consistently introduce contradictory or semantically irrelevant content. The critical insight is that Dynamic_K is an emergent property of relevance, not a developer hyperparameter. A narrow query against a niche corpus produces a small, high-precision Dynamic_K. A broad query against a rich corpus produces a larger Dynamic_K. The system adapts automatically without configuration.

Dynamic_K is sorted in descending order of S(c) to ensure the highest-quality chunks are preserved when the latency governor requires reduction.

3.3 Phase 4 — The Latency Governor (with formal proof)

Real-world deployments operate under latency constraints. A developer deploying LARA specifies L_afford — the total allowable end-to-end latency in seconds. After reranking, the system measures the time consumed by phases 1–3:

L_budget = L_afford − T_rerank

The latency required to process the current Dynamic_K through the LLM is estimated as:

Total_Latency = |Dynamic_K| × Lat_chunk

Where Lat_chunk is the average time to process one chunk through the LLM (calibrated at runtime). If Total_Latency exceeds L_budget, the governor computes the minimum number of chunks to remove:

Exceeded = max(0, Total_Latency − L_budget) Reduce_Count = ⌈ Exceeded / Lat_chunk ⌉ Send_List = Dynamic_K[: |Dynamic_K| − Reduce_Count]

Theorem (Latency Guarantee)

For all inputs, |Send_List| × Lat_chunk ≤ L_budget.

Proof. We consider two cases.

Case 1: Total_Latency ≤ L_budget. Then Exceeded = 0, Reduce_Count = 0, and Send_List = Dynamic_K. The latency of Send_List is Total_Latency ≤ L_budget. ✓

Case 2: Total_Latency > L_budget. Let n = |Dynamic_K|. We have Exceeded = n·Lat_chunk − L_budget. The ceiling function guarantees Reduce_Count ≥ Exceeded / Lat_chunk, so the number of remaining chunks satisfies:

|Send_List| = n − Reduce_Count ≤ n − Exceeded/Lat_chunk = n − (n − L_budget/Lat_chunk) = L_budget/Lat_chunk

Therefore |Send_List| × Lat_chunk ≤ L_budget. ✓

Since the ceiling ensures integer reduction and Dynamic_K is sorted by descending score, removal always eliminates the lowest-ranked chunks, preserving semantic quality. □

3.4 Phase 5 — Sliding Window Batch Refinement

Directly concatenating all chunks in Send_List risks triggering the Middle Context Loss effect (Liu et al., 2024). LARA instead processes Send_List in a sliding window of 3 chunks with iterative answer refinement:

for i in range(0, len(Send_List), 3): batch = Send_List[i : i+3] context = "\n---\n".join([doc.page_content for doc in batch]) Answer_i = LLM(prompt=f"Text: {context}\nExisting Answer: {Answer_{i-1}}\nQuestion: {query}")

The running answer Answer_i carries forward synthesized knowledge from all previous batches. Each LLM call processes at most 3 chunks simultaneously, ensuring that no chunk is buried in a middle position of an excessively long context. Processing N chunks in batches of 3 requires ⌈N/3⌉ LLM calls, compared to N calls for one-chunk-at-a-time refinement — a 66% reduction in total LLM API calls.

4. Comparison with Existing Systems

System	Adapts K count	Latency constraint	Addresses Mid-Context Loss	Training-free
Standard RAG	No	No	No	Yes
Adaptive-RAG	No	No	No	No
Self-RAG	No	No	No	No
FLARE	No	No	No	Yes
RankRAG	Partial	No	No	No
LARA (ours)	Yes	Yes (proven)	Yes	Yes

5. Implementation

python — full LARA pipeline

import math, time
from langchain_community.vectorstores import FAISS
from langchain_community.embeddings import HuggingFaceBgeEmbeddings
from sentence_transformers import CrossEncoder
from npmai import Ollama

# ── Setup ──────────────────────────────────────────
emb = HuggingFaceBgeEmbeddings(
    model_name="BAAI/bge-small-en-v1.5",
    encode_kwargs={"normalize_embeddings": True},
    query_instruction="Represent this sentence for searching relevant passages: "
)
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
llm = Ollama(model="llama3.2", temperature=0.5)

def lara_retrieve(query, vectordb, L_afford=10.0, lat_chunk=0.1):
    # Phase 1 & 2 — ANN candidate capture O(log N)
    t_start = time.time()
    candidates = vectordb.similarity_search(query, k=200)

    # Phase 3 — Quality gate: Dynamic K
    pairs = [(query, doc.page_content) for doc in candidates]
    scores = reranker.predict(pairs)
    dynamic_k = [
        doc for doc, score in zip(candidates, scores)
        if score >= 0.3
    ]
    score_map = {id(doc): s for doc, s in zip(candidates, scores)}
    dynamic_k.sort(key=lambda d: score_map[id(d)], reverse=True)

    # Phase 4 — Latency governor (formally proven)
    t_rerank = time.time() - t_start
    l_budget = L_afford - t_rerank
    total_lat = len(dynamic_k) * lat_chunk
    if total_lat > l_budget:
        exceeded = total_lat - l_budget
        reduce_count = math.ceil(exceeded / lat_chunk)
        send_list = dynamic_k[:-reduce_count] if reduce_count > 0 else dynamic_k
    else:
        send_list = dynamic_k

    # Phase 5 — Sliding window batch refinement
    running_answer = []
    for i in range(0, len(send_list), 3):
        batch = send_list[i : i + 3]
        context = "\n---\n".join([d.page_content for d in batch])
        prompt = (
            f"Text:\n{context}\n\n"
            f"Existing Answer: {running_answer}\n\n"
            f"Question: {query}\n\nAnswer:"
        )
        result = llm.invoke(prompt)
        running_answer = [result]

    return running_answer[0] if running_answer else "No relevant information found."
    

6. Experimental Setup

Evaluation is planned on three datasets: NaturalQuestions (NQ) for single-hop factual retrieval, HotpotQA for multi-hop reasoning across documents, and a custom NPMAI corpus of 50 AI/ML technical documents representing the target deployment environment.

Dataset	Domain	Task type	Metrics
NaturalQuestions	Open domain	Single-hop factual	Exact Match, F1
HotpotQA	Multi-document	Multi-hop reasoning	EM, F1, Supporting Fact F1
NPMAI Corpus	AI/ML technical	Domain QA	EM, F1, Context Noise Rate

Baselines: Standard RAG k=4, Standard RAG k=10, Naive Two-Stage RAG (bi-encoder only, no cross-encoder), and LARA full pipeline. Context Noise Rate is defined as the proportion of retrieved chunks with cross-encoder score < 0.3 — a direct measure of noise in the retrieved context. Full benchmark results are pending.

7. Limitations

Candidate pool size. The ~200 ANN candidate pool is currently a fixed constant. For very small corpora (<200 chunks), this is trivially satisfied. For extremely large corpora, a proportional candidate pool (e.g., min(200, N × 0.05)) may improve recall. Ablation studies are planned.

Lat_chunk calibration. The current implementation uses a developer-provided Lat_chunk constant (default 0.1s). In practice, inference latency varies by model, server load, and chunk token length. A calibration step measuring average latency over a sample of chunks before inference would improve the governor's accuracy. This is a known limitation.

Threshold sensitivity. The θ = 0.3 threshold was empirically determined on the NPMAI corpus. Cross-domain generalizability requires ablation across θ ∈ {0.1, 0.2, 0.3, 0.4, 0.5}.

8. Conclusion

We presented LARA, a five-phase RAG architecture that addresses the fixed-K problem through a principled quality gate, enforces developer-specified latency budgets with formal mathematical guarantees, and eliminates Middle Context Loss through sliding-window batch refinement. LARA is training-free, requires only standard open-source components, and has been deployed in production serving 888,000+ package installations. Full benchmark evaluation on NQ and HotpotQA is forthcoming.

References

Asai, A., et al. (2024). Self-RAG: Learning to retrieve, generate, and critique through self-reflection. ICLR 2024.

Guu, K., et al. (2020). Retrieval augmented language model pre-training. ICML 2020.

Jeong, S., et al. (2024). Adaptive-RAG: Learning to adapt retrieval-augmented large language models through question complexity. NAACL 2024.

Jiang, Z., et al. (2023). Active retrieval augmented generation. EMNLP 2023.

Lewis, P., et al. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. NeurIPS 2020.

Liu, N. F., et al. (2024). Lost in the middle: How language models use long contexts. TACL 2024.

Yu, W., et al. (2024). RankRAG: Unifying context ranking with retrieval-augmented generation. NeurIPS 2024.

System architecture

Four-layer cloud architecture running entirely on free-tier infrastructure. Every layer is connected — a request flows from SDK → Render gateway → HuggingFace models → Supabase storage.

Applications — user-facing layer

NPM AutoCode AI

NPM Debater AI

NPM Rag AI

NPM Legal AI

NPM Journalist

NPM YouTube

↓ import from npmai

npmai SDK — pip install npmai

Ollama class

Memory class

Rag class

LangChain-compatible

Dual-gateway failover

↓ HTTP POST → npmai-api.onrender.com

Render gateway — FastAPI · load balancer · schema validation

Primary API

Model-in-use tracker

Pydantic schema validation

Concurrency management

↓ failover to HuggingFace Spaces

HuggingFace Spaces — 10 dedicated model endpoints

Llama 3.2 /llm

Qwen 2.5 Coder /qwen

Mistral 7b /llm

CodeLLaMA /codellama

Gemma 2 /gemma

+ 5 more models

RAG ingestion /ingestion

OCR · Whisper · FAISS

↕ FAISS index read/write

Supabase — persistent vector and video storage

NPMRagWebVectorDB bucket

NPMSMAVIDEODB bucket

Public access (no key)

Private (secret_key)

Changelog

Release history for the npmai ecosystem.

The builder

Sonu Kumar

14-year-old · Software Developer · AI Developer · Web Developer · Cloud Developer · Devops · TEDx speaker · Founder, NPMAI Ecosystem · 100% scholarship at Allen Career Institute, Deesha Delphi Public School Kota

GitHub ↗ PyPI ↗ Email ↗ TEDx Talk ↗

888K+

PyPI downloads

431K+

Facebook followers

TEDx

Speaker at age 13

100%

Allen scholarship

Background

Self-taught from rural Bihar with no CS background in the family. Built the entire NPMAI ecosystem on free cloud infrastructure — Render, HuggingFace Spaces, Supabase, Netlify — serving nearly a million package installations. Challenged the state Chief Minister at age 11, gave a TEDx talk at 13, and is currently studying for JEE at Allen Career Institute, Kota on a full scholarship. Supported by Lokesh Chaudhary (CS & Robotics Teacher, DDPS Kota).

Mission

"Promoting Individual Journalism to every nation village so that democratic values of a nation can be strengthen so we can achieve Representative Ideal Democracy."

Free LLMs. Powering AI Agents Powering AI Projects