From Question to Citation-Backed Answer: Inside a Scientific RAG Pipeline

When you type a question into Transparent Lab and press enter, you get back a cited answer in a few seconds. It references specific passages from papers you uploaded. It shows which figure came from which study. The sources are ones you chose.

That experience is the point of the whole product. But what actually happens in those few seconds is worth understanding — not because you need to know it to use the tool, but because understanding the architecture is what separates a grounded trust in the system from a blind one.

This post walks through the pipeline from query to answer. We’ll keep the technical depth at “curious researcher” level: enough to understand why each step exists, without requiring a background in machine learning to follow along.

Summary

Transparent Lab uses Retrieval-Augmented Generation (RAG) — it retrieves evidence from your library before generating a response, rather than generating from memory
Your query goes through several processing steps before any retrieval happens: rewriting for clarity, scoring for complexity, and extracting key entities
Retrieval combines two complementary methods: semantic similarity (meaning-based) and keyword matching, blended at a 70/30 ratio
The system understands paper structure — it retrieves differently depending on whether your question is about methods, results, or background
A final evaluation layer checks the answer for consistency with sources before it reaches you
Every step is designed around one constraint: the system can only cite what you gave it

The Core Architecture: Retrieval First, Generation Second

Most AI assistants work by training a large language model on enormous amounts of text, then generating answers from the statistical patterns that training produced. The model “knows” things because it has compressed millions of documents into its parameters. When you ask a question, it retrieves nothing — it generates based on what it learned to predict.

This architecture has an obvious failure mode for research: the model may confidently produce a plausible-sounding answer that doesn’t correspond to any real paper. It’s not lying, exactly. It’s pattern-matching. But a pattern-matched citation that doesn’t exist is worthless — or worse, actively misleading if you trust it.

Transparent Lab uses a fundamentally different approach called Retrieval-Augmented Generation (RAG). The short version: before generating anything, the system retrieves actual text passages from your uploaded papers. The language model then writes an answer grounded in those passages, and cites them directly. Generation is constrained by retrieval.

The practical consequence: the system cannot cite a paper you didn’t upload, because it has no access to anything outside your library. That’s not a limitation — it’s the architecture working as intended.

Here’s what that pipeline actually looks like, step by step.

Step 1: Processing Your Query

Your question doesn’t go directly into a search. It first goes through a processing layer that does several things.

Query rewriting. Natural language questions are often ambiguous or underspecified. “What did Smith find about BRCA1?” is harder to retrieve on than “What are the experimental findings on BRCA1’s role in DNA double-strand break repair?” The system rewrites your query to be more precise — expanding abbreviations, resolving pronouns from conversation context, and making implicit scientific concepts explicit.

Complexity scoring. Not all questions are equal. “What is the mechanism?” is a simple single-hop question. “How do the findings in the Smith paper about BRCA1 connect to the PARP inhibitor results in the Jones and Kim papers, and what does that imply about synthetic lethality?” requires reasoning across multiple documents and concepts. The system scores complexity on several weighted indicators and uses that score to decide what retrieval strategy to deploy. Simple questions get a fast path; complex ones get a more thorough one.

Entity extraction. The system identifies key scientific entities in your query — gene names, protein names, compound identifiers, methodological terms. These get enriched from authoritative databases (UniProt for genes and proteins, PubChem for compounds, MeSH for medical terminology) before retrieval begins. This matters because scientific nomenclature is inconsistent: BRCA1, BRCA-1, breast cancer type 1 susceptibility protein, and the UniProt entry P38398 all refer to the same thing. Entity enrichment ensures that retrieval connects these variants.

Mode parsing. If you’ve prefaced your query with @deepdive or @short, the system adjusts its behavior accordingly before retrieval begins.

Step 2: Turning Text Into Vectors (and Why That’s Necessary)

To retrieve relevant passages from your papers, the system needs a way to measure similarity between your query and thousands of stored text chunks. Simple keyword matching is insufficient — “cardiac hypertrophy” and “enlarged heart muscle” share no words, but they refer to the same phenomenon.

The solution is to convert text into numerical representations called embeddings — vectors in a high-dimensional space where semantic similarity translates to proximity. Two passages that discuss the same concept will have similar vectors, even if they use different words.

Transparent Lab uses a model called BAAI/bge-large-en-v1.5, which produces 1024-dimensional embeddings specifically optimized for dense retrieval tasks. Your query is converted to a vector; so is every chunk of every paper in your library. Retrieval becomes a geometric operation: find the chunks whose vectors are closest to your query vector.

A useful analogy for biologists: this is conceptually similar to how tSNE or UMAP projections cluster cells with similar transcriptional profiles in a 2D space. The underlying math is different, but the principle — high-dimensional similarity represented geometrically — is the same.

Your query embeddings are also cached. If you rephrase and re-ask a similar question, the system doesn’t recompute from scratch.

Step 3: Hybrid Retrieval — Two Methods, One Query

Vector search alone has a blind spot: it’s good at finding semantically similar passages, but can miss exact matches for specific technical terms, gene names, or numerical values. A query about “the p53 R248W mutation” might retrieve passages about p53 biology generally, while missing a passage that mentions the specific mutation by name without surrounding context.

Transparent Lab addresses this with hybrid retrieval: combining vector similarity search with BM25, a classical information retrieval method that scores documents based on keyword frequency and document length normalization. BM25 is essentially what powers most academic search engines. It’s precise on specific terminology, where vector search is strong on concept.

The two scores are blended at a 70/30 ratio — 70% vector similarity, 30% BM25 — for each retrieved chunk. This weighting reflects a deliberate tradeoff: for scientific writing, conceptual relevance matters more than keyword overlap, but exact terminology is important enough to influence ranking.

This runs as a single database query, not two sequential searches, so there’s no latency cost for the hybrid approach.

Step 4: Understanding Paper Structure

A paper is not a uniform block of text. The methods section, the results section, and the introduction serve different purposes and contain different kinds of information. A question about how an experiment was performed is best answered by the Methods section. A question about what was observed belongs in Results. Background context lives in the Introduction.

Transparent Lab’s retrieval system is structure-aware. When your papers are processed on upload, the text is chunked according to document structure — sections are identified and tagged. During retrieval, the system applies query-adaptive section boosting: if your question is phrased as “how was X measured?” it preferentially retrieves from Methods sections, with a 40% scoring boost for that section type. Questions about findings boost Results. Questions asking for background boost Introductions.

This means you’re more likely to get the passage that actually answers your question, rather than the passage that happens to mention the same words.

Chunks are also scored for quality — coherence, citation density, section type — and that quality score contributes 10% of the final retrieval ranking. A fragmented chunk pulled from a table header ranks below a coherent explanatory paragraph, even if both match your query.

Step 5: Context Engineering — Making the Most of What’s Retrieved

The language model that generates your answer has a finite context window — a limit on how much text it can process at once. With a large library and a broad question, you can retrieve far more relevant chunks than fit in that window. Something has to give.

Rather than simply truncating retrieved chunks, the system applies contextual compression: each retrieved passage is processed to identify which sentences are most directly relevant to your specific query. Irrelevant sentences are removed. This typically allows the system to fit roughly twice as many source passages into the context window — more evidence, better answers.

For complex multi-part questions (those that scored high on complexity in Step 1), the system applies query decomposition: breaking your question into 2-4 atomic sub-questions, retrieving separately for each, then merging and deduplicating the results before generating. The sub-questions are checked for completeness and coherence before retrieval runs.

For questions that require following conceptual chains — “how does BRCA1 dysfunction relate to sensitivity to PARP inhibitors?” — multi-hop retrieval activates: the system extracts concepts from the initial retrieval, then follows those concepts across additional retrieval hops. This can surface relevant passages that wouldn’t have appeared in the first search because they don’t match your original query terms, but do match the concepts that first retrieval revealed.

The system also maintains conversation memory: recent exchanges are kept in working memory, and entity facts from the conversation are tracked. If you asked about BRCA1 two questions ago and now ask “what does the second paper say about its expression in breast tissue?”, the system knows what “its” refers to.

Step 6: Generation With Constraints

With retrieved passages in context, the language model generates your answer. But not without constraints.

The citation system requires every claim in the response to link to a specific retrieved chunk. The format [1.1], [2.3] refers to document 1, chunk 1 and document 2, chunk 3 respectively — a direct pointer to the source passage. The model is explicitly constrained to cite only from the passages it was given; it cannot introduce claims from outside the retrieved context.

If your query falls into specific methodological territory — statistics, study design, protocol reporting, evidence synthesis — the system injects domain-specific guidance from a library of 93 pre-written context fragments. A question about a regression analysis activates fragments about statistical interpretation; a question about a clinical trial activates fragments about evidence quality and sample size considerations. This improves the analytical depth of responses for technically specific queries.

For @deepdive queries, the system runs enhanced analysis in the same generation step: classifying the relevance of each retrieved chunk (Chain-of-Note), assessing evidence quality for the sources cited (study type, methodological rigor, sample size), analyzing consensus and conflicts across documents, and identifying what your library covers versus what’s absent.

Step 7: The Evaluation Layer

The generated answer goes through a final check before it reaches you.

An evaluation layer — itself an AI model — assesses the response on several dimensions: accuracy relative to the retrieved sources, citation quality, completeness, and whether the answer avoids a specific failure mode worth calling out: sycophancy. When a researcher poses a hypothesis in their question (“Does this suggest that pathway X is responsible for Y?”), there’s a structural pressure on the model to agree — because the question itself frames the answer. The evaluation layer specifically checks for this and flags responses that confirm hypotheses without evidence from the retrieved passages.

If the evaluation finds a problem, the system attempts one refinement pass, then re-evaluates. There’s a hard limit of two iterations — the system doesn’t loop indefinitely, and it doesn’t silently degrade into a worse answer if refinement fails.

If retrieval came up empty or the query is about something not covered in your library, the system says so. It doesn’t confabulate coverage it doesn’t have.

What This Architecture Means in Practice

The steps above add latency — typically 1-3 seconds for a standard query, up to 5-6 seconds for a @deepdive with multi-hop retrieval. That’s a deliberate tradeoff. The alternative — generating immediately from memory — is faster but removes the citation grounding entirely.

It also means the system’s quality is bounded by what you’ve uploaded. A question about a paper you haven’t added to your library will return an honest “I don’t have coverage of that.” This is correct behavior, not a failure. The constraint is the feature.

The architecture also means there are no background training processes adjusting the system’s behavior based on your queries. Your library is indexed, not learned from. The separation between your content and the model’s behavior is structural, not policy.

Why We’re Explaining This

Most AI products are black boxes. You get an answer and you’re asked to trust it. For a consumer use case, that might be acceptable. For a researcher whose citations support published work, it isn’t.

We think the right response to that isn’t “trust us, it’s accurate.” It’s to make the pipeline understandable enough that you can form a grounded judgment about when to rely on it, when to verify, and what the system’s actual limitations are. A cited passage from a paper you uploaded is verifiable in a way a generated sentence is not. That’s not incidental to the design — it’s the whole point.

If any part of this description raised more questions than it answered, we’re happy to go deeper. The architecture documents exist for a reason.

Want to see the pipeline in action with your own papers? Request early access — we’re currently working with a small group of researchers to refine the experience.