Adding A Memory to AstroLlama

How RAG Is Implemented in AstroLlama (and Why It Matters)

May 08, 2026
When you create a local version of an AI Assistant using a large language model (LLM), 
we've seen in previous posts there are two primary components: the inference engine (in our case 
the open-source llama.cpp program) and the open-source model file, which is like 
"software" being run on the llama.cpp "hardware". 

We can swap out different open-source models ("the "software") so we can choose one that 
works well for our intended use, and available hardware to run it on. I'm still swapping 
models to see which one runs best - I'll keep looking for one that works well, from
an ethical source.

But when a model is generated, it becomes a snapshot in time - no new information can be added. 
To increase the data available to answer the user's questions, we need to add some external 
features. 

One of these is Retrieval-Augmented Generation, or RAG. This is one of the most practical upgrades 
you can make to an AI assistant, which basically gives it a search engine where you can curate the 
data it has access to.  Without RAG, your model can only answer from pretraining knowledge or
whatever is in the active chat context window.

In AstroLlama, RAG is designed to stay simple, local-first, and reliable. This post walks 
through how it works.

## RAG in Plain Language

At a high level, RAG has two phases:

1. Indexing phase
You ingest source documents, split them into chunks, and store them in a database. We do this 
with Python scripts named ingest.py and webingest.py, which load data into the vector database 
from local files or from web sites.

2. Query phase
When a user asks a question, the client retrieves the most relevant chunks of text and injects 
them into the conversation so the LLM can read the resultsand include the information in it's 
response.

## AstroLlama RAG Architecture

AstroLlama’s RAG workflow is centered around a Retriever service plus two ingestion scripts:

- Local document ingestion for files on disk
- Web crawling ingestion for dynamic websites

I used the ChromaDB vector database library in the Python programs to store the data.

## Data Sources and Ingestion Paths

AstroLlama supports two ingestion routes.

### 1) Local file ingestion

The local ingestion script processes supported file types:
- txt
- md
- csv
- pdf
- docx

For PDF files, extraction looks for text first, with optional Optical Character Recognition (OCR) for 
image-bearing pages using the Tesseract open source OCR software. This is required because 
while many PDF files contain the text of a document, for example if you save to PDF from a word
processor. Many PDFs are image based, if they were scanned as images. This means we need to 
use ocr to "read" the characters on the image and output it as text. There is also a column-aware OCR mode to improve extraction for multi-column layouts 
like newsletters and journals.

### 2) Web ingestion with Crawl4AI

The web ingestion script uses the Crawl4AI library to crawl websites and convert pages into clean 
text before sending the text to the RAG database. Web data is noisy in that it has lots of headers, 
footers, navigation menus etc. that need to be removed to get clean text.

The web ingestion script will also collect linked PDF and DOCX files and ingest them into the same collection, which 
is useful for observatory handbooks, newsletters, and archive-style sites using code from 
ingest.py. This method tries to keep the "noise" being ingested to a minimum since it
can serve to provide confusing information to the LLM.

## Query-Time Retrieval and Prompt Injection

During a chat session, the AstroLlama looks for special text from the LLM asking it to retrieve 
information from the RAG database.  You can think of this being the AI saying "I have no 
information, I'll google it!" The AstroLlama client  does the RAG search and injects the results
into the conversation it sends back to the LLM. The LLM can then provide an answer based
on the information retreived. All of a sudden AstroLlama can read books on the fly and provide
answers!

## How RAG Coexists with Tool Use

AstroLlama supports both RAG and MCP (Model Context Protocol) tool calling. Where RAG is like a Google Search, MCP allows the LLM to request 
a specific tool be run dynamically, like a tool get_weather might go out to the internet and download
a weather report from OpenMeteo. More on tools in another article.

The AstroLlama client sets a policy that prioritizes:
- direct model knowledge first
- then retrieved data via RAG (go read some books) 
- tool calls only when the request by the user is explicit (generate an AAVSO finder chart for 
this variable star)

## Limits and Future Improvements

Like most first-generation RAG systems, AstroLlama currently uses simple character chunking. 
That is robust and easy, but there are lots of ways to improve on the process.

Potential upgrades include:
- smarter indexing based on headings/sections
- reranking retrieved text before injection into the conversation
- source citation formatting in final answers
- text deduplication and freshness policies

## Closing

RAG is one way we can change the data that an AI Assistant has access to to improve the quality of the answers 
we're getting and reduce hallucinations, since if the AI doesn't know the answer it makes it up! In 
the coming weeks we'll look at a few other wrinkles that can be added to a locally hosted
AI Assistant for Astronomy to improve how it works. Stay tuned!

The code for AstroLlama has moved to https://github.com/gordtulloch/AstroLlama if you'd like to 
download and play with it. I'm making it more "turnkey" so if you're less inclined technically 
stay tuned for an easily downloaded installation. Thanks for reading!
Discussion about this post

Ready for more?