Adding A Memory to AstroLlama
How RAG Is Implemented in AstroLlama (and Why It Matters)
When you create a local version of an AI Assistant using a large language model (LLM),
we've seen in previous posts there are two primary components: the inference engine (in our case
the open-source llama.cpp program) and the open-source model file, which is like
"software" being run on the llama.cpp "hardware".
We can swap out different open-source models ("the "software") so we can choose one that
works well for our intended use, and available hardware to run it on. I'm still swapping
models to see which one runs best - I'll keep looking for one that works well, from
an ethical source.
But when a model is generated, it becomes a snapshot in time - no new information can be added.
To increase the data available to answer the user's questions, we need to add some external
features.
One of these is Retrieval-Augmented Generation, or RAG. This is one of the most practical upgrades
you can make to an AI assistant, which basically gives it a search engine where you can curate the
data it has access to. Without RAG, your model can only answer from pretraining knowledge or
whatever is in the active chat context window.
In AstroLlama, RAG is designed to stay simple, local-first, and reliable. This post walks
through how it works.
## RAG in Plain Language
At a high level, RAG has two phases:
1. Indexing phase
You ingest source documents, split them into chunks, and store them in a database. We do this
with Python scripts named ingest.py and webingest.py, which load data into the vector database
from local files or from web sites.
2. Query phase
When a user asks a question, the client retrieves the most relevant chunks of text and injects
them into the conversation so the LLM can read the resultsand include the information in it's
response.
## AstroLlama RAG Architecture
AstroLlama’s RAG workflow is centered around a Retriever service plus two ingestion scripts:
- Local document ingestion for files on disk
- Web crawling ingestion for dynamic websites
I used the ChromaDB vector database library in the Python programs to store the data.
## Data Sources and Ingestion Paths
AstroLlama supports two ingestion routes.
### 1) Local file ingestion
The local ingestion script processes supported file types:
- txt
- md
- csv
- pdf
- docx
For PDF files, extraction looks for text first, with optional Optical Character Recognition (OCR) for
image-bearing pages using the Tesseract open source OCR software. This is required because
while many PDF files contain the text of a document, for example if you save to PDF from a word
processor. Many PDFs are image based, if they were scanned as images. This means we need to
use ocr to "read" the characters on the image and output it as text. There is also a column-aware OCR mode to improve extraction for multi-column layouts
like newsletters and journals.
### 2) Web ingestion with Crawl4AI
The web ingestion script uses the Crawl4AI library to crawl websites and convert pages into clean
text before sending the text to the RAG database. Web data is noisy in that it has lots of headers,
footers, navigation menus etc. that need to be removed to get clean text.
The web ingestion script will also collect linked PDF and DOCX files and ingest them into the same collection, which
is useful for observatory handbooks, newsletters, and archive-style sites using code from
ingest.py. This method tries to keep the "noise" being ingested to a minimum since it
can serve to provide confusing information to the LLM.
## Query-Time Retrieval and Prompt Injection
During a chat session, the AstroLlama looks for special text from the LLM asking it to retrieve
information from the RAG database. You can think of this being the AI saying "I have no
information, I'll google it!" The AstroLlama client does the RAG search and injects the results
into the conversation it sends back to the LLM. The LLM can then provide an answer based
on the information retreived. All of a sudden AstroLlama can read books on the fly and provide
answers!
## How RAG Coexists with Tool Use
AstroLlama supports both RAG and MCP (Model Context Protocol) tool calling. Where RAG is like a Google Search, MCP allows the LLM to request
a specific tool be run dynamically, like a tool get_weather might go out to the internet and download
a weather report from OpenMeteo. More on tools in another article.
The AstroLlama client sets a policy that prioritizes:
- direct model knowledge first
- then retrieved data via RAG (go read some books)
- tool calls only when the request by the user is explicit (generate an AAVSO finder chart for
this variable star)
## Limits and Future Improvements
Like most first-generation RAG systems, AstroLlama currently uses simple character chunking.
That is robust and easy, but there are lots of ways to improve on the process.
Potential upgrades include:
- smarter indexing based on headings/sections
- reranking retrieved text before injection into the conversation
- source citation formatting in final answers
- text deduplication and freshness policies
## Closing
RAG is one way we can change the data that an AI Assistant has access to to improve the quality of the answers
we're getting and reduce hallucinations, since if the AI doesn't know the answer it makes it up! In
the coming weeks we'll look at a few other wrinkles that can be added to a locally hosted
AI Assistant for Astronomy to improve how it works. Stay tuned!
The code for AstroLlama has moved to https://github.com/gordtulloch/AstroLlama if you'd like to
download and play with it. I'm making it more "turnkey" so if you're less inclined technically
stay tuned for an easily downloaded installation. Thanks for reading! 

