I built a local AI assistant in Crystal that runs on 4GB RAM

Hey Crystal community,

I want to share a project I have been working on called speak. It is a local AI assistant that runs entirely on your computer. No cloud. No subscription. Your data never leaves your machine.

Why I built it

I was tired of paying for ChatGPT and Claude. I was also tired of AI services that forget who I am every time I close the tab. I wanted something that runs on my laptop (not a $10,000 server). Something that remembers things about me across sessions. Something that can read my files. Something that can search the web when I need current info. Something that costs nothing after setup. Something that keeps my conversations private.

So I built speak in Crystal.

What it does

When you open speak, you get a chat interface. You can type normally. Tell it your name once, and it remembers forever. Ask it to read your config.json, and it shows you the content. Ask it to search for something, and it fetches results from the web. Have a long conversation, and the memory stays flat because the KV cache lives on disk, not in RAM.

Here is an example session.

You type: Hello, who are you?

speak: I am speak, a local AI assistant running on your computer. I can read files, search the web, and remember things about you.

You type: My name is Sarah and I love Python.

speak: I have remembered that your name is Sarah and you love Python.

You type: Read my config.json

speak: (shows the content of config.json)

You type: Search for Python 3.13 features

speak: (shows search results from DuckDuckGo)

You type: What do you know about me?

speak: Your name is Sarah and you love Python.

The agent loop

speak now has an agent loop. It can call tools multiple times to complete a task. For example, if you ask it to read a file and then search based on what it finds, it will do both steps automatically. It plans, executes, observes the result, and decides what to do next. Then it calls the finish tool when it has the final answer.

Technical details

speak uses the Nanbeige 4.1 3B model. It runs on CPU only. No GPU required. The entire model at Q4_K_M quantization is 2.5GB. With memory mapping, the model stays mostly on disk. RAM usage is around 500MB to 1GB.

The KV cache is saved to SSD, not RAM. This means you can have a conversation with 10,000 turns and memory usage stays flat. No slow down. No crash.

speak automatically detects your hardware. It reads total RAM, available RAM, CPU cores, and AVX2 support. It then configures itself to run optimally on your machine. If you have 4GB RAM, it uses the smaller Q2_K model and a smaller context window. If you have 16GB RAM, it uses the larger Q6_K model and a larger context window. You can also override everything by editing config.json.

The tool system includes file reading, web search, and memory. The agent loop handles multi-step tasks. The disk cache persists across sessions, so you can close speak, open it again, and continue the same conversation without reprocessing anything.

Why Crystal

I chose Crystal because it is fast, memory efficient, and compiles to a single binary. The syntax is clean and readable. The performance is close to C but without the pain. There is already a great binding called llama.cr that wraps llama.cpp. This gave me a solid foundation to build on.

I also wanted to prove that Crystal is a good language for AI inference. Most people use Python for this. Python is slow and memory heavy. Crystal is lean. speak uses less than 2GB of RAM total. A Python equivalent would use 6GB or more.

The codebase is 1,689 lines of Crystal code. No bloat. No unnecessary abstractions. Just what is needed.

How to try it

Clone the repo. Run shards install. Build with crystal build src/speak.cr --release -o speak. Then run ./speak. The first run will detect your hardware, create a config file, and download the model (2.5GB). After that, it just works.

The installer supports resumable downloads. If your connection drops, just run ./speak again and it picks up where it left off.

You can find the code at github.com/zendrx/speak

What is next

I want to add more tools to the agent loop. File writing, terminal commands, and calendar integration are on the list. I also want to improve the web search with better result parsing. And I want to add a proper TUI with a status bar and command history.

But even as it is, speak is already useful. I use it every day.

Conclusion

speak is not trying to beat ChatGPT. It is trying to be different. Private. Local. Low RAM. Persistent memory. No subscription. It is for people who want an AI that remembers them and respects their privacy, all on hardware they already own.

Try it. Break it. Tell me what you think.

GitHub: github.com/zendrx/speak

Thanks for sharing this.
It is good to see llama.cr used in a real project. I have just changed the llama.cr version format from 9330 to 0.9330.0, because shards does not seem to recognize integer-only versions.
I will try speak after I get home.

Thank you for taking a look. And thank you for maintaining llama.cr. Without your bindings, this project would not exist.

i noticed the version format change. i will need to match the right llama.cpp build number.

Let me know what you think of speak. I am especially curious how the disk KV cache performs with your bindings. It uses Llama::State heavily

Hey @zendrx

It’s great stuff!

Just in case you are not aware, I built a Crystal ML/LLM library that supports CPU, Metal, and CUDA kernels, and it also includes ML libraries to build your own inference engine. It already has Qwen 3.5/3.6 and Nomics BERT MoE implemented.

You can find it here: GitHub - skuznetsov/cogni-ml: Crystal ML library: Autograd, Tensors, Neural Networks, Optimizers · GitHub

Hey @ComputerMage

Thank you for taking a look at speak and for the kind words.

I had not seen cogni-ml before. I just looked at the repository. It is impressive. Writing a full ML library with Metal and CUDA support in Crystal is a massive achievement. The fact that you have Qwen 3.5 running natively is incredible.

For speak v1.0, I will stick with llama.cr. It is simple and does exactly what I need for running a 3B model.

But I will definitely be studying your code. I want to understand how you implemented the GPU kernels and the inference pipeline. Maybe for a future version of speak, or a different project, I can build on top of what you have done.

Thank you for sharing your work. It is inspiring to see what is possible in Crystal.

zendrx

I just ran a bug detection scan on llama.cr with ChatGPT Pro and fixed all the memory management bugs it found. Unfortunately, it also found a struct layout mistake, but I fixed that as well.
Please let me know if you find anything else. Thank you.

Hey @zendrx

I have quite a lot of experience (more than a year of day-to-day work) with SWE agents, including my custom LLM protocol for much better control of LLM models, so cogni-ml is a fruit of that collaboration. I know a lot about ML, but not the kernels, so I just asked the agents themselves to help me implement them.
And I found that Crystal is quite ideal for agentic work: fewer errors and fewer tokens used, due to greater information richness per token and static analysis during compilation.

Thank you for the update on llama.cr

The memory mnagement fixes and struct layout correction are good news for spek. I rely heavily on Llama::State for disk kv cache and tokenization, If those areas were affected am glad they are now stable

I will update to the latest llama.cr and test speak again

  • Thank you for maintaining llama.cr and for keeping me informed