I am working on a local RAG system.
Are there any bindings or libraries to work with smaller LLM’s directly on CPU at least. For embedding text, comparing (rerank) and summarization/gen.
I tried many things.
Ollama rest api, no rerank, built upon llama.cpp.
Tried llama.cpp llama-server, embeding/pooling doesn’t work.
Tried openvino, hit a wall with pipeline setup. Needs python scripts etc.
Tried onnx, some onnx attempt by kojix also on crystal. But again I hit a wall again at tokenizing.
As I see tokenizer support is crucial.
Now soome very low level libraries for ML seem to be available, but
are there any higher level libraries available in crystal?
Features:
Loading/unloading a model, send text to embed, get 1 embeded vector (mean, last, cls..), rerank and generate.
Either as bindings to C/C++ or as a separate REST API fast small http server.
Any info, directions, experience of yours (especially positive) would be very welcome.
Thank you in advance.
I’m not a professional engineer, and I don’t know much about LLMs or math. So I might not be more knowledgeable than you. Still, I think it’s important to narrow down the issue to get better answers.
onnxruntime.cr and llama.cr are mostly built on Vibe Coding’s work. I tried them because I like writing C bindings with FFI in Ruby, though I’m not very experienced with C.
I do understand that in Crystal, memory issues like leaks or segmentation faults can easily happen if you don’t handle ownership properly, especially when using finalize. I’m not sure llama.cr or onnxruntime.cr fully solve that.
That said, it sounds like the problem is happening earlier.
First, what model are you trying to run?
Is it custom, or a pre-trained model from Hugging Face?
If it’s a pre-trained Transformer, ollama and llama.cpp should work.
If it’s not a Transformer, getting it to run will be much harder—and might not be worth the effort.
If the model should work with ollama but doesn’t, then it’s likely not a Crystal issue. You should first make sure the model runs correctly.
It also sounds like you’re stuck on tokenization.
If that’s the case, I made tiktoken-cr. It might help—though no guarantees.
The Ollama REST API looks simple enough for any language to interact with: send the request and get the response one token at a time (just build the response string as the responses come) or pass stream: false to get the final response only.
Thx for answers. I’ve seen your work, tried it. Didn’t get far must say, got stuck on tokenizer. I am only prototyping for now, since I am looking for stable platform to develop upon.
That’s why I am asking others if they had any succes in similar projects.
Otherwise I use HF models, onnx or gguf (for llama.cpp, ollama). I went llama.cpp because ollama didn’t implement rerank.
What I found was even more unstable dev platform. For example, embedding (pooling) doesn’t work as of now on simple models like all-miniLM or qwen3emb, all GGUF.
I have issues like this: Does --embedding, --pooling even work? · ggml-org/llama.cpp · Discussion #14627 · GitHub
Onnx seems more stable, has releases and MS backing. But with onnx one needs tokenizers. Onnx serves raw models by itself.
I even tried Openvino bare metal, but hit a wall when got unto pipelines which involve python and setting up tokenizers.
On ruby, ankane made onnxruntime-ruby and tokenizers too, and I could do some work. With informers gem too.
It’s quite possible that you’re the first person in the world seriously trying to build a RAG (Retrieval-Augmented Generation) system in the Crystal programming language. (I only just learned the term “RAG” myself.) At the very least, you’re probably among the first five.
That’s what it means to use Crystal. You can’t expect high-quality libraries—most of the time, you’ll have to build things yourself. I enjoy that and make little side projects using Vibe Coding, but I wouldn’t recommend it for professional work.
You seem to understand concepts like reranking and pooling—things I’m not familiar with. You also know model names like all-miniLM, qwen3emb, and all GGUF. So I figured maybe you’re well-versed in Python. ONNX is developed by Microsoft, but in the LLM world, it doesn’t have the best reputation. Yet you’re intentionally choosing to use it, which made me think you might be closer to academia than industry. But at the same time, you’re using Ruby and Crystal—that’s quite rare. So I’m not sure what your background is. Maybe you’re a kindred spirit—someone who just enjoys doing things their own way?
Internally, llama.cpp uses a tensor library called GGML, which is optimized for operations on up to 4D tensors. GGML defines computations as graphs and executes them efficiently using SIMD and other low-level techniques. Because of this, if you want to extract intermediate results—like the output of an attention layer—you usually need to explicitly insert hooks. Compared to Python, it’s surprisingly less suited for inspecting or modifying internals.
If you’re comfortable with C or C++, then in the AI ecosystem, calling the tokenizer is actually one of the easier parts. Like with tiktoken-cr, you can get by just writing some bindings.
It looks like llama.cpp is already addressing the issue.
Development on llama.cpp is very active, which makes it challenging to keep bindings up to date—especially for someone who relies mostly on prompting AI. It would be great if someone with a solid background in AI could provide a well-structured binding for it.
Two years ago, I created a Ruby binding for the Rust-based AI library candle, not through Vibe Coding, but the old-fashioned way—by copy-pasting into ChatGPT. The project was later transferred to assaydepot, and development is still ongoing.
If anyone is interested in taking over as the project owner, I’m open to transferring ownership of llama.cr as well.
y, i am just looking at changes and battling thru code with the of help of AI to see if it was C++ part’s fault or ggml.C.
Will test these changes of course. This is core feature.
From perspective of any Crystal bindings I don’t think the underlaying libraries’ code in llama.ccp are mature enough. IDK. Changing so fast.
Dealing with http API is much easiers and ultimately more scalable.
I am new in AI, and only more high level stuff. Nitty gritty is > me, so if !c++ do crystal. :-) end.