AI library for Crystal

Okay got it. When you said large language models only generate text, does it mean some model can do it ?

By “can do it”, do you mean can other models run commands themselves?

If that’s what you meant: no, AI models really only operate on numbers, vectors, and matrices. The “model” that we’re talking about is a purely mathematical model. With language models, there are words (or parts of words called “tokens”) associated with those numbers so it can generate text for you. But that’s really where it stops.

What you’re trying to do requires a higher-level abstraction on top of the model itself. An analogy might be, let’s you want a cappuccino. The AI model is just espresso beans. The beans are a crucial component but you have to make the rest of the cappuccino.

Okay so I got the point basically with your example . Thank you, I like how you explain :blush:

Very helpful everyone , thanks a lot !

So basically , I just need to establish a context to the AI and just implement the process to run code.

Is it normal when I try to load my model, the app try always to find it in a very strange path ? Look like a Windows like path

zohran@alienware-m17-r3 ~/Downloads/Test $ crystal Test.cr 
Unhandled exception: Error opening file with mode 'r': '/Users/zohran/models/dolphin-2.7-mixtral-8x7b.Q4_K_M.gguf': No such file or directory (File::NotFoundError)
  from /usr/lib64/crystal/crystal/system/unix/file.cr:12:7 in 'open'
  from /usr/lib64/crystal/file.cr:175:10 in 'new'
  from /usr/lib64/crystal/file.cr:741:12 in 'read_meta_data'
  from lib/llamero/src/models/meta_data/meta_data_reader.cr:34:5 in 'initialize'
  from lib/llamero/src/models/meta_data/meta_data_reader.cr:32:3 in 'new'
  from lib/llamero/src/models/base_model.cr:118:25 in 'initialize:model_name'
  from lib/llamero/src/models/base_model.cr:84:3 in 'new:model_name'
  from Test.cr:3:9 in '__crystal_main'
  from /usr/lib64/crystal/crystal/main.cr:118:5 in 'main_user_code'
  from /usr/lib64/crystal/crystal/main.cr:104:7 in 'main'
  from /usr/lib64/crystal/crystal/main.cr:130:3 in 'main'
  from /usr/lib64/libc.so.6 in '??'
  from /usr/lib64/libc.so.6 in '__libc_start_main'
  from /home/zohran/.cache/crystal/crystal-run-Test.tmp in '_start'
  from ???

Yes that’s correct, it looks for models relative to your system user instead of your projects source. There’s an ivar you can overload to change that if you need to, but I don’t recommend putting models in your project source like that. Since they are so large it makes more sense to have 1 folder that any app can access to run the model

So which var should I set to do that ? If I can just see how to do that please

What does that mean?

He said he changed the title from IA to AI.

But, he forgot to modify the same part in the post content.

1 Like

Before I explain how to do the configuration, I’m going to write some introductory docs here for you (and everyone). It’s important to understand these basic concepts because Llamero introduces a slightly new variation on how to write applications that let your AI models interact with the user and system.

Getting Started With Llamero

Llamero is built to act as an interface between a Large Language Model (LLM) and a locally running application. This interface makes a pleasant, simple and way to build apps that utilize locally running models to take advantage of your own hardware for offline and unlimited use.

In particular, this is to take advantage of the Apple M-series hardware, but it is not limited to Apple hardware.

By utilizing Llamero, a developer can write applications that interact with an LLM and allow the LLM to control the flow of the application. This includes developing agents and agentic workflows.

Llamero is currently developed to be single threaded and does not manage memory for the models being used. This means you must use 1 model at a time, and choosing a model that is larger than your systems memory has can cause hard crashes.

Key Concepts

  • Models
  • Prompts
  • Grammars
  • Tokens (not documented here, I’ll write more about this later)

About Models

Llamero currently is only designed to work with LLMs, specifically chat LLMs. Local LLMs vary in size and therefore capabilities. When we work with LLMs, the number of parameters is a good indicator of how capable a model is in it’s ability to use logic and reasoning.

Models are typically sized with a parameter like 7b or 22b. This means the model has 7 billion or 22 billion parameters respectively. The general rule of thumb for an open source model is that the smaller models use less memory and process faster but have very limited logic and reasoning. The larger the amount of parameters, the more capable a model becomes.

Llamero provides a Llamero::BaseModel class that comes with basic default configuration that should allow for quick agent development, and will let you use multiple models throughout your application.

All models have a limited context window. This is very important as this represents the length of total text a model can keep track of a topic in the conversation. All prompts can exceed the context window, but the models responses will start veering off topic and become unreliable once it is exceeded.

The default path where models are looked for is in the ~/models folder, but if you need to change this you can set

property model_root_path : Path

On your model class of the full path to the folder with your model file.

You can check the src/models/base_model.cr file for all of the default properties, their descriptions include how/what they do. Some are specific to AI models, not just run-time configuration. Things like top_k_sampling and repeat_penalty are examples of AI fine tuning properties that you can customize. The model_root_path is an example of a run-time configuration.

About Prompts

Prompts are a simple collection of interactions that are provided to the model. Every LLM uses some form of a system prompt, which is meant to provide explicit instructions and rules that set the models perspective before getting further instructions.

The user role is the text provided by your application for the model to infer from.

The assistant role is the models response.

You can create a prompt chain, which is the series of back and forth in the chat, typically starting with the system prompt.

About Grammars

This is the part that is most unique to Llamero. Grammars are a convenient way to get the LLMs to respond using a consistent and parsable format. Llamero handles parsing that response for you. Grammars both influence the output from the model and hold the response from the model. By using a grammar, you don’t have to provide instructions within your prompt for your expected response and then parse it.

Let’s say we have some text that we want to extract a customers information from. This is unstructured text and we want the model to perform the extraction for us.

When using LLMs from any other service, your prompt will look something like this:

You are a client services professional and a new customer has emailed you inquiring about our company's services. Here is the customers email, I want you to extract the customers first name, last name, and their email address. Your response must be JSON using this template: { "first_name": "", "last_name": "", "email_address": "" }

#{Insert some unstructured user input here}

You would then need to parse this into a class, hopefully it parses correctly and you would need to write the process for if it fails to re-query the model to try again.

When using Llamero models with grammars, this is all done for you.
This entire portion of the prompt is eliminated:

Your response must be JSON using this template: { "first_name": "", "last_name": "", "email_address": "" }

This includes the need to retry if parsing fails, because Llamero handles this.

Grammars are a super power, and they influence the output from the model.

# A simple example from the above scenario
class NewCustomerStructuredResponse < Llamero::BaseGrammar
  property first_name : String = ""
  property last_name : String = ""
  property email : String = ""
end
# Property names influence the models output, play around with an example like this
class ExampleOfInfluenceStructuredResponse < Llamero::BaseGrammar
  property a_random_number_between_zero_and_ten : Int32 = 0
  property a_random_number_greater_than_one_hundred : Int32 = 0
end

Now if you want to have your model write and execute commands on your local machine:

class ExecutableCommandStructuredResponse < Llamero::BaseGrammar
  property system_command_to_execute : String
end

base_prompt = Llamero::BasePrompt.new(
      system_prompt: "You are a Senior Ruby on Rails developer who is working on an application. Your job is to write the system command that is necessary to perform the next step",
      messages: [
        Llamero::PromptMessage.new(role: "user", content: "Create a new scaffold for a Customer that has the following properties: first_name, last_name, email_address"),
      ]
    )

base_model = Llamero::BaseModel.new(model_name: "meta-llama-3-8b-instruct-Q6_K.gguf")

# the `ai_response` is now our structured response
ai_response = base_model.chat(base_prompt, ExecutableCommandStructuredResponse.from_json(%({}))

# Let's execute the command we have from the model
`#{ai_response.system_command_to_execture}`

There’s enough to be dangerous and write something cool! I’ll continue adding docs to the repo, especially for Cursor. I’ve found it to be incredibly helpful.

2 Likes

Actually it’s weird, if you run it locally, even when you just ask simple question like what’s time is it, it answer wrong

Think of the AI model like it’s a service that runs on your computer, but is isolated. It does not perform any actions, it just talks. It’ll make stuff up because it’s job is to answer. The goal for Llamero was to create a way to talk to the model, have the responses be in a format your app can then perform the tasks like getting the time, checking folders, going to the web for things, etc.

The model is just a brain. It’s our job to write the “body” that makes the brain capable of performing tasks.

Thanks a lot for your guide.

So before I start to teach to the model how to run any command. I just try actually to understand how to start a chat with the AI like a chatgpt assistant.

So I did that:

require "llamero"

model = Llamero::BaseModel.new(model_root_path: Path.new("/home/zohran/Downloads/Test/"),model_name: "dolphin-2.7-mixtral-8x7b.Q4_K_M.gguf")

answer = model.quick_chat([{ role: "user", content: "Hi !" }])
puts answer

Why I don’t see any answer, and I can’t write anything ?

I’m not sure how to teach AI to run commands. The method I can think of is:

  1. Give the AI an instruction: I need to do something, please provide the code between <code> </code>.
  2. Use regular expressions to match the code enclosed within <code></code>.
  3. Check the correctness of the code and execute it.
    If you only want to integrate AI into your application without making adjustments to the AI internally, there are many ready-made AI service providers available, or you can use the open-source ollama.

I once wrote a program that simply converts ollama’s interface to OpenAI’s interface and provides the services. It implements a basic command-line program, reading and outputting complete large trunk of content, and reading and outputting streaming multi-section content.

Here is my code repository, hope this helps you.

I did some testing with the Mixtral and I was getting the same issue. Which is weird because I use that same model on my work laptop and it doesn’t freeze up like that.

Anyway, I redid the test with Llama3 8b and it worked fine. Llama is pretty… strange. It tends to follow directions 60-75% of the time, but definitely not well.

Relatively speaking, ChatGPT and Claude are models (and Mixture Of Experts) that have several hundred to over 1 trillion params. So you will not get the same kind of experience with an untrained local model as you do with GPT4o or Claude.

However, I do plan to create a way to train lora filters as part of Llamero. Based on research I’ve been finding (watch https://www.youtube.com/watch?v=IeLxmeHdHWg for a great presentation on this) this is how you’ll get a local model to be as accurate and effective as the hosted larger models. Right now there isn’t an easy way to create the fine-tuning filters, so once I get that done everything will really come together!

Also, @Sunrise you may find this interesting, I wrote a quick example to show how you can get a local AI model to execute commands on your local computer.


I want to emphasize something important here: Llamero is to AI models as a db adapter is to a database.

I have another shard I plan to release that will be an agent development framework, which will make working with various models a lot easier by creating common abstractions for standard workflows. Basically like a langchain for Crystal and local models.

Definitely still experiment with Llamero and let me know any issues you run into. I appreciate your attempts so far. This is the wild wild west of tech right now, and fortune favors the bold!

1 Like

So I am happy it’s not related just to me that issue x)

So I should stick on the version you ask on your repository or I can migrate to the last one ?

Basically, if I want to get the answers of the AI one by one without any shard, how we are suppose to process normally ?

So I will try again with your example now. I will come back to you

So I still have the same problem, just hanging after that:

zohran@alienware-m17-r3 ~/Downloads/Llama Test $ crystal test.cr 
2024-07-18T13:37:39.867706Z  ERROR - An error was encountered: Capacity too big
2024-07-18T13:37:39.867717Z  ERROR - An error was encountered: Capacity too big
2024-07-18T13:37:39.867925Z   INFO - Interacting with the model...
2024-07-18T13:37:39.869880Z   INFO - The AI is now processing... please wait

I think I find a part of the problem. When I started to inspect the log just by curiosity, I just noticed your shard pass the wrong number of threads on my computer:

[1721309859] Log start
[1721309859] Cmd: /usr/local/bin/llamacpp -m "/home/zohran/Downloads/Llama Test/dolphin-2.7-mixtral-8x7b.Q4_K_M.gguf" --grammar-file tmp/grammar_file.gbnf --n-predict 512 --threads 18 --ctx-size 2048 --temp 0.9 --top-k 80 --repeat-penalty 1.1 --prompt "
You are a Ruby on Rails expert, your are building an application using the rails CLI commands. Your response 

My computer can maximum use 16 …

So I did a test with the last version. It look like it work better …

zohran@alienware-m17-r3 ~/Downloads/Llama Test $ crystal test.cr 
2024-07-18T13:46:48.226821Z  ERROR - An error was encountered: Capacity too big
2024-07-18T13:46:48.226832Z  ERROR - An error was encountered: Capacity too big
2024-07-18T13:46:48.227003Z   INFO - Interacting with the model...
2024-07-18T13:46:48.231427Z   INFO - The AI is now processing... please wait
2024-07-18T13:47:10.648968Z   INFO - The process completed with a status of: Process::Status[0]
2024-07-18T13:47:10.648988Z   INFO - Recieved a successful response from the model
#<StructuredResponse:0x7f68d5fe4380 @command_to_run="">
2024-07-18T13:47:10.649155Z   INFO - We have recieved the the output from the AI, and parsed into the response
sh: -c: line 2: syntax error: unexpected end of file

But how should I proceed to get the answer from the AI, parse the output of the process ? Is there any way in your shard to launch the process in quiet mode ?

Can you share the code snippet you used?

The quick_chat method is just for getting back a raw string output from the process. In my example I used the chat method and it works quietly and handles parsing the response.

And nice catch with the number of threads. I haven’t found a way to detect how many GPU threads a processor has available yet to make this dynamic. I think that’s going to have to come once I bind the Metal library directly in Crystal instead of wrapping llama.cpp