I used the one you put as screenshot on your previous post.
For the number of threads, on Linux you can get the number by just running that command:
nproc
I used the one you put as screenshot on your previous post.
For the number of threads, on Linux you can get the number by just running that command:
nproc
So crimson, just to let you know. Basically what I would like to do (if I can get an example), it’s I send a question to the AI by setting a prompt (I would like to load it from a file), then get the answer, and save it in a file, or in a var inside my crystal code. And then close. It’s what I want to do basically.
I want that var or that file just contain the answer, nothing else.
I don’t know if you get my point ?
Okay I had assumed you were using Apple hardware. The way llama.cpp works is very dependent on your hardware, and I haven’t had a chance to thoroughly test on non-Apple hardware.
nproc
lets you know the number of CPU cores available for threads, but you will want to set the threads to the number of threads available for your GPU. Also, depending on the amount of VRAM you have, the model will get loaded only partially into memory and processing gets shared between CPU and GPU. Performance takes a big degradation between CPU and GPU, so if you’re using CPU only I would recommend using a smaller model like Llama3 8b.
You can definitely do what you’re trying to do, I think my last code example illustrates that, you’ll just need to read from a file and create the prompt you want. Then read from the response to save your file. I do this know with creating an OpenAPI spec. I can probably update my example later today or tomorrow morning for you. Do you have an example of a command you want it to run and I can aim to demo that?
I’ll try to see if I can figure out why the Mixtral model is working on my work computer but not my personal computer (both Apple M-series) and see if I can offer some settings to help.
I find how to do ah ah. I just passed a context file, and redirect the answer to another file, in completely quiet mode to get the answer without any other text:
./llama-cli -m ../dolphin-2.7-mixtral-8x7b.Q4_K_M.gguf --chat-template vicuna-orca --n-predict 512 --threads 16 --ctx-size 2048 --temp 0.9 --top-k 80 --repeat-penalty 1.1 --no-display-prompt -f context.txt > result.txt
Context file:
The user interact with the assistant named epsilon. If the user ask for the time, you must just answer COMMAND:time --show, nothing else. When you answer, you must never continue any conversation.
### User: what's time is it ?
File with result:
COMMAND:time --show
So basically I just need to use Process.run and that it …
It’s crazy but I am thinking actually, probably soon most of the linux distribution component we know actually will become obsolete with AI. Like the init system, the shell… etc
I have one question. Why on my laptop the model take quite a bit of time to answer ? When I test llama online it’s very quick. And my laptop is quite powerful …
Oh, I usually try to avoid making negative comments, but I can’t hold back on this topic. In the AI programming, I think the most important libraries are those for matrix computation. In Python, this is NumPy. The Ruby community has also struggled a lot in this area.
Ruby has Numo::NArray and Cumo, which work on GPUs. There used to be a library called NMatrix, but in the end, Numo became the de-facto standard library. In Ruby, ankane and yoshoku have created many AI-related libraries 1 2, which are built using C library bindings and Numo.
In Crystal, developing matrix computation libraries is even rarer than in Ruby. Still, there is num.cr.
Maintaining foundational libraries like NumPy or Numo is very difficult. Languages other than Python and C++ have not been very successful in doing this. Even popular languages like Rust seem to face challenges. (It may be hard for statically typed languages to handle matrix computations when the dimensions and sizes of the matrices are not known. rust-ndarray)
Therefore, making bindings for libraries like ONNXRuntime or connecting to API servers provided by tools like Ollama seems to be the practical approach.
However, this is not interesting at all! Of course, wrapping llama.cpp or re-implementing it in Crystal would be much more amazing. But, oh, it would feel as challenging as climbing Everest. With my skills, not only is climbing Everest impossible, but even climbing the hills around me is difficult. Still, I understand how tough it is…
(This text was written with the help of ChatGPT)
Hi, @crimson-knight , is there possible to use AMD [rocm]GitHub - rocm-arch/rocm-arch: A collection of Arch Linux PKGBUILDS for the ROCm platform to speed up the process of talking with AI?
it so slow when i running following code on my Arch Linux, with AMD 7480hs (780M integrated GPU), 64G memory, and use all my 16 core of CPU, temp to 86, but GPU is free, waiting a long time, still could get the answer.
require "llamero"
model = Llamero::BaseModel.new(
model_root_path: Path.new("/home/zw963/models/"),
model_name: "Meta-Llama-3-8B-Instruct-Q5_K_M.gguf"
)
puts model.quick_chat([{role: "user", content: "Hi!"}])
╰─ $ ./ai_example
2024-07-19T08:54:48.274588Z INFO - Interacting with the model...
2024-07-19T08:54:48.285018Z INFO - The AI is now processing... please wait
I think I will try to find another AI optimized for Linux. I thought Llama was more for Linux… Because we loose a lot of performance
That one is really impressive: https://pi.ai/desktop
Do you know how can I just simply send a request in Crystal to that AI with http client, and get the response ? I am just extremely bad with network
Actually, calling HTTP APIs is not very difficult. You can use Crystal’s standard library, but the standard library is often not user-friendly, so I prefer Crest. I have made some command-line tools to call APIs. Basically, you open the API reference page and write the query in Crest as described there. (I couldn’t find the API page for pi.ai right away. Do I need to register for the wishlist?)
I have created several command-line tools that call HTTP APIs. These may not be very useful, but here are the links:
Based on the following Ruby code, it should be easy to write Crystal code to call Ollama’s API.
This kind of work is not very difficult. What is really important for the Crystal community is developing basic libraries for matrix calculations. Another important task is creating binding libraries for C and C++ libraries. I would greatly appreciate such efforts.
(This text was written with the help of ChatGPT)
This is the documentation . If I can just have on example
I think I find the documentation:
https://docs.aveva.com/bundle/pi-web-api-reference/page/help.html
So I finally find why it was lagging. You must set the number of the cpu cores for ONE of the processor.
For example in my case, if I get the information about /proc/info, I should use the number of cpu cores for a single core.
In my case it’s 8, not 16.
Now the answer is really fast.
So I will back to llama now !
Look at that post crimson-knight for the explanation:
https://www.reddit.com/r/LocalLLaMA/comments/190v426/llamacpp_cpu_optimization/
Over the past few weeks, there’s been a sensation about programming by AI agents. I, too, am amazed by the power of this technology.
Today, I asked Cline (claude-sonnet) to implement either NumPy or Numo::NArray (for those who aren’t familiar, this is the standard library in Ruby equivalent to NumPy) using the Crystal language, and the result was quite impressive.
In just another two years, regardless of how obscure the language may be, we may soon be able to simply request an implementation of a matrix computation library like NumPy.
Before long, I believe we will discover new technological gaps or areas where human expertise is indispensable, but for now, I am simply in awe of this tremendous technological revolution.
(Translated from Japanese by ChatGPT)
I realized something extremely obvious: using a matrix library in Crystal doesn’t provide any performance boost. In Python or Ruby, using NumPy or Numo::NArray naturally results in a speedup, so this feels strange (though it’s not strange at all). It seems that in a language like Crystal, matrix computation libraries exist primarily for readability and ease of writing rather than for performance improvements.
This is because the parts of NumPy like linalg
are written in C translated from Fortran code that has been continually improved for over 45 years (BLAS was released in 1979), rather than textbook implementations that LLMs have no issues regurgitating.
It might sound counterintuitive that Python and Ruby are faster than Crystal, but this is because the fast parts are written in neither Python nor Ruby (nor in this case, C) to begin with. You are better off using something like GitHub - konovod/linalg: Linear algebra library based on LAPACK instead.
I’m not aiming to solve a real-world problem with this, so I prefer a simple, textbook-style library implemented in Crystal rather than a LAPACK binding.
But apart from that, I feel that things that seemed impossible a few months ago—like AI creating a faster BLAS in 20 minutes or LLVM making a huge performance leap—are now quite possible in the future. No one knows what the future holds, though.