Would anyone like to help create an opensource AI training dataset for Crystal?

I think we need to dog pile creating training data for Crystal, mostly focused on the std lib.

I think if we dog pile it, we can get an enormous amount of training data. In a couple of weeks my sister will be done chemo and I should have a fair amount of extra time to focus on this. Would anyone be interested in joining me so we can make Crystal the preferred/top AI development driven language?

I submitted a CFP for Helvetic Ruby about “ADD - Agent Driven Development” which is basically what you can see with the Devin announcement. This just feels like an entirely natural route that development is going to move toward. I don’t know if my talk has been approved yet, but hopefully it gets accepted! :slight_smile:

We’re going to be enhanced and augmented by AI, and if we work together as a community it’ll be to all of our benefit.

I’ll make a repo and get started before the end of the month and probably record a YouTube video and make a checklist of things so we can easily break up who can do which parts.

If you’re interested, please let me know here or send me a message on Discord! :grin:

4 Likes

I was literally looking into this last weekend. I was trying to figure out how to fine-tune the codellama model on the Crystal stdlib. They released one tuned for Python, which makes sense since it’s created by AI/ML engineers working in Python, but I wanted to do the same for Crystal.

Exceeeeeeept I don’t know how to fine-tune a model. I’ve only been looking at AI stuff for the past couple months, so I haven’t reached the part where I know what I’m doing yet. One thing I did try was throwing the entire Crystal stdlib at the codellama:34b-code model as context (kinda-sorta emulating OpenAI’s “custom GPTs” feature, for a sufficiently generous interpretation of “emulating”) and saved that, but it didn’t have the effect I’d hoped for — it never wrote code that would work, even just using the Crystal stdlib, which I’d already shown it. So maybe tuning the model earlier in the pipeline may be needed?

I’ve been pretty aggressively focused on practical uses of AI in tech for the last year (see my last past in April last year about fine-tuning ChatGPT), but I’m still just getting into the nitty gritty.

However, I’ve managed to accomplish a couple of cool things with a wrapper around llama.cpp:

  1. Using grammar files, configurable per interaction. These are basically unused by the community at large at the moment, but what they allow is interacting with an LLM using natural language and getting back a correctly formatted JSON response without having to say “Your response must be valid JSON using the following response template” or whatever variation of that stage. This is very new and very underutilized, but going forward I think it’s going to be a major boon because of how it saves on tokens required for context windows.

  2. Using LoRA (lora) filters, configurable per interaction. These filters are what you “fine-tune” a specific model with, and they can be combined in multiples to change how a model can analyze and respond to your interactions. This is ultimately what we would be creating. They are tightly coupled with a particular model, so the data set being available with the model that was used to create them is critical.

  3. Switching between models at will. Ollama currently does this, and I believe it does the lora filters as well, but I think that is limited.

Ollama is a handy tool for experimenting and it lets you use models that are smaller, so if you can run the 34b codellama model now, I think you’ll be able to run the 70b model through llama.cpp and/or Ollama since it’s quantized into a smaller binary (running using C++ instead of Python).

What we’ll need is training data that explains how the language works, uses clear examples to demonstrate, and build on top of examples. The stdlib is a pretty good start, including the code examples it comes with, but we’ll definitely want to create more.