Training ChatGPT on Crystal's standard lib and syntax

Has anyone tried creating a way to use all of the std lib documentation and train OpenAI/ChatGPT on Crystal’s standard library?

I know that when I ask for examples in Crystal I currently get a lot of mixed ruby syntax in there, so why not take the thousands of lines of code and comments and turn it into good training material for ChatGPT?

I’m just thinking about how we can lean into the AI revolution that is happening and augmenting developers around the world right now.

Being a “younger” and smaller language compared to others, we have the major advantage of a well documented std lib that’s readily available and strongly versions. This is a huge strength because we can more easily train AI models how to best write this language. Has anyone tried it yet?

I’m going to experiment with it. Part of Amber 2.0s philosophy is that I want to lean into AI tooling augmenting developer productivity and the standard lib is critical to getting ideas ramped up and prototyped/outlined quickly. Having good documentation that can also be used to create training material for version releases is going to be incredibly important in the future, and I want to see this community get established quickly and early.

4 Likes

Yes, it would be wonderful if that could happen.

I recently learned that you can do fine tuning of models using OpenAI’s API. However, the training data would need to be in the format shown below.

{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
...

I don’t know how to generate such training data from the documentation. How can we easily generate training data from documents? Or maybe ChatGPT can be used for that as well. (Prompt: Please read the given Crystal documentation and generate a collection of possible questions and answers from its content)

@kojix2 I’ve got that info as well, I was thinking of starting and experimenting with Amber since we have a lot of help docs that could more easily be turned into the “prompt” and “completion” JSON. I also think GPT 4 could fairly easily be used to take explanation/comment text and turned into prompt/completion text as well.

I need to experiment with it some more, probably this weekend depending on the weather and my wife’s demands for yard projects.

But big picture and long-term, I think adding something into either Amber or making a PR on crystal-lang repo to expand the docs command to output docs into a json/csv format that would allow training models including but not limited to ChatGPT. Other models require a format more like:

[
  {
    "method": "get",
    "example": "response = HTTP::Client.get(\"https://api.example.com/products/42\")"
  },
  {
    "method": "get",
    "example": "response = HTTP::Client.get(\"https://api.example.com/comments?post_id=7\")"
  },
  {
    "method": "get",
    "example": "response = HTTP::Client.get(\"https://api.example.com/search?q=crystal+language\")"
  }
]

I generated this example by giving ChatGPT this link HTTP::Client - Crystal 1.7.3 and asked it to provide examples of training and validation data for training itself.

This doesn’t link up with the example format provided by OpenAI in their docs, but their docs also say the CLI tool can convert common training model formats into what it needs. I want to play around with this more, I can’t wait to figure this out

I am not a professional engineer, but I have been enjoying generating simple Crystal language code using recent AI assistants like “Cline” and “Claude sonnet.” These AI tools enable me to casually create web applications or Android applications, providing both personal enjoyment and educational benefits. However, due to security concerns, my use of these tools remains strictly personal.

The macro feature of the Crystal language is not fully documented officially. To gain a deeper understanding of macros, one often needs to directly refer to the source code of the standard library. The reason macros are not extensively documented could be intentional, as they might currently lack stability or there may be an effort to discourage their misuse. Nonetheless, uploading Crystal’s official documentation and standard library into an AI service like Perplexity—where uploaded files can be searched and utilized—could potentially help facilitate understanding and practical usage of macros.

On another note, while the Crystal community typically avoids overly flashy promotions, someone with strong communication skills could create and share engaging content such as a “Build an app in 10 minutes” video using the Crystal language and the Kemal web framework on platforms like YouTube (though I myself am not suited for such a task).

(Translation from Japanese using ChatGPT)

1 Like

The following post was recently discussed in Ruby.
Ruby’s Renaissance in the AI Era by Yacine Petitprez

I don’t think this is entirely true. The reference book has Macros - Crystal which is the high level intro to macros, and the macro API is found under Crystal::Macros - Crystal 1.15.0. Is there a specific part that you think is missing/lacking?

What I had in mind here is around method_missing.
I’m certain this will intentionally remain undocumented in the future, as its use is generally discouraged, but when you look at actual usage examples in the standard library, you find that it’s capable of richer applications than you’d expect. I suspect there are probably several other cases like this as well.

It’s documented within the Hooks - Crystal section. But yes my understanding is its usage is somewhat discouraged as there are likely better ways to handle it. But as of now it’s not explicitly deprecated or anything.

All it actually says on that page is that there is method_missing. You need to actually look at the code to see how the arguments and keyword arguments are handled. And my understanding is that it is not recommended.

:thinking: I’m not sure I follow. method_missing is a macro that you implement yourself that only has a single Crystal::Macros::Call parameter. What you do with it is entirely up to you. What code are you having to look at? Or do you mean like if a shard is using it there’s no way to see what they’re doing with it?

1 Like

Yes, you can use Crystal::Macros::Call to use method_missing more finely, but it is not possible for me to understand this from the description on the specification page above.

I actually tried it and wrote a blog post. It didn’t take 10 minutes—it took me a whole day. But it works. If you’re into Crystal and web development, you can probably do it much better and faster.

4 Likes

I think I understand what @kojix2 meant, at least a little bit.

When you start using macros, the problems you run into are much harder than what the doc say.
For example, when is a macro defined and when is it invoke when a Crystal app is running? If you pass a argument to macro, when and how this arg evaluated?many questions like that.

I’ve been wondering about this. There’s a project called Unsloth: https://docs.unsloth.ai/. These guys offer the means to fine-tune pretty much any LLM in an affordable way.

This means we could just fine tune it to write crystal code in a modern, idiomatic and efficient way. The only thing we would require is good code examples.

That said, I’ve been using LLMs to generate a lot of crystal code and it’s not terrible, IMHO. On the other hand, I am sure that if anyone here looks at it is gonna roll over and die. ;D

Figured I would leave an update on this. Since having tried this repeatedly, a lot of the world has changed, especially since April 2023. We’re almost three full years later, and basically, model capabilities have more than tripled in that time. Because of the radical increase, models have not had to specifically learn crystal as precisely as originally thought. They also probably have just improved how it is that they’re being trained. So I do think that the depth of knowledge about crystal itself has also just increased and not necessarily from any extra efforts on our part.

That being said, while I currently use Claude Opus 4.5 and I’m having a radically improved experience, it leads me to believe that we are going to see a general increase in all coding agents accessibility over the next six months. Generally, what’s happened is Opus is one of the frontier models that leads the way and sets a standard, and within three or four months of its release, other models start catching up.

That being said, I still don’t have a good formal training data set organized in a way that has been meaningfully successful fine-tuning a local model. Instead, I have found a considerable amount of progress by doing prompt engineering and agent orchestration.

We have spoken a lot about prompt engineering, but very little is actually being talked about when it comes to agent orchestration. I think that is a big miss, but it is the next step, and it’s frankly going to be the step of 2026 that starts spreading things like wildfire.

The reason for this is simple. By the time you’re trying to orchestrate multiple agents, you have had your assistant and coding agent escape the chat window. And what that means is that it no longer requires you to sit there and proactively or interactively work with it in order for it to accomplish tasks.

For example, I have set up a process where I can upload voice memos and recordings to a folder on my iCloud drive. Whisper then runs and transcribes that audio file into a raw transcript. Claude then runs and an intake agent reads that transcript and then does a bunch of executive tasking to break it up into different topics. Summarize it with the current projects that I have going, and other significant and very helpful organizational steps. To consistently and reliably maintain a knowledge base, it’s then capable of seeing that a task was requested of another agent that we have available, and it will start that process with that agent with its task that it needs to accomplish.

I no longer have to be physically present at my desk to get Claude to begin working on a project. And frankly, it’s quite wonderful. If any of my agents get stuck and they need help, they can actually use a communication agent that will then give me a call and talk to me about the issue and relay that back to the agent working on my local device, who then hopefully can get unstuck and continue on.

These days, I work almost exclusively in Crystal, especially whenever I’m using my own coding agent. However, I do like to periodically branch out to other languages to get a better idea of how effective other coding languages that are more popular and have content out there are. And I have to say that I’m quite impressed with how effective these other languages are. If they have good tooling and a lot of content out there, it’s kind of dangerous, actually.

The quality of the content in the world that the models are getting trained on is very important. I do consistently find that topics around languages like Rust and Go or C are significantly higher quality and can tackle better and harder concepts than JavaScript. I find that JavaScript has been flooded by YouTube bros peddling starting an agency, and it’s reflected in the quality of the code that models generally write.

So, I think that leads us down a path where we can all individually write anything that we want as long as we understand that what we write and publish out there is going to eventually end up in these models’ memories and influencing what it is that they do.

If you set up an agent with access to memory like I have, you can very easily build long, deep preferences that the agent is capable of following nearly indefinitely.

I think the biggest thing that we need to solve is how we distribute the libraries that we write to enable people and their coding assistants from the start. I just don’t know what that tooling looks like at this moment.

I think we are missing the basic things like the /llms.txt file or an official MCP server for documentation.

I think having these things first could be really beneficial for enabling a better agentic developing experience for Crystal.

For example: https://bun.sh/llms.txt

4 Likes

This technique is new to me. The trick of creating a list that links official documentation URLs and attaching it to the repository is quite simple and robust.
It may be a short-lived “hack” that will remain effective for a few months but will be obsolete next year.
But, I would like to try it out.

I ended up making a fork of shards so I could add some commands related to SOC2 and ISO270001 security management, and I added the ability to install skills/subagents/mcp’s for your library from shards itself.

This should make it easier to create libraries for people to use with their coding assistants and your library.

I’m still new to the community, but if we had a good standardized documentation even though we are decentralized, it could enable some interesting projects. Maybe utilizing the Crystal docs command and have a site index the default docs folder? I could foresee something like this that elixir/beam has https://hex.pm/packages/hexdocs_mcp being very handy.

I suspect that crystaldoc.info would be a good candidate since it mostly tries this for documentation anyway, but bandwidth might end up being a concern if bots start being able to pull tons of files for documentation.

I did something like this and included shards-alpha being able to run MCP servers for the various libraries and it’s documentation. This could then be circulated with all shards so anyone who installs a lib and then also get the tooling and docs intended for the matching version.

This just feels like the future of how working with AI coding assistants and it’s tooling should be.