Training ChatGPT on Crystal's standard lib and syntax

Has anyone tried creating a way to use all of the std lib documentation and train OpenAI/ChatGPT on Crystal’s standard library?

I know that when I ask for examples in Crystal I currently get a lot of mixed ruby syntax in there, so why not take the thousands of lines of code and comments and turn it into good training material for ChatGPT?

I’m just thinking about how we can lean into the AI revolution that is happening and augmenting developers around the world right now.

Being a “younger” and smaller language compared to others, we have the major advantage of a well documented std lib that’s readily available and strongly versions. This is a huge strength because we can more easily train AI models how to best write this language. Has anyone tried it yet?

I’m going to experiment with it. Part of Amber 2.0s philosophy is that I want to lean into AI tooling augmenting developer productivity and the standard lib is critical to getting ideas ramped up and prototyped/outlined quickly. Having good documentation that can also be used to create training material for version releases is going to be incredibly important in the future, and I want to see this community get established quickly and early.

4 Likes

Yes, it would be wonderful if that could happen.

I recently learned that you can do fine tuning of models using OpenAI’s API. However, the training data would need to be in the format shown below.

{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
...

I don’t know how to generate such training data from the documentation. How can we easily generate training data from documents? Or maybe ChatGPT can be used for that as well. (Prompt: Please read the given Crystal documentation and generate a collection of possible questions and answers from its content)

@kojix2 I’ve got that info as well, I was thinking of starting and experimenting with Amber since we have a lot of help docs that could more easily be turned into the “prompt” and “completion” JSON. I also think GPT 4 could fairly easily be used to take explanation/comment text and turned into prompt/completion text as well.

I need to experiment with it some more, probably this weekend depending on the weather and my wife’s demands for yard projects.

But big picture and long-term, I think adding something into either Amber or making a PR on crystal-lang repo to expand the docs command to output docs into a json/csv format that would allow training models including but not limited to ChatGPT. Other models require a format more like:

[
  {
    "method": "get",
    "example": "response = HTTP::Client.get(\"https://api.example.com/products/42\")"
  },
  {
    "method": "get",
    "example": "response = HTTP::Client.get(\"https://api.example.com/comments?post_id=7\")"
  },
  {
    "method": "get",
    "example": "response = HTTP::Client.get(\"https://api.example.com/search?q=crystal+language\")"
  }
]

I generated this example by giving ChatGPT this link HTTP::Client - Crystal 1.7.3 and asked it to provide examples of training and validation data for training itself.

This doesn’t link up with the example format provided by OpenAI in their docs, but their docs also say the CLI tool can convert common training model formats into what it needs. I want to play around with this more, I can’t wait to figure this out