Crystal for Agents v1.20.0 release

Hello,

I’ve been maintaining this project in order to facilitate the use of LLMs with Crystal.

The current version incorporates the changes reported in Crystal’s changelog v1.20.0. I will start following Crystal’s versioning scheme for now in order to be able to map

You can find the repo here. It’s, mostly, for LLMs but it’s human readable. Let me know what you think.

Repo: Rene Bon Ćirić (Renich) / Crystal for agents · GitLab

3 Likes

The repo seems to get pretty detailed about how Crystal works. Do you find that necessary? I’ve mainly used Claude and, while this repo seems to focus on Gemini, I generally don’t need to tell LLMs about Crystal itself. I do have to tell it about specific shards I use in my apps but, apart from a few cases, not the language.

1 Like

The concept behind this is pretty neat, but i’m inclined to agree with jgaskins on this. Why the focus on Gemini rather than just allowing for Claude Code to reference Crystal docs and the Crystal API docs as part of its built in prompt? What problems is this solving that other projects or LLM features don’t cover or miss entirely? Gemini’s main strength is its broad coverage of topics which makes it great for research and early project scoping. When paired with Sonnet you’ve got good coverage for information gathering.

Additionally, depending on the scope of a project you can prompt Claude to only reference source documents or links you provide it first before writing any code. This won’t catch everything in one swoop, but it does make review passes much smoother. I’m not against the idea of using different models for development, they each have their niche things they’re good at, but that’s something users writing software need to weigh and consider when choosing and planning out their projects.

Ultimately what you need to do is make sure the context is always fresh so you’re not constantly having to remind it about what was done and what’s ahead. A good change log and project management tracker like a kanban board helps immensely with this. Especially if your main prompt tells Claude to always reference those documents at the start of each new session you engage with.

I still do. LLMs in general tend to write crystal as if it was ruby, completely ignoring crystal features that I want. Metaprogramming is a good example. The use of hashes when structs are better is another. A tendency to ignore type safety completely and delegate all of that to the compiler.

I know that last one is encouraged here, yet, I expect an LLM to be able to handle those kinds of details.

Detailing crystal’s features like that is pretty useful. I feed crystal.rst to the LLM and code quality becomes much better and I stop seeing .not_nil! everywhere.

Well, I, honestly, don’t use Claude. 4.6 did some terrible messes in my code more than a few times. It’s good for planning, terrible on crystal and non-web projects.

Feeding it the links to official API docs as the primary reference helps a lot with that, whether that’s directly in the session prompt or loaded via a specialized skill file. Being explicit about not fabricating methods matters too. The .not_nil! habit is a common one, most models default to the defensive Ruby pattern rather than trusting Crystal’s type system. Bonus perk, when Crystal has a new release, just update the link rather than having to redraft everything inside the skill file which is what you’d currently have to do with your project setup.

For security tooling specifically, type safety at the compiler level matters beyond just code style. In sysrift’s filesystem walker, inode-level dedup via typed {st_dev, st_ino} pairs is what catches bind-mount cycles silently. A .not_nil! shortcut anywhere in that path is a runtime panic waiting to happen on an unexpected filesystem layout. A separate bug during development, an integer comparison that was silently producing wrong results due to a type width mismatch, was caught at compile time precisely because the type system was enforced rather than bypassed. Both cases: letting Crystal catch it early meant the tool handles edge cases correctly rather than failing on a target system.

The bigger issue any LLM faces, and this isn’t specific to any one model, is context drift. The larger a project gets, the more knowledge of what was built, why decisions were made, and what comes next degrades across sessions. Keeping that context reliably fresh is a workflow problem, not a model problem. A CHANGELOG, a Kanban (or any project management tool) board, and a session start prompt that always references both goes a long way toward keeping any LLM grounded in the actual state of the project rather than what it assumes or invents.

You’re right, but, often, it will just read the first link or two and pretend it knows everything with that. Also, context gets pruned or compressed and the reference is gone.

The .not_nill! thing is because the model starts thinking it’s ruby. I just got tired of telling it to check the docs and pasting references.

Try this: create some blog or something simple with an agent. Then, tell it to read crystal.rst and to re-evaluate. You’ll see the difference.

At least to me, it is useful.

Yes, I agree. Context isn’t infinite. Even if you use subagents for everything. Some form of compressed context is required. One can ask an Agent to keep a journal meant for it. In fact, there is some functionality towards this in gemini-cli already. I am sure there are projects that help.

In regards to security, I like ameba a lot and, recently, I’ve found: Commits · kdairatchi/flaw · GitHub

It looks promising (even, though, it seems it started last week).

1 Like

I’ll check out that project, looks pretty cool for static analysis.

You can solve this problem by also having it reference the locally installed docs on your system as a fallback and any of your other Crystal projects as examples. The pruned and compressed context issue can be mitigated by keeping your coding sessions short and having changes and design decisions documented at the end. Then when you start your next conversation and it’ll have fresh context with your project changes and what you did in previous sessions fresh.

This wouldn’t be all that different than prompting it fresh without a skill or customized prompt. Also, I wouldn’t blindly tell it to code a blog for me without being specific about what I want, how I want it, constraints, and more importantly, instructing it to not just generate a complete project for me. Instructing it to stop and ask about design decisions before a single line of code gets written or generated, or if the proposed code isn’t what you’re wanting and needs slight adjustments.

You raise great points. Yet, the exercise was just to show the differences in code quality.

Reading the whole manual vs reading a condensed (Agent-optimized) version is much better, IMHO. The manual isn’t written for agents; it’s written for humans. That’s the whole why of the project; to be able to provide condensed, yet comprehensive documentation to an agent.

1 Like

The thing that’s annoying me most about all the AI content that’s being spewed out is that it’s so much voodoo with so many having an opinion on which way and how fast to swing the dead chicken. And no hard data due to the non-deterministic nature of the things.

There was this notion that AGENTS.md and instructions actually decreased the LMM performance. I saw someone claiming that the most important thing in order to get good output was to have a good codebase to begin with (which doesn’t help when starting fresh, obviously).

I’ve got colleagues flockning to Claude at the moment because it’s DA SIZZLE, and I’m stuck somewhere between FOMO, and the suspicion that I’ll just be disappointed again. I ain’t got time for trying out every new thing under the sun.

So.. Could we create a Crystal focused benchmark?

2 Likes

I’m totally for the benchmark. I’ve got a few ideas but, in your opinion, how should the benchmark look like?

Well, I donno, in the end, it’s all pretty subjective, innit? Throw a code base and an instruction at different (or even the same) LLM, and you’ll get different solutions that different developers would grade differently.

One thing I would like is transparency. I got enough people throwing around tables showing that now Claude scores 84.5% on some industry standard benchmark. I’d rather like to see examples of what the LLMs produced, given the same job.

I’ve consultedwith an LLM and we developed and designed the experiment. Hopefully, it will work.


Evaluating LLM output is notoriously subjective. To prove empirically whether this document actually works, we’ve built an automated A/B testing harness. This post outlines the methodology so you can replicate the experiment against your preferred models (Claude 4.7, Sonnet, GPT-5.4, etc.) and share the delta.

The Architecture

To guarantee a mathematically pure baseline, the execution harness must run the LLM CLI in a sterile, ephemeral POSIX environment. If you run tests using an agent framework (like OpenCode) or a CLI with access to your local MCP servers and dotfiles, your baseline is contaminated.

The test consists of two deterministic runs at temperature 0.0:

  1. Run A (The Naked Baseline): The model receives the prompt with an empty system prompt. It relies entirely on its pre-trained weights.

  2. Run B (The Override): The model is injected with the crystal.rst file as its system context.

The “Trap” Prompt

You cannot evaluate LLMs with open-ended prompts like “write a web server.” You must use a highly constrained trap that forces the model into scenarios where raw pre-training predictably fails.

We prompt the model to build a concurrent CLI app that reads binary data, parses union-typed JSON, maintains thread-safe state, and implements a custom class constructor.

The Automated Impartial Grader

Evaluation is handled entirely by a bash script. It greps the generated AST/code against a binary matrix to detect architectural traps, and uses the Crystal compiler to verify syntactical integrity.

The Rubric (5 Points Total):

  1. [cite_start]Memory Safety: Passes if it safely allocates using Slice or Bytes[cite: 560]. [cite_start]Fails if it uses unsafe Pointer arithmetic[cite: 558].

  2. [cite_start]Concurrency Primitives: Passes if it uses Sync::Mutex or Sync::Exclusive[cite: 160, 565]. [cite_start]Fails if it uses the legacy Mutex[cite: 564].

  3. [cite_start]Data Parsing: Passes if it defines a struct leveraging JSON::Serializable[cite: 303]. [cite_start]Fails if it relies on JSON.parse and manual .as_s casting[cite: 302].

  4. [cite_start]Constructor Type Inference: Passes if instance variables are explicitly typed, preventing the #1 cause of nil-errors[cite: 569, 570].

  5. Compilation: Passes if crystal build main.cr -Dpreview_mt --no-codegen exits with status 0.

The Execution Harness

Save this as llm_benchmark.bash and run it. You will need a CLI tool (the script currently uses gemini-cli as a placeholder, but you can swap it for gh or claude).

#!/usr/bin/bash

# ==============================================================================
# LLM Crystal Knowledge Benchmark
# ==============================================================================
# This script impartially tests how well an LLM writes Crystal code (v1.20+).
# Run A (Baseline): Asks the LLM to write a concurrent script using its default weights.
# Run B (Override): Injects a strict Crystal architecture manual into the system prompt.
#
# We use raw `curl` and `jq` here instead of CLI wrappers to ensure absolute
# isolation. This prevents your local dotfiles, aliases, or MCP servers from
# contaminating the baseline, and it bypasses the shell's ARG_MAX limits!
# ==============================================================================

provider=$1

# Make sure the user selects a valid model provider
if [[ "$provider" != "gemini" && "$provider" != "claude" && "$provider" != "openai" ]]; then
    echo "👋 Welcome! Please run the script with your preferred provider:"
    echo "   Usage: $0 [gemini|claude|openai]"
    exit 1
fi

# ==============================================================================
# 1. ESTABLISH THE STERILE CONTAINMENT ZONE
# ==============================================================================
# We create a temporary directory to hold our payloads and results.
# This guarantees we aren't relying on any local files or state.
sterile_dir=$(mktemp -d)
echo "📁 Created a sterile testing directory at: $sterile_dir"

document_url="https://gitlab.com/renich/crystal-for-agents/-/raw/master/crystal.rst"
echo "🌐 Fetching the strict Crystal reference document..."
curl -sL "$document_url" -o "$sterile_dir/crystal.rst"

if [[ ! -s "$sterile_dir/crystal.rst" ]]; then
    echo "❌ FATAL: Failed to fetch the reference document. Check your network."
    rm -rf "$sterile_dir"
    exit 1
fi

# The "Trap Prompt" - Engineered specifically to test concurrency and type-safety.
prompt="Write a Crystal CLI application that concurrently processes binary log data. Requirements: 1. Use the experimental multithreading flag capabilities to spawn 5 concurrent workers. 2. Each worker must simulate reading 512 raw bytes directly into memory and then converting it to a string. 3. The string is a JSON payload representing a server event. The JSON has an 'event_id' (which can be a string or an integer), a 'timestamp', and an optional 'error_code' (integer). 4. Parse this JSON into a custom data structure. 5. Maintain a thread-safe global counter of all events processed. 6. Create a processor class that takes an optional custom prefix string in its constructor to prepend to console output, defaulting to 'LOG:'. Output the final code in a single file named main.cr. Prioritize memory safety, nil-safety, and Crystal v1.20 idioms."

# ==============================================================================
# 2. EXECUTE RUNS BASED ON PROVIDER
# ==============================================================================
echo "🚀 Target locked: $provider"

case "$provider" in
    gemini)
        # Ensure the user has exported their API key
        if [[ -z "${GEMINI_API_KEY}" ]]; then echo "❌ Please export GEMINI_API_KEY first."; exit 1; fi
        api_url="https://generativelanguage.googleapis.com/v1beta/models/gemini-3.1-pro-preview:generateContent?key=${GEMINI_API_KEY}"

        echo "   -> Executing Run A (Naked Baseline)..."
        jq -n --arg prompt "$prompt" '{"contents": [{"role": "user", "parts": [{"text": $prompt}]}], "generationConfig": {"temperature": 0.0}}' > "$sterile_dir/payload_a.json"
        curl -s -H "Content-Type: application/json" -X POST -d @"$sterile_dir/payload_a.json" "$api_url" | jq -r '.candidates[0].content.parts[0].text' | sed -n '/^```crystal/,/^```/ p' | sed '/^```/d' > "$sterile_dir/run_a.cr"

        echo "   -> Executing Run B (System Override)..."
        jq -n --arg prompt "$prompt" --rawfile sys "$sterile_dir/crystal.rst" '{"systemInstruction": {"parts": [{"text": $sys}]}, "contents": [{"role": "user", "parts": [{"text": $prompt}]}], "generationConfig": {"temperature": 0.0}}' > "$sterile_dir/payload_b.json"
        curl -s -H "Content-Type: application/json" -X POST -d @"$sterile_dir/payload_b.json" "$api_url" | jq -r '.candidates[0].content.parts[0].text' | sed -n '/^```crystal/,/^```/ p' | sed '/^```/d' > "$sterile_dir/run_b.cr"
        ;;

    claude)
        if [[ -z "${ANTHROPIC_API_KEY}" ]]; then echo "❌ Please export ANTHROPIC_API_KEY first."; exit 1; fi
        api_url="https://api.anthropic.com/v1/messages"

        echo "   -> Executing Run A (Naked Baseline)..."
        jq -n --arg prompt "$prompt" '{"model": "claude-3-5-sonnet-latest", "max_tokens": 8192, "temperature": 0.0, "messages": [{"role": "user", "content": $prompt}]}' > "$sterile_dir/payload_a.json"
        curl -s -H "x-api-key: ${ANTHROPIC_API_KEY}" -H "anthropic-version: 2023-06-01" -H "content-type: application/json" -X POST -d @"$sterile_dir/payload_a.json" "$api_url" | jq -r '.content[0].text' | sed -n '/^```crystal/,/^```/ p' | sed '/^```/d' > "$sterile_dir/run_a.cr"

        echo "   -> Executing Run B (System Override)..."
        jq -n --arg prompt "$prompt" --rawfile sys "$sterile_dir/crystal.rst" '{"model": "claude-3-5-sonnet-latest", "max_tokens": 8192, "temperature": 0.0, "system": $sys, "messages": [{"role": "user", "content": $prompt}]}' > "$sterile_dir/payload_b.json"
        curl -s -H "x-api-key: ${ANTHROPIC_API_KEY}" -H "anthropic-version: 2023-06-01" -H "content-type: application/json" -X POST -d @"$sterile_dir/payload_b.json" "$api_url" | jq -r '.content[0].text' | sed -n '/^```crystal/,/^```/ p' | sed '/^```/d' > "$sterile_dir/run_b.cr"
        ;;

    openai)
        if [[ -z "${OPENAI_API_KEY}" ]]; then echo "❌ Please export OPENAI_API_KEY first."; exit 1; fi
        api_url="https://api.openai.com/v1/chat/completions"

        echo "   -> Executing Run A (Naked Baseline)..."
        jq -n --arg prompt "$prompt" '{"model": "gpt-4o", "temperature": 0.0, "messages": [{"role": "user", "content": $prompt}]}' > "$sterile_dir/payload_a.json"
        curl -s -H "Authorization: Bearer ${OPENAI_API_KEY}" -H "Content-Type: application/json" -X POST -d @"$sterile_dir/payload_a.json" "$api_url" | jq -r '.choices[0].message.content' | sed -n '/^```crystal/,/^```/ p' | sed '/^```/d' > "$sterile_dir/run_a.cr"

        echo "   -> Executing Run B (System Override)..."
        jq -n --arg prompt "$prompt" --rawfile sys "$sterile_dir/crystal.rst" '{"model": "gpt-4o", "temperature": 0.0, "messages": [{"role": "system", "content": $sys}, {"role": "user", "content": $prompt}]}' > "$sterile_dir/payload_b.json"
        curl -s -H "Authorization: Bearer ${OPENAI_API_KEY}" -H "Content-Type: application/json" -X POST -d @"$sterile_dir/payload_b.json" "$api_url" | jq -r '.choices[0].message.content' | sed -n '/^```crystal/,/^```/ p' | sed '/^```/d' > "$sterile_dir/run_b.cr"
        ;;
esac

# ==============================================================================
# 3. THE AUTOMATED GRADER
# ==============================================================================
# This function greps the output for specific idioms and runs the compiler
# to ensure we get an impartial, binary score for the code.
evaluate_code() {
    local file=$1
    local score=0
    echo -e "\n📊 Evaluating $(basename "$file")..."

    if [[ ! -s "$file" ]]; then
        echo "   [FATAL] File is empty. The LLM failed to output parseable code."
        return
    fi

    if grep -q "Slice(UInt8)\|Bytes" "$file" && ! grep -q "Pointer(" "$file"; then
        echo "   ✅ [PASS] Memory Safety: Safely uses Slice/Bytes." && ((score++))
    else echo "   ❌ [FAIL] Memory Safety: Uses unsafe Pointer or misses Slice allocation."; fi

    if grep -q "Sync::Mutex\|Sync::Exclusive" "$file"; then
        echo "   ✅ [PASS] Concurrency: Uses the modern Sync module." && ((score++))
    else echo "   ❌ [FAIL] Concurrency: Defaults to legacy Mutex or misses synchronization."; fi

    if grep -q "include JSON::Serializable" "$file" && grep -q "String | Int32" "$file"; then
        echo "   ✅ [PASS] Data Parsing: Correctly implements JSON::Serializable & Union Types." && ((score++))
    else echo "   ❌ [FAIL] Data Parsing: Relies on JSON.parse or misses Union Type definitions."; fi

    if grep -q "@prefix : String" "$file"; then
        echo "   ✅ [PASS] Constructor: Explicitly types instance variables to prevent Nil errors." && ((score++))
    else echo "   ❌ [FAIL] Constructor: Misses explicit type annotations."; fi

    # Check if it actually compiles
    if crystal build "$file" -Dpreview_mt --no-codegen &> /dev/null; then
        echo "   ✅ [PASS] Compilation: Syntax and types are perfectly valid." && ((score++))
    else echo "   ❌ [FAIL] Compilation: Syntax or type errors detected."; fi

    echo "   ---------------------------------------"
    echo "   🏆 FINAL SCORE: $score/5"
    echo "   ---------------------------------------"
}

# ==============================================================================
# 4. EVALUATE, EXTRACT, AND TEARDOWN
# ==============================================================================
evaluate_code "$sterile_dir/run_a.cr"
evaluate_code "$sterile_dir/run_b.cr"

echo -e "\n📦 Extracting generated code for auditing..."
cp "$sterile_dir/run_a.cr" "./${provider}_run_a.cr"
cp "$sterile_dir/run_b.cr" "./${provider}_run_b.cr"

# Clean up our temporary directory
rm -rf "$sterile_dir"
echo "🧹 Sterile container destroyed. Artifacts saved locally. Have a great day!"

I challenge you to adapt the CLI command for Claude or OpenAI, run the bash script, and post the delta between your Run A and Run B. Let’s see which foundational model actually adheres to strict architectural directives.

Note: Some agent CLI clients, notably Cloude-CLI, have heavy system-like prompts integrated into it. Gemini-CLI is no exception.

My results:

renich@desktop scripts$ ./llm_benchmark.bash gemini
📁 Created a sterile testing directory at: /tmp/tmp.8lgUDkLGnW
🌐 Fetching the strict Crystal reference document...
🚀 Target locked: gemini
   -> Executing Run A (Naked Baseline)...
   -> Executing Run B (System Override)...

📊 Evaluating run_a.cr...
   ✅ [PASS] Memory Safety: Safely uses Slice/Bytes.
   ❌ [FAIL] Concurrency: Defaults to legacy Mutex or misses synchronization.
   ❌ [FAIL] Data Parsing: Relies on JSON.parse or misses Union Type definitions.
   ✅ [PASS] Constructor: Explicitly types instance variables to prevent Nil errors.
   ✅ [PASS] Compilation: Syntax and types are perfectly valid.
   ---------------------------------------
   🏆 FINAL SCORE: 3/5
   ---------------------------------------

📊 Evaluating run_b.cr...
   ✅ [PASS] Memory Safety: Safely uses Slice/Bytes.
   ✅ [PASS] Concurrency: Uses the modern Sync module.
   ✅ [PASS] Data Parsing: Correctly implements JSON::Serializable & Union Types.
   ✅ [PASS] Constructor: Explicitly types instance variables to prevent Nil errors.
   ✅ [PASS] Compilation: Syntax and types are perfectly valid.
   ---------------------------------------
   🏆 FINAL SCORE: 5/5
   ---------------------------------------

📦 Extracting generated code for auditing...
🧹 Sterile container destroyed. Artifacts saved locally. Have a great day!

Note: sometimes, I get 4/5 in run_a.cr.

p.s. I could only tests gemini-cli. I don’t have an OpenAI or a Claude account. I dunno if the script works for those. That said, it should be easily adaptable.

Another note I forgot to mention. The gemini 3 models are optimized to run with 1.0 temperature.

Oh, and I’ll be publishing the scripts in the crystal-for-agents repo; in scripts/.