Crystal for Agents v1.20.0 release

renich · April 17, 2026, 6:26pm

Hello,

I’ve been maintaining this project in order to facilitate the use of LLMs with Crystal.

The current version incorporates the changes reported in Crystal’s changelog v1.20.0. I will start following Crystal’s versioning scheme for now in order to be able to map

You can find the repo here. It’s, mostly, for LLMs but it’s human readable. Let me know what you think.

Repo: Rene Bon Ćirić (Renich) / Crystal for agents · GitLab

jgaskins · April 18, 2026, 5:27am

The repo seems to get pretty detailed about how Crystal works. Do you find that necessary? I’ve mainly used Claude and, while this repo seems to focus on Gemini, I generally don’t need to tell LLMs about Crystal itself. I do have to tell it about specific shards I use in my apps but, apart from a few cases, not the language.

xalynn · April 18, 2026, 7:53am

The concept behind this is pretty neat, but i’m inclined to agree with jgaskins on this. Why the focus on Gemini rather than just allowing for Claude Code to reference Crystal docs and the Crystal API docs as part of its built in prompt? What problems is this solving that other projects or LLM features don’t cover or miss entirely? Gemini’s main strength is its broad coverage of topics which makes it great for research and early project scoping. When paired with Sonnet you’ve got good coverage for information gathering.

Additionally, depending on the scope of a project you can prompt Claude to only reference source documents or links you provide it first before writing any code. This won’t catch everything in one swoop, but it does make review passes much smoother. I’m not against the idea of using different models for development, they each have their niche things they’re good at, but that’s something users writing software need to weigh and consider when choosing and planning out their projects.

Ultimately what you need to do is make sure the context is always fresh so you’re not constantly having to remind it about what was done and what’s ahead. A good change log and project management tracker like a kanban board helps immensely with this. Especially if your main prompt tells Claude to always reference those documents at the start of each new session you engage with.

renich · April 18, 2026, 8:58pm

I still do. LLMs in general tend to write crystal as if it was ruby, completely ignoring crystal features that I want. Metaprogramming is a good example. The use of hashes when structs are better is another. A tendency to ignore type safety completely and delegate all of that to the compiler.

I know that last one is encouraged here, yet, I expect an LLM to be able to handle those kinds of details.

Detailing crystal’s features like that is pretty useful. I feed crystal.rst to the LLM and code quality becomes much better and I stop seeing .not_nil! everywhere.

renich · April 18, 2026, 8:59pm

Well, I, honestly, don’t use Claude. 4.6 did some terrible messes in my code more than a few times. It’s good for planning, terrible on crystal and non-web projects.

xalynn · April 18, 2026, 10:52pm

Feeding it the links to official API docs as the primary reference helps a lot with that, whether that’s directly in the session prompt or loaded via a specialized skill file. Being explicit about not fabricating methods matters too. The .not_nil! habit is a common one, most models default to the defensive Ruby pattern rather than trusting Crystal’s type system. Bonus perk, when Crystal has a new release, just update the link rather than having to redraft everything inside the skill file which is what you’d currently have to do with your project setup.

For security tooling specifically, type safety at the compiler level matters beyond just code style. In sysrift’s filesystem walker, inode-level dedup via typed {st_dev, st_ino} pairs is what catches bind-mount cycles silently. A .not_nil! shortcut anywhere in that path is a runtime panic waiting to happen on an unexpected filesystem layout. A separate bug during development, an integer comparison that was silently producing wrong results due to a type width mismatch, was caught at compile time precisely because the type system was enforced rather than bypassed. Both cases: letting Crystal catch it early meant the tool handles edge cases correctly rather than failing on a target system.

The bigger issue any LLM faces, and this isn’t specific to any one model, is context drift. The larger a project gets, the more knowledge of what was built, why decisions were made, and what comes next degrades across sessions. Keeping that context reliably fresh is a workflow problem, not a model problem. A CHANGELOG, a Kanban (or any project management tool) board, and a session start prompt that always references both goes a long way toward keeping any LLM grounded in the actual state of the project rather than what it assumes or invents.

renich · April 19, 2026, 1:09am

You’re right, but, often, it will just read the first link or two and pretend it knows everything with that. Also, context gets pruned or compressed and the reference is gone.

The .not_nill! thing is because the model starts thinking it’s ruby. I just got tired of telling it to check the docs and pasting references.

Try this: create some blog or something simple with an agent. Then, tell it to read crystal.rst and to re-evaluate. You’ll see the difference.

At least to me, it is useful.

Yes, I agree. Context isn’t infinite. Even if you use subagents for everything. Some form of compressed context is required. One can ask an Agent to keep a journal meant for it. In fact, there is some functionality towards this in gemini-cli already. I am sure there are projects that help.

In regards to security, I like ameba a lot and, recently, I’ve found: Commits · kdairatchi/flaw · GitHub

It looks promising (even, though, it seems it started last week).

xalynn · April 19, 2026, 1:58am

I’ll check out that project, looks pretty cool for static analysis.

You can solve this problem by also having it reference the locally installed docs on your system as a fallback and any of your other Crystal projects as examples. The pruned and compressed context issue can be mitigated by keeping your coding sessions short and having changes and design decisions documented at the end. Then when you start your next conversation and it’ll have fresh context with your project changes and what you did in previous sessions fresh.

This wouldn’t be all that different than prompting it fresh without a skill or customized prompt. Also, I wouldn’t blindly tell it to code a blog for me without being specific about what I want, how I want it, constraints, and more importantly, instructing it to not just generate a complete project for me. Instructing it to stop and ask about design decisions before a single line of code gets written or generated, or if the proposed code isn’t what you’re wanting and needs slight adjustments.

renich · April 19, 2026, 4:12am

You raise great points. Yet, the exercise was just to show the differences in code quality.

Reading the whole manual vs reading a condensed (Agent-optimized) version is much better, IMHO. The manual isn’t written for agents; it’s written for humans. That’s the whole why of the project; to be able to provide condensed, yet comprehensive documentation to an agent.

Xen · April 19, 2026, 7:12pm

The thing that’s annoying me most about all the AI content that’s being spewed out is that it’s so much voodoo with so many having an opinion on which way and how fast to swing the dead chicken. And no hard data due to the non-deterministic nature of the things.

There was this notion that AGENTS.md and instructions actually decreased the LMM performance. I saw someone claiming that the most important thing in order to get good output was to have a good codebase to begin with (which doesn’t help when starting fresh, obviously).

I’ve got colleagues flockning to Claude at the moment because it’s DA SIZZLE, and I’m stuck somewhere between FOMO, and the suspicion that I’ll just be disappointed again. I ain’t got time for trying out every new thing under the sun.

So.. Could we create a Crystal focused benchmark?

renich · April 19, 2026, 7:47pm

I’m totally for the benchmark. I’ve got a few ideas but, in your opinion, how should the benchmark look like?

Xen · April 19, 2026, 8:00pm

Well, I donno, in the end, it’s all pretty subjective, innit? Throw a code base and an instruction at different (or even the same) LLM, and you’ll get different solutions that different developers would grade differently.

One thing I would like is transparency. I got enough people throwing around tables showing that now Claude scores 84.5% on some industry standard benchmark. I’d rather like to see examples of what the LLMs produced, given the same job.

renich · April 19, 2026, 10:41pm

I’ve consultedwith an LLM and we developed and designed the experiment. Hopefully, it will work.

Evaluating LLM output is notoriously subjective. To prove empirically whether this document actually works, we’ve built an automated A/B testing harness. This post outlines the methodology so you can replicate the experiment against your preferred models (Claude 4.7, Sonnet, GPT-5.4, etc.) and share the delta.

The Architecture

To guarantee a mathematically pure baseline, the execution harness must run the LLM CLI in a sterile, ephemeral POSIX environment. If you run tests using an agent framework (like OpenCode) or a CLI with access to your local MCP servers and dotfiles, your baseline is contaminated.

The test consists of two deterministic runs at temperature 0.0:

Run A (The Naked Baseline): The model receives the prompt with an empty system prompt. It relies entirely on its pre-trained weights.
Run B (The Override): The model is injected with the crystal.rst file as its system context.

The “Trap” Prompt

You cannot evaluate LLMs with open-ended prompts like “write a web server.” You must use a highly constrained trap that forces the model into scenarios where raw pre-training predictably fails.

We prompt the model to build a concurrent CLI app that reads binary data, parses union-typed JSON, maintains thread-safe state, and implements a custom class constructor.

The Automated Impartial Grader

Evaluation is handled entirely by a bash script. It greps the generated AST/code against a binary matrix to detect architectural traps, and uses the Crystal compiler to verify syntactical integrity.

The Rubric (5 Points Total):

[cite_start]Memory Safety: Passes if it safely allocates using Slice or Bytes[cite: 560]. [cite_start]Fails if it uses unsafe Pointer arithmetic[cite: 558].
[cite_start]Concurrency Primitives: Passes if it uses Sync::Mutex or Sync::Exclusive[cite: 160, 565]. [cite_start]Fails if it uses the legacy Mutex[cite: 564].
[cite_start]Data Parsing: Passes if it defines a struct leveraging JSON::Serializable[cite: 303]. [cite_start]Fails if it relies on JSON.parse and manual .as_s casting[cite: 302].
[cite_start]Constructor Type Inference: Passes if instance variables are explicitly typed, preventing the #1 cause of nil-errors[cite: 569, 570].
Compilation: Passes if crystal build main.cr -Dpreview_mt --no-codegen exits with status 0.

The Execution Harness

Save this as llm_benchmark.bash and run it. You will need a CLI tool (the script currently uses gemini-cli as a placeholder, but you can swap it for gh or claude).

#!/usr/bin/bash

# ==============================================================================
# LLM Crystal Knowledge Benchmark
# ==============================================================================
# This script impartially tests how well an LLM writes Crystal code (v1.20+).
# Run A (Baseline): Asks the LLM to write a concurrent script using its default weights.
# Run B (Override): Injects a strict Crystal architecture manual into the system prompt.
#
# We use raw `curl` and `jq` here instead of CLI wrappers to ensure absolute
# isolation. This prevents your local dotfiles, aliases, or MCP servers from
# contaminating the baseline, and it bypasses the shell's ARG_MAX limits!
# ==============================================================================

provider=$1

# Make sure the user selects a valid model provider
if [[ "$provider" != "gemini" && "$provider" != "claude" && "$provider" != "openai" ]]; then
    echo "👋 Welcome! Please run the script with your preferred provider:"
    echo "   Usage: $0 [gemini|claude|openai]"
    exit 1
fi

# ==============================================================================
# 1. ESTABLISH THE STERILE CONTAINMENT ZONE
# ==============================================================================
# We create a temporary directory to hold our payloads and results.
# This guarantees we aren't relying on any local files or state.
sterile_dir=$(mktemp -d)
echo "📁 Created a sterile testing directory at: $sterile_dir"

document_url="https://gitlab.com/renich/crystal-for-agents/-/raw/master/crystal.rst"
echo "🌐 Fetching the strict Crystal reference document..."
curl -sL "$document_url" -o "$sterile_dir/crystal.rst"

if [[ ! -s "$sterile_dir/crystal.rst" ]]; then
    echo "❌ FATAL: Failed to fetch the reference document. Check your network."
    rm -rf "$sterile_dir"
    exit 1
fi

# The "Trap Prompt" - Engineered specifically to test concurrency and type-safety.
prompt="Write a Crystal CLI application that concurrently processes binary log data. Requirements: 1. Use the experimental multithreading flag capabilities to spawn 5 concurrent workers. 2. Each worker must simulate reading 512 raw bytes directly into memory and then converting it to a string. 3. The string is a JSON payload representing a server event. The JSON has an 'event_id' (which can be a string or an integer), a 'timestamp', and an optional 'error_code' (integer). 4. Parse this JSON into a custom data structure. 5. Maintain a thread-safe global counter of all events processed. 6. Create a processor class that takes an optional custom prefix string in its constructor to prepend to console output, defaulting to 'LOG:'. Output the final code in a single file named main.cr. Prioritize memory safety, nil-safety, and Crystal v1.20 idioms."

# ==============================================================================
# 2. EXECUTE RUNS BASED ON PROVIDER
# ==============================================================================
echo "🚀 Target locked: $provider"

case "$provider" in
    gemini)
        # Ensure the user has exported their API key
        if [[ -z "${GEMINI_API_KEY}" ]]; then echo "❌ Please export GEMINI_API_KEY first."; exit 1; fi
        api_url="https://generativelanguage.googleapis.com/v1beta/models/gemini-3.1-pro-preview:generateContent?key=${GEMINI_API_KEY}"

        echo "   -> Executing Run A (Naked Baseline)..."
        jq -n --arg prompt "$prompt" '{"contents": [{"role": "user", "parts": [{"text": $prompt}]}], "generationConfig": {"temperature": 0.0}}' > "$sterile_dir/payload_a.json"
        curl -s -H "Content-Type: application/json" -X POST -d @"$sterile_dir/payload_a.json" "$api_url" | jq -r '.candidates[0].content.parts[0].text' | sed -n '/^```crystal/,/^```/ p' | sed '/^```/d' > "$sterile_dir/run_a.cr"

        echo "   -> Executing Run B (System Override)..."
        jq -n --arg prompt "$prompt" --rawfile sys "$sterile_dir/crystal.rst" '{"systemInstruction": {"parts": [{"text": $sys}]}, "contents": [{"role": "user", "parts": [{"text": $prompt}]}], "generationConfig": {"temperature": 0.0}}' > "$sterile_dir/payload_b.json"
        curl -s -H "Content-Type: application/json" -X POST -d @"$sterile_dir/payload_b.json" "$api_url" | jq -r '.candidates[0].content.parts[0].text' | sed -n '/^```crystal/,/^```/ p' | sed '/^```/d' > "$sterile_dir/run_b.cr"
        ;;

    claude)
        if [[ -z "${ANTHROPIC_API_KEY}" ]]; then echo "❌ Please export ANTHROPIC_API_KEY first."; exit 1; fi
        api_url="https://api.anthropic.com/v1/messages"

        echo "   -> Executing Run A (Naked Baseline)..."
        jq -n --arg prompt "$prompt" '{"model": "claude-3-5-sonnet-latest", "max_tokens": 8192, "temperature": 0.0, "messages": [{"role": "user", "content": $prompt}]}' > "$sterile_dir/payload_a.json"
        curl -s -H "x-api-key: ${ANTHROPIC_API_KEY}" -H "anthropic-version: 2023-06-01" -H "content-type: application/json" -X POST -d @"$sterile_dir/payload_a.json" "$api_url" | jq -r '.content[0].text' | sed -n '/^```crystal/,/^```/ p' | sed '/^```/d' > "$sterile_dir/run_a.cr"

        echo "   -> Executing Run B (System Override)..."
        jq -n --arg prompt "$prompt" --rawfile sys "$sterile_dir/crystal.rst" '{"model": "claude-3-5-sonnet-latest", "max_tokens": 8192, "temperature": 0.0, "system": $sys, "messages": [{"role": "user", "content": $prompt}]}' > "$sterile_dir/payload_b.json"
        curl -s -H "x-api-key: ${ANTHROPIC_API_KEY}" -H "anthropic-version: 2023-06-01" -H "content-type: application/json" -X POST -d @"$sterile_dir/payload_b.json" "$api_url" | jq -r '.content[0].text' | sed -n '/^```crystal/,/^```/ p' | sed '/^```/d' > "$sterile_dir/run_b.cr"
        ;;

    openai)
        if [[ -z "${OPENAI_API_KEY}" ]]; then echo "❌ Please export OPENAI_API_KEY first."; exit 1; fi
        api_url="https://api.openai.com/v1/chat/completions"

        echo "   -> Executing Run A (Naked Baseline)..."
        jq -n --arg prompt "$prompt" '{"model": "gpt-4o", "temperature": 0.0, "messages": [{"role": "user", "content": $prompt}]}' > "$sterile_dir/payload_a.json"
        curl -s -H "Authorization: Bearer ${OPENAI_API_KEY}" -H "Content-Type: application/json" -X POST -d @"$sterile_dir/payload_a.json" "$api_url" | jq -r '.choices[0].message.content' | sed -n '/^```crystal/,/^```/ p' | sed '/^```/d' > "$sterile_dir/run_a.cr"

        echo "   -> Executing Run B (System Override)..."
        jq -n --arg prompt "$prompt" --rawfile sys "$sterile_dir/crystal.rst" '{"model": "gpt-4o", "temperature": 0.0, "messages": [{"role": "system", "content": $sys}, {"role": "user", "content": $prompt}]}' > "$sterile_dir/payload_b.json"
        curl -s -H "Authorization: Bearer ${OPENAI_API_KEY}" -H "Content-Type: application/json" -X POST -d @"$sterile_dir/payload_b.json" "$api_url" | jq -r '.choices[0].message.content' | sed -n '/^```crystal/,/^```/ p' | sed '/^```/d' > "$sterile_dir/run_b.cr"
        ;;
esac

# ==============================================================================
# 3. THE AUTOMATED GRADER
# ==============================================================================
# This function greps the output for specific idioms and runs the compiler
# to ensure we get an impartial, binary score for the code.
evaluate_code() {
    local file=$1
    local score=0
    echo -e "\n📊 Evaluating $(basename "$file")..."

    if [[ ! -s "$file" ]]; then
        echo "   [FATAL] File is empty. The LLM failed to output parseable code."
        return
    fi

    if grep -q "Slice(UInt8)\|Bytes" "$file" && ! grep -q "Pointer(" "$file"; then
        echo "   ✅ [PASS] Memory Safety: Safely uses Slice/Bytes." && ((score++))
    else echo "   ❌ [FAIL] Memory Safety: Uses unsafe Pointer or misses Slice allocation."; fi

    if grep -q "Sync::Mutex\|Sync::Exclusive" "$file"; then
        echo "   ✅ [PASS] Concurrency: Uses the modern Sync module." && ((score++))
    else echo "   ❌ [FAIL] Concurrency: Defaults to legacy Mutex or misses synchronization."; fi

    if grep -q "include JSON::Serializable" "$file" && grep -q "String | Int32" "$file"; then
        echo "   ✅ [PASS] Data Parsing: Correctly implements JSON::Serializable & Union Types." && ((score++))
    else echo "   ❌ [FAIL] Data Parsing: Relies on JSON.parse or misses Union Type definitions."; fi

    if grep -q "@prefix : String" "$file"; then
        echo "   ✅ [PASS] Constructor: Explicitly types instance variables to prevent Nil errors." && ((score++))
    else echo "   ❌ [FAIL] Constructor: Misses explicit type annotations."; fi

    # Check if it actually compiles
    if crystal build "$file" -Dpreview_mt --no-codegen &> /dev/null; then
        echo "   ✅ [PASS] Compilation: Syntax and types are perfectly valid." && ((score++))
    else echo "   ❌ [FAIL] Compilation: Syntax or type errors detected."; fi

    echo "   ---------------------------------------"
    echo "   🏆 FINAL SCORE: $score/5"
    echo "   ---------------------------------------"
}

# ==============================================================================
# 4. EVALUATE, EXTRACT, AND TEARDOWN
# ==============================================================================
evaluate_code "$sterile_dir/run_a.cr"
evaluate_code "$sterile_dir/run_b.cr"

echo -e "\n📦 Extracting generated code for auditing..."
cp "$sterile_dir/run_a.cr" "./${provider}_run_a.cr"
cp "$sterile_dir/run_b.cr" "./${provider}_run_b.cr"

# Clean up our temporary directory
rm -rf "$sterile_dir"
echo "🧹 Sterile container destroyed. Artifacts saved locally. Have a great day!"

I challenge you to adapt the CLI command for Claude or OpenAI, run the bash script, and post the delta between your Run A and Run B. Let’s see which foundational model actually adheres to strict architectural directives.

Note: Some agent CLI clients, notably Cloude-CLI, have heavy system-like prompts integrated into it. Gemini-CLI is no exception.

renich · April 19, 2026, 10:42pm

My results:

renich@desktop scripts$ ./llm_benchmark.bash gemini
📁 Created a sterile testing directory at: /tmp/tmp.8lgUDkLGnW
🌐 Fetching the strict Crystal reference document...
🚀 Target locked: gemini
   -> Executing Run A (Naked Baseline)...
   -> Executing Run B (System Override)...

📊 Evaluating run_a.cr...
   ✅ [PASS] Memory Safety: Safely uses Slice/Bytes.
   ❌ [FAIL] Concurrency: Defaults to legacy Mutex or misses synchronization.
   ❌ [FAIL] Data Parsing: Relies on JSON.parse or misses Union Type definitions.
   ✅ [PASS] Constructor: Explicitly types instance variables to prevent Nil errors.
   ✅ [PASS] Compilation: Syntax and types are perfectly valid.
   ---------------------------------------
   🏆 FINAL SCORE: 3/5
   ---------------------------------------

📊 Evaluating run_b.cr...
   ✅ [PASS] Memory Safety: Safely uses Slice/Bytes.
   ✅ [PASS] Concurrency: Uses the modern Sync module.
   ✅ [PASS] Data Parsing: Correctly implements JSON::Serializable & Union Types.
   ✅ [PASS] Constructor: Explicitly types instance variables to prevent Nil errors.
   ✅ [PASS] Compilation: Syntax and types are perfectly valid.
   ---------------------------------------
   🏆 FINAL SCORE: 5/5
   ---------------------------------------

📦 Extracting generated code for auditing...
🧹 Sterile container destroyed. Artifacts saved locally. Have a great day!

Note: sometimes, I get 4/5 in run_a.cr.

p.s. I could only tests gemini-cli. I don’t have an OpenAI or a Claude account. I dunno if the script works for those. That said, it should be easily adaptable.

renich · April 19, 2026, 10:47pm

Another note I forgot to mention. The gemini 3 models are optimized to run with 1.0 temperature.

Oh, and I’ll be publishing the scripts in the crystal-for-agents repo; in scripts/.

xalynn · April 21, 2026, 12:53am

I hear this a lot and sometimes feel it myself. But I think the skepticism is often aimed at the wrong target.

The transparency you’re actually looking for isn’t in the output: it’s in the workflow. How were prompts structured, were skill files used, is there a consistent session methodology with context grounding at the start of each session. Judging AI-assisted code without knowing how the AI was used is like judging a codebase without knowing the requirements it was built against. You’re missing half the picture.

In my position I don’t have a choice but to learn this properly. If I can’t understand how these tools work and use them deliberately, I’m less useful to the clients my company serves. That’s just the reality of where things are heading.

The frame that works for me: treat the LLM like a junior developer who writes fast but needs a tight leash. Architecture decisions, design tradeoffs, verification against external sources: those stay with the human. The moment you hand those over is the moment the output becomes unreliable in ways that are hard to catch, particularly when the project isn’t being carefully documented per session, which is another reason shorter sessions with fresh context boundaries work better than marathon runs.

With well-structured prompts, rigorous review passes, and fresh context at each session boundary you can mitigate context drift and fabrication significantly. Not eliminate: just mitigate. Anyone claiming otherwise is selling something.

A Crystal-focused benchmark is an interesting idea. The harder problem is that methodology matters as much as the model. The same prompt with different context scaffolding produces meaningfully different output. Any benchmark that doesn’t account for that will just produce more of the voodoo you’re already tired of.

Xen · April 22, 2026, 8:59pm

Ah, but we’re not quite at that level yet. We’re still comparing LMMs and perhaps the harness. If you’re judging how it’s used, you’re starting to compare developers. I’d love if someone would make some example sessions where they demonstrate how they use the LLMs without ending up with it spewing out code and ending up with a project that’s impossible to read, but that’s training, not comparing the LLMs as such.

Xen · April 23, 2026, 7:16am

@xalynn

As you’re someone that’s actually deep into these kind of things, what do you think about https://nwave.ai/ ?

xalynn · April 23, 2026, 9:05am

For me the solution I found works best is shorter code sessions and proper project management tracking with something like a Kanban board. The workflow methodology matters because code generated with stricter prompts and in shorter sessions will turn out better than code that was just generated.

You’re right that you’re comparing developers now and not models but it matters because how they prompt and use the AI not just for code generation, but basic review checks, impacts the quality of the code that eventually ends up committed to a git repo. The gap between the major models has narrowed significantly at this point. The Aider Polyglot leaderboard (https://aider.chat/docs/leaderboards/) tracks this across multiple languages including systems languages if you want a concrete reference point.

This is a nice find, I hadn’t heard of this one specifically but a lot of similar products have been popping up lately even in the security vendor space. The core idea is sound: enforced phase gates, human approval before proceeding, agents that can’t skip steps. That’s the same discipline I’ve been building into my own workflow manually with a main prompt and commands that read the CHANGELOG and KANBAN at the start of each new conversation and then update at the end. Multiple review passes are enforced before a card gets marked done and the session moves on.

Where it doesn’t fit my use case is the domain assumptions. nWave is built around web application patterns: TDD, hexagonal architecture, BDD acceptance tests. My testing metrics for crystal-packet is three-layer validation against RFCs, byte-level comparison with Impacket, and live wire capture against a home lab AD environment. No generic framework is going to enforce that because it requires domain-specific knowledge to even define what “correct” means.

Worth studying for the concepts even if the tool doesn’t fit. The DES enforcement layer in particular is interesting: the idea that you can structurally prevent an agent from proceeding without meeting a quality gate rather than relying on prompt discipline alone and also something you can already build in to manual workflows without spending money on another SaaS product. The secret sauce they’re selling you are their custom prompts and guardrails. Overall it enforces what I think most of us have come to realize about anything AI-assisted in development: keep the human in the decision loop.

Topic		Replies	Views
Vibe-coding in Crystal Community	66	1764	June 5, 2026
AI library for Crystal Help & Support	60	1927	August 2, 2025
[RFC] Surviving the AI PR Flood: The Macro X-Ray & Asymmetric TDD Crystal Contrib rfc	22	512	June 3, 2026
Why Isn't Every Sane Developer Obsessed with Crystal? Community	39	919	March 18, 2026
Training ChatGPT on Crystal's standard lib and syntax Community	19	1251	February 17, 2026

Crystal for Agents v1.20.0 release

The Architecture

The “Trap” Prompt

The Automated Impartial Grader

The Execution Harness

Related topics