I’ve consultedwith an LLM and we developed and designed the experiment. Hopefully, it will work.
Evaluating LLM output is notoriously subjective. To prove empirically whether this document actually works, we’ve built an automated A/B testing harness. This post outlines the methodology so you can replicate the experiment against your preferred models (Claude 4.7, Sonnet, GPT-5.4, etc.) and share the delta.
The Architecture
To guarantee a mathematically pure baseline, the execution harness must run the LLM CLI in a sterile, ephemeral POSIX environment. If you run tests using an agent framework (like OpenCode) or a CLI with access to your local MCP servers and dotfiles, your baseline is contaminated.
The test consists of two deterministic runs at temperature 0.0:
-
Run A (The Naked Baseline): The model receives the prompt with an empty system prompt. It relies entirely on its pre-trained weights.
-
Run B (The Override): The model is injected with the crystal.rst file as its system context.
The “Trap” Prompt
You cannot evaluate LLMs with open-ended prompts like “write a web server.” You must use a highly constrained trap that forces the model into scenarios where raw pre-training predictably fails.
We prompt the model to build a concurrent CLI app that reads binary data, parses union-typed JSON, maintains thread-safe state, and implements a custom class constructor.
The Automated Impartial Grader
Evaluation is handled entirely by a bash script. It greps the generated AST/code against a binary matrix to detect architectural traps, and uses the Crystal compiler to verify syntactical integrity.
The Rubric (5 Points Total):
-
[cite_start]Memory Safety: Passes if it safely allocates using Slice or Bytes[cite: 560]. [cite_start]Fails if it uses unsafe Pointer arithmetic[cite: 558].
-
[cite_start]Concurrency Primitives: Passes if it uses Sync::Mutex or Sync::Exclusive[cite: 160, 565]. [cite_start]Fails if it uses the legacy Mutex[cite: 564].
-
[cite_start]Data Parsing: Passes if it defines a struct leveraging JSON::Serializable[cite: 303]. [cite_start]Fails if it relies on JSON.parse and manual .as_s casting[cite: 302].
-
[cite_start]Constructor Type Inference: Passes if instance variables are explicitly typed, preventing the #1 cause of nil-errors[cite: 569, 570].
-
Compilation: Passes if crystal build main.cr -Dpreview_mt --no-codegen exits with status 0.
The Execution Harness
Save this as llm_benchmark.bash and run it. You will need a CLI tool (the script currently uses gemini-cli as a placeholder, but you can swap it for gh or claude).
#!/usr/bin/bash
# ==============================================================================
# LLM Crystal Knowledge Benchmark
# ==============================================================================
# This script impartially tests how well an LLM writes Crystal code (v1.20+).
# Run A (Baseline): Asks the LLM to write a concurrent script using its default weights.
# Run B (Override): Injects a strict Crystal architecture manual into the system prompt.
#
# We use raw `curl` and `jq` here instead of CLI wrappers to ensure absolute
# isolation. This prevents your local dotfiles, aliases, or MCP servers from
# contaminating the baseline, and it bypasses the shell's ARG_MAX limits!
# ==============================================================================
provider=$1
# Make sure the user selects a valid model provider
if [[ "$provider" != "gemini" && "$provider" != "claude" && "$provider" != "openai" ]]; then
echo "👋 Welcome! Please run the script with your preferred provider:"
echo " Usage: $0 [gemini|claude|openai]"
exit 1
fi
# ==============================================================================
# 1. ESTABLISH THE STERILE CONTAINMENT ZONE
# ==============================================================================
# We create a temporary directory to hold our payloads and results.
# This guarantees we aren't relying on any local files or state.
sterile_dir=$(mktemp -d)
echo "📁 Created a sterile testing directory at: $sterile_dir"
document_url="https://gitlab.com/renich/crystal-for-agents/-/raw/master/crystal.rst"
echo "🌐 Fetching the strict Crystal reference document..."
curl -sL "$document_url" -o "$sterile_dir/crystal.rst"
if [[ ! -s "$sterile_dir/crystal.rst" ]]; then
echo "❌ FATAL: Failed to fetch the reference document. Check your network."
rm -rf "$sterile_dir"
exit 1
fi
# The "Trap Prompt" - Engineered specifically to test concurrency and type-safety.
prompt="Write a Crystal CLI application that concurrently processes binary log data. Requirements: 1. Use the experimental multithreading flag capabilities to spawn 5 concurrent workers. 2. Each worker must simulate reading 512 raw bytes directly into memory and then converting it to a string. 3. The string is a JSON payload representing a server event. The JSON has an 'event_id' (which can be a string or an integer), a 'timestamp', and an optional 'error_code' (integer). 4. Parse this JSON into a custom data structure. 5. Maintain a thread-safe global counter of all events processed. 6. Create a processor class that takes an optional custom prefix string in its constructor to prepend to console output, defaulting to 'LOG:'. Output the final code in a single file named main.cr. Prioritize memory safety, nil-safety, and Crystal v1.20 idioms."
# ==============================================================================
# 2. EXECUTE RUNS BASED ON PROVIDER
# ==============================================================================
echo "🚀 Target locked: $provider"
case "$provider" in
gemini)
# Ensure the user has exported their API key
if [[ -z "${GEMINI_API_KEY}" ]]; then echo "❌ Please export GEMINI_API_KEY first."; exit 1; fi
api_url="https://generativelanguage.googleapis.com/v1beta/models/gemini-3.1-pro-preview:generateContent?key=${GEMINI_API_KEY}"
echo " -> Executing Run A (Naked Baseline)..."
jq -n --arg prompt "$prompt" '{"contents": [{"role": "user", "parts": [{"text": $prompt}]}], "generationConfig": {"temperature": 0.0}}' > "$sterile_dir/payload_a.json"
curl -s -H "Content-Type: application/json" -X POST -d @"$sterile_dir/payload_a.json" "$api_url" | jq -r '.candidates[0].content.parts[0].text' | sed -n '/^```crystal/,/^```/ p' | sed '/^```/d' > "$sterile_dir/run_a.cr"
echo " -> Executing Run B (System Override)..."
jq -n --arg prompt "$prompt" --rawfile sys "$sterile_dir/crystal.rst" '{"systemInstruction": {"parts": [{"text": $sys}]}, "contents": [{"role": "user", "parts": [{"text": $prompt}]}], "generationConfig": {"temperature": 0.0}}' > "$sterile_dir/payload_b.json"
curl -s -H "Content-Type: application/json" -X POST -d @"$sterile_dir/payload_b.json" "$api_url" | jq -r '.candidates[0].content.parts[0].text' | sed -n '/^```crystal/,/^```/ p' | sed '/^```/d' > "$sterile_dir/run_b.cr"
;;
claude)
if [[ -z "${ANTHROPIC_API_KEY}" ]]; then echo "❌ Please export ANTHROPIC_API_KEY first."; exit 1; fi
api_url="https://api.anthropic.com/v1/messages"
echo " -> Executing Run A (Naked Baseline)..."
jq -n --arg prompt "$prompt" '{"model": "claude-3-5-sonnet-latest", "max_tokens": 8192, "temperature": 0.0, "messages": [{"role": "user", "content": $prompt}]}' > "$sterile_dir/payload_a.json"
curl -s -H "x-api-key: ${ANTHROPIC_API_KEY}" -H "anthropic-version: 2023-06-01" -H "content-type: application/json" -X POST -d @"$sterile_dir/payload_a.json" "$api_url" | jq -r '.content[0].text' | sed -n '/^```crystal/,/^```/ p' | sed '/^```/d' > "$sterile_dir/run_a.cr"
echo " -> Executing Run B (System Override)..."
jq -n --arg prompt "$prompt" --rawfile sys "$sterile_dir/crystal.rst" '{"model": "claude-3-5-sonnet-latest", "max_tokens": 8192, "temperature": 0.0, "system": $sys, "messages": [{"role": "user", "content": $prompt}]}' > "$sterile_dir/payload_b.json"
curl -s -H "x-api-key: ${ANTHROPIC_API_KEY}" -H "anthropic-version: 2023-06-01" -H "content-type: application/json" -X POST -d @"$sterile_dir/payload_b.json" "$api_url" | jq -r '.content[0].text' | sed -n '/^```crystal/,/^```/ p' | sed '/^```/d' > "$sterile_dir/run_b.cr"
;;
openai)
if [[ -z "${OPENAI_API_KEY}" ]]; then echo "❌ Please export OPENAI_API_KEY first."; exit 1; fi
api_url="https://api.openai.com/v1/chat/completions"
echo " -> Executing Run A (Naked Baseline)..."
jq -n --arg prompt "$prompt" '{"model": "gpt-4o", "temperature": 0.0, "messages": [{"role": "user", "content": $prompt}]}' > "$sterile_dir/payload_a.json"
curl -s -H "Authorization: Bearer ${OPENAI_API_KEY}" -H "Content-Type: application/json" -X POST -d @"$sterile_dir/payload_a.json" "$api_url" | jq -r '.choices[0].message.content' | sed -n '/^```crystal/,/^```/ p' | sed '/^```/d' > "$sterile_dir/run_a.cr"
echo " -> Executing Run B (System Override)..."
jq -n --arg prompt "$prompt" --rawfile sys "$sterile_dir/crystal.rst" '{"model": "gpt-4o", "temperature": 0.0, "messages": [{"role": "system", "content": $sys}, {"role": "user", "content": $prompt}]}' > "$sterile_dir/payload_b.json"
curl -s -H "Authorization: Bearer ${OPENAI_API_KEY}" -H "Content-Type: application/json" -X POST -d @"$sterile_dir/payload_b.json" "$api_url" | jq -r '.choices[0].message.content' | sed -n '/^```crystal/,/^```/ p' | sed '/^```/d' > "$sterile_dir/run_b.cr"
;;
esac
# ==============================================================================
# 3. THE AUTOMATED GRADER
# ==============================================================================
# This function greps the output for specific idioms and runs the compiler
# to ensure we get an impartial, binary score for the code.
evaluate_code() {
local file=$1
local score=0
echo -e "\n📊 Evaluating $(basename "$file")..."
if [[ ! -s "$file" ]]; then
echo " [FATAL] File is empty. The LLM failed to output parseable code."
return
fi
if grep -q "Slice(UInt8)\|Bytes" "$file" && ! grep -q "Pointer(" "$file"; then
echo " ✅ [PASS] Memory Safety: Safely uses Slice/Bytes." && ((score++))
else echo " ❌ [FAIL] Memory Safety: Uses unsafe Pointer or misses Slice allocation."; fi
if grep -q "Sync::Mutex\|Sync::Exclusive" "$file"; then
echo " ✅ [PASS] Concurrency: Uses the modern Sync module." && ((score++))
else echo " ❌ [FAIL] Concurrency: Defaults to legacy Mutex or misses synchronization."; fi
if grep -q "include JSON::Serializable" "$file" && grep -q "String | Int32" "$file"; then
echo " ✅ [PASS] Data Parsing: Correctly implements JSON::Serializable & Union Types." && ((score++))
else echo " ❌ [FAIL] Data Parsing: Relies on JSON.parse or misses Union Type definitions."; fi
if grep -q "@prefix : String" "$file"; then
echo " ✅ [PASS] Constructor: Explicitly types instance variables to prevent Nil errors." && ((score++))
else echo " ❌ [FAIL] Constructor: Misses explicit type annotations."; fi
# Check if it actually compiles
if crystal build "$file" -Dpreview_mt --no-codegen &> /dev/null; then
echo " ✅ [PASS] Compilation: Syntax and types are perfectly valid." && ((score++))
else echo " ❌ [FAIL] Compilation: Syntax or type errors detected."; fi
echo " ---------------------------------------"
echo " 🏆 FINAL SCORE: $score/5"
echo " ---------------------------------------"
}
# ==============================================================================
# 4. EVALUATE, EXTRACT, AND TEARDOWN
# ==============================================================================
evaluate_code "$sterile_dir/run_a.cr"
evaluate_code "$sterile_dir/run_b.cr"
echo -e "\n📦 Extracting generated code for auditing..."
cp "$sterile_dir/run_a.cr" "./${provider}_run_a.cr"
cp "$sterile_dir/run_b.cr" "./${provider}_run_b.cr"
# Clean up our temporary directory
rm -rf "$sterile_dir"
echo "🧹 Sterile container destroyed. Artifacts saved locally. Have a great day!"
I challenge you to adapt the CLI command for Claude or OpenAI, run the bash script, and post the delta between your Run A and Run B. Let’s see which foundational model actually adheres to strict architectural directives.
Note: Some agent CLI clients, notably Cloude-CLI, have heavy system-like prompts integrated into it. Gemini-CLI is no exception.