Stage 4 Research Report: Early ChatGPT Coding Evaluation vs Industry Landscape

Document under review: _posts/2022-12-12-chatgpt.md Date: 2026-02-24

A. Key Early Evaluations of LLM Coding Ability

Pre-ChatGPT baseline: Codex and HumanEval (July 2021)

OpenAI’s Codex paper introduced HumanEval — 164 hand-crafted Python programming problems evaluated via functional correctness (pass@k). Codex solved 28.8% of problems in a single attempt. This established the paradigm: LLMs could produce plausible code, but single-shot correctness was low.

https://arxiv.org/abs/2107.03374

EvalPlus / HumanEval+ (May 2023)

“Is Your Code Generated by ChatGPT Really Correct?” extended HumanEval’s test cases by 80x and found pass rates dropped by 19.3-28.9%. Code that appeared correct on the original sparse tests was frequently wrong on edge cases — directly demonstrating the problem the blog post identified: code that looks right but doesn’t work.

https://arxiv.org/abs/2305.01210

Purdue Stack Overflow Study (2023)

Analysed 517 Stack Overflow questions and found 52% of ChatGPT’s programming answers were incorrect and 77% were verbose. Despite the high error rate, a third of participants were fooled by the responses. Users preferred them because they were comprehensive, well-articulated, and polite — the “plausible-sounding” failure mode the blog post describes.

https://www.theregister.com/2023/08/07/chatgpt_stack_overflow_ai/

Advent of Code and informal tests (Dec 2022 - Jan 2023)

ChatGPT struggled past approximately day 5 of 25 on Advent of Code 2022. Simon Willison documented learning Rust with ChatGPT in December 2022, finding it useful as a teaching assistant but noting hallucinated commands.

https://www.themotte.org/post/797/chatgpt-vs-advent-of-code
https://simonwillison.net/2022/Dec/5/rust-chatgpt-copilot/

Training cutoff effects

On LeetCode, ChatGPT achieved roughly 66-69% functional correctness on established problems, but performance dropped up to 5x on problems introduced after January 2022 — the training data cutoff. This directly explains the blog post’s observation of mixed old/new asm! syntax: the new-style macro was stabilised in Rust 1.59 (February 2022), right at the cutoff boundary.

https://dl.acm.org/doi/10.1145/3643674

SWE-bench (Oct 2023 onward)

Shifted evaluation from isolated function synthesis to real-world GitHub issue resolution. Early models solved under 5% of problems. By late 2024, Claude 3.5 Sonnet reached ~49% on the verified subset.

https://github.com/SWE-bench/SWE-bench
https://epoch.ai/benchmarks/swe-bench-verified

B. How “Hallucination” in Code Was Characterised

The early period used several overlapping terms:

Term	Source	Description
Hallucination	Dominant term from NLP	Inventing nonexistent APIs, fabricating features, mixing syntax versions
Confabulation	Beren (March 2023)	From neuropsychology: constructing plausible narratives to fill memory gaps without awareness. Argued to be more accurate than “hallucination”
Bullshit (Frankfurt sense)	Harford (March 2023), formalised Hicks et al. (2024)	LLMs are indifferent to truth — designed to produce text that looks truth-apt, not to convey accurate information
Stochastic parrot	Bender et al. (2021), widely invoked 2023	Sophisticated pattern matchers producing statistically likely text without understanding meaning
Verbalist	The blog post (Dec 2022)	“Deals in words alone without considering the facts or ideas behind them.” Arguably the most mechanistically precise description of the specific failure mode

The blog post’s “verbalist” framing is distinctive. It captures the observation that ChatGPT operates on linguistic patterns rather than engineering first principles — drilling to the correct technology area and then generating plausible-sounding but nonexistent solutions.

Later formalised taxonomy (2024)

Research classified code hallucinations into three major categories with eight subcategories: Task Requirement Conflicts, Factual Knowledge Conflicts (including API/library knowledge conflicts — exactly what the blog post demonstrates), and Project Context Conflicts.

https://www.beren.io/2023-03-19-LLMs-confabulate-not-hallucinate/
https://link.springer.com/article/10.1007/s10676-024-09775-5
https://timharford.com/2023/03/why-chatbots-are-bound-to-spout-bullshit/
https://arxiv.org/abs/2409.20550

C. LLM Strengths and Weaknesses for Coding (Dec 2022 - Mid 2023)

Where LLMs excelled

Boilerplate code generation in popular languages (Python, JavaScript, Java)
Code explanation and documentation
Common patterns with abundant training data (sorting, web requests, CRUD operations)
Acting as a teaching assistant for learning new languages
Refactoring well-established patterns

Where LLMs failed

Niche/rare constructs: Copilot achieved 57.7% accuracy for Java but only 29.7% for C. Less-common language features fared worse. Training data frequency directly predicted accuracy.
Systems-level and low-level programming: Inline assembly, register-level operations, hardware-specific constraints had minimal training data representation.
Post-training-cutoff knowledge: The new-style Rust asm! macro was at the very edge of GPT-3.5’s knowledge, explaining the exact mixture of old and new syntax the blog post observed.
Multi-step reasoning: The asm! problem requires understanding processor architecture, instruction encoding, Rust macro hygiene, and the const-evaluation pipeline simultaneously.
Distinguishing deprecated from current: The model mixed obsolete GCC-style syntax with the new asm! API — a signature failure mode when training data contains both old and new documentation.

The key structural insight: LLM performance correlates with training data density. The “long tail” of niche, systems-level, or recently-changed features generates confidently wrong output because the model has some relevant tokens to pattern-match against, but not enough to produce correct results.

Sources:

https://dl.acm.org/doi/10.1109/ICSE48619.2023.00181
https://dl.acm.org/doi/10.1145/3715108

D. Evaluations on Embedded, Systems, and Assembly Tasks

Exploring LLMs for Embedded System Development (Pratheek et al., July 2023)

The most systematic study from the early period. N=450 experiments across GPT-3.5, GPT-4, and PaLM 2:

GPT-4 produced functional I2C interfaces 66% of the time in 50 trials
Generated register-level drivers, LoRa communication code, and power optimisations
“Even when these tools fail to produce working code, they consistently generate helpful reasoning about embedded design tasks” — echoing the “verbalist” observation
https://arxiv.org/abs/2307.03817

ChatGPT as Assembly Language Interpreter (2023)

GPT-4’s ability to interpret (not generate) assembly instructions across x86, x86-64, ARM, and AArch64 was “highly accurate in general” — stronger on comprehension than generation for low-level code.

https://dl.acm.org/doi/10.5555/3715622.3715633

Rust for Embedded Systems (Balasubramaniam et al., Nov 2023)

Documented challenges that make embedded Rust particularly difficult for LLMs: hardware abstraction layers, unsafe blocks, FFI boundaries, and register-level operations.

https://arxiv.org/abs/2311.05063

Timeline context

The blog post (December 2022) predates all formal academic work on LLMs and embedded systems by 6-7 months. It is one of the earliest documented tests of ChatGPT on embedded/systems-level code generation.

E. Evolution to 2025-2026

Niche/embedded improvement — but still limited

GPT-4 and subsequent models show real improvement on embedded tasks. However, AI code generation in embedded systems still faces fundamental constraints: lack of understanding of timing, cache, memory layout; hallucinated APIs and system calls; non-deterministic or energy-hungry generated code. The core problem the blog post identified — plausible fabrication in niche domains — persists, though the surface area of “niche” has shrunk.

https://www.gocodeo.com/post/ai-code-generation-in-embedded-systems-constraints-and-solutions

Rust-specific progress

Strand-Rust-Coder-v1 achieved +14% improvement through training on 191K synthetic Rust examples
RUG achieves 71.37% code coverage for Rust unit test generation
C-to-safe-Rust transpilation: frontier models achieve only 13-15% one-shot success, rising to 32-37% with repair loops
Rust’s ownership model, lifetimes, and unsafe boundaries remain difficult for LLMs

Sources:

https://huggingface.co/blog/Fortytwo-Network/strand-rust-coder-tech-report
https://taesoo.kim/pubs/2025/cheng:rug.pdf
https://arxiv.org/html/2504.15254v1

Assembly and low-level code

Nova (specialised assembly LLM) outperforms general models on binary decompilation (146.54% improvement)
LLM-based assembly optimisation with PPO achieves 1.47x speedup over gcc -O3 — but removing the baseline assembly input causes correctness to collapse to 0.0%
LLMs can improve existing low-level code but struggle to generate it from scratch

Sources:

https://arxiv.org/html/2311.13721v3
https://arxiv.org/html/2505.11480v1

F. Blog Post in Context

The post captured phenomena the field took 6-18 months to formalise:

Blog post observation (Dec 2022)	Formal characterisation (2023-2024)
Mixes obsolete and new syntax	Training data temporal contamination; deprecated API hallucination
Invents nonexistent features (`immediate` keyword, `asm::immediate`, `constify!`)	Factual Knowledge Conflicts > API/Library Knowledge hallucination
Plausible-sounding but fabricated solutions	Frankfurt “bullshit” framework; confabulation
“Verbalist” — explains a bogus solution convincingly	Stochastic parrot / indifference to truth
Can’t solve from first principles	Lack of formal reasoning; pattern matching vs constraint satisfaction
Works on popular patterns, fails on niche	Training data frequency bias; long-tail performance degradation

The test case (Rust asm! macro for RISC-V CSR instructions) sits at the exact intersection of properties most likely to produce LLM failure: niche domain, recent API change, hardware-software boundary, formal constraint satisfaction required. The colour-coded red/green/orange annotation is a manual version of the hallucination taxonomy formalised in 2024.

Summary

The blog post is an early, practically grounded evaluation that independently identified the core failure modes of LLM code generation before the field had vocabulary for them. The “verbalist” framing is a genuine contribution to the discourse. The main gaps are retrospective context: why the training cutoff explains the specific syntax mixing observed, and where ChatGPT does perform well (common patterns in popular languages). These are observations that emerged later, not omissions at the time of writing.