Introducing dotLLM - Building an LLM Inference Engine in C#

If you’ve been building .NET applications and wanted to run LLMs locally, your options have been… limited. You could wrap llama.cpp through LLamaSharp, deal with ONNX Runtime, or orchestrate calls to external Python services. None of these are fully satysfing if you want purely to in .NET to run LLM inference.

So I built my own.

Over the past two months I’ve been building dotLLM - a ground-up, high-performance LLM inference engine written natively in C#/.NET 10. Not a wrapper. Not bindings. A full implementation: GGUF model loading, tokenization, attention, sampling, SIMD-optimized CPU inference, CUDA GPU acceleration, an OpenAI-compatible API server, and a built-in chat UI. This week I’ve published the first preview release (v0.1.0-preview.2).

This post covers three things: what dotLLM is, why I built it, and how AI-assisted development - Claude Code, Codex, and Gemini working together in a structured workflow - made it possible for a single developer to ship this in about two months.

What is dotLLM?

The short version

dotLLM is a native C#/.NET 10 LLM inference engine. It runs transformer-based models - Llama, Mistral, Phi, Qwen (and more in future) - from GGUF files, with SIMD-optimized CPU, CUDA GPU backendsm or hybrid mode. It exposes an OpenAI-compatible API, ships as a CLI tool and as NuGet packages you can embed in your own applications.

The key word is native. All orchestration, model loading, tokenization, sampling, scheduling, and CPU compute is implemented in pure C#. The only native code is a thin CUDA library for GPU kernels, loaded via PTX through the CUDA Driver API (and P/Invoked).

Architecture

dotLLM is organized as a layered architecture where each layer depends only on the layers below:

graph TD
    subgraph "User-Facing"
        CLI["DotLLM.Cli<br/><small>CLI tool</small>"]
        ChatUI["Built-in Chat UI<br/><small>Browser-based</small>"]
        Server["DotLLM.Server<br/><small>OpenAI-compatible API</small>"]
    end

    subgraph "Engine"
        Eng["DotLLM.Engine<br/><small>KV-cache, scheduler, samplers,<br/>constraints, speculative decoding</small>"]
    end

    subgraph "Model Layer"
        Models["DotLLM.Models<br/><small>GGUF loader, Llama, Mistral,<br/>Phi, Qwen, DeepSeek</small>"]
        Tokenizers["DotLLM.Tokenizers<br/><small>BPE, SentencePiece,<br/>Jinja2 chat templates</small>"]
    end

    subgraph "Compute"
        CPU["DotLLM.Cpu<br/><small>SIMD / AVX2 / AVX-512</small>"]
        CUDA["DotLLM.Cuda<br/><small>PTX kernels, cuBLAS</small>"]
    end

    Core["DotLLM.Core<br/><small>ITensor, IBackend, IModel,<br/>ISamplerStep, IDecodingConstraint</small>"]

    CLI --> Server
    ChatUI --> Server
    Server --> Eng
    CLI --> Eng
    Eng --> Models
    Eng --> Tokenizers
    Models --> CPU
    Models --> CUDA
    CPU --> Core
    CUDA --> Core
    Models --> Core
    Tokenizers --> Core
    Eng --> Core

Each project ships as a separate NuGet package. DotLLM.Core defines all abstractions (ITensor, IBackend, IModel, ISamplerStep, etc.) while concrete implementations live in their respective projects. You pull in only what you need.

Key features

Performance:

Zero-alloc inference - all tensor data uses NativeMemory.AlignedAlloc (64-byte aligned). No managed heap allocations on the hot path (or at least, the best effort to avoid them). (Almost) no allocs, no GC triggered.
SIMD vectorization - TensorPrimitives for standard operations, hand-tuned System.Runtime.Intrinsics for quantized matmul, RMSNorm, RoPE, softmax operations. AVX2 and AVX-512 with scalar fallbacks.
Memory-mapped model loading - GGUF files loaded via MemoryMappedFile. OS demand-paging means multi-GB models load in milliseconds.
Quantized inference - FP16, Q8_0, Q4_K_M and other GGUF formats with fused scale-int dot-product kernels operating directly on quantized blocks.

Serving:

OpenAI-compatible API - /v1/chat/completions, /v1/completions, tool calling, structured output, streaming SSE via ASP.NET.
Speculative decoding - draft-verify-accept loop with KV-cache rollback for higher throughput.
Structured output - FSM/PDA-based constrained decoding guaranteeing valid JSON, JSON Schema, regex, and GBNF grammar.

Extensibility:

Pluggable backends - IBackend interface with separate packages (CPU, CUDA, future ROCm).
(planned) LoRA adapters - runtime loading, no weight merging, concurrent multi-adapter serving.
(planned) Diagnostic hooks - zero-cost IInferenceHook for activation capture, logit lens, SAE integration.
(planned) OpenTelemetry - System.Diagnostics.Metrics + Activity for throughput, latency, and per-request tracing.

Here’s a minimal streaming generation example:

using var gguf = GgufFile.Open(modelPath);
var config = GgufModelConfigExtractor.Extract(gguf.Metadata);
using var model = TransformerModel.LoadFromGguf(gguf, config);
var tokenizer = GgufBpeTokenizerFactory.Load(gguf.Metadata);

var generator = new TextGenerator(model, tokenizer);
var options = new InferenceOptions
{
    SamplerSteps =
    [
        new TemperatureSampler(0.8f),
        new TopKSampler(40),
        new TopPSampler(0.95f)
    ],
    StopConditions = [new EosStopCondition(tokenizer.EosTokenId)],
    MaxTokens = 128
};

await foreach (var token in generator.GenerateStreamingTokensAsync(prompt, options))
    Console.Write(token.Text);

And here’s what the CLI looks like for a quick generation run with SmolLM-135M:

-- dotllm | Llama 30L/576H | Q8_0 | 16 threads | greedy ──────────────────
The capital of France is Paris. Paris is a city of romance and culture,

╭──────────────────────────────────────────────────────────────────────────╮
│  Generation Complete                                      163.27 tok/s  │
│                                                                         │
│  Prefill            12.3 ms       6 tokens       487.80 tok/s           │
│  Decode             91.8 ms      15 tokens       163.40 tok/s           │
│  Sampling            0.1 ms      15 tokens                              │
│  Total             104.2 ms      21 tokens       201.54 tok/s           │
│  Load              456.7 ms                                             │
│                                                                         │
│  Weights         136.73 MiB      (memory-mapped)                        │
│  KV Cache        158.20 MiB      (192 slots)                            │
╰──────────────────────────────────────────────────────────────────────────╯

There is also build-in serve command to host simple chat UX for testing and research:

Performance reality check

Let’s be honest about where dotLLM stands today. On CPU decode - the metric that matters most for interactive chat - dotLLM is approaching llama.cpp parity on larger models. Decode is memory-bandwidth-bound, and C# with SIMD intrinsics can saturate the memory bus just as well as C. Prefill is a different story: dotLLM is still 2-3x slower because llama.cpp has years of hand-tuned GEMM kernels. This is a preview release and performance is actively improving - but if you need maximum throughput today, llama.cpp is faster. The goal is to close that gap over time.

Why build this?

A few reasons, starting with the most important:

Understanding by building. I’ve been writing about LLM internals on this blog - temperature and sampling, logprobs. At some point, writing about softmax and top-k sampling makes you want to implement the whole pipeline. Building dotLLM was the logical next step: implementing the entire inference pipeline from GGUF parsing through attention to token generation, seeing every allocation and every SIMD instruction up close. And it will generate A LOT of other blog posts…😇

Proving the platform. There’s a persistent assumption that systems-level performance work requires C, C++, or Rust. Twenty years of .NET performance work has taught me that’s not always true. C# with NativeMemory, System.Runtime.Intrinsics, MemoryMappedFile, and Span<T> gives you genuine control over memory and compute. dotLLM is a proof point.

Seeing how far AI-assisted development can go. This was also, explicitly, an experiment. Not vibe coding - not “prompt and pray” loop. Structured, documented, reviewed AI-assisted development where a human makes the architectural decisions and AI handles implementation within well-defined boundaries. I wanted to find out what a solo developer can realistically build in one-two months with this approach. The answer surprised me.

Creating a platform for research and experimentation. Building from the ground up with this goal in mind means the architecture is open to experimentation - adding new features, deeper diagnostics, and interpretability tools into the inference pipeline. All outside the gold-standard HuggingFace monopoly.

The .NET ecosystem gap is real. If you’re building .NET applications and want local LLM inference, your choices are wrappers (LLamaSharp wrapping llama.cpp), limited runtimes (ONNX Runtime with restricted model support), or orchestration layers (Semantic Kernel, which is about chaining calls, not running inference). Enterprise .NET shops that want to run models in production without Python or C++ dependencies have no native option. dotLLM fills that gap, but it will take a LOT of time to treat it as a serious replacement.

dotLLM is not meant to replace llama.cpp or vLLM in production - at least not yet. It’s built for .NET developers who want native inference without leaving their ecosystem, and for researchers and experimenters who want to explore LLM internals from C#. And everyone who wants to understand how to build a LLM inference engine from scratch.

How AI built an AI engine

Let me be upfront: nearly every commit in dotLLM’s git history has a Co-Authored-By: Claude Opus 4.6 line. This project would not exist in its current form without AI assistance. But the story of how that assistance was structured is more interesting than the fact that it was used.

The development methodology

dotLLM was built over ~60 implementation steps organized into 7 phases, documented in a detailed ROADMAP.md. Each step was a discrete unit of work with clear scope and acceptance criteria:

Phase 1 - End-to-end single-token generation (GGUF loader, dequantization, CPU ops, tokenizer, attention, KV-cache, sampling)
Phase 2 - Practical local inference (Q4_K_M, chat templates, streaming, multi-threading, additional architectures)
Phase 3 - CPU performance (tiled attention, SIMD tuning, NUMA awareness, operator fusion)
Phase 4 - GPU acceleration (CUDA backend, hybrid CPU/GPU, KV-cache quantization)
Phase 5 - Constrained decoding and API (JSON/schema/regex/grammar, tool calling, server, chat UI, prompt caching)
Phase 6 - Improved serving (warm-up, Native AOT, paged KV-cache, speculative decoding)
Phase 7 - Diagnostics and interpretability (logprobs, hooks, logit lens - in progress)

Every step started as a GitHub issue. Every issue lived on a branch named issue/{number}-{short-description}. Every PR closed its issue and updated the roadmap. This was relentlessly boring discipline, and it was the single most important factor in the project’s success.

After the initial release, I ran a series of “Waves” - systematic quality passes across the entire codebase:

Wave 1 (P0): Security and crash fixes - path traversal, CUDA shared memory guards, hybrid GPU edge cases
Wave 2 (P1): Quick correctness and consistency fixes
Wave 3: Presentation cleanup - remove dead code, label stubs, fix samples
Wave 4: Server hardening - request validation, LINQ removal from hot paths
Wave 5 was skipped - it was earmarked for batch-serving improvements that depend on Phase 9 (continuous batching), which isn’t implemented yet
Wave 6: CUDA kernel rewrite - tiled softmax, vectorization, grid-stride loops
Wave 7: CPU performance - TopK sampler optimization, AVX2 gap filling, schema cache tuning

`ROADMAP.md` and `CLAUDE.md` - the highest-ROI investments

If there’s one takeaway from this experiment, it’s this: the time you spend writing structured documentation for AI is not overhead - it IS the development methodology.

**ROADMAP.md** was the backbone. Each of the ~60 steps had a feature name, a description, key files to modify, and dependencies on other steps. This gave both me and the AI a shared understanding of what to build next, in what order, and why. Without it, AI would be coding in circles - solving the wrong problem, building features in the wrong order, missing dependencies.

The roadmap also forced me to think through the architecture upfront. When you have to write “Step 31: CUDA backend - PTX kernels loaded via CUDA Driver API, no native shared library, cuBLAS HGEMM for prefill, custom quantized GEMV for decode” before writing any code, you’ve already made the hard decisions.

**CLAUDE.md** was the project’s “constitution” - 180+ lines defining how AI should work in this codebase. Here are some actual rules from it:

**Native .NET first** - All orchestration, model loading, tokenization, sampling, scheduling, CPU compute in pure C#.

**Unmanaged memory for tensors** - `NativeMemory.AlignedAlloc` (64-byte). Zero GC allocations on inference hot path.

**Hybrid GPU architecture** - Thin native C/CUDA lib via `[LibraryImport]`. GPU memory as opaque `IntPtr` - tensor data never crosses P/Invoke boundary.

And specific coding rules:

**NEVER** allocate managed arrays for tensor data. Use `NativeMemory.AlignedAlloc` (64-byte for AVX-512, 32-byte for AVX2).

SIMD: Foundation is `System.Numerics.Tensors.TensorPrimitives` for standard ops. Hot inner loops: `System.Runtime.Intrinsics` - prefer cross-platform `Vector128<T>`/`Vector256<T>`, use platform-specific only when measurably faster. **Always provide scalar fallback.

Beyond CLAUDE.md, there were 22 detailed design documents in /docs/ - one for each major subsystem: ARCHITECTURE, QUANTIZATION, ATTENTION, CUDA, KV_CACHE, CONSTRAINED_DECODING, TOOL_CALLING, SCHEDULING, SPECULATIVE, and more. The rule was simple: AI reads the relevant spec before touching a module.

This documentation-first approach had a compound effect. Every implementation step could reference the roadmap for scope, the design docs for architecture, and CLAUDE.md for coding conventions. The AI wasn’t guessing - it was implementing within well-defined constraints.

Claude Code as implementation partner

Claude Code with Opus 4.6 (1M context) was the primary implementation tool from the project start. The workflow was built around six custom Claude Code skills that automated the development lifecycle:

/plan-step - looks for a given roadmap step in ROADMAP.md plus relevant docs from /docs/, enters plan mode, and produces a step-by-step implementation plan for my approval before any code is written.
/create-pr - commits remaining changes, pushes the branch, and creates a PR with a detailed description following project conventions.
/apply-pr-comments - reads review comments from Codex, Gemini, or human reviewers, analyzes them, enters plan mode so I can approve the fixes before any code changes.
/finish-pr-comments - after fixes are applied and tested, commits the changes, pushes to the PR branch, and replies to each reviewer comment with what was fixed and the commit hash.
/merge-pr - squash-merges the PR into main, deletes the remote branch, and switches to an updated local main.
/plan-issue - similar to plan-step but for less frequent case when we start from an issue, not a roadmap’s step.

There was also a GitHub Actions workflow that responds to @claude mentions in PRs and issues, enabling asynchronous interaction.

The typical flow for a feature looked like this: I’d run /plan-step (or /plan-issue), review and adjust the plan, then let Claude implement it step by step while I reviewed each change. The key insight is that planning was always a separate, human-approved step before implementation began.

Codex and Gemini as PR reviewers

Every PR was also reviewed by Codex and Gemini (2.5 Pro) triggered manually via mentions of @codex and @gemini in PR comments. Bot produces prioritized findings - P1 (high-priority bugs, orange badge) and P2 (improvements, yellow badge).

@gemini comments on PRs, powered by a custom Python bot (.github/scripts/gemini_bot.py) with retry logic and configurable thinking budgets. A separate GEMINI.md file defined its review persona:

# GEMINI.md - dotLLM Project Mandates

This file defines the foundational mandates for Gemini CLI's operation within the **dotLLM** repository. These instructions take absolute precedence over general defaults.

## Core Directives

1.  **Follow CLAUDE.md:** Treat `CLAUDE.md` as the primary source of truth for architectural patterns, memory management, and coding style.
...

The review findings were not cosmetic. They caught genuinely critical bugs that could have shipped:

KV-cache quantization (PR #75): Codex caught a ring-buffer indexing bug (window reads used linear indexing instead of ring indices, producing garbage after wrap-around), a pinned buffer scope issue (pointers from fixed blocks used after the scope exited), and a shared-state race condition (per-layer eviction progress stored in a shared counter) - all P1.
JSON Schema constrained decoding (PR #79): Found a cache key collision (string substates not included in the hash, collapsing distinct parser states) and a unicode escape flag preservation bug (\u parsing wiped the key-string flag) - both P1.
Wave 6 CUDA kernel rewrite (PR #114): Gemini identified thread underutilization in GEMV kernels and uncoalesced memory reads - architectural issues that Codex’s code-level analysis didn’t surface.
First public release (PR #120): Gemini reviewed the CI/CD pipeline, build determinism settings, AOT dependency handling, and documentation completeness.

Once Codex and Gemini leave their comments, the remaining skills close the loop. /apply-pr-comments reads all review comments on the current PR, analyzes them, and enters plan mode — so I can approve which fixes to make before any code changes. This prevents blindly applying every suggestion without human judgment on what’s worth addressing versus deferring.

After the fixes are implemented and tested, straightforward /finish-pr-comments commits the changes, pushes to the PR branch, and has the Claude Code bot (dotllm-claude-code-bot) reply to each reviewer comment with what was fixed and the corresponding commit hash.

This creates a fully traceable chain: Codex/Gemini finds a bug -> Claude fixes it -> the reply references the exact commit. The PR thread becomes a complete audit trail.

Finally, /merge-pr squash-merges the PR into main, deletes the remote branch, and checks out an updated local main. The entire cycle - from roadmap step to merged PR - typically took one Claude Code session.

Lessons learned - a candid assessment

What worked brilliantly:

ROADMAP.md and CLAUDE.md were the highest-ROI time investments. Every hour spent writing structured documentation saved many hours of correcting misdirected AI implementation. The documentation is the development methodology - it’s not overhead, it’s the thing that makes AI-assisted development work at all. Without the roadmap, the AI has no sense of direction. Without CLAUDE.md,` it has no sense of style.
Two-role separation: implementation vs. review. Claude wrote the code. Codex and Gemini reviewed it independently. The AI that writes the code should not be the only AI that reviews it. Different models have genuinely different blind spots - Codex and Gemini almost never flagged the same issues.
No need for Ralph loop or YOLO mode - human in the loop with good permissions works
it is llama.cpp, vllm transpile from C/C++ machine with powerful exploration and research capabilities

Where AI struggled:

Fundamental architectural decisions. Which attention strategy to use, how to structure the GPU interop layer, whether to use PagedAttention kernels or staging-buffer gather - these required human judgment informed by reading llama.cpp, vLLM, and the research literature. AI can implement an architecture; it’s less reliable at choosing between architectures.
Performance intuition. AI follows rules (“don’t allocate on the hot path”) but doesn’t develop intuition for when code is slow. It won’t notice that a particular loop pattern thrashes L2 cache, or that a seemingly innocent LINQ expression is allocating closures in a tight loop. Well, at least not always.
Keeping it withing the guiderails. Nevertheless how many time and places I ask him to not create compund tool calls (cd D:\github\ai\dotLLM && git ...) it eventually ignores it, breaking the persmissions

Surprising insights:

Codex and Gemini findings were genuinely critical. Ring-buffer bugs, use-after-free on pinned buffers, race conditions - these are the kind of bugs that ship to production and cause mysterious crashes.
The two review AIs were truly complementary. Codex excelled at code-level bugs (off-by-one, scope issues, cache key collisions). Gemini excelled at architecture-level concerns (thread utilization, memory access patterns, CI/CD correctness). I don’t think either alone would have been as effective.
Writing CLAUDE.md paid for itself within the first week. Once the rules were written, every implementation session started from a consistent baseline. No re-explaining conventions. No correcting the same mistakes. The upfront investment was maybe 4 hours; the cumulative time saved was enormous.
Register pressure and hardware limits. Step 26 (outer-product tiled matmul) hit a wall: the 4x3 AVX2 tile needs 23 YMM registers but only 16 are available, causing spills that negated the performance gain. AI didn’t anticipate this at the first glance. But it was really awesome to see how it discoveres it, by looking at the JIT results of the generated code and re-iterate with many tried. It required combined understanding of RyuJIT’s register allocator behavior and .NET tooling and it was super to observe!
struggled for few hours, literally, to support quantization format - and he did it!

What’s next

dotLLM is at v0.1.0-preview.1 - explicitly a preview. The foundations are solid, but there’s a long road ahead:

Phase 7 (in progress): Diagnostic hooks, logit lens, Sparse Autoencoders (SAE) integration, LoRA adapters
Phase 8 (planned): MLA attention (DeepSeek), SmolLM3, Gemma 4, Mixture of Experts
Phase 9 (planned): Production serving - continuous batching, prefix sharing, advanced scheduling. This will be fun🔥

The project is GPLv3 and contributions are welcome. The codebase has 22 design docs, a detailed roadmap, and a CLAUDE.md that makes it easy for both human and AI contributors to get oriented quickly.

GitHub: github.com/kkokosa/dotLLM
Website: dotllm.dev
NuGet packages: DotLLM.Engine, DotLLM.Cpu, DotLLM.Cuda, DotLLM.Server, and more
Discussions: GitHub Discussions

Closing

Two things are true at once. First: .NET can do native, systems-level AI work. Zero-GC inference, SIMD-vectorized kernels, memory-mapped model loading, paged KV-cache, speculative decoding - all in C#. The platform is more capable than many give it credit for.

Second: a solo developer can build something of this scope in two months with AI assistance - but only with relentless structure. The roadmap, the design docs, the CLAUDE.md constitution, the dual-review workflow - take any of these away and the productivity collapses. AI amplifies discipline; it doesn’t replace it.

If you’re a .NET developer curious about LLM inference, or a researcher who wants to explore model internals from C#, give dotLLM a try. File issues. Break things. Tell me what’s missing.