<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en"><generator uri="https://jekyllrb.com/" version="4.4.1">Jekyll</generator><link href="https://kokosa.dev/feed.xml" rel="self" type="application/atom+xml"/><link href="https://kokosa.dev/" rel="alternate" type="text/html" hreflang="en"/><updated>2026-04-14T10:56:55+02:00</updated><id>https://kokosa.dev/feed.xml</id><title type="html">Konrad ‘Dev Nerd’ Kokosa</title><subtitle>Blog of Konrad Kokosa. </subtitle><entry><title type="html">Introducing dotLLM - Building an LLM Inference Engine in C#</title><link href="https://kokosa.dev/blog/2026/dotllm/" rel="alternate" type="text/html" title="Introducing dotLLM - Building an LLM Inference Engine in C#"/><published>2026-04-14T10:00:00+02:00</published><updated>2026-04-14T10:00:00+02:00</updated><id>https://kokosa.dev/blog/2026/dotllm</id><content type="html" xml:base="https://kokosa.dev/blog/2026/dotllm/"><![CDATA[<figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/dotllm-hero-480.webp 480w,/assets/img/dotllm-hero-800.webp 800w,/assets/img/dotllm-hero-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/img/dotllm-hero.gif" class="img-fluid rounded z-depth-1" width="100%" height="auto" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> <p>If you’ve been building .NET applications and wanted to run LLMs locally, your options have been… limited. You could wrap <a href="https://github.com/ggml-org/llama.cpp">llama.cpp</a> through <a href="https://github.com/SciSharp/LLamaSharp">LLamaSharp</a>, deal with <a href="https://onnxruntime.ai/docs/get-started/with-csharp.html">ONNX Runtime</a>, or orchestrate calls to external Python services. None of these are fully satisfying if you want to run LLM inference purely in .NET.</p> <p>So I built my own.</p> <p>Over the past two months I’ve been building <a href="https://dotllm.dev/">dotLLM</a> - a ground-up, high-performance LLM inference engine written natively in C#/.NET 10. Not a wrapper. Not bindings. A full implementation: GGUF model loading, tokenization, attention, sampling, SIMD-optimized CPU inference, CUDA GPU acceleration, an OpenAI-compatible API server, and a built-in chat UI. This week I’ve published the <a href="https://github.com/kkokosa/dotLLM/releases">first preview release</a> (v0.1.0-preview.2).</p> <p>This post covers three things: <strong>what</strong> dotLLM is, <strong>why</strong> I built it, and <strong>how</strong> AI-assisted development - Claude Code, Codex, and Gemini working together in a structured workflow - made it possible for a single developer to ship this in about two months.</p> <h2 id="what-is-dotllm">What is dotLLM?</h2> <h3 id="the-short-version">The short version</h3> <p>dotLLM is a native C#/.NET 10 LLM inference engine. It runs transformer-based models - Llama, Mistral, Phi, Qwen (and more in future) - from GGUF files, with SIMD-optimized CPU, CUDA GPU backend, or hybrid/offloading mode. It exposes an OpenAI-compatible API, ships as a CLI tool and as NuGet packages you can embed in your own applications.</p> <p>The key word is <strong>native</strong>. All orchestration, model loading, tokenization, sampling, scheduling, and CPU compute is implemented in pure C#. The only native code is a thin CUDA library for GPU kernels, loaded via PTX through the CUDA Driver API (P/Invoked).</p> <h3 id="architecture">Architecture</h3> <p>dotLLM is organized as a layered architecture where each layer depends only on the layers below:</p> <pre><code class="language-mermaid">graph TD
    subgraph "User-Facing"
        CLI["DotLLM.Cli&lt;br/&gt;&lt;small&gt;CLI tool&lt;/small&gt;"]
        ChatUI["Built-in Chat UI&lt;br/&gt;&lt;small&gt;Browser-based&lt;/small&gt;"]
        Server["DotLLM.Server&lt;br/&gt;&lt;small&gt;OpenAI-compatible API&lt;/small&gt;"]
    end

    subgraph "Engine"
        Eng["DotLLM.Engine&lt;br/&gt;&lt;small&gt;KV-cache, scheduler, samplers,&lt;br/&gt;constraints, speculative decoding&lt;/small&gt;"]
    end

    subgraph "Model Layer"
        Models["DotLLM.Models&lt;br/&gt;&lt;small&gt;GGUF loader, Llama, Mistral,&lt;br/&gt;Phi, Qwen, DeepSeek&lt;/small&gt;"]
        Tokenizers["DotLLM.Tokenizers&lt;br/&gt;&lt;small&gt;BPE, SentencePiece,&lt;br/&gt;Jinja2 chat templates&lt;/small&gt;"]
    end

    subgraph "Compute"
        CPU["DotLLM.Cpu&lt;br/&gt;&lt;small&gt;SIMD / AVX2 / AVX-512&lt;/small&gt;"]
        CUDA["DotLLM.Cuda&lt;br/&gt;&lt;small&gt;PTX kernels, cuBLAS&lt;/small&gt;"]
    end

    Core["DotLLM.Core&lt;br/&gt;&lt;small&gt;ITensor, IBackend, IModel,&lt;br/&gt;ISamplerStep, IDecodingConstraint&lt;/small&gt;"]

    CLI --&gt; Server
    ChatUI --&gt; Server
    Server --&gt; Eng
    CLI --&gt; Eng
    Eng --&gt; Models
    Eng --&gt; Tokenizers
    Models --&gt; CPU
    Models --&gt; CUDA
    CPU --&gt; Core
    CUDA --&gt; Core
    Models --&gt; Core
    Tokenizers --&gt; Core
    Eng --&gt; Core
</code></pre> <p>Each project ships as a separate NuGet package. <code class="language-plaintext highlighter-rouge">DotLLM.Core</code> defines all abstractions (<code class="language-plaintext highlighter-rouge">ITensor</code>, <code class="language-plaintext highlighter-rouge">IBackend</code>, <code class="language-plaintext highlighter-rouge">IModel</code>, <code class="language-plaintext highlighter-rouge">ISamplerStep</code>, etc.) while concrete implementations live in their respective projects. You pull in only what you need.</p> <h3 id="key-features">Key features</h3> <p><strong>Performance:</strong></p> <ul> <li><strong>Zero-alloc inference</strong> - all tensor data uses <code class="language-plaintext highlighter-rouge">NativeMemory.AlignedAlloc</code> (64-byte aligned). No managed heap allocations on the hot path (well, “the best” effort so far). (Almost) no allocs, no GC triggered.</li> <li><strong>SIMD vectorization for CPU backend</strong> - <code class="language-plaintext highlighter-rouge">TensorPrimitives</code> for standard operations, hand-tuned <code class="language-plaintext highlighter-rouge">System.Runtime.Intrinsics</code> for quantized matmul, RMSNorm, RoPE, softmax operations. AVX2 and AVX-512 with scalar fallbacks.</li> <li><strong>CUDA GPU backend</strong> - PTX kernels loaded via the CUDA Driver API with cuBLAS HGEMM for prefill, custom quantized GEMV for decode, and FP16 activation pipeline. Supports full GPU inference, hybrid CPU/GPU layer offloading, and KV-cache quantization.</li> <li><strong>Memory-mapped model loading</strong> - GGUF files loaded via <code class="language-plaintext highlighter-rouge">MemoryMappedFile</code>. OS demand-paging means multi-GB models load in milliseconds.</li> <li><strong>Quantized inference</strong> - FP16, Q8_0, Q4_K_M and other GGUF formats with fused scale-int dot-product kernels operating directly on quantized blocks.</li> </ul> <p><strong>Serving:</strong></p> <ul> <li><strong>OpenAI-compatible API</strong> - <code class="language-plaintext highlighter-rouge">/v1/chat/completions</code>, <code class="language-plaintext highlighter-rouge">/v1/completions</code>, tool calling, structured output, streaming SSE via ASP.NET.</li> <li><strong>Speculative decoding</strong> - draft-verify-accept loop with KV-cache rollback for higher throughput.</li> <li><strong>Structured output</strong> - FSM/PDA-based constrained decoding guaranteeing valid JSON, JSON Schema, regex, and GBNF grammar.</li> </ul> <p><strong>Extensibility:</strong></p> <ul> <li><strong>Pluggable backends</strong> - <code class="language-plaintext highlighter-rouge">IBackend</code> interface with separate packages (CPU, CUDA, future ROCm).</li> <li>(planned) <strong>LoRA adapters</strong> - runtime loading, no weight merging, concurrent multi-adapter serving.</li> <li>(planned) <strong>Diagnostic hooks</strong> - zero-cost <code class="language-plaintext highlighter-rouge">IInferenceHook</code> for activation capture, logit lens, SAE integration.</li> <li>(planned) <strong>OpenTelemetry</strong> - <code class="language-plaintext highlighter-rouge">System.Diagnostics.Metrics</code> + <code class="language-plaintext highlighter-rouge">Activity</code> for throughput, latency, and per-request tracing.</li> </ul> <p>Here’s a minimal streaming generation example:</p> <div class="language-csharp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">using</span> <span class="nn">var</span> <span class="n">gguf</span> <span class="p">=</span> <span class="n">GgufFile</span><span class="p">.</span><span class="nf">Open</span><span class="p">(</span><span class="n">modelPath</span><span class="p">);</span>
<span class="kt">var</span> <span class="n">config</span> <span class="p">=</span> <span class="n">GgufModelConfigExtractor</span><span class="p">.</span><span class="nf">Extract</span><span class="p">(</span><span class="n">gguf</span><span class="p">.</span><span class="n">Metadata</span><span class="p">);</span>
<span class="k">using</span> <span class="nn">var</span> <span class="n">model</span> <span class="p">=</span> <span class="n">TransformerModel</span><span class="p">.</span><span class="nf">LoadFromGguf</span><span class="p">(</span><span class="n">gguf</span><span class="p">,</span> <span class="n">config</span><span class="p">);</span>
<span class="kt">var</span> <span class="n">tokenizer</span> <span class="p">=</span> <span class="n">GgufBpeTokenizerFactory</span><span class="p">.</span><span class="nf">Load</span><span class="p">(</span><span class="n">gguf</span><span class="p">.</span><span class="n">Metadata</span><span class="p">);</span>

<span class="kt">var</span> <span class="n">generator</span> <span class="p">=</span> <span class="k">new</span> <span class="nf">TextGenerator</span><span class="p">(</span><span class="n">model</span><span class="p">,</span> <span class="n">tokenizer</span><span class="p">);</span>
<span class="kt">var</span> <span class="n">options</span> <span class="p">=</span> <span class="k">new</span> <span class="n">InferenceOptions</span>
<span class="p">{</span>
    <span class="n">SamplerSteps</span> <span class="p">=</span>
    <span class="p">[</span>
        <span class="k">new</span> <span class="nf">TemperatureSampler</span><span class="p">(</span><span class="m">0.8f</span><span class="p">),</span>
        <span class="k">new</span> <span class="nf">TopKSampler</span><span class="p">(</span><span class="m">40</span><span class="p">),</span>
        <span class="k">new</span> <span class="nf">TopPSampler</span><span class="p">(</span><span class="m">0.95f</span><span class="p">)</span>
    <span class="p">],</span>
    <span class="n">StopConditions</span> <span class="p">=</span> <span class="p">[</span><span class="k">new</span> <span class="nf">EosStopCondition</span><span class="p">(</span><span class="n">tokenizer</span><span class="p">.</span><span class="n">EosTokenId</span><span class="p">)],</span>
    <span class="n">MaxTokens</span> <span class="p">=</span> <span class="m">128</span>
<span class="p">};</span>

<span class="k">await</span> <span class="k">foreach</span> <span class="p">(</span><span class="kt">var</span> <span class="n">token</span> <span class="k">in</span> <span class="n">generator</span><span class="p">.</span><span class="nf">GenerateStreamingTokensAsync</span><span class="p">(</span><span class="n">prompt</span><span class="p">,</span> <span class="n">options</span><span class="p">))</span>
    <span class="n">Console</span><span class="p">.</span><span class="nf">Write</span><span class="p">(</span><span class="n">token</span><span class="p">.</span><span class="n">Text</span><span class="p">);</span>
</code></pre></div></div> <p>And here’s what the CLI looks like for a quick generation run with SmolLM-135M:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>-- dotllm | Llama 30L/576H | Q8_0 | 16 threads | greedy ──────────────────
The capital of France is Paris. Paris is a city of romance and culture,

╭──────────────────────────────────────────────────────────────────────────╮
│  Generation Complete                                      163.27 tok/s  │
│                                                                         │
│  Prefill            12.3 ms       6 tokens       487.80 tok/s           │
│  Decode             91.8 ms      15 tokens       163.40 tok/s           │
│  Sampling            0.1 ms      15 tokens                              │
│  Total             104.2 ms      21 tokens       201.54 tok/s           │
│  Load              456.7 ms                                             │
│                                                                         │
│  Weights         136.73 MiB      (memory-mapped)                        │
│  KV Cache        158.20 MiB      (192 slots)                            │
╰──────────────────────────────────────────────────────────────────────────╯
</code></pre></div></div> <p>There is also build-in <code class="language-plaintext highlighter-rouge">serve</code> command to host simple chat UX for testing and research:</p> <video autoplay="" loop="" muted="" playsinline="" class="img-fluid rounded z-depth-1" style="width: 100%;"> <source src="/assets/img/dotllm-chat-ui.mp4" type="video/mp4"/> </video> <h3 style="margin-top: 2rem;">Performance reality check</h3> <p>Let’s be honest about where dotLLM stands today. While performance has been a key design consideration from day one (zero-alloc hot paths, SIMD kernels, memory-mapped loading), the primary focus so far has been closing the feature set — getting the full inference pipeline, constrained decoding, tool calling, speculative decoding, and the API server working correctly. Dedicated performance polishing is coming. Here are CPU benchmarks against llama.cpp (AMD Ryzen 9 5950X, 16 threads, same models and quantizations):</p> <p><strong>Decode throughput (tok/s)</strong> - the metric that matters most for interactive chat:</p> <table> <thead> <tr> <th>Model</th> <th>Quant</th> <th style="text-align: right">dotLLM</th> <th style="text-align: right">llama.cpp</th> <th style="text-align: right">Ratio</th> </tr> </thead> <tbody> <tr> <td>SmolLM-135M</td> <td>Q4_K_M</td> <td style="text-align: right">279.1</td> <td style="text-align: right">334.7</td> <td style="text-align: right">0.83x</td> </tr> <tr> <td>SmolLM-135M</td> <td>Q8_0</td> <td style="text-align: right">197.7</td> <td style="text-align: right">255.9</td> <td style="text-align: right">0.77x</td> </tr> <tr> <td>Llama 3.2 1B</td> <td>Q4_K_M</td> <td style="text-align: right">32.4</td> <td style="text-align: right">48.9</td> <td style="text-align: right">0.66x</td> </tr> <tr> <td>Llama 3.2 1B</td> <td>Q8_0</td> <td style="text-align: right">25.0</td> <td style="text-align: right">31.0</td> <td style="text-align: right">0.81x</td> </tr> <tr> <td>Llama 3.2 3B</td> <td>Q4_K_M</td> <td style="text-align: right">15.4</td> <td style="text-align: right">19.6</td> <td style="text-align: right">0.79x</td> </tr> <tr> <td>Llama 3.2 3B</td> <td>Q8_0</td> <td style="text-align: right">9.9</td> <td style="text-align: right">11.2</td> <td style="text-align: right">0.88x</td> </tr> </tbody> </table> <p>On decode, dotLLM reaches <strong>66-88% of llama.cpp</strong> throughput. Decode is largely memory-bandwidth-bound, and C# with SIMD intrinsics can get reasonably close to saturating the memory bus.</p> <p><strong>Prefill is a different story</strong> - dotLLM is roughly 2-5x slower than llama.cpp across the board. Prefill is compute-bound, and llama.cpp has years of hand-tuned GEMM kernels. We hit a specific wall here (outer-product tiled matmul vs. RyuJIT register pressure) which I’ll describe in the lessons learned section below.</p> <p>The CUDA backend is functional but still early - it currently underperforms CPU on small models due to launch overhead, and the kernel tuning work is ongoing.</p> <p>This is a preview release. If you need maximum throughput today, llama.cpp is faster. The gap will narrow.</p> <h2 id="why-build-this">Why build this?</h2> <p>If out-of-the-box performance is not easily achieved, why build this at all? A few reasons, starting with the most important:</p> <p><strong>Understanding by building.</strong> I’ve been writing about LLM internals on this blog - <a href="/blog/2026/temperature/">temperature and sampling</a>, <a href="/blog/2026/logprobs/">logprobs</a>, learning a lot on my own. At some point, it makes you want to implement the whole pipeline. Building dotLLM was the logical next step to accelerate learning: implementing the entire inference pipeline from GGUF parsing through attention to token generation, seeing every allocation and every SIMD instruction up close. And it will generate A LOT of other blog posts…😇</p> <p><strong>Seeing how far AI-assisted development can go.</strong> This was also, explicitly, an experiment. <strong>Not vibe coding</strong> - not “prompt and pray” loop. Structured, documented, reviewed AI-assisted development where a human makes the architectural decisions and AI handles implementation within well-defined boundaries. I wanted to find out what a solo developer can realistically build in one-two months with this approach. The answer surprised me.</p> <p><strong>Creating a platform for research and experimentation.</strong> Building from the ground up with this goal in mind means the architecture is open to experimentation - adding new features, deeper diagnostics, and interpretability tools into the inference pipeline. All outside the gold-standard HuggingFace monopoly. In .NET 😍</p> <p><strong>Proving the platform.</strong> There’s a persistent assumption that systems-level performance work requires C, C++, or Rust. Twenty years of .NET performance work has taught me that’s not always true. C# with <code class="language-plaintext highlighter-rouge">NativeMemory</code>, <code class="language-plaintext highlighter-rouge">System.Runtime.Intrinsics</code>, <code class="language-plaintext highlighter-rouge">MemoryMappedFile</code>, and <code class="language-plaintext highlighter-rouge">Span&lt;T&gt;</code> gives you genuine control over memory and compute. dotLLM is a proof point.</p> <p><strong>The .NET ecosystem gap is real.</strong> If you’re building .NET applications and want local LLM inference, your choices are wrappers (LLamaSharp wrapping llama.cpp), limited runtimes (ONNX Runtime with restricted model support), or orchestration layers (Semantic Kernel, which is about chaining calls, not running inference). Enterprise .NET shops that want to run models in production without Python or C++ dependencies have no native option. dotLLM fills that gap, but it will take a LOT of time to treat it as a serious replacement.</p> <blockquote class="block-info"> <p>dotLLM is not meant to replace llama.cpp or vLLM in production - at least not yet. It’s built for .NET developers who want native inference without leaving their ecosystem, and for researchers and experimenters who want to explore LLM internals from C#. And everyone who wants to understand how to build a LLM inference engine from scratch.</p> </blockquote> <h2 id="how-ai-built-an-ai-engine">How AI built an AI engine</h2> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/dotllm-git-log-480.webp 480w,/assets/img/dotllm-git-log-800.webp 800w,/assets/img/dotllm-git-log-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/img/dotllm-git-log.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> <p>Let me be upfront: nearly every commit in dotLLM’s git history has a <code class="language-plaintext highlighter-rouge">Co-Authored-By: Claude Opus 4.6</code> line. This project would not exist in its current form without AI assistance. But the story of <em>how</em> that assistance was structured is more interesting than the fact that it was used.</p> <h3 id="the-development-methodology">The development methodology</h3> <p>dotLLM was built over ~60 implementation steps organized into 7 phases, documented in a detailed <a href="https://github.com/kkokosa/dotLLM/blob/main/docs/ROADMAP.md"><code class="language-plaintext highlighter-rouge">ROADMAP.md</code></a>. Each phase was a discrete unit of work with clear scope and acceptance criteria:</p> <ul> <li><strong>Phase 1</strong> - End-to-end single-token generation (GGUF loader, dequantization, CPU ops, tokenizer, attention, KV-cache, sampling)</li> <li><strong>Phase 2</strong> - Practical local inference (Q4_K_M, chat templates, streaming, multi-threading, additional architectures)</li> <li><strong>Phase 3</strong> - CPU performance (tiled attention, SIMD tuning, NUMA awareness, operator fusion)</li> <li><strong>Phase 4</strong> - GPU acceleration (CUDA backend, hybrid CPU/GPU, KV-cache quantization)</li> <li><strong>Phase 5</strong> - Constrained decoding and API (JSON/schema/regex/grammar, tool calling, server, chat UI, prompt caching)</li> <li><strong>Phase 6</strong> - Improved serving (warm-up, Native AOT, paged KV-cache, speculative decoding)</li> <li><strong>Phase 7</strong> - Diagnostics and interpretability (logprobs, hooks, logit lens - in progress)</li> </ul> <p>Every step started as a GitHub issue. Every issue lived on a branch named <code class="language-plaintext highlighter-rouge">issue/{number}-{short-description}</code>. Every PR closed its issue and updated the roadmap. This was relentlessly boring discipline, and it was the single most important factor in the project’s success.</p> <p>After the initial release, just before making the repository public, I also ran a series of “Waves” - systematic quality passes across the entire codebase:</p> <ul> <li><strong>Wave 1</strong> (P0): Security and crash fixes - path traversal, CUDA shared memory guards, hybrid GPU edge cases</li> <li><strong>Wave 2</strong> (P1): Quick correctness and consistency fixes</li> <li><strong>Wave 3</strong>: Presentation cleanup - remove dead code, label stubs, fix samples</li> <li><strong>Wave 4</strong>: Server hardening - request validation, LINQ removal from hot paths</li> <li><strong>Wave 5</strong> was skipped - it was earmarked for batch-serving improvements that depend on Phase 9 (continuous batching), which isn’t implemented yet</li> <li><strong>Wave 6</strong>: CUDA kernel rewrite - tiled softmax, vectorization, grid-stride loops</li> <li><strong>Wave 7</strong>: CPU performance - TopK sampler optimization, AVX2 gap filling, schema cache tuning</li> </ul> <p>Those “waves” are a bunch of findings that come from in-depth reviews from other models, grouped into GitHub issues.</p> <h3 id="roadmapmd-and-claudemd---the-highest-roi-investments"><code class="language-plaintext highlighter-rouge">ROADMAP.md</code> and <code class="language-plaintext highlighter-rouge">CLAUDE.md</code> - the highest-ROI investments</h3> <p>If there’s one takeaway from this experiment, it’s this: <strong>the time you spend writing structured documentation for AI is not overhead - it IS the development methodology.</strong></p> <p><code class="language-plaintext highlighter-rouge">ROADMAP.md</code> was the backbone. Each of the ~60 steps had a feature name, a description, key files to modify, and dependencies on other steps. This gave both me and the AI a shared understanding of what to build next, in what order, and why. Without it, AI would be coding in circles - solving the wrong problem, building features in the wrong order, missing dependencies.</p> <p>The roadmap also forced me to think through the architecture upfront. When you have to discuss things like <em>“Step 31: CUDA backend - PTX kernels loaded via CUDA Driver API, no native shared library, cuBLAS HGEMM for prefill, custom quantized GEMV for decode”</em> before writing any code, you’ve already made the hard decisions. And learnt a lot.</p> <p><code class="language-plaintext highlighter-rouge">CLAUDE.md</code> was the project’s “constitution” - 180+ lines defining how AI should work in this codebase. Here are some actual rules from it:</p> <div class="language-markdown highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gs">**Native .NET first**</span> - All orchestration, model loading, tokenization, sampling, scheduling, CPU compute in pure C#.

<span class="gs">**Unmanaged memory for tensors**</span> - <span class="sb">`NativeMemory.AlignedAlloc`</span> (64-byte). Zero GC allocations on inference hot path.

<span class="gs">**Hybrid GPU architecture**</span> - Thin native C/CUDA lib via <span class="sb">`[LibraryImport]`</span>. GPU memory as opaque <span class="sb">`IntPtr`</span> - tensor data never crosses P/Invoke boundary.
</code></pre></div></div> <p>And specific coding rules:</p> <div class="language-markdown highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gs">**NEVER**</span> allocate managed arrays for tensor data. Use <span class="sb">`NativeMemory.AlignedAlloc`</span> (64-byte for AVX-512, 32-byte for AVX2).

SIMD: Foundation is <span class="sb">`System.Numerics.Tensors.TensorPrimitives`</span> for standard ops. Hot inner loops: <span class="sb">`System.Runtime.Intrinsics`</span> - prefer cross-platform <span class="sb">`Vector128&lt;T&gt;`</span>/<span class="sb">`Vector256&lt;T&gt;`</span>, use platform-specific only when measurably faster. <span class="ge">**</span>Always provide scalar fallback.
</code></pre></div></div> <p>Beyond <code class="language-plaintext highlighter-rouge">CLAUDE.md</code>, there were <strong>22 detailed design documents</strong> in <code class="language-plaintext highlighter-rouge">/docs/</code> - one for each major subsystem like <code class="language-plaintext highlighter-rouge">ARCHITECTURE</code>, <code class="language-plaintext highlighter-rouge">QUANTIZATION</code>, <code class="language-plaintext highlighter-rouge">ATTENTION</code>, <code class="language-plaintext highlighter-rouge">CUDA</code>, and more. The rule was simple: AI reads the relevant spec before touching a module.</p> <p>This <strong>documentation-first</strong> approach had a compound effect. Every implementation step could reference the roadmap for scope, the design docs for architecture, and <code class="language-plaintext highlighter-rouge">CLAUDE.md</code> for coding conventions. The AI wasn’t guessing - it was implementing within well-defined constraints.</p> <h3 id="claude-code-as-implementation-partner">Claude Code as implementation partner</h3> <p><a href="https://claude.ai/code">Claude Code</a> with Opus 4.6 (1M context) was the primary implementation tool from the project start. The workflow was built around six custom <a href="https://docs.anthropic.com/en/docs/claude-code/skills">Claude Code skills</a> that automated the development lifecycle:</p> <ul> <li><strong><code class="language-plaintext highlighter-rouge">/plan-step</code></strong> - looks for a given roadmap step in <code class="language-plaintext highlighter-rouge">ROADMAP.md</code> plus relevant docs from <code class="language-plaintext highlighter-rouge">/docs/</code>, enters plan mode, and produces a step-by-step implementation plan for my approval before any code is written.</li> <li><strong><code class="language-plaintext highlighter-rouge">/create-pr</code></strong> - commits remaining changes, pushes the branch, and creates a PR with a detailed description following project conventions.</li> <li><strong><code class="language-plaintext highlighter-rouge">/apply-pr-comments</code></strong> - reads review comments from Codex, Gemini, or human reviewers, analyzes them, enters plan mode so I can approve the fixes before any code changes.</li> <li><strong><code class="language-plaintext highlighter-rouge">/finish-pr-comments</code></strong> - after fixes are applied and tested, commits the changes, pushes to the PR branch, and replies to each reviewer comment with what was fixed and the commit hash.</li> <li><strong><code class="language-plaintext highlighter-rouge">/merge-pr</code></strong> - squash-merges the PR into main, deletes the remote branch, and switches to an updated local main.</li> <li><strong><code class="language-plaintext highlighter-rouge">/plan-issue</code></strong> - similar to <code class="language-plaintext highlighter-rouge">plan-step</code> but for less frequent case when we start from an issue, not a roadmap’s step.</li> </ul> <p>There was also a GitHub Actions workflow that responds to <code class="language-plaintext highlighter-rouge">@claude</code> mentions in PRs and issues, enabling asynchronous interaction.</p> <p>The typical flow for a feature looked like this: I’d run <code class="language-plaintext highlighter-rouge">/plan-step</code> (or <code class="language-plaintext highlighter-rouge">/plan-issue</code>), review and adjust the plan, then let Claude implement it step by step while I reviewed each change. The key insight is that <strong>planning was always a separate, human-approved step</strong> before implementation began. Most of the work and brainstorming happened there.</p> <h3 id="codex-and-gemini-as-pr-reviewers">Codex and Gemini as PR reviewers</h3> <p>Every PR was also reviewed by <a href="https://chatgpt.com/codex">Codex</a> and <a href="https://ai.google.dev/">Gemini</a> triggered manually via mentions of <code class="language-plaintext highlighter-rouge">@codex</code> and <code class="language-plaintext highlighter-rouge">@gemini</code> in PR comments.</p> <p>Gemini was powered by a custom Python bot (<a href="https://github.com/kkokosa/dotLLM/blob/main/.github/scripts/gemini_bot.py"><code class="language-plaintext highlighter-rouge">.github/scripts/gemini_bot.py</code></a>) with retry logic and configurable thinking budgets. A separate <a href="https://github.com/kkokosa/dotLLM/blob/main/GEMINI.md">GEMINI.md</a> file defined its review persona.</p> <p>The review findings were not cosmetic. They caught genuinely critical bugs that could have shipped. Just as examples:</p> <ul> <li><strong>KV-cache quantization</strong> (PR #75): Codex caught a <strong>ring-buffer indexing bug</strong> (window reads used linear indexing instead of ring indices, producing garbage after wrap-around) or a <strong>pinned buffer scope issue</strong> (pointers from <code class="language-plaintext highlighter-rouge">fixed</code> blocks used after the scope exited), and a <strong>shared-state race condition</strong> (per-layer eviction progress stored in a shared counter)</li> <li><strong>JSON Schema constrained decoding</strong> (PR #79): Found a <strong>cache key collision</strong> (string substates not included in the hash, collapsing distinct parser states) and a <strong>unicode escape flag preservation bug</strong> (<code class="language-plaintext highlighter-rouge">\u</code> parsing wiped the key-string flag)</li> <li><strong>Wave 6 CUDA kernel rewrite</strong> (PR #114): Gemini identified thread underutilization in GEMV kernels and uncoalesced memory reads - architectural issues that Codex’s code-level analysis didn’t surface</li> </ul> <p>Once Codex and Gemini leave their comments, the remaining skills close the loop. <code class="language-plaintext highlighter-rouge">/apply-pr-comments</code> reads all review comments on the current PR, analyzes them, and enters plan mode — so I can approve which fixes to make before any code changes. This prevents blindly applying every suggestion without human judgment on what’s worth addressing versus deferring.</p> <p>After the fixes are implemented and tested, straightforward <code class="language-plaintext highlighter-rouge">/finish-pr-comments</code> commits the changes, pushes to the PR branch, and has the Claude Code bot (<code class="language-plaintext highlighter-rouge">dotllm-claude-code-bot</code>) reply to each reviewer comment with what was fixed and the corresponding commit hash.</p> <p>This creates a <strong>fully traceable chain</strong>: Codex/Gemini finds a bug -&gt; Claude fixes it -&gt; the reply references the exact commit. The PR thread becomes a complete audit trail.</p> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/dotllm-codex-review-480.webp 480w,/assets/img/dotllm-codex-review-800.webp 800w,/assets/img/dotllm-codex-review-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/img/dotllm-codex-review.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> <p>Finally, <code class="language-plaintext highlighter-rouge">/merge-pr</code> squash-merges the PR into main, deletes the remote branch, and checks out an updated local main. The entire cycle - from roadmap step to merged PR - typically took one Claude Code session.</p> <h3 id="lessons-learned">Lessons learned</h3> <p><strong>What worked brilliantly:</strong></p> <ul> <li><strong>The implementation is correct.</strong> This is the most obvious finding to overlook. But here it is - this setup really worked! Step by step, we were progressing with the implementation, making it more and more sophisticated. And the moment when for the first time I saw an answer from the real model - it felt like magic! Reaching 80-90% of llama.cpp CPU decode performance is a thing, too.</li> <li><strong><code class="language-plaintext highlighter-rouge">ROADMAP.md</code> was the highest-ROI time investment.</strong> Planning structured documentation saved many hours of correcting misdirected AI implementation. The documentation <em>is</em> the development methodology - it’s not overhead, it’s the thing that makes AI-assisted development work at all. Without the roadmap, the AI has no sense of direction.</li> <li><strong>Two-role separation: implementation vs. review.</strong> Claude wrote the code. Codex and Gemini reviewed it independently. The AI that writes the code should not be the only AI that reviews it. Different models have genuinely different blind spots - Codex and Gemini almost never flagged the same issues.</li> <li><strong>No <a href="https://github.com/anthropics/claude-code/blob/main/plugins/ralph-wiggum/README.md">Ralph Wiggum loop</a> or other YOLO mode.</strong> I really wanted to <em>drive</em> the work, not <em>fire &amp; forget</em> it to burn tokens. It worked out perfectly for me. The single task took around several minutes, up to 30-40, so you could return to them from time to time, during breaks whilst waiting for something else. Great background work.</li> </ul> <p><strong>Where AI struggled:</strong></p> <ul> <li><strong>Fundamental architectural decisions.</strong> Which attention strategy to use, how to structure the GPU interop layer, whether to use PagedAttention kernels or staging-buffer gather - these required human judgment informed by comparing to llama.cpp, vLLM, and the research literature. AI can implement an architecture, but it’s less reliable at choosing between architectures.</li> <li><strong>The prefill throughput gap.</strong> This is the biggest remaining performance wall - but to be fair, it’s not really an AI failure per se. It might be AI struggling to workaround platform’s out-of-the-box behaviours. The 2-5x prefill gap you saw in the benchmarks comes down to one unfinished roadmap step: Step 26, the outer-product tiled matmul. The current batched GEMM (Step 11) uses an inner-product formulation - it tiles over M but still computes one output element at a time. It delivered a 2.2x prefill speedup when implemented. But the “outer-product” approach would share one activation load across all weight-row dot products: a 4x3 AVX2 tile with 12 YMM accumulators. Problem: due to AI that tile needs ~23 YMM registers, but AVX2 only has 16. RyuJIT spills to stack, and the spills eat the reuse benefit the tile was designed to get. AI was actually <em>very good</em> at experimenting here — it iterated through multiple tile sizes, inspected the JIT disassembly for spills, and methodically narrowed down the root cause. The options now seems to be: AVX-512 (32 ZMM registers, fits comfortably), a native C microkernel via <code class="language-plaintext highlighter-rouge">[LibraryImport]</code> (what llamafile/tinyBLAS does), or a smaller tile with partial reuse. I’ve left this for a much more manual, still AI-assisted, research phase - it needs careful benchmarking and disassembly analysis rather than the “plan step, implement, make PR” cadence that worked for the rest of the project.</li> <li><strong>Keeping it within the guiderails.</strong> Small but very annoying. No matter how many times, and in how many places, I ask him to avoid compound tool calls (esp. <code class="language-plaintext highlighter-rouge">cd D:\github\ai\dotLLM &amp;&amp; git ...</code>) it eventually ignores it, breaking the permissions. It breaks the flow, as instead of an autonomous work it got stuck waiting for permission for tools that otherwise should be accepted (like allowed GitHub read-only commands etc). Now I think I need pre-tool hook that checks that💡</li> <li><strong>Claude Code got really stuck 2-3 times.</strong> There were a few times when implementing a step took a lot of struggle. Implementing, retrying, failing, in a loop that took literally hours (!). But even more surprising is the fact that he eventually did it. With some of my help and brainstorming, but it was mostly his async work trying to figure it out.</li> </ul> <h2 id="whats-next">What’s next</h2> <p>dotLLM is at v0.1.0-preview.2 - explicitly a preview. The foundations are solid, but there’s a long road ahead:</p> <ul> <li><strong>Phase 7</strong> (in progress): Diagnostic hooks, logit lens, Sparse Autoencoders (SAE) integration, LoRA adapters</li> <li><strong>Phase 8</strong> (planned): MLA attention (DeepSeek), SmolLM3, Gemma 4, Mixture of Experts</li> <li><strong>Phase 9</strong> (planned): Production serving - continuous batching, prefix sharing, advanced scheduling. This will be fun🔥</li> </ul> <p>After those Phases I will consider feature set (mostly) done and start to focus on performance more and more.</p> <p>BTW, the project is <a href="https://github.com/kkokosa/dotLLM/blob/main/LICENSE">GPLv3</a> and contributions are welcome! There is still a lot to do. The codebase has 22 design docs, a detailed roadmap, and a CLAUDE.md that makes it easy for both human and AI contributors to get oriented quickly.</p> <ul> <li><strong>GitHub</strong>: <a href="https://github.com/kkokosa/dotLLM">github.com/kkokosa/dotLLM</a></li> <li><strong>Website</strong>: <a href="https://dotllm.dev/">dotllm.dev</a></li> <li><strong>NuGet packages</strong>: <code class="language-plaintext highlighter-rouge">DotLLM.Engine</code>, <code class="language-plaintext highlighter-rouge">DotLLM.Cpu</code>, <code class="language-plaintext highlighter-rouge">DotLLM.Cuda</code>, <code class="language-plaintext highlighter-rouge">DotLLM.Server</code>, and more</li> <li><strong>Discussions</strong>: <a href="https://github.com/kkokosa/dotLLM/discussions">GitHub Discussions</a></li> </ul> <h2 id="closing">Closing</h2> <p>Two things are true at once. First: .NET can do native, systems-level AI work. Zero-GC inference, SIMD-vectorized kernels, memory-mapped model loading, paged KV-cache, speculative decoding - all in C#. The platform is more capable than many give it credit for.</p> <p>Second: a solo developer can build something of this scope in two months with AI assistance - but only with relentless structure. The roadmap, the design docs, the CLAUDE.md constitution, the dual-review workflow - take any of these away and the productivity collapses. AI amplifies discipline; it doesn’t replace it.</p> <p>If you’re a .NET developer curious about LLM inference, or a researcher who wants to explore model internals from C#, <a href="https://github.com/kkokosa/dotLLM">give dotLLM a try</a>. File issues. Break things. Tell me what’s missing.</p>]]></content><author><name></name></author><category term="llm"/><category term="llm"/><category term="architecture"/><category term="tools"/><summary type="html"><![CDATA[How I built a ground-up LLM inference engine in .NET 10, and what I learned about AI-assisted (not vibe-coded) development along the way.]]></summary></entry><entry><title type="html">Visualizing logprobs from OpenAI responses</title><link href="https://kokosa.dev/blog/2026/logprobs/" rel="alternate" type="text/html" title="Visualizing logprobs from OpenAI responses"/><published>2026-02-03T09:00:00+01:00</published><updated>2026-02-03T09:00:00+01:00</updated><id>https://kokosa.dev/blog/2026/logprobs</id><content type="html" xml:base="https://kokosa.dev/blog/2026/logprobs/"><![CDATA[<p>In the <a href="/blog/2026/temperature/">previous post</a> we explored what <strong>logits</strong>, <strong>logprobs</strong> and <strong>temperature</strong> are. We learned that LLMs are actually outputting a probability distribution for every single token they generate.</p> <p>But looking at raw numbers and graphs is boring. Let’s see how we can use this data to visualize the model’s “confidence” and at least try to spot places when it is “unsure”.</p> <h3 id="getting-logprobs-from-api">Getting Logprobs from API</h3> <p>If you are using OpenAI API we need to set two parameters: <code class="language-plaintext highlighter-rouge">logprobs</code> to <code class="language-plaintext highlighter-rouge">true</code> and <code class="language-plaintext highlighter-rouge">top_logprobs</code> to some integer $k$ (e.g. <code class="language-plaintext highlighter-rouge">2</code> or <code class="language-plaintext highlighter-rouge">5</code>) to see top-$k$ alternatives. Information will be provided for <strong>all</strong> tokens from the response, as they were seen at the time of generation of each token.</p> <p>Here is what a raw sample HTTP request could look for a question <em>“Who is the President of Poland?”</em>:</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>curl https://api.openai.com/v1/chat/completions <span class="se">\</span>
  <span class="nt">-H</span> <span class="s2">"Content-Type: application/json"</span> <span class="se">\</span>
  <span class="nt">-H</span> <span class="s2">"Authorization: Bearer </span><span class="nv">$OPENAI_API_KEY</span><span class="s2">"</span> <span class="se">\</span>
  <span class="nt">-d</span> <span class="s1">'{
    "model": "gpt-4o",
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": "Who is the president of Poland?"
      }
    ],
    "logprobs": true,
    "top_logprobs": 2
  }'</span>
</code></pre></div></div> <p>And here is a relevant snippet of the JSON response we get back. Notice the <code class="language-plaintext highlighter-rouge">logprobs</code> field inside <code class="language-plaintext highlighter-rouge">choices</code>:</p> <div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
  </span><span class="nl">"id"</span><span class="p">:</span><span class="w"> </span><span class="s2">"chatcmpl-123..."</span><span class="p">,</span><span class="w">
  </span><span class="nl">"object"</span><span class="p">:</span><span class="w"> </span><span class="s2">"chat.completion"</span><span class="p">,</span><span class="w">
  </span><span class="nl">"created"</span><span class="p">:</span><span class="w"> </span><span class="mi">1677652288</span><span class="p">,</span><span class="w">
  </span><span class="nl">"model"</span><span class="p">:</span><span class="w"> </span><span class="s2">"gpt-4o-2024-08-06"</span><span class="p">,</span><span class="w">
  </span><span class="nl">"choices"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
    </span><span class="p">{</span><span class="w">
      </span><span class="nl">"index"</span><span class="p">:</span><span class="w"> </span><span class="mi">0</span><span class="p">,</span><span class="w">
      </span><span class="nl">"message"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
        </span><span class="nl">"role"</span><span class="p">:</span><span class="w"> </span><span class="s2">"assistant"</span><span class="p">,</span><span class="w">
        </span><span class="nl">"content"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Andrzej Duda"</span><span class="p">,</span><span class="w">
        </span><span class="nl">"refusal"</span><span class="p">:</span><span class="w"> </span><span class="kc">null</span><span class="p">,</span><span class="w">
        </span><span class="nl">"annotations"</span><span class="p">:</span><span class="w"> </span><span class="p">[]</span><span class="w">        
      </span><span class="p">},</span><span class="w">
      </span><span class="nl">"logprobs"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
        </span><span class="nl">"content"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
          </span><span class="p">{</span><span class="w">
            </span><span class="nl">"token"</span><span class="p">:</span><span class="w"> </span><span class="s2">"And"</span><span class="p">,</span><span class="w">
            </span><span class="nl">"logprob"</span><span class="p">:</span><span class="w"> </span><span class="mf">-0.000035</span><span class="p">,</span><span class="w">
            </span><span class="nl">"top_logprobs"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w"> 
              </span><span class="p">{</span><span class="w"> </span><span class="nl">"token"</span><span class="p">:</span><span class="w"> </span><span class="s2">"And"</span><span class="p">,</span><span class="w"> </span><span class="nl">"logprob"</span><span class="p">:</span><span class="w"> </span><span class="mf">-0.000035</span><span class="w"> </span><span class="p">},</span><span class="w">
              </span><span class="p">{</span><span class="w"> </span><span class="nl">"token"</span><span class="p">:</span><span class="w"> </span><span class="s2">"The"</span><span class="p">,</span><span class="nl">"logprob"</span><span class="p">:</span><span class="w"> </span><span class="mf">-11.4</span><span class="w"> </span><span class="p">}</span><span class="w">
            </span><span class="p">]</span><span class="w">
          </span><span class="p">},</span><span class="w">
          </span><span class="p">{</span><span class="w">
            </span><span class="nl">"token"</span><span class="p">:</span><span class="w"> </span><span class="s2">"rzej"</span><span class="p">,</span><span class="w">
            </span><span class="nl">"logprob"</span><span class="p">:</span><span class="w"> </span><span class="mf">-0.00021</span><span class="p">,</span><span class="w">
            </span><span class="nl">"top_logprobs"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
              </span><span class="p">{</span><span class="w"> </span><span class="nl">"token"</span><span class="p">:</span><span class="w"> </span><span class="s2">"rzej"</span><span class="p">,</span><span class="w"> </span><span class="nl">"logprob"</span><span class="p">:</span><span class="w"> </span><span class="mf">-0.00021</span><span class="w"> </span><span class="p">},</span><span class="w">
              </span><span class="p">{</span><span class="w"> </span><span class="nl">"token"</span><span class="p">:</span><span class="w"> </span><span class="s2">"rew"</span><span class="p">,</span><span class="w">  </span><span class="nl">"logprob"</span><span class="p">:</span><span class="w"> </span><span class="mf">-9.1</span><span class="w"> </span><span class="p">}</span><span class="w">
            </span><span class="p">]</span><span class="w">
          </span><span class="p">},</span><span class="w">
          </span><span class="p">{</span><span class="w">
            </span><span class="nl">"token"</span><span class="p">:</span><span class="w"> </span><span class="s2">" D"</span><span class="p">,</span><span class="w"> </span><span class="nl">"logprob"</span><span class="p">:</span><span class="w"> </span><span class="mf">-0.004</span><span class="p">,</span><span class="w">
            </span><span class="nl">"top_logprobs"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
              </span><span class="p">{</span><span class="w"> </span><span class="nl">"token"</span><span class="p">:</span><span class="w"> </span><span class="s2">" D"</span><span class="p">,</span><span class="w"> </span><span class="nl">"logprob"</span><span class="p">:</span><span class="w"> </span><span class="mf">-0.004</span><span class="w"> </span><span class="p">},</span><span class="w">
              </span><span class="p">{</span><span class="w"> </span><span class="nl">"token"</span><span class="p">:</span><span class="w"> </span><span class="s2">" Du"</span><span class="p">,</span><span class="w"> </span><span class="nl">"logprob"</span><span class="p">:</span><span class="w"> </span><span class="mf">-6.2</span><span class="w"> </span><span class="p">}</span><span class="w">
            </span><span class="p">]</span><span class="w">
          </span><span class="p">},</span><span class="w">
          </span><span class="p">{</span><span class="w">
            </span><span class="nl">"token"</span><span class="p">:</span><span class="w"> </span><span class="s2">"uda"</span><span class="p">,</span><span class="w">
            </span><span class="nl">"logprob"</span><span class="p">:</span><span class="w"> </span><span class="mf">-0.0012</span><span class="p">,</span><span class="w">
            </span><span class="nl">"top_logprobs"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
              </span><span class="p">{</span><span class="w"> </span><span class="nl">"token"</span><span class="p">:</span><span class="w"> </span><span class="s2">"uda"</span><span class="p">,</span><span class="w"> </span><span class="nl">"logprob"</span><span class="p">:</span><span class="w"> </span><span class="mf">-0.0012</span><span class="w"> </span><span class="p">},</span><span class="w">
              </span><span class="p">{</span><span class="w"> </span><span class="nl">"token"</span><span class="p">:</span><span class="w"> </span><span class="s2">"uda"</span><span class="p">,</span><span class="w"> </span><span class="nl">"logprob"</span><span class="p">:</span><span class="w"> </span><span class="mf">-7.5</span><span class="w"> </span><span class="p">}</span><span class="w">
            </span><span class="p">]</span><span class="w">
          </span><span class="p">}</span><span class="w">
        </span><span class="p">]</span><span class="w">
      </span><span class="p">},</span><span class="w">
      </span><span class="nl">"finish_reason"</span><span class="p">:</span><span class="w"> </span><span class="s2">"stop"</span><span class="w">
    </span><span class="p">}</span><span class="w">
  </span><span class="p">],</span><span class="w">
  </span><span class="err">...</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div> <p>As the API returns <strong>logprobs</strong> (logarithmic probabilities), which are literally the natural logarithm of the probability $ p $. To get the actual probability percentage, we just need to calculate the exponential:</p> \[p = e^{\text{logprob}} \times 100\%\] <p>For example, a logprob of <code class="language-plaintext highlighter-rouge">-0.004</code> corresponds to $ e^{-0.004} \approx 0.996 $, or <strong>99.6%</strong>, and <code class="language-plaintext highlighter-rouge">-0.002</code> is as small as <strong>0.2%</strong>.</p> <p>Let’s visualize the probabilities for the “Andrzej Duda” tokens. In this case, the model is nearly 100% certain, and the second-best alternatives have almost zero probability.</p> <pre><code class="language-plotly">{
  "data": [
    {
      "x": ["And", "rzej", " D", "uda"],
      "y": [99.99, 99.98, 99.60, 99.88],
      "type": "bar",
      "marker": { "color": "#2ca02c" },
      "name": "Top Choice (%)"
    },
    {
      "x": ["And", "rzej", " D", "uda"],
      "y": [0.001, 0.01, 0.20, 0.05],
      "type": "bar",
      "marker": { "color": "#d62728" },
      "name": "Second Choice (%)"
    }
  ],
  "layout": {
    "title": "Top vs Second Token Probability",
    "barmode": "group",
    "yaxis": { "title": "Probability (%)", "range": [0, 110] },
    "paper_bgcolor": "rgba(0,0,0,0)",
    "plot_bgcolor": "rgba(0,0,0,0)"
  },
  "config": {
    "displayModeBar": false,
    "responsive": true
  }
}
</code></pre> <p>The model is very <strong>confident</strong> about the response “Andrzej Duda” tokens!</p> <blockquote class="block-info"> <p>I specifically chose this example to illustrate an important point: the model’s <strong>certainty</strong> about a given fact does not mean that it is a <strong>correct</strong> answer or <strong>not a hallucination</strong>. After the recent elections in 2025, Karol Nawrocki is the President of Poland. However, based on the data available to the model (and no RAG for up to date info), the model is certain that Andrzej Duda is still the President.</p> </blockquote> <h3 id="visualizing-token-clues">Visualizing “Token Clues”</h3> <p>Nevertheless, it may be interesting to see how confident model is about what <em>he thinks is true</em>. We could visualize some “token clues” to have a grasp of his internal thinking and spot some interesting tokens.</p> <p>I’ve implemented a simple <a href="https://github.com/kkokosa/devblog-code/tree/main/logprobs">Logprobs app</a> in C# to play with this. Here are the three main cues I found useful:</p> <h4 id="1-low-confidence-clue-uncertainty">1. Low Confidence Clue (“uncertainty”)</h4> <p>If the probability of the chosen token is low (e.g., $&lt; 10\%$ ), the model was very lucky - apparently due to the temperature it selected less probable token. Nevertheless, the effect is low confidence about what the model has selected. Our first condition can be:</p> \[p(token) &lt; 0.1\] <h4 id="2-ambiguity-close-alternatives">2. Ambiguity (close alternatives)</h4> <p>Sometimes the model may be fairly confident, but there is another option(s) just as good. I would call it *ambigue** it the model can choose from two or more token with similar probabilities. Out simplified condition can be:</p> \[p(top\_token) - p(second\_token) &lt; 0.15\] <h4 id="3-not-top-choice-sampling-effect">3. Not Top Choice (sampling effect)</h4> <p>If you use <code class="language-plaintext highlighter-rouge">temperature &gt; 0</code>, the model might sample a token that <strong>wasn’t</strong> the most probable one. This is a clear signal that the output is purely a result of sampling randomness (“creativity”). The condition would be:</p> \[token_{chosen} \neq token_{top}\] <h3 id="putting-it-all-together">Putting it all together</h3> <p>By highlighting text with these cues, we can get an “X-Ray” view of the generation.</p> <ul> <li><strong>Green</strong>: confidence &gt; 90%</li> <li><strong>Lime</strong>: confidence &gt; 70%</li> <li><strong>Yellowe</strong>: confidence &gt; 50%</li> <li><strong>Orange</strong>: condifence &gt; 30%</li> <li><strong>Red</strong>: Low confidence (&lt;30%)</li> <li><strong>Annotations</strong>: <code class="language-plaintext highlighter-rouge">🔀</code> as Ambiguous (point 2.), <code class="language-plaintext highlighter-rouge">🎯</code> as Not Top (point 3.)</li> </ul> <p>Imagine asking the model: <em>“Who is the president of Poland?”</em></p> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/logprobs_query01-480.webp 480w,/assets/img/logprobs_query01-800.webp 800w,/assets/img/logprobs_query01-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/img/logprobs_query01.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> <p>Here, even a factual answer starts with some ambiguity because the model has multiple ways to phrase its knowledge cutoff disclaimer. Also, we can see the sentence starts as <em>“As of October”…</em> but with 75% it could be <em>“As of my”…</em>.</p> <p>The tool visualizes top-$k$ tokens, so beginning of the sentence in details looks like this:</p> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/logprobs_query02-480.webp 480w,/assets/img/logprobs_query02-800.webp 800w,/assets/img/logprobs_query02-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/img/logprobs_query02.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> <p>Of course, we can expect some ambiguity when asking some questions, like <em>“Tabs or spaces? Single word answer.</em>”:</p> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/logprobs_query03-480.webp 480w,/assets/img/logprobs_query03-800.webp 800w,/assets/img/logprobs_query03-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/img/logprobs_query03.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> <p>Clearly it prefers spaces…😇</p> <p>Or we can observe the <a href="https://x.com/infobeautiful/status/1783132545887953369">famous non-randomness</a> of random numbers:</p> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/logprobs_query04-480.webp 480w,/assets/img/logprobs_query04-800.webp 800w,/assets/img/logprobs_query04-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/img/logprobs_query04.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> <p>Or some interesting “biases”:</p> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/logprobs_query05-480.webp 480w,/assets/img/logprobs_query05-800.webp 800w,/assets/img/logprobs_query05-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/img/logprobs_query05.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> <p>Clearly, for more complex queries the answer will be having much more clues like this, but still interesting to observe:</p> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/logprobs_query06-480.webp 480w,/assets/img/logprobs_query06-800.webp 800w,/assets/img/logprobs_query06-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/img/logprobs_query06.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> <p><em>Note: Here for example we can observe that with 33% chance he was going to use Markdown bold <code class="language-plaintext highlighter-rouge">**</code> when listing cat names.</em></p> <h3 id="a-note-on-hugging-face">A Note on Hugging Face</h3> <p>If you are running open-weights models locally (or in the cloud) using the <strong>Hugging Face</strong> <code class="language-plaintext highlighter-rouge">transformers</code> library, you can access the same information.</p> <p>When calling <code class="language-plaintext highlighter-rouge">generate()</code>, simply set <code class="language-plaintext highlighter-rouge">return_dict_in_generate=True</code> and <code class="language-plaintext highlighter-rouge">output_scores=True</code>. The returned object will contain <code class="language-plaintext highlighter-rouge">scores</code>, which are the raw <strong>logits</strong>. You can then apply <code class="language-plaintext highlighter-rouge">softmax</code> to convert them into probabilities.</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">outputs</span> <span class="o">=</span> <span class="n">model</span><span class="p">.</span><span class="nf">generate</span><span class="p">(</span>
    <span class="n">inputs</span><span class="p">,</span> 
    <span class="n">return_dict_in_generate</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> 
    <span class="n">output_scores</span><span class="o">=</span><span class="bp">True</span>
<span class="p">)</span>
<span class="c1"># outputs.scores contains logits for each generated token
</span></code></pre></div></div> <h3 id="conclusion">Conclusion</h3> <p>Logprobs give us a peek under the hood. They turn a “black box” text generator into a slightly more transparent probabilistic machine. It won’t help you to fight with hallucinations (as we seen with Andrzej Duda example), but at least can give some clues about what the model thinks!</p>]]></content><author><name></name></author><category term="llm"/><category term="llm"/><category term="basics"/><category term="tools"/><category term="experiments"/><summary type="html"><![CDATA[Let's visualize how LLM is confident about its own thoughts.]]></summary></entry><entry><title type="html">Logits, logprobs, and temperature</title><link href="https://kokosa.dev/blog/2026/temperature/" rel="alternate" type="text/html" title="Logits, logprobs, and temperature"/><published>2026-01-26T15:00:00+01:00</published><updated>2026-01-26T15:00:00+01:00</updated><id>https://kokosa.dev/blog/2026/temperature</id><content type="html" xml:base="https://kokosa.dev/blog/2026/temperature/"><![CDATA[<h3 id="vocabulary">Vocabulary</h3> <p>LLMs are smart “next token predictors”. Token by token, they generate responses, based on the probabilities of what token may come next, out of all tokens in the model’s <strong>vocabulary</strong>.</p> <p>For example, <code class="language-plaintext highlighter-rouge">gpt-4</code> has a vocabulary of 100278 tokens, each with its own ID (position) and value. We can see vocabulary as a very high-dimensional vector:</p> \[\mathbf{v} \in \mathbb{R}^{100278} = \overbrace{ [ \underbrace{\text{!}}_{id_0}, \underbrace{\text{'}}_{id_1}, \underbrace{\text{#}}_{id_2}, \cdots, \underbrace{\text{A}}_{id_{32}}, \underbrace{\text{B}}_{id_{33}}, \underbrace{\text{C}}_{id_{34}}, \cdots, \underbrace{\text{ webinars}}_{id_{100275}}, \underbrace{\text{gard}}_{id_{100276}}, \underbrace{\text{гӡ}}_{id_{100277}} ] }^{\text{100278 dimensions}}\] <p>Tokens have very different values. There are obvious ones like single letters, digits or special characters (as we see <code class="language-plaintext highlighter-rouge">!</code>, <code class="language-plaintext highlighter-rouge">#</code>, <code class="language-plaintext highlighter-rouge">A</code>, or <code class="language-plaintext highlighter-rouge">B</code> above), whole words (like <code class="language-plaintext highlighter-rouge">cat</code> or <code class="language-plaintext highlighter-rouge">window</code>) or parts (like <code class="language-plaintext highlighter-rouge">ing</code>, <code class="language-plaintext highlighter-rouge">urity</code>). Tokens are also case sensitive, and many of them have a version with a leading space. Thus, we can have four different tokens like <code class="language-plaintext highlighter-rouge">window</code>, <code class="language-plaintext highlighter-rouge">Window</code>, <code class="language-plaintext highlighter-rouge">_window</code> and <code class="language-plaintext highlighter-rouge">_Window</code> (I’ve used _ to denote whitespace for clarity).</p> <p>There are also some surprises hiding there. For example, the longest token in <code class="language-plaintext highlighter-rouge">gpt-4</code> represents 128 empty space (<code class="language-plaintext highlighter-rouge"> </code>) characters. And there are many similar, like with 96, 88 or 80 <code class="language-plaintext highlighter-rouge">*</code> stars (yes, coming from training on source code and other formatted texts). And the longest “word” is <code class="language-plaintext highlighter-rouge">.translatesAutoresizingMaskIntoConstraints</code>. You can use a tool like <a href="https://tiktokenizer.vercel.app/">https://tiktokenizer.vercel.app</a> to play with tokens online.</p> <h3 id="logits">Logits</h3> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/temperature_tokens.svg" sizes="95vw"/> <img src="/assets/img/temperature_tokens.svg" class="img-fluid rounded z-depth-1" width="100%" height="auto" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> <p>So, when at some point in time our LLM generated <code class="language-plaintext highlighter-rouge">A cat climbed the bookshelf and knocked over a vase. She looked</code>, it calculated the “odds” of every next possible token from its vocabulary. Some are obvious nonsense in such a context. We don’t expect tokens like <code class="language-plaintext highlighter-rouge">{</code>, <code class="language-plaintext highlighter-rouge">*******</code> or <code class="language-plaintext highlighter-rouge">.translatesAutoresizingMaskIntoConstraints</code> to be selected here. But some are much more “probable”, like <code class="language-plaintext highlighter-rouge">_guilty</code> or <code class="language-plaintext highlighter-rouge">_down</code> (note that most probably we need whitespace after the previous word).</p> <p>The output layer of LLM is producing a score for each token in the model’s vocabulary - so called <strong>logits</strong> (“log-odds,” the <strong>logarithm</strong> of the odds of each token). Let’s denote them as $l_i$ (a logit for a token $i$). They are raw, not normalized numerical predictions to represent model’s “confidence” about each token being the next in the sequence. And they are just numbers ranging from $-\infty$ to $+\infty$.</p> <p>At the last stage, LLM converts them to a probability distribution - where all values sum to 1.0 and have values between 0.0 and 1.0 - with the help of Softmax function, which looks like this:</p> \[p(x_i) = \text{softmax}(x_i) = \frac{e^{l_i}}{\sum_{j=1}^{n} e^{l_j}}\] <blockquote class="block-info"> <p>Note: Typically logits are also normalized by substracing their maximum value, to make calculations more stable.</p> </blockquote> <p>The idea behind Softmax function is simple - it uses the expotential function $e^x$ to produce non-negative value for each logit, from which we can then calculate its ratio against the sum of all other <em>expotentiated</em> logits.</p> <h3 id="temperature">Temperature</h3> <p>But here comes the simple trick called <strong>temperature</strong>. What if we would <em>scale</em> each logit before expotentiation, by some value $T$:</p> \[p(x_i) = \text{softmax}(x_i) = \frac{e^{ {l_i}/T}}{\sum_{j=1}^{n} e^{ {l_j}/T}}\] <p>Such scaling gives a nice consequence, visible on the below graph:</p> <ul> <li>for $T &gt; 1$ the curve is flattened, making differences between each exponentiated logit (relative probability) smaller. The bigger $T$, the smaller the differences between resulting probabilities</li> <li>for $T &lt; 1$ the curve becomes steeper, making bigger differences. The smaller $T$, the bigger the differences between the most probable logits (right side of the graph) vs the least probable ones (left side of the graph)</li> </ul> <pre><code class="language-plotly">{
  "data": [
    {
      "x": [-3, -2.5, -2, -1.5, -1, -0.5, 0, 0.5, 1, 1.5, 2, 2.5, 3],
      "y": [0.049, 0.082, 0.135, 0.223, 0.368, 0.606, 1, 1.648, 2.718, 4.481, 7.389, 12.182, 20.085],
      "type": "scatter",
      "mode": "lines+markers",
      "name": "e^x (T=1.0)"
    },
    {
      "x": [-3, -2.5, -2, -1.5, -1, -0.5, 0, 0.5, 1, 1.5, 2, 2.5, 3],
      "y": [0.002, 0.007, 0.018, 0.050, 0.135, 0.368, 1, 2.718, 7.389, 20.086, 54.598, 148.413, 403.429],
      "type": "scatter",
      "mode": "lines+markers",
      "name": "e^(x/0.5) (T=0.5)"
    },
    {
      "x": [-3, -2.5, -2, -1.5, -1, -0.5, 0, 0.5, 1, 1.5, 2, 2.5, 3],
      "y": [0.223, 0.287, 0.368, 0.472, 0.607, 0.779, 1, 1.284, 1.649, 2.117, 2.718, 3.490, 4.482],
      "type": "scatter",
      "mode": "lines+markers",
      "name": "e^(x/2.0) (T=2.0)"
    }
  ],
  "layout": {
    "title": "Effect of Temperature on Exponential Growth",
    "paper_bgcolor": "rgba(0,0,0,0)",
    "plot_bgcolor": "rgba(0,0,0,0)",
    "xaxis": {
      "title": "Logit value (x)"
    },
    "yaxis": {
      "title": "Exponential value (e^(x/T))",
      "range": [0, 20]
    }
  },
  "config": {
    "displayModeBar": false,
    "responsive": true
  }
}
</code></pre> <p>Let’s see how these temperatures affect the final probability distribution for a specific set of logit values like $2$, $1$, $-0.5$ and $-3$:</p> <ul> <li>$T=1.0$ - the standard Softmax distribution. We can treat it as our baseline</li> <li>$T=0.5$ (Low Temperature) - the probabilities become “sharper”. The highest logit (of value $2$) becomes significantly more probable (from ~$68\% \rightarrow 88\%$), while the lower ones are “suppressed”</li> <li>$T=2.0$ (High Temperature) - the probabilities become “flatter”. The gap between the most and least probable tokens becomes smaller, making the distribution more uniform</li> </ul> <pre><code class="language-plotly">{
  "data": [
    {
      "x": ["Token A (2.0)", "Token B (1.0)", "Token C (-0.5)", "Token D (-3.0)"],
      "y": [0.8756, 0.1185, 0.0059, 0.0000],
      "name": "T=0.5 (Sharper)",
      "type": "bar"
    },
    {
      "x": ["Token A (2.0)", "Token B (1.0)", "Token C (-0.5)", "Token D (-3.0)"],
      "y": [0.6865, 0.2525, 0.0564, 0.0046],
      "name": "T=1.0 (Standard)",
      "type": "bar"
    },
    {
      "x": ["Token A (2.0)", "Token B (1.0)", "Token C (-0.5)", "Token D (-3.0)"],
      "y": [0.5062, 0.3071, 0.1451, 0.0415],
      "name": "T=2.0 (Flatter)",
      "type": "bar"
    }
  ],
  "layout": {
    "title": "Probability Distribution vs Temperature",
    "paper_bgcolor": "rgba(0,0,0,0)",
    "plot_bgcolor": "rgba(0,0,0,0)",
    "barmode": "group",
    "xaxis": {
      "title": "Logits"
    },
    "yaxis": {
      "title": "Probability",
      "range": [0, 1]
    }
  },
  "config": {
    "displayModeBar": false,
    "responsive": true
  }
}
</code></pre> <p>What about $T=0.0$? Mathematically, we can’t divide by zero. So, in practice, LLMs treat this as a limit - as the temperature approaches zero, the distribution becomes infinitely sharp. At $0.0$ the model simply picks the token with the highest logit and gives it 100% probability, turning the “soft” max into a “hard” max (also called $argmax$ function).</p> <h3 id="logprobs">Logprobs</h3> <p>What we’ve seen so far is how the model calculates probabilities internally. But some models/providers allow us to observe the output of such probability calculations in the form of so-called <strong>logprobs</strong> (Log Probabilities, let’s denote them as $lp$). It’s possible via API call options like <code class="language-plaintext highlighter-rouge">IncludeLogProbabilities</code> or <code class="language-plaintext highlighter-rouge">Logbprobs</code> (we will play with them more in the upcoming blog post - <em>Update: it’s <a href="/blog/2026/logprobs/">already published</a></em>).</p> <p>Logprobs are natural logarithms of the probability ($ln(p)$). They are already normalized by the model’s Softmax and temperature internally - in other words, the Softmax and temperature have already happened <em>inside</em> the model when we receive them from the API. They take values from $-\infty$ to $0$ (always negative or zero), as they represent probabilities from the range $[0.0, 1.0]$. To calculate probabilities out of it we just need to call $e^x$ on them (inverse of natural logaritm). For example:</p> \[{lp}=0.0 \text{ so } p = e^{0.0} = 1.0 \text{ (100%)}\] \[{lp}=-0.69 \text{ so } p = e^{-0.69} \approx 0.5 \text{ (50%)}\] \[{lp}=-4.6 \text{ so } p = e^{-4.6} \approx 0.01 \text{ (1%)}\] <p>Let’s return to our <code class="language-plaintext highlighter-rouge">A cat climbed the bookshelf and knocked over a vase. She looked</code> example. When we ask for logprobs we will get them for all tokens in the sequence (including the last one, not yet shown here). We can treat logprobes as yet another high-dimensional vector of logprobs values for each token in the model’s vocabulary.</p> <p>Here’s the real example of logprobes returned for <code class="language-plaintext highlighter-rouge">gpt-4</code> model and the token after <code class="language-plaintext highlighter-rouge">looked</code> (13-th index in sequence):</p> \[\mathbf{L}_{13} \in \mathbb{R}^{100278} = \begin{pmatrix} {lp}_1 \\ {lp}_2 \\ {lp}_3 \\ \vdots \\ {lp}_{519} \\ {lp}_{520} \\ {lp}_{521} \\ \vdots \\ {lp}_{1203} \\ \vdots \\ {lp}_{1523} \\ \vdots \end{pmatrix} = \begin{pmatrix} -56.0 \\ -80.2 \\ -67.1 \\ \vdots \\ -81.2 \\ -1.11 \\ -91.2 \\ \vdots \\ -5.61 \\ \vdots \\ -4.11 \\ \vdots \end{pmatrix}\] <p>By mapping positions/ids of tokens around the highest values, we will see that they correspond to the following tokens:</p> \[[ \cdots, \underbrace{\text{ant}}_{id_{519}}, \underbrace{\text{ at}}_{id_{520}}, \underbrace{\text{ase}}_{id_{521}} \cdots, \underbrace{\text{ back}}_{id_{1203}}, \cdots, \underbrace{\text{ down}}_{id_{1523}}, \cdots, \underbrace{\text{ around}}_{id_{2212}}, \cdots, \underbrace{\text{ proud}}_{id_{12691}}, \cdots, \underbrace{\text{ surprised}}_{id_{14792}}, \cdots, \underbrace{\text{ guilty}}_{id_{16390}}, \cdots \underbrace{\text{ innocent}}_{id_{25226}}, \cdots ]\] <p>Not all tokens were presented in the above $\mathbf{L}$ but we see an example for tokens $519$ (<code class="language-plaintext highlighter-rouge">ant</code>, very low value), $520$ (<code class="language-plaintext highlighter-rouge">_at</code>, pretty high value) and $521$ (<code class="language-plaintext highlighter-rouge">_ase</code>, again with very low value.)</p> <p>And here’s the graph presenting values for those top 8 the highest valued logprobes:</p> <pre><code class="language-plotly">{
  "data": [
    {
      "x": [
        "innocent",
        "at",
        "proud",
        "surprised",
        "down",
        "around",
        "guilty",
        "back"
      ],
      "y": [
        -0.61,
        -1.11,
        -2.98,
        -3.86,
        -4.11,
        -4.48,
        -5.23,
        -5.61
      ],
      "name": "Logit",
      "type": "bar"
    }
  ],
  "layout": {
    "title": "Raw Logits",
    "paper_bgcolor": "rgba(0,0,0,0)",
    "plot_bgcolor": "rgba(0,0,0,0)"
  },
  "config": {
    "displayModeBar": false,
    "responsive": true
  }
}
</code></pre> <p>By just $p_i = e^{{lp}_i}$ we get corresponding probabilities:</p> <pre><code class="language-plotly">{
  "data": [
    {
      "x": [
        "innocent",
        "at",
        "proud",
        "surprised",
        "down",
        "around",
        "guilty",
        "back"
      ],
      "y": [
        0.5536,
        0.3358,
        0.0517,
        0.0215,
        0.0167,
        0.0115,
        0.0055,
        0.0037
      ],
      "name": "Probability",
      "type": "bar"
    }
  ],
  "layout": {
    "title": "Probabilities (T=1.0)",
    "paper_bgcolor": "rgba(0,0,0,0)",
    "plot_bgcolor": "rgba(0,0,0,0)",
    "xaxis": {
      "automargin": true
    },
    "yaxis": {
      "automargin": true
    }
  },
  "config": {
    "displayModeBar": false,
    "responsive": true
  }
}
</code></pre> <blockquote class="block-info"> <p>Note: Logprobes returned by API are truncated to top $n$ values, so calculated probabilities won’t sum to 1.0</p> </blockquote> <p>With the help of logprobes we can nicely visualize how temperature influences the resulting probabilities for our sentence:</p> <pre><code class="language-plotly">{
  "data": [
    {
      "x": [
        "innocent",
        "at",
        "proud",
        "surprised",
        "down",
        "around",
        "guilty",
        "back"
      ],
      "y": [
        0.3596,
        0.28,
        0.1099,
        0.0708,
        0.0625,
        0.0519,
        0.0357,
        0.0295
      ],
      "name": "T=2.0",
      "type": "bar"
    },
    {
      "x": [
        "innocent",
        "at",
        "proud",
        "surprised",
        "down",
        "around",
        "guilty",
        "back"
      ],
      "y": [
        0.5536,
        0.3358,
        0.0517,
        0.0215,
        0.0167,
        0.0115,
        0.0055,
        0.0037
      ],
      "name": "T=1.0",
      "type": "bar"
    },
    {
      "x": [
        "innocent",
        "at",
        "proud",
        "surprised",
        "down",
        "around",
        "guilty",
        "back"
      ],
      "y": [
        0.7248,
        0.2667,
        0.0063,
        0.0011,
        0.0007,
        0.0003,
        0.0001,
        0.0
      ],
      "name": "T=0.5",
      "type": "bar"
    },
    {
      "x": [
        "innocent",
        "at",
        "proud",
        "surprised",
        "down",
        "around",
        "guilty",
        "back"
      ],
      "y": [
        0.9933,
        0.0067,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0,
        0.0
      ],
      "name": "T=0.1",
      "type": "bar"
    }
  ],
  "layout": {
    "title": "Temperature Comparison (T=1.0 vs T=0.5 vs T=0.1)",
    "paper_bgcolor": "rgba(0,0,0,0)",
    "plot_bgcolor": "rgba(0,0,0,0)",
    "xaxis": {
      "automargin": true
    },
    "yaxis": {
      "automargin": true
    },
    "barmode": "group"
  },
  "config": {
    "displayModeBar": false,
    "responsive": true
  }
}
</code></pre> <p>We clearly see, in the real-world example, that:</p> <ul> <li>by setting high temperature like $T = 2.0$ we make many tokens more probable. It could be interpreted as the model being “more creative” (or less constrained)</li> <li>by setting lower temperatures like $T = 0.1$ we definitely “boost” most probable tokens at the expense of the remaining tokens. The model becomes more predictable (“less creative”, if you wish)</li> </ul> <p>That’s it for now! In the next blog post we will dive into visualizing logprobs in a nicer way. We also have yet to cover sampling methods like “top-k” and “top-p”, but that’s also a topic for another day.</p>]]></content><author><name></name></author><category term="llm"/><category term="llm"/><category term="hyperparameter"/><category term="basics"/><summary type="html"><![CDATA[Deep dive into the famous temperature hyperparameter in LLMs.]]></summary></entry><entry><title type="html">Simple CLI REPL for Model Context Protocol (MCP)</title><link href="https://kokosa.dev/blog/2026/repl-mcp/" rel="alternate" type="text/html" title="Simple CLI REPL for Model Context Protocol (MCP)"/><published>2026-01-20T15:00:00+01:00</published><updated>2026-01-20T15:00:00+01:00</updated><id>https://kokosa.dev/blog/2026/repl-mcp</id><content type="html" xml:base="https://kokosa.dev/blog/2026/repl-mcp/"><![CDATA[<figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/repl-mcp-demo-480.webp 480w,/assets/img/repl-mcp-demo-800.webp 800w,/assets/img/repl-mcp-demo-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/img/repl-mcp-demo.gif" class="img-fluid rounded z-depth-1" width="100%" height="auto" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> <p>Since many months I’ve been working with the <strong>Model Context Protocol (MCP)</strong> and I found myself frustrated with the available tools to easily access and test them. It is not not even for building your own servers, but just to connect to already made and look around what’s possible. I couldn’t find anything that quite suited my needs - something leightweight, simple to use and CLI-based. And I’m not a big fan of the standard <a href="https://modelcontextprotocol.io/docs/tools/inspector">MCP Inspector</a>.</p> <p>So, I <strong>decided to build my own</strong>: <a href="https://github.com/kkokosa/repl-mcp">https://github.com/kkokosa/repl-mcp</a> - a simple, interactive CLI tool for testing MCP servers.</p> <p>The idea was to have a lightweight, REPL-style (Read-Eval-Print Loop) environment where I could quickly poke at tools, prompts, and resources without any overhead.</p> <p>Key Features:</p> <ul> <li><strong>Interactive REPL</strong> - command history and tab autocompletion</li> <li><strong>Inspect Everything</strong> - list and inspect available tools, prompts, and resources</li> <li><strong>Call Tools</strong> - execute tools both with JSON or interactive parameter input</li> <li><strong>Transport Support</strong> - works with both Stdio and HTTP transports</li> <li><strong>Syntax Highlighting</strong> - clean, readable terminal output</li> </ul> <h3 id="examples">Examples</h3> <h4 id="simple-local-server-via-stdio---filesystem">Simple local server via <code class="language-plaintext highlighter-rouge">stdio</code> - filesystem</h4> <p>You can try it out with the <a href="https://www.npmjs.com/package/@modelcontextprotocol/server-filesystem"><code class="language-plaintext highlighter-rouge">server-filesystem</code></a> MCP server. In this case we choose <code class="language-plaintext highlighter-rouge">--command</code> (or <code class="language-plaintext highlighter-rouge">-c</code>) option to run the server on our own (remember to <code class="language-plaintext highlighter-rouge">npm install</code> it beforehand):</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>repl-mcp <span class="nt">-c</span> <span class="s2">"npx -y @modelcontextprotocol/server-filesystem /tmp"</span>
</code></pre></div></div> <h4 id="remote-server-with-api-token---github">Remote server with API token - GitHub</h4> <p>If we want to access remote MCP server, we can use <code class="language-plaintext highlighter-rouge">--url</code>/<code class="language-plaintext highlighter-rouge">-u</code> option, but <strong>bear in mind authorization</strong> - they (hopefuly!) require some kind of them.</p> <p>Some of them may accept API token, like in case of <a href="https://github.com/github/github-mcp-server">official GitHub MCP Server</a>, which we provide as <a href="https://swagger.io/docs/specification/v3_0/authentication/bearer-authentication/">Bearer token</a> (and use HTTP transport):</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>repl-mcp <span class="nt">-u</span> https://api.githubcopilot.com/mcp/ <span class="nt">-t</span> http <span class="nt">-H</span> Authorization:<span class="s2">"Bearer github_pat_..."</span>
</code></pre></div></div> <h4 id="remote-servier-with-oauth---atlassian">Remote servier with OAuth - Atlassian</h4> <p>Many MCP servers rely on <a href="https://auth0.com/docs/get-started/authentication-and-authorization-flow/authorization-code-flow">OAuth 2.x flow</a> to authorize the user, which is more tricky. In such case we need to support the full OAuth flow, which requires exposing a public endpoint to the Internet (point 8 in the link), to log in to the given service and let the callback return to us. That could be responsibility of my <code class="language-plaintext highlighter-rouge">repl-mcp</code> but there’s already a handy <a href="https://www.npmjs.com/package/mcp-remote"><code class="language-plaintext highlighter-rouge">mcp-remote</code></a> proxy MCP server.</p> <p>Let’s see in action while trying to use <a href="https://www.atlassian.com/platform/remote-mcp-server">Atlassian Remote MCP Server</a>. It is hosted at <code class="language-plaintext highlighter-rouge">https://mcp.atlassian.com/v1/mcp</code> and requires OAuth 2.1 authorization flow. “Supported clients” like ChatGPT, Claude or Google Gemini handle authorization flow out of the box. But as we can <a href="https://support.atlassian.com/atlassian-rovo-mcp-server/docs/getting-started-with-the-atlassian-remote-mcp-server/">read in the docs</a>, their MCP server: <em>“also supports any <strong>local MCP-compatible</strong> client that can run on <code class="language-plaintext highlighter-rouge">localhost</code> and connect to the server via the <code class="language-plaintext highlighter-rouge">mcp-remote</code> proxy”</em>.</p> <p>This is exactly what we will do! Here are the steps:</p> <ol> <li>run <code class="language-plaintext highlighter-rouge">mcp-remote</code> proxy: <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>npx mcp-remote https://mcp.atlassian.com/v1/mcp
</code></pre></div> </div> </li> <li>see what port it is using, like <code class="language-plaintext highlighter-rouge">Using existing client port: 3736</code></li> <li>run <a href="https://ngrok.com/">ngrok</a> or similar tool to expose the port: <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ngrok.exe http --domain some-words-here.ngrok-free.app http://localhost:3736
</code></pre></div> </div> </li> <li>also, we need to configure Atlassian to accept/trust our callback domain - in <a href="https://admin.atlassian.com/">Administration panel</a>, we need to add our [[ngrok]]-based domain to <em>Apps -&gt; AI settings -&gt; Rovo MCP server</em> like <code class="language-plaintext highlighter-rouge">https://some-words-here.ngrok-free.app/**</code></li> <li>go to the link printed in the <code class="language-plaintext highlighter-rouge">mcp-remote</code>, something like: <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>...
Please authorize this client by visiting:
https://mcp.atlassian.com/v1/authorize?response_type=code&amp;client_id=Y...
...
</code></pre></div> </div> </li> <li>After logging in/authorizing, the callback will be send (via ngrok) to the mcp-proxy and received token will be stored. Close <code class="language-plaintext highlighter-rouge">mcp-remote</code>. <strong>We won’t need to repeat it every time.</strong></li> <li>Run the proxy under <code class="language-plaintext highlighter-rouge">repl-mcp</code> and enjoy your Atlassian access! <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>repl-mcp <span class="nt">-c</span> <span class="s2">"npx mcp-remote https://mcp.atlassian.com/v1/mcp"</span>
</code></pre></div> </div> </li> </ol>]]></content><author><name></name></author><category term="tools"/><category term="mcp"/><summary type="html"><![CDATA[I've decided to build my own lightweight, REPL-style (Read-Eval-Print Loop) MCP testing tool.]]></summary></entry></feed>