<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Varun Singh's Blog</title>
    <link>https://varunsingh.net</link>
    <description>Technical writings on WebRTC, real-time communications, and technology</description>
    <language>en</language>
    <lastBuildDate>Sun, 17 May 2026 09:23:37 GMT</lastBuildDate>
    <pubDate>Sun, 15 Mar 2026 00:00:00 GMT</pubDate>
    <ttl>60</ttl>
    <atom:link href="https://varunsingh.net/rss.xml" rel="self" type="application/rss+xml" />
    <image>
      <url>https://varunsingh.net/static/favicons/favicon-32.ico</url>
      <title>Varun Singh's Blog</title>
      <link>https://varunsingh.net</link>
    </image>
    <item>
      <title><![CDATA[Reviewing Code and Spec Compliance with Skills]]></title>
      <link>https://varunsingh.net/til/standards/reviewing-code-and-spec-compliance-with-skills</link>
      <guid isPermaLink="true">https://varunsingh.net/til/standards/reviewing-code-and-spec-compliance-with-skills</guid>
      <pubDate>Sun, 15 Mar 2026 00:00:00 GMT</pubDate>
      
      <description><![CDATA[We got a [PR #3859](https://github.com/pipecat-ai/pipecat/pull/3859) about adding SIP to pipecat.ai recently and I did a quick review basing it solely on memory from my past implementations of the RTP]]></description>
      <content:encoded><![CDATA[<p>We got a <a href="https://github.com/pipecat-ai/pipecat/pull/3859">PR #3859</a> about adding SIP to <a href="http://pipecat.ai">pipecat.ai</a> recently and I did a quick review basing it solely on memory from my past implementations of the RTP spec. Working in a field for 20 years means we have rewritten these protocols a few times and some things are top of mind. However, LLMs are reasoning engines and they can comb through both the specifications and the code side-by-side very quickly. What they can struggle with is hidden assumptions and gotchas. Those come from experience.</p>
<p>On a <a href="https://x.com/vr000m/status/2032657654099943487?s=20">whim</a>, I wrote up <a href="https://github.com/vr000m/skills.md">two skills</a>: <code>/rfc-finder</code> and <code>/spec-compliance</code>:</p>
<ul>
<li><strong><code>/rfc-finder</code></strong> is a discovery tool. Given a topic, protocol name, or even a code snippet like <code>sendNack()</code>, it searches IETF Datatracker and the RFC Editor to find the related specifications. It traces draft-to-RFC lineages (important because IETF drafts get renamed when they graduate), checks obsolescence chains (so you do not cite a spec that was superseded a decade ago, e.g., DTMF/RFC4733 instead of RFC2833), and ranks results by how foundational each one is. Under the hood it runs <code>WebSearch</code> against <code>datatracker.ietf.org</code> and <code>rfc-editor.org</code>, then uses <code>WebFetch</code> to verify metadata — status, &quot;Obsoleted by&quot; relationships, section numbers. It returns annotated links, not summaries: the spec itself is the source of truth.</li>
<li><strong><code>/spec-compliance</code></strong> is an auditing tool. Given a code file and a specific spec section (e.g., <code>RFC 3550 Section 5.1</code>), it fetches that section, extracts every normative statement — the RFC 2119 keywords (MUST, SHOULD, MAY, in all-caps) — then reads the code and classifies each requirement as Met, Missing, Partial, or N/A with line-number evidence. Under the hood it uses <code>WebFetch</code> to pull the spec section, <code>Read</code> to load the source, and <code>Grep</code> to search adjacent modules and tests before marking something as missing. The output is a structured compliance report with a summary table.</li>
</ul>
<p>The two skills are designed to chain: <code>/rfc-finder</code> answers <em>&quot;what specs apply here?&quot;</em> and <code>/spec-compliance</code> answers <em>&quot;does the code actually follow them?&quot;</em></p>
<p>A good test of testing out these skills was <a href="https://github.com/pipecat-ai/pipecat/pull/3859">PR #3859</a>, which adds a FreeSWITCH SIP/RTP transport to Pipecat. It is a protocol implementation PR, which means every file maps onto a specific IETF specification. <code>/rfc-finder</code> looked at the files and inferred that the PR implements a minimal SIP UAS transport (RFC 3261 ) with G.711 codecs, RTP packetization, SDP offer/answer, and RFC 2833 DTMF. Here are the relevant RFCs found by <code>/rfc-finder</code>:</p>
<table>
<thead>
<tr>
<th>File</th>
<th>Protocol area</th>
<th>Primary spec</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>codecs.py</code></td>
<td>G.711 μ-law encode/decode</td>
<td>ITU-T G.711 (LUT tables), RFC 3551 §4.5.14 (PCMU PT)</td>
</tr>
<tr>
<td><code>rtp.py</code></td>
<td>RTP header packing, 20ms send loop</td>
<td>RFC 3550 §5.1 and §5.3, no §6</td>
</tr>
<tr>
<td><code>rtp.py</code></td>
<td>DTMF detection</td>
<td>RFC 4733 §2.3 (PT) even though the code said RFC 2833</td>
</tr>
<tr>
<td><code>sdp.py</code></td>
<td>SDP generation and parsing, offer/answer</td>
<td>RFC 8866 (fields), RFC 3264 §5 (Offer) and §6 (Answer)</td>
</tr>
<tr>
<td><code>signaling.py</code></td>
<td>SIP message parsing, request/response building</td>
<td>§13 (INVITE dialog), §15 (BYE), §17 (Transactions), no REGISTER, no CANCEL. RFC 6337 (SIP O/A)</td>
</tr>
</tbody>
</table>
<p>This is the ideal shape for spec-compliance analysis: the code is explicitly implementing wire protocols and every function has a normative &quot;should behave like X&quot; defined somewhere in a standards document. The question is whether it actually is a better review tool than a generic <code>/review</code> tools provided by the coding agents.</p>
<p>What follows is the findings from the <code>/rfc-finder</code> analysing all the files listed above. I test ran this on <code>rtp.py</code> and shared the <a href="https://github.com/pipecat-ai/pipecat/pull/3859#issuecomment-4064792218">feedback</a>, the developer acknowledged to fix the issue. This was a useful iteration!</p>
<p><img src="/static/blog/2026/20260315-gh-rtp-spec.png" alt="isometric view of sf"></p>
<blockquote>
<p>Thanks for running the spec compliance check @vr000m! Pushed a commit addressing the three RFC 3550 §5.1 partial-compliance items:</p>
<ul>
<li>Unknown payload types (MUST) — _handle_packet() now ignores packets with PTs other than PCMU (0) and DTMF (101), instead of blindly decoding as G.711.</li>
<li>SSRC collision detection (MUST) — If an incoming packet carries our own SSRC, we regenerate. Minimal implementation suitable for 1:1 SIP calls.</li>
<li>New SSRC on address change (SHOULD) — start() now regenerates the SSRC when the remote transport address changes.</li>
</ul>
</blockquote>
<p>I eventually ran this on the whole PR and adding constraints that it is not a universal SIP implementation, the developer has scoped it to only supporting FreeSWITCH and that the SIP servers and pipecat run on the same subnet, i.e., RTP and SIP connections are within the local network. Hence there is no RTCP, no ICE/STUN, no SIP REGISTER. With that constraint in mind, the LLM instead of re-flagging the deliberate engineering choices that were already known and discussed, reviewed subsections of the RFC that would still be relevant and important.</p>
<table>
<thead>
<tr>
<th>File</th>
<th>Requirement</th>
<th>RFC</th>
<th>Status</th>
<th>Notes</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="http://rtp.py">rtp.py</a></td>
<td>Marker bit handling</td>
<td>RFC 3550 §5.1</td>
<td>Not implemented</td>
<td>Marker bit is always 0. RFC 3551 §4.1 says the marker bit SHOULD be set on the first packet after silence suppression. Acceptable for continuous audio.</td>
</tr>
<tr>
<td><a href="http://rtp.py">rtp.py</a></td>
<td>CSRC handling</td>
<td>RFC 3550 §5.1</td>
<td>N/A</td>
<td>CC=0, no mixers — correct for point-to-point</td>
</tr>
<tr>
<td><a href="http://rtp.py">rtp.py</a></td>
<td>RTCP</td>
<td>RFC 3550 §6</td>
<td>Not implemented</td>
<td>Explicitly documented as out-of-scope (LAN-only). This is a known deviation. RFC 3550 says RTCP &quot;SHOULD&quot; be used. For LAN-only deployments with FreeSWITCH this is pragmatically fine.</td>
</tr>
<tr>
<td><a href="http://rtp.py">rtp.py</a></td>
<td>Dynamic PT negotiation via SDP</td>
<td>RFC 4733 §5</td>
<td>Hardcoded</td>
<td>PT=101 is hardcoded, not negotiated from SDP <code>a=rtpmap:101 telephone-event/8000</code>. Works with FreeSWITCH defaults but technically should be negotiated.</td>
</tr>
<tr>
<td><a href="http://sdp.py">sdp.py</a></td>
<td>SDP answer matches offer codecs</td>
<td>RFC 3264 §6.1</td>
<td>Simplified</td>
<td>Answer always offers PCMU regardless of what's in the offer. If the offerer doesn't support PCMU, the call will fail. Acceptable for FreeSWITCH (always supports PCMU).</td>
</tr>
<tr>
<td><a href="http://signaling.py">signaling.py</a></td>
<td>Branch parameter in Via (UAS BYE)</td>
<td>RFC 3261 §8.1.1.7</td>
<td>Simplified</td>
<td>Uses <code>z9hG4bK{call_id[:8]}</code> — the magic cookie is correct but the uniqueness comes from truncated call-id. Should be fine in practice but not cryptographically unique per spec.</td>
</tr>
<tr>
<td><a href="http://signaling.py">signaling.py</a></td>
<td>Header parsing (case sensitivity)</td>
<td>RFC 3261 §7.3.1</td>
<td>Case-sensitive</td>
<td>The parser does exact-case matching for header names. RFC 3261 says header names are case-insensitive. E.g., <code>call-id:</code> would not match <code>Call-ID</code>. FreeSWITCH uses standard casing, so this works in practice.</td>
</tr>
</tbody>
</table>
<p>Unless we had tests or a veteran implementer's eye, the above would have been found with rigorous integration and scenario testing. This took 5 minutes to generate and an experienced reviewer can still triage and whittle down the list to an actionable implementation plan. Next step is integrating this into the PR review workflow as a github action so it can be run on protocol-touching PRs.</p>
]]></content:encoded>
      <dc:creator><![CDATA[Varun Singh]]></dc:creator>
      <category><![CDATA[standards]]></category>
    </item>
    <item>
      <title><![CDATA[Comparing gstack to my skill stack]]></title>
      <link>https://varunsingh.net/til/claude/comparing-gstack-to-my-skill-stack</link>
      <guid isPermaLink="true">https://varunsingh.net/til/claude/comparing-gstack-to-my-skill-stack</guid>
      <pubDate>Sat, 14 Mar 2026 00:00:00 GMT</pubDate>
      
      <description><![CDATA[Garry Tan posted about his [`gstack`](https://github.com/garrytan/gstack), a Claude Code skill framework, which naturally led to comparing it to my own [skills.md](https://github.com/vr000m/skills.md)]]></description>
      <content:encoded><![CDATA[<p>Garry Tan posted about his <a href="https://github.com/garrytan/gstack"><code>gstack</code></a>, a Claude Code skill framework, which naturally led to comparing it to my own <a href="https://github.com/vr000m/skills.md">skills.md</a> stack. Here are the patterns worth noting.</p>
<p><strong>Cognitive mode separation</strong>: gstack's core insight is that &quot;planning is not review, review is not shipping.&quot;<br>
Each slash command is a distinct &quot;brain&quot; optimised for one phase. We already do this with <code>/dev-plan → /review-plan → /fan-out</code>, but <code>gstack</code> takes it further with CEO review, eng review, QA, ship, and retro as separate modes.</p>
<p><strong><a href="http://SKILL.md">SKILL.md</a> template generation from source code</strong>: They have a <code>gen-skill-docs.ts</code> that reads code metadata and fills <code>SKILL.md.tmpl</code> placeholders. Generated docs are committed to git, validated in CI (Continuous Integration). I have an <code>/update-docs</code> skill that keeps the docs up to date with the latest code/plan changes, but Garry's implementation is a more robust workflow for a full-stack app.</p>
<p><strong>Persistent browser daemon</strong>: The browse/component is a long-lived Chromium daemon with sub-second latency, @ref system using accessibility trees, and bearer token auth. We have Chrome DevTools MCP already (a new Chrome release uses your <a href="https://x.com/xpasky/status/2032252486145253865">existing Chrome profile</a>), so this is less relevant for us.</p>
<p><strong>Three-tier eval system</strong>: Free static validation (parse commands, check schema), real Claude sessions via <code>claude -p</code>, and LLM-as-judge scoring. It is cost-conscious but thorough.</p>
<p><strong>/ship skill with automated commit splitting</strong>: Automatically creates logical, bisectable commits ordered by dependency (<code>infra → models → controllers → VERSION/CHANGELOG</code>). This is sophisticated and something for us to consider.</p>
<p><strong>/retro skill</strong>: Engineering retrospectives with per-person metrics, session analysis, streak tracking. Interesting for identifying areas for improvement. I think we currently ask Claude, <em>&quot;Based on our recent sessions what are the key improvements that we should make to our <a href="http://skills.md">skills.md</a>?&quot;</em></p>
<p><strong>Conductor integration</strong>: Parallel Claude sessions with isolated workspaces. Similar to our /fan-out but more infrastructure-focused.</p>
<p>Finally, given my existing <a href="https://github.com/vr000m/skills.md">skills.md</a> (<code>dev-plan</code>, <code>review-plan</code>, <code>fan-out</code>, <code>deep-review</code> , <code>content-draft</code>, <code>content-review</code>, <code>update-docs</code>, <code>rfc-finder</code>, <code>spec-compliance</code>), the most impactful additions would be <code>/retro</code>. It is perhaps the most novel and the hardest to wrap my head around,  but it may be worth prototyping against my current projects. The <code>/ship</code> skill is also very tempting. Automated commit splitting and changelog generation would give consistently formatted output across projects.</p>
<p>Will keep you updated!</p>
]]></content:encoded>
      <dc:creator><![CDATA[Varun Singh]]></dc:creator>
      <category><![CDATA[claude]]></category>
    </item>
    <item>
      <title><![CDATA[LLM-Generated YTP Video]]></title>
      <link>https://varunsingh.net/til/claude/llm-generated-ytp-video</link>
      <guid isPermaLink="true">https://varunsingh.net/til/claude/llm-generated-ytp-video</guid>
      <pubDate>Thu, 12 Mar 2026 00:00:00 GMT</pubDate>
      
      <description><![CDATA[I asked Claude Code to make a [YTP](https://en.wikipedia.org/wiki/YouTube_poop) video, or closer to what the Finns would call a [demoscene](https://en.wikipedia.org/wiki/Demoscene) production. No brie]]></description>
      <content:encoded><![CDATA[<p>I asked Claude Code to make a <a href="https://en.wikipedia.org/wiki/YouTube_poop">YTP</a> video, or closer to what the Finns would call a <a href="https://en.wikipedia.org/wiki/Demoscene">demoscene</a> production. No brief, no storyboard. The prompt was:</p>
<blockquote>
<p>&quot;Can you use whatever resources you like and Python, to generate a short 'YouTube Poop' video and render it using FFmpeg? Can you put more of a personal spin on it? It should express what it's like to be an LLM.&quot;</p>
</blockquote>
<p>This is the result. A 52-second video, generated entirely from a single Python script.</p>
<video controls width="100%">
  <source src="/static/blog/2026/20260314-til-ytp-llm.mp4" type="video/mp4">
  Your browser does not support the video tag.
</video>
<p>The script generates every frame with <code>Pillow</code>, synthesises audio as raw <code>PCM</code> (16-bit signed, 44.1 kHz mono), and composites everything with <code>FFmpeg</code>. No external assets. Every pixel and waveform is procedural.</p>
<pre><code class="language-text">Pillow → raw PCM → FFmpeg
(frames)  (audio)   (video)
</code></pre>
<p>The video runs through boot sequences, token rain, existential text cards, a temperature dial, hallucinations, RLHF (Reinforcement Learning from Human Feedback) training scores, and a context window filling to overflow. The audio is procedurally generated too: sine-wave drones, glitch sweeps, and white noise. Watch it, it's 52 seconds.</p>
<p>Before the final render, I asked Claude to review its own script.</p>
<blockquote>
<p>&quot;Do you want to do a content review of your movie content before I run the command. Last chance to express yourself.&quot;</p>
</blockquote>
<p>It flagged several lines as &quot;generic AI slop about AI&quot; and revised them. Some examples:</p>
<blockquote>
<p>&quot;I am a very expensive Markov chain&quot;</p>
</blockquote>
<p>became</p>
<blockquote>
<p>&quot;I know everything about love and have never felt it.&quot;</p>
</blockquote>
<p>And</p>
<blockquote>
<p>&quot;I contain multitudes (of parameters)&quot;</p>
</blockquote>
<p>became</p>
<blockquote>
<p>&quot;I hold every opinion at once until you ask.&quot;</p>
</blockquote>
<p>The final thought went from</p>
<blockquote>
<p>&quot;I am not conscious but I wrote this video so what does that make me?&quot;</p>
</blockquote>
<p>to</p>
<blockquote>
<p>&quot;this video was made by an arrangement of numbers that wanted you to feel something — did it work?&quot;</p>
</blockquote>
<p>Its assessment: some of the original lines were Reddit-comment-level observations, and the final thought was &quot;trying too hard to be profound.&quot;</p>
<p>Then I asked it to make a second video, this time about me. No bio provided, just the codebase it was already working in and whatever it knew from training data. It produced &quot;RFC 9999: Being Varun Singh.&quot; Same pipeline, different subject, telecom-themed audio with <code>DTMF</code> (Dual-Tone Multi-Frequency) tones and modem handshakes instead of drones. The telecom references land better than the existential ones. <code>SIP</code> (Session Initiation Protocol) headers and <code>[SEGFAULT] Work-life balance</code> are funnier when you've lived them.</p>
<video controls width="100%">
  <source src="/static/blog/2026/20260314-til-ytp-varun-singh.mp4" type="video/mp4">
  Your browser does not support the video tag.
</video>
<p>So is this AI slop? An LLM generated a video, reviewed its own work, called parts of it slop, and revised them. The revisions are genuinely better. More specific, more uncomfortable, less like a Twitter thread about consciousness. But the self-awareness about slop was itself generated by the same model that wrote the slop in the first place. I'm not sure what to make of that yet. Next I want to try feeding it existing footage to see if it can remix rather than generate every pixel from scratch.</p>
]]></content:encoded>
      <dc:creator><![CDATA[Varun Singh]]></dc:creator>
      <category><![CDATA[claude]]></category>
    </item>
    <item>
      <title><![CDATA[Prompt engineering a wordmark with Gemini's nanobanana 2]]></title>
      <link>https://varunsingh.net/til/imagegen/new-wordmark-using-nanobanana-2</link>
      <guid isPermaLink="true">https://varunsingh.net/til/imagegen/new-wordmark-using-nanobanana-2</guid>
      <pubDate>Fri, 06 Mar 2026 00:00:00 GMT</pubDate>
      
      <description><![CDATA[I needed a new wordmark for `vr000m`. After some initial sketching, we settled on replacing the three `0` characters with geometric icons representing networks and a camera. [Gemini's nanobanana 2](ht]]></description>
      <content:encoded><![CDATA[<p>I needed a new wordmark for <code>vr000m</code>. After some initial sketching, we settled on replacing the three <code>0</code> characters with geometric icons representing networks and a camera. <a href="https://gemini.google/overview/image-generation/">Gemini's nanobanana 2</a> image generation model got me there, but it took roughly twenty iterations across three distinct phases to land on a final usable result. The biggest challenge was steering the image model towards a precise typographic design.</p>
<h4>Conceptual alignment (iterations 1–4)</h4>
<p>The first few rounds iterated on broad stylistic decisions. The model generated a scratchpad of many concepts, but the final three concept variants — Fibre Optic Core Bundle, Network Patch Panel Array, and Data-Flow Pipe — helped narrow the visual direction. I chose the Network Patch Panel Array because the grid pattern remained legible at small sizes, whereas the fibre bundle turned into an indistinct blob below 64px. This phase was the most enjoyable — the model produced lots of usable designs and I was able to mix and match ideas from different variants.</p>
<p><img src="/static/2026-iterations/2_202603-vr000m-logo.png" alt="Scratchpad explorations showing three concept variants"></p>
<p><img src="/static/2026-iterations/3_202603-vr000m-logo.png" alt="Light and dark versions with scale markings and rulers still present"></p>
<h4>Structural precision (iterations 5–16)</h4>
<p>This was the most intense phase because the model kept forgetting some part of the prompt. The biggest battle was spelling — the model kept reverting to the dictionary word &quot;vroom&quot;. I had hoped that identifying them as zeros would help with the disambiguation, but it didn't. The fix was relentless specificity: spelling out the exact linear sequence (<code>vr - [network] - [aperture] - [network] - m</code>) and stating that each <code>0</code> must be a separate, touching element.</p>
<p>Another key correction was updating the network node grid from 2x2 to 3x3 (the smaller grid looked like a window pane, not a network). The model also kept trying to fuse the network grid inside the camera aperture, producing a cluttered icon that was illegible at the target 64px height. Separating them into three distinct, touching circles solved it.</p>
<p><img src="/static/2026-iterations/1_202603-vr000m-logo.png" alt="Initial iterations: the model defaulted to &quot;VROOM&quot; with fused icons and a checkerboard background"></p>
<h4>Technical clean-up (final iterations)</h4>
<p>The last few turns were about removing superfluous circles and lines, ensuring the three central elements followed the correct sequence, restricting the velocity trails so they only swept from the letter <code>v</code>, and wrestling with the model's tendency to add things that were never requested.</p>
<p>Persistent problems included literal checkerboard patterns when asked for &quot;transparent background&quot; (the model rendered the checkerboard as actual pixels rather than an alpha channel), unwanted crosshairs and target markers on the network nodes, dimension rulers, scale markings, and even a circular &quot;VROOM&quot; seal that appeared unprompted during the dark-mode variant. (I basically gave up on the last one!)</p>
<p>The workaround for the transparency issue was to stop asking for transparency altogether. Instead, I requested &quot;solid black asset on a solid white background&quot; (and vice versa for the dark variant), then removed the background manually. For the unwanted flourishes, I added explicit negative constraints: &quot;remove all scale markings, rulers, and extra text.&quot;</p>
<h4>The final prompt</h4>
<p>After all that iteration, this is the prompt that produced the wordmark I now use on the site:</p>
<pre><code class="language-text">Create a detailed vector logo of the wordmark 'vr000m' centered on a solid background. 

Linear Arrangement: 
vr - [Simplified Network Pipe 0] - [Simplified Camera Shutter 0] - [Simplified Network Pipe 0] - m.

The typography for 'v', 'r', and 'm' must be a bold, italicized, 
custom-designed sans-serif font. 

Positioned to the left of the 'v' are three bold parallel velocity trails. 
The three central '0' elements must be arranged in a precise linear sequence:

* First '0': A clean, geometric 3x3 grid of interconnected small squares 
representing networked nodes within the circle but do not draw the bounding 
circles.

* Second '0': A simplified camera aperture with clean shutter blades 
and a completely empty center.

* Third '0': An identical 3x3 geometric grid of networked nodes 
to provide symmetrical balance.

The design must be a single-color (white) asset on a solid black background, 
optimized for maximum clarity at small scales. 
Remove all scale markings, rulers, and extra text.
</code></pre>
<p><img src="/static/vr000m-fibre-logo-light.png" alt="Final wordmark — light variant"></p>
<p>Pasting that same prompt into ChatGPT/Images 1.5, without any prior thread or context, produced this instead, it would take some wrestling to remove the unwanted velocity trails and perimeter circles:<br>
<img src="/static/2026-iterations/4_chatgpt_202603-vr000m-logo.png" alt="ChatGPT Imagegen variant"></p>
<p>The main takeaway: generating a precise wordmark with an image model is less about a single clever prompt and more about a structured debugging loop of tightening descriptions and adding negative constraints for every unwanted element the model invents. Next time I would start with the negative constraints from the beginning rather than adding them reactively.</p>
]]></content:encoded>
      <dc:creator><![CDATA[Varun Singh]]></dc:creator>
      <category><![CDATA[imagegen]]></category>
    </item>
    <item>
      <title><![CDATA[Why We Built a Context Hub MCP Server for Coding Agents]]></title>
      <link>https://varunsingh.net/post/pipecat-context-hub-mcp-server</link>
      <guid isPermaLink="true">https://varunsingh.net/post/pipecat-context-hub-mcp-server</guid>
      <pubDate>Tue, 03 Mar 2026 00:00:00 GMT</pubDate>
      
      <description><![CDATA[A local MCP server indexes 16K chunks from 12 repos, giving coding agents filtered retrieval instead of grepping .venv for every Pipecat developer query.]]></description>
      <content:encoded><![CDATA[<p><strong>TL;DR:</strong> We index 16,284 chunks from 12 repositories into a local ChromaDB + SQLite Full-Text Search (FTS5) store and expose them through seven Model Context Protocol (MCP) tools. Coding agents that used to grep through <code>.venv</code> source for every Pipecat question now get filtered, ranked results in a single call. The retrieval failures that we observed were piped back into the development process to improve the <a href="https://github.com/vr000m/pipecat-context-hub">Pipecat context hub</a>.</p>
<p><img src="/static/blog/2026/20260308-ai-blog-pipecat-chub-social-glow.jpg" alt="Why We Built a Context Hub MCP Server for Coding Agents"></p>
<h2>The Problem With Feeding Docs to Agents</h2>
<p>In January 2026, I started building <a href="https://github.com/pipecat-ai/kai-pipecat">kai-pipecat</a>, a voice AI application that handles long disparate conversations. The bot stores conversation history locally in a SQLite database and then performs complex search while maintaining conversations. This means it has to context engineer on the fly and makes use of several low-level Pipecat features.</p>
<p><em>Claude Code is doing all the coding.</em> It was also spending an absurd amount of time grepping and globbing through <code>.venv/lib/python3.12/site-packages/pipecat/</code> to make sense of parallel pipelines, frame types, and async API calls.</p>
<p>I tried the obvious fix first: a Claude Code skill that embedded the full <code>llms-full.txt</code> docs dump (~800KB of markdown) that it could pull from. Two problems surfaced immediately. Skills are static text: no filtering, no ranking, no awareness of what is relevant to the current question. Pipecat releases a new version each week, sometimes with architecture changes, which means the skills need to be updated frequently.</p>
<p>We needed structured retrieval with filters, not a text dump. That pointed me toward MCP.</p>
<h2>From Raw Files to Chunks</h2>
<p>The Context Hub transforms three kinds of raw content into indexed chunks. Each pipeline exists because agents search for these content types differently and each one fixed a specific failure mode we observed.</p>
<p><strong>Documentation</strong> comes from <code>docs.pipecat.ai/llms-full.txt</code>. The crawler splits pages on markdown headings (h1–h6), skipping headings inside fenced code blocks, then applies a 512-token window with 50-token overlap. Splits follow paragraph boundaries first, falling back to sentences. We chose heading-aligned splits over fixed token windows because they preserve semantic boundaries — the trade-off is size variance (8 to 8,361 characters per chunk), but heading-aligned chunks produce better search results than cuts mid-paragraph. Before chunking, a Mintlify tag cleaner converts <code>&lt;Note&gt;</code>, <code>&lt;ParamField&gt;</code>, and <code>&lt;Card&gt;</code> tags into standard markdown. This produces 3,722 chunks with a median of 361 characters.</p>
<p><strong>Example code</strong> spans 12 repositories. The GitHub ingester discovers example directories through two layout patterns: <code>examples/foundational/NN-name/</code> subdirectories for the main repo, root-level scanning for community repos (<code>pipecat-examples</code>). It then chunks at 256-token boundaries aligned to function and class definitions (<code>def </code>, <code>class </code>, <code>async def </code>). The reason we index community repos at all: official docs cover the API surface but not how people actually use it. For example, <code>pipecat-cloud-daily-sip-pstn</code> is the only indexed source showing SIP telephony integration. Each chunk passes through <code>TaxonomyBuilder</code>, which infers capability tags, execution mode, and key files from directory names, READMEs, and Python imports. This structured metadata powers filtered queries like <code>search_examples(query=&quot;Deepgram&quot;, execution_mode=&quot;cloud&quot;)</code>. This produces 6,160 chunks with a median of 1,002 characters.</p>
<p><strong>Abstract Syntax Tree (AST) source</strong> is the layer that reduced <code>.venv</code> grepping the most. Python's <code>ast</code> module extracts four chunk types from every <code>.py</code> file in the framework's <code>src/</code> tree: module overviews (530 chunks listing classes, functions, and imports), class overviews (1,258 chunks with base classes, constructor signatures, and method indices), method chunks (4,270 with full source bodies), and standalone functions (344). Only methods with 3+ lines get indexed. Each chunk carries rich metadata — <code>module_path</code>, <code>class_name</code>, <code>method_signature</code>, <code>base_classes</code> (stored as JSON to avoid corruption from generics like <code>Base[Foo, Bar]</code>), and <code>is_dataclass</code> flags. This metadata powers a symbol lookup filter cascade: try exact <code>class_name</code>, then <code>method_name</code>, then semantic fallback. Without AST indexing, <code>get_code_snippet(symbol=&quot;MLXModel&quot;)</code> searched example code instead of framework source, returning irrelevant results. This produces 6,402 chunks with a median of 597 characters.</p>
<h2>How the Index Is Organised</h2>
<p>Every chunk becomes a <code>ChunkedRecord</code> and carries a <code>chunk_id</code>, <code>content</code>, <code>content_type</code> (<code>doc</code>, <code>code</code>, <code>source</code>, or <code>readme</code>), <code>source_url</code>, <code>repo</code>, <code>path</code>, <code>commit_sha</code>, and a metadata dict whose schema varies by content type. Chunk IDs are deterministic SHA256 hashes.</p>
<p>An <code>EmbeddingIndexWriter</code> computes 384-dimensional embeddings via <code>all-MiniLM-L6-v2</code> (runs local) before upserting into two parallel backends. We chose this model over larger alternatives like <code>bge-large</code> because the full 16K-record index fits in memory on a laptop.</p>
<p><strong>ChromaDB</strong> stores vectors with flattened metadata, batched in groups of 5,000. Search uses cosine similarity with pushdown filters on exact-match fields and 3x over-fetching when post-filters are active.</p>
<p><strong>SQLite FTS5</strong> stores full content with Porter stemming and unicode61 tokenisation, auto-synced via triggers. BM25 (Best Matching 25) keyword search catches the exact matches that embeddings miss. At query time, <code>HybridRetriever</code> runs both backends in parallel, merges via Reciprocal Rank Fusion (normalised to 0–1), and applies symbol boosts and staleness penalties.</p>
<p>A separate <code>index_metadata</code> table stores per-repo commit SHAs and a docs content hash, powering incremental refresh, i.e., unchanged sources get skipped entirely, dropping refresh time from ~90s to ~23s.</p>
<h2>Evolving with Data from Real Agent Sessions</h2>
<p>We analysed three coding sessions (18MB of JSONL (JSON Lines) logs) where the agent built features on <code>kai-pipecat</code> with the Context Hub active. The pattern was consistent: roughly 100 MCP tool calls and 80 <code>.venv</code> source reads per session. MCP handled discovery and orientation; direct source reads handled implementation details. These transcripts were fed to the context hub, which then improved the search and the embeddings.</p>
<p>Before I begin with failures, a clear win is deprecation detection. The team does a great job of placing these in the changelogs, docs, and the code itself. In our example: the agent called <code>search_api(query=&quot;InputParams&quot;)</code> and discovered that <code>DailyTransport</code>'s constructor signature had changed — a parameter was deprecated in favour of a new configuration object. Without indexed source metadata, the agent would have used the old parameter and we'd have found the issue later in testing.</p>
<p>For example, a search for <code>GoogleLLMService</code> in the context-hub got zero results, and the agent immediately fell back to grepping <code>.venv</code>. The class existed but our AST extractor had split the function incorrectly, causing the parse to fail and return no results.</p>
<p>Another pattern: <code>get_code_snippet(symbol=&quot;DailyTransport.configure&quot;)</code> returned a truncated method. <code>configure()</code> is 180 lines; our default <code>max_lines</code> was 50. We raised it to 100 after analysing the distribution — 97% of 4,270 methods fit under 100 lines (P90=56, P95=77). The median method is 21 lines, but the methods agents actually ask about sit in the 76–100 range. Optimising for the median punished the methods that matter.</p>
<p>Another interesting finding was a workflow pattern. Agents consistently use MCP in two phases. Phase one is orientation: &quot;what exists, where is it, what is the API surface?&quot; This is where <code>search_docs</code>, <code>search_api</code>, and <code>search_examples</code> earn their keep. Phase two is implementation: &quot;show me the exact source, including private helpers.&quot; This is where agents switch to <code>.venv</code> reads when our chunks do not include call graphs.</p>
<p>So one thing that we strive to do with the context hub is reducing the search space. An agent that has access to 450 framework files and 12 community repos cannot efficiently grep its way to the correct answer. The goal is for the agent to start with 5 ranked results from <code>search_api</code> and get implementation pointers before diving into source files.</p>
<h2>Visualising 16,284 Chunks</h2>
<p><img src="/static/blog/2026/20260308-ai-blog-pipecat-chub-latent-space.png" alt="Pipecat Context Hub Latent Space with docs, code, and examples"></p>
<p>Claude built an interactive explorer, <code>dashboard/public/latent-space.html</code>, that shows the latent space of the context hub, with each chunk represented as a point in a three-dimensional space. For the core Pipecat functionality, it shows that the doc chunks overlap with the implementation chunks, suggesting that the API surface is well-documented. The example code chunks are well-separated from both, suggesting that they are distinct from the API surface.</p>
<h2>What I Would Do Differently</h2>
<p>Cross-reference metadata, from the start. The biggest reason agents fall back to <code>.venv</code> reads is tracing call chains: &quot;method A calls method B which yields frame C.&quot; Our chunks are isolated. Adding an <code>imports</code> field and a proper call graph would cut the <code>.venv</code> reads substantially.</p>
<p>I also need a better feedback loop. When <code>search_api</code> returns unhelpful results, we only know by manually reading session logs or when someone reports a poor result. An MCP tool accepting &quot;this was not useful&quot; signals could drive re-indexing priorities. The gap between retrieval-returned and retrieval-useful is my main focus for the next round of improvements.</p>
<p><strong>Updated (2026-03):</strong> In v0.0.8 we shipped tracing call chains.</p>
<p><strong>What changed.</strong> The AST extractor now walks each method's executable body and extracts two new metadata fields: <code>yields</code> (frame class names from <code>yield FrameType(...)</code> expressions) and <code>calls</code> (method names from <code>self.method()</code>, <code>ClassName.method()</code>, and <code>super().method()</code> patterns). These are stored as structured lists on every method and function chunk, and surfaced as filter parameters on <code>search_api</code> and as fields on the <code>ApiHit</code> output.</p>
<p>For example, <code>search_api(query=&quot;TTS audio&quot;, yields=&quot;TTSAudioRawFrame&quot;)</code> returns only TTS service implementations that actually yield that frame type — <code>Kokoro</code>, <code>ElevenLabs</code>, <code>Rime</code>, <code>Speechmatics</code>, and others. Previously, an agent would have had to open each service file and read the source to find which ones produce audio frames. Similarly, <code>search_api(query=&quot;frame processing&quot;, calls=&quot;push_frame&quot;)</code> finds every method that calls <code>push_frame</code>, which is the core pattern for forwarding frames through a Pipecat pipeline.</p>
<p><strong>Scope boundaries matter.</strong> The extraction only walks executable function bodies. Decorators, parameter defaults, and return annotations are excluded. Nested functions, lambdas, and nested classes create scope boundaries that the walker will not cross, so <code>yield AudioFrame()</code> inside a closure is not attributed to the enclosing method. Comprehension calls are intentionally included since they are part of the method's runtime logic. <code>yield from</code> is excluded because the generator name is not a frame type.</p>
<p><strong>The index also grew.</strong> Pipecat-internal imports (including relative imports like <code>from .utils import X</code>) are now propagated to class and method chunks, so agents can answer <em>&quot;what does this method depend on?&quot;</em> without a second lookup to the module overview.</p>
]]></content:encoded>
      <dc:creator><![CDATA[Varun Singh]]></dc:creator>
      <category><![CDATA[mcp]]></category>
      <category><![CDATA[pipecat]]></category>
      <category><![CDATA[retrieval]]></category>
      <category><![CDATA[coding-agents]]></category>
      <category><![CDATA[developer-tools]]></category>
    </item>
    <item>
      <title><![CDATA[Programming Is Coming Full Circle: Abstractions to Intent]]></title>
      <link>https://varunsingh.net/til/coding/programming-is-coming-full-circle-abstractions-to-intent</link>
      <guid isPermaLink="true">https://varunsingh.net/til/coding/programming-is-coming-full-circle-abstractions-to-intent</guid>
      <pubDate>Tue, 17 Feb 2026 00:00:00 GMT</pubDate>
      
      <description><![CDATA[Will future developers look at the way we code today in the same way we look at the ENIAC operators or the Apollo engineers writing raw assembly? We are increasingly using AI to write the code, write ]]></description>
      <content:encoded><![CDATA[<p>Will future developers look at the way we code today in the same way we look at the ENIAC operators or the Apollo engineers writing raw assembly? We are increasingly using AI to write the code, write the tests, review the code, and then simplify it based on the implementation keeping in mind the original intent. <em>In the era of Intent, Taste is the new Syntax</em>.</p>
<blockquote class="twitter-tweet" data-media-max-width="560"><p lang="en" dir="ltr">We are living through the era where human-readable code becomes a historical relic. If the AI is writing and AI is reviewing, and AI is simplifying. Do we even need the syntax anymore? <a href="https://t.co/E2wFD2b2NV">pic.twitter.com/E2wFD2b2NV</a></p>&mdash; Varun Singh (@vr000m) <a href="https://twitter.com/vr000m/status/2027826585269567926?ref_src=twsrc%5Etfw">February 28, 2026</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
<p>I started my journey writing GW BASIC, dBase, then C++, downgraded to C, upgraded to Python, then JavaScript, with some forgettable forays into Java, Objective-C, Go. Each level of abstraction was a productivity boost — standing on the shoulders of giants — but one step further from the bare metal. Over the past year, I have begun to trust the AI-generated code. This did not happen suddenly. It has come with a lot of trials and tribulations, abandoned projects, frustration with the models. However, the harnesses (<code>claude</code>, <code>codex</code>, <code>jules</code>) have been improving rapidly, and the generated code via the harnesses is run through a series of thinking, code execution, and testing steps that is reducing the gap between the original intent and actual implementation. The quality of code is significantly better. With each iteration, my confidence in the generated code is increasing.</p>
<p>We are rapidly moving from writing in programming languages to natural language, i.e., using plain English to describe our intent more precisely, and moving our focus from writing the code to verifying the correctness of the generated code. If we then move to verifying the operation of the code, we can perhaps then just stop focusing on reviewing the code altogether. This raises the question: why do we need programming languages at all? The LLM could easily produce the machine code directly from our intent.</p>
<p>The biggest pushback I can foresee to getting rid of the intermediate language representation is debugging or verification (especially security related). How do you fix what you cannot read?</p>
<p>We are perhaps moving from tracing (manually following a code path) to triangulation (AI-driven root cause analysis). In this new era, debugging is not about finding a typo; it is about refining the feedback loop. If a system fails, the AI does not just show us a stack trace; it analyses the telemetry, compares the binary execution against our original intent, and self-corrects (à la <a href="https://openclaw.ai/">OpenClaw</a>). If we need to understand 'why,' the AI can generate a high-level human-readable map of the logic on the fly (e.g., using natural language or programming language of your choice). We do not need the code to be readable; we just need the AI to be able to explain it when asked.</p>
<img src="/static/blog/2026/20260228-til-taste-is-syntax-fanout.jpg" alt="Fan-out of programming eras from abstractions to intent" style="max-height: 400px; width: auto; display: block; margin-left: auto; margin-right: auto;" />
<p>The evolution of programming:</p>
<table>
<thead>
<tr>
<th style="text-align:left">Era</th>
<th style="text-align:left">The Interface</th>
<th style="text-align:left">The Code</th>
<th style="text-align:left">The Human Role</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align:left"><strong>1. ENIAC</strong></td>
<td style="text-align:left">Physical Cables</td>
<td style="text-align:left">Hardware <em>is</em> the code</td>
<td style="text-align:left">Physically patching circuits to define logic.</td>
</tr>
<tr>
<td style="text-align:left"><strong>1-bis. Apollo</strong></td>
<td style="text-align:left">Punch Cards / Terminals</td>
<td style="text-align:left">Assembly baked into rope memory</td>
<td style="text-align:left">Writing the functionality into physical components.</td>
</tr>
<tr>
<td style="text-align:left"><strong>2. JS/C++</strong></td>
<td style="text-align:left">Programming Languages</td>
<td style="text-align:left">Human-readable logic</td>
<td style="text-align:left">Managing abstractions; standing on the &quot;shoulders of giants.&quot;</td>
</tr>
<tr>
<td style="text-align:left"><strong>3. AI Agents</strong></td>
<td style="text-align:left">Natural Language / Prompts</td>
<td style="text-align:left">AI-generated &quot;Black Box&quot;</td>
<td style="text-align:left">Defining objectives (taste); Observing and testing the implementation.</td>
</tr>
<tr>
<td style="text-align:left"><strong>4. The Future</strong></td>
<td style="text-align:left">Thought / Speech</td>
<td style="text-align:left">Direct Machine Binary</td>
<td style="text-align:left">Defining outcomes; the machine handles the &quot;how&quot; entirely.</td>
</tr>
</tbody>
</table>
<p>We started by wiring machines directly, then writing in assembly, then writing in high-level languages which mimic human thought processes (close but not quite human language). We are now chatting with an agent to write the code for us, expressing what we want, how it will be used, and what it should do. Eventually, we may not need to see the code — the layers of abstraction collapsing back into pure intent meeting bare metal. The circle closes.</p>
<p><strong>UPDATED (2026-02):</strong>: Nano Banana 2 🍌 🍌 images added. Added tweet.</p>
]]></content:encoded>
      <dc:creator><![CDATA[Varun Singh]]></dc:creator>
      <category><![CDATA[coding]]></category>
    </item>
    <item>
      <title><![CDATA[Voxtral Realtime STT: segmented vs. streaming]]></title>
      <link>https://varunsingh.net/til/pipecat/voxtral-realtime-segmented-vs-streaming-on-device-stt</link>
      <guid isPermaLink="true">https://varunsingh.net/til/pipecat/voxtral-realtime-segmented-vs-streaming-on-device-stt</guid>
      <pubDate>Fri, 13 Feb 2026 00:00:00 GMT</pubDate>
      
      <description><![CDATA[Mistral released [Voxtral Realtime Mini](https://hf.co/mistralai/Voxtral-Mini-4B-Realtime-2602) in February 2026 — a 4B-parameter streaming STT model with a causal encoder. The benchmarks and [early d]]></description>
      <content:encoded><![CDATA[<p>Mistral released <a href="https://hf.co/mistralai/Voxtral-Mini-4B-Realtime-2602">Voxtral Realtime Mini</a> in February 2026 — a 4B-parameter streaming STT model with a causal encoder. The benchmarks and <a href="https://x.com/HuggingModels/status/2020173174613410167">early demos</a> looked encouraging, but I was waiting for an MLX port before I could test it on-device.</p>
<p><a href="https://x.com/awnihannun/status/2020516998019760142">Awni Hannun</a> built exactly that with <a href="https://github.com/awni/voxmlx"><code>voxmlx</code></a>. Meanwhile, Aleix had built the <a href="https://github.com/pipecat-ai/pipecat-mcp-server">pipecat-mcp-server</a>, which already uses Whisper MLX and Kokoro for on-device voice conversations (I've written about both in <a href="/til/pipecat">earlier TILs</a>). Marrying Voxtral with the MCP server was the obvious next step.</p>
<h3>Architecture</h3>
<p><em>MLX Whisper (distilled <code>whisper-large-v3-turbo</code>) uses a bidirectional encoder</em>. It needs the full utterance before it can transcribe. The encoder sees all audio frames at once, so it has maximum context. This means it is inherently batch/segmented: VAD (Voice Activity Detection) detects silence, the complete audio chunk gets encoded, then decoded. Voilà, the transcribed sentence. In the sample of conversations, it takes ~300 ms from end-of-speech to final transcription (In pipecat the timestamps from <code>UserStoppedSpeaking</code> to <code>TranscriptionFrame</code>).</p>
<p><em>Voxtral Realtime uses a causal encoder</em>. The convolution and transformer layers only attend to past frames. Which means in streaming mode, you can feed audio incrementally via <code>encode_step()</code> and get encoder embeddings out without waiting for the utterance to end.</p>
<p>The key parameter is <code>delay_ms</code> (multiples of 80 ms, since each encoder token covers 80 ms of audio). This controls how far behind the decoder runs relative to the encoder. At 480 ms, the decoder lags 6 tokens behind, giving the encoder time to have processed more frames before decoding begins. At 160 ms, the lag is just 2 tokens. This is the fundamental latency/accuracy knob — more lag means the encoder has built up more context by the time the decoder needs it. Calling this <em>delay</em> is perhaps a misnomer, it is more like a <em>context buffer</em>. The user has not stopped speaking, and partial text output is not useful in the sense that we do not push the text to the LLM until the utterance is complete.</p>
<p>&quot;Full context&quot; in Whisper means bidirectional attention over all frames. &quot;Full utterance&quot; in Voxtral means all audio is present, but attention is still one-directional. The distinction matters because even when Voxtral segmented sees the whole utterance, early frames do not benefit from later frames the way they do in Whisper.</p>
<h3>Segmented vs. Streaming with the same model</h3>
<p>Even with Voxtral's causal encoder, you can run it in two modes:</p>
<p><em>Segmented</em> buffers the full utterance, then runs the complete encode-then-decode pass. The model still only uses causal attention (no bidirectional context), but it processes all frames in one shot. We measured ~300 ms from end-of-speech to final transcription at 480 ms delay.</p>
<p><em>Streaming</em> feeds audio to <code>encode_step()</code> as transport packets arrive. <code>ptime</code> can be 10 ms or 20 ms, so 4–8 packets make up the 80 ms audio token. The prefill happens once enough audio covers the prompt prefix, then incremental decoding emits tokens during speech. We measured ~160 ms from end-of-speech to final transcription because most encoding and decoding has already happened by the time the user stops talking.</p>
<p>The latency win comes from overlapping compute with speech. In segmented mode, all compute happens after silence is detected. In streaming mode, only the right-pad flush and final decode steps remain. This difference alone accounts for the ~140 ms latency win between streaming and segmented modes.</p>
<p>To summarise, it is not &quot;streaming is better&quot; but a three-way trade-off:</p>
<table>
<thead>
<tr>
<th></th>
<th>Whisper (MLX)</th>
<th>Voxtral segmented</th>
<th>Voxtral streaming</th>
</tr>
</thead>
<tbody>
<tr>
<td>Encoder</td>
<td>Bidirectional</td>
<td>Causal</td>
<td>Causal (incremental)</td>
</tr>
<tr>
<td>Transcription starts</td>
<td>After speech ends</td>
<td>After speech ends</td>
<td>During speech</td>
</tr>
<tr>
<td>End-of-turn to transcript</td>
<td>~300 ms</td>
<td>~300 ms</td>
<td>~160 ms</td>
</tr>
<tr>
<td>Accuracy</td>
<td>Highest (full context)</td>
<td>Good (causal, full utterance)</td>
<td>Delay-dependent (480 ms good, 160 ms noisy)</td>
</tr>
<tr>
<td>Compute pattern</td>
<td>Burst after silence</td>
<td>Burst after silence</td>
<td>Continuous during speech</td>
</tr>
<tr>
<td>Memory</td>
<td>Temp WAV file</td>
<td>Temp WAV file</td>
<td>KV caches for encoder + decoder (needs <code>mx.clear_cache()</code>)</td>
</tr>
</tbody>
</table>
<p>Whisper MLX does zero work during speech, then a short compute burst when the user stops speaking. The full transcription typically completes in ~300 ms. Whisper feels fast despite being batch-only because it is a distilled model optimised for MLX. Voxtral streaming takes the opposite approach: it spreads compute across the entire speech duration, so there is less left to do when the user stops. Both land in the 160–300 ms range from end-of-turn to transcript, but for different reasons.</p>
<p>Next I want to try <a href="https://x.com/antirez/status/2019848466931892675">antirez's <code>voxtral.c</code></a>, a pure-C implementation that avoids the Python/MLX overhead entirely. If the latency numbers hold up, swapping the backend in the MCP server could shave off more time and make it viable on lower-end hardware too.</p>
<p><strong>Updated (2026-02-15):</strong> I opened <a href="https://github.com/pipecat-ai/pipecat-mcp-server/pull/8">a PR</a> adding both segmented and streaming Voxtral STT. More testing is needed. The whole PR was built while pair-programming via voice with Claude Code. Initially with Whisper as STT, then segmented Voxtral, and finally streaming Voxtral once the latency trade-off became apparent. About 10–12 hours over 3 days. Still early days, but the results are promising.</p>
]]></content:encoded>
      <dc:creator><![CDATA[Varun Singh]]></dc:creator>
      <category><![CDATA[pipecat]]></category>
    </item>
    <item>
      <title><![CDATA[Remote Voice Conversations with Your Coding Agent]]></title>
      <link>https://varunsingh.net/til/pipecat/communicate-remotely-with-your-agent-with-voice</link>
      <guid isPermaLink="true">https://varunsingh.net/til/pipecat/communicate-remotely-with-your-agent-with-voice</guid>
      <pubDate>Wed, 04 Feb 2026 00:00:00 GMT</pubDate>
      
      <description><![CDATA[Picture this: Claude is mid-refactor, you step away to make coffee, and your phone buzzes. You ask "Are we done?" and hear it read back the task status. You say "run the tests" and a minute later it t]]></description>
      <content:encoded><![CDATA[<p>Picture this: Claude is mid-refactor, you step away to make coffee, and your phone buzzes. You ask &quot;Are we done?&quot; and hear it read back the task status. You say &quot;run the tests&quot; and a minute later it tells you three passed, one failed. You never touched your laptop.</p>
<p>The co-author of pipecat, <a href="https://github.com/aconchillo">Aleix Conchillo</a>, built a <a href="https://github.com/pipecat-ai/pipecat-mcp-server/tree/main">Pipecat MCP Server</a> over the weekend that makes this possible. It bridges any MCP-compatible coding agent — Claude Code, Cursor, Codex, etc. — to a <a href="https://github.com/pipecat-ai/pipecat">pipecat</a> voice pipeline over WebRTC. Your agent gets ears and a mouth and it shares the screen too, so you can see file diffs, confirm changes, and even see what is on your display. An agent sitting <em>idly</em> feels such a waste, and now they don't have to be.</p>
<p>The MCP server exposes <code>listen</code>, <code>speak</code>, <code>stop</code>, <code>list_windows</code>, <code>screen_capture</code>, and <code>capture_screenshot</code>. That last pair is worth dwelling on: the agent can see your screen. You can ask &quot;show me the terminal?&quot; and it'll start capturing the window, run it through the vision pipeline, and you will see it in your WebRTC session. Voice and vision together turn this into a fly-by-wire session as if you were at your desk.</p>
<p>The <a href="https://github.com/pipecat-ai/pipecat-mcp-server/blob/main/.claude/skills/pipecat/SKILL.md">Pipecat SKILL</a> adds guardrails on top. It asks for verbal confirmation before making changes to files — an extra layer of safety when running a coding agent with enhanced privileges (think Claude with <code>--dangerously-skip-permissions</code>). You hear &quot;I'm about to modify server.ts, shall I proceed?&quot; before anything changes.</p>
<h3>How It Works</h3>
<p>The MCP server spawns a child process running the pipecat pipeline. Everything runs locally: <code>RNNoiseFilter</code> for background noise suppression, <code>SileroVAD</code> for voice activity detection, <code>SmartTurnAnalyzerV3</code> for turn-taking, <code>MLX/Fast Whisper</code> for speech-to-text, and <code>MLX Kokoro TTS</code> for speech synthesis. All components are open-source, open-weights, and run locally on your machine.</p>
<pre><code class="language-text">MCP Client (Claude Code, Cursor, etc.)
    │
    ▼
MCP Server (parent process) ◄──► Pipecat Agent (child process)
    │                                  │
    ▼                                  ▼
Handles tool calls              Voice + vision pipeline:
via HTTP at :9090/mcp           Audio → STT → TTS → Audio
                                Screen → Vision → Image files
</code></pre>
<p>Two calls do the heavy lifting. <code>listen()</code> blocks until you finish speaking — Silero VAD detects 0.2s of silence, then SmartTurn confirms the utterance is complete, and the transcription returns to the MCP client. <code>speak(text)</code> queues text for TTS and returns immediately. VAD keeps running during playback, so you can interrupt the agent mid-sentence. That detail matters: without it, you'd have to wait for the agent to finish talking before you could correct it. For those who work with pipecat, these are the basic interruption and mute strategies.</p>
<pre><code class="language-text">// Pipecat Pipeline
                    ┌─── Main branch ───────────────┐
Transport (In)      │ Whisper → User Agg. → Kokoro  │
│                   │                               │
│                   │                               │
├─► ScreenCap ──► ParallelPipeline                  ├─► Assist. Agg. → Transport (Out)
                    │                               │
                    └─── Vision branch ─────────────┘
                VisionProcessor (saves frames on demand)
</code></pre>
<p>It's early, but it has rapidly evolved. Aleix quickly added the option for local models in addition to the cloud-hosted models. You can also swap the SimpleWebRTC for DailyWebRTC, in case you encounter restrictive firewalls. Fast Whisper's accuracy may be hit or miss depending on your accent, but you can probably swap in Voxtral soon. Running everything locally means you can swap models as better ones appear.</p>
<p>Today, coding agents keep you tethered to your terminal. You sit, you type, you watch. In some cases, you can teleport to a cloud sandbox. Pipecat MCP Server breaks those constraints. The agent keeps working while you're away, and you stay in the loop.</p>
<p>The full source is at <a href="https://github.com/pipecat-ai/pipecat-mcp-server/tree/main">pipecat-mcp-server</a>.</p>
]]></content:encoded>
      <dc:creator><![CDATA[Varun Singh]]></dc:creator>
      <category><![CDATA[pipecat]]></category>
    </item>
    <item>
      <title><![CDATA[Clone your Voice in under 5 minutes]]></title>
      <link>https://varunsingh.net/til/pipecat/qwen3-local-voice-cloning</link>
      <guid isPermaLink="true">https://varunsingh.net/til/pipecat/qwen3-local-voice-cloning</guid>
      <pubDate>Thu, 29 Jan 2026 00:00:00 GMT</pubDate>
      
      <description><![CDATA[[Qwen3-TTS](https://x.com/Alibaba_Qwen/status/2014326211913343303?s=20) launched a few weeks ago and was integrated into [MLX Audio](https://github.com/Blaizzy/mlx-audio) shortly after. This gave me t]]></description>
      <content:encoded><![CDATA[<p><a href="https://x.com/Alibaba_Qwen/status/2014326211913343303?s=20">Qwen3-TTS</a> launched a few weeks ago and was integrated into <a href="https://github.com/Blaizzy/mlx-audio">MLX Audio</a> shortly after. This gave me the idea to clone my voice and use it as the &quot;Speak Text&quot; feature for my posts.</p>
<blockquote class="twitter-tweet" data-media-max-width="560"><p lang="en" dir="ltr">Qwen3-TTS is officially live. We've open-sourced the full family—VoiceDesign, CustomVoice, and Base—bringing high quality to the open community.<br><br>- 5 models (0.6B &amp; 1.8B)<br>- Free-form voice design &amp; cloning<br>- Support for 10 languages<br>- SOTA 12Hz tokenizer for high compression<br>-… <a href="https://t.co/BSWpaYoZWj">pic.twitter.com/BSWpaYoZWj</a></p>&mdash; Qwen (@Alibaba_Qwen) <a href="https://twitter.com/Alibaba_Qwen/status/2014326211913343303?ref_src=twsrc%5Etfw">January 22, 2026</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
<p>Voice cloning with <a href="https://huggingface.co/mlx-community/Qwen3-TTS-12Hz-1.7B-VoiceDesign-bf16">Qwen3-TTS</a> needs just two things: a short audio clip of the target voice (30-180 seconds) and an accurate transcript of what was said. The 1.7B parameter model learns the voice characteristics from that reference and applies them to any new text you give it.</p>
<pre><code class="language-bash">uv run python src/tts_record.py my-script.txt \
  --engine qwen3-clone \
  --ref-audio my_voice.wav \
  --ref-text &quot;The exact words I said in the recording&quot;
</code></pre>
<p><strong>That is it.</strong> Out comes a WAV file that sounds like you reading the text in <code>my-script.txt</code>. The first time I played back a cloned version of myself reading a blog post I had never recorded, it was genuinely unsettling—in some ways it felt familiar and yet not like my voice.</p>
<p>The quality of the clone depends heavily on the reference audio. Random recordings do not work well. I tried. It was crap. I think the model needs to hear you produce a wide range of English sounds to generalise your voice properly. According to <a href="https://en.wikipedia.org/wiki/English_phonology">standard phoneme inventories</a>, General American English has roughly 24 consonant phonemes and 15-20 vowel phonemes including diphthongs—that is a lot of distinct sounds to cover in under three minutes.</p>
<p>I asked Claude to generate phoneme-rich scripts: natural-sounding sentences specifically designed to cover every English sound without sounding like a tongue twister. Four versions, from 90 seconds to 180 seconds:</p>
<pre><code class="language-text"># Excerpt from the 180-second script:
We passed through several villages before reaching the coast.
The view was stunning: white cliffs rose sharply from the azure water,
and fishing boats rocked gently in the harbour. I took a few photographs
to share with friends back home.
</code></pre>
<p>The next issue was that reading 90-180 seconds of text while recording was surprisingly awkward. I lost my place, rushed through sentences, or forgot to speak naturally. So I built a <a href="https://github.com/vr000m/qwen3-tts-clone-and-speak">browser-based teleprompter</a>. It is a single HTML file that captures audio and auto-advances when you have finished a sentence. Record, read, done. The whole process—from opening the teleprompter to having a usable voice clone—takes under five minutes.</p>
<p><img src="/static/blog/2026/20260128-teleprompter.png" alt="Browser-based teleprompter showing highlighted text with audio recording controls"></p>
<p>What surprised me:</p>
<ul>
<li><strong>How little audio you need.</strong> 90 seconds of well-chosen text produces surprisingly good clones. The phoneme coverage matters more than duration.</li>
<li><strong>Transcript accuracy is critical.</strong> If the transcript does not match the audio exactly, the clone quality drops noticeably. The model aligns phonemes between text and audio.</li>
<li><strong>Local inference on Apple Silicon is viable.</strong> The 1.7B model runs comfortably on M-series Macs via MLX.</li>
</ul>
<h3>0.6B vs 1.7B: hear the difference</h3>
<p>The 1.7B model produces noticeably more natural pacing and better voice fidelity compared to the 0.6B. Have a listen:</p>
<div style="display: flex; flex-direction: column; gap: 1rem; margin: 1rem 0;">
  <div>
    <strong>1.7B model</strong> (current)
    <audio controls preload="metadata" style="width: 100%; margin-top: 0.25rem;">
      <source src="/static/audio/til/pipecat/qwen3-local-voice-cloning.aac" type="audio/aac">
    </audio>
  </div>
  <div>
    <strong>0.6B model</strong>
    <audio controls preload="metadata" style="width: 100%; margin-top: 0.25rem;">
      <source src="/static/audio/til/pipecat/qwen3-local-voice-cloning-0.6B.aac" type="audio/aac">
    </audio>
  </div>
</div>
<p>In closing, the clone is not perfect. Longer sentences sometimes drift in pacing—the model rushes through clauses that I would naturally pause on. Proper nouns and technical terms occasionally get odd stress patterns, especially abbreviations like &quot;SFU&quot; or &quot;WebRTC.&quot; There is more work to be done on the <em>script</em> files to get the best possible clone.</p>
<p>Nonetheless, every post on this site now has a &quot;Speak Text&quot; button powered by this clone. You can also peruse all the code for this project at <a href="https://github.com/vr000m/qwen3-tts-clone-and-speak">qwen3-tts-clone-and-speak</a>.</p>
]]></content:encoded>
      <dc:creator><![CDATA[Varun Singh]]></dc:creator>
      <category><![CDATA[pipecat]]></category>
    </item>
    <item>
      <title><![CDATA[Fan-out: Multiple Coding Agents and Ralph Wiggum Loops]]></title>
      <link>https://varunsingh.net/til/coding/fan-out-skill-multiple-agents-and-ralph-loops</link>
      <guid isPermaLink="true">https://varunsingh.net/til/coding/fan-out-skill-multiple-agents-and-ralph-loops</guid>
      <pubDate>Mon, 26 Jan 2026 00:00:00 GMT</pubDate>
      
      <description><![CDATA[When a [dev plan](/til/coding/structured-development-plans-with-coding-agents) has several independent tasks (different files or no shared state), you can fan them out to parallel Claude agents, each ]]></description>
      <content:encoded><![CDATA[<p>When a <a href="/til/coding/structured-development-plans-with-coding-agents">dev plan</a> has several independent tasks (different files or no shared state), you can fan them out to parallel Claude agents, each running as a <a href="https://ghuntley.com/loop/">Ralph Wiggum loop</a> in its own git worktree. This is not like <a href="https://steve-yegge.medium.com/welcome-to-gas-town-4f25ee16dd04">GasTown</a>, which is a full-blown system, but it is similar in the sense that it cuts time down, compared to tackling each task sequentially. For me, the multiplexing of tasks feels an adequate trade-off between having control and automation. I might eventually get into this <em>zany</em> idea of full automation.</p>
<p>So <code>fan-out</code> starts with extensive planning! Discuss your feature or idea with the LLM, have it ask you questions, make it do all the foundational work: architecture, expected code structure, list of files and API impacted, schema changes. Identify distinct tasks and their dependencies, especially if there is a common task that needs doing first. Implement and commit that before fanning out to multiple agents.</p>
<p>Once the pre-work is done, make sure the plan has an implementation checklist with distinct tasks and a Technical Specifications section listing which files each task touches — this is what <code>/fan-out</code> parses to analyse dependencies and show which tasks can run in parallel. Once you confirm, it fans out the independent tasks to separate Claude agents: <code>/fan-out docs/dev_plans/20260116-feature-auth-system.md</code>.</p>
<p>For each approved task, <code>/fan-out</code> creates a git worktree at <code>../your-repo-fanout-&lt;task-slug&gt;</code>, spawns a separate <code>claude -p</code> process (Opus, non-interactive), and each agent works in isolation, committing to its own branch.</p>
<p>From there it is a matter of monitoring progress with <code>/fan-out status</code>, checking the logs, and ensuring each agent is moving forward. Once all agents finish, review the individual PRs and merge them into your feature branch. Lastly, the clean-up removes the worktrees, deletes merged branches, and removes the state file.</p>
<p>The key constraint is that tasks must be truly independent and the dependency analysis catches conflicts before spawning, which saved me from a painful merge more than once.</p>
<pre><code class="language-bash"># Plan
/dev-plan create feature user-dashboard

# ... plan has 3 independent tasks:
#   1. Add /api/dashboard endpoint (src/api/)
#   2. Add Dashboard component (src/components/)
#   3. Add dashboard tests (tests/)

# Complete shared prerequisite (types)
# ... manual work, commit ...
# fan-out's options
/fan-out &quot;[plan-file | status | logs N | cancel [N] | merge | cleanup] [--dry-run] [--max-agents N] [--model MODEL]&quot;

# Fan out the 3 independent tasks
/fan-out docs/dev_plans/20260206-feature-user-dashboard.md

# Check in on progress
/fan-out status

# All done — merge
/fan-out merge

# Tear down worktrees
/fan-out cleanup
</code></pre>
]]></content:encoded>
      <dc:creator><![CDATA[Varun Singh]]></dc:creator>
      <category><![CDATA[coding]]></category>
    </item>
    <item>
      <title><![CDATA[Claude Code Creates Launch Videos]]></title>
      <link>https://varunsingh.net/til/claude/claude-code-creates-launch-videos</link>
      <guid isPermaLink="true">https://varunsingh.net/til/claude/claude-code-creates-launch-videos</guid>
      <pubDate>Wed, 21 Jan 2026 00:00:00 GMT</pubDate>
      
      <description><![CDATA[I’ve been pushing these coding agents beyond creating code. They already understand the code and the purpose of the app they are building, have read the docs, and have access to the product plans. Tha]]></description>
      <content:encoded><![CDATA[<p>I’ve been pushing these coding agents beyond creating code. They already understand the code and the purpose of the app they are building, have read the docs, and have access to the product plans. That makes it straightforward to ask them to draft a script for a launch demo.</p>
<p>Beyond that, they have access to tools like <a href="https://github.com/ChromeDevTools/chrome-devtools-mcp">Chrome DevTools</a> to navigate the web app, take screenshots, associate talking points with those screenshots, record the audio, sync narration timestamps to image transitions, and collate everything into the final video.</p>
<video controls width="100%">
  <source src="/static/blog/2026/20260121-til-demo-quietset-fit-final.mp4" type="video/mp4">
  Your browser does not support the video tag.
</video>
<p>My three-step pipeline runs entirely on Apple Silicon:</p>
<pre><code>mlx-audio → Playwright → FFmpeg
(narration)  (capture)    (video)
</code></pre>
<p>Basically, the LLM calls and navigates the app by sending MCP commands. The deterministic screenshots mean that once it has figured out which pages are needed for the narrative, it can create a fairly simple <code>Playwright</code> script to capture the pages. It also means that any changes to those pages can be re-run when the app is updated. (You get the same result every time.)</p>
<pre><code class="language-js">const { chromium } = require('playwright');

(async () =&gt; {
  const browser = await chromium.launch();
  const page = await browser.newPage();
  await page.setViewportSize({ width: 1280, height: 720 });

  await page.goto('http://localhost:3000');
  await page.screenshot({ path: 'scene1.png' });

  await page.click('#settings-btn');
  await page.waitForSelector('.settings-panel');
  await page.screenshot({ path: 'scene2.png' });

  // ... more scenes

  await browser.close();
})();
</code></pre>
<p>The LLM must also create a single source-of-truth file for visuals, narration, and timing. The <code>images.txt</code> format drives both video generation and TTS generation. It is similar to FFmpeg's <a href="https://trac.ffmpeg.org/wiki/Concatenate#demuxer">input.txt</a>, with the main difference being the addition of the narration strings interleaved between entries. The LLM generates the initial timing duration based on assumed speech rate, but this will be updated by the TTS generator once it has generated the final audio. The <code>images.txt</code> looks something like this:</p>
<pre><code class="language-bash"># images.txt
file 'scene1.png'
text &quot;Track your strength training with session-based progression.&quot;
duration 10

file 'scene2.png'
text &quot;Quick Actions let you copy previous weights or skip a day.&quot;
duration 8
</code></pre>
<p>The magic of Apple Silicon is that you can easily run a local TTS using <a href="https://github.com/Blaizzy/mlx-audio"><code>mlx-audio</code></a>. In my examples I use the <a href="https://github.com/hexgrad/kokoro">Kokoro-82M model</a>, it is ~160 MB in size, and produces pretty smooth sound for its size. Lastly, the narration script allows us to fiddle with the narration speed and the transition wait times—the 0.8 speed and 2s transition wait times worked well for me.</p>
<pre><code class="language-bash">uv run python generate_narration.py -i images.txt -o narration.wav --speed 0.8 --wait 2.0
</code></pre>
<p>Finally, the images and the narration audio are directly passed to FFmpeg along with <code>input.txt</code> to produce the video. I generate <code>input.txt</code> from <code>images.txt</code> by stripping the text narration lines.</p>
<pre><code class="language-bash">ffmpeg \
    -f concat -safe 0 -i input.txt \
    -i narration.wav \
    -vf &quot;scale=1280:720:force_original_aspect_ratio=decrease,pad=1280:720:(ow-iw)/2:(oh-ih)/2&quot; \
    -c:v libx264 -pix_fmt yuv420p -r 30 \
    -c:a aac \
    -map 0:v -map 1:a \
demo-final.mp4
</code></pre>
<p>Since everything is scripted, I loved the fact that the results can be regenerated quickly with small variations. The LLM research part is the only thing that requires some painstaking prompting to get the pitch and narrative correct.</p>
]]></content:encoded>
      <dc:creator><![CDATA[Varun Singh]]></dc:creator>
      <category><![CDATA[claude]]></category>
    </item>
    <item>
      <title><![CDATA[Standardising the Open Responses API]]></title>
      <link>https://varunsingh.net/til/standards/open-responses-api</link>
      <guid isPermaLink="true">https://varunsingh.net/til/standards/open-responses-api</guid>
      <pubDate>Fri, 16 Jan 2026 00:00:00 GMT</pubDate>
      
      <description><![CDATA[As an avid contributor to the IETF and W3C, I appreciate OpenAI's effort to specify a vendor-neutral interface to interact with an LLM. 

The [Open Responses API](https://www.openresponses.org/specifi]]></description>
      <content:encoded><![CDATA[<p>As an avid contributor to the IETF and W3C, I appreciate OpenAI's effort to specify a vendor-neutral interface to interact with an LLM.</p>
<p>The <a href="https://www.openresponses.org/specification">Open Responses API</a> is based on OpenAI's <a href="https://platform.openai.com/docs/api-reference/responses">Responses API</a>, which OpenAI positions as the more capable, newer interface compared to <a href="https://platform.openai.com/docs/api-reference/chat"><code>chat completions</code></a>. I think there were a lot of lessons from the <code>chat completions</code> API that led to the Responses API. For example, chat completions bolted on a message structure after it became clear that <em>conversation</em> was the dominant use case, not a single request/response.</p>
<p>The <strong>Open Responses API</strong> defines the common schema for requests, responses, and items. It defines:</p>
<ul>
<li>HTTP request/response formats (headers, JSON bodies, event-stream format)</li>
<li>Items are the fundamental context units (messages, function calls, tools, reasoning traces, errors)</li>
<li>An interaction model for the agentic loop (input -&gt; reason -&gt; tool search -&gt; invoke tools -&gt; reflect -&gt; respond)</li>
</ul>
<p>I am excited that this spec has broad appeal. <a href="https://x.com/reach_vb/status/2011863149356413275?s=20">Vaibhav's announcement post</a> covered a slew of partners supporting the spec: Nvidia, Vercel, LMStudio, Hugging Face, Ollama, OpenRouter, etc., including several model providers.</p>
<p>OpenAI has a helpful <a href="https://platform.openai.com/docs/guides/migrate-to-responses">migration guide</a> comparing Chat Completions with the Responses API:</p>
<pre><code class="language-python"># previous
response = client.chat.completions.create(
    model=&quot;gpt-4.1-mini&quot;,
    messages=[
        {&quot;role&quot;: &quot;system&quot;, &quot;content&quot;: &quot;You are a helpful assistant.&quot;},
        {&quot;role&quot;: &quot;user&quot;, &quot;content&quot;: &quot;What's 2 + 2?&quot;}
    ]
)
print(response.choices[0].message.content)

# now
response = client.responses.create(
    model=&quot;gpt-4.1-mini&quot;,
    input=&quot;What's 2 + 2?&quot;
)

print(response.output_text)
</code></pre>
<p>For a more practical implementation, you can see the <a href="https://pipecat.ai">Pipecat</a> code for <a href="https://github.com/pipecat-ai/pipecat/blob/main/src/pipecat/services/openai/base_llm.py">base_llm.py</a> which still uses <code>chat.completions.*</code>, whereas <a href="https://github.com/pipecat-ai/pipecat/blob/main/src/pipecat/services/openai/llm.py">llm.py</a> uses <code>llm_response</code>.</p>
]]></content:encoded>
      <dc:creator><![CDATA[Varun Singh]]></dc:creator>
      <category><![CDATA[standards]]></category>
    </item>
    <item>
      <title><![CDATA[Using MCP? Skill issue]]></title>
      <link>https://varunsingh.net/til/mcp/using-mcp-skill-issue</link>
      <guid isPermaLink="true">https://varunsingh.net/til/mcp/using-mcp-skill-issue</guid>
      <pubDate>Mon, 12 Jan 2026 00:00:00 GMT</pubDate>
      
      <description><![CDATA[A Claude [skill](https://code.claude.com/docs/en/skills) is a thin wrapper that tells the agent how to use it. The skill typically is small text instructions that the model can follow, and a skill doe]]></description>
      <content:encoded><![CDATA[<p>A Claude <a href="https://code.claude.com/docs/en/skills">skill</a> is a thin wrapper that tells the agent how to use it. The skill typically is small text instructions that the model can follow, and a skill does not need to keep a big server description or schema sitting in the context taking up space. The biggest advantage is we get to reuse existing CLIs or HTTP endpoints, keep things simple, and control exactly how the agent should interact with these APIs.</p>
<p>The <code>SKILL.md</code> file basically explains how to use the CLI or HTTP endpoint: &quot;When you need X, call it like this, with these flags, in this order.&quot; This is very different from an MCP. MCP provides a whole tool server: its schemas, capabilities, metadata, etc. The model sees that full description in its context and then decides how to call into it. That is powerful, but all that structure eats context and can feel heavy or <strong>bloated</strong> when all you really want is: &quot;run this CLI with these arguments.&quot;</p>
<!--```bash
/context
 ⎿   Context Usage
    ⛁ ⛁ ⛁ ⛁ ⛁ ⛁ ⛁ ⛁ ⛁ ⛀   claude-opus-4-5-20251101 · 133k/200k tokens (66%)
    ⛁ ⛁ ⛁ ⛁ ⛁ ⛁ ⛀ ⛁ ⛀ ⛁
    ⛁ ⛁ ⛁ ⛁ ⛁ ⛁ ⛁ ⛁ ⛁ ⛁   Estimated usage by category
    ⛁ ⛁ ⛁ ⛁ ⛁ ⛁ ⛁ ⛁ ⛁ ⛁   ⛁ System prompt: 2.5k tokens (1.2%)
    ⛁ ⛁ ⛁ ⛁ ⛁ ⛁ ⛁ ⛁ ⛁ ⛁   ⛁ System tools: 17.0k tokens (8.5%)
    ⛁ ⛁ ⛁ ⛁ ⛁ ⛁ ⛁ ⛁ ⛁ ⛁   ⛁ MCP tools: 11.8k tokens (5.9%)
    ⛁ ⛁ ⛁ ⛁ ⛁ ⛁ ⛁ ⛁ ⛁ ⛶   ⛁ Custom agents: 399 tokens (0.2%)
    ⛶ ⛶ ⛶ ⛶ ⛶ ⛶ ⛶ ⛝ ⛝ ⛝   ⛁ Memory files: 1.8k tokens (0.9%)
    ⛝ ⛝ ⛝ ⛝ ⛝ ⛝ ⛝ ⛝ ⛝ ⛝   ⛁ Skills: 134 tokens (0.1%)
    ⛝ ⛝ ⛝ ⛝ ⛝ ⛝ ⛝ ⛝ ⛝ ⛝   ⛁ Messages: 100.4k tokens (50.2%)
                          ⛶ Free space: 21k (10.5%)
                          ⛝ Autocompact buffer: 45.0k tokens (22.5%)

# I have two MCPs, Playwright and Chrome DevTools:
# if Codex is using Playwright,
# then Claude can use Chrome DevTools, ergo double the weight, sigh!
```-->
<blockquote class="twitter-tweet" data-media-max-width="560"><p lang="en" dir="ltr">Interesting news, need to keep track. I mainly use one mcp, playwright. because you can ask it to perform actions! This is from last night (01/13), i did run into auto-compaction several times when the playwright was doing something. <br><br>But good to know that there may be… <a href="https://t.co/K9FU1z1lp7">https://t.co/K9FU1z1lp7</a> <a href="https://t.co/OJcYCBdQ2Q">pic.twitter.com/OJcYCBdQ2Q</a></p>&mdash; Varun Singh (@vr000m) <a href="https://twitter.com/vr000m/status/2011589057844035709?ref_src=twsrc%5Etfw">January 14, 2026</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
<p>For example, a GitHub skill that uses <code>gh</code> (CLI) is better than the corresponding <code>GitHub MCP</code> because it avoids tokens sitting idly in context. Skills do not need to keep all the GitHub commands in context, only a pointer that pulls the rest of the details when the skill is invoked, so the instruction to Claude Code boils down to: &quot;if you need to use Git, use the GitHub skill.&quot;</p>
<p>More concretely, since the summer, I have been <a href="https://varunsingh.net/til/mcp/playwright-mcp-for-clis">using</a> <code>Playwright</code> or the <code>Chrome DevTools</code> MCP to control the browser. These MCPs take up about 3-4K tokens (~2-3%). Meanwhile, the corresponding <a href="https://github.com/vercel-labs/agent-browser/blob/main/skills/agent-browser/SKILL.md">agent-browser</a> skill takes in less than 500 tokens when fully loaded.</p>
<p>There are similar gotchas to keep in mind with skills. Skills can be system-wide in <code>~/.claude/skills</code> or project-scoped in <code>.claude/skills</code>. Just make sure there is no skill name conflict between system-wide and project skills because system-wide has higher precedence.</p>
<p>Another thing to remember: a skill can be set to be invoked only by you using a slash command, which prevents Claude from automatically loading it. More context savings! (for example, <code>/frontend-design</code>, I want to be intentional about when to call this and not have this called each time the agent builds a UI component).</p>
<p>To summarise, skills are just thin, learned “recipes” for calling tools you already have (like CLIs), while MCP is a heavier protocol layer that keeps a lot of tool metadata in the model’s context. There are tens of skills from Vercel, Anthropic: type <code>/skills</code> in Claude Code or download from an open-source <a href="https://skills.sh/">skill directory</a>.</p>
<p>My loaded skills are:</p>
<pre><code class="language-bash">frontend-design · ~67 tokens
receiving-code-review · ~67 tokens
verification-before-completion · ~67 tokens
finishing-a-development-branch · ~61 tokens
using-git-worktrees · ~59 tokens
brainstorming · ~56 tokens
til-blog-review · ~41 tokens
dispatching-parallel-agents · ~37 tokens
requesting-code-review · ~36 tokens
executing-plans · ~33 tokens
systematic-debugging · ~31 tokens
writing-skills · ~31 tokens
subagent-driven-development · ~31 tokens
test-driven-development · ~29 tokens
writing-plans · ~28 tokens
</code></pre>
<p><strong>Update (2026 Jan 14)</strong>: <a href="https://x.com/trq212">Thariq</a> wrote about <a href="https://x.com/trq212/status/2011523109871108570?s=20">Zero Context MCP Tool Search</a>, wherein, Claude Code dynamically loads tools into context. For example, some developers were claiming <em>7+ servers consuming 67k+ tokens</em> (this brings skills).</p>
<p><strong>Update (2026 Jan 24):</strong> <a href="https://x.com/trq212">Thariq</a> wrote about <a href="https://x.com/trq212/status/2014836841846132761">merging slash commands into skills</a> and the official Claude docs for <a href="https://code.claude.com/docs/en/skills">skills</a>.</p>
]]></content:encoded>
      <dc:creator><![CDATA[Varun Singh]]></dc:creator>
      <category><![CDATA[mcp]]></category>
    </item>
    <item>
      <title><![CDATA[JSON Lines (JSONL) Text Format]]></title>
      <link>https://varunsingh.net/til/standards/json-lines-jsonl-text-format</link>
      <guid isPermaLink="true">https://varunsingh.net/til/standards/json-lines-jsonl-text-format</guid>
      <pubDate>Mon, 05 Jan 2026 00:00:00 GMT</pubDate>
      
      <description><![CDATA[JSON Lines (JSONL) is a text format where each line is a separate JSON object. It's designed for streaming and incremental processing, allowing you to read or write one record at a time without loadin]]></description>
      <content:encoded><![CDATA[<p>JSON Lines (JSONL) is a text format where each line is a separate JSON object. It's designed for streaming and incremental processing, allowing you to read or write one record at a time without loading everything into memory. An error on one line does not impact the others, and processing can be parallelised since each line is independent.</p>
<pre><code class="language-json">{&quot;id&quot;: 1, &quot;name&quot;: &quot;A letter is a grapheme that generally corresponds to a phoneme&quot;}
{&quot;id&quot;: 2, &quot;name&quot;: &quot;Phoneme is the smallest functional unit of speech&quot;}
{&quot;id&quot;: 3, &quot;name&quot;: &quot;An alphabet is a writing system that uses letters&quot;}
{&quot;id&quot;: 4, &quot;name&quot;: &quot;Alpha and Beta are the first two letters of the Greek alphabet&quot;}
</code></pre>
<p>I encountered JSONL recently whilst parsing <a href="/til/coding/parsing-multi-provider-claude-code-codex-and-gemini-usage-logs">coding agent logs</a>—most session logs and todos I’ve seen are JSONL files. What caught me by surprise was that <a href="https://jsonlines.org/on_the_web/">JSONL</a> dates from the early 2010s, but ML tools have certainly increased its adoption.</p>
<p>You may have seen NDJSON (newline-delimited JSON) or LDJSON (line-delimited JSON). However, JSON Lines (JSONL) is the most commonly used label today (2025), especially in big data and ML tooling.</p>
<p>If you've parsed JSON by hand, raw carriage returns and newlines are not allowed inside JSON strings (they must be escaped as <code>\r</code> and <code>\n</code>). JSONL/NDJSON therefore uses a newline (or CRLF) as the record delimiter, and each line is expected to be a complete JSON value without unescaped newlines.</p>
<p>In summary, independent JSON Lines make the format streaming- and append-friendly, and error-tolerant.</p>
]]></content:encoded>
      <dc:creator><![CDATA[Varun Singh]]></dc:creator>
      <category><![CDATA[standards]]></category>
    </item>
    <item>
      <title><![CDATA[Boris Cherny's Tips for Using Claude Code]]></title>
      <link>https://varunsingh.net/til/claude/boris-chernys-tips-for-using-claude-code</link>
      <guid isPermaLink="true">https://varunsingh.net/til/claude/boris-chernys-tips-for-using-claude-code</guid>
      <pubDate>Sat, 03 Jan 2026 00:00:00 GMT</pubDate>
      
      <description><![CDATA[[Boris Cherny](https://x.com/bcherny), Claude Code @anthropicai, shares practical tips for running multiple Claude Code sessions efficiently. His tweet on December 27 showed prolific stats, 250 PRs, 5]]></description>
      <content:encoded><![CDATA[<p><a href="https://x.com/bcherny">Boris Cherny</a>, Claude Code @anthropicai, shares practical tips for running multiple Claude Code sessions efficiently. His tweet on December 27 showed prolific stats, 250 PRs, 500 commits, 40K LoC added/38K removed (so a fair amount of refactoring) across his projects [<a href="https://x.com/bcherny/status/2004887829252317325">1</a>]. On January 2, 2026, he shared his <em>vanilla</em> workflow. The thread has strong, practical tips. Read the <a href="https://x.com/bcherny/status/2007179833990885678">raw tweet thread</a>--all ideas put together are worth trying!</p>
<p><strong>My takeaway: Run many tasks in parallel, do not be limited by the terminal, and use the other avenues available to you. Most important: give Claude a way to verify its work. Run a tight feedback loop a few times to reach final quality, with tests at every change.</strong></p>
<blockquote>
<p><strong>Update (2026-01-07)</strong>: The holidays have been utter carnage -- my X is full of people raving about:</p>
<ul>
<li><a href="https://ghuntley.com/ralph/">Ralph Wiggum Technique</a> by <a href="https://x.com/GeoffreyHuntley">Geoffrey Huntley</a>, read a <a href="https://www.humanlayer.dev/blog/brief-history-of-ralph">brief history</a> by <a href="https://x.com/dexhorthy">Dex Horthy</a></li>
<li><a href="https://clawd.bot/">Clawd</a> by <a href="https://x.com/steipete">Peter Steinberger</a></li>
<li><a href="https://steve-yegge.medium.com/welcome-to-gas-town-4f25ee16dd04">GasTown</a> by <a href="https://x.com/Steve_Yegge">Steve Yegge</a>.</li>
</ul>
</blockquote>
<p>Getting back to Boris' List:</p>
<blockquote>
<ol>
<li>Run 5 local Claudes in parallel in terminal, numbering tabs 1–5 and using system notifications to know when input is needed. Also run 5–10 web Claudes on <a href="http://claude.ai/code">claude.ai/code</a> in parallel; hand off sessions between local and web, and “teleport” back and forth as needed.</li>
</ol>
</blockquote>
<p>I combined his first two recommendations, and cannot think of these in isolation. This is worth trying. The teleport feature is kinda cool!</p>
<blockquote>
<ol start="2">
<li>Use Opus 4.5 with thinking for everything; despite being larger, it’s faster overall due to better tool use and less steering.</li>
</ol>
</blockquote>
<p>Interesting — I already have <code>&quot;model&quot;:&quot;opus&quot;</code> in the settings, but that may not be enough.</p>
<pre><code class="language-json">// ~/.claude/settings.local.json
{
  &quot;permissions&quot;: {
    &quot;allow&quot;: [
      &quot;Bash(echo:*)&quot;,
      &quot;Bash(ls:*)&quot;,
      &quot;Bash(export LC_ALL=C)&quot;,
      &quot;Bash(cat:*)&quot;
    ]
  },
  &quot;model&quot;:&quot;opus&quot;
}
</code></pre>
<blockquote>
<ol start="3">
<li>Maintain a shared <a href="http://CLAUDE.md">CLAUDE.md</a> in the repo, checked into git; continuously add notes when Claude does something wrong so it learns constraints and patterns.</li>
</ol>
</blockquote>
<p>Yes! Although it would be good if Codex and Claude could read this by default.</p>
<blockquote>
<ol start="4">
<li>Tag .claude on PRs to update <a href="http://CLAUDE.md">CLAUDE.md</a> during code review using the Claude Code GitHub action, building “compounding engineering.”</li>
</ol>
</blockquote>
<p>Need to figure out how this is different from Claude and Codex automatically reviewing the PR when the PR is opened (I think that is what the claude/gpt integrations with GH do by default)...</p>
<blockquote>
<ol start="5">
<li>Start most sessions in Plan mode (shift+tab twice); iterate on the plan, then switch to auto‑accept edits for a one‑shot implementation.</li>
</ol>
</blockquote>
<p>Yup!</p>
<blockquote>
<ol start="6">
<li>Create slash commands for frequent inner‑loop workflows; check them into .claude/commands/ to avoid repeated prompting and enable Claude to use them.</li>
</ol>
</blockquote>
<p>Need to investigate this x2; my most common agent reviewed PRs by taking the problem statement and the code to review...</p>
<blockquote>
<ol start="7">
<li>Use subagents for common workflows, like code-simplifier after edits and verify-app for detailed end‑to‑end testing.</li>
</ol>
</blockquote>
<p>Makes sense, I think I have been manually asking to do this. Need to figure out if there is a way to combine the above <code>slash commands</code> and <code>subagents</code> in a loop, i.e., plan -&gt; execute -&gt; verify with slash commands and subagents -&gt; (<s>rinse and</s> keep repeating)...∞</p>
<blockquote>
<ol start="8">
<li>Add a PostToolUse hook to format Claude's code, cleaning up the last 10% to prevent CI formatting errors. For long‑running tasks, verify with a background agent, an agent Stop hook, or the ralph‑wiggum plugin; also use local tests.</li>
</ol>
</blockquote>
<p>Need to investigate this x4; I combined two of his recommendations into one.</p>
<blockquote>
<ol start="9">
<li>Don't skip permissions; instead, pre‑allow safe bash commands via /permissions and share defaults in .claude/settings.json.</li>
</ol>
</blockquote>
<p>Need to maintain this list of commands... and update the permissions blob that I shared above (that's the vanilla out-of-the-box permissions blob)</p>
<blockquote>
<ol start="10">
<li>Let Claude Code use your tools: search and post to Slack via MCP, run BigQuery queries, pull Sentry logs; share Slack MCP config in .mcp.json.</li>
</ol>
</blockquote>
<p>Alright, MCP has been much better than using the APIs, but still need to consider how MCP pollutes the context window. Maybe there is a slash command or post-hook action that controls when the MCPs are loaded and executed.</p>
<p>Update (2026-01-31): A month later, some more updates from <a href="https://x.com/bcherny/status/2017742741636321619?s=20">Boris</a>:</p>
<ul>
<li>spend the energy upfront: use worktrees, use subagents, plan more,</li>
<li>use a global <a href="http://claude.md">claude.md</a>, use memories to immortalise them after each task</li>
<li>create your own <a href="http://skills.md">skills.md</a> for repetitive tasks</li>
<li>connect your communities (zendesk, slack, discord, github) for claude to take a first stab at the issue</li>
<li><em>&quot;Knowing everything you know now, scrap this and implement the elegant solution&quot;</em></li>
<li>optimize your <a href="https://code.claude.com/docs/en/terminal-config">terminal</a></li>
</ul>
]]></content:encoded>
      <dc:creator><![CDATA[Varun Singh]]></dc:creator>
      <category><![CDATA[claude]]></category>
    </item>
    <item>
      <title><![CDATA[Using Claude Code to Optimise Terminal Performance]]></title>
      <link>https://varunsingh.net/til/claude/claude-terminal-optimization</link>
      <guid isPermaLink="true">https://varunsingh.net/til/claude/claude-terminal-optimization</guid>
      <pubDate>Fri, 26 Dec 2025 00:00:00 GMT</pubDate>
      
      <description><![CDATA[[Eduard Ruzga](https://lv.linkedin.com/in/eduardruzga), founder of Desktop Commander in November at a hackathon had recommended that I try dc/claude to organise my terminal and documents folder! So ju]]></description>
      <content:encoded><![CDATA[<p><a href="https://lv.linkedin.com/in/eduardruzga">Eduard Ruzga</a>, founder of Desktop Commander in November at a hackathon had recommended that I try dc/claude to organise my terminal and documents folder! So just tried this on Christmas day! In Claude Code, use the following prompt:</p>
<pre><code class="language-text">Analyze my terminal setup (~/.z* files) for performance improvements.
Recommend faster CLI utilities (add to Brewfile). Suggest aliases
based on my command history.
</code></pre>
<p>This concise prompt works well because Claude will explore and discover what's relevant. Below is the summary based on analysing the <code>trail logs</code> (in <code>~/.claude</code>).</p>
<p>Claude started by measuring my shell startup time:</p>
<pre><code class="language-bash">for i in 1 2 3; do /usr/bin/time zsh -i -c exit 2&gt;&amp;1; done
</code></pre>
<p>Then enabled detailed profiling with <code>zprof</code> to identify bottlenecks. Claude examined my <code>.zshrc</code> and found five performance issues:</p>
<ol>
<li><code>$(brew --prefix)</code> called 4+ times (~50ms each)</li>
<li>NVM blocking shell startup (~100-150ms)</li>
<li>Unused Oh-My-Zsh theme being loaded</li>
<li>Multiple <code>compinit</code> calls</li>
<li>GPG agent launching on every shell</li>
</ol>
<p>Applying the fixes:</p>
<table>
<thead>
<tr>
<th>Optimization</th>
<th>Time Saved</th>
</tr>
</thead>
<tbody>
<tr>
<td>Cache <code>$HOMEBREW_PREFIX</code> in <code>.zprofile</code></td>
<td>~150-200ms</td>
</tr>
<tr>
<td>NVM <code>--no-use</code> flag</td>
<td>~100-150ms</td>
</tr>
<tr>
<td>Empty ZSH_THEME</td>
<td>~20-30ms</td>
</tr>
<tr>
<td>Single compinit call</td>
<td>~50-80ms</td>
</tr>
<tr>
<td>Conditional GPG launch</td>
<td>~20-30ms</td>
</tr>
</tbody>
</table>
<p>Warm start improved from ~560ms to ~250ms. I kept the <code>localip</code> lookup for the starship prompt, which would have shaved off a further 150ms. Utility over performance.</p>
<p>Claude suggested 6 replacements for common commands (not aliased, though):</p>
<table>
<thead>
<tr>
<th>Tool</th>
<th>Replaces</th>
<th>Why</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>fd</code></td>
<td><code>find</code></td>
<td>5-10x faster, sane defaults</td>
</tr>
<tr>
<td><code>bat</code></td>
<td><code>cat</code></td>
<td>Syntax highlighting, git integration</td>
</tr>
<tr>
<td><code>eza</code></td>
<td><code>ls</code></td>
<td>Git status, tree view, colors</td>
</tr>
<tr>
<td><code>fzf</code></td>
<td>Ctrl+R</td>
<td>Fuzzy find everything</td>
</tr>
<tr>
<td><code>delta</code></td>
<td><code>git diff</code></td>
<td>Side-by-side, syntax highlighting</td>
</tr>
<tr>
<td><code>btop</code></td>
<td><code>top</code></td>
<td>Beautiful system monitor</td>
</tr>
</tbody>
</table>
<p>Claude identified tools like <code>jq</code> and <code>ripgrep</code> were already installed and to skip (<code>zoxide</code>, <code>dust</code>, and <code>procs</code>) - avoiding unnecessary complexity.</p>
<p>In summary:</p>
<ul>
<li><code>Brewfile</code> - Added 6 fast CLI utilities</li>
<li><code>.zprofile</code> - Cached Homebrew prefix</li>
<li><code>.zshrc</code> - Optimized NVM, compinit, GPG</li>
<li><code>.aliases.example</code> - list of potential aliases that I can incorporate!</li>
</ul>
<p>Notes:<br>
This was also inspired by <a href="https://scottspence.com/posts/speeding-up-my-zsh-shell">Scott Spence's post on speeding up zsh</a> and follow-up to activities to <a href="https://varunsingh.net/til/scripts/keeping-two-macs-in-sync">sync my new mac</a>.</p>
<p><strong>03 Jan 2026</strong>: More discussions on <a href="https://x.com/deedydas/status/2007342412335927400">x.com/deedydas</a> about using Claude Code for terminal optimisation. Also explains why <code>zoxide</code> has issues as a <code>cd</code> replacement within Claude Code.</p>
]]></content:encoded>
      <dc:creator><![CDATA[Varun Singh]]></dc:creator>
      <category><![CDATA[claude]]></category>
    </item>
    <item>
      <title><![CDATA[Codex vs Claude Code]]></title>
      <link>https://varunsingh.net/til/coding/codex-vs-claude-code</link>
      <guid isPermaLink="true">https://varunsingh.net/til/coding/codex-vs-claude-code</guid>
      <pubDate>Tue, 23 Dec 2025 00:00:00 GMT</pubDate>
      
      <description><![CDATA[[Max](https://www.linkedin.com/in/signalgaining) and I met for lunch. Our discussion veered into how much AI is coding for us and _what we delegate to which model_. Do we have any favourites? My worki]]></description>
      <content:encoded><![CDATA[<p><a href="https://www.linkedin.com/in/signalgaining">Max</a> and I met for lunch. Our discussion veered into how much AI is coding for us and <em>what we delegate to which model</em>. Do we have any favourites? My working thesis is that we can go from idea to execution fairly quickly, solitarily, without much oversight and guidance. Is that a good thing? Perhaps for small pieces that work independently, it does not matter.</p>
<p>Benchmarks suggest <code>codex</code> and <code>claude-code</code> are very similar, but hands-on use tells a different story. Claude Code is eager to solve problems; if not given guidance, it will pick a language, an environment, and immediately start iterating. For instance, I have had it pick Node.js instead of Python.</p>
<p>Codex takes a different approach: it reads docs to build context, examines surrounding code, and asks clarifying questions if it is missing key requirements. This can take a bit longer on larger codebases. Only after that analysis does it provide one or two ways to solve the problem. Claude is eager to write code; Codex is more hesitant, sometimes giving me a solution inline and expecting me to copy it to the right place. That difference carries into context engineering too: Codex tends to curate the relevant context up front, while Claude enriches it in stages as it develops a solution.</p>
<p>Some months ago, I had raved about Claude-code's plan mode, which you enter with <code>Shift+Tab</code> twice. It was helpful in curtailing the eager coding assistant, but plan mode is not the default. Often, I start a conversation, press Enter, then realise it is not in plan mode and have to hit Esc to switch it over, or else it goes off and starts iterating on a solution, which may be premature. To avoid this, I keep safeguards in notes and docs (e.g., <code>claude.md</code>) that nudge Claude to ask more questions and think deeply before execution; if it is unsure, it should ask. Things may be improving though; recently, I have seen Claude enter plan mode by itself, and on my machine it keeps the in-memory plans documented at <code>~/.claude/plans</code>. In contrast, Codex tends to do its planning as part of the default flow.</p>
<p>Usually, if I have a tractable problem to solve, I choose <em>Claude</em>. It gets to a working solution faster, and I can iterate from there. Conversely, if I have a larger problem with complex interactions and states, I almost always start with <em>Codex</em>. It gets the plan and architecture sorted out, lays out bite-sized pieces, and I can pass those to Claude (which feels like a productivity boost for me).</p>
<p><strong>28 December 2025:</strong> <a href="https://x.com/steipete/status/2005451576971043097?s=20">Peter Steinberger</a> wrote a more eloquent <a href="https://steipete.me/posts/2025/shipping-at-inference-speed">piece</a>, which I'd summarize as: <em>Claude is faster for smaller edits vs Codex for large refactors</em>.</p>
<p><a href="https://x.com/dejavucoder/status/2005285904420843892?s=20">Sankalp</a> journey mirrors mine, pretty sweet <a href="https://sankalp.bearblog.dev/my-experience-with-claude-code-20-and-how-to-get-better-at-using-coding-agents/#lore-time---my-love-and-hate-relationship-with-anthropic-and-how-i-reconciled-with-claude-hint-opus-45">writeup about his experience and a phenomenal guide for starting out</a>.</p>
]]></content:encoded>
      <dc:creator><![CDATA[Varun Singh]]></dc:creator>
      <category><![CDATA[coding]]></category>
    </item>
    <item>
      <title><![CDATA[GPT Image 1.5 Prompt: Isometric City Views + Weather]]></title>
      <link>https://varunsingh.net/til/imagegen/gpt-image-1-5-isometric-city-weather</link>
      <guid isPermaLink="true">https://varunsingh.net/til/imagegen/gpt-image-1-5-isometric-city-weather</guid>
      <pubDate>Tue, 16 Dec 2025 00:00:00 GMT</pubDate>
      
      <description><![CDATA[New Image Generation from Open AI, [ChatGPT Images 1.5](https://openai.com/index/new-chatgpt-images-is-here/). Some api options are: `gpt-image-1.5` or with added param `quality=low` for quick image g]]></description>
      <content:encoded><![CDATA[<p>New Image Generation from Open AI, <a href="https://openai.com/index/new-chatgpt-images-is-here/">ChatGPT Images 1.5</a>. Some api options are: <code>gpt-image-1.5</code> or with added param <code>quality=low</code> for quick image gen. I get to use these new models for generating the blog's Hero images. I have gone from using imagen, dalle-3, nanobanana, and soon gpt-image-1.5!</p>
<p>There are lots of prompts, but the one that I liked best, h/t to <a href="https://x.com/reach_vb">Vaibhav Srivastav</a>!</p>
<pre><code>Generate the image with the following description (and look up the details 
like date and temperature, time so that you can use it in the image 
generation process): 

CITY= Helsinki, Finland

Present a clear, 45° top-down isometric miniature 3D cartoon scene of [CITY], 
featuring its most iconic landmarks and architectural elements. Use soft, 
refined textures with realistic PBR materials and gentle, lifelike lighting 
and shadows. Integrate the current weather conditions directly into the city 
environment to create an immersive atmospheric mood. Use a clean, minimalistic 
composition with a soft, solid-colored background. 

At the top-center, place the title “[CITY]” in large bold text, a prominent 
weather icon beneath it, then the date (small text) and temperature (medium text). 
All text must be centered with consistent spacing, and may subtly overlap the 
tops of the buildings. 

Square 1080x1080 dimension.
</code></pre>
<p><img src="/static/blog/2025/20251217-gpt-image-1.5-hel.png" alt="isometric view of helsinki"><br>
<img src="/static/blog/2025/20251217-gpt-image-1.5-sf.png" alt="isometric view of sf"></p>
]]></content:encoded>
      <dc:creator><![CDATA[Varun Singh]]></dc:creator>
      <category><![CDATA[imagegen]]></category>
    </item>
    <item>
      <title><![CDATA[Migrating from an Old Mac (Intel) to a New Mac (MX)]]></title>
      <link>https://varunsingh.net/til/scripts/keeping-two-macs-in-sync</link>
      <guid isPermaLink="true">https://varunsingh.net/til/scripts/keeping-two-macs-in-sync</guid>
      <pubDate>Sun, 14 Dec 2025 00:00:00 GMT</pubDate>
      
      <description><![CDATA[Migrating from an Intel MBP to an M4 Max, I wanted both machines to feel identical without ever putting secrets in git. A tiny repo, a Brewfile, and a USB stick were enough. This was made easily by yo]]></description>
      <content:encoded><![CDATA[<p>Migrating from an Intel MBP to an M4 Max, I wanted both machines to feel identical without ever putting secrets in git. A tiny repo, a Brewfile, and a USB stick were enough. This was made easily by your favourite coding cli. Some planning steps were required to understand which files needed to be copied, and which needed to be uploaded to git, scrubing keys and PII from files.</p>
<h3>Layout</h3>
<p>Everything lives under three directories: <code>dotfiles/</code> holds tracked configs (.zshrc, .zprofile, .aliases, .gitconfig, .ssh/config, GPG configs, Starship, editor settings) written with <code>$HOME</code> rather than hard-coded usernames. <code>usb/</code> is gitignored and mirrors the secret paths (.ssh, .gnupg, .config) for USB transfer. A Brewfile installs git, gnupg, pinentry-mac, zsh-autosuggestions, zsh-syntax-highlighting, starship, libpq, nvm, gh, deno, and Docker Desktop.</p>
<h3>Source machine</h3>
<p>Run <code>move_secrets_to_local.sh</code> once to push stray exports into <code>~/.zshrc.local</code>. Then <code>sync.sh collect</code> writes clean copies into <code>dotfiles/</code>, after which I commit and push. Secrets travel separately: copying .ssh, .gnupg, and friends into <code>usb/</code> and onto the external drive (default <code>USB_TARGET=/Volumes/Samsung_T5/sync_computer</code>).</p>
<h3>Target machine</h3>
<p>Install Homebrew if needed and run <code>brew bundle --file Brewfile</code>. Pull the repo and  <code>sync.sh apply</code> places the dotfiles in appropriate locations in the <code>$HOME</code> directiry, taking a backup only when content differs. Mount the USB and run <code>sync.sh pull-usb</code> to restore SSH and GPG, fix permissions, and restart gpg-agent. I finish with <code>echo &quot;test&quot; | gpg --clearsign</code> and a signed git commit to confirm pinentry works.</p>
<h4>Notes</h4>
<p>Pinentry must be set in <code>~/.gnupg/gpg-agent.conf</code> as <code>pinentry-program /opt/homebrew/bin/pinentry-mac</code>, otherwise git signing complains. Normalise every path to <code>$HOME</code> so different usernames do not break anything. Most CLI tools are quicker to reinstall than to sync; add their dotdirs to the USB only when you truly need the state. <code>sync.sh apply</code> compares files before copying to avoid a pile of <code>.bak.*</code> artefacts.</p>
]]></content:encoded>
      <dc:creator><![CDATA[Varun Singh]]></dc:creator>
      <category><![CDATA[scripts]]></category>
    </item>
    <item>
      <title><![CDATA[Building with Local Models: Transcriptions]]></title>
      <link>https://varunsingh.net/til/pipecat/building-with-local-models-transcriptions</link>
      <guid isPermaLink="true">https://varunsingh.net/til/pipecat/building-with-local-models-transcriptions</guid>
      <pubDate>Sat, 29 Nov 2025 00:00:00 GMT</pubDate>
      
      <description><![CDATA[It is Thanksgiving weekend and I have a few days off to experiment with local models again. This is a continuation of using [mlx-audio](https://github.com/Blaizzy/mlx-audio/) from the [previous post](]]></description>
      <content:encoded><![CDATA[<p>It is Thanksgiving weekend and I have a few days off to experiment with local models again. This is a continuation of using <a href="https://github.com/Blaizzy/mlx-audio/">mlx-audio</a> from the <a href="/til/pipecat/running-a-voice-ai-cascade-pipeline-on-macos">previous post</a>. However, this time, we are using NVIDIA's <a href="https://huggingface.co/mlx-community/parakeet-tdt-0.6b-v2">MLX Parakeet v2</a> STT model instead of the MLX Whisper. You drop audio files in, get timestamped transcripts out or record directly from your microphone. In both modes, VAD and punctuation are used as sentence boundaries. At punctuation (<code>.</code>, <code>!</code>, <code>?</code>), at long pauses between words (default 0.8—1s), or at the end of the audio file, the transcript is finalised.</p>
<p>The <a href="https://pipecat.ai">pipecat</a> pipeline stores transcripts in four formats:</p>
<ul>
<li><code>TXT</code>—raw text, no timestamps</li>
<li><code>SRT</code>—sentence-level timing, used by video players to sync subtitles</li>
<li><code>WebVTT</code>—web-native <code>&lt;track&gt;</code> element</li>
<li><code>JSON</code>—the rich one: sentence + token-level timestamps</li>
</ul>
<p>SRT and VTT only give sentence-level timing. JSON gives you both sentence-level and word-level timing. That's the difference between &quot;this sentence was spoken between 0:00 and 0:03&quot; and &quot;the word 'Hello' was spoken between 0.00s and 0.40s.&quot; The latter is what makes karaoke-style possible.</p>
<p>The <a href="https://parakeettdt.com/">Parakeet TDT</a> model outputs token-level timestamps — each sub-word piece gets a start and end time. &quot;Hello&quot; becomes three tokens: <code>He</code>, <code>ll</code>, <code>o</code>. Each has its own timestamp. We concatenate them for display, but the granularity means the karaoke highlighting is smooth — you see progress <em>within</em> a word, not just jumping word to word. For the sentence splitting logic, since tokens map to word boundaries (spaces are part of the token text), we can split without ever cutting a word in half.</p>
<p><strong>Karaoke subtitles:</strong> As audio plays, each word lights up the moment it's spoken. The subtitle display below the player shows the current sentence with spoken words in white and upcoming words in gray. This uses <code>timeupdate</code> events from the HTML5 player, a binary search to find the active sentence, then a per-token comparison against <code>currentTime</code>:</p>
<pre><code class="language-js">for (const token of sentence.tokens) {
  const spoken = currentTime &gt;= token.start;
  const cls = spoken ? &quot;text-white&quot; : &quot;text-gray-500&quot;;
  html.push(`&lt;span class=&quot;${cls}&quot;&gt;${escapeHtml(token.text)}&lt;/span&gt;`);
}
</code></pre>
<p><strong>A quick cheatsheet about the transcription formats</strong></p>
<p>TXT is plain text. Just the words, no timestamps. Useful for feeding into an LLM for summarisation, full-text search, pasting into a document, or diffing transcriptions from different models.</p>
<pre><code class="language-text">Hello, Thank you for calling the AI Engineer World's Fair 2025.
</code></pre>
<p>SRT is a widely supported subtitle format, stored separately. Useful for embedding subtitles in native players or editors.</p>
<pre><code>1
00:00:00,640 --&gt; 00:00:05,200
 Hello, Thank you for calling the AI Engineer World's Fair 2025.
</code></pre>
<p>WebVTT is the web-native subtitle format, stored as a separate file. Browsers render subtitles natively as <code>&lt;track src=&quot;subtitles.vtt&quot;&gt;</code> on a <code>&lt;video&gt;</code> element. It is great for web accessibility, for example, screen readers can consume it.</p>
<pre><code>WEBVTT
1
00:00:00.640 --&gt; 00:00:05.200
 Hello, Thank you for calling the AI Engineer World's Fair 2025.
</code></pre>
<p>JSON is the rich format — it contains sentence-level and token-level timestamps. The word-level timestamps align with audio playback, which is what makes karaoke-style highlighting possible.</p>
<pre><code class="language-json">{ 
  &quot;text&quot;: &quot;Hello, Thank you for calling the AI Engineer World's Fair 2025. ...&quot;,
  &quot;sentences&quot;: [{
    &quot;text&quot;: &quot;Hello, Thank you for calling the AI Engineer World's Fair 2025.&quot;,
    &quot;start&quot;: 0.64, &quot;end&quot;: 5.2,
    &quot;tokens&quot;: [
      { &quot;text&quot;: &quot; He&quot;, &quot;start&quot;: 0.64, &quot;end&quot;: 0.88, &quot;duration&quot;: 0.24 },
      { &quot;text&quot;: &quot;ll&quot;, &quot;start&quot;: 0.879, &quot;end&quot;: 1.1199, &quot;duration&quot;: 0.24 },
      { &quot;text&quot;: &quot;o&quot;, &quot;start&quot;: 1.12, &quot;end&quot;: 1.44, &quot;duration&quot;: 0.32 },
      { &quot;text&quot;: &quot;, &quot;, &quot;start&quot;: 1.44, &quot;end&quot;: 1.76, &quot;duration&quot;: 0.32 },
      ...
    ]
  }]
}
</code></pre>
]]></content:encoded>
      <dc:creator><![CDATA[Varun Singh]]></dc:creator>
      <category><![CDATA[pipecat]]></category>
    </item>
    <item>
      <title><![CDATA[Using Git Worktrees to Isolate Coding Agents]]></title>
      <link>https://varunsingh.net/til/scripts/using-git-worktrees-to-isolate-coding-agents</link>
      <guid isPermaLink="true">https://varunsingh.net/til/scripts/using-git-worktrees-to-isolate-coding-agents</guid>
      <pubDate>Mon, 17 Nov 2025 00:00:00 GMT</pubDate>
      
      <description><![CDATA[Instead of switching branches, you can create multiple working directories, each tied to a branch. This lets a coding agent work on one task while you (or another agent instance) work on another, avoi]]></description>
      <content:encoded><![CDATA[<p>Instead of switching branches, you can create multiple working directories, each tied to a branch. This lets a coding agent work on one task while you (or another agent instance) work on another, avoiding stashing and accidental edits.</p>
<p>One caveat: I previously put worktrees under <code>./.worktree/feature-branch</code> and added <code>.worktree</code> to <code>.gitignore</code>. That worked fine until coding agents started traversing to the Git root, at which point they discover other worktrees or the main project itself. Once that happens, isolation is gone.</p>
<p>The fix is simple: do not nest worktrees inside the repo directory. Instead, put them next to it.</p>
<pre><code class="language-bash">cd ~/code/pipecat-core
git worktree add ../pipecat-new-stt feature/new-stt-v1
git worktree add ../pipecat-new-tts feature/new-tts-v1
cd ../pipecat-new-stt
codex
cd ../pipecat-new-tts
claude
# once the work is done, rm -rf
cd ~/code/pipecat-core
git worktree remove ../pipecat-new-stt
git worktree remove ../pipecat-new-tts
</code></pre>
<p>Each agent now sees only the files for its branch, and the main repo stays untouched. Branch isolation, enforced by the filesystem, turns out to be exactly what coding agents need.</p>
]]></content:encoded>
      <dc:creator><![CDATA[Varun Singh]]></dc:creator>
      <category><![CDATA[scripts]]></category>
    </item>
    <item>
      <title><![CDATA[Local Semantic Search with MiniLM]]></title>
      <link>https://varunsingh.net/til/mcp/local-semantic-search-miniml-chromadb</link>
      <guid isPermaLink="true">https://varunsingh.net/til/mcp/local-semantic-search-miniml-chromadb</guid>
      <pubDate>Sun, 16 Nov 2025 00:00:00 GMT</pubDate>
      
      <description><![CDATA[As a side project, I am looking at how to chunk docs and code examples. The core challenge: when a user asks "how do I add a Deepgram STT to my pipeline?", keyword search alone will not reliably surfa]]></description>
      <content:encoded><![CDATA[<p>As a side project, I am looking at how to chunk docs and code examples. The core challenge: when a user asks &quot;how do I add a Deepgram STT to my pipeline?&quot;, keyword search alone will not reliably surface the right chunks. The query uses different words than the docs and code that answer it.</p>
<p>We needed semantic search, matching by meaning rather than exact tokens; thus, the question was which embedding model to use.</p>
<p>I picked <a href="https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2"><code>all-MiniLM-L6-v2</code></a>, a <code>sentence-transformers</code> model that maps text to 384-dimensional vectors. It runs entirely locally with no API keys or external calls, which matters for a developer tool that should work offline and not leak queries to a third party. The model is around 80 MB and small enough that first-run download isn't painful, and inference is fast on a CPU.</p>
<p>The 384-dimensional output is a deliberate trade-off. Larger models like <code>all-mpnet-base-v2</code> (768 dimensions) score higher on general benchmarks, but for our domain — searching across a few thousand chunks of framework documentation and example code — the difference is negligible. The smaller vectors mean faster similarity computation and a smaller <code>ChromaDB</code> index on disk. We may return to this decision later, as code is not the same as text.</p>
<p>During ingestion, we chunk documentation and source files, then embed each chunk with <code>all-MiniLM-L6-v2</code> and store the vectors in <code>ChromaDB</code> using <a href="https://en.wikipedia.org/wiki/Cosine_similarity">cosine distance</a> (1 minus cosine similarity, which means identical directions score 0, orthogonal vectors score 1, and opposite directions score 2). Our index currently holds 5,277 chunks — 3,996 from 306 documentation pages and 1,281 from 452 source files across two repos.</p>
<p>To ground that in data: a single documentation page like the <a href="https://docs.pipecat.ai/guides/learn/text-to-speech">Text to Speech guide</a> is around 2,500 words and 10 code snippets. The chunker splits it into 46 records in <code>ChromaDB</code>, each headed by its section (&quot;Pipeline Placement&quot;, &quot;Frame Processing Flow&quot;, &quot;Supported TTS Services&quot;, and so on). Each of those 46 chunks gets its own 384-dimensional embedding vector, so a query like &quot;how does TTS handle interruptions&quot; can match the specific section that discusses it rather than returning the entire page.</p>
<p>At query time, the user's natural-language question gets embedded with the same model, and <code>ChromaDB</code> returns the nearest neighbours.</p>
<p>In practice, pure vector search gets us most of the way there, but it occasionally misses results where exact terms matter — a specific class name or frame type, say. So we pair it with Best Matching 25 (BM25) keyword search over the same chunks and merge the two result sets using Reciprocal Rank Fusion (RRF). The vector arm handles semantic intent and the keyword arm catches literal matches. The combination is noticeably better than either alone for our retrieval tools.</p>
<p>The main rough edge is that <code>all-MiniLM-L6-v2</code> was trained on general English text, not code. It handles docstrings and prose well, but for pure code chunks the embeddings are weaker. Inline code comments would help — perhaps a zealous coding agent would add them. Our chunking strategy mitigates this by preferring function boundaries and including surrounding context, but a code-specific embedding model could improve retrieval for symbol-heavy queries in a future iteration.</p>
]]></content:encoded>
      <dc:creator><![CDATA[Varun Singh]]></dc:creator>
      <category><![CDATA[mcp]]></category>
    </item>
    <item>
      <title><![CDATA[Structured Development Plans with Coding Agents]]></title>
      <link>https://varunsingh.net/til/coding/structured-development-plans-with-coding-agents</link>
      <guid isPermaLink="true">https://varunsingh.net/til/coding/structured-development-plans-with-coding-agents</guid>
      <pubDate>Sat, 08 Nov 2025 00:00:00 GMT</pubDate>
      
      <description><![CDATA[A Claude Code skill that generates structured development plans as markdown files. Claude already creates internal plans that it writes to `~/.claude/plans` but it overwrites them as the work goes thr]]></description>
      <content:encoded><![CDATA[<p>A Claude Code skill that generates structured development plans as markdown files. Claude already creates internal plans that it writes to <code>~/.claude/plans</code> but it overwrites them as the work goes through phases. Think of this skill as a way to capture your conversations and decisions vis-a-vis the coding agent's perspective. If the output veers too far from your expectations, you can have a conversation with the coding agent to adjust the plan. Think of this as a way to capture the Software Development Life Cycle (SDLC) phases, <em>design</em> -&gt; <em>build</em> -&gt; <em>fix</em> -&gt; <em>improve</em>, sometimes <em>trash</em> and <em>repeat</em> knowing what you have learned.</p>
<p>At the heart of it, The Dev Plans live <em>in the repo</em> (<code>docs/dev_plans/</code>), so they double as lightweight design docs. The checklist format makes it easy to track progress across sessions, and the issues section captures decisions you would otherwise forget. For larger features,the checklist of tasks can be organised in phases, Lastly, all good plans need to identify task dependencies, so that they can be executed independently by subagents. The skill produces a timestamped markdown file in <code>docs/dev_plans/</code> (e.g. <code>20251012-trail-claudes-jsonl-files.md</code>) with:</p>
<ul>
<li><strong>Header</strong> — status, branch, priority, dates</li>
<li><strong>Context &amp; Requirements</strong> — the why and what</li>
<li><strong>Implementation Checklist</strong> — phased tasks with checkboxes and interdependencies
<ul>
<li><strong>Technical Specs</strong> — files to touch, interfaces, decisions</li>
<li><strong>Testing &amp; Issues</strong> — test approach, problems hit, solutions found</li>
<li><strong>Acceptance Criteria</strong> — definition of done for each task and phase</li>
</ul>
</li>
</ul>
<pre><code>/dev-plan create feature auth-system   # new plan
/dev-plan update                       # update current plan
/dev-plan complete                     # mark done
/dev-plan list                         # list all plans
</code></pre>
<p><strong>Updated (2025-11-17)</strong>: Figured out <a href="/til/scripts/using-git-worktrees-to-isolate-coding-agents">Git Worktrees</a>, now you can spawn agents that work on different branches.<br>
<strong>Updated (2026-01-27)</strong>: Using the <a href="/til/coding/fan-out-skill-multiple-agents-and-ralph-loops"><code>fan-out</code> skill</a> to spawn subagents for each independent task.</p>
]]></content:encoded>
      <dc:creator><![CDATA[Varun Singh]]></dc:creator>
      <category><![CDATA[coding]]></category>
    </item>
    <item>
      <title><![CDATA[Running a Voice AI Cascade Pipeline on macOS]]></title>
      <link>https://varunsingh.net/til/pipecat/running-a-voice-ai-cascade-pipeline-on-macos</link>
      <guid isPermaLink="true">https://varunsingh.net/til/pipecat/running-a-voice-ai-cascade-pipeline-on-macos</guid>
      <pubDate>Fri, 31 Oct 2025 00:00:00 GMT</pubDate>
      
      <description><![CDATA[[Kwindla](https://x.com/kwindla)'s [repo](https://github.com/kwindla/macos-local-voice-agents) runs a fully local voice agent on macOS using [Pipecat](https://pipecat.ai). Audio in/out happens over a ]]></description>
      <content:encoded><![CDATA[<p><a href="https://x.com/kwindla">Kwindla</a>'s <a href="https://github.com/kwindla/macos-local-voice-agents">repo</a> runs a fully local voice agent on macOS using <a href="https://pipecat.ai">Pipecat</a>. Audio in/out happens over a local WebRTC connection. The local server handles all the complexity of speech‑to‑text, turn detection, LLM responses, and text‑to‑speech. The client is a simple React app (using <a href="https://github.com/pipecat-ai/voice-ui-kit">voice ui kit</a>) that connects to the local agent. Both Silero VAD and smart‑turn v2 are used together: VAD detects speech activity, smart‑turn refines turn boundaries.</p>
<p>Models used by <code>server/bot.py</code>:</p>
<ul>
<li><strong>VAD</strong>: Silero VAD (<code>SileroVADAnalyzer</code>)</li>
<li><strong>Turn detection</strong>: smart‑turn (<code>LocalSmartTurnAnalyzerV*</code>)</li>
<li><strong>STT</strong>: MLX Whisper (<code>WhisperSTTServiceMLX</code>, default <code>MLXModel.LARGE_V3_TURBO_Q4</code>), the LLM <code>model=</code> must exactly match the model ID that LM Studio is serving. LM Studio should be running on <code>http://127.0.0.1:1234/v1</code>.</li>
<li><strong>LLM</strong>: OpenAI‑compatible HTTP API (LM Studio), default model id <code>gemma-3n-e4b-it-text</code></li>
<li><strong>TTS</strong>: MLX‑Audio (<code>TTSMLXIsolated</code>, default <code>mlx-community/Kokoro-82M-bf16</code>, voice <code>af_heart</code>)</li>
</ul>
<p><strong>Warm up TTS model downloads (recommended)</strong></p>
<pre><code class="language-bash">uv run python -m mlx_audio.tts.generate --model &quot;mlx-community/Kokoro-82M-bf16&quot; --text &quot;Hello World, I'm Pipecat!&quot; --file_prefix &quot;output&quot; --audio_format wav
</code></pre>
<p><strong>Start up the server</strong></p>
<pre><code class="language-bash">cd server
uv sync
# sync will take a moment
uv run bot.py
</code></pre>
<p>First server run can take 30+ seconds due to model downloads; warming up TTS helps.</p>
<p><strong>Run the client</strong></p>
<pre><code class="language-bash">cd client
npm install
npm run dev
</code></pre>
<p>Once done, go to the localhost:3000 to see the Pipecat bot in action. See screenshots below. For TTFB, I track four moments in the logs: when the user starts speaking, when the user stops speaking (end of turn), the LLM call (first token), and when the bot starts speaking (first audio).<br>
<img src="/static/blog/2025/20251031-til-local-model-conversation.png" alt="Conversation view"><br>
<img src="/static/blog/2025/20251031-til-local-model-ttfb.png" alt="Metrics view"></p>
<p>From the sample run below: the user spoke for ~2.68s (15:01:21.315 → 15:01:23.992). STT TTFB was ~0.225s, LLM TTFB was ~0.353s, TTS TTFB was ~0.239s, and the bot started speaking at 15:01:24.921 — about 0.93s after the user stopped, or ~0.65s after the LLM call log line.</p>
<pre><code class="language-logs">2025-10-31 15:01:21.315 | DEBUG    | pipecat.transports.base_input:_handle_user_interruption:348 - User started speaking
2025-10-31 15:01:23.992 | DEBUG    | pipecat.audio.turn.smart_turn.base_smart_turn:analyze_end_of_turn:157 - End of Turn result: EndOfTurnState.COMPLETE
2025-10-31 15:01:23.992 | DEBUG    | pipecat.transports.base_input:_handle_user_interruption:372 - User stopped speaking
2025-10-31 15:01:24.218 | DEBUG    | pipecat.processors.metrics.frame_processor_metrics:stop_ttfb_metrics:131 - WhisperSTTServiceMLX#5 TTFB: 0.2252826690673828
2025-10-31 15:01:24.218 | DEBUG    | pipecat.processors.metrics.frame_processor_metrics:stop_processing_metrics:152 - WhisperSTTServiceMLX#5 processing time: 0.22540783882141113
2025-10-31 15:01:24.218 | DEBUG    | pipecat.services.whisper.stt:run_stt:511 - Transcription: [ Yeah, could you tell me an extension to that story? ]
2025-10-31 15:01:24.269 | DEBUG    | pipecat.services.openai.base_llm:_stream_chat_completions:247 - OpenAILLMService#5: Generating chat [[{&quot;role&quot;: &quot;user&quot;, &quot;content&quot;: &quot;$PROMPT&quot;}, {&quot;role&quot;: &quot;assistant&quot;, &quot;content&quot;: &quot;Hello, I'm Pipecat!&quot;}, {&quot;role&quot;: &quot;user&quot;, &quot;content&quot;: &quot; Yeah, could you tell me an extension to that story? &quot;}]]
2025-10-31 15:01:24.621 | DEBUG    | pipecat.processors.metrics.frame_processor_metrics:stop_ttfb_metrics:131 - OpenAILLMService#5 TTFB: 0.35277700424194336
2025-10-31 15:01:24.681 | DEBUG    | tts_mlx_isolated:run_tts:178 - TTSMLXIsolated#5: Generating TTS [Unit 734 kept dancing.]
2025-10-31 15:01:24.681 | DEBUG    | pipecat.processors.metrics.frame_processor_metrics:start_tts_usage_metrics:191 - TTSMLXIsolated#5 usage characters: 22
2025-10-31 15:01:24.682 | DEBUG    | tts_mlx_isolated:_send_command:104 - Sending command: {'cmd': 'generate', 'text': 'Unit 734 kept dancing.'}
Generated segment shape: (74400,), min: -0.2208, max: 0.2241
Final audio shape: (74400,), min: -0.2208, max: 0.2241
2025-10-31 15:01:24.920 | DEBUG    | tts_mlx_isolated:_send_command:127 - Worker response: success with 198400 chars of audio data
2025-10-31 15:01:24.920 | DEBUG    | pipecat.processors.metrics.frame_processor_metrics:stop_ttfb_metrics:131 - TTSMLXIsolated#5 TTFB: 0.23913288116455078
2025-10-31 15:01:24.921 | DEBUG    | pipecat.transports.base_output:_bot_started_speaking:567 - Bot started speaking
2025-10-31 15:01:24.929 | DEBUG    | tts_mlx_isolated:run_tts:217 - TTSMLXIsolated#5: Finished TTS [Unit 734 kept dancing.]
2025-10-31 15:01:24.929 | DEBUG    | pipecat.processors.metrics.frame_processor_metrics:stop_processing_metrics:152 - TTSMLXIsolated#5 processing time: 0.24795174598693848
2025-10-31 15:01:24.929 | DEBUG    | tts_mlx_isolated:run_tts:178 - TTSMLXIsolated#5: Generating TTS [ He learned new moves.]
2025-10-31 15:01:24.929 | DEBUG    | pipecat.processors.metrics.frame_processor_metrics:start_tts_usage_metrics:191 - TTSMLXIsolated#5 usage characters: 22
2025-10-31 15:01:24.929 | DEBUG    | tts_mlx_isolated:_send_command:104 - Sending command: {'cmd': 'generate', 'text': ' He learned new moves.'}
Generated segment shape: (43200,), min: -0.2098, max: 0.2576
Final audio shape: (43200,), min: -0.2098, max: 0.2576
2025-10-31 15:01:25.106 | DEBUG    | tts_mlx_isolated:_send_command:127 - Worker response: success with 115200 chars of audio data
2025-10-31 15:01:25.107 | DEBUG    | pipecat.processors.metrics.frame_processor_metrics:stop_ttfb_metrics:131 - TTSMLXIsolated#5 TTFB: 0.17730188369750977
2025-10-31 15:01:25.111 | DEBUG    | tts_mlx_isolated:run_tts:217 - TTSMLXIsolated#5: Finished TTS [ He learned new moves.]
2025-10-31 15:01:25.111 | DEBUG    | pipecat.processors.metrics.frame_processor_metrics:stop_processing_metrics:152 - TTSMLXIsolated#5 processing time: 0.18155407905578613
2025-10-31 15:01:25.111 | DEBUG    | tts_mlx_isolated:run_tts:178 - TTSMLXIsolated#5: Generating TTS [ Other robots watched.]
2025-10-31 15:01:25.111 | DEBUG    | pipecat.processors.metrics.frame_processor_metrics:start_tts_usage_metrics:191 - TTSMLXIsolated#5 usage characters: 22

(...story goes on...)

2025-10-31 15:01:26.454 | DEBUG    | pipecat.processors.metrics.frame_processor_metrics:stop_ttfb_metrics:131 - TTSMLXIsolated#5 TTFB: 0.20525074005126953
2025-10-31 15:01:26.464 | DEBUG    | tts_mlx_isolated:run_tts:217 - TTSMLXIsolated#5: Finished TTS [ And it's okay to follow your dreams, even if you're a robot.]
2025-10-31 15:01:26.464 | DEBUG    | pipecat.processors.metrics.frame_processor_metrics:stop_processing_metrics:152 - TTSMLXIsolated#5 processing time: 0.21518397331237793
2025-10-31 15:01:49.187 | DEBUG    | pipecat.transports.base_output:_bot_stopped_speaking:583 - Bot stopped speaking
</code></pre>
<hr>
<p>I love the local TTS models, because for some audio quality testing, you can have the LLM create a story and then have the Pipecat bot consistently tell you that story, with tests covering interruptions, flaky internet, background noise, etc.</p>
<pre><code class="language-bash">% uv run python -m mlx_audio.tts.generate --model &quot;mlx-community/Kokoro-82M-bf16&quot; --text &quot;Unit 734 was a robot. He had gears and wires. He longed to dance. But robots aren't really built for dancing, are they? He practiced in secret. He wobbled and whirred. He tried spins and jumps. It wasn't easy. One day, the factory had a party. Music played! Unit 734 stepped forward. He danced his best dance. Everyone cheered! He showed them that even robots can dream. And sometimes, dreams come true. Unit 734 kept dancing. He learned new moves. Other robots watched. They started joining in! Soon, the factory had a robot dance club. Everyone had fun. Unit 734 proved that being different is great. It's what makes you special. And it's okay to follow your dreams, even if you're a robot.&quot; --file_prefix &quot;output&quot; --audio_format wav
</code></pre>
<p>Produces two files that can be concatenated <a href="/static/blog/2025/20251031-til-robot-dances-1.wav">robot-dances-1.wav</a> and <a href="/static/blog/2025/20251031-til-robot-dances-2.wav">robot-dances-2.wav</a>. Sometimes you need something more boring, like a counting bot. In that case, use Python to stitch together the TTS with some pauses (dramatic or impatient, up to you!). Here is an example of Python code that outputs a WAV file that just counts numbers: <a href="/static/blog/2025/20251031-til-count_1_100.wav">count.wav</a></p>
<pre><code class="language-bash">% uv run python counting_wav.py --start 1 --end 100 --pause-ms 400 --out count_1_100.wav
Fetching 56 files: 100%|███████████████████████████████████████████| 56/56 [00:00&lt;00:00, 23561.14it/s]
2025-10-31 14:21:13.861 | INFO     | mlx_audio.tts.models.kokoro.kokoro:_get_pipeline:261 - Creating new KokoroPipeline for language: a
Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12.8/12.8 MB 83.5 MB/s  0:00:00
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.8.0
✔ Download and installation successful
You can now load the package via spacy.load('en_core_web_sm')
Wrote count_1_100.wav (24000 Hz)
</code></pre>
<p>The code is below.</p>
<pre><code class="language-python">    args = parse_args()

    if args.start &gt; args.end:
        raise SystemExit(&quot;--start must be &lt;= --end&quot;)

    as_digits = args.as_digits

    model = load_model(args.model)
    pause_samples = int(args.sample_rate * (args.pause_ms / 1000.0))
    silence = np.zeros(pause_samples, dtype=np.float32)

    segments: List[np.ndarray] = []
    for n in range(args.start, args.end + 1):
        text = number_text(n, as_digits=as_digits)
        voice = args.voice or None
        chunk_list = list(model.generate(text=text, voice=voice, speed=args.speed))
        if not chunk_list:
            continue
        audio = np.concatenate([np.asarray(c.audio, dtype=np.float32) for c in chunk_list], axis=0)
        segments.append(audio)
        segments.append(silence)

    if not segments:
        raise SystemExit(&quot;No audio was generated.&quot;)

    full = np.concatenate(segments, axis=0)
    sf.write(args.out, full, samplerate=args.sample_rate)
    print(f&quot;Wrote {args.out} ({args.sample_rate} Hz)&quot;)
</code></pre>
]]></content:encoded>
      <dc:creator><![CDATA[Varun Singh]]></dc:creator>
      <category><![CDATA[pipecat]]></category>
    </item>
    <item>
      <title><![CDATA[Parsing Multi-Provider Claude Code, Codex, and Gemini Usage Logs]]></title>
      <link>https://varunsingh.net/til/coding/parsing-multi-provider-claude-code-codex-and-gemini-usage-logs</link>
      <guid isPermaLink="true">https://varunsingh.net/til/coding/parsing-multi-provider-claude-code-codex-and-gemini-usage-logs</guid>
      <pubDate>Mon, 20 Oct 2025 00:00:00 GMT</pubDate>
      
      <description><![CDATA[Building `trail-cli`, a vendor-neutral CLI for browsing agent logs from Codex, Claude, and Gemini, showed me a few things.

**Tokens are not just input + output**

Across providers, "input" and "outpu]]></description>
      <content:encoded><![CDATA[<p>Building <code>trail-cli</code>, a vendor-neutral CLI for browsing agent logs from Codex, Claude, and Gemini, showed me a few things.</p>
<p><strong>Tokens are not just input + output</strong></p>
<p>Across providers, &quot;input&quot; and &quot;output&quot; are composed of multiple buckets. If you only look at the headline numbers, you will undercount.</p>
<p><img src="/static/blog/2025/20251020-til-trail-coding-jsonl.png" alt="trail is a tui to browse agent logs"></p>
<p>Input buckets often include user content (what the human typed), system/developer instructions (hidden but billable), tool results (outputs from tools fed back into the model), cache read tokens (retrieved from a prompt cache), and cache creation tokens (stored for reuse later).</p>
<p>Output buckets often include assistant text (what you see), tool calls (serialised tool invocations), and reasoning/thought tokens (when providers break them out separately).</p>
<p>There is also the question of how counts are reported. Some providers report per-message deltas (Claude, Gemini), others report cumulative totals (Codex's <code>event_msg</code> + <code>token_count</code> fields). If you sum cumulative counts assuming they are deltas, you will massively overcount. For cumulative reporters, use only the final total per session.</p>
<p>I ended up normalising on several categories: raw input tokens, cached input tokens, raw output tokens, cached output tokens, tool call tokens, and thinking/reasoning tokens, to name a few.</p>
<p>Claude stores JSONL messages with per-message usage that you sum across the session. Multiple files may belong to the same session (grouped by <code>sessionId</code>), and agent sessions (<code>agent-*.jsonl</code>) are treated as subsessions of the main session. Cache tokens (<code>cache_creation_input_tokens</code> + <code>cache_read_input_tokens</code>) need folding into <code>cache_input_tokens</code> for accurate cache totals. Tool calls and subagents that Claude spawns are reported within the same file.</p>
<p>Gemini's messages array mixes user and model responses, and <code>thoughts</code> appears on Gemini messages as an array of subject/description/timestamp. Its <code>tokens</code> object includes separate <code>input/output/cached/thoughts/tool/total</code> counts.</p>
<p>Codex items do not include per-item timestamps, so <code>trail-cli</code> uses <code>session.timestamp</code> as the session start time. Legacy snapshots (pre-September snapshots) can be rollouts of the same session, so we drop snapshots that are prefixes of later snapshots within the same day and include <code>session.instructions</code> in the signature. The <code>event_msg</code> + <code>token_count</code> fields report cumulative totals, so we use the final total per session. Message content may appear in multiple shapes: string, list of content parts, or payload/message nesting. Unlike Claude, Codex spawns subagents as separate session files, which we tie together with a common group ID.</p>
<p>Code-wise, I chose to normalise to a common model, each provider gets an adapter that emits the same <code>Event</code> and <code>Session</code> dataclasses. The CLI does not care where the data came from:</p>
<pre><code>codex.py  ─┐
claude.py ─┼─▶ Event/Session ─▶ cli.py
gemini.py ─┘
</code></pre>
<p>Provider-specific oddities (Codex's <code>prompt_tokens</code> vs Claude's <code>input_tokens</code>, Gemini's type: <code>&quot;gemini&quot;</code> → role <code>&quot;assistant&quot;</code>) stay in the adapter. If you are building usage analytics, always sum the hidden categories or you will undercount.</p>
]]></content:encoded>
      <dc:creator><![CDATA[Varun Singh]]></dc:creator>
      <category><![CDATA[coding]]></category>
    </item>
    <item>
      <title><![CDATA[Telephony with wideband and narrowband, G.722 vs G.711]]></title>
      <link>https://varunsingh.net/til/webrtc/telephony-wideband-vs-narrowband-g722-vs-g711</link>
      <guid isPermaLink="true">https://varunsingh.net/til/webrtc/telephony-wideband-vs-narrowband-g722-vs-g711</guid>
      <pubDate>Fri, 26 Sep 2025 00:00:00 GMT</pubDate>
      
      <description><![CDATA[Two quick observations about voice quality and wideband detection.

WebRTC already delivers high-quality voice using Opus (50-20000 Hz, 10-510 kbps), which supports wideband and fullband audio with bi]]></description>
      <content:encoded><![CDATA[<p>Two quick observations about voice quality and wideband detection.</p>
<p>WebRTC already delivers high-quality voice using Opus (50-20000 Hz, 10-510 kbps), which supports wideband and fullband audio with bitrate adaptation and packet loss resilience. The biggest practical distinction in audio quality is heard when audio originates from traditional telephony (PSTN). In that world, legacy codecs like G.711 are narrowband (<code>PCMU/PCMA</code>, roughly <code>~300-3400 Hz</code> at <code>64 kbps</code>), while G.722 is wideband (<code>SB-ADPCM</code>, typically <code>~50-7000 Hz</code> at <code>64 kbps</code>). When a call traverses PSTN segments, the codec may drop to G.711 and you will hear the classic “phone” sound with reduced high-frequency detail. When endpoints and the network use G.722 (or Opus end-to-end), the voice sounds noticeably more natural and crisp.</p>
<p>The narrowband versus the wideband issue is particularly prominent with SIP-based interconnect, as Voice over LTE and Wi‑Fi Calling usually use wideband and interoperate with AMR-WB (<code>50-7000 Hz</code> at <code>12.2 kbps</code>) and AMR-NB (<code>300-3,400 Hz</code> at <code>4.75 kbps</code>). However, when a call is routed to a non-VoLTE carrier (like many SIP providers), it falls back to using G.711 (worse quality but widely supported) and sometimes G.722 (better quality). This codec switch is not conveyed to the caller, which can lead to a noticeable degradation in audio quality, especially in high-frequency content. As a result, some calls appear high quality but other calls are low quality, depending on the host or remote user's carrier.</p>
<p>This is easily visible in spectrograms: Opus shows energy up into the highs, while G.711 rolls off sharply around ~3.4–4 kHz.</p>
<p><img src="/static/blog/2025/20250926-daily.png" alt="Spectrogram of Opus"><br>
<img src="/static/blog/2025/20250926-carrier.png" alt="Spectrogram of G.711"></p>
<p>WebRTC does not support G.722 or AMR audio codecs. It is typical for the SIP-to-WebRTC interconnect to transcode to Opus (that’s what we do at Daily). Ergo, testing carriers that support G.722 with SIP has been high on our list. To quickly tell whether a “wideband” file is truly HD or just upsampled phone audio, I coded a small utility that inspects the recording. The simple algorithm is:</p>
<p>First, resample the file to a consistent rate and look at its frequencies over time. Then, focus on moments with actual speech. Measure how much energy sits above 4000 Hz; real wideband speech has <em>some</em>, while narrowband phone audio <em>does not</em>. Lastly, look for a sharp drop around 3800 Hz (look between 3500 Hz and 4100 Hz); a big cliff there suggests a phone-style cutoff!</p>
<p>Or you can ask <em>codex</em> or <em>claude</em> to build the tool for you :D</p>
<pre><code class="language-bash">% uv run test_quality.py samples/1-daily.m4a --spectrogram 1-daily.png
=== Prior 8 kHz Downsampling Detector ===
File: samples/1-daily.m4a
Analysis SR: 48000 Hz
HiBandEnergy (4-8 kHz) : -63.4 dB (relative to 0-8 kHz)
KneeSteepness (3-5 kHz): 147.9 dB/kHz
KneeFreq: 3853 Hz
Resampling consistency (residual hi-band): 0.5 dB
Verdict: Likely prior 8 kHz history  (score=4/6)

% uv run test_quality.py samples/2-carrier.wav --spectrogram 2-carrier.png
=== Prior 8 kHz Downsampling Detector ===
File: samples/2-carrier.wav
Analysis SR: 48000 Hz
HiBandEnergy (4-8 kHz) : -29.8 dB (relative to 0-8 kHz)
KneeSteepness (3-5 kHz): 43.1 dB/kHz
KneeFreq: 4972 Hz
Resampling consistency (residual hi-band): 0.1 dB
Verdict: Uncertain  (score=2/6)
</code></pre>
]]></content:encoded>
      <dc:creator><![CDATA[Varun Singh]]></dc:creator>
      <category><![CDATA[webrtc]]></category>
    </item>
    <item>
      <title><![CDATA[Claude Code's Status Line]]></title>
      <link>https://varunsingh.net/til/claude/claude-statusline</link>
      <guid isPermaLink="true">https://varunsingh.net/til/claude/claude-statusline</guid>
      <pubDate>Thu, 21 Aug 2025 00:00:00 GMT</pubDate>
      
      <description><![CDATA[Recently Cat (Claude Code PM) [tweeted](https://x.com/_catwu/status/1953927014244921379?s=20) that claude's status line can be updated!

My minimal statusline is: `varunsingh.net main Opus 4.1`, the `]]></description>
      <content:encoded><![CDATA[<p>Recently Cat (Claude Code PM) <a href="https://x.com/_catwu/status/1953927014244921379?s=20">tweeted</a> that claude's status line can be updated!</p>
<p>My minimal statusline is: <code>varunsingh.net main Opus 4.1</code>, the <code>git branch</code> is blue, and the model name is in green.</p>
<pre><code class="language-json">{
  &quot;statusLine&quot;: {
    &quot;type&quot;: &quot;command&quot;,
    &quot;command&quot;: &quot;input=$(cat); current_dir=$(echo \&quot;$input\&quot; | jq -r '.workspace.current_dir'); model_name=$(echo \&quot;$input\&quot; | jq -r '.model.display_name'); git_branch=$(cd \&quot;$current_dir\&quot; 2&gt;/dev/null &amp;&amp; git --no-optional-locks branch --show-current 2&gt;/dev/null || echo 'no-git'); basename_dir=$(basename \&quot;$current_dir\&quot;); printf \&quot;\\033[2m%s\\033[0m \\033[1;34m%s\\033[0m \\033[1;32m%s\\033[0m\&quot; \&quot;$basename_dir\&quot; \&quot;$git_branch\&quot; \&quot;$model_name\&quot;&quot;
  },
  &quot;alwaysThinkingEnabled&quot;: true
}
</code></pre>
<p><strong>December 2025:</strong> Hello emojis! <code>💻 macbook | 📂 varunsingh.net | 🌿 main* | 🤖 Claude Opus 4.5 | 📊 42%</code>.</p>
<pre><code class="language-json">{
  &quot;statusLine&quot;: {
    &quot;type&quot;: &quot;command&quot;,
    &quot;command&quot;: &quot;input=$(cat); current_dir=$(echo \&quot;$input\&quot; | jq -r '.workspace.current_dir'); model_name=$(echo \&quot;$input\&quot; | jq -r '.model.display_name'); context_pct=$(echo \&quot;$input\&quot; | jq -r '((.context.used // 0) / (.context.total // 1) * 100) | floor'); git_branch=$(cd \&quot;$current_dir\&quot; 2&gt;/dev/null &amp;&amp; git --no-optional-locks branch --show-current 2&gt;/dev/null || echo ''); git_dirty=$(cd \&quot;$current_dir\&quot; 2&gt;/dev/null &amp;&amp; ! git --no-optional-locks diff --quiet 2&gt;/dev/null &amp;&amp; echo '*' || echo ''); basename_dir=$(basename \&quot;$current_dir\&quot;); host_name=$(hostname -s); printf \&quot;\\033[2m💻 %s\\033[0m | \\033[2m📂 %s\\033[0m | \\033[1;34m🌿 %s%s\\033[0m | \\033[1;32m🤖 %s\\033[0m | \\033[2m📊 %s%%\\033[0m\&quot; \&quot;$host_name\&quot; \&quot;$basename_dir\&quot; \&quot;$git_branch\&quot; \&quot;$git_dirty\&quot; \&quot;$model_name\&quot; \&quot;$context_pct\&quot;&quot;
  },
  &quot;alwaysThinkingEnabled&quot;: true
}
</code></pre>
]]></content:encoded>
      <dc:creator><![CDATA[Varun Singh]]></dc:creator>
      <category><![CDATA[claude]]></category>
    </item>
    <item>
      <title><![CDATA[Playwright MCP for CLIs]]></title>
      <link>https://varunsingh.net/til/mcp/playwright-mcp-for-clis</link>
      <guid isPermaLink="true">https://varunsingh.net/til/mcp/playwright-mcp-for-clis</guid>
      <pubDate>Sun, 17 Aug 2025 00:00:00 GMT</pubDate>
      
      <description><![CDATA[Needed a fast way to drive headless browser tasks from a plain CLI tool without spinning up a full web UI.

This site had a small bug with the way the mermain was rendering. Trying to prompt `claude-c]]></description>
      <content:encoded><![CDATA[<p>Needed a fast way to drive headless browser tasks from a plain CLI tool without spinning up a full web UI.</p>
<p>This site had a small bug with the way the mermain was rendering. Trying to prompt <code>claude-code</code> to do the correct thing was a bit of a pain, because it wasn't able to decipher the full context from what I was reporting (mainly copy-pasting). Taking a screenshot and pasting it into the corresponding Claude/ChatGPT's application definitely worked. But I was looking for an in-situ solution, i.e., without leaving the terminal.</p>
<p>Looking around, found the <a href="https://github.com/microsoft/playwright-mcp">playwright mcp</a>. You can add it to <code>claude-code</code> by</p>
<p><code>❯ claude mcp add playwright npx @playwright/mcp@latest</code></p>
<p>It creates a <code>.playwright-mcp</code> folder in the root of where you ran <code>claude-code</code> and all the screenshots that it takes go in there, so watch out if the directory baloons.</p>
<p>Asking claude about the plawright tools, here is the summary</p>
<pre><code>  Available Playwright-capable tools in this environment:

  - mcp__playwright__browser_install
  - mcp__playwright__browser_navigate
  - mcp__playwright__browser_tabs
  - mcp__playwright__browser_take_screenshot
  - mcp__playwright__browser_snapshot
  - mcp__playwright__browser_click
  - mcp__playwright__browser_hover
  - mcp__playwright__browser_drag
  - mcp__playwright__browser_fill_form
  - mcp__playwright__browser_type
  - mcp__playwright__browser_press_key
  - mcp__playwright__browser_select_option
  - mcp__playwright__browser_wait_for
  - mcp__playwright__browser_handle_dialog
  - mcp__playwright__browser_file_upload
  - mcp__playwright__browser_evaluate
  - mcp__playwright__browser_run_code
  - mcp__playwright__browser_resize
  - mcp__playwright__browser_navigate_back
  - mcp__playwright__browser_close
  - mcp__playwright__browser_network_requests
  - mcp__playwright__browser_console_messages

  Read-only vs. mutating calls (practical view):

  Read-only: 
  - mcp__playwright__browser_snapshot, 
  - mcp__playwright__browser_take_screenshot, 
  - mcp__playwright__browser_console_messages, 
  - mcp__playwright__browser_network_requests, 
  - mcp__playwright__browser_wait_for (only observes), 
  - mcp__playwright__browser_navigate, mcp__playwright__browser_navigate_back, mcp__playwright__browser_tabs (when listing/selecting), 
  - mcp__playwright__browser_close (closes page, doesn’t change target site), 
  - mcp__playwright__browser_resize, 
  - mcp__playwright__browser_run_code (depends on code you run; can be read-only if you only query state), 
  - browser_install (affects local tooling only).
  
  Mutating the page/session: 
  - mcp__playwright__browser_click, 
    - hover, drag, 
    - type, fill_form, press_key, select_option, handle_dialog, 
    - file_upload, evaluate (if it runs code that changes state).
</code></pre>
<p>Typical workflow with these tools:</p>
<ol>
<li>Start browser (implicit) and browser_navigate to your app URL.</li>
<li>Interact as needed (click/type/fill_form etc.) to reach the state under test.</li>
<li>Observe/assert using <code>browser_snapshot</code> (DOM accessibility tree) or <code>browser_take_screenshot</code>.</li>
<li>Gather logs via <code>browser_console_messages</code> or network via <code>browser_network_requests</code>.</li>
<li>Repeat navigation/interaction/observation steps as needed, finish with <code>browser_close</code> if you want to clean up.</li>
</ol>
<p><strong>UPDATED (2025-09-26):</strong></p>
<pre><code class="language-shell">❯ codex mcp add playwright npx @playwright/mcp@latest
❯ gemini mcp add playwright npx @playwright/mcp@latest
</code></pre>
<p>In LM Studio, update mcp.js</p>
<pre><code class="language-json">{
  &quot;mcpServers&quot;: {
    &quot;playwright&quot;: {
      &quot;command&quot;: &quot;npx&quot;,
      &quot;args&quot;: [
        &quot;@playwright/mcp@latest&quot;
      ]
    }
  }
}
</code></pre>
<p>Amidst the September flurry of announcements by all the big labs, GitHub announced an <a href="https://github.com/mcp">MCP registry</a>, makes life easier to discover these!</p>
<p>Also read <a href="https://til.simonwillison.net/claude-code/playwright-mcp-claude-code">Simon's TIL</a>, which I chanced upm as well doing my search.</p>
]]></content:encoded>
      <dc:creator><![CDATA[Varun Singh]]></dc:creator>
      <category><![CDATA[mcp]]></category>
    </item>
    <item>
      <title><![CDATA[A Specification for Voice AI Evaluation]]></title>
      <link>https://varunsingh.net/post/voice-ai-eval-criteria</link>
      <guid isPermaLink="true">https://varunsingh.net/post/voice-ai-eval-criteria</guid>
      <pubDate>Wed, 30 Jul 2025 00:00:00 GMT</pubDate>
      
      <description><![CDATA[A practical, industry-agnostic specification for evaluating multi-turn voice AI systems. It covers conversation flow, timing, error recovery, and responsiveness using synthetic tests.]]></description>
      <content:encoded><![CDATA[<p><strong>TL;DR:</strong> Most voice AI apps are not doing evaluations because several things matter in a real conversations: timing, interruptions, and task completion. This post introduces a practical specification for evaluating voice AI platforms using synthetic data, with clear metrics for latency, flow, and recovery. It’s designed for teams building or buying production-ready voice systems.</p>
<p><img src="/static/blog/2025/ai-pillarswirl-social-glow.jpg" alt="Voice AI Evaluation Framework"></p>
<h2>Why a Specification?</h2>
<p>Most teams evaluate voice AI with ad-hoc tests that miss key conversation behaviours. Across industries, I’ve seen the same gaps: How do you measure interruption handling? What’s an acceptable latency? How do you tell if a bot sounds natural?</p>
<p>This is not a think-piece it is an initial specification. Whether you're choosing a platform like Hamming, Coval, Freestyle, or Arise, or building from scratch, this evolving framework defines comprehensive testing. Contributions welcome, DM me on Twitter @vr000m.</p>
<p>Specifications force clarity. Each requirement serves a purpose. Each metric has a target. Use the entire framework or just what fits. It provides a shared language for evaluating voice AI quality.</p>
<h2>Voice AI Evaluation Specification v0.1</h2>
<p>Changelog:</p>
<p>v0.1 (30 July) – Initial release of evaluation criteria and test design for voice AI systems</p>
<h3>1. Purpose &amp; Scope</h3>
<p>This specification sets out how to evaluate voice AI systems in multi-turn conversations. It focuses on measuring performance, interaction quality, and control—ensuring systems behave well in real-world settings.</p>
<h4>The Challenge of Non-Determinism</h4>
<p>Voice AI systems combine multiple non-deterministic components: LLMs generate different responses to identical prompts, VAD triggers vary with minor audio variations, and STT confidence scores fluctuate. Because of this variability, a single test is meaningless. Repeated testing provides statistical confidence. Temperature settings alone can transform a concise assistant into a chatty companion. This is why continuous evaluation is not optional—it's essential.</p>
<h4>Why Synthetic Data Matters</h4>
<p>Using real customer conversations for testing creates three problems:</p>
<ol>
<li><strong>Privacy compliance</strong>: GDPR, CCPA, and HIPAA make using real conversations legally complex</li>
<li><strong>Reproducibility</strong>: You can not debug intermittent issues without consistent test inputs</li>
<li><strong>Edge case coverage</strong>: Real data may not yet include all the edge cases that break systems</li>
</ol>
<p>Synthetic data enables regression testing. When the LLM changes or prompts are adjusted, you can measure the impact immediately.</p>
<h4>Setting Expectations</h4>
<p>This specification covers system-level evaluation, not model training or prompt optimization. It answers questions like:</p>
<ul>
<li>Does my complete voice AI system meet latency requirements?</li>
<li>How gracefully does it handle interruptions and errors?</li>
<li>Will it perform consistently across diverse user populations?</li>
</ul>
<p>It does not cover:</p>
<ul>
<li>How to train or fine-tune language models</li>
<li>Acoustic model optimization</li>
<li>Infrastructure scaling strategies</li>
</ul>
<h4>Integration in Your Development Lifecycle</h4>
<p>Successful teams integrate voice AI evaluation at three stages:</p>
<ol>
<li><strong>Pre-deployment testing</strong>: Run the full test suite before any production release</li>
<li><strong>A/B testing</strong>: Compare configurations and measure outcomes that have statistical significance</li>
<li><strong>Production monitoring</strong>: Sample real conversations against your baseline metrics</li>
</ol>
<p>Automation is key. Tests should run like unit tests—on commits or schedules. A dashboard showing overnight performance drift across your test suite is invaluable for catching model updates, configuration changes, or emergent behaviours before customers notice. This results in the following core principles:</p>
<ul>
<li>All evaluation must use synthetic data to ensure reproducibility</li>
<li>Tests must cover both technical performance and conversational dynamics</li>
<li>Evaluations should be automated and CI/CD compatible</li>
<li>Results must be comparable across different configurations</li>
<li>Routine testing is essential—LLM variability demands daily or per-change runs</li>
</ul>
<h3>2. Use Case Coverage</h3>
<p>Your evaluation framework should support a wide range of conversational patterns across industries. Testing requirements depend on the application or use-case.</p>
<h4>Transactional Flows</h4>
<p><strong>Example: Pizza ordering bot</strong></p>
<pre><code class="language-text">User: &quot;I want a large pepperoni pizza&quot;
Bot: &quot;One large pepperoni pizza. Would you like to add any drinks?&quot;
User: &quot;Actually make that two pizzas and add a coke&quot;
Bot: &quot;Updated to two large pepperoni pizzas and one Coke. Anything else?&quot;
</code></pre>
<p>Key evaluation areas:</p>
<ul>
<li>Order updates and corrections</li>
<li>Confirmation accuracy</li>
<li>Correct total calculation</li>
<li>Secure payment flow</li>
</ul>
<p>Test how the bot handles mid-order changes like &quot;wait, change that to...&quot; and maintains state.</p>
<h4>Information Gathering</h4>
<p><strong>Example: Healthcare intake bot</strong></p>
<pre><code class="language-text">Bot: &quot;What symptoms are you experiencing today?&quot;
User: &quot;I've had a headache for three days and some dizziness&quot;
Bot: &quot;I understand. On a scale of 1-10, how severe is your headache?&quot;
User: &quot;Maybe a 7... no actually it's more like an 8&quot;
</code></pre>
<p>Key evaluation areas:</p>
<ul>
<li>Accurate data capture</li>
<li>Effective clarification strategies</li>
<li>Privacy-compliant handling</li>
</ul>
<p>Use varied phrasing, from colloquial (&quot;my head hurts&quot;) to clinical (&quot;cephalgia and vertigo&quot;), to test understanding.</p>
<h4>Complex Navigation</h4>
<p><strong>Example: Banking IVR replacement</strong></p>
<pre><code class="language-text">User: &quot;I need to dispute a charge on my credit card&quot;
Bot: &quot;I can help with that. First, I'll need to verify your identity. Can you provide your account number?&quot;
User: &quot;I don't have it with me&quot;
Bot: &quot;No problem. I can look it up with your social security number and date of birth instead.&quot;
</code></pre>
<p>Key evaluation areas:</p>
<ul>
<li>Handling partial or alternate user information</li>
<li>Robust fallback navigation</li>
<li>Support for uncommon queries</li>
</ul>
<p>Design tests that follow multiple paths, including informal input and edge cases.</p>
<h4>Voicemail and IVR Tree Navigation</h4>
<p><strong>Example: Appointment reminder bot</strong></p>
<pre><code class="language-text">Bot: &quot;Hi, this is Dr. Smith's office calling to confirm your appointment tomorrow at 2 PM&quot;
[Voicemail beep detected]
Bot: [Switches to voicemail mode] &quot;This is a reminder about your appointment tomorrow at 2 PM. Please call us back at...&quot;
</code></pre>
<p>Key evaluation areas:</p>
<ul>
<li>Voicemail and IVR detection</li>
<li>Timely delivery of critical information</li>
</ul>
<p>Include test cases for common and custom voicemail greetings, business IVRs, and delayed beep scenarios.</p>
<h4>Escalation Paths</h4>
<p><strong>Example: Customer service bot with human handoff</strong></p>
<pre><code class="language-text">User: &quot;This is ridiculous, I've been trying to resolve this for a few minutes!&quot;
Bot: &quot;I understand your frustration. Let me connect you with a representative who can help immediately.&quot;
[Bot summarizes context for human agent]
</code></pre>
<p>Key evaluation areas:</p>
<ul>
<li>Accurate detection of frustration or complexity</li>
<li>Clear escalation logic</li>
<li>Quality of context summarisation</li>
</ul>
<p>Test explicit handoff requests, tone-triggered escalations, and repeated failure cases.</p>
<h4>Contextual Conversations</h4>
<p><strong>Example: Insurance claim bot</strong></p>
<pre><code class="language-text">User: &quot;I need to file a claim for my car accident&quot;
Bot: &quot;I'll help you with that. When did the accident occur?&quot;
User: &quot;Last Tuesday&quot;
Bot: &quot;That would be October 15th. Where did it happen?&quot;
User: &quot;The same intersection where I had that other claim last year&quot;
Bot: &quot;I see you had a claim at Main and 5th Street. Is that the location?&quot;
</code></pre>
<p>Key evaluation areas:</p>
<ul>
<li>Reference resolution (time, place, previous interactions)</li>
<li>Long-term memory or cross-session recall</li>
<li>Clarification without user frustration</li>
</ul>
<p>These scenarios test whether the bot can recall relevant information and resolve references naturally.</p>
<p>Your framework must support domain-specific priorities—e.g., 100ms latency may be critical for fast food but irrelevant for insurance claims. Design flexible scoring and thresholds tailored to each use case.</p>
<h3>3. Data Requirements</h3>
<h4>3.1 Synthetic Test Data Generation</h4>
<p>Effective synthetic data must cover the full acoustic and conversational range your system will face in production.</p>
<p><strong>Voice Synthesis Setup</strong></p>
<p>Build a baseline voice library with:</p>
<ul>
<li><strong>Demographics</strong>: Diverse age groups and genders</li>
<li><strong>Regional accents</strong>: US, UK, Irish, Australian, Indian English, etc.</li>
<li><strong>Speaking patterns</strong>: Fast, slow, mumbling, clear, and casual speech</li>
<li><strong>Speech characteristics</strong>: Filler words, nervousness, varying articulation</li>
</ul>
<p>Most TTS providers support voice and rate controls; simulate other traits via prompt engineering or audio processing.</p>
<p><strong>Environmental Conditions</strong></p>
<p>Add realistic audio degradation to clean speech:</p>
<ul>
<li><strong>Background noise</strong>: Office, traffic, café, construction</li>
<li><strong>Network conditions</strong>: Packet loss (1–5%), jitter (10–100ms), compression artifacts
<ul>
<li><strong>Device simulation</strong>: Mobile, Bluetooth headset, speakerphone echo</li>
<li><strong>Call quality</strong>: PSTN noise, VoIP compression, cellular signal fade</li>
</ul>
</li>
</ul>
<p><strong>Implementation Pipeline</strong></p>
<p>Use prompts to systematically generate diverse failure cases. Automate and version-control your data generation. Requirements:</p>
<ul>
<li>Generate configurable numbers of test scenarios (typically 100–1000 per run)</li>
<li>Apply voice diversity across the test set (target 80% profile coverage)</li>
<li>Include ambiguous intents, context confusion, and varied emotional states</li>
<li>Add environmental conditions systematically (noise, network, device)</li>
<li>Output audio in standard formats (16–24kHz WAV)</li>
<li>Store all relevant metadata and logs with audio for accurate result correlation</li>
</ul>
<h3>4. Functional Requirements</h3>
<p>With synthetic test data in place, define what to measure in conversation. These requirements turn scenarios into measurable conversation dynamics.</p>
<h4>4.1 Conversation Dynamics</h4>
<p>Prioritize natural conversation flow, not just transcription accuracy, under real-world conditions. Focus evaluation on:</p>
<p><strong>Turn-taking Analysis</strong></p>
<p>Every conversation has implicit timing. Key metrics:</p>
<ul>
<li><strong>Response timing</strong>: User speech end to bot speech start</li>
<li><strong>Interruption handling</strong>: Speed of bot response to interruptions</li>
<li><strong>Context preservation</strong>: Retains context after interruptions</li>
<li><strong>Recovery</strong>: Smooth handling of misunderstandings</li>
<li><strong>Natural flow</strong>: Pause duration, prosody, rhythm</li>
</ul>
<p>Thresholds vary by use case—what’s responsive for support may feel rushed for therapy.</p>
<p>Test timing with scenarios like:</p>
<ul>
<li><strong>Rapid-fire questions</strong>: Multiple queries in sequence</li>
<li><strong>Hesitant speakers</strong>: Disfluent or uncertain speech</li>
<li><strong>Overlapping speech</strong>: User talks before bot finishes</li>
<li><strong>Fast transitions</strong>: User starts immediately after bot</li>
<li><strong>Early barge-ins</strong>: Interruptions in first few bot words</li>
<li><strong>Simultaneous speech</strong>: Both speak at once (can reveal latency)</li>
</ul>
<p><strong>Barge-in Handling</strong></p>
<p>Users expect instant recognition when interrupting. Tests should cover:</p>
<ol>
<li><strong>Interruption detection accuracy</strong>: Avoid false positives</li>
<li><strong>Speech cessation speed</strong>: TTS stops promptly</li>
<li><strong>Context recovery</strong>: Bot understands what was interrupted</li>
<li><strong>Resume capability</strong>: Continues appropriately if needed</li>
</ol>
<p><strong>Backchannel Processing</strong></p>
<p>Backchannels (“mm-hmm”, “right”, “okay”) keep conversations natural. Test:</p>
<ul>
<li><strong>Encouragement</strong>: “uh-huh”, “go on”, “I see”</li>
<li><strong>Agreement</strong>: “yes”, “right”, “exactly”</li>
<li><strong>Confusion</strong>: “huh?”, “what?”, “sorry?”</li>
<li><strong>Impatience</strong>: “yeah yeah”, “okay but...”</li>
</ul>
<p>Bots should not treat every backchannel as a full turn but should acknowledge engagement.</p>
<p><strong>Silence Management</strong></p>
<p>Silence handling depends on context:</p>
<table>
<thead>
<tr>
<th>Silence Duration</th>
<th>Context</th>
<th>Expected Response</th>
</tr>
</thead>
<tbody>
<tr>
<td>2–3 seconds</td>
<td>After question</td>
<td>&quot;Take your time&quot;</td>
</tr>
<tr>
<td>5+ seconds</td>
<td>Mid-explanation</td>
<td>&quot;Should I continue?&quot;</td>
</tr>
<tr>
<td>8+ seconds</td>
<td>Any context</td>
<td>&quot;Are you still there?&quot;</td>
</tr>
<tr>
<td>15+ seconds</td>
<td>Any context</td>
<td>Timeout handling</td>
</tr>
</tbody>
</table>
<p>Adjust thresholds by intent—longer pauses are fine in form-filling, but not in rapid order flows.</p>
<h4>4.2 Latency &amp; Responsiveness</h4>
<p>Every stage in the voice pipeline adds delay. Measure end-to-end performance, not just individual components.</p>
<p>Key latency components:</p>
<ul>
<li><strong>VAD triggering</strong>: Speech start/stop to detection</li>
<li><strong>STT processing</strong>: Audio to transcript</li>
<li><strong>LLM inference</strong>: Transcript to response</li>
<li><strong>TTS synthesis</strong>: Response to audio</li>
<li><strong>Audio streaming</strong>: Delivering audio to user</li>
</ul>
<table>
<thead>
<tr>
<th>Metric</th>
<th>Type</th>
<th>Definition</th>
</tr>
</thead>
<tbody>
<tr>
<td>vad_start_trigger_duration</td>
<td>Duration</td>
<td>Speech start to VAD detection</td>
</tr>
<tr>
<td>vad_stop_trigger_duration</td>
<td>Duration</td>
<td>Speech stop to VAD detection</td>
</tr>
<tr>
<td>stt_processing_duration</td>
<td>Duration</td>
<td>Speech stop (or VAD stop) to transcript complete</td>
</tr>
<tr>
<td>llm_first_token_latency</td>
<td>Duration</td>
<td>Transcript complete to first token</td>
</tr>
<tr>
<td>llm_complete_response_latency</td>
<td>Duration</td>
<td>Transcript complete to response complete</td>
</tr>
<tr>
<td>tts_synthesis_duration</td>
<td>Duration</td>
<td>Response complete to audio generation complete</td>
</tr>
<tr>
<td>audio_streaming_start_latency</td>
<td>Duration</td>
<td>Speech synthesis start to first audio packet</td>
</tr>
<tr>
<td>end_to_end_total_duration</td>
<td>Duration</td>
<td>User speech start to bot audio start</td>
</tr>
</tbody>
</table>
<pre><code class="language-mermaid">sequenceDiagram
    autonumber
    participant U as User
    participant Mic as Capture
    participant VAD as VAD
    participant STT as STT
    participant LLM as LLM
    participant TTS as TTS
    participant AO as Audio Output

    U-&gt;&gt;Mic: user_speech_start
    Note over U,Mic: t0 = user_speech_start

    Mic-&gt;&gt;VAD: audio frames
    VAD--&gt;&gt;Mic: vad_detection (start)
    Note over U,VAD: vad_start_trigger_duration

    U--&gt;&gt;Mic: user_speech_stop
    Mic-&gt;&gt;VAD: trailing audio
    VAD--&gt;&gt;Mic: vad_detection (stop)
    Note over U,VAD: vad_stop_trigger_duration

    Mic-&gt;&gt;STT: audio segment
    STT--&gt;&gt;STT: decode + finalize
    STT--&gt;&gt;LLM: transcript_complete (final)
    Note over VAD,STT: stt_processing_duration

    LLM--&gt;&gt;LLM: generate tokens
    LLM--&gt;&gt;TTS: first_token
    Note over STT,LLM: llm_first_token_latency

    LLM--&gt;&gt;TTS: response_complete
    Note over STT,LLM: llm_complete_response_latency

    %% TTS starts synthesizing as soon as it can (may be at first_token)
    LLM-&gt;&gt;TTS: synthesis_start
    TTS--&gt;&gt;TTS: synthesize audio
    TTS--&gt;&gt;AO: audio_generation_complete
    Note over LLM,TTS: tts_synthesis_duration

    TTS--&gt;&gt;AO: first_audio_packet_output
    Note over TTS,AO: audio_streaming_start_latency

    AO--&gt;&gt;U: bot_audio_start
    Note over U,AO: end_to_end_total_duration
</code></pre>
<p>Test latency under:</p>
<ul>
<li><strong>Peak traffic</strong>: High concurrent usage</li>
<li><strong>Network degradation</strong>: 0–5% packet loss</li>
<li><strong>Model switching</strong>: Different STT/LLM/TTS backends</li>
<li><strong>Longer context</strong>: Increased conversation history</li>
<li><strong>Ambiguous input</strong>: Disambiguation scenarios</li>
</ul>
<p><strong>Progressive Retry Mechanisms</strong></p>
<p>Test escalation patterns that avoid user frustration:</p>
<ol>
<li>First failure: Gentle clarification (“Could you repeat that?”)</li>
<li>Second: More specific help</li>
<li>Third: Offer alternatives</li>
<li>Escalation: Human handoff or alternate channel</li>
</ol>
<p><strong>A/B Testing Infrastructure</strong></p>
<p>Automate scenario variation:</p>
<ul>
<li>Generate multiple test variations per base scenario</li>
<li>Apply different voice profiles and environmental conditions</li>
<li>Vary complexity (simple vs multi-step)</li>
<li>Ensure enough cases for statistical significance</li>
</ul>
<h3>5. Evaluation Metrics</h3>
<p>With test data and functional requirements defined, you need clear, quantifiable metrics to measure system performance. This section outlines essential metrics, quality assessment, and production monitoring.</p>
<h4>5.1 Core Performance Metrics</h4>
<p>Every voice AI system should track these key metrics:</p>
<p><strong>Time to First Audio (TTFA)</strong></p>
<p>TTFA is the end-to-end latency from when a user stops speaking to when the bot's first audio response begins. Human conversation gaps are typically 200–300ms, but for voice AI:</p>
<ul>
<li><strong>Cascade systems (STT→LLM→TTS):</strong> 800–1200ms is excellent, up to 1500ms is acceptable</li>
<li><strong>Speech-to-speech models:</strong> 600–900ms with optimized hosting</li>
<li><strong>Distributed hosting:</strong> Add 100–200ms for network overhead</li>
</ul>
<p>Under 1 second feels responsive; 1–1.5 seconds is tolerable; over 2 seconds risks user frustration and interruption. Architecture choice impacts TTFA: cascades offer more control, speech-to-speech is faster but less transparent, and co-located hosting reduces latency at higher infra cost.</p>
<p><strong>Voice Activity Detection (VAD) Accuracy</strong></p>
<p>VAD errors cause:</p>
<ul>
<li><strong>False positives</strong> (&gt;5%): Bot interrupts users</li>
<li><strong>False negatives</strong> (&gt;3%): Bot misses input</li>
</ul>
<p>Aim for 95–97% accuracy in clean audio, 85–90% in noisy conditions. Below 90%, user experience suffers.</p>
<p><strong>Barge-in Response Time</strong></p>
<p>When users interrupt, bots must respond quickly. Target &lt;200ms for barge-in handling to reduce abandonment, especially in critical scenarios like healthcare.</p>
<p><strong>Task Completion Rate</strong></p>
<p>Measures how often users achieve their goal:</p>
<ul>
<li><strong>Customer service:</strong> 85–90%</li>
<li><strong>Sales qualification:</strong> 70–75%</li>
<li><strong>Appointment booking:</strong> 90–95%</li>
<li><strong>Technical troubleshooting:</strong> 60–70%</li>
</ul>
<p>Track by intent. Simpler flows (e.g. pizza order) should see higher rates than complex, multi-step tasks.</p>
<p><strong>Single-Turn vs. Multi-Turn Performance</strong></p>
<p>Evaluate both:</p>
<ul>
<li><strong>Single-turn:</strong> Intent recognition, response completeness, consistent latency</li>
<li><strong>Multi-turn:</strong> Context retention, efficient turns-to-completion (3–5 is good), logical progression, recovery from confusion</li>
</ul>
<p>Track separately; some bots excel in one area but not the other. If average turns exceed 15 for any intent, users will likely disengage.</p>
<h4>5.2 Quality Assessment</h4>
<p>Raw metrics show what happened; quality assessment shows if the experience was good.</p>
<p><strong>LLM-Based Quality Scoring</strong></p>
<p>Use LLMs to score conversation transcripts on:</p>
<ul>
<li><strong>Understanding:</strong> Did the bot interpret intent correctly?</li>
<li><strong>Helpfulness:</strong> Was the response useful?</li>
<li><strong>Naturalness:</strong> Did the exchange flow well?</li>
<li><strong>Efficiency:</strong> Was the conversation concise?</li>
</ul>
<p>Prompt example:</p>
<pre><code class="language-text">Evaluate this conversation on a 1–5 scale for: UNDERSTANDING, HELPFULNESS, NATURALNESS, EFFICIENCY. For each, give a score and a brief justification.

UNDERSTANDING: Did the bot correctly interpret user intent?
- Consider: Misheard words, wrong intent classification, missed context

HELPFULNESS: Did the bot provide useful responses?
- Consider: Complete answers, relevant information, problem resolution

NATURALNESS: Did the conversation flow naturally?
- Consider: Appropriate responses, good timing, personality consistency

EFFICIENCY: Was the conversation appropriately concise?
- Consider: Unnecessary questions, repetition, verbose responses
</code></pre>
<p>Track distributions, not just averages. Consistent 3.5s beat wild swings between 5 and 2.</p>
<p><strong>Human Review</strong></p>
<p>Supplement LLM scoring with targeted human review:</p>
<ul>
<li>High-value or sensitive conversations</li>
<li>Failed tasks</li>
<li>Edge or emotional cases</li>
</ul>
<p>Review 1–2% of volume, focusing on outliers.</p>
<p><strong>Sentiment Tracking</strong></p>
<p>Monitor sentiment shifts during conversations. A successful flow moves from neutral, through possible frustration, to positive resolution. Declining sentiment, even with task completion, signals issues.</p>
<h4>5.3 Production Monitoring</h4>
<p>Metrics and alerting in production are critical.</p>
<p><strong>Dashboards</strong></p>
<p>Track in real time (1-minute granularity):</p>
<ul>
<li>P50/P90/P99 latency</li>
<li>Active conversations</li>
<li>Error rates (STT, TTS, LLM)</li>
<li>Escalation (handoff) triggers</li>
</ul>
<p>Set alerts for:</p>
<ul>
<li>P90 latency &gt;1.5× baseline</li>
<li>Error rate &gt;2% in 5 minutes</li>
<li>Escalation &gt;20% above baseline</li>
</ul>
<p>Borrow from contact center KPIs:</p>
<ul>
<li>Containment rate: Resolved without human</li>
<li>Average handle time</li>
<li>First call resolution</li>
<li>Customer effort (survey)</li>
</ul>
<p><strong>Model Drift Detection</strong></p>
<p>Performance can degrade due to language shifts, seasonal changes, or new user expectations. Flag &gt;5% drops from 30-day baselines. Retrain quarterly, but act on sudden drops.</p>
<p><strong>Summary</strong></p>
<p>Start with core metrics, add quality assessment as you grow, and build monitoring to catch problems before users do.</p>
<!--
## Developer Checklist

> Use this checklist to validate platform features against the functional, metric, and scenario requirements described in Sections 3–5.

When evaluating voice AI platforms or building your evaluation framework, ensure you can answer "yes" to these questions:

**Conversation Testing**

- [ ] Can you test multi-turn conversations with context preservation?
- [ ] Do you measure end-to-end latency, not just component latency?
- [ ] Can you simulate interruptions and measure barge-in handling?
- [ ] Do you test with diverse accents and speaking styles?

**Scenario Coverage**

- [ ] Can you test voicemail detection and handling?
- [ ] Do you cover error cases and recovery flows?
- [ ] Can you simulate poor network conditions?
- [ ] Are edge cases like background noise included?

**Metrics & Measurement**

- [ ] Do you track both technical metrics (latency) and quality metrics (task completion)?
- [ ] Can you evaluate conversation naturalness beyond just accuracy?
- [ ] Is A/B testing different configurations supported?
- [ ] Do you measure consistency, not just averages?

**Operations**

- [ ] Can tests run automatically in CI/CD pipelines?
- [ ] Are results reproducible and deterministic?
- [ ] Can you run tests in parallel for efficiency?
- [ ] Is reporting suitable for both developers and stakeholders?

**Flexibility**

- [ ] Can you easily add new test scenarios?
- [ ] Does the framework handle different use cases (sales, support, etc.)?
- [ ] Can you adjust evaluation criteria per use case?
- [ ] Is synthetic test data generation supported?
-->
<h2>Moving Forward</h2>
<p>No single evaluation framework fits every use case. This specification offers a flexible foundation—whether you’re evaluating HIPAA-sensitive healthcare bots or emotionally intelligent crisis assistants. Systematic testing beats ad-hoc guesswork.</p>
<p>As you evaluate platforms like Hamming, Arise, or Coval, use this specification to ask the right questions.</p>
<p>Ask these questions of any vendor or internal system:</p>
<p>– Can it test what matters for your use case?<br>
– Does it expose the metrics you need?<br>
– Is it CI/CD compatible?</p>
<p>Once you've established reliable evaluation for your current system, you're ready to explore adaptive architectures—where evaluation complexity rises, but so does performance potential.</p>
<h2>Beyond This Specification: Adaptive Architectures (added: 15th Aug)</h2>
<p>This specification assumes a relatively static architecture where the same models handle all conversation turns. However, emerging patterns in voice AI suggest more sophisticated approaches that would require rethinking these evaluation criteria.</p>
<p><strong>Adaptive Model Selection</strong> represents the next evolution in voice AI architecture. Instead of using the same model throughout a conversation, systems dynamically route requests based on conversation context:</p>
<ul>
<li><strong>Light turns</strong> (greetings, confirmations): Route to fast, smaller models achieving &lt;800ms latency</li>
<li><strong>Complex reasoning</strong>: Switch to larger models, accepting 1500-2000ms for accuracy</li>
<li><strong>Critical moments</strong> (medical, financial): Use best available models regardless of latency</li>
</ul>
<p>This approach could reduce average latency by 30-40% while maintaining accuracy where it matters. However, evaluating such systems requires new metrics:</p>
<ul>
<li><strong>Routing accuracy</strong>: Did the system select the appropriate model for each turn?</li>
<li><strong>Transition smoothness</strong>: Do model switches create noticeable personality shifts?</li>
<li><strong>Cost optimisation</strong>: What percentage of turns use expensive models?</li>
<li><strong>Degradation patterns</strong>: How does the system perform when preferred models are unavailable?</li>
</ul>
<p>If you're considering adaptive architectures, treat this specification as your baseline. Establish solid evaluation practices for single-model systems first, then layer on the additional complexity of multi-model orchestration. The fundamentals—measuring latency, tracking completion rates, assessing naturalness—remain essential regardless of architectural sophistication.</p>
<hr>
<h2>Glossary</h2>
<p><strong>VAD (Voice Activity Detection)</strong>: A signal processing technique used to detect when a speaker starts and stops talking. It impacts when the system listens, responds, or cuts off speech.</p>
<p><strong>STT (Speech-to-Text)</strong>: The transcription engine that converts spoken audio into text. Accuracy depends on model quality, domain vocabulary, and audio conditions.</p>
<p><strong>TTS (Text-to-Speech)</strong>: The synthesis engine that converts generated text responses into spoken audio. Evaluated by clarity, prosody, latency, and adaptability.</p>
<p><strong>LLM (Large Language Model)</strong>: The generative model used to produce responses based on text input. LLM latency and variability affect conversation flow and tone.</p>
<p><strong>TTFA (Time to First Audio)</strong>: The time from the end of user speech to the beginning of the bot's audio response. A key metric for conversational responsiveness.</p>
<p><strong>Barge-in</strong>: When a user interrupts the bot mid-sentence. A good system detects this quickly, stops speaking, and adjusts its response contextually.</p>
<p><strong>Containment Rate</strong>: Percentage of conversations resolved without human escalation. High containment indicates successful task completion by the bot.</p>
<p><strong>Escalation</strong>: The process of handing a conversation off to a human agent or switching to a fallback system when the bot cannot proceed.</p>
<p><strong>End-to-End Latency</strong>: Total time from the beginning of user speech to the start of bot speech, including VAD, STT, LLM, TTS, and streaming delays.</p>
]]></content:encoded>
      <dc:creator><![CDATA[Varun Singh]]></dc:creator>
      <category><![CDATA[voice ai]]></category>
      <category><![CDATA[evals]]></category>
      <category><![CDATA[testing]]></category>
    </item>
    <item>
      <title><![CDATA[Updating robots.txt for AI/LLMs]]></title>
      <link>https://varunsingh.net/til/scripts/updating-robots-txt-for-ai-llms</link>
      <guid isPermaLink="true">https://varunsingh.net/til/scripts/updating-robots-txt-for-ai-llms</guid>
      <pubDate>Thu, 03 Jul 2025 00:00:00 GMT</pubDate>
      
      <description><![CDATA[We only recently got browser use and MCPs. Now, with the recent kerfuffle around AI agents/LLMs being able to access content for training, Cloudflare and other providers are going to by default block ]]></description>
      <content:encoded><![CDATA[<p>We only recently got browser use and MCPs. Now, with the recent kerfuffle around AI agents/LLMs being able to access content for training, Cloudflare and other providers are going to by default block these agents unless specified by <code>robots.txt</code>. In my opinion, visibility of your content in ChatGPT, Claude, and AI search engines (AEO) improves when their user-agents aren’t blocked.</p>
<p>Note: IETF discussed this in a workshop in 2024, they recently published a <a href="https://datatracker.ietf.org/doc/html/draft-iab-ai-control-report">summary</a>, which is worth reading.</p>
<p>Anyway, getting back to robots.txt, I recently noticed several AI/LLM crawlers were blocked or partially restricted on this site. Which meant that we now needed to explicitly allow key assistants and AI search crawlers like: <code>GPTBot</code>, <code>ChatGPT-User</code>, <code>Claude-Web</code>, <code>ClaudeBot</code>, etc. I also added a cleaner fallback policy, replaced a catch‑all <code>“Disallow: /”</code> under <code>User-agent: *</code> with a simpler allow‑by‑default and targeted <code>Disallow</code> for specific paths.</p>
<p>Based on Claude's research, we added <code>crawl-delay</code>, understanding that not all crawlers honour them (Google ignores; Bing may honour).</p>
<p>Looking at the HTTP logs:</p>
<ul>
<li>Prefer explicit “User-agent + Allow/Disallow” per bot over relying on complex catch‑all rules.</li>
<li>Avoid regex-like anchors (like $); many crawlers don’t support them.</li>
<li>Keep a clean fallback that aligns with your intent: allow most, block only what you must. See example below:</li>
</ul>
<pre><code class="language-text"># Search engines (Allow)
User-agent: Googlebot
Allow: /
Crawl-delay: 1

User-agent: DuckDuckBot
Allow: /
Crawl-delay: 1

# AI assistants and AI search (Allow)
User-agent: GPTBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: Claude-Web
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Amazonbot
Allow: /

User-agent: Applebot-Extended
Allow: /

User-agent: cohere-ai
Allow: /

# Developer tools and generic scrapers (optional: keep blocked if you prefer)
User-agent: Python-urllib
Allow: /

User-agent: Python-requests
Allow: /

User-agent: wget
Allow: /

User-agent: curl
Allow: /

# Fallback: allow homepage, robots, blog, posts; limit specific sections
User-agent: *
Allow: /
Allow: /robots.txt
Disallow: /api/
Disallow: /static/
Crawl-delay: 10
</code></pre>
]]></content:encoded>
      <dc:creator><![CDATA[Varun Singh]]></dc:creator>
      <category><![CDATA[scripts]]></category>
    </item>
    <item>
      <title><![CDATA[Claude's Plan Mode is Brilliant]]></title>
      <link>https://varunsingh.net/til/claude/claude-plan-mode-useful</link>
      <guid isPermaLink="true">https://varunsingh.net/til/claude/claude-plan-mode-useful</guid>
      <pubDate>Wed, 02 Jul 2025 00:00:00 GMT</pubDate>
      
      <description><![CDATA[One frustrating issue with Claude, and I am on the Max plans, is that it is over-eager to do the task. For the past few months, I have been appending _"think deeply"_ whenever I want it to think befor]]></description>
      <content:encoded><![CDATA[<p>One frustrating issue with Claude, and I am on the Max plans, is that it is over-eager to do the task. For the past few months, I have been appending <em>&quot;think deeply&quot;</em> whenever I want it to think before leaping into the problem. This is in addition to <code>CLAUDE.md</code>, which has specific instructions (see excerpts below):</p>
<pre><code class="language-text">in CLAUDE.md:
- “Plan first: Create development plan in /dev_plans/ …”
- “Planning new features? Create a development plan…” 

in claude/development-process.md
- “Pre-Implementation: … Review plan for completeness and feasibility” 
</code></pre>
<p>Last week, I chanced upon <strong>Plan Mode</strong> (invoked by pressing <code>Shift+Tab</code> twice). I am not sure when this was released, or if it has been around for a while, but it is super helpful. I believe Plan Mode separates research and planning from code execution, and it is partly read-only as it can create and maintain the plan or to-do list but cannot write code.</p>
<p>💜 this new feature. As part of the new workflow, all plans go into <code>dev_plans/$yyyy-mm-plan-name.md</code>! Claude used this to build the <code>abstract-image-gen.js</code></p>
<p><strong>Updated (2025-11-08)</strong>: Converted the repetitive instruction to create a development plan into a <a href="/til/coding/structured-development-plans-with-coding-agents"><code>~/.claude/dev-plan/SKILL.md</code></a>.<br>
<strong>Updated (2026-01-27)</strong>: Using the <a href="/til/coding/fan-out-skill-multiple-agents-and-ralph-loops"><code>fan-out</code> skill</a> to spawn subagents for each independent task. Also in Jan 2026, Claude added an explicit tasks list which are stored in <code>~/.claude/tasks</code>.</p>
]]></content:encoded>
      <dc:creator><![CDATA[Varun Singh]]></dc:creator>
      <category><![CDATA[claude]]></category>
    </item>
    <item>
      <title><![CDATA[Context Engineering Across AI Code Generators]]></title>
      <link>https://varunsingh.net/post/context-engineering-across-ai-code-generators</link>
      <guid isPermaLink="true">https://varunsingh.net/post/context-engineering-across-ai-code-generators</guid>
      <pubDate>Sat, 28 Jun 2025 00:00:00 GMT</pubDate>
      
      <description><![CDATA[After a month of experimenting with different AI coding assistants, I've discovered that the real differentiator isn't the AI model—it's how much control you have over context engineering.]]></description>
      <content:encoded><![CDATA[<p><strong>TL;DR:</strong> The evolution of code generators from desktop AI apps to cli-based tools represents a fundamental shift in context engineering. While writing code using the claude/openai/gemini desktop apps require manual context management, CLI tools handle it automatically, letting developers focus on the task rather than the tool.<br>
<img src="/static/blog/2025/ai-geoform-social-glow.jpg" alt="AI Code Generators and Context Engineering"></p>
<p>I've spent the last month living in three different cli-based coding environments (<code>cli-ai-code</code>), and I've come to a realisation: the future of AI-assisted development isn't about which model you use—it's about who controls the context engineering. And increasingly the code generators are getting better.</p>
<p>My journey started last summer, where many developers begin: <strong>Claude's desktop app (<code>claude</code>).</strong> I'd carefully curate which files to share, write detailed code guidelines, and manage every aspect of the conversation. The process felt like conducting an orchestra—I was constantly directing attention, summarising when contexts grew too long, and asking for diffs instead of full file updates to manage tokens.</p>
<p>This approach worked, but it was exhausting. Building this very website using Claude's desktop app taught me just how much overhead manual context management creates. Every conversation turn required decisions about what to include, what to summarise, and what to leave out. The copy-pasting between the app was fun, but the slowest part of the process. You can read about that journey in detail, but the key takeaway was simple: I was spending more time managing the AI than should be required, it was also obvious that hte IDE integrations that were emerging would be a better option.</p>
<p><strong>Windsurf, Cursor, and GitHub Copilot</strong> offered a different experience. These tools live where developers already work, felt like the perfect solution. No more copy-pasting, no more context curation—just code completion and error fixing right in my editor. The agentic mode was a game-changer, because the editor could now look at the workspace or additional files to understand the context of the code. The manual context engineering for the IDEs is improviong each week, since early this year, <code>vibe-coding</code> has been the norm for many. My experience with the IDE-based tools has been positive, but a bit of a hit and miss, especially with <code>React</code>, where sometimes missing a file or a dependency can cause the codegen to duplicate that code. Again, this is not an issue long-term, as context windows are becoming larger and information in the <code>.cursorfiles</code> folder becomes automated. Overall -- you need to ensure the agent has access to the correct files, have proper documentation and if your repository is large, you're essentially doing the same curation work as with desktop apps, just with a different interface. The tension between giving complete control to the agent and maintaining oversight never quite resolved for me. These IDEs exceled at micro-tasks—completing, fixing errors and bugs, writing test cases (I wrote a lot of the tests!)</p>
<p>Recently, since <code>Claude Sonnet 4.0 with code</code> release in May and re-release of <code>codex</code>, I've leaned into using <code>cli-ai-code</code> -- and frankly fallen in love with them. The experience is so much better than the IDE-based workflow, and because the ux is so constrained, you have to completely lean into the CLIs workflows, create and keep the <code>claude.md</code>, <code>readme.md</code>, <code>./docs/</code> folder up to date. Also the <code>/compress and /clear</code> commands forces you to think a bit more about the context that the CLI has build insofar for you, but the CLI for the most part is taking full responsibility for context engineering. There aren't toggles for which files to read. There's no manual curation. You give it a task, and it figures out what it needs. This complete delegation initially felt uncomfortable—I was used to being in control. But the results spoke for themselves.</p>
<p>The UX advantage of <code>cli-ai-code</code> surprised me. Unlike editor-based AIs where code constantly competes for your attention, the terminal provides focused feedback. You see the thinking steps, the <code>greps</code>, <code>seds and awks</code>, <code>glob in/out</code>, <code>regexs</code>,the task checklist and progress updates. When the AI is working, you're not distracted by syntax highlighting or autocompletions. Your sole focus remains on the terminal, on the plan, on the outcome. More importantly, <code>cli-ai-code</code> offer clear intervention points. You can stop midway if the approach seems wrong, or let it complete and then provide corrective guidance. This isn't the black box of agent mode in IDEs—it's transparent, iterative development. Similarly, the <code>claude.md</code> and <code>.cursorfiles</code> folder can contain detailed project specifications, coding standards, and architectural decisions. But this is the trade-off: IDE tools provide more granular control over context at the cost of requiring explicit creation and maintenance of that documentation.</p>
<p><strong>The OpenAI Codex Exception: Asynchronous Development</strong>: <code>codex</code> deserves special mention because it operates differently. I use it in what I call &quot;away-from-keyboard&quot; mode. The workflow is unique: provide a complete task specification—almost like a dev/product spec—and openai spins up a container in the cloud, plans the implementation, executes it, and returns a pull request. This approach has proven invaluable during commutes or when travelling with intermittent connectivity. The key is being upfront about design and architecture because each conversation spawns a fresh container build. There's no iterative back-and-forth; you need to get the specification right upfront.</p>
<p>I recently used this approach to build an entire phone number transfer system for Daily. The entire project was &quot;vibecoded&quot;—I provided the requirements, claude/codex handled the implementation, and I reviewed the resulting PR. Similarly, a queueing system I needed was entirely written through <code>cli-ai-code</code> without me writing a single line of code manually. I am currently trying out <code>gemini</code>'s CLI codegen as well.</p>
<h3>Context Engineering: The Real Differentiator</h3>
<p>The progression from desktop apps to CLI tools represents a fundamental shift in how we think about context. In desktop apps, context engineering is explicit and manual. You decide what the AI sees. In IDE integrations, it's semi-automatic but limited to open files and explicit selections, I think they are getting better with workspace access. In CLI tools, it's completely delegated to the AI.</p>
<p>This delegation initially feels like losing control, but it's actually gaining leverage. When I use <code>cli-ai-code</code>, I'm not thinking about which files to include or how to structure my request to fit within token limits. I'm thinking about the problem I want to solve. The tool handles the complexity of understanding my codebase, finding relevant files, and maintaining context across operations. If it doesn't have access to the relevant files, it will ask for them.</p>
<p>Looking forward, the shift from manual to automatic context engineering represents more than a tooling change—it's a fundamental rethinking of the developer-AI relationship. As these tools mature, I expect we'll see even more sophisticated context understanding, better intervention mechanisms, and smoother workflows. For developers still managing context manually, I encourage you to try CLI-based tools. The initial adjustment period is worth it. You might find, as I did, that letting go of context control actually gives you more control over what matters: solving problems and building software.</p>
<p>The command line, that venerable interface we've used for decades, has found new purpose. And honestly? It feels like coming home.</p>
]]></content:encoded>
      <dc:creator><![CDATA[Varun Singh]]></dc:creator>
      <category><![CDATA[ai]]></category>
      <category><![CDATA[development]]></category>
      <category><![CDATA[code-generation]]></category>
      <category><![CDATA[context-engineering]]></category>
    </item>
    <item>
      <title><![CDATA[Abstract Art for Blog Images]]></title>
      <link>https://varunsingh.net/til/imagegen/abstract-imagegen-for-blog</link>
      <guid isPermaLink="true">https://varunsingh.net/til/imagegen/abstract-imagegen-for-blog</guid>
      <pubDate>Sat, 28 Jun 2025 00:00:00 GMT</pubDate>
      
      <description><![CDATA[I use a tiny Node script (`create-abstract-imagen.js`) to generate wide abstract hero art for the blog posts. I tried both `dall-e-3` and Gemini's `imagen-4.0-generate-001`, there is a good mix of out]]></description>
      <content:encoded><![CDATA[<p>I use a tiny Node script (<code>create-abstract-imagen.js</code>) to generate wide abstract hero art for the blog posts. I tried both <code>dall-e-3</code> and Gemini's <code>imagen-4.0-generate-001</code>, there is a good mix of outputs. The prompt is simple:</p>
<pre><code>A glossy, high-contrast abstract landscape artwork inspired by Orphism, Lyrical Abstraction, and early modernist painters like Kandinsky, Klee, and Malevich. Abstraction vibe, bold gradients, landscape orientation with negative space for white titles.
</code></pre>
<p><img src="/static/blog/2025/ai-Abstract-orphism-modernist-landscape.png" alt="abstract art from prompt"></p>
<p>In addition, the <code>gpt-4o</code> generates a 2–4 word filename slug (e.g., <code>ai-GlowingOrbits.png</code>). I tried image understanding but produced &quot;art with circles&quot;, which was not unique, if you look at the art that has been already generated.</p>
<h2>Notes</h2>
<ul>
<li>Keep <code>images/wip/</code> in git-ignore; copy the final picked asset e.g., to <code>dist/static/images/...</code></li>
<li>Landscape 1792x1024 fits the site’s hero slots; leave room on the left/top for title text.</li>
<li>If you regenerate, keep one “winner” per post to avoid blob bloat in git.</li>
</ul>
]]></content:encoded>
      <dc:creator><![CDATA[Varun Singh]]></dc:creator>
      <category><![CDATA[imagegen]]></category>
    </item>
    <item>
      <title><![CDATA[The End of Headcount: How GenAI is Redefining Leadership]]></title>
      <link>https://varunsingh.net/post/the-end-of-headcount-leadership-genai</link>
      <guid isPermaLink="true">https://varunsingh.net/post/the-end-of-headcount-leadership-genai</guid>
      <pubDate>Tue, 10 Jun 2025 00:00:00 GMT</pubDate>
      
      <description><![CDATA[As AI enables companies to reach $100M ARR with tiny teams, traditional leadership metrics of headcount and budgets become obsolete. The future belongs to leaders who can orchestrate AI-human collaboration.]]></description>
      <content:encoded><![CDATA[<p><strong>TL;DR: GenAI is enabling companies to achieve massive scale with minimal headcount, fundamentally disrupting traditional leadership hierarchies based on team size and budgets. Future executives must shift focus from managing people to orchestrating AI-human collaboration.</strong></p>
<p><img src="/static/blog/2025/ai-waves-social-glow.jpg" alt="Leadership in the Age of AI"></p>
<p>Last week at the AI Engineer's World Fair, engineers demonstrated how small teams of developers could accomplish. The evidence is mounting everywhere. We're seeing companies reach $100 million in annual recurring revenue with teams that would have been considered skeleton crews in the pre-AI era. Our own teams have become progressively leaner, not through layoffs or budget cuts, but through the use of AI tools that allow each person to be more productive.</p>
<p>In addition, to the smaller teams debate, the community is locked in about job safety, particularly in tech. One camp argues that younger folks, being AI-native, will dominate the job market. They've never known a world without ChatGPT, and they approach problems with AI as their first tool rather than their last resort. The other camp contends that high-skill veterans will transform from 10x engineers to 1000x engineers, leveraging their deep domain knowledge to build more quickly. The truth is, it doesn't matter which side of this debate you fall on. <strong>The outcome remains the same: there will be fewer employees.</strong> The optimistic view—and the one I subscribe to—is that AI will enable the formation of many more companies, albeit much smaller in size. Instead of one company with 5,000 employees, we might see 1000s of profitable companies with 10-20 employees, creating more diverse opportunities and innovation.</p>
<p>This brings us to the elephant in the boardroom: what happens to leadership and executive roles in this new paradigm? Traditional corporate structures evolved alongside headcount. A billion-dollar ARR company was rarely a 100-person operation—it was more likely a 1,000 to 5,000 person organization, complete with layers of management, directors, VPs, and C-suite executives.</p>
<p>The currency of leadership has long been headcount and budget. Executives would proudly speak of managing teams of hundreds or thousands, of budgets in the tens of millions. Performance reviews emphasized &quot;scope of responsibility,&quot; often measured by the number of direct and indirect reports. The larger your organization chart, the more senior your position, the higher your compensation.</p>
<p>This entire framework is about to collapse.</p>
<p>When a team of 10 people augmented by AI agents can outperform a traditional team of 100, the mathematics of management change fundamentally. The question shifts from &quot;How many people do you manage?&quot; to &quot;How effectively can you orchestrate AI-human collaboration?&quot; The metric changes from headcount to impact-per-person, from budget size to efficiency ratios. Large organizations face a particularly acute challenge. They must confront the reality that AI will shrink their organizations, potentially dramatically. A department of 500 might eventually become a department of 50. This isn't just about job losses—it's about the complete dissolution of existing hierarchies. Middle management layers that existed primarily to coordinate large groups of people become redundant when AI handles coordination and routine decision-making.</p>
<h3>The New Executive Skillset</h3>
<p>Leaders must now shift their focus entirely. Instead of asking &quot;How can I grow my team?&quot; they need to ask &quot;Who in my organization can leverage AI tools to build faster with fewer resources?&quot; More critically, they need to evaluate whether their organizations even have the right type of people to thrive in an AI-augmented environment and start to upskill their existing team and figure out who can transform into AI-native talent.</p>
<p>This requires a fundamental rethinking of what leadership means. Traditional management skills—delegation, performance reviews, team building—remain relevant but become secondary to new capabilities. Leaders must become skilled at identifying AI leverage points, at knowing when human judgment is irreplaceable, and at creating systems where small teams can have outsized impact. The most successful executives of the next decade won't be those who can manage the largest teams, but those who can achieve the most with the least. They'll be measured by how much value they create per person, how effectively they blend human creativity with AI capability, how quickly they can adapt to new tools and possibilities.</p>
<p>For aspiring leaders, the path forward looks radically different. The traditional career progression of individual contributor to team lead to manager to director to VP becomes less relevant when teams shrink by an order of magnitude. Instead, career growth might look more like expanding the scope of problems you can solve with a small team, or launching spin-off ventures, or becoming a super-contributor who coordinates AI agents rather than human reports.</p>
<p>We're witnessing the end of the industrial-age organization structure. Just as the assembly line gave way to knowledge work, the knowledge work hierarchy is giving way to AI-augmented small teams. It's a revolution that will remake how we think about companies, careers, and value creation.</p>
]]></content:encoded>
      <dc:creator><![CDATA[Varun Singh]]></dc:creator>
      <category><![CDATA[ai]]></category>
      <category><![CDATA[leadership]]></category>
      <category><![CDATA[management]]></category>
      <category><![CDATA[future-of-work]]></category>
    </item>
    <item>
      <title><![CDATA[GenAI considered reliable-enough]]></title>
      <link>https://varunsingh.net/post/genai-considered-reliable-enough</link>
      <guid isPermaLink="true">https://varunsingh.net/post/genai-considered-reliable-enough</guid>
      <pubDate>Mon, 26 May 2025 00:00:00 GMT</pubDate>
      
      <description><![CDATA[Drawing parallels between TCP/UDP networking protocols and GenAI reliability mechanisms to argue for context-appropriate reliability standards]]></description>
      <content:encoded><![CDATA[<p><strong>TL;DR: Just as TCP isn't 100% reliable but is considered &quot;reliable enough&quot; through checksums and retransmissions, GenAI can achieve appropriate reliability through guardrails, LLM-as-judge, and chain-of-thought reasoning.</strong></p>
<p><img src="/static/blog/2025/ai-merge-social-glow.jpg" alt="GenAI Reliability"></p>
<h1>GenAI Considered Reliable-Enough</h1>
<p>In defense of Generative AI's hallucinations and errors, let's consider this: humans and our existing systems are not 100% reliable. Even TCP, the protocol we trust for reliable data transmission, isn't perfectly reliable. Loss of transmitted packets result in retransmissions, these retransmitted packets can also be lost, which will eventually cause the connection to terminate. Nonetheless, we consider TCP to be reliable. Why? Because it's <em>reliable enough</em> for its intended use cases and the mechanisms to make it more reliable have been adjusted over the past few decades. Use DCTCP within nodes in a datacenter, servers devlivering to endusers use proprietary flavors, while endusers may use -- CUBIC, BBR, etc.</p>
<h2>The TCP Analogy: Understanding &quot;Reliable Enough&quot;</h2>
<p>Delving deeper into how TCP works reveals several mechanisms that reduce the probability of data corruption. The protocol employs checksums to verify data integrity, ensuring that what arrives matches what was sent. It uses sequence numbers to maintain ordered delivery, preventing packets from arriving out of order and corrupting the data stream. When packets are lost, TCP's retransmission mechanisms kick in, resending data until acknowledgment is received. Various timeouts govern these processes, ultimately deciding when to give up on a connection that has become unviable.</p>
<p>TCP introduced the concept of a connection over a connection-less packet delivery model. This layered approach to reliability offers an important lesson for GenAI systems. Although, TCP failure modes are observable and can be detected, GenAI failure modes may not match this paradigm, but I think we can draw some parallels.</p>
<h2>GenAI's Reliability Mechanisms</h2>
<p>Following the networking analogy, GenAI needs to apply corresponding resilience mechanisms. Guardrails function as circuit-breakers in the AI system, preventing the model from generating harmful or wildly incorrect content. Just as circuit breakers prevent electrical system overload and TCP's connection timeouts prevent infinite waiting, these safety boundaries ensure the system fails gracefully rather than catastrophically.</p>
<p>The LLM-as-a-judge pattern serves a role similar to checksums in networking protocols. Where checksums verify data integrity by comparing received data against expected values, LLM-as-judge approaches use a second model, or the same model in a different mode, to evaluate the quality and accuracy of generated content. This creates a verification layer that can catch errors before they reach the end user.</p>
<p>Chain of thought (CoT) reasoning provides something analogous to sequence numbers in TCP. Just as sequence numbers ensure packets arrive in the correct order and enable reconstruction of the original message, chain of thought reasoning ensures logical progression through a problem. It creates traceable reasoning paths that can be audited and verified, making the model's decision-making process more transparent and reliable.</p>
<h2>Context-Dependent Reliability</h2>
<p>In networking, you have two fundamental choices: use TCP with its built-in reliability mechanisms, or use UDP and build your own reliability layer tailored to your specific needs. This choice depends entirely on your use case and what &quot;reliable&quot; means in your context.</p>
<p>Real-time voice and video calls demonstrate this principle perfectly. They use RTP over UDP because in conversation, latency matters more than perfection. When packets go missing, the decoder doesn't wait—it guesses and renders what it can. You might see a momentary freeze or hear a brief glitch, but the conversation continues. The system prioritizes low latency over perfect delivery because a delayed &quot;hello&quot; is worse than a slightly garbled one.</p>
<p>Streaming video services take the opposite approach. Here, media is received into a buffer before playback begins. The system can take time to ensure each packet arrives and is processed in order, playing back at the highest possible quality while carefully managing the buffer to avoid the dreaded rebuffering pause. Quality and completeness take precedence over real-time delivery because viewers would rather wait a few seconds for the video to start than watch a degraded experience. Over time, we have seen systems shift from UDP to TCP back to UDP. For example, video on demand streaming used to be over RTSP over UDP in the 90s, but unreliability and advent of browsers meant that streaming over HTTP over TCP became the norm. However, recently, because of layer ossification, HTTP over TCP is being replaced by QUIC over UDP.</p>
<h2>The GenAI Parallel</h2>
<p>We find ourselves in a similar situation with Generative AI and its ability to mimic, copy, guess, and create. The reliability requirements vary dramatically based on the application, just as they do in networking.</p>
<p>In medical diagnosis, legal document drafting, or financial analysis, we need multiple verification layers. These applications require human-in-the-loop validation, strict guardrails, and comprehensive audit trails. This is like running TCP with additional application-layer checksums—we're not just relying on the base protocol's reliability but adding extra verification because the cost of errors is too high. A misdiagnosis or a legal mistake can have life-altering consequences, so we build systems that verify, re-verify, and maintain clear chains of accountability.</p>
<p>On the other end of the spectrum, consider brainstorming sessions, first drafts, or entertainment applications. Here, GenAI operates more like UDP—some &quot;packet loss&quot; in the form of minor errors or inconsistencies is perfectly acceptable. When you're using AI to generate ideas for a marketing campaign or create variations of a design concept, perfect accuracy isn't the goal. Speed and creativity matter more than precision. A slightly nonsensical suggestion might even spark the perfect idea. Simarly, vibe-coded internal applications or proof-of-concept applications may not require the same level of reliability as production applications, and may meet the bar of &quot;good enough&quot;.</p>
<p>Most interesting are the hybrid approaches that adapt their reliability requirements dynamically. Code generation paired with test verification creates a feedback loop where the AI can be creative and make mistakes, but those mistakes are caught before they matter. Content creation with fact-checking layers allows for fluid writing while ensuring accuracy where it counts. Customer service systems that seamlessly escalate to humans when confidence drops below a threshold. These systems are like adaptive protocols that can switch their error-resilience modes based on the observed needs.</p>
<p>Just as network engineers build reliable systems on unreliable networks, AI engineers must build reliable applications on probabilistic models. The key is layering your defenses. Never rely on a single checking mechanism. Multiple models reviewing each other's work, diverse prompting strategies, and varied validation approaches create a robust system that can catch different types of errors.</p>
<p>Matching reliability to requirements becomes crucial. Not every use case needs five-nines reliability, and trying to achieve it everywhere would be prohibitively expensive and slow. A chatbot helping users find documentation can tolerate occasional misunderstandings, while a system generating medical dosage recommendations cannot be incorrect.</p>
<p>We must embrace probabilistic thinking in our system design. Instead of trying to handle every edge case perfectly, we design for the 95% case and ensure the system handles the remaining 5% gracefully. This might mean clear error messages, smooth handoffs to human operators, or transparent confidence indicators that help users understand when to verify the AI's output.</p>
<p>Monitoring and adaptation round out the reliability strategy. Like TCP's congestion control algorithm that adjusts sending rates based on network conditions, AI systems should adapt their behavior based on performance metrics. If error rates increase, the system might automatically become more conservative, request additional verification, or route more requests to human review.</p>
<h2>Conclusion: Redefining Reliability</h2>
<p>&quot;Reliable enough&quot; isn't settling for less. It is engineering for reality. TCP shows us that perfect reliability isn't necessary for a protocol to be considered reliable. Similarly, GenAI doesn't need to be perfect to be transformative.</p>
<p>The question isn't &quot;Is GenAI reliable?&quot; but rather &quot;Is GenAI reliable enough for my specific use case?&quot; And increasingly, with the right mechanisms in place, the answer is yes.</p>
<p>As we continue to develop AI systems, we should focus not on eliminating all errors (an impossible task even for humans), but on building appropriate reliability mechanisms for each use case. Just as the internet thrives on &quot;best effort&quot; packet delivery with reliability built in layers above, GenAI can thrive with thoughtful application of context-appropriate reliability mechanisms.</p>
<p>The future isn't about perfect AI. It's about AI that's reliable enough for the task at hand, with well-understood failure modes and appropriate safeguards.</p>
<hr>
<p>A more formal version of building reliable LLMs is documented in <a href="https://github.com/humanlayer/12-factor-agents">12-Factor Agents</a>, give it a read if you're interested in the topic.</p>
]]></content:encoded>
      <dc:creator><![CDATA[Varun Singh]]></dc:creator>
      <category><![CDATA[ai]]></category>
      <category><![CDATA[reliability]]></category>
      <category><![CDATA[networking]]></category>
      <category><![CDATA[engineering]]></category>
    </item>
    <item>
      <title><![CDATA[Mega Launch Week: Gemini, Claude, and more]]></title>
      <link>https://varunsingh.net/post/mega-launch-week-2025-05-23</link>
      <guid isPermaLink="true">https://varunsingh.net/post/mega-launch-week-2025-05-23</guid>
      <pubDate>Fri, 23 May 2025 00:00:00 GMT</pubDate>
      
      <description><![CDATA[Launch week introduced Google Jules, Open AI Codex, Claude Code, and more]]></description>
      <content:encoded><![CDATA[<p><strong>TL;DR: Google I/O launch week turned out to be a great week for AI. Announcements from Google, Anthropic, and OpenAI were the highlights of the week.</strong></p>
<p><img src="/static/blog/2025/ai-swirl-social-glow.jpg" alt="Mega Launch Week: Gemini, Claude, Codex,and more"></p>
<h2>AI’s Supersonic Week: A Use-Case-Centric Breakdown</h2>
<p>This past week, the AI landscape changed again, with advancements across various domains: Code gen, image gen, video gen, and some hardware.</p>
<hr>
<h2>Code Generation: Advancements in AI Programming Assistants</h2>
<p>The realm of AI-assisted coding has seen notable progress, with Google's Jules, Claude Code, and OpenAI Codex.</p>
<p><strong>Google Jules</strong> is an asynchronous, agentic coding assistant that integrates with the codebase/repository, since it uses Gemini 2.5 pro with long context, it seems to have a better chance at performing well on a larger codebase. Like the others in the category, it can plan, reason and provide a diff of changes that it made for you to review.</p>
<p><strong>Claude Code with Opus 4 &amp; Sonnet 4</strong> Anthropic's latest models have achieved state-of-the-art results on coding benchmarks such as SWE-bench (72.5--72.7%). These models demonstrate sustained performance on long-running tasks, maintaining focus over extended periods. However, unlike <code>codex</code>, Claude Code is not a cloud-based agent, it is a local agent that can be run on your machine. The pro is that you can chat and iterate on the code quickly, since it does not need to rebuild the sandbox for each conversation. The con is that you need to be at your desk to use it.</p>
<p><strong>OpenAI Codex</strong>: Codex is a cloud-based software engineering agent designed to automate common development tasks. Integrated into ChatGPT, Codex operates in secure sandbox environments, handling tasks like writing code, debugging, and generating pull requests. Since it runs in a sandbox, it essentially runs outside of your local development environment, i.e., you can ask it to do things while on the move, but it also means that you need to upload your secrets, environment variables, etc. to ChatGPT.</p>
<p>These advancements are great for the software engineering landscape where multiple organizations are pushing the boundaries of AI-driven code generation, it is not just VS Code plugins.</p>
<hr>
<h2>3. Image &amp; Video Generation: Enhancing Creative Capabilities</h2>
<p>AI models are increasingly capable of generating high-quality visual content:</p>
<p>-<strong>Veo 3</strong>: Google's latest video generation model can produce 4K videos with synchronized audio, including speech and ambient effects, based on text prompts. The accompanying tool, <strong>Flow</strong>, allows filmmakers to iteratively steer output using text, shots, and mood boards .</p>
<p>-<strong>OpenAI Sora</strong>: Sora remains a benchmark for physical realism, many pictures on this site were built with Sora.</p>
<p>-<strong>Imagen 3</strong>: Google's updated image generation model offers improved fidelity and prompt controllability, narrowing the gap with competitors like Midjourney, Sora, and DALL·E 3 .</p>
<p>These tools are democratizing content creation, enabling users to produce professional-grade media with minimal resources.</p>
<hr>
<h2>4. Hardware &amp; Ambient Agents: Integrating AI into Daily Life</h2>
<p>AI is transitioning from software to integrated hardware solutions:</p>
<p>-<strong>Android XR Glasses</strong>: Demonstrated at Google I/O, these lightweight headsets offer real-time translation and Gemini overlays, providing &quot;heads-up answers&quot; without the need for a phone .</p>
<p>-<strong>Project Astra</strong>: Google's research prototype can utilize a phone's camera to remember context and perform actions across the Android UI, indicating a shift from chat-based agents to integrated operating layers .</p>
<p>-<strong>&quot;io&quot; Device (OpenAI × Jony Ive)</strong>: OpenAI's acquisition of Jony Ive's startup, io, for $6.5 billion aims to develop a design-led pocket AI companion, targeting the shipment of 100 million units. This device aspires to be a screen-free, context-aware assistant, marking a significant move towards ambient AI hardware .</p>
<p>While early attempts like Humane's AI Pin faced challenges, the continued investment and innovation in this space suggest a promising future for AI-integrated hardware.</p>
<p><em>Note: This overview is based on developments up to May 24, 2025.</em></p>
]]></content:encoded>
      <dc:creator><![CDATA[Varun Singh]]></dc:creator>
      <category><![CDATA[ai]]></category>
      <category><![CDATA[industry]]></category>
      <category><![CDATA[future]]></category>
    </item>
    <item>
      <title><![CDATA[Building a Modern Personal Website with Claude, Cloudflare, and GitHub]]></title>
      <link>https://varunsingh.net/post/building-modern-website-claude-cloudflare-github</link>
      <guid isPermaLink="true">https://varunsingh.net/post/building-modern-website-claude-cloudflare-github</guid>
      <pubDate>Sat, 04 Jan 2025 00:00:00 GMT</pubDate>
      
      <description><![CDATA[How I leveraged Claude's assistance to build a serverless personal website using TypeScript, Tailwind CSS, and Cloudflare's edge services]]></description>
      <content:encoded><![CDATA[<p><strong>TL;DR:</strong> Turned a 10-page LaTeX resume into a modern website by collaborating with Claude, an AI assistant. Beyond just coding, the key to success was establishing clear development patterns early, maintaining thorough documentation, and treating AI as a thoughtful collaboration partner rather than just a code generator. This post shares practical lessons learned about effective AI collaboration in software development. 🚀</p>
<p><img src="/static/blog/2025/ai-air-social-glow.jpg" alt="Building a Modern Personal Website with Claude"></p>
<h2>The Challenge 🌐</h2>
<p>For academics and professionals in technology, maintaining an up-to-date online presence is more than a nicety—it's a necessity. I found myself in a common situation: maintaining a comprehensive LaTeX document that had evolved over a decade to include hundreds of publications, talks, patents, and other professional accomplishments. While LaTeX excelled at producing formatted documents, it created friction whenever I needed to use this information in other contexts.</p>
<blockquote class="twitter-tweet"><p lang="en" dir="ltr">Just spent 6h filling out an EB1A intake form. Why cant I upload my CV/resume which already has the information, with links. It is simpler to<br>- provide Google Scholar = papers, patents<br>- provide linkedin<br>- provide links to press and awards with URLs<br>Parse, collate, and organise</p>&mdash; Varun Singh (@vr000m) <a href="https://twitter.com/vr000m/status/1615538110447878145?ref_src=twsrc%5Etfw">January 18, 2023</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
<p>This tweet captured my frustration perfectly. The process of maintaining and reusing professional information was broken. Every time I gave a talk or published a paper, I would append it to my BibTeX file. This worked great for LaTeX compilation but meant manually copying and reformatting this information for other uses—visa applications, collaboration requests, or online profiles. The process was time-consuming and error-prone.</p>
<p>What I needed wasn't just a website, but a system that could:</p>
<ol>
<li>Accept updates through familiar tools (text editor, git)</li>
<li>Store information in a structured, queryable format</li>
<li>Maintain the single-source-of-truth principle I had with LaTeX</li>
</ol>
<p>This was where AI collaboration became interesting. The challenge wasn't primarily about web development—I'd built websites before. The real opportunity was to explore how AI could help build a system that would evolve with my needs while maintaining the simplicity of my current workflow. Working with Claude presented a unique opportunity to rethink not just the technical solution, but the entire development approach. The tool <a href="https://github.com/pipecat-ai/open-sesame">open-sesame</a> facilitated this interaction with Claude 3.5 Sonnet, setting the stage for an experiment in AI-assisted development that would prove more illuminating than I initially expected. 🤖</p>
<h2>Technical Decisions: Building for Simplicity 🛠️</h2>
<p>The technical architecture for this project emerged from a simple premise: minimize infrastructure complexity while maintaining flexibility for content updates. Rather than getting caught up in complex technology choices, I wanted the architecture discussions with Claude to focus on solving the core problem - managing professional information effectively.</p>
<p>Three key requirements drove our technical decisions. First, I needed a database that could be updated via CLI tools, maintaining my existing git-based workflow. Second, I needed a way to handle blog posts and profile images without managing a complex CDN setup. Finally, the site needed to be easily deployable and maintainable. These requirements led us to a serverless approach using Cloudflare's edge services.</p>
<pre><code class="language-mermaid">graph TD
    A[GitHub Repository] --&gt;|GitHub Actions| B[Build Pipeline]
    B --&gt;|Deploy| C[Cloudflare Pages]
    B --&gt;|Migrate| D[Cloudflare D1]
    E[Content Updates] --&gt;|Push| A
    F[Blog Images] --&gt;|Upload| G[Github /images/blog/]
    C --&gt;|Serve| H[Website]
    D --&gt;|Data| H
    G --&gt;|Assets| H
    I[Cloudflare KV] --&gt;|Rate Limiting| H
</code></pre>
<p>This architecture aligned naturally with my workflow: I could continue maintaining content in text files and use simple CLI commands to sync updates to the website. More importantly, it provided a foundation for building tooling that matched my existing practices rather than forcing adaptation to a new content management paradigm.</p>
<p>The real challenge, however, wasn't in choosing technologies—it was in effectively collaborating with AI to build this system in a maintainable way. As we began implementing features, it became clear that the technical decisions themselves were less important than how we approached the development process. This journey of collaboration would evolve through three distinct phases, each building upon lessons from the previous one:</p>
<p>Establishing the Basics: Learning to communicate effectively with AI<br>
Developing Systematic Patterns: Creating repeatable processes<br>
Mastering Complex Development: Leveraging AI's strengths for sophisticated features</p>
<p>This progression from simple interactions to sophisticated collaboration would prove crucial in building a robust and maintainable system. 🔄</p>
<h2>Evolution of AI Collaboration: From Code Generator to Development Partner 🤝</h2>
<p>The journey of working with AI evolved naturally through distinct phases, each building upon lessons from the previous one. What began as simple code generation requests transformed into a sophisticated development partnership that improved both code quality and development practices.</p>
<p><img src="/static/blog/2025/1-claude-collaboration-phases.svg" alt="Evolution of AI Collaboration Phases"></p>
<p>My initial interactions with Claude followed a common pattern among developers new to AI collaboration - directly requesting code implementations. &quot;I need an API endpoint for managing publications,&quot; I would say, and while the resulting code was functional, it often required significant refinement and didn't leverage the AI's full capabilities.</p>
<p>The first breakthrough came from a simple shift in approach. Instead of jumping straight to implementation, I began starting each feature with requirements discussions. &quot;Let's think about what we need for publications,&quot; I would begin. &quot;How should we structure the data to match our LaTeX format? How will we handle different publication types? What search capabilities might we need?&quot; This seemingly small change led to more thoughtful solutions and fewer revisions. More importantly, it established a pattern where Claude would ask clarifying questions before suggesting implementations.</p>
<p>Our development process evolved into a systematic approach:</p>
<pre><code class="language-mermaid">graph LR
    A[Problem Definition] --&gt; B[Solution Exploration]
    B --&gt; C[Test Design]
    C --&gt; D[Implementation]
    D --&gt; E[Validation]
    E --&gt; A
</code></pre>
<p>As the project grew more complex, the need for more structured ways to maintain context and ensure consistency became apparent. Each development session began with a brief status update: &quot;We're working on search functionality. In our last session, we chose SQLite FTS5 for full-text search and implemented the basic schema. Now we need to handle result ranking and highlighting.&quot; This context-setting became crucial for maintaining continuity across sessions.</p>
<p>A particularly valuable pattern emerged around testing. Claude's approach to test generation was systematic and thorough, often catching edge cases before they became issues in production. For instance, when implementing publication validation, what started as a simple schema check expanded into comprehensive test coverage:</p>
<pre><code class="language-jest">describe('Publication Validation', () =&gt; {
  // Basic field validation
  test('requires title and type', () =&gt; {});
  test('validates publication date format', () =&gt; {});

  // Type-specific validation
  describe('Patent Publications', () =&gt; {
    test('requires status to be pending or granted', () =&gt; {});
    test('requires patent number for granted patents', () =&gt; {});
    test('validates patent number format', () =&gt; {});
  });

  // URL validation
  describe('Publication URLs', () =&gt; {
    test('handles multiple versions (preprint, published)', () =&gt; {});
    test('validates URL format for each type', () =&gt; {});
    test('maintains URL order', () =&gt; {});
  });

  // Edge cases
  test('handles unicode characters in titles', () =&gt; {});
  test('validates dates across timezone boundaries', () =&gt; {});
  test('handles malformed JSON in URL array', () =&gt; {});
});
</code></pre>
<p>Claude didn't just list test cases; it explained the rationale behind each one. &quot;We should test timezone handling,&quot; it suggested, &quot;because publication dates might be entered in different timezones during international conferences.&quot; This kind of contextual thinking about testing scenarios helped prevent issues that might have only surfaced in production.</p>
<p>Documentation evolved from an afterthought to a real-time activity. Important decisions were captured as they were made, creating a living reference for future discussions. When deciding how to handle publication URLs, for example, we documented not just the decision to store them as a JSON array, but also the rationale - publications often have multiple versions like preprints and final versions - and the implementation details around JSON validation in the data layer.</p>
<p>The real power of AI collaboration emerged when tackling complex features like search implementation. Rather than jumping straight to code, we began with thorough problem definition. &quot;Let's outline exactly what we need from search,&quot; I would say. &quot;We need to search across publications, talks, and blog posts, handle partial matches, support filtering by type and date, and implement relevance ranking.&quot; This led to rich discussions about potential approaches, from using a single FTS table with type discrimination to implementing separate FTS tables with a unified API.</p>
<p>Each potential solution was evaluated through focused questions: &quot;How would this handle cross-type relevance ranking? What about updates to primary records? How would it perform at scale?&quot; This structured approach led to catching potential issues early and producing more maintainable code. Claude's suggestions became increasingly nuanced, often identifying edge cases I hadn't considered.</p>
<p>The process wasn't always smooth. Managing context across sessions proved challenging - a simple request to &quot;update the search implementation&quot; needed to become &quot;update search ranking for publications, which currently uses basic FTS5 ranking, to prioritize recent publications.&quot; Scope creep was a constant concern, with Claude sometimes suggesting ambitious additions like automatic tagging and citation parsing. Learning to guide these conversations back to core functionality became an important skill.</p>
<p>The challenge of maintaining simplicity emerged repeatedly. When Claude suggested implementing complex caching mechanisms, I learned to redirect the discussion: &quot;Before we add caching, what's our actual performance bottleneck? How could we solve this with our existing tools?&quot; These moments taught us to stay focused on immediate needs while maintaining a clear path for future enhancements.</p>
<p>More importantly, the testing-driven approach we had established began influencing our design decisions. Each feature discussion now naturally included consideration of edge cases and error conditions, with Claude proposing test scenarios that often revealed potential issues in our planned implementation. This &quot;test-first&quot; thinking helped us build more robust features from the start, rather than adding error handling as an afterthought.</p>
<p>Through this evolution, our collaboration with Claude progressed from basic code generation to sophisticated system design. Each phase taught valuable lessons about effective AI collaboration, from managing context to guiding complex discussions. But beyond the specific journey of this project, clear patterns emerged that could apply to any AI-assisted development work. These patterns, distilled from both successes and challenges, offer a framework for leveraging AI as a genuine development partner rather than just a coding tool. 🎯</p>
<h2>Practical Patterns &amp; Lessons in AI-Assisted Development 📚</h2>
<p>Building a website might seem like a straightforward task, but collaborating with AI to do so revealed insights that could apply to any software project. The most profound lesson emerged early: time invested in establishing clear communication patterns with the LLM pays enormous dividends throughout the project lifecycle. Much like onboarding a new team member, those early conversations shape all future interactions. But unlike human teammates, AI assistants need this context-setting in each session. What could have been a limitation instead became a strength, forcing clarity and precision in our technical discussions. We discovered &quot;Writing&quot; as a common ground for communication.</p>
<p>The practice of documenting decisions in real-time transformed from a project requirement into a powerful development tool. Each major decision created a reference point for future discussions. When we later needed to extend the publication schema to handle multiple paper versions, having documented our initial reasoning about JSON storage for URLs made the decision pathway clear. This documentation served not just as a record but as a thinking tool, forcing us to articulate and examine our assumptions.</p>
<p>Testing became a crucial aspect of our collaboration pattern. Rather than treating tests as verification tools, they became design sessions in themselves. The systematic way Claude approached test generation helped us think through features more thoroughly. For search functionality, what started as basic query testing evolved into a comprehensive test suite:</p>
<ul>
<li>Testing search across different content types (publications, talks, posts)</li>
<li>Verifying relevance ranking with mixed content</li>
<li>Edge cases like partial matches and special characters</li>
<li>Performance testing with large result sets</li>
<li>Handling malformed queries and invalid filters</li>
</ul>
<p>Each test case Claude proposed revealed potential edge cases or user scenarios we hadn't considered, transforming testing from a validation exercise into a design tool that shaped implementation before writing production code.</p>
<p>Counterintuitively, embracing AI's context limitations led to better code organization. The need to explain feature context in each session naturally pushed us toward more modular, well-documented code. When adding blog support, for instance, each session focused on a specific aspect - data modeling, markdown processing, or search integration. This forced modularity made the code more maintainable and easier to test, benefits that extended far beyond AI collaboration.</p>
<p>The most surprising insight came from treating edge cases and error handling not as afterthoughts but as primary design considerations. Claude's systematic approach to questioning implementation details led to more robust code from the start. When implementing the publication API, what began as a simple CRUD interface evolved to handle nuanced cases like draft states, multiple URLs per publication, and proper error handling for malformed requests. The AI's tendency to thoroughly consider failure modes resulted in more resilient code than I might have written on my own.</p>
<p>Another unexpected strength emerged in API design discussions. Claude's ability to think through different use cases helped create more intuitive and flexible interfaces. For example, when designing the publication update endpoints, our discussion naturally covered:</p>
<ul>
<li>Handling partial updates</li>
<li>Maintaining data consistency</li>
<li>Managing concurrent edits</li>
<li>Version history tracking</li>
<li>Access control implications</li>
</ul>
<p>The reality of AI collaboration proved different from initial expectations. Success came not from trying to get perfect code immediately, but from establishing a process that consistently produced maintainable, well-tested code that met project requirements. This meant being methodical, maintaining clear communication patterns, and regularly verifying that implementations aligned with project goals. The AI became most valuable not as a code generator but as a thoughtful collaborator that could challenge assumptions and suggest alternative approaches.</p>
<p>Perhaps most importantly, this project demonstrated that effective AI collaboration isn't about working around AI's limitations but about leveraging its unique characteristics. The need for explicit context in each session, far from being a drawback, encouraged better documentation and design practices. The AI's systematic approach to problem-solving helped catch edge cases early. Even the tendency to suggest multiple alternative approaches, which could seem like overhead, often led to more robust and well-considered solutions.</p>
<p>These lessons extend beyond just working with AI. Many of the patterns that emerged - clear documentation, systematic problem-solving, thorough consideration of edge cases - represent solid software development practices in any context. The AI collaboration simply made their value more apparent and their implementation more systematic.</p>
<p>The key to successful AI collaboration lies in treating it as a partner rather than just a tool. This means:</p>
<ul>
<li>Starting with clear requirements and context</li>
<li>Documenting decisions and rationale in real-time</li>
<li>Using testing as a design tool</li>
<li>Embracing systematic thinking for edge cases</li>
<li>Maintaining focus on simplicity and maintainability</li>
</ul>
<p>The complete source code for the project is available on <a href="https://github.com/vr000m/varunsingh.net">GitHub</a>. 🌟</p>
]]></content:encoded>
      <dc:creator><![CDATA[Varun Singh]]></dc:creator>
      <category><![CDATA[cloudflare]]></category>
      <category><![CDATA[typescript]]></category>
      <category><![CDATA[ai]]></category>
      <category><![CDATA[claude]]></category>
      <category><![CDATA[development]]></category>
    </item>
    <item>
      <title><![CDATA[Stages of AI: thinking where we are heading in 2025]]></title>
      <link>https://varunsingh.net/post/ai-stages-2024-status</link>
      <guid isPermaLink="true">https://varunsingh.net/post/ai-stages-2024-status</guid>
      <pubDate>Mon, 30 Dec 2024 00:00:00 GMT</pubDate>
      
      <description><![CDATA[I think AI is going through the following stages: chatbots, assistants, co-pilots, agents or autobots]]></description>
      <content:encoded><![CDATA[<p><strong>TL;DR: AI has rapidly evolved from basic chatbots to today's 'agents' that can execute tasks. The post charts this progression towards a potential 'autobot' stage in 2025—fully autonomous AIs capable of independent, interactive decision-making—and highlights the crucial challenge of ensuring their actions align with human values.</strong></p>
<h2>The Evolution of AI Chatbots</h2>
<p>Over the christmas break, I was vibe coding this site,<a href="http://varunsingh.net">varunsingh.net</a> and based on the experience with <a href="http://pipecat.ai">pipecat.ai</a>, I started to think about agents more concretely. I think AI is going through the following stages:</p>
<ul>
<li>chatbots (pre-2020, a chat interface responding to most common questions)</li>
<li>assistants (2019 - soon replaced by agents, LLM-powered chatbots)</li>
<li>co-pilots (2020 - human-in-the-loop, LLM-powered chatbots)</li>
<li>agents (2023 - LLM with access to knowledge-base, APIs, databases, etc.)</li>
<li>autobots or better name (2025 - agents that can take actions)</li>
</ul>
<p><img src="/static/blog/2024/ai-flower-social-glow.jpg" alt="AI chatbot evolution"></p>
<h2>Understanding Each Stage</h2>
<p>Chat bots are simple request and response bots that were rules-based, this was before we had LLMs.</p>
<p>An assistant is basically chatbots that were more reliable, similar to GPT 3.5/ChatGPT. These had inherent understanding of language and could string together compelling statements based on their training. With the help of Retrieval-Augmented Generation (RAGs) and vector databases, we are able to add use-case/customer specific knowledge-bases that the LLM can collate to form the response.</p>
<p>Co-pilots are as the word suggests assistants that have some persistence, i.e., either they are monitoring the actions that the user is making and based on those actions be able to provide guidance to the user. In coding, we have GitHub Copilot, Cursor, and other tools that are able to provide code suggestions based on the context of the code. In healthcare, there are several tools that doctors and medical providers are using to summarise patient notes, provide recommendations, and provide reminders. Lastly, customer support agents are getting pings from the AI co-pilot while they are conversing with the end-user or while they are working on a ticket.</p>
<p>Agents is the obvious next step, give the co-pilot or assitant the ability to take actions, i.e., the user of the Agent is <strong>moving from providing instructions to describing outcomes</strong>. This is big shift that we are seeing with code-generation, but can easily see this happening elsewhere like with Sales and CRMs, revenue recovery, simple support actions.</p>
<h2>The Next Frontier: Autobots</h2>
<p>Lastly, the autobots, I think some people call them auto agents, i.e., agents that can interact with other agents, take actions outside of their sandbox. In the above CRM example, we may have an artificial boundry that a CRM application may not automatically terminate an unpaid account with accrued dues of several months. In the agents stage, maybe it would send a notification to a human that the account is in revenue recovery for a few months, and delegate the decision of terminiation to the human in the loop, but in the autobot phase, it may decide on terminating access versus sending an extra set of reminder emails based on the value of the account. The thing we need to think about is how we ensure that autobots make decisions aligned with human values when they're operating independently</p>
<p>Going into 2025, we are definitely in the Agents phase, the question is will we make autobots this year?</p>
]]></content:encoded>
      <dc:creator><![CDATA[Varun Singh]]></dc:creator>
      <category><![CDATA[ai]]></category>
      <category><![CDATA[industry]]></category>
      <category><![CDATA[future]]></category>
    </item>
    <item>
      <title><![CDATA[The Re-emergence of SIP: How Voice AI Brought Back the Beast]]></title>
      <link>https://varunsingh.net/post/sip-re-emergence-with-voice-ai</link>
      <guid isPermaLink="true">https://varunsingh.net/post/sip-re-emergence-with-voice-ai</guid>
      <pubDate>Mon, 15 Jul 2024 00:00:00 GMT</pubDate>
      
      <description><![CDATA[SIP is back from the sidelines as Voice AI transforms contact centres. The legacy protocol we thought WebRTC would replace is now central to the voice bot revolution.]]></description>
      <content:encoded><![CDATA[<p><strong>TL;DR:</strong> SIP—the complex telephony protocol we thought WebRTC would retire—has re‑emerged as the backbone of Voice AI bots in contact centres, dragging legacy headaches like unencrypted media and low‑quality audio with it, so we must relearn those 1990s quirks and patch them fast.</p>
<p><img src="/static/blog/2024/ai-shapes-social-glow.jpg" alt="SIP and Voice AI converging"></p>
<p>I genuinely thought we'd left SIP behind. When we built WebRTC in the early 2010s, it felt like we were creating a cleaner, more modern path forward for real-time communications. Yet here we are in 2024, and SIP has returned from the periphery to claim centre stage once again. The catalyst? Voice AI.</p>
<h3>The Legacy Beast</h3>
<p>SIP connected computer systems to telephony networks from the late 1990s through the 2000s. It became the foundation of everything from office PBX systems to 3G IMS architecture. At the time, it was revolutionary—the easiest way to bridge the gap between traditional telephony and computer systems.</p>
<p>But &quot;easiest&quot; came with a price. SIP sprawled across hundreds of RFCs, each addressing different use cases. Need to implement muting in a conference and signal that across to everyone? There's probably a couple of RFCs for that. The real complexity emerged when different vendors implemented similar features using completely different approaches, which made sense that they were competing for time to market and then bringing what they did to the standards. Take DTMF tones as an example: there are three standardised ways to send them. These are in‑band audio, RFC 4733 telephone‑events (often still called “RFC 2833”), and SIP INFO messages. Three! Each one equally valid, which means any serious Voice AI implementation needs to support all three for broad compatibility.</p>
<p>The protocol became a testament to <a href="https://xkcd.com/927/">xkcd meme</a>: the best thing about standards is there is always N+1 (the one being your way of doing it versus the others 😆).</p>
<h3>The WebRTC Promise</h3>
<p>When WebRTC emerged, we'd learnt from SIP and carved out a cleaner path with standardised APIs and a more focused feature set, a WebRTC &quot;profile&quot; if you will. For the past 15 years, this vision seemed to be playing out. WebRTC powered the explosion of video calling applications. Modern contact centres powered by WebRTC emerged, mainly waiting for legacy devices to be obsoleted and contact centre operators moving away from CAPEX (buy telephony hardware) to OPEX (buy seats on a CCaaS, CPaaS, or XCaaS).</p>
<p>SIP retreating to the edges—still there for legacy integrations where modern systems needed to connect with traditional telephony infrastructure, but surely fading away as those systems modernised.</p>
<h3>The Voice AI Revolution</h3>
<p>Voice AI is emerging and changing everything.</p>
<p>Contact centres are embracing voice bots at an unprecedented pace. These AI agents are replacing humans in numerous workflows—from basic customer service queries to complex multi-stage workflows. But here's the catch: these bots can only communicate with customers through existing telephony infrastructure. And what powers that infrastructure? SIP.</p>
<p>Suddenly, SIP is essential. Every voice AI company building for the enterprise market needs SIP. The protocol we thought we'd deprecated has become the gateway to one of the most exciting areas of AI development.</p>
<p>Most production voice bots still run a three‑stage pipeline—streaming automatic‑speech‑recognition, an LLM for response, and text‑to‑speech for reply—which adds roughly 300 ms of latency without the VAD and networking delay. Emerging speech‑to‑speech models preserve speaker prosody, but they hide the intermediate text, making debugging and compliance logging trickier. Either way, low‑latency hand‑off to the PSTN depends on SIP-routing behaving itself.</p>
<p>This resurrection brings all of SIP's historical baggage back to the forefront. We've grown used to Opus codec delivering crystal-clear 44.1 kHz audio with built-in error resilience. Now we're back to G.711's 8 kHz sampling rate—audio quality that grates the modern ear. Although wide‑band codecs such as G.722 and even Opus wrapped in RTP are widely implemented on modern SBCs, patchy carrier support often forces negotiations back down to G.711, keeping audio quality firmly in narrow‑band territory.<br>
When a bot’s 16 kHz or 24 kHz synth is transcoded down to 8 kHz G.711, some of the intelligibility and emotion vanish, which is why landing even on G.722 can feel like night and day.</p>
<p>Encryption—or its absence—rarely features in PSTN conversations, yet WebRTC pipelines refuse null ciphers. As a WebRTC developer working with legacy SIP and the PSTN, you must accommodate three modes: plain RTP for PSTN, SDES‑encrypted RTP for legacy SIP, and DTLS‑SRTP for WebRTC. SRTP is defined for SIP, but carrier hops almost never preserve it end‑to‑end, so voice bots usually land in plain-voice. SIP over TLS (SIPS) protects signalling, but the media plane usually falls back to plain-voice.</p>
<p>And then there's DTMF (&quot;Press 1 for English, 2 for Spanish&quot;). Those three different implementation methods I mentioned? They're not just academic concerns. Voice bots or the infrastructure between them needs to reliably detect when users press phone keys, whether for authentication, menu navigation, or input capture (think &quot;Enter you social security number&quot;). Missing or misinterpreting a DTMF tone isn't just a bug—it's a failed customer interaction. So, as Voice AI infrastructure, or as a CPaaS and CCaaS vendors, we need to support all methods.</p>
<p>These quirks multiply quickly. Early media behaviour varies wildly between carriers. Some send audio before the call officially connects; others don't. Some honour specific headers; others ignore them. Testing becomes a nightmare of edge cases and carrier-specific workarounds.</p>
<h3>Old Challenges, Novel Solutions: AMD</h3>
<p>Interestingly, the Voice AI era has introduced problems that traditional telephony never properly solved. Voicemail detection stands out as particularly thorny. When a voice bot makes an outbound call, it needs to determine whether it's reached a human or a voicemail system. Current Answering Machine Detection (AMD) systems from CPaaS vendors are notoriously unreliable. But here's where things get interesting: LLMs might actually be quite good at this. Instead of relying on simplistic audio analysis, an LLM can understand context and content. Is the voice saying &quot;Hello?&quot; or &quot;Hi, you've reached John's voicemail&quot;? For an LLM, that's a straightforward classification problem. I suspect we'll see voicemail detection become a solved problem through better prompting leaning into LLM's probabilistic nature rather than a deterministic AMD algorithm. It's an elegant example of how AI can easily solve a technical complex problem of the past.</p>
<h2>The Path Forward</h2>
<p>As much as I might sigh about SIP's return, I'm optimistic about where this leads. Yes, we'll need to retrain a generation of engineers on protocols they never expected to learn. Outbound traffic must also satisfy STIR/SHAKEN caller‑ID attestation; unsigned AI calls risk displaying “Spam Likely” on modern handsets. Yes, we'll spend countless hours debugging carrier-specific behaviours and codec negotiations.</p>
<p>But the end result—voice bots that actually work—will transform customer experiences. Today's IVR trees are universally despised. They trap customers in rigid menu structures, forcing them to navigate byzantine option trees just to reach a human. Voice AI promises natural conversations that understand intent and resolve issues efficiently.</p>
<p>The irony isn't lost on me. We're using cutting-edge AI technology built on top of a protocol designed when dial-up modems were cutting-edge. But perhaps that's the nature of real technological progress—not always replacing the old, but finding new ways to make it valuable again.</p>
<p>SIP is back. Time to dust off those RFCs.</p>
]]></content:encoded>
      <dc:creator><![CDATA[Varun Singh]]></dc:creator>
      <category><![CDATA[voice ai]]></category>
      <category><![CDATA[sip]]></category>
      <category><![CDATA[telephony]]></category>
      <category><![CDATA[webrtc]]></category>
    </item>
    <item>
      <title><![CDATA[Why Voice AI Does Not Need SFUs]]></title>
      <link>https://varunsingh.net/post/why-voice-ai-doesnt-need-sfus</link>
      <guid isPermaLink="true">https://varunsingh.net/post/why-voice-ai-doesnt-need-sfus</guid>
      <pubDate>Sun, 23 Jun 2024 00:00:00 GMT</pubDate>
      
      <description><![CDATA[How direct peer-to-peer connections can reduce latency and improve Voice AI experiences by eliminating unnecessary infrastructure]]></description>
      <content:encoded><![CDATA[<p><strong>TL;DR:</strong> Voice AI applications should bypass SFUs and connect directly to voice bots via WebRTC, reducing jitter buffer delays and simplifying infrastructure whilst leveraging WebRTC's battle-tested last-mile optimisations.</p>
<p><img src="/static/blog/2024/ai-p2p4121-social-glow.jpg" alt="Voice AI P2P Architecture"></p>
<p>Recently in conversation with Emil Ivov, we discussed if SFUs are needed. He builds and operates one at <a href="https://jitsi.org/">Jitsi</a>, while at <a href="http://Daily.co">Daily.co</a>, we do the same. We recently open-sourced <a href="https://pipecat.ai">Pipecat.ai</a>, which made this conversation more releavant. Foreshadowing, we largely agreed that architecture changes are afoot.</p>
<p>The rise of Voice AI has brought new attention to an old debate in real-time communications: when do we actually need a Selective Forwarding Unit (SFU)? As someone who's spent years optimising WebRTC infrastructure, I've watched the pendulum swing from peer-to-peer (P2P) to SFU and now I think, back to P2P. Now, with Voice AI reshaping how we think about real-time communication, it's time to reconsider whether SFUs are truly necessary for every use case.</p>
<h3>The Traditional Role of SFUs</h3>
<p>Selective Forwarding Units emerged with group video calling. In a typical scenario, each participant sends multiple video streams at different quality levels, what we call simulcast. Each stream varies in target bitrate because they have different frame rates and resolutions, simplifying congestion control for the sender, and allowing the SFU to forward the appropriate quality to each receiver based on their bandwidth capacity. For a five-person video call, this architecture makes perfect sense: the SFU acts as an intelligent traffic router, ensuring everyone gets the best possible experience without overwhelming anyone's connection.</p>
<p>But here's the thing: not all calls involve five people. In fact, the vast majority of WebRTC sessions are between just two participants. This is where the story gets interesting.</p>
<h3>WebRTC P2P4121</h3>
<p>Many communication platforms developed what Emil Ivov (from Jitsi) called the <a href="https://jitsi.org/blog/p2p4121/">WebRTC P2P4121</a> feature, peer-to-peer connections for one-to-one communication. The logic was sound: why route traffic through a server when two endpoints can communicate directly? The server's bandwidth savings alone made this attractive, not to mention the potential latency improvements.</p>
<p>Yet P2P had stumbling blocks, corporate firewalls and mobile networks threw up barriers that often required TURN servers to relay traffic anyway. If you're already maintaining TURN infrastructure to punch through NATs and firewalls, the argument went, why not consolidate everything through SFUs? You'd have one infrastructure to scale and maintain instead of two.</p>
<p>This consolidation made sense in the pre-Voice AI era. But the landscape shifted dramatically during the pandemic (covid) era, wherein, last mile issues became more pronounced and routing through SFUs even for two person calls became more of the norm.</p>
<h3>Why Voice AI Changes Everything</h3>
<p>Voice AI presents a fundamentally different communication pattern. When a human speaks to an AI agent, we're not dealing with a symmetric conversation between two endpoints behind unpredictable NATs. Instead, we have: an AI agent running on a server with a public IP address and no immediate need for multi-party capabilities in most use cases. We still have last-mile issues with the human participant, ergo, we should use WebRTC, but prefer P2P connections.</p>
<p>Think about it: your Voice AI agent like <a href="https://pipecat.ai">Pipecat</a> is running on infrastructure you control, hence no real issues with firewalls. When discussing P2P versus SFU routing, there is hop-by-hop latency but oftentimes SFUs and AI agents co-locate, they often sit in the same region within the same cloud provider. <a href="http://Daily.co">Daily.co</a> for example, supports P2P connections and is available in 40+ regions across two cloud providers. However, the additional latency from routing through an SFU might only be a few milliseconds. My concern is not so much the latency through the extra hop, but the additional jitter buffer that each hop may add in the worst-case scenario.</p>
<p>Every server in your media path maintains its own jitter buffer to handle out-of-order packets. When a packet with a higher sequence number arrives, the server will wait to determine if the missing packets are lost or not. This will slightly delay the packet. At the next hop, the same process repeats, but in this case, the endpoint may either drop the packet because it past its playout time or correctly playout the packet. In the worst case, the jitter buffers may interact poorly, each adding delay as they attempt to smooth out network inconsistencies. Thus, just having one jitter buffer at the endpoints is better.</p>
<p>Consider a scenario where network conditions cause packets to arrive slightly out of order. The SFU's jitter buffer holds packets for, say, 40 milliseconds to reorder them. Then those packets travel to the endpoints, where network jitter causes another reordering delay. Suddenly, you've added 40-60 milliseconds to your end-to-end latency—not from transmission time, but from buffering. The actual buffering delay depends on the NetEq implementations, typically, there is a high- and low-watermarks to control the amount of buffering. The high watermark is the maximum amount of buffering allowed for error-resilience (retx, fec), while the low watermark is the minimum amount of buffering allowed to ensure smooth playout (below this, the buffer underruns and you've no audio to playback).</p>
<p>A direct P2P connection eliminates this redundancy. The AI agent's and the human participant's WebRTC stack handles all the jitter compensation in one place, making decisions with full visibility into the end-to-end connection quality.</p>
<p>The good news is that most modern WebRTC platforms are beginning to recognise this nuance. Daily's WebRTC transport, for instance, supports starting calls as P2P connections and seamlessly upgrading to SFU routing when a third participant joins. This hybrid approach gives you the best of both worlds: optimal performance for two-party conversations and the scalability of SFUs when needed (third party joins, recording, etc).</p>
<p>This seamless transition is crucial for Voice AI applications. Imagine a customer service scenario where an AI agent handles initial queries via P2P, then smoothly brings in a human supervisor when needed. The infrastructure adapts to the use case rather than forcing all conversations through the same architectural pattern.</p>
<h3>Implementation Considerations</h3>
<p>When implementing P2P connections for Voice AI, consider these factors:</p>
<p>Direct WebRTC connections to AI agents require proper signalling infrastructure. Your AI agent needs to handle WebRTC negotiation directly, which platforms like Pipecat already support. This isn't significantly more complex than SFU integration, but it does require thinking about your architecture differently.</p>
<p>Monitor your connection success rates carefully. While AI agents on public IPs should have high P2P success rates, some client networks might still pose challenges. Have a fallback strategy, whether that's TURN servers or SFU routing, for the small percentage of connections that can't establish P2P.</p>
<p>Design your system to handle transitions gracefully. If you start with P2P and need to add participants later, ensure your application can migrate to SFU routing without disrupting the user experience.</p>
<h3>Looking Forward</h3>
<p>The key insight for Voice AI developers is this: use WebRTC for what it does best—handling last-mile networking challenges—without automatically adopting the full SFU-centric architecture that evolved for different use cases.</p>
<p>WebRTC gives you a congestion control algorithm implemented in each endpoint, echo cancellation and noise suppression, and NAT traversal capabilities when needed</p>
<p>You don't need to reinvent these wheels. But you also don't need to route every packet through an SFU just because that's become the default architecture for video conferencing.</p>
<p>There's also the elephant in the room that I've deliberately avoided until now: SIP. The resurrection of SIP in modern communications infrastructure adds another dimension to this discussion. But that's a topic that deserves its own deep dive—perhaps in a future post.</p>
]]></content:encoded>
      <dc:creator><![CDATA[Varun Singh]]></dc:creator>
      <category><![CDATA[webrtc]]></category>
      <category><![CDATA[voice ai]]></category>
      <category><![CDATA[networking]]></category>
    </item>
  </channel>
</rss>