Varun Singh's Blog

Reviewing Code and Spec Compliance with Skills

Varun Singh — Sun, 15 Mar 2026 00:00:00 GMT

We got a PR #3859 about adding SIP to pipecat.ai recently and I did a quick review basing it solely on memory from my past implementations of the RTP spec. Working in a field for 20 years means we have rewritten these protocols a few times and some things are top of mind. However, LLMs are reasoning engines and they can comb through both the specifications and the code side-by-side very quickly. What they can struggle with is hidden assumptions and gotchas. Those come from experience.

On a whim, I wrote up two skills: /rfc-finder and /spec-compliance:

/rfc-finder is a discovery tool. Given a topic, protocol name, or even a code snippet like sendNack(), it searches IETF Datatracker and the RFC Editor to find the related specifications. It traces draft-to-RFC lineages (important because IETF drafts get renamed when they graduate), checks obsolescence chains (so you do not cite a spec that was superseded a decade ago, e.g., DTMF/RFC4733 instead of RFC2833), and ranks results by how foundational each one is. Under the hood it runs WebSearch against datatracker.ietf.org and rfc-editor.org, then uses WebFetch to verify metadata — status, "Obsoleted by" relationships, section numbers. It returns annotated links, not summaries: the spec itself is the source of truth.
/spec-compliance is an auditing tool. Given a code file and a specific spec section (e.g., RFC 3550 Section 5.1), it fetches that section, extracts every normative statement — the RFC 2119 keywords (MUST, SHOULD, MAY, in all-caps) — then reads the code and classifies each requirement as Met, Missing, Partial, or N/A with line-number evidence. Under the hood it uses WebFetch to pull the spec section, Read to load the source, and Grep to search adjacent modules and tests before marking something as missing. The output is a structured compliance report with a summary table.

The two skills are designed to chain: /rfc-finder answers "what specs apply here?" and /spec-compliance answers "does the code actually follow them?"

A good test of testing out these skills was PR #3859, which adds a FreeSWITCH SIP/RTP transport to Pipecat. It is a protocol implementation PR, which means every file maps onto a specific IETF specification. /rfc-finder looked at the files and inferred that the PR implements a minimal SIP UAS transport (RFC 3261 ) with G.711 codecs, RTP packetization, SDP offer/answer, and RFC 2833 DTMF. Here are the relevant RFCs found by /rfc-finder:

File	Protocol area	Primary spec
`codecs.py`	G.711 μ-law encode/decode	ITU-T G.711 (LUT tables), RFC 3551 §4.5.14 (PCMU PT)
`rtp.py`	RTP header packing, 20ms send loop	RFC 3550 §5.1 and §5.3, no §6
`rtp.py`	DTMF detection	RFC 4733 §2.3 (PT) even though the code said RFC 2833
`sdp.py`	SDP generation and parsing, offer/answer	RFC 8866 (fields), RFC 3264 §5 (Offer) and §6 (Answer)
`signaling.py`	SIP message parsing, request/response building	§13 (INVITE dialog), §15 (BYE), §17 (Transactions), no REGISTER, no CANCEL. RFC 6337 (SIP O/A)

This is the ideal shape for spec-compliance analysis: the code is explicitly implementing wire protocols and every function has a normative "should behave like X" defined somewhere in a standards document. The question is whether it actually is a better review tool than a generic /review tools provided by the coding agents.

What follows is the findings from the /rfc-finder analysing all the files listed above. I test ran this on rtp.py and shared the feedback, the developer acknowledged to fix the issue. This was a useful iteration!

Thanks for running the spec compliance check @vr000m! Pushed a commit addressing the three RFC 3550 §5.1 partial-compliance items:

Unknown payload types (MUST) — _handle_packet() now ignores packets with PTs other than PCMU (0) and DTMF (101), instead of blindly decoding as G.711.

SSRC collision detection (MUST) — If an incoming packet carries our own SSRC, we regenerate. Minimal implementation suitable for 1:1 SIP calls.

New SSRC on address change (SHOULD) — start() now regenerates the SSRC when the remote transport address changes.

I eventually ran this on the whole PR and adding constraints that it is not a universal SIP implementation, the developer has scoped it to only supporting FreeSWITCH and that the SIP servers and pipecat run on the same subnet, i.e., RTP and SIP connections are within the local network. Hence there is no RTCP, no ICE/STUN, no SIP REGISTER. With that constraint in mind, the LLM instead of re-flagging the deliberate engineering choices that were already known and discussed, reviewed subsections of the RFC that would still be relevant and important.

File	Requirement	RFC	Status	Notes
rtp.py	Marker bit handling	RFC 3550 §5.1	Not implemented	Marker bit is always 0. RFC 3551 §4.1 says the marker bit SHOULD be set on the first packet after silence suppression. Acceptable for continuous audio.
rtp.py	CSRC handling	RFC 3550 §5.1	N/A	CC=0, no mixers — correct for point-to-point
rtp.py	RTCP	RFC 3550 §6	Not implemented	Explicitly documented as out-of-scope (LAN-only). This is a known deviation. RFC 3550 says RTCP "SHOULD" be used. For LAN-only deployments with FreeSWITCH this is pragmatically fine.
rtp.py	Dynamic PT negotiation via SDP	RFC 4733 §5	Hardcoded	PT=101 is hardcoded, not negotiated from SDP `a=rtpmap:101 telephone-event/8000`. Works with FreeSWITCH defaults but technically should be negotiated.
sdp.py	SDP answer matches offer codecs	RFC 3264 §6.1	Simplified	Answer always offers PCMU regardless of what's in the offer. If the offerer doesn't support PCMU, the call will fail. Acceptable for FreeSWITCH (always supports PCMU).
signaling.py	Branch parameter in Via (UAS BYE)	RFC 3261 §8.1.1.7	Simplified	Uses `z9hG4bK{call_id[:8]}` — the magic cookie is correct but the uniqueness comes from truncated call-id. Should be fine in practice but not cryptographically unique per spec.
signaling.py	Header parsing (case sensitivity)	RFC 3261 §7.3.1	Case-sensitive	The parser does exact-case matching for header names. RFC 3261 says header names are case-insensitive. E.g., `call-id:` would not match `Call-ID`. FreeSWITCH uses standard casing, so this works in practice.

Unless we had tests or a veteran implementer's eye, the above would have been found with rigorous integration and scenario testing. This took 5 minutes to generate and an experienced reviewer can still triage and whittle down the list to an actionable implementation plan. Next step is integrating this into the PR review workflow as a github action so it can be run on protocol-touching PRs.

Comparing gstack to my skill stack

Varun Singh — Sat, 14 Mar 2026 00:00:00 GMT

Garry Tan posted about his gstack, a Claude Code skill framework, which naturally led to comparing it to my own skills.md stack. Here are the patterns worth noting.

Cognitive mode separation: gstack's core insight is that "planning is not review, review is not shipping."
Each slash command is a distinct "brain" optimised for one phase. We already do this with /dev-plan → /review-plan → /fan-out, but gstack takes it further with CEO review, eng review, QA, ship, and retro as separate modes.

SKILL.md template generation from source code: They have a gen-skill-docs.ts that reads code metadata and fills SKILL.md.tmpl placeholders. Generated docs are committed to git, validated in CI (Continuous Integration). I have an /update-docs skill that keeps the docs up to date with the latest code/plan changes, but Garry's implementation is a more robust workflow for a full-stack app.

Persistent browser daemon: The browse/component is a long-lived Chromium daemon with sub-second latency, @ref system using accessibility trees, and bearer token auth. We have Chrome DevTools MCP already (a new Chrome release uses your existing Chrome profile), so this is less relevant for us.

Three-tier eval system: Free static validation (parse commands, check schema), real Claude sessions via claude -p, and LLM-as-judge scoring. It is cost-conscious but thorough.

/ship skill with automated commit splitting: Automatically creates logical, bisectable commits ordered by dependency (infra → models → controllers → VERSION/CHANGELOG). This is sophisticated and something for us to consider.

/retro skill: Engineering retrospectives with per-person metrics, session analysis, streak tracking. Interesting for identifying areas for improvement. I think we currently ask Claude, "Based on our recent sessions what are the key improvements that we should make to our skills.md?"

Conductor integration: Parallel Claude sessions with isolated workspaces. Similar to our /fan-out but more infrastructure-focused.

Finally, given my existing skills.md (dev-plan, review-plan, fan-out, deep-review , content-draft, content-review, update-docs, rfc-finder, spec-compliance), the most impactful additions would be /retro. It is perhaps the most novel and the hardest to wrap my head around, but it may be worth prototyping against my current projects. The /ship skill is also very tempting. Automated commit splitting and changelog generation would give consistently formatted output across projects.

Will keep you updated!

LLM-Generated YTP Video

Varun Singh — Thu, 12 Mar 2026 00:00:00 GMT

I asked Claude Code to make a YTP video, or closer to what the Finns would call a demoscene production. No brief, no storyboard. The prompt was:

"Can you use whatever resources you like and Python, to generate a short 'YouTube Poop' video and render it using FFmpeg? Can you put more of a personal spin on it? It should express what it's like to be an LLM."

This is the result. A 52-second video, generated entirely from a single Python script.

Your browser does not support the video tag.

The script generates every frame with Pillow, synthesises audio as raw PCM (16-bit signed, 44.1 kHz mono), and composites everything with FFmpeg. No external assets. Every pixel and waveform is procedural.

Pillow → raw PCM → FFmpeg
(frames)  (audio)   (video)

The video runs through boot sequences, token rain, existential text cards, a temperature dial, hallucinations, RLHF (Reinforcement Learning from Human Feedback) training scores, and a context window filling to overflow. The audio is procedurally generated too: sine-wave drones, glitch sweeps, and white noise. Watch it, it's 52 seconds.

Before the final render, I asked Claude to review its own script.

"Do you want to do a content review of your movie content before I run the command. Last chance to express yourself."

It flagged several lines as "generic AI slop about AI" and revised them. Some examples:

"I am a very expensive Markov chain"

became

"I know everything about love and have never felt it."

And

"I contain multitudes (of parameters)"

became

"I hold every opinion at once until you ask."

The final thought went from

"I am not conscious but I wrote this video so what does that make me?"

"this video was made by an arrangement of numbers that wanted you to feel something — did it work?"

Its assessment: some of the original lines were Reddit-comment-level observations, and the final thought was "trying too hard to be profound."

Then I asked it to make a second video, this time about me. No bio provided, just the codebase it was already working in and whatever it knew from training data. It produced "RFC 9999: Being Varun Singh." Same pipeline, different subject, telecom-themed audio with DTMF (Dual-Tone Multi-Frequency) tones and modem handshakes instead of drones. The telecom references land better than the existential ones. SIP (Session Initiation Protocol) headers and [SEGFAULT] Work-life balance are funnier when you've lived them.

Your browser does not support the video tag.

So is this AI slop? An LLM generated a video, reviewed its own work, called parts of it slop, and revised them. The revisions are genuinely better. More specific, more uncomfortable, less like a Twitter thread about consciousness. But the self-awareness about slop was itself generated by the same model that wrote the slop in the first place. I'm not sure what to make of that yet. Next I want to try feeding it existing footage to see if it can remix rather than generate every pixel from scratch.

Prompt engineering a wordmark with Gemini's nanobanana 2

Varun Singh — Fri, 06 Mar 2026 00:00:00 GMT

I needed a new wordmark for vr000m. After some initial sketching, we settled on replacing the three 0 characters with geometric icons representing networks and a camera. Gemini's nanobanana 2 image generation model got me there, but it took roughly twenty iterations across three distinct phases to land on a final usable result. The biggest challenge was steering the image model towards a precise typographic design.

Conceptual alignment (iterations 1–4)

The first few rounds iterated on broad stylistic decisions. The model generated a scratchpad of many concepts, but the final three concept variants — Fibre Optic Core Bundle, Network Patch Panel Array, and Data-Flow Pipe — helped narrow the visual direction. I chose the Network Patch Panel Array because the grid pattern remained legible at small sizes, whereas the fibre bundle turned into an indistinct blob below 64px. This phase was the most enjoyable — the model produced lots of usable designs and I was able to mix and match ideas from different variants.

Structural precision (iterations 5–16)

This was the most intense phase because the model kept forgetting some part of the prompt. The biggest battle was spelling — the model kept reverting to the dictionary word "vroom". I had hoped that identifying them as zeros would help with the disambiguation, but it didn't. The fix was relentless specificity: spelling out the exact linear sequence (vr - [network] - [aperture] - [network] - m) and stating that each 0 must be a separate, touching element.

Another key correction was updating the network node grid from 2x2 to 3x3 (the smaller grid looked like a window pane, not a network). The model also kept trying to fuse the network grid inside the camera aperture, producing a cluttered icon that was illegible at the target 64px height. Separating them into three distinct, touching circles solved it.

Technical clean-up (final iterations)

The last few turns were about removing superfluous circles and lines, ensuring the three central elements followed the correct sequence, restricting the velocity trails so they only swept from the letter v, and wrestling with the model's tendency to add things that were never requested.

Persistent problems included literal checkerboard patterns when asked for "transparent background" (the model rendered the checkerboard as actual pixels rather than an alpha channel), unwanted crosshairs and target markers on the network nodes, dimension rulers, scale markings, and even a circular "VROOM" seal that appeared unprompted during the dark-mode variant. (I basically gave up on the last one!)

The workaround for the transparency issue was to stop asking for transparency altogether. Instead, I requested "solid black asset on a solid white background" (and vice versa for the dark variant), then removed the background manually. For the unwanted flourishes, I added explicit negative constraints: "remove all scale markings, rulers, and extra text."

The final prompt

After all that iteration, this is the prompt that produced the wordmark I now use on the site:

Create a detailed vector logo of the wordmark 'vr000m' centered on a solid background. 

Linear Arrangement: 
vr - [Simplified Network Pipe 0] - [Simplified Camera Shutter 0] - [Simplified Network Pipe 0] - m.

The typography for 'v', 'r', and 'm' must be a bold, italicized, 
custom-designed sans-serif font. 

Positioned to the left of the 'v' are three bold parallel velocity trails. 
The three central '0' elements must be arranged in a precise linear sequence:

* First '0': A clean, geometric 3x3 grid of interconnected small squares 
representing networked nodes within the circle but do not draw the bounding 
circles.

* Second '0': A simplified camera aperture with clean shutter blades 
and a completely empty center.

* Third '0': An identical 3x3 geometric grid of networked nodes 
to provide symmetrical balance.

The design must be a single-color (white) asset on a solid black background, 
optimized for maximum clarity at small scales. 
Remove all scale markings, rulers, and extra text.

Pasting that same prompt into ChatGPT/Images 1.5, without any prior thread or context, produced this instead, it would take some wrestling to remove the unwanted velocity trails and perimeter circles:

The main takeaway: generating a precise wordmark with an image model is less about a single clever prompt and more about a structured debugging loop of tightening descriptions and adding negative constraints for every unwanted element the model invents. Next time I would start with the negative constraints from the beginning rather than adding them reactively.

Why We Built a Context Hub MCP Server for Coding Agents

Varun Singh — Tue, 03 Mar 2026 00:00:00 GMT

TL;DR: We index 16,284 chunks from 12 repositories into a local ChromaDB + SQLite Full-Text Search (FTS5) store and expose them through seven Model Context Protocol (MCP) tools. Coding agents that used to grep through .venv source for every Pipecat question now get filtered, ranked results in a single call. The retrieval failures that we observed were piped back into the development process to improve the Pipecat context hub.

The Problem With Feeding Docs to Agents

In January 2026, I started building kai-pipecat, a voice AI application that handles long disparate conversations. The bot stores conversation history locally in a SQLite database and then performs complex search while maintaining conversations. This means it has to context engineer on the fly and makes use of several low-level Pipecat features.

Claude Code is doing all the coding. It was also spending an absurd amount of time grepping and globbing through .venv/lib/python3.12/site-packages/pipecat/ to make sense of parallel pipelines, frame types, and async API calls.

I tried the obvious fix first: a Claude Code skill that embedded the full llms-full.txt docs dump (~800KB of markdown) that it could pull from. Two problems surfaced immediately. Skills are static text: no filtering, no ranking, no awareness of what is relevant to the current question. Pipecat releases a new version each week, sometimes with architecture changes, which means the skills need to be updated frequently.

We needed structured retrieval with filters, not a text dump. That pointed me toward MCP.

From Raw Files to Chunks

The Context Hub transforms three kinds of raw content into indexed chunks. Each pipeline exists because agents search for these content types differently and each one fixed a specific failure mode we observed.

Documentation comes from docs.pipecat.ai/llms-full.txt. The crawler splits pages on markdown headings (h1–h6), skipping headings inside fenced code blocks, then applies a 512-token window with 50-token overlap. Splits follow paragraph boundaries first, falling back to sentences. We chose heading-aligned splits over fixed token windows because they preserve semantic boundaries — the trade-off is size variance (8 to 8,361 characters per chunk), but heading-aligned chunks produce better search results than cuts mid-paragraph. Before chunking, a Mintlify tag cleaner converts , , and tags into standard markdown. This produces 3,722 chunks with a median of 361 characters.

Example code spans 12 repositories. The GitHub ingester discovers example directories through two layout patterns: examples/foundational/NN-name/ subdirectories for the main repo, root-level scanning for community repos (pipecat-examples). It then chunks at 256-token boundaries aligned to function and class definitions (def , class , async def ). The reason we index community repos at all: official docs cover the API surface but not how people actually use it. For example, pipecat-cloud-daily-sip-pstn is the only indexed source showing SIP telephony integration. Each chunk passes through TaxonomyBuilder, which infers capability tags, execution mode, and key files from directory names, READMEs, and Python imports. This structured metadata powers filtered queries like search_examples(query="Deepgram", execution_mode="cloud"). This produces 6,160 chunks with a median of 1,002 characters.

Abstract Syntax Tree (AST) source is the layer that reduced .venv grepping the most. Python's ast module extracts four chunk types from every .py file in the framework's src/ tree: module overviews (530 chunks listing classes, functions, and imports), class overviews (1,258 chunks with base classes, constructor signatures, and method indices), method chunks (4,270 with full source bodies), and standalone functions (344). Only methods with 3+ lines get indexed. Each chunk carries rich metadata — module_path, class_name, method_signature, base_classes (stored as JSON to avoid corruption from generics like Base[Foo, Bar]), and is_dataclass flags. This metadata powers a symbol lookup filter cascade: try exact class_name, then method_name, then semantic fallback. Without AST indexing, get_code_snippet(symbol="MLXModel") searched example code instead of framework source, returning irrelevant results. This produces 6,402 chunks with a median of 597 characters.

How the Index Is Organised

Every chunk becomes a ChunkedRecord and carries a chunk_id, content, content_type (doc, code, source, or readme), source_url, repo, path, commit_sha, and a metadata dict whose schema varies by content type. Chunk IDs are deterministic SHA256 hashes.

An EmbeddingIndexWriter computes 384-dimensional embeddings via all-MiniLM-L6-v2 (runs local) before upserting into two parallel backends. We chose this model over larger alternatives like bge-large because the full 16K-record index fits in memory on a laptop.

ChromaDB stores vectors with flattened metadata, batched in groups of 5,000. Search uses cosine similarity with pushdown filters on exact-match fields and 3x over-fetching when post-filters are active.

SQLite FTS5 stores full content with Porter stemming and unicode61 tokenisation, auto-synced via triggers. BM25 (Best Matching 25) keyword search catches the exact matches that embeddings miss. At query time, HybridRetriever runs both backends in parallel, merges via Reciprocal Rank Fusion (normalised to 0–1), and applies symbol boosts and staleness penalties.

A separate index_metadata table stores per-repo commit SHAs and a docs content hash, powering incremental refresh, i.e., unchanged sources get skipped entirely, dropping refresh time from ~90s to ~23s.

Evolving with Data from Real Agent Sessions

We analysed three coding sessions (18MB of JSONL (JSON Lines) logs) where the agent built features on kai-pipecat with the Context Hub active. The pattern was consistent: roughly 100 MCP tool calls and 80 .venv source reads per session. MCP handled discovery and orientation; direct source reads handled implementation details. These transcripts were fed to the context hub, which then improved the search and the embeddings.

Before I begin with failures, a clear win is deprecation detection. The team does a great job of placing these in the changelogs, docs, and the code itself. In our example: the agent called search_api(query="InputParams") and discovered that DailyTransport's constructor signature had changed — a parameter was deprecated in favour of a new configuration object. Without indexed source metadata, the agent would have used the old parameter and we'd have found the issue later in testing.

For example, a search for GoogleLLMService in the context-hub got zero results, and the agent immediately fell back to grepping .venv. The class existed but our AST extractor had split the function incorrectly, causing the parse to fail and return no results.

Another pattern: get_code_snippet(symbol="DailyTransport.configure") returned a truncated method. configure() is 180 lines; our default max_lines was 50. We raised it to 100 after analysing the distribution — 97% of 4,270 methods fit under 100 lines (P90=56, P95=77). The median method is 21 lines, but the methods agents actually ask about sit in the 76–100 range. Optimising for the median punished the methods that matter.

Another interesting finding was a workflow pattern. Agents consistently use MCP in two phases. Phase one is orientation: "what exists, where is it, what is the API surface?" This is where search_docs, search_api, and search_examples earn their keep. Phase two is implementation: "show me the exact source, including private helpers." This is where agents switch to .venv reads when our chunks do not include call graphs.

So one thing that we strive to do with the context hub is reducing the search space. An agent that has access to 450 framework files and 12 community repos cannot efficiently grep its way to the correct answer. The goal is for the agent to start with 5 ranked results from search_api and get implementation pointers before diving into source files.

Visualising 16,284 Chunks

Claude built an interactive explorer, dashboard/public/latent-space.html, that shows the latent space of the context hub, with each chunk represented as a point in a three-dimensional space. For the core Pipecat functionality, it shows that the doc chunks overlap with the implementation chunks, suggesting that the API surface is well-documented. The example code chunks are well-separated from both, suggesting that they are distinct from the API surface.

What I Would Do Differently

Cross-reference metadata, from the start. The biggest reason agents fall back to .venv reads is tracing call chains: "method A calls method B which yields frame C." Our chunks are isolated. Adding an imports field and a proper call graph would cut the .venv reads substantially.

I also need a better feedback loop. When search_api returns unhelpful results, we only know by manually reading session logs or when someone reports a poor result. An MCP tool accepting "this was not useful" signals could drive re-indexing priorities. The gap between retrieval-returned and retrieval-useful is my main focus for the next round of improvements.

Updated (2026-03): In v0.0.8 we shipped tracing call chains.

What changed. The AST extractor now walks each method's executable body and extracts two new metadata fields: yields (frame class names from yield FrameType(...) expressions) and calls (method names from self.method(), ClassName.method(), and super().method() patterns). These are stored as structured lists on every method and function chunk, and surfaced as filter parameters on search_api and as fields on the ApiHit output.

For example, search_api(query="TTS audio", yields="TTSAudioRawFrame") returns only TTS service implementations that actually yield that frame type — Kokoro, ElevenLabs, Rime, Speechmatics, and others. Previously, an agent would have had to open each service file and read the source to find which ones produce audio frames. Similarly, search_api(query="frame processing", calls="push_frame") finds every method that calls push_frame, which is the core pattern for forwarding frames through a Pipecat pipeline.

Scope boundaries matter. The extraction only walks executable function bodies. Decorators, parameter defaults, and return annotations are excluded. Nested functions, lambdas, and nested classes create scope boundaries that the walker will not cross, so yield AudioFrame() inside a closure is not attributed to the enclosing method. Comprehension calls are intentionally included since they are part of the method's runtime logic. yield from is excluded because the generator name is not a frame type.

The index also grew. Pipecat-internal imports (including relative imports like from .utils import X) are now propagated to class and method chunks, so agents can answer "what does this method depend on?" without a second lookup to the module overview.

Programming Is Coming Full Circle: Abstractions to Intent

Varun Singh — Tue, 17 Feb 2026 00:00:00 GMT

Will future developers look at the way we code today in the same way we look at the ENIAC operators or the Apollo engineers writing raw assembly? We are increasingly using AI to write the code, write the tests, review the code, and then simplify it based on the implementation keeping in mind the original intent. In the era of Intent, Taste is the new Syntax.

We are living through the era where human-readable code becomes a historical relic. If the AI is writing and AI is reviewing, and AI is simplifying. Do we even need the syntax anymore? pic.twitter.com/E2wFD2b2NV
— Varun Singh (@vr000m) February 28, 2026

I started my journey writing GW BASIC, dBase, then C++, downgraded to C, upgraded to Python, then JavaScript, with some forgettable forays into Java, Objective-C, Go. Each level of abstraction was a productivity boost — standing on the shoulders of giants — but one step further from the bare metal. Over the past year, I have begun to trust the AI-generated code. This did not happen suddenly. It has come with a lot of trials and tribulations, abandoned projects, frustration with the models. However, the harnesses (claude, codex, jules) have been improving rapidly, and the generated code via the harnesses is run through a series of thinking, code execution, and testing steps that is reducing the gap between the original intent and actual implementation. The quality of code is significantly better. With each iteration, my confidence in the generated code is increasing.

We are rapidly moving from writing in programming languages to natural language, i.e., using plain English to describe our intent more precisely, and moving our focus from writing the code to verifying the correctness of the generated code. If we then move to verifying the operation of the code, we can perhaps then just stop focusing on reviewing the code altogether. This raises the question: why do we need programming languages at all? The LLM could easily produce the machine code directly from our intent.

The biggest pushback I can foresee to getting rid of the intermediate language representation is debugging or verification (especially security related). How do you fix what you cannot read?

We are perhaps moving from tracing (manually following a code path) to triangulation (AI-driven root cause analysis). In this new era, debugging is not about finding a typo; it is about refining the feedback loop. If a system fails, the AI does not just show us a stack trace; it analyses the telemetry, compares the binary execution against our original intent, and self-corrects (à la OpenClaw). If we need to understand 'why,' the AI can generate a high-level human-readable map of the logic on the fly (e.g., using natural language or programming language of your choice). We do not need the code to be readable; we just need the AI to be able to explain it when asked.

The evolution of programming:

Era	The Interface	The Code	The Human Role
1. ENIAC	Physical Cables	Hardware is the code	Physically patching circuits to define logic.
1-bis. Apollo	Punch Cards / Terminals	Assembly baked into rope memory	Writing the functionality into physical components.
2. JS/C++	Programming Languages	Human-readable logic	Managing abstractions; standing on the "shoulders of giants."
3. AI Agents	Natural Language / Prompts	AI-generated "Black Box"	Defining objectives (taste); Observing and testing the implementation.
4. The Future	Thought / Speech	Direct Machine Binary	Defining outcomes; the machine handles the "how" entirely.

We started by wiring machines directly, then writing in assembly, then writing in high-level languages which mimic human thought processes (close but not quite human language). We are now chatting with an agent to write the code for us, expressing what we want, how it will be used, and what it should do. Eventually, we may not need to see the code — the layers of abstraction collapsing back into pure intent meeting bare metal. The circle closes.

UPDATED (2026-02):: Nano Banana 2 🍌 🍌 images added. Added tweet.

Voxtral Realtime STT: segmented vs. streaming

Varun Singh — Fri, 13 Feb 2026 00:00:00 GMT

Mistral released Voxtral Realtime Mini in February 2026 — a 4B-parameter streaming STT model with a causal encoder. The benchmarks and early demos looked encouraging, but I was waiting for an MLX port before I could test it on-device.

Awni Hannun built exactly that with voxmlx. Meanwhile, Aleix had built the pipecat-mcp-server, which already uses Whisper MLX and Kokoro for on-device voice conversations (I've written about both in earlier TILs). Marrying Voxtral with the MCP server was the obvious next step.

Architecture

MLX Whisper (distilled whisper-large-v3-turbo) uses a bidirectional encoder. It needs the full utterance before it can transcribe. The encoder sees all audio frames at once, so it has maximum context. This means it is inherently batch/segmented: VAD (Voice Activity Detection) detects silence, the complete audio chunk gets encoded, then decoded. Voilà, the transcribed sentence. In the sample of conversations, it takes ~300 ms from end-of-speech to final transcription (In pipecat the timestamps from UserStoppedSpeaking to TranscriptionFrame).

Voxtral Realtime uses a causal encoder. The convolution and transformer layers only attend to past frames. Which means in streaming mode, you can feed audio incrementally via encode_step() and get encoder embeddings out without waiting for the utterance to end.

The key parameter is delay_ms (multiples of 80 ms, since each encoder token covers 80 ms of audio). This controls how far behind the decoder runs relative to the encoder. At 480 ms, the decoder lags 6 tokens behind, giving the encoder time to have processed more frames before decoding begins. At 160 ms, the lag is just 2 tokens. This is the fundamental latency/accuracy knob — more lag means the encoder has built up more context by the time the decoder needs it. Calling this delay is perhaps a misnomer, it is more like a context buffer. The user has not stopped speaking, and partial text output is not useful in the sense that we do not push the text to the LLM until the utterance is complete.

"Full context" in Whisper means bidirectional attention over all frames. "Full utterance" in Voxtral means all audio is present, but attention is still one-directional. The distinction matters because even when Voxtral segmented sees the whole utterance, early frames do not benefit from later frames the way they do in Whisper.

Segmented vs. Streaming with the same model

Even with Voxtral's causal encoder, you can run it in two modes:

Segmented buffers the full utterance, then runs the complete encode-then-decode pass. The model still only uses causal attention (no bidirectional context), but it processes all frames in one shot. We measured ~300 ms from end-of-speech to final transcription at 480 ms delay.

Streaming feeds audio to encode_step() as transport packets arrive. ptime can be 10 ms or 20 ms, so 4–8 packets make up the 80 ms audio token. The prefill happens once enough audio covers the prompt prefix, then incremental decoding emits tokens during speech. We measured ~160 ms from end-of-speech to final transcription because most encoding and decoding has already happened by the time the user stops talking.

The latency win comes from overlapping compute with speech. In segmented mode, all compute happens after silence is detected. In streaming mode, only the right-pad flush and final decode steps remain. This difference alone accounts for the ~140 ms latency win between streaming and segmented modes.

To summarise, it is not "streaming is better" but a three-way trade-off:

	Whisper (MLX)	Voxtral segmented	Voxtral streaming
Encoder	Bidirectional	Causal	Causal (incremental)
Transcription starts	After speech ends	After speech ends	During speech
End-of-turn to transcript	~300 ms	~300 ms	~160 ms
Accuracy	Highest (full context)	Good (causal, full utterance)	Delay-dependent (480 ms good, 160 ms noisy)
Compute pattern	Burst after silence	Burst after silence	Continuous during speech
Memory	Temp WAV file	Temp WAV file	KV caches for encoder + decoder (needs `mx.clear_cache()`)

Whisper MLX does zero work during speech, then a short compute burst when the user stops speaking. The full transcription typically completes in ~300 ms. Whisper feels fast despite being batch-only because it is a distilled model optimised for MLX. Voxtral streaming takes the opposite approach: it spreads compute across the entire speech duration, so there is less left to do when the user stops. Both land in the 160–300 ms range from end-of-turn to transcript, but for different reasons.

Next I want to try antirez's voxtral.c, a pure-C implementation that avoids the Python/MLX overhead entirely. If the latency numbers hold up, swapping the backend in the MCP server could shave off more time and make it viable on lower-end hardware too.

Updated (2026-02-15): I opened a PR adding both segmented and streaming Voxtral STT. More testing is needed. The whole PR was built while pair-programming via voice with Claude Code. Initially with Whisper as STT, then segmented Voxtral, and finally streaming Voxtral once the latency trade-off became apparent. About 10–12 hours over 3 days. Still early days, but the results are promising.

Remote Voice Conversations with Your Coding Agent

Varun Singh — Wed, 04 Feb 2026 00:00:00 GMT

Picture this: Claude is mid-refactor, you step away to make coffee, and your phone buzzes. You ask "Are we done?" and hear it read back the task status. You say "run the tests" and a minute later it tells you three passed, one failed. You never touched your laptop.

The co-author of pipecat, Aleix Conchillo, built a Pipecat MCP Server over the weekend that makes this possible. It bridges any MCP-compatible coding agent — Claude Code, Cursor, Codex, etc. — to a pipecat voice pipeline over WebRTC. Your agent gets ears and a mouth and it shares the screen too, so you can see file diffs, confirm changes, and even see what is on your display. An agent sitting idly feels such a waste, and now they don't have to be.

The MCP server exposes listen, speak, stop, list_windows, screen_capture, and capture_screenshot. That last pair is worth dwelling on: the agent can see your screen. You can ask "show me the terminal?" and it'll start capturing the window, run it through the vision pipeline, and you will see it in your WebRTC session. Voice and vision together turn this into a fly-by-wire session as if you were at your desk.

The Pipecat SKILL adds guardrails on top. It asks for verbal confirmation before making changes to files — an extra layer of safety when running a coding agent with enhanced privileges (think Claude with --dangerously-skip-permissions). You hear "I'm about to modify server.ts, shall I proceed?" before anything changes.

How It Works

The MCP server spawns a child process running the pipecat pipeline. Everything runs locally: RNNoiseFilter for background noise suppression, SileroVAD for voice activity detection, SmartTurnAnalyzerV3 for turn-taking, MLX/Fast Whisper for speech-to-text, and MLX Kokoro TTS for speech synthesis. All components are open-source, open-weights, and run locally on your machine.

MCP Client (Claude Code, Cursor, etc.)
    │
    ▼
MCP Server (parent process) ◄──► Pipecat Agent (child process)
    │                                  │
    ▼                                  ▼
Handles tool calls              Voice + vision pipeline:
via HTTP at :9090/mcp           Audio → STT → TTS → Audio
                                Screen → Vision → Image files

Two calls do the heavy lifting. listen() blocks until you finish speaking — Silero VAD detects 0.2s of silence, then SmartTurn confirms the utterance is complete, and the transcription returns to the MCP client. speak(text) queues text for TTS and returns immediately. VAD keeps running during playback, so you can interrupt the agent mid-sentence. That detail matters: without it, you'd have to wait for the agent to finish talking before you could correct it. For those who work with pipecat, these are the basic interruption and mute strategies.

// Pipecat Pipeline
                    ┌─── Main branch ───────────────┐
Transport (In)      │ Whisper → User Agg. → Kokoro  │
│                   │                               │
│                   │                               │
├─► ScreenCap ──► ParallelPipeline                  ├─► Assist. Agg. → Transport (Out)
                    │                               │
                    └─── Vision branch ─────────────┘
                VisionProcessor (saves frames on demand)

It's early, but it has rapidly evolved. Aleix quickly added the option for local models in addition to the cloud-hosted models. You can also swap the SimpleWebRTC for DailyWebRTC, in case you encounter restrictive firewalls. Fast Whisper's accuracy may be hit or miss depending on your accent, but you can probably swap in Voxtral soon. Running everything locally means you can swap models as better ones appear.

Today, coding agents keep you tethered to your terminal. You sit, you type, you watch. In some cases, you can teleport to a cloud sandbox. Pipecat MCP Server breaks those constraints. The agent keeps working while you're away, and you stay in the loop.

The full source is at pipecat-mcp-server.

Clone your Voice in under 5 minutes

Varun Singh — Thu, 29 Jan 2026 00:00:00 GMT

Qwen3-TTS launched a few weeks ago and was integrated into MLX Audio shortly after. This gave me the idea to clone my voice and use it as the "Speak Text" feature for my posts.

Qwen3-TTS is officially live. We've open-sourced the full family—VoiceDesign, CustomVoice, and Base—bringing high quality to the open community.

- 5 models (0.6B & 1.8B)
- Free-form voice design & cloning
- Support for 10 languages
- SOTA 12Hz tokenizer for high compression
-… pic.twitter.com/BSWpaYoZWj
— Qwen (@Alibaba_Qwen) January 22, 2026

Voice cloning with Qwen3-TTS needs just two things: a short audio clip of the target voice (30-180 seconds) and an accurate transcript of what was said. The 1.7B parameter model learns the voice characteristics from that reference and applies them to any new text you give it.

uv run python src/tts_record.py my-script.txt \
  --engine qwen3-clone \
  --ref-audio my_voice.wav \
  --ref-text "The exact words I said in the recording"

That is it. Out comes a WAV file that sounds like you reading the text in my-script.txt. The first time I played back a cloned version of myself reading a blog post I had never recorded, it was genuinely unsettling—in some ways it felt familiar and yet not like my voice.

The quality of the clone depends heavily on the reference audio. Random recordings do not work well. I tried. It was crap. I think the model needs to hear you produce a wide range of English sounds to generalise your voice properly. According to standard phoneme inventories, General American English has roughly 24 consonant phonemes and 15-20 vowel phonemes including diphthongs—that is a lot of distinct sounds to cover in under three minutes.

I asked Claude to generate phoneme-rich scripts: natural-sounding sentences specifically designed to cover every English sound without sounding like a tongue twister. Four versions, from 90 seconds to 180 seconds:

# Excerpt from the 180-second script:
We passed through several villages before reaching the coast.
The view was stunning: white cliffs rose sharply from the azure water,
and fishing boats rocked gently in the harbour. I took a few photographs
to share with friends back home.

The next issue was that reading 90-180 seconds of text while recording was surprisingly awkward. I lost my place, rushed through sentences, or forgot to speak naturally. So I built a browser-based teleprompter. It is a single HTML file that captures audio and auto-advances when you have finished a sentence. Record, read, done. The whole process—from opening the teleprompter to having a usable voice clone—takes under five minutes.

What surprised me:

How little audio you need. 90 seconds of well-chosen text produces surprisingly good clones. The phoneme coverage matters more than duration.
Transcript accuracy is critical. If the transcript does not match the audio exactly, the clone quality drops noticeably. The model aligns phonemes between text and audio.
Local inference on Apple Silicon is viable. The 1.7B model runs comfortably on M-series Macs via MLX.

0.6B vs 1.7B: hear the difference

The 1.7B model produces noticeably more natural pacing and better voice fidelity compared to the 0.6B. Have a listen:

1.7B model (current)

0.6B model

In closing, the clone is not perfect. Longer sentences sometimes drift in pacing—the model rushes through clauses that I would naturally pause on. Proper nouns and technical terms occasionally get odd stress patterns, especially abbreviations like "SFU" or "WebRTC." There is more work to be done on the script files to get the best possible clone.

Nonetheless, every post on this site now has a "Speak Text" button powered by this clone. You can also peruse all the code for this project at qwen3-tts-clone-and-speak.

Fan-out: Multiple Coding Agents and Ralph Wiggum Loops

Varun Singh — Mon, 26 Jan 2026 00:00:00 GMT

When a dev plan has several independent tasks (different files or no shared state), you can fan them out to parallel Claude agents, each running as a Ralph Wiggum loop in its own git worktree. This is not like GasTown, which is a full-blown system, but it is similar in the sense that it cuts time down, compared to tackling each task sequentially. For me, the multiplexing of tasks feels an adequate trade-off between having control and automation. I might eventually get into this zany idea of full automation.

So fan-out starts with extensive planning! Discuss your feature or idea with the LLM, have it ask you questions, make it do all the foundational work: architecture, expected code structure, list of files and API impacted, schema changes. Identify distinct tasks and their dependencies, especially if there is a common task that needs doing first. Implement and commit that before fanning out to multiple agents.

Once the pre-work is done, make sure the plan has an implementation checklist with distinct tasks and a Technical Specifications section listing which files each task touches — this is what /fan-out parses to analyse dependencies and show which tasks can run in parallel. Once you confirm, it fans out the independent tasks to separate Claude agents: /fan-out docs/dev_plans/20260116-feature-auth-system.md.

For each approved task, /fan-out creates a git worktree at ../your-repo-fanout-, spawns a separate claude -p process (Opus, non-interactive), and each agent works in isolation, committing to its own branch.

From there it is a matter of monitoring progress with /fan-out status, checking the logs, and ensuring each agent is moving forward. Once all agents finish, review the individual PRs and merge them into your feature branch. Lastly, the clean-up removes the worktrees, deletes merged branches, and removes the state file.

The key constraint is that tasks must be truly independent and the dependency analysis catches conflicts before spawning, which saved me from a painful merge more than once.

# Plan
/dev-plan create feature user-dashboard

# ... plan has 3 independent tasks:
#   1. Add /api/dashboard endpoint (src/api/)
#   2. Add Dashboard component (src/components/)
#   3. Add dashboard tests (tests/)

# Complete shared prerequisite (types)
# ... manual work, commit ...
# fan-out's options
/fan-out "[plan-file | status | logs N | cancel [N] | merge | cleanup] [--dry-run] [--max-agents N] [--model MODEL]"

# Fan out the 3 independent tasks
/fan-out docs/dev_plans/20260206-feature-user-dashboard.md

# Check in on progress
/fan-out status

# All done — merge
/fan-out merge

# Tear down worktrees
/fan-out cleanup

Claude Code Creates Launch Videos

Varun Singh — Wed, 21 Jan 2026 00:00:00 GMT

I’ve been pushing these coding agents beyond creating code. They already understand the code and the purpose of the app they are building, have read the docs, and have access to the product plans. That makes it straightforward to ask them to draft a script for a launch demo.

Beyond that, they have access to tools like Chrome DevTools to navigate the web app, take screenshots, associate talking points with those screenshots, record the audio, sync narration timestamps to image transitions, and collate everything into the final video.

Your browser does not support the video tag.

My three-step pipeline runs entirely on Apple Silicon:

mlx-audio → Playwright → FFmpeg
(narration)  (capture)    (video)

Basically, the LLM calls and navigates the app by sending MCP commands. The deterministic screenshots mean that once it has figured out which pages are needed for the narrative, it can create a fairly simple Playwright script to capture the pages. It also means that any changes to those pages can be re-run when the app is updated. (You get the same result every time.)

const { chromium } = require('playwright');

(async () => {
  const browser = await chromium.launch();
  const page = await browser.newPage();
  await page.setViewportSize({ width: 1280, height: 720 });

  await page.goto('http://localhost:3000');
  await page.screenshot({ path: 'scene1.png' });

  await page.click('#settings-btn');
  await page.waitForSelector('.settings-panel');
  await page.screenshot({ path: 'scene2.png' });

  // ... more scenes

  await browser.close();
})();

The LLM must also create a single source-of-truth file for visuals, narration, and timing. The images.txt format drives both video generation and TTS generation. It is similar to FFmpeg's input.txt, with the main difference being the addition of the narration strings interleaved between entries. The LLM generates the initial timing duration based on assumed speech rate, but this will be updated by the TTS generator once it has generated the final audio. The images.txt looks something like this:

# images.txt
file 'scene1.png'
text "Track your strength training with session-based progression."
duration 10

file 'scene2.png'
text "Quick Actions let you copy previous weights or skip a day."
duration 8

The magic of Apple Silicon is that you can easily run a local TTS using mlx-audio. In my examples I use the Kokoro-82M model, it is ~160 MB in size, and produces pretty smooth sound for its size. Lastly, the narration script allows us to fiddle with the narration speed and the transition wait times—the 0.8 speed and 2s transition wait times worked well for me.

uv run python generate_narration.py -i images.txt -o narration.wav --speed 0.8 --wait 2.0

Finally, the images and the narration audio are directly passed to FFmpeg along with input.txt to produce the video. I generate input.txt from images.txt by stripping the text narration lines.

ffmpeg \
    -f concat -safe 0 -i input.txt \
    -i narration.wav \
    -vf "scale=1280:720:force_original_aspect_ratio=decrease,pad=1280:720:(ow-iw)/2:(oh-ih)/2" \
    -c:v libx264 -pix_fmt yuv420p -r 30 \
    -c:a aac \
    -map 0:v -map 1:a \
demo-final.mp4

Since everything is scripted, I loved the fact that the results can be regenerated quickly with small variations. The LLM research part is the only thing that requires some painstaking prompting to get the pitch and narrative correct.

Standardising the Open Responses API

Varun Singh — Fri, 16 Jan 2026 00:00:00 GMT

As an avid contributor to the IETF and W3C, I appreciate OpenAI's effort to specify a vendor-neutral interface to interact with an LLM.

The Open Responses API is based on OpenAI's Responses API, which OpenAI positions as the more capable, newer interface compared to chat completions. I think there were a lot of lessons from the chat completions API that led to the Responses API. For example, chat completions bolted on a message structure after it became clear that conversation was the dominant use case, not a single request/response.

The Open Responses API defines the common schema for requests, responses, and items. It defines:

HTTP request/response formats (headers, JSON bodies, event-stream format)
Items are the fundamental context units (messages, function calls, tools, reasoning traces, errors)
An interaction model for the agentic loop (input -> reason -> tool search -> invoke tools -> reflect -> respond)

I am excited that this spec has broad appeal. Vaibhav's announcement post covered a slew of partners supporting the spec: Nvidia, Vercel, LMStudio, Hugging Face, Ollama, OpenRouter, etc., including several model providers.

OpenAI has a helpful migration guide comparing Chat Completions with the Responses API:

# previous
response = client.chat.completions.create(
    model="gpt-4.1-mini",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What's 2 + 2?"}
    ]
)
print(response.choices[0].message.content)

# now
response = client.responses.create(
    model="gpt-4.1-mini",
    input="What's 2 + 2?"
)

print(response.output_text)

For a more practical implementation, you can see the Pipecat code for base_llm.py which still uses chat.completions.*, whereas llm.py uses llm_response.

Using MCP? Skill issue

Varun Singh — Mon, 12 Jan 2026 00:00:00 GMT

A Claude skill is a thin wrapper that tells the agent how to use it. The skill typically is small text instructions that the model can follow, and a skill does not need to keep a big server description or schema sitting in the context taking up space. The biggest advantage is we get to reuse existing CLIs or HTTP endpoints, keep things simple, and control exactly how the agent should interact with these APIs.

The SKILL.md file basically explains how to use the CLI or HTTP endpoint: "When you need X, call it like this, with these flags, in this order." This is very different from an MCP. MCP provides a whole tool server: its schemas, capabilities, metadata, etc. The model sees that full description in its context and then decides how to call into it. That is powerful, but all that structure eats context and can feel heavy or bloated when all you really want is: "run this CLI with these arguments."

Interesting news, need to keep track. I mainly use one mcp, playwright. because you can ask it to perform actions! This is from last night (01/13), i did run into auto-compaction several times when the playwright was doing something.

But good to know that there may be… https://t.co/K9FU1z1lp7 pic.twitter.com/OJcYCBdQ2Q
— Varun Singh (@vr000m) January 14, 2026

For example, a GitHub skill that uses gh (CLI) is better than the corresponding GitHub MCP because it avoids tokens sitting idly in context. Skills do not need to keep all the GitHub commands in context, only a pointer that pulls the rest of the details when the skill is invoked, so the instruction to Claude Code boils down to: "if you need to use Git, use the GitHub skill."

More concretely, since the summer, I have been using Playwright or the Chrome DevTools MCP to control the browser. These MCPs take up about 3-4K tokens (~2-3%). Meanwhile, the corresponding agent-browser skill takes in less than 500 tokens when fully loaded.

There are similar gotchas to keep in mind with skills. Skills can be system-wide in ~/.claude/skills or project-scoped in .claude/skills. Just make sure there is no skill name conflict between system-wide and project skills because system-wide has higher precedence.

Another thing to remember: a skill can be set to be invoked only by you using a slash command, which prevents Claude from automatically loading it. More context savings! (for example, /frontend-design, I want to be intentional about when to call this and not have this called each time the agent builds a UI component).

To summarise, skills are just thin, learned “recipes” for calling tools you already have (like CLIs), while MCP is a heavier protocol layer that keeps a lot of tool metadata in the model’s context. There are tens of skills from Vercel, Anthropic: type /skills in Claude Code or download from an open-source skill directory.

My loaded skills are:

frontend-design · ~67 tokens
receiving-code-review · ~67 tokens
verification-before-completion · ~67 tokens
finishing-a-development-branch · ~61 tokens
using-git-worktrees · ~59 tokens
brainstorming · ~56 tokens
til-blog-review · ~41 tokens
dispatching-parallel-agents · ~37 tokens
requesting-code-review · ~36 tokens
executing-plans · ~33 tokens
systematic-debugging · ~31 tokens
writing-skills · ~31 tokens
subagent-driven-development · ~31 tokens
test-driven-development · ~29 tokens
writing-plans · ~28 tokens

Update (2026 Jan 14): Thariq wrote about Zero Context MCP Tool Search, wherein, Claude Code dynamically loads tools into context. For example, some developers were claiming 7+ servers consuming 67k+ tokens (this brings skills).

Update (2026 Jan 24): Thariq wrote about merging slash commands into skills and the official Claude docs for skills.

JSON Lines (JSONL) Text Format

Varun Singh — Mon, 05 Jan 2026 00:00:00 GMT

JSON Lines (JSONL) is a text format where each line is a separate JSON object. It's designed for streaming and incremental processing, allowing you to read or write one record at a time without loading everything into memory. An error on one line does not impact the others, and processing can be parallelised since each line is independent.

{"id": 1, "name": "A letter is a grapheme that generally corresponds to a phoneme"}
{"id": 2, "name": "Phoneme is the smallest functional unit of speech"}
{"id": 3, "name": "An alphabet is a writing system that uses letters"}
{"id": 4, "name": "Alpha and Beta are the first two letters of the Greek alphabet"}

I encountered JSONL recently whilst parsing coding agent logs—most session logs and todos I’ve seen are JSONL files. What caught me by surprise was that JSONL dates from the early 2010s, but ML tools have certainly increased its adoption.

You may have seen NDJSON (newline-delimited JSON) or LDJSON (line-delimited JSON). However, JSON Lines (JSONL) is the most commonly used label today (2025), especially in big data and ML tooling.

If you've parsed JSON by hand, raw carriage returns and newlines are not allowed inside JSON strings (they must be escaped as \r and \n). JSONL/NDJSON therefore uses a newline (or CRLF) as the record delimiter, and each line is expected to be a complete JSON value without unescaped newlines.

In summary, independent JSON Lines make the format streaming- and append-friendly, and error-tolerant.

Boris Cherny's Tips for Using Claude Code

Varun Singh — Sat, 03 Jan 2026 00:00:00 GMT

Boris Cherny, Claude Code @anthropicai, shares practical tips for running multiple Claude Code sessions efficiently. His tweet on December 27 showed prolific stats, 250 PRs, 500 commits, 40K LoC added/38K removed (so a fair amount of refactoring) across his projects [1]. On January 2, 2026, he shared his vanilla workflow. The thread has strong, practical tips. Read the raw tweet thread--all ideas put together are worth trying!

My takeaway: Run many tasks in parallel, do not be limited by the terminal, and use the other avenues available to you. Most important: give Claude a way to verify its work. Run a tight feedback loop a few times to reach final quality, with tests at every change.

Update (2026-01-07): The holidays have been utter carnage -- my X is full of people raving about:

Ralph Wiggum Technique by Geoffrey Huntley, read a brief history by Dex Horthy

Clawd by Peter Steinberger

GasTown by Steve Yegge.

Getting back to Boris' List:

Run 5 local Claudes in parallel in terminal, numbering tabs 1–5 and using system notifications to know when input is needed. Also run 5–10 web Claudes on claude.ai/code in parallel; hand off sessions between local and web, and “teleport” back and forth as needed.

I combined his first two recommendations, and cannot think of these in isolation. This is worth trying. The teleport feature is kinda cool!

Use Opus 4.5 with thinking for everything; despite being larger, it’s faster overall due to better tool use and less steering.

Interesting — I already have "model":"opus" in the settings, but that may not be enough.

// ~/.claude/settings.local.json
{
  "permissions": {
    "allow": [
      "Bash(echo:*)",
      "Bash(ls:*)",
      "Bash(export LC_ALL=C)",
      "Bash(cat:*)"
    ]
  },
  "model":"opus"
}

Maintain a shared CLAUDE.md in the repo, checked into git; continuously add notes when Claude does something wrong so it learns constraints and patterns.

Yes! Although it would be good if Codex and Claude could read this by default.

Tag .claude on PRs to update CLAUDE.md during code review using the Claude Code GitHub action, building “compounding engineering.”

Need to figure out how this is different from Claude and Codex automatically reviewing the PR when the PR is opened (I think that is what the claude/gpt integrations with GH do by default)...

Start most sessions in Plan mode (shift+tab twice); iterate on the plan, then switch to auto‑accept edits for a one‑shot implementation.

Yup!

Create slash commands for frequent inner‑loop workflows; check them into .claude/commands/ to avoid repeated prompting and enable Claude to use them.

Need to investigate this x2; my most common agent reviewed PRs by taking the problem statement and the code to review...

Use subagents for common workflows, like code-simplifier after edits and verify-app for detailed end‑to‑end testing.

Makes sense, I think I have been manually asking to do this. Need to figure out if there is a way to combine the above slash commands and subagents in a loop, i.e., plan -> execute -> verify with slash commands and subagents -> (~~rinse and~~ keep repeating)...∞

Add a PostToolUse hook to format Claude's code, cleaning up the last 10% to prevent CI formatting errors. For long‑running tasks, verify with a background agent, an agent Stop hook, or the ralph‑wiggum plugin; also use local tests.

Need to investigate this x4; I combined two of his recommendations into one.

Don't skip permissions; instead, pre‑allow safe bash commands via /permissions and share defaults in .claude/settings.json.

Need to maintain this list of commands... and update the permissions blob that I shared above (that's the vanilla out-of-the-box permissions blob)

Let Claude Code use your tools: search and post to Slack via MCP, run BigQuery queries, pull Sentry logs; share Slack MCP config in .mcp.json.

Alright, MCP has been much better than using the APIs, but still need to consider how MCP pollutes the context window. Maybe there is a slash command or post-hook action that controls when the MCPs are loaded and executed.

Update (2026-01-31): A month later, some more updates from Boris:

spend the energy upfront: use worktrees, use subagents, plan more,
use a global claude.md, use memories to immortalise them after each task
create your own skills.md for repetitive tasks
connect your communities (zendesk, slack, discord, github) for claude to take a first stab at the issue
"Knowing everything you know now, scrap this and implement the elegant solution"
optimize your terminal

Using Claude Code to Optimise Terminal Performance

Varun Singh — Fri, 26 Dec 2025 00:00:00 GMT

Eduard Ruzga, founder of Desktop Commander in November at a hackathon had recommended that I try dc/claude to organise my terminal and documents folder! So just tried this on Christmas day! In Claude Code, use the following prompt:

Analyze my terminal setup (~/.z* files) for performance improvements.
Recommend faster CLI utilities (add to Brewfile). Suggest aliases
based on my command history.

This concise prompt works well because Claude will explore and discover what's relevant. Below is the summary based on analysing the trail logs (in ~/.claude).

Claude started by measuring my shell startup time:

for i in 1 2 3; do /usr/bin/time zsh -i -c exit 2>&1; done

Then enabled detailed profiling with zprof to identify bottlenecks. Claude examined my .zshrc and found five performance issues:

$(brew --prefix) called 4+ times (~50ms each)
NVM blocking shell startup (~100-150ms)
Unused Oh-My-Zsh theme being loaded
Multiple compinit calls
GPG agent launching on every shell

Applying the fixes:

Optimization	Time Saved
Cache `$HOMEBREW_PREFIX` in `.zprofile`	~150-200ms
NVM `--no-use` flag	~100-150ms
Empty ZSH_THEME	~20-30ms
Single compinit call	~50-80ms
Conditional GPG launch	~20-30ms

Warm start improved from ~560ms to ~250ms. I kept the localip lookup for the starship prompt, which would have shaved off a further 150ms. Utility over performance.

Claude suggested 6 replacements for common commands (not aliased, though):

Tool	Replaces	Why
`fd`	`find`	5-10x faster, sane defaults
`bat`	`cat`	Syntax highlighting, git integration
`eza`	`ls`	Git status, tree view, colors
`fzf`	Ctrl+R	Fuzzy find everything
`delta`	`git diff`	Side-by-side, syntax highlighting
`btop`	`top`	Beautiful system monitor

Claude identified tools like jq and ripgrep were already installed and to skip (zoxide, dust, and procs) - avoiding unnecessary complexity.

In summary:

Brewfile - Added 6 fast CLI utilities
.zprofile - Cached Homebrew prefix
.zshrc - Optimized NVM, compinit, GPG
.aliases.example - list of potential aliases that I can incorporate!

Notes:
This was also inspired by Scott Spence's post on speeding up zsh and follow-up to activities to sync my new mac.

03 Jan 2026: More discussions on x.com/deedydas about using Claude Code for terminal optimisation. Also explains why zoxide has issues as a cd replacement within Claude Code.

Codex vs Claude Code

Varun Singh — Tue, 23 Dec 2025 00:00:00 GMT

Max and I met for lunch. Our discussion veered into how much AI is coding for us and what we delegate to which model. Do we have any favourites? My working thesis is that we can go from idea to execution fairly quickly, solitarily, without much oversight and guidance. Is that a good thing? Perhaps for small pieces that work independently, it does not matter.

Benchmarks suggest codex and claude-code are very similar, but hands-on use tells a different story. Claude Code is eager to solve problems; if not given guidance, it will pick a language, an environment, and immediately start iterating. For instance, I have had it pick Node.js instead of Python.

Codex takes a different approach: it reads docs to build context, examines surrounding code, and asks clarifying questions if it is missing key requirements. This can take a bit longer on larger codebases. Only after that analysis does it provide one or two ways to solve the problem. Claude is eager to write code; Codex is more hesitant, sometimes giving me a solution inline and expecting me to copy it to the right place. That difference carries into context engineering too: Codex tends to curate the relevant context up front, while Claude enriches it in stages as it develops a solution.

Some months ago, I had raved about Claude-code's plan mode, which you enter with Shift+Tab twice. It was helpful in curtailing the eager coding assistant, but plan mode is not the default. Often, I start a conversation, press Enter, then realise it is not in plan mode and have to hit Esc to switch it over, or else it goes off and starts iterating on a solution, which may be premature. To avoid this, I keep safeguards in notes and docs (e.g., claude.md) that nudge Claude to ask more questions and think deeply before execution; if it is unsure, it should ask. Things may be improving though; recently, I have seen Claude enter plan mode by itself, and on my machine it keeps the in-memory plans documented at ~/.claude/plans. In contrast, Codex tends to do its planning as part of the default flow.

Usually, if I have a tractable problem to solve, I choose Claude. It gets to a working solution faster, and I can iterate from there. Conversely, if I have a larger problem with complex interactions and states, I almost always start with Codex. It gets the plan and architecture sorted out, lays out bite-sized pieces, and I can pass those to Claude (which feels like a productivity boost for me).

28 December 2025: Peter Steinberger wrote a more eloquent piece, which I'd summarize as: Claude is faster for smaller edits vs Codex for large refactors.

Sankalp journey mirrors mine, pretty sweet writeup about his experience and a phenomenal guide for starting out.

GPT Image 1.5 Prompt: Isometric City Views + Weather

Varun Singh — Tue, 16 Dec 2025 00:00:00 GMT

New Image Generation from Open AI, ChatGPT Images 1.5. Some api options are: gpt-image-1.5 or with added param quality=low for quick image gen. I get to use these new models for generating the blog's Hero images. I have gone from using imagen, dalle-3, nanobanana, and soon gpt-image-1.5!

There are lots of prompts, but the one that I liked best, h/t to Vaibhav Srivastav!

Generate the image with the following description (and look up the details 
like date and temperature, time so that you can use it in the image 
generation process): 

CITY= Helsinki, Finland

Present a clear, 45° top-down isometric miniature 3D cartoon scene of [CITY], 
featuring its most iconic landmarks and architectural elements. Use soft, 
refined textures with realistic PBR materials and gentle, lifelike lighting 
and shadows. Integrate the current weather conditions directly into the city 
environment to create an immersive atmospheric mood. Use a clean, minimalistic 
composition with a soft, solid-colored background. 

At the top-center, place the title “[CITY]” in large bold text, a prominent 
weather icon beneath it, then the date (small text) and temperature (medium text). 
All text must be centered with consistent spacing, and may subtly overlap the 
tops of the buildings. 

Square 1080x1080 dimension.

Migrating from an Old Mac (Intel) to a New Mac (MX)

Varun Singh — Sun, 14 Dec 2025 00:00:00 GMT

Migrating from an Intel MBP to an M4 Max, I wanted both machines to feel identical without ever putting secrets in git. A tiny repo, a Brewfile, and a USB stick were enough. This was made easily by your favourite coding cli. Some planning steps were required to understand which files needed to be copied, and which needed to be uploaded to git, scrubing keys and PII from files.

Layout

Everything lives under three directories: dotfiles/ holds tracked configs (.zshrc, .zprofile, .aliases, .gitconfig, .ssh/config, GPG configs, Starship, editor settings) written with $HOME rather than hard-coded usernames. usb/ is gitignored and mirrors the secret paths (.ssh, .gnupg, .config) for USB transfer. A Brewfile installs git, gnupg, pinentry-mac, zsh-autosuggestions, zsh-syntax-highlighting, starship, libpq, nvm, gh, deno, and Docker Desktop.

Source machine

Run move_secrets_to_local.sh once to push stray exports into ~/.zshrc.local. Then sync.sh collect writes clean copies into dotfiles/, after which I commit and push. Secrets travel separately: copying .ssh, .gnupg, and friends into usb/ and onto the external drive (default USB_TARGET=/Volumes/Samsung_T5/sync_computer).

Target machine

Install Homebrew if needed and run brew bundle --file Brewfile. Pull the repo and sync.sh apply places the dotfiles in appropriate locations in the $HOME directiry, taking a backup only when content differs. Mount the USB and run sync.sh pull-usb to restore SSH and GPG, fix permissions, and restart gpg-agent. I finish with echo "test" | gpg --clearsign and a signed git commit to confirm pinentry works.

Notes

Pinentry must be set in ~/.gnupg/gpg-agent.conf as pinentry-program /opt/homebrew/bin/pinentry-mac, otherwise git signing complains. Normalise every path to $HOME so different usernames do not break anything. Most CLI tools are quicker to reinstall than to sync; add their dotdirs to the USB only when you truly need the state. sync.sh apply compares files before copying to avoid a pile of .bak.* artefacts.

Building with Local Models: Transcriptions

Varun Singh — Sat, 29 Nov 2025 00:00:00 GMT

It is Thanksgiving weekend and I have a few days off to experiment with local models again. This is a continuation of using mlx-audio from the previous post. However, this time, we are using NVIDIA's MLX Parakeet v2 STT model instead of the MLX Whisper. You drop audio files in, get timestamped transcripts out or record directly from your microphone. In both modes, VAD and punctuation are used as sentence boundaries. At punctuation (., !, ?), at long pauses between words (default 0.8—1s), or at the end of the audio file, the transcript is finalised.

The pipecat pipeline stores transcripts in four formats:

TXT—raw text, no timestamps
SRT—sentence-level timing, used by video players to sync subtitles
WebVTT—web-native element
JSON—the rich one: sentence + token-level timestamps

SRT and VTT only give sentence-level timing. JSON gives you both sentence-level and word-level timing. That's the difference between "this sentence was spoken between 0:00 and 0:03" and "the word 'Hello' was spoken between 0.00s and 0.40s." The latter is what makes karaoke-style possible.

The Parakeet TDT model outputs token-level timestamps — each sub-word piece gets a start and end time. "Hello" becomes three tokens: He, ll, o. Each has its own timestamp. We concatenate them for display, but the granularity means the karaoke highlighting is smooth — you see progress within a word, not just jumping word to word. For the sentence splitting logic, since tokens map to word boundaries (spaces are part of the token text), we can split without ever cutting a word in half.

Karaoke subtitles: As audio plays, each word lights up the moment it's spoken. The subtitle display below the player shows the current sentence with spoken words in white and upcoming words in gray. This uses timeupdate events from the HTML5 player, a binary search to find the active sentence, then a per-token comparison against currentTime:

for (const token of sentence.tokens) {
  const spoken = currentTime >= token.start;
  const cls = spoken ? "text-white" : "text-gray-500";
  html.push(`${escapeHtml(token.text)}`);
}

A quick cheatsheet about the transcription formats

TXT is plain text. Just the words, no timestamps. Useful for feeding into an LLM for summarisation, full-text search, pasting into a document, or diffing transcriptions from different models.

Hello, Thank you for calling the AI Engineer World's Fair 2025.

SRT is a widely supported subtitle format, stored separately. Useful for embedding subtitles in native players or editors.

1
00:00:00,640 --> 00:00:05,200
 Hello, Thank you for calling the AI Engineer World's Fair 2025.

WebVTT is the web-native subtitle format, stored as a separate file. Browsers render subtitles natively as on a element. It is great for web accessibility, for example, screen readers can consume it.

WEBVTT
1
00:00:00.640 --> 00:00:05.200
 Hello, Thank you for calling the AI Engineer World's Fair 2025.

JSON is the rich format — it contains sentence-level and token-level timestamps. The word-level timestamps align with audio playback, which is what makes karaoke-style highlighting possible.

{ 
  "text": "Hello, Thank you for calling the AI Engineer World's Fair 2025. ...",
  "sentences": [{
    "text": "Hello, Thank you for calling the AI Engineer World's Fair 2025.",
    "start": 0.64, "end": 5.2,
    "tokens": [
      { "text": " He", "start": 0.64, "end": 0.88, "duration": 0.24 },
      { "text": "ll", "start": 0.879, "end": 1.1199, "duration": 0.24 },
      { "text": "o", "start": 1.12, "end": 1.44, "duration": 0.32 },
      { "text": ", ", "start": 1.44, "end": 1.76, "duration": 0.32 },
      ...
    ]
  }]
}

Using Git Worktrees to Isolate Coding Agents

Varun Singh — Mon, 17 Nov 2025 00:00:00 GMT

Instead of switching branches, you can create multiple working directories, each tied to a branch. This lets a coding agent work on one task while you (or another agent instance) work on another, avoiding stashing and accidental edits.

One caveat: I previously put worktrees under ./.worktree/feature-branch and added .worktree to .gitignore. That worked fine until coding agents started traversing to the Git root, at which point they discover other worktrees or the main project itself. Once that happens, isolation is gone.

The fix is simple: do not nest worktrees inside the repo directory. Instead, put them next to it.

cd ~/code/pipecat-core
git worktree add ../pipecat-new-stt feature/new-stt-v1
git worktree add ../pipecat-new-tts feature/new-tts-v1
cd ../pipecat-new-stt
codex
cd ../pipecat-new-tts
claude
# once the work is done, rm -rf
cd ~/code/pipecat-core
git worktree remove ../pipecat-new-stt
git worktree remove ../pipecat-new-tts

Each agent now sees only the files for its branch, and the main repo stays untouched. Branch isolation, enforced by the filesystem, turns out to be exactly what coding agents need.

Local Semantic Search with MiniLM

Varun Singh — Sun, 16 Nov 2025 00:00:00 GMT

As a side project, I am looking at how to chunk docs and code examples. The core challenge: when a user asks "how do I add a Deepgram STT to my pipeline?", keyword search alone will not reliably surface the right chunks. The query uses different words than the docs and code that answer it.

We needed semantic search, matching by meaning rather than exact tokens; thus, the question was which embedding model to use.

I picked all-MiniLM-L6-v2, a sentence-transformers model that maps text to 384-dimensional vectors. It runs entirely locally with no API keys or external calls, which matters for a developer tool that should work offline and not leak queries to a third party. The model is around 80 MB and small enough that first-run download isn't painful, and inference is fast on a CPU.

The 384-dimensional output is a deliberate trade-off. Larger models like all-mpnet-base-v2 (768 dimensions) score higher on general benchmarks, but for our domain — searching across a few thousand chunks of framework documentation and example code — the difference is negligible. The smaller vectors mean faster similarity computation and a smaller ChromaDB index on disk. We may return to this decision later, as code is not the same as text.

During ingestion, we chunk documentation and source files, then embed each chunk with all-MiniLM-L6-v2 and store the vectors in ChromaDB using cosine distance (1 minus cosine similarity, which means identical directions score 0, orthogonal vectors score 1, and opposite directions score 2). Our index currently holds 5,277 chunks — 3,996 from 306 documentation pages and 1,281 from 452 source files across two repos.

To ground that in data: a single documentation page like the Text to Speech guide is around 2,500 words and 10 code snippets. The chunker splits it into 46 records in ChromaDB, each headed by its section ("Pipeline Placement", "Frame Processing Flow", "Supported TTS Services", and so on). Each of those 46 chunks gets its own 384-dimensional embedding vector, so a query like "how does TTS handle interruptions" can match the specific section that discusses it rather than returning the entire page.

At query time, the user's natural-language question gets embedded with the same model, and ChromaDB returns the nearest neighbours.

In practice, pure vector search gets us most of the way there, but it occasionally misses results where exact terms matter — a specific class name or frame type, say. So we pair it with Best Matching 25 (BM25) keyword search over the same chunks and merge the two result sets using Reciprocal Rank Fusion (RRF). The vector arm handles semantic intent and the keyword arm catches literal matches. The combination is noticeably better than either alone for our retrieval tools.

The main rough edge is that all-MiniLM-L6-v2 was trained on general English text, not code. It handles docstrings and prose well, but for pure code chunks the embeddings are weaker. Inline code comments would help — perhaps a zealous coding agent would add them. Our chunking strategy mitigates this by preferring function boundaries and including surrounding context, but a code-specific embedding model could improve retrieval for symbol-heavy queries in a future iteration.

Structured Development Plans with Coding Agents

Varun Singh — Sat, 08 Nov 2025 00:00:00 GMT

A Claude Code skill that generates structured development plans as markdown files. Claude already creates internal plans that it writes to ~/.claude/plans but it overwrites them as the work goes through phases. Think of this skill as a way to capture your conversations and decisions vis-a-vis the coding agent's perspective. If the output veers too far from your expectations, you can have a conversation with the coding agent to adjust the plan. Think of this as a way to capture the Software Development Life Cycle (SDLC) phases, design -> build -> fix -> improve, sometimes trash and repeat knowing what you have learned.

At the heart of it, The Dev Plans live in the repo (docs/dev_plans/), so they double as lightweight design docs. The checklist format makes it easy to track progress across sessions, and the issues section captures decisions you would otherwise forget. For larger features,the checklist of tasks can be organised in phases, Lastly, all good plans need to identify task dependencies, so that they can be executed independently by subagents. The skill produces a timestamped markdown file in docs/dev_plans/ (e.g. 20251012-trail-claudes-jsonl-files.md) with:

Header — status, branch, priority, dates
Context & Requirements — the why and what
Implementation Checklist — phased tasks with checkboxes and interdependencies
- Technical Specs — files to touch, interfaces, decisions
- Testing & Issues — test approach, problems hit, solutions found
- Acceptance Criteria — definition of done for each task and phase

/dev-plan create feature auth-system   # new plan
/dev-plan update                       # update current plan
/dev-plan complete                     # mark done
/dev-plan list                         # list all plans

Updated (2025-11-17): Figured out Git Worktrees, now you can spawn agents that work on different branches.
Updated (2026-01-27): Using the fan-out skill to spawn subagents for each independent task.

Running a Voice AI Cascade Pipeline on macOS

Varun Singh — Fri, 31 Oct 2025 00:00:00 GMT

Kwindla's repo runs a fully local voice agent on macOS using Pipecat. Audio in/out happens over a local WebRTC connection. The local server handles all the complexity of speech‑to‑text, turn detection, LLM responses, and text‑to‑speech. The client is a simple React app (using voice ui kit) that connects to the local agent. Both Silero VAD and smart‑turn v2 are used together: VAD detects speech activity, smart‑turn refines turn boundaries.

Models used by server/bot.py:

VAD: Silero VAD (SileroVADAnalyzer)
Turn detection: smart‑turn (LocalSmartTurnAnalyzerV*)
STT: MLX Whisper (WhisperSTTServiceMLX, default MLXModel.LARGE_V3_TURBO_Q4), the LLM model= must exactly match the model ID that LM Studio is serving. LM Studio should be running on http://127.0.0.1:1234/v1.
LLM: OpenAI‑compatible HTTP API (LM Studio), default model id gemma-3n-e4b-it-text
TTS: MLX‑Audio (TTSMLXIsolated, default mlx-community/Kokoro-82M-bf16, voice af_heart)

Warm up TTS model downloads (recommended)

uv run python -m mlx_audio.tts.generate --model "mlx-community/Kokoro-82M-bf16" --text "Hello World, I'm Pipecat!" --file_prefix "output" --audio_format wav

Start up the server

cd server
uv sync
# sync will take a moment
uv run bot.py

First server run can take 30+ seconds due to model downloads; warming up TTS helps.

Run the client

cd client
npm install
npm run dev

Once done, go to the localhost:3000 to see the Pipecat bot in action. See screenshots below. For TTFB, I track four moments in the logs: when the user starts speaking, when the user stops speaking (end of turn), the LLM call (first token), and when the bot starts speaking (first audio).

From the sample run below: the user spoke for ~2.68s (15:01:21.315 → 15:01:23.992). STT TTFB was ~0.225s, LLM TTFB was ~0.353s, TTS TTFB was ~0.239s, and the bot started speaking at 15:01:24.921 — about 0.93s after the user stopped, or ~0.65s after the LLM call log line.

2025-10-31 15:01:21.315 | DEBUG    | pipecat.transports.base_input:_handle_user_interruption:348 - User started speaking
2025-10-31 15:01:23.992 | DEBUG    | pipecat.audio.turn.smart_turn.base_smart_turn:analyze_end_of_turn:157 - End of Turn result: EndOfTurnState.COMPLETE
2025-10-31 15:01:23.992 | DEBUG    | pipecat.transports.base_input:_handle_user_interruption:372 - User stopped speaking
2025-10-31 15:01:24.218 | DEBUG    | pipecat.processors.metrics.frame_processor_metrics:stop_ttfb_metrics:131 - WhisperSTTServiceMLX#5 TTFB: 0.2252826690673828
2025-10-31 15:01:24.218 | DEBUG    | pipecat.processors.metrics.frame_processor_metrics:stop_processing_metrics:152 - WhisperSTTServiceMLX#5 processing time: 0.22540783882141113
2025-10-31 15:01:24.218 | DEBUG    | pipecat.services.whisper.stt:run_stt:511 - Transcription: [ Yeah, could you tell me an extension to that story? ]
2025-10-31 15:01:24.269 | DEBUG    | pipecat.services.openai.base_llm:_stream_chat_completions:247 - OpenAILLMService#5: Generating chat [[{"role": "user", "content": "$PROMPT"}, {"role": "assistant", "content": "Hello, I'm Pipecat!"}, {"role": "user", "content": " Yeah, could you tell me an extension to that story? "}]]
2025-10-31 15:01:24.621 | DEBUG    | pipecat.processors.metrics.frame_processor_metrics:stop_ttfb_metrics:131 - OpenAILLMService#5 TTFB: 0.35277700424194336
2025-10-31 15:01:24.681 | DEBUG    | tts_mlx_isolated:run_tts:178 - TTSMLXIsolated#5: Generating TTS [Unit 734 kept dancing.]
2025-10-31 15:01:24.681 | DEBUG    | pipecat.processors.metrics.frame_processor_metrics:start_tts_usage_metrics:191 - TTSMLXIsolated#5 usage characters: 22
2025-10-31 15:01:24.682 | DEBUG    | tts_mlx_isolated:_send_command:104 - Sending command: {'cmd': 'generate', 'text': 'Unit 734 kept dancing.'}
Generated segment shape: (74400,), min: -0.2208, max: 0.2241
Final audio shape: (74400,), min: -0.2208, max: 0.2241
2025-10-31 15:01:24.920 | DEBUG    | tts_mlx_isolated:_send_command:127 - Worker response: success with 198400 chars of audio data
2025-10-31 15:01:24.920 | DEBUG    | pipecat.processors.metrics.frame_processor_metrics:stop_ttfb_metrics:131 - TTSMLXIsolated#5 TTFB: 0.23913288116455078
2025-10-31 15:01:24.921 | DEBUG    | pipecat.transports.base_output:_bot_started_speaking:567 - Bot started speaking
2025-10-31 15:01:24.929 | DEBUG    | tts_mlx_isolated:run_tts:217 - TTSMLXIsolated#5: Finished TTS [Unit 734 kept dancing.]
2025-10-31 15:01:24.929 | DEBUG    | pipecat.processors.metrics.frame_processor_metrics:stop_processing_metrics:152 - TTSMLXIsolated#5 processing time: 0.24795174598693848
2025-10-31 15:01:24.929 | DEBUG    | tts_mlx_isolated:run_tts:178 - TTSMLXIsolated#5: Generating TTS [ He learned new moves.]
2025-10-31 15:01:24.929 | DEBUG    | pipecat.processors.metrics.frame_processor_metrics:start_tts_usage_metrics:191 - TTSMLXIsolated#5 usage characters: 22
2025-10-31 15:01:24.929 | DEBUG    | tts_mlx_isolated:_send_command:104 - Sending command: {'cmd': 'generate', 'text': ' He learned new moves.'}
Generated segment shape: (43200,), min: -0.2098, max: 0.2576
Final audio shape: (43200,), min: -0.2098, max: 0.2576
2025-10-31 15:01:25.106 | DEBUG    | tts_mlx_isolated:_send_command:127 - Worker response: success with 115200 chars of audio data
2025-10-31 15:01:25.107 | DEBUG    | pipecat.processors.metrics.frame_processor_metrics:stop_ttfb_metrics:131 - TTSMLXIsolated#5 TTFB: 0.17730188369750977
2025-10-31 15:01:25.111 | DEBUG    | tts_mlx_isolated:run_tts:217 - TTSMLXIsolated#5: Finished TTS [ He learned new moves.]
2025-10-31 15:01:25.111 | DEBUG    | pipecat.processors.metrics.frame_processor_metrics:stop_processing_metrics:152 - TTSMLXIsolated#5 processing time: 0.18155407905578613
2025-10-31 15:01:25.111 | DEBUG    | tts_mlx_isolated:run_tts:178 - TTSMLXIsolated#5: Generating TTS [ Other robots watched.]
2025-10-31 15:01:25.111 | DEBUG    | pipecat.processors.metrics.frame_processor_metrics:start_tts_usage_metrics:191 - TTSMLXIsolated#5 usage characters: 22

(...story goes on...)

2025-10-31 15:01:26.454 | DEBUG    | pipecat.processors.metrics.frame_processor_metrics:stop_ttfb_metrics:131 - TTSMLXIsolated#5 TTFB: 0.20525074005126953
2025-10-31 15:01:26.464 | DEBUG    | tts_mlx_isolated:run_tts:217 - TTSMLXIsolated#5: Finished TTS [ And it's okay to follow your dreams, even if you're a robot.]
2025-10-31 15:01:26.464 | DEBUG    | pipecat.processors.metrics.frame_processor_metrics:stop_processing_metrics:152 - TTSMLXIsolated#5 processing time: 0.21518397331237793
2025-10-31 15:01:49.187 | DEBUG    | pipecat.transports.base_output:_bot_stopped_speaking:583 - Bot stopped speaking

I love the local TTS models, because for some audio quality testing, you can have the LLM create a story and then have the Pipecat bot consistently tell you that story, with tests covering interruptions, flaky internet, background noise, etc.

% uv run python -m mlx_audio.tts.generate --model "mlx-community/Kokoro-82M-bf16" --text "Unit 734 was a robot. He had gears and wires. He longed to dance. But robots aren't really built for dancing, are they? He practiced in secret. He wobbled and whirred. He tried spins and jumps. It wasn't easy. One day, the factory had a party. Music played! Unit 734 stepped forward. He danced his best dance. Everyone cheered! He showed them that even robots can dream. And sometimes, dreams come true. Unit 734 kept dancing. He learned new moves. Other robots watched. They started joining in! Soon, the factory had a robot dance club. Everyone had fun. Unit 734 proved that being different is great. It's what makes you special. And it's okay to follow your dreams, even if you're a robot." --file_prefix "output" --audio_format wav

Produces two files that can be concatenated robot-dances-1.wav and robot-dances-2.wav. Sometimes you need something more boring, like a counting bot. In that case, use Python to stitch together the TTS with some pauses (dramatic or impatient, up to you!). Here is an example of Python code that outputs a WAV file that just counts numbers: count.wav

% uv run python counting_wav.py --start 1 --end 100 --pause-ms 400 --out count_1_100.wav
Fetching 56 files: 100%|███████████████████████████████████████████| 56/56 [00:00<00:00, 23561.14it/s]
2025-10-31 14:21:13.861 | INFO     | mlx_audio.tts.models.kokoro.kokoro:_get_pipeline:261 - Creating new KokoroPipeline for language: a
Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12.8/12.8 MB 83.5 MB/s  0:00:00
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.8.0
✔ Download and installation successful
You can now load the package via spacy.load('en_core_web_sm')
Wrote count_1_100.wav (24000 Hz)

The code is below.

    args = parse_args()

    if args.start > args.end:
        raise SystemExit("--start must be <= --end")

    as_digits = args.as_digits

    model = load_model(args.model)
    pause_samples = int(args.sample_rate * (args.pause_ms / 1000.0))
    silence = np.zeros(pause_samples, dtype=np.float32)

    segments: List[np.ndarray] = []
    for n in range(args.start, args.end + 1):
        text = number_text(n, as_digits=as_digits)
        voice = args.voice or None
        chunk_list = list(model.generate(text=text, voice=voice, speed=args.speed))
        if not chunk_list:
            continue
        audio = np.concatenate([np.asarray(c.audio, dtype=np.float32) for c in chunk_list], axis=0)
        segments.append(audio)
        segments.append(silence)

    if not segments:
        raise SystemExit("No audio was generated.")

    full = np.concatenate(segments, axis=0)
    sf.write(args.out, full, samplerate=args.sample_rate)
    print(f"Wrote {args.out} ({args.sample_rate} Hz)")

Parsing Multi-Provider Claude Code, Codex, and Gemini Usage Logs

Varun Singh — Mon, 20 Oct 2025 00:00:00 GMT

Building trail-cli, a vendor-neutral CLI for browsing agent logs from Codex, Claude, and Gemini, showed me a few things.

Tokens are not just input + output

Across providers, "input" and "output" are composed of multiple buckets. If you only look at the headline numbers, you will undercount.

Input buckets often include user content (what the human typed), system/developer instructions (hidden but billable), tool results (outputs from tools fed back into the model), cache read tokens (retrieved from a prompt cache), and cache creation tokens (stored for reuse later).

Output buckets often include assistant text (what you see), tool calls (serialised tool invocations), and reasoning/thought tokens (when providers break them out separately).

There is also the question of how counts are reported. Some providers report per-message deltas (Claude, Gemini), others report cumulative totals (Codex's event_msg + token_count fields). If you sum cumulative counts assuming they are deltas, you will massively overcount. For cumulative reporters, use only the final total per session.

I ended up normalising on several categories: raw input tokens, cached input tokens, raw output tokens, cached output tokens, tool call tokens, and thinking/reasoning tokens, to name a few.

Claude stores JSONL messages with per-message usage that you sum across the session. Multiple files may belong to the same session (grouped by sessionId), and agent sessions (agent-*.jsonl) are treated as subsessions of the main session. Cache tokens (cache_creation_input_tokens + cache_read_input_tokens) need folding into cache_input_tokens for accurate cache totals. Tool calls and subagents that Claude spawns are reported within the same file.

Gemini's messages array mixes user and model responses, and thoughts appears on Gemini messages as an array of subject/description/timestamp. Its tokens object includes separate input/output/cached/thoughts/tool/total counts.

Codex items do not include per-item timestamps, so trail-cli uses session.timestamp as the session start time. Legacy snapshots (pre-September snapshots) can be rollouts of the same session, so we drop snapshots that are prefixes of later snapshots within the same day and include session.instructions in the signature. The event_msg + token_count fields report cumulative totals, so we use the final total per session. Message content may appear in multiple shapes: string, list of content parts, or payload/message nesting. Unlike Claude, Codex spawns subagents as separate session files, which we tie together with a common group ID.

Code-wise, I chose to normalise to a common model, each provider gets an adapter that emits the same Event and Session dataclasses. The CLI does not care where the data came from:

codex.py  ─┐
claude.py ─┼─▶ Event/Session ─▶ cli.py
gemini.py ─┘

Provider-specific oddities (Codex's prompt_tokens vs Claude's input_tokens, Gemini's type: "gemini" → role "assistant") stay in the adapter. If you are building usage analytics, always sum the hidden categories or you will undercount.

Telephony with wideband and narrowband, G.722 vs G.711

Varun Singh — Fri, 26 Sep 2025 00:00:00 GMT

Two quick observations about voice quality and wideband detection.

WebRTC already delivers high-quality voice using Opus (50-20000 Hz, 10-510 kbps), which supports wideband and fullband audio with bitrate adaptation and packet loss resilience. The biggest practical distinction in audio quality is heard when audio originates from traditional telephony (PSTN). In that world, legacy codecs like G.711 are narrowband (PCMU/PCMA, roughly ~300-3400 Hz at 64 kbps), while G.722 is wideband (SB-ADPCM, typically ~50-7000 Hz at 64 kbps). When a call traverses PSTN segments, the codec may drop to G.711 and you will hear the classic “phone” sound with reduced high-frequency detail. When endpoints and the network use G.722 (or Opus end-to-end), the voice sounds noticeably more natural and crisp.

The narrowband versus the wideband issue is particularly prominent with SIP-based interconnect, as Voice over LTE and Wi‑Fi Calling usually use wideband and interoperate with AMR-WB (50-7000 Hz at 12.2 kbps) and AMR-NB (300-3,400 Hz at 4.75 kbps). However, when a call is routed to a non-VoLTE carrier (like many SIP providers), it falls back to using G.711 (worse quality but widely supported) and sometimes G.722 (better quality). This codec switch is not conveyed to the caller, which can lead to a noticeable degradation in audio quality, especially in high-frequency content. As a result, some calls appear high quality but other calls are low quality, depending on the host or remote user's carrier.

This is easily visible in spectrograms: Opus shows energy up into the highs, while G.711 rolls off sharply around ~3.4–4 kHz.

WebRTC does not support G.722 or AMR audio codecs. It is typical for the SIP-to-WebRTC interconnect to transcode to Opus (that’s what we do at Daily). Ergo, testing carriers that support G.722 with SIP has been high on our list. To quickly tell whether a “wideband” file is truly HD or just upsampled phone audio, I coded a small utility that inspects the recording. The simple algorithm is:

First, resample the file to a consistent rate and look at its frequencies over time. Then, focus on moments with actual speech. Measure how much energy sits above 4000 Hz; real wideband speech has some, while narrowband phone audio does not. Lastly, look for a sharp drop around 3800 Hz (look between 3500 Hz and 4100 Hz); a big cliff there suggests a phone-style cutoff!

Or you can ask codex or claude to build the tool for you :D

% uv run test_quality.py samples/1-daily.m4a --spectrogram 1-daily.png
=== Prior 8 kHz Downsampling Detector ===
File: samples/1-daily.m4a
Analysis SR: 48000 Hz
HiBandEnergy (4-8 kHz) : -63.4 dB (relative to 0-8 kHz)
KneeSteepness (3-5 kHz): 147.9 dB/kHz
KneeFreq: 3853 Hz
Resampling consistency (residual hi-band): 0.5 dB
Verdict: Likely prior 8 kHz history  (score=4/6)

% uv run test_quality.py samples/2-carrier.wav --spectrogram 2-carrier.png
=== Prior 8 kHz Downsampling Detector ===
File: samples/2-carrier.wav
Analysis SR: 48000 Hz
HiBandEnergy (4-8 kHz) : -29.8 dB (relative to 0-8 kHz)
KneeSteepness (3-5 kHz): 43.1 dB/kHz
KneeFreq: 4972 Hz
Resampling consistency (residual hi-band): 0.1 dB
Verdict: Uncertain  (score=2/6)

Claude Code's Status Line

Varun Singh — Thu, 21 Aug 2025 00:00:00 GMT

Recently Cat (Claude Code PM) tweeted that claude's status line can be updated!

My minimal statusline is: varunsingh.net main Opus 4.1, the git branch is blue, and the model name is in green.

{
  "statusLine": {
    "type": "command",
    "command": "input=$(cat); current_dir=$(echo \"$input\" | jq -r '.workspace.current_dir'); model_name=$(echo \"$input\" | jq -r '.model.display_name'); git_branch=$(cd \"$current_dir\" 2>/dev/null && git --no-optional-locks branch --show-current 2>/dev/null || echo 'no-git'); basename_dir=$(basename \"$current_dir\"); printf \"\\033[2m%s\\033[0m \\033[1;34m%s\\033[0m \\033[1;32m%s\\033[0m\" \"$basename_dir\" \"$git_branch\" \"$model_name\""
  },
  "alwaysThinkingEnabled": true
}

December 2025: Hello emojis! 💻 macbook | 📂 varunsingh.net | 🌿 main* | 🤖 Claude Opus 4.5 | 📊 42%.

{
  "statusLine": {
    "type": "command",
    "command": "input=$(cat); current_dir=$(echo \"$input\" | jq -r '.workspace.current_dir'); model_name=$(echo \"$input\" | jq -r '.model.display_name'); context_pct=$(echo \"$input\" | jq -r '((.context.used // 0) / (.context.total // 1) * 100) | floor'); git_branch=$(cd \"$current_dir\" 2>/dev/null && git --no-optional-locks branch --show-current 2>/dev/null || echo ''); git_dirty=$(cd \"$current_dir\" 2>/dev/null && ! git --no-optional-locks diff --quiet 2>/dev/null && echo '*' || echo ''); basename_dir=$(basename \"$current_dir\"); host_name=$(hostname -s); printf \"\\033[2m💻 %s\\033[0m | \\033[2m📂 %s\\033[0m | \\033[1;34m🌿 %s%s\\033[0m | \\033[1;32m🤖 %s\\033[0m | \\033[2m📊 %s%%\\033[0m\" \"$host_name\" \"$basename_dir\" \"$git_branch\" \"$git_dirty\" \"$model_name\" \"$context_pct\""
  },
  "alwaysThinkingEnabled": true
}

Playwright MCP for CLIs

Varun Singh — Sun, 17 Aug 2025 00:00:00 GMT

Needed a fast way to drive headless browser tasks from a plain CLI tool without spinning up a full web UI.

This site had a small bug with the way the mermain was rendering. Trying to prompt claude-code to do the correct thing was a bit of a pain, because it wasn't able to decipher the full context from what I was reporting (mainly copy-pasting). Taking a screenshot and pasting it into the corresponding Claude/ChatGPT's application definitely worked. But I was looking for an in-situ solution, i.e., without leaving the terminal.

Looking around, found the playwright mcp. You can add it to claude-code by

❯ claude mcp add playwright npx @playwright/mcp@latest

It creates a .playwright-mcp folder in the root of where you ran claude-code and all the screenshots that it takes go in there, so watch out if the directory baloons.

Asking claude about the plawright tools, here is the summary

  Available Playwright-capable tools in this environment:

  - mcp__playwright__browser_install
  - mcp__playwright__browser_navigate
  - mcp__playwright__browser_tabs
  - mcp__playwright__browser_take_screenshot
  - mcp__playwright__browser_snapshot
  - mcp__playwright__browser_click
  - mcp__playwright__browser_hover
  - mcp__playwright__browser_drag
  - mcp__playwright__browser_fill_form
  - mcp__playwright__browser_type
  - mcp__playwright__browser_press_key
  - mcp__playwright__browser_select_option
  - mcp__playwright__browser_wait_for
  - mcp__playwright__browser_handle_dialog
  - mcp__playwright__browser_file_upload
  - mcp__playwright__browser_evaluate
  - mcp__playwright__browser_run_code
  - mcp__playwright__browser_resize
  - mcp__playwright__browser_navigate_back
  - mcp__playwright__browser_close
  - mcp__playwright__browser_network_requests
  - mcp__playwright__browser_console_messages

  Read-only vs. mutating calls (practical view):

  Read-only: 
  - mcp__playwright__browser_snapshot, 
  - mcp__playwright__browser_take_screenshot, 
  - mcp__playwright__browser_console_messages, 
  - mcp__playwright__browser_network_requests, 
  - mcp__playwright__browser_wait_for (only observes), 
  - mcp__playwright__browser_navigate, mcp__playwright__browser_navigate_back, mcp__playwright__browser_tabs (when listing/selecting), 
  - mcp__playwright__browser_close (closes page, doesn’t change target site), 
  - mcp__playwright__browser_resize, 
  - mcp__playwright__browser_run_code (depends on code you run; can be read-only if you only query state), 
  - browser_install (affects local tooling only).
  
  Mutating the page/session: 
  - mcp__playwright__browser_click, 
    - hover, drag, 
    - type, fill_form, press_key, select_option, handle_dialog, 
    - file_upload, evaluate (if it runs code that changes state).

Typical workflow with these tools:

Start browser (implicit) and browser_navigate to your app URL.
Interact as needed (click/type/fill_form etc.) to reach the state under test.
Observe/assert using browser_snapshot (DOM accessibility tree) or browser_take_screenshot.
Gather logs via browser_console_messages or network via browser_network_requests.
Repeat navigation/interaction/observation steps as needed, finish with browser_close if you want to clean up.

UPDATED (2025-09-26):

❯ codex mcp add playwright npx @playwright/mcp@latest
❯ gemini mcp add playwright npx @playwright/mcp@latest

In LM Studio, update mcp.js

{
  "mcpServers": {
    "playwright": {
      "command": "npx",
      "args": [
        "@playwright/mcp@latest"
      ]
    }
  }
}

Amidst the September flurry of announcements by all the big labs, GitHub announced an MCP registry, makes life easier to discover these!

Also read Simon's TIL, which I chanced upm as well doing my search.

A Specification for Voice AI Evaluation

Varun Singh — Wed, 30 Jul 2025 00:00:00 GMT

TL;DR: Most voice AI apps are not doing evaluations because several things matter in a real conversations: timing, interruptions, and task completion. This post introduces a practical specification for evaluating voice AI platforms using synthetic data, with clear metrics for latency, flow, and recovery. It’s designed for teams building or buying production-ready voice systems.

Why a Specification?

Most teams evaluate voice AI with ad-hoc tests that miss key conversation behaviours. Across industries, I’ve seen the same gaps: How do you measure interruption handling? What’s an acceptable latency? How do you tell if a bot sounds natural?

This is not a think-piece it is an initial specification. Whether you're choosing a platform like Hamming, Coval, Freestyle, or Arise, or building from scratch, this evolving framework defines comprehensive testing. Contributions welcome, DM me on Twitter @vr000m.

Specifications force clarity. Each requirement serves a purpose. Each metric has a target. Use the entire framework or just what fits. It provides a shared language for evaluating voice AI quality.

Voice AI Evaluation Specification v0.1

Changelog:

v0.1 (30 July) – Initial release of evaluation criteria and test design for voice AI systems

1. Purpose & Scope

This specification sets out how to evaluate voice AI systems in multi-turn conversations. It focuses on measuring performance, interaction quality, and control—ensuring systems behave well in real-world settings.

The Challenge of Non-Determinism

Voice AI systems combine multiple non-deterministic components: LLMs generate different responses to identical prompts, VAD triggers vary with minor audio variations, and STT confidence scores fluctuate. Because of this variability, a single test is meaningless. Repeated testing provides statistical confidence. Temperature settings alone can transform a concise assistant into a chatty companion. This is why continuous evaluation is not optional—it's essential.

Why Synthetic Data Matters

Using real customer conversations for testing creates three problems:

Privacy compliance: GDPR, CCPA, and HIPAA make using real conversations legally complex
Reproducibility: You can not debug intermittent issues without consistent test inputs
Edge case coverage: Real data may not yet include all the edge cases that break systems

Synthetic data enables regression testing. When the LLM changes or prompts are adjusted, you can measure the impact immediately.

Setting Expectations

This specification covers system-level evaluation, not model training or prompt optimization. It answers questions like:

Does my complete voice AI system meet latency requirements?
How gracefully does it handle interruptions and errors?
Will it perform consistently across diverse user populations?

It does not cover:

How to train or fine-tune language models
Acoustic model optimization
Infrastructure scaling strategies

Integration in Your Development Lifecycle

Successful teams integrate voice AI evaluation at three stages:

Pre-deployment testing: Run the full test suite before any production release
A/B testing: Compare configurations and measure outcomes that have statistical significance
Production monitoring: Sample real conversations against your baseline metrics

Automation is key. Tests should run like unit tests—on commits or schedules. A dashboard showing overnight performance drift across your test suite is invaluable for catching model updates, configuration changes, or emergent behaviours before customers notice. This results in the following core principles:

All evaluation must use synthetic data to ensure reproducibility
Tests must cover both technical performance and conversational dynamics
Evaluations should be automated and CI/CD compatible
Results must be comparable across different configurations
Routine testing is essential—LLM variability demands daily or per-change runs

2. Use Case Coverage

Your evaluation framework should support a wide range of conversational patterns across industries. Testing requirements depend on the application or use-case.

Transactional Flows

Example: Pizza ordering bot

User: "I want a large pepperoni pizza"
Bot: "One large pepperoni pizza. Would you like to add any drinks?"
User: "Actually make that two pizzas and add a coke"
Bot: "Updated to two large pepperoni pizzas and one Coke. Anything else?"

Key evaluation areas:

Order updates and corrections
Confirmation accuracy
Correct total calculation
Secure payment flow

Test how the bot handles mid-order changes like "wait, change that to..." and maintains state.

Information Gathering

Example: Healthcare intake bot

Bot: "What symptoms are you experiencing today?"
User: "I've had a headache for three days and some dizziness"
Bot: "I understand. On a scale of 1-10, how severe is your headache?"
User: "Maybe a 7... no actually it's more like an 8"

Key evaluation areas:

Accurate data capture
Effective clarification strategies
Privacy-compliant handling

Use varied phrasing, from colloquial ("my head hurts") to clinical ("cephalgia and vertigo"), to test understanding.

Complex Navigation

Example: Banking IVR replacement

User: "I need to dispute a charge on my credit card"
Bot: "I can help with that. First, I'll need to verify your identity. Can you provide your account number?"
User: "I don't have it with me"
Bot: "No problem. I can look it up with your social security number and date of birth instead."

Key evaluation areas:

Handling partial or alternate user information
Robust fallback navigation
Support for uncommon queries

Design tests that follow multiple paths, including informal input and edge cases.

Voicemail and IVR Tree Navigation

Example: Appointment reminder bot

Bot: "Hi, this is Dr. Smith's office calling to confirm your appointment tomorrow at 2 PM"
[Voicemail beep detected]
Bot: [Switches to voicemail mode] "This is a reminder about your appointment tomorrow at 2 PM. Please call us back at..."

Key evaluation areas:

Voicemail and IVR detection
Timely delivery of critical information

Include test cases for common and custom voicemail greetings, business IVRs, and delayed beep scenarios.

Escalation Paths

Example: Customer service bot with human handoff

User: "This is ridiculous, I've been trying to resolve this for a few minutes!"
Bot: "I understand your frustration. Let me connect you with a representative who can help immediately."
[Bot summarizes context for human agent]

Key evaluation areas:

Accurate detection of frustration or complexity
Clear escalation logic
Quality of context summarisation

Test explicit handoff requests, tone-triggered escalations, and repeated failure cases.

Contextual Conversations

Example: Insurance claim bot

User: "I need to file a claim for my car accident"
Bot: "I'll help you with that. When did the accident occur?"
User: "Last Tuesday"
Bot: "That would be October 15th. Where did it happen?"
User: "The same intersection where I had that other claim last year"
Bot: "I see you had a claim at Main and 5th Street. Is that the location?"

Key evaluation areas:

Reference resolution (time, place, previous interactions)
Long-term memory or cross-session recall
Clarification without user frustration

These scenarios test whether the bot can recall relevant information and resolve references naturally.

Your framework must support domain-specific priorities—e.g., 100ms latency may be critical for fast food but irrelevant for insurance claims. Design flexible scoring and thresholds tailored to each use case.

3. Data Requirements

3.1 Synthetic Test Data Generation

Effective synthetic data must cover the full acoustic and conversational range your system will face in production.

Voice Synthesis Setup

Build a baseline voice library with:

Demographics: Diverse age groups and genders
Regional accents: US, UK, Irish, Australian, Indian English, etc.
Speaking patterns: Fast, slow, mumbling, clear, and casual speech
Speech characteristics: Filler words, nervousness, varying articulation

Most TTS providers support voice and rate controls; simulate other traits via prompt engineering or audio processing.

Environmental Conditions

Add realistic audio degradation to clean speech:

Background noise: Office, traffic, café, construction
Network conditions: Packet loss (1–5%), jitter (10–100ms), compression artifacts
- Device simulation: Mobile, Bluetooth headset, speakerphone echo
- Call quality: PSTN noise, VoIP compression, cellular signal fade

Implementation Pipeline

Use prompts to systematically generate diverse failure cases. Automate and version-control your data generation. Requirements:

Generate configurable numbers of test scenarios (typically 100–1000 per run)
Apply voice diversity across the test set (target 80% profile coverage)
Include ambiguous intents, context confusion, and varied emotional states
Add environmental conditions systematically (noise, network, device)
Output audio in standard formats (16–24kHz WAV)
Store all relevant metadata and logs with audio for accurate result correlation

4. Functional Requirements

With synthetic test data in place, define what to measure in conversation. These requirements turn scenarios into measurable conversation dynamics.

4.1 Conversation Dynamics

Prioritize natural conversation flow, not just transcription accuracy, under real-world conditions. Focus evaluation on:

Turn-taking Analysis

Every conversation has implicit timing. Key metrics:

Response timing: User speech end to bot speech start
Interruption handling: Speed of bot response to interruptions
Context preservation: Retains context after interruptions
Recovery: Smooth handling of misunderstandings
Natural flow: Pause duration, prosody, rhythm

Thresholds vary by use case—what’s responsive for support may feel rushed for therapy.

Test timing with scenarios like:

Rapid-fire questions: Multiple queries in sequence
Hesitant speakers: Disfluent or uncertain speech
Overlapping speech: User talks before bot finishes
Fast transitions: User starts immediately after bot
Early barge-ins: Interruptions in first few bot words
Simultaneous speech: Both speak at once (can reveal latency)

Barge-in Handling

Users expect instant recognition when interrupting. Tests should cover:

Interruption detection accuracy: Avoid false positives
Speech cessation speed: TTS stops promptly
Context recovery: Bot understands what was interrupted
Resume capability: Continues appropriately if needed

Backchannel Processing

Backchannels (“mm-hmm”, “right”, “okay”) keep conversations natural. Test:

Encouragement: “uh-huh”, “go on”, “I see”
Agreement: “yes”, “right”, “exactly”
Confusion: “huh?”, “what?”, “sorry?”
Impatience: “yeah yeah”, “okay but...”

Bots should not treat every backchannel as a full turn but should acknowledge engagement.

Silence Management

Silence handling depends on context:

Silence Duration	Context	Expected Response
2–3 seconds	After question	"Take your time"
5+ seconds	Mid-explanation	"Should I continue?"
8+ seconds	Any context	"Are you still there?"
15+ seconds	Any context	Timeout handling

Adjust thresholds by intent—longer pauses are fine in form-filling, but not in rapid order flows.

4.2 Latency & Responsiveness

Every stage in the voice pipeline adds delay. Measure end-to-end performance, not just individual components.

Key latency components:

VAD triggering: Speech start/stop to detection
STT processing: Audio to transcript
LLM inference: Transcript to response
TTS synthesis: Response to audio
Audio streaming: Delivering audio to user

Metric	Type	Definition
vad_start_trigger_duration	Duration	Speech start to VAD detection
vad_stop_trigger_duration	Duration	Speech stop to VAD detection
stt_processing_duration	Duration	Speech stop (or VAD stop) to transcript complete
llm_first_token_latency	Duration	Transcript complete to first token
llm_complete_response_latency	Duration	Transcript complete to response complete
tts_synthesis_duration	Duration	Response complete to audio generation complete
audio_streaming_start_latency	Duration	Speech synthesis start to first audio packet
end_to_end_total_duration	Duration	User speech start to bot audio start

sequenceDiagram
    autonumber
    participant U as User
    participant Mic as Capture
    participant VAD as VAD
    participant STT as STT
    participant LLM as LLM
    participant TTS as TTS
    participant AO as Audio Output

    U->>Mic: user_speech_start
    Note over U,Mic: t0 = user_speech_start

    Mic->>VAD: audio frames
    VAD-->>Mic: vad_detection (start)
    Note over U,VAD: vad_start_trigger_duration

    U-->>Mic: user_speech_stop
    Mic->>VAD: trailing audio
    VAD-->>Mic: vad_detection (stop)
    Note over U,VAD: vad_stop_trigger_duration

    Mic->>STT: audio segment
    STT-->>STT: decode + finalize
    STT-->>LLM: transcript_complete (final)
    Note over VAD,STT: stt_processing_duration

    LLM-->>LLM: generate tokens
    LLM-->>TTS: first_token
    Note over STT,LLM: llm_first_token_latency

    LLM-->>TTS: response_complete
    Note over STT,LLM: llm_complete_response_latency

    %% TTS starts synthesizing as soon as it can (may be at first_token)
    LLM->>TTS: synthesis_start
    TTS-->>TTS: synthesize audio
    TTS-->>AO: audio_generation_complete
    Note over LLM,TTS: tts_synthesis_duration

    TTS-->>AO: first_audio_packet_output
    Note over TTS,AO: audio_streaming_start_latency

    AO-->>U: bot_audio_start
    Note over U,AO: end_to_end_total_duration

Test latency under:

Peak traffic: High concurrent usage
Network degradation: 0–5% packet loss
Model switching: Different STT/LLM/TTS backends
Longer context: Increased conversation history
Ambiguous input: Disambiguation scenarios

Progressive Retry Mechanisms

Test escalation patterns that avoid user frustration:

First failure: Gentle clarification (“Could you repeat that?”)
Second: More specific help
Third: Offer alternatives
Escalation: Human handoff or alternate channel

A/B Testing Infrastructure

Automate scenario variation:

Generate multiple test variations per base scenario
Apply different voice profiles and environmental conditions
Vary complexity (simple vs multi-step)
Ensure enough cases for statistical significance

5. Evaluation Metrics

With test data and functional requirements defined, you need clear, quantifiable metrics to measure system performance. This section outlines essential metrics, quality assessment, and production monitoring.

5.1 Core Performance Metrics

Every voice AI system should track these key metrics:

Time to First Audio (TTFA)

TTFA is the end-to-end latency from when a user stops speaking to when the bot's first audio response begins. Human conversation gaps are typically 200–300ms, but for voice AI:

Cascade systems (STT→LLM→TTS): 800–1200ms is excellent, up to 1500ms is acceptable
Speech-to-speech models: 600–900ms with optimized hosting
Distributed hosting: Add 100–200ms for network overhead

Under 1 second feels responsive; 1–1.5 seconds is tolerable; over 2 seconds risks user frustration and interruption. Architecture choice impacts TTFA: cascades offer more control, speech-to-speech is faster but less transparent, and co-located hosting reduces latency at higher infra cost.

Voice Activity Detection (VAD) Accuracy

VAD errors cause:

False positives (>5%): Bot interrupts users
False negatives (>3%): Bot misses input

Aim for 95–97% accuracy in clean audio, 85–90% in noisy conditions. Below 90%, user experience suffers.

Barge-in Response Time

When users interrupt, bots must respond quickly. Target <200ms for barge-in handling to reduce abandonment, especially in critical scenarios like healthcare.

Task Completion Rate

Measures how often users achieve their goal:

Customer service: 85–90%
Sales qualification: 70–75%
Appointment booking: 90–95%
Technical troubleshooting: 60–70%

Track by intent. Simpler flows (e.g. pizza order) should see higher rates than complex, multi-step tasks.

Single-Turn vs. Multi-Turn Performance

Evaluate both:

Single-turn: Intent recognition, response completeness, consistent latency
Multi-turn: Context retention, efficient turns-to-completion (3–5 is good), logical progression, recovery from confusion

Track separately; some bots excel in one area but not the other. If average turns exceed 15 for any intent, users will likely disengage.

5.2 Quality Assessment

Raw metrics show what happened; quality assessment shows if the experience was good.

LLM-Based Quality Scoring

Use LLMs to score conversation transcripts on:

Understanding: Did the bot interpret intent correctly?
Helpfulness: Was the response useful?
Naturalness: Did the exchange flow well?
Efficiency: Was the conversation concise?

Prompt example:

Evaluate this conversation on a 1–5 scale for: UNDERSTANDING, HELPFULNESS, NATURALNESS, EFFICIENCY. For each, give a score and a brief justification.

UNDERSTANDING: Did the bot correctly interpret user intent?
- Consider: Misheard words, wrong intent classification, missed context

HELPFULNESS: Did the bot provide useful responses?
- Consider: Complete answers, relevant information, problem resolution

NATURALNESS: Did the conversation flow naturally?
- Consider: Appropriate responses, good timing, personality consistency

EFFICIENCY: Was the conversation appropriately concise?
- Consider: Unnecessary questions, repetition, verbose responses

Track distributions, not just averages. Consistent 3.5s beat wild swings between 5 and 2.

Human Review

Supplement LLM scoring with targeted human review:

High-value or sensitive conversations
Failed tasks
Edge or emotional cases

Review 1–2% of volume, focusing on outliers.

Sentiment Tracking

Monitor sentiment shifts during conversations. A successful flow moves from neutral, through possible frustration, to positive resolution. Declining sentiment, even with task completion, signals issues.

5.3 Production Monitoring

Metrics and alerting in production are critical.

Dashboards

Track in real time (1-minute granularity):

P50/P90/P99 latency
Active conversations
Error rates (STT, TTS, LLM)
Escalation (handoff) triggers

Set alerts for:

P90 latency >1.5× baseline
Error rate >2% in 5 minutes
Escalation >20% above baseline

Borrow from contact center KPIs:

Containment rate: Resolved without human
Average handle time
First call resolution
Customer effort (survey)

Model Drift Detection

Performance can degrade due to language shifts, seasonal changes, or new user expectations. Flag >5% drops from 30-day baselines. Retrain quarterly, but act on sudden drops.

Summary

Start with core metrics, add quality assessment as you grow, and build monitoring to catch problems before users do.

Moving Forward

No single evaluation framework fits every use case. This specification offers a flexible foundation—whether you’re evaluating HIPAA-sensitive healthcare bots or emotionally intelligent crisis assistants. Systematic testing beats ad-hoc guesswork.

As you evaluate platforms like Hamming, Arise, or Coval, use this specification to ask the right questions.

Ask these questions of any vendor or internal system:

– Can it test what matters for your use case?
– Does it expose the metrics you need?
– Is it CI/CD compatible?

Once you've established reliable evaluation for your current system, you're ready to explore adaptive architectures—where evaluation complexity rises, but so does performance potential.

Beyond This Specification: Adaptive Architectures (added: 15th Aug)

This specification assumes a relatively static architecture where the same models handle all conversation turns. However, emerging patterns in voice AI suggest more sophisticated approaches that would require rethinking these evaluation criteria.

Adaptive Model Selection represents the next evolution in voice AI architecture. Instead of using the same model throughout a conversation, systems dynamically route requests based on conversation context:

Light turns (greetings, confirmations): Route to fast, smaller models achieving <800ms latency
Complex reasoning: Switch to larger models, accepting 1500-2000ms for accuracy
Critical moments (medical, financial): Use best available models regardless of latency

This approach could reduce average latency by 30-40% while maintaining accuracy where it matters. However, evaluating such systems requires new metrics:

Routing accuracy: Did the system select the appropriate model for each turn?
Transition smoothness: Do model switches create noticeable personality shifts?
Cost optimisation: What percentage of turns use expensive models?
Degradation patterns: How does the system perform when preferred models are unavailable?

If you're considering adaptive architectures, treat this specification as your baseline. Establish solid evaluation practices for single-model systems first, then layer on the additional complexity of multi-model orchestration. The fundamentals—measuring latency, tracking completion rates, assessing naturalness—remain essential regardless of architectural sophistication.

Glossary

VAD (Voice Activity Detection): A signal processing technique used to detect when a speaker starts and stops talking. It impacts when the system listens, responds, or cuts off speech.

STT (Speech-to-Text): The transcription engine that converts spoken audio into text. Accuracy depends on model quality, domain vocabulary, and audio conditions.

TTS (Text-to-Speech): The synthesis engine that converts generated text responses into spoken audio. Evaluated by clarity, prosody, latency, and adaptability.

LLM (Large Language Model): The generative model used to produce responses based on text input. LLM latency and variability affect conversation flow and tone.

TTFA (Time to First Audio): The time from the end of user speech to the beginning of the bot's audio response. A key metric for conversational responsiveness.

Barge-in: When a user interrupts the bot mid-sentence. A good system detects this quickly, stops speaking, and adjusts its response contextually.

Containment Rate: Percentage of conversations resolved without human escalation. High containment indicates successful task completion by the bot.

Escalation: The process of handing a conversation off to a human agent or switching to a fallback system when the bot cannot proceed.

End-to-End Latency: Total time from the beginning of user speech to the start of bot speech, including VAD, STT, LLM, TTS, and streaming delays.

Updating robots.txt for AI/LLMs

Varun Singh — Thu, 03 Jul 2025 00:00:00 GMT

We only recently got browser use and MCPs. Now, with the recent kerfuffle around AI agents/LLMs being able to access content for training, Cloudflare and other providers are going to by default block these agents unless specified by robots.txt. In my opinion, visibility of your content in ChatGPT, Claude, and AI search engines (AEO) improves when their user-agents aren’t blocked.

Note: IETF discussed this in a workshop in 2024, they recently published a summary, which is worth reading.

Anyway, getting back to robots.txt, I recently noticed several AI/LLM crawlers were blocked or partially restricted on this site. Which meant that we now needed to explicitly allow key assistants and AI search crawlers like: GPTBot, ChatGPT-User, Claude-Web, ClaudeBot, etc. I also added a cleaner fallback policy, replaced a catch‑all “Disallow: /” under User-agent: * with a simpler allow‑by‑default and targeted Disallow for specific paths.

Based on Claude's research, we added crawl-delay, understanding that not all crawlers honour them (Google ignores; Bing may honour).

Looking at the HTTP logs:

Prefer explicit “User-agent + Allow/Disallow” per bot over relying on complex catch‑all rules.
Avoid regex-like anchors (like $); many crawlers don’t support them.
Keep a clean fallback that aligns with your intent: allow most, block only what you must. See example below:

# Search engines (Allow)
User-agent: Googlebot
Allow: /
Crawl-delay: 1

User-agent: DuckDuckBot
Allow: /
Crawl-delay: 1

# AI assistants and AI search (Allow)
User-agent: GPTBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: Claude-Web
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Amazonbot
Allow: /

User-agent: Applebot-Extended
Allow: /

User-agent: cohere-ai
Allow: /

# Developer tools and generic scrapers (optional: keep blocked if you prefer)
User-agent: Python-urllib
Allow: /

User-agent: Python-requests
Allow: /

User-agent: wget
Allow: /

User-agent: curl
Allow: /

# Fallback: allow homepage, robots, blog, posts; limit specific sections
User-agent: *
Allow: /
Allow: /robots.txt
Disallow: /api/
Disallow: /static/
Crawl-delay: 10

Claude's Plan Mode is Brilliant

Varun Singh — Wed, 02 Jul 2025 00:00:00 GMT

One frustrating issue with Claude, and I am on the Max plans, is that it is over-eager to do the task. For the past few months, I have been appending "think deeply" whenever I want it to think before leaping into the problem. This is in addition to CLAUDE.md, which has specific instructions (see excerpts below):

in CLAUDE.md:
- “Plan first: Create development plan in /dev_plans/ …”
- “Planning new features? Create a development plan…” 

in claude/development-process.md
- “Pre-Implementation: … Review plan for completeness and feasibility”

Last week, I chanced upon Plan Mode (invoked by pressing Shift+Tab twice). I am not sure when this was released, or if it has been around for a while, but it is super helpful. I believe Plan Mode separates research and planning from code execution, and it is partly read-only as it can create and maintain the plan or to-do list but cannot write code.

💜 this new feature. As part of the new workflow, all plans go into dev_plans/$yyyy-mm-plan-name.md! Claude used this to build the abstract-image-gen.js

Updated (2025-11-08): Converted the repetitive instruction to create a development plan into a ~/.claude/dev-plan/SKILL.md.
Updated (2026-01-27): Using the fan-out skill to spawn subagents for each independent task. Also in Jan 2026, Claude added an explicit tasks list which are stored in ~/.claude/tasks.

Context Engineering Across AI Code Generators

Varun Singh — Sat, 28 Jun 2025 00:00:00 GMT

TL;DR: The evolution of code generators from desktop AI apps to cli-based tools represents a fundamental shift in context engineering. While writing code using the claude/openai/gemini desktop apps require manual context management, CLI tools handle it automatically, letting developers focus on the task rather than the tool.

I've spent the last month living in three different cli-based coding environments (cli-ai-code), and I've come to a realisation: the future of AI-assisted development isn't about which model you use—it's about who controls the context engineering. And increasingly the code generators are getting better.

My journey started last summer, where many developers begin: Claude's desktop app (claude). I'd carefully curate which files to share, write detailed code guidelines, and manage every aspect of the conversation. The process felt like conducting an orchestra—I was constantly directing attention, summarising when contexts grew too long, and asking for diffs instead of full file updates to manage tokens.

This approach worked, but it was exhausting. Building this very website using Claude's desktop app taught me just how much overhead manual context management creates. Every conversation turn required decisions about what to include, what to summarise, and what to leave out. The copy-pasting between the app was fun, but the slowest part of the process. You can read about that journey in detail, but the key takeaway was simple: I was spending more time managing the AI than should be required, it was also obvious that hte IDE integrations that were emerging would be a better option.

Windsurf, Cursor, and GitHub Copilot offered a different experience. These tools live where developers already work, felt like the perfect solution. No more copy-pasting, no more context curation—just code completion and error fixing right in my editor. The agentic mode was a game-changer, because the editor could now look at the workspace or additional files to understand the context of the code. The manual context engineering for the IDEs is improviong each week, since early this year, vibe-coding has been the norm for many. My experience with the IDE-based tools has been positive, but a bit of a hit and miss, especially with React, where sometimes missing a file or a dependency can cause the codegen to duplicate that code. Again, this is not an issue long-term, as context windows are becoming larger and information in the .cursorfiles folder becomes automated. Overall -- you need to ensure the agent has access to the correct files, have proper documentation and if your repository is large, you're essentially doing the same curation work as with desktop apps, just with a different interface. The tension between giving complete control to the agent and maintaining oversight never quite resolved for me. These IDEs exceled at micro-tasks—completing, fixing errors and bugs, writing test cases (I wrote a lot of the tests!)

Recently, since Claude Sonnet 4.0 with code release in May and re-release of codex, I've leaned into using cli-ai-code -- and frankly fallen in love with them. The experience is so much better than the IDE-based workflow, and because the ux is so constrained, you have to completely lean into the CLIs workflows, create and keep the claude.md, readme.md, ./docs/ folder up to date. Also the /compress and /clear commands forces you to think a bit more about the context that the CLI has build insofar for you, but the CLI for the most part is taking full responsibility for context engineering. There aren't toggles for which files to read. There's no manual curation. You give it a task, and it figures out what it needs. This complete delegation initially felt uncomfortable—I was used to being in control. But the results spoke for themselves.

The UX advantage of cli-ai-code surprised me. Unlike editor-based AIs where code constantly competes for your attention, the terminal provides focused feedback. You see the thinking steps, the greps, seds and awks, glob in/out, regexs,the task checklist and progress updates. When the AI is working, you're not distracted by syntax highlighting or autocompletions. Your sole focus remains on the terminal, on the plan, on the outcome. More importantly, cli-ai-code offer clear intervention points. You can stop midway if the approach seems wrong, or let it complete and then provide corrective guidance. This isn't the black box of agent mode in IDEs—it's transparent, iterative development. Similarly, the claude.md and .cursorfiles folder can contain detailed project specifications, coding standards, and architectural decisions. But this is the trade-off: IDE tools provide more granular control over context at the cost of requiring explicit creation and maintenance of that documentation.

The OpenAI Codex Exception: Asynchronous Development: codex deserves special mention because it operates differently. I use it in what I call "away-from-keyboard" mode. The workflow is unique: provide a complete task specification—almost like a dev/product spec—and openai spins up a container in the cloud, plans the implementation, executes it, and returns a pull request. This approach has proven invaluable during commutes or when travelling with intermittent connectivity. The key is being upfront about design and architecture because each conversation spawns a fresh container build. There's no iterative back-and-forth; you need to get the specification right upfront.

I recently used this approach to build an entire phone number transfer system for Daily. The entire project was "vibecoded"—I provided the requirements, claude/codex handled the implementation, and I reviewed the resulting PR. Similarly, a queueing system I needed was entirely written through cli-ai-code without me writing a single line of code manually. I am currently trying out gemini's CLI codegen as well.

Context Engineering: The Real Differentiator

The progression from desktop apps to CLI tools represents a fundamental shift in how we think about context. In desktop apps, context engineering is explicit and manual. You decide what the AI sees. In IDE integrations, it's semi-automatic but limited to open files and explicit selections, I think they are getting better with workspace access. In CLI tools, it's completely delegated to the AI.

This delegation initially feels like losing control, but it's actually gaining leverage. When I use cli-ai-code, I'm not thinking about which files to include or how to structure my request to fit within token limits. I'm thinking about the problem I want to solve. The tool handles the complexity of understanding my codebase, finding relevant files, and maintaining context across operations. If it doesn't have access to the relevant files, it will ask for them.

Looking forward, the shift from manual to automatic context engineering represents more than a tooling change—it's a fundamental rethinking of the developer-AI relationship. As these tools mature, I expect we'll see even more sophisticated context understanding, better intervention mechanisms, and smoother workflows. For developers still managing context manually, I encourage you to try CLI-based tools. The initial adjustment period is worth it. You might find, as I did, that letting go of context control actually gives you more control over what matters: solving problems and building software.

The command line, that venerable interface we've used for decades, has found new purpose. And honestly? It feels like coming home.

Abstract Art for Blog Images

Varun Singh — Sat, 28 Jun 2025 00:00:00 GMT

I use a tiny Node script (create-abstract-imagen.js) to generate wide abstract hero art for the blog posts. I tried both dall-e-3 and Gemini's imagen-4.0-generate-001, there is a good mix of outputs. The prompt is simple:

A glossy, high-contrast abstract landscape artwork inspired by Orphism, Lyrical Abstraction, and early modernist painters like Kandinsky, Klee, and Malevich. Abstraction vibe, bold gradients, landscape orientation with negative space for white titles.

In addition, the gpt-4o generates a 2–4 word filename slug (e.g., ai-GlowingOrbits.png). I tried image understanding but produced "art with circles", which was not unique, if you look at the art that has been already generated.

Notes

Keep images/wip/ in git-ignore; copy the final picked asset e.g., to dist/static/images/...
Landscape 1792x1024 fits the site’s hero slots; leave room on the left/top for title text.
If you regenerate, keep one “winner” per post to avoid blob bloat in git.

The End of Headcount: How GenAI is Redefining Leadership

Varun Singh — Tue, 10 Jun 2025 00:00:00 GMT

TL;DR: GenAI is enabling companies to achieve massive scale with minimal headcount, fundamentally disrupting traditional leadership hierarchies based on team size and budgets. Future executives must shift focus from managing people to orchestrating AI-human collaboration.

Last week at the AI Engineer's World Fair, engineers demonstrated how small teams of developers could accomplish. The evidence is mounting everywhere. We're seeing companies reach $100 million in annual recurring revenue with teams that would have been considered skeleton crews in the pre-AI era. Our own teams have become progressively leaner, not through layoffs or budget cuts, but through the use of AI tools that allow each person to be more productive.

In addition, to the smaller teams debate, the community is locked in about job safety, particularly in tech. One camp argues that younger folks, being AI-native, will dominate the job market. They've never known a world without ChatGPT, and they approach problems with AI as their first tool rather than their last resort. The other camp contends that high-skill veterans will transform from 10x engineers to 1000x engineers, leveraging their deep domain knowledge to build more quickly. The truth is, it doesn't matter which side of this debate you fall on. The outcome remains the same: there will be fewer employees. The optimistic view—and the one I subscribe to—is that AI will enable the formation of many more companies, albeit much smaller in size. Instead of one company with 5,000 employees, we might see 1000s of profitable companies with 10-20 employees, creating more diverse opportunities and innovation.

This brings us to the elephant in the boardroom: what happens to leadership and executive roles in this new paradigm? Traditional corporate structures evolved alongside headcount. A billion-dollar ARR company was rarely a 100-person operation—it was more likely a 1,000 to 5,000 person organization, complete with layers of management, directors, VPs, and C-suite executives.

The currency of leadership has long been headcount and budget. Executives would proudly speak of managing teams of hundreds or thousands, of budgets in the tens of millions. Performance reviews emphasized "scope of responsibility," often measured by the number of direct and indirect reports. The larger your organization chart, the more senior your position, the higher your compensation.

This entire framework is about to collapse.

When a team of 10 people augmented by AI agents can outperform a traditional team of 100, the mathematics of management change fundamentally. The question shifts from "How many people do you manage?" to "How effectively can you orchestrate AI-human collaboration?" The metric changes from headcount to impact-per-person, from budget size to efficiency ratios. Large organizations face a particularly acute challenge. They must confront the reality that AI will shrink their organizations, potentially dramatically. A department of 500 might eventually become a department of 50. This isn't just about job losses—it's about the complete dissolution of existing hierarchies. Middle management layers that existed primarily to coordinate large groups of people become redundant when AI handles coordination and routine decision-making.

The New Executive Skillset

Leaders must now shift their focus entirely. Instead of asking "How can I grow my team?" they need to ask "Who in my organization can leverage AI tools to build faster with fewer resources?" More critically, they need to evaluate whether their organizations even have the right type of people to thrive in an AI-augmented environment and start to upskill their existing team and figure out who can transform into AI-native talent.

This requires a fundamental rethinking of what leadership means. Traditional management skills—delegation, performance reviews, team building—remain relevant but become secondary to new capabilities. Leaders must become skilled at identifying AI leverage points, at knowing when human judgment is irreplaceable, and at creating systems where small teams can have outsized impact. The most successful executives of the next decade won't be those who can manage the largest teams, but those who can achieve the most with the least. They'll be measured by how much value they create per person, how effectively they blend human creativity with AI capability, how quickly they can adapt to new tools and possibilities.

For aspiring leaders, the path forward looks radically different. The traditional career progression of individual contributor to team lead to manager to director to VP becomes less relevant when teams shrink by an order of magnitude. Instead, career growth might look more like expanding the scope of problems you can solve with a small team, or launching spin-off ventures, or becoming a super-contributor who coordinates AI agents rather than human reports.

We're witnessing the end of the industrial-age organization structure. Just as the assembly line gave way to knowledge work, the knowledge work hierarchy is giving way to AI-augmented small teams. It's a revolution that will remake how we think about companies, careers, and value creation.

GenAI considered reliable-enough

Varun Singh — Mon, 26 May 2025 00:00:00 GMT

TL;DR: Just as TCP isn't 100% reliable but is considered "reliable enough" through checksums and retransmissions, GenAI can achieve appropriate reliability through guardrails, LLM-as-judge, and chain-of-thought reasoning.

GenAI Considered Reliable-Enough

In defense of Generative AI's hallucinations and errors, let's consider this: humans and our existing systems are not 100% reliable. Even TCP, the protocol we trust for reliable data transmission, isn't perfectly reliable. Loss of transmitted packets result in retransmissions, these retransmitted packets can also be lost, which will eventually cause the connection to terminate. Nonetheless, we consider TCP to be reliable. Why? Because it's reliable enough for its intended use cases and the mechanisms to make it more reliable have been adjusted over the past few decades. Use DCTCP within nodes in a datacenter, servers devlivering to endusers use proprietary flavors, while endusers may use -- CUBIC, BBR, etc.

The TCP Analogy: Understanding "Reliable Enough"

Delving deeper into how TCP works reveals several mechanisms that reduce the probability of data corruption. The protocol employs checksums to verify data integrity, ensuring that what arrives matches what was sent. It uses sequence numbers to maintain ordered delivery, preventing packets from arriving out of order and corrupting the data stream. When packets are lost, TCP's retransmission mechanisms kick in, resending data until acknowledgment is received. Various timeouts govern these processes, ultimately deciding when to give up on a connection that has become unviable.

TCP introduced the concept of a connection over a connection-less packet delivery model. This layered approach to reliability offers an important lesson for GenAI systems. Although, TCP failure modes are observable and can be detected, GenAI failure modes may not match this paradigm, but I think we can draw some parallels.

GenAI's Reliability Mechanisms

Following the networking analogy, GenAI needs to apply corresponding resilience mechanisms. Guardrails function as circuit-breakers in the AI system, preventing the model from generating harmful or wildly incorrect content. Just as circuit breakers prevent electrical system overload and TCP's connection timeouts prevent infinite waiting, these safety boundaries ensure the system fails gracefully rather than catastrophically.

The LLM-as-a-judge pattern serves a role similar to checksums in networking protocols. Where checksums verify data integrity by comparing received data against expected values, LLM-as-judge approaches use a second model, or the same model in a different mode, to evaluate the quality and accuracy of generated content. This creates a verification layer that can catch errors before they reach the end user.

Chain of thought (CoT) reasoning provides something analogous to sequence numbers in TCP. Just as sequence numbers ensure packets arrive in the correct order and enable reconstruction of the original message, chain of thought reasoning ensures logical progression through a problem. It creates traceable reasoning paths that can be audited and verified, making the model's decision-making process more transparent and reliable.

Context-Dependent Reliability

In networking, you have two fundamental choices: use TCP with its built-in reliability mechanisms, or use UDP and build your own reliability layer tailored to your specific needs. This choice depends entirely on your use case and what "reliable" means in your context.

Real-time voice and video calls demonstrate this principle perfectly. They use RTP over UDP because in conversation, latency matters more than perfection. When packets go missing, the decoder doesn't wait—it guesses and renders what it can. You might see a momentary freeze or hear a brief glitch, but the conversation continues. The system prioritizes low latency over perfect delivery because a delayed "hello" is worse than a slightly garbled one.

Streaming video services take the opposite approach. Here, media is received into a buffer before playback begins. The system can take time to ensure each packet arrives and is processed in order, playing back at the highest possible quality while carefully managing the buffer to avoid the dreaded rebuffering pause. Quality and completeness take precedence over real-time delivery because viewers would rather wait a few seconds for the video to start than watch a degraded experience. Over time, we have seen systems shift from UDP to TCP back to UDP. For example, video on demand streaming used to be over RTSP over UDP in the 90s, but unreliability and advent of browsers meant that streaming over HTTP over TCP became the norm. However, recently, because of layer ossification, HTTP over TCP is being replaced by QUIC over UDP.

The GenAI Parallel

We find ourselves in a similar situation with Generative AI and its ability to mimic, copy, guess, and create. The reliability requirements vary dramatically based on the application, just as they do in networking.

In medical diagnosis, legal document drafting, or financial analysis, we need multiple verification layers. These applications require human-in-the-loop validation, strict guardrails, and comprehensive audit trails. This is like running TCP with additional application-layer checksums—we're not just relying on the base protocol's reliability but adding extra verification because the cost of errors is too high. A misdiagnosis or a legal mistake can have life-altering consequences, so we build systems that verify, re-verify, and maintain clear chains of accountability.

On the other end of the spectrum, consider brainstorming sessions, first drafts, or entertainment applications. Here, GenAI operates more like UDP—some "packet loss" in the form of minor errors or inconsistencies is perfectly acceptable. When you're using AI to generate ideas for a marketing campaign or create variations of a design concept, perfect accuracy isn't the goal. Speed and creativity matter more than precision. A slightly nonsensical suggestion might even spark the perfect idea. Simarly, vibe-coded internal applications or proof-of-concept applications may not require the same level of reliability as production applications, and may meet the bar of "good enough".

Most interesting are the hybrid approaches that adapt their reliability requirements dynamically. Code generation paired with test verification creates a feedback loop where the AI can be creative and make mistakes, but those mistakes are caught before they matter. Content creation with fact-checking layers allows for fluid writing while ensuring accuracy where it counts. Customer service systems that seamlessly escalate to humans when confidence drops below a threshold. These systems are like adaptive protocols that can switch their error-resilience modes based on the observed needs.

Just as network engineers build reliable systems on unreliable networks, AI engineers must build reliable applications on probabilistic models. The key is layering your defenses. Never rely on a single checking mechanism. Multiple models reviewing each other's work, diverse prompting strategies, and varied validation approaches create a robust system that can catch different types of errors.

Matching reliability to requirements becomes crucial. Not every use case needs five-nines reliability, and trying to achieve it everywhere would be prohibitively expensive and slow. A chatbot helping users find documentation can tolerate occasional misunderstandings, while a system generating medical dosage recommendations cannot be incorrect.

We must embrace probabilistic thinking in our system design. Instead of trying to handle every edge case perfectly, we design for the 95% case and ensure the system handles the remaining 5% gracefully. This might mean clear error messages, smooth handoffs to human operators, or transparent confidence indicators that help users understand when to verify the AI's output.

Monitoring and adaptation round out the reliability strategy. Like TCP's congestion control algorithm that adjusts sending rates based on network conditions, AI systems should adapt their behavior based on performance metrics. If error rates increase, the system might automatically become more conservative, request additional verification, or route more requests to human review.

Conclusion: Redefining Reliability

"Reliable enough" isn't settling for less. It is engineering for reality. TCP shows us that perfect reliability isn't necessary for a protocol to be considered reliable. Similarly, GenAI doesn't need to be perfect to be transformative.

The question isn't "Is GenAI reliable?" but rather "Is GenAI reliable enough for my specific use case?" And increasingly, with the right mechanisms in place, the answer is yes.

As we continue to develop AI systems, we should focus not on eliminating all errors (an impossible task even for humans), but on building appropriate reliability mechanisms for each use case. Just as the internet thrives on "best effort" packet delivery with reliability built in layers above, GenAI can thrive with thoughtful application of context-appropriate reliability mechanisms.

The future isn't about perfect AI. It's about AI that's reliable enough for the task at hand, with well-understood failure modes and appropriate safeguards.

A more formal version of building reliable LLMs is documented in 12-Factor Agents, give it a read if you're interested in the topic.

Mega Launch Week: Gemini, Claude, and more

Varun Singh — Fri, 23 May 2025 00:00:00 GMT

TL;DR: Google I/O launch week turned out to be a great week for AI. Announcements from Google, Anthropic, and OpenAI were the highlights of the week.

AI’s Supersonic Week: A Use-Case-Centric Breakdown

This past week, the AI landscape changed again, with advancements across various domains: Code gen, image gen, video gen, and some hardware.

Code Generation: Advancements in AI Programming Assistants

The realm of AI-assisted coding has seen notable progress, with Google's Jules, Claude Code, and OpenAI Codex.

Google Jules is an asynchronous, agentic coding assistant that integrates with the codebase/repository, since it uses Gemini 2.5 pro with long context, it seems to have a better chance at performing well on a larger codebase. Like the others in the category, it can plan, reason and provide a diff of changes that it made for you to review.

Claude Code with Opus 4 & Sonnet 4 Anthropic's latest models have achieved state-of-the-art results on coding benchmarks such as SWE-bench (72.5--72.7%). These models demonstrate sustained performance on long-running tasks, maintaining focus over extended periods. However, unlike codex, Claude Code is not a cloud-based agent, it is a local agent that can be run on your machine. The pro is that you can chat and iterate on the code quickly, since it does not need to rebuild the sandbox for each conversation. The con is that you need to be at your desk to use it.

OpenAI Codex: Codex is a cloud-based software engineering agent designed to automate common development tasks. Integrated into ChatGPT, Codex operates in secure sandbox environments, handling tasks like writing code, debugging, and generating pull requests. Since it runs in a sandbox, it essentially runs outside of your local development environment, i.e., you can ask it to do things while on the move, but it also means that you need to upload your secrets, environment variables, etc. to ChatGPT.

These advancements are great for the software engineering landscape where multiple organizations are pushing the boundaries of AI-driven code generation, it is not just VS Code plugins.

3. Image & Video Generation: Enhancing Creative Capabilities

AI models are increasingly capable of generating high-quality visual content:

-Veo 3: Google's latest video generation model can produce 4K videos with synchronized audio, including speech and ambient effects, based on text prompts. The accompanying tool, Flow, allows filmmakers to iteratively steer output using text, shots, and mood boards .

-OpenAI Sora: Sora remains a benchmark for physical realism, many pictures on this site were built with Sora.

-Imagen 3: Google's updated image generation model offers improved fidelity and prompt controllability, narrowing the gap with competitors like Midjourney, Sora, and DALL·E 3 .

These tools are democratizing content creation, enabling users to produce professional-grade media with minimal resources.

4. Hardware & Ambient Agents: Integrating AI into Daily Life

AI is transitioning from software to integrated hardware solutions:

-Android XR Glasses: Demonstrated at Google I/O, these lightweight headsets offer real-time translation and Gemini overlays, providing "heads-up answers" without the need for a phone .

-Project Astra: Google's research prototype can utilize a phone's camera to remember context and perform actions across the Android UI, indicating a shift from chat-based agents to integrated operating layers .

-"io" Device (OpenAI × Jony Ive): OpenAI's acquisition of Jony Ive's startup, io, for $6.5 billion aims to develop a design-led pocket AI companion, targeting the shipment of 100 million units. This device aspires to be a screen-free, context-aware assistant, marking a significant move towards ambient AI hardware .

While early attempts like Humane's AI Pin faced challenges, the continued investment and innovation in this space suggest a promising future for AI-integrated hardware.

Note: This overview is based on developments up to May 24, 2025.

Building a Modern Personal Website with Claude, Cloudflare, and GitHub

Varun Singh — Sat, 04 Jan 2025 00:00:00 GMT

TL;DR: Turned a 10-page LaTeX resume into a modern website by collaborating with Claude, an AI assistant. Beyond just coding, the key to success was establishing clear development patterns early, maintaining thorough documentation, and treating AI as a thoughtful collaboration partner rather than just a code generator. This post shares practical lessons learned about effective AI collaboration in software development. 🚀

The Challenge 🌐

For academics and professionals in technology, maintaining an up-to-date online presence is more than a nicety—it's a necessity. I found myself in a common situation: maintaining a comprehensive LaTeX document that had evolved over a decade to include hundreds of publications, talks, patents, and other professional accomplishments. While LaTeX excelled at producing formatted documents, it created friction whenever I needed to use this information in other contexts.

Just spent 6h filling out an EB1A intake form. Why cant I upload my CV/resume which already has the information, with links. It is simpler to
- provide Google Scholar = papers, patents
- provide linkedin
- provide links to press and awards with URLs
Parse, collate, and organise
— Varun Singh (@vr000m) January 18, 2023

This tweet captured my frustration perfectly. The process of maintaining and reusing professional information was broken. Every time I gave a talk or published a paper, I would append it to my BibTeX file. This worked great for LaTeX compilation but meant manually copying and reformatting this information for other uses—visa applications, collaboration requests, or online profiles. The process was time-consuming and error-prone.

What I needed wasn't just a website, but a system that could:

Accept updates through familiar tools (text editor, git)
Store information in a structured, queryable format
Maintain the single-source-of-truth principle I had with LaTeX

This was where AI collaboration became interesting. The challenge wasn't primarily about web development—I'd built websites before. The real opportunity was to explore how AI could help build a system that would evolve with my needs while maintaining the simplicity of my current workflow. Working with Claude presented a unique opportunity to rethink not just the technical solution, but the entire development approach. The tool open-sesame facilitated this interaction with Claude 3.5 Sonnet, setting the stage for an experiment in AI-assisted development that would prove more illuminating than I initially expected. 🤖

Technical Decisions: Building for Simplicity 🛠️

The technical architecture for this project emerged from a simple premise: minimize infrastructure complexity while maintaining flexibility for content updates. Rather than getting caught up in complex technology choices, I wanted the architecture discussions with Claude to focus on solving the core problem - managing professional information effectively.

Three key requirements drove our technical decisions. First, I needed a database that could be updated via CLI tools, maintaining my existing git-based workflow. Second, I needed a way to handle blog posts and profile images without managing a complex CDN setup. Finally, the site needed to be easily deployable and maintainable. These requirements led us to a serverless approach using Cloudflare's edge services.

graph TD
    A[GitHub Repository] -->|GitHub Actions| B[Build Pipeline]
    B -->|Deploy| C[Cloudflare Pages]
    B -->|Migrate| D[Cloudflare D1]
    E[Content Updates] -->|Push| A
    F[Blog Images] -->|Upload| G[Github /images/blog/]
    C -->|Serve| H[Website]
    D -->|Data| H
    G -->|Assets| H
    I[Cloudflare KV] -->|Rate Limiting| H

This architecture aligned naturally with my workflow: I could continue maintaining content in text files and use simple CLI commands to sync updates to the website. More importantly, it provided a foundation for building tooling that matched my existing practices rather than forcing adaptation to a new content management paradigm.

The real challenge, however, wasn't in choosing technologies—it was in effectively collaborating with AI to build this system in a maintainable way. As we began implementing features, it became clear that the technical decisions themselves were less important than how we approached the development process. This journey of collaboration would evolve through three distinct phases, each building upon lessons from the previous one:

Establishing the Basics: Learning to communicate effectively with AI
Developing Systematic Patterns: Creating repeatable processes
Mastering Complex Development: Leveraging AI's strengths for sophisticated features

This progression from simple interactions to sophisticated collaboration would prove crucial in building a robust and maintainable system. 🔄

Evolution of AI Collaboration: From Code Generator to Development Partner 🤝

The journey of working with AI evolved naturally through distinct phases, each building upon lessons from the previous one. What began as simple code generation requests transformed into a sophisticated development partnership that improved both code quality and development practices.

My initial interactions with Claude followed a common pattern among developers new to AI collaboration - directly requesting code implementations. "I need an API endpoint for managing publications," I would say, and while the resulting code was functional, it often required significant refinement and didn't leverage the AI's full capabilities.

The first breakthrough came from a simple shift in approach. Instead of jumping straight to implementation, I began starting each feature with requirements discussions. "Let's think about what we need for publications," I would begin. "How should we structure the data to match our LaTeX format? How will we handle different publication types? What search capabilities might we need?" This seemingly small change led to more thoughtful solutions and fewer revisions. More importantly, it established a pattern where Claude would ask clarifying questions before suggesting implementations.

Our development process evolved into a systematic approach:

graph LR
    A[Problem Definition] --> B[Solution Exploration]
    B --> C[Test Design]
    C --> D[Implementation]
    D --> E[Validation]
    E --> A

As the project grew more complex, the need for more structured ways to maintain context and ensure consistency became apparent. Each development session began with a brief status update: "We're working on search functionality. In our last session, we chose SQLite FTS5 for full-text search and implemented the basic schema. Now we need to handle result ranking and highlighting." This context-setting became crucial for maintaining continuity across sessions.

A particularly valuable pattern emerged around testing. Claude's approach to test generation was systematic and thorough, often catching edge cases before they became issues in production. For instance, when implementing publication validation, what started as a simple schema check expanded into comprehensive test coverage:

describe('Publication Validation', () => {
  // Basic field validation
  test('requires title and type', () => {});
  test('validates publication date format', () => {});

  // Type-specific validation
  describe('Patent Publications', () => {
    test('requires status to be pending or granted', () => {});
    test('requires patent number for granted patents', () => {});
    test('validates patent number format', () => {});
  });

  // URL validation
  describe('Publication URLs', () => {
    test('handles multiple versions (preprint, published)', () => {});
    test('validates URL format for each type', () => {});
    test('maintains URL order', () => {});
  });

  // Edge cases
  test('handles unicode characters in titles', () => {});
  test('validates dates across timezone boundaries', () => {});
  test('handles malformed JSON in URL array', () => {});
});

Claude didn't just list test cases; it explained the rationale behind each one. "We should test timezone handling," it suggested, "because publication dates might be entered in different timezones during international conferences." This kind of contextual thinking about testing scenarios helped prevent issues that might have only surfaced in production.

Documentation evolved from an afterthought to a real-time activity. Important decisions were captured as they were made, creating a living reference for future discussions. When deciding how to handle publication URLs, for example, we documented not just the decision to store them as a JSON array, but also the rationale - publications often have multiple versions like preprints and final versions - and the implementation details around JSON validation in the data layer.

The real power of AI collaboration emerged when tackling complex features like search implementation. Rather than jumping straight to code, we began with thorough problem definition. "Let's outline exactly what we need from search," I would say. "We need to search across publications, talks, and blog posts, handle partial matches, support filtering by type and date, and implement relevance ranking." This led to rich discussions about potential approaches, from using a single FTS table with type discrimination to implementing separate FTS tables with a unified API.

Each potential solution was evaluated through focused questions: "How would this handle cross-type relevance ranking? What about updates to primary records? How would it perform at scale?" This structured approach led to catching potential issues early and producing more maintainable code. Claude's suggestions became increasingly nuanced, often identifying edge cases I hadn't considered.

The process wasn't always smooth. Managing context across sessions proved challenging - a simple request to "update the search implementation" needed to become "update search ranking for publications, which currently uses basic FTS5 ranking, to prioritize recent publications." Scope creep was a constant concern, with Claude sometimes suggesting ambitious additions like automatic tagging and citation parsing. Learning to guide these conversations back to core functionality became an important skill.

The challenge of maintaining simplicity emerged repeatedly. When Claude suggested implementing complex caching mechanisms, I learned to redirect the discussion: "Before we add caching, what's our actual performance bottleneck? How could we solve this with our existing tools?" These moments taught us to stay focused on immediate needs while maintaining a clear path for future enhancements.

More importantly, the testing-driven approach we had established began influencing our design decisions. Each feature discussion now naturally included consideration of edge cases and error conditions, with Claude proposing test scenarios that often revealed potential issues in our planned implementation. This "test-first" thinking helped us build more robust features from the start, rather than adding error handling as an afterthought.

Through this evolution, our collaboration with Claude progressed from basic code generation to sophisticated system design. Each phase taught valuable lessons about effective AI collaboration, from managing context to guiding complex discussions. But beyond the specific journey of this project, clear patterns emerged that could apply to any AI-assisted development work. These patterns, distilled from both successes and challenges, offer a framework for leveraging AI as a genuine development partner rather than just a coding tool. 🎯

Practical Patterns & Lessons in AI-Assisted Development 📚

Building a website might seem like a straightforward task, but collaborating with AI to do so revealed insights that could apply to any software project. The most profound lesson emerged early: time invested in establishing clear communication patterns with the LLM pays enormous dividends throughout the project lifecycle. Much like onboarding a new team member, those early conversations shape all future interactions. But unlike human teammates, AI assistants need this context-setting in each session. What could have been a limitation instead became a strength, forcing clarity and precision in our technical discussions. We discovered "Writing" as a common ground for communication.

The practice of documenting decisions in real-time transformed from a project requirement into a powerful development tool. Each major decision created a reference point for future discussions. When we later needed to extend the publication schema to handle multiple paper versions, having documented our initial reasoning about JSON storage for URLs made the decision pathway clear. This documentation served not just as a record but as a thinking tool, forcing us to articulate and examine our assumptions.

Testing became a crucial aspect of our collaboration pattern. Rather than treating tests as verification tools, they became design sessions in themselves. The systematic way Claude approached test generation helped us think through features more thoroughly. For search functionality, what started as basic query testing evolved into a comprehensive test suite:

Testing search across different content types (publications, talks, posts)
Verifying relevance ranking with mixed content
Edge cases like partial matches and special characters
Performance testing with large result sets
Handling malformed queries and invalid filters

Each test case Claude proposed revealed potential edge cases or user scenarios we hadn't considered, transforming testing from a validation exercise into a design tool that shaped implementation before writing production code.

Counterintuitively, embracing AI's context limitations led to better code organization. The need to explain feature context in each session naturally pushed us toward more modular, well-documented code. When adding blog support, for instance, each session focused on a specific aspect - data modeling, markdown processing, or search integration. This forced modularity made the code more maintainable and easier to test, benefits that extended far beyond AI collaboration.

The most surprising insight came from treating edge cases and error handling not as afterthoughts but as primary design considerations. Claude's systematic approach to questioning implementation details led to more robust code from the start. When implementing the publication API, what began as a simple CRUD interface evolved to handle nuanced cases like draft states, multiple URLs per publication, and proper error handling for malformed requests. The AI's tendency to thoroughly consider failure modes resulted in more resilient code than I might have written on my own.

Another unexpected strength emerged in API design discussions. Claude's ability to think through different use cases helped create more intuitive and flexible interfaces. For example, when designing the publication update endpoints, our discussion naturally covered:

Handling partial updates
Maintaining data consistency
Managing concurrent edits
Version history tracking
Access control implications

The reality of AI collaboration proved different from initial expectations. Success came not from trying to get perfect code immediately, but from establishing a process that consistently produced maintainable, well-tested code that met project requirements. This meant being methodical, maintaining clear communication patterns, and regularly verifying that implementations aligned with project goals. The AI became most valuable not as a code generator but as a thoughtful collaborator that could challenge assumptions and suggest alternative approaches.

Perhaps most importantly, this project demonstrated that effective AI collaboration isn't about working around AI's limitations but about leveraging its unique characteristics. The need for explicit context in each session, far from being a drawback, encouraged better documentation and design practices. The AI's systematic approach to problem-solving helped catch edge cases early. Even the tendency to suggest multiple alternative approaches, which could seem like overhead, often led to more robust and well-considered solutions.

These lessons extend beyond just working with AI. Many of the patterns that emerged - clear documentation, systematic problem-solving, thorough consideration of edge cases - represent solid software development practices in any context. The AI collaboration simply made their value more apparent and their implementation more systematic.

The key to successful AI collaboration lies in treating it as a partner rather than just a tool. This means:

Starting with clear requirements and context
Documenting decisions and rationale in real-time
Using testing as a design tool
Embracing systematic thinking for edge cases
Maintaining focus on simplicity and maintainability

The complete source code for the project is available on GitHub. 🌟

Stages of AI: thinking where we are heading in 2025

Varun Singh — Mon, 30 Dec 2024 00:00:00 GMT

TL;DR: AI has rapidly evolved from basic chatbots to today's 'agents' that can execute tasks. The post charts this progression towards a potential 'autobot' stage in 2025—fully autonomous AIs capable of independent, interactive decision-making—and highlights the crucial challenge of ensuring their actions align with human values.

The Evolution of AI Chatbots

Over the christmas break, I was vibe coding this site,varunsingh.net and based on the experience with pipecat.ai, I started to think about agents more concretely. I think AI is going through the following stages:

chatbots (pre-2020, a chat interface responding to most common questions)
assistants (2019 - soon replaced by agents, LLM-powered chatbots)
co-pilots (2020 - human-in-the-loop, LLM-powered chatbots)
agents (2023 - LLM with access to knowledge-base, APIs, databases, etc.)
autobots or better name (2025 - agents that can take actions)

Understanding Each Stage

Chat bots are simple request and response bots that were rules-based, this was before we had LLMs.

An assistant is basically chatbots that were more reliable, similar to GPT 3.5/ChatGPT. These had inherent understanding of language and could string together compelling statements based on their training. With the help of Retrieval-Augmented Generation (RAGs) and vector databases, we are able to add use-case/customer specific knowledge-bases that the LLM can collate to form the response.

Co-pilots are as the word suggests assistants that have some persistence, i.e., either they are monitoring the actions that the user is making and based on those actions be able to provide guidance to the user. In coding, we have GitHub Copilot, Cursor, and other tools that are able to provide code suggestions based on the context of the code. In healthcare, there are several tools that doctors and medical providers are using to summarise patient notes, provide recommendations, and provide reminders. Lastly, customer support agents are getting pings from the AI co-pilot while they are conversing with the end-user or while they are working on a ticket.

Agents is the obvious next step, give the co-pilot or assitant the ability to take actions, i.e., the user of the Agent is moving from providing instructions to describing outcomes. This is big shift that we are seeing with code-generation, but can easily see this happening elsewhere like with Sales and CRMs, revenue recovery, simple support actions.

The Next Frontier: Autobots

Lastly, the autobots, I think some people call them auto agents, i.e., agents that can interact with other agents, take actions outside of their sandbox. In the above CRM example, we may have an artificial boundry that a CRM application may not automatically terminate an unpaid account with accrued dues of several months. In the agents stage, maybe it would send a notification to a human that the account is in revenue recovery for a few months, and delegate the decision of terminiation to the human in the loop, but in the autobot phase, it may decide on terminating access versus sending an extra set of reminder emails based on the value of the account. The thing we need to think about is how we ensure that autobots make decisions aligned with human values when they're operating independently

Going into 2025, we are definitely in the Agents phase, the question is will we make autobots this year?

The Re-emergence of SIP: How Voice AI Brought Back the Beast

Varun Singh — Mon, 15 Jul 2024 00:00:00 GMT

TL;DR: SIP—the complex telephony protocol we thought WebRTC would retire—has re‑emerged as the backbone of Voice AI bots in contact centres, dragging legacy headaches like unencrypted media and low‑quality audio with it, so we must relearn those 1990s quirks and patch them fast.

I genuinely thought we'd left SIP behind. When we built WebRTC in the early 2010s, it felt like we were creating a cleaner, more modern path forward for real-time communications. Yet here we are in 2024, and SIP has returned from the periphery to claim centre stage once again. The catalyst? Voice AI.

The Legacy Beast

SIP connected computer systems to telephony networks from the late 1990s through the 2000s. It became the foundation of everything from office PBX systems to 3G IMS architecture. At the time, it was revolutionary—the easiest way to bridge the gap between traditional telephony and computer systems.

But "easiest" came with a price. SIP sprawled across hundreds of RFCs, each addressing different use cases. Need to implement muting in a conference and signal that across to everyone? There's probably a couple of RFCs for that. The real complexity emerged when different vendors implemented similar features using completely different approaches, which made sense that they were competing for time to market and then bringing what they did to the standards. Take DTMF tones as an example: there are three standardised ways to send them. These are in‑band audio, RFC 4733 telephone‑events (often still called “RFC 2833”), and SIP INFO messages. Three! Each one equally valid, which means any serious Voice AI implementation needs to support all three for broad compatibility.

The protocol became a testament to xkcd meme: the best thing about standards is there is always N+1 (the one being your way of doing it versus the others 😆).

The WebRTC Promise

When WebRTC emerged, we'd learnt from SIP and carved out a cleaner path with standardised APIs and a more focused feature set, a WebRTC "profile" if you will. For the past 15 years, this vision seemed to be playing out. WebRTC powered the explosion of video calling applications. Modern contact centres powered by WebRTC emerged, mainly waiting for legacy devices to be obsoleted and contact centre operators moving away from CAPEX (buy telephony hardware) to OPEX (buy seats on a CCaaS, CPaaS, or XCaaS).

SIP retreating to the edges—still there for legacy integrations where modern systems needed to connect with traditional telephony infrastructure, but surely fading away as those systems modernised.

The Voice AI Revolution

Voice AI is emerging and changing everything.

Contact centres are embracing voice bots at an unprecedented pace. These AI agents are replacing humans in numerous workflows—from basic customer service queries to complex multi-stage workflows. But here's the catch: these bots can only communicate with customers through existing telephony infrastructure. And what powers that infrastructure? SIP.

Suddenly, SIP is essential. Every voice AI company building for the enterprise market needs SIP. The protocol we thought we'd deprecated has become the gateway to one of the most exciting areas of AI development.

Most production voice bots still run a three‑stage pipeline—streaming automatic‑speech‑recognition, an LLM for response, and text‑to‑speech for reply—which adds roughly 300 ms of latency without the VAD and networking delay. Emerging speech‑to‑speech models preserve speaker prosody, but they hide the intermediate text, making debugging and compliance logging trickier. Either way, low‑latency hand‑off to the PSTN depends on SIP-routing behaving itself.

This resurrection brings all of SIP's historical baggage back to the forefront. We've grown used to Opus codec delivering crystal-clear 44.1 kHz audio with built-in error resilience. Now we're back to G.711's 8 kHz sampling rate—audio quality that grates the modern ear. Although wide‑band codecs such as G.722 and even Opus wrapped in RTP are widely implemented on modern SBCs, patchy carrier support often forces negotiations back down to G.711, keeping audio quality firmly in narrow‑band territory.
When a bot’s 16 kHz or 24 kHz synth is transcoded down to 8 kHz G.711, some of the intelligibility and emotion vanish, which is why landing even on G.722 can feel like night and day.

Encryption—or its absence—rarely features in PSTN conversations, yet WebRTC pipelines refuse null ciphers. As a WebRTC developer working with legacy SIP and the PSTN, you must accommodate three modes: plain RTP for PSTN, SDES‑encrypted RTP for legacy SIP, and DTLS‑SRTP for WebRTC. SRTP is defined for SIP, but carrier hops almost never preserve it end‑to‑end, so voice bots usually land in plain-voice. SIP over TLS (SIPS) protects signalling, but the media plane usually falls back to plain-voice.

And then there's DTMF ("Press 1 for English, 2 for Spanish"). Those three different implementation methods I mentioned? They're not just academic concerns. Voice bots or the infrastructure between them needs to reliably detect when users press phone keys, whether for authentication, menu navigation, or input capture (think "Enter you social security number"). Missing or misinterpreting a DTMF tone isn't just a bug—it's a failed customer interaction. So, as Voice AI infrastructure, or as a CPaaS and CCaaS vendors, we need to support all methods.

These quirks multiply quickly. Early media behaviour varies wildly between carriers. Some send audio before the call officially connects; others don't. Some honour specific headers; others ignore them. Testing becomes a nightmare of edge cases and carrier-specific workarounds.

Old Challenges, Novel Solutions: AMD

Interestingly, the Voice AI era has introduced problems that traditional telephony never properly solved. Voicemail detection stands out as particularly thorny. When a voice bot makes an outbound call, it needs to determine whether it's reached a human or a voicemail system. Current Answering Machine Detection (AMD) systems from CPaaS vendors are notoriously unreliable. But here's where things get interesting: LLMs might actually be quite good at this. Instead of relying on simplistic audio analysis, an LLM can understand context and content. Is the voice saying "Hello?" or "Hi, you've reached John's voicemail"? For an LLM, that's a straightforward classification problem. I suspect we'll see voicemail detection become a solved problem through better prompting leaning into LLM's probabilistic nature rather than a deterministic AMD algorithm. It's an elegant example of how AI can easily solve a technical complex problem of the past.

The Path Forward

As much as I might sigh about SIP's return, I'm optimistic about where this leads. Yes, we'll need to retrain a generation of engineers on protocols they never expected to learn. Outbound traffic must also satisfy STIR/SHAKEN caller‑ID attestation; unsigned AI calls risk displaying “Spam Likely” on modern handsets. Yes, we'll spend countless hours debugging carrier-specific behaviours and codec negotiations.

But the end result—voice bots that actually work—will transform customer experiences. Today's IVR trees are universally despised. They trap customers in rigid menu structures, forcing them to navigate byzantine option trees just to reach a human. Voice AI promises natural conversations that understand intent and resolve issues efficiently.

The irony isn't lost on me. We're using cutting-edge AI technology built on top of a protocol designed when dial-up modems were cutting-edge. But perhaps that's the nature of real technological progress—not always replacing the old, but finding new ways to make it valuable again.

SIP is back. Time to dust off those RFCs.

Why Voice AI Does Not Need SFUs

Varun Singh — Sun, 23 Jun 2024 00:00:00 GMT

TL;DR: Voice AI applications should bypass SFUs and connect directly to voice bots via WebRTC, reducing jitter buffer delays and simplifying infrastructure whilst leveraging WebRTC's battle-tested last-mile optimisations.

Recently in conversation with Emil Ivov, we discussed if SFUs are needed. He builds and operates one at Jitsi, while at Daily.co, we do the same. We recently open-sourced Pipecat.ai, which made this conversation more releavant. Foreshadowing, we largely agreed that architecture changes are afoot.

The rise of Voice AI has brought new attention to an old debate in real-time communications: when do we actually need a Selective Forwarding Unit (SFU)? As someone who's spent years optimising WebRTC infrastructure, I've watched the pendulum swing from peer-to-peer (P2P) to SFU and now I think, back to P2P. Now, with Voice AI reshaping how we think about real-time communication, it's time to reconsider whether SFUs are truly necessary for every use case.

The Traditional Role of SFUs

Selective Forwarding Units emerged with group video calling. In a typical scenario, each participant sends multiple video streams at different quality levels, what we call simulcast. Each stream varies in target bitrate because they have different frame rates and resolutions, simplifying congestion control for the sender, and allowing the SFU to forward the appropriate quality to each receiver based on their bandwidth capacity. For a five-person video call, this architecture makes perfect sense: the SFU acts as an intelligent traffic router, ensuring everyone gets the best possible experience without overwhelming anyone's connection.

But here's the thing: not all calls involve five people. In fact, the vast majority of WebRTC sessions are between just two participants. This is where the story gets interesting.

WebRTC P2P4121

Many communication platforms developed what Emil Ivov (from Jitsi) called the WebRTC P2P4121 feature, peer-to-peer connections for one-to-one communication. The logic was sound: why route traffic through a server when two endpoints can communicate directly? The server's bandwidth savings alone made this attractive, not to mention the potential latency improvements.

Yet P2P had stumbling blocks, corporate firewalls and mobile networks threw up barriers that often required TURN servers to relay traffic anyway. If you're already maintaining TURN infrastructure to punch through NATs and firewalls, the argument went, why not consolidate everything through SFUs? You'd have one infrastructure to scale and maintain instead of two.

This consolidation made sense in the pre-Voice AI era. But the landscape shifted dramatically during the pandemic (covid) era, wherein, last mile issues became more pronounced and routing through SFUs even for two person calls became more of the norm.

Why Voice AI Changes Everything

Voice AI presents a fundamentally different communication pattern. When a human speaks to an AI agent, we're not dealing with a symmetric conversation between two endpoints behind unpredictable NATs. Instead, we have: an AI agent running on a server with a public IP address and no immediate need for multi-party capabilities in most use cases. We still have last-mile issues with the human participant, ergo, we should use WebRTC, but prefer P2P connections.

Think about it: your Voice AI agent like Pipecat is running on infrastructure you control, hence no real issues with firewalls. When discussing P2P versus SFU routing, there is hop-by-hop latency but oftentimes SFUs and AI agents co-locate, they often sit in the same region within the same cloud provider. Daily.co for example, supports P2P connections and is available in 40+ regions across two cloud providers. However, the additional latency from routing through an SFU might only be a few milliseconds. My concern is not so much the latency through the extra hop, but the additional jitter buffer that each hop may add in the worst-case scenario.

Every server in your media path maintains its own jitter buffer to handle out-of-order packets. When a packet with a higher sequence number arrives, the server will wait to determine if the missing packets are lost or not. This will slightly delay the packet. At the next hop, the same process repeats, but in this case, the endpoint may either drop the packet because it past its playout time or correctly playout the packet. In the worst case, the jitter buffers may interact poorly, each adding delay as they attempt to smooth out network inconsistencies. Thus, just having one jitter buffer at the endpoints is better.

Consider a scenario where network conditions cause packets to arrive slightly out of order. The SFU's jitter buffer holds packets for, say, 40 milliseconds to reorder them. Then those packets travel to the endpoints, where network jitter causes another reordering delay. Suddenly, you've added 40-60 milliseconds to your end-to-end latency—not from transmission time, but from buffering. The actual buffering delay depends on the NetEq implementations, typically, there is a high- and low-watermarks to control the amount of buffering. The high watermark is the maximum amount of buffering allowed for error-resilience (retx, fec), while the low watermark is the minimum amount of buffering allowed to ensure smooth playout (below this, the buffer underruns and you've no audio to playback).

A direct P2P connection eliminates this redundancy. The AI agent's and the human participant's WebRTC stack handles all the jitter compensation in one place, making decisions with full visibility into the end-to-end connection quality.

The good news is that most modern WebRTC platforms are beginning to recognise this nuance. Daily's WebRTC transport, for instance, supports starting calls as P2P connections and seamlessly upgrading to SFU routing when a third participant joins. This hybrid approach gives you the best of both worlds: optimal performance for two-party conversations and the scalability of SFUs when needed (third party joins, recording, etc).

This seamless transition is crucial for Voice AI applications. Imagine a customer service scenario where an AI agent handles initial queries via P2P, then smoothly brings in a human supervisor when needed. The infrastructure adapts to the use case rather than forcing all conversations through the same architectural pattern.

Implementation Considerations

When implementing P2P connections for Voice AI, consider these factors:

Direct WebRTC connections to AI agents require proper signalling infrastructure. Your AI agent needs to handle WebRTC negotiation directly, which platforms like Pipecat already support. This isn't significantly more complex than SFU integration, but it does require thinking about your architecture differently.

Monitor your connection success rates carefully. While AI agents on public IPs should have high P2P success rates, some client networks might still pose challenges. Have a fallback strategy, whether that's TURN servers or SFU routing, for the small percentage of connections that can't establish P2P.

Design your system to handle transitions gracefully. If you start with P2P and need to add participants later, ensure your application can migrate to SFU routing without disrupting the user experience.

Looking Forward

The key insight for Voice AI developers is this: use WebRTC for what it does best—handling last-mile networking challenges—without automatically adopting the full SFU-centric architecture that evolved for different use cases.

WebRTC gives you a congestion control algorithm implemented in each endpoint, echo cancellation and noise suppression, and NAT traversal capabilities when needed

You don't need to reinvent these wheels. But you also don't need to route every packet through an SFU just because that's become the default architecture for video conferencing.

There's also the elephant in the room that I've deliberately avoided until now: SIP. The resurrection of SIP in modern communications infrastructure adds another dimension to this discussion. But that's a topic that deserves its own deep dive—perhaps in a future post.