Engineering

GenAI considered reliable-enough

Drawing parallels between TCP/UDP networking protocols and GenAI reliability mechanisms to argue for context-appropriate reliability standards

TL;DR: Just as TCP isn’t 100% reliable but is considered "reliable enough" through checksums and retransmissions, GenAI can achieve appropriate reliability through guardrails, LLM-as-judge, and chain-of-thought reasoning.

GenAI Reliability

GenAI Considered Reliable-Enough

In defense of Generative AI’s hallucinations and errors, let’s consider this: humans and our existing systems are not 100% reliable. Even TCP, the protocol we trust for reliable data transmission, isn’t perfectly reliable. Loss of transmitted packets result in retransmissions, these retransmitted packets can also be lost, which will eventually cause the connection to terminate. Nonetheless, we consider TCP to be reliable. Why? Because it’s reliable enough for its intended use cases and the mechanisms to make it more reliable have been adjusted over the past few decades. Use DCTCP within nodes in a datacenter, servers devlivering to endusers use proprietary flavors, while endusers may use – CUBIC, BBR, etc.

The TCP Analogy: Understanding "Reliable Enough"

Delving deeper into how TCP works reveals several mechanisms that reduce the probability of data corruption. The protocol employs checksums to verify data integrity, ensuring that what arrives matches what was sent. It uses sequence numbers to maintain ordered delivery, preventing packets from arriving out of order and corrupting the data stream. When packets are lost, TCP’s retransmission mechanisms kick in, resending data until acknowledgment is received. Various timeouts govern these processes, ultimately deciding when to give up on a connection that has become unviable.

TCP introduced the concept of a connection over a connection-less packet delivery model. This layered approach to reliability offers an important lesson for GenAI systems. Although, TCP failure modes are observable and can be detected, GenAI failure modes may not match this paradigm, but I think we can draw some parallels.

GenAI’s Reliability Mechanisms

Following the networking analogy, GenAI needs to apply corresponding resilience mechanisms. Guardrails function as circuit-breakers in the AI system, preventing the model from generating harmful or wildly incorrect content. Just as circuit breakers prevent electrical system overload and TCP’s connection timeouts prevent infinite waiting, these safety boundaries ensure the system fails gracefully rather than catastrophically.

The LLM-as-a-judge pattern serves a role similar to checksums in networking protocols. Where checksums verify data integrity by comparing received data against expected values, LLM-as-judge approaches use a second model, or the same model in a different mode, to evaluate the quality and accuracy of generated content. This creates a verification layer that can catch errors before they reach the end user.

Chain of thought (CoT) reasoning provides something analogous to sequence numbers in TCP. Just as sequence numbers ensure packets arrive in the correct order and enable reconstruction of the original message, chain of thought reasoning ensures logical progression through a problem. It creates traceable reasoning paths that can be audited and verified, making the model’s decision-making process more transparent and reliable.

Context-Dependent Reliability

In networking, you have two fundamental choices: use TCP with its built-in reliability mechanisms, or use UDP and build your own reliability layer tailored to your specific needs. This choice depends entirely on your use case and what "reliable" means in your context.

Real-time voice and video calls demonstrate this principle perfectly. They use RTP over UDP because in conversation, latency matters more than perfection. When packets go missing, the decoder doesn’t wait—it guesses and renders what it can. You might see a momentary freeze or hear a brief glitch, but the conversation continues. The system prioritizes low latency over perfect delivery because a delayed "hello" is worse than a slightly garbled one.

Streaming video services take the opposite approach. Here, media is received into a buffer before playback begins. The system can take time to ensure each packet arrives and is processed in order, playing back at the highest possible quality while carefully managing the buffer to avoid the dreaded rebuffering pause. Quality and completeness take precedence over real-time delivery because viewers would rather wait a few seconds for the video to start than watch a degraded experience. Over time, we have seen systems shift from UDP to TCP back to UDP. For example, video on demand streaming used to be over RTSP over UDP in the 90s, but unreliability and advent of browsers meant that streaming over HTTP over TCP became the norm. However, recently, because of layer ossification, HTTP over TCP is being replaced by QUIC over UDP.

The GenAI Parallel

We find ourselves in a similar situation with Generative AI and its ability to mimic, copy, guess, and create. The reliability requirements vary dramatically based on the application, just as they do in networking.

In medical diagnosis, legal document drafting, or financial analysis, we need multiple verification layers. These applications require human-in-the-loop validation, strict guardrails, and comprehensive audit trails. This is like running TCP with additional application-layer checksums—we’re not just relying on the base protocol’s reliability but adding extra verification because the cost of errors is too high. A misdiagnosis or a legal mistake can have life-altering consequences, so we build systems that verify, re-verify, and maintain clear chains of accountability.

On the other end of the spectrum, consider brainstorming sessions, first drafts, or entertainment applications. Here, GenAI operates more like UDP—some "packet loss" in the form of minor errors or inconsistencies is perfectly acceptable. When you’re using AI to generate ideas for a marketing campaign or create variations of a design concept, perfect accuracy isn’t the goal. Speed and creativity matter more than precision. A slightly nonsensical suggestion might even spark the perfect idea. Simarly, vibe-coded internal applications or proof-of-concept applications may not require the same level of reliability as production applications, and may meet the bar of "good enough".

Most interesting are the hybrid approaches that adapt their reliability requirements dynamically. Code generation paired with test verification creates a feedback loop where the AI can be creative and make mistakes, but those mistakes are caught before they matter. Content creation with fact-checking layers allows for fluid writing while ensuring accuracy where it counts. Customer service systems that seamlessly escalate to humans when confidence drops below a threshold. These systems are like adaptive protocols that can switch their error-resilience modes based on the observed needs.

Just as network engineers build reliable systems on unreliable networks, AI engineers must build reliable applications on probabilistic models. The key is layering your defenses. Never rely on a single checking mechanism. Multiple models reviewing each other’s work, diverse prompting strategies, and varied validation approaches create a robust system that can catch different types of errors.

Matching reliability to requirements becomes crucial. Not every use case needs five-nines reliability, and trying to achieve it everywhere would be prohibitively expensive and slow. A chatbot helping users find documentation can tolerate occasional misunderstandings, while a system generating medical dosage recommendations cannot be incorrect.

We must embrace probabilistic thinking in our system design. Instead of trying to handle every edge case perfectly, we design for the 95% case and ensure the system handles the remaining 5% gracefully. This might mean clear error messages, smooth handoffs to human operators, or transparent confidence indicators that help users understand when to verify the AI’s output.

Monitoring and adaptation round out the reliability strategy. Like TCP’s congestion control algorithm that adjusts sending rates based on network conditions, AI systems should adapt their behavior based on performance metrics. If error rates increase, the system might automatically become more conservative, request additional verification, or route more requests to human review.

Conclusion: Redefining Reliability

"Reliable enough" isn’t settling for less. It is engineering for reality. TCP shows us that perfect reliability isn’t necessary for a protocol to be considered reliable. Similarly, GenAI doesn’t need to be perfect to be transformative.

The question isn’t "Is GenAI reliable?" but rather "Is GenAI reliable enough for my specific use case?" And increasingly, with the right mechanisms in place, the answer is yes.

As we continue to develop AI systems, we should focus not on eliminating all errors (an impossible task even for humans), but on building appropriate reliability mechanisms for each use case. Just as the internet thrives on "best effort" packet delivery with reliability built in layers above, GenAI can thrive with thoughtful application of context-appropriate reliability mechanisms.

The future isn’t about perfect AI. It’s about AI that’s reliable enough for the task at hand, with well-understood failure modes and appropriate safeguards.


A more formal version of building reliable LLMs is documented in 12-Factor Agents, give it a read if you’re interested in the topic.

Related Posts

Back to all posts