Best of Mayank A. - LinkedIn Posts by Mayank A.

Mayank A.

Oct 26, 2025 at 4:19 AM

Nobody noticed when you improved it. Everyone noticed when it broke.

Life of a Software Developer 🤗

--------

➕ If that hit you deep, following Mayank is non-negotiable.😃

▿ Show more

Mayank A.

Oct 27, 2025 at 1:15 PM

We've all shipped an LLM feature that "felt right" in dev, only to watch it break in production.

Why? Because human "eyeballing" isn't a scalable evaluation strategy.

The real challenge in building robust AI isn't just getting an LLM to generate an output. It’s ensuring the output is 𝐫𝐢𝐠𝐡𝐭, 𝐬𝐚𝐟𝐞, 𝐟𝐨𝐫𝐦𝐚𝐭𝐭𝐞𝐝, 𝐚𝐧𝐝 𝐮𝐬𝐞𝐟𝐮𝐥, consistently, across thousands of diverse user inputs.

This is where 𝐄𝐯𝐚𝐥𝐮𝐚𝐭𝐢𝐨𝐧 𝐌𝐞𝐭𝐫𝐢𝐜𝐬 become non-negotiable. Think of them as the sophisticated unit tests and integration tests for your LLM's brain.

You need to move beyond "does it work?" to "how well does it work, and why?"

This is precisely what Comet's 𝐎𝐩𝐢𝐤 is designed for. It provides the framework to rigorously grade your LLM's performance, turning subjective feelings into objective data.

Here's how we approach it, as shown in the cheat sheet below:
1./ Heuristic Metrics => the 'Linters' & 'Unit Tests'
- These are your non-negotiable, deterministic sanity checks.
- They are low-cost, fast, and catch objective failures.
- Your pipeline should fail here first.
▫️Is it valid? → IsJson, RegexMatch
▫️Is it faithful? → Contains, Equals
▫️Is it close? → Levenshtein

2./ LLM-as-a-Judge => the 'Peer Review'
- This is for everything that "looks right" but might be subtly wrong.
- These metrics evaluate quality and nuance where statistical rules fail.
- They answer the hard, subjective questions.
▫️Is it true? → Hallucination
▫️Is it relevant? → AnswerRelevance
▫️Is it helpful? → Usefulness

3./ G-Eval => the dynamic 'Judge-Builder'
- G-Eval is a task-agnostic LLM-as-a-Judge.
- You define custom evaluation criteria in plain English (e.g., "Is the tone professional but not robotic?").
- It then uses Chain-of-Thought reasoning internally to analyze the output and produce a human-aligned score for those criteria.
- This allows you to test specific business logic without writing new code.

4./ Custom Metrics
- For everything else.
- This is where you write your own Python code to create a metric.
- It’s for when you need to check an output against a live internal API, a proprietary database, or any other logic that only your system knows.

Take a look at the cheat sheet for a quick breakdown.

Which metric are you implementing first for your current LLM project?

♻️ Don't forget to repost.

▿ Show more

Mayank A.

Oct 25, 2025 at 3:35 PM

We are too focused on achieving full autonomy instead of designing reliable, human-in-the-loop systems.

An AI agent is not a loyal employee you can give a vague goal to.

It's a super-fast, super-naive intern with access to your entire production system. Without strict guardrails, it will inevitably get lost.

The breakthrough in agents won't be a single, smarter model. It will be the development of a standard "operating system" for agents that provides memory, tool constraints, and verification steps.

4 Principles for Building Reliable Agents:

1./ Autonomous loops ⇢ Verifiable single steps.
Don't let the agent run free. Design it to propose a single next action, get automated or human approval, then execute.

2./ General tools ⇢ Constrained capabilities.
Don't give an agent raw command-line access. Give it a small, well-defined, and observable set of APIs it is allowed to call.

3./ Vague goals ⇢ Concrete success criteria.
The task shouldn't be "plan a trip." It must be "find three flights under $200 and two hotels with a rating above 4.5, then present them in a valid JSON object."

4./ Black box reasoning ⇢ A transparent audit trail.
The agent must "show its work." It should output its reasoning, the tools it used, and the outcome of its actions at every single step.

-----------

You can follow Mayank for regular insights :)

▿ Show more

Mayank A.

Oct 26, 2025 at 2:35 PM

The only family tree I'm truly afraid of. Save this. You will need it. #datascience #statistics #machinelearning

This is your brain on Stats. 😉

Image Source - math(dot)wm(dot)edu

▿ Show more

Mayank A.

Best Posts by Mayank A. on LinkedIn

Related Influencers

Philippe Birker

Dawn Choo

Christian Kampf 康可安 💊

Josue Valles