AGENT RELIABILITY

Machine Learning 2.0

30 minutes to demo, 3 years to production. Why prompt tweaking won't survive. How ML 2.0 reframes AI workflow development.

Eran ShlomoThe Architect

June 2, 2026·8 min read·1,746 words

In this article

2 years back I thought this coming era is the "everybody is developer era", while technically this is correct, everybody can now practically develop applications, this is not describing well what is happening, everybody being able to produce apps now is (one) result of the change, the result is much more significant.

Everybody is now a machine learning model developer and while "Product models" are very popular, they are just one type of the many available models that are now possible and accessible to everybody. Karpathy called deep learning software 2.0, manifesting the "programming by examples" change deep learning has introduced into the ML world. I believe we are now entering the Machine Learning 2.0 era.

No code, No ML (Machine learning)

Way before the current boom of AI and applications, there was a trend called "no code", mentioning the "everybody is now a developer" notion, even going back several years ago. The idea was simple: what if we give nice building blocks with UI and allow everybody to build software using these "lego blocks" and while it got into some success, it failed to deliver the promise for opening the development world for everybody - it was more of lets unlock some automation opportunities for non-developers in a controlled sandbox.

Comparing this to what is going on today emphasizes the gap between the promise (delivered today) to the early vision a few years back. We have leaped and indeed everybody is now a developer, or so at least so it seems.

As part of the "No code" movement, there was even a smaller movement, saying "everybody is a data scientist or a machine learning model developer", following similar principles to the no code trend, named as ML 2.0.

History has played out a bit differently. The ML 2.0 never got meaningful success and No code success was very limited, yet the vision was right, even if timing was off by a few years. Today we have truly entered an era where both are truly happening, yet this era is "Everybody is now model developer", rather the common notion of "everybody is now a developer", the difference between the two is huge:

When it comes to professional software (or hardware) developers, it's even more confusing, since the new development methodology required disciplines both from ML and traditional SW dev, in this article I will start the basic ML 2.0 principles as they apply to this new world and are required, these are relevant to your work whether you are developing web app in Base44, trying to build sales pipelines using AI workflows, automating the finance closing of the months using agents or just a professional developer - we are all now developing models, daily.

ML Development 2.0: What should you be doing?

Behavior is learned from data with large datasets

ML divides the learning methods into two groups: supervised and unsupervised, in both cases the machine learns from humans, the difference is which human:

Supervised: The machine learns from labeled examples automatically. The domain expert injects their knowledge or the goal into the machine through labels, saved over files with the data, often long before the learning (also called training) takes place.
Unsupervised: The machines learn from unlabelled data, the human feedback comes in real time from the person who develops the model iteratively, often a data scientist or algorithm developer. They will always need domain expert assistance to evaluate the model result (no labels).

What are we doing today? In general we iterate prompts, skills, workflows, <add here your favorite markdown name for instruction> and validate it works locally on our system. We are actually the domain expert in most cases, yet we work very similarly to the unsupervised way.

This is a major gap, yet we fail to recognize this gap easily now - why?

The local tweak most of us do when doing agentic work is local overfitting, we make sure it works on that specific case and it happens fast. Yet our prompt is unlikely to work the next day or on the next case. This is the classic trap of ML - 30 minutes to exciting demo, 3 years to production.

What we (most of us) actually should be doing is transitioning to supervision:

Collect data
Label data
Feed the machine with the latest set of examples - called DataSet.

If you are tweaking prompts/skills/agents/ you (and me) are doing it wrong, we do that since we are in the early days of this industry, as time goes by better tools and methodologies will come and the common wisdom will change. Try to start now. Don't write prompt, train a prompt.

Note, we don't label anymore these days, the LLM equivalent is evals.

Key takeaway: Move your activity from unspecified methodologies to supervised methodologies, it's slow, more expensive and to be honest require a lot of the mundane work of preparing Datasets.

Yet, if you want to start counting 9's for your AI workflow (99.999…% accuracy), this is the way (together with task breaking into smaller ones).

How to improve? The feedback loop

Improvement is simply defined by collecting bugs and fixing them, in the last article we discussed AI calls it edge cases and software calls it corner cases, yet the principle is the same - our plan (the 80%) usually covers the known and trivial and as we develop use and distribute our app, we start collecting those exotic cases where it just fails due to things we did not know, did not predict or mischaracterized.

In traditional software development we will usually go through the following process (yet there are many reasons why it's like that in reality):

Try to reproduce the bug
Identify the logic failure, often called root cause analysis.
Put the fix in place
Put test for future capture

What do we do in ML?

In unsupervised modeling we will usually do the same with one important difference. It is rare that we will fix algorithms to handle a single corner case, so in most cases we will collect these into buckets, trying to handle many at once.

What happens in reality is that most are ignored, unsupervised learning fails to deal with corner cases effectively and this is the main reason these types of algorithms "lost" to supervised algorithms the AI race, worth noting that supervised methods are brute force and often get less respect from researchers.

What happens in supervised modeling?

We capture the error and try to figure out "can we collect more data like this", we will collect and label as many examples of data as we can covering this error and these examples will go into our training set, with the (well based) hope our next automatically generated model will handle these much better.

Key takeaway: When your AI workflow fails, collect and enrich your training sets, LLMs are great with generating synthetic data for your use case, often called distillation.

Testing

In traditional machine learning, whether you are modeling the data supervised or unsupervised you will have test sets, these are known sets of examples - input and output where you will simply run through the model and report correctness. In supervised learning you are not allowed to train on the test set (the model is not allowed to see the answers of the exam), so usually we will divide our examples into two groups randomly - Training set and test set.

Note what we just said: supervised and unsupervised are working with Test sets, this means we need evals today, even if we don't train agents. The only way to properly report and maintain your agent progress in a professional environment is using evals and just like traditional supervised elements, you can not train your models on the test evals.

In reality often we can't escape our testing example sets to leak into training examples and creating overfit. This makes us happy (our model jumps in score), yet, the real score drops. Luckily for us we are not exposed to the real score, only our customers are, so at least for the short term we can maintain the illusion of improvement internally.

Failures

ML Models are statistical, so by nature we should expect failures, yet we (and our customers) expect it "to work every time", especially in the enterprise world. No one will accept their credit card billing 99% correct, most professional use cases are counting 9's and many of them.

From my experience, on non-critical workflows people will accept 95% accuracy on paper and will approve 99%, this tends to be a reasonable benchmark for many AI workflows.

Bigger problem is the inability to immediately learn, understand and explain the failures.

In traditional apps bugs we had logical failures of our flow, so explaining to our customers/users/regulators what went bad and why was possible and expected. Machine learning bugs are behaving differently, putting AI explainability as a major issue to handle when scaling AI models into the real world.

Key takeaways:

Build with failures in mind, make it resilient, add guardrails according to the risk you are exposed to.
Communicate and educate "No 100%" when possible.
Don't AI when simple, deterministic (AI-written) code can solve the problem.
Examples collection & Evals are the way.
In agentic workflows we have built in cost-accuracy tradeoff - if latency and cost are not an issue, we can run multiple iterations of the same example with different models/system prompts and average the results, almost always accuracy will increase (together with cost and latency).

The above is solid, but not complete

The above principles are all correct and should be applied to your AI workflows, yet, in machine learning there is a subdomain called Reinforcement learning, Coding, and some other examples are verifiable, this means we can save a lot of examples by giving the agent "tool" to experiment and score good results. In software development, traditional tests are a good example for that and TDD is now more important than ever.

I will cover the RL projection soon in more detail, going into how the above ML 2.0 principles, together with RL, are applied to Software 3.0 development.

If you are building agentic workflows and need help applying these ML 2.0 principles to your team's stack, examples collection, evals, guardrails, and the cost-accuracy tradeoffs, we are happy to talk. Book a demo and we will walk through it with you.

· · ·

Written by The Architect

Eran Shlomo

Cofounder & CEO of Langware Labs. Writes about AI strategy, enterprise technology, and the technical architecture behind AI coding tools.

Download

Download Star on GitHub

KEEP READING

AGENT RELIABILITY