Week 10: Embracing Uncertainty in AI-Native Development • James Best

Split week between reading about AI engineering practices and hands-on development across three projects: simplifying Wordbuddies after the summer break, implementing dual EID support in Breedr, and adding content to Stem Buddies.

The Probabilistic Shift

Whilst diving into articles about AI engineering mindset and building effective agents, one insight particularly resonated: “Your app’s performance is only as good as the data you collect about how it’s being used.” This isn’t just about analytics anymore—it’s about treating every user interaction as a learning opportunity in systems that can genuinely surprise us.

The distinction between workflows and agents became clearer this week too. Workflows orchestrate LLMs through predefined code paths, whilst agents allow LLMs to dynamically direct their own processes. It’s the difference between a carefully choreographed dance and improvisational jazz—both have their place, but they require entirely different approaches to quality assurance.

Evals: The New Testing Paradigm

Traditional testing assumes we know what good looks like. Write a function, assert the output, job done. But when working with AI systems, each change might make the system 5% better or 50% worse—and without proper evaluation systems, you’d never know which.

The concept of deterministic evals (simple assertions) versus more sophisticated evaluation systems for probabilistic outputs represents a fundamental evolution in how we think about software quality. It’s a shift from “does this function return the expected value?” to “how do we measure improvement in systems that can surprise us?”

The data flywheel concept particularly intrigued me: user generates query → system responds → user provides feedback → feedback becomes new evaluation case → system improves → better responses for future users. It’s elegant in its simplicity, yet represents a massive shift from traditional development cycles.

Simplicity Through Constraints

Sometimes the most important engineering decision is what not to build. During the summer holidays, I picked up Wordbuddies again—a project that had grown into what I can only describe as a monster. It was complex, broken, and trying to do far too much.

The solution wasn’t more features or clever architecture. It was stripping everything back to essentials: one game, ready for my kids to use. This week’s statistics tell the story:

94 total commits focused on refinement
43 new features, but all targeted and purposeful
18 bug fixes addressing real user friction
Spaced repetition, session history, and configurable ElevenLabs API keys

Each addition served the core purpose rather than expanding scope. The comprehensive 300-word Year 3 spelling curriculum replaced scattered, unfocused content. Sometimes constraint breeds creativity better than unlimited possibility.

React Native Evolution

Split my Breedr work this week between feature development and a React Native upgrade—a reminder that mobile development exists in constant tension between innovation and maintenance. The duel EID support work is underway and represents the kind of complex, stateful work that becomes exponentially harder when your flow is not built to deal with this.

React Native upgrades remain one of the more painful aspects of mobile development. What should be a straightforward dependency update inevitably turns into a archaeology expedition through breaking changes, deprecated APIs, and subtle behavioural shifts that only surface during testing. Half the week spent on an upgrade that theoretically improves performance and developer experience, but practically means debugging why something that worked perfectly yesterday now crashes.

Nine commits might seem modest, but the focus on dual EID capabilities whilst maintaining data integrity reflects the careful balance required in production mobile applications. Each commit represents hours of testing across device configurations, OS versions, and edge cases that users will inevitably discover.

Search at Scale

Stem Buddies has reached that inflection point where success creates its own problems. What started as a handful of science posts for my daughter has grown into a proper collection of articles about everything from plant detectives to rollercoaster physics. The content is there, but finding it was becoming increasingly frustrating.

This week’s focus on enhancing the search functionality wasn’t just a nice-to-have feature—it was addressing a real usability crisis. When you’ve got stories about Janaki Ammal, explanations of how bees find flowers, and profiles of computer science pioneers like Radia Perlman, the old browse-and-hope approach simply doesn’t work anymore.

The search page redesign and index updates represent the kind of foundational work that doesn’t feel glamorous but directly impacts whether users can actually find the content they’re looking for. It’s a reminder that as our digital products grow, the infrastructure for discovery becomes just as important as the content itself.

The Observable Future

What strikes me most about this week’s work is how observability has become central to everything we build. Whether it’s AI evaluation systems providing continuous feedback, mobile apps tracking user interactions across complex state machines, or content platforms monitoring engagement patterns—the systems that succeed are those that can see themselves clearly.

This observability isn’t just about monitoring; it’s about building systems that can adapt and improve based on real-world usage. The traditional waterfall from requirements to implementation to deployment is being replaced by continuous loops of hypothesis, measurement, and refinement.

As we embrace this probabilistic future, the engineers who thrive will be those who can balance uncertainty with rigour, simplicity with sophistication, and deterministic foundations with adaptive surfaces. The craft of software development isn’t becoming easier—it’s becoming more nuanced, more contextual, and ultimately more human.

Next week: diving deeper into evaluation frameworks and how they’re reshaping mobile development workflows.