The Most Valuable Skill in an AI World Is Thinking

The biggest risk of AI isn’t that it can make mistakes. It’s that it produces answers so fast, and packages them so convincingly, that we stop noticing when it does.

Generating an answer, even to an extremely complex question, has become effortless. Consider a straightforward workplace analytics task: a business leader highlights a growing risk and asks what trends we are seeing in our marketing engagement data. With modern AI tools, all you really have to do to generate a response is copy-paste the email thread into code assist and provide some basic guidance on the data sources. Within 15 minutes it could be done, with a clean executive summary, crisp charts, and a coherent narrative.

The problem is you’d have approximately no idea whether any of it is believable. Do you send it right away? Do you review the code line-by-line, eating back the precious time you tried to save in the first place?

And before you say well, just train AI to evaluate itself, ask yourself if that would really make you trust the output more? Your answer is probably no, and it’s worth working through exactly why that is. What’s needed is the right evaluation framework that will build confidence in the process while still realizing efficiency improvements. For now, there are a few reasons why this critical role needs to be owned by a human being:

People are willing to trust people, but not always AI

Imagine I handled the workplace task above without using AI – manually writing the code, interpreting the results, making the charts, and drafting the executive summary. And let’s say the business leader strongly disagreed with my interpretation and recommendations. For human-human interactions, this is a familiar scenario. We can schedule a meeting, discuss how we ended up in different places interpreting the results and next steps, and come to an agreement on next steps that we can both accept.

The same scenario if I’d just directly shared the AI output is much more complicated. Consider a simple case of an error in the analysis. If I’d made the error we have clear processes in place – it can reflect in my performance evaluation and reputation, I can be asked to take a training course or receive mentorship, or I could be reassigned to a different type of work. If I violated a company code or broke a law, I could be fired, fined, or sent to jail.

But the accountability if the model made the error is unclear. Sure we could retrain the model, but how do we know fixing this issue wouldn’t create five more? Even more concerning is that if we ask the same thing again using slightly different language, we might not even get the same answer. And the reputation that will be damaged isn’t AI’s – it’s the human who decided to ship the erroneous response.

We simply do not have a framework for holding AI accountable without effectively treating a human, at some level, as the owner. And if something can’t be held accountable, it is going to be hard to trust. Ultimately we understand the range of moral, legal, and ethical frameworks that other humans generally live by, and this makes them trustworthy in a way that machines are not yet.

The better AI gets, the harder evaluation becomes

To state the obvious, humans are not machines. But as AI improves and becomes better and better at modeling human behavior, the line will become increasingly blurred. AI failures will look increasingly human and persuasive, eroding the feedback loop and degrading AI performance.

But hold on, you might say. If the errors are getting harder to detect isn’t that just a different way of saying performance is improving, not degrading? It’s not necessarily the case though. Behavioral economists have spent decades identifying cognitive biases and heuristics that lead humans to make decisions that feel right but are not optimal – even in extremely high stakes conditions. The list of biases is long. As AI mimics humans more and more, will its own cognitive biases emerge?

The example that most worries me is confirmation bias, which is the psychological tendency for humans to selectively filter information in a way that supports their preexisting beliefs. Put simply, we like it when we hear something that confirms we were right a lot more than something challenging our beliefs. In a world where every chat with AI leans into this bias by telling you “that is a very meaningful thought and one you should take seriously”, it will become increasingly hard to catch enough errors to meaningfully improve the system.

Ironically, the very human behaviors we turned to machines to escape through mathematical rigor and automation could end up baked into the machines themselves — laundered through an algorithm, and handed back to us with more confidence than ever. Which means that evaluating AI isn’t just about checking the math. It’s about recognizing when a model has learned to reflect our own blind spots back at us, and having a disciplined evaluation framework to keep AI objective.

Someone has to tell AI what “good” looks like – and it’s not easy

There is a common fallacy, I think, that if we can simply create enough metrics, AI will be able to deduce the proper way to balance them and produce an objectively optimal output. But this ignores the fundamental limitation that all these metrics exist to measure something. And those things are ultimately human, tied to the goals we have, the moral code we want to live by, and the values that drive our decisions.

Take an example from baseball. If you watch baseball a lot like I do, you’ve seen an explosion of metrics in recent years. A century ago you might know a hitter’s average, home runs and RBI; now we keep track of who’s leading in exit velocity. But do we know how important hitting the ball fast is to actually winning a game? And could focusing on a metric like exit velocity actually lead to unintended negative consequences like tons of strikeouts?

“When a measure becomes a target, it ceases to be a good measure” – this is Goodhart’s Law. In theory we can ask AI to consider an ever growing list of metrics to represent and approximate the real goals we are trying to achieve. But we will always be faced with the fact that once you set the metrics, the system can be gamified.

And the risk of this with AI is high because it’s an extremely powerful gamifier. Fundamentally, the technology works by looking for predictive associations across huge sets of data points. It’s all too easy to envision how it can find unintended associations and exploit them, ultimately driving metrics that diverge more and more from the real goals we wanted to achieve.

What does it all mean for the value of human judgment in an AI powered world?

The time and effort it takes to generate content with AI is collapsing toward zero. And what it actually creates is becoming increasingly compelling. Its arguments are persuasive, its organization is sophisticated, and its graphics are fancy. It’s easy to feel right now like anything we can do, it can do better.

But at the end of the day, the consumer of all that generated content is still a human being. And humans are likely to stay in that driver seat for some time because we’re simply not ready to afford AI the same level of trust. We don’t have the right accountability framework and we don’t know how to tell it what to do in a way that will sustain for the long term. I think that working as an evaluator at the AI-human interface will remain a valuable human contribution for a long time to come.

I’m not saying that AI won’t be the evaluator in many cases. As the volume of what is produced increases by orders of magnitude there simply won’t be time for humans to comb through everything. But for major decisions involving large risk/reward tradeoffs, with legal or moral implications, or without a clear answer, we’re going to want a human in charge.

The shift from maker to evaluator isn’t a demotion. It’s a different skill set, and an undervalued one. Thinking Empirically is about building exactly that, the habit of asking the right questions, stress testing the answers you get, and learning when to trust what’s in front of you. Those skills matter more than ever.

How to Think Empirically Part 5: Keep Track of What You Did and How it Went

“Combine the ingredients and cook in the black pot.”

My mom was recently making a stew recipe passed down from her grandmother. This detail mattered enough for my great grandmother to include it in the directions. But 100 years later, without the “black pot”, we were left unsure about the best replacement to use or how important it was to perfecting the stew.

This story highlights something that scientists care deeply about called replicability.

At its core, replicability comes down to a simple equation: attention to detail plus documentation. Without a clear enough record of how to reproduce the stew, some of the knowledge my great-grandmother gained through hard work and experimentation was now lost.

Document what you did

Let’s return to our pasta sauce experiment. There are a few key things we need to keep track of as we go:

Materials: which ingredients did you include, and in what amounts? What pots and pans are needed?
Procedure: how did you prepare the ingredients, and in what order did you combine them? How long did you cook?
Results: did you like this version better than the previous one?

Remember that the goal is replicability – you want to ensure that the next time you make this recipe, you can achieve the same outcome by faithfully recreating the steps. It’s helpful to think about the details someone else would need to recreate the sauce to make sure you’re capturing the right level of detail:

instead of “cook in the black pot”, try “cook for 30 minutes over medium heat using a 6.5 qt cast iron dutch oven”
instead of “add 1 teaspoon of oregano”, try “mix 1 teaspoon of dried oregano with basil and garlic and stir for 1 minute”

These details may seem minor. But small differences can easily sneak into your recipe and change your results when documentation is unclear. If you intended to test “removing red pepper flakes” but you accidentally switched from a white to yellow onion at the same time, you’ve accidentally changed two variables and will have trouble learning from your experiment.

Record what happened

You’ll also want to make note of whether your new version of the recipe worked or not. A common framework in business experimentation is the champion vs. challenger model. In this set up, you continue to use the best performing version of your recipe (in our case) as the baseline until some challenger comes along that can outperform it – that then becomes the new champion.

You need to keep track of which version is the champion, but there are many simple ways to do this for our pasta sauce example:

tasting notes: description of the new sauce and comparison to the champion
Thumbs up / thumbs down: was the challenger better or worse than the champion?
Rating scale: gives more nuance to tracking the outcome vs. thumbs up/thumbs down

The exact system doesn’t matter. What matters is preserving the knowledge you gain from each attempt and setting yourself up for the next round of experimentation.

Detailed documentation is the most time consuming part of experimentation – just ask any scientist how much they enjoy preparing reports for publication! The most critical thing is to capture the essential details, and avoid burning yourself out on things that may not matter much (should it be store brand or name brand oregano?).

And with that, you’ve completed a testing cycle and are ready to return to goal setting for you next attempt. By following a systematic process we’ve gained clear, reusable insights from all the effort we’ve put in, and are ready to continue on with more confidence in what we know.

How to Think Empirically Part 4: Change One Thing at a Time

If you throw a handful of poo at a fan, some of it is going to stick.

In the middle of my lesson, my saxophone teacher paused to share that analogy. He wanted me to know that when he shared anecdotes, advanced concepts, and creative ideas, he didn’t necessarily expect me to remember everything. The goal was breadth — inspiring my creativity and motivating me to keep working hard.

But this approach came with a tradeoff. Neither of us could say with certainty what, exactly, I would take away from that day’s lesson.

Thinking empirically makes a different tradeoff on the spectrum of breadth vs. depth (1). Rather than explore many related ideas at once, it narrows the scope to a small number of clearly defined learning objectives.

This often means changing only one thing at a time while keeping everything else the same. Let’s go back to our pasta sauce example to illustrate the difference between these two approaches.

When we surveyed the landscape of pasta sauce solutions, we found that there are core ingredients we’re not going to want to mess with (tomatoes, olive oil, onion, garlic, and salt), as well as well as elements ripe for experimentation (herbs, spices, and fat):

Poo at fan strategy: keep all of the core ingredients, but make multiple changes to other elements based on what I predict will be the best. For example, maybe I remove butter, oregano, and red pepper flakes in the next attempt.
Thinking Empirically strategy: keep everything identical except for one thing. For example, I know I don’t like spicy pasta sauce, so try removing red pepper flakes.

You might be asking yourself why, if I think making multiple changes is going to create my ideal sauce, would I slow myself down by making only one change at a time? It feels faster to just try it, right?

But it actually slows down learning.

Let’s say that you do in fact like your new sauce better than the original.

If you changed only one thing (removed red pepper flakes), your conclusion is clear. Don’t put red pepper flakes in future pasta sauces that you make.

But if you changed multiple things at once, your conclusion is ambiguous. Was it the butter? The oregano? The pepper flakes? Some combination of those three? If you don’t know, you can’t make an informed decision the next time you make pasta sauce.

This is how you save time and build confidence by thinking empirically – everything you test out produces a clear, transferable insight. These insights compound, and over time you build a knowledge base that enables more reliable, predictable decision making that gets you the results you want.

To benefit from this compounding learning, you need to record it. In the final part of the framework we’ll tackle how to keep track of how it went.

Part 5: Keep Track of What You Did and How it Went

I hope I’ve been very clear that there is great value to breadth-focused learning – inspiring creativity and motivating future learning are extremely important. The approach you take to learning depends on the goal you want to achieve – take another look at our post on “Defining Your Goal” for a refresher.

How to Think Empirically Part 3: See What Solutions Are Already Out There

“Disney says go left!”

I still remember this piece of advice from a guidebook my family used when planning a Disney World trip. Supposedly, you could save waiting time by always choosing the left path whenever a line split.

Even as a kid, I was skeptical.

Is this actually true? Do half the books tell people to go right so everyone doesn’t end up in the same line? How would they even know this works?

When you encounter a surprising piece of information, curiosity is your best friend. Before diving into action on advice that is hard to explain and potentially costly (wasting precious vacation time), it’s worth pausing to ask:

What evidence supports this claim?
How much time will it save?
Does this work universally, or only sometimes?

One of the best ways to start your investigation is also the most straightforward – check additional, independent sources. Do they come to the same conclusion? Do they explain why?

Let’s turn back to our pasta sauce challenge from last time. My goal is to make an everyday, tomato-based, quick and easy sauce that will level up my pasta lunches. I am not the first person to try to solve this problem. Even though I want to find a unique solution that’s best for me, learning from others’ solutions provides a strong starting point.

In research settings, stepping back to explore the big picture before diving into details is often called a landscape analysis. By examining where solutions converge — and where they differ — patterns begin to emerge. These patterns help refine your problem space: the set of variables actually worth testing.

If every recipe uses garlic I know I’m going to want that too. And if some use basil while others use oregano, I know this is something I’m going to want to test out.

To establish my own starting point I looked at a few different recipes:

Store bought sauce: tomatoes, water, onion, olive oil, salt, garlic, and basil.
“Perfect Easy Red Sauce” (The Food Lab): olive oil, butter, onion, garlic, oregano, red pepper flakes, tomatoes, basil, and salt.
“Best Marinara Sauce Yet” (AllRecipes): tomatoes, tomato paste, parsley, garlic, oregano, salt, black pepper, olive oil, onion, white wine

Even this quick survey reveals clear patterns. Tomatoes, onions, garlic, olive oil, and salt are key ingredients across all recipes that are going to form the backbone of my sauce. I can also see where there are variations – herbs and spices, fat (oil vs. butter).

This insight immediately focuses experimentation. Instead of tweaking everything at once, the foundational structure is clear — and the variables worth exploring become obvious.

By surveying the landscape, I transformed an overwhelming question —

“How do I make the perfect sauce?”

— into a manageable one:

“Which variations meaningfully improve this already-solid base?”

Next, it’s time to start experimenting. One variable at a time.

Part 4: Change One Thing at a Time

How to Think Empirically Part 2: Decide the Right Level of Effort

We can do anything we want, but we can’t do everything we want.

The process of empirical thinking — defining goals, exploring what’s out there, changing one thing at a time, and tracking of results — is powerful. It’s one of the most reliable ways to learn what actually works.

But it isn’t the right tool for every problem.

Every day we’re faced with dozens, if not hundreds, of problems to solve. What groceries do we need? What’s for dinner this week? What are we doing on the weekend? It takes time and effort to tackle a problem using a structured process, and we simply don’t have enough energy to apply this approach to everything.

This is why the next step once your goal is clear – but before diving into testing – is to decide on the level of effort you want to invest in this goal.

Consider something you might do nearly every day like making a purchase online. Each time you search for a product, you are flooded with dozens of options, personalized ads, and customer reviews. The amount of time you invest in your research is going to depend on two key things:

Potential Payoff: How much different is a great versus OK version of this product? Cheap vs. expensive?
Effort Required: How hard is it to research this? Are there a ton of options or only a few? Is it easy or hard to find and understand relevant information?

In business settings this is often called an opportunity analysis. Simply put, this means comparing what I stand to gain against what I have to give up in order to realize that gain (usually in terms of time and treasure). Ultimately the decision is subjective—it depends on how much you value the goal you’re trying to achieve.

It’s also not a one and done decision. As you make progress you start to realize some of the potential upside, meaning there is less to gain in the future (probably at higher cost) the longer you keep going.

Let’s apply this thinking to my pasta sauce example.

In the part 1 of the framework, I decided that making a great pasta sauce was my goal, and we worked through the process of articulating that in a clear, SMART way. What we need to decide now is whether this is the right goal for me to invest my time and effort into, given everything else going on in my life. Let’s consider the cost and benefits of going down this path:

Potential payoff is unclear: I think I can make a better sauce for less money than buying one, but it’s difficult to say how much better I can do. However, I think trying at least one test will let me see if there is any improvement.
Effort required is low: making pasta sauce requires relatively cheap, easy to find ingredients. Preparing and cooking the sauce will take 1-2 hours each time.

With this simple analysis I can make an informed decision – is it worth to me to spend a couple of hours and a few dollars to improve a meal I eat multiple times per week?

While I might decide this is worth it, you may not – and that’s totally fine! The key idea is simple: spend your time, energy, and attention on the problems that matter most to you. By clearly considering the costs and benefits up front, you can invest effort in learning what is most important, without burning yourself out on the small stuff.

And now that I’ve decided making a new pasta sauce is worth it to me, it’s time to explore the solutions (that is, the sauces) that are already out there.

Next: Part 3: See what solutions are already out there

How to Think Empirically Part 1: Defining Your Goal

Have you ever taken a look at the reviews for your favorite restaurant and found there are a surprising number of haters?

Opinions can vary, but the more you search the more you will find that disagreement often reflects different goals, not different facts. Maybe a negative review focused on slow service while I was mainly interested in delicious, high-quality food. Perhaps someone had a negative experience in-restaurant, but I ordered takeout.

Finding the “best” restaurant depends on what you consider to be the perfect dining experience. And the same is true for almost any problem we face. There are many possible solutions, but the right one depends on the goal you have.

Cooking is a perfect example of this dynamic. As an avid pasta eater and someone who genuinely enjoys cooking, developing my own everyday pasta sauce felt like a fun challenge. With a small set of inexpensive, easy-to-find ingredients, there was real opportunity to create something better than the generic options on my grocery store’s shelves. More importantly, it presented a perfect opportunity for experimentation.

Because research and experimentation are central to my work life — and because I’m a huge nerd always looking for excuses to apply empirical thinking to everyday life — I realized I could use this same structured process to develop my own sauce.

If you’re new here, you can read about How to Think Empirically and find an overview of the Thinking Empirically Framework under the Thinking Empirically Fundamentals menu. This post is the first of five to walk through the key pillars of the framework, using my pasta sauce experimentation as a practical, hands-on example of how thinking empirically can work in real life. Each week I’ll add a post on the next pillar to flesh out the full framework. You can follow along with this process in your own kitchen or, if cooking isn’t your thing, adapt the framework to an everyday challenge that matters most to you.

Part 1: Defining Your Goal

Before grabbing the pots and pans and getting the spices out of the cabinet, I needed a clearly defined objective. To set myself up for success I needed to first decide what success would actually look like. Without this step, I could easily invest a lot of time and effort without getting closer to what I want to achieve.

This is where the SMART framework can be a huge help. It is widely used in business, performance management, and research-informed settings, with the idea of setting specific, measurable, achievable, relevant, and time-bound goals.

Specific: Are we talking about tomato sauce or alfredo? Cooking for two or twenty? Kids or adults? Weeknight survival or dinner party? Specificity narrows the solution space.
Measurable: How will you know if you succeeded? This doesn’t need to be fancy. Recording a thumbs up/thumbs down, one-to-five star rating, or even focusing on prep & cooking time as the key metric is fine.
Achievable: What constraints matter? The “perfect” sauce must operate within real-world limits — budget, time, skill, effort. A solution you can’t sustain isn’t a solution.
Relevant: Is this worth solving? Do you eat pasta often enough to justify the effort? Or is this a once-a-year problem? Thinking empirically does take effort, so let’s make sure it’s worth it.
Time-bound: How long are you willing to invest in solving this problem? Do you need this by next week, or is this an open-ended project? Defining this up front prevents burn out.

Here is how I defined my goal:

Specific: Develop an everyday, tomato-based sauce for casual lunches 1–2 times per week. The key audience is me — I’m the primary pasta-eater.
Measurable: Maintain a notebook recording ingredients, quantities, methods, tasting notes, and a simple thumbs-up / thumbs-down decision.
Achievable: Use affordable, standard grocery store ingredients. Limit prep and cooking to roughly one hour per week.
Relevant: I eat pasta regularly enough for this to meaningfully improve my daily life (and potentially save money).
Time-bound: No fixed timeline. This is a fun personal project rather than a deadline-driven task.

By clearly defining the goal, I dramatically narrowed the solution space. I now knew what kind of sauce I was making, who it needed to satisfy, what constraints mattered, and how I would recognize success.

But before diving into experimentation, there’s one more critical step: deciding if this is the goal I want to prioritize using my time and effort to solve. That’s where we’re headed next week.

Next: Part 2: Decide the Right Level of Effort