## A hypothesis cannot be experimentally proven

TL;DR: there is a difference between “is proven” and “could not be disproven”.

Let’s say you have a hypothesis A. How do you test it? By doing a set of experiments E₁, E₂, E₃, etc., observing their outcomes and comparing them against predictions made by A.

How do you disprove A? If any of collected outcomes and predictions disagree, it means that A is either wrong or at least requires some rework.

How do you prove A? Alas, you cannot. No finite number of experiments actually proves A. There is always a possibility that yet another experiment Z comes along and its outcome will not be in agreement with what A predicts for it.

Or look at it this way: are there any other, substantially different, hypotheses out here that would explain the same set of experiments equally well to A? Let’s say it is called ℵ.

One such ℵ is: “The results of what you see are as they are because you are inside simulation built in such a way as to demonstrate you these results”. It is of course a non-falsifiable statement, but nevertheless it would explain the same set of outcomes (and almost any other outcome) as A.

## Occam is of little help

Occam’s Razor would of course tell us to prefer A to ℵ. But it is not a law but rather a minimalist design principle.

As such, we cannot ever prove A just by doing experiments. We only approximate the proof with a finite set of failed falsification tests.

Let’s be more specific and have an example.

## Example from programming

Let’s say you have a program ℭ written in language C. The program is meant to follow a well-defined specification document. You want to prove hypothesis: “this program ℭ operates according to the provided specification”.

You’ve prepared as many tests as you could imagine to test all observable aspects of running the program. None of them has resulted in an observation that would have contradicted the specification.

Can you firmly state that, based on this alone, the hypothesis is true?

The choice of underlying language, i.e. C, precludes you from that. In C, there is such a term as “undefined behavior”. It describes program’s behavior whenever execution reaches a prohibited construct. Among such things (which are plenty: 190+ cases are listed in C11) are: reading an uninitialized variable, dereferencing an invalid pointer or, (sometimes) entering an endless loop.

The C language standard dictates that a program that has reached undefined behavior can do anything afterwards. “Anything” includes seemingly intended behavior, all the time or some time.

What does this mean for our program ℭ and the hypothesis about its behavior? It means that we cannot tell for sure whether the success of any test is to attribute to the fact that the hypothesis is true, or simply because there is undefined behavior somewhere in the application, which, at the time of the test, has “decided” to behave as intended. The latter of course only means that we were “lucky” in that one case. It does not guarantee that any future execution will give the expected outcome.

## White boxes and formalism

Note that this does not mean that any hypothesis cannot be proven at all. No amount of experimentation alone is sufficient. With hypothesis, we treat our system as a black box. Because of that, we can never be sure that we have exhaustively explored it by doing experiments from outside. There may always be some corner case left that, when reached, would invalidate everything we’ve learned so far about the system.

However, if you are allowed to look inside the system, then your experiments can be guided by knowing its internal structure. In this case, it is often possible to say when you are sure.

Usually it means that there is a set of axioms and production rules hidden inside of your system. Checking your hypothesis is then reduced to determining if it can be derived from those axioms by these rules or not. It is not always possible either. But in many practical cases, it is doable (barring the computational complexity limits).

For our program ℭ, if its source code is available, we can inspect it line-by-line and thus discover which aspects of it could lead to undefined behavior and which code paths are well-defined. This will give us much more confidence in knowing what our external tests are actually reaching and when all essential aspects of ℭ’s operation have been exhaustively exercised.