When would I approve a large pull request made by a magic genie?

Preword

Recently I was asked to review a huge code pull request (PR). The change was generated by, let’s say, a magic genie.

There were about 18 thousand lines modified in it. To make matters worse, the affected code originates from a 10+ years old copy of a mathematical library. The library in question looks like it was mechanically ported to C from Fortran 77, with liberal usage of gotos targeting up and down. To my horror, jump label sometimes were named the same as variable names in the same scope (it is allowed in C?!). On top of this pre-existing mess, the PR suggested we make numerous changes here and there to it. What can possibly go wrong.

After 30 minutes of staring at it, my comment to this PR before I closed for the day was: “I need to lie down”.

This made me think: in which circumstances would I agree to approve such a large PR with many logical changes? Its origin would not matter, it can be a human, a machine, a combination of both, any sort of divine revelation, etc.

A multi-line physical change that carries a single logical change (such as a rename of a widely used function name) is usually not a big problem to review despite me not being capable to read through all of it.

But what about something more complicated, something that I know I won’t be able to review line-by-line in reasonable time?

I’ve come up with two requirements on the starting state of the project, and one requirement on the structure and contents of the pull request itself. In a way, they reflect my views on a “dream” source code base and on a “decently good” pull request. It turns out that change size is less important, it is properties of the test suite that might allow safely making arbitrarily large changes.

Production code has associated test suite that offers 100% test coverage

Let’s start with an “easy” requirement: every line, every statement in the production code should be reachable from tests. No changed line is allowed to stay untouched. If I am not looking at that line, something else must “look” at it, and the test suite is what does that.

Needless to say, very few real-world projects that I met in practice fulfill this requirement. But if we are talking about the nebulous goal of being able to safely apply arbitrary automatic transformations to our code base, then we must set our bar high for it, and exhaustive coverage is the first thing to have.

Such a perfect test coverage is not enough though. It is only a beginning.

The test suite is strong: it tracks every statement in production code

The perfect test coverage says nothing about whether any test case that reaches given place in production code actually “cares” about what behavior arises from that place. Figuratively speaking, it is not enough to visit every room in a house, you have to look around to see if there is anything wrong going on there.

We need additional guarantees that the test suite is strong, that comprising it test cases have expectations collectively reacting on all statements in the production code.

One way to reliably achieve such amazing property is to incrementally grow the code base where every test, while being added, is demonstrated to react (both pass and fail) on every statement in a section of production code the test is meant to track.

Let’s not assume such ideal development process to have taken place however. Instead, what if we start from an unknown evolution history of the code base? Do not despair! Post-factum, the test suite’s strength can be demonstrated by mutation testing, that mystical process of “testing the tests“.

  • For every statement found in production code,
  • if that statement is intentionally temporarily corrupted (e.g., a plus operator is replaced by a minus, a line is removed, a logical condition is replaced by constant true, etc.) and the resulting “mutant” code base is subjected to testing,
  • then at least one test case will react on the corruption with a failing expectation, meaning that it would not escape if really happened.

Such a test suite will be tight enough to reliably detect an wide range of intentional or accidental behavioral regressions admitted at any position in the production code base.

This property is much harder to achieve than aforementioned 100% test coverage. Not only should your tests combined be able to reach every nook and cranny of the production behavior, they should pin down all observable behavior as expected.

What would it mean for people working with such an ideal state of a code base? It would mean that for any change in production code, there is at least one associated test case that tracks it and related use case. It would be very hard to accidentally break that use case without noticing by an arbitrarily big or small edit in production code. (I won’t cover the potential negative effect of fragile tests, this post is already quite long).

Aside: no silver bullet

I am not saying that this state is easy or even possible to achieve in all cases. It is unlikely to be achievable by just throwing more genies on an existing blob of legacy code, telling them to cover it with “good” tests. Some human involvement will be required.

To make matters worse, there are a lot of known theoretical and practical issues with mutation testing:

  • It is slow, O(N×M) where N is size of production code (measured in basic blocks) and M is number of test cases to run against each mutation. It can be made slightly better by careful test selection (M can be made lower), but there is no easy way out of having to visit every line of the production code.
  • Automatic application of mutations does not guarantee to result in a mutant that terminates (halting problem).
  • It is impossible to guarantee that a mutation indeed changes observable behavior of a program. For example, in statement if (a+b > 0) print("foo");, assume that a and b are always either 0 or 1. If we replace the addition operator with a logical-or, the truth table of the condition won’t change; such a mutation cannot be caught by any test. It is really a refactoring, but it will be falsely automatically marked as a deficiency in the test suite’s strength.

The analysis can be done if some human help is involved to deal with non-terminating runs and equivalent mutants. Is it worth it? If I, for some reason, want to blindly apply large changes to a code base, I want to be dead sure that its test suite is that much advanced and is up for the challenge.

We haven’t talked about the pull request yet, so let’s cover it.

Incoming pull request contains correlated changes of production code and test suite

It is critical that after any PR is merged, the two test suite properties (100% test coverage and test suite strength) continue to hold for the updated production code and its modified test suite.

What does it mean for the contents of the PR? That depends on its stated intent: does it add new behavior? does it remove old one? changes no behavior at all?

  • In a PR marked as refactoring of production code, no changes to test suite may be present. By definition, a refactoring changes production code structure without affecting its behavior. Our 100%-coverage test suite helps us to automatically prove it in this case.

  • In a PR marked as adding new use cases to production, additional tests covering those use cases must be present. These new tests must cover 100% of the added production code. More importantly, evidence must be provided that these new tests react (i.e., they fail in expectations) on any behavioral change to the proposed production code additions. That would be a “small scale” mutation testing applied to the change. Finally, no changes to already present old test cases are allowed in such a PR.

  • In a PR that removes existing features, removal of both production code and associated tests should be present. Indeed, it is impossible to only remove production code: 100% test coverage and strength properties guarantee that at least one test case would react on no longer present behavior. Removing just a test case should not be allowed either: that would create a hole in coverage. As a final piece of evidence that justifies removal of a feature, I will require telemetry product usage data query results that support the observation that the feature in question is not actively used by enough customers. Looking at the telemetry allows us to make te removal into data-driven decision: “if we drop this feature, how many customers will be affected?”

Final thoughts

I do not allow “refactorings” involving both the production part and the test suite of the project. Similarly, any alleged “refactoring” of just the test suite falls under suspicion. If a need for any such kind of change arises, both coverage and strength properties of the test suite must be re-certified from scratch. Otherwise, it is surprisingly easy to disarm some or all of your tests. You would live blissfully (for a while) with always “passing” tests, not knowing that they have lost their teeth and can no longer warn you when you need that most.

I still prefer small PRs.


Written by Grigory Rechistov in Uncategorized on 31.03.2026. Tags: testing, mutation testing, llm,


Copyright © 2026 Grigory Rechistov