Atakua's Diary

Q and A on undisciplined preprocessor

What is undisciplined preprocessor usage?
Does “undisciplined” mean bugs?
- Is this macro abuse affecting our project?
There are commercial tools that perform static analysis! Why don’t we use them?
What problems do object-like macros pose?
What problems do function-like macros pose?
What problems do conditional compilation blocks pose?
What problems do include directives carry?
Are there situations where macros are appropriate?
Are there any examples of bad usage?
References

What is undisciplined preprocessor usage?

It is code written using C and C++ preprocessor facilities which contributes to code opacity.

It uses macros where other language features are more appropriate.
It interleaves conditional compilation paths and regular conditional paths, making the control flow hard to track for humans, even for tools.
It generally contributes to making code harder to understand than necessary.

There is an agreement in the SW industry about the dangers of unrestrained preprocessor usage [1]:

By interviewing 40 developers and by performing a survey among 202 participants, we found that most developers agree that undisciplined preprocessor usage influences code understanding (88 %), maintainability (81 %), and error proneness (86 %) negatively. The developers emphasize that they would not use undisciplined directives because they decrease code readability, obfuscate the control flow, and make the code difficult to evolve and maintain. For example, one developer elaborated on the aforementioned problems, saying: “I avoid this kind of directives; they make the source code hard to understand and maintain. My gut feeling keeps screaming possible bugs when I’m faced with a code like that.”

There is a reason that none of modern computer languages (starting from Java) have preprocessor facilities resembling the C/C++. Those languages that have (e.g. Rust), give considerably less expressive power to preprocessor. Alternatively, the Lisp family has much more powerful macros, but those are quite different in their expressiveness.

Does “undisciplined” mean bugs?

Quite a few bugs found are around macros. They are usually either tracked back to an error hidden inside a macro, or concealed by using a macro in a context where it has been unrolled into an unexpected result.

Most importantly, even if code involving a macro is correct, it may still be harder to understand and maintain it than if it had been written with more attention to its legibility.

Is this macro abuse affecting our project?

That depends on how widespread them are in your code.

After many years of malpractice and neglect for clean up work, certain parts of the code base may become entangled in multiple layers of conditional compilation. The attainable development rate may become jeopardized by drowning in a sea of entangled macros.

If many independent teams of engineers contribute to the project, communication between them is limited, there is a real danger of deepening the macro hell. Humans tend to look at existing bad code examples and reuse the antipatterns found in them.

As such, it requires coordinated team effort to improve our practices around the preprocessor usage. Measures include educating team members about improper and proper preprocessor usage, and having tools to help us find new and track existing maintainability issues.

There are commercial tools that perform static analysis! Why don’t we use them?

To the best of my knowledge, none of the available commercial and free tools do this type of preprocessor analysis.

Many known C++ static analyzers receive code after the preprocessing phase has been finished. They can find errors in specific macros instantiations affecting the program behavior. But anything else is invisible to these analyzers. E.g. not taken conditional compilation paths, mistyped macros that are not used anywhere, fragile macros (e.g. no guards against side effects hidden in arguments) will not be reported.
Those few tools that do perform their analysis with macros in mind, may either choke on the complexity, or take disproportionally long time to process such files.

A few known examples of tools that exhibit limited support for reporting improper preprocessor usage.

Visual Studio 2017 suggests using constants instead of #define.
SonarQube makes the same suggestion and assigns code debt to constant-like #defines.
CCCC analyzer complains about more than 99 possible code paths through conditional sections. Note that human’s limit to being able to analyze multiple code paths is somewhere between 2 and 6.

Other tools, such as Coverity, gcc -fanalyzer, and clang-analyze, did not present any awareness about the preprocessor-specific issues.

What problems do object-like macros pose?

An object-like macro is a #define with no arguments:

#define LIMIT 10 + 100

They are often used to give names to literals. Alternatively, they define an expression with arguments implicitly taken from the outside context.

Problems with object-like “constants” is that, being really pieces of strings, they may be treated unexpectedly differently depending on the application context.

They are untyped. In different expressions, their deduced type may cause silent narrowing of the result.
They are lazily calculated. The LIMIT macro defined above is not equal to 110 in e.g. the following usage: int res = LIMIT * LIMIT; The unfolded line 10 + 100 * 10 + 100 results to 1200, not 12100.
They are opaque to other tools, such as GDB. One can easily inspect a value of a normal variable. Not so easy with a macro, because it has no value in strict sense when the program is running. Other debuggers, e.g. Visual Studio, are more sophisticated and do help even with macros.
They may be undefined, but still “correctly” used in certain contexts. E.g. the #ifdef MISTPYED_VALUE line will not cause an error, even though there is a mistype in the symbol name.
They are usually written in ALL CAPITALS. This is known to decrease text readability [2].
In many cases, there is a better way to define a literal without resorting to a macro (e.g. using const, enum, constexpr).

What problems do function-like macros pose?

A function-like macro has zero or more arguments:

#define MAX(a, b) a > b? a:b

In addition to all the problems that object-like macros have, function-like macros have more of their own.

Their return result is untyped. E.g. a “boolean” macro-function may in fact return an integer outside a range of (0, 1), which in many contexts will be treated unexpectedly.
Their arguments are untyped. Unexpected narrowing of sub-expressions may happen.
Special care is required around unfolding of their arguments. E.g. all arguments must be enclosed in brackets ().
Special care is required around side-effects of their arguments.
These macros are especially ugly when they do not fit on a single line. Many line-continuation symbols \ are needed to keep it correct. In the preprocessed output, all these lines are glued to a single one.
Multi-line void macro bodies need to be enclosed into do-while(false) blocks to avoid misinterpretation in certain contexts.
Debugger cannot inspect their values for you.
You cannot put a breakpoint on a macro.
Special tools are required to analyze macros that call other macros, in order to partially unfold them (e.g. Eclipse CDT can do that)

Function-like macros are often considered as a way to enforce inlining and ensure high performance of the resulting code. Which might be relevant in 1990’s. The compiler technology has improved significantly since then, as the CPU technology. Compilers are often better at deciding which code is worth inlining. Processors are faster to do function calls and returns. Excessive inlining may even decrease performance, forcing the code bloat over the instruction cache. It is still possible to affect compiler decisions by using real functions with static inline, or similar compiler attributes. Without solid profiler data, using function-like macros is trading for premature optimization by losing code readability, e.g. a lose-lose situation.

What problems do conditional compilation blocks pose?

A conditional compilation block is a block inside #if-endif, or #ifdef-#endif, with optional #else in the middle.

Problems:

Nested conditional compilation sections lead to exponential growth of code paths. Even two levels of #ifdef-#endif mean there are four compilation paths, three levels give eight paths etc. To cover all of them by tests, the software has to be compiled four, eight or more times, and the tests must be repeated the same number of times. Obviously, nobody does that, which leads to lack of coverage.
The same uncontrolled growth of code paths happens when the single condition is too complex. #if FOO_IS_ABC && BAR_IS_XYZ entails there are up to four compilations required to test all the combinations of this block with outside code.
Interleaved regular if-then-else statements (or equivalents, like ternary operator cond ? val1 : val2) and conditionally compiled #if-#endif blocks lead to the worst unreadable code ever. It most often leads to correctness issues hidden inside it.
The same level of confusion is planted in many other contexts, e.g. optional function arguments under conditional compilation, C++-specific definitions in a header included into both C and C++, logical expressions with several members under conditional compilation.
Long conditionally compiled sections make the reader unable to decide if currently visible page of source text is relevant or not. A single unmarked #endif may close any of the opening #if-sections above. Without scrolling up and down and counting #if and #endif‘s, it is impossible to tell.
Conditional sections may hide dead code for decades. E.g. a block of code meant for a dead host platform may remain long after the support for that platform has ended. People will still be affected by such code, because they will read it and attempt to mimic it, or try to keep it in buildable state.
Mistyped expressions may be silently accepted. The #ifdef directive will not report a problem if the symbol inside it is mistyped. The same applies to #if defined().

What problems do include directives carry?

Include is an #include "file-name.h" directive. These are relatively harmless when compared to other preprocessor facilities. Still, a few problems are specific to them.

Non-self sufficient headers complicate code maintenance. Adding new includes above or below them may change the outcome of compilation.
Includes in the middle of a file are unexpected. All includes should be aggregated at the top of the file.
Includes with “one level up” ../ in their file paths mean there is a design problem hidden within the build configuration (e.g. Makefiles). Usually it is a feature envy.

Are there situations where macros are appropriate?

Of course yes. The preprocessor is a powerful mechanism which, when used sparingly and appropriately, makes code more better. Some examples of appropriate usage:

Macros that perform code generation using token and string concatenation.
Generic containers in C (but not in C++).
Asserts and logging functions are macros that refer to current source code line and file name. This functionality cannot be easily repeated without using preprocessor.
Code that guards host- and environment-specific facilities. E.g. interfaces to Linux and Windows, 64- and 32- bit builds, different compilers. They should usually have a thin layer of conditionally compiled definitions. The purpose here is to hide the host differences from the rest of the code base, allowing it to be more environment-agnostic.

Note that using preprocessor in C++ is even less warranted than in C because the templates are usually a safer alternative free of many preprocessor deficiencies (such as opaqueness to compiler, debugger, lack of type, vague scope etc.)

Are there any examples of bad usage?

Let us now look at some examples.

Function-like macros

#define BIT(x, n) ((x) & (1ULL << (n)))

The return type of this macro is supposed to be bool (or at least an integer in range 0..1), but the macro itself does not enforce it. Not a big deal in this case if all expressions involving this macro are logical, not arithmetic (e.g. if nobody tries to accumulate number of set bits in a loop). It still leaves room for usage error. This possibility would not be present if it instead were a function bool bit(uint64 x, int n);.

#define MASK(n) ((1ULL << (n)) | ((1ULL << (n)) - 1ULL))

Parameter n is used twice in this expression. It is up to the macro caller to ensure that (n) does not have side effects, or else it would happening twice; most likely unintentionally. Again, no issues if it were a function.

#define SET(x, n) BIT(x, n) ? (x) | (~MASK(n)) : (x);

Pay close attention! The ending semi-column character is likely not intentional here, but it will prevent from using the macro in many contexts. The whole macro body should probably be wrapped in parentheses to reduce chances of incorrectly unfolding it as a part of a more complex expression. None of this would be an issue if it were a function.

#define FOO_GOO(x, n) (((x) > MPOS((n))) ? (SET, POS((n))) : \
  ... many lines follow ...

A long multi-line macro with non-trivial conditional logic (macro calling other macros) makes it impossible to inspect during a debugging session. I’d have to use pen and paper to deduce its value, which is slow and error-prone. If it were a function, I could just ask the debugger to calculate its value for me.

#define PIPI4(ptr, i) ((int8)((uint8)((ptr)->i8[i >> 1]) << (4 * ((i ^ 1) & 1))) >> 4)

Just look at it. Can you understand what this does, and how, if it is correct, and will it be correct in all potential use places?

Object-like macros

#define GBYTE 1024 * 1024 * 1024

When used in a mask calculation expression:

uint64 mask = ~GBYTE;

It unfolds into:

uint64 mask = ~1024 * 1024 * 1024;

The bit inversion operator has higher priority than the multiplication operator. This results in incorrect calculation.

Conditional compilation

#if defined(DEST_FOO) && defined(SOURCE_BAR)

This seems innocuous enough. However, the source-level entanglement between these two features had become very bad and widespread in many places. We had to spend significant amount of time untying them after a requirement to have them separated.

#if DOMAIN_64_BIT

...

#ifdef TARGET_HAS_FOO

Nesting of Foo inside the 64-bit requirement during the development process turned out to be problematic. We discovered that the inner code cannot be made available outside the 64-bit only models. The innermost conditional code made many implicit assumptions about the presence of the outermost macro guard. It was all fine and not visible until someone had to move it around.

#ifdef INCLUDE_FOO

... thousand lines here ..

#ifdef INCLUDE_FOO

... a couple of hundred lines here ...

#endif

...

#endif

There was once a large block of code with nested includes. The same symbol was checked to be defined both by the inner and outer conditions. Obviously, the inner check was redundant. We can only guess that it was added by someone who had failed to find the boundaries of the outermost block, so big it was. So, the person decided to add a new pair of guards, just in case.

#if 0 /* Do not change it, it makes debugging harder */

This dead code has been around for ages. The comment it is accompanied with explains little of its intent.

Finally, the crown of ugliness, a single if-condition with many conditional blocks in between

~~~~~

if (entry
#if COND_FOO
        /* some comment */
        && (op_is_instrument(obj) || !bob->foo
#if defined(FEATURE_XYZ)
        || bob->flags_x || !(entry->attrs.a & UsedMask)
#else
        || bob->pc || !entry->attrs.b
#endif
        || obj->mode != Supervisor || !bob->baz)
#endif /* COND_FOO */
#if FEATURE_BAR
        /* HACK: Disable baz when kpi is enabled. */
        && (!bob->kpi || op_is_instrument(obj))
#endif
#ifdef FEATURE_ALFA
    /* Disable baz for shadow request */
    && (obj->access != ShadowMask)
#endif

) {...}

~~~~~~

The boolean condition here is a collection of many sub-expressions. Many of them are guarded by their own conditional compilation sections, some of them with else-alternatives. Some of them are nested.

The code formatting is helpless to clarify this mess. The way it should be addressed is to give names to individual sub-expressions and to keep them separate.

References

Flávio Medeiros et al. Discipline Matters: Refactoring of Preprocessor Directives in the #ifdef Hell
https://en.wikipedia.org/wiki/All_caps#Readability
https://stackoverflow.com/questions/1892043/self-sufficient-header-files-in-c-c