Expression Templates in C++
Last updated on March 9th, 2025
Not to be confused with template
, an expression template is an object that represents an expression or computation.
This is a powerful metaprogramming tool because it gives developers complete control over how the computations are performed. We could evaluate an expression template any time we want, substitute values, inspect it, or even transform it. Some common applications of expression templates include lazy evaluation[1], symbolic execution, optimization, and introducing new semantic meanings to expressions.
To briefly illustrate the concept, a simple expression template representing the expression could be:
template<typename L, typename R>
struct add {
L lhs;
R rhs;
};
template<typename L, typename R>
struct mul {
L lhs;
R rhs;
};
auto expression_template = add(mul(2, 3), 4);
The syntax add(mul(2, 3), 4)
can be a bit cumbersome and unnatural for expressions[2] but with operator overloading it would be possible to build the same template by using wrapper types for leaf-node terms with syntax along the lines of term(2) * 3 + 4
or even a user-defined literal like 2_term * 3 + 4
.
In this blog post I'll briefly go over some examples of how expression templates are used then I will walk through a implementing a simple expression template that can evaluate an expression with placeholders, such as param_1 * param_2 + param_3
. I'll talk about performance and then I'll showcase how expression templates can be used to add chained comparisons to with the syntax 100_chain <= x <= 200
. Familiarity with fundamentals is required, however, I have tried to write this in such a way that it is approachable for a wide audience.
Some Applications
Expression templates for lazy evaluation are often useful in the context of linear algebra. For example, we might be interested in specific components of
Bindings for the Z3 theorem prover in many languages use expression templates to allow users to write constraints naturally instead of potentially cumbersome function calls. For example, the following constructs expression templates which are passed to Z3's symbolic solver:
x = Int('x')
y = Int('y')
solve(x > 2, y < 10, x + 2 * y == 7)
Another fantastic application is automatic differentiation, which is a method for computing derivatives based on expression trees that is more accurate than numerical differentiation and faster than symbolic differentiation.
views are also expression templates and they are evaluated lazily. They are a good example of how expression templates can be used to introducing new semantics to operations and of how they can be used with different inputs:
auto pipeline =
std::views::filter([] (auto x) { return x % 2 == 0; })
| std::views::transform([] (auto x) { return x * 5; });
std::vector<int> vec1{1, 2, 3, 4, 5};
std::vector<unsigned> vec2{6, 7, 8, 9, 10};
for(auto x : vec1 | pipeline) {
std::println("{}", x);
}
for(auto x : vec2 | pipeline) {
std::println("{}", x);
}
Another good example is Boost.Lambda which uses expression templates built around placeholders to create a shorthand lambda syntax:
vec | std::views::filter(_1 % 2 == 0)
Expression templates can be used for optimization in a few ways. At an abstract level, various optimization techniques can be applied to expression trees encoded at compile-time in expression templates. These might be useful to linear algebra libraries where the compiler might not be able to figure out high-level optimizations on vector and matrix operations. They might also be useful for various accelerated or parallel computation libraries. Another way expression templates could be used for optimization is for improving string concatenation in . In , concatenating a handful of strings like str1 + str2 + str3 + str4 + str5
results in four temporary strings to be built, the last one usually being moved to where it is used. This potentially introduces a lot of extra short-lived allocation and a lot of redundant copying of characters which the compiler is unlikely to optimize away for us. Instead, if an expression template were built representing the concatenation operation then evaluation code could analyze the expression tree and allocate a single buffer of size str1.size() + str2.size() + str3.size() + str4.size() + str5.size()
.
Lastly, my assertion library, libassert, uses expression templates and some macro tricks to save expression operands, evaluate the full expression, and write out the operands in the event of a failure:
int a = 15;
int b = 10;
ASSERT(a < b);
Assertion failed at demo.cpp:20: int main():
ASSERT(a < b);
Where:
a => 15
b => 10
Implementing a Simple Template
We're going to implement a simple expression template system for expressions with placeholder terms. Our end-goal will be to support the syntax of param_1 * param_2 + param_3
.
The first thing we need is a class representing a binary expression. The job of this class is very simple, simply encode the following expression tree:
graph TD op op --> a op --> b
Unlike the very start of this article where I used separate class templates add
and mul
for the sake of simplicity, here we're going to use a function object[3] as a template parameter which will encode the operation.
template<typename L, typename R, typename Op>
class binary_expr {
L lhs;
R rhs;
public:
constexpr binary_expr(L lhs, R rhs) : lhs(lhs), rhs(rhs) {}
};
NOTE
For the sake of simplicity I'm forgoing perfect forwarding here. The types we will be working with in this blog are all trivially copyable.
NOTE
I'll be marking all functions as cosntexpr
here. This isn't necessary or an optimization, just something we may as well do.
We're also going to add a few helper template aliases using function objects from <functional>
:
template<typename L, typename R>
using add = binary_expr<L, R, std::plus<void>>;
template<typename L, typename R>
using mul = binary_expr<L, R, std::multiplies<void>>;
We also know we want a handful of special numbered placeholders for parameters. Later during evaluation we are going to want these placeholders to know their corresponding index at compile time so we'll implement these with a non-type template parameter:
template<std::size_t Index>
struct indexed_parameter {};
constexpr indexed_parameter<0> param_1;
constexpr indexed_parameter<1> param_2;
constexpr indexed_parameter<2> param_3;
We can now start to use the function template we've created: add(mul(param_1, param_2), param_3)
builds an expression template for us to start to work with (though we can't do anything with it yet).
To achieve the ergonomics we want, param_1 * param_2 + param_3
, we can add some simple operator overloads:
template<typename T>
concept binary_expr_val =
specialization_of_nttp<T, indexed_parameter>
|| specialization_of<T, binary_expr>;
constexpr auto operator+(binary_expr_val auto lhs, binary_expr_val auto rhs) {
return add(lhs, rhs);
}
constexpr auto operator*(binary_expr_val auto lhs, binary_expr_val auto rhs) {
return mul(lhs, rhs);
}
NOTE
Because these are global operator templates it is important to constrain the templates to the types we actually handle. specialization_of
and specialization_of_nttp
are simple utilities based based on this stackoverflow answer.
Implementation for specialization_of
and specialization_of_nttp
// type templates
template <typename T, template <typename...> class Z>
struct is_specialization_of : std::false_type {};
template <typename... Args, template <typename...> class Z>
struct is_specialization_of<Z<Args...>, Z> : std::true_type {};
template <typename T, template <typename...> class Z>
concept specialization_of = is_specialization_of<T, Z>::value;
// non-type templates
template <typename T, template <auto...> class Z>
struct is_specialization_of_nttp : std::false_type {};
template <auto... Args, template <auto...> class Z>
struct is_specialization_of_nttp<Z<Args...>, Z> : std::true_type {};
template <typename T, template <auto...> class Z>
concept specialization_of_nttp = is_specialization_of_nttp<T, Z>::value;
The last thing to do is actually evaluate the expression template we've created. We'll want to support the syntax of expression.evaluate(a, b, c)
so we'll start with a top-level auto evaluate(auto&&... args) const
for binary_expr
. Evaluation is pretty simple overall, we simply evaluate operands recursively until we hit leaf nodes:
template<typename L, typename R, typename Op>
class binary_expr {
L lhs;
R rhs;
public:
constexpr binary_expr(L l, R r) : l(l), r(r) {}
constexpr auto evaluate(auto&&... args) const {
return Op{}(lhs.evaluate(args...), rhs.evaluate(args...));
}
};
This is all we need to do for binary_expr
. The next step is handling leaf nodes, specifically indexed_parameter
. We'll implement a similar evaluate
method:
template<std::size_t Index>
struct indexed_parameter {
constexpr auto& evaluate(auto&&... args) const {
return args...[Index];
}
};
I'm using pack indexing here which is supported by gcc 15 and clang 19. It's possible to implement this with template recursion or some tricks with index sequences, fold expressions, and immediately-invoked lambda-expressions (IILEs), however, for the sake of keeping this blog short it's convenient to use the new feature.
And with that we're able to evaluate our simple expression templates with a set of arguments:
auto expression = param_1 * param_2 + param_3;
std::cout << expression.evaluate(2, 3, 4); // prints 10
Final Implementation
#include <concepts>
#include <cstddef>
#include <functional>
#include <type_traits>
// type templates
template <typename T, template <typename...> class Z>
struct is_specialization_of : std::false_type {};
template <typename... Args, template <typename...> class Z>
struct is_specialization_of<Z<Args...>, Z> : std::true_type {};
template <typename T, template <typename...> class Z>
concept specialization_of = is_specialization_of<T, Z>::value;
// non-type templates
template <typename T, template <auto...> class Z>
struct is_specialization_of_nttp : std::false_type {};
template <auto... Args, template <auto...> class Z>
struct is_specialization_of_nttp<Z<Args...>, Z> : std::true_type {};
template <typename T, template <auto...> class Z>
concept specialization_of_nttp = is_specialization_of_nttp<T, Z>::value;
template<std::size_t Index>
struct indexed_parameter {
constexpr auto& evaluate(auto&&... args) const {
return args...[Index];
}
};
constexpr indexed_parameter<0> param_1;
constexpr indexed_parameter<1> param_2;
constexpr indexed_parameter<2> param_3;
template<typename L, typename R, typename Op>
class binary_expr {
L lhs;
R rhs;
public:
constexpr binary_expr(L lhs, R rhs) : lhs(lhs), rhs(rhs) {}
constexpr auto evaluate(auto&&... args) const {
return Op{}(lhs.evaluate(args...), rhs.evaluate(args...));
}
};
template<typename L, typename R>
using add = binary_expr<L, R, std::plus<void>>;
template<typename L, typename R>
using mul = binary_expr<L, R, std::multiplies<void>>;
template<typename T>
concept binary_expr_val =
specialization_of_nttp<T, indexed_parameter>
|| specialization_of<T, binary_expr>;
constexpr auto operator+(binary_expr_val auto lhs, binary_expr_val auto rhs) {
return add(lhs, rhs);
}
constexpr auto operator*(binary_expr_val auto lhs, binary_expr_val auto rhs) {
return mul(lhs, rhs);
}
NOTE
This is a very simplified approach expression templates. It has plenty of limitations such as not perfectly forwarding and only supporting indexed_parameter
s as leaf nodes. But, it illustrates the general ideas and can be expanded to support more functionality with relative ease.
Performance
We've implemented a simple expression template but how does it perform? Even this simple expression template requires a lot of objects to be constructed, functions to be called, and arguments to be passed around.
However, despite all of this, it turns out that this ends up being a zero-overhead abstraction. Looking at the following example in Compiler Explorer we find that we get instruction-for-instruction identical codegen with the expression template as if we did the same operation without the template:
int foo(int a, int b, int c) {
auto expression = param_1 * param_2 + param_3;
return expression.evaluate(a, b, c);
}
int bar(int a, int b, int c) {
return a * b + c;
}
foo(int, int, int):
imul edi, esi
lea eax, [rdi + rdx]
ret
bar(int, int, int):
imul edi, esi
lea eax, [rdi + rdx]
ret
Additionally, for the expression.evaluate(2, 3, 4)
demo in the previous section you may have noticed the instruction mov esi, 10
before the call to operator<<
. This is the compiler passing 10
directly to the function which means the compiler constant-propagated the computation expression.evaluate(2, 3, 4)
at compile-time for us.
Chained Comparisons
Many languages, notably Python, support chained comparison such as 100 <= x <= 200
with the meaning 100 <= x && x <= 200
. This functionality can provide a few main benefits: It can avoid repetition and bugs associated with repetition, it would allow fold expressions to work with comparisons (e.g. (args <= ...)
), and it is more natural interpretation of the expression than the C and interpretation of (100 <= x) <= 200
, which is legal but won't do what many users expect[4]. In recent years, there have been efforts to add support for chained comparisons to [P0893] [P3439], however, it is yet to be seen if it will become part of the language.
In this section we'll explore using expression templates to do chained comparisons in . We can't make the syntax 100 <= x <= 200
magically work as we have to build the expression template somehow, but, we can get close with wrapper types that have operator overloads. One way to do this could look like chain(100) <= chain(x) <= chain(200)
, but, we can do better. In an expression like 100 <= x <= 200
, it's sufficient to place wrap just a single leaf node in the expression tree: chain(100) <= x <= 200
. Overloaded operators will build the expression tree all the way up. We'll also support a user-defined literal to make the syntax even more minimal: 100_chain <= x <= 200
. We'll support chaining six comparison operators <=
, <
, ==
, !=
, >=
, and >
, including across precedence levels like treating chain(100) <= x == y <= chain(200)
as 100 <= x && x == y && y <= 200
.
NOTE
Because of operator precedence an expression like 100 <= x == y <= 200
parses as (100 <= x) == (y <= 200)
and we will have to use more than one chain()
wrapper to ensure the whole expression is built up as an expression template.
NOTE
[P3439] proposes only changing unparenthesized chains of comparisons while leaving expressions such as (a <= b) <= c
unchanged. I think this is a very reasonable and sensible approach. But, we won't be able to achieve quite the same semantics as there is no way for our expression templates to detect parentheses in such expressions.
We'll use a similar approach to what we used in the previous section. The main difference is that we won't use a special indexed_parameter
type and instead we'll store values directly. Once again, I am going to forgo perfect forwarding for the sake of simplicity. We'll start with the same binary_expr
class we used in the previous section:
template<typename L, typename R, typename Op>
class binary_expr {
L lhs;
R rhs;
public:
constexpr binary_expr(L lhs, R rhs) : lhs(lhs), rhs(rhs) {}
};
And we'll also add similar aliases but this time for comparison operations:
template<typename L, typename R>
using less_equal = binary_expr<L, R, std::less_equal<void>>;
// lots more ...
We'll also again add a set of operator overloads for our expression template:
template<typename L, typename R>
concept can_chain = specialization_of<L, binary_expr> || specialization_of<R, binary_expr>;
template<typename L, typename R> requires(can_chain<L, R>)
constexpr auto operator<=(L lhs, R rhs) {
return less_equal(lhs, rhs);
}
// lots more...
Again, it is important to constrain these overloads. In this case I've made a helper concept can_chain
that enables the operator overloads if either the left or right hand operands are binary_expr
s.
We could use this now with syntax along the lines of less_equal(100, x) <= 200
, however, this looks a bit funny and inconsistent and it's not ergonomic. In order to support the syntax chain(100) <= x <= 200
we'll use a helper struct to wrap a value and kick-start the expression template building.
template<typename T>
struct chain {
T wrapped_value;
};
We'll also have to update our can_chain
concept to support this new wrapper:
template<typename T>
concept chainable = specialization_of<T, binary_expr> || specialization_of<T, chain>;
template<typename L, typename R>
concept can_chain = specialization_of<L, binary_expr> || specialization_of<R, binary_expr>;
concept can_chain = chainable<L> || chainable<R>;
Now we can build our expression template with the syntax chain(a) <= b <= c
. The last step is to implement evaluation. To do this, we'll define an operator bool
on our expression template that will evaluate it when requested:
template<typename L, typename R, typename Op>
class binary_expr {
L lhs;
R rhs;
public:
constexpr binary_expr(L lhs, R rhs) : lhs(lhs), rhs(rhs) {}
constexpr operator bool() const {
...
}
};
If this binary_expr
had normal semantics it would be very simple to implement: We could simply return Op{}(lhs, rhs);
and call it a day. However, because we want to provide chaining semantics we'll have to do some extra logic. Additionally, because we store not just binary_expr
s and values we also store chain
wrappers we need to do a little extra work to unwrap the value if needed. I'm going to add a helper maybe_unwrap
for this and attempt to unwrap operands right away for evaluation to make things easier for us:
template<typename L, typename R, typename Op>
class binary_expr {
L lhs;
R rhs;
public:
constexpr binary_expr(L lhs, R rhs) : lhs(lhs), rhs(rhs) {}
template<typename T>
constexpr auto& maybe_unwrap(const chain<T>& value) const {
return value.wrapped_value;
}
template<typename T>
constexpr auto& maybe_unwrap(const T& value) const {
return value;
}
constexpr operator bool() const {
auto& lhs = maybe_unwrap(this->lhs);
auto& rhs = maybe_unwrap(this->rhs);
...
}
};
NOTE
Naming is hard so I'm reusing (and shadowing) lhs
and rhs
.
To evaluate a given binary_expr
there are four cases to consider:
- Both the left-hand and right-hand operand are
binary_expr
s - The right-hand operand is a
binary_expr
but the left-hand isn't - The left-hand operand is a
binary_expr
but the right-hand isn't - Neither are
binary_expr
Let's consider an expression template representing a <= b <= c == d <= e <= f
. This expression has a parse tree of:
graph TD eq["=="] lt1["<="] lt2["<="] lt3["<="] lt4["<="] eq --> lt1 eq --> lt3 lt1 --> lt2 lt2 --> a lt2 --> b lt1 --> c lt3 --> lt4 lt3 --> f lt4 --> d lt4 --> e
We'll start with considering a binary_expr
representing the root ==
in the expression. It has to do two things: It has to check c == d
and it has to check the rest of the chained expression. Checking c == d
requires us to recursively walk the expression tree in order to find the right-most node in the left-hand sub-tree (c
) and the left-most node in the right-hand sub-tree (d
).
We'll add two helper functions for this:
template<typename L, typename R, typename Op>
class binary_expr {
L lhs;
R rhs;
public:
constexpr binary_expr(L lhs, R rhs) : lhs(lhs), rhs(rhs) {}
constexpr auto& get_left_most() const {
if constexpr(specialization_of<L, binary_expr>) {
return lhs.get_left_most();
} else {
return lhs;
}
}
constexpr auto& get_right_most() const {
if constexpr(specialization_of<R, binary_expr>) {
return rhs.get_right_most();
} else {
return rhs;
}
}
template<typename T>
constexpr auto& maybe_unwrap(const chain<T>& value) const { ... }
template<typename T>
constexpr auto& maybe_unwrap(const T& value) const { ... }
constexpr operator bool() const { ... }
};
With these helpers, evaluating c == d
is a matter of checking Op{}(lhs.get_right_most(), rhs.get_left_most())
.
To check the rest of the chained expression we'll take advantage of a chained a <= b <= c == d <= e <= f
being equivalent to (a <= b <= c) && c == d && (d <= e <= f)
. The lhs
and rhs
for the binary_expr
for ==
are these chained sub-expressions we're interested in already, a <= b <= c
and d <= e <= f
respectively. In addition to computing c == d
all we have to do is simply evaluate the lhs
and rhs
if they are binary expressions:
template<typename L, typename R, typename Op>
class binary_expr {
L lhs;
R rhs;
public:
constexpr binary_expr(L lhs, R rhs) : lhs(lhs), rhs(rhs) {}
constexpr auto& get_left_most() const { ... }
constexpr auto& get_right_most() const { ... }
template<typename T>
constexpr auto& maybe_unwrap(const chain<T>& value) const { ... }
template<typename T>
constexpr auto& maybe_unwrap(const T& value) const { ... }
constexpr operator bool() const {
auto& lhs = maybe_unwrap(this->lhs);
auto& rhs = maybe_unwrap(this->rhs);
if constexpr(specialization_of<L, binary_expr> && specialization_of<R, binary_expr>) {
return lhs && Op{}(lhs.get_right_most(), rhs.get_left_most()) && rhs;
}
}
};
This is case one from earlier: Both lhs
and rhs
are binary_expr
s. Cases two and three fall out naturally from this, we simply evaluate the lhs
or rhs
only if they are binary_expr
s. Case four is the simplest, we just evaluate Op{}(lhs, rhs)
.
constexpr operator bool() const {
auto& lhs = maybe_unwrap(this->lhs);
auto& rhs = maybe_unwrap(this->rhs);
if constexpr(specialization_of<L, binary_expr> && specialization_of<R, binary_expr>) {
return lhs && Op{}(lhs.get_right_most(), rhs.get_left_most()) && rhs;
} else if constexpr(!specialization_of<L, binary_expr> && specialization_of<R, binary_expr>) {
return Op{}(lhs, rhs.get_left_most()) && rhs;
} else if constexpr(specialization_of<L, binary_expr> && !specialization_of<R, binary_expr>) {
return lhs && Op{}(lhs.get_right_most(), rhs);
} else {
return Op{}(lhs, rhs);
}
}
And with that we can evaluate our expression template as a chained comparison:
int x = 150;
std::cout
<< std::boolalpha
<< (chain(100) <= x <= 200); // prints true
Final Implementation
#include <concepts>
#include <cstddef>
#include <functional>
#include <type_traits>
template <typename T, template <typename...> class Z>
struct is_specialization_of : std::false_type {};
template <typename... Args, template <typename...> class Z>
struct is_specialization_of<Z<Args...>, Z> : std::true_type {};
template <typename T, template <typename...> class Z>
concept specialization_of = is_specialization_of<T, Z>::value;
template<typename T>
struct chain {
T wrapped_value;
};
template<typename L, typename R, typename Op>
class binary_expr {
L lhs;
R rhs;
public:
constexpr binary_expr(L lhs, R rhs) : lhs(lhs), rhs(rhs) {}
constexpr auto& get_left_most() const {
if constexpr(specialization_of<L, binary_expr>) {
return lhs.get_left_most();
} else {
return lhs;
}
}
constexpr auto& get_right_most() const {
if constexpr(specialization_of<R, binary_expr>) {
return rhs.get_right_most();
} else {
return rhs;
}
}
template<typename T>
constexpr auto& maybe_unwrap(const chain<T>& value) const {
return value.wrapped_value;
}
template<typename T>
constexpr auto& maybe_unwrap(const T& value) const {
return value;
}
constexpr operator bool() const {
auto& lhs = maybe_unwrap(this->lhs);
auto& rhs = maybe_unwrap(this->rhs);
if constexpr(specialization_of<L, binary_expr> && specialization_of<R, binary_expr>) {
return lhs && Op{}(lhs.get_right_most(), rhs.get_left_most()) && rhs;
} else if constexpr(!specialization_of<L, binary_expr> && specialization_of<R, binary_expr>) {
return Op{}(lhs, rhs.get_left_most()) && rhs;
} else if constexpr(specialization_of<L, binary_expr> && !specialization_of<R, binary_expr>) {
return lhs && Op{}(lhs.get_right_most(), rhs);
} else {
return Op{}(lhs, rhs);
}
}
};
template<typename L, typename R>
using equal_to = binary_expr<L, R, std::equal_to<void>>;
template<typename L, typename R>
using not_equal_to = binary_expr<L, R, std::not_equal_to<void>>;
template<typename L, typename R>
using greater = binary_expr<L, R, std::greater<void>>;
template<typename L, typename R>
using less = binary_expr<L, R, std::less<void>>;
template<typename L, typename R>
using greater_equal = binary_expr<L, R, std::greater_equal<void>>;
template<typename L, typename R>
using less_equal = binary_expr<L, R, std::less_equal<void>>;
template<typename T>
concept chainable = specialization_of<T, binary_expr> || specialization_of<T, chain>;
template<typename L, typename R>
concept can_chain = chainable<L> || chainable<R>;
template<typename L, typename R> requires(can_chain<L, R>)
constexpr auto operator==(L lhs, R rhs) {
return equal_to(lhs, rhs);
}
template<typename L, typename R> requires(can_chain<L, R>)
constexpr auto operator!=(L lhs, R rhs) {
return not_equal_to(lhs, rhs);
}
template<typename L, typename R> requires(can_chain<L, R>)
constexpr auto operator>(L lhs, R rhs) {
return greater(lhs, rhs);
}
template<typename L, typename R> requires(can_chain<L, R>)
constexpr auto operator<(L lhs, R rhs) {
return less(lhs, rhs);
}
template<typename L, typename R> requires(can_chain<L, R>)
constexpr auto operator>=(L lhs, R rhs) {
return greater_equal(lhs, rhs);
}
template<typename L, typename R> requires(can_chain<L, R>)
constexpr auto operator<=(L lhs, R rhs) {
return less_equal(lhs, rhs);
}
Syntax Sugar
The syntax chain(100) <= x <= 200
is pretty good but we can make this a little more minimal for terms with literal values by using a user-defined literal (UDL)[5]: 100_chain <= x <= 200
.
We could implement a simple UDL with the following:
constexpr auto operator ""_chain(unsigned long long value) {
return chain(value);
}
However, this runs into a limitation: The argument for an integer UDL value has to be an unsigned long long
and thus here we return a chain<unsigned long long>
. That would cause promotions in our comparison that we might not want. Additionally, it could introduce signed-unsigned comparisons in an expression which could be problematic[6].
There are a handful of ways to proceed here and cleaver tricks that can be used depending on goals but for the sake of this blog post I'll simply update the UDL to return the value as a wrapped int
:
constexpr auto operator ""_chain(unsigned long long value) {
return chain(static_cast<int>(value));
}
Fold Expressions
[P3439] highlights fold expressions as a motivating example for chained comparisons. Since we have made a tool to do chained comparison, let try it out for this application.
INFO
Fold expressions are a way to perform an operation on a pack of values. For example, if we had a function add(auto&&... values)
we could implement this as:
constexpr auto add(auto&&... values) {
return (values + ...);
}
For an input like add(1, 2, 3, 4)
this expands to return (1 + (2 + (3 + 4)));
.
Let's make a function is_sorted
that checkes if the pack of arguments are in sorted order. We can use a pack like (args <= ...)
to create an expression and then we'll use chain()
to make the fold expression build an expression template:
constexpr bool is_sorted(auto&&... args) {
return (chain(args) <= ...);
}
The easiest way to test this is a handful of static asserts:
static_assert(is_sorted(1, 2, 3, 3, 4, 5));
static_assert(!is_sorted(1, 2, 3, 4, 3, 5));
static_assert(!is_sorted(1, 2, 3, 3, 3, 1));
static_assert(!is_sorted(1, 4, 5, 1, 2, 3));
Performance
Once again, we have an elaborate setup that involves a lot of objects and function calls and template metaprogramming. None the less, compilers are good at optimizing transforming away superfluous computation.
A simple chained comparison such as chain(a) <= b <= c
generates as efficient assembly as we'd expect. Clang ends up producing assembly identical to that of a <= b && b <= c
(and gcc's is identical too other than register names):
cmp edi, esi
setle cl
cmp esi, edx
setle al
and al, cl
ret
chain(a) <= b <= c
vs a <= b && b <= c
codegen on clang
bool foo(int a, int b, int c) {
return chain(a) <= b <= c;
}
bool bar(int a, int b, int c) {
return a <= b && b <= c;
}
foo(int, int, int):
cmp edi, esi
setle cl
cmp esi, edx
setle al
and al, cl
ret
bar(int, int, int):
cmp edi, esi
setle cl
cmp esi, edx
setle al
and al, cl
ret
Now let's look at more complex example:
bool foo(int a, int b, int c, int d, int e, int f) {
return chain(a) <= b <= c == chain(d) <= e <= f;
}
bool bar(int a, int b, int c, int d, int e, int f) {
return a <= b && b <= c && c == d && d <= e && e <= f;
}
The code generated by gcc and clang for the two functions are close but not quite identical: Compiler Explorer.
Due to complexities of modern hardware (pipelining, superscalar execution, branch prediction, micro-architecture details, etc.) eyeballing x86 is impossible[7] and in general the only way to know for sure how something performs is to benchmark it. But, none the less, let's first look at some superficial eyeball metrics as a way of summarizing the high-level differences and similarities:
instructions (gcc) | instructions (clang) | branch (gcc) | branch (clang) | comparisons (gcc) | comparisons (clang) | |
---|---|---|---|---|---|---|
foo | 18 | 17 | no | yes | 5 | 5 |
bar | 15 | 15 | yes | no | 5 | 5 |
Overall, fairly similar but with minor differences. Whether or not branching is beneficial here will come down to the likelihood of the condition being branched on. In both cases, bar
uses slightly fewer instructions for both compilers. Whether or not this makes a difference will come down to port allocation, register dependencies, latencies, etc. One observation is that gcc's generation for foo
is three instructions longer than bar
and that comes down to three mov
s it has that bar
does not. That being said, this doesn't necessarily make it slower because of register renaming on x86 can make mov
s can effectively free.
We can use tools like llvm-mca or uiCA to do more sophisticated analysis and approximation of instruction performance on x86: Compiler Explorer. Here, llvm-mca estimates about 500 cycles for 100 iterations of bar
and about 540-600 cycles for 100 iterations of foo
, depending on the compiler[8].
NOTE
Tools like llvm-mca and uiCA don't model branch prediction and assume the instruction are queued in order regardless of branches.
This simulated difference isn't insignificant so it's worth investigating.
WARNING
This blog post is about to dive into some fairly technical compiler internals.
Inspecting The Optimizer
To begin figuring out what's going on here we can look at the optimization pass viewer on Compiler Explorer to inspect how clang (really LLVM[9]) is optimizing the two functions: Demo. I'm going to focus on clang in this section because LLVM, in my experience, has better tooling for studying optimization passes. The story for gcc is similar.
First looking at bar
, clang relatively early transforms this into something that is almost optimized. If you aren't familiar with LLVM's internal representation (IR) that is ok, the important thing to see here is that the instructions are a series of comparisons (icmp
s) then branches (br
s) which originate from the short-circuiting behavior of &&
in . I've also added some comments to explain what is being computed:
define dso_local noundef zeroext i1 @bar(int, int, int, int, int, int)(i32 noundef %a, i32 noundef %b, i32 noundef %c, i32 noundef %d, i32 noundef %e, i32 noundef %f) {
entry:
%cmp = icmp sle i32 %a, %b ; a <= b
br i1 %cmp, label %land.lhs.true, label %land.end ; "if the comparison is true continue, else go to the return block and return false"
land.lhs.true:
%cmp1 = icmp sle i32 %b, %c ; b <= c
br i1 %cmp1, label %land.lhs.true2, label %land.end ; "if the comparison is true continue, else go to the return block and return false"
land.lhs.true2:
%cmp3 = icmp eq i32 %c, %d ; c == d
br i1 %cmp3, label %land.lhs.true4, label %land.end ; "if the comparison is true continue, else go to the return block and return false"
land.lhs.true4:
%cmp5 = icmp sle i32 %d, %e ; d <= e
br i1 %cmp5, label %land.rhs, label %land.end ; "if the comparison is true continue, else go to the return block and return false"
land.rhs:
%cmp6 = icmp sle i32 %e, %f ; e <= f
br label %land.end ; "jump to the return block and return the value of this comparison"
land.end: ; preds = %land.rhs, %land.lhs.true4, %land.lhs.true2, %land.lhs.true, %entry
%0 = phi i1 [ false, %land.lhs.true4 ], [ false, %land.lhs.true2 ], [ false, %land.lhs.true ], [ false, %entry ], [ %cmp6, %land.rhs ]
ret i1 %0
}
A later pass then removes these branches to produce a chain of comparisons and boolean operations:
define dso_local noundef zeroext i1 @bar(int, int, int, int, int, int)(i32 noundef %a, i32 noundef %b, i32 noundef %c, i32 noundef %d, i32 noundef %e, i32 noundef %f) local_unnamed_addr {
entry:
%cmp.not = icmp sgt i32 %a, %b
%cmp1.not = icmp sgt i32 %b, %c
%or.cond = or i1 %cmp.not, %cmp1.not
%or.cond.not = xor i1 %or.cond, true
%cmp3 = icmp eq i32 %c, %d
%or.cond11 = and i1 %or.cond.not, %cmp3
%or.cond11.not = xor i1 %or.cond11, true
%cmp5.not = icmp sgt i32 %d, %e
%or.cond12 = or i1 %or.cond11.not, %cmp5.not
%cmp6 = icmp sle i32 %e, %f
%spec.select = select i1 %or.cond12, i1 false, i1 %cmp6
ret i1 %spec.select
}
Optimizing foo
takes a lot more work: We have built up a lot of complex objects and we have a lot of function calls. LLVM has to do a few passes including inlining and peephole optimizations to combine instructions in order to massage the function down to something fairly optimal:
define dso_local noundef zeroext i1 @foo(int, int, int, int, int, int)(i32 noundef %a, i32 noundef %b, i32 noundef %c, i32 noundef %d, i32 noundef %e, i32 noundef %f) local_unnamed_addr #0 personality ptr @__gxx_personality_v0 {
entry:
%cmp.i.i.i.i.i = icmp sle i32 %a, %b ; a <= b
%cmp.i.i.i.i = icmp sle i32 %b, %c ; b <= c
%0 = and i1 %cmp.i.i.i.i.i, %cmp.i.i.i.i ; a <= b && b <= c
%cmp.i.i.i = icmp eq i32 %c, %d ; c == d
%or.cond.i = and i1 %0, %cmp.i.i.i ; (a <= b && b <= c) && c == d
br i1 %or.cond.i, label %land.rhs.i, label %exit
land.rhs.i:
%cmp.i.i.i.i17.i = icmp sle i32 %c, %e ; c <= e
%cmp.i.i.i18.i = icmp sle i32 %e, %f ; e <= f
%1 = and i1 %cmp.i.i.i.i17.i, %cmp.i.i.i18.i ; c <= e && e <= f
br label %exit
exit: ; preds = %entry, %land.rhs.i
%2 = phi i1 [ false, %entry ], [ %1, %land.rhs.i ]
ret i1 %2
}
Visually, this has a different structure. Instead of a series of comparisons and branches or a direct series of comparisons and boolean operations we are greeted with a middle state of two branches each with a series of comparisons and boolean operations. LLVM doesn't further transform this function even though we know it could be further transformed in theory.
The difference in codegen comes down to how we got here: Inlining vs starting with a simple sequence of comparisons and branches. Because of how LLVM orders its passes, code for the binary_expr
evaluating a <= b && b <= c
and c <= e && e <= f
had already been optimized by the time they were inlined into the binary_expr
computing the root .. && c == d && ..
. This created the middle state where the two sub-expression evaluations had been optimized out of branch form but the root .. && c == d && ..
had not. As a result, LLVM wasn't able to transform the full evaluation into the non-branching version as easily.
In short, this is a result of pass ordering[10].
To prove the theory I'm going to manually tell LLVM what order to run optimization passes in. This takes a couple of careful steps:
- I'm going to use Compiler Explorer to dump the unoptimized LLVM IR for
foo
: CE. It's also important to turn off demangling here since we're going to feed this IR into another tool parsing the IR. - Switching the language on compiler explorer to LLVM I'm going to feed the IR into
opt
, which is an LLVM tool to run optimization passes on LLVM IR and output the resulting LLVM IR. Instead of passing-O3
or similar, I'm explicitly going to tell LLVM the order I want it to run passes in. Specifically, I want LLVM to inline functions first before absolutely anything else. After that, SROA (transforming stack-based loads and stores to registers), InstCombine (instruction peephole optimization), and SimplifyCFG are sufficient for LLVM to optimize our expression template. We can tellopt
to do this with-passes=inline,sroa,instcombine,simplifycfg
: CE.
The LLVM IR we get as a result is pretty well optimized!
define dso_local noundef zeroext i1 @_Z3fooiiiiii(i32 noundef %a, i32 noundef %b, i32 noundef %c, i32 noundef %d, i32 noundef %e, i32 noundef %f) personality ptr @__gxx_personality_v0 {
%cmp.i.i.i.not.i.i = icmp sgt i32 %a, %b
%cmp.i.i.i.i.not = icmp sgt i32 %b, %c
%or.cond = or i1 %cmp.i.i.i.not.i.i, %cmp.i.i.i.i.not
%or.cond.not = xor i1 %or.cond, true
%cmp.i.i.i = icmp eq i32 %c, %d
%or.cond10 = and i1 %or.cond.not, %cmp.i.i.i
%or.cond10.not = xor i1 %or.cond10, true
%cmp.i.i.i.not.i3.i = icmp sgt i32 %d, %e
%or.cond11 = or i1 %or.cond10.not, %cmp.i.i.i.not.i3.i
%cmp.i.i.i7.i = icmp sle i32 %e, %f
%spec.select = select i1 %or.cond11, i1 false, i1 %cmp.i.i.i7.i
ret i1 %spec.select
}
In fact, it looks so similar to the optimized LLVM IR we dumped for bar
that it's worth doing a side-by-side comparison (I've cleaned up long lines to make it fit):
Expression template | Manually chained |
---|---|
llvm
| llvm
|
And would you look at that! It's the same, IR instruction for IR instruction!
With this understanding we can create a much simpler example showing where the difference originates:
static bool abc(int a, int b, int c) {
return a <= b && b <= c;
}
static bool def(int d, int e, int f) {
return d <= e && e <= f;
}
bool foo(int a, int b, int c, int d, int e, int f) {
return abc(a, b, c) && c == d && def(d, e, f);
}
bool bar(int a, int b, int c, int d, int e, int f) {
return a <= b && b <= c && c == d && d <= e && e <= f;
}
NOTE
I'm using static
here to give functions internal linkage so they don't show up in the output assembly after being inlined.
Here, foo
and bar
are clearly equivalent and clearly have no overhead but the generated machine code very closely matches the differences we saw when we first started looking at the more elaborate comparison example.
Key takeaway
Small differences in optimization order can have far-reaching and surprising indirect consequences.
Overall, this isn't a result of overhead from the expression template and this also isn't necessarily a deficiency in LLVM or a problem with the compiler's pass ordering. It's a simple consequence of small differences in the order of transformation precluding some optimization opportunities (or maybe making others possible). Effort could be put into adding pass logic for compilers to optimize code like this, however, it's not entirely clear how best to generate an expression like this. Simply trying to remove all branches, for example, could make things faster or slower depending on likelihoods.
Conclusion
Expression templates are very useful and extremely powerful metaprogramming tool that are applicable for a wide range of uses from lazy evaluation to introducing new expression semantics to other optimization applications. Implementing expression templates can be fairly simple and they can be written in such a way that they don't add run-time overhead. I hope this has been interesting in one form or another. Cheers!
Thanks to Che and Matt Godbolt for reviewing an early draft of this blog
The wikipedia page for expression templates focuses mostly on this application but the technique is much more broadly applicable. ↩︎
Expressions like
add(mul(2, 3), 4)
are effectively polish notation (+ (* 2 3) 4
or+ * 2 3 4
). I think these are really bad for expressibility, however, they're sometimes required in languages that don't have operator overloading. Common examples include working with bignum or high-precision arithmetic libraries in languages like C. ↩︎Function objects in are objects that have a call operator. In other words, they can be used like functions. Lambdas are the most prevalent use of function templates, these are implemented under the hood as struct types with an
operator()
. Function objects are useful for template metaprogramming because they allow functions to be represented as types and used along the lines ofF{}(args...)
. ↩︎Attempting to write a chained comparison like
100 <= x <= 200
can be a common footgun for novice C and programmers. ↩︎These were added in though they have seemed to become increasingly popular in recent years. ↩︎
Comparisons between signed and unsigned integers are a footgun in (e.g.
1u < -1
evaluates totrue
), but, this happens to be another great use of expression templates! An expression template could be made that always evaluates comparisons with sign safety, e.g. withstd::cmp_less
etc. My assertion library, libassert, used to do this by default. ↩︎But that doesn't stop people from trying! ↩︎
Interestingly, llvm-mca seems to be allocating register pressure to the
mov
s that I hoped would end up being register-renamed. ↩︎Both clang and LLVM are part of the LLVM project but "clang" generally refers to the C and front-end while LLVM handles the middle and back-end of compilation (optimization in intermediate representations and final machine code generation). ↩︎
Pass ordering for optimizing compilers is a notoriously difficult problem. ↩︎