Using a virtual dispatch might get relatively expensive in terms of clock cycles due to multiple levels of indirections including indirect branching as well as
this pointer adjustment. Wise programmers do not use virtual dispatch without a good reason but oftentimes it is required either by design or when creating non-template reusable components/libraries and the final implementation of some parts of the program is not known.
Optimizing compilers do a great deal of work trying to optimize virtual dispatch as much as possible, sometimes even eliminating it in the end program entirely. Let’s take a quick look at how compilers implement virtual dispatch, how it can be optimized away and go over a few tricks that programmers can do to achieve better runtime performance.
Virtual Dispatch Mechanics
Implementation of the virtual dispatch in C++ is not covered by the language standard. It is up to compiler to decide how to implement it. All of the compilers that I’ve worked with implement it using a virtual method table concept. The details of an exact implementation tend to get very complex once multiple inheritance, virtual inheritance and optimization gets involved. For utmost simple cases, one can think of a straightforward implementation where a base class has a pointer to an array of function pointers. To demonstrate this, let’s write a functional equivalent of the below C++ program in C.
1 2 3 4 5 6 7 8 9
Assuming that no optimization is performed, the above code can be described in C like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59
By looking at how virtual functions are called, it becomes apparent that invocation of a virtual function has some overhead. Basically, three steps need to be performed to call a virtual function:
- Offset must be applied to a virtual table in order to obtain an address of a function to call.
- A pointer to a virtual table must be dereferenced to read the address of a function.
- An indirect call must be made to call a function by address stored in register.
x86_64 Machine Code
To demonstrate this, let’s look at the disassembled function that gets an instance of class
A by reference and calls its virtual
1 2 3
GCC would generate the following code for x86_64 with all optimizations turned off:
1 2 3 4 5 6 7 8 9 10 11 12 13
Putting irrelevant parts aside, there are four instructions involved in virtual function call:
- Load vtable base address:
- Apply offset to get address of the function to call:
- Read the address of the function:
- Perform indirect call:
With optimization turned on, GCC omits a frame pointer, gets rid of one extra memory load, does not move around parameters on the stack. The code is much better, yet the most expensive operations are still there:
1 2 3
In more complex scenarios, such as multiple inheritance, compiler might also need to adjust an the value of
this pointer that is implicitly passed to the virtual function so that it points to a correct base class in the inheritance hierarchy. One more extra level of indirection might also be needed in case of virtual inheritance.
There are a few optimizations that compilers can do to improve the runtime performance when it comes to virtual dispatch.
Extracting the Function Pointer
Calling a virtual function involves multiple steps. In case when virtual function is called multiple times, such as calling the function in a loop or subsequently calling the same function of the same object, compiler optimizes the code by performing the following steps just once:
- Loading the member pointer from a virtual table. It consists of a function pointer and information needed to adjust
- Extracting a function pointer from member pointer.
Then compilers simply perform the fourth step by calling a function using an extracted function pointer with adjusted
The function pointer extraction is “unsafe” and “dangerous” and done by the compiler automatically, behind the scenes. GCC, unlike other compilers, provides a non-standard extension for brave developers who know what they are doing, so that they can do this manually. It is part of pointer-to-member function (PMF) conversion functionality and is described in §7.6:
7.6 Extracting the function pointer from a bound pointer to member function
In C++, pointer to member functions (PMFs) are implemented using a wide pointer of sorts to handle all the possible call mechanisms; the PMF needs to store information about how to adjust the
thispointer, and if the function pointed to is virtual, where to find the vtable, and where in the vtable to look for the member function. If you are using PMFs in an inner loop, you should really reconsider that decision. If that is not an option, you can extract the pointer to the function that would be called for a given object/PMF pair and call it directly inside the inner loop, to save a bit of time.
Note that you still pay the penalty for the call through a function pointer; on most modern architectures, such a call defeats the branch prediction features of the CPU. This is also true of normal virtual function calls.
The syntax for this extension is
extern A a; extern int (A::*fp)(); typedef int (*fptr)(A *); fptr p = (fptr)(a.*fp);
For PMF constants (i.e. expressions of the form
&Klasse::Member), no object is needed to obtain the address of the function. They can be converted to function pointers directly:
fptr p1 = (fptr)(&A::foo);
You must specify
-Wno-pmf-conversionsto use this extension.
This is also possible to do with other compilers, though in a lot more dangerous manner. For example, having a member pointer and knowing the ABI of a given platform (for example, Itanium C++ ABI describes member pointers “layout” in §2.3 Member Pointers), programmers can “extract”
this adjustment information as well as raw function pointer, essentially performing PMF manually.
In order to reduce the overhead and improve runtime performance, optimizing compilers attempt to convert calls to virtual functions to direct calls. This is done both within a procedure and interprocedurally as part of indirect inlining and interprocedural constant propagation.
This optimization technique is implemented by most production grade C++ compilers. For example, both GCC and clang perform this optimization. To demonstrate what it does, let’s say we got a piece of code written by a third-party company that does a lot of great things and looks like this:
1 2 3 4 5 6 7 8
At this point, there isn’t much that can be done because the end-user use case is unknown at this stage. But let’s say the user of the above API writes the following program:
1 2 3 4 5 6 7 8 9
By looking at the code, it is obvious that class
B is the final class in the inheritance hierarchy. Therefore, when
do_something() function is called, calling
foo() method directly would result in a functional equivalent of calling it through the virtual table.
When compiling the above program with optimization turned on, compiler takes that information into account. Since compilers has a definition of
do_something() function body, a wise compiler chooses to perform a direct call or even inline the body of a derived class’s virtual function, which in the above case is
Interestingly enough, since all of the code is known to the compiler, it optimizes the program further, eventually reducing it to just a few CPU instructions:
1 2 3
From the above observation we can draw a conclusion that when compiler has all of the information needed, it optimizes away the virtual dispatch.
Link Time Optimization
Needless to say that the above example is very simple and naive. In real-life it is unlikely that
do_something() function body will be known. Because it was chosen to use virtual dispatch instead of template meta-programming, body of the function is likely complex and resides in a compiled object file. In that case compiler would not be able to deduce the information it needs to perform devirtualization. This makes this optimization unlikely to occur in real-life.
There is a way to make it work though. Both GCC and clang provide a feature called Link Time Optimization (aka LTO). When this feature is enabled, compilers generate object files in intermediate language. GCC uses GIMPLE for this purpose while Clang is using LLVM bitcode. Such “intermediate” object files can later be used by the compiler to perform intermodular optimizations. At the same time, source code is not required to be distributed shown, which makes it great for paranoid software vendors. Intermodular optimizations, which in case of LTO are performed at linking stage, allow the compiler to inline functions defined in separate object files or, what’s most important in our case, perform devirtualization across different translation units.
Getting back to our simple example, given that user program is stored in
do_something() function is defined in
file2.cc, compiler does not perform devirtualization. However, when both files are compiled with LTO enabled, the resulting program is optimized down to two CPU instructions. Here is an example of compiling and linking two separate files with LTO using GCC:
1 2 3
The only time when this won’t work is when the code is provided in a form of shared object because LTO does not work across shared object boundaries.
So if you plan on providing a library that must use a virtual dispatch and is performance critical to the point where tradeoffs of the virtual dispatch start to matter, consider distributing the code in a form of pre-compiled static library built with LTO support.
C++11 Final Keyword
Sometimes compilers are not able to determine that a class is a final instance in the inheritance hierarchy. Consider the following code:
1 2 3 4 5 6 7 8 9 10 11 12 13
In the above example, compiler does not know whether some other classes could inherit from class
B or not. Therefore, it would not devirtualize the invocation of
b.foo() when generating the code for
do_something() function body.
If, however, nothing else inherits from
B, then programmers might want to consider using a
final keyword that was introduced in C++11. Once the class
B is marked as
final, compiler knows that
B is the final instance in the inheritance tree and immediately optimizes away all indirect calls.
1 2 3 4
Devirtualization in C
Even though C language does not have a concept of virtual dispatch built into the language, compilers do recognize this concept and can still optimize away all indirect calls. Both GCC and clang have compiled a simple C virtual dispatch example program into, well, almost nothing. Here is what Clang generated:
1 2 3 4 5 6
And here is what GCC have done:
1 2 3 4
1 2 3 4 5 6