How CentiLeo Works

Jun 24, 2020 06:17

Here we describe the stages of CentiLeo renderer. It’s organized as a pipeline and works the same way for Interactive Preview Render (IPR) and for Picture Viewer.

It’s interesting to know what stages are involved and what hardware devices have influence for them in certain types of scenes. This knowledge can help to improve render experience.

The process is complex under the hood but the user controls are very simple: select the GPUs and RAM Texture cache size.

Contents:

CentiLeo pipeline.png

Cinema 4D side (Get Scene Data)

To render a scene CentiLeo should get it from host app (Cinema 4D). On Cinema’s side the scene data (geometry, tags, camera, materials, shaders and textures) is evaluated with Cinema internal functions and provide it to CentiLeo as data allocated in CPU RAM.

We have organized the process where individual polygon mesh objects are usually evaluated by Cinema if they are changed (using generators, deformers, mesh or UV edits). In IPR not-changed meshes are not reloaded for CentiLeo and this saves the time.

Scene object tree (description of objects, their positions and instances, but not the actual polygon data) is re-evaluated entirely if at least one object or object tag has changed. This description tree evaluation is repeated as many times as there are motion blur steps in CentiLeo render settings to support multi-step transformation blur.

Render settings and material data are cheap to load and are reloaded for CentiLeo if they are changed.
This data delivery process is optimized and will be even more optimized, however for better experience a powerful CPU and fast RAM may help a lot.

CentiLeo side (Scene Prepare)

At input CentiLeo transforms all host app scene data to compressed internal data structures for further rendering. This process is performed using only single Master GPU which is selected as the most powerful and with the highest amount of memory from the list of selected user GPUs in render settings.

CentiLeo stores all prepared geometry data to internal buffer called ALL PREPARED SCENE GEOMETRY on the figure. This buffer is allocated in CPU RAM and spends around 40 bytes per triangle including all the data structures. For triangles with deformation motion blur and additional vertex attributes (additional UVs) the memory storage for each triangle can go to 60 or even 120 bytes.

Apart from Prepared Scene Geometry buffer the preparation stage also creates:

  • GPU Caches (rays, textures, geometry), from 2GB to max GPU memory capacity, they are automatically determined. See GPU Memory size and out-of-core sections;
  • CPU Texture Cache (4-16GB, user setting in render settings);
  • Prepared Scene Textures in form of *.cntx files.

Prepared Scene Texture files *.cntx

If scene material data has links to textures / bitmaps / HDRIs then CentiLeo Scene Prepare stage generates cache files (*.cntx) for them and stores the cache files in the same disc locations as original files.

These files contain unpacked bitmap data that is structured for high-performance out-of-core texture rendering. Using these cache files helps to save RAM space, because on RAM level we can allocate a much smaller and limited cache buffer for texture system service. Using these cache files also helps to load / prepare the scene project much faster in the future project launches avoiding cache preparation repeats.

The disadvantage of *.cntx files is that they are uncompressed currently and hence occupy a lot of disc space (similar to .bmp files). The choice for using these files was made because the disc memory is much cheaper than RAM or GPU memory and quick project launches are important. Here the disc space amount can be important.

In the future CentiLeo releases these cache files will be compressed by several times to mitigate the disc space problem.

Rendering

CentiLeo rendering is organized as a path tracer using many internal stages all executed using one or MULTIPLE GPUs selected by user in render settings. Each GPU may have different specs and on board memory size. However, CentiLeo silently uses all available resources given by each specific device.

More powerful GPU and more GPUs will produce faster renders.

Each GPU works on some own regions of image constantly improving the entire image and once they are ready they are contributed to the buffer of RENDERED IMAGES which includes Main Beauty image and other extra images selected in render settings. This buffer is shared among all render GPUs and is allocated in CPU RAM because it may occupy too much memory space for high resolution images.

CentiLeo is designed to produce almost 100% GPU utilization during rendering.

GPU Memory Size

GPU Memory size is important property for rendering. However, the renderer doesn’t see all the memory labeled in GPU product specs.

OS memory reduction. Windows 10 OS cuts noticeable portion of the GPU memory from all installed system GPUs. E.g. in a system with two GeForce 1070 GTX (each with 8GB memory) the accessible portion of memory can be only 6.8GB for each GPU. This is all the memory a renderer can use.

Other memory reductions. There could be also unnecessary memory reductions because some other programs may use GPU: e.g. Photoshop, Nuke, Google Chrome, other modeling apps. When memory size is critical it’s recommended to close them.

GPU caches.png

CentiLeo renderer automatically splits the remaining part of GPU memory (after OS reductions) in the following balanced way (the actual data is displayed in GPU Indicators of IPR window):

  1. Rays buffer. The size is from 640MB to 1152MB depending on the total accessible GPU memory. This buffer is needed for all GPU rendering multi-stage tasks.
  2. Texture Cache. The size is from 256MB to 896MB depending on the total accessible GPU memory. This cache holds the texture data usable by rendering calculations.
  3. Geometry Cache. The size depends on scene geometry data size and the total accessible GPU memory that remains after OS reduction, Rays buffer and Texture Cache reductions. At this point the maximum Geometry Cache size for GTX 1070 can be: 6800 – 1152 – 896 MB = 4752 MB.

Considering 40 bytes / triangle the Geometry Cache of 4752 MB in GTX 1070 can hold internally up to 125 million unique triangles and render a scene without out-of-core geometry mode.

If some meshes have extra vertex attributes or deformation motion blur then they will consume more memory and less unique triangles will fit into GPU Geometry Cache.

CentiLeo doesn’t get more memory upfront than needed for scene. But if the scene is too large and requires larger storage than Geometry cache size then the out-of-core geometry process is automatically enabled.

Other GPUs like 2080 RTX have more memory and also higher OS memory reduction, but the ray buffer and texture cache for them doesn’t exceed 2048MB so there is more potential space for Geometry cache.

Out-of-core geometry

Even if GPU Geometry cache can’t arrange the entire scene geometry there is an opportunity to render it on GPU anyway using a special out-of-core geometry technology that delivers data from CPU RAM to GPU on-demand for rendering tasks. The out-of-core mode is enabled silently when it’s needed. In the same render job some GPUs may be working out-of-core while others may not because of larger memory size (i.e. working in-core).

CentiLeo supports rendering 3× more geometry data than the smallest Geometry cache size among MULTI GPUs selected in render settings.

This means that while Geometry Cache is 4752MB (as we have determined for GTX 1070) which can hold 125 million unique triangles physically together then the total of 375 million triangle scenes can be rendered by CentiLeo renderer using the out-of-core mode.

In our out-of-core implementation the most useful data is hold in Geometry cache for the most of time while other data portions come from CPU RAM storage to GPU Geometry cache on demand through the CPU RAM -> GPU connection involving CPU, RAM, Motherboard and PCI-Express (PCIe). See below the scheme of computer components influencing the GPU data delivery speed. 

out-of-core.png

Although the software part of out-of-core engine is very efficient the hardware part including PCIe and CPU RAM bandwidth has very important role in out-of-core rendering.
In case there is MULTI GPU render setup the requirements for hardware are even higher. To deliver the data from RAM to GPU efficiently at the claimed PCIe speed there are a lot more conditions that should be met for improved data transfer performance: more Motherboard PCIe lanes (physical and chipset electrical), more CPU Cores, lanes and frequency, higher bandwidth RAM.

What are Lanes?

These are channels that let data flow through PCIe to/from GPU (in our case). The more lanes – the better data bandwidth and better render speed for huge scene.

E.g. one typical GeForce GPU with PCIe Gen 3 x16 supports 16 lanes (16GB/sec for transfer to GPU, which converts to ~12GB/sec on Windows 10 in practice).

Motherboard Lanes

The motherboards may have enough physical slots for PCI Express x16 connections. But there may be not enough electrical support on the chipset side. Some motherboards may have 7 physical PCIe x16 slots. But there is also the problem of connection, space and cooling of GPU hamburger. At the same time Motherboards may claim 64 lanes support, but few lanes work with NVMe, and GPU connections may have possible configurations schemes like x16/x8/x16/x8 or x16/x16/x16 or x8/x8/x8/x8/x8/x8 which means that some of GPU PCIe connections will operate on half of maximum PCIe speed. But in practice x8 is not bad at all!

Once the GPUs are placed in the motherboard PCIe slots and depending on what slots become busy the PCIe lanes will operate at the claimed speed according to the specified scheme.

CPU Lanes

To prepare data for MULTI GPU transfer in time the CPU must process all incoming requests quickly for all of them. It’s harder when several GPUs that work in parallel come at the same time with data transfer requests. So the number of CPU cores (each per GPU), the CPU Frequency and the number of supported lanes by CPU are important. The CPU is more efficient if it can service the GPU lanes times the number of GPUs in the rig.

Apart from other specs CPUs have limitations of up to 24, 32, 40, 48 or 64 lanes which have impact on Multi-GPU systems. These parameters contribute to the maximum achievable data delivery speed to GPU in case there are several GPUs requesting CPU RAM data at the same time.

RAM bandwidth

When there is request of data transfer to GPU some copy/pack work should be done by CPU. The speed of this process depends on RAM bandwidth and it’s highly influential in data-consuming scenes. DDR4 or upcoming DDR5 memory would be great to solve these tasks.

Out-of-core textures

When the GPU Texture cache can’t arrange all the scene texture data then the out-of-core CPU->GPU mechanism is enabled to deliver data needed for rendering on demand. Enabling out-of-core textures mode is an often case because it’s common today to have scenes with several GBs of textures while CentiLeo GPU texture cache is configured to be always smaller than 1GB.

On top of that there is a CPU TEXTURE CACHE allocated in CPU RAM with a size limited by user in render settings.

When scene textures don’t fit into CPU RAM Texture Cache then the requested data is transferred from *.cntx files (from disc) to CPU Texture Cache.

out-of-core textures.png

The same hardware properties as for out-of-core geometry such as Motherboard and CPU lanes, RAM bandwidth have influence on the texture engine too but at much smaller level with a smaller performance impact. The reason of this easier behavior is because textures are 2D images and their filtering helps to hide a lot of data-intensive requests.

Out-of-core textures in CentiLeo can support from few dozen to up to several hundred GBs of scene textures rendered on consumer GPUs at great speed.
It’s also recommended to keep the TEXTURE CACHE size 4-16GB for practically all the cases.

Render Images Accumulation

CentiLeo rendered images are saved in a buffer allocated in CPU RAM. Multiple GPUs render some image parts and once ready they contribute updated info to this buffer via PCIe bus. In practice this approach allows working with huge image resolutions and with multiple extra render images (AOVs) at the same time. In our implementation this operation has no important influence on total rendering time.