GPU Zen 2: Advanced Rendering Techniques 179758314X, 9781797583143

Exploring recent developments in the rapidly evolving field of game real-time rendering, GPU Zen assembles a high-qualit

613 41 22MB

English Pages 304 Year 2019

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Contents
Preface
Part I: Rendering
Chapter I-1
Chapter I-2
Chapter I-3
Chapter I-4
Chapter I-5
Part II: Environmental Effects
Chapter II-1
Chapter II-2
Part III: Shadows
Chapter III-1
Chapter III-2
Part IV: 3D Engine Design
Chapter IV-1
Chapter IV-2
Chapter IV-3
Chapter IV-4
Chapter IV-5
Part V: Real-time Ray Tracing
Chapter V-1
Chapter V-2
Recommend Papers

GPU Zen 2: Advanced Rendering Techniques
 179758314X, 9781797583143

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

GPU Zen 2

GPU Zen 2 Advanced Rendering Techniques

Edited by Wolfgang Engel

Black Cat Publishing Inc. Encinitas, CA

Editorial, Sales, and Customer Service Office Black Cat Publishing Inc. 144 West D Street Suite 204 Encinitas, CA 92009 http://www.black-cat.pub/

Copyright © 2019 by Black Cat Publishing Inc. All rights reserved. No part of the material protected by this copyright notice may be reproduced or utilized in any form, electronic or mechanical, including photocopying, recording, or by any information storage and retrieval system, without written permission from the copyright owner.

ISBN 13: 978-1-79758-314-3

Printed in the United States of America 12 11 10 09 08

10 9 8 7 6 5 4 3 2 1

Contents Preface

xi

I

Rendering

1

Adaptive GPU Tessellation with Compute Shaders

3

1

Patrick Cozzi, editor Jad Khoury, Jonathan Dupuy, and Christophe Riccio

1.1 Introduction .................................................................................................. 3 1.2 Implicit Triangle Subdivision ....................................................................... 4 1.3 Adaptive Subdivision on the GPU ................................................................ 9 1.4 Discussion .................................................................................................. 13 1.5 Acknowledgments ...................................................................................... 14 Bibliography ....................................................................................................... 15

2

Applying Vectorized Visibility on All Frequency Direct Illumination

Ho Chun Leung, Tze Yui Ho, Zhenni Wang, Chi Sing Leung, and Eric Wing Ming Wong 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8

17

Introduction ................................................................................................ 17 The Precomputed Radiance Transfer .......................................................... 18 Rewriting the Radiance Equation ............................................................... 20 The Vectorized Visibility ........................................................................... 22 Lighting Evaluation .................................................................................... 23 Shader Implementation for the Generalized SAT Lookup ......................... 28 Dynamic Tessellation ................................................................................. 30 Results ........................................................................................................ 34

v

vi

Contents 2.9 Conclusion ..................................................................................................37 2.10 Acknowledgments .....................................................................................38 Bibliography .......................................................................................................38

3

Nonperiodic Tiling of Noise-based Procedural Textures Aleksandr Kirillov

41

3.1 Introduction.................................................................................................41 3.2 Wang Tiles ..................................................................................................42 3.3 Nonperiodic Tiling of Procedural Noise Functions .....................................43 3.4 Tiled Noise Filtering ...................................................................................50 3.5 Tiling Improvements ...................................................................................52 3.6 Results.........................................................................................................54 3.7 Performance ................................................................................................54 3.8 Limitations ..................................................................................................58 3.9 Conclusion ..................................................................................................58 3.10 Future Work ..............................................................................................59 Bibliography .......................................................................................................60

4

Rendering Surgery Simulation with Vulkan Nicholas Milef, Di Qi, and Suvranu De

63

4.1 Introduction.................................................................................................63 4.2 Overview .....................................................................................................63 4.3 Render Pass Architecture ............................................................................64 4.4 Handling Deformable Meshes.....................................................................69 4.5 Memory Management System ....................................................................71 4.6 Performance and results ..............................................................................73 4.7 Case Study: CCT .........................................................................................75 4.8 Conclusion and Future Work.......................................................................76 4.9 Source Code ................................................................................................77 4.10 Acknowledgments .....................................................................................77 Bibliography .......................................................................................................77

5

Skinned Decals

Hawar Doghramachi

79

5.1 Introduction.................................................................................................79 5.2 Overview .....................................................................................................79 5.3 Implementation ...........................................................................................80

Contents

vii

5.4 Pros and Cons ............................................................................................. 86 5.5 Results ........................................................................................................ 87 5.6 Conclusion.................................................................................................. 88 Bibliography ....................................................................................................... 88

II Environmental Effects

89

1

91

Wolfgang Engel, editor

Real-Time Fluid Simulation in Shadow of the Tomb Raider Peter Sikachev, Martin Palko, and Alexandre Chekroun

1.1 Introduction ................................................................................................ 91 1.2 Related Work .............................................................................................. 91 1.3 Simulation .................................................................................................. 92 1.4 Engine Integration .................................................................................... 104 1.5 Optimization ............................................................................................. 108 1.6 Future Work.............................................................................................. 110 Acknowledgments ............................................................................................ 110 Bibliography ..................................................................................................... 110

2

Real-time Snow Deformation in Horizon Zero Dawn: The Frozen Wilds Kevin Örtegren

113

2.1 Introduction .............................................................................................. 113 2.2 Related work ............................................................................................. 114 2.3 Implementation ........................................................................................ 114 2.4 Results ...................................................................................................... 120 2.5 Conclusion and Discussion....................................................................... 122 Bibliography ..................................................................................................... 123

III Shadows

125

1

127

Mauricio Vives, editor

Soft Shadow Approximation for Dappled Light Sources Mariano Merchante

1.1 Introduction .............................................................................................. 127 1.2 Detecting Pinholes .................................................................................... 129

viii

Contents 1.3 Shadow Rendering ....................................................................................133 1.4 Temporal Filtering.....................................................................................135 1.5 Results.......................................................................................................137 1.6 Conclusion and Future Work.....................................................................139 Bibliography .....................................................................................................140

2

Parallax-Corrected Cached Shadow Maps Pavlo Turchyn

143

2.1 Introduction...............................................................................................143 2.2 Parallax Correction Algorithm ..................................................................144 2.3 Applications of Parallax Correction ..........................................................149 2.4 Results.......................................................................................................150 Bibliography .....................................................................................................152

IV 3D Engine Design

Wessam Bahnassi, editor

1

Real-Time Layered Materials Compositing Using Spatial Clustering Encoding

Sergey Makeev

155

157

1.1 Introduction...............................................................................................157 1.2 Overview of Current Techniques ..............................................................158 1.3 Introduced Terms ......................................................................................159 1.4 Algorithm Overview .................................................................................159 1.5 Algorithm Implementation ........................................................................164 1.6 Results.......................................................................................................172 1.7 Conclusion and Future Work.....................................................................172 Acknowledgments .............................................................................................175 Bibliography .....................................................................................................175

2

Procedural Stochastic Textures by Tiling and Blending Thomas Deliot and Eric Heitz 2.1 2.2 2.3 2.4 2.5

177

Introduction...............................................................................................177 Tiling and Blending ..................................................................................178 Precomputing the Histogram Transformations ..........................................184 Improvement: Using a Decorrelated Color Space .....................................188 Improvement: Prefiltering the Look-up Table ...........................................190

Contents

ix

2.6 Improvement: Using Compressed Texture Formats.................................... 195 2.7 Results ...................................................................................................... 196 2.8 Conclusion................................................................................................ 197 Acknowledgments ............................................................................................ 197 Bibliography ..................................................................................................... 200

3

A Ray Casting Technique for Baked Texture Generation Alain Galvan and Jeff Russell

201

3.1 Baking in Practice .................................................................................... 202 3.2 GPU Considerations ................................................................................. 211 3.3 Future Work.............................................................................................. 213 Bibliography ..................................................................................................... 214

4

Writing an Efficient Vulkan Renderer Arseny Kapoulkine

215

4.1 Memory Management .............................................................................. 216 4.2 Descriptor Sets ......................................................................................... 219 4.3 Command Buffer Recording and Submission........................................... 229 4.4 Pipeline Barriers ....................................................................................... 233 4.5 Render Passes ........................................................................................... 238 4.6 Pipeline Objects ........................................................................................ 240 4.7 Conclusion................................................................................................ 245 Acknowledgments ............................................................................................ 247

5

glTF—Runtime 3D Asset Delivery Marco Hutter 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8

249

The Goals of glTF .................................................................................... 249 Design Choices ......................................................................................... 250 Feature Summary ..................................................................................... 251 Ecosystem................................................................................................. 256 Tools and Workflows ................................................................................ 257 Extensions ................................................................................................ 261 Application support .................................................................................. 263 Conclusion................................................................................................ 264

V Real-time Ray Tracing Anton Kaplanyan, editor

265

x

Contents

1

Real-Time Ray-Traced One-Bounce Caustics Holger Gruen

267

1.1 Introduction...............................................................................................267 1.2 Previous Work ...........................................................................................269 1.3 Algorithm Overview .................................................................................270 1.4 Implementation Details .............................................................................273 1.5 Results.......................................................................................................275 1.6 Future work ...............................................................................................275 1.7 Demo ........................................................................................................276 Bibliography .....................................................................................................278

2

Adaptive Anti-Aliasing using Conservative Rasterization and GPU Ray Tracing Rahul Sathe, Holger Gruen, Adam Marrs, Josef Spjut, Morgan McGuire, and Yury Uralsky

279

2.1 Introduction...............................................................................................279 2.2 Overview ...................................................................................................280 2.3 Pixel Classification using Conservative Rasterization ...............................280 2.4 Improved Coverage and Shading Computation .........................................284 2.5 Image Quality and Performance................................................................288 2.6 Future work ...............................................................................................291 2.7 Demo ........................................................................................................291 Bibliography .....................................................................................................292

Preface

This book—like its long line of predecessors—is created with the intend to helping readers to better achieve their goals. For generations, books were used to preserve valuable information. They are an important source of knowledge in our modern world. With the rise of social media, information is obscured and transformed into whatever the agenda of the poster is. It became acceptable to bother other people with information that is sometimes tasteless, mindless and/or nonsensical in all areas of life, including graphics programming. Political parties and companies drive large scale misinformation activities (sometimes called marketing or information warfare) with noise levels that are hard to bear. This book is meant to provide an oasis of peace and intellectual reflection. All of us who worked on it, tried to make sure this collection of articles is practically useful, stimulating for your mind and a joy to read. The awesome screenshot on the cover is provided by Jeroen Roding with permission by Guerilla Games. Thank you! I would like to thank Eric Lengyel for editing the articles and creating the beautiful page layout. I would also like to thank Anton Kaplanyan, Mauricio Vives, Patrick Cozzi and Wessam Bahnassi for being the section editors. I also want to thank at this point everyone for supporting this book series and its predecessors since 2001. These books started friendships, careers, companies and much more over the years. They certainly changed my life in awesome ways! Love and Peace, —Wolfgang Engel

xi

I

Rendering

Real-time rendering is an exciting field in part because how rapidly it evolves and advances but also because of the graphics community’s eagerness and willingness to share their new ideas, opening the door for others to learn and share in the fun! In this section we introduce five new rendering techniques that will be relevant to game developers, hobbyist and anyone else interested in the world of graphics. The article “Adaptive GPU Tessellation with Compute Shaders” by Jad Khoury, Jonathan Dupuy, and Christophe Riccio suggests to make rasterization more efficient for moderately distant polygons by procedurally refining coarse meshes as they get closer to the camera with the help of compute shaders. They achieve this by manipulating an implicit (triangle-based) subdivision scheme for each polygon of the scene in a dedicated compute shader that reads from and writes to a compact, double-buffered array. The article “Applying Vectorized Visibility on All frequency Direct Illumination” by Ho Chun Leung, Tze Yui Ho, Zhenni Wang, Chi Sing Leung, Eric Wing Ming Wong describes a new PRT approach with visibility functions represented in vector graphics form. This results in a different set of strengths and weaknesses compared to other PRT approaches. This new approach can preserve the fidelity of high frequency shadows and accurately account for a huge number of light sources, even with coarsely tessellated 3D models. It can also handle the specular component from mirror to blurry reflections. The article “Non-periodic Tiling of Noise-based Procedural Textures” by Aleksandr Kirillov shows a method to combine noise-based procedural texture synthesis with a non-periodic tiling algorithm. It describes modifications to several popular procedural noise functions that directly produce texture maps containing the smallest possible complete Wang tile set. It can be used as a preprocessing step or during application runtime. The article “Rendering Surgery Simulation with Vulkan” by Nicholas Milef, Di Qi, and Suvranu De shows a rendering system design around surgery simulation including how higher-level design decisions propagate to lower-level usage of Vulkan. The last article in the section “Skinned Decals” by Hawar Doghramachi describes a way on how to dynamically apply decals to a character for example to show the impact position of a projectile. This technique is overcoming the drawback of deferred decals in that scenario, where in case the target area is influenced by several bones, the decals are “swimming” on top of the target mesh. —Patrick Cozzi

1

1 I

Adaptive GPU Tessellation with Compute Shaders Jad Khoury, Jonathan Dupuy, and Christophe Riccio

1.1 Introduction GPU rasterizers are most efficient when primitives project into more than a few pixels. Below this limit, the Z-buffer starts aliasing, and shading rate decreases dramatically [Riccio 2012]; this makes the rendering of geometrically-complex scenes challenging, as any moderately distant polygon will project to subpixel size. In order to minimize such subpixel projections, a simple solution consists in procedurally refining coarse meshes as they get closer to the camera. In this chapter, we are interested in deriving such a procedural refinement technique for arbitrary polygon meshes. Traditionally, mesh refinement has been computed on the CPU via recursive algorithms such as quadtrees [Duchaineau et al. 1997, Strugar 2009] or subdivision surfaces [Stam 1998, Cashman 2012]. Unfortunately, CPU-based refinement is now fundamentally bottlenecked by the massive CPU-GPU streaming of geometric data it requires for high resolution rendering. In order to avoid these data transfers, extensive work has been dedicated to implement and/or emulate these recursive algorithms directly on the GPU by leveraging tessellation shaders (see, e.g., [Niessner et al. 2012, Cashman 2012, Mistal 2013]). While tessellation shaders provide a flexible, hardwareaccelerated mechanism for mesh refinement, they remain limited in two respects. First, they only allow up to log 2 64  6 levels of subdivision. Second, their performance drops along with subdivision depth [AMD 2013]. In the following sections, we introduce a GPU-based refinement scheme that is free from the limitations incurred by tessellation shaders. Specifically, our scheme allows arbitrary subdivision levels at constant memory costs. We achieve this by manipulating an implicit (triangle-based) subdivision scheme for each polygon of the scene in a dedicated compute shader that reads from and writes to a compact, double-buffered

3

4

1. Adaptive GPU Tessellation with Compute Shaders array. First, we show how we manage our implicit subdivision scheme in Section 1.2. Then, we provide implementation details for rendering programs we wrote that leverage our subdivision scheme in Section 1.3.

1.2 Implicit Triangle Subdivision 1.2.1 Subdivision Rule Polygon refinement algorithms build upon a subdivision rule. The subdivision rule describes how an input polygon splits into subpolygons. Here, we rely on a binary triangle subdivision rule, which is illustrated in Figure 1.1(a). The rule splits a triangle into two similar subtriangles 0 and 1, whose barycentric-space transformation matrices are respectively

 1 2 1 2 1 2    M 0  1 2 1 2 1 2 ,    0 0 1 

(1.1)

 1 2 1 2 1 2    M 1    1 2  1 2 1 2 .     0 0 1  

(1.2)

and

Listing 1.1 shows the GLSL code we use to procedurally compute either M 0 or M 1 based on a binary value. It is clear that at subdivision level N  0 , the rule produces 2 N triangles; Figure 1.1(b) shows the refinement produced at subdivision level N  4, which consists of 2 4  16 triangles.

Figure 1.1. The (a) subdivision rule we apply on a triangle (b) uniformly and (c) adaptively. The subdivision levels for the red, blue, and green nodes are respectively 2, 3, and 4.

1.2 Implicit Triangle Subdivision

5

mat3 bitToXform(in uint bit) { float s = float(bit) - 0.5; vec3 c1 = vec3( s, -0.5, 0); vec3 c2 = vec3(-0.5, -s, 0); vec3 c3 = vec3(+0.5, +0.5, 1); return mat3(c1, c2, c3); } Listing 1.1. Computing the subdivision matrix M 0 or M 1 from a binary value.

1.2.2 Implicit Representation By construction, our subdivision rule produces unique subtriangles at each step. Therefore, any subtriangle can be represented implicitly via concatenations of binary words, which we call a key. In this key representation, each binary word corresponds to the partition (either 0 or 1) chosen at a specific subdivision level; Figure 1.1(b, c) shows the keys associated to each triangle node in the context of (b) uniform and (c) adaptive subdivision. We retrieve the subdivision matrix associated to each key through successive matrix multiplications in a sequence determined by the binary concatenations. For example, letting M 0100 denote the transformation matrix associated to the key 0100, we have

M 0100  M 0  M 1  M 0  M 0

(1.3)

In our implementation, we store each key produced by our subdivision rule as a 32-bit unsigned integer. Below is the bit representation of a 32-bit word, encoding the key 0100. Bits irrelevant to the code are denoted by the ‘_’ character. MSB LSB ____ ____ ____ ____ ____ ____ ___1 0100 Note that we always prepend the key’s binary sequence with a binary value of 1 so we can track the subdivision level associated to the key easily. Listing 1.2 provides the GLSL code we use to extract the transformation matrix associated to an arbitrary key. Since we use 32-bit integers, we can store up to a 32 1  31 levels of subdivision, which includes the root node. Naturally, more levels require longer words. Because longer integers are currently unavailable on many GPUs, we emulate them using integer vectors, where each component represents a 32-bit wide portion of the entire key. For more details, see our implementation, where we provide a 63-level subdivision algorithm using the GLSL uvec2 datatype.

6

1. Adaptive GPU Tessellation with Compute Shaders

mat3 keyToXform(in uint key) { mat3 xf = mat3(1); while (key > 1u) { xf = bitToXform(key & 1u) * xf; key = key >> 1u; } return xf; } Listing 1.2. Key to transformation matrix decoding routine.

1.2.3 Iterative Construction Subdivision is recursive by nature. Since GPU execution units lack stacks, implementing GPU recursion is difficult. In order to circumvent this difficulty, we store the triangles produced by our subdivision as keys inside a buffer that we update iteratively in a ping-pong fashion; we refer to this double-buffer as the subdivision buffer. Because our keys consists of integers, our subdivision buffer is very compact. At each iteration, we process the keys independently in a compute shader, which is set to write in the second buffer. We allow three possible outcomes for each key: it can be subdivided to the next level, downgraded to the previous subdivision level, or conserved as is. Such operations are very straightforward to implement thanks to our key representation. The following bit representations match the parent of the key given in our previous example along with its two children:

parent: key: child1: child2:

MSB ____ ____ ____ ____

____ ____ ____ ____

____ ____ ____ ____

____ ____ ____ ____

____ ____ ____ ____

____ ____ ____ ____

____ ___1 __10 __10

LSB 1010 0100 1000 1001

Note that compared to the key representation, the other keys are either 1-bit expansions or contractions. The GLSL code to compute these representations is shown in Listing 1.3; it simply consists of bit shifts and logical operations, and is thus very cheap. Listing 1.4 provides the pseudocode we typically use for updating the subdivision buffer in a GLSL compute shader. In practice, if a key needs to be split, it emits two new words, and the original key is deleted. Conversely, when two sibling keys must merge, they are replaced by their parent's key. In order to avoid generating two copies of the same key in memory, we only emit the key once from the 0-child, identified

1.2 Implicit Triangle Subdivision using the test provided in Listing 1.5. We also provide some unit tests we perform on the keys to avoid producing invalid keys in Listing 1.6. For the keys that do not require any modification, they are simply re-emitted, unchanged.

uint parentKey(in uint key) { return (key >> 1u); } void childrenKeys(in uint key, out uint children[2]) { children[0] = (key = g_FluidSimGridRes)) { float2 coords = (float2(dtID.xy) + 0.5f)) * g_InvFluidSimGridRes; coords = GetScrolledCoords(coords); g_uavOutput[dtID.xy] = g_texInflow.SampleLevel( BilinearSampler, coords, 0.0f).rgrg; } else #endif { g_uavOutput[dtID.xy] = g_texInput[texCoord]; } } Listing 1.1. Grid scrolling shader. g_vPosition stands for characters position relative to the object with fluid simulation, g_v(Inverse)Proxy(Half)Size stands for the size of the object in world units, and g_texInflow is a texture with static density map.

1.3.4 Density Inflow If you take a volume with two differently colored fluids and stir it long enough, eventually, both fluids would mix up to a point when it would be one uniformly colored fluid. At this point, no matter how you interact with a fluid, you would never see any effect, because, there is only now, practically speaking, only one fluid.

1.3 Simulation

Figure 1.4. Static density map (white) and simulated density area (red).

Over time, the density, disturbed by fluid dynamics, should fade back into the static texture. We call this process of fading back density to the default value density inflow. Listing 1.2 shows how the inflow is added into the density map. The following constants are being introduced in it:  g_FadeoutStart is the point (in  0,1 space on the simulation grid) at which fadeout starts and g_FadeoutLength  1 1  g_FadeoutStart .  g_DensityExponent defines how inflow depends on the density itself.  g_DissipationFactor is basically the speed at which the density map fades back to the static texture map multiplied by the time elapsed from the previous frame. There are few tricks that deserve to be explained further. First, we use densityExponent in order to differentiate the speed at which high and low density areas dissipate. This feature is used primarily for algae simulation: we want to get the effect of algae ‘closing back’ after the player, so the higher the density is, the lower is the dissipation speed. Second, DensityExponent is made exponential in order to make the technique frame rate-independent. As it has been mentioned in the Listing description, g_fDensityExponent incorporates elapsed time. Since e t1 e t 2  e t1 t 2 , multiplying on the

97

98

1. Real-Time Fluid Simulation in Shadow of the Tomb Raider

RWTexture2D g_uavOutput : register(u0); Texture2D g_texInput : register(t0); Texture2D g_texInflow : register(t1); [numthreads(GROUP_SIZE, GROUP_SIZE, 1)] void main(uint3 dtID : SV_DispatchThreadID) { float2 coords = (float2(dtID.xy) + 0.5f)) * g_InvFluidSimGridRes; float proximityFactor = saturate(2.0f * length(coords - 0.5f)); proximityFactor = 1.0f - saturate(proximityFactor - g_FadeoutStart) * g_FadeoutLength; coords = GetScrolledCoords(coords); float currentValue = g_texInput.Load(int3(dtID.xy, 0)); float fetchedValue = g_texInflow.SampleLevel( BilinearSamplerClamp, coords, 0.0f); float densityExponent = pow(max(currentValue, FLT_MIN), g_DensityExponent); float dissipationFactor = exp(-g_fDissipationFactor * densityExponent); float result = lerp(fetchedValue, currentValue, dissipationFactor); result = fetchedValue + sign(result - fetchedValue) * min(abs(result - fetchedValue), proximityFactor); g_uavOutput[dtID.xy] = result; } Listing 1.2. Density inflow shader.

exponential dissipation factor over one frame of 30 ms will be completely equivalent to multiplying twice on the same exponential factor over two frames of 15 ms each, for instance. Last, proximityFactor is there to cross-fade density on the border. Initially, we tried to handle it in the material shader, but that fails to work if one starts to walk backwards: when the simulated part on the border is scrolled to the center, a clear border between simulated and non-simulated areas is visible. What this code does, essentially, is ‘shifting’ values towards the static map values.

1.3 Simulation E.g., a value on the simulation area boundary is always equal to the static map value. If fadeout starts from the very center of the simulation area (i.e., gFadeoutStart  0), then any value halfway (e.g., at  0, 0.25  or  0.75, 0  of the simulation grid) to the edge should be at least within 0.5 of the static map value. If it is not, it is ‘shifted’ towards it. That allows avoiding discontinuities between simulated and non-simulated areas with any pattern of character locomotion.

1.3.5 Obstacle Injection Unlike Crane et al. [2008], we do not voxelize 3D meshes to get obstacle data, as this would have been a highly impractical use of resources for a video game. Instead, we reuse collision primitives that are already used for other purposes in the game engine. In Shadow of the Tomb Raider, character collision is defined by an array of capsules. We find intersections for all capsules with the fluid plane. The game code allows us to query NPCs closest to the main character. We store up to the maximum of 32 NPCs in the radius equal to the fluid object extents in a hash table, where character’s ID serves as a key. Characters that were not returned as a result of query from the previous frame get evicted from the hash. Thus, we are able to track characters’ collision capsules locomotion from the previous frame and evaluate velocities. We then inject obstacles and obstacle velocities just as in Harris [2006].

1.3.6 Advection Advection is the process of transfer of fluid properties by a velocity field of a fluid (including the velocity itself). Listing 1.3 shows the advection shader. Besides of advection itself, the advection shader performs two other operations. First, it adds velocity inflow from the static map. Second, it adds viscosity by simply dampening the velocity map according to the viscosity values. Different viscosities are used for the two fluids, and the density map from the previous frame is used to blend between them. Despite being not physically correct, this approach works very well in practice.

1.3.7 Poisson Pressure Jacobi Solver The Poisson pressure equation in fluid simulation is solved using a Jacobi iterative method—an iterative method to solve systems of linear equations. One typically needs a ballpark of 20 iterations in order to get plausible results, so that makes this part of the algorithm a bottleneck. Pressure solver arithmetic is trivial, so the main offender is the memory reads and writes. While not much could be done about memory writes, the reads could be optimized quite a bit. Listing 1.4 shows the optimized Jacobi solver.

99

100

1. Real-Time Fluid Simulation in Shadow of the Tomb Raider

RWTexture2D Texture2D Texture2D Texture2D Texture2D Texture2D

g_uavOutput g_texInput g_texVelocity g_texObstacle g_texInflow g_texDensity

: : : : : :

register(u0); register(t0); register(t1); register(t2); register(t3); register(t4);

[numthreads(GROUP_SIZE, GROUP_SIZE, 1)] void main(uint3 dtID : SV_DispatchThreadID) { #if EffectType == fxVelocity if (g_texObstacle.Load(int3(dtID.xy, 0)).x > OBSTACLE_THRESHOLD) { g_uavOutput[dtID.xy] = 0.0f; } else #endif { float2 coords = (float2(dtID.xy) + 0.5f)) * g_InvFluidSimGridRes; float2 inflowCoords = GetScrolledCoords(coords); float fadeout = 1.0f - saturate(pow(dot(2 * coords - 1, 2 * coords - 1), g_FadeoutPower)); float2 pos = coords - g_DeltaTime * (g_texVelocity.SampleLevel(BilinearSampler, coords, 0.0f).xy + fadeout * g_Factor * (g_texInflow.SampleLevel( BilinearSampler, inflowCoords, 0.0f).xy - 0.5f)); g_uavOutput[dtID.xy] = #if EffectType == fxVelocity lerp(g_ViscosityWater, g_ViscosityOil, saturate(g_texDensity. SampleLevel(BilinearSampler, coords, 0.0f).x)) * #endif g_texInput.SampleLevel(BilinearSampler, pos, 0.0f); } } Listing 1.3. Advection shader. g_FadeoutPower is a fadeout exponent for the velocity inflow map, g_DeltaTime is the elapsed time from the previous frame, and g_ViscosityOil and g_ViscosityWater are oil and water viscosity coefficients, respectively.

1.3 Simulation

101

First, we use local data storage (LDS) memory to prefetch from the VRAM. Since most (except for the boundary ones) fetches are shared between multiple threads, we effectively limit the bandwidth per thread. Second, we exploit Gather instructions for optimizing fetches. We found out that this is highly beneficial even when not using LDS, and using two Gather instructions to fetch just four texels. In out case, when LDS is used, we can group together fetches corresponding to different threads and thus minimize the waste. Besides, we keep all transient textures in ESRAM to improve bandwidth.

RWTexture2D Texture2D Texture2D Texture2D

g_uavOutput g_texVelocityDivergence g_texPressure g_texObstacle

: : : :

register(u0); register(t0); register(t1); register(t2);

groupshared float pressureLDS[GROUP_SIZE + 2][GROUP_SIZE + 2]; groupshared float obstacleLDS[GROUP_SIZE + 2][GROUP_SIZE + 2]; [numthreads(GROUP_SIZE, GROUP_SIZE, 1)] void main(uint3 dtID : SV_DispatchThreadID, uint3 grtID : SV_GroupThreadID) { // Load data to LDS if ((grtID.x < (GROUP_SIZE + 2) / 2) && (grtID.y < (GROUP_SIZE + 2) / 2)) { float2 coordsNormalized = float2(dtID.xy) * g_InvFluidSimGridRes; coordsNormalized += float2(grtID.xy) * g_InvFluidSimGridRes; float4 pressureSample = g_texPressure.Gather( PointSampler, coordsNormalized); float float float float

topLeftPressure = pressureSample.w; bottomLeftPressure = pressureSample.x; topRightPressure = pressureSample.z; bottomRightPressure = pressureSample.y;

pressureLDS[grtID.x*2][grtID.y*2] = topLeftPressure; pressureLDS[grtID.x*2][grtID.y*2+1] = bottomLeftPressure; pressureLDS[grtID.x*2+1][grtID.y*2] = topRightPressure; pressureLDS[grtID.x*2+1][grtID.y*2+1] = bottomRightPressure; /* Do the same for obstacle */ } GroupMemoryBarrierWithGroupSync();

102

1. Real-Time Fluid Simulation in Shadow of the Tomb Raider

int3 coords = int3(dtID.xy, 0); int2 ldsID = (int2)grtID.xy + int2(1, 1); float float float float float /* Do

left = right = bottom = top = center = the same

pressureLDS[ldsID.x - 1][ldsID.y]; pressureLDS[ldsID.x + 1][ldsID.y]; pressureLDS[ldsID.x][ldsID.y + 1]; pressureLDS[ldsID.x][ldsID.y - 1]; pressureLDS[ldsID.x][ldsID.y]; for obstacle */

// If cell is solid, set pressure to center value instead if (obstacleLeft > OBSTACLE_THRESHOLD) { left = center; } /* Do the same for other sides */ float velocityDivergence = g_texVelocityDivergence.Load(coords).x; g_uavOutput[dtID.xy] = 0.25f * (left + right + bottom + top - velocityDivergence); } Listing 1.4. Poisson pressure shader. Some parts were edited in order to fit on the page.

1.3.8 Algae Simulation Initially, our simulation method was designed for incompressible fluids, like oil or water. After experimenting with it, the VFX artist asked if they could reuse the same method for algae on a surface of a swamp. Figure 1.5 shows the final result for algae simulation alongside with the density maps. The key problem with reusing fluid simulation directly for algae is that the algae is, in essence, made of ‘macroparticles’. If you walk through algae, those particles would move out of your way, and effectively, clump together or overlap with each other. Now, if we map the number of particles per unit area into density, we would realize that algae would become a ‘compressible fluid’. Practically speaking, if you walk through an area that is uniformly covered with algae, simulated using Navier-Stokes equation, you will not see any interaction. Fortunately, similar substance simulations were already researched by academia [Zhu and Bridson 2005]. We used the idea of the mixed Eulerian-Lagrangian simulation—particle-in-cell (PIC). In a nutshell, every frame we convert density into particles, advect them using the velocity field, and then convert them back into density by accumulating particles’ densities via additive blending. Figure 1.6 shows how density is calculated for algae simulation (all other steps stay the same). Let us discuss the details of these stages (except clear, which is pretty self-explanatory) in more detail below.

1.3 Simulation

Figure 1.5. Algae simulation. In the inset, the static density inflow map is shown (white) with the simulated density area (red).

Figure 1.6. Algae fluid simulation data flow (density advection only).

103

104

1. Real-Time Fluid Simulation in Shadow of the Tomb Raider Particle Density Accumulation. Listing 1.5 shows the shader that advects particles and accumulates particle density. We allocate a one-channel 32-bit unsigned integer accumulation texture. It has a resolution of original simulation grid resolution size times ALGAE_PARTICLE_GRID_MULTIPLIER. Effectively, we quantize density in ALGAE_DENSITY_QUANTIZATION quants. We add a 0.5 offset in order to avoid energy loss during conversions. This quantization is needed, because in Shader Model 5.0, atomic operations (such as InterlockedAdd) work only with integer values. Particle Density Resolution. Listing 1.6 shows the shader resolving particles back to density. The denominator has ALGAE_DENSITY_QUANTIZATION to the power of two because (before advection) each simulation grid cell contains ALGAE_DENSITY_QUANTIZATION × ALGAE_DENSITY_QUANTIZATION particles.

1.4 Engine Integration In order to be useful, fluid simulation needs to be properly integrated in the game engine. In the following sections, we discuss the way fluid simulation was made into a component and how it is used within the Shadow of the Tomb Raider material system.

RWTexture2D g_uavOutput : register(u0); Texture2D g_texDensity : register(t0); Texture2D g_texVelocity : register(t1); #define ALGAE_DENSITY_QUANTIZATION 65536.0f #define ALGAE_PARTICLE_GRID_MULTIPLIER 2 [numthreads(GROUP_SIZE, GROUP_SIZE, 1)] void main(uint3 dtID : SV_DispatchThreadID) { float2 coords = (float2(dtID.xy) + 0.5f) * g_InvFluidSimGridRes / ALGAE_PARTICLE_GRID_MULTIPLIER; float density = g_texDensity.Load( int3(dtID.xy / ALGAE_PARTICLE_GRID_MULTIPLIER, 0)).r; coords += g_DeltaTime * g_texVelocity.SampleLevel( BilinearSampler, coords, 0.0f).xy; uint2 uCoords = uint2(coords * g_FluidSimGridRes); InterlockedAdd(g_uavOutput[uCoords], int(density * ALGAE_DENSITY_QUANTIZATION + 0.5f)); } Listing 1.5. Particle density accumulation shader.

1.4 Engine Integration

105

RWTexture2D g_uavOutput : register(u0); Texture2D g_texInput : register(t0); [numthreads(GROUP_SIZE, GROUP_SIZE, 1)] void main(uint3 dtID : SV_DispatchThreadID) { g_uavOutput[dtID.xy] = (float4(g_texInput.Load( int3(dtID.xy, 0))) / (ALGAE_DENSITY_QUANTIZATION * float(ALGAE_PARTICLE_GRID_MULTIPLIER * ALGAE_PARTICLE_GRID_MULTIPLIER))); } Listing 1.6. Particle density resolution shader.

1.4.1 Fluid Component Architecture In Shadow of the Tomb Raider (like in many other games), we utilize a component architecture. Essentially, that means that every entity (or, Instance in the engine terms) in the game has an array of components of different types attached to them. For each component type, there is usually a respective Manager class instance that is performing certain operations for all the components of the same type (e.g., drawing them altogether). For the component types that must be rendered, there is usually an auxiliary Drawable type. The drawable is created on the main thread, and then a Draw() virtual function is called by the render thread. The drawables are allocated on the frame heap, so they are deleted when a frame is rendered. Thus, we can safely pass the information from the main thread to the render thread. We decided to implement fluid simulation as a component within this architecture. Figure 1.7 shows how fluid simulation is integrated into the Shadow of the Tomb Raider engine architecture. When a fluid component is getting attached to an instance, it is created and added to this instance’s array of components. Additionally, a reference to this fluid component is added to the array of fluid components within the fluid component manager. When Process() function of the manager is called from the main thread, it loops through all of the visible instances that have fluid components attached and picks one that is closest to the player’s character. We use Equation (1.1) as a distance function, where  x, y  is the main character’s normalized (to the object’s extents) relative position within an object with fluid component with  0, 0  being the center of the object.

d  max  x , y 

(1.1)

After that, a fluid simulation drawable is created on the main thread and appended to the drawables list. This drawable encapsulates all fluid simulation parameters (e.g.,

106

1. Real-Time Fluid Simulation in Shadow of the Tomb Raider

Figure 1.7. Fluid component architecture. Blue boxes stand for classes, green boxes stand for threads. Rhombus-ending lines stand for aggregation, and arrowhead-ending lines stand for function calls. The main thread creates a fluid manager, components, and a drawable, while the rendering thread calls the Draw() function of the fluid drawable.

inflow texture maps, viscosity, dissipation, fadeout factors and such) needed to perform a fluid simulation step. Finally, when the render thread flushes the drawables list, the Draw() function of the fluid simulation drawable is called. This is the place where the simulation happens: compute shaders for the respective algorithm stages are dispatched.

1.4.2 Integration with Material System Having performed the simulation, we need to actually use it for a particular object. This is where the Shadow of the Tomb Raider material system comes into play. The Shadow of the Tomb Raider material system is based on shader nodes. Every material is represented by a graph where vertices are shader nodes. Each shader node, in its turn, is an entity that has one or more outputs and, usually, one or more inputs (with constant nodes being a notable exception to the latter). Figure 1.8 shows the fluid simulation node within the material graph. The inputs (left) are UVOffset (which defines additional artist-defined UV shift) and StaticMask—a fetch from the inflow texture (this has to be exactly same texture as the one used for simulation). The outputs (right) are Color—a 4-component floating point

Figure 1.8. Fluid simulation node.

1.4 Engine Integration

107

vector containing density value, velocity field, and pressure packed into its components and PressureGradient—a precomputed gradient of the scalar pressure value. Listing 1.7 shows implementation of the fluid simulation shader node. g_PlayerPos and g_InvSimulationSize are global constants set by the fluid simulation drawable. We use a simple binary fadeout, as there is already a fadeout in simulation, but a more complex fadeout function could be used. FluidSimulationTexture and PressureGradientTexture are global textures where fluid simulation outputs its results, and they are persistent between frames.

float2 texCoord = g_PlayerPos.xy - v_WorldPosition.xy; texCoord *= g_InvSimulationSize; texCoord += UVOffset; float fadeout = (all(abs(texCoord)) < FADEOUT_THRESHOLD) ? 1 : 0; texCoord += 0.5f; Color = FluidSimulationTexture.SampleLevel( SamplerGenericAnisoBorder, float3(texCoord, 0), 0); Color = lerp(float4(StaticMask.r, 0, 0, 0), Color, fadeout); PressureGradient = PressureGradientTexture.SampleLevel( SamplerGenericAnisoBorder, float3(texCoord, 0), 0); Listing 1.7. Fluid mask shader node. g_InvSimulationSize is the inverse of the size of the simulation area (in world units). g_PlayerPos is the player’s character position, multiplied by the size of the simulation area and divided by the grid resolution.

1.4.3 Oil Shading Shadow of the Tomb Raider features realistic water rendering with reflection, refraction and light absorption effects. We use a simple linear interpolation between the water material and the oil material controlled by the mask outputted by the fluid simulation. To add more visual interest, we add a subtle iridescent effect on top of the oil using a look-up table (see Figure 1.9) controlled by an oil thickness parameter and a Fresnel factor. See Listing 1.8 for more details.

1.4.4 Algae Shading Algae shading works in a similar way, but we put an emphasis on breaking up the smooth edges produced by the fluid simulation mask with a high frequency mask (see

108

1. Real-Time Fluid Simulation in Shadow of the Tomb Raider

Figure 1.9. 128128 iridescence look-up table.

float t = Thickness; float a = abs(dot(CameraVector, Normal)); float3 lutSample = IridescenceLUT_texture.SampleLevel( SamplerGenericBilinearClamp, float2(t, a), 0).rgb - 0.5; float intensity = Intensity * 4 * (FresnelIn * (1 - FresnelIn)); FresnelOut = saturate(lutSample * intensity + FresnelIn); Listing 1.8. Iridescence shader.

Figure 1.10) that represents the micro structure of algae found in swamps. Listing 1.9 shows more details about our shader implementation.

1.5 Optimization We have already shown how to optimize one of the bottlenecks: Poisson pressure equation solving. However, there are few more issues that need to be addressed.

1.5.1 Managing Many Obstacles Originally, we were planning to handle only the main character’s interaction with the fluid. Therefore, when rendering obstacle and obstacle velocities, it was possible to simply loop over all collision capsules in every simulation grid cell. Later on, we decided to add support for multiple characters that are close to the player. In this case, there could easily be too many capsules to evaluate per each grid cell, therefore, a better solution is needed. We utilize the approach similar to a 2D clustered lighting approach. We subdivide the simulation grid into 3232 tiles in each dimension. Then, we calculate on CPU

1.5 Optimization

109

Figure 1.10. Algae micro detail mask.

float float float float float

algaeMicro; fluidsim; algaeMaskSpread; algaeMaskDarkness; algaeTransSharp;

// // // // //

Algae micro mask texture input Fluid simulation mask input Parameter for the spread of algae Parameter for algae mask color Parameter for transition sharpness

float mask = (algaeMicro * fluidsim * algaeMaskSpread) + (algaeMicro * (1 - fluidsim) * algaeMaskDarkness ) + fluidsim; float algaeMaskTransition = sat(pow(mask, algaeTransSharp)); Listing 1.9. Algae transition mask shader.

which capsules intersect with which tiles and store this information in a StructuredBuffer. We use StructuredBuffer over a ConstantBuffer because of the storage limitations. Finally, instead of querying all the capsules in each grid cell, we only query the capsules from the respective tile. In practice, this was good enough to handle a realistic number of NPCs around the main character.

1.5.2 Async Compute Being independent from any graphics stage (e.g., depth pre-pass or lighting pass), fluid simulation became a natural candidate to be moved to async compute. We can dispatch

110

1. Real-Time Fluid Simulation in Shadow of the Tomb Raider it very early in frame, and the results are expected relatively late in the frame (as water objects are usually rendered after all opaque objects have been rendered and lit). Therefore, we dispatch fluid simulation on the low-priority compute pipe early in the frame. In order to ensure simulation is done by the time it is needed to be used, we insert a fence at the end of the simulation. Also, a wait on this fence is inserted before the pass where the water is rendered.

1.6 Future Work In the future, we would ultimately like to see more fluid simulation utilized in games. We have demonstrated that 2D fluid simulation can be used effectively on current-generation consoles. However, 3D fluid simulation could bring in more interesting use cases. Effects such as fire or smoke can be simulated very realistically, including interaction between fluid and solid objects. Hierarchical grid approaches (like in Wroński [2014]) could be interesting to explore. This would potentially allow both having fluid simulation running in a bigger area, and fluid simulation being seamlessly injected into volumetric fog 3D texture that many game rendering engines use, saving on raycasting for rendering fluid effects.

Acknowledgments We would like to thank several colleagues who made this feature possible. Maximilien Faubert came up with the original idea to utilize fluid simulation for fluid on water and did the initial prototype. Vincent Duboisdendien and Jonathan Bard did countless code reviews and helpful comments on how to improve and accelerate the method. Finally, we would like to thank all the Shadow of the Tomb Raider team, Eidos Montréal and Crystal Dynamics studio for providing us with an opportunity to work together on this game and to make this publication happen.

Bibliography BRIDSON, R. AND MÜLLER-FISCHER, M. 2007. Fluid Simulation. In ACM SIGGRAPH 2007 Courses. URL: https://www.cs.ubc.ca/rbridson/fluidsimulation/fluids_notes.pdf. BRIDSON, R., MÜLLER-FISCHER, M., AND GUENDELMAN, E. 2006. Fluid Simulation. In ACM SIGGRAPH 2006 Courses. URL: https://www.cs.ubc.ca/rbridson/fluidsimulation/2006/fluids_notes.pdf. CRANE, K., LLAMAS, I., AND TARIQ, S. 2008. Real-Time Simulation and Rendering of 3D Fluids. In GPU Gems 3, pp. 633-675. URL: https://developer.nvidia.com/gpugems/GPUGems3/ gpugems3_ch30.html. GRINSPUN, E. 2018. Animation and CGI Motion. edX. URL: https://www.edx.org/course/ animation-cgi-motion-columbiax-csmm-104x.

Bibliography HARRIS, M. 2004. Fast Fluid Dynamics Simulation on the GPU. In GPU Gems, pp. 637–665. URL: http://developer.download.nvidia.com/books/HTML/gpugems/gpugems_ch38.html. VLIETINCK, J. 2009. Fluid simulation (DX11/DirectCompute). URL: http://users.skynet.be/ fquake/. WROŃSKI, B. 2014. Volumetric fog: Unified, compute shader based solution to atmospheric scattering. In ACM SIGGRAPH, Advances in the Real-Time Rendering in 3D Graphics and Games. ZHU, Y. AND BRIDSON, R. 2005. Animating Sand as a Fluid. In ACM SIGGRAPH Papers, pp. 965–972.

111

2 II

Real-time Snow Deformation in Horizon Zero Dawn: The Frozen Wilds Kevin Örtegren 2.1 Introduction Having dynamic characters and objects interact with the environment makes the scene more immersive and alive. Typically games will have some form of foliage interaction with the character, where leaves and grass bend out of the way when the character moves through. Another example is ground projected decals when walking on snow. Both of these commonly used techniques lack persistence and scalability; the foliage will not be permanently deformed or crushed and the footsteps in snow usually have an upper limit to the number of decals active at the same time. Horizon Zero Dawn: The Frozen Wilds is the expansion to Horizon Zero Dawn1 and it takes place in a new snowy mountain region. Snow covers most of the landscape and we thus needed believable snow which solved some of the shortcomings of existing techniques and worked under the constraints and requirements for this particular project. Figure 2.1 gives an overview of the results we achieved. The requirements were:  We needed real-time snow deformation for any dynamic object.  It had to work in a massive open world and on top of certain static objects, like rocks and roof tops. Horizon Zero Dawn™ © 2017–2018 Sony Interactive Entertainment Europe. Developed by Guerrilla. Horizon Zero Dawn is a trademark of Sony Interactive Entertainment Europe. All rights reserved. 1

113

114

2. Real-time Snow Deformation in Horizon Zero Dawn: The Frozen Wilds

(a)

(b)

Figure 2.1. Showing the final applied result versus what the deformation is in the system. (a) Example of snow trails caused by the deformation system. (b) Debug overlay mesh showing the actual system deformation, colored by its normal map.

 It had to run very fast on the GPU, since the GPU frame was already laid out and optimized for the base game.  No major asset refactor could be done to avoid adding to the expansion download size and because we could not spare artists to go through all the content manually and fine tune for a new system.

2.2 Related work A few AAA games have implemented real-time snow deformation, with various approaches, prior to this. Some notable ones with presentations and articles on the subject include Assassin’s Creed III [St-Amour 2013], Batman: Arkham Origins, and Rise of the Tomb Raider. Similar to Batman: Arkham Origins, our approach renders dynamic objects orthographically from below to determine the deformation [Barré-Brisebois 2014]. Rise of the Tomb Raider implements trail elevation [Michels and Sikachev 2016], where the snow can be elevated above the base snow height. This is something which our approach does not support in the height data, but is instead faked in the surface shader of the snow using different diffuse textures and normal maps.

2.3 Implementation This section will go through the different steps included in the snow deformation algorithm. A overview block diagram depicting the algorithm is shown in Figure 2.2.

2.3 Implementation

Figure 2.2. Overview of the algorithm and the needed buffers.

The algorithm consists of two passes: 1. Write orthographic height. Render all dynamic objects (characters, robots, debris from explosions etc.) from below into a depth buffer, using an orthographic camera. (Vertex/Pixel Shader) 2. Deform & temporal filter. Take the rendered height as input and determine for each pixel if deformation has to be applied while simultaneously performing temporal filtering on the persistent deformation data. The output of this shader is both an updated version of the persistent height data, written to one of the ping-pong buffers to be read back next frame, as well as the packed calculated normal and height data to the result buffer. (Compute shader) After finishing the compute shader work, the deformation data may be read from the result buffer by any shader. The needed buffers for the algorithm are (all are 1024 1024 in our case): 1. 2× persistent UNORM8 buffers containing relative deformation height, used as ping-pong buffers across GPU frames for temporal filtering.

115

116

2. Real-time Snow Deformation in Horizon Zero Dawn: The Frozen Wilds

Figure 2.3. Side view visualization of the deformation height above the different height maps. This also illustrates the use of an orthographic frustum to capture dynamic object depths locally around the player, where the deformation occurs. (Illustration not to scale.)

2. 1×16-bit depth buffer containing linear depth for the orthographic camera, only used within the same GPU frame (can thus be aliased with other render targets). 3. 1× Result buffer, UNORM10.10.10.2 which stores the packed height and normal for other shaders to read. The relative deformation height stored in the persistent buffers and in the result buffer represents a height relative to the baked terrain/objects/water height. See Figure 2.3 for a visualization of this.

2.3.1 Dynamic object representation Most objects have a fairly high vertex count when viewed up close, rendering these into the depth buffer would be costly. To solve this, we used lower LOD versions of objects. The LOD is automatically selected as if the object was 30 meters out from the camera. See Figure 2.4 for examples of meshes.

2.3.2 Write orthographic height The first pass will do a scene query of dynamic objects within an orthographic frustum from below the minimum terrain height, centered on the player character. The output

2.3 Implementation

(a) Watcher

117

(b) Character

(c) Boar

Figure 2.4. Examples of low LOD skinned meshes used for rendering the dynamic object heights.

of this pass is a simple 16-bit depth buffer containing linear depth and it serves as the input to the next pass. To ensure that small objects have enough depth samples rendered, we render the depth using nonuniform xy axes in NDC. The nonuniformity is achieved by the equax tion P  P xyndc P xyndc , where P xyndc is the original post-projected xy coordinate in 1,1, and x is the power of the distribution, limited between 0.8 (extreme distortion) and 0.0 (linear). While the depth is linear, the normalized device coordinate (NDC) xy axes are not. This acts as a dynamic level-of-detail, where objects close to the u and v center axes in texture space get more depth samples. To read back the depth texture, the same function must be applied to the sampling coordinate, see Listing 2.1 for a helper function in HLSL. Shown in Figure 2.5 is the comparison between uniform distribution and the distribution we used in The Frozen Wilds. Notice how in Figure 2.5(b) our main character Aloy (in the center) is using many more samples in the depth texture than in Figure 2.5(a). Aloy’s feet would normally only be a few samples in the depth render, which would not be enough for detailed deformation.

float2 TransformToNonUniformUV(float2 uv, float x) { float2 ndc_coord = uv * 2.0 - float2(1.0, 1.0); ndc_coord *= pow(abs(ndc_coord), x); return ndc_coord * 0.5 + float2(0.5, 0.5); } Listing 2.1. Shader function to convert from uniform normalized UV coordinates to non-uniform normalized UV coordinates.

118

2. Real-time Snow Deformation in Horizon Zero Dawn: The Frozen Wilds

(a) 0.0 exponent

(b) 0.3 exponent

Figure 2.5. Non uniform distribution example.

2.3.3 Read object height, deform, and filter This pass is run using a compute shader which does most of the heavy lifting of this technique. First of all, each compute thread in a thread group will read back the value from the previous frame and store it in local data store (LDS), a one texel skirt will also be read in to LDS to allow for efficient neighborhood sampling. Temporal filtering is performed by using a min-average 33 filter on the results in LDS. The filter is time dependent and will converge towards a specified slope gradient value. A comparison between a filtered and unfiltered deformation can be seen in Figure 2.6. Once the filtered result has been calculated we can sample the depth of the dynamic objects from the depth buffer. We can early out if the depth buffer contains a far

(a) Filtered deformation

(b) Unfiltered deformation

Figure 2.6. Comparison between a filtered and unfiltered deformation, viewed from the side. The deformation was made by a character standing still in 1 meter snow.

2.3 Implementation plane value (1.0), which means no object was written to that location. If the read depth value is above the baked terrain/object/water height and below the current snow height, apply it as the new snow height. See Figure 2.3 for a visualization of this in action. The normal is calculated using finite difference between samples in the 33 filter. The new height value and normal are packed and written to the UNORM10.10.10.2 output buffer. We pack the world space normal using Lambert Azimuthal Equal-Area Projection, since we know that the Z component will always be positive (pointing up in world space). The height stored represents the relative snow deformation, 0 meaning untouched snow and 1 meaning fully deformed down to the height data below it (terrain, rocks, water etc.). The new result is also written to the next persistent buffer for input to the next frame, but before that is done we subtract the snow refill which is based on the current precipitation rate of the game.

2.3.4 Sliding window Since we needed the technique to work in an open world, we had to move the localized system with the player and fade out the results at the edges. By trial and error we found that a 6464 m region around the player was enough to give a convincing persistence to the deformation. The sliding is done by reading samples from the previous buffer using an offset when the system has moved one or more texels. Figure 2.7 shows this in action.

2.3.5 Applying results The result buffer is read by the terrain shader and deforms the terrain height and blends in the normals. We created a shader graph node for sampling the deformation height

Figure 2.7. Illustration of the deformation system during a move. The current texture reads back previous frame data using an offset, depending on how much the system moved. Textures in this example are 2  2 texels.

119

120

2. Real-time Snow Deformation in Horizon Zero Dawn: The Frozen Wilds

Figure 2.8. Example of snow lumps and piles, which respond the deformation, placed by our procedural placement system.

and normal in any artist created vertex or pixel shader. This led to more than just the terrain using the deformation. Things like snow lumps, like in Figure 2.8 and other snowy assets sampled the deformation and responded to dynamic object interaction. This is achieved by having the actual deformation height of the system differ from the rendered snow height. In our case, the snow height of the deformation system is at 1 meter, but the visual snow layer is only about 30 cm deep. This allows shaders to know about deformation which occurs above the snow height and applying that as deformation to snow lumps sticking out of the snow layer. One interesting thing which spawned from this system was interactive thin ice/snow slush on the surface of lakes, as shown in Figure 2.9. Using the shader node to sample the deformation data, our artists could make this happen in the water surface shader. Since objects below the terrain/object/water height do not contribute to the deformation, it was even possible to stealth swim below the surface without disturbing it.

2.4 Results With a 6464 m region around the player, we achieve the desired quality using 1024 1024 buffers, which adds up to 8 MB of VRAM. This gives us a persistent deformation buffer with 6.25 cm resolution, which roughly matches our inner most terrain LOD mesh resolution. An example of the result buffer can be seen in Figure 2.10.

2.4 Results

Figure 2.9. Aloy swimming and interacting with the layer of snow slush on the surface of the cold water.

Figure 2.10. An example of the result buffer after a battle with a few Watcher robots.

121

122

2. Real-time Snow Deformation in Horizon Zero Dawn: The Frozen Wilds The GPU cost of this technique is broken down into three parts: Clear depth, render dynamic objects and deformation compute shader. The cost of rendering the dynamic objects depends on the number and complexity, but in general it takes 1 µs per object. The timings for the other two passes can be found in Table 2.1. The minimal cost to using the result of the deformation shader is one 32-bit texture sample and the decoding of the height and normal.

PS4 PS4 Pro

Clear depth 21 µs 21 µs

Deform compute shader 210 µs 105 µs

Table 2.1. GPU timings for the fixed cost shaders.

2.5 Conclusion and Discussion All the initial requirements were met, but in doing so a few shortcuts had to be taken. For instance, since we couldn’t do a major asset re-factor the input meshes to the system were all low resolution dynamic shadow casting LODs, many of which were auto generated. These meshes worked reasonably well, but having either custom meshes, shapes or points as input would be preferable. To create deformations from invisible objects (like melting snow with the flame thrower) artists had to create a dynamic shadow casting mesh with a huge depth bias (to not have it cast an actual shadow) and slowly lower it down into the ground to simulate melting snow. That was an example of a creative solution to the fact that we were limited to using shadow casting meshes. Using a local system around the player with a sliding window allowed us to seemingly run the system anywhere in the world, satisfying our goal for it to work in a massive open world. Since the system runs on top of baked static objects like houses or rocks, there would be overhangs without any snow deformation below them. These cases would be solved by artists not allowing for deep snow under overhangs. Performance was acceptable and fit well into our budget thanks to relying on temporal filtering with the downside of slightly more memory usage and a frame delay in the interaction. Since the system outputs the results in world space and a shader graph node was created for easy access to the data by artists, other use cases spawned from this, like the interactive snow slush on lakes and general snow lump assets being deformed by the system. Going forward, it would be interesting to explore the introduction of trail elevation, similar to what was done in Rise of the Tomb Raider, and other physical behaviors of snow. Having only one layer of deformable snow proved to be a limitation which would be worth trying to solve, maybe by having multiple deformation layers which could

Bibliography interact with each other. Snow from an upper layer (e.g., on a roof top) could fall down and get added to a snow deformation layer below. Even though the size of the deformation area gave a plausible result, it would be interesting to explore a larger scale, more persistent solution using a combination of “pre-deformed” data, either coming from streamed in data or from our procedural placement system, and streaming out real-time deformation results to secondary storage to later stream it back in. One of the more interesting aspects going forwards is repurposing this technique to allow for interaction with other environmental assets, like vegetation, sand, mud, water. Rendering depth from above could allow a dynamic precipitation occlusion system to be spawned from this.

Bibliography BARRÉ-BRISEBOIS, C. 2014. Deformable Snow Rendering in Batman: Arkham Origins. In Game Developers Conference 2014. URL: https://www.gdcvault.com/play/1020379/ Deformable-Snow-Rendering-in-Batman. MICHELS, A. AND SIKACHEV, P. 2016. Deferred Snow Deformation in Rise of the Tomb Raider. In GPU Pro 7, pp. 3–16. CRC Press. ST-AMOUR, J. 2013. Rendering Assassin’s Creed III. In Game Developers Conference 2013. URL: https://www.gdcvault.com/play/1017710/Rendering-Assassin-s-Creed.

123

III

Shadows

Shadows are the dark companions of lights, and although both can exist in their own, they shouldn’t exist without each other in games. Achieving good visual results in rendering shadows is considered one of the particularly difficult tasks of graphics programmers. The first article “Soft Shadow Approximation for Dappled Light Sources” by Mariano Merchante proposes to mimic an effect that is called dabble lights. This type of effect occurs when for example light shines through leaves that are very close together, while a small patch of light can travel through them, projecting the sun shape into the shadow receiver. The second article is “Parallax-Corrected Cached Shadow Maps” by Pavlo Turchyn is the successor to another great shadow map article in GPU Pro 2 by the same author. This article describes a parallax correction algorithm for rendering sweeping shadows from a dynamic light source using a static shadow map. The resulting implementation uses Cascaded Shadow Maps up to a distance of 30 meters from the camera and after that Adaptive Shadow Maps covering the next 500 meters range and updated every 2500 frames. The use of the parallax correction algorithm enables a fairly seamless transition between dynamic shadows rendered with these two methods. —Mauricio Vives

125

1 III

Soft Shadow Approximation for Dappled Light Sources Mariano Merchante

1.1 Introduction Common shadow rendering techniques rely on solving the visibility problem through buffers that contain information related to the distance to the light source. These buffers are commonly referred as shadow maps [Williams 1978], and although the results can be filtered with a wide variety of algorithms to get smooth penumbra effects, filtering is expensive and does not consider either the size or shape of the light source. Well known examples of filtering approaches include percentage-closer filtering [Reeves et al. 1987], percentage-closer soft shadows [Fernando 2005], variance shadow maps [Donnelly and Lauritzen 2006] and moment shadow maps [Peters and Klein 2015]. A more recent approach uses raytracing and denoising to estimate the penumbra generated by a polygonal area light [Heitz et al. 2018], but requires a complex rendering pipeline setup that may not be available to most real-time engines. Given the complexity of analytically approximating the penumbra generated by area lights, these real-time techniques usually ignore the pinhole effect that certain high frequency objects can generate when lit by such lights. This article proposes an approximation of this effect, which can be seen working in Figure 1.1. In photography, this phenomenon is called dappled light, and is very characteristic of tree shadows: when leaves are very close together, a small patch of light can travel through it while essentially projecting the sun shape into the shadow receiver. Moreover, it is particularly evident in the case of crescent shadows like shown in Figure 1.2 while a solar eclipse is occurring. Minnaert [1937] offers a comprehensive introduction of the subject matter. This phenomenon does not only happen on perfectly infinitesimal holes in depthspace, as it is an artifact of complex visibility functions. However, the effect can be approximated by just identifying these points and using subtractive masking on top of any other conventional technique.. This article will describe a simple and practical 127

128

1. Soft Shadow Approximation for Dappled Light Sources

Figure 1.1. A sample scene using the proposed technique, inspired by Tufte’s [1997] sculptures.

Figure 1.2. Examples of dappled light in nature. Left: dappled light on a road. Right: Crescent shadows during a solar eclipse. (Source: Wikipedia)

algorithm that uses an arbitrary shape texture to represent the projected light shape. Moreover, the majority of the examples presented here will be related to tree shadows, given that it is one of the most prominent cases of this phenomenon while also being very common in games and real-time engines. Other situations where this effect occurs are metal gratis, woven patterns and certain furniture.

1.1.1 Algorithm Overview The technique can be subdivided into the following steps, which we will discuss in the following sections. 1. Store the scene’s shadow map. 2. For each pixel on the shadow map: a. Identify pinholes.

1.2 Detecting Pinholes b. Store pinholes in a uniform grid compute buffer. 3. For each grid cell in the compute buffer: a. Accumulate newly identified pinholes and merge with old. b. Flatten linked lists into grid cell arrays. 4. When rendering each shadow-receiving point: a. Iterate over the uniform grid. b. Iterate over every pinhole within a radius and accumulate contribution. c. Estimate a shadow factor computed by any common shadow filtering technique. d. Combine the accumulated contribution with the estimated shadow factor through subtraction.

1.2 Detecting Pinholes We define pinholes as points in a shadow map that have a substantial depth discontinuity with respect to its neighbors, i.e., they are outliers in a predefined range in UV space. An additional constraint is that these points have to be further away from the neighbor cluster, or else they would become blockers. Specifically, we can identify pinholes by calculating the depth mean and variance in the shadow map, and storing points that differ wildly to the neighbor mean, with a threshold proportional to its variance. A naive implementation can be seen in Listing 1.1, and an example shadow map with detected pinholes can be found in Figure 1.3. Note that it is important to store both the raw distance to the light and the average depth of the neighbor pixels, as it will prove useful in the next sections. A more robust approach for pinhole detection includes differentiating samples that fall within an expected radius and computing statistical values to estimate the probability that the points inside the radius represent a pinhole of that size, as shown in Figure 1.4. A set of thresholds are used to bound the variances, and by making these constraints stronger we can prevent finding false positives, such as points on the edge of the projected shape. More specifically, a pinhole is identified if the following applies:  The variance on the outer set, B, is smaller than a specified threshold.  A percentage of the points in the inner set, A, have greater depth than the mean of set B plus its standard deviation multiplied by some constant. A smoother version of this idea can be implemented with a sample weight defined by distance; this helps by giving more importance to the center samples, thus reducing the total number of detected pinholes.

129

130

1. Soft Shadow Approximation for Dappled Light Sources

bool FindPinhole(float2 uv) { float centerDepth = ShadowMap.SampleLevel(sampler, uv, lod).r; // Calculate average depth and std deviation of neighbors float averageDepth = ... float std = ... // Now calculate them again, including the center sample float stdCenter = ... float avgCenter = ... if (averageDepth > 0.0) { std /= averageDepth; stdCenter /= avgCenter; } return stdCenter > threshold && std < threshold; } Listing 1.1. A naive implementation that selects pinholes when neighbor pixels are very similar and the center pixel is an outlier with a certain threshold.

Figure 1.3. Left: The core concept of this technique is finding pinholes a) and b) and their estimated distances to the light, and then projecting textures based on the distance to the receiving surface. Right top: The original shadow map. Right bottom: The identified pinholes that will project the light shape, shown here as crosshairs.

1.2 Detecting Pinholes

Figure 1.4. Left: The neighborhood segmentation based on radius. Right: An example of a possible set of depth samples at a cross section of this neighborhood. Ideally, we would desire the cavity to be close in radius to our expected radius r.

The outer neighborhood B has to have low variance so that we can then approximate the estimated pinhole depth by using the average depth of the neighborhood. Cases where there’s too much noise on the outer neighborhood must be ignored, as there’s no clear way to approximate the combination of light spills happening.

1.2.1 Scatter Versus Gather In a similar approach to percentage-closer soft shadows (PCSS), we separate the technique into a searching step and a rendering step. However, in contrast to PCSS, it is possible to search these points just once on the shadow map, and not per pixel being shaded. Moreover, the shadow map can also be downsampled to accelerate the search, although it might affect search accuracy. This search can also be masked in a way that only updates sections of the shadow map that occlude visible surfaces in the screen, as we will discuss in Section 1.4.3. The projected light shape size is proportional to the distance from each shadow receiver point to each pinhole. Thus, it is computationally complex to search for pinholes for each shaded point (i.e., a gather operation), as the projected size can be both spatially incoherent and very big. It is desirable to first use a scatter approach, in which we precompute all the pinholes and store them into a uniform grid, and then iterate them efficiently as needed on the receiving shaders. An analogy can be made with depth of field techniques that splatter sprites that represent the aperture [Pettineo and de Rousiers 2012], as they solve a similar scatter vs. gather problem. To achieve this there are multiple options, but a straightforward implementation uses a compute shader to search pinholes and store them into concurrent linked lists

131

132

1. Soft Shadow Approximation for Dappled Light Sources on a compute buffer. This approach is based on Yang et al. [2010], which is generally used for order-independent transparency. It also exploits the fact that, in general, pinholes are sparsely distributed, so these lists can be reasonably small for later inspection. We can build a grid that subdivides uniformly the shadow map containing these lists of samples, as shown in Listing 1.2. However, the actual points are distributed throughout a global buffer in a very noncoherent way. To improve coherency, an intermediate compute shader flattens these lists into flat arrays on a separate compute buffer, which the receiving shadow shader uses. Additional data can be stored, such as the “intensity” of this pinhole, which we’ll discuss in Section 1.4.1. Finally, the actual count of samples are stored in a separate buffer so that iterating through these is easier. Figure 1.5 shows how the buffers are organized.

uniform RWStructuredBuffer g_PinholeLinkBuffer; uniform RWByteAddressBuffer g_OffsetBuffer; [numthreads(32,32,1)] void ComputePinholes(uint2 id : SV_DispatchThreadID) { float2 uv = ... bool foundPinhole = FindPinhole(uv, id, ...); if (foundPinhole) { uint newIndex = g_PinholeLinkBuffer.IncrementCounter(); if (newIndex >= TotalMaxPinholeCount) return; // Get cell position from uvs uint2 cellPos = ... uint offset = cellPos.y * gridSize + cellPos.x; uint prevIndex; // Atomic swapping g_OffsetBuffer.InterlockedExchange(offset * 4, newIndex, prevIndex); // Store detected pinhole information PinholeLink link = ...; g_PinholeLinkBuffer[newIndex] = link; } } Listing 1.2. An example implementation of the linked list generation. This approach is very similar to Yang et al. [2010].

1.3 Shadow Rendering

133

Figure 1.5. Left: The shadow map is subdivided into a global buffer with linked lists and a separate buffer that indicates each cell’s first link. Right: The required buffers. From top to bottom: 1) The global linked list data, which uses an atomic counter along with a 2) buffer that maintains the list starting element. 3) The coherent uniform grid buffer; each cell has a maximum amount of pinholes, but 4) for reducing the amount of iterations, a size buffer is used.

1.3 Shadow Rendering When rendering the shadow-receiving points, a search is made on the uniform grid buffer with a predefined maximum radius, and the contribution of each precomputed pinhole to the shaded point is calculated. To do this, we can estimate that each projected pinhole has a size proportional to the distance towards the receiver and the solid angle subtended by the area light. If the receiver is in the pinhole’s projected range, we can sample an arbitrary shape texture with the local offset as UVs. This shape texture can easily be animated for interesting effects, such as an eclipse, and can also be mipmapped if necessary. In the case of a distant light like the sun, it is possible to approximate the pinhole size S p by just using a constant S L representing the size of the area light, as Equation (1.1) shows. It is useful to use said constant to exaggerate the effect.

S p   d r  d p  S L

(1.1)

d r is the distance from the receiving point to the light, and d p is the estimated average distance of the pinhole’s neighborhood to the light. d p is used to provide a good estimate of the pinhole’s abstract position. Using a single shape texture for all pinholes may be limiting, because it implies that the source light is infinitely far away and thus the projected shape is always similar, disregarding size. Because of this, it is also possible to extend this method to use different shaped textures depending on the direction towards the light, in a similar way to view-dependent impostors [Brucks 2018]. This can be useful if the light has an unconventional shape (e.g., a star polyhedron) and is very close to the occluders.

134

1. Soft Shadow Approximation for Dappled Light Sources Finally, if the light shape is simple enough, it is possible to define it with a function of the pinhole’s local UV space. For example, a solar eclipse can be described with two overlapping circles, and the subtracting inner circle’s position can be driven by a function of time. This may reduce texture lookups drastically and improve performance if the shape function is computationally cheap. The local pinhole UV can also be rotated or distorted easily through a matrix multiplication or any kind of domain warping for specific scenarios, e.g., for shadows seen through animated water.

1.3.1 Handling Occlusion Because the pinhole can still be occluded by other elements, we need to save the raw depth at the pinhole position, regardless of its neighbor depth values. With this depth, it is simple to check if the current shadow receiver has direct visibility within a predefined margin, and can be transitioned smoothly if necessary, as Figure 1.6 shows. Although it can be sampled from the unfiltered shadow map when iterating through the pinhole buffer, it is more efficient to sample it when detecting them, as it only has to be sampled once per pinhole. The resulting lookup function that samples the texture can be seen in Listing 1.3.

Figure 1.6. Pinholes can leak through occluders. To prevent this, we must still store the actual depth (not it’s estimated neighborhood average one) at the pinhole’s position in the depth map, so we can use that to prevent leaking. Left: No occlusion. Right: occlusion considered.

1.4 Temporal Filtering

float CollectPinholes(uint2 cell, float2 uv, float depth) { uint bufferIndex = cell.y * PinholeGridSize + cell.x; int count = g_PinholeCountBuffer.Load(bufferIndex); float accum = 0.0; int bufferOffset = bufferIndex * PinholesPerCell; for (int i = 0; i < count; ++i) { PinholeData pinhole = g_PinholeBuffer[bufferOffset + i]; float occlusion = depth - pinhole.rawDepth; if (occlusion < DEPTH_PROXIMITY) { // d_r = depth // d_p = pinhole.meanDepth float CoC = abs(depth - pinhole.meanDepth); CoC = saturate(CoC) * _BokehSize; float d = distance(pinhole.position, uv); float2 shapeUV = CalculateUV(uv, pinhole,position, CoC); accum += SampleLightShape(shapeUV); } } return accum; } Listing 1.3. The pinhole rendering code. This method is called over every uniform grid cell inside a maximum predefined radius.

1.4 Temporal Filtering Since pinholes are very evident when rendering a shape that has high-frequency details in light space, it is not surprising that the technique is highly susceptible to changes in the shadow map. This can happen due to occluder or light animations and is, in a way, similar to what happens in nature. But because the shadow map is just a discrete representation of the depth towards the light, pinholes can appear and disappear sporadically, suffer great jumps in position and size, generating strong aliasing artifacts. Ignoring these details will generate a very noisy and distracting temporal pattern that can take away from the desired effect, so it is essential to design a temporal filter that at least mitigates the effect, unless the user desires to render just a still image or a static scene.

135

136

1. Soft Shadow Approximation for Dappled Light Sources

1.4.1 Accumulation and Decay Given that pinholes are an emergent feature of the visibility function of each scene, it is hard to predict how they move in time. A very naive approach to solving this problem is keeping track of the previous frame’s uniform grid, and accumulating new points, while decreasing the intensity of pinholes from previous frames. If the intensity is below a certain threshold, it is discarded. This requires an additional variable to be tracked per pinhole that represents intensity. This works visually well, but has the secondary effect that if pinholes are moving fast in a scene, the accumulation effects cannot catch up and the grid is saturated with decaying points, decreasing performance because of the high amount of iterations required per shaded pixel.

1.4.2 Merging In addition to temporal accumulation, it is also desirable to merge any pair of points that fall within a defined radius. When merging both pinholes, we also combine the position of both points using linear interpolation, letting the developer control the result by choosing a bias between new or old points. Merging can be executed while flattening the linked list generated from the pinhole detector, but it requires   N 2  iterations over the pinhole buffer. Also, this technique forces the user to do more bookkeeping, as multiple buffers with different structures need to be kept and traversed. This filtering technique can cause pinholes to slide and swim through the shadow map, which can visually break the phenomenon. Thus, it is important to define reasonable parameters that limit the amount of movement a pinhole can have, and to carefully merge with neighbor cells. If moving pinholes (Figure 1.7) are not properly merged, there will be clear artifacts at the edges of cells where pinholes will stop moving and just decay until discarded.

Figure 1.7. A pinhole has moved from one cell to another in this frame, and should be merged. If the merging does not occur, the previous pinhole will decay at the edge of the cell, generating an unwanted artifact.

1.5 Results

1.4.3 Screen-space Masking It is also possible to do the search and filtering in screen space. For example, an initial mask can be generated based on primary visibility, where only regions of the shadow map that intersect with visible geometry are marked. Then we can search for pinholes and merge/accumulate safely within that mask, possibly with some relaxation of its boundaries so that pinholes don’t appear and disappear at the edges. This concept can also be implemented without filtering, as it would help reduce the complexity of pinhole search for very big shadow maps or lists of cascades that don’t necessarily contribute to the visible range. This approach would also have to explore many of the solutions that spatiotemporal algorithms implement [Korein and Badler 1983, Karis 2014, Marrs et al. 2018], given that any fast camera or object movement would leave trails on the screen, among other possible artifacts. However, it is possible that this can be useful in very constrained scenarios where the developer has control over these circumstances, such as a top-down real-time strategy game.

1.4.4 No Filtering If the scene or the shadow casters responsible for pinhole generation are near static, then this technique is very effective. It can be used for baked lighting, where temporal filtering is unnecessary but the shadow phenomenon is desired on the static receivers. Additionally, if the user desires to have the projected (but static) pinholes interact with dynamic objects, the pinhole buffer can be calculated and stored offline. This removes the per-frame cost of pinhole detection, but requires a bit more infrastructure in the engine. The runtime cost of evaluating the projected textures is still necessary, however.

1.5 Results Runtime performance is highly dependent on the scene and amount of pinholes per frame. Shadow map resolution and the number of shadow-receiving pixels being evaluated on screen also contribute to performance. The maximum possible pinhole count and the uniform grid subdivision count are the two biggest driving performance factors, as it usually happens when iterating over uniform grids, as Table 1.1 shows. It is also fill rate bound, considering that the pinhole search is done in the fragment shader of each receiver. Because the radius of the projected shape light is proportional to the distance of the pinhole to the receiver, the number of neighbor cells iterated increase when objects are far away, negatively impacting performance, as can be seen in Table 1.2. Having either a maximum search size or transitioning to a default soft shadow model might be enough to hide this limitation. At very big pinhole sizes the approximation breaks down visually, so it is desired to prevent this from happening anyway.

137

138

1. Soft Shadow Approximation for Dappled Light Sources

512 512 1024 1024 2048 2048

16 16 , 64 pinholes 0.08/0.12/1.11 0.29/0.98/1.71 1.11/1.16/2.73

3232, 32 pinholes 0.08/0.05/0.95 0.29/0.21/1.19 1.1/0.38/1.93

6464, 16 pinholes 0.08/0.04/0.95 0.29/0.06/1.07 1.1/0.13/1.53

Table 1.1. Performance measurements for different uniform grid cell subdivisions and shadow map sizes. The values are in milliseconds, and correspond to pinhole detection, filtering and rendering. Note that the shadow map size implicitly drives how many pinholes can be detected, and their size. All measurements were done on an Nvidia 1070 GTX graphics card running the sample scene fullscreen at 1920 1080, and captured with Nvidia’s Nsight performance tool. Shape textures are 128128, and the number of pinholes per cell are adjusted with the grid size to keep the pinhole count constant.

Search radius Total frame time

20 px 0.97

51 px 2.15

102 px 5.30

204 px 16.39

512 px 64.10

Table 1.2. Example measurements for different search radii when rendering a shadow receiver. As the radius increases, the number of cells required to iterate increases, negatively impacting performance. This maximum search radius also limits how big the projected pinholes can be. Time values are in milliseconds, and radius is defined in terms of the shadow map pixels.

Sampling light shape textures is also another bottleneck, as it must occur for each pinhole in a specific neighborhood inside the uniform grid. Considering this limitation, it is better to design shapes that can be procedurally defined, reducing memory lookups. Finally, downsampling the shadow map also helps substantially, as both the lookups necessary for pinhole detection and the amount of pinholes found decrease. This also benefits pinhole detection by offloading the filtering to the rasterizer. In practice, we found that the simple 9-tap pinhole detector works best, coupled with the simpler temporal filtering approaches. In a way, if the pinholes are moving too much the result becomes stochastic and approximate, but at least it is not jarring to the eye. Using the more robust pinhole detection with a radius has multiple problems. First, the complexity of estimating the mean and variance of the segmented kernel makes it very expensive. However, our implementation used a very basic local mean and variance estimator, and by using shared memory, most of the impact from texture lookups can probably be mitigated. Additionally, selecting a predefined radius is not good for arbitrary scenes with animations, and even if we sampled at different radii and averaged the results, this would require even more samples, making it impractical. Finally, a comparison with PCSS can prove useful. A naive implementation of PCSS can generate similar results, but requires many samples to reduce noise and capture pinholes, as it has no particular search method for them. It also cannot approximate arbitrary light shapes, although sampling a shape texture may be an interesting approach for future work. Figure 1.8 shows the resulting shadows. Overall, our method can achieve similar quality of high-frequency shadows as PCSS with fewer filtering samples.

1.6 Conclusion and Future Work

Figure 1.8. Left: our pinhole approximation with a 64x64 uniform buffer and 16 pinholes per cell. Right: naive PCSS with 32 blocker search samples and 256 PCF samples. Note that PCSS has better shadow edges, but cannot approximate the same optical properties (in the case of noncircular shapes), as well as requiring many more samples for its computation.

1.6 Conclusion and Future Work We propose a novel approximation technique to represent a common visual phenomenon that arises when rendering shadows from shapes that have high-frequency details in light space. It enables artists and developers to enhance the aesthetic of complex shadows in these circumstances, such as tree shadows. Aside from real-life scenarios where it is applicable, this effect also lets users be creative with the projected shapes, bringing non-photorealistic rendering to shadows. Figures 1.9 through 1.11 show example renders running in real time. This technique for implementing shadow pinholes can be optimized in several ways. For example, to improve the performance, one could use an advanced spatial

Figure 1.9. Left: Common shadow mapping with simple filtering (Unity). Right: Using a circular shape to estimate the light shape.

139

140

1. Soft Shadow Approximation for Dappled Light Sources

Figure 1.10. Shadows during a simulated eclipse, where the area light is occluded procedurally.

Figure 1.11. Non-photorealistic shape textures, which can be used for artistic control of a scene, or simulating uncommon light shapes (for example, an LED array light will generate shadows similar to those from the light shape in the second image).

acceleration structure for the pinhole search, such as a quadtree. Clever use of downsampling techniques can also reduce the lookup complexity in most cases. To improve the pinhole detection, one could use a moment map for approximating the variance. To improve the temporal filtering, one could implement a more robust clustering approach that doesn’t consider just close pairs of samples. Lastly, we could combine our technique with PCSS, either by extending the blocker search with information from the pinhole buffer, or trying to apply the same shape texture concept based on the blocker search region. Finally, partial occlusion of the projected pinhole shapes by close occluders can also aid in simulating this effect.

Bibliography BRUCKS, R. 2018. Realistic Foliage Imposter and Forest Rendering in UE4. Game Developers Conference 2018. DONNELLY, W. AND LAURITZEN, A. 2006. Variance Shadow Maps. In Proceedings of the 2006 Symposium on Interactive 3D Graphics and Games, pp. 161–165. URL: http://doi.acm.org/ 10.1145/1111411.1111440.

Bibliography FERNANDO, R. 2005. Percentage-closer Soft Shadows. In ACM SIGGRAPH 2005 Sketches. URL: http://doi.acm.org/10.1145/1187112.1187153. HEITZ, E., HILL, S., AND MCGUIRE, M. 2018. Combining Analytic Direct Illumination and Stochastic Shadows. In Proceedings of the 2018 Symposium on Interactive 3D Graphics and Games, pp. 2:1–2:11. URL: http://doi.acm.org/10.1145/3190834.3190852. KARIS, B. 2014. High Quality Temporal Anti-Aliasing. SIGGRAPH 2014. KOREIN, J. AND BADLER, N. 1983. Temporal Anti-aliasing in Computer Generated Animation. In Proceedings of ACM SIGGRAPH ‘83, pp. 377–388. URL: http://doi.acm.org/10.1145/ 800059.801168. MARRS, A., SPJUT, J., GRUEN, H, SATHE, R., AND MCGUIRE, M. 2018. Adaptive Temporal Antialiasing. In Proceedings of the 2018 Conference on High-Performance Graphics, pp. 1:1– 1:4. URL: http://doi.acm.org/10.1145/3231578.3231579. MINNAERT, M. 1937. Light and Color in the Outdoors. Springer, 1937. PETERS, C. AND KLEIN, R. 2015. Moment Shadow Mapping. In Proceedings of the 2015 Symposium on Interactive 3D Graphics and Games, pp. 7–14. URL: http://doi.acm.org/10.1145/ 2699276.2699277. PETTINEO, M. AND DE ROUSIERS, C. 2012. Depth of Field with Bokeh Rendering. REEVES, W., SALESIN, D., AND COOK, R. 1987. Rendering Antialiased Shadows with Depth Maps. In Proceedings of ACM SIGGRAPH ‘87, pp. 283–291. URL: http://doi.acm.org/10. 1145/37401.37435. TUFTE, E. 1997. Escaping Flatland, Sculpture. URL: https://www.edwardtufte.com/tufte/ sculpture. WILLIAMS, L. 1978. Casting Curved Shadows on Curved Surfaces. In Proceedings of the ACM SIGGRAPH ‘78, pp. 270–274. URL: http://doi.acm.org/10.1145/800248.807402. YANG, J., HENSLEY, J., GRÜN, H., AND THIBIEROZ, N. 2010. Real-time Concurrent Linked List Construction on the GPU. In Proceedings of the 2010 Eurographics Conference on Rendering, pp. 1297–1304. URL: http://dx.doi.org/10.1111/j.1467-8659.2010.01725.x.

141

2 III

Parallax-Corrected Cached Shadow Maps Pavlo Turchyn 2.1 Introduction Rendering shadows over large viewing distances often requires processing a large number of shadow-casting objects. Many game engines, which are using shadow maps for long-range shadow rendering, opt for some caching schemes that allow distributing the costs of shadow map rendering over several frames, thus exploiting frame-to-frame coherency and rendering only a subset of shadow casters per frame, e.g., Schulz and Mader [2014] and Acton [2012]. Some game engines cache occlusion data derived from shadow maps rather than keeping plain shadow maps, e.g., Valient [2012] and Gollent [2014].

Figure 2.1. Shadows from a moving directional light rendered using two shadow map cascades. The first (near) cascade is updated every frame, and the second (far) cascade is cached and invalidated infrequently. The left image shows a mismatch between the cascades since the cached cascade is rendered with a light direction that was captured many frames ago. Parallax correction fixes this divergence as shown on the right image.

143

144

2. Parallax-Corrected Cached Shadow Maps However, caching is problematic when the shadow casting light is dynamic, e.g., in a game with dynamic day-night cycle where the sun or moon is constantly moving across the sky, thus making cached data inconsistent with the current light state. One has to either invalidate the cache often to keep the divergence small, which makes caching a less efficient optimization, or treat cached shadows as a very rough approximation of actual shadows because the error is too apparent when viewed up close. In this paper, we describe a parallax correction algorithm for rendering sweeping shadows from a dynamic light source using a static shadow map. The use of parallax correction in Far Cry 5 enabled a fairly seamless transition between dynamic shadows rendered with two different techniques: near shadows with cascaded shadow maps (CSM) updated every frame, and far shadows with adaptive shadow maps (ASM) covering 500 meters range and updated every 2500 frames. As a result, we are using relatively expensive CSM for rendering shadows within only 30 meters from the player’s camera, which is quite a short range for an open world game.

2.2 Parallax Correction Algorithm When rendering images of a static scene from different viewpoints, one can reproject pixels seen from one viewpoint to another provided that the image depth buffer is available and we know all required camera transforms, e.g., Mark et al. [1997]. It is straightforward to apply this approach to shadow maps: given a shadow map rendered with a shadow camera built for light direction L 0, one can reconstruct world space positions of all shadow map texels, and then render the resulting set of points (as a point cloud or a mesh) into a shadow map for a different light direction L1. It’s possible to employ a more sophisticated method of interpolation [Yang et al. 2011]. Unfortunately, these reprojection procedures are not very practical since they operate with the entire shadow map, which is often a high-resolution image. Processing a lot of texels may be computationally expensive even with relatively simple shaders. Moreover, doing reprojection over a set of shadow maps arranged via some spatial subdivision scheme, such as cascaded shadow maps, isn't straightforward for the border texels, which might end up in a different cascade when rendered with a new light camera. Here we attempt to approximate the reprojection for a set of pixels on screen instead of warping the entire shadow map. Assume we have a shadow map generated for a dynamic directional light with direction L 0. Consider a point P0 in world space as illustrated in Figure 2.2(a). We can reconstruct the world space position of its occluder Pocc by computing the distance to occluder d 0 from the shadow map depth:

Pocc  P0  d 0 L 0.

(2.1)

Suppose the light is moving, and its new direction is L1. Let’s project Pocc along the new direction L1 to get a point P1:

P1  Pocc  d 1 L1 .

(2.2)

2.2 Parallax Correction Algorithm

(a)

145

(b)

Figure 2.2. Parallax correction algorithm. (a) The shadow map is rendered for a light direction L 0. The point Pocc is starting to occlude P1 rather than P0 when the light changes its direction from L 0 to L1. Our idea is to sample shadows at P0 using the shadow map computed for light direction L 0, and take the resulting shadow factor for shadow intensity at P1. (b) We walk along the direction D starting from P1 in small increments, sampling the shadow map at each point S i and accumulating occluder distance values. We stop after certain number of iterations. We compute the average value of accumulated occluder distances, which gives us an approximation of P0 via Equation (2.4).

The practical meaning of Equation (2.2) is that we can compute shadows at P1, i.e., P1 will be in shadow for any d 1  0 . So far we were following this reprojection route: take a point P0 , reconstruct its occluder from a shadow map, and then use the reprojected occluder when shading the scene. However, we are really interested in doing these steps in the reverse order. For any given light direction L1 and a point P1, we want to find the corresponding P0 , so that we can sample shadows at P0 using the shadow map computed for light direction L 0 and then use the resulting shadow factor for shadow intensity at P1. Our parallax correction algorithm can be briefly described as this: we want to get to P0 from P1. For this, we need two things: a good guess of the direction from P1 to P0 and a good guess of the length of the path from P1 to P0 . Let’s elaborate how to obtain these values. Substituting Equation (2.1) into Equation (2.2), we get

P0  P1  d 1 L1  d 0 L 0.

(2.3)

We attempt to solve Equation (2.3) by assuming d 1  kd 0, where k is a constant. We will discuss how we choose the value k later in this section. Here we only note that this assumption enforces a certain relation between P1, Pocc , and P0 . This gets us

P0  P1  d 0  kL1  L 0 .

(2.4)

146

2. Parallax-Corrected Cached Shadow Maps Thus, if we want to compute shadows for light direction L1 at an arbitrary point P1, we can use Equation (2.4) to find corresponding point P0 provided that we can compute the distance to occluder d 0 and choose a reasonable value of k. Occluder search. As follows from Equation (2.4), P0 is located somewhere on the ray starting from P1 in the direction

D  kL1  L 0.

(2.5)

We search for an occluder by marching along this ray with a certain number of iterations, computing occluder depth at each step, and then taking the average for d 0 . This process is illustrated in Figure 2.2(b) and Figure 2.3(a). The search distance d s is a scene-dependent parameter that accounts for maximum displacement of the shadows due to the parallax we are expecting. That is, we need a larger search distance for a scene with long shadows and tall shadow casting structures. Conversely, the search distance may be shorter for a scene with an overhead light and small shadow casters. An increase in the difference between L 0 and L1 also increases shadow parallax and thus the search distance. Choosing the parameter k. The point P0 is located somewhere on the ray originating at P1, with ray direction D being controlled with the parameter k. Figure 2.3(b) illustrates that changing the value of k would result in different D, with values k  1 corresponding to the point P0 being closer to the occluder than P1. Ideally, having P0 on the

(a)

(b)

Figure 2.3. Occluder search. (a) Occluder search with 5 iterations and k  1.5. We are sampling shadow map depths at each step to compute the distance from a point on the ray to its occluder, if there’s any. (b) Effect of parameter k on search vector D. The value k  1.5 would give incorrect results because some points on the search ray are located inside shadow casting geometry.

2.2 Parallax Correction Algorithm

147

surface of an object containing P1 would give the most accurate shadows, but this is hardly possible in practice. In our example, P0 will be either above the surface if we choose k  2 or k  3, or even below the surface if we choose k  1.5. The best value for k depends on the scene. Consider a difficult situation when we want to compute parallax correction at a point located on a concave surface. It’s quite probable that the occluder search may be testing points under the surfaces of nearby objects as illustrated in Figure 2.3(a), thus interpreting surrounding geometry as an occluder. Choosing a larger value for k can help prevent these errors in concavities, as shown in Figure 2.3(b). However, a larger values of k can cause the method to miss smaller occluders, thus producing inaccurate results. Due to this tradeoff, one needs to pick the value of k that produces the best results for a given scene. We use the following empirical formula, where k depends on the magnitude of the difference between light directions

k  1  L1  L 0 .

(2.6)

Finally, we can gather all the bits we have described so far into the sample code given in Listing 2.1. Implementation enhancements. The accuracy of shadows created with parallax correction generally degrades as the angle between the current light direction and the original light direction increases. It’s possible to reduce the divergence and thus the visual deficiencies if we know the animation curve of the light L  t , where L is the light direction and t is time. Assume we want to update the shadow map at regular time intervals with the time step Δt , i.e., if we first update the shadow map at t 0 , then the following updates will happen at t 0  nΔt , where n is a positive integer. The best strategy when updating the shadow map at t 0 would be to take the light direction at halfway between directions L  t 0  and L  t 0  Δt , or just taking L  t 0  Δ2t  if the light’s animation speed is constant. This minimizes angles between the light directions used for shading and the direction of the shadow map. Even though lookahead sampling requires us to give the shadows subsystem access to the curve L (thus making the implementation more complicated), we found that the resulting improvement in quality is worth the effort. Another way to improve parallax-corrected shadows is adding a cross-fade when updating cached shadow maps to the most recent light state. We are assuming that two shadow map textures will be available at this point. It makes sense to render the updated shadow map into a separate texture because its rendering may take several frames to complete (e.g., it takes more than 80 frames to update the cached shadow map in Far Cry 5) and we still need to apply shadows to the scene while the new shadow map is only partially rendered, and thus it can’t be used for shading. With both the old and new shadow maps ready, we can perform a cross-fade between shadow factors sampled from these textures, which is less visible than an immediate switch between the textures in a single frame.

148

2. Parallax-Corrected Cached Shadow Maps

uniform uniform uniform uniform

float3 L0; // shadow map light direction float3 L1; // current light direction float4x3 searchParams; // contains search distance, etc. float4x3 worldSpaceToShadowMap;

float CalcShadow(float3 P1 /* a point in world space */) { #if ENABLE_PARALLAX_CORRECTION float k = 1.0 + length(L1 - L0); float3 D = k * L1 + L0; // Occluder search float3 S = P1; float3 dS = mul(float4(D, 1), searchParams); float sum = 0, cnt = 0; for (int i = 0; i < OCCLUDER_SEARCH_ITS; ++i) { float3 Sp = mul(float4(S, 1), worldSpaceToShadowMap); float occDepth = SampleShadowMapDepthTexture(Sp.xy); if (occDepth < Sp.z) { sum += Sp.z - occDepth; cnt += 1; } S += dS; } float d0 = cnt > 0 ? sum * rcp(cnt) : 0; float3 P0 = P1 + d0 * D; #else float3 P0 = P1; #endif // Compute shadow factor at P0 float3 Pp = mul(float4(P0, 1), worldSpaceToShadowMap); return step(Pp.z, SampleShadowMapDepthTexture(Pp.xy)); } Listing 2.1. Example implementation of the parallax correction for a simple orthogonal shadow map.

Algorithm limitations. While the algorithm works well as long as the parameters d s and k are chosen appropriately, a high variation in depth of overlapping shadow casters results in incorrect parallax correction. It is caused by the occluder search hitting the furthest occluder and computing parallax correction using a biased occluder distance, which may distort shadows from closer shadow casters overlapping with shadows from

2.3 Applications of Parallax Correction

Figure 2.4. Defects occurring with high depth variation of overlapping shadow casters. From left to right: a shadow from the cone is overlapping with a shadow from the cylinder, which is much taller and further away from the camera; PCSS estimator gives incorrect penumbra size resulting in a large penumbra near the cone base; our parallax correction algorithm also produces incorrect results for the same reason (overestimation of the distance to occluder) resulting in shadows distortion; parallax-corrected shadows penumbra is incorrect too.

more distant ones, as shown in Figure 2.4. This is similar to the occluder fusion problem existing in some soft shadows algorithms, such as percentage-closer soft shadows (PCSS) [Fernando 2005].

2.3 Applications of Parallax Correction Parallax correction may be applied to a number of algorithms that utilize shadow map caching. Generally, a static shadow-casting algorithm can be used to generate the initial shadow map. Parallax correction can then be applied to adjust the map as lights move. However, the caches need to be invalidated from time to time since parallax correction is an approximative technique. Cached cascaded shadow maps. CSM caching implies that not all cascades are updated within one frame. One way to do that is skipping cascade updates for a certain small number of frames, either replacing cascade updates with another workload of similar complexity or updating distant cascades in a round-robin manner. The system keeps using matrices and shadow map textures cached from previous frames to apply shadows to the current scene. We suppose that applying parallax correction in this scenario doesn’t offer a lot of improvements since shadow maps are meant to be updated quite frequently (every other frame or so), thus small discrete changes in light direction aren’t too noticeable.

149

150

2. Parallax-Corrected Cached Shadow Maps The work by Schulz and Mader [2014] employs a different approach to caching with a single shadow map containing only static objects replacing the last two cascades. Shadows cover 1.4 km range, so the full update of this shadow map takes 10–15 ms distributed over many frames. In this scenario, parallax correction can reduce the visual discontinuity between the cascades that is caused by the lengthy shadow map update process, similar to what is demonstrated in Figure 2.1. One should just update the static shadow map periodically rather than only in certain preset points in game levels. Acton [2012] utilizes a toroidal update scheme, also found in other algorithms such as clipmaps [Asirvatham and Hoppe 2005], to minimize the number of shadow casting objects rendered into cascades. Their main observation is that moving the player’s camera only changes the cascade’s frustum translation, but not its size or orientation, as long as the shadow casting light is static. Typically there’s only a small difference between the current frustum and the frustum from the previous frame. Thus, contents of the shadow map would nearly be the same save for few small regions. One can perform a toroidal update reusing a large portion of the previously rendered shadow map, and only rendering objects falling into the parts near the shadow map border that were invisible previously. Parallax correction improves consistency between the cascades updated every frame and the cascades updated via toroidal update, thus allowing cached data to be reused for a longer period before the cached cascades need to be rebuilt with a new light direction. Adaptive shadow maps. Shadow map caching is an essential part of the adaptive shadow maps algorithm, e.g., Turchyn [2011]. It’s possible to discretize light direction movements and build a separate hierarchy of tiles for each quantized light direction. Aside from possibly noticeable steps in the light directions, this also implies that we have to maintain two hierarchies whenever we want to update the shadow maps. One of the hierarchies is used for shading the current frame, and the other one is in the process of construction. Having two hierarchies at the same time means we need to double the size of the tile cache, and we also pay extra costs to render the tiles. Parallax correction addresses these issues. We can start using the tiles rendered with the new light direction as they become available, rather than waiting until the full tile hierarchy is ready. This way shadows are sampled from a mix of old and new tiles with the parallax correction ensuring shadow consistency as shown in Figure 2.5. Therefore, we can start discarding old tiles as the new tiles become available, thus reducing tile cache memory requirements and improving cache utilization.

2.4 Results A major challenge in the development of Far Cry 5 was the addition of new rendering techniques, such as screen-space reflections, that were not present in the engine previously. The existing subsystems had to become faster to accommodate for the new tech, thus the shadow rendering budget was reduced from 6 ms to 4.5 ms. The Far Cry series

2.4 Results

151

(a)

(b)

Figure 2.5. Parallax correction not only enables smooth sweeping shadows with adaptive shadow maps algorithm, but also makes it possible to start evicting old tiles from the cache before the update is fully finished without having discontinuities visible in the left image. (a) ASM shadows from a mix of new and old tiles. (b) The same set of tiles with parallax correction applied.

has a long history of using cached shadow maps [Valient 2012]. However, cached shadows were always treated as a low-quality solution for the objects further away from the player’s camera, hence cascaded shadow maps used to cover quite a large viewing distance. Adding parallax correction and improving cached shadow map filtering quality allowed having cached shadows closer to the camera, thus reducing the range covered by CSM from 80 to 30 meters. Table 2.1 shows examples of the resulting performance improvements. We are using adaptive shadow maps for shadows covering the range from 30 to 500 meters from the camera. A typical cost of rendering a single ASM tile is around 0.5 ms on the GPU and up to 1.5 ms on the CPU on PS4. An ASM light direction

Test a b

Number of objects in CSM 80 meters 30 meters 310 104 132 68

CSM GPU render time, ms 80 meters 30 meters 3.8 2.3 3.9 2.8

Table 2.1. Reducing the range of cascaded shadow maps was a large win performance-wise. Far Cry 4 used CSM with three cascades covering 80 meters range from the player’s camera. We have reduced the range of three cascades down to 30 meters in Far Cry 5, since parallaxcorrected adaptive shadow maps are good enough to be used at ranges closer than 80 meters. A great side effect of this range reduction was the increase of CSM texel density around the player.

152

2. Parallax-Corrected Cached Shadow Maps update is triggered every 1–2 minutes of normal gameplay time, so we’re avoiding the update costs in the vast majority of frames. Our implementation of the parallax correction is relatively lightweight, with the typical GPU cost being 40–70 µs on PS4. We are performing the occluder search with 7 steps over a low-res downsampled depth map generated using a min-depth kernel over the shadow map (see depth extent map in Turchyn [2011]), which improves search accuracy while keeping the number of iterations low. The importance of consistency between cached and dynamic shadows is clearer in motion than on static screenshots such as Figure 2.6. Our very infrequently updated cached shadows without parallax correction often resulted into lighting being completely different when transitioning between dynamic and cached shadow maps. In motion this change between the two types of shadows was perceived as a cross-fade between unrelated images rather than a change in a shadow’s details. See this book’s online sample code for a demonstration of parallax correction in motion.

Bibliography ACTON, M. 2012. CSM Scrolling, An Acceleration Technique for The Rendering of Cascaded Shadow Maps. Advances in Real-Time Rendering in Games course, SIGGRAPH ‘12. ASIRVATHAM, A. AND HOPPE, H. 2005. Terrain rendering using GPU-based geometry clipmaps. In GPU Gems 2, pp. 27–45. Addison-Wesley. FERNANDO, R. 2005. Percentage-closer soft shadows. In ACM SIGGRAPH 2005 Sketches. URL: http://doi.acm.org/10.1145/1187112.1187153. GOLLENT, M. 2014. Landscape Creation and Rendering in REDengine 3. Game Developers Conference 2014. MARK, W., MCMILLAN, L., AND BISHOP, G. 1997. Post-rendering 3D Warping. In Proceedings of the 1997 Symposium on Interactive 3D Graphics, pp. 7–ff. SCHULZ, N. AND MADER, T. 2014. Rendering Techniques in Ryse: Son of Rome. ACM SIGGRAPH ‘14. TURCHYN, P. 2011. Fast Soft Shadows via Adaptive Shadow Maps. In GPU Pro 2, pp. 215– 224, A K Peters. VALIENT, M. 2012. Shadows in Games—Practical Considerations. Real-Time Shadows course, SIGGRAPH ‘12. YANG, L., TSE, Y., SANDER, P., LAWRENCE, J., NEHAB, D., HOPPE, H., AND WILKINS, C. 2011. Image-based Bidirectional Scene Reprojection. In ACM Trans. Graph., 30:6, pp. 150:1– 150:10.

Bibliography (a)

(b)

Figure 2.6. Parallax correction improves the transition between cascaded shadow maps at the foreground and adaptive shadow maps at the background in Far Cry 5. (a) A mismatch between long-range and dynamic shadows due to a slow update of cached long-range shadows. (b) Parallax correction fixes the mismatch so that the long-range shadows are perceived as a level of details of dynamic shadows rather than something unrelated.

153

IV

3D Engine Design

Welcome to the 3D Engine Design section of GPU Zen’s second volume. The five chapters presented here are a reflection of the latest trends in modern 3D engine design, as shown through advances in realism, material synthesis, ray-tracing, as well as targeting the latest graphics API standards. The section starts with Sergey Makeev’s chapter “Real-Time Layered Materials Compositing Using Spatial Clustering Encoding”. Sergey presents an algorithm that mimics “Allegorithmic Substance” texture pipeline as close as possible but in realtime. It uses a layered materials method which allows us to create composite materials using a large number of layers. This technique was successfully applied in the rendering of armored vehicles in the published action multiplayer tank game “Armored Warfare”. Next, Thomas Deliot and Eric Heitz’s chapter “Procedural Stochastic Textures by Tiling and Blending” describes a production-ready algorithm that synthesizes infinitely-tiling stochastic textures from small input texture examples. The technique runs in a fragment shader and requires no more than 4 texture fetches and a few computations. The third chapter in this section is “A Ray Casting Technique for Baked Texture Generation” by Alain Galvan and Jeff Russell. This chapter shows how to bake highpolygon geometry to textures meant to be used by low-polygon geometry using GPU ray-casting. Computation times are reduced drastically compared to classical CPUbased baking tools. The chapter shows example shaders to bake various types of textures, as well as highlighting a number of potential pitfalls inherent in the process. In the fourth chapter “Writing an efficient Vulkan renderer”, Arseny Kapoulkine explores key topics for implementing Vulkan in modern 3D engines. The topics include memory allocation, descriptor set management, command buffer recording, pipeline barriers, and render passes. The chapter also discusses ways to optimize CPU and GPU performance of production desktop/mobile Vulkan renderers today as well as look at what a future looking Vulkan renderer could do differently. The fifth chapter “glTF - Runtime 3D Asset Delivery” by Marco Hutter explains Khronos Group’s glTF – a transmission and delivery format for 3D assets. The chapter starts with the goals and features that are achieved with glTF and their technical implementation. Then the role of glTF in the 3D content creation workflow is laid out, showing the tools and libraries that are available to support each step of the content

155

156

IV 3D Engine Design creation process, and how glTF may open up new application areas that rely on the efficient transfer and rendering of high-quality 3D content. I hope you enjoy learning from this section’s authors’ experiences, and do not hesitate to share with us your latest findings and experiences around 3D Engine Design! Welcome! —Wessam Bahnassi

1 IV

Real-Time Layered Materials Compositing Using Spatial Clustering Encoding Sergey Makeev

1.1 Introduction Most of the modern rendering engines take advantage of using a library of simple and well-known materials and a layered material representation to author detailed and highquality in-game materials. Popular tools used in texturing pipeline nowadays (e.g., Allegorithmic Substance Painter and Quixel NDO Painter) are also based on the concept of layered materials [Neubelt and Pettineo 2013, Deguy et al. 2016, Karis 2013]. In this chapter, we present an algorithm that uses a layered materials method which allows us to create composite materials using a large number of layers in real-time. Our algorithm is designed to mimic Allegorithmic Substance texture pipeline as close as possible but in real-time. The proposed technique based on the blending of multiple well-known materials where a shared materials library defines the surface properties for each material used in compositing. Using our method, each mesh can have one unique UV set and several unique texture blend masks where each blend mask defines the per-pixel blending weights for the material from the library. Each material from the library can use a detail textures technique to improve surface details resolution. Using the materials with detail textures for the composition has the advantage of breaking the texture resolution barrier and allows us to produce a final composition at a very high resolution. Having high-resolution in-game materials is especially crucial in the 4K era. Our method supports the replacement of a library of materials and transparency modifications for the texture blend masks at runtime. Material replacement at runtime leads to a different visual appearance of the resulting composited material which is especially important for games supporting User Generated Content or in-game

157

158

1. Real-Time Layered Materials Compositing Using Spatial Clustering Encoding Customization. The presented technique is used for rendering armored vehicles in Armored Warfare, an action multiplayer tank game published by My.com.

1.2 Overview of Current Techniques A popular method for compositing is to pre-bake a multilayered material into a set of textures (albedo, normal, roughness, etc.) that are used by the game engine. The resulting texture set are primarily designed for a specific mesh and cannot be shared between different meshes. We call this method “Static Material Layering.” [Neubelt and Pettineo 2013, Deguy et al. 2016]. This approach gives good results but requires a lot of GPU memory to store the final high-resolution textures. To break this limitation, some modern rendering engines use a technique that is pretty similar to the method that has already been proven for rendering terrain which is called “Texture Splatting.” [Bloom 2000]. To produce the final composited texture, these engines blend several textures using a pixel shader and a set of texture blend masks which define the transparency of the blending. We call this method “Dynamic Material Layering.” [Inside Unreal 2013, Noguer 2016]. This approach works well as shown in Figure 1.1 but is limited to a small number of simultaneous material layers due to memory and performance limitations.

Figure 1.1. An example of Dynamic Material Layering. This example mesh uses three texture blend masks to define the blending transparency and three library materials to define the surface properties.

1.3 Introduced Terms

1.3 Introduced Terms Since different game engines and material authoring pipelines use different terms, here are definitions which are used in this article.  Material Template. One single well-known material such as gold, steel, wood, etc. Material Template can use a tiled detail texture to give the illusion of greater detail for a material. Material Templates are used as basic blocks to create complex multi-layered materials.  Material Mask. A grayscale texture which is used for defining transparency while compositing different Material Templates. Usually these textures are created by modern texturing tools like Allegorithmic Substance Painter and Quixel NDO Painter using a semi-procedural approach.  Color ID. A color-coded texture which defines areas of UVs that belong to different opaque materials. The opaque material does not have a blend mask associated with it, and it is always used as a bottom layer in our composition. A color-coded representation where each unique color represents a single material is used to simplify the content pipeline and reduce the number of required textures. Each color-coded texture can represent several opaque materials as shown in Figure 1.2.  Layered Material. This is a material definition which is used to build the final composite material. Each Layered Material has a single Color ID associated with it and an ordered set of Material Masks which define the composition order and the blending weights of the materials. Layered Material also has a set of Material Templates associated with it to define the visual appearance of each material used in a composition.

1.4 Algorithm Overview Existing solutions [Inside Unreal 2013, Noguer 2016] which used for Dynamic Material Layering store transparency for different materials in RGBA channels of the texture. When each material mask covers only a small area of the texture such solutions are inefficient in terms of memory consumption. A lot of texture space is not used for the composition and wasted. We observe that blend masks are usually coherent in the texture space and only partially overlap each other. Using this observation, we propose storing the different non-overlapped blend masks in the same texture channels. To achieve this, we group several Material Masks into a set that we call Clusters. We build the clusters based on the connectivity between texels in texture space. This allows us to use the texture space more effectively since the different clusters can store their blend masks in the same

159

160

1. Real-Time Layered Materials Compositing Using Spatial Clustering Encoding

Figure 1.2. Several opaque material masks combined into a single Color ID map and applied to the mesh.

shared texture channels. At runtime we use this clustered representation and the set of the material templates to make the final material composition as shown in Figure 1.3. Since we are storing the different blend masks in the same texture channels, it is critical to take into account texture filtering boundaries between different clusters. Texture filtering of different blend masks lead to errors during the composition stage due to the leaking of texture blend masks from one material into another. While building the material clusters, we consider which neighboring pixels are involved in the texture filtering and this information is used while creating the clusters. To create the initial partitioning into the clusters, we perform a connectivity analysis for the set of Material Masks. Connectivity analysis classifies all the texels which are used for texture filtering as connected. If the texels are classified as connected, they will belong to the same material cluster. When we perform a connectivity analysis, we should also take mipmap texture filtering into account. At the same time, we should limit the number of supported mipmap levels otherwise at the very last mipmap level all the texels will be classified as connected. For our implementation, we decided to support only the first four mipmap levels. Smaller mipmap levels are not handled by our implementation and discarded. Supporting only the first four mipmap levels is enough to preserve a good quality of the texture filtering and keep the number of the resulting clusters small. An incomplete mipmaps chain might lead to aliasing, but its level is acceptable [Mittring 2008]. In practice, the resulting aliasing can be barely visible and effectively removed by most of the modern anti-aliasing algorithms. Each resulting cluster should not contain more than a limited number of materials where the number of materials depends on how many per pixel material layers we need to support. In practice, the number of materials used in the cluster is usually equal to

1.4 Algorithm Overview

Figure 1.3. An example of the use of the presented technique. Several texture blend masks encoded as a single RGB weights texture and a single cluster indirection texture. Encoded material blend masks and material templates from the library are composited to get the final image.

four or five since we store the cluster blend weights in the BC1 or BC3 texture format. A practically unlimited total number of materials and five per-pixel materials is enough to represent even a very complex layered material. Constructing the clusters with a limited number of materials is not always possible since we can find more connected materials than the maximum allowed number of materials per cluster. As a result, we can find a cluster which is used more than a maximum allowed number of materials. We can split such clusters into several smaller ones that meet our initial requirements. This will lead to a texture filtering error for the texels shared between the clusters edges. Filtering errors will occur due to erroneous texture filtering between blend masks from the different clusters in which different materials are encoded (see Figure 1.4). We propose a solution that minimizes leaking of the texture blend masks while splitting such clusters. For more details see Section 1.5.7.

1.4.1 Spatial Clustering Encoding Representation At the preprocessing stage, we build a clustered representation using a single Color ID to define all the opaque materials and an ordered set of Material Masks to define all

161

162

1. Real-Time Layered Materials Compositing Using Spatial Clustering Encoding

Figure 1.4. The texture unit uses blend masks from different Clusters during texture filtering. The result of such filtering is a leaking of the material boundaries which leads to visual artifacts in the composition stage.

transparent materials. As a result of preprocessing, we get a dataset which is used for the runtime composition and contains three different types of data.  Cluster Indirection. An indirection texture that defines the current cluster ID for a specified texel. The cluster ID is stored using an integer texture format and defines which set of materials should be used for a given texel. Cluster Indirection is stored using a texture that has a lower resolution than a resolution of the source Material Masks. Neighbor texels usually use the same set of materials and the set of used materials rarely changes which allows us to use a smaller resolution texture to store this data. Since this texture contains the integer data, this texture cannot use any texture filtering. Cluster Indirection is stored in the texture without mipmaps and fetched using a POINT texture filtering mode.  Cluster Weights. The weight texture defines the blending weight for each material in a set of materials which a specified by the cluster ID. We support up to five different material masks per pixel where the weights are stored using the BC3 texture format. Cluster Weights are stored using a texture that has the same resolution, as the input Material Masks, despite the set of used materials rarely vary, neighbor texels of a blend mask can differ significantly. Cluster Weights can be correctly filtered inside the same material cluster. Texture data is stored with mipmaps and fetched using a TRILINEAR or ANISOTROPIC texture filtering mode.  Cluster Properties. Defines material surface properties such as albedo, roughness, metalness, etc. which are used for the final composition. We decided to use a Structured Buffer to store the Cluster Properties. Depending on the implementation, Cluster Properties can also be stored in the Constant Buffer.

1.4 Algorithm Overview

163

Figure 1.5. An example representation and usage of the encoded data.

At the composition stage, we obtain the cluster ID for each fragment which defines a set of used materials and the blending weights. Then using the Cluster Properties, we obtain the surface properties for each material which are used in the current cluster. Afterwards, we use the blend weights and the surface properties for the final composition of the surface properties for a given fragment. See Figure 1.5 for more details.

1.4.2 Order-independent Representation for Blend Masks For the final composition, we need to blend materials in the correct order, as defined in the input data. The most common way of doing this is to perform alpha blending and composite the fragments in a back-to-front order using the following equation:

C final  C src α  C dst 1  α .

164

1. Real-Time Layered Materials Compositing Using Spatial Clustering Encoding Then we repeat this operation for all the blend masks used for blending:

C final  C n α n  1  α n    C 2 α 2  1  α 2  C 1α1  1  α1  C 0 . This approach depends on the order of operations and instead of using alpha-blending we can rewrite the blending equation in weighted form:

C final  w 0C 0  w 1C 1  w 2C 2    w nC n , where:

w k  1 α n 1  α n 1   1  α k 1  α k . Since the resulting weights are normalized, we know that the sum of all weights are always equal to one. We can use this property to reconstruct one of the weights inside a composting pixel-shader instead of storing this weight in the texture channel:

w n  1  w 0  w 1  w 2   w n 1. For our implementation, we decided to use the order-independent weighted representation for the texture blend masks. The order-independent representation allows us to swap the texture channels inside the cluster freely. This property can be very useful for several further optimizations.

1.5 Algorithm Implementation 1.5.1 Extract Background Materials First, we define an opaque material for each texel. The opaque material is also used to determine which texels are inside the UV mapping and which ones are not. Then we determine which texture resolution we should use for the latest supported mipmap level. For each texel inside the mip level, we generate a Color ID value from the original high-resolution Color ID texture. To avoid situations where the resulting mipmap texel uses several different opaque materials, we forbid using different Colors IDs in the neighboring texels. This natural limitation allows us to find potential clusterization issues at the very early stage of the art-pipeline and helps create Color ID maps which can be efficiently clustered. This also allows us to skip the connectivity analysis for all the opaque materials since the different opaque materials never share the neighboring texels.

1.5.2 Material Layers At this step, we have the ordered set of texture blend masks called Material layers. Each material layer defines the transparency of the blending for each texel. We downsample

1.5 Algorithm Implementation each material layer using a MAX filter to the resolution corresponding the latest supported mipmap level. If the resulting texel was marked as unused on the previous step, this texel is located outside of the valid UV mapping and will not be used in the composition.

1.5.3 Weighted Sum Representation At this step, we have several downsampled texture blend masks defined in the specified order used for the alpha blending. We transform the texture blend masks to a normalized weighted form using Algorithm 1.1. The normalized weighted representation also helps us to discard texels that are entirely covered by other materials and do not contribute to the final composition.

For (every texel(X,Y) in opaque layer) { If (texel(X,Y) is empty in opaque layer) { skip texel } accum = 1.0 For (every input texture blend mask) { alpha = blend_mask(X,Y) layer_weight(X,Y) = alpha * accum accum = accum * (1.0 - alpha) } } Algorithm 1.1. Converting the texture blend mask to a normalized weighted form.

1.5.4 Undirected Graph Representation At this step, we move from a bitmap representation of input data to an undirected graph representation. The advantage of a graph representation in comparison with a bitmap representation is that we can use graph theory for analyzing and building clusters with specific characteristics. For each texel, we find all the texture blend masks that have contributed to a given texel, and assign a unique texel identifier that corresponds to the unique combination of blend masks used. Then we find the connected area using an algorithm similar to a flood-fill algorithm and make a separate graph vertex from each unique combination. The result produced by the algorithm shown in Figure 1.6. For each resulting graph vertex, we store an assigned identifier that encodes which blend masks were used to build this graph vertex.

165

166

1. Real-Time Layered Materials Compositing Using Spatial Clustering Encoding

Figure 1.6. Splitting bitmap data to the graph vertices. Red, Green and Blue circles represent areas covered by the different texture blend masks.

1.5.5 Texture Filtering Requirement Analysis At this step, we build edges between graph vertices according to the following rule: If two graph vertices have adjacent pixels that are used for texture filtering, these vertices are connected. As a result, we built a connected undirected graph G  V , E  from the source bitmap data. V represents the area affected by the different combinations of input texture blend masks. E represents the texture filtering relationship between these areas as shown in Figure 1.7. The number of texels used for texture filtering determines the edge weight and will be used later for the graph partitioning.

Figure 1.7. Resulting undirected graph. Arrows represent edges which indicate the texture filtering relationships between vertices.

1.5 Algorithm Implementation

1.5.6 Finding the Number of Connected Components The resulting undirected graph usually has several connected components. Our goal is to find the number of connected components and split the graph into several subgraphs between which no filtering is required (see Figure 1.8). Each subgraph is processed as an independent graph for the next algorithm steps. If the resulting graph already fits our initial requirements and does not exceeding the maximum allowed number of materials, then the next step is redundant and can be skipped. Otherwise, the next algorithm step splits the graph into several subgraphs with specific properties defined by our initial requirements.

Figure 1.8. A resulting graph with two connected components.

1.5.7 Solving the Graph Partitioning Problem At this step, we need to solve the graph partitioning problem and split a graph G  V , E , where V is the set of vertices and E are edges, into smaller components with specific properties. Typically, graph partition problems are NP-hard problems so we should use heuristics and approximations to solve the graph partition problem. To find the optimal solution, we use an iterative greedy algorithm to find a set of edges with minimal weight to cut. Our solution is inspired by the heuristic algorithm of graph partitioning proposed by Kernighan and Lin [1970]. Our goal is to divide the graph into subsets A and B where subset A satisfies initial requirements, and the sum of edge weights from A to B are minimized. Since the weight of the edge is the number of pixels used for the texture filtering, by minimizing the sum of edge weights from A to B we reduce the resulting filtering error. Our multi-pass algorithm maintains and improves a partition, using a greedy algorithm in each pass to pair up vertices of A with vertices of B, so that moving the paired vertices from one side of the partition to the other improves the partitioning.

167

168

1. Real-Time Layered Materials Compositing Using Spatial Clustering Encoding Next, our algorithm chooses the best solution from all solutions that have been tried. Thus our algorithm attempts to find the optimal subset A which has a minimum sum of edge weights to cut. See Algorithm 1.2 for implementation details. To demonstrate one iteration step of the algorithm, see Figure 1.9.

1.5.8 Generating the Final Data At the final step, we have a set of undirected graphs, where each graph fits our initial requirements. In practice, many resulting graphs use fewer materials than the maximum allowable number. To reduce the resulting number of clusters, we combine such graphs into larger ones as long as the merged results satisfy the initial requirements.

ForEach (source layers identifier existing in the graph G(V,E)) { Begin (split graph into initial subsets A and B) { Add all graph vertices with same identifier to subset A. Add all other graph vertices into subset B. } Loop { Calculate sum of edges weights between subset A and B and store as solution. Find a vertex inside a subset B that has the largest sum of edges crossing between subsets and can be moved to a subset A without violating our constraints. If (such vertex found) { Move all vertices with same identifier as a found vertex from the subset B to the subset A. } Else { break } } } Return solution with the minimal weight between subsets. Algorithm 1.2. Graph partitioning algorithm.

1.5 Algorithm Implementation

Figure 1.9. One step of the graph partitioning algorithm. The vertex V 5 moved from subset B into subset A.

Next, we a build normalized weighted representation as described in Section 1.5.3, but this time for full-resolution texture data. For each resulting graph, we copy all texels used by the graph vertices from the weight textures into separate channels of the Cluster Weights texture. Then we store the used graph index into a Cluster Indirection texture using the R8_UINT format. Next, for the Cluster Weights, we generate a partial mipmap chain which is limited by the number of supported mipmaps and then compress the resulting texture using the BC1 or BC3 texture format. See Figure 1.10 for the example set of resulting textures.

1.5.9 Runtime Composition For the final composition of the material at runtime, we use the following approach:  Fetch the encoded cluster ID from the indirection texture.

169

170

1. Real-Time Layered Materials Compositing Using Spatial Clustering Encoding

Figure 1.10. Final data generated by our implementation. Cluster weight texture (left) and Indirection texture (right). Indirection texture is colorized and upscaled by 8 times for demonstration purposes. Image courtesy of Mail.Ru Group.

 Fetch material blend weights from the weight texture.  Read the surface properties stored in the Structured Buffer using the fetched cluster ID.  Make the final composition using the obtained surface parameters and material blend weights. See Listing 1.1 for an example of a basic compositing shader. Since we use a normalized weighted representation for storing the blend weights, we can change the blend weight of any individual material layer and renormalize the total sum of the weights. This allows us to change the transparency of individual layers at runtime. Textures used for the composition usually have insufficient resolution. To increase the final resolution of the composition, we use detail textures stored in the texture arrays as described by Hamilton and Brown [2016]. To generate a UV set for detail textures, we multiply original the UV set by tiling factor. The tiling factor for a detail UV set can be specified per material. We use two sets of texture arrays: First for the surface parameters (albedo, roughness, metallic) and the second for the normal maps. Each material used in the composition can use an arbitrary detail map for the surface properties and an arbitrary detail map for the surface normals. See Figure 1.11 for examples of the composition with and without detail textures. For the normal maps blending, we use a weighted blending of partial derivatives as described in “Blending in Detail” [Barré-Brisebois and Hill 2012]. Additionally, we can blend only two detail maps with the highest contribution weights at a medium distance and completely disable detail maps at a long distance to improve the composition performance.

1.5 Algorithm Implementation

struct SurfaceParameters { float3 albedo; }; struct ClusterParameters { SurfaceParameters layer0; SurfaceParameters layer1; SurfaceParameters layer2; SurfaceParameters layer3; }; // Weights texture Texture2D cWeights; // Indirection texture Texture2D cIndirection; // Material parameters (stored per cluster) StructuredBuffer clusterParameters; float4 DecodeAndComposition(float2 uv) : SV_Target0 { float4 weights; // Fetch weights weights.xyz = cWeights.Sample(samplerTrilinear, uv).rgb; // Reconstruct weight weights.w = 1.0 - weights.x - weights.y - weights.z; // Fetch index uint clusterIndex = cIndirection.Sample(samplerPoint, uv).r; // Get material params ClusterParameters params = clusterParameters[clusterIndex]; // Use the material parameters and weights // for a final composition float3 albedo = params.layer0.albedo * weights.x + params.layer1.albedo * weights.y + params.layer2.albedo * weights.z + params.layer3.albedo * weights.w; return float4(albedo, 1.0); } Listing 1.1. An example of cluster decoding.

171

172

1. Real-Time Layered Materials Compositing Using Spatial Clustering Encoding

Figure 1.11. Rendering using detail maps (left) and without detail maps (right). Image courtesy of Mail.Ru Group.

1.5.10 Source Code For demonstration purposes, we implemented our method using C# for preprocessing and the Unity game engine by Unity Technologies for the runtime materials composition. Full source code can be found in the supplemental materials of this book. Source code is also available at https://github.com/SergeyMakeev/GpuZen2.

1.6 Results We used the approach described in this article for rendering the armored vehicles in the Armored Warfare game. In our game, users can customize coloring and materials used for rendering the armored vehicles. The presented technique allows minimizing the number of textures stored on disk while supporting high-quality textures and allowing the customization of the visual appearance. You can see some results of using our technique in Figure 1.12. We compared the number of instructions resulting for our technique and number of instructions resulting for Unreal Engine 4 material layering technique, see Table 1.1. Table 1.2 shows build times and the number of resulting materials clusters made for the armored vehicle.

1.7 Conclusion and Future Work The method described in this chapter helps to store efficiently and use more texture blend masks than would be allowed by existing methods with some natural limitations.

1.7 Conclusion and Future Work

Figure 1.12. An example of composited material and some material templates used in composition (Top) and the same composited material with different material templates applied (Bottom). Image courtesy of Mail.Ru Group.

173

174

1. Real-Time Layered Materials Compositing Using Spatial Clustering Encoding

Technique Unreal Engine 4 – MatLayerBlendSimple Unreal Engine 4 – Custom material Spatial Clustering Encoding – High Spatial Clustering Encoding – Standard Spatial Clustering Encoding – Fast Spatial Clustering Encoding – Fastest

Per-pixel number of layers

Total number of layers

Per-pixel number of ALU TEX detail textures

4

4

4

82

14

3

3

0

23

2

5

9+

5

140

13

4

9+

4

104

11

4

9+

1

42

4

3

9+

0

26

3

Table 1.1. Number of instructions resulting for different layered material techniques.

Asset Turret Hull Cannon Wheels Tracks

Input material count 9 7 7 6 5

Build time

Mip count

1.9 s 1.4 s 0.5 s 0.9 s 0.2 s

3 3 3 3 3

Graph vertex count 1791 3937 262 1422 268

Cluster count 5 3 4 3 1

Memory used for cluster parameters 560 bytes 336 bytes 448 bytes 336 bytes 112 bytes

Table 1.2. Build time and resulting cluster statistics for example asset.

Also, the proposed method supports real-time material recomposition using the proposed data representation. For the most effective use of our method, it is necessary to take into account the texture space connectivity of the different texture blend masks at the earliest stages of the art-pipeline. At the same time, the proposed method is suitable for any existing art assets without additional preparation with some tolerable texture filtering errors. We are continue to develop and refine of the proposed technique. Here are some areas for further development:  Nonlinear blending for the material composition as proposed by Hardy and McRoberts [2006].  Reducing the texture filtering errors when dividing clusters. Since the texture blend masks are order independent, we can swap the texture channels inside the

Acknowledgments cluster. Using the least squares minimization technique along the “seams” boundary as proposed by Iwanicki [2013], we can reduce the texture filtering error almost to zero.  Composition after evaluating BRDF instead of the surface properties composition. Using this approach we can create accurate multi-layered materials with multiple specular lobes.  Using the vertex color as a blend weight modifier for local dynamic material recomposition (dynamic dirt, scratches, etc.).

Acknowledgments First, I would like to thank Vladimir Egorov, my friend and colleague, for his suggestions and early feedback on this article. Peter Sikachev, Vadim Slyusarev, Bonifacio Costiniano and Alexandre Chekroun for their feedback on this article. In addition, I would like to thank all Allods Team members as well.

Bibliography BARRÉ-BRISEBOIS, C. AND HILL, S. 2012. Blending in Detail. URL: http://blog. selfshadow.com/publications/blending-in-detail. BLOOM, C. 2000. Terrain Texture Compositing by Blending in the Frame-Buffer. URL: http://www.cbloom.com/3d/techdocs/splatting.txt. DEGUY, S., OLGUIN, R., AND SMITH, B. 2016. Texturing Uncharted 4: a matter of Substance. Game Developers Conference 2016. HAMILTON, A. AND BROWN, K. 2016. Photogrammetry and Star Wars Battlefront. Game Developers Conference 2016. HARDY, A. AND MCROBERTS, D. 2006. Blend maps: enhanced terrain texturing. In SAICSIT 2006. INSIDE UNREAL. 2013. A Look at Unreal Engine 4 Layered Materials. URL: https://www. unrealengine.com/news/look-at-unreal-engine-4-layered-materials. IWANICKI, M. 2013. Lighting technology of The Last of Us. SIGGRAPH ’13. KARIS, B. 2013. Real Shading in Unreal Engine 4 : Physically Based Shading in Theory and Practice. SIGGRAPH ’13. KERNIGHAN, B. AND LIN, S. 1970. An efficient heuristic procedure for partitioning graphs. In The Bell System Technical Journal, 49, pp. 291–307. MITTRING, M. 2008. Advanced Virtual Texture Topics. SIGGRAPH ’08.

175

176

1. Real-Time Layered Materials Compositing Using Spatial Clustering Encoding NEUBELT, D. AND PETTINEO, M. 2013. Crafting a Next-Gen Material Pipeline for The Order: 1886. SIGGRAPH ’13. NOGUER, J. 2016. The Next Frontier of Texturing Workflows. URL: https://www. allegorithmic.com/blog/next-frontier-texturing-workflows.

2 IV

Procedural Stochastic Textures by Tiling and Blending Thomas Deliot and Eric Heitz 2.1 Introduction Heitz and Neyret [2018] recently introduced a new by-example procedural texturing method for stochastic textures, typically natural textures such as moss, granite, sand, bark, etc. Their algorithm takes as input a small texture example and synthesizes an infinite output with the same appearance, as in Figure 2.1. The algorithm is a simple

Figure 2.1. Procedural stochastic textures by tiling and blending. Our algorithm runs in a fragment shader that requires no more than 4 texture fetches and a few computations. It can be efficiently integrated into a rendering engine.

177

178

2. Procedural Stochastic Textures by Tiling and Blending tiling-and-blending scheme augmented by a histogram-preserving blending operator that prevents the visual artifacts caused by linear blending. The cornerstone of the implementation is thus this new blending operator, which requires dedicated precomputations. In this chapter, we investigate the details of a practical implementation of this algorithm with some improvements compared to the original article. The chapter comes with a C++ OpenGL demo the code snippets are extracted from.

2.2 Tiling and Blending The fragment shader of our tiling-and-blending algorithm is illustrated in Figure 2.2. We partition the uv space on a triangle grid and compute the local triangle and the barycentric coordinates inside the triangle. We use a hash function to associate a random offset with each vertex of the triangle grid and use this random offset to fetch the example texture. Finally, we blend the result using the barycentric coordinates as blending weights. This method is fast because each pixel requires only a few computations and 3 texture fetches. The implementation is provided in Listing 2.1.

2.2.1 Tiling In this section, we provide the implementation of the functions required for the tiling part of the algorithm in Listing 2.1. Triangle grid. We use the equilateral-triangle lattice introduced in Simplex Noise [Perlin 2001]. Listing 2.2 provides the function that, for a given point in uv space, computes

(a) Example

(b) Tiling and blending

Figure 2.2. Tiling and blending. Each pixel is obtained by blending three tiles from the example.

2.2 Tiling and Blending

179

(a)

(b)

(c)

Figure 2.3. Results of tiling and blending. The tricky part of the algorithm is the blending operator. (a) Example image. (b) Linear tiling and blending. (c) Histogram-preserving tiling and blending.

sampler2D input; // Example texture vec3 ProceduralTilingAndBlending(vec2 uv) { // Get triangle info float w1, w2, w3; ivec2 vertex1, vertex2, vertex3; TriangleGrid(uv, w1, w2, w3, vertex1, vertex2, vertex3); // Assign random offset to each triangle vertex vec2 uv1 = uv + hash(vertex1); vec2 uv2 = uv + hash(vertex2); vec2 uv3 = uv + hash(vertex3); // Precompute UV derivatives vec2 duvdx = dFdx(uv); vec2 duvdy = dFdy(uv); // Fetch input vec3 I1 = textureGrad(input, uv1, duvdx, duvdy).rgb; vec3 I2 = textureGrad(input, uv2, duvdx, duvdy).rgb; vec3 I3 = textureGrad(input, uv3, duvdx, duvdy).rgb; // Linear blending vec3 color = w1 * I1 + w2 * I2 + w3 * I3; return color; } Listing 2.1. Tiling and blending.

180

2. Procedural Stochastic Textures by Tiling and Blending the vertices of its containing triangle and its barycentric coordinates w 1 , w 2 , w 3 inside this triangle. With this partitioning of the uv space, each vertex is associated with a hexagonal tile chosen randomly in the input image such that each point is covered by exactly 3 tiles and each tile is weighted by a function falling to 0 at the borders and such that the sum of the weights equals 1 everywhere (w 1  w 2  w 3  1). Note that the constant 2 3 controls the size of the input with respect to the size of the tiles. With this value, the height of a hexagonal tile is half the size of the input texture, which works well in general. This parameter can be adjusted depending on the input. Using larger tiles (decreasing the constant) captures more large-scale features but is more prone to visible repetitions. Using smaller tiles (increasing the constant) increases the variety of the tiles but misses large-scale features.

// Compute local triangle barycentric coordinates and vertex IDs void TriangleGrid(vec2 uv, out float w1, out float w2, out float w3, out ivec2 vertex1, out ivec2 vertex2, out ivec2 vertex3) { // Scaling of the input uv *= 3.464; // 2 * sqrt(3) // Skew input space into simplex triangle grid const mat2 gridToSkewedGrid = mat2(1.0, 0.0, -0.57735027, 1.15470054); vec2 skewedCoord = gridToSkewedGrid * uv; // Compute local triangle vertex IDs and local // barycentric coordinates ivec2 baseId = ivec2(floor(skewedCoord)); vec3 temp = vec3(fract(skewedCoord), 0); temp.z = 1.0 - temp.x - temp.y; if (temp.z > 0.0) { w1 = temp.z; w2 = temp.y; w3 = temp.x; vertex1 = baseId; vertex2 = baseId + ivec2(0, 1); vertex3 = baseId + ivec2(1, 0); } else { w1 = -temp.z; w2 = 1.0 - temp.y; w3 = 1.0 - temp.x;

2.2 Tiling and Blending

181

vertex1 = baseId + ivec2(1, 1); vertex2 = baseId + ivec2(1, 0); vertex3 = baseId + ivec2(0, 1); } } Listing 2.2. Computing the local triangle vertices and barycentric coordinates.

Hash function. We use the hash function given in Listing 2.3 to associate a random offset with each vertex of the triangle grid and use it to fetch the example texture. The choice of the hash function does not really matter as long as it provides enough randomness and does not introduce visible correlations between neighboring tiles.

vec2 hash(vec2 p) { return fract(sin((p) * mat2(127.1, 311.7, 269.5, 183.3)) * 43758.5453); } Listing 2.3. The hash function used to randomize the tiles.

Fetching the example texture. We fetch the input texture with mipmapping and anisotropic filtering like a conventional texture. Note that the hardware uses screen-space derivatives to compute the mipmap level and parameterize its anisotropic filter. Typically, these derivatives are computed with the finite differences between neighboring pixels of the uv positions passed as argument to the texture function. In our case, these screen-space derivatives are broken by the random offsets if neighboring pixels are not in the same triangle. To avoid this problem, in Listing 2.1 we compute the uv derivatives before adding the random offsets and we pass them explicitly to the texture2DGrad function.

2.2.2 Blending In this section, we address the blending part of the algorithm in Listing 2.1. The problem of linear blending. Listing 2.1 implements a classic linear blending operator:

I  w 1 I 1  w 2 I 2  w 3 I 3.

(2.1)

182

2. Procedural Stochastic Textures by Tiling and Blending Unfortunately, it does not yield satisfying results, as shown in Figure 2.3(b). The result has heterogeneous contrast and exhibits a grid-revealing pattern. Heitz and Neyret explain that the problem of linear blending is that is does not preserve the statistical properties of the input, i.e., its histogram. The problem is thus to find a blending operator that preserves the histogram. Variance-preserving blending. Heitz and Neyret notice that in the special case where the input has a Gaussian histogram, variance-preserving blending preserves the Gaussian histogram. The expression of this operator is

G

w 1G1  w 2 G 2  w 3G 3    G  w 12  w 22  w 32

   G ,

(2.2)

where the expectation   G  is the average color of the Gaussian input. Histogram-preserving blending. To generalize this idea to arbitrary non-Gaussian inputs, Heitz and Neyret use an histogram transformation T that makes the input Gaussian, blend with the variance-preserving blending of Equation (2.2), and finally apply the inverse histogram transformation T 1. The overview of this algorithm is provided in Figure 2.4. This operator provides better results than linear blending, as shown in Figure 2.3(c). The following is dedicated to the implementation of this operator in the tiling-and-blending algorithm. For more details on histogram-preserving blending, we refer the reader to the original article [Heitz and Neyret 2018]. Precomputations. The histogram-preserving version of the tiling-and-blending algorithm requires the Gaussian version of the input T  I  and the inverse histogram transformation T 1. We pass them to the fragment shader as textures in Listing 2.4. Section 2.3 is dedicated to the precomputation of these textures.

uniform sampler2D Tinput; uniform sampler2D invT;

// Gaussian input T(I) // Inverse histogram transformation T^{-1}

Listing 2.4. Textures for histogram-preserving blending.

Fragment shader. We update the blending step of Listing 2.1 with the instructions provided in Listing 2.5. Instead of sampling the original input, we sample the Gaussian input stored in texture Tinput and we use the variance-preserving blending operator of Equation (2.2). Finally, we apply the inverse histogram transformation by fetching the precomputed look-up table stored in texture invT.

2.2 Tiling and Blending

Figure 2.4. Tiling and blending with histogram-preserving blending.

183

184

2. Procedural Stochastic Textures by Tiling and Blending

// Sample vec3 G1 = vec3 G2 = vec3 G3 =

Gaussian values from transformed input textureGrad(Tinput, uv1, duvdx, duvdy).rgb; textureGrad(Tinput, uv2, duvdx, duvdy).rgb; textureGrad(Tinput, uv3, duvdx, duvdy).rgb;

// Variance-preserving blending vec3 G = w1 * G1 + w2 * G2 + w3 * G3; G = G - vec3(0.5); G = G * inversesqrt(w1 * w1 + w2 * w2 + w3 * w3); G = G + vec3(0.5); // Fetch LUT vec3 color; color.r = texture(invT, vec2(G.r, 0)).r; color.g = texture(invT, vec2(G.g, 0)).g; color.b = texture(invT, vec2(G.b, 0)).b; } Listing 2.5. Implementation of histogram-preserving blending in Listing 2.1.

2.3 Precomputing the Histogram Transformations This section is dedicated to the C++ precomputation of the histogram transformation T applied on the input image and the inverse histogram transformation T 1 stored in a look-up table that are passed to the fragment shader in Listing 2.4.

2.3.1 Target Gaussian Distribution As shown in Figure 2.4, T is an histogram transformation that makes the input distributed as a Gaussian distribution   μ, σ 2  whose Probability Density Function (PDF) is

PDF  x  

  x  μ  2   . exp  2  2  2 σ  2 πσ  

1

(2.3)

To do this, we need to choose the parameters of the Gaussian distribution we will be using and recall some of its properties. Parameters. We choose the target Gaussian distribution of parameters μ  1 2 and σ 2  1 6 2 . With these parameters, the distribution fits well in the interval  0,1 and can be stored with 8-bit precision.

2.3 Precomputing the Histogram Transformations

185

Cumulative Distribution Function. The histogram transformation T in Section 2.3.2 requires the Cumulative Distribution Function (CDF) of the Gaussian distribution. It is the function that computes the quantile values of the distribution at a given position x:

CDF  x  

1 2

  x  μ  1  erf   .   σ 2 

(2.4)

A quantile value U  CDF  x  is the integral of the distribution below x. For instance, if U  0.30 it means that 30% of the integral is below x and 70% is above.

float CDF(float x, float mu, float sigma) { float U = 0.5f * (1 + Erf((x - mu) / (sigma * sqrtf(2.0f)))); return U; } Listing 2.6. Cumulative Distribution Function (CDF) of a Gaussian.

Inverse Cumulative Distribution Function. The inverse histogram transformation T 1 in Section 2.3.3 requires the inverse CDF:

CDF 1 U   μ  σ 2 erf 1  2U 1.

(2.5)

It computes the quantile x  CDF 1 U  of a given value U   0,1.

float invCDF(float U, float mu, float sigma) { float x = sigma * sqrtf(2.0f) * ErfInv(2.0f * U - 1.0f) + mu; return x; } Listing 2.7. Inverse Cumulative Distribution Function (ICDF) of a Gaussian.

2.3.2 Applying the Histogram Transformation T on the Input In this section, we show how to apply the histogram transformation T on the input (Step 1 in Figure 2.4). Our algorithm makes each color channel of the input distributed as the target Gaussian chosen in Section 2.3.1.

186

2. Procedural Stochastic Textures by Tiling and Blending

Figure 2.5. Histogram transformation of the input. We sort the pixel values I of the input and we map them to sorted values G from the target Gaussian distribution.

Algorithm. A discrete 1D histogram transformation T is typically done by replacing sorted values I from the input by the same number of sorted values G from the target histogram, as shown in Figure 2.5. Implementation. In Listing 2.8, we start by sorting the values of the input image. For this purpose, we use a structure PixelSortStruct that stores the coordinates and the value of a pixel. Then, we go through the sorted list of pixel values and for the i-th i 1 2 element we compute its quantile value U  N . It means that U% of the list is before this element and 1 U  % is after. We replace the pixel value by the same quantile in the Gaussian distribution using the inverse CDF of Equation (2.5): G  CDF 1 U .

void ComputeTinput(TextureDataFloat& input, TextureDataFloat& Tinput, int channel) { // Sort pixels of example image vector sortedInputValues; sortedInputValues.resize(input.width * input.height); for (int y = 0; y < input.height; y++) for (int x = 0; x < input.width; x++) { sortedInputValues[y * input.width + x].x = x; sortedInputValues[y * input.width + x].y = y; sortedInputValues[y * input.width + x].value = input.GetPixel(x, y, channel); } sort(sortedInputValues.begin(), sortedInputValues.end()); // Assign Gaussian value to each pixel for (unsigned int i = 0; i < sortedInputValues.size() ; i++) { // Pixel coordinates int x = sortedInputValues[i].x; int y = sortedInputValues[i].y;

2.3 Precomputing the Histogram Transformations

// Input quantile (given by its order in the sorting) float U = (i + 0.5f) / sortedInputValues.size(); // Gaussian quantile float G = invCDF(U, GAUSSIAN_AVERAGE, GAUSSIAN_STD); // Store Tinput.SetPixel(x, y, channel, G); } } Listing 2.8. Applying the histogram transformation T on the input.

2.3.3 Precomputing the Inverse Histogram Transformation T−1 In this section, we show how to compute the inverse histogram transformation T 1 that maps Gaussian values to values from the input and store it in a look-up table (Step 3 in Figure 2.4). Algorithm. The algorithm consists in mapping sorted values, as in the previous section (Figure 2.5). However, the computation of the values is different. Since we use a Gaussian distribution that can be well represented in the interval  0,1, we are going to parameterize the look-up table on this interval and we associate quantiles of the Gaussian distribution in  0,1 to quantiles of the pixel values. Implementation. In Listing 2.9, we start by sorting the values of the input image. Note that an optimized implementation could reuse the sorting step of Listing 2.8. Then, we go through the texels of the look-up table that parameterizes the interval  0,1 such that i 1 2 the i-th over N texel is associated with the position x  N . We compute the Gaussian quantile value at this position using Equation (2.4): U  CDF  x , and we pick up the same quantile in the sorted pixel values, i.e., we fetch the U . M -th element in the sorted list if it has M entries. This is the value that we store in the look-up table.

void ComputeinvT(TextureDataFloat& input, TextureDataFloat& invT, int channel) { // Sort pixels of example image vector sortedInputValues; sortedInputValues.resize(input.width * input.height); for (int y = 0; y < input.height; y++) for (int x = 0; x < input.width; x++) { sortedInputValues[y * input.width + x] = input.GetPixel(x, y, channel); }

187

188

2. Procedural Stochastic Textures by Tiling and Blending

sort(sortedInputValues.begin(), sortedInputValues.end()); // Generate invT look-up table for (int i = 0; i < invT.width; i++) { // Gaussian value in [0, 1] float G = (i + 0.5f) / (invT.width); // Quantile value float U = CDF(G, GAUSSIAN_AVERAGE, GAUSSIAN_STD); // Find quantile in sorted pixel values int index = (int) floor(U * sortedInputValues.size()); // Get input value float I = sortedInputValues[index]; // Store in LUT invT.SetPixel(i, 0, channel, I); } } Listing 2.9. Precomputing the inverse histogram transformation T 1 and storing it in a look-up table.

2.3.4 Discussion With the fragment shader of Section 2.2 and the precomputations of Section 2.3 we already have a standalone implementation. However, this implementation has several shortcomings: color problems might appear with some inputs. They are due to computing separate per-channel histogram transformations and the incompatibility of mipmapping and using a look-up table. Sections 2.4 and 2.5 are dedicated to overcome these shortcomings.

2.4 Improvement: Using a Decorrelated Color Space Our method, as described so far, occasionally produces procedural texture that exhibit colors that were not present in the example texture, as in Figure 2.6(b). In this section, we show how to reduce this problem by using a decorrelated color space, such as Heeger and Bergen [1995].

2.4.1 The Problem with Color-space Correlations In Section 2.3, we computed histogram transformations for each color channel separately, which occasionally produces wrong colors in the output. Indeed, the histogram of an RGB image is not the composition of three 1D functions but rather one 3D function or a 3D point cloud, as shown in Figure 2.6(a). This 3D histogram might have

2.4 Improvement: Using a Decorrelated Color Space

(a) Input

(b) Procedural (RGB space)

189

(c) Procedural (decorrelated space)

Figure 2.6. Improvement: using a decorrelated color space. If the color channels are correlated, processing them separately might introduce wrong colors (b) that were not present in the input (a). We reduce this problem by using a color space in which the channels are not correlated (c).

inter-channel correlations and transforming the channels separately does not preserve these correlations. For instance, the result of Figure 2.6(b) has the same 1D histogram as the input for each channel. However, since the inter-channels correlations are not preserved, the 3D shape of this histogram is not preserved and wrong colors appear in the result. We obtained the result of Figure 2.6(c) by using a color space in which the channels are not correlated such that processing them separately is less prone to this problem.

2.4.2 Decorrelating the Color Space We precompute the color space transformation before Step 1 in Figure 2.4 and revert it at the end of the fragment shader after Step 3 in Figure 2.4. Precomputation. We start by computing the covariance matrix of the input histogram and extracting its eigenvectors, which means extracting the principal axes of the point cloud given by the pixel's colors in the RGB space, as shown in Figure 2.7(a). Along these eigenvectors, the coordinates of the points are statistically decorrelated. Then, we compute the bounding box of the point cloud aligned with these eigenvectors and we find the coordinates of the points in this bounding box, as shown in Figure 2.7(b). With this parameterization, all the points are parameterized by

P  O  v 1 V1  v 2 V 2  v 3 V3 with  v 1 , v 2 , v 3    0,1 , 3

(2.6)

where the bounding box is defined by its corner O and its orthogonal axes V1, V 2 , and V3.

190

2. Procedural Stochastic Textures by Tiling and Blending

(a) Eigenvectors

(b) Parameterization

Figure 2.7. Parameterization of the decorrelated color space.

Fragment shader. Before returning, the fragment shader transforms the result back to the original color space with Equation (2.6). This is done in the function ReturnToOriginalColorSpace provided in Listing 2.10.

uniform uniform uniform uniform

vec3 vec3 vec3 vec3

colorSpaceVector1; colorSpaceVector2; colorSpaceVector3; colorSpaceOrigin;

vec3 ReturnToOriginalColorSpace(vec3 color) { vec3 result = colorSpaceOrigin + colorSpaceVector1 * color.r + colorSpaceVector2 * color.g + colorSpaceVector3 * color.b; return result; } Listing 2.10. Return to the original color space in the fragment shader.

2.5 Improvement: Prefiltering the Look-up Table Our method, as described so far, uses mipmap levels to fetch the Gaussian input in Step 2 of Figure 2.4. However, when the lower levels of detail are used, comparing its

2.5 Improvement: Prefiltering the Look-up Table

191

appearance to a regular tiling of the input reveals an issue of color deviation, as shown in Figure 2.8. In this section, we show how to solve this problem by prefiltering the look-up table.

2.5.1 The Problem with Texture Filtering and Look-up Tables Classic texture filtering. To understand the problem when filtering our procedural texture, we look into the equation of classic texture filtering. We define texture  uv  as

(a) Input (repeat)

(b) Procedural

(c) Procedural (prefiltered LUT)

Figure 2.8. Improvement: prefiltering the look-up table. The procedural texture uses a look-up table on top of the mipmapped input. This results in a noticeable color shift as we zoom out (b) compared to the input (a). We solve this problem by prefiltering the look-up table (c).

192

2. Procedural Stochastic Textures by Tiling and Blending

Figure 2.9. Classic texture filtering.

the color of the input texture at a position uv and P the domain covered by the pixel footprint. Figure 2.9 illustrates that the filtered color is the integral of the texture over the pixel footprint:



P

texture  uv  duv .

(2.7)

Texture mipmapping (with anisotropic filtering for more accuracy) provides a fast way to evaluate this integral. Procedural texture filtering (reference). Our tiling-and-blending method computes a procedural texture that is the composition of the Gaussian input texture and a lookup table (LUT) that contains the inverse histogram transformation:

procedural texture  uv   LUT texture  uv .

(2.8)

If we apply Equation (2.7) to this formulation, we obtain the following filtering equation: filtered procedural texture 

 LUT texture uv  duv.

(2.9)

P

As shown in Figure 2.10, this integral can be computed by sampling the values of the texture over the footprint P, passing the values through the look-up table, and averaging the results. Unfortunately, this process is too costly and we are thus willing to use mipmapping, as for a conventional texture.

2.5 Improvement: Prefiltering the Look-up Table

193

Figure 2.10. Filtering the procedural texture. The correct filtering averages the values after the application of the look-up table. Filtering the texture before and applying the look-up table after does not produce the same result.

Filtering the procedural texture (wrong). A simple approach consists in using a mipmapped version of the input texture, fetching a single sample from it as for a conventional texture, and then passing it through the look-up table, as shown in Figure 2.10. However, this computes

 filtered procedural texture  LUT  

 texture uv  duv , 

(2.10)

P

which is not the right result because the integral and the look-up table do not commute:

 LUT  



 texture  uv  duv    P

 LUT texture uv  duv .

(2.11)

P

This inequality explains the color difference between Figure 2.8(a) and (b).

2.5.2 Alternative Filtering Formulation with a Look-up Table We use the solution of Heitz et al. [2013] to the problem of filtering procedural textures with look-up tables (also called “color maps”). Their solution is based on the observation that the reference result of Equation (2.9) is a weighted average of values from the look-up table. Hence, the equation can be rewritten

194

2. Procedural Stochastic Textures by Tiling and Blending

filtered procedural texture 







LUT  t  H P  t  dt ,

(2.12)

where H P gives the weight of each entry of the look-up table. This weight depends on the distribution of texture values t inside the pixel footprint P. The more a value t of the texture is represented, the more the entry LUT  t  contributes to the weighted average. Hence, H P is the histogram of the values of the texture inside P. This equivalence is shown in Figure 2.11. Implementation with a prefiltered look-up table. Applying the result of Equation (2.12) in practice requires estimating H P for a given footprint P and computing its product integral with the look-up table. To do this in real-time, we approximate H P by a Gaussian distribution and use a look-up table prefiltered with a Gaussian filter for each level of detail of the input texture. The motivation for this approximation is that at the texture has effectively a Gaussian histogram. Hence, the approximation becomes exact at the highest level of detail and remains reasonable at intermediate levels.

2.5.3 Computing and Fetching the Prefiltered Look-up Table Precomputation. In our implementation, we prefilter the look-up table in a function PrefilterLUT. This function creates a 2D look-up table whose width is the same as

Figure 2.11. Alternative filtering formulation with a look-up table. Filtering the texture with the look-up table is equivalent to convolving the look-up table by the histogram of the texture values inside the pixel footprint.

2.6 Improvement: Using Compressed Texture Formats the unfiltered look-up table and whose height is the number of levels of detail of the input texture. For each level of detail L we compute the average variance in all the subwindows of width 2 L. At the first level of detail the variance is 0 and at the highest level of detail the variance is the variance of the full Gaussian texture, which is 1 6 2 as explained in Section 2.3.1. For each level of detail, we filter the look-up table by a Gaussian filter of the associated variance. Fragment shader. We update the fragment shader in Listing 2.11 where we use the function textureQueryLod to obtain the level of detail of the input texture and we remap it to a value in  0,1 to obtain a y coordinate to fetch the look-up table.

// Compute LOD level to fetch the prefiltered look-up table invT float LOD = textureQueryLod(Tinput, uv).y / float(textureSize(invT, 0).y); // Fetch prefiltered LUT (T^{-1}) vec3 color; color.r = texture(invT, vec2(G.r, LOD)).r; color.g = texture(invT, vec2(G.g, LOD)).g; color.b = texture(invT, vec2(G.b, LOD)).b; Listing 2.11. Fetching the prefiltered look-up table in the fragment shader.

2.6 Improvement: Using Compressed Texture Formats In Figure 2.12, we test our algorithm with the DXT1 compressed texture format applied to the Gaussian version of the input Tinput and the look-up table invT. We notice that the compression occasionally introduces visible artifact when it is applied directly on our textures (Figure 2.12(b)) and a modification is necessary to support a compressed texture format. The problem is that our histogram transformation makes all the channels have the same range of Gaussian values. This impacts the quality of the compression because the compressor optimizes an error that has become equally distributed among the channels while the true error should be more important for channels with wide ranges. Fixing this issue is simple: instead of using the same Gaussian distribution  1 2 ,1 6 2  for all the channels, we scale the Gaussian distribution such that its standard deviation around the average 1 2 becomes proportional to the actual range of the channel data. We do this modification just before sending the data to the DXT compressor and we revert it in the fragment shader. With this minor modification we were able to fix the issue and safely use the DXT1 format for all our textures (Figure 2.12(c)). Our C++ OpenGL demo provides a binary flag #define USE_DXT_COMPRESSION that enables these modifications.

195

196

2. Procedural Stochastic Textures by Tiling and Blending

(a) RGB8

(b) DXT1

(c) DXT1 (fixed)

Figure 2.12. Using a compressed texture format. The DXT1 texture format fails with some inputs if it is applied directly on the textures compute by our algorithm (b). We fix this problem by scaling the range of the Gaussian texture (c).

2.7 Results Performance and storage. In Table 2.1, we compare the performance and storage of our method compared to a classic texture repeat, as in Figures 2.13 and 2.14. On average, it is 4–5 times costlier, which makes sense since we fetch the input 3 times, use one additional look-up table fetch and use a few additional operations. The repeated tiling only requires the storage of the input texture while our method requires the storage of the Gaussian input Tinput and the look-up table invT. Since the Gaussian input has the same size as the input, the memory overhead of our method is only the storage of the look-up table, which is small in comparison. Generative textural space. Our method is dedicated to stochastic textures, such as the rock in Figure 2.15. It does not produce plausible results if the input presents a strong pattern-like organization like in Figure 2.16.

2.8 Conclusion Input Size Format 64 2 RGB8 2 128 RGB8 256 2 RGB8 512 2 RGB8 1024 2 RGB8 2048 2 RGB8 64 2 DXT1 128 2 DXT1 256 2 DXT1 2 512 DXT1 1024 2 DXT1 2048 2 DXT1

197 Performance (T)input invT 16 KB 2 KB 65 KB 3 KB 262 KB 3 KB 1048 KB 3 KB 4194 KB 4 KB 16777 KB 5 KB 3 KB 1 KB 11 KB 1 KB 48 KB 1 KB 174 KB 1 KB 699 KB 1 KB 2796 KB 1 KB

Memory Repeat Procedural 0.035 ms 0.179 ms 0.035 ms 0.180 ms 0.036 ms 0.181 ms 0.039 ms 0.186 ms 0.052 ms 0.200 ms 0.112 ms 0.341 ms 0.035 ms 0.180 ms 0.035 ms 0.180 ms 0.035 ms 0.180 ms 0.036 ms 0.180 ms 0.039 ms 0.182 ms 0.046 ms 0.207 ms

Table 2.1. Performance and storage comparison. We compare our method to a single texture fetch in a repeated texture for various sizes of the input texture and storage formats. The classic repeat requires only the storage of the input texture. Our method requires the storage of the Gaussian version of the input Tinput, which has the same size as the input, and the look-up table invT. We measured the performance by rendering a full-screen quad at 1920 1080 resolution on a GeForce GTX 980.

2.8 Conclusion We have presented an implementation of our procedural texturing algorithm that works well for breaking the repetition of tiled textures. This algorithm is meant to be used with stochastic textures (moss, granite, sand, etc.) and cannot be used with repetitive or strongly correlated patterns. It has little memory overhead, works well with the compressed DXT texture format, and is about four times the cost of a classic texture repeated tiling. Finally, it is straightforward to adapt it to other inputs than RGB color data such as the normal map in Figure 2.14.

Acknowledgments This chapter is the result of Thomas Deliot’s master thesis, which was supervised by Eric Heitz. Both authors conducted this work at Unity Technologies.

198

2. Procedural Stochastic Textures by Tiling and Blending

(a) Repeat

(b) Procedural

Figure 2.13. Comparison of classic texture repeat and our procedural texturing algorithm applied on the ground texture of a video game scene.

(a) Repeat

(b) Procedural

Figure 2.14. Our algorithm applied on non-RGB input. We compare classic texture repeat and our procedural texturing algorithm on a small-scale skin pore normal map.

Acknowledgments

Figure 2.15. Our procedural texturing algorithm applied on a rock texture.

Figure 2.16. Failure case of our method. Our method does not produce plausible results if the input presents a strong pattern-like organization.

199

200

2. Procedural Stochastic Textures by Tiling and Blending

Bibliography HEEGER, D. AND BERGEN, J. 1995. Pyramid-based Texture Analysis/Synthesis. In Proceedings of ACM SIGGRAPH ’95, pp. 229–238. HEITZ, E. AND NEYRET, F. 2018. High-Performance By-Example Noise using a Histogram-Preserving Blending Operator. In Proceedings of the ACM on Computer Graphics and Interactive Techniques, 1:2, pp. 25. HEITZ, E., NOWROUZEZAHRAI, D., POULIN, P., AND NEYRET, F. 2013. Filtering Color Mapped Textures and Surfaces. 2013. In Proceedings of the Symposium on Interactive 3D Graphics and Games 2013, pp. 129–136. PERLIN, K. 2001. Noise hardware. Real-time shading languages, SIGGRAPH 2001 Course.

3 IV

A Ray Casting Technique for Baked Texture Generation Alain Galvan and Jeff Russell Baking is a process of transferring surface data from high-polygon geometry to a texture meant to be used by low-polygon geometry. By baking, static data such as a model’s normals, vertex colors, displacement, and even shading terms like ambient occlusion and diffuse lighting can be efficiently represented on a low-polygon mesh. In this way, greater detail can be provided with vastly reduced geometry. This process has been crucial in asset preparation for real-time systems for many years, and will likely continue to see heavy use.

Figure 3.1. An example scene from Marmoset Toolbag 3 showing a model baked using our technique.

201

202

3. A Ray Casting Technique for Baked Texture Generation As baking is generally an offline process, early tools relied on CPU processing for greater flexibility. Computation times were on the order of minutes and hours. More recent implementations have used GPU processing to great effect. Increased parallelism as well as improved support for general computing, in particular ray tracing, have made GPU baking an appealing choice. Modern systems can bake results in seconds or even in real-time. This chapter outlines a technique to bake geometry on the GPU with user input, as well as a number of potential pitfalls inherent in the process. The technology demo that this chapter follows is based on the baker used in Marmoset Toolbag. This example can be downloaded at: https://github.com/alaingalvan/gpu-zen-2-baker/

3.1 Baking in Practice The primary task of any baking system is finding the corresponding high-polygon surface for a given point on the low-polygon mesh. Typically an artist will provide two meshes: one high-polygon mesh at full detail (hereinafter referred to as the reference model), and a low-polygon version of the same model meant for use in a real-time renderer (hereinafter referred to as the working model) [Teixeira 2008]. The expected result is an image of the surface of the reference model, laid out according to the texture coordinates of the working model. This projection is performed by tracing rays from the surface of the working model into the reference model. To determine these intersections, it is best to store the reference model in an acceleration structure such as a k-d tree [Lantz 2013] for efficient traversal. Once the nearest intersecting triangle has been found for a given ray, the desired vertex properties can be interpolated with the barycentric coordinates of the intersection and returned.

Figure 3.2. A model representation of baking, where rays are cast from the working model to the reference model.

3.1 Baking in Practice This conceptually straightforward process is complicated considerably by differences between the reference and working models. While the shape and size of the two meshes are generally very similar, the reference mesh is not always completely bounded by the working mesh (or vice versa), and as a result there is some question about how to find the correct corresponding points between the two. Rays cast from the surface of the working model may start inside the reference model, and thereby miss their intended targets on the reference mesh. A common solution to this problem is to allow the user to specify, either procedurally or explicitly with a third mesh, a “cage” that completely surrounds the reference model and serves as the origin for sample rays. This bounding of the reference mesh greatly reduces the possibility of incorrect intersections, as it correctly encloses areas of the reference mesh both “above” and “below” the working model.

Figure 3.3. A model of baking which includes a Cage projected from the working model.

Another issue with baking projection relates to ray directions. In general, sample rays should point “inward” in the direction of the reference model. However, the specific direction chosen obviously has a significant effect on the sample’s ultimate result. A simple approach to determine sample directions is to make sample rays run parallel to the working model’s polygon normals. This works somewhat well, but has the major drawback of creating discontinuities along polygon edges, as Figure 3.4 shows.

Figure 3.4. A comparison of baking projection techniques. The left shows ray directions based off the average normal of that surface, and the right shows rays based off the working model’s face normal. Note the discontinuities on the edge of different faces.

203

204

3. A Ray Casting Technique for Baked Texture Generation A more common alternative approach is to use the interpolated vertex normals of the working mesh to specify sample ray directions. This removes any first-order discontinuities in the projection, and is a fairly robust default. Most models bake fairly well with this technique. In practice, artists will often require a combination of these two techniques for sample directions. This need arises due to “skewing”, a kind of distortion of the bake result that occurs as a result of the gradient of change of the sample directions. Details from the reference model may appear undesirably skewed or otherwise distorted in these cases. Where this occurs, choosing sample directions closer to the face normals (as in Figure 3.5) will mitigate the distortion. We introduce the idea of a Skew Map, a grayscale map that allows users to specify which ray direction to use. The ray’s direction is thus determined by interpolating between the computed face normal and outward smooth normal on a per-pixel basis. We also introduce the idea of an Offset Map, which allows users to specify ray origin offset magnitude for certain parts of the mesh. Users specify a minimum and maximum offset, and ray origins are determined by interpolating between these min and max distance values along the required ray direction. In short, the offset map allows the user fine-grained control over how far the ray origin should “back up” from the working mesh, creating a de-facto cage enclosure. A baking system with this degree of user control has proven to be a powerful tool. Users may paint skew and offset maps in texture coordinate space or in the 3D viewport, and quickly address baking errors that have been difficult to fix in the past. The

Figure 3.5. A visualization of a cage determined by a positional offset map and skew map. On the left shows the offset cage colored with its skew map, on the right shows the direction of rays projected from that cage with the mesh rotated 180 degrees for easier visualization. Red indicates that rays should point the direction of the working model’s normal and green indicates rays should point in the direction of the calculated smooth normal.

3.1 Baking in Practice process of baking models remains more art than science, but by exposing the technical aspects of baking in an intuitive way artists are able to work more effectively. In summary, the following data are needed to bake a mesh with our approach:  Output Render Target. Your output render target, which can vary in format and size.  Working Model. The low polygon geometry used to determine smooth normals and used as input for any final transformations to output fragments.  Reference Model. The high polygon geometry used when searching for ray collisions.  Skew Map. A map that allows users to interpolate between either raycasting in the direction of the computed smooth normal, or the direction of the working model’s normal.  Offset Map. A map defines the offset between the cage and the working model.  Offset Bounds. A minimum and maximum offset value to offset the cage from the working model.

3.1.1 Pitfalls One pitfall we countered however was in the computing of smooth normals. We noticed that for convex geometry such as holes or corners, the smooth direction would point away from the reference model. To fix this for a given vertex, the starting ray’s direction should be the average normal if the dot product of it and that vertex normal is greater than 0, otherwise it should be reflected by that vertex’s normal. In addition, when baking geometry, it is sometimes necessary for the user to split apart a bake to different sections to avoid rays being cast from the inside of the reference model. This splitting of working models and reference models can speed up the time it takes to bake, but now different sections can be completely isolated from all reference geometry. One possible way to mitigate this issue is to expose a user option to bake using all reference geometry. This can be useful when working maps that require scene information such as ambient occlusion. Finally, since every user has unique use cases for their baked textures, it’s best to be aware of the tangent space the working model will be used in, as well as the handedness of the target application to avoid issues such as baked normals facing the wrong direction.

3.1.2 Implementation Before we begin discussing implementation details it’s important to introduce base functions that we’ll be depending on. We’ll be using the function findTraceDir-

205

206

3. A Ray Casting Technique for Baked Texture Generation ection to determine the direction our ray will be cast from, findTraceOrigin to determine the origin of our rays, and finally traceRay to perform the ray tracing operation. For the sake of brevity, we’ll omit traceRay from any source listings in this chapter, but a working implementation can be found in our included example.

// Interpolate between the current input smooth normal and // face normal to determine the final ray trace direction. vec3 findTraceDirection(vec3 position, vec3 smoothNormal, vec2 uv, sampler2D dirMask) { vec3 dx = dFdx(position); vec3 dy = dFdy(position); vec3 faceNormal = normalize(cross(dx, dy)); float traceBlend = textureLod(dirMask, uv, 0.0).x; if (dot(faceNormal, smoothNormal) < 0.0) { faceNormal = -faceNormal; } vec3 diff = smoothNormal - faceNormal; float diffLen = length(diff); float maxLen = sqrt(2.0) * traceBlend; // Interpolate final direction if (diffLen > maxLen) { diff *= maxLen / diffLen; } vec3 dir = faceNormal + diff; return -normalize(dir); } // Interpolate between a range of offsets to determine the // origin of rays cast. vec3 findTraceOrigin(vec3 position, vec3 direction, vec2 uv, sampler2D offsetMask, vec2 offsetRange) { float offset = texture2DLod(offsetMask, uv, 0.0).x; offset = mix(offsetRange.x, offsetRange.y, offset); vec3 traceOrigin = position - offset * direction; return traceOrigin; } Listing 3.1. Base functions implementations when performing raycast calculations.

3.1 Baking in Practice Normal Map. Normals are available directly from the reference model’s vertex data. One would simply need to compute the fragment as the current reference vertex’s normal value, and interpolate between vertices through barycentric coordinates. Tangent based normals require that one transform the reference model normal with the working model’s normal orientation:

Figure 3.6. A tangent space normal map texture applied to a model rendered with 16× multisampling.

vec3 n = vec3(0.0, 0.0, 1.0); vec3 traceDir = findTraceDirection(inPosition, normalize(inBakeDir) inTextureCoords, tSkewMap); vec3 tracePos = findTraceOrigin(inPosition, traceDir, inTextureCoords, tOffsetMap, uOffsetRange); TriangleHit hit; bool didhit = traceRay(tracePos, traceDir, hit); if (didHit) { n = hit.coords.x * uNormals[hit.vertices.x] + hit.coords.y * uNormals[hit.vertices.y] + hit.coords.z * uNormals[hit.vertices.z]; n = normalize(n); } outObjectNormals.rgb = n; outObjectNormals.a = 1.0; outTangentNormals.rgb = vec3(dot(n, inTangents), dot(n, inBitangents), dot(n, inNormals)); outTangentNormals.a = 1.0; Listing 3.2. An example implementation of baking the reference model’s vertex normal.

207

208

3. A Ray Casting Technique for Baked Texture Generation Height Map. Height maps are used as inputs in tessellation to determine areas that require more subdivisions, and to offset those areas by the input texture.

Figure 3.7. Height map computed from cage offset distance to reference model.

vec3 traceDir = findTraceDirection(inPosition, normalize(inBakeDir), inTextureCoords, dirMask); vec3 tracePos = findTraceOrigin(inPosition, traceDir, inTextureCoords, tOffsetMap, uOffsetRange); TriangleHit hit; bool didhit = traceRay(tracePos, traceDir, hit); float height = didhit ? hit.distance : 0.0; height -= length(inPosition - tracePos); // Interpolate between user specified minimum and // maximum height values height = height * uHeightScaleBias.x + uHeightScaleBias.y; outHeight.xyz = vec3(height, height, height); outHeight.w = didhit ? 1.0 : 0.0; Listing 3.3. Height map computed from cage offset distance to reference model.

Ambient Occlusion Map. Ambient occlusion describes average amount of light that would be expected to miss a region from an omnidirectional light source. This value can be determined through monte-carlo stochastic sampling of rays cast hemispherically from the first initial raycast.

3.1 Baking in Practice

Figure 3.8. An example of a baked ambient occlusion map set to 4096 rays.

#define SAMPLES 16 outAO = vec4(1.0,1.0,1.0,0.0); vec3 traceDir = findTraceDirection(inPosition, normalize(inBakeDir), inTextureCoords, dirMask); vec3 tracePos = findTraceOrigin(inPosition, traceDir, inTextureCoords, tOffsetMap, uOffsetRange); TriangleHit hit; if (traceRay(tracePos, traceDir, hit)) { vec3 pos = tracePos + traceDir * (hit.distance - uHemisphereOffset); vec3 basisY = normalize(hit.coords.x * uNormals[hit.vertices.x] + hit.coords.y * uNormals[hit.vertices.y] + hit.coords.z * uNormals[hit.vertices.z]); vec3 basisX = normalize(cross(basisY, fTangent)); vec3 basisZ = cross(basisX, basisY); float ao = 0.0; float hits = 0.0; TriangleHit hit2; for (int i = 0; i < SAMPLES; ++i) { // Random Direction in hemisphere of first hit vec3 d = normalize(rand3(fTexCoord + uRandSeed + float(i))); // Give rays that point away from the top of the hemisphere // more weight when averaging the final ambient occlusion. float omega = d.y;

209

210

3. A Ray Casting Technique for Baked Texture Generation

d = d.x * basisX + d.y * basisY + d.z * basisZ; if (traceRay(pos, d, hit2)) { ao += omega; hits += omega; } } ao = omega >= 1.0 ? 1.0 - ao / omega : 1.0; } outAO.xyz = vec3(ao, ao, ao); outAO.w = 1.0; Listing 3.4. An example implementation of monte-carlo ambient occlusion baking.

Material Atlas. Scene descriptions introduce the concept of a Mesh being composed of several Primitives, each coupled with a Material. Different parts of a mesh can correspond with different materials, which can lead to dense geometry with made up of many materials. This is great for authoring reference modes, however when designing assets to be used in real time rendering, the need for a simpler working model that encodes all these materials as textures to be used in a single material with a Physically Based Rendering (PBR) workflow arises. One solution to this problem is to process each material the working model is composed of, masking out the geometry of what’s not being baked with an alpha of 0.0.

Figure 3.9. An example of an albedo map that uses the metalness workflow, baked from the reference model’s Physically Based (PBR) materials using our technique. On the right is the mesh textured with the generated albedo map rendered with a PBR metalness workflow.

3.2 GPU Considerations

3.2 GPU Considerations Baking represents a potentially heavy workload, even when properly optimized for GPU processing. Reference models are often composed of tens or hundreds of millions of triangles, and output resolutions up to 8k or even 16k are not uncommon. While simple bakes can be performed in just a few milliseconds, a high resolution ambient occlusion bake can take several minutes to complete. Short preview and iteration times are of high importance to artists, which makes the performance of a baking system a primary consideration.

3.2.1 Acceleration Structure In terms of processing time, the main task for a bake is the tracing of rays against the reference mesh. These intersection tests must run quickly, and a properly built acceleration structure is vital to improving ray tracing speed. Our system uses a k-d tree, a special case of a binary spatial partitioning tree. Each node in the tree points either to two children, or if the node is a leaf, to a list of triangles. In this way a mesh is recursively subdivided into manageable chunks so that rays can test much smaller subsets of triangles instead of the whole mesh. A full explanation of k-d tree construction and traversal is outside the scope of this chapter; we encourage interested readers to read more on the topic in Akenine-Möller et al. [2018]. Traversal of k-d trees can be thought of as a two-step process: traversing nodes, and testing triangles once a leaf node is found. Traversal time tends to grow as a function of average tree depth, and triangle testing time tends to grow as a function of average leaf triangle count. A tree must be built to balance these quantities for minimal execution time: a deep tree will spend too much time in traversal, and a shallow tree will have too many triangles to test at leaf nodes. Our k-d tree is built with metrics for preferred and maximal depth and triangle counts, and some careful exceptions to these rules. A two-level heuristic is used: depth and triangle counts are kept within preferred limits, except in cases where leaf triangle counts would be unacceptably high, in which case a looser maximum depth is enforced. This produces trees well balanced for traversal on the GPU in a variety of cases. The code for tree traversal for ray tests rests in loops of several iterations for each ray. We utilize a stackless traversal algorithm [Popov et al. 2007], to minimize GPU register use. The complex flow of branches in algorithms like this one is handled relatively well by modern GPU hardware, but these algorithms have an inherent large variability in the time to trace rays. As a result, the shared flow control between threads in a group causes traversal to stall as a thread group waits on the slowest thread. Cases where neighboring rays are relatively incoherent (that is, landing in different parts of the tree) perform relatively poorly for this reason. Ray incoherence is also a source of cache misses, the k-d tree structure often being quite large in memory. Optimizing coherence in flow control in GPU ray tracing is an area of ongoing research.

211

212

3. A Ray Casting Technique for Baked Texture Generation

preferredTrisPerLeaf := 12 maxTrisPerLeaf := 64 preferredNodeDepth := 23 maxNodeDepth := preferredNodeDepth + 10 buildNode(depth, triCount): if (depth < preferredNodeDepth and triCount > preferredTrisPerLeaf) or (depth < maxNodeDepth and triCount > maxTrisPerLeaf) Split triangles into two new nodes for each new node buildNode() else Attach triangles to leaf node Listing 3.5. Pseudocode for k-d tree construction. Triangle and depth limits have been chosen empirically with a brute-force performance search.

3.2.2 Dividing Work Baking processes vary greatly in their time requirements, depending on reference mesh size and shape, output resolution, sample count, and GPU capabilities. A low resolution preview bake may take only a millisecond or two, but a final resolution bake may take several seconds or even minutes. At these longer time scales, we must contend with limits imposed by many desktop operating systems on the length of time a set of GPU render commands may take. Microsoft Windows in particular imposes a twosecond timeout for any GPU process, after which time the driver is reset. Even if this timeout period were not an issue, users generally like to be able to do other things on their machines while baking occurs. This motivates the subdivision of bakes into jobs of manageable duration, which are executed in series. The GPU is then yielded to the operating system between these jobs, to allow other programs to process and redraw, as well as preventing driver resets. A straightforward way of dividing work in baking processes is to divide the output canvas into two dimensional sectors and process each sector as a separate job. The size of these sectors must be carefully estimated to keep their execution time low, which can be difficult. A good estimate, if an imperfect one, is found by multiplying the pixel area of a sector with the number of rays each pixel will require. This produces fairly large sectors for simple bakes, but small sectors for bakes that cast hundreds of rays (such as ambient occlusion). Processing many small sectors comes with significant performance cost, so there is a balance to be struck between bake speed and system responsiveness. A user-specified setting for this priority has also proven helpful.

3.3 Future Work

3.2.3 Memory Use Finite video memory has proven to be a troublesome constraint, as data sizes vary with user-specified inputs and settings, and are effectively unbounded. A large reference mesh of 100M triangles and a corresponding k-d tree can easily occupy multiple gigabytes of video memory on its own, let alone multiple 8k render targets and other intermediate results. As a result, the maximum mesh size and output resolution is limited somewhat by the user’s installed video memory. Paging or piecemeal processing of meshes would at first seem an ideal solution to this problem of memory limits, but this approach is in practice extremely difficult. It is very nearly impossible to robustly predict where sample rays will land on the reference mesh without first tracing them, which rules out piecemeal processing of meshes. This requires, at minimum, the k-d tree structure and reference mesh to always fit in video memory, alongside the render output surface and the working mesh. Output render target sizes are often limited by the video driver to 16k or less. In situations where users desire higher resolution, piecemeal processing is possible by rendering subsections and compositing into a final buffer on the CPU. A similar approach can be used for efficient multisampling of bake outputs: subsections of the image are rendered at an enlarged size, and then resolved to their final resolution in the output image. This allows for high resolution images at high sample counts to be rendered regardless of GPU limitations.

3.3 Future Work Even incremental optimization of processing time or memory use are likely to be well received by users of baking software. Inconvenient processing times and hard memory limits have long been the norm in this space, and fast GPU bakers are starting to change the way 3D artwork is produced. Quicker turnaround times with larger meshes is the ever-present goal. In addition to speed and capacity, new uses for baking algorithms are always being devised and requested by users. Further investigation into improved k-d tree construction and traversal is likely to be productive. Other acceleration structures may be faster or use less memory; in particular a bounding volume hierarchy [Akenine-Möller et al. 2018, Chapter 26] may offer a more compact representation of a reference mesh. Additionally, current GPU traversal algorithms leave much to be desired, particularly when it comes to achieving full occupancy in thread groups. The use of persistent worker threads that consume a queue of rays, and as a result rarely go idle, has been proposed as a possible solution to the thread occupancy problem [Akenine-Möller et al. 2018, Chapter 3]. Additionally, there may be some opportunity for caching ray hits between bake outputs. When a user bakes multiple maps, some redundant work is performed between these outputs. Retaining a map of ray hits may help amortize the cost of ray casting between outputs, speeding up the crucial case of complex bakes.

213

214

3. A Ray Casting Technique for Baked Texture Generation

Bibliography TEIXEIRA, D. 2008. Baking Normal Maps on the GPU. In GPU Gems 3, Addison-Wesley. POPOV, S., GÜNTHER, J., SEIDEL, H., AND SLUSALLEK, P. 2007. Stackless KD-Tree Traversal for High Performance GPU Ray Tracing. In Computer Graphics Forum, 26:3, pp. 415–424. URL: https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1467-8659.2007.01064.x. LANTZ, K. 2013. KD Tree Construction Using the Surface Area Heuristic, Stack-Based Traversal, and The Hyperplane Separation Theorem. URL: https://www.keithlantz.net/2013/04. AKENINE-MÖLLER, T., HAINES, E., HOFFMAN, N., PESCE, A., IWANICKI, M., AND HILLAIRE, S. 2018. Real-Time Rendering, 4th Ed. CRC Press.

4 IV

Writing an Efficient Vulkan Renderer Arseny Kapoulkine

Vulkan is a new explicit cross-platform graphics API. It introduces many new concepts that may be unfamiliar to even seasoned graphics programmers. The key goal of Vulkan is performance—however, attaining good performance requires in-depth knowledge about these concepts and how to apply them efficiently, as well as how particular driver implementations implement these. This article will explore topics such as memory allocation, descriptor set management, command buffer recording, pipeline barriers, render passes and discuss ways to optimize CPU and GPU performance of production desktop/mobile Vulkan renderers today as well as look at what a future looking Vulkan renderer could do differently. Modern renderers are becoming increasingly complex and must support many different graphics APIs with varying levels of hardware abstraction and disjoint sets of concepts. This sometimes makes it challenging to support all platforms at the same level of efficiency. Fortunately, for most tasks Vulkan provides multiple options that can be as simple as reimplementing concepts from other APIs with higher efficiency due to targeting the code specifically towards the renderer needs, and as hard as redesigning large systems to make them optimal for Vulkan. We will try to cover both extremes when applicable—ultimately, this is a tradeoff between maximum efficiency on Vulkan-capable systems and implementation and maintenance costs that every engine needs to carefully pick. Additionally, efficiency is often application-dependent—the guidance in this article is generic and ultimately best performance is achieved by profiling the target application on a target platform and making an informed implementation decision based on the results. This article assumes that the reader is familiar with the basics of Vulkan API, and would like to understand them better and/or learn how to use the API efficiently.

215

216

4. Writing an Efficient Vulkan Renderer

4.1 Memory Management Memory management remains an exceedingly complex topic, and in Vulkan it gets even more so due to the diversity of heap configurations on different hardware. Earlier APIs adopted a resource-centric concept—the programmer doesn’t have a concept of graphics memory, only that of a graphics resource, and different drivers are free to manage the resource memory based on API usage flags and a set of heuristics. Vulkan, however, forces to think about memory management up front, as you must manually allocate memory to create resources. A perfectly reasonable first step is to integrate VulkanMemoryAllocator (henceforth abbreviated as VMA), which is an open-source library developed by AMD that solves some memory management details for you by providing a general purpose resource allocator on top of Vulkan functions. Even if you do use that library, there are still multiple performance considerations that apply; the rest of this section will go over memory caveats without assuming you use VMA; all of the guidance applies equally to VMA.

4.1.1 Memory Heap Selection When creating a resource in Vulkan, you have to choose a heap to allocate memory from. Vulkan device exposes a set of memory types where each memory type has flags that define the behavior of that memory, and a heap index that defines the available size. Most Vulkan implementations expose two or three of the following flag combinations1:  VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT—this is generally referring to GPU memory that is not directly visible from CPU; it’s fastest to access from the GPU and this is the memory you should be using to store all render targets, GPU-only resources such as buffers for compute, and also all static resources such as textures and geometry buffers.  VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT | VK_MEMORY_PROPERTY_ HOST_VISIBLE_BIT—on AMD hardware, this memory type refers to up to 256 MB of video memory that the CPU can write to directly, and is perfect for allocating reasonable amounts of data that is written by CPU every frame, such as uniform buffers or dynamic vertex/index buffers

We only cover memory allocation types that are writable from host and readable or writable from GPU; for CPU readback of data that has been written by GPU, memory with VK_MEMORY_ PROPERTY_HOST_CACHED_BIT flag is more appropriate. 1

4.1 Memory Management  VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT | VK_MEMORY_PROPERTY_ HOST_COHERENT_BIT2—this is referring to CPU memory that is directly visible from GPU; reads from this memory go over PCI-express bus. In absence of the previous memory type, this generally speaking should be the choice for uniform buffers or dynamic vertex/index buffers, and also should be used to store staging buffers that are used to populate static resources allocated with VK_MEMORY_ PROPERTY_DEVICE_LOCAL_BIT with data.  VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT | VK_MEMORY_PROPERTY_ LAZILY_ALLOCATED_BIT—this is referring to GPU memory that might never need to be allocated for render targets on tiled architectures. It is recommended to use lazily allocated memory to save physical memory for large render targets that are never stored to, such as MSAA images or depth images. On integrated GPUs, there is no distinction between GPU and CPU memory— these devices generally expose VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT | VK_ MEMORY_PROPERTY_HOST_VISIBLE_BIT that you can allocate all static resources through as well. When dealing with dynamic resources, in general allocating in non-device-local host-visible memory works well—it simplifies the application management and is efficient due to GPU-side caching of read-only data. For resources that have a high degree of random access though, like dynamic textures, it’s better to allocate them in VK_ MEMORY_PROPERTY_DEVICE_LOCAL_BIT and upload data using staging buffers allocated in VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT memory—similarly to how you would handle static textures. In some cases you might need to do this for buffers as well—while uniform buffers typically don’t suffer from this, in some applications using large storage buffers with highly random access patterns will generate too many PCIe transactions unless you copy the buffers to GPU first; additionally, host memory does have higher access latency from the GPU side that can impact performance for many small draw calls. When allocating resources from VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT, in case of VRAM oversubscription you can run out of memory; in this case you should fall back to allocating the resources in non-device-local VK_MEMORY_PROPERTY_ HOST_VISIBLE_BIT memory. Naturally you should make sure that large frequently used resources such as render targets are allocated first. There are other things you can do in an event of an oversubscription, such as migrating resources from GPU memory to CPU memory for less frequently used resources—this is outside of the scope of this article; additionally, on some operating systems like Windows 10 correct handling of oversubscription requires APIs that are not currently available in Vulkan.

Note that VK_MEMORY_PROPERTY_HOST_COHERENT_BIT generally implies that the memory will be write-combined; on some devices it’s possible to allocate non-coherent memory and flush it manually with vkFlushMappedMemoryRanges.

2

217

218

4. Writing an Efficient Vulkan Renderer

4.1.2 Memory Suballocation Unlike some other APIs that allow an option to perform one memory allocation per resource, in Vulkan this is impractical for large applications—drivers are only required to support up to 4096 individual allocations. In addition to the total number being limited, allocations can be slow to perform, may waste memory due to assuming worst case possible alignment requirements, and also require extra overhead during command buffer submission to ensure memory residency. Because of this, suballocation is necessary. A typical pattern of working with Vulkan involves performing large (e.g., 16 MB to 256 MB depending on how dynamic the memory requirements are) allocations using vkAllocateMemory, and performing suballocation of objects within this memory, effectively managing it yourself. Critically, the application needs to handle alignment of memory requests correctly, as well as bufferImageGranularity limit that restricts valid configurations of buffers and images. Briefly, bufferImageGranularity restricts the relative placement of buffer and image resources in the same allocation, requiring additional padding between individual allocations. There are several ways to handle this:  Always over-align image resources (as they typically have larger alignment to begin with) by bufferImageGranularity, essentially using a maximum of required alignment and bufferImageGranularity for address and size alignment.  Track resource type for each allocation, and have the allocator add the requisite padding only if the previous or following resource is of a different type. This requires a somewhat more complex allocation algorithm.  Allocate images and buffers in separate Vulkan allocations, thus sidestepping the entire problem. This reduces internal fragmentation due to smaller alignment padding but can waste more memory if the backing allocations are too big (e.g., 256 MB). On many GPUs the required alignment for image resources is substantially bigger than it is for buffers which makes the last option attractive—in addition to reducing waste due to lack of extra padding between buffers and images, it reduces internal fragmentation due to image alignment when an image follows a buffer resource. VMA provides implementations for option 2 (by default) and option 3 (see VMA_POOL_ CREATE_IGNORE_BUFFER_IMAGE_GRANULARITY_BIT).

4.1.3 Dedicated Allocations While the memory management model that Vulkan provides implies that the application performs large allocations and places many resources within one allocation using suballocation, on some GPUs it’s more efficient to allocate certain resources as one

4.2 Descriptor Sets dedicated allocation. That way the driver can allocate the resources in faster memory under special circumstances. To that end, Vulkan provides an extension (core in 1.1) to perform dedicated allocations—when allocating memory, you can specify that you are allocating this memory for this individual resource instead of as an opaque blob. To know if this is worthwhile, you can query the extended memory requires via vkGetImageMemoryRequirements2KHR or vkGetBufferMemoryRequirements2KHR; the resulting struct, VkMemoryDedicatedRequirementsKHR, will contain requiresDedicatedAllocation (which might be set if the allocated resource needs to be shared with other processes) and prefersDedicatedAllocation flags. In general, applications may see performance improvements from dedicated allocations on large render targets that require a lot of read/write bandwidth depending on the hardware and drivers.

4.1.4 Mapping Memory Vulkan provides two options when mapping memory to get a CPU-visible pointer:  Do this before CPU needs to write data to the allocation, and unmap once the write is complete.  Do this right after the host-visible memory is allocated, and never unmap memory. The second option is otherwise known as persistent mapping and is generally a better tradeoff—it minimizes the time it takes to obtain a writeable pointer (vkMapMemory is not particularly cheap on some drivers), removes the need to handle the case where multiple resources from the same memory object need to be written to simultaneously (calling vkMapMemory on an allocation that’s already been mapped and not unmapped is not valid) and simplifies the code in general. The only downside is that this technique makes the 256 MB chunk of VRAM that is host visible and device local on AMD GPU that was described in “Memory heap selection” less useful—on systems with Windows 7 and AMD GPU, using persistent mapping on this memory may force WDDM to migrate the allocations to system memory. If this combination is a critical performance target for your users, then mapping and unmapping memory when needed might be more appropriate.

4.2 Descriptor Sets Unlike earlier APIs with a slot-based binding model, in Vulkan the application has more freedom in how to pass resources to shaders. Resources are grouped into descriptor sets that have an application-specified layout, and each shader can use several descriptor sets that can be bound individually. It’s the responsibility of the application to manage the descriptor sets to make sure that CPU doesn’t update a descriptor set

219

220

4. Writing an Efficient Vulkan Renderer that’s in use by the GPU, and to provide the descriptor layout that has an optimal balance between CPU-side update cost and GPU-side access cost. In addition, since different rendering APIs use different models for resource binding and none of them match Vulkan model exactly, using the API in an efficient and cross-platform way becomes a challenge. We will outline several possible approaches to working with Vulkan descriptor sets that strike different points on the scale of usability and performance.

4.2.1 Mental Model When working with Vulkan descriptor sets, it’s useful to have a mental model of how they might map to hardware. One such possibility—and the expected design—is that descriptor sets map to a chunk of GPU memory that contains descriptors—opaque blobs of data, 16-64 bytes in size depending on the resource, that completely specify all resource parameters necessary for shaders to access resource data. When dispatching shader work, CPU can specify a limited number of pointers to descriptor sets; these pointers become available to shaders as the shader threads launch. With that in mind, Vulkan APIs can map more or less directly to this model— creating a descriptor set pool would allocate a chunk of GPU memory that’s large enough to contain the maximum specified number of descriptors. Allocating a set out of descriptor pool can be as simple as incrementing the pointer in the pool by the cumulative size of allocated descriptors as determined by VkDescriptorSetLayout (note that such an implementation would not support memory reclamation when freeing individual descriptors from the pool; vkResetDescriptorPool would set the pointer back to the start of pool memory and make the entire pool available for allocation again). Finally, vkCmdBindDescriptorSets would emit command buffer commands that set GPU registers corresponding to descriptor set pointers. Note that this model ignores several complexities, such as dynamic buffer offsets, limited number of hardware resources for descriptor sets, etc. Additionally, this is just one possible implementation—some GPUs have a less generic descriptor model and require the driver to perform additional processing when descriptor sets are bound to the pipeline. However, it’s a useful model to plan for descriptor set allocation/usage.

4.2.2 Dynamic Descriptor Set Management Given the mental model above, you can treat descriptor sets as GPU-visible memory— it’s the responsibility of the application to group descriptor sets into pools and keep them around until GPU is done reading them. A scheme that works well is to use free lists of descriptor set pools; whenever you need a descriptor set pool, you allocate one from the free list and use it for subsequent descriptor set allocations in the current frame on the current thread. Once you run out of descriptor sets in the current pool, you allocate a new pool. Any pools that were used in a given frame need to be kept around; once the frame has finished rendering, as determined by the associated fence objects, the descriptor set pools can reset

4.2 Descriptor Sets via vkResetDescriptorPool and returned to free lists. While it’s possible to free individual descriptors from a pool via VK_DESCRIPTOR_POOL_CREATE_FREE_ DESCRIPTOR_SET_BIT, this complicates the memory management on the driver side and is not recommended. When a descriptor set pool is created, application specifies the maximum number of descriptor sets allocated from it, as well as the maximum number of descriptors of each type that can be allocated from it. In Vulkan 1.1, the application doesn’t have to handle accounting for these limits—it can just call vkAllocateDescriptorSets and handle the error from that call by switching to a new descriptor set pool. Unfortunately, in Vulkan 1.0 without any extensions, it’s an error to call vkAllocateDescriptorSets if the pool does not have available space, so application must track the number of sets and descriptors of each type to know beforehand when to switch to a different pool. Different pipeline objects may use different numbers of descriptors, which raises the question of pool configuration. A straightforward approach is to create all pools with the same configuration that uses the worst-case number of descriptors for each type—for example, if each set can use at most 16 texture and 8 buffer descriptors, one can allocate all pools with maxSets = 1024, and pool sizes 16 × 1024 for texture descriptors and 8 × 1024 for buffer descriptors. This approach can work but in practice it can result in very significant memory waste for shaders with different descriptor count—you can’t allocate more than 1024 descriptor sets out of a pool with the aforementioned configuration, so if most of your pipeline objects use 4 textures, you’ll be wasting 75% of texture descriptor memory. Two alternatives that provide a better balance with respect to memory use are:  Measure an average number of descriptors used in a shader pipeline per type for a characteristic scene and allocate pool sizes accordingly. For example, if in a given scene we need 3000 descriptor sets, 13400 texture descriptors, and 1700 buffer descriptors, then the average number of descriptors per set is 4.47 textures (rounded up to 5) and 0.57 buffers (rounded up to 1), so a reasonable configuration of a pool is maxSets = 1024, 5 × 1024 texture descriptors, 1024 buffer descriptors. When a pool is out of descriptors of a given type, we allocate a new one—so this scheme is guaranteed to work and should be reasonably efficient on average.  Group shader pipeline objects into size classes, approximating common patterns of descriptor use, and pick descriptor set pools using the appropriate size class. This is an extension of the scheme described above to more than one size class. For example, it’s typical to have large numbers of shadow/depth prepass draw calls, and large numbers of regular draw calls in a scene—but these two groups have different numbers of required descriptors, with shadow draw calls typically requiring 0 to 1 textures per set and 0 to 1 buffers when dynamic buffer offsets are used. To optimize memory use, it’s more appropriate to allocate descriptor set pools separately for shadow/depth and other draw calls. Similarly to general-

221

222

4. Writing an Efficient Vulkan Renderer purpose allocators that can have size classes that are optimal for a given application, this can still be managed in a lower-level descriptor set management layer as long as it’s configured with application specific descriptor set usages beforehand.

4.2.3 Choosing Appropriate Descriptor Types For each resource type, Vulkan provides several options to access these in a shader; application is responsible for choosing an optimal descriptor type. For buffers, application must choose between uniform and storage buffers, and whether to use dynamic offsets or not. Uniform buffers have a limit on the maximum addressable size—on desktop hardware, you get up to 64 KB of data, however on mobile hardware some GPUs only provide 16 KB of data (which is also the guaranteed minimum by the specification). The buffer resource can be larger than that, but shader can only access this much data through one descriptor. On some hardware, there is no difference in access speed between uniform and storage buffers, however for other hardware depending on the access pattern uniform buffers can be significantly faster. Prefer uniform buffers for small to medium sized data especially if the access pattern is fixed (e.g., for a buffer with material or scene constants). Storage buffers are more appropriate when you need large arrays of data that need to be larger than the uniform buffer limit and are indexed dynamically in the shader. For textures, if filtering is required, there is a choice of combined image/sampler descriptor (where, like in OpenGL, descriptor specifies both the source of the texture data, and the filtering/addressing properties), separate image and sampler descriptors (which maps better to Direct3D 11 model), and image descriptor with an immutable sampler descriptor, where the sampler properties must be specified when pipeline object is created. The relative performance of these methods is highly dependent on the usage pattern; however, in general immutable descriptors map better to the recommended usage model in other newer APIs like Direct3D 12, and give driver more freedom to optimize the shader. This does alter renderer design to a certain extent, making it necessary to implement certain dynamic portions of the sampler state, like per-texture LOD bias for texture fade-in during streaming, using shader ALU instructions.

4.2.4 Slot-based Binding A simplistic alternative to Vulkan binding model is Metal/Direct3D11 model where an application can bind resources to slots, and the runtime/driver manage descriptor memory and descriptor set parameters. This model can be implemented on top of Vulkan descriptor sets; while not providing the most optimal results, it generally is a good model to start with when porting an existing renderer, and with careful implementation it can be surprisingly efficient.

4.2 Descriptor Sets To make this model work, application needs to decide how many resource namespaces are there and how they map to Vulkan set/slot indices. For example, in Metal each stage (VS, FS, CS) has three resource namespaces—textures, buffers, samplers—with no differentiation between, e.g., uniform buffers and storage buffers. In Direct3D 11 the namespaces are more complicated since read-only structured buffers belong to the same namespace as textures, but textures and buffers used with unordered access reside in a separate one. Vulkan specification only guarantees a minimum of 4 descriptor sets accessible to the entire pipeline (across all stages); because of this, the most convenient mapping option is to have resource bindings match across all stages—for example, a texture slot 3 would contain the same texture resource no matter what stage it’s accessed from— and use different descriptor sets for different types, e.g., set 0 for buffers, set 1 for textures, set 2 for samplers. Alternatively, an application can use one descriptor set per stage3 and perform static index remapping (e.g., slots 0-16 would be used for textures, slots 17–24 for uniform buffers, etc.)—this, however, can use much more descriptor set memory and isn’t recommended. Finally, one could implement optimally compact dynamic slot remapping for each shader stage (e.g., if a vertex shader uses texture slots 0, 4, 5, then they map to Vulkan descriptor indices 0, 1, 2 in set 0, and at runtime application extracts the relevant texture information using this remapping table. In all these cases, the implementation of setting a texture to a given slot wouldn’t generally run any Vulkan commands and would just update shadow state; just before the draw call or dispatch you’d need to allocate a descriptor set from the appropriate pool, update it with new descriptors, and bind all descriptor sets using vkCmdBindDescriptorSets. Note that if a descriptor set has 5 resources, and only one of them changed since the last draw call, you still need to allocate a new descriptor set with 5 resources and update all of them. To reach good performance with this approach, you need to follow several guidelines:  Don’t allocate or update descriptor sets if nothing in the set changed. In the model with slots that are shared between different stages, this can mean that if no textures are set between two draw calls, you don’t need to allocate/update the descriptor set with texture descriptors.  Batch calls to vkAllocateDescriptorSets if possible—on some drivers, each call has measurable overhead, so if you need to update multiple sets, allocating both in one call can be faster  To update descriptor sets, either use vkUpdateDescriptorSets with descriptor write array, or use vkUpdateDescriptorSetWithTemplate from Note that with the 4 descriptors per pipeline, this approach can’t handle full pipeline setup for VS, GS, FS, TCS and TES—which is only a problem if you use tessellation on drivers that only expose 4 descriptor sets. 3

223

224

4. Writing an Efficient Vulkan Renderer Vulkan 1.1. Using the descriptor copy functionality of vkUpdateDescriptorSets is tempting with dynamic descriptor management for copying most descriptors out of a previously allocated array, but this can be slow on drivers that allocate descriptors out of write-combined memory. Descriptor templates can reduce the amount of work application needs to do to perform updates—since in this scheme you need to read descriptor information out of shadow state maintained by application, descriptor templates allow you to tell the driver the layout of your shadow state, making updates substantially faster on some drivers.  Finally, prefer dynamic uniform buffers to updating uniform buffer descriptors. Dynamic uniform buffers allow to specify offsets into buffer objects using pDynamicOffsets argument of vkCmdBindDescriptorSets without allocating and updating new descriptors. This works well with dynamic constant management where constants for draw calls are allocated out of large uniform buffers, substantially reduce CPU overhead, and can be more efficient on GPU. While on some GPUs the number of dynamic buffers must be kept small to avoid extra overhead in the driver, one or two dynamic uniform buffers should work well in this scheme on all architectures. In general, the approach outlined above can be very efficient in terms of performance—it’s not as efficient as approaches with more static descriptor sets that are described below, but it can still run circles around older APIs if implemented carefully. On some drivers, unfortunately the allocate and update path is not very optimal—on some mobile hardware, it may make sense to cache descriptor sets based on the descriptors they contain if they can be reused later in the frame.

4.2.5 Frequency-based Descriptor Sets While slot-based resource binding model is simple and familiar, it doesn’t result in optimal performance. Some mobile hardware may not support multiple descriptor sets; however, in general Vulkan API and driver expect an application to manage descriptor sets based on frequency of change. A more Vulkan centric renderer would organize data that the shaders need to access into groups by frequency of change, and use individual sets for individual frequencies, with set = 0 representing least frequent change, and set = 3 representing most frequent. For example, a typical setup would involve:  Set = 0 descriptor set containing uniform buffer with global, per-frame or perview data, as well as globally available textures such as shadow map texture array/atlas  Set = 1 descriptor set containing uniform buffer and texture descriptors for permaterial data, such as albedo map, Fresnel coefficients, etc.

4.2 Descriptor Sets  Set = 2 descriptor set containing dynamic uniform buffer with per-draw data, such as world transform array For set = 0, the expectation is that it only changes a handful of times per frame; it’s sufficient to use a dynamic allocation scheme similar to the previous section. For set = 1, the expectation is that for most objects, the material data persists between frames, and as such could be allocated and updated only when the gameplay code changes material data. For set = 2, the data would be completely dynamic; due to the use of a dynamic uniform buffer, we’d rarely need to allocate and update this descriptor set—assuming dynamic constants are uploaded to a series of large per-frame buffers, for most draws we’d need to update the buffer with the constant data, and call vkCmdBindDescriptorSets with new offsets. Note that due to compatibility rules between pipeline objects, in most cases it’s enough to bind sets 1 and 2 whenever a material changes, and only set 2 when material is the same as that for the previous draw call. This results in just one call to vkCmdBindDescriptorSets per draw call. For a complex renderer, different shaders might need to use different layouts—for example, not all shaders need to agree on the same layout for material data. In rare cases it might also make sense to use more than 3 sets depending on the frame structure. Additionally, given the flexibility of Vulkan it’s not strictly required to use the same resource binding system for all draw calls in the scene. For example, post-processing draw call chains tend to be highly dynamic, with texture/constant data changing completely between individual draw calls. Some renderers initially implement the dynamic slot-based binding model from the previous section and proceed to additionally implement the frequency-based sets for world rendering to minimize the performance penalty for set management, while still keeping the simplicity of slot-based model for more dynamic parts of the rendering pipeline. The scheme described above assumes that in most cases, per-draw data is larger than the size that can be efficiently set via push constants. Push constants can be set without updating or rebinding descriptor sets; with a guaranteed limit of 128 bytes per draw call, it’s tempting to use them for per-draw data such as a 4x3 transform matrix for an object. However, on some architectures the actual number of constants available to push quickly depends on the descriptor setup the shaders use, and is closer to 12 bytes or so. Exceeding this limit can force the driver to spill the push constants into driver-managed ring buffer, which can end up being more expensive than moving this data to a dynamic uniform buffer on the application side. While limited use of push constants may still be a good idea for some designs, it’s more appropriate to use them in a fully bindless scheme described in the next section.

225

226

4. Writing an Efficient Vulkan Renderer

4.2.6 Bindless Descriptor Designs Frequency-based descriptor sets reduce the descriptor set binding overhead; however, you still need to bind one or two descriptor sets per draw call. Maintaining material descriptor sets requires a management layer that needs to update GPU-visible descriptor sets whenever material parameters change; additionally, since texture descriptors are cached in material data, this makes global texture streaming systems hard to deal with—whenever some mipmap levels in a texture get streamed in or out, all materials that refer to this texture need to be updated. This requires complex interaction between material system and texture streaming system and introduces extra overhead whenever a texture is adjusted—which partially offsets the benefits of the frequencybased scheme. Finally, due to the need to set up descriptor sets per draw call it’s hard to adapt any of the aforementioned schemes to GPU-based culling or command submission. It is possible to design a bindless scheme where the number of required set binding calls is constant for the world rendering, which decouples texture descriptors from materials, making texture streaming systems easier to implement, and facilitates GPUbased submission. As with the previous scheme, this can be combined with dynamic ad-hoc descriptor updates for parts of the scene where the number of draw calls is small, and flexibility is important, such as post-processing. To fully leverage bindless, core Vulkan may or may not be sufficient; some bindless implementations require updating descriptor sets without rebinding them after the update, which is not available in core Vulkan 1.0 or 1.1 but is possible to achieve with VK_EXT_descriptor_indexing extension. However, basic design described below can work without extensions, given high enough descriptor set limits. This requires double buffering for the texture descriptor array described below to update individual descriptors since the array would be constantly accessed by GPU. Similarly to the frequency-based design, we’ll split the shader data into global uniforms and textures (set 0), material data and per-draw data. Global uniforms and textures can be specified via a descriptor set the same way as described the previous section. For per-material data, we will move the texture descriptors into a large texture descriptor array (note: this is a different concept than a texture array—texture array uses one descriptor and forces all textures to have the same size and format; descriptor array doesn’t have this limitation and can contain arbitrary texture descriptors as array elements, including texture array descriptors). Each material in the material data will have an index into this array instead of texture descriptor; the index will be part of the material data, which will also have other material constants. All material constants for all materials in the scene will reside in one large storage buffer; while it’s possible to support multiple material types with this scheme, for simplicity we’ll assume that all materials can be specified using the same data. An example of material data structure is below:

4.2 Descriptor Sets

struct MaterialData { vec4 albedoTint; float float float float uint uint uint uint

tilingX; tilingY; reflectance; unused0; // pad to vec4 albedoTexture; normalTexture; roughnessTexture; unused1; // pad to vec4

};

Similarly, all per-draw constants for all objects in the scene can reside in another large storage buffer; for simplicity, we’ll assume that all per-draw constants have identical structure. To support skinned objects in a scheme like this, we’ll extract transform data into a separate, third storage buffer: struct TransformData { vec4 transform[3]; };

Something that we’ve ignored so far is the vertex data specification. While Vulkan provides a first-class way to specify vertex data by calling vkCmdBindVertexBuffers, having to bind vertex buffers per-draw would not work for a fully bindless design. Additionally, some hardware doesn’t support vertex buffers as a first-class entity, and the driver has to emulate vertex buffer binding, which causes some CPU-side slowdowns when using vkCmdBindVertexBuffers. In a fully bindless design, we need to assume that all vertex buffers are suballocated in one large buffer and either use per-draw vertex offsets (firstVertex argument to vkCmdDrawIndexed) to have hardware fetch data from it, or pass an offset in this buffer to the shader with each draw call and fetch data from the buffer in the shader. Both approaches can work well, and might be more or less efficient depending on the GPU; here we will assume that the vertex shader will perform manual vertex fetching. Thus, for each draw call we need to specify three integers to the shader:  Material index; used to look up material data from material storage buffer. The textures can then be accessed using the indices from the material data and the descriptor array.  Transform data index; used to look up transform data from transform storage buffer.

227

228

4. Writing an Efficient Vulkan Renderer  Vertex data offset; used to look up vertex attributes from vertex storage buffer. We can specify these indices and additional data, if necessary, via draw data: struct DrawData { uint materialIndex; uint transformOffset; uint vertexOffset; uint unused0; // vec4 padding // ... extra gameplay data goes here };

The shader will need to access storage buffers containing MaterialData, TransformData, DrawData as well as a storage buffer containing vertex data. These can be bound the shader via the global descriptor set; the only remaining piece of information is the draw data index, that can be passed via a push constant. With this scheme, we’d need to update the storage buffers used by materials and draw calls each frame and bind them once using our global descriptor set; additionally, we need to bind index data—assuming that, like vertex data, index data is allocated in one large index buffer, we only need to bind it once using vkCmdBindIndexBuffer. With the global setup complete, for each draw call we need to call vkCmdBindPipeline if the shader changes, followed by vkCmdPushConstants to specify an index into the draw data buffer4, followed by vkCmdDrawIndexed. In a GPU-centric design, we can use vkCmdDrawIndirect or vkCmdDrawIndirectCountKHR (provided by KHR_draw_indirect_count extension) and fetch per-draw constants using gl_DrawIDARB (provided by KHR_shader_draw_parameters extension) as an index instead of push constants. The only caveat is that for GPUbased submission, we’d need to bucket draw calls based on pipeline object on CPU since there’s no support for switching pipeline objects otherwise. With this, vertex shader code to transform the vertex could look like this: DrawData dd = drawData[gl_DrawIDARB]; TransformData td = transformData[dd.transformOffset]; vec4 positionLocal = vec4(positionData[gl_VertexIndex + dd.vertexOffset], 1.0); vec3 positionWorld = mat4x3(td.transform[0], td.transform[1], td.transform[2]) * positionLocal;

Depending on the GPU architecture it might also be beneficial to pass some of the indices, like material index or vertex data offset, via push constants to reduce the number of memory indirections in vertex/fragment shaders. 4

4.3 Command Buffer Recording and Submission Fragment shader code to sample material textures could look like this: DrawData dd = drawData[drawId]; MaterialData md = materialData[dd.materialIndex]; vec4 albedo = texture(sampler2D(materialTextures[md.albedoTexture], albedoSampler), uv * vec2(md.tilingX, md.tilingY));

This scheme minimizes the CPU-side overhead. Of course, fundamentally it’s a balance between multiple factors:  While the scheme can be extended to multiple formats of material, draw and vertex data, it gets harder to manage.  Using storage buffers exclusively instead of uniform buffers can increase GPU time on some architectures.  Fetching texture descriptors from an array indexed by material data indexed by material index can add an extra indirection on GPU compared to some alternative designs.  On some hardware, various descriptor set limits may make this technique impractical to implement; to be able to index an arbitrary texture dynamically from the shader, maxPerStageDescriptorSampledImages should be large enough to accomodate all material textures—while many desktop drivers expose a large limit here, the specification only guarantees a limit of 16, so bindless remains out of reach on some hardware that otherwise supports Vulkan. As the renderers get more and more complex, bindless designs will become more involved and eventually allow moving even larger parts of rendering pipeline to GPU; due to hardware constraints this design is not practical on every single Vulkan-compatible device, but it’s definitely worth considering when designing new rendering paths for future hardware.

4.3 Command Buffer Recording and Submission In older APIs, there is a single timeline for GPU commands; commands executed on CPU execute on the GPU in the same order, as there is generally only one thread recording them; there is no precise control over when CPU submits commands to GPU, and the driver is expected to manage memory used by the command stream as well as submission points optimally. In contrast, in Vulkan the application is responsible for managing command buffer memory, recording commands in multiple threads into multiple command buffers, and submitting them for execution with appropriate granularity. While with carefully written code a single-core Vulkan renderer can be significantly faster than older APIs, the peak efficiency and minimal latency is obtained by utilizing many cores in the system for command recording, which requires careful memory management.

229

230

4. Writing an Efficient Vulkan Renderer

4.3.1 Mental Model Similarly to descriptor sets, command buffers are allocated out of command pools; it’s valuable to understand how a driver might implement this to be able to reason about the costs and usage implications. Command pool has to manage memory that will be filled with commands by CPU and subsequently read by GPU command processor. The amount of memory used by the commands can’t be statically determined; a typical implementation of a pool would involve thus a free list of fixed-size pages. Command buffer would contain a list of pages with actual commands, with special jump commands that transfer control from each page to the next one so that GPU can execute all of them in sequence. Whenever a command needs to be allocated from a command buffer, it will be encoded into the current page; if the current page doesn’t have space, the driver would allocate the next page using a free list from the associated pool, encode a jump to that page into the current page and switch to the next page for subsequent command recording. Each command pool can only be used from one thread concurrently, so the operations above don’t need to be thread-safe5. Freeing the command buffer using vkFreeCommandBuffers may return the pages used by the command buffer into the pool by adding them to the free list. Resetting the command pool may put all pages used by all command buffers into the pool free list; when VK_COMMAND_POOL_ RESET_RELEASE_RESOURCES_BIT is used, the pages can be returned to the system so that other pools can reuse them. Note that there is no guarantee that vkFreeCommandBuffers actually returns memory to the pool; alternative designs may involve multiple command buffers allocating chunks within larger pages, which would make it hard for vkFreeCommand Buffers to recycle memory. Indeed, on one mobile vendor, vkResetCommandPool is necessary to reuse memory for future command recording in a default setup when pools are allocated without VK_COMMAND_POOL_CREATE_RESET_COMMAND_BUFFER_BIT.

4.3.2 Multithreaded Command Recording Two crucial restrictions in Vulkan for command pool usage are:  Command buffers allocated from one pool may not be recorded concurrently by multiple threads  Command buffers and pools can not be freed or reset while GPU is still executing the associated commands

Regrettably, Vulkan doesn’t provide a way for the driver to implement thread-safe command buffer recording so that one command pool can be reused between threads; in the scheme described, cross-thread synchronization is only required for switching pages which is relatively rare and can be lock-free for the most part. 5

4.3 Command Buffer Recording and Submission Because of these, a typical threading setup requires a set of command buffer pools. The set has to contain F * T pools, where F is the frame queue length—F is usually 2 (one frame is recorded by the CPU while another frame is being executed by the GPU) or 3; T is the number of threads that can concurrently record commands, which can be as high as the core count on the system. When recording commands from a thread, the thread needs to allocate a command buffer using the pool associated with the current frame & thread and record commands into it. Assuming that command buffers aren’t recorded across a frame boundary, and that at a frame boundary the frame queue length is enforced by waiting for the last frame in the queue to finish executing, we can then free all command buffers allocated for that frame and reset all associated command pools. Additionally, instead of freeing command buffers, it’s possible to reuse them after calling vkResetCommandPool—which would mean that command buffers don’t have to be allocated again. While in theory allocating command buffers could be cheap, some driver implementations have a measurable overhead associated with command buffer allocation. This also makes sure that the driver doesn’t ever need to return command memory to the system which can make submitting commands into these buffers cheaper. Note that depending on the frame structure, the setup above may result in unbalanced memory consumption across threads; for example, shadow draw calls typically require less setup and less command memory. When combined with effectively random workload distribution across threads that many job schedulers produce, this can result in all command pools getting sized for the worst-case consumption. If an application is memory constrained and this becomes a problem, it’s possible to limit the parallelism for each individual pass and select the command buffer/pool based on the recorded pass to limit the waste. This requires introducing the concept of size classes to the command buffer manager. With a command pool per thread and a manual reuse of allocated command buffers as suggested above, it’s possible to keep a free list per size class, with size classes defined based on the number of draw calls (e.g., “