13.12.2012 Views

ShaderX Shader Programming Tips & Tricks With DirectX 9

ShaderX Shader Programming Tips & Tricks With DirectX 9

ShaderX Shader Programming Tips & Tricks With DirectX 9

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong><strong>Shader</strong>X</strong> 2 : <strong>Shader</strong><br />

<strong>Programming</strong> <strong>Tips</strong> &<br />

<strong>Tricks</strong> with <strong>DirectX</strong> 9<br />

Edited by<br />

Wolfgang F. Engel<br />

Wordware Publishing, Inc.


Library of Congress Cataloging-in-Publication Data<br />

<strong><strong>Shader</strong>X</strong> 2 : shader programming tips and tricks with <strong>DirectX</strong> 9 / edited by<br />

Wolfgang F. Engel.<br />

p. cm.<br />

Includes bibliographical references and index.<br />

ISBN 1-55622-988-7 (paperback, companion CD-ROM)<br />

1. Computer games--<strong>Programming</strong>. 2. Three-dimensional display systems.<br />

I. Title: <strong><strong>Shader</strong>X</strong> squared. II. Engel, Wolfgang F.<br />

QA76.76.C672S48 2003<br />

794.8'16693--dc22 2003018871<br />

CIP<br />

ISBN 1-55622-988-7<br />

10987654321<br />

0308<br />

© 2004, Wordware Publishing, Inc.<br />

All Rights Reserved<br />

2320 Los Rios Boulevard<br />

Plano, Texas 75074<br />

No part of this book may be reproduced in any form or by any means<br />

without permission in writing from Wordware Publishing, Inc.<br />

Printed in the United States of America<br />

Crystal Reports is a registered trademark of Crystal Decisions, Inc. in the United States and/or other countries.<br />

Names of Crystal Decisions products referenced herein are trademarks or registered trademarks of Crystal Decisions or its<br />

Screen shots used in this book remain the property of their respective companies.<br />

All brand names and product names mentioned in this book are trademarks or service marks of their respective companies.<br />

Any omission or misuse (of any kind) of service marks or trademarks should not be regarded as intent to infringe on the<br />

property of others. The publisher recognizes and respects all marks used by companies, manufacturers, and developers as a<br />

means to distinguish their products.<br />

This book is sold as is, without warranty of any kind, either express or implied, respecting the contents of this book and any<br />

disks or programs that may accompany it, including but not limited to implied warranties for the book’s quality, performance,<br />

merchantability, or fitness for any particular purpose. Neither Wordware Publishing, Inc. nor its dealers or distributors shall be<br />

liable to the purchaser or any other person or entity with respect to any liability, loss, or damage caused or alleged to have been<br />

caused directly or indirectly by this book.<br />

All inquiries for volume purchases of this book should be addressed to Wordware<br />

Publishing, Inc., at the above address. Telephone inquiries may be made by calling:<br />

(972) 423-0090


Contents<br />

Preface vii<br />

About the Authors ix<br />

Introduction xix<br />

Section I—Geometry Manipulation <strong>Tricks</strong> 1<br />

Using Vertex <strong>Shader</strong>s for Geometry Compression 3<br />

Dean Calver<br />

Using Lookup Tables in Vertex <strong>Shader</strong>s 13<br />

Carsten Wenzel<br />

Terrain Geomorphing in the Vertex <strong>Shader</strong> 18<br />

Daniel Wagner<br />

3D Planets on the GPU 33<br />

Jesse Laeuchli<br />

Cloth Animation with Pixel and Vertex <strong>Shader</strong> 3.0 40<br />

Kristof Beets<br />

Collision <strong>Shader</strong>s 58<br />

Takashi Imagire<br />

Displacement Mapping 73<br />

Tom Forsyth<br />

Section II — Rendering Techniques 87<br />

Rendering Objects as Thick Volumes 89<br />

Greg James<br />

Screen-aligned Particles with Minimal VertexBuffer Locking 107<br />

O’dell Hicks<br />

Hemisphere Lighting with Radiosity Maps 113<br />

Shawn Hargreaves<br />

iii


Contents<br />

iv<br />

Galaxy Textures 123<br />

Jesse Laeuchli<br />

Turbulent Sun 127<br />

Jesse Laeuchli<br />

Fragment-level Phong Illumination 131<br />

Emil Persson<br />

Specular Bump Mapping on Pre-ps_1_4 Hardware 149<br />

Matthew Halpin<br />

Voxel Rendering with PS_3_0 161<br />

Aaron Burton<br />

Simulating Blending Operations on Floating-point Render Targets 172<br />

Francesco Carucci<br />

Rendering Volumes in a Vertex & Pixel Program by Ray Tracing 177<br />

Eli Z. Gottlieb<br />

Normal Map Compression 185<br />

Jakub Klarowicz<br />

Drops of Water and Texture Sprites 190<br />

Sylvain Lefebvre<br />

Advanced Water Effects 207<br />

Kurt Pelzer<br />

Efficient Evaluation of Irradiance Environment Maps 226<br />

Peter-Pike J. Sloan<br />

Practical Precomputed Radiance Transfer 232<br />

Peter-Pike J. Sloan<br />

Advanced Sky Dome Rendering 240<br />

Marco Spoerl and Kurt Pelzer<br />

Deferred Shading with Multiple Render Targets 251<br />

Nicolas Thibieroz<br />

Meshuggah’s Effects Explained 270<br />

Carsten Wenzel<br />

Layered Car Paint <strong>Shader</strong> 293<br />

John Isidoro, Chris Oat, and Natalya Tatarchuk<br />

Motion Blur Using Geometry and Shading Distortion 299<br />

Natalya Tatarchuk, Chris Brennan, Alex Vlachos, and John Isidoro


Contents<br />

Simulation of Iridescence and Translucency on Thin Surfaces 309<br />

Natalya Tatarchuk and Chris Brennan<br />

Floating-point Cube Maps 319<br />

Arkadiusz Waliszewski<br />

Stereoscopic Rendering in Hardware Using <strong>Shader</strong>s 324<br />

Thomas Rued<br />

Hatching, Stroke Styles, and Pointillism 340<br />

Kevin Buchin and Maike Walther<br />

Layered Fog 348<br />

Guillaume Werle<br />

Dense Matrix Algebra on the GPU 352<br />

Ádám Moravánszky<br />

Section III — Software <strong>Shader</strong>s and <strong>Shader</strong> <strong>Programming</strong> <strong>Tips</strong> 381<br />

Software Vertex <strong>Shader</strong> Processing 383<br />

Dean P. Macri<br />

x86 <strong>Shader</strong>s–ps_2_0 <strong>Shader</strong>s in Software 396<br />

Nicolas Capens<br />

SoftD3D: A Software-only Implementation of<br />

Microsoft’s Direct3D API 413<br />

Oliver Weichhold<br />

Named Constants in <strong>Shader</strong> Development 432<br />

Jeffrey Kiel<br />

Section IV — Image Space 437<br />

Advanced Image Processing with <strong>DirectX</strong> 9 Pixel <strong>Shader</strong>s 439<br />

Jason L. Mitchell, Marwan Y. Ansari, and Evan Hart<br />

Night Vision: Frame Buffer Post-processing with ps.1.1 Hardware 465<br />

Guillaume Werle<br />

Non-Photorealistic Post-processing Filters in MotoGP 2 469<br />

Shawn Hargreaves<br />

Image Effects with <strong>DirectX</strong> 9 Pixel <strong>Shader</strong>s 481<br />

Marwan Y. Ansari<br />

Using Pixel <strong>Shader</strong>s to Implement a Mosaic Effect Using<br />

Character Glyphs 519<br />

Roger Descheneaux and Maurice Ribble<br />

v


Contents<br />

vi<br />

Mandelbrot Set Rendering 526<br />

Emil Persson<br />

Real-Time Depth of Field Simulation 529<br />

Guennadi Riguer, Natalya Tatarchuk, and John Isidoro<br />

Section V — Shadows 557<br />

Soft Shadows 559<br />

Flavien Brebion<br />

Robust Object ID Shadows 580<br />

Sim Dietrich<br />

Reverse Extruded Shadow Volumes 587<br />

Renaldas Zioma<br />

Section VI — 3D Engine and Tools Design 595<br />

<strong>Shader</strong> Abstraction 597<br />

Tom Forsyth<br />

Post-Process Fun with Effects Buffers 614<br />

Tom Forsyth<br />

<strong>Shader</strong>s under Control (Codecreatures Engine) 625<br />

Oliver Hoeller<br />

<strong>Shader</strong> Integration in the Gamebryo Graphics Engine 631<br />

Scott Sherman, Dan Amerson, Shaun Kime, and Tim Preston<br />

Vertex <strong>Shader</strong> Compiler 650<br />

David Pangerl<br />

<strong>Shader</strong> Disassembler 667<br />

Jean-Sebastian Luce<br />

Index 675


Preface<br />

After the tremendous success of Direct3D <strong><strong>Shader</strong>X</strong>: Vertex and Pixel <strong>Shader</strong> <strong>Tips</strong><br />

and <strong>Tricks</strong>, I planned to do another book with an entirely new set of innovative<br />

ideas, techniques, and algorithms. The call for authors led to many proposals from<br />

nearly 80 people who wanted to contribute to the book. Some of these proposals<br />

featured introductory material and others featured much more advanced themes.<br />

Because of the large amount of material, I decided to split the articles into introductory<br />

pieces that are much longer but explain a lot of groundwork and articles<br />

that assume a certain degree of knowledge. This idea led to two books:<br />

<strong><strong>Shader</strong>X</strong> 2 : Introductions & Tutorials with <strong>DirectX</strong> 9<br />

<strong><strong>Shader</strong>X</strong> 2 : <strong>Shader</strong> <strong>Programming</strong> <strong>Tips</strong> & <strong>Tricks</strong> with <strong>DirectX</strong> 9<br />

The first book helps the reader get started with shader programming, whereas<br />

the second book (this one) features tips and tricks that an experienced shader<br />

programmer will benefit from.<br />

As with Direct3D <strong><strong>Shader</strong>X</strong>, Javier Izquierdo Villagrán (nurbs1@jazzfree.com)<br />

prepared the drafts for the cover design of both books with in-game screen shots<br />

from Aquanox 2, which were contributed by Ingo Frick, the technical director of<br />

Massive Development.<br />

A number of people have enthusiastically contributed to both books:<br />

Wessam Bahnassi<br />

Andre Chen<br />

Muhammad Haggag<br />

Kenneth L. Hurley<br />

Eran Kampf<br />

Brian Peltonen<br />

Mark Wang<br />

Additionally, the following <strong><strong>Shader</strong>X</strong> 2 authors proofread several articles each:<br />

Dean Calver<br />

Nicolas Capens<br />

Tom Forsyth<br />

Shawn Hargreaves<br />

Jeffrey Kiel<br />

Hun Yen Kwoon<br />

Markus Nuebel<br />

Michal Valient<br />

Oliver Weichhold<br />

vii


Preface<br />

viii<br />

These great people spent a lot of time proofreading articles, proposing improvements,<br />

and exchanging e-mails with other authors and myself. Their support was<br />

essential to the book development process, and their work led to the high quality<br />

of the books. Thank you!<br />

Another big thank you goes to the people in the Microsoft Direct3D discussion<br />

group (http://DISCUSS.MICROSOFT.COM/archives/DIRECTXDEV.html).<br />

They were very helpful in answering my numerous questions.<br />

As with Direct3D <strong><strong>Shader</strong>X</strong>, there were some driving spirits who encouraged<br />

me to start this project and hold on through the seven months it took to complete<br />

it:<br />

Dean Calver (Eclipse)<br />

Jason L. Mitchell (ATI Research)<br />

Natasha Tatarchuk (ATI Research)<br />

Nicolas Thibieroz (PowerVR)<br />

Carsten Wenzel (Crytek)<br />

Additionally, I have to thank Thomas Rued from DigitalArts for inviting me to the<br />

Vision Days in Copenhagen, Denmark, and for the great time I had there. I would<br />

like to thank Matthias Wloka and Randima Fernando from nVidia for lunch at GDC<br />

2003. I had a great time.<br />

As usual, the great team at Wordware made the whole project happen: Jim<br />

Hill, Wes Beckwith, Heather Hill, Beth Kohler, and Paula Price took over after I<br />

sent them hundreds of megabytes of data.<br />

There were other numerous people involved in this book project that I have<br />

not mentioned. I would like to thank them here. It was a pleasure working with so<br />

many talented people.<br />

Special thanks goes to my wife, Katja, and our daughter, Anna, who spent a<br />

lot of evenings and weekends during the last seven months without me, and to<br />

my parents, who always helped me to believe in my strength.<br />

— Wolfgang F. Engel<br />

P.S.: Plans for an upcoming project named <strong><strong>Shader</strong>X</strong> 3 are already in progress. Any<br />

comments, proposals, and suggestions are highly welcome (wolf@shaderx.com).


About the Authors<br />

Dan Amerson<br />

Dan graduated from North Carolina State University in 2001 with a bachelor’s<br />

degree in computer science. During his undergraduate studies, he focused on artificial<br />

intelligence research for automated camera control and positioning. After<br />

graduation, Dan joined NDL in late 2001 to work on the NetImmerse and Gamebryo<br />

engines. He works primarily on console rendering technologies and most<br />

recently served as lead programmer for the Gamebryo shader demo Eturnum.<br />

Marwan Y. Ansari (mansari@ati.com)<br />

Marwan is a member of the 3D Application Research Group at ATI Research. He<br />

received a master’s degree in computer science from the University of Illinois at<br />

Chicago and a bachelor of science degree in computer science and mathematics<br />

from DePaul University. Prior to moving to ATI’s 3D Application Research Group,<br />

he worked on OpenGL drivers for Number Nine Visual Technology before joining<br />

ATI’s Digital TV group. In addition to his image space contributions to <strong><strong>Shader</strong>X</strong>2 ,<br />

Marwan has also contributed to Game <strong>Programming</strong> Gems 4 and spoken about<br />

real-time video processing using shaders at the Game Developers Conference.<br />

Kristof Beets (kristof.beets@powervr.com)<br />

Kristof took his first steps in the 3D world by running a technical 3D fan site, covering<br />

topics such as the differences between traditional and tile-based rendering<br />

technologies. This influenced his electrical engineering studies in such a way that<br />

he wrote his thesis about wavelet compression for textures in Direct3D, a paper<br />

that won the Belgian Barco Prize. He continued his studies, obtaining a master’s<br />

degree in artificial intelligence, while working as a technical editor for Beyond3D<br />

and writing various technical articles about 3D hardware, effects, and technology.<br />

As a freelance writer he wrote the “FSAA Explained” document for 3Dfx Interactive<br />

to explain the differences between various types of full-screen anti-aliasing.<br />

This document resulted in a full-time job offer at 3Dfx. Currently he is working as<br />

a developer relations engineer for PowerVR Technologies, which includes<br />

research into new graphical algorithms and techniques.<br />

Flavien Brebion (f.brebion@vrcontext.com)<br />

Flavien has had a passion for video games since he got an Amstrad CPC at the age<br />

of 12. He still remembers typing in hundred-page listings just to see a small sprite<br />

appear on-screen. He studied computing science at the University of Nantes,<br />

France, where he graduated with both bachelor’s and master’s degrees in 2000.<br />

He has also done a lot of research and developed many small games and rendering<br />

engines on his own. Currently he works at VRcontext, a virtual reality company<br />

ix


About the Authors<br />

x<br />

in Brussels, where he develops software designed to display industrial models<br />

made up of millions of triangles. He works on amateur games and graphical<br />

demos in his spare time, trying to get the most out of the new, powerful video<br />

cards. His web site is http://www.fl-tw.com/opengl/SoftShadows/.<br />

Chris Brennan (cbrennan@ati.com)<br />

Chris graduated with bachelor’s degrees in computer science and electrical engineering<br />

from Worcester Polytechnic Institute in 1997 and joined Digital Equipment<br />

Corp.’s Workstation Graphics group doing hardware design and verification.<br />

When Digital died, Chris joined ATI as a 3D ASIC designer for the Radeon line of<br />

graphics chips and then moved over to the 3D Application Research Group where<br />

he tries to get those chips to do things that were not originally thought possible.<br />

Kevin Buchin<br />

Kevin received his master’s degree from Hasso Plattner Institute for Software<br />

Engineering in Potsdam, Germany, in 2003. He wrote his thesis on real-time<br />

non-photorealistic terrain rendering. He has studied math, logic, and computer<br />

science in Muenster, Germany, and Leeds, England, and is involved in the 3D<br />

rendering engine VRS (www.vrs3d.org) and the 3D-map software system<br />

LandExplorer (www.landex.de).<br />

Aaron Burton (aaron.burton@powervr.com)<br />

Aaron has been a developer relations engineer at PowerVR Technologies since he<br />

received his Honours degree in information systems engineering in 1998. His<br />

first computer was a VIC 20, though his fascination for 3D graphics began with<br />

the Atari ST. At PowerVR he has been able to indulge this interest by developing<br />

a variety of demos, benchmarks, and debug/performance tools, and supporting<br />

developers in creating faster and better games. When he’s not climbing, he<br />

spends his spare time working on ray-tracing and real-time 3D demos.<br />

Dean Calver<br />

Games are fun! Dean figured that out at age 2 and has spent the ensuing years<br />

working on how to make better games. For the last seven years, people have even<br />

paid him to do it. Having no real preference for console or PC has meant a mixed<br />

career switching between them for every project. Professionally, he has worked<br />

on a war game, racing games, an X-COM style game, arcade classic updates and<br />

the port of Silent Hill 2 to the PC. He is currently working on an Xbox RPG called<br />

Sudeki at Climax Solent.<br />

Nicolas Capens (sw-shader.sourceforge.net)<br />

Nicolas is a master’s student in civil engineering in computer science in Ghent,<br />

Belgium. He became interested in graphics programming after discovering some<br />

Quake mods, and he quickly learned C++ and x86 assembly by himself. His main<br />

interest is software rendering and optimization. For more than two years he has<br />

been developing his own software renderer in his spare time. He is currently<br />

focusing on implementing shader emulation using the MMX and SSE instruction<br />

sets and dynamic code generation.<br />

Francesco Carucci<br />

Francesco has been a professional game programmer for three years and


About the Authors<br />

currently works on Black&White 2 for Lionhead Studios. He studied graphics<br />

programming-related subjects at university for five years before that. His passion<br />

for video games and 3D graphics help him spend many sleepless nights after long<br />

days of writing shader code.<br />

Roger Descheneaux<br />

Roger has been working on 3D graphics since the late 1980s, and he has a vaguely<br />

uncomfortable feeling that he should somehow be better at it by now. In 1991 he<br />

graduated to working on 3D graphics device drivers for IBM. The first driver he<br />

worked on was for a five-card graphics solution that sold for $30,000 and couldn’t<br />

do texture mapping. The graphics hardware is slightly faster and somewhat<br />

cheaper these days. He currently works on OpenGL device drivers for ATI<br />

Research in Marlborough, Massachusetts, for graphics chips that can definitely<br />

do texture mapping.<br />

Sim Dietrich<br />

Sim manages the U.S. Technical Developer Relations team at nVidia Corporation.<br />

Sim has written chapters for Game <strong>Programming</strong> Gems 1 and 2 and served as editor<br />

of the Graphics Display section of Gems 2. He was a key contributor to the<br />

CgFX effort, bringing real-time shaders to Max, Maya, and SoftImage for the first<br />

time. Sim’s interests include new shadow techniques and improving graphics<br />

workflow through efforts like Cg and CgFX.<br />

Wolfgang F. Engel (wolfgang.engel@shaderx.com)<br />

Wolfgang is the editor of <strong><strong>Shader</strong>X</strong>2 : Introductions & Tutorials with <strong>DirectX</strong> 9, the<br />

editor and a co-author of Direct3D <strong><strong>Shader</strong>X</strong>: Vertex and Pixel <strong>Shader</strong> <strong>Tips</strong> and<br />

<strong>Tricks</strong>, the author of Beginning Direct3D Game <strong>Programming</strong>, and a co-author of<br />

OS/2 in Team, for which he contributed the introductory chapters on OpenGL and<br />

DIVE. He spoke at GDC 2003 and at Vision Days 2003 in Copenhagen, Denmark.<br />

He has published articles in German journals and on www.gamedev.net,<br />

www.gamasutra.com, and his own web site, www.direct3d.net. During his career<br />

in the game industry he built up two small game development units.<br />

Tom Forsyth (tomf@muckyfoot.com)<br />

Tom has been obsessed by 3D graphics since seeing Elite on his ZX Spectrum.<br />

Since then he has always tried to make hardware beg for mercy. Tom has written<br />

triangle-drawing routines on the Spectrum, Sinclair QL, Atari ST, Sega 32X, Saturn,<br />

Dreamcast, PC, GamePark32, and Xbox, and he’s getting quite good at them<br />

now. Tom’s coding past includes writing curved-surface stuff for Sega and graphics<br />

drivers for 3Dlabs. Currently he works in Guildford, England, at Mucky Foot<br />

Productions, where past projects include Urban Chaos, StarTopia, and Blade2.<br />

Eli Z. Gottlieb<br />

Eli is a self-taught programmer attending ninth grade at Bethlehem Central High<br />

School in Delmar, New York.<br />

Matthew Halpin<br />

Matthew started programming before he was 10 and has continued to hold an<br />

interest in the areas of graphics and physics. Starting with 2D vector and sprite<br />

rendering, he quickly moved onto software 3D rendering and later 3D hardware<br />

xi


About the Authors<br />

xii<br />

accelerated rendering combined with rigid body and particle physics systems. He<br />

has been working in the games industry since receiving a BA in computer science<br />

from Cambridge University.<br />

Shawn Hargreaves<br />

After finishing a degree in music, Shawn has been writing games for the last six<br />

years, most recently as lead programmer on Climax’s MotoGP bike racing game.<br />

Having started out coding 2D graphics by hand in DOS (where he created the<br />

popular Allegro library (http://www.talula.demon.co.uk/allegro/index.html) and<br />

then spending time on the N64 and PS2, he is still in awe of the sorts of things<br />

that are possible with programmable shaders on Xbox and modern PC cards.<br />

Evan Hart<br />

Evan is a software engineer with ATI’s Application Research Group where he<br />

works on technology evangelism and adoption. He is a graduate of Ohio State<br />

University.<br />

O’dell Hicks<br />

O’dell has been a professional game programmer since 1998 and a hobbyist several<br />

years longer than that. He has done work on both the PC and Xbox. One day<br />

he hopes to finish a game that he is working on by himself in his spare time. His<br />

web site can be found at http://odellworld.com/.<br />

Oliver Hoeller<br />

Oliver currently works as senior engine programmer at Piranha Bytes, which<br />

developed the RPGs Gothic I and II. He started programming at age 10 on his<br />

Commodore VIC20, working his way through 6502(VIC20), 6510(C64), and<br />

68000(Amiga) Assembler. His first game project was with 15, a jump and run<br />

game named Platou (Kingsoft, C64). He was an active member of the German<br />

demo scene in the ’80s and early ’90s. After a detour — during which he developed<br />

music software, created a security program, and worked as a consultant for<br />

web services — Oliver returned to his roots and developed his first 3D engine<br />

(Warrior Engine, 1995-98). He was lead programmer and director of development<br />

at H2Labs/Codecult and was responsible for development of the Codecreatures<br />

game system.<br />

Takashi Imagire<br />

Takashi has been a professional game programmer for five years, mainly working<br />

with the PlayStation and PlayStation2. Currently, he is programming real-time 3D<br />

graphics in his spare time, while focusing on the newest shader technology. A<br />

number of articles and demos on shader programming can be found on his web<br />

site at http://www.t-pot.com/. His goal is to publish his demos immediately after<br />

the release of new shader technology.<br />

John Isidoro<br />

John is a member of the 3D Application Research Group at ATI Technologies and<br />

a graduate student at Boston University. His research interests are in the areas of<br />

real-time graphics, image-based rendering, and machine vision.<br />

Greg James<br />

Greg is a software engineer with nVidia’s technical developer relations group


About the Authors<br />

where he develops tools and demos for real-time 3D graphics. Prior to this, he<br />

worked for a small game company and as a research assistant in a high-energy<br />

physics laboratory. He is very glad to have avoided graduate school, and even happier<br />

to be working in computer graphics, which he picked up as a hobby after his<br />

father brought home a strange beige Amiga 1000.<br />

Jeffrey Kiel<br />

Jeff started his work in graphics as an undergrad at the University of North<br />

Carolina doing volume rendering research. After a stint in the corporate world, he<br />

moved on to work at Interactive Magic as a lead programmer on Destiny (one of<br />

the first 3D strategy games), iF18, and WarBirds. Then he joined Sinister Games<br />

to work on Shadow Company (3D squad-based strategy game) and the Dukes of<br />

Hazzard I and II on PS1. Jeff returned to his passion for graphics by joining nVidia,<br />

where he has worked on a couple of 3D engines, incorporating shader technology<br />

into real-world applications. His shader experience covers standard transform/<br />

lighting/shading, special effects, mesh animation, and particle systems.<br />

Shaun Kime<br />

Shaun is a software engineer at NDL where he is the lead developer on the 3ds<br />

max tools pipeline. Prior to working at NDL, he worked on the Mimesis project at<br />

North Carolina State University doing research on integrating narrative planning<br />

into virtual worlds. When he isn’t at work, he can be found reviewing local pubs at<br />

http://www.drinktheworld.com.<br />

Jakub Klarowicz<br />

Jakub is an engine programmer at Techland where he works on all low-level<br />

aspects of game engine development. His biggest interest is, of course, real-time<br />

3D graphics. He received an MS in computer science from Wroclaw University of<br />

Technology in 2001, and has been programming computers since he was 10. Jakub<br />

always wanted to push hardware to its limits so he started learning assembler<br />

while his friends were still playing games. In his work with 3D graphics, Jakub has<br />

gone all the way from software rendering to shader programming. He has been<br />

playing with hardware-accelerated rendering for five years, using Glide, OpenGL,<br />

and Direct3D. For the last three years he has worked with 3D graphics<br />

professionally.<br />

Jesse Laeuchli<br />

Jesse is a self-taught programmer who now makes his home in Budapest, Hungary.<br />

As the child of a Foreign Service officer, he has lived in such places as China,<br />

Taiwan, Africa, and Saudi Arabia. He has written for several computer magazines,<br />

books, and web sites, and is also an avid epee fencer. His web site is<br />

www.laeuchli.com/jesse/.<br />

Sylvain Lefebvre<br />

Sylvain is a Ph.D. student in the iMAGIS team at the French National Institute<br />

for Research in Computer Science, working on the rendering of natural scenes.<br />

He is also interested in many aspects of game programming and real-time graphics.<br />

He is currently focusing on developing new approaches with vertex and pixel<br />

shaders to handle the complexity of natural scenes. His home page is at<br />

http://www.aracknea.net.<br />

xiii


About the Authors<br />

xiv<br />

Jean-Sebastian Luce<br />

Jean-Sebastian has been a professional game programmer specializing in computer<br />

graphics for three years in the Nadeo studio where he worked on the games<br />

Virtual Skipper 1 and 2. He is currently working on improving their graphic<br />

engine quality by using more complex shaders for the recent games TrackMania<br />

and Virtual Skipper3. He has also studied applied mathematics, computer science,<br />

and image synthesis in a French National Institute (ENSIMAG).<br />

Dean Macri<br />

Dean is a software engineer with Intel Corporation where he works with software<br />

developers in optimizing the processor-specific aspects of their titles. He wrote<br />

his first graphics application, a line and circle drawing program, in TMS9900<br />

assembly language in 1984 on a Texas Instruments 99/4A. Since then he’s been<br />

hooked on graphics and programming, majoring in computer science as both an<br />

undergraduate and a graduate student. Starting in 1992, he spent five years developing<br />

high-speed assembly routines for 2D graphics transition effects at a multimedia<br />

kiosk development company. In 1998 he joined Intel where he continues to<br />

evangelize the benefits of new processors and technologies to software developers<br />

and provide their feedback to the processor architects.<br />

Jason L. Mitchell (JasonM@ati.com)<br />

Jason is the team lead of the 3D Application Research Group at ATI Research,<br />

makers of the Radeon family of graphics processors. Jason has worked with<br />

Microsoft on the Microsoft campus in Redmond for several years to define key<br />

new Direct3D features. Prior to working at ATI, Jason did work in human eye<br />

tracking for human interface applications at the University of Cincinnati, where he<br />

received his master’s degree in electrical engineering in 1996. He received a<br />

bachelor’s degree in computer engineering from Case Western Reserve University<br />

in 1994. In addition to this book’s article on advanced image processing, Jason<br />

wrote about HLSL programming in <strong><strong>Shader</strong>X</strong>2 : <strong>Shader</strong> <strong>Programming</strong> <strong>Tips</strong> & <strong>Tricks</strong><br />

with <strong>DirectX</strong> 9, and has written for the Game <strong>Programming</strong> Gems books, Game<br />

Developer magazine, Gamasutra.com, and academic publications on graphics and<br />

image processing. He regularly presents at graphics and game development conferences<br />

around the world. His home page can be found at http://www.pixelmaven.com/jason/.<br />

Ádám Moravánszky<br />

Ádám is a recent graduate of the Swiss Federal Institute of Technology. After finishing<br />

his thesis in the field of real-time 3D graphics, he co-founded NovodeX<br />

(www.novodex.com), a company providing game physics middleware, where he is<br />

the chief software architect.<br />

Christopher Oat<br />

Christopher is a software engineer in the 3D Application Research Group at ATI,<br />

where he explores novel rendering techniques for real-time 3D graphics applications.<br />

His focus is on pixel and vertex shader development for current and future<br />

graphics platforms. Christopher has contributed as an original member of the<br />

RenderMonkey development team and as a shader programmer for ATI’s demos<br />

and screen savers. He has been published in Game <strong>Programming</strong> Gems 3 (Charles


About the Authors<br />

River Media, 2002) and Direct3D <strong><strong>Shader</strong>X</strong>: Vertex and Pixel <strong>Shader</strong> <strong>Tips</strong> and <strong>Tricks</strong><br />

(Wordware, 2002). Christopher is a graduate of Boston University.<br />

David Pangerl<br />

David’s addiction to computers and games started early in his life, and the vision<br />

to create virtual worlds continues to be a strong force in his life. He has been<br />

involved in the production of several games, including Crash, Casanova, Hitchcock,<br />

Hannibal, and most recently Mistmare. His main interests are computer<br />

graphics, artificial intelligence, and compilers.<br />

Kurt Pelzer<br />

As a senior programmer at Codecult, Kurt developed several real-time simulations<br />

and technology demos built on CC’s high-end 3D engine Codecreatures<br />

(e.g., a launch demo for nVidia’s GeForce4 Ti generation and the well-known<br />

Codecreatures-Benchmark-Pro). He designed the innovative fx systems of<br />

Codecreatures and was involved in creating a simulation of the Shanghai<br />

TRANSRAPID track for SIEMENS AG. Kurt also worked on Piranha Bytes’ PC<br />

game Gothic and the top-selling Gothic II—which were named RPG of the Year in<br />

Germany in 2001 and 2002. In prehistoric times Kurt started programming on<br />

C64 and Atari’s ST; later on he studied mathematics, always focusing on computer<br />

graphics. When he’s not scribbling down equations or reading the book of<br />

seven seals, Kurt works at Piranha Bytes to guarantee a high level of visual quality<br />

for the company’s future products.<br />

Emil Persson<br />

Emil recently graduated from Luleå University of Technology in Northern Sweden<br />

after studying computer science and engineering. Over the years Emil has<br />

gathered experience from early software rendering attempts to advanced techniques<br />

in the Glide, OpenGL, and Direct3D APIs. His web site at http://esprit.<br />

campus.luth.se/~humus/ focuses on real-time 3D graphics. In the future you’ll<br />

probably find Emil working as a game developer working on the next generation<br />

of game engines.<br />

Tim Preston<br />

Tim is a software engineer working on the Direct3D sections of the Gamebryo<br />

game engine at NDL. He graduated from Princeton University in 1997 with a<br />

degree in chemistry and a desire to do pretty much anything but chemistry. He<br />

went to the University of North Carolina for a master’s in computer science,<br />

where he did a lot of molecular modeling work that led to an interest in 3D graphics.<br />

When he graduated in 1999, the game industry was a good match for his experience<br />

and his goal of not doing anything too important.<br />

Maurice Ribble<br />

Maurice graduated in 2001 from the Milwaukee School of Engineering with a<br />

bachelor’s degree in computer engineering. During his junior year he had the<br />

opportunity to take part in a summer internship at Los Alamos National Labs. He<br />

was somewhat disappointed that other people worked on million-dollar workstations<br />

while he worked on consumer-level hardware, but after writing an application<br />

that performed lighting calculations for volume textures on first-generation<br />

consumer fragment shader hardware, he realized that consumer-level hardware<br />

xv


About the Authors<br />

xvi<br />

was in for exciting changes, and he wanted to be part of the action. He currently<br />

works on the OpenGL device driver team at ATI Research.<br />

Guennadi Riguer<br />

Guennadi is a software developer at ATI Technologies, where he is helping game<br />

engine developers to adopt new graphics technologies. Guennadi holds a degree<br />

in computer science from York University and previously studied at Belorussian<br />

State University of Computing and Electronics. He began programming in the<br />

mid-80s and worked on a wide variety of software development projects prior to<br />

joining ATI.<br />

Thomas Rued (rued@digitalarts.dk)<br />

Thomas started his programming career at the local mall in 1983, doing small<br />

graphics programs in BASIC until an angry salesperson turned the computer off<br />

and he had to start all over. Later he programmed multimedia programs for<br />

InterVision in assembler and Pascal. Then he decided that education was in order<br />

and earned a degree in computer science. He moved on to Interactive Vision for<br />

several years, where he was a senior software engineer and worked on 3D applications<br />

plus the in-house frameworks for game development using C++ and<br />

<strong>DirectX</strong>. Currently Thomas works at Digital Arts (www.digitalarts.dk) where he<br />

focuses on high-end 3D visualization stuff in real time using modern 3D hardware.<br />

In his spare time he is the co-coordinator of the Danish IGDA chapter.<br />

Scott Sherman<br />

Scott is a software engineer at NDL where he is the lead on the Xbox version of<br />

their graphics engine. After receiving degrees in physics and electrical engineering,<br />

a short stint in the hardware side of the computer industry led to doing on-air<br />

statistics and scoring systems programming for sporting event broadcasts. Once<br />

the excitement of live television wore off, he moved over to the field of game programming,<br />

and is currently focused on real-time 3D graphics.<br />

Peter-Pike Sloan<br />

Peter-Pike currently works on D3DX at Microsoft. Prior to that he worked in the<br />

Microsoft Research Graphics Group, the Scientific Computing and Imaging group<br />

at the University of Utah, PTC, and Evans & Sutherland. His primary research<br />

interests revolve around interactive graphics techniques. Most of his publications<br />

are available at http://research.microsoft.com/~ppsloan.<br />

Marco Spoerl (http://www.marcospoerl.com)<br />

Like just about everyone else, Marco started programming way back on a C64.<br />

After buying a PC just so he could play Doom, he learned about computer graphics.<br />

He started his professional career as an engine programmer at Codecult Software,<br />

working on the Codecreatures Game Development System and the<br />

Codecreatures Benchmark Pro. After receiving his diploma in computer science,<br />

and a short walk on the wild side as a freelance software developer, he’s now<br />

working in the training and simulation department at Munich-based Krauss-<br />

Maffei Wegmann.<br />

Natalya Tatarchuk (Natasha@ati.com)<br />

Natalya is a software engineer working in the 3D Application Research Group at


About the Authors<br />

ATI Research, where she is the programming lead for the RenderMonkey IDE<br />

project. She has worked in the graphics industry for more than six years, working<br />

on 3D modeling applications and scientific visualization prior to joining ATI.<br />

Natalya graduated from Boston University with a bachelor’s degree in computer<br />

science, a bachelor’s degree in mathematics, and a minor in visual arts.<br />

Nicolas Thibieroz (nicolas.thibieroz@powervr.com)<br />

Like many kids of his generation, Nicolas discovered video games on the Atari<br />

VCS 2600. He quickly became fascinated by the mechanics behind those games,<br />

and started programming on C64 and Amstrad CPC before moving on to the PC<br />

world. Nicolas realized the potential of real-time 3D graphics while playing Ultima<br />

Underworld. This game inspired him in such a way that both his school placement<br />

and final year project were based on 3D computer graphics. After obtaining a<br />

bachelor’s degree in electronic engineering in 1996, he joined PowerVR Technologies<br />

where he is now responsible for developer relations. His duties include supporting<br />

game developers, writing test programs and demos, and generally keeping<br />

up to date with the latest 3D technology.<br />

Alex Vlachos (http://alex.vlachos.com)<br />

Alex is a staff engineer in the 3D Application Research Group at ATI, where he<br />

has worked since 1998 focusing on 3D engine development as the lead programmer<br />

for ATI’s Demo Team. He developed N-Patches (a curved surface representation<br />

introduced in Microsoft’s <strong>DirectX</strong> 8), also known as PN Triangles, and<br />

TRUFORM. He has published in Game <strong>Programming</strong> Gems 1, 2, and 3, ACM<br />

Symposium on Interactive 3D Graphics (I3DG), and Direct3D <strong><strong>Shader</strong>X</strong>: Vertex and<br />

Pixel <strong>Shader</strong> <strong>Tips</strong> and <strong>Tricks</strong>. He has presented at Microsoft Meltdown Seattle and<br />

UK, I3DG, GDC, and GDC Europe. Alex is a graduate of Boston University.<br />

Daniel Wagner (daniel@ims.tuwien.ac.at)<br />

Daniel has been fascinated by programming computer graphics since he got his<br />

first PC in 1991. In 1995 he developed the software SimLinz for the Ars<br />

Electronica Center (museum of the future) in Linz, Austria. During his study he<br />

worked for Reality2, a company that created virtual reality software. After finishing<br />

his master’s thesis, “EndoView: A System for Fast Virtual Endoscopic Rendering<br />

and Registration” in summer 2001, he worked as a lead developer for<br />

BinaryBee, a company developing arcade-style web games. Daniel is currently<br />

working on his Ph.D. thesis on augmented reality at the Interactive Media Systems<br />

Group at the Vienna University of Technology.<br />

Arkadiusz Waliszewski<br />

Arkadiusz holds a master’s degree in computer science from Poznan University of<br />

Technology and is currently a software engineer in Poland. He started his adventure<br />

with computer graphics when he got his first computer (Atari 65XE) and has<br />

become addicted. Beside real-time computer graphics, he is also interested in<br />

object-oriented programming and design. He likes good movies, dry wine, and big<br />

fluffy carpet slippers.<br />

Maike Walther<br />

Maike’s research interests lie in computational and cognitive aspects of computer<br />

depiction. She has studied mathematics, logic, computer science, and psychology<br />

xvii


About the Authors<br />

xviii<br />

at the universities of Muenster, Germany, and Leeds, England. Maike graduated<br />

in 2003 from the Hasso Plattner Institute in Potsdam, Germany after writing her<br />

master’s thesis on computer graphics and algorithms for real-time non-photorealistic<br />

rendering of 3D city models. She is currently developing for the Virtual<br />

Rendering System (www.vrs3d.org).<br />

Oliver Weichhold<br />

Oliver has been a programmer and developer on a number of projects, including a<br />

software implementation of the Direct3D pipeline.<br />

Carsten Wenzel (carsten@crytek.de)<br />

Carsten has been passionate about computer graphics ever since he got a hold of<br />

intros and demos for Amiga and PC. Although he’s never really been active in the<br />

demo scene, it’s always been a big inspiration for him. As a 3D programmer at<br />

Totally Games, he developed many of the pixel and vertex shaders used for special<br />

effects in an Xbox game. At that time he also wrote a tech demo for nVidia’s<br />

GeForce3. His latest demo, Meshuggah, was released in spring 2002, and he<br />

received his master’s degree in computer science in December 2002. He currently<br />

works at Crytek.<br />

Guillaume Werle (guille@free.fr)<br />

Guillaume is a 26-year-old graphic engineer at Montecristo (www.montecristogames.com).<br />

He joined the R&D department team last year where he is working<br />

on the next-generation 3D engine. In the game industry since 1998, he has done<br />

two PlayStation games for Infogrames and one PC game for Montecristo. Despite<br />

the little spare time he has, he is still an active demoscener (http://cocoon.planetd.net).<br />

His last demo, Raw Confessions, was nominated for the Demoscene<br />

Awards (http://awards.scene.org/) in the “Best Demo” category and won the<br />

“Best Graphics” award.<br />

Renaldas Zioma<br />

Renald Zioma has been driven (mad) by computer graphics since he saw ZX<br />

Spectrum. After learning assembly and writing a Tetris clone for his ZX, he<br />

switched to PCs, finished school, wrote a couple of small non-commercial games,<br />

gained experience with object-oriented programming and design while working at<br />

a software development company, and received a bachelor’s degree in computer<br />

science from Kaunas University of Technology. He has been working as a professional<br />

game programmer for the last two years. Recently he finished a demo of a<br />

3D fighting game based on real-time motion recognition for Interamotion, LLC. In<br />

his spare time, he programs demos and games and organizes small demo/game<br />

scene related events in Lithuania.


Introduction<br />

This book is a collection of articles that discuss ways to use vertex and pixel<br />

shaders to implement a variety of effects. The following provides a brief overview<br />

of these articles:<br />

Section I — Geometry Manipulation <strong>Tricks</strong><br />

This section starts with a <strong>DirectX</strong> 9 sequel to Dean Calver’s vertex compression<br />

article in Direct3D <strong><strong>Shader</strong>X</strong>: Pixel and Vertex <strong>Shader</strong> <strong>Tips</strong> and <strong>Tricks</strong>. Dean shows<br />

a number of ways to reduce vertex throughput by compressing vertex data.<br />

Carsten Wenzel points out how to use lookup tables in vertex shaders to reduce<br />

the workload of the vertex shader hardware. A feature-complete and very hardware-friendly<br />

terrain engine is explained in Daniel Wagner’s article, “Terrain<br />

Geomorphing in the Vertex <strong>Shader</strong>.” The speed of the example program provided<br />

with source is impressive. Creating 3D planets for a space-shooter type of game<br />

can be done entirely on the GPU, which Jesse Laeuchli shows how to do in his<br />

article “3D Planets on the GPU.”<br />

The vs_3_0 vertex shader model has a feature called vertex texturing, which<br />

Kristof Beets uses to create a very realistic-looking cloth animation in his article<br />

“Cloth Animation with Pixel and Vertex <strong>Shader</strong> 3.0.” In “Collision <strong>Shader</strong>s,”<br />

Takashi Imagire, who is known for the example programs on his web site<br />

(www.t-pot.com), uses shaders to calculate collisions, something that has never<br />

been shown before. The final article in this section covers using displacement<br />

mapping as a method of geometry compression. The main aim of Tom Forsyth’s<br />

article is to allow people to take data from the industry’s current mesh and texture<br />

authoring pipelines, and to derive displacement map data from them.<br />

Section II — Rendering Techniques<br />

The section starts with an article by Greg James that presents a convenient and<br />

flexible technique for rendering ordinary polygon objects of any shape as thick<br />

volumes of light scattering or light absorbing material with ps_1_3. O’dell Hicks<br />

shows in his article, “Screen-aligned Particles with Minimal VertexBuffer<br />

Locking,” how to create screen-aligned particles with a vertex shader, bringing us<br />

one step closer to the goal of having almost everything done by the GPU. “Hemisphere<br />

Lighting with Radiosity Maps,” written by Shawn Hargreaves, shows a<br />

lighting model that was designed for fast moving objects in outdoor environments.<br />

Its goals are to tie in the moving objects with their surroundings, to convey a sensation<br />

of speed, and to be capable of rendering large numbers of meshes at a good<br />

xix


Introduction<br />

xx<br />

frame rate on first-generation shader hardware. The companion movie on the CD<br />

includes jaw-dropping effects.<br />

Jesse Laeuchli has contributed two additional articles. In “Galaxy Textures,”<br />

he uses a procedural model to generate easy-to-vary galaxies that can be implemented<br />

almost entirely on hardware using pixel shaders. “Turbulent Sun” demonstrates<br />

how to implement a sun using a 3D noise function. The example program<br />

runs solely on the GPU using shaders. A complete implementation of Phong lighting,<br />

together with a cube shadow mapping implementation, is shown in Emil<br />

Persson’s article, “Fragment-level Phong Illumination.” Getting a nicely distributed<br />

specular reflection on ps_1_1 hardware is a challenge, but Matthew Halpin<br />

shows a new and very efficient way to achieve this in “Specular Bump Mapping<br />

on Pre-ps_1_4 Hardware.” <strong>With</strong> the advent of pixel shader 3_0, graphics hardware<br />

has become capable of rendering hardware-accelerated voxels. Aaron Burton’s<br />

article, “Rendering Voxel Objects with PS_3_0,” shows how to implement real<br />

voxels on third-generation graphics hardware. Current <strong>DirectX</strong> 9 hardware is not<br />

capable of alpha-blending between floating-point render targets, but Francesco<br />

Carucci shows a way to simulate alpha-blending on this hardware in his article,<br />

“Simulating Blending Operations on Floating-point Render Targets.”<br />

Eli Z. Gottlieb’s article, “Rendering Volumes in a Vertex & Pixel Program by<br />

Ray Tracing,” shows how to render volumes by using ray tracing and a volume<br />

texture on ps_2_x hardware. Using bump maps to create bump mapping effects<br />

increases the amount of data necessary in memory. Jakub Klarowicz’s article,<br />

“Normal Map Compression,” shows how to compress bump maps with a common<br />

DXT format. Sylvain Lefebvre discusses how to implement pattern-based procedural<br />

textures in “Drops of Water and Texture Sprites.” These kinds of textures<br />

are not procedural in the sense of classic marble or wood textures, but they combine<br />

explicit textures (patterns) in order to create a larger texture with the<br />

desired appearance. Kurt Pelzer explains how to implement a realistic water simulation<br />

that is extensively usable in his article “Advanced Water Effects.” If you<br />

ever wondered how this was done in the CodeCreatures engine, don’t look any<br />

further.<br />

Peter-Pike Sloan uses irradiance environment maps to render diffuse objects<br />

in arbitrary lighting environments in “Efficient Evaluation of Irradiance Environment<br />

Maps.” He presents a method that uses spherical harmonics to efficiently<br />

represent an irradiance environment map, which is more efficient to compute and<br />

uses fewer resources than diffuse cube maps. In a second article, “Practical<br />

Precomputed Radiance Transfer,” Peter-Pike Sloan shows how to use precomputed<br />

radiance transfer to illuminate rigid objects in low-frequency lighting environments<br />

with global effects like soft shadows and inter-reflections. These results<br />

are achieved by running a lengthy preprocess that computes how light is transferred<br />

from the source environment to exit radiance at a point. Marco Spoerl and<br />

Kurt Pelzer discuss how to render advanced sky domes in “Advanced Sky Dome<br />

Rendering.” This article describes the implementation of a basic vertex color sky<br />

dome, which computes the correct position of both the sun and the moon depending<br />

on time of day, changes its color depending on the position of the sun, renders<br />

a projection of the sun at its correct position, and renders a projection of the moon<br />

at its correct position including the moon’s current phase.


Introduction<br />

Nicolas Thibieroz shows how to implement deferred shading in “Deferred<br />

Shading with Multiple Render Targets.” Contrary to traditional rendering algorithms,<br />

deferred shading submits the scene geometry only once and stores perpixel<br />

attributes into local video memory to be used in the subsequent rendering<br />

passes. Carsten Wenzel explains how he created the effects in his Meshuggah<br />

demo in “Meshuggah’s Effects Explained.” It is impressive what he has done on<br />

<strong>DirectX</strong> 8.1-capable hardware and on the Xbox. John Isidoro, Chris Oat, and<br />

Natalya Tatarchuk explain how they created a two-tone, suspended microflake car<br />

paint shader in “Layered Car Paint <strong>Shader</strong>.” Motion blur effects as shown in the<br />

Animusic demo Pipe Dream are described in “Motion Blur Using Geometry and<br />

Shading Distortion” by Natalya Tatarchuk, Chris Brennan, Alex Vlachos, and John<br />

Isidoro. “Simulation of Iridescence and Translucency on Thin Surfaces” by<br />

Natalya Tatarchuk and Chris Brennan focuses on simulating the visual effect of<br />

translucency and iridescence of thin surfaces such as butterfly wings.<br />

Arkadiusz Waliszewski describes in “Floating-point Cube Maps” how to use<br />

floating-point cube maps to get a much more visually pleasing cube mapping<br />

effect. Thomas Rued compares three different kinds of stereoscopic rendering and<br />

provides shader implementations for each of them in his article “Stereoscopic<br />

Rendering in Hardware Using <strong>Shader</strong>s.” The article “Hatching, Stroke Styles, and<br />

Pointillism” by Kevin Buchin and Maike Walther shows how to implement hatching<br />

by combining strokes into a texture. These compositions of strokes can convey<br />

the surface form through stroke orientation, the surface material through<br />

stroke arrangement and style, and the effect of light on the surface through stroke<br />

density. Guillaume Werle explains a technique that achieves a realistic-looking<br />

layered fog in “Layered Fog.” It computes the height on a per-vertex basis and<br />

uses the texture coordinate interpolator to get per-pixel precision. Ádám<br />

Moravánszky’s article, “Dense Matrix Algebra on the GPU,” shows how to use<br />

shaders to solve two common problems in scientific computing: solving systems<br />

of linear equations and linear complementarity problems. Both of these problems<br />

come up in dynamics simulation, which is a field drawing increasing interest from<br />

the game developer community.<br />

Section III — Software <strong>Shader</strong>s and <strong>Shader</strong> <strong>Programming</strong> <strong>Tips</strong><br />

Dean Macri’s article, “Software Vertex <strong>Shader</strong> Processing,” explores optimization<br />

guidelines for writing shaders that will use the software vertex processing pipeline.<br />

Additionally, the techniques described in this article should also apply to vertex<br />

shaders written for graphics hardware. Emulating pixel shaders efficiently on<br />

the CPU might be the first step in writing a software 3D engine with shader support<br />

that runs only on the CPU. In “x86 <strong>Shader</strong>s-ps_2_0 <strong>Shader</strong>s in Software,”<br />

Nicolas Capens shows how to create a fast-performing software emulation of<br />

ps_2_0 shaders by using a run-time assembler. Oliver Weichhold has created a<br />

software implementation of the Direct3D pipeline. His article, “SoftD3D: A Software-only<br />

Implementation of Microsoft’s Direct3D API,” describes how he did it.<br />

Jeffrey Kiel shows a very handy trick for using named constants in shader development<br />

in “Named Constants in <strong>Shader</strong> Development.”<br />

xxi


Introduction<br />

xxii<br />

Section IV — Image Space<br />

Jason L. Mitchell, Marwan Y. Ansari, and Evan Hart describe in their article<br />

“Advanced Image Processing with <strong>DirectX</strong> 9 Pixel <strong>Shader</strong>s” how to perform color<br />

space conversion using an edge detection filter called the Canny filter, separable<br />

Gaussian and median filters, and a real-time implementation of the Fast Fourier<br />

Transform with ps_2_0 shaders. The article “Night Vision: Frame Buffer Postprocessing<br />

with ps.1.1 Hardware” describes how to implement an efficient night<br />

view on ps_1_1 hardware. Guillaume Werle uses a three-step approach to achieve<br />

this, first rendering the scene into a texture, converting this texture to grayscale,<br />

and using the luminance value of each pixel as the index into a gradient texture.<br />

Shawn Hargreaves shows the non-photorealistic post-processing filters he used in<br />

the game MotoGP 2 for ps_1_1 hardware and the Xbox in “Non-Photorealistic<br />

Post-processing Filters in MotoGP 2.”<br />

Marwan Y. Ansari discusses in his article “Image Effects with <strong>DirectX</strong> 9 Pixel<br />

<strong>Shader</strong>s” how to achieve transition, distortion, and posterization image effects in<br />

a video shader. Roger Descheneaux and Maurice Ribble show how to achieve a<br />

mosaic-like effect via post-processing in “Using Pixel <strong>Shader</strong>s to Implement a<br />

Mosaic Effect Using Character Glyphs.” The article “Mandelbrot Set Rendering”<br />

by Emil Persson shows how to implement a Mandelbrot set in a ps_2_0 pixel<br />

shader. Guennadi Riguer, Natalya Tatarchuk, and John Isidoro present two variations<br />

of a two-pass approach for depth of field simulation in their article “Real-<br />

Time Depth of Field Simulation.” In both variations, the scene is rendered in the<br />

first pass with some additional information such as depth, and in the second pass<br />

some filters are run to blur the result from the first pass.<br />

Section V — Shadows<br />

In the article “Soft Shadows” by Flavien Brebion, a soft shadows algorithm that<br />

works as an extension of the shadow volumes algorithm is explained. This is<br />

achieved by using two volumes, the first from the standard point light (inner volume)<br />

and the second from a jittered point light position (outer volume). This second<br />

volume defines the outer contour of the penumbra. The inner and outer<br />

volumes are next rendered to the shadow map, each in one color component<br />

channel, and then blurred. Sim Dietrich shows in “Robust Object ID Shadows”<br />

how to prevent the depth aliasing problem of shadow maps by using object IDs<br />

instead of storing depth in the light view texture. In his article “Reverse Extruded<br />

Shadow Volumes,” Renaldas Zioma suggests a solution for dealing with shadowing<br />

artifacts using stenciled shadow volumes that allow proper self-shadowing<br />

while using occluder geometry.<br />

Section VI — 3D Engine and Tools Design<br />

Tom Forsyth shows in “<strong>Shader</strong> Abstraction” how to abstract shaders by specifying<br />

a description of an ideal shader, but then in code the shader is allowed to<br />

degrade gracefully in quality according to both platform and distance from the<br />

camera. In an additional article, Tom Forsyth discusses how to generalize many of<br />

the common effects in current games into a unified framework, where multiple<br />

effects can be added, tried out, and combined at run time without replicating<br />

shared code, in order to keep speed and memory use optimal when only a few of


Introduction<br />

the effects are visible. The article “<strong>Shader</strong>s under Control (Codecreatures<br />

Engine)” by Oliver Hoeller describes the base architecture used in the Codecreatures<br />

engine. Scott Sherman, Dan Amerson, Shaun Kime, and Tim Preston<br />

describe how they integrated shaders into the Gamebryo Engine. A complete<br />

high-level programming language vertex shader compiler with source is given in<br />

David Pangerl’s article “Vertex <strong>Shader</strong> Compiler.” The final article in this book,<br />

“<strong>Shader</strong> Disassembler,” by Jean-Sebastian Luce covers the creation of a shader<br />

disassembler that can disassemble all available shader versions in <strong>DirectX</strong> 9.<br />

xxiii


Section I<br />

Geometry Manipulation<br />

<strong>Tricks</strong><br />

Using Vertex <strong>Shader</strong>s for Geometry Compression<br />

by Dean Calver<br />

Using Lookup Tables in Vertex <strong>Shader</strong>s<br />

by Carsten Wenzel<br />

Terrain Geomorphing in the Vertex <strong>Shader</strong><br />

by Daniel Wagner<br />

3D Planets on the GPU<br />

by Jesse Laeuchli<br />

Cloth Animation with Pixel and Vertex <strong>Shader</strong> 3.0<br />

by Kristof Beets<br />

Collision <strong>Shader</strong>s<br />

by Takashi Imagire<br />

Displacement Mapping<br />

by Tom Forsyth<br />

1


Using Vertex <strong>Shader</strong>s for<br />

Geometry Compression<br />

Dean Calver<br />

This article is a follow-up to an article I wrote in Direct3D <strong><strong>Shader</strong>X</strong>: Vertex and<br />

Pixel <strong>Shader</strong> <strong>Tips</strong> and <strong>Tricks</strong>. <strong>DirectX</strong> 9 has introduced new data types and added<br />

new capabilities to the vertex stream model. This, combined with more complex<br />

and faster vertex shaders, allows us to explore more advanced forms of vertex<br />

and geometry compression.<br />

What’s New in <strong>DirectX</strong> 9?<br />

Vertex <strong>Shader</strong>s<br />

In most cases I still use vertex shader version 1.1, as this is executed in hardware<br />

on the greatest number of machines. The new cards do benefit in the extra constant<br />

space available. This improves the amount of batching that can occur. Static<br />

branching also makes it easier to use different compression methods on different<br />

models. Vertex shader version 3.0 potentially offers a number of new capabilities,<br />

the most prominent being vertex texturing. This will offer a new range of compression<br />

methods but isn’t explored here due to current lack of hardware support.<br />

New Vertex Stream Declaration Format<br />

Limitations<br />

The vertex stream declaration system from <strong>DirectX</strong> 8 was completely overhauled<br />

to make it both easier to use and add new capabilities. From a compression point<br />

of view, the most interesting items are the new vertex data types and the extra<br />

control over where each element comes from in the stream (stream offset).<br />

When under <strong>DirectX</strong> 8 drivers (you can check via the D3DDEVCAPS2_STREAM-<br />

OFFSET cap bit), most new capabilities of the <strong>DirectX</strong> 9 vertex stream declarations<br />

can’t be used. Under <strong>DirectX</strong> 7 drivers, you must stick to FVF-style<br />

declarations. Also, if a declaration’s stream offsets produce overlapping vertex<br />

elements, then even on <strong>DirectX</strong> 9 drivers, the D3DDEVCAPS2_VERTEXELE-<br />

MENTSCANSHARESTREAMOFFSET cap bit must be set. Another limitation is<br />

that stream offsets must align on DWORD boundaries (4 bytes).<br />

3


Section I — Geometry Manipulation <strong>Tricks</strong><br />

4 Using Vertex <strong>Shader</strong>s for Geometry Compression<br />

The new vertex data types now have cap bits for each new type that <strong>DirectX</strong><br />

9 introduced (and UBYTE4 from <strong>DirectX</strong> 8); you must check these before using<br />

them. If the cap bit for the data type that you want is set, use it; otherwise, you<br />

will have to emulate the functionality via vertex shader code or change the vertex<br />

data to a format that is available on this hardware.<br />

NOTE The DirextX 9 documentation states the following about each new<br />

vertex data type: “This type is valid for vertex shader version 2.0 or higher.”<br />

This appears to be a documentation bug; if the cap bit is set, you can use it<br />

with any vertex shader version. There is already hardware that supports this,<br />

even on hardware that doesn’t support vertex shader version 2.0. (ATI<br />

supports some of the new data types on all its vertex shader-capable<br />

hardware.)<br />

New Vertex Data Types<br />

Most of these new types are signed, unsigned, and normalized versions of the<br />

existing <strong>DirectX</strong> 8 data types, but a few add new capabilities. The following table<br />

lists data types sorted by bits per channel.<br />

Data Type Number of<br />

Channels<br />

Bits Per<br />

Type<br />

Bits Per<br />

Channel<br />

Range in Vertex<br />

<strong>Shader</strong> Register<br />

Cap Bit? Notes<br />

D3DCOLOR 4 32 8 [0,1] N a<br />

UBYTE4 4 32 8 [0,255] Y<br />

UBYTE4N 4 32 8 [0,1] Y<br />

UDEC3 3 32 10 [0,1024] Y b<br />

DEC3N 3 32 10 [–1,1] Y b<br />

SHORT2 2 32 16 [–32768,32767] N<br />

SHORT4 4 64 16 [–32768,32767] N<br />

USHORT2N 2 32 16 [0,1] Y<br />

USHORT4N 4 64 16 [0,1] Y<br />

SHORT2N 2 32 16 [–1,1] Y<br />

SHORT4N 4 64 16 [–1,1] Y<br />

FLOAT16_2 2 32 16 [–6.55e4,6.55e4] Y c<br />

FLOAT16_4 4 64 16 [–6.55e4,6.55e4] Y c<br />

FLOAT1 1 32 32 [–3.48e38, 3.48e38] N d<br />

FLOAT2 2 64 32 [–3.48e38, 3.48e38] N d<br />

FLOAT3 3 96 32 [–3.48e38, 3.48e38] N d<br />

FLOAT4 4 128 32 [–3.48e38, 3.48e38] N d<br />

a) D3DCOLOR also reorders elements as it enters the vertex shader. ARGB becomes RGBA.<br />

b) The two top bits are unused and are lost without explicit vertex stream programming.<br />

c) float16 is an OpenEXR standard, a new standard created by nVidia and PIXAR. Use D3DXFLOAT16<br />

to manipulate (or the library in the OpenEXR SDK).<br />

d) float is an IEEE754 standard, corresponding to C type float.<br />

This is quite a rich set of data types with all data type multiples of 32 bits (this is<br />

the reason for losing the two bits on the DEC3 formats). The cap bits to check are


under D3DCAPS9.DeclType, the specific bit is D3DTCAPS_datatype, and the<br />

type to use is D3DDECLTYPE_datatype (where the data type is from the list<br />

above).<br />

Reclaiming Two Bits<br />

When DEC3N or UDEC3 formats are used, we seem to have lost two bits, but<br />

even two bits can be used quite effectively, so we want them back (e.g., if you<br />

have per-vertex branching, you could store the number of bones here). By causing<br />

two different vertex elements to point to the same memory in the vertex<br />

buffer, we can get access to our two bits (this requires the overlapped stream<br />

offset cap bit to be set).<br />

The vertex stream declaration for a single stream if we stored normals (a common<br />

use) as UDEC3 and wanted to reclaim our two bits is below. The vertex<br />

shader can now bind NORMAL0 to access the data as UDEC3 and NORMAL1 as<br />

UBYTE4.<br />

D3DVERTEXELEMENT9 decl[] =<br />

{<br />

Section I — Geometry Manipulation <strong>Tricks</strong><br />

Using Vertex <strong>Shader</strong>s for Geometry Compression<br />

Figure 1: Data from vertex stream element to vertex register<br />

// first element, a ‘normal’ UDEC3 declaration<br />

{ 0, // stream number<br />

0, // stream offset in bytes<br />

D3DDECLTYPE UDEC3, // vertex type for this access<br />

D3DDECLMETHOD DEFAULT, // not used so leave at default<br />

D3DDECLUSAGE NORMAL, // usage (used to bind in the vertex shader)<br />

0 // usage number (you can have n normals)<br />

},<br />

// second element, a UBYTE4 that accesses the same memory as the normal above<br />

{ 0, // stream number, same as first element<br />

0, // stream offset, same as first element<br />

D3DDECLTYPE UBYTE4, // vertex type for this access<br />

D3DDECLMETHOD DEFAULT, // not used so leave at default<br />

5


Section I — Geometry Manipulation <strong>Tricks</strong><br />

6 Using Vertex <strong>Shader</strong>s for Geometry Compression<br />

};<br />

D3DDECLUSAGE NORMAL, // usage (used to bind in the vertex shader)<br />

1 // usage no (so you can have n normals)<br />

},<br />

D3DDECL END()<br />

To get our two bits in a usable form, we need to divide by 2^6 (64) and then floor<br />

the result. This has the effect of shifting the extraneous data to the right of the<br />

decimal point and only keeping the integer part, which will be our reclaimed two<br />

bits in the range 0 to 3. The floor can be removed if you are going to use the two<br />

bits as a constant address register (the mova instruction rounds to zero).<br />

struct VS INPUT<br />

{<br />

float4 normal : NORMAL0,<br />

float4 enc2Bit : NORMAL1<br />

};<br />

void main( VS INPUT input )<br />

{<br />

// access normal as usual<br />

float3 normal = input.normal;<br />

// decode our 2 bits (0-3)<br />

float two bits = floor(input.enc2Bit.w / 64.0);<br />

}<br />

A Better Compression Transform Data Type<br />

The new DEC3N data types allow us to easily design a format with three channels<br />

with 10, 10, and 12 bits precision. This is a useful format for compression<br />

transformed positions. (Compression transform is discussed in my “Vertex<br />

Decompression in a <strong>Shader</strong>” article in Direct3D <strong><strong>Shader</strong>X</strong>; briefly, it compresses<br />

positions by solving the eigen-system of the covariant matrix of the mesh positions<br />

and transforming the positions into this basis before quantization. Decompressing<br />

a matrix vector multiple in the vertex shader restores the original<br />

position.)<br />

Many natural and man-made objects have a dominant axis (e.g., along the<br />

spine of many animals, etc.). By giving that axis the extra two bits, we are able to<br />

use a 32-bit format for some objects that would have required switching to a<br />

64-bit format (SHORT4). For simplicity in the vertex shader, we arrange the compressor<br />

to always make z the longest axis and then append the extra two bits to it<br />

before uncompressing.<br />

struct VS INPUT<br />

{<br />

float4 position : POSITION0,<br />

float4 enc2Bit : POSITION1<br />

};<br />

void main( VS INPUT input )<br />

{


}<br />

// get the 10,10,10 portion of the position<br />

float3 cpos = input.position;<br />

// decode our 2 bits (0-3)<br />

float two bits = floor(input.enc2Bit.w / 64.0);<br />

// factor in the extra bits and convert back into the 0-1 range<br />

cpos.z = (cpos.z + two bits) * 0.25;<br />

// transform by the inverse compression matrix<br />

float4 pos = mul( float4(cpos,1), InvCompressionTransform );<br />

Displacement Compression<br />

My previous article covered the use of vertex shaders to render displacement<br />

maps. This capability can be extended to a very powerful technique that Tom<br />

Forsyth has termed “displacement compression.” It’s a complete family of techniques<br />

that includes patch rendering, displacement mapping, and subdivision surfaces<br />

that any vertex shader-capable hardware can do and is a powerful form of<br />

geometry compression.<br />

Usually tessellation levels are decided by the CPU, as we currently have no<br />

programmable tessellation hardware, but there are a few fixed-function hardware<br />

tessellation systems that you may be able to use. This is the technique’s major<br />

limitation — to a limited degree, we can remove triangles (by sending the vertices<br />

to be clipped), but we cannot add triangles.<br />

By using the vertex shaders as a function evaluator with the vertex stream<br />

bringing in the function parameters, we can render many geometrical surfaces.<br />

For the surfaces we use here, this consists of a barycentric surface function with<br />

an additional displacement scalar, but other surfaces’ parameterizations are<br />

possible.<br />

There are two components that are needed for displacement compression.<br />

� Displacement mapping: A method of retrieving a scalar displacement along<br />

the surface normal. <strong>With</strong>out it, your displacement compression becomes<br />

standard surface patch evaluation.<br />

� Surface basis: Every displacement compression shader requires a basis system<br />

that defines the base surface before displacement. The simplest is just<br />

planar, although it could be as complex as a subdivision surface.<br />

Displacement Mapping<br />

Section I — Geometry Manipulation <strong>Tricks</strong><br />

Using Vertex <strong>Shader</strong>s for Geometry Compression<br />

There are at least four ways to get the displacement value into the vertex shader.<br />

The more advanced methods require explicit hardware support and are not covered<br />

here. Refer to presentations from Mike Doggett and Tom Forsyth for details<br />

[2]. Also, Tom Forsyth’s article covers actual generation of displacement data in<br />

detail [1].<br />

The technique presented here works on any vertex shader hardware by<br />

treating the displacement map as a 1D vertex stream. It’s a generalization of the<br />

technique that I presented in Direct3D <strong><strong>Shader</strong>X</strong>, which had an implied planar basis<br />

that with a few minor modification works for any surface basis.<br />

7


Section I — Geometry Manipulation <strong>Tricks</strong><br />

8 Using Vertex <strong>Shader</strong>s for Geometry Compression<br />

The displacement value is stored explicitly in a vertex stream. If kept in a<br />

separate stream, it can be accessed via the CPU as a standard displacement map,<br />

or you can choose to pack it with other vertex elements. Packed will usually save<br />

space, but a separate stream can be more convenient, especially for dynamically<br />

updated displacement maps.<br />

As there is only one one channel vertex data type (FLOAT1), you will probable<br />

store your displacement map in another data type that will have spare channels.<br />

For 8-bit displacement map data, UBYTE4 is the obvious choice. This may<br />

appear to waste a lot of space, but in practice, enough other data has to be provided<br />

so that if space is a concern, it can be reclaimed to store other surface<br />

parameters.<br />

NOTE Unfortunately, <strong>DirectX</strong> 9 has no GPU-powered way of transferring or<br />

sharing data between render targets and vertex streams. This is purely an API<br />

issue, but it makes GPU-based dynamic displacement maps difficult (if not<br />

impossible) under <strong>DirectX</strong> 9. Mike Doggett’s OpenGL uber-buffer render-tovertex-array<br />

demo shows what GPU modification of vertex data can do.<br />

Pre-Filtering Displacement Maps<br />

One form of filtering that can be used with vertex stream displacement is to store<br />

the displacement value that would occur at the lower tessellation levels with the<br />

usual displacement value. This is similar to mipmapping in that the filter is run<br />

before the actual rendering. As with mipmapping, you can use either point sampling<br />

(just select the appropriate displacement value) or linear filtering (select two<br />

displacement values and linearly interpolate). The main difference with mipmapping<br />

is that there is no easy way to access the texture derivatives in vertex<br />

shaders, so you will probably have a global blend factor or base it on distance from<br />

the camera.<br />

If you store displacement values in UBYTE4, you could pack three lower levels<br />

in the other three channels, which gives you an effective linear mip filter (but<br />

with point min/mag filter).<br />

Surface Basis<br />

The key to displacement compression is reversing the standard relationship<br />

between the vertex stream and the constant registers. A vertex shader for<br />

indexed triangles can only access the data of one vertex at a time, but each vertex<br />

shader can access more than one vertex constant. Thus, if you put mesh data into<br />

constant memory, each vertex shader execution has access to multiple vertices,<br />

etc. We upload vertices or control points to constant memory and feed normalized<br />

barycentric coordinates (aka areal coordinates) and surface indices in via the vertex<br />

stream. (For some surface bases we may need other parameters — i.e., subdivision<br />

surfaces require surrounding surface indices as well.)<br />

The normalized barycentric coordinates and surface indices uniquely define<br />

where in the mesh (stored in constant memory) the vertex shader is currently<br />

evaluating the surface basis function.


Points Inside a Triangle<br />

N-Patches<br />

A unique point inside a triangle can be computed via the three vertices defining<br />

the triangle and the barycentric coordinates of this interior point. The three vertices<br />

for each triangle are placed into constant memory, and we store two of the<br />

barycentric coordinates in the vertex stream (k can be computed from i and j). A<br />

vertex stream triangle index is used to select which set of three vertices in constant<br />

memory makes up the triangle with which we are currently working.<br />

Here we hit a small issue: Some vertices belong to more than one triangle.<br />

We have to duplicate each vertex attached to more than one triangle and give each<br />

one a separate index.<br />

//HLSL code for calculating interior points of a number of triangles.<br />

float3 VertexPos[3 * NUM BASE TRIANGLE];<br />

void main(float3 vertexStream : POSITION0)<br />

{<br />

float i = vertexStream.x;<br />

float j = vertexStream.y<br />

float k=1.0–i–j;<br />

float baseIndex = vertexStream.z * 256; // un-normalize index<br />

float3 pos = i*VertexPos[ (baseIndex*3) +0]+<br />

j*VertexPos[ (baseIndex*3) +1]+<br />

k*VertexPos[(baseIndex*3) +2];<br />

}<br />

N-Patches (Curved PN Patches [3]) are a type of bicubic patch where the control<br />

points are determined from a triangle’s vertex positions and normals. N-Patches<br />

come in two variations, both with cubic interpolated position, but they differ in<br />

whether the normal is interpolated linearly or quadratically. The algorithm calculates<br />

the control points for the patch and then evaluates at each point on the base<br />

triangle.<br />

Effectively, there are two frequencies at which this vertex shader needs executing;<br />

the control points need calculating only once per patch, whereas the evaluation<br />

needs running at every vertex. Some consoles can execute this pattern on<br />

the GPU, but on current PC architectures you can either generate the control<br />

points on the CPU and upload them to vertex constant memory or recalculate the<br />

control points at every vertex. The first uses CPU power per patch, and each<br />

patch uses more constant memory (for linear normal N-Patches, 39 floats versus<br />

18 for vertices), whereas recalculating at every vertex uses a lot of vertex shader<br />

power but allows better batching and has lower CPU overhead.<br />

float3 VertexPos[3 * NUM BASE TRIANGLE];<br />

float3 VertexNormals[3 * NUM BASE TRIANGLE];<br />

Section I — Geometry Manipulation <strong>Tricks</strong><br />

Using Vertex <strong>Shader</strong>s for Geometry Compression<br />

// bicubic control points<br />

float3 b300,b030,b003, b210,b120,b021, b201,b102,b012, b111;<br />

9


Section I — Geometry Manipulation <strong>Tricks</strong><br />

10 Using Vertex <strong>Shader</strong>s for Geometry Compression<br />

float3 n200,n020,n002;<br />

void generateControlPoints<strong>With</strong>LinearNormals(float baseIndex);<br />

{<br />

float3 v0 = VertexPos[ (baseIndex*3) +0];<br />

float3 v1 = VertexPos[ (baseIndex*3) +1];<br />

float3 v2 = VertexPos[ (baseIndex*3) +2];<br />

float3 n0 = VertexNormal [ (baseIndex*3) +0];<br />

float3 n1 = VertexNormal [ (baseIndex*3) +1];<br />

float3 n2 = VertexNormal[ (baseIndex*3) +2];<br />

// For the book I’ll do one bicubic patch control point here, for the rest<br />

// see example code on CD/Web or reference ATI’s Curved PN Patch paper [3]<br />

float3 edge = v1 - v0;<br />

// E - (E.N)N<br />

float3 tangent1 = edge;<br />

float tmpf = dot( tangent1, n0 );<br />

tangent1 -= n0 * tmpf;<br />

b210 = v0 + (tangent1 * rcp3);<br />

}<br />

void evaluateNPatchLinearNormal(float i, float j, out float3 pos, out float3 norm)<br />

{<br />

float k=1-i-j;<br />

float k2 =k*k;<br />

float k3 = k2 * k;<br />

float i2 =i*i;<br />

float i3 = i2 * i;<br />

float j2 =j*j;<br />

float j3 = j2 * j;<br />

}<br />

// bicubic position<br />

pos = (b300*k3) + (b030*u3) + (b003*v3) +<br />

(b210*3*k2*i) + (b120*3*k*i2) + (b201*3*k2*j) +<br />

(b021*3*i2*j) + (b102*3*k*j2) + (b012*3*i2*j) +<br />

(b111*6*k*i*j);<br />

// linear normal<br />

norm = (w * n200) + (i * n020) + (j * n002);<br />

void main(float3 vertexStream : POSITION0)<br />

{<br />

float i = vertexStream.x;<br />

float j = vertexStream.y<br />

float baseIndex = vertexStream.z * 256;<br />

float3 pos, norm;<br />

}<br />

generateControlPoints<strong>With</strong>LinearNormals(baseIndex);<br />

evaluateNPatchLinearNormal(i, j, pos, norm);


Making It Fast Using a Linear Basis<br />

Evaluating N-Patches via a vertex shader can be quite expensive. If you are also<br />

using a displacement map, the inherent surface curve usually isn’t very important<br />

anyway. Usually when using displacement compression, we would like a basis that<br />

has a smooth surface normal but relies on the displacement map to handle the<br />

position. A linear basis has all these properties: The surface normal is smooth<br />

between patches (assuming the vertex normals are smooth), but the position<br />

before the displacement is planar. The surface normal is generated from the linear<br />

interpolation of the vertex normals (in a similar manner to how Phong shading<br />

interpolates the lighting normal).<br />

A linear basis only requires the mesh vertex data, and as these can be shared<br />

between patches, it’s usually better to store vertex indices rather than a triangle<br />

index at every interior point. This usually increases the number of vertices that<br />

can be stored in constant memory, which increases performance as more patches<br />

can be evaluated per call at the expense of slightly larger per-vertex data.<br />

//HLSL for a displaced linear basis surface with indexed vertices<br />

float MAX DISPLACEMENT HEIGHT = 100; // this is just an example value<br />

float3 VertexPos[NUM BASE VERTICES];<br />

float3 VertexNormal[NUM BASE VERTICES];<br />

float2 VertexUV[NUM BASE VERTICES];<br />

struct VS IN<br />

{<br />

float2 barycentric;<br />

float3 indices;<br />

float displacement;<br />

};<br />

void main( VS IN input )<br />

{<br />

float i = input.barycentric.x;<br />

float j = input.barycentric.y<br />

float k=1.0–i–j;<br />

float i0 = input.indices.x * 256;<br />

float i1 = input.indices.y * 256;<br />

float i2 = input.indices.z * 256;<br />

}<br />

Section I — Geometry Manipulation <strong>Tricks</strong><br />

Using Vertex <strong>Shader</strong>s for Geometry Compression<br />

float3 pos = i*VertexPos[i0] + j*VertexPos[i1] + k*VertexPos[i2];<br />

float3 normal = i* VertexNormal[i0] + j* VertexNormal[i1] + k* VertexNormal[i2];<br />

float2 uv = i* VertexUV[i0] + j* VertexUV[i1] + k* VertexUV[i2];<br />

normal = normalized( normal );<br />

pos = pos + input.displacement * normal * MAX DISPLACEMENT HEIGHT;<br />

Barycentric coordinates are in the range [0,1] and are the same for each triangle<br />

at a particular subdivision. Indices only require a maximum of 256 values (there<br />

are currently only 256 constants). So a byte per index is enough. For the triangle<br />

indexed version, this is 1 byte + 1 byte displacement and a shared 8 bytes (two<br />

11


Section I — Geometry Manipulation <strong>Tricks</strong><br />

12 Using Vertex <strong>Shader</strong>s for Geometry Compression<br />

floats), and for the vertex indexed version it is 3 bytes + 1 byte displacement and<br />

a shared 8 bytes (two floats). A good approach is to place the barycentric<br />

coodinates in one stream and the indices and displacement in another. The<br />

barycentric stream can be reused by all meshes at the same subdivision level.<br />

Lighting Normal<br />

Conclusion<br />

References<br />

As getting a perturbed lighting normal proves to be difficult, the best option is not<br />

to bother at run time. If the displacement map is fixed, you can just create a normal<br />

map off-line that encodes the lighting normal. Even if you are vertex lighting,<br />

you can feed the normal map values into the vertex shader in the same manner as<br />

the displacement values.<br />

If you really have to derive a sensible lighting normal in the vertex shader, it<br />

is possible with some preprocessing. If we could access the local surface points<br />

around us (perturb i and j by a small amount) and look up the displacement maps<br />

at those points, we could calculate the local post-displaced tangent plane. The<br />

only way of doing this in a vertex stream is by using a process similar to<br />

prefiltering, by storing at every interior point the displacement values around us.<br />

By storing all surrounding displacement values at every interior point, we could<br />

run the surface evaluator (including the displacement) on each perturbed point<br />

and calculate the lighting normal. In practice, only storing a couple of displaced<br />

values (usually left and down) is enough to get a reasonable lighting normal.<br />

Vertex shaders can be used as effective geometry decompressors; with tight<br />

packing of vertex data and techniques like displacement compression, we can<br />

save considerable memory and, more importantly, bandwidth. The cost of using<br />

extra vertex shader instructions is usually not a problem, as in most cases this<br />

isn’t a bottleneck; by using this “spare” vertex throughput to save bandwidth, it<br />

may make things run faster.<br />

Displacement compression requires changes to the tools (these are described<br />

elsewhere [2]) but are an important future technique that you should be thinking<br />

about implementing in the near and long term.<br />

[1] Forsyth, Tom, “Displacement Mapping,” <strong>Shader</strong> X2 : <strong>Shader</strong> <strong>Programming</strong> <strong>Tips</strong><br />

& <strong>Tricks</strong> with <strong>DirectX</strong> 9, Wolfgang Engel, ed., Wordware Publishing, Inc., 2004, pp.<br />

73-86.<br />

[2] Doggett, Mike and Tom Forsyth, “Displacement Mapping,” GDC 2003.<br />

[3] Vlachos, A., J. Peters, C. Boyd, and J. Mitchell, “Curved PN Triangles,”<br />

http://www.ati.com/developer/CurvedPNTriangles.pdf.


Using Lookup Tables in<br />

Vertex <strong>Shader</strong>s<br />

Carsten Wenzel<br />

When writing vertex shader code, you almost always want to squeeze out a few<br />

instructions. Maybe you have to do it in order to stay within the instruction limit,<br />

which can easily be reached when doing complex animation and lighting calculations.<br />

Or maybe you simply want to speed up your code to gain some extra frames<br />

per second. Both goals can be achieved by encoding functions and terms in your<br />

vertex shader that consume a lot of instructions (and thus time) to evaluate.<br />

Another potential scenario would be the use of empirical data for certain calculations.<br />

This is where lookup tables can come in handy.<br />

A table lookup can be implemented quite easily using the address register a 0<br />

to index an array of constant registers c tableBase ... c tableBase + tableSize –1containing the<br />

actual table data. Generally, you want to keep the table as small as possible.<br />

Therefore, it is often necessary to interpolate between consecutive table values.<br />

Here’s an example. Say your lookup table stores values of a continuous function<br />

f(x) for all integers x in the range [0, 10]. Now it happens that you need to look up<br />

the value for f(3.25). The exact value isn’t stored in the lookup table. To get an<br />

estimated result, we could use the fractional part of the index value as the blend<br />

factor for a linear interpolation, i.e.:<br />

f( 325 . ) � f[ 3] �025 . �( f[ 4] � f[<br />

3])<br />

Do not forget about the Nyquist theorem 1 when representing continuous functions<br />

via lookup tables, or else you’ll face aliasing. That is, make sure the table is<br />

not too small — which implies that encoding terms and functions by means of<br />

lookup tables is not feasible if the range you’re interested in exhibits high frequencies.<br />

Also note that the table size directly affects the precision of the interpolated<br />

result.<br />

To demonstrate how a table lookup translates into actual shader code, let’s<br />

start with a description of a sample application. Imagine you’d like to write a particle<br />

effect that simulates stars in a galaxy. They are placed in clusters on the x/z<br />

plane with some variation in y and spin around the y axis with the galaxy center<br />

being the pivot point. Rotation speed is based on the squared distance (0 = d 2 =<br />

1 The Nyquist theorem describes one of the most important rules of sampling. To fully reproduce a<br />

continuous signal one needs to sample it with a frequency at least twice that of the highest frequency<br />

contained in the original signal. For example, to reproduce a full 20 kHz audio signal it has to be sampled at<br />

least 40,000 times a second.<br />

13


Section I — Geometry Manipulation <strong>Tricks</strong><br />

14 Using Lookup Tables in Vertex <strong>Shader</strong>s<br />

1.0) to the center. Further assume that the vertex shader version used is 1.1,<br />

which means there are no cosine and sine instructions at your disposal, but you<br />

still want to do the animation entirely on the GPU. The following matrix Mrot describes how much a star should be rotated after time seconds:<br />

time<br />

a �<br />

2<br />

0. 1 �1000 �d<br />

c�cos( a)<br />

s � sin( a)<br />

� c 0 �s<br />

0�<br />

�<br />

�<br />

� 0 0 0 0�<br />

M rot � �<br />

�<br />

�<br />

s 0 c 0<br />

�<br />

�<br />

� 0 0 0 1<br />

�<br />

�<br />

This shows the rotation matrix that should be built per vertex on the GPU.<br />

Some of you might say that cosine-sine pairs can be calculated at the same<br />

time using a Taylor-series expansion — such as the code written by Matthias<br />

Wloka, which takes nine instructions and three constant registers to execute. But<br />

you’d also need to determine a to pass it to the cosine-sine evaluation code. Since<br />

we intend to use a lookup table anyway, all these calculations can be baked<br />

together there, thus saving instructions in the vertex shader. Here is how to set<br />

up the lookup table:<br />

const unsigned int TABLE SIZE(64);<br />

const unsigned int TABLE BASE(10);<br />

for(unsigned int uiI(0); uiI < TABLE SIZE; ++uiI)<br />

{<br />

float d2(uiI / (float) (TABLE SIZE – 1));<br />

float alpha(time / (0.1f + 1000.0f * d2));<br />

float c(cosf(alpha));<br />

float s(sinf(alpha));<br />

}<br />

D3DXVECTOR4 vLookup(c, s, 0.0f, 0.0f);<br />

pD3DDev->SetVertex<strong>Shader</strong>Constant(TABLE BASE + uiI, &vLookup, 1);<br />

float fIndexScale((float) (TABLE SIZE – 1));<br />

float fIndexOffset(0.0f);<br />

D3DXVECTOR4 vIndex(fIndexScale, fIndexOffset, 0.0f, 0.0f);<br />

const unsigned int TABLE INDEX(9);<br />

pD3DDev->SetVertex<strong>Shader</strong>Constant(TABLE INDEX, &vIndex, 1);<br />

This way, to look up c and s, we only need to find d 2 , which is as simple as dotting<br />

the position of a star with itself — the center of the galaxy is at (0, 0, 0). The previous<br />

pseudocode also sets all constants required to properly index the lookup<br />

table, as we see very soon.


What remains to do is write the vertex shader to animate each particle. The<br />

code will be split into several pieces showing all necessary steps to get the stars<br />

spinning on the GPU. The following part computes the table index.<br />

#define srcPos v0 // (x, y, z, 1)<br />

#define temp0 r0<br />

#define temp1 r1<br />

#define temp2 r2<br />

#define worldPos r3<br />

#define TABLE INDEX 9<br />

#define TABLE BASE 10<br />

vs.1.1<br />

#ifdef DX9<br />

dcl position0 srcPos<br />

#endif<br />

// calculate d^2 and table index<br />

dp3 temp0, srcPos, srcPos<br />

mad temp1, temp0, c[TABLE INDEX].x, c[TABLE INDEX].y<br />

// get fraction of table index<br />

expp temp0.y, temp1.y<br />

// set table index for relative addressing of lookup table<br />

#ifdef DX9<br />

add a0.x, temp1.y, -temp0.y<br />

#else // DX8<br />

mov a0.x, temp1.y<br />

#endif<br />

Section I — Geometry Manipulation <strong>Tricks</strong><br />

Using Lookup Tables in Vertex <strong>Shader</strong>s<br />

15<br />

The first section of the vertex shader determines the table index for the lookup<br />

table. It calculates d 2 and applies the index scale and offset constant. Why mad can<br />

be used to evaluate the table index in a single instruction and how to set up the<br />

index scale and offset constant for lookup tables covering arbitrary intervals is<br />

shown in the appendix to this article.<br />

When copying the table index to a 0, care must be taken. According to the<br />

<strong>DirectX</strong> 8.1 specs, moving a value into the address register automatically computes<br />

the floor of that value — exactly the behavior we are after. Quite the<br />

contrary if you use <strong>DirectX</strong> 9. Here you have to do the floor calculation yourself<br />

because a value moved into the address register gets rounded to the nearest integer.<br />

This would obviously break the interpolation code due to a possibly incorrect<br />

index in a 0.<br />

The following part of the shader calculates the linearly interpolated table<br />

lookup value. It fetches the values for a 0.x and a 0.x + 1 from the lookup table.<br />

Then it takes the already-computed fraction of the table index to blend between<br />

them.


Section I — Geometry Manipulation <strong>Tricks</strong><br />

16 Using Lookup Tables in Vertex <strong>Shader</strong>s<br />

// fetch two consecutive values from lookup table<br />

mov temp1, c[a0.x + TABLE BASE]<br />

mov temp2, c[a0.x + TABLE BASE + 1]<br />

// lerp them using fraction of index<br />

add temp2, temp2, -temp1<br />

mad temp2, temp2, temp0.y, temp1<br />

The third section starts off with a trick. Knowing that cos(x) 2 + sin(x) 2 =1,we<br />

can renormalize the linearly interpolated table lookup values to feed the rotation<br />

matrix with proper ones, which is important for rotations. Now we can build the<br />

matrix and transform each particle into world space.<br />

// renormalize cos/sin<br />

dp3 temp1.w, temp2, temp2<br />

rsq temp1.w, temp1.w<br />

mul temp2, temp2, temp1.w<br />

// build y rotation matrix<br />

mov temp0, temp2.xzyw // 1st row: cos 0.0 -sin 0.0<br />

mov temp0.z, -temp0.z<br />

mov temp1, temp2.yzxw // 3rd row: sin 0.0 cos 0.0<br />

// rotate particle<br />

mov worldPos, srcPos<br />

dp3 worldPos.x, srcPos, temp0<br />

dp3 worldPos.z, srcPos, temp1<br />

Once the particle is in world space, you can apply the view-projection matrix as<br />

usual, calculate the point size for the particle, set its color, etc. The following<br />

screen shot shows the result of our efforts.<br />

Figure 1: Screen shot of vertex shader in action


Appendix<br />

Say you’d like to create a lookup table containing tableSize entries for a function<br />

f(x) in range [x min, x max]. The values stored in an array of constant registers c tableBase<br />

... c tableBase + tableSize –1look like this:<br />

0 �i�tableSize � xmax � xmin<br />

�<br />

ctableBase� � f� i xmin �i� �<br />

� tableSize<br />

�1<br />

�<br />

To do a lookup you now need to map a value x from [x min, x max]to[tableBase,<br />

tableBase + tableSize – 1]:<br />

x�x index �<br />

x � x<br />

min<br />

max min<br />

This can be decoupled to:<br />

x<br />

index �<br />

x � x<br />

max min<br />

�( tableSize �1) �tableBase<br />

xmin<br />

�( tableSize �1) � �(<br />

tableSize<br />

�1) �tableBase<br />

x � x<br />

max min<br />

In the equation above, everything but x is invariant. Taking a closer look reveals<br />

that it can be expressed in terms of a mad:<br />

index �indexScale �x�indexOffset tableSize �1<br />

indexScale �<br />

xmax � xmin<br />

xmin<br />

indexOffset �� �( tableSize �1)�tableBase<br />

x � x<br />

max min<br />

Since tableBase can be used as a fixed relative offset when fetching values from<br />

the lookup table (as can be seen in the vertex shader sample code above),<br />

indexOffset can be rewritten as:<br />

xmin<br />

indexOffset �� �( tableSize �1)<br />

x � x<br />

max min<br />

Section I — Geometry Manipulation <strong>Tricks</strong><br />

Using Lookup Tables in Vertex <strong>Shader</strong>s<br />

17


Terrain Geomorphing in the<br />

Vertex <strong>Shader</strong><br />

Daniel Wagner<br />

Introduction<br />

Terrain rendering has heretofore been computed by a CPU and rendered by a<br />

combination of CPU and GPU. It is possible to implement a fast terrain renderer<br />

that works optimally with current 3D hardware. This is done by using geo-mipmapping,<br />

which splits the terrain into a set of smaller meshes called patches.<br />

Each patch is triangulated view-dependently into one single triangle strip. Special<br />

care is taken to avoid gaps and t-vertices between neighboring patches. An arbitrary<br />

number of textures, which are combined using multiple alpha-blended rendering<br />

passes, can be applied to the terrain. Since the terrain’s triangulation<br />

changes over time, vertex normals cannot be used for lighting. Instead, a precalculated<br />

lightmap is used. In order to reduce popping when a patch switches<br />

between two tessellation levels, geomorphing is implemented. As we point out<br />

later, this splitting of the terrain into small patches allows some very helpful<br />

optimizations.<br />

Why Geomorphing?<br />

18<br />

Terrain rendering has been an active research area for quite a long time. Although<br />

some impressive algorithms have been developed, the game development community<br />

has rarely used these methods because of the high computational<br />

demands. Recently, another reason for not using the classic terrain rendering<br />

approaches such as ROAM [Duc97] or VDPM [Hop98] emerged: Modern GPUs<br />

just don’t like CPU-generated dynamic vertex data. The game developers’ solution<br />

for this problem was to build very low-resolution maps and fine-tuned terrain<br />

layout for visibility optimization. In contrast to indoor levels, terrain visibility is<br />

more difficult to tune, and there are cases where the level designer just wants to<br />

show distant views.<br />

The solution to these problems is to introduce some kind of terrain LOD<br />

(level of detail). The problem with simple LOD methods is that at the moment<br />

that vertices are added or removed, the mesh is changed; this leads to very<br />

noticeable popping effects. The only clean way out of this is to introduce geomorphing,<br />

which inserts new vertices along an existing edge and later moves that


vertex to its final position. As a consequence, the terrain mesh is no longer static<br />

but changes (“morphs”) every frame. It is obvious that this morphing has to be<br />

done in hardware in order to achieve high performance.<br />

Previous Work<br />

A lot of work has already been done on rendering terrain meshes. Classic algorithms<br />

such as ROAM and VDPM attempt to generate triangulations that optimally<br />

adapt to terrain given as a heightmap. This definition of “optimally” was<br />

defined to be as few triangles as possible for a given quality criteria. While this<br />

was a desirable aim some years ago, things have changed.<br />

Today, the absolute number of triangles is not as important. As of 2003,<br />

games that render up to 200,000 triangles per frame have been released, including<br />

games such as Unreal 2. An attractive terrain triangulation takes some 10,000 triangles.<br />

This means that it is no longer important if we need 10,000 or 20,000 triangles<br />

for the terrain mesh, as long as it is done fast enough. Today “fast” also<br />

implies using as little CPU processing power as possible, since in real-life applications<br />

the CPU usually has more things to do than just drawing terrain (e.g., AI,<br />

physics, voice-over, IP compression, etc.). The other important thing today is to<br />

create the mesh in such a way that the graphics hardware can process it quickly,<br />

which usually means the creation of long triangle strips. Both requirements are<br />

mostly unfulfilled by the classic terrain meshing algorithms.<br />

The work in this article is based on the idea of geo-mipmapping described by<br />

de Boer in [Boe00]. Another piece of work that uses the idea of splitting the terrain<br />

into a fixed set of small tiles is [Sno01], although the author does not write<br />

about popping effects or how to efficiently apply materials to the mesh.<br />

Building the Mesh<br />

Section I — Geometry Manipulation <strong>Tricks</strong><br />

Terrain Geomorphing in the Vertex <strong>Shader</strong><br />

19<br />

The terrain mesh is created from an 8-bit heightmap that has to be sized 2^n+1<br />

* 2^n+1 (e.g., 17*17, 33*33, 65*65, etc.) in order to create n^2 * n^2 quads.<br />

The heightmap (see Figure 1a) can be created from real data (e.g., DEM) [Usg86]<br />

or by any program that can export into raw<br />

8-bit heightmap data (e.g., Corel Bryce<br />

[Cor01]). The number of vertices of a patch<br />

changes during rendering (see view-dependent<br />

tessellation), which forbids using vertex<br />

normals for lighting. Therefore, a lightmap<br />

(see Figure 1b) is used instead.<br />

In order to create the lightmap, the normals<br />

for each point in the heightmap have to<br />

be calculated first. This can be done by creating<br />

two 3D vectors, each pointing from the<br />

current height value to the neighboring height<br />

positions. Calculating the cross product of Figure 1a: A sample heightmap


Section I — Geometry Manipulation <strong>Tricks</strong><br />

20 Terrain Geomorphing in the Vertex <strong>Shader</strong><br />

these two vectors gives the current normal<br />

vector, which can be used to calculate a diffuse<br />

lighting value. To get better results, including<br />

static shadows, advanced terrain data editing<br />

software such as Wilbur [Slay95] or Corel<br />

Bryce should be used.<br />

The heightmap is split into 17*17 values-sized<br />

parts called patches. The borders of<br />

neighboring patches overlap by one value<br />

(e.g., value column 16 is shared by patch 0/0<br />

and patch 1/0). Geometry for each patch is<br />

created at run time as a single indexed triangle<br />

strip. A patch can create geometry in five<br />

different tessellation levels, ranging from full<br />

geometry (2*16*16 triangles) down to a single<br />

flat quad (two triangles; for an illustration see Figure 2). Where needed, degenerate<br />

triangles are inserted to connect the sub-strips into one large strip [Eva96].<br />

In order to connect two strips, the last vertex of the first strip and the first<br />

vertex of the second strip have to be inserted twice. The result is triangles that<br />

connect the two strips in the form of a line and are therefore invisible (unless rendered<br />

in wireframe mode). The advantage of connecting small strips to one larger<br />

strip is that less API calls are needed to draw the patch. Since index vertices are<br />

used and a lot of today’s graphics hardware can recognize and automatically<br />

remove degenerate triangles, the rendering and bandwidth overhead of the<br />

degenerate triangles is very low.<br />

Calculating the Tessellation Level of a Patch<br />

Figure 1b: Corresponding lightmap<br />

created with Wilbur<br />

Figure 2: The same patch tessellated in different levels ranging from full geometry (level 0) to a<br />

single quad (level 4)<br />

Before a frame is rendered, each patch is checked for its necessary tessellation<br />

level. It’s easy to see from Figure 2 that the error of each patch increases as the<br />

number of vertices is reduced. In a preprocessing step, for each level the position<br />

of the vertex with the largest error (the one that has the largest distance to the<br />

corresponding correct position, called “maxerror vertex” later on) is determined<br />

and saved together with the correct position.<br />

When determining the level at which to render, all saved “maxerror vertices”<br />

are projected into the scene and the resulting errors calculated. Finally, the level<br />

with the largest error below an application-defined error boundary is chosen. In


order to create a specific level’s geometry, only the “necessary” vertices are written<br />

into the buffers. For example, to create level 0, all vertices are used. Level 1<br />

leaves out every second vertex, reducing the triangle count by a quarter. Level 2<br />

uses only every fourth vertex, and so on.<br />

Connecting Patches<br />

Section I — Geometry Manipulation <strong>Tricks</strong><br />

Terrain Geomorphing in the Vertex <strong>Shader</strong><br />

If two neighboring patches with different tessellation levels were simply rendered<br />

one next to the other, gaps would occur (imagine drawing any of the patches in<br />

Figure 2 next to any other). Another problem is t-vertices, which occur when a<br />

vertex is positioned on the edge of another triangle. Because of rounding errors,<br />

that vertex will not be exactly on the edge of the neighboring triangle, and small<br />

gaps that are only a few pixels in size can become visible. Even worse, when moving<br />

the camera, these gaps can emerge and disappear every frame, which leads to<br />

a very annoying flickering effect.<br />

To solve both problems, it is obvious that each patch must know its neighbors’<br />

tessellation levels. To do so, all tessellation levels are calculated first without<br />

creating the resulting geometry and then each patch is informed about its<br />

neighbors’ levels. After that, each patch updates its geometry as necessary.<br />

Geometry updating has to be done only if the inner level or any of the neighbors’<br />

levels has changed. To close gaps and prevent t-vertices between patches, a border<br />

of “adapting triangles” is created that connects the differently sized triangles<br />

(see Figure 3). It is obvious that only one of two neighboring patches has to adapt<br />

to the other. As we can see in the section “Geomorphing,” it is necessary for the<br />

patch with the finer tessellation level (having more geometry) to adapt.<br />

Figure 3a: T-vertices at the border of two<br />

patches<br />

Figure 3b: T-vertices removed<br />

Figure 3a shows a typical case of where t-vertices occur. In Figure 3b those<br />

“adapting triangles” at the left side of the right patch are created to avoid t-vertices.<br />

Although these triangles look like good candidates for being created by using<br />

triangle fans, they are also implemented using strips, since fans cannot be combined<br />

into bigger fans, as can be achieved with strips.<br />

21


Section I — Geometry Manipulation <strong>Tricks</strong><br />

22 Terrain Geomorphing in the Vertex <strong>Shader</strong><br />

Materials<br />

Our terrain has no shading or materials yet. Applying dynamic light by using surface<br />

normals would be the easiest way to go but would result in strange effects<br />

when patches switch tessellation levels. The reduction of vertices goes hand in<br />

hand with the loss of an equal number of normals. When a normal is removed, the<br />

resulting diffuse color value is removed too. The user notices such changes very<br />

easily — especially if the removed normal produced a color value that was very<br />

different from its neighboring color values.<br />

The solution to this problem is easy and well known in today’s computer<br />

graphics community. Instead of doing real-time lighting, we can use a precalculated<br />

lightmap, which is by its nature more resistant to vertex removal than<br />

per-vertex lighting. Besides solving our tessellation problem, it provides us with<br />

the possibility to precalculate shadows into the lightmap. The disadvantage of<br />

using lightmaps is that the light’s position is now fixed to the position that was<br />

used during the lightmap’s generation.<br />

In order to apply a lightmap (see Figure 4), we need to add texture coordinates<br />

to the vertices. Since only one lightmap is used for the whole terrain, it<br />

simply spans the texture coordinates from (0,0) to (1,1).<br />

Figure 4a: Lit terrain Figure 4b: Same terrain with wireframe overlay<br />

Figure 4c: Terrain with overlaid triangle mesh Figure 4d: Getting close to the ground, the<br />

highly detailed materials become visible.


Section I — Geometry Manipulation <strong>Tricks</strong><br />

Terrain Geomorphing in the Vertex <strong>Shader</strong><br />

23<br />

Now that the terrain’s mesh is set up and shaded, it’s time to apply some materials.<br />

In contrast to the lightmap, we need far more detail for materials such as<br />

grass, mud, or stone to look good. (See Figures 4c and 4d.) The texture won’t be<br />

large enough to cover the complete landscape and look good, regardless of how<br />

high the resolution of a texture might be. For example, if we stretch one texture<br />

of grass over a complete terrain, the grass wouldn’t even be recognizable. One<br />

way to overcome this problem is to repeat material textures.<br />

To achieve this, we scale and wrap the texture so that it is repeated over the<br />

terrain. By setting a texture matrix we can use the same texture coordinates for<br />

the materials as for the lightmap. As we see later, this one set of (never-changing)<br />

texture coordinates, together with some texture matrices, is sufficient for an arbitrary<br />

number of materials (each one having its own scaling factor and/or rotation)<br />

and even for moving faked cloud shadows (see below).<br />

To combine a material with the lightmap, two texture stages are set up using<br />

modulation (component-wise multiplication). The result is written into the graphics<br />

buffer. In order to use more than one material, each material is combined with<br />

a different lightmap containing a different alpha channel. Although this would<br />

allow each material to use different color values for the lightmap too, in practice<br />

this hardly makes any sense. This results in one render pass per material, which<br />

is alpha blended into the frame buffer. As we see later, a lot of fillrate can be saved<br />

if not every patch uses every material — which is the usual case (see the section<br />

titled “Optimizations”). Figure 5 shows how two materials are combined with<br />

lightmaps and then blended using an alpha map. (For better visualization, the<br />

materials’ textures are not repeated in Figure 5.)<br />

Figure 5: Combining two render passes<br />

In the top row of Figure 5, the base material is combined with the base lightmap.<br />

Since there is nothing to be drawn before this pass, no alpha map is needed. In the<br />

bottom row, the second pass is combined with another lightmap. This time there<br />

is an alpha channel (invisible parts are drawn with checkered boxes). The resulting<br />

image is finally alpha-blended to the first pass (the right image in Figure 5).<br />

It is important to note that this method allows each material pass to use a<br />

free scaling (repeating) factor for the color values, which results in highly detailed


Section I — Geometry Manipulation <strong>Tricks</strong><br />

24 Terrain Geomorphing in the Vertex <strong>Shader</strong><br />

materials, while the lightmap does not need to be repeated since lighting values<br />

do not need as much detail. Only two texture stages are used at once, which<br />

allows combining an arbitrary number of passes. Most applications will not need<br />

more than three or four materials.<br />

After all materials have been rendered, another pass can be drawn in order to<br />

simulate cloud shadows. Again, we can repeat the shadows in order to get more<br />

detailed-looking shadows. As we are already using a texture matrix to do scaling,<br />

we can animate the clouds easily by applying velocity to the matrix’s translation<br />

values. The effect is that the clouds’ shadows move along the surface, which<br />

makes the whole scene look far more realistic and “alive.”<br />

Geomorphing<br />

One problem with geometry management using level of detail is that at some<br />

point vertices will have to be removed or added, which leads to the alreadydescribed<br />

“popping” effect. In our case of geo-mipmapping, where the number of<br />

vertices is doubled or halved at each tessellation level change, this popping<br />

becomes very visible. In order to reduce the popping effect, geomorphing is introduced.<br />

The aim of geomorphing is to move (morph) vertices softly into their position<br />

in the next level before that next level is activated. If this is done perfectly, no<br />

popping but only slightly moving vertices are observed by the user. Although this<br />

vertex moving looks a little bit strange if a very low detailed terrain mesh is used,<br />

it is still less annoying to the user than the popping effect.<br />

It can be shown that only vertices with odd indices inside a patch have to<br />

move and that those vertices on even positions can stay fixed because they are<br />

not removed when switching to the next coarser tessellation level. Figure 6a<br />

shows the tessellation of a patch in tessellation level 2 from a top view. Figure 6b<br />

shows the next level of tessellation coarseness (level 3) and that the vertices 1, 2,<br />

and 3 do not have to move since they are still there in the next level. There are<br />

three possible cases in which a vertex has to move:<br />

� Case A: The vertex is on an odd x- and even y-position. Vertex has to move<br />

into the middle position between the next left (1) and the right (2) vertices.<br />

� Case B: The vertex is on an odd x- and odd y-position. Vertex has to move<br />

into the middle position between the next top-left (1) and the bottom-right<br />

(3) vertices.<br />

� Case C: The vertex is on an even x- and odd y-position. Vertex has to move<br />

into the middle position between the next top (2) and the bottom (3) vertices.<br />

Things become much clearer when taking a look at the result of the morphing<br />

process: After the morphing is done, the patch is retessallated using the next tessellation<br />

level. In Figure 6b it becomes obvious that the previously existing vertex<br />

A had to move into the average middle position between the vertices 1 and 2<br />

in order to be removed without popping.


Figure 6a: Fine geometry with morphing<br />

vertices<br />

Optimizations<br />

Section I — Geometry Manipulation <strong>Tricks</strong><br />

Terrain Geomorphing in the Vertex <strong>Shader</strong><br />

Figure 6b: Corresponding coarser<br />

tessellation level. Only odd indexed<br />

vertices were removed.<br />

25<br />

Although the geometry’s creation is very fast and we are rendering the mesh<br />

using only a small number of long triangle strips (usually about some hundred<br />

strips per frame), there are quite a few optimizations that we can do to increase<br />

the performance on the side of the processor as well as the graphics card.<br />

As described in the section titled “Materials,” we use a multi-pass rendering<br />

approach to apply more than one material to the ground. Generally, most materials<br />

will be used only in small parts of the landscape and be invisible in most others.<br />

The alpha channel of the material’s lightmap defines where which material is visible.<br />

Of course, it’s a waste of GPU bandwidth to render materials on patches that<br />

don’t use that material at all (where the material’s alpha channel is zero in the<br />

corresponding patch’s part).<br />

It’s easy to see that if the part of a material’s alpha channel that covers one<br />

distinct patch is completely set to zero, then this patch does not need to be rendered<br />

with that material. Assuming that the materials’ alpha channels won’t<br />

change during run time, we can calculate for each patch which materials will be<br />

visible and which won’t in a preprocessing step. Later at run time, only those<br />

passes are rendered that really contribute to the final image.<br />

Another important optimization is to reduce the number of patches that need<br />

to be rendered at all. This is done in three steps. First, a rectangle that covers the<br />

projection of the viewing frustum onto the ground plane is calculated. All patches<br />

outside that rectangle will surely not be visible. All remaining patches are culled<br />

against the viewing frustum. To do this, we clip the patches’ bounding boxes<br />

against all six sides of the viewing frustum. All remaining patches are guaranteed<br />

to lie at least partially inside the camera’s visible area. Nevertheless, not all of<br />

these remaining patches will necessarily be visible because some of them will<br />

probably be hidden from other patches (e.g., a mountain). To optimize this case,<br />

we can finally use a PVS (Potentially Visible Sets) algorithm to further reduce the<br />

number of patches that need to be rendered.


Section I — Geometry Manipulation <strong>Tricks</strong><br />

26 Terrain Geomorphing in the Vertex <strong>Shader</strong><br />

PVS [Air91, Tel91] is used to determine, at run time, which patches can be<br />

seen from a given position and which are hidden by other objects (in our case, also<br />

patches). Depending on the type of landscape and the viewer’s position, a lot of<br />

patches can be removed this way. In Figure 7 the camera is placed in a valley and<br />

looks at a hill.<br />

Figure 7a: Final image Figure 7b: <strong>With</strong>out PVS<br />

Figure 7c: <strong>With</strong> PVS Figure 7d: View from camera’s position<br />

Figure 7e: Same scene as 7d from a different<br />

viewpoint with same PVS and culling performed<br />

(See Color Plate 1.)<br />

Figure 7b shows that a lot of triangles are<br />

rendered that do not contribute to the final<br />

image because they are hidden by the front<br />

triangles forming the hill. Figure 7c shows<br />

how PVS can successfully remove most of<br />

those triangles. Figures 7d and 7e show the<br />

same PVS optimized scene, as seen from<br />

the camera’s view and as seen from above.<br />

The nice thing about PVS is that the cost of<br />

processing power is almost zero at run time<br />

because most calculations are done offline<br />

when the terrain is designed.


In order to (pre-) calculate a PVS, the area of interest is divided into smaller<br />

parts. In our case it is obvious that we should use patches for those parts. For<br />

example, a landscape consisting of 16x16 patches requires 16x16 cells on the<br />

ground plane (z=0). To allow the camera to move up and down, it is necessary to<br />

have several layers of such cells. Tests have shown that 32 layers in a range of<br />

three times the height of the landscape are enough for fine-graded PVS usage.<br />

One problem with PVS is the large amount of memory needed to store all the<br />

visibility data. In a landscape with 16x16 patches and 32 layers of PVS data, we<br />

get 8,192 PVS cells. For each cell we have to store the 16x16 patches that are<br />

visible from that cell. This means that we have to store more than two million values.<br />

Fortunately, we only need to store one-bit values (visible/not visible) and can<br />

save the PVS as a bit field, which results in a 256Kbyte data file in this example<br />

case.<br />

Figure 8 shows an example image<br />

from the PVS calculation application<br />

where the camera is located in the center<br />

of the valley (the black part in the<br />

middle of the green dots (the lighter<br />

dots at the top center)). All red dots<br />

resemble those patches that are not<br />

visible from that location. Determining<br />

whether a patch is visible from a location<br />

is done by using an LOS (line of<br />

sight) algorithm, which tracks a line<br />

from the viewer’s position to the<br />

patch’s position. If the line does not hit<br />

the landscape on its way to the patch,<br />

this patch is visible from that location.<br />

To optimize memory requirements,<br />

the renderer distinguishes<br />

between patches that are active (currently visible) and those that aren’t. Only<br />

those patches that are currently active are fully resident in memory. The memory<br />

footprint of inactive patches is rather low (about 200 bytes per patch).<br />

Geomorphing in Hardware<br />

Section I — Geometry Manipulation <strong>Tricks</strong><br />

Terrain Geomorphing in the Vertex <strong>Shader</strong><br />

27<br />

Figure 8: PVS from top view. The camera<br />

sits in the valley in the middle of the green<br />

dots.<br />

Doing geomorphing for a single patch basically means doing vertex tweening<br />

between the current tessellation level and the next finer one. The tessellation<br />

level calculation returns a tessellation factor in the form of a floating-point value,<br />

where the integer part means the current level and the fractional part denotes the<br />

tweening factor (e.g., a factor of 2.46 means that tweening is done between levels<br />

2 and 3 and the tweening factor is 0.46). Tweening between two mesh representations<br />

is a well-known technique in computer graphics and easily allows an implementation<br />

of morphing for one single patch (vertices that should not move simply<br />

have the same position in both representations).


Section I — Geometry Manipulation <strong>Tricks</strong><br />

28 Terrain Geomorphing in the Vertex <strong>Shader</strong><br />

The problem becomes more difficult if a patch’s neighbors are considered.<br />

Problems start with the shared border vertices, which can only follow one of the<br />

two patches but not both (unless we accept gaps). As a consequence, one patch<br />

has to adapt its border vertices to those of its neighbor. In order to do correct<br />

geomorphing, it is necessary that the finer patch allows the coarser one to dictate<br />

the border vertices’ position. This means that we do not only have to care about<br />

one tweening factor as in the single patch case but have to add four more factors<br />

for the four shared neighbor vertices. Since the vertex shader cannot distinguish<br />

between interior and border vertices, these five factors have to be applied to all<br />

vertices of a patch. So we are doing a tweening between five meshes.<br />

As if this wasn’t already enough, we also have to take special care with the<br />

inner neighbor vertices of the border vertices. Unfortunately, these vertices also<br />

need their own tweening factor in order to allow correct vertex insertion (when<br />

switching to a finer tessellation level). To point out this quite complicated situation<br />

more clearly, we go back to the example of Figure 6b. For example, we state<br />

that the patch’s left border follows its coarser left neighbor. Then the tweening<br />

factor of vertex 1 depends on the left neighbor, whereas the tweening factor of all<br />

interior vertices (such as vertex 2) depend on the patch itself. When the patch<br />

reaches its next finer tessellation level (Figure 6a), the new vertex A is inserted.<br />

Figure 9 shows the range in which vertices 1 and 2 can move and the range in<br />

which vertex A has to be inserted. (Recall that a newly inserted vertex must<br />

always lie in the middle of its preexisting neighbors.) To make it clear why vertex<br />

A needs its own tweening factor, suppose that the vertices 1 and 2 are both at<br />

their bottom position when A is inserted (tweeningL and tweeningI are both 0.0).<br />

Later on when A is removed, the vertices 1 and 2 might lie somewhere else and<br />

A would now probably not lie in the middle between those two if it had the same<br />

tweening factor as vertex 1 or vertex 2. The consequence is that vertex A must<br />

have a tweening factor (tweeningA) that depends on both the factor of vertex 1<br />

(tweeningL — the factor from the left neighboring patch) and on that of vertex 2<br />

(tweeningI — the factor by which all interior vertices are tweened).<br />

Figure 9: Vertex insertion/removal range


Section I — Geometry Manipulation <strong>Tricks</strong><br />

Terrain Geomorphing in the Vertex <strong>Shader</strong><br />

What we want is the following:<br />

Vertex A should:<br />

� be inserted/removed in the middle between the positions of vertex 1 and<br />

vertex 2<br />

� not pop when the patch switches to another tessellation level<br />

� not pop when the left neighbor switches to another tessellation level<br />

The simple formula tweeningA = (1.0-tweeningL) * tweeningI does the job.<br />

Each side of a patch has such a tweeningA that results in four additional tessellation<br />

levels.<br />

Summing this up, we have nine tessellation levels that must all be combined<br />

every frame for each vertex. What we actually do in order to calculate the final<br />

position of a vertex is the following:<br />

PosFinal = PosBase + tweeningI*dI + tweeningL*dL + tweeningR*dR + tweeningT*dT + ...<br />

Since we only morph in one direction (as there is no reason to morph other than<br />

up/down in a heightmap-generated terrain), this results in nine multiplications<br />

and nine additions just for the geomorphing task (not taking into account any<br />

matrix multiplications for transformation). This would be quite slow in terms of<br />

performance on the CPU. Fortunately, the GPU provides us with an ideal solution.<br />

The vertex shader command dp4 can multiply four values with four other values<br />

and sum the products in just one instruction. This allows us to do all these calculations<br />

in just five instructions, which is only slightly more than a single 4x4<br />

matrix multiplication takes.<br />

The following code snippet shows the vertex data and constants layout that is<br />

pushed onto the graphics card.<br />

; Constants specified by the app<br />

;<br />

; c0 = (factorSelf, 0.0f, 0.5f, 1.0f)<br />

; c2 = (factorLeft, factorLeft2, factorRight, factorRight2),<br />

; c3 = (factorBottom, factorBottom2, factorTop, factorTop2)<br />

;<br />

; c4-c7 = WorldViewProjection Matrix<br />

; c8-c11 = Pass 0 Texture Matrix<br />

;<br />

;<br />

; Vertex components (as specified in the vertex DECLARATION)<br />

;<br />

; v0 = (posX, posZ, texX, texY)<br />

; v1 = (posY, yMoveSelf, 0.0, 1.0)<br />

; v2 = (yMoveLeft, yMoveLeft2, yMoveRight, yMoveRight2)<br />

; v3 = (yMoveBottom, yMoveBottom2, yMoveTop, yMoveTop2)<br />

We see that only four vectors are needed to describe each vertex, including all<br />

tweening. Note that those vectors v0-v3 do not change as long as the patch is not<br />

retessellated; they are therefore good candidates for static vertex buffers.<br />

The following code shows how vertices are tweened and transformed by the<br />

view/projection matrix.<br />

29


Section I — Geometry Manipulation <strong>Tricks</strong><br />

30 Terrain Geomorphing in the Vertex <strong>Shader</strong><br />

Results<br />

;-------------------------------------------------------------------------<br />

; Vertex transformation<br />

;------------------------------------------------------------------------mov<br />

r0, v0.xzyy ; build the base vertex<br />

mov r0.w, c0.w ; set w-component to 1.0<br />

dp4 r1.x, v2, c2 ; calc all left and right neighbor tweening<br />

dp4 r1.y, v3, c3 ; calc all bottom and top neighbor tweening<br />

mad r0.y, v1.y, c0.x, v1.x ; add factorSelf*yMoveSelf<br />

add r0.y, r0.y, r1.x ; add left and right factors<br />

add r0.y, r0.y, r1.y ; add bottom and top factors<br />

m4x4 r3, r0, c4 ; matrix transformation<br />

mov oPos, r3<br />

While this code could surely be further optimized, there is no real reason to do so,<br />

since it is already very short for a typical vertex shader.<br />

Finally, there is only texture coordinate transformation.<br />

;-------------------------------------------------------------------------<br />

; Texture coordinates<br />

;-------------------------------------------------------------------------<br />

; Create tex coords for pass 0 – material (use texture matrix)<br />

dp4 oT0.x, v0.z, c8<br />

dp4 oT0.y, v0.w, c9<br />

; Create tex coords for pass 1 – lightmap (simple copy, no transformation)<br />

mov oT1.xy, v0.zw<br />

oT0 is multiplied by the texture matrix to allow scaling, rotation, and movement<br />

of materials and cloud shadows. oT1 is not transformed, since the texture coordinates<br />

for the lightmap do not change and always span (0,0) to (1,1).<br />

The following table shows frame rates achieved on an Athlon-1300 with a standard<br />

GeForce3. The minimum scene uses just one material together with a<br />

lightmap (two textures in one render pass — see Figure 10a). The full scene renders<br />

the same landscape with three materials, plus a cloud shadow layer, plus a<br />

skybox and a large lens flare (seven textures in four render passes for the terrain<br />

— see Figure 10b).<br />

The following are frame rates achieved at different scene setups and LOD<br />

systems:<br />

Static LOD Software Morphing Hardware Morphing<br />

Minimum Scene 587 fps 312 fps 583 fps<br />

Full Scene 231 fps 205 fps 230 fps


References<br />

Section I — Geometry Manipulation <strong>Tricks</strong><br />

Terrain Geomorphing in the Vertex <strong>Shader</strong><br />

The table shows that geomorphing done using the GPU is almost as fast as doing<br />

no geomorphing at all. In the minimum scene the software morphing method falls<br />

back tremendously since the CPU and the system bus cannot deliver the high<br />

frame rates (recall that software morphing needs to send all vertices over the bus<br />

each frame) achieved by the other methods. Things change when using the full<br />

scene setup. Here the software morphing takes advantage of the fact that the terrain<br />

is created and sent to the GPU only once but is used four times per frame for<br />

the four render passes and the skybox and lens flare slow down the frame rate<br />

independently. Notice that the software morphing method uses the same<br />

approach as for hardware morphing. An implementation fundamentally targeted<br />

for software rendering would come off far better.<br />

Figure 10a: Terrain with one material layer Figure 10b: Same as 10a but with three<br />

materials (grass, stone, mud) + moving cloud<br />

layer + skybox + lens flare<br />

31<br />

In this article I’ve shown how to render a dynamically view-dependent triangulated<br />

landscape with geomorphing by taking advantage of today’s graphics hardware.<br />

Splitting the mesh into smaller parts allowed us to apply the described<br />

optimizations, which led to achieved high frame rates. Further work could be<br />

done to extend the system to use geometry paging for really large terrains. Other<br />

open topics are the implementation of different render paths for several graphics<br />

cards or using a bump map instead of a lightmap in order to achieve dynamic lighting.<br />

The new generation of DX9 cards allows the use of up to 16 textures per<br />

pass, which would enable us to draw seven materials plus a cloud shadow layer in<br />

just one pass.<br />

[Air91] Airey, John, “Increasing Update Rates in the Building Walkthrough System<br />

with Automatic Model-Space Subdivision and Potentially Visible Set Calculations,”<br />

Ph.D. thesis, University of North Carolina, Chapel Hill, 1991.<br />

[Boe00] de Boer, Willem H., “Fast Terrain Rendering Using Geometrical<br />

MipMapping,” E-mersion Project, October 2000, http://www.connectii.net/<br />

emersion.


Section I — Geometry Manipulation <strong>Tricks</strong><br />

32 Terrain Geomorphing in the Vertex <strong>Shader</strong><br />

[Cor01] Corel Bryce by Corel Corporation, http://www.corel.com.<br />

[Duc97] Duchaineau, M., M. Wolinski, D. Sigeti, M. Miller, C. Aldrich, and<br />

M. Mineev-Weinstein, “ROAMing Terrain: Real-time Optimally Adapting<br />

Meshes,” IEEE Visualization, Oct. 1997, pp. 81-88, http://www.llnl.gov/<br />

graphics/ROAM.<br />

[Eva96] Evans, Francine, Steven Skiena, and Amitabh Varshney, “Optimizing<br />

triangle strips for fast rendering,” 1996, pp. 319-326, http://www.cs.sunysb.edu/<br />

evans/stripe.html.<br />

[Hop98] Hoppe, H., “Smooth View-Dependent Level-of-Detail Control and its<br />

Application to Terrain Rendering,” IEEE Visualization, Oct. 1998, pp. 35-42,<br />

http://www.research.microsoft.com/~hoppe.<br />

[Slay95] Slayton, Joseph R., Wilbur, the latest version can be retrieved at<br />

http://www.ridgenet.net/~jslayton/software.html.<br />

[Sno01] Snook, Greg, “Simplified Terrain Using Interlocking Tiles,” Game <strong>Programming</strong><br />

Gems 2, Charles River Media, 2001, pp. 377-383.<br />

[Tel91] Teller, Seth J. and Carlo H. Sequin, “Visibility preprocessing for interactive<br />

walkthroughs,” Computer Graphics (Proceedings of SIGGRAPH ’91), July<br />

1991, 25(4):61-69.<br />

[Usg86] U.S. Geological Survey (USGS), “Data Users Guide 5 — Digital Elevation<br />

Models,” Earth Science Information Center (ESIC), U.S. Geological Survey,<br />

507 National Center, Reston, VA, 1986.


3D Planets on the GPU<br />

Jesse Laeuchli<br />

Rendering planets in a 3D application is a difficult task. Previously, if a programmer<br />

wanted to include planets, the CPU had to juggle planet rendering and any<br />

other tasks the program might have. Now it is possible to perform almost the<br />

entire task on the GPU using vertex and pixel shaders. Moreover, the procedural<br />

model presented here allows a near infinite number of different planets to be rendered.<br />

This article examines rendering planets entirely on the GPU using nVidia’s<br />

Cg. (See [nVidia] for more information about Cg.)<br />

The most important task in rendering planets is to generate the geometry.<br />

This is usually done by first generating a sphere and then deforming the points on<br />

the sphere with some type of fractal. The sphere can be generated using the parametric<br />

equation:<br />

X=sin(u)*sin(v)<br />

Y=cos(u)*sin(v)<br />

Z=cos(v)<br />

Evaluating this equation on the GPU is fairly simple. It can be done by passing the<br />

u,v values in position.xy and then calling the sincos function. Using the sincos<br />

function (as opposed to separately calling the sin and cos functions) can make the<br />

code cleaner and faster. The code below achieves this.<br />

float fxsin;<br />

float fxcos;<br />

float fysin;<br />

float fycos;<br />

sincos(In.pos.x,fxsin,fxcos);<br />

sincos(In.pos.y,fysin,fycos);<br />

Sphere.x= fxsin* fysin;<br />

Sphere.y= fxcos* fysin;<br />

Sphere.z= fycos;<br />

After the sphere has been generated, it must be deformed to create the planet<br />

geometry. A function is needed that can be called at each point of the sphere,<br />

which will then return a scalar value that can be used to modify the sphere’s<br />

geometry. We can obtain this function by using noise to create a fractal. The<br />

fractal shown here is a hybrid multifractal [Ebert98] and is created by calling 3D<br />

noise several times and then scaling the noise by the product of the frequencies.<br />

This creates a fractal with smooth planes, rounded hills, and tall mountains. See<br />

33


Section I — Geometry Manipulation <strong>Tricks</strong><br />

34 3D Planets on the GPU<br />

[Ebert98] for more types of fractals. Below is the code to implement the<br />

multifractal:<br />

float MultiFractal(float3 pos, float octaves, float offset,float freqchange,float h,<br />

float4 pg[B2])<br />

{<br />

float result;<br />

float signal;<br />

float weight;<br />

float freq=1;<br />

result=(noise(pos,pg)+offset)*pow(freq,-h);<br />

freq*=freqchange;<br />

weight=result;<br />

pos*=freqchange;<br />

for(int i=0;i


p=p+i[1];<br />

float4 b;<br />

b[0] = pg[ p[0] ].w;<br />

b[1] = pg[ p[1] ].w;<br />

b[2] = pg[ p[0] + 1 ].w;<br />

b[3] = pg[ p[1] + 1 ].w;<br />

b=b+i[2];<br />

// compute dot products between gradients and vectors<br />

float4 r;<br />

r[0] = dot(pg[ b[0] ].xyz, f );<br />

r[1] = dot(pg[ b[1] ].xyz, f - float3(1.0f, 0.0f, 0.0f));<br />

r[2] = dot(pg[ b[2] ].xyz, f - float3(0.0f, 1.0f, 0.0f));<br />

r[3] = dot(pg[ b[3] ].xyz, f - float3(1.0f, 1.0f, 0.0f));<br />

float4 r1;<br />

r1[0] = dot(pg[ b[0] + 1 ].xyz, f - float3(0.0f, 0.0f, 1.0f));<br />

r1[1] = dot(pg[ b[1] + 1 ].xyz, f - float3(1.0f, 0.0f, 1.0f));<br />

r1[2] = dot(pg[ b[2] + 1 ].xyz, f - float3(0.0f, 1.0f, 1.0f));<br />

r1[3] = dot(pg[ b[3] + 1 ].xyz, f - float3(1.0f, 1.0f, 1.0f));<br />

// interpolate<br />

f=scurve(f);<br />

r = lerp(r, r1, f[2]);<br />

r = lerp(r.xyyy, r.zwww, f[1]);<br />

return lerp(r.x, r.y, f[0]);<br />

}<br />

Section I — Geometry Manipulation <strong>Tricks</strong><br />

3D Planets on the GPU<br />

Perlin noise works well with vertex profiles but is less suitable for pixel profiles,<br />

where other (albeit lower quality) noise functions can be written that use fewer<br />

texture accesses and require fewer instructions.<br />

By passing the x,y,z coordinates<br />

of the sphere to the<br />

multifractal function, it is possible<br />

to create the planet geometry.<br />

Figure 1 is a screen shot of<br />

the generated geometry.<br />

After the geometry has<br />

been generated, the planet<br />

needs to be textured. This can<br />

be done in a pixel shader by first<br />

creating a one-dimensional texture<br />

containing the various colors<br />

that the planet will use. In<br />

the example program, a simple Figure 1: Untextured planet geometry<br />

texture containing just a few<br />

shades of green, brown, and white is used, but different textures and textures<br />

containing different colors are also possible. If, for example, a planet resembling<br />

Mars is required, then the texture could be filled with reddish colors. To index<br />

35


Section I — Geometry Manipulation <strong>Tricks</strong><br />

36 3D Planets on the GPU<br />

into the texture, the vertex shader passes the height used to modify the sphere<br />

geometry, scaled to the range [0,1], to the pixel shader. The pixel shader then<br />

uses this to access the texture. However, this leads to a fairly unrealistic color<br />

distribution. In nature, height is not the sole basis for the terrain color. Snow does<br />

uniformly appear on the tops of mountains, and sometimes it falls lower down.<br />

The same applies for grass and other types of terrain. To account for this, noise<br />

can be used to modify the index into the texture. This makes the distribution of<br />

terrain types more random and visually pleasing. Below is the code used to<br />

achieve this:<br />

float height=In.dif.x; //Height passed from vertex shader<br />

float modifyindex=(2*noise(normalize(In.tex1.xyz*10,BaseTexture2)-1)/10; //scale noise<br />

height+=modifyindex; //modify height.<br />

float4 color=tex1D(BaseTexture, height); //index into buffer.<br />

The noise function used here is a type of value noise. It works by indexing into an<br />

array of random variables, then linearly interpolating the results and smoothing<br />

those results with an ease curve. It uses fewer texture accesses than Perlin noise<br />

and typically requires fewer instructions. However, another noise function may be<br />

substituted for this one without a significant change in the results.<br />

Figure 2a: Planet texture generated using noise Figure 2b: Planet texture generated using just<br />

height value<br />

half random(float x,float y,float z,sampler1D g)<br />

{<br />

half index=(x*6.6)+(y*7.91)+(z*8.21);<br />

index=index*0.001953125;<br />

index=h1tex1D(g,index);<br />

return index;<br />

}<br />

half3 scurve(half3 v)<br />

{<br />

return v*v*(3-2*v);<br />

}


half noise(float3 v,sampler1D g)<br />

{<br />

}<br />

half3 LatticePoint=floor(v);<br />

half3 frac1=scurve(frac(v));<br />

half4 v1;<br />

v1.x = random(LatticePoint.x,LatticePoint.y,LatticePoint.z,g);<br />

v1.y = random(LatticePoint.x + 1, LatticePoint.y,LatticePoint.z,g);<br />

v1.z = random(LatticePoint.x, LatticePoint.y + 1,LatticePoint.z,g);<br />

v1.w = random(LatticePoint.x + 1, LatticePoint.y + 1,LatticePoint.z,g);<br />

half2 i1 = lerp(v1.xz , v1.yw , frac1.x);<br />

half a=lerp(i1.x , i1.y , frac1.y);<br />

v1.x = random(LatticePoint.x,LatticePoint.y,LatticePoint.z+1,g);<br />

v1.y = random(LatticePoint.x + 1, LatticePoint.y,LatticePoint.z+1,g);<br />

v1.z = random(LatticePoint.x, LatticePoint.y + 1,LatticePoint.z+1,g);<br />

v1.w = random(LatticePoint.x + 1, LatticePoint.y + 1,LatticePoint.z+1,g);<br />

i1 = lerp(v1.xz , v1.yw , frac1.x);<br />

half b=lerp(i1.x , i1.y , frac1.y);<br />

return lerp(a,b,frac1.z);<br />

It is also possible to use this noise function to create a cloud layer for the planet.<br />

To do this, another slightly bigger sphere needs to be drawn around the planet,<br />

and then several octaves of noise need to be summated, each octave with successively<br />

higher frequency and lower amplitude.<br />

color.w=noise(input,BaseTexture)+noise(input*2,BaseTexture)*.5+noise(input*4,BaseTexture)<br />

*.25+noise(input*8,BaseTexture)*.125;<br />

color.w=1-color.w;<br />

This could be improved by drawing<br />

several cloud spheres, with<br />

each sphere being slightly larger<br />

than the last. This gives the<br />

clouds a volumetric look.<br />

Oceans can easily be added to<br />

the planet by rendering a<br />

semitransparent sphere with a<br />

radius less than that of the planet<br />

sphere. Then, any land that has a<br />

low enough height value will be<br />

below water level.<br />

The last step in rendering the<br />

planet is lighting it. It is quite<br />

Section I — Geometry Manipulation <strong>Tricks</strong><br />

3D Planets on the GPU<br />

Figure 3: Clouds rendered with five octaves of<br />

noise<br />

37


Section I — Geometry Manipulation <strong>Tricks</strong><br />

38 3D Planets on the GPU<br />

difficult to achieve accurate per-pixel lighting on the planet. To do this, it is necessary<br />

to either recompute the sphere normals when the sphere is deformed or<br />

generate tangent space for the planet. Unfortunately, due to the current instruction<br />

count of the program, it is impossible to regenerate the normals. However, it<br />

is easy to generate tangent space for the sphere by taking the partial derivative<br />

with respect to u,v, giving:<br />

�u=cos(u)*sin(v),–sin(u)*sin(v),0<br />

�v=cos(v)*sin(u),cos(v)*cos(u),–sin(v)<br />

It would then be possible to use the amount that the sphere geometry is perturbed<br />

by to generate normals and would work for lighting the clouds. However,<br />

as we generate the sphere geometry for the planet, the sphere equation changes,<br />

and so it becomes much more difficult to generate the tangent space by taking the<br />

derivative of the parametric sphere equation. The total equation is:<br />

X=sin(u)*sin(v)*Multifractal(sin(u)*sin(v), cos(u)*sin(v), cos(v))+1<br />

Y=cos(u)*sin(v)*Multifractal(sin(u)*sin(v), cos(u)*sin(v), cos(v))+1<br />

Z=cos(v)*Multifractal(sin(u)*sin(v), cos(u)*sin(v), cos(v))+1<br />

Because the partial derivative for this function is difficult to find and the vertex<br />

program is already reaching the maximum instruction limit, the example program<br />

simply uses the sphere normals to generate per-pixel lighting. This means that<br />

the planet lighting is not accurate, as the changes to the geometry of the sphere<br />

are not reflected; however, it does allow some lighting to be performed. This is<br />

done with the following code in the planet pixel shader.<br />

Out.dif.xyz= color.xyz*dot(normalize(In.tex1.xyz), In.tex2);<br />

//Light position in In.tex2, sphere normal in In.tex1<br />

Figure 4: Planet with cloud cover, noise texture, ocean,<br />

and per-pixel lighting (See Color Plate 2.)


Conclusion<br />

References<br />

Section I — Geometry Manipulation <strong>Tricks</strong><br />

3D Planets on the GPU<br />

This article examined how to generate 3D planets using only the GPU to perform<br />

the required rendering by evaluating the multifractal, value, and Perlin noise functions<br />

almost entirely on the graphics card and using these functions to generate<br />

the planet geometry, textures, and atmosphere. This provides a good starting<br />

point for developers seeking to implement planets using the latest hardware and<br />

for further experimentation with 3D planets.<br />

[Ebert98] Ebert, David S., et al., Texturing and Modeling: A Procedural Approach<br />

(Second Edition), San Diego: Academic Press, 1998.<br />

[nVidia] Cg Language information available online at http://www.cgshaders.org<br />

and http://developer.nvidia.com/view.asp?IO=cg_about.<br />

[Perlin] Perlin, Ken, “Improved Noise reference implementation,”<br />

http://mrl.nyu.edu/~perlin/noise/.<br />

39


Cloth Animation with Pixel and<br />

Vertex <strong>Shader</strong> 3.0<br />

Kristof Beets<br />

Introduction<br />

In computer graphics, simulating cloth has always been a topic of much research<br />

[UCL02]. In everyday life we observe cloth behavior without realizing the complexity<br />

of the physics involved. The model and shaders introduced in this article<br />

attempt to simulate cloth using a simplified massless spring model, which can be<br />

executed completely by next generation graphics hardware. The spring model is<br />

used to generate the position and normal of a cloth’s control points, which are<br />

then stored into “geometry textures” using an advanced pixel shader 3.0. Finally,<br />

the vertex texturing capabilities of the vertex shader 3.0 model allows us to render<br />

the deformed cloth using the position and normal data stored in these geometry<br />

textures.<br />

Basic Cloth Model<br />

40<br />

Before attempting to simulate cloth behavior using shaders, it is important to<br />

understand the underlying cloth model that we will be implementing [Elias01].<br />

Our cloth surface is modeled using a network of nodes linked together by massless<br />

springs. A first-level approximation is to connect every node to its four direct<br />

neighbor nodes, thus creating a simple grid; however, this results in an extremely<br />

flexible cloth that fails to retain its area and shape. This can be improved by connecting<br />

each node to its eight direct neighbor nodes, thus adding diagonal springs<br />

that work against shearing deformations. A final optimization is to add four or<br />

eight more connections to neighbor nodes that are two steps away; these connections<br />

again battle deformation of the original cloth shape and also avoid excessive<br />

bending of the cloth surface. Ultimately, it is possible to connect each node to all<br />

the direct neighbors and those two steps away, resulting in 24 spring connections.<br />

Figure 1 shows a central node with the various spring configurations as described.


Figure 1: The interconnection of cloth springs<br />

Section I — Geometry Manipulation <strong>Tricks</strong><br />

Cloth Animation with Pixel and Vertex <strong>Shader</strong> 3.0<br />

41<br />

Now let’s introduce an actual model for these interconnecting springs. A property<br />

of springs is that they will fight against any force that attempts to compress or<br />

stretch them. This behavior can be translated into the following formula:<br />

DefaultSpringLength � SpringLength<br />

SpringForce �SpringConst �<br />

DefaultSpringLength<br />

This formula calculates the relative deformation of the spring. If the spring is<br />

stretched, the relative deformation will be negative and result in a force counteracting<br />

the stretching. If the spring is compressed, the relative deformation will be<br />

positive and result in a force counteracting the compression (see Figure 2). If the<br />

spring is untouched, the relative deformation is zero and results in no force.<br />

SpringConst translates the relative deformation into an actual force. This constant<br />

can be used to modify the power of the spring: A high number will result in a<br />

strong counteracting force, while a low number will result in a small counteracting<br />

force. It is possible to further modify the spring behavior by changing this formula.<br />

For example, we could take the square of the relative deformation, which<br />

Figure 2: An example of springs with a Deformation Force (Fd) and the<br />

resulting Spring Force (Fs)


Section I — Geometry Manipulation <strong>Tricks</strong><br />

42 Cloth Animation with Pixel and Vertex <strong>Shader</strong> 3.0<br />

means that the force would behave in a nonlinear way to deformations. Effectively,<br />

this is how the cloth material type can be changed.<br />

To translate this force into a movement, we have to dig out Newton’s second law:<br />

Force � Mass � Acceleration<br />

Or, reorganized:<br />

Acceleration Force<br />

�<br />

Mass<br />

Acceleration is the change of velocity over time, and velocity is the rate of change<br />

of the position over time. This can be translated to:<br />

Force<br />

VelocityNEW �VelocityOLD � Acceleration��t�VelocityOLD � ��t<br />

Mass<br />

PositionNEW � PositionOLD �VelocityNEW ��t� Force 2<br />

PositionOLD<br />

�VelocityOLD ��t� ��t<br />

Mass<br />

In summary, the new position (after a period of time) is dependent on the old position,<br />

existing velocity, the force acting on the object, and the mass of the object.<br />

Our aim is to have a very simple model, so we will ignore velocity and acceleration<br />

and just reduce this to:<br />

PositionNEW � PositionOLD � ForceScaleConst � Force<br />

Basically, we flatten all of the above factors into a single constant. The main property<br />

that we maintain is that the movement is related to the total force acting<br />

upon the nodes of the cloth. Combining this with our SpringForce equation, we<br />

get:<br />

Position Position Const DefaultSpringLength � SpringLength<br />

NEW � OLD � �<br />

DefaultSpringLength<br />

In other words, in this highly simplified model, the change of position is dependent<br />

only on the deformation of the spring multiplied by a constant.<br />

To recap, we first chose the basic model for our cloth: A grid of nodes represents<br />

the cloth surface with the nodes interconnected by a network of springs.<br />

The second step was to build a model for these springs that describes how the<br />

node will move under the impact of its neighboring nodes.<br />

Finally, we bring all of this together into one complete model that also takes<br />

into account external factors, such as gravity and collisions with objects; I’ll introduce<br />

this model using easy-to-understand pseudocode:<br />

Variables<br />

VECTOR ARRAY: ClothOld (0 to X, 0 to Y) (init. with start positions)<br />

VECTOR ARRAY: ClothNew (0 to X, 0 to Y) (target for result of model)<br />

VECTOR: MovementVector


VECTOR: SpringVector<br />

VECTOR: ForceVector<br />

VECTOR: Gravity (init. to (0, 0, g, 0) where g is gravity)<br />

SCALAR: Length<br />

SCALAR: ForceScaler<br />

CONSTANT SCALAR: NormalLength (undeformed length of spring)<br />

CONSTANT SCALAR: SmallAmount (const translates force to movement)<br />

Functions<br />

CheckConstraints (checks for collision, intersection, etc.)<br />

DisplayCloth (displays the cloth)<br />

Main Processing Loop<br />

For every node (x,y) on the cloth:<br />

MovementVector = Gravity<br />

For each of the 4/8/12/16/... neighboring points<br />

SpringVector = (position of neighbor) - (position of node)<br />

Length = length of SpringVector<br />

NormalLength = undeformed length SpringVector<br />

ForceScaler = (Length - NormalLength) / NormalLength<br />

SpringVector = (SpringVector/Length)<br />

ForceVector = SpringVector * ForceScaler<br />

ForceVector = ForceVector * SmallAmount<br />

MovementVector += ForceVector<br />

End of loop<br />

ClothNew (x,y) = ClothOld(x,y)+ MovementVector<br />

CheckConstraints (ClothNew (x,y))<br />

End of loop<br />

DisplayCloth (ClothNew)<br />

Copy all the values in ClothNew to ClothOld (double buffering)<br />

Repeat Main Processing Loop forever<br />

Section I — Geometry Manipulation <strong>Tricks</strong><br />

Cloth Animation with Pixel and Vertex <strong>Shader</strong> 3.0<br />

43<br />

The pseudocode above shows an iterative loop that processes the input to create<br />

updated output positions. These output positions are then fed back into the system<br />

as input to create the next position and so on. The code uses a vector array to<br />

store the node positions; this array is initialized with the start positions of the<br />

nodes (cloth) before executing the main loop. For each node, the code looks at a<br />

certain number of neighboring nodes and, based on the distance between the current<br />

node and its neighbors, calculates the corresponding forces. The sum of<br />

these forces is then converted into a translation, which is added to the original<br />

position of the node along with some motion due to a static gravity. The conversion<br />

from forces to motion is done using a constant. This constant has to be chosen<br />

carefully: If the value is too big, the motion will be too large and the network<br />

will become unstable; if the constant is too small, the model will take forever to<br />

evolve. The new position finally undergoes a constraint check that involves<br />

checking collisions with objects. Specifically, if the new node position is within a<br />

constraining object, the node position has to be updated so the cloth will drape<br />

correctly on top of the object rather than sit inside it.


Section I — Geometry Manipulation <strong>Tricks</strong><br />

44 Cloth Animation with Pixel and Vertex <strong>Shader</strong> 3.0<br />

Implementation Using <strong>Shader</strong>s<br />

Now that we have a model to simulate cloth, we can start to convert it to the<br />

world of pixel and vertex shaders so that full hardware acceleration can be used.<br />

The model uses a double-buffered vector array to store the position of each<br />

node; this is implemented using textures. This storage needs to support both<br />

reading and writing, which is possible with textures created with the D3D-<br />

USAGE_RENDERTARGET flag. This position (x, y, and z) needs to be stored with sufficient<br />

accuracy — at least a 16-bit float per component should be used. This can be<br />

achieved by using either a 64-bit texture format (such as D3DFMT_A16B16G16R16F)<br />

or Multiple Render Targets (MRTs, such as 2 x D3DFMT_G16R16F). Our goal is to<br />

use these values to create the final geometry on screen. If vertex lighting is<br />

required, a normal vector will also be needed for each vertex. This brings the<br />

number of components to six: x, y, z and Nx, Ny, Nz. These can be stored easily<br />

and efficiently in three render targets with format D3DFMT_G16R16F. Because the<br />

texture data contains positions and normals, it effectively contains geometry; for<br />

this reason, these textures are referred to as geometry textures. The size of these<br />

geometry textures matches the number of nodes in our cloth grid (tessellation).<br />

For example, if we want a 32x32 grid of nodes forming the cloth, we need a 32x32<br />

texture. Now that we have decided on our storage format, we can start to use it to<br />

implement our algorithm, which we will split into the following six phases: initialization,<br />

depth, cloth, constraint, normal map, and display.<br />

Initialization Phase<br />

The initialization phase is only run once at the start of the program or when we<br />

want to restart the cloth simulation. This phase fills the geometry textures<br />

(MRTs) with their initial<br />

startup values and clears the<br />

buffers. To keep things simple,<br />

we restrict our scene to a unit<br />

cube. The cloth starts at the<br />

top of the cube and falls down<br />

(possibly colliding with objects<br />

causing constraints) until it<br />

reaches a stable position or the<br />

bottom of the cube that is<br />

effectively the floor. This is<br />

illustrated in Figure 3.<br />

The initial values for our<br />

MRT are (x, y, start height of<br />

cloth). Since we are working in<br />

a unit cube, the x and y posi-<br />

tions can be generated quite<br />

easily using a trivial vertex and<br />

pixel shader program. All we<br />

Figure 3: The scene containing cloth and objects<br />

within the unit cube


need to do is render a full-screen quad with texture coordinates interpolating from<br />

0 to 1 along both the x and y-axes. We then store the interpolated texture coordinate<br />

for each pixel using the pixel shader, since each interpolated coordinate<br />

matches the position of a node.<br />

Vertex shader code:<br />

vs 3 0<br />

; Input registers<br />

dcl position0 v0 ;Position in NORMALIZED SCREEN COORDINATES<br />

dcl texcoord0 v4 ;Texture coordinates = base node position<br />

; Output registers<br />

dcl position0 o0.xyzw ;Vertex position<br />

dcl texcoord0 o1.xy ;Texcoord<br />

; C8 contains scaling constants that influence the cloth size<br />

mov r0, v0<br />

mov r0.w, c21.w<br />

mov o0, r0 ;Output Position<br />

mad r1, v4, c8.x, c8.y ;Scale cloth – change init positions<br />

mov o1.xy, r1; ;Output Texture Coord = node position<br />

Pixel shader code:<br />

ps 3 0<br />

; Input<br />

dcl texcoord0 v0.xy ;Tex Coord = node position<br />

; Output results<br />

mov r0.rg, v0.xy<br />

mov oC0, r0 ;Node (X,Y) = interpolated texcoord<br />

mov r0, c12 ; = (, 0.0f, 0.0f, 0.0f)<br />

mov oC1, r0 ;Write Initial Depth<br />

mov r0, c12.y<br />

mov oC2, r0 ;Init to Zero<br />

At the end of this phase, we have initialized all our buffers and they are ready for<br />

processing by the following phases.<br />

Depth Phase<br />

Section I — Geometry Manipulation <strong>Tricks</strong><br />

Cloth Animation with Pixel and Vertex <strong>Shader</strong> 3.0<br />

45<br />

So far, we have not discussed how to handle constraints. The main aim for this<br />

implementation is to have cloth draping realistically over a collection of objects.<br />

When the objects are simple, it is easy to use a mathematical constraint. For<br />

example, it is quite trivial to detect if the new position of a node is inside a sphere.<br />

However, when working with more complex objects, such as a human body, a teapot,<br />

a table, etc., it becomes considerably more difficult to use mathematical


Section I — Geometry Manipulation <strong>Tricks</strong><br />

46 Cloth Animation with Pixel and Vertex <strong>Shader</strong> 3.0<br />

constraints. To handle cloth draping over complex objects, we use depth maps<br />

(height field). Since we have cloth falling down, we need at least a depth value for<br />

every vertical column within the unit cube. Using the (x, y) position of a node, we<br />

can then do a dependent read within the depth map to detect if a collision has<br />

occurred.<br />

Having only a top depth map does impose some limitations. For example,<br />

cloth might drape over a table, and during this process a tip of the cloth might flap<br />

down and move slightly underneath the table. If this happens, the tip of the cloth<br />

could suddenly be affected by the constraint (i.e., the table surface), and the tip<br />

will be moved instantly to the top of the table surface by the constraint, creating a<br />

cloth loop. Obviously this behavior is incorrect and can cause severe instability<br />

within the node-network. To solve this problem, a range is placed on the constraints.<br />

Specifically, it only applies the constraint if the depth value of the node is<br />

within a certain range of the constraint depth value. This issue and the solution<br />

are illustrated in Figure 4.<br />

Figure 4: The cloth loop problem and its solution using ranged constraints<br />

An even better constraint system is to use a “cube” depth map, meaning we create<br />

a depth map for all the surfaces of our unit cube; this allows us to do true volumetric<br />

testing. This can be illustrated using a sphere as the constraining object:<br />

The top and bottom depth maps will contain depth values that indicate the start of<br />

the sphere volume (the top depth map contains the depth of the top half of the<br />

sphere) and the end of the sphere volume (the bottom depth map contains the<br />

depth of the bottom half of the sphere). It is quite easy to fetch both of these<br />

depth values and do a comparison, and only if the node is between the top and<br />

bottom constraint points (that is, inside the volume along the z-axis) does it need<br />

to be moved to the value stored within the closest map. The same principle can


e applied to the other faces of the depth<br />

cube map. Essentially, we check if a node is<br />

within the volume along each of the x, y, and<br />

z-axes, and only when a test along an axis<br />

indicates the point is inside the volume would<br />

the node position be corrected. This technique<br />

is illustrated in Figure 5.<br />

The resolution of the depth map(s)<br />

should be high enough to avoid jagged artifacts<br />

in the geometry; for static scenes, these<br />

depth map(s) only have to be calculated at the<br />

start of the simulation, so there is no reason<br />

not to use a sufficiently high resolution. For<br />

dynamic scenes, the situation is different<br />

because whenever the constraints change<br />

(i.e., objects move), the depth map(s) need to<br />

be regenerated, which incurs a fillrate cost.<br />

The following vertex shader code is used with an orthographic projection (as<br />

perspective distortion is unwanted in these depth constraint maps) to store the<br />

world space linear depth into the texture:<br />

vs 3 0<br />

; Input registers<br />

dcl position0 v0 ;Position in NORMALIZED SCREEN COORDINATES<br />

; Output registers<br />

dcl position0 o0.xyzw ;Vertex position<br />

dcl texcoord0 o1.x ;Texcoord<br />

; C0-3 contains World+View+Proj Matrix<br />

; C4-7 contains World+View Matrix<br />

; C9 contains scene scaling and translation values<br />

Section I — Geometry Manipulation <strong>Tricks</strong><br />

Cloth Animation with Pixel and Vertex <strong>Shader</strong> 3.0<br />

m4x4 r0, v0, c0 ;Transform by view/projection/world matrix<br />

mov o0, r0 ;Output position<br />

dp4 r1.z, v0, c6 ;Transform by world+view Z<br />

mov r1 , r1.z<br />

;Convert to world space depths rather than camera relative depths:<br />

add r1 , -r1, c9.x<br />

mul r1 , r1, c9.y ;Scale to unit cube depth sizes<br />

mov o1.x, r1 ;Move scaled world depth result into tex coord<br />

The pixel shader simply stores the depth value, created in the vertex shader and<br />

passed on through a texture coordinate field, in the render target.<br />

The current demonstration application implements a single top depth map<br />

with constraint range; a full cube depth map version might be added at a later<br />

stage.<br />

47<br />

Figure 5: The usage of a cube depth<br />

map along one axis


Section I — Geometry Manipulation <strong>Tricks</strong><br />

48 Cloth Animation with Pixel and Vertex <strong>Shader</strong> 3.0<br />

Cloth Phase<br />

The cloth phase is where the real action occurs. The pixel shader used in this<br />

phase will need to read the node’s position and apply a step of the earlier<br />

described iterative cloth model to generate a new position. To create this shader,<br />

we need to translate our previous pseudocode into pixel and vertex shader code.<br />

Our pseudocode involves operations on the center node position using the position<br />

of several neighboring nodes as input. These positions need to be fetched<br />

from textures (filled during the initialization phase or during previous cloth<br />

phases), which requires texture coordinates that we will set up in the vertex<br />

shader. The following code sets up 16 2D texture coordinates; this is achieved by<br />

storing two sets of 2D coordinates in a single 4D coordinate register.<br />

vs 3 0<br />

; Input registers<br />

dcl position0 v0<br />

dcl texcoord0 v4<br />

; Output registers<br />

dcl position0 o0.xyzw ; Vertex position<br />

dcl texcoord0 o1.xyzw ; center texcoord<br />

dcl texcoord1 o2.xyzw ; 1 of 8 dual 2D Coords<br />

dcl texcoord2 o3.xyzw ; 2 of 8 dual 2D Coords<br />

dcl texcoord3 o4.xyzw ; 3 of 8 dual 2D Coords<br />

dcl texcoord4 o5.xyzw ; 4 of 8 dual 2D Coords<br />

dcl texcoord5 o6.xyzw ; 5 of 8 dual 2D Coords<br />

dcl texcoord6 o7.xyzw ; 6 of 8 dual 2D Coords<br />

dcl texcoord7 o8.xyzw ; 7 of 8 dual 2D Coords<br />

dcl texcoord8 o9.xyzw ; 8 of 8 dual 2D Coords<br />

; VERTEX POSITION<br />

; --------------mov<br />

r0, v0<br />

mov r0.w, c21.w<br />

mov o0, r0<br />

; MODEL TEXTURE COORDINATES<br />

; -------------------------<br />

; Copy base XY into both sections to generate two 2D coords per vector<br />

mov o1.xy, v4 ; Center position<br />

mov r1, v4.xyxy ; Copy Base into both sections<br />

; c10-c17 contain the delta for each neighbor position - set in code<br />

add r0, r1, c10<br />

mov o2.xyzw, r0 ; 1 out of 8 dual 2D Coords<br />

add r0, r1, c11<br />

mov o3.xyzw, r0 ; 2 out of 8 dual 2D Coords<br />

add r0, r1, c12<br />

mov o4.xyzw, r0 ; 3 out of 8 dual 2D Coords<br />

add r0, r1, c13<br />

mov o5.xyzw, r0 ; 4 out of 8 dual 2D Coords


add r0, r1, c14<br />

mov o6.xyzw, r0 ; 5 out of 8 dual 2D Coords<br />

add r0, r1, c15<br />

mov o7.xyzw, r0 ; 6 out of 8 dual 2D Coords<br />

add r0, r1, c16<br />

mov o8.xyzw, r0 ; 7 out of 8 dual 2D Coords<br />

add r0, r1, c17<br />

mov o9.xyzw, r0 ; 8 out of 8 dual 2D Coords<br />

After generating these texture coordinates, we can use them efficiently in our<br />

pixel shader. (Note the following code is written so it is easy to compare with the<br />

pseudocode — it is not performance optimized.)<br />

ps 3 0<br />

; Samplers<br />

dcl 2d s0 ;Input Textures MRT0 (x,y)<br />

dcl 2d s1 ;Input Textures MRT1 (Z,Nx)<br />

; Input registers<br />

dcl texcoord0 v0.xyzw ;Base Pos<br />

dcl texcoord1 v1.xyzw ;Neighbor Dual Coord Set 1/8<br />

dcl texcoord2 v2.xyzw ;Neighbor Dual Coord Set 2/8<br />

dcl texcoord3 v3.xyzw ;Neighbor Dual Coord Set 3/8<br />

dcl texcoord4 v4.xyzw ;Neighbor Dual Coord Set 4/8<br />

dcl texcoord5 v5.xyzw ;Neighbor Dual Coord Set 5/8<br />

dcl texcoord6 v6.xyzw ;Neighbor Dual Coord Set 6/8<br />

dcl texcoord7 v7.xyzw ;Neighbor Dual Coord Set 7/8<br />

dcl texcoord8 v8.xyzw ;Neighbor Dual Coord Set 8/8<br />

; Constants<br />

; c1..4 Set in code to weight to translate force to translation<br />

; c7..10 Set in code to default spring length constants<br />

defi i0, 4, 1, 1, 0 ;Used for loop<br />

; Init Movement vector<br />

mov r0, c0 ;Init movement vector with gravity<br />

; Sample Main Position<br />

texld r1, v0.xy, s0 ;Main Pos (x,y)<br />

texld r2, v0.xy, s1 ;Main Pos (z,Nx)<br />

mov r1.z, r2.x<br />

mov r1.w, r2.y ;Main Pos (x,y,z,Nx)<br />

; Main processing loop for 16 neighbor nodes split up in 4 cases each 4 nodes<br />

;CaseA+C:Length of "1.0" and "2.0" for undeformed springs (Axis Springs)<br />

;CaseB+D:Length of "1.4" and "2.8" for undeformed springs (Diagonal Springs)<br />

loop aL, i0 ; for (aL=1; aL


Section I — Geometry Manipulation <strong>Tricks</strong><br />

50 Cloth Animation with Pixel and Vertex <strong>Shader</strong> 3.0<br />

mov r2.z, r3.x ;Neighbor (X,Y,Z) = R2<br />

add r2, r2, -r1 ;Spring Vector = Neighbor - Main<br />

dp3 r3.x, r2, r2 ;Sum of Squares<br />

; Is it an edge pixel ?-ifweclamp to the same value then don't do maths<br />

if ne r3.x, c0.x<br />

rsq r4.x, r3.x ;RSQ of Sum of Squares = for Normalization<br />

rcp r5.x, r4.x ;RCP of RSQ of Sum of Squares = Length = R5<br />

add r5.x, r5.x, -c7.x ;Create Force scale using default lengths<br />

mul r5.x, r5.x, c7.y ;R5 is Force Scale<br />

mul r2, r2, r4.x ;Normalized Spring Vector<br />

mul r2, r2, r5.x ;r2 is Force Vector<br />

;Convert Force Vector to translation and add it to final movement vector<br />

mad r0, r2, c1, r0<br />

endif<br />

;Case C<br />

texld r2, v[aL].zw, s0 ;Sample neighbor (X,Y)<br />

texld r3, v[aL].zw, s1 ;Sample neighbor (Z)<br />

mov r2.z, r3.x ;Neighbor (X,Y,Z) = R2<br />

add r2, r2, -r1 ;Spring Vector = Neighbor - Main<br />

dp3 r3.x, r2, r2 ;Sum of Squares<br />

; Is it an edge pixel ?-ifweclamp to the same value then don't do maths<br />

if ne r3.x, c0.x<br />

rsq r4.x, r3.x ;RSQ of Sum of Squares = for Normalization<br />

rcp r5.x, r4.x ;RCP of RSQ of Sum of Squares = Length = R5<br />

add r5.x, r5.x, -c9.x ;Create Force scale using default lengths<br />

mul r5.x, r5.x, c9.y ;R5 is Force Scale<br />

mul r2, r2, r4.x ;Normalized Spring Vector<br />

mul r2, r2, r5.x ;r2 is Force Vector<br />

;Convert Force Vector to translation and add it to final movement vector<br />

mad r0, r2, c3, r0<br />

endif<br />

;Case B<br />

texld r2, v[aL+4].xy, s0 ;Sample neighbor (X,Y)<br />

texld r3, v[aL+4].xy, s1 ;Sample neighbor (Z)<br />

mov r2.z, r3.x ;Neighbor (X,Y,Z) = R2<br />

add r2, r2, -r1 ;Spring Vector = Neighbor - Main<br />

dp3 r3.x, r2, r2 ;Sum of Squares<br />

; Is it an edge pixel ?-ifweclamp to the same value then don't do maths<br />

if ne r3.x, c0.x<br />

rsq r4.x, r3.x ;RSQ of Sum of Squares = for Normalization<br />

rcp r5.x, r4.x ;RCP of RSQ of Sum of Squares = Length = R5<br />

add r5.x, r5.x, -c8.x ;Create Force scale using default lengths<br />

mul r5.x, r5.x, c8.y ;R5 is Force Scale<br />

mul r2, r2, r4.x ;Normalized Spring Vector<br />

mul r2, r2, r5.x ;r2 is Force Vector<br />

;Convert Force Vector to translation and add it to final movement vector<br />

mad r0, r2, c2, r0<br />

endif<br />

;Case D


texld r2, v[aL+4].zw, s0 ;Sample neighbor (X,Y)<br />

texld r3, v[aL+4].zw, s1 ;Sample neighbor (Z)<br />

mov r2.z, r3.x ;Neighbor (X,Y,Z) = R2<br />

add r2, r2, -r1 ;Spring Vector = Neighbor - Main<br />

dp3 r3.x, r2, r2 ;Sum of Squares<br />

; Is it an edge pixel ?-ifweclamp to the same value then don't do maths<br />

if ne r3.x, c0.x<br />

rsq r4.x, r3.x ;RSQ of Sum of Squares = for Normalization<br />

rcp r5.x, r4.x ;RCP of RSQ of Sum of Squares = Length = R5<br />

add r5.x, r5.x, -c10.x ;Create Force scale using default lengths<br />

mul r5.x, r5.x, c10.y ;R5 is Force Scale<br />

mul r2, r2, r4.x ;Normalized Spring Vector<br />

mul r2, r2, r5.x ;r2 is Force Vector<br />

;Convert Force Vector to translation and add it to final movement vector<br />

mad r0, r2, c4, r0<br />

endif<br />

endloop<br />

;Write Out Final Values<br />

add r2, r1, r0<br />

mov r3, r2.z<br />

mov oC0, r2 ; (X, Y)<br />

mov oC1, r3 ; (Z, X)<br />

The pixel shader code contains three large sections. The first section handles the<br />

initial setup, such as initializing the movement with a fixed gravity factor and<br />

reading the main node position. The second section is the main processing loop,<br />

which contains four subsections. These subsections correspond to different<br />

spring groups, as described in our model (see Figure 1). The code within each<br />

subsection calculates the force created by the spring between the central node<br />

and its neighbors, based on the distance between the nodes and the original<br />

undeformed spring length. This last element is a constant, which is different for<br />

nodes along the diagonal (relative length of 2 and 2� 2)<br />

and nodes along the<br />

axis (relative length of 1.0f and 2.0f); this is the main difference between the subsections.<br />

The final and third section adds the movement vector to the original<br />

node position and writes the result out to the render targets. This shader can be<br />

adapted to use more or fewer neighbor positions; for details, check the shaders<br />

included with the demo application, which support 4, 8, 12, 16, 20, and 24 neighbor<br />

nodes.<br />

Performance Considerations<br />

Section I — Geometry Manipulation <strong>Tricks</strong><br />

Cloth Animation with Pixel and Vertex <strong>Shader</strong> 3.0<br />

The cloth shader code contains a loop, but while a loop makes the code easy to<br />

understand and read, it might not be optimal for hardware execution. If the hardware<br />

supports enough instructions, it might be better to unroll this loop, since by<br />

unrolling the loop no cycles would be wasted on actually executing the loop<br />

instructions (i.e., the compare and jump operations). However, in most cases a<br />

developer should not have to worry about this, since the driver’s compiler should<br />

automatically handle it according to the capabilities of the host 3D device.<br />

51


Section I — Geometry Manipulation <strong>Tricks</strong><br />

52 Cloth Animation with Pixel and Vertex <strong>Shader</strong> 3.0<br />

To handle cloth border cases correctly, where there are fewer neighbor nodes<br />

to consider, the shader contains a conditional dynamic branch (if_ne). By using<br />

branching, it is possible to jump over some instructions that do not need to be<br />

executed. For example, in the above shader the branch stops seven instructions<br />

from being executed in some cases; however, this comes at the overhead of executing<br />

the conditional branching instruction itself in all cases. Depending on the<br />

cost of the branching instruction (which is hardware dependent), it might be<br />

better to implement a different (cheaper or faster) mechanism to handle the border<br />

cases correctly, such as a cmp or setp instruction.<br />

At first glance, this shader might look very complex and one might expect<br />

poor or non-real-time performance; however, it is important to understand that<br />

this shader is only executed on a very small set of pixels — a 64x64 grid is equivalent<br />

to rendering a 64x64 pixel texture and results in a network with 4,096 vertices.<br />

A render target of 64x64 pixels (or even 128x128, which results in a network<br />

with 16,384 vertices) is negligible compared to a default 1024x768 screen resolution.<br />

So even though the shader is complex, it is only being applied to a very small<br />

number of pixels, and hence real-time performance is still achieved.<br />

Constraint Phase<br />

During the constraint phase, we check all the new node positions and verify<br />

whether they have collided with an object. If they have, the node has to be moved<br />

so it sits on top of the object. As described before, this will be implemented using<br />

a depth compare using the depth map that we have created during a previous<br />

phase. All we need to do is use the (x, y) position of the node as a texture coordinate<br />

to do a dependent texture read into the depth map. We can then compare the<br />

node’s current z position with the depth value stored for that column in the unit<br />

cube, and if the new depth value is smaller (i.e., closer to the floor) than the value<br />

of the depth map, we replace the node’s z value with the depth map’s value. To<br />

avoid instability, we add a safety margin to this compare so that we only constrain<br />

nodes that have a depth value within a certain range of the stored depth value.<br />

This way, if the tip of the tablecloth moves under the table, it is not suddenly<br />

jerked to the top of the table. This can be achieved using the following pixel<br />

shader code:<br />

ps 3 0<br />

; Declare inputs<br />

dcl 2d s0 ; X ,Y<br />

dcl 2d s1 ; Z ,Nx<br />

dcl 2d s2 ; Ny,Nz<br />

dcl 2d s3 ; Depth Map<br />

dcl texcoord0 v0.rg ; Base Tex Coord<br />

def c0, 0.05, 0.0, 0.0, 0.0 ; Controls range of the constraint<br />

texld r0, v0, s0 ; Fetch (X,Y) of Node<br />

texld r1, r0, s3 ; Read Depth Map at (X,Y) = Z Constraint<br />

texld r2, v0, s1 ; Fetch (Z,Nx) of Node


add r4.r, r1.r, -r2.r ; Subtract Cloth and Constraint Z<br />

if gt r4.r, c0.x ; Compare with range<br />

mov r1.x, r2.r ; keep cloth (e.g., cloth tip under table)<br />

else<br />

max r4, r1.r, r2.r ; Constrain cloth to largest value<br />

mov r1.x, r4.x ; Update the output<br />

endif<br />

mov oC0, r0 ; output (X, Y)<br />

mov oC1, r1 ; output (Z, Nx)<br />

Different kinds of constraints can be introduced in this phase. We could have<br />

implemented a mathematical constraint, or we could simply use this shader to<br />

lock certain vertices in place (e.g., cloth hanging from two hooks, elastic cloth in a<br />

frame, etc.). The possibilities are endless and easy to implement.<br />

Normal Map Phase<br />

This phase calculates a normal for each node based on the neighboring nodes’<br />

information. This concept alone is probably worth a complete article; the current<br />

implementation creates two vectors (using a cross shape) from the four neighboring<br />

nodes and calculates the cross product to generate the normal. This is a very<br />

basic implementation; while more advanced solutions are possible, which would<br />

probably result in better image quality, they also come with increased sampling<br />

and processing costs. The sampling positions are set up in the vertex shader<br />

(similar to that illustrated in the cloth phase section) and processed as follows by<br />

the pixel shader:<br />

ps 3 0<br />

; Samplers<br />

dcl 2d s0 ; MRT0 (X ,Y )<br />

dcl 2d s1 ; MRT1 (Z ,Nx)<br />

; Inputs<br />

dcl texcoord0 v0.xy ; Main Node Sample Coord<br />

dcl texcoord1 v1.xy ; Right Node Sample Coord<br />

dcl texcoord2 v2.xy ; Top Node Sample Coord<br />

dcl texcoord3 v3.xy ; Left Node Sample Coord<br />

dcl texcoord4 v4.xy ; Bottom Node Sample Coord<br />

texld r0 , v0, s0 ; Center Node (X,Y)<br />

texld r11, v0, s1 ; Center Node (Z)<br />

mov oC0, r0 ; Output (X,Y) to MRT0<br />

texld r0, v1, s0 ; Right Node (X,Y)<br />

texld r1, v1, s1 ; Right Node (Z)<br />

mov r3.xy, r0<br />

mov r3.z, r1.x ; Right Node (X,Y,Z)<br />

Section I — Geometry Manipulation <strong>Tricks</strong><br />

Cloth Animation with Pixel and Vertex <strong>Shader</strong> 3.0<br />

53


Section I — Geometry Manipulation <strong>Tricks</strong><br />

54 Cloth Animation with Pixel and Vertex <strong>Shader</strong> 3.0<br />

texld r0, v2, s0 ; Top Node (X,Y)<br />

texld r1, v2, s1 ; Top Node (Z)<br />

mov r4.xy, r0<br />

mov r4.z, r1.x ; Top Node (X,Y,Z)<br />

texld r9 , v3, s0 ; Left Node (X,Y)<br />

texld r10, v3, s1 ; Left Node (Z)<br />

mov r5.xy, r9<br />

mov r5.z, r10.x ; Left Node (X,Y,Z)<br />

texld r9 , v4, s0 ; Bottom Node (X,Y)<br />

texld r10, v4, s1 ; Bottom Node (Z)<br />

mov r6.xy, r9<br />

mov r6.z, r10.x ; Bottom Node (X,Y,Z)<br />

; create vectors for cross product<br />

add r0.xyz, r3.xyz, -r5.xyz<br />

add r1.xyz, r4.xyz, -r6.xyz<br />

; cross product and normalization<br />

crs r7.xyz, r0, r1<br />

nrm r0, r7.xyz ; Vertex Normal<br />

; Output results to MRT1 and MRT2<br />

mov r11.y, r0.x<br />

mov r9 , r0.y<br />

mov r9.g, r0.z<br />

mov oC1, r11 ; (Z, Nx)<br />

mov oC2, r9 ; (Ny,Nz)<br />

Display Phase<br />

The final phase is the display phase, which will render our deformed cloth on the<br />

screen. To achieve this, we need to read every node’s (vertex’s) position and normal<br />

from the texture, rescale from the unit cube space into world space, transform,<br />

and display them on screen. All of this is achieved using the following<br />

vertex shader code:<br />

vs 3 0<br />

; Input Registers<br />

dcl position0 v0<br />

dcl texcoord0 v4<br />

; Output Registers<br />

dcl position0 o0.xyzw ; Final Vertex Position<br />

dcl color0 o1 ; Diffuse color for lighting<br />

dcl texcoord0 o2.xy ; Texture Coordinates<br />

; Samplers<br />

dcl 2d s0 ; (X ,Y)<br />

dcl 2d s1 ; (Z ,Nx)


dcl 2d s2 ; (Ny,Nz)<br />

def c10, 120.0, 240.0, 100.0, 0.0 ; Scale Factor<br />

def c11, 0.4267,-0.853,0.298,0.0 ; LIGHT<br />

def c12, 0.0, 0.0, 0.0, 1.0 ; Init value<br />

; Sample Vertex Textures<br />

texldl r1, v4, s0 ; Read Node (X , Y)<br />

texldl r2, v4, s1 ; Read Node (Z , Nx)<br />

texldl r3, v4, s2 ; Read Node (Ny, Nz)<br />

; Create XYZ in r4<br />

mov r4.xy, r1 ; Grab XY<br />

mov r4.z, r2.x ; Grab Z<br />

; Create NxNyNz in r5<br />

mov r5.x, r2.y ; Grab Nx<br />

mov r5.yz, r3.xxy ; Grab Ny, Nx<br />

Section I — Geometry Manipulation <strong>Tricks</strong><br />

Cloth Animation with Pixel and Vertex <strong>Shader</strong> 3.0<br />

; Create Final Node/Vertex Position<br />

mov r6, c12<br />

mad r6.x, r4.x, c10.y, -c10.x ; Rescale [0 -> 1] => [-120 -> 120]<br />

mul r6.y, r4.z, c10.y ; Rescale [0 -> 1] => [0 -> 240]<br />

mad r6.z, r4.y, -c10.y, c10.x ; Rescale [0 -> 1] => [-120 -> 120]<br />

m4x4 r2, r6, c0 ; Transformation (c0 set in code)<br />

dp3 r4, r4, c11 ; Simple Lighting Model<br />

mov o0, r2 ; Output Position<br />

mov o1, r4 ; Output Diffuse Color<br />

mov o2.xy, v4 ; Output Texture Coordinate<br />

55<br />

The above vertex shader should be easy to understand, as vertex texturing is the<br />

only exciting new feature used. Vertex texturing is virtually identical to texture<br />

accesses done in the pixel shader. It is, however, essential to understand the<br />

impact of vertex texturing on performance. All texture accesses come with high<br />

latencies, meaning that the period between fetching a value from a texture and<br />

being able to use the result can be quite long. There will be a lot of clock cycles<br />

spent moving the data from external memory into the chip (on a cache miss),<br />

through the cache, through texture filtering calculation, and eventually into the<br />

vertex shader. For this reason, throughput when using vertex texturing can<br />

potentially be quite low; however, it also means that if the shader has instructions<br />

that do not rely on the result of the texture fetch, the texture fetch can be “free,”<br />

since non-dependent instructions can be executed while waiting for the texture<br />

data to arrive. On the other hand, if there are no non-dependent instructions, the<br />

hardware may stall while waiting for the texture data, and valuable processing<br />

power will be lost. Given this potential high per-vertex cost, it is essential to<br />

maximize vertex cache usage (e.g., using D3DX’s Mesh Optimize functions).<br />

The pixel shader used during the display phase applies a simple base texture<br />

with diffuse lighting; this is to maintain acceptable performance on the Direct3D<br />

reference device given the lack of 3D hardware supporting the 3.0 shader model.


Section I — Geometry Manipulation <strong>Tricks</strong><br />

56 Cloth Animation with Pixel and Vertex <strong>Shader</strong> 3.0<br />

Overview<br />

NOTE Vertex texturing should never be referred to as displacement<br />

mapping, as displacement mapping is only a very small subset of the millions<br />

of possibilities that can be brought to life by the ability to read texture data<br />

from within the vertex shader. The algorithm and the geometry textures<br />

presented here are just one such case: Geometry is stored as a position (and<br />

normal) within a texture, and the massive parallel processing power of the<br />

pixel shader can be used to modify that geometry using complex physics or<br />

simulation models. In this case, a simple physics-based model is<br />

implemented, but other interesting possibilities include fluid dynamics,<br />

metaballs, and chemical simulations.<br />

Figure 6 shows an overview of the various shaders and buffers as they work<br />

together to bring cloth animation to life:<br />

Figure 6: <strong>Shader</strong> interaction overview<br />

The initialization phase writes the default node positions into the MRTs, and the<br />

depth phase writes the results of the depth render to a texture. The main processing<br />

loop then executes the cloth phase on the node positions, and the result<br />

undergoes the constraint phase. At this point, the cloth phase can start another<br />

iteration followed by another constraint phase. After looping through the cloth and<br />

constraint phases for a certain number of iterations, the normal map phase creates<br />

a new MRT, which contains the position and normal, and these are fed into<br />

the display phase, which creates the final on-screen result.<br />

Color Plate 3 illustrates the contents of the position and normal map MRTs as<br />

well as the final result in wireframe and solid mode.


Sample Application<br />

Conclusion<br />

References<br />

Section I — Geometry Manipulation <strong>Tricks</strong><br />

Cloth Animation with Pixel and Vertex <strong>Shader</strong> 3.0<br />

A sample application and a movie can be found on the companion CD. Updated<br />

versions are available at www.pvrdev.com and www.shaderx2.com.<br />

This article described a method of bringing real-time cloth simulations to life<br />

using the high performance and flexibility of pixel and vertex shader 3.0. A simple<br />

physics model was introduced together with various methods to apply constraints.<br />

This was then translated into a number of advanced shaders making use of<br />

advanced new functionality only found in the 3.0 shader model, such as dynamic<br />

branching and loops within the pixel shader, and texturing from within the vertex<br />

shader.<br />

[Elias01] http://freespace.virgin.net/hugo.elias/models/m_cloth.htm.<br />

[UCL02] http://www.cs.ucl.ac.uk/research/vr/Projects/3DCentre/cloth_simulation_links.htm.<br />

57


Collision <strong>Shader</strong>s<br />

Takashi Imagire<br />

Introduction<br />

58<br />

It is well known that GPU power is evolving at a rate far exceeding the expectations<br />

of Moore’s Law for general CPU power growth. However, this does not necessarily<br />

mean a simple speedup of the GPU. The GPU processes data at a much<br />

quicker speed than the CPU because of the parallel nature of the vertex and pixel<br />

pipelines. Rendering is a special process that is easy to parallelize. Although general-purpose<br />

calculations cannot always be carried out by the GPU, if processes<br />

are well suited to parallelization, they can likely be processed at high speeds using<br />

the GPU.<br />

In games, collision detection is one of the most processor-demanding processes.<br />

Collision detection is a complicated process that tends to be divided into<br />

many calculations because of the difference among many situations and is difficult<br />

to create as a single routine. For collision detection between objects, there is a<br />

“brute-force” algorithm that is simple but has a high processing load. The geometry<br />

of objects is mapped to a two-dimensional depth texture, and collision detection<br />

is performed for each texel of the texture. Since this method calculates in a<br />

parallel fashion, calculation time is reduced, each texel is processed independently,<br />

parallel processing is possible, and processing can be calculated at a high<br />

speed by the GPU. This article discusses this method of calculation by the GPU.<br />

Calculation by the GPU not only brings about an improvement given its<br />

incredible evolution speed, but it also lessens the load on the CPU, which can<br />

therefore assign more time to other processes (e.g., AI). In some game situations<br />

the CPU is busy, whereas in others the GPU is. The situation may change quickly<br />

depending on the scene. If it is possible to predict which processor carries the<br />

higher load, the calculation can be assigned to the other and the application will<br />

attain more efficient processing. (Of course, in order to be able to always perform<br />

this, the the CPU and GPU must be able to perform identical processing. This will<br />

probably be difficult. Additionally, this process of changing over to the GPU will<br />

only be used for specific scenes.)<br />

As another advantage, if calculating only by the GPU is possible, we do not<br />

have to wait for data to be locked in video memory before the CPU accesses it.<br />

For example, when analyzing a rendering result by the CPU, we have to wait for<br />

the GPU to finish a rendering. Generally, when processing using the CPU and the


GPU simultaneously, blocking often occurs since the data cannot be used by the<br />

other processor until processing is completed. The performance will improve,<br />

since we no longer have to wait for the other processor to be able to access the<br />

results.<br />

Visibility Test<br />

In performing the collision detection by the GPU, we first consider a simple case:<br />

that of a scene with a wall. Although an object will be rendered when it lies in<br />

front of a wall, it will not be rendered when it is placed behind a wall because it<br />

has not passed the z test. That is, the front or back relationship between objects<br />

can be judged by the number of pixels rendered.<br />

Let’s now consider the case where we transpose this wall to the ground and<br />

set a camera pointing upward from underneath the wall, which we think of as the<br />

earth’s surface. When an object is above the surface, the object is not rendered,<br />

since it is on the other side of the wall. But if the object is moved below the<br />

ground, it is rendered. Since the rendering of the object takes place after the rendering<br />

of the surface, it can be deduced that the object collided with the surface.<br />

We will now consider a concrete implementation. In order to detect whether the<br />

rendering was carried out, it is easiest to use an asynchronous notification mechanism<br />

introduced in <strong>DirectX</strong> 9. If an asynchronous notification is used, the number<br />

of rendered pixels can simply be measured. When using asynchronous notification,<br />

the object needed by the application side is a pointer to an IDirect3DQuery9<br />

object.<br />

IDirect3DQuery9* m pQuery;<br />

Section I — Geometry Manipulation <strong>Tricks</strong><br />

Collision <strong>Shader</strong>s<br />

Figure 1: Looking at the object from under the ground and its rendered images<br />

59


Section I — Geometry Manipulation <strong>Tricks</strong><br />

60 Collision <strong>Shader</strong>s<br />

The initialization of the IDirect3DQuery9 object is performed by IDirect3D-<br />

Device9::CreateQuery. In order to count the number of pixels rendered,<br />

D3DQUERYTYPE_OCCLUSION is specified as the argument of IDirect3D-<br />

Device9::CreateQuery.<br />

m pd3dDevice->CreateQuery(D3DQUERYTYPE OCCLUSION, &m pQuery);<br />

m_pQuery is used twice, before and after rendering the object. As opposed to<br />

normal rendering, the sub-surface camera must be prepared when preparing to<br />

render for collision detection. In order to prepare this camera, it is necessary to<br />

set a viewpoint on the bottom of the ground and make an observing point directly<br />

above. Since the direction of the camera’s target is along the Y-axis, the up direction<br />

of the camera must be set along the direction of the Z-axis so that it is not<br />

parallel to the direction of the camera’s view.<br />

D3DXVECTOR3 vEye = D3DXVECTOR3(0.0f,-1.0f, 0.0f);<br />

D3DXVECTOR3 vLookatPt = D3DXVECTOR3(0.0f, 0.0f, 0.0f);<br />

D3DXVECTOR3 vUp = D3DXVECTOR3(0.0f, 0.0f, 1.0f);<br />

D3DXMatrixLookAtLH(&mView, &vEye, &vLookatPt, &vUp);<br />

m pd3dDevice->SetTransform(D3DTS VIEW, &mView);<br />

// width height min z max z<br />

D3DXMatrixOrthoLH(&mProj, 10.0f, 10.0f, -10.0f, 10.0f);<br />

m pd3dDevice->SetTransform(D3DTS PROJECTION, &mProj);<br />

In the rendering loop of each frame, the z-buffer that records the geometry of the<br />

ground is generated by rendering the ground first. Since the camera looks at the<br />

underside of the ground, it is extremely important that the culling mode be set to<br />

none or reverse. Otherwise, the rendering will not be performed. Moreover, if it<br />

finishes rendering the ground, it is necessary to restore the original culling mode.<br />

m pd3dDevice->SetRenderState(D3DRS CULLMODE, D3DCULL CW);<br />

Rendering of the ground<br />

m pd3dDevice->SetRenderState(D3DRS CULLMODE, D3DCULL CCW);<br />

Next, the asynchronous notification that measures the written-in number of pixels<br />

is started, and the rendering of the object that detects collision is carried out.<br />

After finishing the rendering, we must stop counting the rendered pixels.<br />

m pQuery->Issue(D3DISSUE BEGIN);<br />

Rendering of the object<br />

m pQuery->Issue(D3DISSUE END);<br />

The number of rendered pixels is counted by calling m_pQuery->Issue with the<br />

D3DISSUE_BEGIN argument before rendering is performed and passing the<br />

D3DISSUE_END parameter after rendering is complete.<br />

If a rendering is completed, the result of the asynchronous notification is<br />

receivable. The number of rendered pixels can be determined by calling<br />

IDirect3DQuery9::GetData. The argument of IDirect3DQuery9::GetData is a<br />

pointer to a variable of the DWORD type which receives a result and the size of<br />

the variable (sizeof (DWORD)) and a flag specifying the query type. If the function<br />

is successful, S_OK is returned as the result. If the rendering is not yet completed,<br />

an error is returned.


Section I — Geometry Manipulation <strong>Tricks</strong><br />

Collision <strong>Shader</strong>s<br />

DWORD pixels;<br />

while(S OK!=(hr=m pQuery->GetData(&pixels, sizeof(DWORD), D3DGETDATA FLUSH ))){<br />

if(D3DERR DEVICELOST == hr) break;<br />

}<br />

if(1000


Section I — Geometry Manipulation <strong>Tricks</strong><br />

62 Collision <strong>Shader</strong>s<br />

Collision Map<br />

When performing collision detection, after a collision is detected, we want to find<br />

out the area with which the object collided. Although we can find out where an<br />

object currently is by asynchronous notification when objects penetrate, we do<br />

not know the point of collision. Moreover, a collision cannot be detected when an<br />

object moves too quickly and jumps through the area between the ground and<br />

camera. We will now explore more detailed collision detection by examining the<br />

path that the object moved along.<br />

The “path volume” is introduced to detect any collisions with the moving<br />

object. This is an object similar to the well-known “shadow volume.” Just as the<br />

shadow volume is a mesh that includes the object that casts a shadow as well as<br />

the object that is extruded in the direction away from the light, the path volume is<br />

a mesh that includes the object of the present position and the object of the past<br />

position about a certain object.<br />

Since the path volume is determined by the same method used to create<br />

shadow volumes caused by parallel light sources, many different methods exist<br />

for generation [Brennan]. For example, another mesh that embeds degenerate<br />

polygons about all the edges of an original mesh of the object is prepared. The<br />

polygons are degenerate quadrangles and the vertices of such polygons are specified<br />

to be every two vertices of both ends of an edge. The normal vectors of two<br />

faces that share an edge of the original mesh are assigned to the normal vector of<br />

two overlapping vertices, respectively. The path volume is dynamically created at<br />

the time of rendering. The normal vector of every vertex is compared with the<br />

velocity of an object. (Specifically, the dot product of each vector is calculated, and<br />

the sign determines the position, past or present, to which the vertex goes.) The<br />

prepared mesh is drawn in the present position when its direction of movement<br />

equals the direction of the normal vector. If not suitable, it will draw in the past<br />

position. For the portions of the mesh where the dot product of the normal and<br />

velocity vectors changes sign, the edge is filled with a degenerate quadrangle.<br />

Figure 2: Meshes shown at the present<br />

position and at the last position<br />

Figure 3: Path volume


Section I — Geometry Manipulation <strong>Tricks</strong><br />

Collision <strong>Shader</strong>s<br />

63<br />

There is a simpler method of using the original mesh as is without introducing<br />

additional degenerate polygons. For each vertex, the normal vector of the original<br />

mesh is compared with the direction of movement using the dot product, and the<br />

vertex is rendered in the present position when the dot product is greater than 0.<br />

The past position is rendered when the directions differ by more than 90 degrees<br />

(dot product < 0). Although this method can be processed by half vertex data<br />

compared with the first method, the result is not exact. <strong>With</strong> this method, the edge<br />

where the move direction and the direction of a normal vector change is not<br />

extended, but the face with normal vectors that gives the value of positive and<br />

negative both about the mark of the dot product of the move direction and direction<br />

of normal vector is extended. Since the original mesh changes shape through<br />

this enlargement, the generated path volume will be smaller than the “correct”<br />

one, although it will still be a subset of it. (See Figure 4.) Therefore, it can be used<br />

only as a simple approximation. However, when actually used in a game, this<br />

method of determining path volume is not a bad idea, as it makes the processing<br />

load lighter.<br />

Figure 4: Simple but incomplete path volume<br />

Although the path volume that connects the present position and the past position<br />

was introduced here as a linear path, it can also be created when the movement<br />

lies along a curve. When the movement of an object is as complicated as a parabolic<br />

movement by free-fall, etc., a path volume by the curved surface surrounding<br />

the volume that moved can be used. (In fact, a curved surface will be finely<br />

divided using a tessellator unit, etc.)<br />

Next, we explain how to determine the area of collision using a path volume.<br />

This is very similar to the calculation of a shadowed area using a shadow volume.<br />

The rendering target for collision detection will be referred to as a “collision map”<br />

here. First, it is initialized by applying black to the collision map. (Although any<br />

color is sufficient for collision detection, the color black is convenient for special<br />

effects.) Next, a camera is put on the bottom of the ground and turned upward,<br />

and the rendering of the earth’s surface is carried out. It is not necessary to write<br />

anything to a color component at this time. Rather, the purpose is just to write the<br />

depth value of the ground in the z-buffer. The next step is rendering the path volume.<br />

The rendering of the path volume is carried out twice. In the first pass, only<br />

the polygons of the path volume that face the camera are drawn with white (in


Section I — Geometry Manipulation <strong>Tricks</strong><br />

64 Collision <strong>Shader</strong>s<br />

fact, any color is sufficient as long as it is different from the color of the ground).<br />

In the second pass, only the polygons that face the reverse side of the camera are<br />

drawn with the same black color as the ground. In both passes of the path volume<br />

rendering, it is necessary to set the rendering states so that the z-buffer will not<br />

be written in but color components will be (we have to perform the z test still).<br />

Consequently, the area that was drawn by the first pass but not by the second<br />

(i.e., failed the z test) remains white in the collision map. This white area is<br />

exactly the domain where the ground and object contacted. (In Figure 5, for clarity,<br />

the ground is seen from across, whereas when actually performing the test, a<br />

camera is set just under the ground and a collision map that corresponds to a<br />

one-to-one relationship with the ground plane is created.)<br />

Figure 5: Rendering the front surface and the back surface and taking the difference between<br />

two images<br />

When using a collision map along with a path volume to determine whether an<br />

object has collided, we need to handle the use of the asynchronous notification a<br />

bit differently. Because the path volume will have rendered pixels without a collision<br />

necessarily occurring, we need to count both the number of rendered pixels<br />

of the front-facing polygons and the number of rendered pixels of the back-facing<br />

polygons and take the difference between the two numbers.<br />

This method of creating a collision map returns the right result only when<br />

rendering the path volume for a convex object. In the case of a concave object, if<br />

the indented portion has become sideways when drawing the front of the path volume,<br />

we have pixels that have been drawn to multiple times. At this time, the<br />

actual area of collision will be overwritten by the front polygon. When dealing<br />

with complicated objects that are not convex, it is necessary to find the difference<br />

between the rendering targets. A stencil buffer is often used for this more exact<br />

method. First, the value of a stencil buffer is filled with 0. When drawing the front<br />

surface of the path volume, the increment of the value of the stencil buffer is carried<br />

out; when drawing the back, the decrement of the value of the stencil buffer<br />

is carried out. The area where the back was not drawn by a z test failure but the<br />

front surface was is the area where the final value in the stencil buffer is not 0. (In<br />

some GPUs supporting the features of <strong>DirectX</strong> 9, such as the Radeon 9700 Pro,<br />

the rendering of the path volume can be completed by one drawing pass using the<br />

function of a two-sided stencil feature.) Since two or more objects can be processed<br />

repeatedly when a stencil buffer is used (without clearing a stencil buffer),


Section I — Geometry Manipulation <strong>Tricks</strong><br />

Collision <strong>Shader</strong>s<br />

65<br />

there are many merits to using a stencil buffer. The only problem is that it is difficult<br />

to use asynchronous notification for the two-sided stencil buffer.<br />

The created collision map can be used for special effects by the application. In<br />

Figure 6, a special effect that puts “flares” in the collision area is demonstrated.<br />

Figure 6: Using a collision map, we draw “flares” in the areas where bullets<br />

hit the ground.<br />

The collision map is a map that records an instantaneous collision and is updated<br />

with each frame. Since we want to burn all the places where the ground and the<br />

object have collided up until now, the created collision map is rendered to another<br />

map by addition composition. Initially, the accumulated map is completely black. If<br />

the collision maps are drawn one after another, at the end they will become pure<br />

white. That is, everyplace on the ground will blaze up.<br />

Although flaring on this accumulated collision map is a problem completely<br />

different from collision detection, it is still an important visual problem. The<br />

method involves post-processing, which applies an effect in screen space. First,<br />

the accumulated collision map is transformed to coincide with the ground and<br />

then rendered to a screen-aligned texture. This makes the “burning” areas of the<br />

screen become white. Next, we combine this texture via multiplication composition<br />

with a random animated texture that we call “the seed of fire.” This random<br />

animation means a wooden bit burns. However, with simple multiplication, the<br />

areas where the bullet collided with the ground only become bright on and off. In<br />

order to express the way the flame moves upward, a technique using an “afterimage”<br />

is used [James]. In order to make a flame, two screen-aligned textures are<br />

prepared for accumulation. We take the accumulated texture of one frame ago,<br />

reduce the intensity of its color, shift it upward in space a little, and render it to<br />

another accumulated texture. The current burning texture is drawn by addition<br />

composition as is. If the amount of color reduction is changed, this will affect the<br />

size of the flames. Since a flame slowly disappears when the color is reduced a little<br />

bit, a flame goes up high. Conversely, if the color is reduced greatly, it will


Section I — Geometry Manipulation <strong>Tricks</strong><br />

66 Collision <strong>Shader</strong>s<br />

quickly fade to 0 and a flame will hardly rise. If the created texture is drawn by<br />

addition composition on the whole screen at the end, the ground will blaze up red.<br />

This technique is a two-dimensional one and has a fault in that it does not account<br />

for areas where the flame should be obscured, such as those beyond a mountain.<br />

Furthermore, in order to give additional realism, still another texture consisting<br />

of a blurred version of the accumulated collision map is created; using this, the<br />

ground darkens to represent the scorched areas. When rendering this texture on<br />

the ground, after blurring it in two dimensions, we transform it so that it coincides<br />

with the ground, like we did for the burning texture. In order to darken the<br />

area, we use subtractive composition with black.<br />

Figure 7: Rendering steps<br />

Reflection by the Interaction<br />

If the area of collision between two<br />

objects is known, it is natural to<br />

want to know how to calculate the<br />

interaction between them. Here,<br />

as one example, we consider an<br />

interaction from which a bullet<br />

rebounds from the ground.<br />

If the incident velocity vector<br />

to the ground is set to vin, and the<br />

normal vector of the ground is n,<br />

the velocity vector vout after<br />

rebounding from the ground is set<br />

Figure 8: Vectors for velocity reflection<br />

to vout=vin –2*(n, vin)n. Therefore, an object can be reflected if the normal vector of the ground and<br />

the velocity before reflection are known.


Section I — Geometry Manipulation <strong>Tricks</strong><br />

Collision <strong>Shader</strong>s<br />

67<br />

Here, the problem is to determine the normal vector in the place where the<br />

bullet reflected. As long as a collision map is used, the place that collided can be<br />

determined only as an area rather than a single point. In order to find the normal,<br />

it is necessary to choose one point that collided by averaging the pixels of the<br />

area (i.e., find the centroid of the collision area). The alternative — averaging<br />

normal vectors over the collision area — can lead to anomalies. Consider the case<br />

where a bullet collides in the center of the sharp mountain. Although the place<br />

where the bullet hit may become level as an average of normal vectors, the summit<br />

of a mountain is not necessarily level. Thus, where a change of the normal<br />

vector in a nearby place is big, when the normal vector in the domain that collided<br />

is averaged, the normal vector that came out may become the value of a normal<br />

vector not existing in the original ground. Therefore, it is more reliable to choose<br />

specific coordinates and calculate the normal vector about the point.<br />

In order to calculate the centroid, it is necessary to prepare two textures<br />

beforehand. One texture, called the coordinate map, is based on the image of the<br />

ground when seen from the bottom. Here, the texture coordinate values are output<br />

to the red and green color components of the texels, creating a linear ramp.<br />

Moreover, in the blue component, the value of 0.0f is written in the areas where<br />

the ground does not exist and 1.0f where it does. Another texture is the normal<br />

map, which maps screen coordinates to the texture coordinates. By using texture<br />

coordinates based on screen coordinates, it becomes possible to do a direct lookup<br />

in the normal map texture for the value of a normal vector directly.<br />

In the preceding example, since only the normal vector was used in the collision<br />

response calculations, only the normal map was prepared. Sometimes we<br />

may also want to use the position of the model of the ground, such as when moving<br />

a bullet to the position of the surface of the ground so that it doesn’t sink into<br />

the ground when it contacts. In this case, it is necessary to prepare another texture,<br />

which contains the value of the height or the geometry of the ground, that<br />

maps screen coordinates to the texture coordinates.<br />

When a model without a texture is used (in other words, the model does not<br />

have texture coordinates), a suitable coordinate system should be chosen for the<br />

screen coordinate of a normal map and a height map of the ground, such as a<br />

world coordinate. It derives a normal vector from the place that collided without a<br />

texture coordinate.<br />

Figure 9: A texture coordinate map looking from below and a normal map


Section I — Geometry Manipulation <strong>Tricks</strong><br />

68 Collision <strong>Shader</strong>s<br />

At each rendering cycle, the texture coordinates of the center of the collision area<br />

are calculated using the collision map and the coordinate map. We compute this<br />

efficiently by using hardware texture bilinear filtering, since we can think of filtering<br />

as computing the average of the values contained in a given area. First, we<br />

multiply our collision map texture with the coordinate map texture. However, the<br />

right result will not be obtained if we simply filter the composite texture consisting<br />

of the collision and coordinate maps. Since the domain in which we compute<br />

our average is the collision area, not the entire texture, we need to divide by the<br />

relative extent of the collision area.<br />

We prepare a rendering target with a small size, similar to creating a mipmap.<br />

About the prepared rendering target, all texels to which the original texture corresponds<br />

are filtered — in effect, “averaging” the values contained in their color<br />

components. The blue component of the texture coordinate map is used for derivation<br />

of the collision area. Since a 1.0 is written where the ground exists for the<br />

blue component of a texture coordinates map, the blue component of the “averaged”<br />

texture represents the area of collision, where 0.0 would mean no collision<br />

area whatsoever and 1.0 would mean that the area of collision covered the entire<br />

ground plane.<br />

When composition is performed together with the average operation, it is<br />

efficient. The final result, the texture coordinates of the centroid of the collision<br />

area, equals the xy (red and green) components of the filtered compound texture<br />

divided by the z (blue) component. Although the whole texture is multiplied by<br />

the constant, the result does not change. Therefore, efficient calculation of average<br />

values over an area can be performed if a sampling point is set as the center<br />

of four texels and the four texels are read by one sampling using a bilinear filter.<br />

Furthermore, the result will be valid even for a 2x2 rendering target. By sampling<br />

the center of the created texture, the average value of a texture can be calculated,<br />

and this can be used as a final result.<br />

Figure 10: Multiplying a texture coordinate map and collision map and subsequent<br />

averaging using bilinear filtering<br />

If the normal map is sampled using the texture coordinates determined above, the<br />

normal vector of a point of collision is obtained.<br />

In order to change direction of an object using a normal vector, we have to<br />

pass the derived normal vector to the collision response calculation. For a CPUbased<br />

calculation, we can access this data from the GPU by locking texture memory<br />

and doing a texture read or by using asynchronous notifications as described


earlier. However, such methods are very slow and cannot be considered practical.<br />

It is hard to imagine that feedback of the data from the GPU to the CPU will<br />

become high speed in the future. Therefore, it is necessary to redesign an application<br />

so that these calculations normally performed by the CPU may be performed<br />

by the GPU.<br />

The most important thing is memory that saves information. In terms of particle<br />

calculations, the CPU and the GPU are capable of almost the same thing.<br />

However, the memory that each can access directly is different. The CPU<br />

acquires information from the main memory, while the GPU acquires information<br />

from the video memory. We can use a texture as a means to read and write data in<br />

video memory. Thus, in order to replace particle calculation from the CPU to the<br />

GPU, the position and velocity of an object are recorded on a “particle map” texture.<br />

For example, the position of an object is saved in the top row of a texture and<br />

velocity is saved in the second row. When treating two or more objects, each<br />

object’s attributes are arranged horizontally and indexed via the x-coordinate.<br />

Furthermore, the acceleration that acts on each particle is written in the row<br />

under the velocity. This storing of data in a texture is well suited for our purposes,<br />

since the row under each texel of a texture is the time derivative of the value represented<br />

by that texel. Similarly, the row above each texel represents the time<br />

integral of the values. If we denote position, velocity, and acceleration by x, v, and<br />

a, respectively, the operation that compounds by shifting a texture can be written<br />

with the following expression:<br />

x=x+v<br />

v=v+a<br />

Section I — Geometry Manipulation <strong>Tricks</strong><br />

Collision <strong>Shader</strong>s<br />

This formula represents the movement of the object over a unit of time, assuming<br />

constant acceleration. Processing, in which a texture is shifted and rendered, consists<br />

of only a one-pass rendering about the polygon of the size of the texture,<br />

which can be done very fast.<br />

Figure 11: Particle map<br />

The problem that remains is the calculation of acceleration. The acceleration<br />

value is what allows us to change the movement of an object in the scene.<br />

Although it is 0 at the time of a uniform straight-line motion when it reflects with<br />

69


Section I — Geometry Manipulation <strong>Tricks</strong><br />

70 Collision <strong>Shader</strong>s<br />

the ground, we need to determine the proper acceleration to change the velocity<br />

to that represented by the reflected vector. The formula to use is 2*(n·v)n, which<br />

yields the expected behavior for reflection. Here, n is the normal vector calculated<br />

using the collision map. The HLSL program for deriving acceleration using a<br />

particle map is as follows.<br />

float4 ReflectPS ( REFLECT VS OUTPUT In ) : COLOR0<br />

{<br />

float4 acceleration;<br />

}<br />

float4 coord = tex2D( CoordSamp, In.Tex0 );<br />

float4 velocity = 2.0f*tex2D( VelocitySamp, In.Tex1 )-1.0f;<br />

float pixels = coord.z;<br />

coord /= pixels;<br />

float4 normal = 2.0f*tex2D( NormalSamp, coord )-1.0f;<br />

if(pixels


Conclusion<br />

Section I — Geometry Manipulation <strong>Tricks</strong><br />

Collision <strong>Shader</strong>s<br />

{0, 12, D3DDECLTYPE FLOAT3, D3DDECLMETHOD DEFAULT, D3DDECLUSAGE NORMAL, 0},<br />

{0,24, D3DDECLTYPE FLOAT2, D3DDECLMETHOD DEFAULT, D3DDECLUSAGE TEXCOORD, 0},<br />

{0, 32, D3DDECLTYPE FLOAT2, D3DDECLMETHOD LOOKUP, D3DDECLUSAGE SAMPLE, 0},<br />

D3DDECL END()<br />

};<br />

71<br />

The texel coordinates of the position map are contained in this newly added<br />

vertex data. For example, if the width of the position map is defined as MAP_<br />

WIDTH, the texture coordinates about the i-th object are set to (i/MAP_WIDTH,<br />

0). Since each instance of an object in a scene may refer to a different position of<br />

the texture, a solution would be to create additional copies of the mesh that differ<br />

in the position map texture index. However, since we can set the source of the<br />

displacement map-related vertex data to be another stream in <strong>DirectX</strong>, if only the<br />

data of texture coordinates differs about each mesh, we can save memory by<br />

avoiding redundant vertex data.<br />

There can be a maximum of one texture used for displacement mapping that<br />

can be referred to in this manner as a variable from a vertex shader program.<br />

Here, position coordinate is used for this variable to refer to. However, we also<br />

need the velocity of the object for generating and extruding the path volume.<br />

When using the displacement mapping method, it’s not possible to refer to multiple<br />

values in the particle map per vertex, and so we need to set the initial velocity<br />

through the CPU.<br />

At the time this sample program was written, the vs_3_0 standard was not<br />

yet supported by existing <strong>DirectX</strong> hardware. Therefore, only the displacement<br />

mapping technique is demonstrated here. In addition, GPUs supporting floatingpoint<br />

textures and displacement mapping in hardware did not exist yet. The displacement<br />

mapping technique is a provisional one, and in the future, texture reads<br />

in vs_3_0 shaders will be the preferred method.<br />

In this chapter, methods of collision detection and response by the GPU using an<br />

asynchronous notification and collision map were discussed. Both methods<br />

involve checking whether the rendering of the object has been carried out and<br />

judging if it has collided or not.<br />

Since the sample program using this method is included on the companion<br />

CD, I encourage you to play with the source. (These programs have been checked<br />

on GeForce FX 5800 Ultra and Radeon 9700 Pro cards.)<br />

One example that can use this method immediately is recording bullet marks<br />

as a texture in an FPS. As another example, in a race game, accurate depths of<br />

dents due to collisions could be recorded as a texture using a displacement map.<br />

Currently, for an actual game, the GPU is insufficient for general processing,<br />

and the performance of the GPU can be used only for drawing. However, it is<br />

expected that using the GPU for purposes other than rendering, such as collision<br />

detection, will become possible in the future.


Section I — Geometry Manipulation <strong>Tricks</strong><br />

72 Collision <strong>Shader</strong>s<br />

References<br />

[Brennan] Brennan, Chris, “Shadow Volume Extrusion Using a Vertex <strong>Shader</strong>,”<br />

Direct3D <strong><strong>Shader</strong>X</strong>: Vertex and Pixel <strong>Shader</strong> <strong>Tips</strong> and <strong>Tricks</strong>, Wolfgang Engel, ed.,<br />

Wordware Publishing, Inc., 2002, pp. 188-194.<br />

[James] James, Greg, “Operations for Hardware-Accelerated Procedural Texture<br />

Animation,” Game <strong>Programming</strong> Gems 2, Charles River Media, Inc., 2001, pp.<br />

497-509.


Principles<br />

Advantages<br />

Displacement Mapping<br />

Tom Forsyth<br />

Displacement mapping is essentially a method of geometry compression. A<br />

low-polygon base mesh is tessellated in some way. The vertices created by this<br />

tessellation are then displaced along a vector — usually the normal of the vertex.<br />

The distance that they are displaced is looked up in a 2D map called a displacement<br />

map.<br />

The main aim of this article is to allow people to take data from the industry’s<br />

current mesh and texture authoring pipelines and derive displacement map data<br />

from them. There will also be some discussion of rendering techniques on past,<br />

current, and future hardware.<br />

It is worth mentioning that the problems and restrictions inherent in<br />

authoring for displacement maps are the same as those that occur when authoring<br />

for normal maps because they are essentially two different representations of the<br />

same thing. Generating normal maps has recently come into fashion, and there is<br />

plenty of hardware around to support it. If you are going to be generating normal<br />

maps, generating and using displacement map data is a relatively simple enhancement<br />

to the tool chain and rendering pipeline. As shown later, there is already<br />

widespread hardware support for at least some form of displacement mapping,<br />

new and faster hardware has been released recently, and there is, no doubt, even<br />

more direct support for displacement maps on the way.<br />

Using displacement maps reduces the amount of memory required for a given<br />

mesh level of detail. Bulky vertex data is replaced by a 2D array of displacements<br />

— typically 8 or 16 bits in size, with most attributes such as texture positions,<br />

tangent vectors, and animation weights implicit. This reduces storage requirements<br />

and the bandwidth needed to send that data to the rendering hardware,<br />

both of which are major limits on today’s platforms. Alternatively, it allows much<br />

higher detail meshes to be stored or rendered in the same amount of memory<br />

space or bandwidth.<br />

Reducing the mesh to a far simpler version (typically around a few hundred<br />

vertices rather than tens of thousands) means operations such as animation and<br />

morphing are cheaper. They can therefore be moved from the GPU back onto the<br />

73


Section I — Geometry Manipulation <strong>Tricks</strong><br />

74 Displacement Mapping<br />

CPU, which is a much more general-purpose processor. Because of this, the range<br />

of possible operations is expanded — more complex animations are possible and<br />

different techniques used, such as multi-target morphing (for facial animation),<br />

volume-preservation (for bulging muscles), and cloth simulation. One other<br />

advantage is that the animation algorithms used are no longer tied to the specific<br />

GPU platform or to the lowest-common-denominator of platforms. Indeed, the<br />

animation programmer no longer needs to know the core details of the graphics<br />

platform to experiment with and implement new techniques.<br />

A more abstract advantage is that using displacement maps turns meshes —<br />

tricky 3D entities with complex connectivity — into a few 2D entities. 2D objects<br />

(i.e., textures and images) have been studied extensively, and there are a lot of<br />

existing techniques that can now be applied to meshes. For example:<br />

� Mesh simplification and LOD becomes mipmap generation.<br />

� Compression can use frequency-based methods such as Fourier transforms<br />

or wavelets.<br />

� Procedural generation of meshes can use existing 2D fractal and<br />

image-compositing methods.<br />

� Morphing becomes a matter of blending 2D images together.<br />

� End-user customization involves 2D grayscale images rather than complex<br />

meshes.<br />

Using graphics hardware and render-to-texture techniques, many of the above<br />

features can be further accelerated.<br />

Disadvantages<br />

Displacement maps place some restrictions on the meshes that can be authored<br />

and are not applicable everywhere. Highly angular, smooth, or faceted objects do<br />

not have much fine or complex surface detail and are better represented either by<br />

standard polygonal mesh data or some sort of curved surface representation, such<br />

as the Bezier family of curves or subdivision surfaces.<br />

Highly crinkled or fractal data such as trees or plants are not easy to represent<br />

using displacement maps, since there is no good 2D parameterization to use<br />

over their surfaces.<br />

Meshes that overlap closely or have folds in them can be a problem, such as<br />

collars, cuffs, or layers of material like jackets over shirts, or particularly baggy<br />

bits of material. This is because a displacement map can only hold a single height<br />

value. Although this is a problem at first, if artists can author or change the mapping<br />

of displacement maps, they can map each layer to a different part of the displacement<br />

map and duplicate each layer in the low-polygon base mesh.<br />

Automated tools are also easy to modify to do this correctly.<br />

Authoring displacement maps almost always requires specialized tools — it is<br />

very hard to directly author the sort of maps discussed here (large-scale ones that<br />

cover a whole object). However, the amount of work required to write, adapt, or<br />

buy these tools is small compared to the benefits. The recommended tools are<br />

discussed below.


At first glance, hardware support is slim for displacement mapping. Currently,<br />

only two PC graphics cards support it natively (the Parhelia and members of the<br />

Radeon 9x00 series) and none of the consoles. However, with a bit of thought, displacement<br />

mapping methods can be applied to a much wider range of hardware.<br />

On the PC, anything using any sort of vertex shader can use them, including software<br />

VS pipelines used by many people for animation or bump-mapping on older<br />

cards. On the consoles, the VU units of the PS2 can use displacement maps<br />

directly, and any CPU with a SIMD-style instruction set (such as the<br />

GameCube’s) can efficiently render displacement map data. On the consoles, the<br />

reduction in memory use and memory bandwidth is well worth the extra effort.<br />

Required Source Data<br />

To use displacement mapping in hardware or software, you eventually need the<br />

basic ingredients:<br />

� A low-polygon base mesh<br />

� A “unique” UV texture mapping for the base mesh<br />

� A heightfield displacement map for displacement of vertices<br />

� A normal map for lighting<br />

Typically, displacement maps are lower resolution than normal maps, though they<br />

may demand more precision. Additionally, displacement maps and normal maps<br />

usually share the same mapping, since the same problems must be solved by both<br />

— filtering (especially mipmapping), representation of discontinuities, texel resolution<br />

at appropriate places on a mesh, and assigning each texel a unique position<br />

on the mesh.<br />

How you get these basic ingredients is almost entirely up to the art team and<br />

the available tools. They are available from many sources in many combinations.<br />

For reference, all vertex numbers given are for a human figure that would<br />

normally take around 10,000 vertices to represent with a raw mesh with around<br />

40 bones. Typically, there are twice as many triangles as vertices in a mesh.<br />

Low-Polygon Base Mesh<br />

Section I — Geometry Manipulation <strong>Tricks</strong><br />

Displacement Mapping<br />

As a guide, this mesh is around 100 vertices for a human figure, depending on the<br />

quality of animation required and the complexity of the clothing. The artists can<br />

directly author this mesh, or it can be derived from higher-polygon meshes by<br />

using a variety of mesh simplification techniques. These may be completely automatic,<br />

or they may be semiautomatic with visual checks and tweaks by artists.<br />

There are many methods to automatically reduce meshes in complexity.<br />

Those based on half-edge collapses are popular, especially as they can also be<br />

used to directly author progressive mesh sequences, which are useful for rendering<br />

continuous levels of detail on older hardware. Other options include using<br />

Delaunay-style parameterization and remeshing and also voxelizing the mesh and<br />

remeshing from appropriately filtered voxel data.<br />

75


Section I — Geometry Manipulation <strong>Tricks</strong><br />

76 Displacement Mapping<br />

Unique Texture Mapping<br />

Displacement and normal maps generally require a mapping over the mesh, which<br />

ensures that each texel is used no more than once. Although not strictly necessary<br />

in some specialized cases (for example, when an object has perfect left/right<br />

symmetry), in general the extra flexibility is well worth the effort.<br />

The unique mapping can be authored directly, using a spare mapping channel<br />

in the mesh. Automated generation is possible using the variety of “texture atlas”<br />

methods that exist, including the same Delaunay-style parameterization as the<br />

above remeshing or using the technique in Gu’s “Geometry Images” [1] of a minimal<br />

number of cuts to unfold and flatten a mesh onto a square plane.<br />

There are also existing unique mapping solutions in 3D authoring tools, such<br />

as 3ds max’s “flatten” mapping. However, it is important to note that it is not the<br />

high-polygon mesh that needs the unique mapping but the low-polygon version.<br />

Unique-mapping the high-polygon mesh can work in some cases, but it tends to<br />

introduce a lot of unwanted discontinuities, which hinder many of the polygonreduction<br />

techniques used to produce the low-polygon base mesh. If the mesh<br />

simplification is a plug-in for the authoring package, that can be performed first,<br />

before unique mapping. Alternatively, the mesh can be exported, simplified by<br />

external tools, and reimported for unique mapping. Although clumsy, this does<br />

have the advantage that the artists can tweak the automated unique mapping —<br />

sometimes a useful ability.<br />

The unique mapping can also be used for lightmap generation or procedural<br />

textures, if required.<br />

Heightfield Displacement Map and Normal Map<br />

Displacement maps can be authored directly using grayscale textures and suitable<br />

art tools. However, 8 bits per pixel is generally not sufficient for a high-precision<br />

displacement map, and few if any art packages handle 16-bit grayscales. Even<br />

when they do, since they are designed for visual use rather than heightfield<br />

authoring, the control over the values is relatively coarse, and it is hard for artists<br />

to achieve anything but an approximation of the correct shape. In practice, this<br />

leads to “caricatures” of the object.<br />

A better choice is to author most or all of the data using a high-polygon mesh.<br />

Using the unique mapping above, each texel on the displacement and normal<br />

maps has a single position on the low-polygon base mesh. A ray is cast from that<br />

position along the interpolated low-polygon normal and the intersection found<br />

with the high-polygon mesh. The normal of the high-polygon mesh is written into<br />

the normal map, and the distance along the ray to the point of intersection is written<br />

into the displacement map. Remember that these distances may be negative<br />

— the ray needs to trace both outward and inward from the low-polygon mesh, as<br />

shown in Figure 1.<br />

When creating the high-polygon mesh, the artists still need to be aware that<br />

they are indirectly authoring a heightfield. Folding or overlaps of geometry will<br />

not be recorded well by a heightfield. In practice, we find it is better to have the<br />

ray-caster report difficult or ambiguous intersection cases and have the artists fix


the mesh (either the high- or low-polygon ones as appropriate) than to attempt to<br />

make the ray-caster very intelligent. These tricky cases are rare, and this method<br />

highlights them rather than trying to hide them, reducing unwanted surprises.<br />

Normal maps (either object-space or surface-local space) are almost impossible<br />

to author directly but are easily generated from displacement maps or bump<br />

maps. Although a bump map is actually a heightfield and is essentially the same<br />

thing as a displacement map, since absolute scale is far less important when generating<br />

normal maps than when displacing positions, they are routinely generated<br />

by hand.<br />

High-frequency displacement and normal maps are fairly easy to author by<br />

using bump maps. These are used to provide texture to a surface or add small<br />

ridges or creases, such as those between panels of a car body. These are often<br />

applied to medium-polygon meshes to add fine details, rather than to the<br />

low-polygon mesh that is used in displacement mapping. It is easy to apply them<br />

to existing or generated displacement and normal maps, as long as there is<br />

already a unique texture mapping. The high frequency implies small displacements,<br />

so the lack of a well-controlled scale for those displacements is not as<br />

much of a problem. Having a crease in clothing twice as large as desired is not a<br />

major problem, unlike having a character’s nose twice as long as it should be.<br />

Note that the mapping of these high-frequency maps is kept flexible on the artist’s<br />

end. They do not need to be uniquely mapped, and it is perfectly acceptable to tile<br />

a small bump map over a larger surface to provide noise and detail. They will be<br />

rendered into the normal and displacement maps by the ray-caster, and it is those<br />

that are uniquely mapped.<br />

Mucky Foot Choices<br />

Section I — Geometry Manipulation <strong>Tricks</strong><br />

Displacement Mapping<br />

Figure 1: A displacement map applies scalar offsets along the<br />

interpolated normals of a base mesh.<br />

At Mucky Foot we tend to author medium-polygon meshes (around 3,000 vertices<br />

for humans) with high-frequency bump maps. It is more efficient for the artists to<br />

put small creases and surface texture into a bump map than it is to generate them<br />

with polygonal creases, and it is just as visually effective. It also reduces the<br />

77


Section I — Geometry Manipulation <strong>Tricks</strong><br />

78 Displacement Mapping<br />

problem of high-frequency polygon data confusing the ray-caster and causing multiple<br />

intersections.<br />

For some objects, we author the unique mappings directly. Manual unique<br />

mapping is typically used on objects such as people, since they are already<br />

mapped fairly uniquely, except for left/right symmetry. This is easily fixed by<br />

selecting the right half of the part of the object that has been mapped this way<br />

(typically everything below the neck) and adding 1 to either the U or V value of<br />

the texture coordinates. Since these meshes use texture wrap address mode by<br />

default (as opposed to clamp-to-edge), this does not affect the diffuse, specular,<br />

etc., texture maps, but it does create a unique mapping for use by the displacement<br />

map. This mapping is then packed (see below) for better texel efficiency.<br />

For other objects, we generate unique mapping automatically using fairly<br />

standard “texture atlas” creation techniques. In some cases, such as buildings,<br />

3ds max’s “flatten” tool mostly does a good enough job and has the benefit that<br />

the artists can directly tweak any problem areas. In other cases, this produces too<br />

many seams (or takes too much time to fix up by hand), and we first reduce the<br />

mesh to the low-polygon base mesh version and then uniquely map the object<br />

using our own texture atlas code.<br />

To produce a low-polygon base mesh, Mucky Foot uses a quadric error metric-based<br />

semiautomatic edge-collapse sequence that is visually checked and<br />

manually tweaked where necessary. Fully automated reduction is generally<br />

acceptable down to around 500 vertices, and then manual tweaking can be<br />

required in a few places to reduce to around 100 vertices. The tweaking is generally<br />

required to collapse features that are visually less important, such as the feet,<br />

or prevent collapse of perceptually important features, such as the face, elbows,<br />

knees, and hands. Automation of these (for example, taking bone weights into<br />

account) was attempted with mixed results. It seems generally quicker and better<br />

simply to allow the artists full control by this stage; frequently, the extra “intelligence”<br />

of the tool gets in the way. Production of a low-polygon mesh typically<br />

takes around 30 minutes per human mesh, which compares well with the initial<br />

authoring time.<br />

As well as producing the low-polygon base mesh, this process also generates<br />

a view-independent progressive mesh, which is useful when rendering the mesh<br />

on some hardware (see below). The same tool also produces VIPM sequences for<br />

objects that do not use displacement maps — simple or smooth objects such as<br />

coffee mugs, dustbins, chairs, and tables.<br />

The high- or mid-polygon meshes that the artists author are only used<br />

as input to the offline ray-caster; they are not used directly at run time. Because<br />

of this, the limits imposed by the rendering pipeline on polygon counts are almost<br />

totally removed. The new limit on polygon count is simply whatever the artists<br />

have time to author. The limits on connectivity, large or small polygon sizes, and<br />

mesh complexity are also largely removed — as long as a sensible low-polygon<br />

base mesh can be produced. Games are getting bigger and becoming more limited<br />

by what we have the time, talent, and manpower to author, rather than by the<br />

hardware, and this extra flexibility allows the artists to optimize for their time<br />

rather than for the peculiarities of a graphics engine.


Art Tools<br />

Although we use our own VIPM and texture atlas libraries, it should be noted<br />

that they are fairly standard algorithms, and many of the tools mentioned later<br />

would do just as good a job. We use our own code simply because it was already<br />

written and is now well integrated with our tool chain.<br />

We found a number of tools handy when authoring displacement maps. Many of<br />

these tools have other uses, such as the QEM-based edge collapser which also<br />

generates view-independent progressive mesh data. Some of them already exist<br />

in various forms, and experimenting with these off-the-shelf solutions is a very<br />

good idea. Many produce readily usable data, while others make useful test cases<br />

before committing to writing custom tools.<br />

Displacement Map Previewer<br />

If displacement maps are authored directly, some sort of preview tool is usually<br />

needed. Some 3D packages may have displacement map renderers included, but if<br />

not, it is fairly simple to write a brute-force previewer that simply tessellates the<br />

base mesh to the resolution of the displacement map — one quad per texel.<br />

Although it is a lot of triangles to draw, it is not unreasonable if done on a single<br />

object at a time. A 512x512 map requires half a million triangles to render, which<br />

can be done at acceptable speeds on most decent PC graphics cards.<br />

If displacement maps are extracted from a high-polygon mesh, this previewer<br />

is usually not necessary.<br />

Unique Mapping Checker<br />

Ray-caster<br />

Section I — Geometry Manipulation <strong>Tricks</strong><br />

Displacement Mapping<br />

When creating unique texture mappings manually, it is easy to accidentally map<br />

two areas of mesh to the same bit of texture. This is easily solved by rendering<br />

the mesh to a texture using the UV mapping as XY coordinates, counting each<br />

time a particular texel is touched. Where a texel is touched more than once, render<br />

an opaque red texel. Otherwise, render a translucent blue texel. When the<br />

mesh is loaded back into a 3D modeling package and the texture applied to it, any<br />

red/opaque texels show where the problem spots are. As there will be red texels<br />

in both places that conflict, it is easy to spot and correct the overlap.<br />

This tool is usually a special mode of the ray-caster, since both rasterize<br />

base-mesh polygons onto a uniquely mapped texture. The difference is that the<br />

ray-caster does a lot more work to decide what data to write to the texels.<br />

The ray-caster rasterizes base-mesh triangles to the displacement and normal<br />

maps. For each texel, it casts a ray from the texel’s position on the base mesh<br />

(after interpolation by whatever basis is used — linear, N-Patches, subdivision<br />

surface, etc.) along the normal, looking for the best intersection with the<br />

79


Section I — Geometry Manipulation <strong>Tricks</strong><br />

80 Displacement Mapping<br />

high-polygon mesh. “Best” is defined by various heuristics. It is usually the nearest<br />

intersection to the base mesh, though if multiple intersections are close<br />

together, this often indicates a high-frequency part of the mesh that folds back on<br />

itself or a mesh “decal” where smaller polygonal details have been added over a<br />

coarser part of the mesh. Usually, the furthest of these bunched intersections is<br />

used. This heuristic will take some tweaking — alternatively, it is often wise to<br />

highlight problem areas so that the artists can manually check and tweak them.<br />

The ray-caster takes the normal of the high-polygon mesh, modifies it by any<br />

applied bump map, and writes it to the normal map.<br />

It takes the distance along the ray from the base mesh to the intersection and<br />

writes that value into the displacement map. Any high-frequency bump map<br />

applied to the high-polygon mesh will also modify the displacement at this stage<br />

as a “detail” displacement map. In theory, a bump map should perturb the highpolygon<br />

mesh and alter where the ray intersects it. However, we have found that<br />

simply adding the bump map height onto the intersection distance produces perfectly<br />

acceptable results, as long as the bump map has a small displacement scale<br />

and is only used for creases and small bumps, rather than major features.<br />

After the ray-caster has written texel data to the normal and displacement<br />

maps, the maps are usually sent through a dilation filter, which spreads written<br />

values outward to any neighboring unwritten texels. This fills in the gaps between<br />

mapped areas with sensible data and ensures that filtering still brings in sensible<br />

data, especially when mipmapping.<br />

ATI’s Normal Mapper [2], nVidia’s Melody, and Crytek’s PolyBump [3] all do<br />

this ray-casting, though at the time of publication all only output normal maps. It<br />

would be simple to modify them to output displacement data as well, and this support<br />

is planned for them. All include a variety of heuristics to decide the “best”<br />

ray intersection to use for various cases.<br />

Unique Mapping Packer<br />

There are two problems in unique mapping. One is to get a unique mapping so<br />

that no texel is used in two places, and the other is to pack the many small areas<br />

of connected triangle “patches” together on the texture in the most efficient way.<br />

The first can be solved by automation, but human intervention is frequently necessary,<br />

and it involves some judgment calls. Fortunately, these decisions are usually<br />

easy and quick for humans to make.<br />

The second part — equivalent to the problem of packing odd shapes in a box<br />

— is tedious for humans. But because it involves no perceptive judgment calls, it<br />

is simple to leave a computer crunching away through possible solutions (possibly<br />

overnight) until it finds a good one. To reduce “bleeding” between patches due to<br />

filtering (especially mipmapping), patches must be separated by certain minimum<br />

numbers of texels. After packing, these texels are filled with the value of the<br />

nearest used texel (again so that filtering does not bring in undefined values).<br />

Where unique texturing is generated or tweaked by hand, this automatic<br />

packing allows artists to concentrate on the task of uniquely mapping an object.<br />

They do not have to simultaneously keep all the bits optimally packed — they can


scatter them all over the UV domain and arrange them for easier mental labeling<br />

(all the parts of one type together, etc.).<br />

A further enhancement is to analyze the frequency of the displacement and<br />

normal map data in each triangle patch and scale them up or down to allocate<br />

more texture space to the areas with the higher frequency data. By packing the<br />

patches together after this scaling, a given size of displacement or normal map<br />

will be spread over the object with more texels applied to detailed areas.<br />

It is important to not completely remove the artist-determined scales.<br />

A maximum grow/shrink factor of two in each UV axis is sufficient to ensure good<br />

use of available space but allows artists to deliberately allocate extra texel space<br />

to areas of high importance, such as the face and hands of people, and reduce perceptually<br />

minor parts, such as the undersides of cars, which are very crinkly but<br />

not very visible (unless it’s that sort of game, of course!).<br />

Note that this scaling implies a slightly more complex pipeline. First the<br />

patches are packed together without scaling. This is just to get them all onto a<br />

single map of reasonable size — the packing does not need to be very efficient.<br />

Then the ray-caster is run to produce a first approximation of the displacement<br />

and normal map data. For quick previews, that data is then used directly for<br />

display.<br />

For final artwork, the frequency of the data in each patch is determined, and<br />

the patches are scaled accordingly and repacked — possibly with a more expensive<br />

or thorough algorithm. Then the ray-caster is run again with this new optimal<br />

mapping, usually with a very high-resolution map. The large map is then<br />

filtered down to the actual size stored on disk. This second pass is typically run on<br />

a batch job overnight, which means it can devote a lot of time to finding near-optimal<br />

packing and use really big maps for the ray-casting phase, removing as many<br />

sampling artifacts as possible.<br />

Alternative methods of optimizing texture space for signal frequency are<br />

given by Sander et al. [4].<br />

Mesh Reduction<br />

Section I — Geometry Manipulation <strong>Tricks</strong><br />

Displacement Mapping<br />

Mesh reduction is probably the trickiest tool to get right since it usually needs to<br />

have an interactive element to it and it relies on a lot of heuristics.<br />

The most common mesh-reduction techniques are based on incremental<br />

edge or half-edge collapses. This technique produces a progressive mesh [5] as it<br />

works, which can be used for rendering continuous level of detail meshes. Many<br />

heuristics exist to decide the order of edge collapses, most based on the quadric<br />

error metric by Garland and Heckbert [6] or modifications of it by Hoppe [7].<br />

An increasing number of existing tools can be used for this:<br />

� The Direct3DX library PMesh interface<br />

� Melody tool by nVidia [8]<br />

� Galaxy3 source library by Charles Bloom [9]<br />

� Source code to my article “Comparison of VIPM Methods” in Game<br />

<strong>Programming</strong> Gems 2 [12]<br />

81


Section I — Geometry Manipulation <strong>Tricks</strong><br />

82 Displacement Mapping<br />

Rendering<br />

The above all use edge-collapse methods. Alternatively, there are various styles<br />

of remeshing using Delaunay triangulation [10] or voxelizing and remeshing.<br />

Once the basic data of a low-polygon mesh, a displacement map, a normal map,<br />

and a mapping for the maps is obtained, the data can be processed for the capabilities<br />

of the target hardware. Much of the details are either proprietary (in the case<br />

of consoles) or have been discussed elsewhere (in the case of my “displacement<br />

compression” techniques [11]), so only brief outlines are given here. Fortunately,<br />

this processing rarely requires any human intervention and is fairly simple number<br />

crunching. I address each platform separately.<br />

The techniques for rendering normal maps are fairly standard between most<br />

of these platforms. The exception (as always) is the PlayStation 2, but again these<br />

details are proprietary.<br />

Adaptive Displacement Mapping<br />

� Matrox Parhelia, future hardware<br />

Make mipmaps of the displacement map and render the low-polygon mesh with<br />

the displacement map. If necessary, feed some distance-related or perceptual<br />

biases into the adaptive tessellator. The hardware does the rest.<br />

Pre-sampled Displacement Mapping<br />

� ATI Radeon 9700, maybe PlayStation 2, and GameCube<br />

Offline, regularly and uniformly tessellate the base mesh in software and sample<br />

the displacement map at the generated vertices. This produces an array of<br />

n(n+1)/2 displacements for each triangle on the base mesh. These values are<br />

swizzled in a hardware-specified manner into a linear stream fed to the vertex<br />

shader unit. At run time, the vertex shader unit performs this tessellation itself,<br />

reads the values from the displacement stream, and draws the final displaced<br />

vertices.<br />

To perform level of detail transitions, repeat the above process for a variety<br />

of different tessellation amounts (generally the powers of two), giving an effective<br />

“mipmap chain” of displacement streams. This allows discrete LOD transitions,<br />

though with some popping as the mesh switches from one tessellation level to the<br />

next.<br />

To remove the popping, each displacement stream entry holds two displacements<br />

rather than one. The first holds the standard displacements, and the second<br />

holds the upsampled displacements from the lower LOD tessellation. In the<br />

vertex shader (or equivalent), a per-mesh scalar interpolates between the two<br />

sets of displacements. Just using these upsampled values should give a mesh that<br />

is visually identical to the lower LOD version. As an object goes away from the<br />

camera, this allows the high LOD version to smoothly morph into the low LOD


version and then the low LOD version swaps in with no visual popping but reducing<br />

the triangle and vertex count.<br />

Because this method samples the displacement map in a predictable manner,<br />

you may get some improvement in quality by ray-casting at the required positions<br />

directly rather than going via a displacement map. This also means that a unique<br />

mapping is not required for displacements, since there is no actual 2D displacement<br />

map but simply a displacement stream for each triangle of the base mesh.<br />

However, a unique mapping is still required for the normal map.<br />

The Radeon 9500-9800 series are currently the only cards to explicitly support<br />

this method, though it seems possible that the PlayStation 2 and GameCube<br />

could also implement it with a bit of work. As with all things on the PS2, it<br />

depends heavily on the rest of the rendering pipeline being used.<br />

Displacement Compression<br />

Section I — Geometry Manipulation <strong>Tricks</strong><br />

Displacement Mapping<br />

� All PC cards with hardware vertex shader support (nVidia GeForce 3 and<br />

better, ATI Radeon 8500 and better, and others), GameCube, Xbox,<br />

PlayStation 2, software vertex shader pipelines on DX6 or better cards<br />

The base mesh vertices are uploaded to the memory of the vertex unit rather<br />

than in a standard mesh/vertex stream. This may need to be done in multiple sections<br />

because of limited vertex unit memory, with each section drawn before the<br />

next is uploaded.<br />

Tessellation of the mesh is performed offline to whatever degree required,<br />

and the tessellated vertices and/or indices are fed in as a standard mesh. The difference<br />

is that rather than holding a raw vertex position, normal, texture coordinates,<br />

etc., each vertex stores only the following data:<br />

� Three indices to three base-mesh vertices<br />

� Two barycentric coordinates that interpolate between the base-mesh vertices<br />

� A displacement value<br />

This reduces the size of a vertex to 6 bytes (though many systems require padding<br />

of the vertices up to 8 bytes). The vertex unit interpolates position, normal,<br />

texture coordinates, tangent vectors, and so on from the given three base-mesh<br />

vertices and the two barycentric coordinates. The vertex is then displaced along<br />

the interpolated normal.<br />

It is important to realize that this method does not require the hardware to<br />

tessellate the mesh. All tessellation is performed offline, and a fairly standard<br />

mesh renderer is used. The difference is that the vertices are compressed using<br />

the data from the displacement map.<br />

Interpolation can be performed using any basis, but linear and bicubic<br />

are common. Linear interpolation is fine for most objects, though highly animated<br />

objects may benefit from using an N-Patch-style basis because it is relatively<br />

smooth, even under heavy mesh distortion.<br />

As with presampled displacement mapping, there is no actual 2D displacement<br />

map (the displacements are held by the vertices themselves), so the displacement<br />

for each vertex can be sampled directly using the ray-caster if desired.<br />

83


Section I — Geometry Manipulation <strong>Tricks</strong><br />

84 Displacement Mapping<br />

Level of detail transitions can be done using the same trick as with<br />

presampled displacement mapping — storing two displacements per vertex and<br />

lerping between them — or using view-independent progressive meshes. Mucky<br />

Foot currently uses the lerping method on the PlayStation 2; on other platforms<br />

with indexed primitive support, we use “sliding window” VIPM [12].<br />

In some cases, the interpolated texture coordinates (used for the diffuse and<br />

normal maps) are slightly distorted from the desired coordinates. The simple<br />

solution is to add 2 bytes to the vertex format that offset the UV values from the<br />

interpolated ones. This brings the vertex size up to 8 bytes. On the PC, vertices<br />

are required to be multiples of 4 bytes anyway, and on other platforms, the larger<br />

vertices are still a substantial improvement on traditional mesh data. The other<br />

option is to distort the diffuse maps slightly to correct for this effect — this fits in<br />

easily with some pipelines.<br />

It is possible to reformulate this method so that instead of sending base-mesh<br />

vertices to the vertex unit, base-mesh triangles are sent. Each displaced vertex<br />

then only needs a single index to determine which triangle it is on. This reduces<br />

the possible size of vertices down to 4 bytes. However, since there are typically<br />

more triangles then vertices in the base mesh, more information is required to<br />

store a triangle, and vertex unit storage space is typically at a premium, this may<br />

be slower except for highly tessellated objects with simple base meshes.<br />

Start-of-Day Tessellation<br />

� Slow CPUs with DX5 or earlier graphics cards, software rasterizers, laptops,<br />

PDAs, mobile phones<br />

These devices do not have enough polygon throughput and/or CPU power to use<br />

run-time displacement mapping to any useful extent. However, you can tessellate<br />

and displace the data using software either at installation time or at start of day.<br />

By tessellating according to the CPU speed of the machine and tessellating multiple<br />

versions of each mesh, you still gain the advantages of adapting polygon count<br />

to the scene complexity, machine capability, and the size of each mesh on the<br />

screen without having to author them directly.<br />

If the data is delivered on a format with reduced bandwidth or size (for example,<br />

over a modem or on a multi-game “sampler” disk) you gain the excellent<br />

compression and space savings that come with using displacement and normal<br />

maps.<br />

On really slow hardware, the low-polygon base map is just used directly with<br />

no tessellation at all.<br />

Some software rasterizers may be able to do normal mapping, and some<br />

hardware may be able to use the displacement map data to do emboss bump-mapping.<br />

Otherwise, it is easy to do a prelighting phase applied to the normal map<br />

with the mesh in its default pose and light coming from above to give lights and<br />

shadows in appropriate places. While not strictly correct, it produces images easily<br />

acceptable by the standards of the available hardware but does not cost any<br />

extra authoring time to produce.


Summary<br />

References<br />

Section I — Geometry Manipulation <strong>Tricks</strong><br />

Displacement Mapping<br />

Displacement mapping reduces memory use and increases mesh detail. Once displacement<br />

maps are authored, highly scalable content is easy to generate automatically,<br />

allowing an application to use very long view distances, more complex<br />

scenes, a wide variety of platforms, and (to an extent) future-proof itself and the<br />

art assets for future hardware.<br />

The difficulties of authoring displacement maps directly are reduced to a far<br />

more manageable pipeline with a few simple tools and a small amount of artist<br />

training. Previously, greater effort was frequently taken when authoring and<br />

re-authoring different levels of detail for different platforms or to rebalance processing<br />

load for specific scenes. Almost all of the difficulties with displacement<br />

maps are shared by the generation of normal maps — if generating one, you can<br />

frequently get the other with very little effort.<br />

Despite appearances, there is already wide hardware support for displacement<br />

maps — all the current consoles and almost all “gamer” PC hardware.<br />

Newer hardware allows more efficient implementations of displacement mapping,<br />

but any of the methods listed give speed and size advantages over raw mesh<br />

rendering.<br />

[1] Gu, X., S. Gortler, and H. Hoppe, “Geometry Images,” ACM SIGGRAPH ’02,<br />

pp. 355-361.<br />

[2] ATI Normal Mapper tool, available from http://mirror.ati.com/developer/index.html.<br />

[3] Crytek PolyBump package, http://www.crytek.com/.<br />

[4] Sander, P., S. Gortler, J. Snyder, and H. Hoppe, “Signal-specialized<br />

parametrization,” Eurographics Workshop on Rendering 2002,<br />

http://research.microsoft.com/~hoppe/.<br />

[5] Hoppe, H., “Progressive meshes,” ACM SIGGRAPH ’96, pp. 99-108.<br />

[6] Garland, M. and P. Heckbert, “Surface simplification using quadric error metrics,”<br />

SIGGRAPH ’97 Proceedings, Aug. 1997.<br />

[7] Hoppe, H., “New quadric metric for simplifying meshes with appearance<br />

attributes,” IEEE Visualization 1999, October 1999, pp. 59-66.<br />

[8] Melody tool by nVidia, www.nvidia.com.<br />

[9] Galaxy3 source library by Charles Bloom, http://www.cbloom.com/3d/<br />

galaxy3/.<br />

[10] Eck, M., T. DeRose, T. Duchamp, H. Hoppe, M. Lounsbery, and W. Stuetzle,<br />

“Multiresolution analysis of arbitrary meshes,” Computer Graphics, 1995.<br />

85


Section I — Geometry Manipulation <strong>Tricks</strong><br />

86 Displacement Mapping<br />

[11] Forsyth, T., “Where Have All the Bumpmaps Gone” (Meltdown 2000) and<br />

“Highly Scalable Character Rendering” (Meltdown 2001), available at<br />

http://www.tomforsyth.pwp.blueyonder.co.uk/.<br />

[12] Forsyth, T., “Comparison of VIPM Methods,” Game <strong>Programming</strong> Gems 2,<br />

Charles River Media, 2001.


Section II<br />

Rendering Techniques<br />

Rendering Objects as Thick<br />

Volumes<br />

by Greg James<br />

Screen-aligned Particles<br />

with Minimal VertexBuffer<br />

Locking<br />

by O’dell Hicks<br />

Hemisphere Lighting with<br />

Radiosity Maps<br />

by Shawn Hargreaves<br />

Galaxy Textures<br />

by Jesse Laeuchli<br />

Turbulent Sun<br />

by Jesse Laeuchli<br />

Fragment-level Phong<br />

Illumination<br />

by Emil Persson<br />

Specular Bump Mapping on<br />

Pre-ps_1_4 Hardware<br />

by Matthew Halpin<br />

Voxel Rendering with PS_3_0<br />

by Aaron Burton<br />

Simulating Blending<br />

Operations on Floating-point<br />

Render Targets<br />

by Francesco Carucci<br />

Rendering Volumes in a Vertex<br />

& Pixel Program by Ray<br />

Tracing<br />

by Eli Z. Gottlieb<br />

87


Normal Map Compression<br />

by Jakub Klarowicz<br />

Drops of Water and Texture<br />

Sprites<br />

by Sylvain Lefebvre<br />

Advanced Water Effects<br />

by Kurt Pelzer<br />

Efficient Evaluation of Irradiance<br />

Environment Maps<br />

by Peter-Pike J. Sloan<br />

Practical Precomputed Radiance<br />

Transfer<br />

by Peter-Pike J. Sloan<br />

Advanced Sky Dome Rendering<br />

by Marco Spoerl and Kurt Pelzer<br />

Deferred Shading with Multiple<br />

Render Targets<br />

by Nicolas Thibieroz<br />

Meshuggah’s Effects Explained<br />

by Carsten Wenzel<br />

Layered Car Paint <strong>Shader</strong><br />

by John Isidoro, Chris Oat, and<br />

Natalya Tatarchuk<br />

Motion Blur Using Geometry<br />

and Shading Distortion<br />

by Natalya Tatarchuk, Chris<br />

Brennan, Alex Vlachos, and John<br />

Isidoro<br />

Simulation of Iridescence and<br />

Translucency on Thin Surfaces<br />

by Natalya Tatarchuk and Chris<br />

Brennan<br />

Floating-point Cube Maps<br />

by Arkadiusz Waliszewski<br />

Stereoscopic Rendering in<br />

Hardware Using <strong>Shader</strong>s<br />

by Thomas Rued<br />

Hatching, Stroke Styles, and<br />

Pointillism<br />

by Kevin Buchin and Maike<br />

Walther<br />

Layered Fog<br />

by Guillaume Werle<br />

Dense Matrix Algebra<br />

on the GPU<br />

by Ádám Moravánszky


Introduction<br />

Rendering Objects as<br />

Thick Volumes<br />

Greg James<br />

This article presents a convenient and flexible technique for rendering ordinary<br />

polygon objects of any shape as thick volumes of light-scattering or light-absorbing<br />

material. Vertex and pixel shaders are used in multipass rendering to generate<br />

a measure of object thickness at each pixel. These thicknesses are then used to<br />

produce the colors of the object on screen. For example, we can render a volumetric<br />

shaft of light by creating a simple polygonal model of the light shaft. Each<br />

frame, new thickness information for this object is rendered from the current<br />

point of view, and the thicknesses are converted to colors. The result is a true<br />

volumetric rendering of the object suitable for interactive dynamic scenes.<br />

The technique can be implemented on hardware that supports Microsoft’s<br />

pixel shaders version 1.3 or higher and runs at real-time frame rates in complex<br />

scenes. No preprocessing or special treatment of the volume object geometry is<br />

required, making it trivial to animate and distort the volume objects. An efficient<br />

and simple method is given to properly render any volume objects, convex or concave,<br />

and handle complex intersection cases where opaque objects of any shape<br />

penetrate the volumes. This article also introduces a new method of dithering to<br />

eliminate the effects of aliased thickness information. The dithering is accomplished<br />

using texture data, and it does not complicate the rendering or require<br />

additional passes.<br />

This article focuses on rendering based on the thickness visible from the current<br />

viewpoint. This is suitable for volumes of single-scattering material. In this<br />

case, each bit of light arriving at the viewpoint is the result of only one scattering<br />

interaction within the object, and the total amount of light is a function of the total<br />

thickness. As the visible thickness increases, the number of scatterers or the<br />

chance of scattering increases. The scattering can both add light and attenuate<br />

light as a function of thickness. More sophisticated models of scattering could be<br />

employed but will not be presented here. Hoffman and Preetham have a good<br />

demo and introduction to various types of scattering [Hoffman02].<br />

The appearance of the volume objects is easy to control, and an artist-created<br />

color ramp can be used to map object thickness to color. While the technique<br />

treats objects as volumes of constant density, the color ramp allows us to map<br />

increasing thickness to an exponential ramp, overbright saturated colors, or any<br />

89


Section II — Rendering Techniques<br />

90 Rendering Objects as Thick Volumes<br />

arbitrary colors. The technique is being used in several upcoming games and has<br />

great promise for bringing practical volumetric effects to interactive real-time<br />

rendering.<br />

The Big Picture<br />

This technique is a significant departure from traditional 3D rendering. It involves<br />

rendering to off-screen textures, rendering depth information as RGBA colors,<br />

using simple vertex shader programs and textures to encode information, and<br />

using alpha blending to add and subtract high-precision encoded depth information.<br />

Rather than jump into detailed discussion right away let’s begin with an<br />

overview of the complete rendering process, so you can clearly see what’s<br />

involved and how the technique compares to other approaches.<br />

The full implementation of the technique is illustrated in Figure 1. These<br />

steps render any volumetric shape, handle all solid objects intersecting the volumes,<br />

dither the thickness information, and handle any camera position in the<br />

scene, whether the camera is inside or outside of the volumes or solid objects.<br />

Rendering proceeds as follows and is covered in greater detail later in the article:<br />

1. Opaque objects are rendered to the ordinary back buffer. See Figure 1a.<br />

2. The view-space depth of opaque objects that might intersect the volume<br />

objects is rendered to a texture that we label O. Depth is encoded as RGBA<br />

colors. See Figure 1b.<br />

3. All volume object back faces are rendered to texture B using additive RGBA<br />

blending to sum the depths. A pixel shader samples O while rendering each<br />

triangle in order to handle intersections. See Figure 1c.<br />

4. All volume object front faces are rendered to texture F while sampling O to<br />

handle intersections. See Figure 1d.<br />

5. Textures B and F are sampled to compute the volume thickness, convert this<br />

to color, and blend the color to the scene rendered in Step 1. See Figure 1e.<br />

One of the advantages of this technique is that the rendering does not have to<br />

change in order to handle various intersection cases and camera positions. No<br />

extra passes or knowledge about the objects is required as long as the depth complexity<br />

of the volume objects remains below a certain adjustable limit. A later section<br />

presents this in greater detail, but the depth complexity limit depends on the<br />

precision of the thickness information. This can be adjusted from frame to frame.<br />

A depth complexity of 16 or 32 volume object faces can be rendered at high precision<br />

with no additional passes.


Computing Thickness<br />

Section II — Rendering Techniques<br />

Rendering Objects as Thick Volumes<br />

91<br />

Figure 1: Overview of the rendering steps. Five rendering passes produce correct results<br />

for all cases where volume objects intersect opaque objects and for all camera locations<br />

inside or outside of the objects. One additional pass (not shown) is required for<br />

hardware that does not support pixel shaders 2.0.<br />

First, we need a way to get thickness information from ordinary polygon hulls.<br />

Dan Baker presents a technique for this in the Microsoft <strong>DirectX</strong> 8.1 SDK<br />

VolumeFog example [Baker02]. His approach can be extended in a number of<br />

ways, but the basic approach is to calculate thickness by subtracting the viewspace<br />

depth of an object’s front faces from the depth of the back faces. The depths<br />

of an object’s faces are rendered to off-screen render targets, and the thickness is<br />

computed from information in the render targets. At any given pixel, if we sum<br />

the depths of all of an object’s front faces at that pixel and sum the depths of all<br />

back faces, the thickness through the object is the back face sum minus the front<br />

face sum. This is illustrated in Figure 2.<br />

Figure 2: For a given pixel on screen, the thickness through the object is the sum of the<br />

depths of all front faces at that pixel subtracted from the sum of the depths of all back<br />

faces at that pixel.


Section II — Rendering Techniques<br />

92 Rendering Objects as Thick Volumes<br />

Depth is calculated at each vertex as part of the standard 3D view transform. This<br />

is interpolated for standard Z-buffer rendering, but the Z-buffer information is not<br />

practical to use for this technique. It is too costly in terms of performance, and the<br />

graphics APIs have no flexibility for summing and differencing the Z-buffer information.<br />

Attempting to manipulate the information on our own would require<br />

decompressing and copying the GPU data across to the CPU for processing. This<br />

would break the parallelism of the two processors, stall the GPU, and burden the<br />

CPU unnecessarily.<br />

Instead, we can use standard RGBA 8-bit color rendering and additive blending<br />

to accomplish the thickness calculations entirely on the GPU. A high-precision<br />

depth value can be split up and encoded across the color channels of an<br />

RGBA-8 color. I’ll refer to this as RGB-encoding of the depth information. Standard<br />

blend operations can then sum the encoded values. This allows us to process<br />

and sum, for example, 12-bit or 18-bit depth information using commonplace<br />

RGBA-8 render targets.<br />

The latest generation of consumer GPUs (the GeForce FX and Radeon 9800<br />

series) has introduced support for rendering high-precision color information with<br />

up to 32 bits per color component for a total of 128 bits per RGBA color. Unfortunately,<br />

these chips do not support additive blending of these high-precision colors,<br />

so they are not capable of performing the depth sums as efficiently or quickly as<br />

with RGBA-8 additive blending.<br />

RGB-Encoding of Values<br />

A standard RGBA-8 render target can do a fantastic job of storing and accumulating<br />

high-precision scalar (1D) values. The bits of a number can be split across the<br />

8-bit red, green, blue, and alpha color channels using any number of the low bits of<br />

each channel. When the bits of a number are split across the R, G, and B colors, I<br />

call it an RGB-encoded value. A particular case is illustrated in Figure 3, where a<br />

15-bit number is split into three 5-bit color values. The precision at which we can<br />

encode values is given by the number of low bits, L, that we use in each color<br />

channel multiplied by the number of color channels. For example, if we use four<br />

low bits (L=4) from each R, G, and B channel, we can encode 12-bit values (3*4).<br />

It’s important to note that we use only a few of the lowest bits of each color<br />

channel to encode any single value. The remaining high bits are left empty so that<br />

when two or more values are added, the low bits can carry over into the unused<br />

high bits. RGB-encoded values can be added together using standard RGBA blend<br />

operations until all the bits of any color channel are full. At that point, any further<br />

additions will be lost because the bits of one color channel do not carry into the<br />

other channels. The number of high “carry” bits in each color channel is (8-L),<br />

and the number of RGB-encoded values we can add together without error is 2 8-L .<br />

There is a tradeoff between the precision that we can encode and the number of<br />

encoded values that can be added together. For our case of encoding a 15-bit value<br />

(L=5), we have three carry bits, so we can sum at most eight values into any<br />

given RGBA-8 color. Figure 3 includes a table relating precision to the number of<br />

values that can be safely added.


Section II — Rendering Techniques<br />

Rendering Objects as Thick Volumes<br />

Figure 3: Encoding a 15-bit value using five low bits (L=5) of each 8-bit R, G, and B<br />

color channel. The diagram on the right relates the number of low bits, L, to the<br />

precision of each value and the number of encoded values that can be added into an<br />

RGB-8 color before error occurs due to saturating all of the bits of a particular color<br />

channel.<br />

93<br />

Figure 4 illustrates the RGB-encoding applied to a steadily increasing value. The<br />

RGB-encoded value is the sum of the R, G, and B ramps at a point along the axis.<br />

Only two bits per color channel are used to better illustrate the relationship of the<br />

colors, and a scheme is used where blue holds the least significant bits, green<br />

holds the middle significant bits, and red holds the most significant bits. Thus, the<br />

green values go through one cycle each time red increases by one bit, and blue<br />

cycles once for each green increment.<br />

Figure 4: Encoding 6-bit depth values using two low bits (L=2) of each channel of an<br />

RGB-8 color. Depth varies from 0 to 1 from the near clipping plane to the far clipping<br />

plane and is encoded by adding the blue, green, and red color ramps shown. At a depth<br />

of 1.0, the color is RGB=(3,3,3) out of the full (255, 255, 255) range, and at 0.75, the<br />

color is (3,0,0).<br />

Applying this RGB-encoding of depth to the simple scene in Figure 5a gives the<br />

result shown in Figure 5b. Here, four bits are used from each color channel. The<br />

RGB colors are displayed overbright because in practice the low bit values of each<br />

color would appear mostly black. Red values are too low to be noticeable in Figure<br />

5b, but if the objects extended farther toward the far clip plane, the red values<br />

would become noticeable. In practice, the RGB-encoded depths are rendered to<br />

an off-screen texture render target. This allows us to read back the depths in later<br />

rendering operations, which is important for handling solid objects that intersect<br />

the volumes of fog.


Section II — Rendering Techniques<br />

94 Rendering Objects as Thick Volumes<br />

Figure 5: Objects rendered with a) traditional shading and b) RGB-encoded depth<br />

rendering at 12 bits of precision (L=4). The RGB-encoded colors are shown overbright,<br />

as their actual range, in this case from [0, 16] out of [0, 255], would appear mostly<br />

black.<br />

RGB-encoding is easy to achieve using programmable vertex shaders and small<br />

color ramp textures. The encoding can be applied to any per-vertex scalar that we<br />

compute in a vertex shader, but here all we care about is the per-vertex depth.<br />

The vertex shader computes a depth value at each vertex as part of the standard<br />

3D transform from object space to homogenous clip space (HCLIP space), as<br />

shown in the following vertex shader assembly code VS 1:<br />

DP4 r1.x, V POSITION, c[CV WORLDVIEWPROJ 0]<br />

DP4 r1.y, V POSITION, c[CV WORLDVIEWPROJ 1]<br />

DP4 r1.z, V POSITION, c[CV WORLDVIEWPROJ 2]<br />

DP4 r1.w, V POSITION, c[CV WORLDVIEWPROJ 3]<br />

MOV oPos, r1<br />

V_POSITION is the input vertex position in object space, and CV_WORLD-<br />

VIEW-PROJ_ are the elements of the standard 4x4 transform-and-project<br />

matrix used in 3D rendering. The r1.w component is the vertex’s distance to the<br />

camera plane (not the radial distance to the camera and not the distance to the<br />

near clip plane), so it behaves correctly when linearly interpolated in<br />

rasterization. This W component is easily turned into three texture coordinates<br />

that can access small color ramp textures to achieve the encoding of Figure 5. All<br />

we have to do is scale the W component so it varies from 0 to 1 from the near to<br />

far plane and scale that value by the number of times each color ramp repeats.<br />

The color ramp textures are typically small with one texel per color value,<br />

and they are created to match our choice of the number of low bits, L, that we are<br />

using from each channel. The color value at each texel of the color ramps is simply<br />

an integer, L-bits in size, corresponding to the texel location. For example, if<br />

we choose a 12-bit encoding, then L=4 bits from each color channel, so we use R,<br />

G, and B ramps 16 texels long (16 being 2 L ), with values ranging from 0 to 15 over<br />

the 16 texels. The lowest coordinate texel, which is at (0,0), has the value 0 out of<br />

255, the second texel has the value 1 of 255, the eighth is 7, etc.<br />

Texture repeating or wrapping is enabled so that one color ramp can repeat<br />

many times over the range of depths. The texture coordinate for the red texture<br />

ramp is the W depth scaled to [0,1] from the near to far clip plane. (The range<br />

notation [n,m] denotes all numbers from n to m, inclusive of the limit values n and


m.) The coordinate for green is this same coordinate multiplied by the number of<br />

values in the red color ramp, which is 2 L , so that the green color ramp repeats 2 L<br />

times, or once for each bit increment of the red color. The texture coordinate for<br />

the blue color ramp is the red coordinate scaled by 2 L *2 L ,or2 2L , so that the blue<br />

ramp repeats once for each increment of the green color ramp. For the case of<br />

Figure 5, where L=2, the texture coordinate for green ranges from 0 to 4, and the<br />

blue coordinate spans [0,16]. These texture coordinates are calculated and output<br />

by the vertex code fragment listed in VS 2, where “near” and “far” denote the distances<br />

to the near and far clip planes.<br />

// CV RAMPSCALE = ( 1.0, 2^L, 2^(2L), 0 )<br />

// CV NEARFAR = ( 1/(far-near), -near/(far-near), 0, 0 )<br />

// Scale r1.w to [0,1] from near to far clip planes<br />

MAD r1.w, r1.w, c[CV NEARFAR].x, c[CV NEARFAR].y<br />

// oT0 = ( [0,1], [0,2^L], 0, 0)+(0,0,1,1)<br />

MAD oT0.xyzw, r1.w, c[CV RAMPSCALE].xyww, c[CV RAMPSCALE].wwxx<br />

// oT0 = ( [0,2^(2L)], 0, 1, 1 )<br />

MAD oT1, r1.w, c[CV RAMPSCALE].zwww, c[CV RAMPSCALE].wwxx<br />

Three separate color ramp textures could be used, but it is better to combine the<br />

red and green ramps into a single texture, where red is accessed with the X coordinate<br />

and green is accessed with the Y coordinate. This saves a texture fetch and<br />

math operation in the pixel shader that fetches the color ramps. The blue ramp is<br />

not merged with red and green because that would entail using a 3D volume texture.<br />

For simple color ramps, such a texture is wasteful. As we see later, the blue<br />

bits are special, and we can use a larger 2D dithered color ramp to dither the least<br />

significant bit of depth information. To form the RGB-encoded depth value and<br />

output it to a texture render target, we use the following pixel shader code, PS 1,<br />

which adds the red-green color from T0 to the blue color from T1:<br />

ps.1.1<br />

TEX t0 // read red+green texture ramp<br />

TEX t1 // read blue texture ramp<br />

ADD r0, t0, t1 // output RGB-encoded value<br />

These simple shader fragments and the small RGB ramp textures are all we need<br />

to render functions of depth and distance for any object. Vertex positions can be<br />

animated in software or in a vertex shader, and the depth-encoding behaves correctly<br />

in all cases. In practice, more operations are added to these shaders to compare<br />

depth-encoded objects to previously rendered RGB-encoded values, but<br />

there’s more about that later.<br />

Decoding RGB-encoded Values<br />

Section II — Rendering Techniques<br />

Rendering Objects as Thick Volumes<br />

95<br />

There are several advantages to this scheme of encoding high-precision values.<br />

Unlike exponentiation (RGBE) or multiplicative (RGBM) encoding, RGB-encoded<br />

values can be added by simply adding each color channel. An offset can be added<br />

to each color channel in order to store negative values in a biased state, and<br />

decoding the values can be done in a single dot product operation. Hardware that


Section II — Rendering Techniques<br />

96 Rendering Objects as Thick Volumes<br />

supports Microsoft’s D3D8 pixel shaders version 1.1. or higher can perform the<br />

decode in one shader instruction.<br />

The RGB-encoding scheme spreads the bits into each color channel by dividing<br />

the middle and high bit ranges by a scale factor or shifting the bits down to<br />

begin from zero. Decoding the values is simply a matter of multiplying each color<br />

channel by that same scale factor (shifting the bits back up to their original values)<br />

and adding the shifted values together. This is accomplished by a single dot<br />

product, as illustrated in equation 1.1, where V Decoded is the decoded value, T0 is<br />

the RGB-encoded value, C is a constant vector of scale values for each channel,<br />

and scale is an arbitrary scale factor that can be used as a convenient control to<br />

adjust the output to some meaningful or more pleasant visual range.<br />

L 2L<br />

(C.x,C.y,C.z) � scale * (1.0,1/2 ,1/2 )<br />

VDecoded � C.x*T0.<br />

red �C, y * T0.green �C.z*<br />

T0.blue<br />

V � C DOT T0<br />

Decoded<br />

The multipliers, C, for each channel depend on the number of bits, L, we chose<br />

from each channel to encode the values. Typically, four or five bits are used from<br />

each channel, so the values 1/2 L and 1/2 2L may be small. For example, with L=5<br />

we have C = ( 1, 1/32, 1/1024 ). Since the values may be small, the dot product<br />

must be executed at high precision in the pixel shader.<br />

Pixel shader versions 1.1 to 1.4 have two classes of operations: texture<br />

addressing operations and arithmetic operations. Arithmetic operations can be<br />

executed at 9-bit precision per color component. This is not sufficient to hold<br />

small-scale values like 1/1024, so we should use the texture addressing operations,<br />

which must be performed at floating-point precision. If we’re working with<br />

pixel shaders 2.0 and higher, all operations are executed at floating-point precision,<br />

so the ps.2.0 arithmetic operations can be used. <strong>Shader</strong> program fragments<br />

to perform the decode for ps.1.3 and ps.2.0 are shown in listings PS 2 and PS 3. In<br />

these shaders, a dot-product operation generates a texture coordinate, which we<br />

use to access a color ramp texture. This maps the RGB-encoded values (depth,<br />

thickness, etc.) to any colors we like, and it provides a convenient way for artists<br />

to control the look of the volume objects.<br />

// vertex shader pseudocode to set up values<br />

// for the pixel shader below<br />

vs.1.1<br />

MOV oT0, SCREEN COORDS // map texture to the screen<br />

MOV oT1, scale *(1,2^(-L), 2^(-2L) )<br />

ps.1.3<br />

TEX t0 // read RGB-encoded value<br />

TEXDP3 t1, t0 // decode and output color from t1 texture<br />

MOV r0, t1 // output the color value<br />

ps.2.0<br />

// c0 = scale *(1,2^(-L), 2^(-2L) )<br />

dcl t0.xyzw<br />

dcl 2d s0 // A texture with RGB-encoded values<br />

dcl 2d s1 // A color ramp mapping value to color


TEXLD r0, t0, s0 // read the RGB-encoded value<br />

DP3 SAT r0, r0, c0 // decode the RGB-encoded value<br />

TEXLD r0, r0, s1 // convert value to color<br />

MOV oC0, r0 // output the color<br />

These shader fragments work for RGB-encoded values with both positive and<br />

negative values in the color channels. They also work for RGB-encoded values<br />

that result from adding or subtracting two or more encoded values. As you’ll see<br />

in the next section, two sums of RGB-encoded values are easily subtracted and<br />

decoded. The result of decode(A) – decode(B) is identical to the result of<br />

decode(A – B), which is very convenient! At each pixel we can easily calculate the<br />

front face and back face depth sums, subtract them to get an RGB-encoded thickness<br />

value, and convert this into a color for the volume object.<br />

Rendering Thick Volumes<br />

<strong>With</strong> Nothing Intersecting the Volumes<br />

Section II — Rendering Techniques<br />

Rendering Objects as Thick Volumes<br />

97<br />

Applying RGB encoding to the method of computing an object’s thickness gives<br />

us a way to render ordinary polygonal objects as thick volumes of material. This<br />

section gives a step-by-step discussion of the rendering for volumes in free space<br />

where no opaque objects intersect the volumes. A few issues related to each step<br />

are also presented, and I focus on the ps.2.0 implementation. The ps.1.3 implementation<br />

is almost identical, and information about the differences is included in<br />

the demo source code. Situations where objects intersect the volumes are far<br />

more common; these are covered in the next section.<br />

To render volumes in free space, two off-screen texture render targets are<br />

needed. These render targets could be a lower resolution than the ordinary back<br />

buffer, but if the rendered volumes have high-contrast edges, the render targets<br />

should match the back buffer size to reduce aliasing. These color render targets<br />

may or may not need an associated depth buffer. It depends on whether or not the<br />

solid objects in the scene can occlude the volumes, and it depends on the geometry<br />

used to render the final volume object color into the back buffer. A simple<br />

approach that handles occlusion is to use a depth buffer and render the final color<br />

with a large quad covering the entire screen. In that case, rendering proceeds as<br />

follows:<br />

1. Render the scene to the color and depth buffers as you normally would with<br />

no volumetric objects.<br />

2. Switch to an off-screen texture render target, which I call the “back faces”<br />

target. In Direct3D, the depth buffer can be shared between the back buffer<br />

and the texture render target if they are the same size and multisample type.<br />

Otherwise, the depth of occluders in the scene needs to be rendered again<br />

into a separate depth buffer that matches the texture render target. Clear the<br />

render target to black, set the cull mode to render volume object back faces,<br />

set the depth test to less-equal, and disable depth writes. Render the RGBencoded<br />

depth of all back faces of the volume objects with additive blending


Section II — Rendering Techniques<br />

98 Rendering Objects as Thick Volumes<br />

to the color target. Where several back faces overlap, the encoded depths will<br />

be added to form a sum of all back face depths at each pixel. The result will<br />

be similar to Figure 1c, where colors are shown overbright to better illustrate<br />

their values.<br />

3. Switch the color render target to the other off-screen render target, which I<br />

call the “front faces” target. Clear it to black and render the RGB-encoded<br />

depth of all volume object front faces to create the front face depth sum. The<br />

result will be similar to Figure 1d.<br />

4. If using hardware that supports pixel shaders 2.0, switch to the ordinary color<br />

and depth back buffers from step 1. Disable depth testing. Render a single<br />

quad covering the entire back buffer with the pixel shader listed in PS 4. This<br />

shader builds on the shader from listing PS 3. It samples the depth sums in<br />

the “back faces” texture, samples the “front faces” texture, computes the<br />

object thickness, and converts the thickness to the volume object color. It<br />

converts the thickness to a color value using an arbitrary color ramp texture<br />

bound to the D3D S2 sampler. The color ramp can be created by an artist or<br />

computed from a mathematical model of light scattering. For hardware that<br />

supports only pixel shaders 1.3, an extra pass and texture render target are<br />

needed to compute the RGB-encoded thickness value and supply this to the<br />

shader PS 2 listed above.<br />

ps.2.0<br />

// c0 = scale *(1,2^(-L), 2^(-2L) )<br />

dcl v0.xyzw<br />

dcl t0.xyzw<br />

dcl 2d s0 // back face depth sum<br />

dcl 2d s1 // front face depth sum<br />

dcl 2d s2 // color ramp<br />

TEXLD r0, t0, s0 // back face depth sum<br />

TEXLD r1, t0, s1 // front face depth sum<br />

ADD r0, r0, -r1 // RGB-encoded thickness = back – front<br />

DP3 SAT r0, r0, c0 // decode to floating-point coordinate<br />

TEXLD r0, r0, s2 // convert thickness to fog color<br />

MOV oC0, r0<br />

An alternate approach is to use the volume object geometry instead of a fullscreen<br />

quad to drive the computation of volume object color and rendering to the<br />

back buffer. The choice depends on the coverage and complexity of the volume<br />

objects, and you can switch between methods, depending on the viewpoint and<br />

performance. This geometry provides the appropriate pixel coverage on screen. It<br />

creates pixels over an area so the pixel shader receives input and can perform the<br />

computations. If we use a simple full-screen quad, we waste fill rate rendering<br />

pixels where there is no volume thickness. If we use the volume objects themselves,<br />

we might reduce the fill rate by drawing pixels only where the volumes<br />

are, but we could spend more time transforming vertices or passing over pixels<br />

more than once where the depth complexity of the volume objects is greater than<br />

one. Since alpha blending is used to blend the volume’s color into the scene, the


depth complexity is important. If we use the volume object geometry, we need to<br />

enable a stencil or destination alpha test to avoid blending the volume color more<br />

than once at each pixel where the depth complexity might be greater than one.<br />

If we use the volume objects to drive the processing, we need a shader that<br />

projects the front and back face depth sum textures from steps 2 and 3 onto the<br />

volume object geometry so that the pixel shader receives the correct values for<br />

each point on screen. This is simply a matter of turning the screen-space position<br />

into a texture coordinate from [0,1] across the full screen. A Direct3D vertex<br />

shader code fragment for this is listed in VS 3. This code is also used in handling<br />

solid objects that may intersect the volumes, since it can project rendered texture<br />

information at each pixel onto the same pixels as they are rendered again, regardless<br />

of the shape of the geometry. The code is useful in many multipass<br />

approaches, so it’s good to keep in mind for other effects.<br />

vs.1.1<br />

// Transform position to clip space and output it<br />

DP4 r1.x, V POSITION, c[CV WORLDVIEWPROJ 0]<br />

DP4 r1.y, V POSITION, c[CV WORLDVIEWPROJ 1]<br />

DP4 r1.z, V POSITION, c[CV WORLDVIEWPROJ 2]<br />

DP4 r1.w, V POSITION, c[CV WORLDVIEWPROJ 3]<br />

MOV oPos, r1<br />

// Convert geometry screen position to a texture coordinate,<br />

// so we can project previously rendered textures to the same<br />

// pixels on screen for any geometry.<br />

// CV CONSTS 1 = ( 0.0, 0.5, 1.0, 2.0 )<br />

MUL r1.xy, r1.xy, c[CV CONSTS 1].yyyy<br />

// Add w/2 to x,y to shift from (x/w,y/w) in the<br />

// range [-1/2,1/2] to (x/w,y/w) in the range [0,1]<br />

MAD r1.xy, r1.wwww, c[CV CONSTS 1].yyyy, r1.xy<br />

// Invert y coordinate by setting y=1-y<br />

// Remember, w!=1 so 1.0 really equals 1*w<br />

// and we compute y=1*w-y<br />

ADD r1.y, r1.w, -r1.y<br />

// Add half-texel offset to sample from texel centers<br />

// not texel corners (a D3D convention)<br />

// Multiply by w because w != 1<br />

MAD r1.xy, r1.wwww, c[CV HALF TEXEL SIZE], r1.xy<br />

// output to tex coord t0<br />

MOV oT0, r1<br />

Section II — Rendering Techniques<br />

Rendering Objects as Thick Volumes<br />

99<br />

The steps above work for any camera position in the scene. The camera can move<br />

through the volume objects and the rendering remains correct, with a volume<br />

thickness contribution from only the part of the volume in front of the near clip<br />

plane. This is a consequence of choosing our RGB depth-encoding to start from 0<br />

at the near clip plane. <strong>With</strong>out this, the volume object’s polygons would have to<br />

be clamped or capped at the near plane so their depth values are not clipped away.<br />

These steps require the volume objects to be closed hulls. There can be no<br />

back face without a corresponding front face, and vice versa. The depth complexity<br />

of the volume objects must also be kept below the limit for adding the RGBencoded<br />

values together. The front and back faces are summed to separate render


Section II — Rendering Techniques<br />

100 Rendering Objects as Thick Volumes<br />

targets, so the depth complexity limit is for only the number of back or front<br />

faces, not for the total of front and back faces. In practice, using 12-bit encoding<br />

(L=4) allows 16 front and back faces (see the table in Figure 3), which is more<br />

than enough for interesting scenes. The approach also sums the thicknesses of<br />

overlapping volumes, so where two volumes intersect, the thickness will be<br />

greater. Everitt’s technique of depth peeling [Everitt02] could be applied to eliminate<br />

the overlapping areas, but this would require several more passes and might<br />

be too costly in terms of performance.<br />

Handling Solids Intersecting the Volumes<br />

Often, we want opaque objects to pass through the volume objects, or we want to<br />

place the volumes in a scene without having to worry about their polygons being<br />

clipped away by solid objects. Handling the areas where solid objects intersect the<br />

volumes is key to making the technique easy to use. Fortunately, even complex<br />

intersection cases are easy to handle using one additional render target texture<br />

and a comparison of RGB-encoded values in the pixel shader.<br />

If an opaque object cuts through a volume object’s hull, we have to use the<br />

depth to the opaque object instead of the depth to the volume object faces that are<br />

occluded by the opaque object. This ensures that we get thickness contributions<br />

for only the part of the volume that is visible in front of the opaque object. This<br />

can be accomplished by doing a comparison of depth values in the pixel shader as<br />

each face of the volume objects are rendered. The pixel shader that created the<br />

RGB-encoded depth value (as in listing PS 1) can also sample a texture containing<br />

the solid object depth. This texture holds the RGB-encoded depth of the solid<br />

object closest to the near plane. The pixel shader compares the RGB-encoded<br />

depth of the volume object pixel being rendering to the encoded depth of the<br />

nearest solid object and outputs whichever value is the lesser depth. This allows<br />

both the back faces and front faces of volume objects to be occluded by the solid<br />

objects. It is an efficient way to handle any and all solids that might penetrate the<br />

volumes, and it handles complex volumes of any shape or depth complexity.<br />

Where a volume object face goes inside or behind a solid object, its depth contribution<br />

becomes the depth of the solid object. The volume object is effectively<br />

clamped to always begin from the solid object, and the varying depth complexity<br />

of concave or folded volume objects is handled correctly in all cases. This is illustrated<br />

in Figure 6 (on the next page).<br />

To implement this approach, first render a texture to hold the RGB-encoded<br />

depth of the nearest part of any opaque objects that may penetrate or occlude the<br />

volume objects. There is no need to render all the opaque objects in the scene to<br />

this texture, and the regions of intersection do not have to be computed. Next,<br />

render the volume object faces according to steps 2 and 3 from the previous section,<br />

but in the pixel shader, sample the solid object depth texture, perform the<br />

depth comparison, and output the lesser depth. The depth is either the opaque<br />

object depth read from the texture or the depth of the volume object face at that<br />

pixel. <strong>Shader</strong>s to perform this on ps.1.3 and ps.2.0 hardware are shown in listings<br />

PS 5 and PS 6.


The depth comparison barely fits within the limits of a ps.1.3 shader. Unfortunately,<br />

Direct3D API restrictions require an additional instruction slot for the<br />

CMP instruction on ps.1.1 hardware, so this comparison can’t be expressed in a<br />

ps.1.1 shader. Also note that the ps.1.3 shader can’t decode the RGB-encoded values<br />

to a high-precision scalar, so it relies on comparing each R, G, and B channel<br />

separately. It scales and clamps each R, G, and B difference to [–1,1]. The ps.1.3<br />

comparison will not work for RGB-encoded values where the high carry bits are<br />

used, but this doesn’t present a problem. Additional comments are provided in the<br />

demo’s shader source code. Since the ps.2.0 shader operates at floating-point precision,<br />

it is simpler and can handle values where the carry bits are on.<br />

ps.1.3<br />

// RGB-encoded depth comparison.<br />

// Outputs the lesser RGB-encoded value<br />

// Requires saturation to [-1,1] range<br />

// Weight for each of the RGB channels<br />

DEF c7, 1.0, 0.66, 0.31, -0.66<br />

// CMP uses >= 0.0, so use this small bias to get a<br />

// "less than zero" comparison<br />

DEF c6, -0.01, -0.01, -0.01, -0.01<br />

TEX t0 // red+green ramp texture<br />

TEX t1 // blue ramp texture<br />

TEX t3 // depth of solid objects<br />

ADDt2,t0,t1 //AddR+G+Btomake depth value<br />

// Difference between pixel depth and solid object depth<br />

// Use *4 to increase the contrast. The goal is to saturate<br />

Section II — Rendering Techniques<br />

Rendering Objects as Thick Volumes<br />

101<br />

Figure 6: Handling opaque objects intersecting the volume objects. The volume<br />

geometry is shown with a dotted line. A pixel shader compares the volume object depth<br />

to the solid object depth read from a texture and outputs the lesser depth value. This<br />

results in depth information being taken from the geometry shown with solid lines. The<br />

depth comparison clamps the occluded volume object pixels to the nearest solid object<br />

depth, effectively limiting the volume object thickness to the proper amount for all<br />

intersection cases.


Section II — Rendering Techniques<br />

102 Rendering Objects as Thick Volumes<br />

// each R,G,B channel of the signed number the values<br />

// -1, 0, or +1.<br />

ADD x4 r1, -t3, t2 // diff * 4<br />

ADD x4 r1, r1, r1 // diff * 32<br />

ADD x4 r1, r1, r1 // diff * 256<br />

// DP3 the saturated difference with the c7 weights.<br />

// The result is positive, negative, or zero depending on the<br />

// difference between the high precision values that t3 and t2<br />

// represent<br />

DP3 x4 r1, r1, c7<br />

// Subtract a small value from r1<br />

ADD r1, r1, c6<br />

// Compare r1 decision value to 0. If r1 is positive,<br />

// output t3, otherwise output t2<br />

CMP r0, r1, t3, t2<br />

ps.2.0<br />

// Comparison of RGB-encoded depths<br />

// Outputs the lesser RGB-encoded value<br />

// c0 = scale *(1,2^(-L), 2^(-2L) )<br />

dcl t0.xyzw<br />

dcl t1.xyzw<br />

dcl t3.xyzw<br />

dcl 2d s0 // red+green ramp texture for depth encode<br />

dcl 2d s1 // blue ramp texture for depth encode<br />

dcl 2d s3 // RGB-encoded depth value of nearest solid<br />

TEXLD r0, t0, s0 // red+green part of depth encoding<br />

TEXLD r1, t1, s1 // blue part of depth encoding<br />

ADD r0, r0, r1 // Depth of volume object’s pixel<br />

TEXLDP r1, t3, s3 // RGB-encoded depth from texture at s2<br />

// RGB-encoded difference<br />

ADD r2, r0, -r1<br />

// Decode to positive or negative value<br />

DP4 r2, r2, CPN RGB TEXADDR WEIGHTS<br />

// Choose the lesser value: r2 >= 0?r1:r0<br />

CMP r3, r2.xxxx, r1, r0<br />

MOV oC0, r3<br />

Dithering the Low Bit<br />

In this technique, depth is represented by discrete values, so aliasing can appear<br />

in the thickness values. This aliasing is always present. Even at high precision, it<br />

can be noticeable, especially if thin objects have their thickness multiplied by a<br />

large scale factor in order to generate some visible contribution. Depth aliasing<br />

appears as sharp transitions between light and dark in the color of the rendered<br />

volume, and its shown in Figures 7a and 7b. Luckily, there is a painless way to<br />

dither the lowest bit of depth information, which breaks the sharp bands into dithered<br />

transitions that appear smooth. The results are shown in Figures 7c and 7d.


Section II — Rendering Techniques<br />

Rendering Objects as Thick Volumes<br />

103<br />

Figure 7: Artifacts of depth aliasing in severe cases are shown in a and b. Low depth<br />

precision is used to accentuate the effects. Dithering the depth information breaks the<br />

artifacts into gradual noisy transitions, which appear smooth, as shown in c and d.<br />

The dithering is very easy to implement. The lowest bit of depth, held in the blue<br />

color channel, is read from a small color ramp texture, as described above. To<br />

dither the depth information, all we have to do is dither the color ramp texture!<br />

As long as the texture coordinate used to access the color ramp is not aliased, this<br />

approach works very well. The texture coordinate is a high-precision floating-point<br />

value with more precision than the 12 or 15 bits of depth precision typically<br />

used. The demo source code has a few functions to create dithered color<br />

ramps. To access the dither pattern, a 2D texture coordinate is used instead of the<br />

1D coordinate oT1 (.x only) of listing VS 2. The dither pattern varies in the second<br />

(.y) coordinate direction, so the Y coordinate can vary randomly to create<br />

gradual dithered transitions. There is some subtlety involved in wrapping the<br />

dither pattern in the X direction from one end of the texture to the other so that<br />

dithering continues where the color ramp repeats. This involves using the alpha<br />

channel of the blue ramp to hold a value that represents a negative increment to<br />

the next highest (green) bit.<br />

The dithered depth information can be used to create interesting noise<br />

effects in the volume rendering. As the depth precision is lowered, the dithered<br />

thickness becomes progressively noisier. At just a few bits of depth precision,<br />

the volume shape and thickness remain recognizable, and the rendering appears<br />

as though it were coming from a noisy video source. An example is shown in<br />

Figure 8.


Section II — Rendering Techniques<br />

104 Rendering Objects as Thick Volumes<br />

Figure 8: A curious volume noise effect is produced by an object spanning only three<br />

bits of depth precision. The depth-aliased rendering is shown in a, where only a few<br />

bit-increments of thickness occur. Dithering the least significant bit of depth during depth<br />

encoding gives the result in b, which is a close but noisy match to the actual volume.<br />

Negative Thicknesses<br />

Objects can be treated as contributing negative thickness to the volume rendering.<br />

This could be used to render shafts of shadow cutting through a volume or<br />

modulate the thickness of objects. It is accomplished by simply reversing the<br />

front and back faces for negative objects. To render an object as subtracting from<br />

the total thickness, its front-facing triangles are rendered to the “back faces”<br />

depth sum texture, and the back-facing triangles are rendered to the “front faces”<br />

depth texture. This approach is not robust and works correctly only if the negative<br />

objects lie entirely within a positive thickness. If two negative volumes overlap<br />

or if a negative volume extends outside of a positive volume, they will still<br />

subtract from the total thickness. This would create areas of over-subtraction,<br />

where the total thickness ends up being too thin. For some situations, the effects<br />

of this are not problematic, but to properly handle such cases, Everitt’s technique<br />

of depth peeling [Everitt02] could be used. This requires several more rendering<br />

passes and would have a substantial impact on frame rate.<br />

Additional Thoughts<br />

For slower hardware, you may want to use a single texture for both the front faces<br />

and back faces depth sums. This texture would begin from a mid-range value,<br />

such as RGB=(128,128,128). The back face depths are added to this, and the front<br />

face depths are subtracted using subtractive frame buffer blending. The advantage<br />

is that one less non-power-of-2 texture needs to be read. The disadvantage is that


Conclusion<br />

Section II — Rendering Techniques<br />

Rendering Objects as Thick Volumes<br />

105<br />

the depth complexity that can be handled in a single pass for a given choice of<br />

RGB-encoding precision is half of what it is when two separate render target textures<br />

are used.<br />

The method of RGB-encoding can be extended to higher precision and depth<br />

complexity by spreading the bits across multiple render targets. For example, you<br />

could use two RGBA-8 surfaces to encode 64-bit values. Older hardware will run<br />

out of precision in the texture coordinate interpolation used to read the color<br />

ramp textures, but newer hardware can calculate the color ramp values in the<br />

pixel shader itself.<br />

The method of rendering RGB-encoded depths to a texture and projecting<br />

these back onto objects can be used to implement shadow mapping. Depths can<br />

be rendered from the point of view of a light, and any number of passes can be<br />

used to implement complex and customizable filtering of the shadow map comparisons.<br />

This could provide high-quality hardware-accelerated shadow mapping for<br />

near-real-time applications.<br />

Going a step further, the thickness through an object to a light source can be<br />

computed and used to render translucent materials like jade or approximate the<br />

appearance of sub-surface scattering. It is also possible to render shadows from<br />

semitransparent objects, where the shadow darkness depends on the thickness<br />

through objects.<br />

Programmable shaders and a few render-to-texture passes are all that is needed<br />

to render ordinary polygon objects as thick volumes of light-scattering material.<br />

This article presented an efficient and direct means to render and accumulate<br />

high-precision values using 8-bit-per-component render targets, handle objects<br />

intersecting and occluding any volume object shape, and eliminate aliasing artifacts.<br />

The approach works for any viewpoint in the scene, and it is trivial to animate<br />

the volume geometry. The technique can be used on the large installed base<br />

of Direct3D8 ps.1.3 hardware. When hardware supports additive blending to floating-point<br />

render targets, the method of RGB-encoding can be abandoned to simplify<br />

the implementation.<br />

Using thickness to determine the appearance of objects offers exciting new<br />

possibilities for real-time interactive rendering. Scenes can be filled with dynamic<br />

wisps of fog, clouds, and truly volumetric beams of light that are easily created<br />

and controlled. Intuitive controls and color ramps govern the appearance of the<br />

volume objects, though more sophisticated treatments of scattering could also be<br />

employed.


Section II — Rendering Techniques<br />

106 Rendering Objects as Thick Volumes<br />

Example Code<br />

References<br />

Example code and additional images for this technique and others are available<br />

from nVidia’s developer web site:<br />

http://developer.nvidia.com<br />

http://developer.nvidia.com/view.asp?IO=FogPolygonVolumes<br />

[Baker02] Baker, Dan, VolumeFog D3D example, Microsoft D3D8.1 and D3D9<br />

SDKs, http://www.microsoft.com.<br />

[Everitt02] Everitt, Cass, “Order Independent Transparency,” http://developer.nvidia.com/view.asp?IO=order_independent_transparency.<br />

[Hoffman02] Hoffman, Naty and Kenny Mitchell, “Photorealistic Real-Time Outdoor<br />

Light Scattering,” Game Developer magazine, CMP Media, Inc., Vol. 9 #8,<br />

August 2002, pp. 32-38.


Screen-aligned Particles<br />

with Minimal VertexBuffer Locking<br />

Terminology<br />

O’dell Hicks<br />

A particle is the smallest component of a particle system, or group of common particles.<br />

A spark in a spark shower, a snowflake in a blizzard, and a puff of smoke<br />

from a campfire are examples of particles. The puff of smoke is a good example of<br />

a particle being the smallest component. In real life, smoke consists of microscopic<br />

particles floating in the air. Unless you are doing an extremely complex<br />

simulation of smoke, this is obviously too small, so we need to make a reasonable<br />

approximation. A single, wispy puff often works as the smallest element. In this<br />

article, particles are treated as screen-aligned planar geometry (in our case, quads<br />

composed of two triangles), but they can also be geometrically complex objects.<br />

A particle system is basically a common behavior pattern for particles. In general,<br />

it will define texture, color, size, motion, and other attributes. A particle system<br />

binds particles into functionality, such as campfire smoke or a blizzard.<br />

A particle emitter is an object that initializes new particles (or recycles old<br />

ones that have gone through their life cycle), usually with physical data from the<br />

emitter. For example, a particle emitter attached to the tail of a missile would pass<br />

along the current position and direction of the missile for a more realistic emitting<br />

of smoke.<br />

Aligning to the Screen<br />

Screen-aligned particles, or billboards, always face the viewport in the same way<br />

so that a different perspective of them is never seen, regardless of view orientation.<br />

This gives the illusion of the particles having a three-dimensional volume. A<br />

circle appears to be a sphere (solid of revolution), and a 2D image of smoke<br />

becomes a 3D cloud. At this point, you may be thinking that a solution already<br />

exists in recent video cards — point sprites. While they can be useful in limited<br />

examples from hardware vendors, at the time of publication, they leave something<br />

to be desired. The biggest issue, in my opinion, is that the largest they can be is<br />

64x64 pixels in screen space. Imagine playing a game in a resolution of 1280x1024<br />

pixels. In the distance, smoke from the campfire will look good, but, as you get<br />

close, the smoke will begin to look weird. When near enough to walk through the<br />

107


Section II — Rendering Techniques<br />

108 Screen-aligned Particles with Minimal VertexBuffer Locking<br />

smoke, it will appear as odd-looking, sparse puffs. Another limitation is that particles<br />

can’t be rotated about the view’s forward axis.<br />

So, how do we go about manually aligning our billboards? The answer lies in<br />

the view transformation matrix.<br />

Figure 1: The orientation vectors of the view transformation matrix<br />

As Figure 1 shows, the orientation vectors of the view transform can be directly<br />

retrieved from the matrix.<br />

NOTE You may wish to reference a general-purpose 3D text to better<br />

understand why we can use these components of the view matrix directly. Such<br />

an explanation is beyond the scope of this book.<br />

Figure 2: Billboard drawn normally<br />

Of interest are the Right and Up vectors, the first three components of the first<br />

and second column, respectively. Screen-alignment is a two-axis operation not<br />

requiring the forward vector, as it would only affect the distance from the camera<br />

and not alignment.<br />

NOTE This applies to row-major matrices as used by Direct3D. Other<br />

graphics APIs may use column-major matrices, so the orientation extraction<br />

must be changed accordingly.


If each of the four billboard vertices (in world space) were offset from the center<br />

of the billboard along the Up and Right axes, they would form a square, as shown<br />

in Figures 3 and 4.<br />

The unused forward vector can be used for a test, just to prove alignment. If the<br />

normal of the plane formed by the four projected vertices were calculated, the dot<br />

product of it and the forward vector would show that they are parallel for any view<br />

orientation.<br />

The Particle Vertex <strong>Shader</strong><br />

Section II — Rendering Techniques<br />

Screen-aligned Particles with Minimal VertexBuffer Locking<br />

Figure 3: Billboard drawn aligned to the viewport<br />

Figure 4: Four vertices of a billboard projected<br />

along the view’s Up and Right vectors<br />

109<br />

Unless your particles are static and the camera never moves, this alignment will<br />

have to be recalculated every frame. Particles are generally very active, thus various<br />

attributes such as color and size are also likely to change. Before vertex<br />

shaders, all of this required locking the vertex buffer and modifying vertices for<br />

every frame. But with a bit of planning and special use of vertex data, we can<br />

greatly reduce the amount of locking. Very complex particle systems will require<br />

locking more often, but many effects can simply be driven by a time set in a


Section II — Rendering Techniques<br />

110 Screen-aligned Particles with Minimal VertexBuffer Locking<br />

vertex shader constant. For now, we focus on the alignment. The following is a<br />

simple vertex structure for a particle:<br />

struct ParticleVertex<br />

{<br />

float x, y, z; // vertex position<br />

unsigned int diffuse; // color<br />

float tu, tv; // one texture coordinate set<br />

float rightFactor, upFactor; // custom data for projecting vertices<br />

};<br />

The members rightFactor and upFactor are both a combination of two values —<br />

half the width and half the height, respectively, of the billboard and the direction<br />

(positive or negative; see Figure 2) along the Right and Up vectors that the vertex<br />

is projected. It is half the width and height because on each axis, vertices are projected<br />

along a unit vector positively and negatively, resulting in a doubled width<br />

and height. Alternatively, you could multiply the Up and Right vectors by 0.5f.<br />

Each particle has four vertices, and all vertices must be set up properly, as<br />

demonstrated by the following pseudocode:<br />

For each particle<br />

{<br />

For all four vertices in this particle // set common vertex data<br />

{<br />

vertex.xyz = particle.center; // all vertices are set to center of particle<br />

vertex.diffuse = color;<br />

}<br />

// now set unique data<br />

Set Uvs<br />

// vertices are referred to as upperLeft, upperRight, lowerLeft, and lowerRight<br />

// to indicate their position relative to the particle’s center<br />

vertex[upperLeft].rightFactor = -1.0f * halfBillboardWidth;<br />

vertex[upperLeft].upFactor = 1.0f * halfBillboardHeight;<br />

vertex[upperRight].rightFactor = 1.0f * halfBillboardWidth;<br />

vertex[upperRight].upFactor = 1.0f * halfBillboardHeight;<br />

vertex[lowerRight].rightFactor = 1.0f * halfBillboardWidth;<br />

vertex[lowerRight].upFactor = -1.0f * halfBillboardHeight;<br />

vertex[lowerLeft].rightFactor = -1.0f * halfBillboardWidth;<br />

vertex[lowerLeft].upFactor = -1.0f * halfBillboardHeight;<br />

}<br />

<strong>With</strong> this vertex data set up properly, we then set some constants and our shader<br />

(as well as appropriate render states). I have my vertices in world coordinates, but<br />

you may have them in a local object space if desired. Just remember to first transform<br />

the position by the world matrix before projecting along the view vectors. In<br />

my shader, c0 through c4 contain the concatenated, transposed view and projection<br />

transformations. c20 is the view’s Right vector, and c21 is the Up vector. The<br />

fourth float (w-component) of both constants is set to 0.0f.<br />

vs 1 1 ; vertex shader 1.1<br />

#define VIEW PROJECTION TRANSPOSED c0<br />

#define VIEW RIGHT VECTOR c20<br />

#define VIEW UP VECTOR c21<br />

dcl position v0 ; vertex position in register v0


Section II — Rendering Techniques<br />

Screen-aligned Particles with Minimal VertexBuffer Locking<br />

dcl color v1 ; vertex color in register v1<br />

dcl texcoord v2 ; vertex texture coordinates in register v2<br />

dcl texcoord1 v3 ; custom data, albeit declared as a texture coordinate (see DX9 SDK)<br />

; right: r0 = view.Right * vertex.rightFactor<br />

mul r0, VIEW RIGHT VECTOR , v3.x<br />

; up: r1 = view.Up * vertex.upFactor<br />

mul r1, VIEW UP VECTOR , v3.y<br />

; final world position = position + up projection + right projection<br />

add r2, r0, r1<br />

add r2, v0, r2<br />

; transform to homogenous clip space<br />

m4x4 oPos, r2, VIEW PROJECTION TRANSPOSED<br />

; set diffuse and texture coordinates<br />

mov oD0, v1<br />

mov oT0.xy, v2<br />

We now have particles that are aligned properly every frame, without having to<br />

touch the vertex buffer! But since they are static, they are rather dull and nearly<br />

useless. Let’s look at a more complex particle structure:<br />

struct DynamicParticleVertex<br />

{<br />

float x, y, z; // vertex position<br />

float tu, tv; // one texture coordinate set<br />

float rightFactor, upFactor; // custom data for projecting vertices<br />

float velocityX, velocityY, velocityZ;<br />

unsigned int beginningColor, endingColor;<br />

};<br />

A non-looping particle effect driven on a normalized timer ranging from 0.0f to<br />

1.0f is easy to create. Once an event, such as a grenade exploding, triggers the<br />

need for the effect, the vertex buffer is locked once at the very start to initialize<br />

the effect. Vertex positions will be set, as well as velocityX, velocityY, and<br />

velocityZ and beginningColor and endingColor. The velocity can be treated as a<br />

texture coordinate set with three components, just as upFactor and rightFactor<br />

are defined through a two-component texture coordinate set.<br />

NOTE Don’t forget to scale the velocity and to the actual time. If the effect<br />

plays over 30 seconds, then the velocity should be scaled by 30. Also, all four<br />

vertices of a particle should have the same velocity.<br />

111<br />

The beginningColor would be the color at the start of the effect. In a fiery blast,<br />

some particles may be white hot, while others might be a cooler orange or red.<br />

The endingColor would be the color at the end of the effect, usually RGBA<br />

(0,0,0,0) for additive blends and RGBA(255,255,255,0) for most other types of<br />

blending, resulting in a totally transparent, faded-out particle.<br />

Applying this data in the shader is easy. Another constant is needed to pass<br />

along the normalized time. Since there will be an interpolation between colors,<br />

one constant’s values should be the complement of the time (1.0 – current time).


Section II — Rendering Techniques<br />

112 Screen-aligned Particles with Minimal VertexBuffer Locking<br />

For the velocity, just multiply the vertex’s velocity by the current time constant,<br />

and add that to the position before transforming to homogenous clip space:<br />

;v3 is the vertex velocity, c22 is the time constant, and r2 is the vertex aligned to the<br />

;viewport, before the transform to clip space<br />

#define TIME c22.x<br />

#define ACCELERATION c22.y; 0.5f * acceleration * time2<br />

mul r3, v3, TIME; r3 = velocity * time<br />

add r3.y, r3.y, ACCELERATION ; apply gravity along the up axis<br />

add r4, r2, r3; r4 = position + velocity and acceleration offset<br />

;then transform r4 to homogenous clip space<br />

The color can easily be linearly interpolated. The complement of the current time<br />

is multiplied by the beginning color and added to the ending color multiplied by<br />

the current time:<br />

diffuse = ( complement of time * beginning color ) + (time * ending color)<br />

c22.x is the time, c22.y is the complement, v4 is the start color, v5 is the end color<br />

mul r1, v4, c22.y;<br />

mul r2, v5, c22.x;<br />

add oD0, r1, r2;<br />

Most particle effects can be done very easily in this manner. However, a more<br />

complex particle simulation, such as a tornado, would require more touching of<br />

the data between frames — not every single frame, if done cleverly, but more<br />

than just the start of the effect.<br />

Wrapping It Up<br />

Summary<br />

Be sure to check out my sample on the CD to see some cool examples of particle<br />

effects. One major issue that I haven’t touched is sorting. Obviously, alphablended<br />

objects should be drawn last in the scene. A very useful thing to know<br />

is that additively blended particles (Source = D3DBLEND_ONE, Destination =<br />

D3DBLEND_ONE) do not have to be sorted amongst each other. Since they are<br />

all simply added to the frame buffer, any order will give the same results. For<br />

effects with a different blend type, you will have to sort, but you don’t have to<br />

mess with the vertex buffer. If you keep a system memory copy of your particles<br />

and track the centers, you can sort them and modify the index buffer, which<br />

would touch far less memory than adjusting the vertex buffer.<br />

In this chapter, we learned how to do screen-aligned particles with a vertex<br />

shader, bringing us one step closer to the big goal of having almost everything<br />

done by the GPU. <strong>With</strong> a little cleverness, you can make vertex shaders for<br />

almost any basic type of particle effect.


Hemisphere Lighting with<br />

Radiosity Maps<br />

Shawn Hargreaves<br />

This lighting model was designed for fast-moving objects in outdoor environments.<br />

Its goals are to tie the moving objects in with their surroundings, convey a<br />

sensation of speed, and be capable of rendering large numbers of meshes at a<br />

good framerate on first-generation shader hardware.<br />

It combines a crude form of radiosity lighting with world-to-object shadowing,<br />

using just one texture lookup and four pixel shader blend instructions.<br />

Approximation is the name of the game here, with performance being by far the<br />

most important consideration!<br />

Figure 1a: Diffuse (dot3) sunlight plus radiosity<br />

hemisphere lookup (See Color Plate 4.)<br />

Hemisphere Lighting<br />

Figure 1b: <strong>With</strong> the addition of specular and<br />

Fresnel contributions, using a static cube map<br />

holding an image of typical surroundings<br />

There is an apocryphal story that in the early days of color television, someone<br />

pulled off a successful scam selling kits that claimed to upgrade existing black and<br />

white TVs to display a color picture. This was done using a bit of plastic that fitted<br />

over the TV screen and tinted the top third of the display blue, the middle green,<br />

and the bottom brown, on the assumption that most things in life have sky at the<br />

top, trees in the middle, and soil underneath. This was not perhaps the most<br />

robust of solutions, and I suspect the people who bought this kit were not terribly<br />

113


Section II — Rendering Techniques<br />

114 Hemisphere Lighting with Radiosity Maps<br />

happy, but it would have worked okay — at least for a few carefully chosen<br />

images!<br />

Hemisphere lighting is basically just a shader implementation of the same<br />

concept.<br />

Most conventional lighting models evaluate some sort of equation for the<br />

more important few lights in a scene and then add in a constant ambient term as<br />

an approximation of all the leftover bits and pieces. This works nicely for scenes<br />

with many complex light sources but is less than ideal for outdoor environments<br />

where all the direct light is coming from the sun. <strong>With</strong> only a single light source<br />

available, fully half of every object will be in shadow and will thus be illuminated<br />

only by the ambient term. This gets even worse in overcast or rainy weather conditions<br />

because as the sun is obscured by clouds, its direct contribution becomes<br />

less, and almost all of the light in the scene ends up being provided by the<br />

catch-all ambient constant. A constant amount of light results in a constant color,<br />

which makes things look flat and boring.<br />

This is clearly wrong because you only have to step outside on a foggy morning<br />

to notice that even though the sun itself may by entirely hidden, there is still<br />

enough variation in light levels that you can easily make out the contours of whatever<br />

you are looking at.<br />

The problem with conventional lighting models is that in the real world, the<br />

majority of light does not come directly from a single source. In an outdoor setting,<br />

some of it does indeed come straight from the sun, but more comes equally<br />

from all parts of the sky, and still more is reflected back from the ground and other<br />

surrounding objects. These indirect light sources are extremely important<br />

because they will often provide a much larger percentage of the total illumination<br />

than the sun itself.<br />

Hemisphere lighting is a simple way of emulating the indirect light contributions<br />

found in a typical outdoor scene. Any kind of complex radiosity lighting<br />

could be modeled by encoding the surrounding light sources into an HDR (high<br />

dynamic range) cube map, but it is impractical to update such a cube map in real<br />

time as large numbers of objects move around the world. So we need to approximate,<br />

cutting down the complexities of the real world into a more efficient<br />

real-time model.<br />

The sun is easy: A per-vertex dot3 can handle that quite nicely. <strong>With</strong> this<br />

taken out of the equation, we are left with a pattern of light sources that can be<br />

roughly divided into:<br />

� Sky: Usually blue, emits lots of light, located above your head<br />

� Ground: Some other color, darker than the sky, located underneath you<br />

This is trivial to evaluate in a vertex shader; just set your sky and ground colors<br />

as constants, and use the y component of the vertex normal to interpolate<br />

between them!


Radiosity Maps<br />

Hemisphere lighting avoids the flatness that can result from a constant ambient<br />

term, but it also poses a question: What should you use for the ground color?<br />

Bearing in mind our goal of making moving objects fit in with their surroundings,<br />

it would be good if this could change appropriately depending on your location in<br />

the world.<br />

The solution is obvious: Encode the ground color into a texture as a large,<br />

top-down image of the landscape. This map can then be sampled at a position corresponding<br />

to the location of the object or, even better, offset some distance (a<br />

meter or so works well) along the vertex normal. Adding this offset stretches the<br />

sample area to include a larger region of the ground image and introduces some<br />

horizontal lighting variation in addition to the vertical ground to sky transition.<br />

The results may not be exactly what a high-end renderer would describe as<br />

radiosity lighting, but it can be a remarkably good approximation. The underside<br />

of an object picks up color from the ground directly beneath it, while the sides are<br />

influenced by the scenery slightly off to each side, and the top is affected entirely<br />

by the sky color.<br />

Making the Map<br />

Section II — Rendering Techniques<br />

Hemisphere Lighting with Radiosity Maps<br />

Figure 2: Hemisphere lighting with and without a vertex normal offset<br />

115<br />

Ground color maps can easily be generated by taking a screen shot of your level<br />

viewed from above with an orthographic projection. The results can be improved<br />

if you preprocess the mesh by removing polygons that are too high above the<br />

ground surface and rotating vertical polygons to face upward so elements like the<br />

sides of fences will contribute to the radiosity colors.<br />

I also found it useful to add about 10 percent of random noise to the resulting<br />

texture, as this introduces a subtle speed-dependent flicker that gives an effective<br />

sense of motion as you move around the world.


Section II — Rendering Techniques<br />

116 Hemisphere Lighting with Radiosity Maps<br />

A 1024x1024 texture (only half a megabyte when encoded in DXT1 format) is<br />

sufficient to represent a couple of square miles of landscape with enough precision<br />

to make out details such as alternating colors along the rumble strip at the<br />

edge of a racetrack or dappled patterns of light and shadow in a forest scene.<br />

Figure 3: The images in Figures 1, 4, and 5 were created along the zoomed-in section on this<br />

2048x512 radiosity map. (See Color Plate 5.)<br />

Shadowing<br />

Once you have the ground color encoded in a texture, it seems that static environment<br />

to object shadows ought to “just work” if you put a dark patch in the relevant<br />

portion of the radiosity map. Compared to other shadowing techniques, this<br />

is highly approximate but also incredibly cheap, and it can be very effective especially<br />

for complex shadow patterns, such as a forest floor.<br />

Unfortunately, it doesn’t “just work.” The problem with using the radiosity<br />

map to encode shadows is that even if you darken down the ground color, the sky<br />

color is still a constant and so will not be affected.<br />

There are several possible solutions:<br />

� Use one texture to encode the ground color and another to encode shadows.<br />

This is the highest quality and most controllable approach, but it burns two<br />

texture units and requires double the amount of storage for the two textures.<br />

� You could encode the shadow amount into the alpha channel of the radiosity<br />

texture. In this case, your ground color would be (radiosity.rgb * radiosity.a),<br />

while the sky color would be (sky_color_constant * radiosity.a). This works<br />

well, but using alpha in the radiosity map requires at least an 8-bit texture<br />

format, such as DXT5. For such a large image, storage space is a serious<br />

concern.<br />

� At the risk of excessive approximation, it is possible to collapse the ground<br />

color and shadow data into a single RGB texture, thus allowing it to be stored<br />

in 4-bit per-texel DXT1 format. The process is:<br />

1. Convert your radiosity map into an HSV color space.<br />

2. Find the average V (brightness) value.<br />

3. Normalize the map so that all texels have this same constant brightness<br />

level, except for areas shadowed by static environment geometry,<br />

for which the brightness is left at a lower level. Hue and saturation are<br />

not affected by this process.


Section II — Rendering Techniques<br />

Hemisphere Lighting with Radiosity Maps<br />

117<br />

4. Convert back into RGB format.<br />

5. Work out a scaling factor that will turn the average radiosity brightness<br />

into your desired sky color, and set this as a shader constant.<br />

At run time, the ground color can be looked up directly from the modified radiosity<br />

map. Except for areas of shadow, this will now be lacking any variation in<br />

brightness, but the changes in hue are enough for the technique to remain<br />

effective.<br />

To calculate the sky color, dot the ground color with (1/3, 1/3, 1/3) in your<br />

pixel shader, thus converting it to grayscale. Because of the brightness normalization,<br />

this will produce a constant value for all non-shadowed areas, or a darker<br />

shade of gray if you are in shadow. Multiplying this value by the sky color scaling<br />

constant gives a correctly shadowed version of the sky term.<br />

Combining the ground color and shadow information into a single texture creates<br />

one final dilemma: Where should this texture be sampled? The radiosity<br />

lighting works best if the sample position is offset along the vertex normal, but<br />

that is blatantly incorrect for shadowing, where the map should be sampled<br />

directly at the vertex position.<br />

A hacky compromise is to apply an offset along the left-to-right component of<br />

the normal but not in the front/back direction, so polygons facing forward or backward<br />

will sample the radiosity map at their exact position, while side-facing polygons<br />

use an offset sample point. Since objects usually travel roughly along their<br />

forward axis, this maintains a nice solid transition as they move in and out of<br />

shadow, while still allowing radiosity tints to be picked up from either side of their<br />

exact location.<br />

Figure 4: These images show the hemisphere lighting on its own, using a single DXT1<br />

format radiosity map that encodes both shadow and ground color information. (See<br />

Color Plate 6.)


Section II — Rendering Techniques<br />

118 Hemisphere Lighting with Radiosity Maps<br />

The <strong>Shader</strong>s<br />

The complete lighting model combines four elements:<br />

� Base texture<br />

� Radiosity texture combining ground color and shadow information. The vertex<br />

shader calculates the sample location and ground-to-sky tweening factor,<br />

while the pixel shader generates a shadowed version of the sky color based<br />

on the grayscale of the ground color and performs the hemisphere tween.<br />

� Environment cube map containing a static image of a typical area of the level<br />

along with a specular highlight in the alpha channel. The envmap intensity is<br />

calculated in the vertex shader, combining a per-vertex reflectivity amount<br />

(mapped by artists) with a 1-cos Fresnel approximation.<br />

� The direct sun contribution is calculated per vertex using a straightforward<br />

infinitely distant dot3 light.<br />

vs.1.1<br />

// vertex inputs:<br />

#define iPos v0 // vertex position<br />

#define iNormal v1 // vertex normal<br />

#define iDiffuse v2 // reflectivity amount<br />

#define iTex0 v3 // base texture coordinates<br />

dcl position iPos<br />

dcl normal iNormal<br />

dcl color0 iDiffuse<br />

dcl texcoord0 iTex0<br />

// constants:<br />

def c0, 0, 0, 0, 1<br />

#define VS CONST 0 c[0].x<br />

#define VS CONST 1 c[0].w<br />

#define VS EYEPOS 1 // object space eye position<br />

#define VS CAMERA1 2 // 4x4 object to screen matrix<br />

#define VS CAMERA2 3<br />

#define VS CAMERA3 4<br />

#define VS CAMERA4 5<br />

#define VS ENVMAP1 6 // 3x3 object to world matrix<br />

#define VS ENVMAP2 7<br />

#define VS ENVMAP3 8<br />

#define VS FOG 9 // fog transform vector


#define VS AMBIENT 10 // ambient light color<br />

#define VS LIGHT COLOR 11 // diffuse light color<br />

#define VS LIGHT DIR 12 // object space light direction<br />

#define VS RADIOSITY U 13 // radiosity U mapping<br />

#define VS RADIOSITY V 14 // radiosity V mapping<br />

#define VS RADIOSITY SIDE 15 // object sideways offset<br />

#define VS RADIOSITY SAT 16 // ground vs. sky vector<br />

// outputs:<br />

//<br />

// oPos = position<br />

// oFog = fogging<br />

//<br />

// oT0 = base texture coordinates<br />

// oT1 = radiosity map sample location<br />

// oT2 = environment cube map coordinates<br />

//<br />

// oD0.xyz = dot3 sunlight<br />

// oD1.xyz = radiosity ground to sky tween factor<br />

// oD0.w = fresnel term<br />

// oD1.w = specular intensity<br />

// transform the vertex position<br />

mul r0, c[VS CAMERA1], iPos.x<br />

mad r0, c[VS CAMERA2], iPos.y, r0<br />

mad r0, c[VS CAMERA3], iPos.z, r0<br />

add oPos, c[VS CAMERA4], r0<br />

// calculate the fog amount<br />

dp4 oFog, iPos, c[VS FOG]<br />

// output the base texture coords<br />

mov oT0.xy, iTex0<br />

// **************** RADIOSITY HEMISPHERE ****************<br />

// stretch the radiosity lookup area to either side of the model<br />

dp3 r0.x, iNormal, c[VS RADIOSITY SIDE]<br />

mad r0.xyz, r0.x, c[VS RADIOSITY SIDE], iPos<br />

// planar map the radiosity texture<br />

mov r0.w, VS CONST 1<br />

dp4 oT1.x, r0, c[VS RADIOSITY U]<br />

dp4 oT1.y, r0, c[VS RADIOSITY V]<br />

Section II — Rendering Techniques<br />

Hemisphere Lighting with Radiosity Maps<br />

119


Section II — Rendering Techniques<br />

120 Hemisphere Lighting with Radiosity Maps<br />

// calculate the ground to sky radiosity tween factor<br />

dp4 oD1.xyz, iNormal, c[VS RADIOSITY SAT]<br />

// **************** FRESNEL / SPECULAR CUBE MAP ****************<br />

// calculate and normalize the eye->vertex vector<br />

sub r0.xyz, iPos, c[VS EYEPOS]<br />

dp3 r0.w, r0, r0<br />

rsq r0.w, r0.w<br />

mul r0.xyz, r0, r0.w<br />

// dot the vertex normal with eye->vert<br />

dp3 r1.x, r0, iNormal<br />

// fresnel term = (1 - r1.x) * reflectivity amount<br />

mad oD0.w, r1.x, iDiffuse.x, iDiffuse.x<br />

// also output a non-fresnel version of the reflectivity amount<br />

mov oD1.w, iDiffuse.x<br />

// reflect the view direction through the vertex normal<br />

add r1.x, r1.x, r1.x<br />

mad r0.xyz, iNormal, -r1.x, r0<br />

// transform the environment map sample location into worldspace<br />

dp3 oT2.x, r0, c[VS ENVMAP1]<br />

dp3 oT2.y, r0, c[VS ENVMAP2]<br />

dp3 oT2.z, r0, c[VS ENVMAP3]<br />

// **************** DOT3 SUNLIGHT ****************<br />

// let's do a boring old school per vertex diffuse light, too...<br />

dp3 r0.x, iNormal, c[VS LIGHT DIR]<br />

max r0.x, r0.x, VS CONST 0<br />

mul r0.xyz, r0.x, c[VS LIGHT COLOR]<br />

add oD0.xyz, r0, c[VS AMBIENT]<br />

ps.1.1<br />

// inputs:<br />

//<br />

// v0.rgb = dot3 sunlight<br />

// v1.rgb = radiosity ground to sky tween factor<br />

// v0.a = fresnel term<br />

// v1.a = specular intensity<br />

//<br />

// c1 = sky color


def c0, 0.3333, 0.3333, 0.3333, 0.3333<br />

tex t0 // base texture<br />

tex t1 // radiosity texture<br />

tex t2 // environment cube map<br />

// envmap + specular<br />

lrp r0.rgb, v0.a, t2, t0 // fresnel tween between envmap and base<br />

mad r0.rgb, t2.a, v1.a, r0 // add the specular component<br />

// radiosity hemisphere<br />

dp3 r1.rgb, t1, c0 // grayscale version of the ground color<br />

mul r1.rgb, r1, c1 // calculate sky color<br />

lrp r1.rgb, v1, t1, r1 // tween between ground and sky<br />

mul x2 r0.rgb, r0, r1 // apply the radiosity color<br />

@codebg = // per vertex sunlight<br />

mul x2 r0.rgb, r0, v0 // output color * diffuse<br />

+ mov r0.a, t0.a // output base texture alpha<br />

Section II — Rendering Techniques<br />

Hemisphere Lighting with Radiosity Maps<br />

121<br />

Figure 5: The complete lighting model, combining a base texture, radiosity hemisphere,<br />

Fresnel cube map, and dot3 sunlight (See Color Plate 7.)


Section II — Rendering Techniques<br />

122 Hemisphere Lighting with Radiosity Maps<br />

Additional Considerations<br />

References<br />

This form of lighting can easily be simplified, using cheaper versions to implement<br />

shader LOD on distant objects. Most significantly, the per-pixel radiosity<br />

lookups and sky color calculations can be replaced by a single CPU texture lookup<br />

at the center of the object, with the resulting sky and ground colors set as vertex<br />

shader constants, the hemisphere tween evaluated per vertex, and no work at all<br />

required in the pixel shader.<br />

Doing single-texel CPU lookups into the radiosity map is extremely fast, and<br />

this data can be useful in many places. For instance, a particle system might do a<br />

ground color lookup when spawning a new dust particle, to see if it should be in<br />

shadow and also so it can be tinted to match the hue of its surroundings.<br />

The radiosity maps can easily become very large, but they are also highly<br />

compressible. Large areas of the map will typically contain either flat color or<br />

smooth gradients, so good results can be obtained by splitting it into a grid of tiles<br />

and adjusting the resolution of each tile according to how much detail it contains.<br />

At run time, a quick render to texture can expand the area around the camera<br />

back out into a continuous full resolution map.<br />

Because the radiosity map is only a two-dimensional image, there will obviously<br />

be problems with environments that include multiple vertical levels. Such<br />

cases can be handled by splitting the world into layers with a different radiosity<br />

map for each, but this lighting model is not well suited to landscapes with a great<br />

deal of vertical complexity.<br />

Philip Taylor (Microsoft Corporation) discusses hemisphere lighting at:<br />

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dndrive/html<br />

/directx11192001.asp<br />

The non-approximate version: Image Based Lighting, Cunjie Zhu, University of<br />

Delaware: http://www.eecis.udel.edu/~czhu/IBL.pdf<br />

Videos demonstrating various elements of the shaders presented above can be<br />

found on the companion CD.<br />

The lighting model presented in this article is used in the game MotoGP 2, on PC<br />

and Xbox, developed by Climax and published by THQ. MotoGP 1 used a similar<br />

radiosity technique but without the Fresnel term on the environment cube map.


Galaxy Textures<br />

Jesse Laeuchli<br />

In many space simulations, it is useful to be able to render galaxies to provide<br />

some background settings. Galaxies are a good effect to generate using a procedural<br />

model, as there are many different types and variations of galaxies, and so it<br />

is useful to be able to vary the generated galaxies. In this article a procedural<br />

model is presented that can be implemented almost entirely on the GPU using Cg<br />

pixel shaders.<br />

Cluster Galaxies<br />

The simplest type of galaxy to render is the cluster galaxy. A cluster galaxy is a<br />

group of stars clustered together. This is fairly easy to simulate using a single<br />

quad and a pixel shader. First, a quad is drawn and a texture filled with values<br />

from [0,1] assigned to it. Then, for each pixel the distance from the center of the<br />

quad needs to be found. To do this, the Cg function distance() can be used to find<br />

the Euclidian distance between two points:<br />

float4 center;<br />

center.x=.5;<br />

center.y=.5;<br />

center.z=.0;<br />

center.w=.0;<br />

float d=distance(center,In.tex);<br />

After the distance d has been obtained, the star density of the current point can be<br />

found with the following equation:<br />

stardensity=(1–(2*d))*(randomtexture^(d*10))<br />

This equation moves d from the range [0,.5] to [0,1] and then multiplies that by<br />

randomtexture to the power of d times 10. This causes the star density to fall off<br />

away from the center of the galaxy.<br />

Note that the constants 2 and 10 can be modified to adjust the appearance of<br />

the galaxy. In addition, if the texture coordinates used to fetch the random variables<br />

are changed over time, the galaxy structure will appear to change. While<br />

this may not be very realistic (real galaxies take many years to change drastically),<br />

it looks nice and can provide the application with more visual interest.<br />

123


Section II — Rendering Techniques<br />

124 Galaxy Textures<br />

Cg Cluster Galaxy code:<br />

float random(float2 xy,sampler2D BaseTexture)<br />

// Index into texture filled with random values<br />

{<br />

float color=tex2D(BaseTexture,xy);<br />

return color;<br />

}<br />

Pix Out;<br />

float2 InputTest;<br />

float4 center;<br />

center.x=.5;<br />

center.y=.5;<br />

center.z=.0;<br />

center.w=.0;<br />

float d=distance(center,In.tex);<br />

float randomtexture=random((In.tex.xy*10)+Time.xx,BaseTexture);//Random texture<br />

d=(1-d*2)*pow(randomtexture,d*10);<br />

Out.dif.x=d;<br />

Out.dif.y=d;<br />

Out.dif.z=d;<br />

Figure 1: Real cluster galaxy Figure 2: Procedural cluster galaxy<br />

Spiral Galaxies<br />

These equations can easily be modified to create another type of galaxy — a spiral<br />

galaxy. Spiral galaxies are galaxies with swirls of stellar matter emanating from<br />

their center. The amount, thickness, and length of the spirals vary greatly in real<br />

galaxies, so any procedural model should be easily modifiable.<br />

The first step in modifying the equation to support spirals is to find out the<br />

angle at which the current pixel is being shaded. In other words, as the galaxy is<br />

circular in shape, we need to find which angle of the circle the point being shaded<br />

lies in.


We can do that by finding the inverse tangent of the point, like this:<br />

float angle=atan((In.tex.y-.5)/(In.tex.x-.5));<br />

However, as atan’s range is [–�/2, �/2], it is necessary to move it so that its range<br />

is positive. It is also helpful to convert the output into degrees.<br />

float angle=atan((In.tex.y-.5)/(In.tex.x-.5)); //find angle<br />

angle=degrees(angle); //convert angle<br />

angle+=270; //Move angle to (0,360)<br />

Next, the galaxy needs to be split into spiral sections. This can be done by performing<br />

a modulus operation on the angle.<br />

Angle%(360*d)<br />

The amount that the angle is modulated by increases as the point being shaded is<br />

farther away from the center of the galaxy. Spirals only result if the amount that<br />

the angle is modulated by is increased with the distance. If the amount used to<br />

modulate the angle is constant, the galaxy will be split into straight sections. This<br />

calculation can be skipped if the point is too close to the center of the galaxy, as<br />

the center of the galaxy does not swirl.<br />

if(d>.15)<br />

{<br />

angle=fmod(angle,360*d);<br />

dense=(dense)*(angle);<br />

dense/=25;//scale the density<br />

Out.dif.x=dense;<br />

Out.dif.y=dense;<br />

Out.dif.z=dense;<br />

It is easy to change the appearance of the galaxy by changing a few of the constants<br />

used here. For example, lowering 360 will change the structure of the galaxy<br />

by changing the number of spirals. Changing the amount that dense is divided<br />

by will change the galaxies’ overall brightness.<br />

Figure 3: Two procedural spiral galaxies<br />

Section II — Rendering Techniques<br />

Galaxy Textures<br />

125


Section II — Rendering Techniques<br />

126 Galaxy Textures<br />

Figure 4: Two real spiral galaxies<br />

Figure 5: Space background, showing galaxies generated<br />

with the methods shown here<br />

Summary<br />

This article examined a procedural galaxies model used to generate various types<br />

of animated galaxies, which can be used to generate a space background.


Turbulent Sun<br />

Jesse Laeuchli<br />

Many 3D engines today have situations where the sun needs to be rendered.<br />

Whether the engine is being used to create a space simulation, flight simulation,<br />

space conquest game, or even just a shooter, the sun needs to be displayed. Many<br />

developers solve this by taking a sphere, texturing it bright yellow, and whitening<br />

out the user’s view, but this is visually boring. By using the latest pixel shaders<br />

and Cg, the sun can be rendered and animated entirely on the GPU. This article<br />

examines how this can be done using nVidia’s Cg and shows several styles of<br />

suns that can be generated by tweaking the shader.<br />

The most important part of the sun shader is a noise function. As of<br />

publication, Cg does not yet have a noise function implemented, so a large part of<br />

the code is spent implementing one. In this shader, a 3D value noise function is<br />

used. This takes a 1D table (stored in a 1D texture), looks into the table for the<br />

random values surrounding each pixel, and then interpolates between them using<br />

an ease curve and linear interpolation. Vnoise is used here because it only uses<br />

one channel of one texture unit and reduces the number of texture lookups used.<br />

On the downside, it uses a few more instructions than a gradient noise implementation.<br />

Using a different noise function will not affect the way the final image looks<br />

significantly, so another noise function may be substituted as desired. Below is<br />

the code for vnoise:<br />

half random(float x,float y,float z,sampler1D g)<br />

{<br />

half index=(x*6.6)+(y*7.91)+(z*8.21);<br />

index=index*0.001953125;<br />

index=h1tex1D(g,index);<br />

return index;<br />

}<br />

half3 scurve(half3 v)<br />

{<br />

return v*v*(3-2*v);<br />

}<br />

half noise(float3 v,sampler1D g)<br />

{<br />

half3 LatticePoint=floor(v);<br />

half3 frac1=scurve(frac(v));<br />

half4 v1;<br />

127


Section II — Rendering Techniques<br />

128 Turbulent Sun<br />

}<br />

v1.x = random(LatticePoint.x, LatticePoint.y, LatticePoint.z,g);<br />

v1.y = random(LatticePoint.x + 1, LatticePoint.y, LatticePoint.z,g);<br />

v1.z = random(LatticePoint.x, LatticePoint.y + 1, LatticePoint.z,g);<br />

v1.w = random(LatticePoint.x + 1, LatticePoint.y + 1, LatticePoint.z,g);<br />

half2 i1 = lerp(v1.xz , v1.yw , frac1.x);<br />

half a=lerp(i1.x , i1.y , frac1.y);<br />

//<br />

v1.x = random(LatticePoint.x, LatticePoint.y, LatticePoint.z+1,g);<br />

v1.y = random(LatticePoint.x + 1, LatticePoint.y, LatticePoint.z+1,g);<br />

v1.z = random(LatticePoint.x, LatticePoint.y + 1, LatticePoint.z+1,g);<br />

v1.w = random(LatticePoint.x + 1, LatticePoint.y + 1, LatticePoint.z+1,g);<br />

i1 = lerp(v1.xz , v1.yw , frac1.x);<br />

half b=lerp(i1.x , i1.y, frac1.y);<br />

return lerp(a,b,frac1.z);<br />

For each pixel, the function random is called eight times to index into the table for<br />

the random values (if animation is not desired, this can be reduced to four) and<br />

then lerps between them. The artifacts generated by using lerps instead of a<br />

better interpolation function are surprisingly small in this instance, and in any<br />

case they are masked by using an ease curve to smooth it. Half variables are used<br />

instead of floats to improve the performance, as the visual difference is not noticeable<br />

in this instance. The texture is a 1D texture containing values from –1 to 1 in<br />

the red channel.<br />

To generate a sun texture, this function should be called several times per<br />

pixel with different frequencies each call, using the texture coordinates as the x<br />

and y parameters and time for the z parameter, if animation is desired. The output<br />

is then applied to the final color. Calling noise more times makes the image have<br />

more detail, but calling it past certain levels dependent on the display resolution<br />

is useless, as there will be more detail than a single pixel can capture. Also, calling<br />

noise more times increases the length of the program, so it is desirable to call<br />

it fewer times. In the example below, it is called four times.<br />

ninput.x=(IN.TexCoord0.x)*10;<br />

ninput.y=(IN.TexCoord0.y)*10;<br />

ninput.z=Time.x;<br />

float suncolor= (noise(ninput,texture1)+(noise(ninput*2,texture1)*.5)+<br />

(noise(ninput*4,texture1))*.25) )+(noise(ninput*8,texture1))*.125);<br />

When setting the output color, giving added weight to the red component makes<br />

the sun look better. In Figure 1, the red is set to twice the value of green, and the<br />

blue component is set to 0.


A more interesting look can be achieved by using a turbulence function. Turbulence<br />

is the same as the sum of the noise functions used above, but instead of<br />

using normal noise, the absolute value of signed noise is used. Signed noise is<br />

usually created by:<br />

2*noise(x,y,z)-1;<br />

This scales the output to (–1,1).<br />

Note that it is not required that the value 2 be used and that other values can<br />

sometimes yield more interesting results. In Figure 2, a value of 1.5 was used.<br />

This of course changes the value to which noise is scaled.<br />

The change to the shader is simple:<br />

Noise function…<br />

.....<br />

}<br />

return 1.5*lerp(a,b,frac1.z)-1;<br />

Section II — Rendering Techniques<br />

Turbulent Sun<br />

Figure 1: Sun generated using noise Figure 2: Sun generated using turbulence<br />

float test=abs(noise(ninput,texture1))+abs((noise(ninput*2,texture1))*.5)+<br />

abs((noise(ninput*4,texture1))*.25) +abs((noise(ninput*8,texture1))*.125);<br />

This looks better than just the sum of normal noise, but one other interesting<br />

form is possible. Instead of getting the absolute value of every call to snoise, add<br />

all the values of snoise and then take the absolute value. (See Figure 3.)<br />

float test=abs(noise(ninput,texture1)+(noise(ninput*2,texture1)*.5)+<br />

(noise(ninput*4,texture1)*.25) +(noise(ninput*8,texture1)* .125).);<br />

129<br />

After the sun has been rendered, a flare can be drawn using vertex shaders to<br />

give the impression of light rays being emitted from the sun. The basic idea is to<br />

render a circle around the sun, then deform the uppermost vertexes of the circle<br />

by a constantly updated random value. This shader can be used even on graphics<br />

cards that support only <strong>DirectX</strong> 8 functionality or even a software implementation<br />

of vertex shaders. To generate the vertex coordinates, the number of triangles to<br />

use must be decided on. Obviously, the more triangles used, the closer the<br />

approximation is to a true circle, and the better it looks. In the following example,


Section II — Rendering Techniques<br />

130 Turbulent Sun<br />

Summary<br />

1000 triangles are used. After the number of triangles has been chosen, the vertex<br />

number must be passed for each vertex, as well as the random number used<br />

to deform the vertex position. The vertex shader then uses the sincos Cg function<br />

to generate the vertex positions. The following code is used to do this. Position.x<br />

contains the number of the vertex being rendered, and position.y is used to<br />

deform the flare and control the size. AmountOfVertexes is the uniform parameter<br />

passed to the shader containing the number of vertexes in the flare:<br />

float4 temppos=IN.position;<br />

float step=IN. position.x*(6.283185307179586476925286766559)/AmountOfVertexes;<br />

sincos(step, temppos.x, temppos.y);<br />

temppos.x = (temppos.x * IN. position.y);<br />

temppos.y = (temppos.y * IN. position.y);<br />

To animate it, position.y should be updated with new random values periodically.<br />

Figure 3: Sun generated with<br />

moderate turbulence<br />

It is important when passing the parameters that a value of 1 be specified for the<br />

innermost vertexes so they stay connected to the sun. Also, the flares should<br />

become more transparent the farther they reach from the sun:<br />

OUT.cColor.x = 1; // Flame-like color<br />

OUT.cColor.y = .7125; // Flame-like color<br />

OUT.cColor.z = 0;<br />

OUT.cColor.w = 1-(In. position.y-1)*2<br />

Figure 4: Sun with corona<br />

In conclusion, this article has examined how to generate value noise in a Cg<br />

shader and then use it to produce turbulence, which can be applied per-pixel to<br />

create animated sun textures. Also, a method for displaying and animating a light<br />

flare around the sun has been shown. The entire source code for the shaders and<br />

the example code can be seen on the companion CD.


Introduction<br />

Fragment-level Phong<br />

Illumination<br />

Emil Persson<br />

Phong illumination really isn’t anything new. The Phong illumination model has<br />

been around for almost three decades now. First introduced by Phong Bui-Tuong<br />

in 1975, this model is still frequently used in both the offline rendering world and<br />

the real-time graphics world. Due to the complex math behind the model, it has<br />

until recently only been used for vertex lighting in the real-time rendering world.<br />

Both the Direct3D and OpenGL illumination models closely follow the Phong<br />

model with some small variation. Doing it on a vertex level often causes visible<br />

artifacts and a less-than-convincing look, unless you use a very high tessellation.<br />

<strong>With</strong> advances like the dot3 operation in the fixed function pipeline, we came a<br />

step closer to getting lighting on a per-pixel level. Unfortunately, the limitations of<br />

the fragment processing pipeline meant a lot of compromises had to be made,<br />

even in <strong>DirectX</strong> 8 level pixel shaders. <strong>With</strong> a limited range of [–1,1], or [–8,8] in<br />

PS 1.4, and with the limited precision that the <strong>DirectX</strong> 8 level graphic cards<br />

offers, much of the required math is simply not possible to do. Further, the fact<br />

that there are no advanced math instructions in these graphics solutions is<br />

another obstacle on our way toward advanced lighting, not to mention the instruction<br />

limit. For these reasons, tricks like packing attenuation into a 3D texture,<br />

using cube maps for normalization, and using textures as lookup tables for<br />

exponentiation of the specular component has been the norm for the past<br />

generation.<br />

Fortunately, this will sooner or later be nothing but a bad memory. <strong>With</strong><br />

<strong>DirectX</strong> 9 level hardware, we not only have the close-to-infinite range of floating-point<br />

components and much higher precision, we are also able to do advanced<br />

math and have a lot more instructions to play with before reaching the hardware<br />

limits. This means that for the first time ever, we are able to truly evaluate the<br />

Phong illumination model for each pixel completely in a pixel shader. I will state,<br />

however, that even though we are finally able to evaluate the whole Phong illumination<br />

model in the pixel shader, there are still considerations and limitations that<br />

need to be addressed. The number one consideration to take into account is, of<br />

course, performance. Even with the top high-end graphics cards of today, the full<br />

equation can be quite demanding on the fragment pipeline, and if care is not taken,<br />

performance will suffer. We address some of these issues later in this article.<br />

131


Section II — Rendering Techniques<br />

132 Fragment-level Phong Illumination<br />

The Phong Illumination Model<br />

Let me start by introducing the Phong illumination model:<br />

S<br />

exp<br />

I � Acoeff Acolor Dcolor ��( Att �Lcolor( Dcoeff Dcolor( N � Li<br />

) �Scoeff Scolor( R�V) ))<br />

i<br />

So what does all this do? Let’s consider every component and their purpose. The<br />

first component, I, is of course the resulting color or intensity. The other components,<br />

A, D, and S, represent three different attributes of light and are called<br />

ambient, diffuse, and specular.<br />

The Diffuse Component<br />

We begin with diffuse, as it’s the most intuitive (though not the simplest) of<br />

these. To understand what diffuse lighting is, take a piece of paper and point a<br />

light toward it (or just imagine it). The paper may represent a polygon in our little<br />

world. When the paper faces the light, it receives a lot of light and looks bright<br />

white. Now slowly turn the paper around until the edge faces the light instead. As<br />

you can see, it fades with the angle as the paper faces away from the light. This<br />

phenomenon is what diffuse lighting represents. The actual math behind this is<br />

what we see in the middle of the equation above, N·Li. N is the normal of the surface,<br />

and Li is the light vector. The light vector<br />

is a vector that points from the point we’re<br />

lighting toward the light. The light vector<br />

should be normalized (that is, being of length<br />

1). The same should of course be true for the<br />

normal too. The dot product factor will thus<br />

be a number between –1 and 1. We don’t want<br />

negative light contribution, so all dot products<br />

in this article are assumed to be clamped to<br />

the [0...1] range. Why does this expression<br />

give us the desired result? See Figure 1 for an Figure 1: The diffuse component<br />

illustration.<br />

A dot product between two perpendicular vectors will return 0. That’s the<br />

case with light lying in the surface plane in the illustration above. Anything<br />

behind the surface will return a negative number and thus be clamped to 0. A light<br />

shining perpendicularly toward the surface from above will return 1, and anything<br />

lighting the surface from an angle will get a higher contribution as the light vector<br />

approaches the surface vector. Quite intuitive, but this is of course no proof of<br />

correctness. At this time, it’s better to spill the beans: The Phong model isn’t correct.<br />

It’s just an approximation of how light tends to behave but nowhere near<br />

acceptable for studying optics. However, in graphics we don’t need correctness;<br />

our main concern is to please the eye. Thus, the motto is: If it looks good, then it<br />

is good. Phong illumination looks good and consequently is good. That it can’t<br />

predict how photons interact with matter is not going to concern us a whole lot.


If we go back to the equation, you can see that the diffuse contribution is<br />

multiplied with two other variables, D coeff and D color.D color is the color of the material<br />

of the surface, commonly represented by a texture or a constant color. We use<br />

a texture, which is the normal base material texture used in many applications<br />

and games and should not need any further introduction. D coeff is simply a variable<br />

telling how much the diffuse component is going to contribute to the whole lighting<br />

equation. You’ll notice that there’s also an A coeff and an S coeff variable, which<br />

control how much ambient and specular we want. For performance, we do not<br />

necessarily need to care about all of these. In fact, it can be beneficial to just bake<br />

the D coeff into the base texture. If you want less diffuse contribution, you can just<br />

use a darker texture, similar to ambient and specular. So we can do the same thing<br />

with that in mind and consequently have a somewhat simpler expression. Here<br />

the components A, D, and S have their coefficients and colors pre-baked into single<br />

entities.<br />

exp<br />

I � AD ��( Att �Lcolor( D( N �Li) �S( R �V)<br />

))<br />

The Specular Component<br />

i<br />

Section II — Rendering Techniques<br />

Fragment-level Phong Illumination<br />

So far we have only discussed diffuse lighting. Diffuse lighting works well for<br />

materials like wood, stone, fabric, etc. But it won’t work that well for materials<br />

like plastic, metal, and porcelain. Why not? These materials have a property that<br />

rough wood, for instance, lacks — they are shiny. Shininess is the property that<br />

the specular component tries to resemble. For rough wood you could do without<br />

any shininess, and it would look pretty good. But even for rough wood, a small<br />

specular component can enhance the image. There’s a saying in graphics: “If you<br />

can’t make it good, make it shiny.” But be careful; with Phong illumination, you<br />

can make it good, so you shouldn’t need to resort to making it overly shiny.<br />

Unless you’re trying to make it look like polished wood, you should use a low<br />

specular component. The best images are created by carefully balancing specular<br />

and diffuse properly according to the properties of these materials in real life.<br />

So how is the specular component calculated? The idea is similar to that of<br />

the diffuse component. To begin with, compute the dot product between the<br />

reflection vector R and the view vector V. The view vector is similar to the light<br />

vector, except the view vector is the vector from the camera to the lit point rather<br />

than from the light to that said point. This reveals a significant property of specular<br />

lighting. While diffuse lighting is viewpoint independent, specular is by its<br />

nature very viewpoint dependent. If you navigate around in your little computer<br />

graphics world, it doesn’t matter from where you observe a piece of rough wood;<br />

it’ll look the same regardless of where you view it from. That’s not true for materials<br />

like plastics though; as you move around, you’ll see the reflection of the light<br />

in the surface, and as the viewpoint changes the reflection will move around.<br />

Figure 2 (on the following page) illustrates the behavior of the specular<br />

component.<br />

You, of course, get maximum reflected light if you view the surface from<br />

somewhere along the reflection vector. As you move away from this vector, you<br />

S<br />

133


Section II — Rendering Techniques<br />

134 Fragment-level Phong Illumination<br />

see less and less reflected light. If you were<br />

to use the dot product of the view vector<br />

and reflection vector, the surface still<br />

wouldn’t look particularly shiny; rather it<br />

would just look bleached. Why is that?<br />

Think of a perfectly reflective surface — a<br />

mirror in other words. A mirror will only Figure 2: The specular component<br />

reflect light in the exact direction of the<br />

reflection vector. That is, if you viewed it at<br />

a slight angle off from the ideal reflection angle, as in Figure 2, you wouldn’t see<br />

any reflected light at all. Thus, in that case the dot product alone obviously doesn’t<br />

work. Also think of a dull surface. It will reflect light in all possible directions,<br />

so reflections should be visible from pretty much everywhere, even though you<br />

don’t see a sharp reflection but rather just a uniformly lit surface. The difference<br />

is the spread. The mirror doesn’t have any spread, while the dull material has a<br />

significant spread. In other words, the more reflective the material, the faster the<br />

light falls off as you move away from the reflection vector. Enter the specular<br />

exponent. As you can see in the Phong equation, the dot product of the view vector<br />

and the reflection vector is raised to a power. This exponent represents the<br />

shininess of the material. The higher the exponent, the shinier the material. A<br />

specular exponent of infinity is a mirror, and a specular exponent of 0 is a completely<br />

dull surface where light is spread equally in all directions. If you didn’t<br />

raise the specular to a power, basically using a specular exponent of 1, you still<br />

have a pretty dull surface. Normal values of the specular exponent tend to be<br />

around 8 to 64. We will use a constant specular exponent of 24, something I<br />

choose because it looks pretty good. Remember, it if looks good, then it is good.<br />

<strong>With</strong> pixel shaders 2.0, nothing really prevents us from changing the shininess of<br />

the surface by storing the exponent in a texture and using that as a lookup table<br />

for specular exponents for each pixel. This can be used to let rusty parts of metal<br />

be non-shining while letting the intact parts shine as appropriate. A dark region in<br />

this texture represents a non-shiny area, while bright regions are those that are<br />

shiny. I’ll leave this as an exercise for the interested, however, and instead focus<br />

on a more important part by which we can create a quite similar effect — gloss.<br />

Gloss is basically just another word for the specular coefficient. As you<br />

remember, we baked the coefficients together with the colors for each of the components<br />

— ambient, diffuse, and specular. One often leaves the specular color as<br />

white, which basically reduces the S component to be nothing but the specular<br />

coefficient, or the gloss. This is because most shiny materials don’t significantly<br />

change the color of the light as it reflects off the surface. Some material does<br />

though, and if you’re going to simulate this behavior you should of course keep<br />

the specular color component. Gloss is an important part of the equation, however,<br />

and should generally be left in the equation. It often gives better results to<br />

just alter the gloss instead of the specular component across a surface to do<br />

effects like rusty parts of a metal surface. So we will use a texture containing the<br />

gloss, a so-called gloss map. If you want to use a specular color, you can bake it<br />

into the gloss map, but in our case we will take advantage of the fact that we only


have a single property to take care of and use a single channel texture to store our<br />

gloss map, which reduces the bandwidth need.<br />

Attenuation<br />

In real life, light fades as the lit surface gets farther from the light. The falloff is<br />

roughly a 1/r2 function (think of the area of a sphere with the light in its center). In<br />

real life, light sources aren’t really a dimensionless point in space either. A<br />

lightbulb, for instance, while not particularly large, is still not exactly infinitesimal<br />

either. So if we applied an attenuation factor of 1/r2 , we wouldn’t get very realistic<br />

results. To better capture the behavior of light, a slightly more complex function is<br />

commonly used:<br />

1<br />

Att<br />

c 1 r q r 2<br />

�<br />

� � � �<br />

We have constant, linear, and quadratic attenuation — c, l, and q in the formula<br />

above. It’s not necessary to use all components; I usually drop the linear component,<br />

since it doesn’t add a whole lot and places the heaviest load on the fragment<br />

pipeline because it requires a square root. Usually it’s enough to just offset the<br />

inverse square function with a constant. Setting this constant to 1 will usually suit<br />

us well. So the attenuation function we use is:<br />

1<br />

Att<br />

1 q r 2 �<br />

� �<br />

The Ambient Component<br />

Section II — Rendering Techniques<br />

Fragment-level Phong Illumination<br />

135<br />

If we were to implement the lighting equation as discussed so far, we would get<br />

quite good results. However, there’s still something that will hurt the impression<br />

of reality. Polygons in our little virtual world that face away from our light will be<br />

black. This may sound natural, as no light would hit it. However, experience tells<br />

us otherwise. If you’re in a decently lit room, you’ll have a hard time finding a surface<br />

that’s so dark you can’t see its details and texture. Nothing really gets black.<br />

Why is that? When light hits a surface, some of it scatters back. Some of that light<br />

hits our eyes, which is the sole reason we can see anything at all. Not every photon<br />

scattering off from a surface will hit the eyes of the viewer though; some will<br />

bounce away and hit other surfaces. Some of that light will then once again scatter<br />

back into the scene. This is called indirect lighting and is something that our<br />

Phong model doesn’t take care of. Fortunately, there’s a very cheap way to fake it.<br />

Enter the ambient component. While none of the components of Phong illumination<br />

are particularly real or physically correct, the ambient is the most fake of<br />

them all. In fact, it clearly goes against all our knowledge about light. But as<br />

always, if it looks good, then it is good. Ambient gets rid of the blackness of unlit<br />

surfaces and gives a decent impression that indirect light is present in the scene.<br />

This alone is a noble enough goal to justify its place in the Phong model, and<br />

given how cheap it is to implement, one would really have to justify the decision<br />

not to use ambient.


Section II — Rendering Techniques<br />

136 Fragment-level Phong Illumination<br />

So what is ambient then? Basically it’s nothing but a constant light that hits<br />

every surface. One assumes that the light scattered off from the surfaces in the<br />

scene is uniformly distributed in all directions and all places. This is hardly close<br />

to reality but works reasonably well for most normal scenes. <strong>With</strong> light hitting a<br />

surface uniformly from all directions, you get no reflective behavior. It’s also completely<br />

angle independent, so anything like diffuse is out the window too.<br />

Basically you end up with just the texture color multiplied with a constant of how<br />

much ambient you want in the scene; very simple but quite effective. For being so<br />

effective and yet so cheap, it’s easily the most worthwhile calculation your fragment<br />

shader can do.<br />

Fragment-level Evaluation<br />

In real life, few surfaces are really flat; this is a painful truth for the graphic artist,<br />

as it becomes so much harder to create realistic environments given the base<br />

primitives of 3D graphics. However, there are solutions, and Phong illumination<br />

on a fragment level gives you opportunities to ease the burden on the artist without<br />

the need for zillions of tiny triangles to simulate rough surfaces. Also, it would<br />

be wasteful to do all this work on every pixel without taking advantage of the possibilities<br />

this gives you. One could, for instance, just interpolate the normals and<br />

evaluate the Phong equation on each pixel. While this would certainly look better<br />

than normal per-vertex lighting, it would still look flat. Fortunately, the Phong<br />

illumination model still has room to improve<br />

this significantly. The solution is, of course,<br />

to store them in a texture and look them up<br />

on a per-pixel level instead of just interpolating<br />

the normals. This is what’s commonly<br />

called a normal map, or bump map. This will<br />

let you give surfaces properties that real<br />

surfaces tend to have, like roughness,<br />

bumpiness, and fine details. However, this<br />

introduces some important issues, and the<br />

full concept can be a significant threshold<br />

for many people to get over. Let’s take it<br />

from the beginning and study the issues Figure 3: World space normals<br />

that it raises in detail.<br />

So let’s assume that we have a texture with the normals stored. We sample<br />

this texture in our pixel shader and do the math. Will this work? Those who have<br />

tried (including me before I understood these issues) can assure you that it’ll look<br />

very odd and incorrect. It will look okay at some spots but wrong in most others.<br />

If the normal map was created exactly for the given direction of a polygon, it<br />

would work, but we can’t create a separate normal map for every direction that a<br />

texture may be located in our scene. Not only would our dear artist refuse to take<br />

on this tremendous job, but even if he did, we would bring the graphic card to its<br />

knees due to the extreme memory requirements. So this is obviously not an<br />

option. Ideally, we would want a base texture, a normal map, and a gloss map to go


together for each material. This is the solution I come to in the end, so why do I<br />

insist that just sampling a texture and doing the math requires a separate texture<br />

for each given possible direction? Consider a simple example: You are inside a<br />

cube. All six faces use the same texture and the same normal map. Now assume<br />

we want them all to look flat, so we store a constant normal (say for instance, (1,<br />

0, 0)) in the normal map. Applying this to all six faces will give us something like<br />

Figure 3. Of course, you’d want the normals to point into the box. The faces of the<br />

cube obviously have different normals, and in this case only one face has correct<br />

normals. It may appear impossible at first that the faces can share the same normal<br />

map given that they are oriented differently and have very different normals.<br />

Using a separate normal map seems to be the only solution at first. Fortunately,<br />

there’s a better solution.<br />

Tangent Space<br />

Section II — Rendering Techniques<br />

Fragment-level Phong Illumination<br />

137<br />

To solve the problem, we need to introduce the concept of a vector space. Imagine<br />

that we removed the axis pointers in Figure 3. How would we know which direction<br />

is the X direction? We wouldn’t! Why? Because the direction of X is nothing<br />

but an arbitrary choice that we have made. There’s no fundamental truth behind<br />

this choice. It’s just a choice as good as any. Imagine that we put X into Y’s position<br />

and vice versa. Suddenly the (1, 0, 0) normal would be incorrect for a face<br />

that it was correct for before. Not only that, but suddenly it’s correct for the face<br />

in the bottom of the cube. Now imagine that we used different meanings of X, Y,<br />

and Z for each face. What would that imply? (1, 0, 0) can be the correct normal for<br />

every face; we only need to adjust our coordinate system to suit the normals.<br />

This may seem backward, but it is an extremely handy thing to do in graphics.<br />

A vector space is basically just a coordinate system. You have three vectors<br />

defining the direction of each major axis. These are the vectors pointing in the X,<br />

Y, and Z directions, as defined by that vector space. There are two vector spaces<br />

that are important to us right now. First, the standard vector space we place all<br />

our objects into is called world space. This is the vector space you’ve been using<br />

even though you may not have realized it. As we place our objects in absolute<br />

coordinates, the world space is defined by the vectors (1, 0, 0), (0, 1, 0), (0, 0, 1).<br />

The other space that’s important to us is the so-called tangent space. It is defined<br />

by the tangent vectors and the surface normal. Note that we still need the surface<br />

normal even though we have normals stored in the normal map. The difference<br />

though is that the surface normal is a normal normal (no pun intended) — i.e., it’s<br />

defined in world space. The normal map, however, contains normals in tangent<br />

space. To better understand the concept of tangent spaces, try to think of a texture<br />

quad. The vectors that define this vector space are the ones that point in the<br />

direction of the U and V texture coordinate in world space. The normal points<br />

perpendicularly right up from the surface as usual. Figure 4 (on the following<br />

page) may help you understand the concept. The tangent space in this figure is<br />

thus defined by (0, 1, 0), (0, 0, –1), and (1, 0, 0), since the direction of the U texture<br />

coordinate points in the Y direction, V points in the –Z direction, and the<br />

normal points in the X direction.


Section II — Rendering Techniques<br />

138 Fragment-level Phong Illumination<br />

Now that we have our tangent space, what’s<br />

next? Well, we need to store the tangent space<br />

for each vertex in our geometry and pass that<br />

along with the vertex and texture coordinates to<br />

the vertex shader. The vertex shader needs to<br />

transform the light vector and reflection vector<br />

into tangent space and pass that along to the<br />

pixel shader. The pixel shader can then work as<br />

usual and use the normal from the normal map.<br />

An obvious question at this point is, of<br />

course, how do we create the normal map?<br />

Unfortunately, there’s no general method for creating<br />

a normal map from a base texture. Instead,<br />

Figure 4: Tangent space<br />

the artist needs to create the normal map along with the base texture as two separate<br />

but obviously connected entities. It’s quite unintuitive to draw a normal<br />

map, however; a height map is much more intuitive. It’s easier to think of white<br />

as high and black as low than it is to think of pink as pointing to the right and light<br />

green as pointing down, etc. Fortunately, there’s a general way to convert a height<br />

map into a normal map that can also be done at load time. All you need to do is<br />

apply a Sobel filter to every pixel. Fortunately, the concept of a Sobel filter is quite<br />

simple, but if that’s still too much, you can resort to the D3DXComputeNormal-<br />

Map function. Basically, a Sobel filter finds the slope of a grayscale picture. First<br />

you apply the Sobel filter in the X direction and then in the Y direction to form the<br />

the vector (dX, dY, 1). Then normalize this vector, and you’re done. The filter<br />

kernels look like this:<br />

–1 0 1 –1 –2 –1<br />

If you’re unfamiliar with the concept of<br />

filter kernels, just place the pixel that<br />

–2 0 2 0 0 0 you’re filtering right now in the middle<br />

–1 0 1 1 2 1<br />

square. Then multiply each pixel that each<br />

square covers with the number that’s in<br />

that square and sum it all together. The result is your filtered value. So applying<br />

the left filter will give you dX, and applying the right one will give you dY.<br />

Implementation<br />

If you’ve read everything to this point, you are probably getting a little tired of all<br />

the theory. So without further ado, let’s dive straight into the implementation.<br />

The first thing we need to define is our vertex format. As we’ve concluded earlier<br />

in this text, the data we need is a vertex position, a texture coordinate, and our<br />

tangent space. This gives us this vertex format:<br />

struct TexVertex {<br />

Vertex vertex;<br />

float s, t;<br />

Vertex uVec, vVec, normal;<br />

};


Now we need to feed this info into the vertex shader. Feeding the vertex and texture<br />

coordinates into a vertex shader should be pretty much straightforward. It’s<br />

important to note at this time though that texture coordinates no longer need to<br />

be in any way related to textures. They are really nothing but generic interpolated<br />

properties. So we feed info into the vertex shader through texture coordinates<br />

and then pass new texture coordinates from the vertex shader into the pixel<br />

shader. So the vertex declaration looks like this:<br />

D3DVERTEXELEMENT9 texVertexFormat[] = {<br />

{ 0, 0, D3DDECLTYPE FLOAT3, D3DDECLMETHOD DEFAULT,<br />

D3DDECLUSAGE POSITION, 0},<br />

{ 0, 1 * sizeof(Vertex), D3DDECLTYPE FLOAT2, D3DDECLMETHOD DEFAULT,<br />

D3DDECLUSAGE TEXCOORD, 0},<br />

{ 0, 1 * sizeof(Vertex) +2*sizeof(float), D3DDECLTYPE FLOAT3, D3DDECLMETHOD DEFAULT,<br />

D3DDECLUSAGE TEXCOORD, 1},<br />

{ 0, 2 * sizeof(Vertex) +2*sizeof(float), D3DDECLTYPE FLOAT3, D3DDECLMETHOD DEFAULT,<br />

D3DDECLUSAGE TEXCOORD, 2},<br />

{ 0, 3 * sizeof(Vertex) +2*sizeof(float), D3DDECLTYPE FLOAT3, D3DDECLMETHOD DEFAULT,<br />

D3DDECLUSAGE TEXCOORD, 3},<br />

D3DDECL END()<br />

};<br />

The vertex shader needs to compute the light vector and the view vector from<br />

the provided data. Thus, we need to provide the vertex shader with the camera<br />

position and light position. This is best done with vertex shader constants, as<br />

these attributes don’t change with the geometry in any way. Once the view and<br />

light vectors are done, we need to transform them into tangent space. The transformation<br />

is just a matrix multiplication, which is nothing but a set of dot products.<br />

As these are three-dimensional properties, we need only do a dp3 operation<br />

with each of uVec, tVec, and the normal. The resulting vertex shader ends up as<br />

something like this:<br />

vs.2.0<br />

dcl position v0<br />

dcl texcoord0 v1 // TexCoord<br />

dcl texcoord1 v2 // uVec<br />

dcl texcoord2 v3 // vVec<br />

dcl texcoord3 v4 // normal<br />

// c0-c3 = mvp matrix<br />

// c4 = camera position<br />

// c5 = light position<br />

// Transform position<br />

m4x4 oPos, v0, c0<br />

// Output texcoord<br />

mov oT0, v1<br />

sub r0, c5, v0 // r0 = light vector<br />

Section II — Rendering Techniques<br />

Fragment-level Phong Illumination<br />

139


Section II — Rendering Techniques<br />

140 Fragment-level Phong Illumination<br />

dp3 oT1.x, r0, v2<br />

dp3 oT1.y, r0, v3<br />

dp3 oT1.z, r0, v4 // oT1 = light vector in tangent space<br />

sub r1, c4, v0 // r1 = view vector<br />

dp3 oT2.x, r1, v2<br />

dp3 oT2.y, r1, v3<br />

dp3 oT2.z, r1, v4 // oT2 = view vector in tangent space<br />

Everything should now be properly set up for the most important piece of code of<br />

ours, the pixel shader, which will do all the tough work. As everything is now in<br />

tangent space, we can carry on all operations as if all data, including the normal<br />

from the normal map, had been in world space. The pixel shader will be much longer,<br />

so we’ll go through it step by step instead of just printing all the code right<br />

here. Let’s start with the diffuse component.<br />

ps.2.0<br />

dcl t0.xy<br />

dcl t1<br />

dcl 2d s0<br />

dcl 2d s1<br />

def c0, 2.0, 1.0, 0.0, 0.0 // (2.0, 1.0, unused ...)<br />

texld r0, t0, s0 // r0 = base<br />

texld r1, t0, s1 // r1 = bump<br />

mad r1.xyz, r1, c0.x, -c0.y // bump[0..1] => bump[-1..1]<br />

dp3 r7.w, r1, r1<br />

rsq r7.w, r7.w<br />

mul r1.xyz, r1, r7.w // r1 = post-filter normalized bump map<br />

dp3 r7, t1, t1<br />

rsq r7.w, r7.x<br />

mul r3, t1, r7.w // r3 = normalized light vector<br />

dp3 sat r4, r3, r1 // r4 = diffuse<br />

mul r4, r4, r0 // r4 = base * diffuse<br />

mov oC0, r4<br />

This should be pretty straightforward. We begin by sampling our base texture and<br />

grabbing the normal from the bump map. We could have used floating-point textures<br />

given that normals can have components that range from –1 to 1, but that<br />

would reduce performance without a whole lot of image quality improvement.<br />

Actually, it would reduce the image quality on current hardware, since at the time<br />

of publication no hardware is available that supports filtering on floating-point<br />

textures. Instead, we take the traditional approach of packing it into a normal<br />

D3DFMT_X8R8G8B8 texture. This means that we have to unpack it in our


shader though, and that’s the mad (as in “multiply” and “add,” not “crazy”)<br />

instruction right after the sampling. Note that the linear filter on the normal map<br />

isn’t really that suitable for normals, so after the filtering, the normal may no longer<br />

be of unit length but rather slightly shorter. This may not matter a whole lot<br />

for diffuse, but it does matter quite a lot for specular. If the length is 0.99 instead<br />

of 1.0 and you raise it to, say, 24, it’ll end up not with the wanted 1.0 but rather<br />

something much lower, 0.99 24 = 0.785, which will make our specular highlights<br />

significantly less sharp. So the post-filter normalization is certainly needed,<br />

though maybe not this early, but it doesn’t hurt to use a better normal for diffuse<br />

also. The normalization process is quite simple. As you may remember from linear<br />

algebra, a vector dot multiplied with itself is the squared length of that vector.<br />

So what we do is take the inverse square root of that squared length, which gives<br />

us the inverse of the length. Multiply the vector with the inverse length, and the<br />

normalization is done. The same is then done to the light vector. After the<br />

normalizations, we can just do the dot product between these vectors, multiply<br />

that with the base texture, and our diffuse is done. Note that we use dp3_sat as<br />

opposed to just dp3. This is so that all negative dot products get clamped to zero.<br />

We don’t want negative light, remember?<br />

So far, the output doesn’t look particularly impressive. The most obvious<br />

drawback is the lack of attenuation. So far, only the angle matters, not how far<br />

from the light the surface is. We’ll remedy the problem right away. So we’ll need<br />

this piece of code inserted right after the light vector normalization:<br />

// c2 = (constant attenuation, quadratic attenuation, unused ...)<br />

mad r5, c2.y, r7, c2.x<br />

rcp r5, r5.x // r5 = attenuation<br />

Section II — Rendering Techniques<br />

Fragment-level Phong Illumination<br />

This will give us our wanted attenuation factor, which we can multiply with our<br />

diffuse to get a properly attenuated light. So the last step in the shader changes is<br />

as follows:<br />

mul r4, r4, r0 // r4 = base * diffuse<br />

mul r4, r4, r5 // r4 = base * diffuse * attenuation<br />

mov oC0, r4<br />

141<br />

Next up is our specular. To begin with, we need to sample our gloss map. It has<br />

the same texture coordinates as the base texture and normal map, so it’s straightforward<br />

to add. As you may remember from<br />

our vertex shader above, we get our view<br />

vector in t2. So we normalize as we did with<br />

the light vector. We then need to compute<br />

the reflection vector. The reflection vector is<br />

illustrated in Figure 5.<br />

Once the reflection vector is done, we<br />

Figure 5: Reflection vector<br />

basically just need to do the dot product,<br />

raise it to a power, and multiply with the gloss and we’re done. We’ll add the specular<br />

exponent to the first constant. The code ends up something like this:


Section II — Rendering Techniques<br />

142 Fragment-level Phong Illumination<br />

Aliasing<br />

dcl t2<br />

dcl 2d s2<br />

...<br />

def c0, 2.0, 1.0, 24.0, 0.0 // (2.0, 1.0, specular exponent, 0.0)<br />

...<br />

texld r2, t0, s2 // r2 = gloss<br />

...<br />

dp3 r7, t2, t2<br />

rsq r7.w, r7.x<br />

mul r6, t2, r7.w // r6 = normalized view vector<br />

dp3 r7, r3, r1<br />

mul r7, r7, c0.x<br />

mad r3, r7, r1, -r3 // r3 = reflection vector<br />

dp3 sat r3, r3, r6<br />

pow r3, r3.x, c0.z // r3 = specular<br />

mul r3, r3, r2 // r3 = specular * gloss<br />

Given the discussion above, there shouldn’t be a whole lot of questions about this<br />

code. Now we just need to combine it with the diffuse component. The last piece<br />

of code is as follows:<br />

mad r4, r4, r0, r3 // r4 = base * diffuse + specular * gloss<br />

mul r4, r4, r5 // r4 *= attenuation<br />

mov oC0, r4<br />

The last piece of the equation that remains is the ambient, which is also the simplest<br />

to implement. So without further ado let’s go right to the task. We need to<br />

pass the ambient factor to the shader. There are some unused components in our<br />

c2 constant, so we’ll just use one of these. Then we only need to squeeze another<br />

instruction into the final combining code.<br />

// c2 = (constant attenuation, quadratic attenuation, ambient, unused)<br />

mad r4, r4, r0, r3 // r4 = base * diffuse + specular * gloss<br />

mul r4, r4, r5 // r4 *= attenuation<br />

mad r4, r0, c2.z, r4 // r4 += base * ambient<br />

mov oC0, r4<br />

Yes, that’s it. The Phong model is now complete and ready for some serious<br />

action.<br />

While we already get pretty good results, there are still a couple of issues that<br />

need to be addressed. One such issue is aliasing. You probably already know why<br />

we use techniques like mipmapping. If you don’t have the mathematical background,<br />

you probably at least know from experience that not using mipmapping<br />

will cause severe shimmering artifacts on objects at a distance. Why is that? The


Section II — Rendering Techniques<br />

Fragment-level Phong Illumination<br />

143<br />

mathematical explanation is that it violates the Nyquist frequency. That probably<br />

sounds like Greek to most people; only those with a signal processing background<br />

will be familiar with it. Basically we are stating that the frequency present in the<br />

texture is higher than half the sampling rate, which may only confuse you more,<br />

but it’s actually a quite easy concept to understand, even though it would take a<br />

higher degree of mathematical skills to do the reasoning from a mathematical<br />

point of view. Assume we are rendering to a resolution of 256x256, a resolution<br />

that will hardly ever be used in real life, but for this example it makes it easy to<br />

understand the issues. Assume we also have a texture of a 256x256 containing a<br />

checkerboard pattern (that is, every other pixel is black and white). Ignoring that<br />

we usually have linear filtering, it would appear that mapping this texture onto the<br />

full screen will work just fine. Every other pixel gets black and white. Now<br />

assume we map it to the upper-left 128x128 pixels. Only every other pixel from<br />

the texture will end up on screen (still ignoring filters), so we get only the black<br />

pixels by seemingly unfortunate bad luck. Obviously, information got lost in the<br />

process. It’s hard to get something useful in this situation either way, but at least<br />

we would want all pixels in the texture to contribute to the final results, producing<br />

some kind of gray. Alright you say, and you point out that this is exactly what a linear<br />

filter will do for us. True — in this case using a linear filter would be enough,<br />

but then consider another checkerboard texture but with each 2x2 pixels being<br />

either white or black. Mapping this to either 256x256 or 128x128 will work just<br />

fine. Now map it to 64x64 and consider the results. We’re back in the same situation<br />

— we will get nothing but black, as nothing from the white 2x2 pixel blocks<br />

will ever be touched by the linear filter. Obviously, information once again got lost.<br />

Ideally we would want every 4x4 block in the texture to contribute to each pixel.<br />

This is basically what mipmapping does. It tries to match the pixel and texel rates<br />

by using smaller down-sampled textures to better fit the spacing between where<br />

in the texture each pixel would sample. So when mapping to a 256x256 pixel area<br />

the full 256x256 mipmap would be used, while when mapping it to a 64x64 pixel<br />

area it would use a 64x64 mipmap. For anything in between, it would interpolate<br />

between the two closest mipmap levels for a smooth transition. Doing this should<br />

effectively get rid of all kinds of texture shimmer artifacts related to texture<br />

sampling.<br />

So what’s up with all this theory? The problem is solved, right? Well, I’d love<br />

that to be true. Unfortunately it’s not. During the <strong>DirectX</strong> 7 era, one could pretty<br />

much state that it was a solved problem, but with the pixel shaders of today, we<br />

are basically back at square one again. Why? Well, during the <strong>DirectX</strong> 7 era, textures<br />

were combined with simple arithmetic operations like modulating a base<br />

texture with a lightmap, possibly adding an environment map onto that. Simple<br />

arithmetic operations like multiplications and additions don’t change the frequency<br />

properties of the texture. So as long as you use these simple operations,<br />

you’ll be fine. Unfortunately this is not the case with operations like dot products.<br />

It basically kicks all the assumptions from the reasoning behind mipmapping out<br />

of the window. This means that we once again see shimmering. Since the trend is<br />

that multisampling replaces supersampling as the preferred anti-aliasing technique,<br />

we won’t get any help there either. The situation is not as horrible as it


Section II — Rendering Techniques<br />

144 Fragment-level Phong Illumination<br />

Shadows<br />

may first appear, however. We just need to be aware of the problem and carefully<br />

tackle it. While mipmapping may no longer perfectly match our source, it certainly<br />

helps us a lot. Again, what’s the reason for shimmering? There are too-high<br />

frequencies in the source material. What can we do about it? Reduce the high-frequency<br />

components in our textures; in plain English, use blurrier textures. It is<br />

important to note that there’s no need to use a blurrier base texture, since it will<br />

only be part of simple arithmetic operations. Our main target is instead our normal<br />

map, and to some extent the gloss map. The general advice is to avoid having<br />

sharp contrasts in the normal map. You also don’t necessarily need to use the<br />

whole 0-to-1 range when creating your height map. Sharp contrasts in the gloss<br />

map are generally not desired either. Smoother transitions in the gloss map can<br />

help hide the aliasing artifacts slightly. It’s also noteworthy that a high specular<br />

exponent, while giving sharper and generally better-looking specular highlights,<br />

also adds to the aliasing, so these two factors need to be balanced. Some good<br />

advice is to use a blurrier normal map the higher the specular exponent is. That<br />

is, a shiny surface will need a blurrier normal map, while a matte surface may do<br />

fine with fairly high contrasts in the normal map. Aliasing certainly occurs from<br />

diffuse too, so you can’t use normal maps that are too sharp for dull surfaces<br />

either. It’s also important to note that the artifacts tend to occur on lower mipmap<br />

levels, so it may help to not only downsample the previous mipmap level when<br />

creating the mipmap chain but also apply a soft blur filter.<br />

If you work for a game or content creation company, it’s important that you<br />

make sure the artist understands these issues. Unlike many other issues that can<br />

be handled graciously by the programmer, this will require awareness from the<br />

artists. The best thing the programmer can do is educate the artist and provide<br />

good tools for previewing the material.<br />

There is one thing left that seriously hurts the impression of reality, and that’s the<br />

lack of shadows. It would be wasteful to spend all this time implementing Phong<br />

illumination and leave it in this state. There are several shadowing techniques to<br />

choose from, some of them existing in several different forms. Unfortunately, they<br />

all suck in one way or another. The two most common are stencil shadows and<br />

shadow mapping. The advantages of stencil shadows are that the shadows are<br />

pixel accurate, and stenciling is widely supported. The disadvantages are that it’s<br />

slow, not particularly scalable, hard to implement, and not very general, and it<br />

may interfere with some anti-aliasing techniques. The advantages of shadow mapping<br />

are that it’s reasonably fast, quite scaleable, easy to implement, and very<br />

general. The disadvantage is that the shadows are prone to aliasing. It has enough<br />

pluses though to make it my shadow technique of choice.<br />

The idea behind shadow mapping is simple. In the first pass you render the<br />

distance to the light into a texture from the light’s point of view. Then in the second<br />

pass you check the distance to the light against what’s stored in the texture<br />

from pass 1. If the distance is larger than the stored value, obviously some other<br />

object is in the same line of view that covers the light, which implies that it’s in


shadow. Otherwise, it’s lit. Quite simple, isn’t it? In our case, we use<br />

omnidirectional lights, so we need to render to a cube map instead of a normal<br />

texture. As we’re only interested in distance and not colors, etc., we can use a<br />

much simpler pass — no textures, just plain geometry. For that we need a pair of<br />

simple shaders.<br />

vs.2.0<br />

dcl position v0<br />

// c0-c3 = mvp matrix<br />

// c5 = light position<br />

// Transform position<br />

m4x4 oPos, v0, c0<br />

sub oT0, c5, v0 // oT0 = light vector<br />

It can’t be simpler; just compute the light vector — no tangent spaces or anything,<br />

just a subtraction and we’re done. The pixel shader isn’t any more complex.<br />

ps.2.0<br />

dcl t0<br />

dp3 r0, t0, t0<br />

mov oC0, r0<br />

Section II — Rendering Techniques<br />

Fragment-level Phong Illumination<br />

145<br />

The dot product with itself gives the squared length of the light vector. Normally,<br />

one would compare the distances, but the squared distances work just as well<br />

and give a significant speed boost. There is an issue we need to take care of for<br />

this to work well, however; when comparing with the stored distance, there will<br />

unavoidably be precision errors due to the finite resolution of our shadow map and<br />

limited number of bits. For this reason, you need to bias the distance to give some<br />

headroom for precision errors. Normally, you would just add a small number.<br />

However, if we’re using the squared distances, this won’t work very well due to<br />

the non-linear spacing that we have. It would effectively make our bias smaller<br />

and smaller with distance, and artifacts would soon be visible. If we use a larger<br />

bias, we would instead get problems with missing shadows close to the light.<br />

Unfortunately, there’s no optimal bias in between either; rather, we could find<br />

biases that cause both artifacts. Instead, we take a different approach. We just<br />

multiply the distance with a constant slightly less than 1. This will instead define<br />

the allowed error in terms of a certain percentage, which will work much better.<br />

Only very close up on the light will there be artifacts. If this is a problem, there’s<br />

still the option to use linear distance rather than the squared distance (but at a<br />

performance cost, of course).<br />

Note that squared distances will return quite large numbers — certainly<br />

larger than 1 in general, unless we use a very small world. So we’ll need a floating-point<br />

texture to store it to. We could use a normal fixed-point texture too, but<br />

then we’d need to scale it down so that we’ll never get anything larger than 1. We<br />

can’t allow clamping, as that will destroy our shadows. Also, floating point better


Section II — Rendering Techniques<br />

146 Fragment-level Phong Illumination<br />

suits our quadratic representation of distance. So the best choice for us is to use a<br />

D3DFMT_R32F texture. Note that some pixel shader 2.0 hardware doesn’t support<br />

floating-point cube maps, but otherwise this is an ideal format, as it is single<br />

channel and floating point with high precision. If you need to support such hardware,<br />

you’re better off just using the linear distance instead.<br />

To implement shadows, we also need to change our lighting shaders. Our<br />

vertex shader will receive another line:<br />

mov oT3, -r0 // oT3 = shadow map<br />

This line isn’t obvious just by looking at it; instead you must take a look at the old<br />

vertex shader and see that r0 will contain the light vector from earlier computations<br />

(that is, the light position minus the vertex position). We want to look up in<br />

the cube map in the direction from the light position toward the vertex position<br />

(that is, the exact opposite direction of the light vector). So that’s how we come<br />

up with –r0. The pixel shader gets more extensive additions. First we need some<br />

basic setup, and then we sample the shadow map:<br />

dcl t3<br />

dcl cube s3<br />

...<br />

def c1, 0.97, 1.0, 0.0, 0.0 // (biasfactor, averaging factors)<br />

...<br />

texld r8, t3, s3 // r8 = shadow map<br />

Then right after we normalize the light vector, we’ll squeeze in an instruction to<br />

compute the biased distance to the light. r7.x contains the squared distance to the<br />

light from previous calculations above.<br />

mul r8.y, r7.x, c1.x // r8.y = lengthSqr(light vector) * biasfactor<br />

We now need to get a shadow factor (that is, 0 if we’re in shadow and 1 otherwise).<br />

So we’ll compare and grab a 0 or 1 from our c1 constant, depending on the<br />

outcome of the comparison.<br />

sub r8.x, r8.x, r8.y<br />

cmp r8.x, r8.x, c1.y, c1.z // r8.x = shadow factor<br />

Now we only need to multiply this with our diffuse and specular components. The<br />

ambient will be left alone though, as we want ambient to be visible in shadowed<br />

areas too. So the component combining will be changed to this:<br />

mad r4, r4, r0, r3 // r4 = base * diffuse + specular * gloss<br />

mul r4, r4, r5 // r4 *= attenuation<br />

mul r4, r4, r8.x // r8 *= shadow factor<br />

mad r4, r0, c2.z, r4 // r4 += base * ambient<br />

mov oC0, r4<br />

Ta da — we have shadows! We could leave it at this and be fairly satisfied. This<br />

doesn’t mean that there are no improvements left to be done, however. Sure<br />

enough, I have another trick for you. While the shadows created with the above


code look fairly good, there is a problem. If the shadow map is of low resolution<br />

(say 256x256), we will get pixelation of the shadows in which the edges have<br />

obvious stair-stepping. What can we do about it? Well, we could increase the resolution<br />

of our shadow map. This will quickly kill our performance, though. Rendering<br />

to a 512x512 shadow map requires four times the fillrate of rendering to a<br />

256x256 shadow map. Instead, let’s try to anti-alias our shadows. How can we do<br />

that? By taking several samples and averaging them. So we just take the normal<br />

shadow map sampling position, add an arbitrary constant to offset it slightly, and<br />

take another sample. Take three additional samples for a total of four to get a<br />

decent smoothing of the edges. So we need to provide three additional sampling<br />

positions from the vertex shader.<br />

def c8, 1.0, 2.0, -1.0, 0.0<br />

def c9, 2.0, -1.0, 1.0, 0.0<br />

def c10, -1.0, 1.0, 2.0, 0.0<br />

...<br />

sub oT4, c8, r0<br />

sub oT5, c9, r0<br />

sub oT6, c10, r0<br />

The pixel shader gets its fair share of additions also. The changes are pretty<br />

straightforward. First we just sample at the newly provided sample positions:<br />

dcl t4<br />

dcl t5<br />

dcl t6<br />

...<br />

texld r9, t4, s3 // r9 = shadow map<br />

texld r10, t5, s3 // r10 = shadow map<br />

texld r11, t6, s3 // r11 = shadow map<br />

...<br />

Then we need to revise the shadow factor calculation slightly. Let’s use 0.25<br />

instead of 1.0 for obvious reasons. We accumulate the results from all sample<br />

comparisons in r8.x, so the final combining code remains the same:<br />

def c1, 0.97, 0.25, 0.0, 0.0 // (biasfactor, averaging factors)<br />

...<br />

sub r8.x, r8.x, r8.y<br />

sub r9.x, r9.x, r8.y<br />

sub r10.x, r10.x, r8.y<br />

sub r11.x, r11.x, r8.y<br />

cmp r8.x, r8.x, c1.y, c1.z<br />

cmp r9.x, r9.x, c1.y, c1.z<br />

cmp r10.x, r10.x, c1.y, c1.z<br />

cmp r11.x, r11.x, c1.y, c1.z<br />

add r8.x, r8.x, r9.x<br />

Section II — Rendering Techniques<br />

Fragment-level Phong Illumination<br />

147


Section II — Rendering Techniques<br />

148 Fragment-level Phong Illumination<br />

add r8.x, r8.x, r10.x<br />

add r8.x, r8.x, r11.x<br />

And that’s it. The shadows should now look much smoother. If we look closely<br />

though, we can still see stair-stepping; more samples would solve that. I’ll leave<br />

that as an exercise for those who are interested.<br />

We’ve come a long way. We have implemented something that was hardly<br />

possible to do in real time just a year ago. It’s fascinating how far graphics technology<br />

has advanced recently, and we’re still moving. As mentioned several times<br />

in this article, we are still doing many things that are hardly real, but as technology<br />

goes forward, I hope we can overcome these problems also. I hope we can<br />

join up some time in the future and implement real soft shadows, real indirect<br />

lighting, and real displaced geometry instead of normal mapped simulations. See<br />

you then.


Introduction<br />

Specular Bump Mapping on<br />

Pre-ps_1_4 Hardware<br />

Matthew Halpin<br />

This article presents a selection of techniques that can be used to improve the<br />

quality and flexibility of specular bump mapping on pre-ps_1_4 hardware. The<br />

techniques are targeted at ps_1_1 hardware, with certain optimizations presented<br />

for higher pixel shader versions. It should be noted that these techniques are<br />

strictly suitable for pre-ps_1_4 hardware; there are more efficient and elegant<br />

solutions for ps_1_4 and better hardware. There exists extensive documentation<br />

for per-pixel lighting on these higher pixel shader versions (see [1] and [2]).<br />

The following diagrams are provided to define some terms that will be used<br />

extensively in this chapter.<br />

Figure 1: Phong shading terms<br />

Figure 2: Blinn shading terms<br />

149


Section II — Rendering Techniques<br />

150 Specular Bump Mapping on Pre-ps_1_4 Hardware<br />

The equation for the specular lighting term using Phong shading is:<br />

S = C * (R.L) P<br />

Where S is the final specular value, C is the specular light value, R is the reflected<br />

eye vector, L is the light vector, and P is the surface specular power.<br />

The equation for the specular lighting term using Blinn shading is:<br />

S = C * (N.H) P<br />

Where S is the final specular value, C is the specular light value, N is the surface<br />

normal, H is the interpolated half vector, and P is the surface specular power.<br />

All techniques discussed in this article make use of per-vertex tangent space<br />

data (tangent, normal, and bi-normal). See [2] for an explanation of tangent<br />

spaces.<br />

There are a number of aspects that affect the quality of a given specular bump<br />

mapping technique. Here are the ones relevant to this article:<br />

� Half vector or reflected eye vector. A technique can use an interpolated half vector<br />

(Blinn shading), or it can interpolate a vertex to eye vector and calculate<br />

the reflection vector through the pixel normal (Phong shading). In the case of<br />

Blinn shading, the interpolated half vector is combined with the per-pixel<br />

normal using the dot product to find the parameter for the specular power<br />

function. In the case of Phong shading, the per-pixel reflected vector is combined<br />

with an interpolated light vector using the dot product to find the<br />

parameter for the specular power function. Phong shading generally gives a<br />

nicer-looking highlight but requires more work per pixel.<br />

� Normalized half or reflected eye vector. When interpolating a vector between<br />

vertices, the length of the vector can be shortened. This can cause problems<br />

for vectors that are meant to be of unit length (light vector, half vector, eye<br />

vector). A technique can ignore this problem on the assumption that the<br />

mesh will be tessellated enough for the vector shortening to not have a visible<br />

effect, or it can normalize the vector using a normalization cube map or<br />

shader arithmetic.<br />

� Per-pixel specular power. A technique can allow a per-pixel specular power<br />

value to be looked up from a texture, or it can allow a fixed specular power<br />

value for each primitive render. Per-pixel specular power normally imposes a<br />

fill-rate cost on the technique but allows meshes to be batched together if<br />

they normally would have been rendered separately due to requiring different<br />

specular values. Per-pixel specular power primarily provides more flexibility<br />

for the artwork.<br />

� Per-pixel specular level (gloss). This is a per-pixel value that is used to modulate<br />

the specular pass. This can be used to vary the surface material between<br />

being shiny (e.g., metal) and being dull (e.g., rust).<br />

� Arbitrarily high specular exponent. Some techniques may be limited in the<br />

range of specular power values that they are able to work with. Generally,<br />

high values will be unavailable rather than low values.


� Amount of banding for high specular exponents. Raising a value to a power is<br />

not directly supported by ps_1_x hardware. There are a number of solutions<br />

that can broadly be split into two categories: texture lookups and arithmetic<br />

operations. Precision problems often occur when using arithmetic operations<br />

to raise a value to a high power (e.g., a value can be raised to a power of 16 by<br />

doing four successive multiply operations, but banding artifacts will be visible).<br />

This is because ps_1_1 hardware generally has limited precision<br />

fixed-point arithmetic units, and each multiply loses precision.<br />

All the final solutions presented below use Blinn shading (using half vectors)<br />

rather than Phong shading.<br />

In the following sections, techniques are presented for achieving various<br />

combinations of the aspects presented above. This includes some alternatives for<br />

the aspects that are of lower quality, as these are still useful as optimizations for<br />

rendering of objects that may not necessarily require the higher quality<br />

techniques.<br />

Standard <strong>Shader</strong><br />

Here is an example standard shader that could be used to do specular (and diffuse)<br />

bump mapping:<br />

; c0 – c3 = local to clip space matrix<br />

; c4 = camera position in local space<br />

; c5 = light position in local space<br />

vs 1 1<br />

dcl position0 v0<br />

dcl texcoord0 v1<br />

dcl normal v2<br />

dcl tangent v3<br />

dcl binormal v4<br />

m4x4 oPos, v0, c0 ; transform position into screen space<br />

mov oT0, v1 ; output uv for diffuse texture<br />

mov oT1, v1 ; output uv for normal map<br />

sub r0, c4, v0 ; vertex to camera vector<br />

dp3 r0.w, r0, r0<br />

rsq r0.w, r0.w<br />

mul r0, r0, r0.w ; Normalized view dir in r0<br />

sub r1, c5, v0 ; vertex to light vector<br />

dp3 r1.w, r1, r1<br />

rsq r1.w, r1.w<br />

mul r1, r1, r1.w ; Normalized light dir in r1<br />

add r0, r0, r1 ; add view and light vectors<br />

dp3 r0.w, r0, r0<br />

rsq r0.w, r0.w<br />

mul r0, r0, r0.w ; Normalized half vector in r0<br />

Section II — Rendering Techniques<br />

Specular Bump Mapping on Pre-ps_1_4 Hardware<br />

151


Section II — Rendering Techniques<br />

152 Specular Bump Mapping on Pre-ps_1_4 Hardware<br />

dp3 oT2.x, v3.xyz, r1.xyz<br />

dp3 oT2.y, v4.xyz, r1.xyz<br />

dp3 oT2.z, v2.xyz, r1.xyz ; Tangent space light dir in oT2<br />

dp3 oT3.x, v3.xyz, r0.xyz<br />

dp3 oT3.y, v4.xyz, r0.xyz<br />

dp3 oT3.z, v2.xyz, r0.xyz ; Tangent space half vector in oT3<br />

; c0 = diffuse color<br />

; c1 = specular color<br />

ps 1 1<br />

tex t0 ; Diffuse texture<br />

tex t1 ; Normal map<br />

texm3x2pad t2, t1 bx2 ; u = (N.L)<br />

texm3x2tex t3, t1 bx2 ; v = (N.H)<br />

mul r0, t0, c0 ; diffuse texture * diffuse light color<br />

mul r0, r0, t3.a ; diffuse * (N.L)<br />

mad r0, t3, c1, r0 ; (((N.H)^p) * specular) + diffuse<br />

This shader uses (N.L) as the texture u coordinate and (N.H) as the texture v<br />

coordinate. The texture contains (u) in the alpha component and (v p ) in the RGB<br />

components. The light vector and half vector are not normalized per pixel.<br />

Per-pixel Specular Power<br />

Techniques have been investigated that use arithmetic instructions to achieve<br />

per-pixel variable specular power (see [3]). The most significant disadvantage to<br />

this approach is the banding that occurs due to precision problems. The approach<br />

presented in this article uses texture lookups to evaluate the specular power<br />

function for a number of specular power values and then uses arithmetic instructions<br />

to interpolate the final value. For example, the specular power function may<br />

be evaluated for power values of 2, 10, 30, 60, and a value of 45 can be achieved by<br />

interpolating halfway between the 30 and 60 values. The per-pixel specular power<br />

is stored as a single texture channel with the range [0–1]. This maps directly to<br />

the range of specular powers. Hence a value of 0 means a power of 2, and a value<br />

of 1 means a power of 60.<br />

It is possible to evaluate the specular power function at up to four values due<br />

to textures having up to four channels (alpha, red, green, blue). Each channel can<br />

store the precomputed result of each specular power value. This can be achieved<br />

by precalculating a 1D texture that stores u p for four different values of p in each<br />

channel and then doing a texture lookup with the result of the lighting dot product<br />

(N.H) as the u coordinate.<br />

The next step in the shader is to use the per-pixel specular power value to<br />

interpolate between these four specular results. Initially this will be achieved<br />

using another texture operation for simplicity, but this will later be converted to<br />

arithmetic instructions.<br />

An interpolation between two values can be achieved using this equation:<br />

R = (t * a) + ((1–t) * b)


Section II — Rendering Techniques<br />

Specular Bump Mapping on Pre-ps_1_4 Hardware<br />

This is of the form of a weighted sum of a and b, where the weight for a is t and<br />

the weight for b is (1–t). Here are the graphs for these weights, given t:<br />

Figure 3<br />

153<br />

This needs to be extended to interpolate between four values. Value a must have<br />

a weight of 1 whent=0,andb,c,anddmust have weights of 0. Likewise, b must<br />

have a weight of 1 when t = 1 3, c when t = 2 3,anddwhent=1.Thegraphs for<br />

w a,w b,w c, and w d are given below:<br />

Figure 4


Section II — Rendering Techniques<br />

154 Specular Bump Mapping on Pre-ps_1_4 Hardware<br />

Texture-Encoded Interpolation<br />

These weights can be encoded in a texture in a similar way as described above for<br />

evaluating the specular power function. In this case, w a goes in the alpha channel,<br />

w b in the red channel, w c in green, and w d in blue. This texture then needs to be<br />

read with the per-pixel specular power value as the u coordinate.<br />

The final step is to evaluate the weighted sum:<br />

R=(w a *a)+(w b *b)+(w c *c)+(w d *d)<br />

This equation is the same as a 4D dot product. Certain pixel shader versions have<br />

a dp4 instruction, but where this isn’t available, a dp3 followed by a mad can be<br />

used as follows:<br />

dp3 r0, t1, t2<br />

mad r0, t1.a, t2.a, r0<br />

The dp3 instruction only modulates and sums the red, green, and blue components,<br />

so the alpha component must be added with the mad instruction.<br />

Here is the final pixel shader:<br />

; c0 = specular color<br />

ps 1 2<br />

tex t0 ; read normal map with specular power in alpha<br />

texdp3tex t1, t0 bx2 ; read (N.L)^p for 4 different values of p.<br />

texreg2ar t2, t0 ; read weights for each component<br />

dp3 r0, t1, t2 ; dp3<br />

mad r0, t1.a, t2.a, r0 ; extend dp3 to dp4<br />

mul r0, r0, c0 ; specular * specular light color<br />

NOTE This shader only evaluates specular lighting, not diffuse.<br />

The texreg2ar instruction uses the source texture alpha component as the u coordinate<br />

and the source texture red component as the v coordinate. In this case, it is<br />

being used to look up in the 1D weight’s texture, so the red component is being<br />

ignored.<br />

The pixel shader instruction texdp3tex is only available in pixel shader versions<br />

1.2 and 1.3, so for version 1.1 the texm3x2tex instruction can be used<br />

instead, though a wasted texm3x2pad instruction must be included as well.<br />

Here is the final ps_1_1 shader:<br />

; c0 = specular color<br />

ps 1 1<br />

tex t0 ; read normal map with specular power in alpha<br />

texm3x2pad t1, t0 bx2<br />

texm3x2tex t2, t0 bx2 ; read (N.L)^p for 4 different values of p.<br />

texreg2ar t3, t0 ; read weights for each component<br />

dp3 r0, t2, t3 ; dp3<br />

mad r0, t2.a, t3.a, r0 ; extend dp3 to dp4<br />

mul r0, r0, c0 ; specular * specular light color


Arithmetic Interpolation<br />

Section II — Rendering Techniques<br />

Specular Bump Mapping on Pre-ps_1_4 Hardware<br />

The interpolation of the four specular values can be implemented as arithmetic<br />

instructions instead of a texture lookup. This has the advantage of reducing texture<br />

bandwidth as well as making the technique compatible with more complex<br />

specular solutions that require all four texture stages.<br />

The graphs for w a,w b,w c, and w d need to be created using pixel shader arithmetic<br />

instructions. Here is the shader fragment that achieves this:<br />

; c0 = -0.33333, 0.0, 0.33333, 0.66666<br />

; c1 = 0.0, 0.33333, 0.66666, 1.0<br />

; c2 = 0.75, 0.75, 0.75, 0.75<br />

sub sat r0, t0.a, c0 ; offset rising edges<br />

mul x4 sat r0, r0, c2 ; scale rising edges<br />

sub sat r1, t0.a, c1 ; offset falling edges<br />

mul x4 sat r1, r1, c2 ; scale falling edges<br />

sub sat r0, r0, r1 ; combine rising and falling edges<br />

NOTE The input per-pixel specular power value is in t0.a, and r0 outputs<br />

the final interpolated specular value.<br />

This shader calculates each weight in parallel (in each component). It consists of<br />

three parts: constructing the rising edge of the triangle, constructing the falling<br />

edge of the triangle, and combining the two. The rising and falling edges have gradients<br />

of 3 and –3, but the pixel shader constants can only be in the range of –1 to<br />

1. Hence, a _x4 modifier must be used in combination with multiplying by 0.75 to<br />

achieve a multiply by 3.<br />

In this section, texture and arithmetic solutions have been provided for<br />

achieving an approximation of per-pixel variable specular power using segmented<br />

interpolation.<br />

Normalized Specular Bump Mapping<br />

155<br />

The example shaders given previously all suffer from denormalized half vectors.<br />

This happens when a vector that is meant to be of unit length is linearly interpolated<br />

between vertices. When a denormalized vector is used in a dot product<br />

operation, the result will be smaller than it would be for a normalized vector. This<br />

can cause dulling of specular highlights in the middle of triangles and triangle<br />

edges. To fix this, the vector needs to be normalized after interpolation and before<br />

being used in a dot product. Usually, a normalization cube map is used to normalize<br />

the vector because the texture-addressing algorithm for cube maps is independent<br />

of vector magnitude.<br />

Using a normalization cube map prohibits a technique from using a texture<br />

lookup to implement the specular power function. This is because the relevant<br />

instructions (texm3x2tex, etc.) do dot products between one texture result and<br />

one texture coordinate, rather than between two arbitrary texture results.


Section II — Rendering Techniques<br />

156 Specular Bump Mapping on Pre-ps_1_4 Hardware<br />

To overcome this problem, the cube map can be directly used to implement<br />

the specular power function as well. The technique is an extension of the standard<br />

environment-mapped bump mapping shader. Here is the standard shader:<br />

; c0 = specular light color<br />

ps 1 1<br />

tex t0<br />

texm3x3pad t1, t0 bx2<br />

texm3x3pad t2, t0 bx2<br />

texm3x3vspec t3, t0 bx2<br />

mul r0, t3, c0<br />

This shader transforms the per-pixel normal value in t0 into world space, calculates<br />

the eye reflection vector, and looks this up in an environment cube map (the<br />

eye vector is interpolated in oT1.w, oT2.w, and oT3.w).<br />

The cube map will be precalculated as having a single light pointing in a fixed<br />

direction. Each texel will be calculated using the specular function with the normalized<br />

u,v,w coordinates of the texel as the input vector and a fixed vector for<br />

the light direction, e.g., (0,1,0).<br />

Using the above shader, this will give a consistent specular highlight in the<br />

fixed direction (0,1,0), which is incorrect, as the highlight needs to point in the<br />

direction of the light. By adjusting the interpolated texture coordinates (associated<br />

with t1, t2, and t3) in the vertex shader, the cube map can be aligned with the<br />

direction of the light. To ease this process, the shader will be switched to use<br />

Blinn shading rather than Phong shading. This means that the cube map needs to<br />

be aligned with the vertex half vector, and the last texture instruction needs to be<br />

texm3x3tex rather than texm3x3vspec. This also means that the vertex shader<br />

doesn’t need to output the eye vector in the w components of oT1, oT2, and oT3,<br />

as the pixel shader doesn’t need this information.<br />

In order to perform this alignment, a 3x3 matrix needs to be constructed that<br />

will transform the half vector onto the fixed vector (0,1,0). This matrix defines a<br />

world space coordinate system that will be referred to as “light space.” Once this<br />

matrix has been determined, it needs to be combined with the tangent space<br />

matrix (tangent, normal, and bi-normal) so that the per-pixel normal will be transformed<br />

from tangent space into light space before looking up in the cube map.<br />

The light space matrix consists of three vectors defining the x, y, and z axes<br />

of the light space in world space.<br />

There is one constraint on creating the light space matrix: The y-axis must<br />

be the same as the vertex half vector because the fixed light direction was (0,1,0).<br />

Hence, the other two axes must be in the plane perpendicular to the half vector in<br />

order to form an orthogonal basis, though their specific orientation does not affect<br />

the lighting, as the cube map is symmetrical around the y-axis.<br />

NOTE The only difference between directional and point lights for this<br />

technique is in the calculation of the half vector (directional lights use the light<br />

direction; point lights use the normalized vertex to light vector). After the half<br />

vector has been calculated, the vertex shaders are identical and the pixel<br />

shaders are completely identical.


Light Space Interpolation Consistency<br />

One way of generating the other two axes (x and z) is to take any fixed vector,<br />

cross it with the half vector to get the z-axis, and then cross the half vector with<br />

this z-axis to get the x-axis. Once normalized, these two vectors will form an<br />

orthonormal basis with the half vector.<br />

Problems occur with this technique when the half vector points near the<br />

fixed vector or nearly opposite the fixed vector. This is because neighboring vertices<br />

might have half vectors that surround the fixed vector. When this happens,<br />

the x- and z-axes for neighboring vertices will point in radically different directions<br />

as the small difference between the half vector and the fixed vector defines<br />

their direction. This problem manifests itself as small highlights appearing inside<br />

triangles where there shouldn’t be any. Hence, a fixed vector cannot be used to<br />

construct a light space that can be consistently interpolated.<br />

To overcome this problem, the tangent space vectors can be used as a starting<br />

point for constructing a light space that can be consistently interpolated over<br />

the mesh. Simply cross the vertex tangent vector with the half vector to get the<br />

light space z-axis. Then cross the half vector with this z-axis to get the light<br />

space x-axis. This works because a highlight is usually only visible on the surface<br />

wherever the half vector is near to the surface normal and hence approximately<br />

perpendicular to the tangent vector. Additionally, the tangent vector can be consistently<br />

interpolated over the mesh (this is an assumption that all bump mapping<br />

techniques have to be able to make about the tangent and bi-normal vectors).<br />

NOTE This light space construction isn’t guaranteed to be artifact free; if<br />

the mesh is so sparsely tessellated that neighboring tangent space vectors vary<br />

greatly, then the half vector could point too close to the tangent vector and risk<br />

causing small highlight artifacts.<br />

Here are the vertex and pixel shaders to implement this:<br />

vs 1 1<br />

dcl position0 v0<br />

dcl texcoord0 v1<br />

dcl normal v2<br />

dcl tangent v3<br />

dcl binormal v4<br />

m4x4 oPos, v0, c0 ; transform position into screen space<br />

mov oT0, v1 ; output uv for normal map<br />

sub r0, c5, v0<br />

dp3 r0.w, r0, r0<br />

rsq r0.w, r0.w<br />

mul r0, r0, r0.w ; Normalized light dir in r0<br />

sub r1, c4, v0<br />

dp3 r1.w, r1, r1<br />

rsq r1.w, r1.w<br />

Section II — Rendering Techniques<br />

Specular Bump Mapping on Pre-ps_1_4 Hardware<br />

157


Section II — Rendering Techniques<br />

158 Specular Bump Mapping on Pre-ps_1_4 Hardware<br />

mul r1, r1, r1.w ; Normalized view dir in r1<br />

add r2, r0, r1<br />

dp3 r2.w, r2, r2<br />

rsq r2.w, r2.w<br />

mul r2, r2, r2.w ; Normalized half vector in r2<br />

; Work out lightspace<br />

; LightY = half vector. (r2)<br />

; LightZ = Tangent x LightY (r6)<br />

mul r6, v3.zxyw, r2.yzxw<br />

mad r6, v3.yzxw, r2.zxyw, -r6<br />

; Normalize<br />

dp3 r6.w, r6, r6<br />

rsq r6.w, r6.w<br />

mul r6, r6, r6.w<br />

; LightX = LightY x LightZ (r7)<br />

mul r7, r2.zxyw, r6.yzxw<br />

mad r7, r2.yzxw, r6.zxyw, -r7<br />

; Normalize<br />

dp3 r7.w, r7, r7<br />

rsq r7.w, r7.w<br />

mul r7, r7, r7.w<br />

; Work out Tangent in lightspace<br />

dp3 oT1.x, v3.xyz, r7.xyz<br />

dp3 oT2.x, v3.xyz, r2.xyz<br />

dp3 oT3.x, v3.xyz, r6.xyz<br />

; Work out Bi-normal in lightspace<br />

dp3 oT1.y, v4.xyz, r7.xyz<br />

dp3 oT2.y, v4.xyz, r2.xyz<br />

dp3 oT3.y, v4.xyz, r6.xyz<br />

; Work out Normal in lightspace<br />

dp3 oT1.z, v2.xyz, r7.xyz<br />

dp3 oT2.z, v2.xyz, r2.xyz<br />

dp3 oT3.z, v2.xyz, r6.xyz<br />

; c0 = specular light color<br />

ps 1 1<br />

tex t0 ; read normal in .rgb<br />

texm3x3pad t1, t0 bx2<br />

texm3x3pad t2, t0 bx2<br />

texm3x3tex t3, t0 bx2 ; lookup specular cube map with light space normal<br />

mul r0, t0, c0 ; multiply by specular color


Normalized Specular Bump Mapping<br />

with Per-pixel Power<br />

This section combines the techniques described in the previous sections to give a<br />

shader that can render specular bump mapping with normalized half vector and<br />

per-pixel variable specular power.<br />

Taking the normalized half vector shader described above as a starting point,<br />

the first thing to do to extend it to allow per-pixel specular power is precalculate<br />

four different specular functions, one in each channel of the cube map. Once this<br />

has been done, the final shader simply involves copying the shader fragments<br />

from the per-pixel specular power section and adjusting the register names so<br />

they fit together. The vertex shader is the same as for the normalized half vector<br />

technique, so here is the final pixel shader:<br />

; c0 = -0.33333, 0.0, 0.33333, 0.66666<br />

; c1 = 0.0, 0.33333, 0.66666, 1.0<br />

; c2 = 0.75, 0.75, 0.75, 0.75<br />

; c3 = Specular light color<br />

ps 1 1<br />

tex t0 ; read normal in .rgb and power in .a<br />

texm3x3pad t1, t0 bx2<br />

texm3x3pad t2, t0 bx2<br />

texm3x3tex t3, t0 bx2 ; lookup specular cube map with light space normal<br />

sub sat r0, t0.a, c0 ; offset rising edges<br />

mul x4_sat r0, r0, c2 ; scale rising edges<br />

sub sat r1, t0.a, c1 ; offset falling edges<br />

mul x4 sat r1, r1, c2 ; scale falling edges<br />

sub sat r1, r0, r1 ; combine rising and falling edges<br />

dp3 r0, t3, r1 ; dp3 between weights and specular values<br />

mad r0, t3.a, r1.a, r0 ; extend to dp4.<br />

mul r0, r0, c3 ; multiply by specular color<br />

NOTE Light attenuation can be added to this shader: Calculate the<br />

attenuation value in the vertex shader, modulate with the specular light color,<br />

output this in the oD1 register, and then replace the final pixel instruction with<br />

mul r0, r0, v1.<br />

Implementation Details<br />

Section II — Rendering Techniques<br />

Specular Bump Mapping on Pre-ps_1_4 Hardware<br />

159<br />

The demo associated with this article provides implementation details for some<br />

of the techniques discussed. All the techniques use normalized half vectors.<br />

Pressing the S key will cycle through different specular techniques to provide a<br />

comparison between them. The “Fixed High” and “Per Pixel” options implement


Section II — Rendering Techniques<br />

160 Specular Bump Mapping on Pre-ps_1_4 Hardware<br />

Conclusion<br />

the techniques described in this article, while the others use standard arithmetic<br />

techniques for approximating specular power functions.<br />

When implementing specular bump mapping on ps_1_1 hardware, it is important<br />

to use the most efficient shaders that will achieve the desired effect. For<br />

example, if per-pixel specular power is not needed (and it won’t be for many surfaces),<br />

then use a more efficient and simpler shader. Similarly, if a fixed power of 8<br />

is sufficient for a given surface and the banding is not noticeable, then use it<br />

rather than the shaders described in this article.<br />

The demo shows how to combine a diffuse pass with the specular pass while<br />

incorporating per-pixel specular level maps (gloss maps). It is also extended to<br />

allow for multiple lights. The gloss map is stored in the diffuse texture alpha<br />

channel and written to the destination alpha channel during the first pass. In subsequent<br />

passes, alpha writing is disabled (by using the D3DRS_COLORWRITE-<br />

ENABLE render state). This means that the gloss map is permanently stored in<br />

the destination alpha channel, and specular passes can be modulated and added<br />

using the following blend mode: D3DRS_SRCBLEND = D3DBLEND_DEST-<br />

ALPHA; D3DRS_DESTBLEND = D3DBLEND_ONE.<br />

In this article, two primary factors were addressed that affect specular bump mapping:<br />

per-pixel specular power values and normalized high exponent specular<br />

power functions. It has been shown how to achieve either or both of these factors<br />

in a single pass on ps_1_1 level hardware while minimizing banding artifacts.<br />

Acknowledgments<br />

References<br />

Thanks to Oscar Cooper, Dean Calver, Andrew Vidler, and Peter Halpin for<br />

proofreading.<br />

Thanks to Tim Mews for providing the artwork in the demo.<br />

[1] ATI developer page, http://www.ati.com/developer/. Papers,<br />

http://www.ati.com/developer/techpapers.html.<br />

[2] nVidia developer page, http://developer.nvidia.com/.<br />

[3] Beaudoin, Philippe and Juan Guardado, “A Non-Integer Power Function<br />

on the Pixel <strong>Shader</strong>,” http://www.gamasutra.com/features/20020801/<br />

beaudoin_01.htm.


Introduction<br />

Voxel Rendering with PS_3_0<br />

Aaron Burton<br />

<strong>With</strong> the advent of pixel shader 3_0, graphics hardware has become capable of<br />

rendering hardware-accelerated voxels.<br />

Voxel objects are stored as a three-dimensional map of matter, with each<br />

voxel (or texel in a volume map) indicating something about that “lump” of matter<br />

— its color, translucency, or “power” in the case of metaballs. In the “power”<br />

case, a threshold value is used; voxel values that are above this value are considered<br />

to be solid matter, with the rest considered to be empty space. (Alternatively,<br />

an implementation could reverse the comparison and consider matter to exist in<br />

all voxels with a value below the threshold.)<br />

Typically, voxel objects are converted to polygons before rendering using the<br />

“marching cubes” algorithm or something similar. The method presented here<br />

submits a single eight-vertex cube and extracts the surface in the pixel shader; a<br />

ray is traced step by step through the volume, sampling the texture at each step,<br />

searching for matter.<br />

The method requires pixel shader loops, dynamic flow control (IF, BREAK),<br />

unlimited texture reads, and unlimited dependent reads; it also makes use of function<br />

calls.<br />

The Plan Revealed<br />

The basic recipe is simple.<br />

1. Take a volume texture(s) containing the voxel object. D3DTADDRESS_CLAMP<br />

should be used, although small changes to the code would allow texture<br />

repeats if that is desired.<br />

2. Render a cube. The eight vertices have 3D texture coordinates applied that<br />

simply map the entire volume texture to the cube (i.e., one vertex has coordinates<br />

[0 0 0] and the opposite corner is [1 1 1]). If the front clip plane clips<br />

this cube, it must have capping polygons added to cover the hole created.<br />

3. In the vertex shader, output the 3D texture coordinate and a vector indicating<br />

the direction of the line from camera to vertex.<br />

161


Section II — Rendering Techniques<br />

162 Voxel Rendering with PS_3_0<br />

4. In the pixel shader, start at the given texture coordinate and step along the<br />

line, deeper into the volume. Sample the texture at each step.<br />

The length of the camera to vertex vector, which is calculated in the vertex<br />

shader, directly controls the distance stepped through the volume texture each<br />

loop; thus volume textures containing small objects will require the vector length<br />

to be no longer than the size of a voxel. For example, a 64x64x64 volume texture<br />

would require the length of the step vector to be 64 –1 .<br />

Note that the normal issues with vector interpolation apply: As the cube is<br />

rendered, this vector will be linearly interpolated between vertices, and so it may<br />

be shorter for some pixels — those toward the middle of polygons — than it<br />

should be. This is not likely to be a problem unless both the near clip plane and<br />

the object are very close to the camera, so the shaders herein do not attempt corrective<br />

action.<br />

The voxels may be rendered in many ways; the three demonstrated here are<br />

accumulative (e.g., additive, such as holograms or light volumes, or multiplicative,<br />

such as smoke), solid, and lit.<br />

The Vertex <strong>Shader</strong><br />

The vertex shader used is simple and the same no matter which rendering<br />

method is chosen.<br />

// Constants: c0..c3 world.View.Proj matrix<br />

// c4 Camera position (model space)<br />

// c5.x Length of camera->vertex vector for pixel shader<br />

vs 3 0<br />

dcl position v0<br />

dcl texcoord v1<br />

dcl position o0<br />

dcl texcoord0 o1<br />

dcl texcoord1 o2<br />

m4x4 o0, v0, c0 // Output: transformed position<br />

mov o1.xyz, v1 // Output: texture coordinate<br />

sub r1, v0, c4 // Camera->Vertex vector in model space...<br />

nrm r2, r1 // ...normalize it...<br />

mul r1, r2, c5.x // ...and scale it to the right "step" length.<br />

mov o2.xyz, r1 // Output: Camera-direction in texture space


Accumulated Voxels<br />

This is the simplest method. Accumulated voxels require a single volume texture<br />

containing the color at each voxel. The colors are summed (accumulated) as the<br />

ray traces through the volume.<br />

Accumulated voxel rendering is most obviously applied as additive (e.g.,<br />

holography or light volumes), or obscuring, as smoke or fog (i.e., multiplicative<br />

blending), or another blend type for another effect. There can be a problem with<br />

using accumulated voxels as light volumes (i.e., the light from a window shining<br />

through a dusty atmosphere) or smoke/fog volumes, which is covered later in this<br />

article in the section titled “The Problem with Depth Buffering.”<br />

1. As the ray steps through the volume, the texture is read and accumulated.<br />

The final number must be modulated by some value, or else it will likely be<br />

too bright — at least in the case of additive blending.<br />

2. If a fixed number of steps are taken through the volume, this number can<br />

then be used to divide the total and gain the final result; this will work on<br />

pixel shader 2_0 hardware. Note that if pixel shader 2_0 hardware is used<br />

with D3DTADDRESS_CLAMP, the pixel shader will be unable to terminate the ray<br />

when it hits the edge of the volume; this will require the edge voxels to be<br />

empty, as otherwise, matter in the edges will appear to be stretched to<br />

infinity.<br />

3. If the ray is terminated when it leaves the [0..1] texture coordinate range,<br />

then fewer texture reads will be performed, and thus pixel shader 2_0 is no<br />

longer sufficient. The number chosen to modulate the final result can be<br />

tuned to give the desired brightness or set from the maximum number of<br />

steps possible through the volume. The number must be constant across all<br />

pixels (to avoid color banding), not calculated per pixel.<br />

// Constants: c1.x Length of step, used as scale factor for final result<br />

// i0 Loop enough times to step completely through object<br />

ps 3 0<br />

def c0, 0, 1, 0, 0<br />

dcl texcoord0 v0.xyz // Start texture coordinate<br />

dcl texcoord1 v1.xyz // Ray "step vector"<br />

dcl volume s0 // "Matter map" - use alpha channel<br />

dcl volume s1 // Color map<br />

mov r0.xyz, v0 // r0.xyz is current sample position<br />

mov r0.w, c0.x // r0.w is number of samples taken<br />

texld r1, v0, s0<br />

rep i0<br />

add r0.xyz, r0, v1 // Step further along the ray<br />

add r0.w, r0.w, c0.y // Increment sample count<br />

Section II — Rendering Techniques<br />

Voxel Rendering with PS_3_0<br />

163


Section II — Rendering Techniques<br />

164 Voxel Rendering with PS_3_0<br />

// Stop if any ray coord leaves [0..1] range<br />

mov sat r3.xyz, r0<br />

sub r3.xyz, r3, r0<br />

abs r3.xyz, r3<br />

dp3 r3.w, r3, c0.y<br />

if ne r3.w, c0.x<br />

break<br />

endif<br />

// Load the texture then<br />

texld r2, r0, s0 // Load the texture...<br />

add r1.a, r1, r2 // ...and add it to the total<br />

endrep<br />

// Scale result and output color<br />

mul r1.a, r1.a, c1.x<br />

mov oC0, r1.a<br />

Solid Voxels<br />

<strong>With</strong> solid voxels, the aim is to step through the volume texture until a non-empty<br />

voxel is found and then render its color. This requires the ray to terminate when<br />

it leaves the volume or finds matter. This is simple and should be easily understood<br />

by examining the shader code.<br />

If point sampling is used, a special color, or color-key, can be used to indicate<br />

whether the voxel is opaque or not. If bilinear filtering is to be used, and this does<br />

improve the results, then an additional alpha channel must be used to indicate the<br />

existence of matter, as color filtering will prevent the use of a color-key. Most of<br />

the ray tracing operation will consist of sampling the information contained in the<br />

alpha channel, as the final color is only retrieved after an opaque voxel has been<br />

found. For this reason, it may be best for performance if the matter (alpha) map<br />

and color map are two separate volume textures. For the matter map, D3DFMT_A8<br />

will do, although 1 bit per pixel might be sufficient<br />

(e.g., for a landscape). For metaballs<br />

and other surfaces that have varying densities,<br />

an 8-bit channel could be ideal. It is then<br />

possible to vary the threshold value that the<br />

pixel shader considers to be matter; this<br />

effectively changes the isosurface that is<br />

rendered.<br />

The ray traces through the matter map,<br />

sampling each step and searching for matter.<br />

When matter is found, the pixel shader can<br />

then sample the color map and write the<br />

value. In Figure 1, the circles indicate sample<br />

positions, and the arrows indicate the direction<br />

of the ray. Sampling stops and the object<br />

Figure 1: The ray steps through the<br />

volume, sampling the texture to<br />

search for matter.


color is output as soon as matter is found, in this case by the rightmost sample,<br />

the fifth in the image.<br />

// Constants: c1.y Threshold value to detect matter<br />

// i0 Loop enough times to step completely through object<br />

ps 3 0<br />

def c0, 0.5f, 1, 0, 0<br />

dcl texcoord0 v0.xyz // Start texture coordinate<br />

dcl texcoord1 v1.xyz // Ray "step vector"<br />

dcl volume s0 // "Matter map" - use alpha channel<br />

dcl volume s1 // Color map<br />

mov r0.xyz, v0 // r0 is our ray position; it starts at v0 and adds v1 with each step<br />

mov r1, c0.z // Initialize output with black<br />

rep i0<br />

texld r2, r0, s0 // Sample the "matter map"<br />

endrep<br />

// Matter detected?<br />

if gt r2.a, c1.y<br />

texld r1, r0, s1 // Sample the color texture.<br />

break // Done!<br />

endif<br />

add r0.xyz, r0, v1 // Step further along the ray<br />

// Stop if any ray coord leaves [0..1] range<br />

mov sat r2.xyz, r0<br />

sub r2.xyz, r2, r0<br />

abs r2.xyz, r2<br />

dp3 r2.w, r2, c0.y<br />

if ne r2.w, c0.z<br />

break<br />

endif<br />

// Output color<br />

mov oC0, r1<br />

Lit Voxels (Calculating a Normal)<br />

Section II — Rendering Techniques<br />

Voxel Rendering with PS_3_0<br />

165<br />

In order to light the rendered voxels, a normal must be obtained. Depending on<br />

how the volume data was extracted, another volume texture containing normal<br />

data could be accessed for visible voxels. If such data is not available, then a normal<br />

vector must be calculated per pixel; computing an accurate normal from volume<br />

data is a complex task that could make it the object of an article of its own.<br />

The techniques described below give a good approximation of normal calculation;


Section II — Rendering Techniques<br />

166 Voxel Rendering with PS_3_0<br />

however, more sampling data and complex<br />

algorithms could be used to improve the<br />

visual accuracy of the results.<br />

One approach is to trace three rays per<br />

pixel and calculate the cross product of the<br />

resulting hit points, but this is inordinately<br />

expensive — and problem-rich.<br />

A better solution is illustrated in Figure<br />

2 and described here:<br />

1. The hit point is found just as with the<br />

solid voxels approach.<br />

2. Sample a number of points on a sphere<br />

around the hit point (eight sample points<br />

are shown as diamond shapes in the<br />

diagram).<br />

3. Sum the offset vectors (i.e., the vector from hit point to sample point; vectors<br />

a, b, c, and d in the diagram) that do not hit matter and normalize the resulting<br />

vector (N in the diagram). This then approximates the normal.<br />

Improvements<br />

The points sampled in order to generate the normal can be tuned to get the best<br />

results.<br />

1. The offset vector to each sample position on a sphere can be stored in pixel<br />

shader constants. These positions are stored for just one octant of the<br />

sphere; the sample positions for the other seven octants can be found by<br />

inverting the offset vector across each axis. Thus, three supplied offset vectors,<br />

with signs flipped and summed to the hit point, give coordinates for 24<br />

texture samples. (Figure 3, which is 2D, would store two offset vectors per<br />

quadrant for a total of eight sample positions.)<br />

2. Better results may be gained by sampling<br />

from several spheres of differing<br />

radii; this can be easily achieved by varying<br />

the lengths of each supplied offset<br />

vector.<br />

There is potential for an error if the ray hits a<br />

thin sliver of matter, as shown in Figure 3.<br />

Offset vectors 2, 3, 4, 6, 7, and 8 all detect<br />

“no matter”; the summed result is a<br />

zero-length normal.<br />

An additional check solves this problem:<br />

When calculating the normal, only perform<br />

texture samples for cases where the dot<br />

product of the offset vector and the step vector<br />

is negative (i.e., in Figure 3, sample<br />

Figure 2. Additional texture samples<br />

can calculate an approximate<br />

normal.<br />

Figure 3: Thin slivers of matter can<br />

generate invalid normals unless fixed<br />

with additional code.


points 5, 6, 7, and 8 would be tested); this also has the benefit of halving the number<br />

of texture reads.<br />

// Constants: c1.x Length of step<br />

// c1.y Threshold value to detect matter<br />

// i0 Loop enough times to step completely through object<br />

ps 3 0<br />

def c0, 0.5f, 1, 0, -1<br />

def c2, 0.363f, 0.363f, 0.858f, 0 // Offset vectors; 3 matter checks per octant<br />

def c3, 0.363f, 0.858f, 0.363f, 0 // (used for normal calculation)<br />

def c4, 0.858f, 0.363f, 0.363f, 0<br />

def c20, 0, 1, 0, 0 // Light vector<br />

dcl texcoord0 v0.xyz // Start texture coordinate<br />

dcl texcoord1 v1.xyz // Ray "step vector"<br />

dcl volume s0 // "Matter map" - use alpha channel<br />

dcl volume s1 // Color map<br />

mov r0.xyz, v0 // Initialize ray position to v0; add v1 with each step<br />

mov r11, c0.z // Initialize output with black<br />

rep i0<br />

texld r1, r0, s0<br />

// Matter detected?<br />

if gt r1.a, c1.y<br />

// Zero r1; it will be used to sum the vectors contributing to the normal<br />

mov r1, c0.z<br />

mov r2, c2<br />

mul r2, r2, c1.x // r2 is the offset to sample around curr pos<br />

call l0 // Will update r1 with normal contributions<br />

mov r2, c3<br />

mul r2, r2, c1.x // r2 is the offset to sample around curr pos<br />

call l0 // Will update r1 with normal contributions<br />

mov r2, c4<br />

mul r2, r2, c1.x // r2 is the offset to sample around curr pos<br />

call l0 // Will update r1 with normal contributions<br />

// If the normal is zero, use the inverse camera direction<br />

dp3 r1.w, r1, c0.y<br />

if eq r1.w, c0.z<br />

mov r1.xyz, -v1<br />

endif<br />

// Now normalize the normal & do some lighting<br />

nrm r2, r1<br />

Section II — Rendering Techniques<br />

Voxel Rendering with PS_3_0<br />

167


Section II — Rendering Techniques<br />

168 Voxel Rendering with PS_3_0<br />

endrep<br />

dp3 r3, r2, c20<br />

mad r11, r3, c0.x, c0.x<br />

break<br />

endif<br />

add r0.xyz, r0, v1 // Step further along the ray<br />

// Stop if any ray coord leaves 0..1 range<br />

mov sat r1.xyz, r0<br />

sub r1.xyz, r1, r0<br />

abs r1.xyz, r1<br />

dp3 r1.w, r1, c0.y<br />

if ne r1.w, c0.z<br />

break<br />

endif<br />

// Output color<br />

mov oC0, r11<br />

ret // End of main<br />

//////////////////////////////////////////////////////////////////////////////<br />

// Purpose: Check for matter around a position.<br />

// In: r0.xyz Hit position<br />

// r1.xyz Summed normal-contributions<br />

// r2.xyz Offset vector, in octant 0, to use to search around r0<br />

// Out: r1.xyz Updated with new contributions (if any)<br />

// Uses: r3..r5<br />

//////////////////////////////////////////////////////////////////////////////<br />

label l0<br />

mov r3, r2 // Octant 0<br />

call l1<br />

mul r3.xyz, r2, c0.yyw // Octant 1<br />

call l1<br />

mul r3.xyz, r2, c0.ywy // Octant 2<br />

call l1<br />

mul r3.xyz, r2, c0.yww // Octant 3<br />

call l1<br />

mul r3.xyz, r2, c0.wyy // Octant 4<br />

call l1<br />

mul r3.xyz, r2, c0.wyw // Octant 5<br />

call l1


mul r3.xyz, r2, c0.wwy // Octant 6<br />

call l1<br />

mul r3.xyz, r2, c0.www // Octant 7<br />

call l1<br />

ret // End of function: l0<br />

//////////////////////////////////////////////////////////////////////////////<br />

// Purpose: Check a position for matter; sum the offset vector if no hit.<br />

// In: r0.xyz Hit position<br />

// r1.xyz Summed normal-contributions<br />

// r3.xyz Offset vector<br />

// Out: r1.xyz Updated with new contributions (if any)<br />

// Uses: r3..r5<br />

//////////////////////////////////////////////////////////////////////////////<br />

label l1<br />

// Only check this sample point if the offset vector faces the camera<br />

dp3 r3.w, r3, v1<br />

if lt r3.w, c0.z<br />

add r4.xyz, r3, r0<br />

texld r5, r4, s0<br />

// If no is matter here, the offset vector can contribute to the normal<br />

if le r5.a, c1.y<br />

add r1, r3, r1 // Add to the generated normal<br />

endif<br />

endif<br />

ret // End of function: l1<br />

The Problem with Depth Buffering<br />

Section II — Rendering Techniques<br />

Voxel Rendering with PS_3_0<br />

169<br />

Translucent, “accumulative” voxel objects will most likely be drawn after the<br />

other geometry of the scene. Figures 4 and 5 show an object inside a light volume.<br />

Ideally, a ray being traced from the front of the volume would not only terminate<br />

at the back of the volume but also when it hits an object, as shown in Figure<br />

5. The supplied code does not demonstrate this. Thus, if an object gradually<br />

moves through the volume from behind to in front, the intensity visible in front of<br />

the object will not smoothly diminish; it will switch from full to none as the object<br />

passes through the front of the volume. Fixing this requires knowledge of the contents<br />

of the depth buffer, which currently can only be achieved by having a copy of<br />

it in a texture.<br />

<strong>With</strong> both solid and lit voxels, it may be desirable to have the objects correctly<br />

depth buffered by calculating and outputting the correct depth value oDepth<br />

in the pixel shader. For solid objects that should never have any intersections with<br />

other objects, this is not necessary and should be strongly avoided for


Section II — Rendering Techniques<br />

170 Voxel Rendering with PS_3_0<br />

performance reasons, as 3D hardware typically prefers not to have the pixel<br />

shader output a depth value. However, in most other cases, correct depth buffering<br />

will likely be required. Note that unlike the accumulative case, it is not necessary<br />

to have current depth buffer values available as an input to the pixel shader.<br />

Comparison with “Stacked Quads” Method<br />

Imposters<br />

Figure 4: An object in a light beam,<br />

which is passing through a dusty<br />

atmosphere<br />

Figure 5: Top view of Figure 4,<br />

showing the eye position and ray<br />

paths<br />

Another method that has been used to visualize a volume texture is the “stacked<br />

quads” approach. A collection of quads is rendered, screen aligned, equally spaced<br />

in depth, and with the same screen-space size, each with 3D texture coordinates<br />

applied, which extract a different “slice” of the volume; the volume texture can be<br />

rotated by changing the texture coordinates or moved by shifting the stacked<br />

quads. Typically, additive blending is used. This can be compared to the accumulative,<br />

additive voxels; the distance between each quad is similar to the length of<br />

the step vector. One approach uses more geometry; the other uses more pixel<br />

shader power. If an alpha-test were used on the stacked quads approach, the visual<br />

result would be similar to the solid rendering approach. Lighting — that is, normal<br />

calculation — appears to be beyond the stacked quads method.<br />

Just as with polygonal models, “imposters” may be used to cut the processing<br />

costs of voxel objects.<br />

In addition, voxel objects could conceivably be used as a 3D imposter. This<br />

would require code to be written to render slices of polygonal objects to the slices<br />

of a volume texture. 2D imposters must be updated when the view of the object<br />

significantly changes, for reasons including motion toward/away from the camera


Shadows<br />

or when it changes (animates) or rotates. A 3D imposter would not have to be<br />

updated due to rotation.<br />

Voxel objects are compatible with shadow-buffer shadows but not with stencil<br />

shadows. This is because a shadow texture can be rendered containing the voxel<br />

object as easily as rendering it to the screen, but there is no geometry information<br />

from which to calculate a shadow volume for stencil shadows.<br />

Generating Voxel Data<br />

CD Demo<br />

Summary<br />

Section II — Rendering Techniques<br />

Voxel Rendering with PS_3_0<br />

171<br />

Voxel data can come from many sources. Physical objects can be scanned and converted<br />

to voxel data, for example, from MRI or CT scans. Polygonal objects can be<br />

rendered in slices to volume textures; an approach similar to rendering stencil<br />

volumes should work, though appropriate code was left as a “future direction.”<br />

Another approach is algorithmically generating voxel data, as in the case of<br />

metaballs; the volume texture is filled with power values, and the pixel shader<br />

code considers matter to exist when the power is above a certain threshold.<br />

The demo included on the companion CD requires 3D hardware supporting the<br />

pixel shader 3_0 model. If no such 3D accelerator is present, a batch file forcing<br />

the reference device is provided so that the demo can be run. Note that performance<br />

will then become a direct factor of the CPU power of the host platform.<br />

The CD also contains movies illustrating some of the techniques described in<br />

this article.<br />

Consumer triangle-accelerating hardware has become sufficiently advanced that it<br />

is now able to hardware accelerate voxels and metaballs. This article described<br />

how to render voxel objects using volume textures. Different implementations<br />

were proposed, which can be evaluated depending on the application required.<br />

While it might be some time before consumer 3D hardware can run anything<br />

more than a few voxel objects at acceptable speeds, this technique opens the door<br />

to a new set of possible visual effects, which, until now, were not accessible in<br />

mainstream real-time rendering.


Simulating Blending Operations<br />

on Floating-point Render Targets<br />

Francesco Carucci<br />

Introduction<br />

One of the most exciting new features introduced with DX9 class hardware (R300<br />

and NV30) is floating-point textures, where RGBA channels are not fixed-point<br />

numbers restricted in the range from 0.0 to 1.0, as in the past, but can be<br />

expressed by 16- or 32-bit floating-point numbers, which extends both dynamic<br />

range and precision. A floating-point texture can be used both as input in a pixel<br />

shader or as a render target, storing the result coming from pixel color computation.<br />

This comes with some limitations: Current generation hardware can’t do<br />

blend operations on floating-point render targets. This article introduces a technique<br />

to overcome the limitation when non-overlapping geometry is rendered.<br />

Creating a Floating-point Texture<br />

172<br />

Direct3D 9 currently supports 16 and 32 bits per-channel formats for textures<br />

that can be used as render targets.<br />

D3DFMT_R16F 16-bit texture using 16 bits only for the red channel<br />

D3DFMT_G16R16F 32-bit texture using 16 bits for the red channel and<br />

16 bits for the green channel<br />

D3DFMT_A16B16G16R16 64-bit texture using 16 bits for each channel (alpha,<br />

blue, green, red)<br />

D3DFMT_R32F 32-bit texture using 32 bits only for the red channel<br />

D3DFMT_G32R32F 64-bit texture using 32 bits for the red channel and<br />

32 bits for the green channel<br />

D3DFMT_A32B32G32R32 128-bit texture using 32 bits for each channel<br />

(alpha, blue, green, red)<br />

The 16 bits per-channel format is an s10e5 floating-point number with 10 bits<br />

mantissa and 5 bits exponent. The 32 bits per-channel format is an s23e8 floating-point<br />

number with 23 bits mantissa and 8 bits exponent.<br />

Here is some code to create a floating-point texture to use as a render target:


Section II — Rendering Techniques<br />

Simulating Blending Operations on Floating-point Render Targets<br />

LPDIRECT3DSURFACE9 gColorBufferSurface[2];<br />

LPDIRECT3DTEXTURE9 gColorBufferTexture[2];<br />

// create color buffer surfaces<br />

gD3DDevice->CreateTexture(<br />

gWidth, gHeight,<br />

1,<br />

D3DUSAGE RENDERTARGET,<br />

D3DFMT A16B16G16R16F,<br />

D3DPOOL DEFAULT,<br />

&gColorBufferTexture[0],<br />

NULL);<br />

// get associated surfaces<br />

gColorBufferTexture->GetSurfaceLevel(<br />

0,<br />

&gColorBufferSurface[0]);<br />

Where gWidth and gHeight are the screen’s current width and height.<br />

Texture and surface (gColorBufferSurface[1] and gColorBufferTexture[1])<br />

for the second color buffer can be created in the same way and will be used later<br />

in the code.<br />

The following code sets the floating-point texture surface as the current render<br />

target:<br />

gD3DDevice->SetRenderTarget(<br />

0,<br />

gColorBufferSurface[0]);<br />

It’s now possible to output the color resulting from pixel shader computation to<br />

the floating-point render target and make full use of higher precision floating-point<br />

math, for example, to accumulate results of several lighting passes without the<br />

limitations imposed by fixed-point math, both in precision and dynamic range.<br />

Overview of the Technique<br />

Current generation DX9-class hardware cannot perform post-pixel shader operations,<br />

such as blending, dithering, alpha testing, fogging, and masking on floatingpoint<br />

render targets, but blending operations with some limitations can be simulated<br />

in the pixel shader using floating-point precision.<br />

The idea behind this technique is simple: By setting the floating-point render<br />

target both as input texture and as output in a pixel shader, we can read the color<br />

value of the currently rendered pixel and blend it with the color computed in the<br />

pixel shader and then write it again into the render target in the same pass.<br />

An additive blending operation, for example, looks like this:<br />

FinalColor = TexelColor + PixelColor<br />

173<br />

Where FinalColor is the color being written to the render target, TexelColor is<br />

the color read from the input texture, and PixelColor is the color computed in the<br />

pixel shader.


Section II — Rendering Techniques<br />

174 Simulating Blending Operations on Floating-point Render Targets<br />

The idea cannot be implemented in a straightforward manner as described<br />

because having the same render target texture both as input and as output of the<br />

same pixel shader is not officially supported by current hardware and might lead<br />

to undefined and unpredictable results, due to the presence of texture and<br />

frame-buffer caches that usually do not talk to each other.<br />

As a workaround, we can borrow the double buffering idea and create two<br />

floating-point textures; when the first one is used as input texture, the second one<br />

is used as a render target. After the pass is rendered, they can be switched for the<br />

next pass. The pseudocode for this is as follows:<br />

Setito0<br />

Clear texture[0]<br />

Clear texture[1]<br />

For each pass<br />

Set texture[i] as input texture<br />

Set texture[not i] as render target<br />

Render geometry<br />

Swap texture[0] and texture[1];<br />

Reading the Input Texture in the Pixel <strong>Shader</strong><br />

The first step is to properly read the input texture stored by the previously rendered<br />

pass from within the pixel shader. Each fragment processed by the pixel<br />

shader corresponds to a unique texel in the input texture, which is the pixel computed<br />

in the previous pass. To compute the texture coordinates to access that<br />

texel, the vertex shader will pass the position of each vertex transformed in normalized<br />

device coordinates (x, y, and z vary from –1.0 to 1.0) through a texture<br />

coordinate interpolator, which is automatically linearly interpolated per pixel and<br />

gives the needed position of each fragment in the right space. This is implemented<br />

by the following HLSL code:<br />

struct sVertexOutput<br />

{<br />

float4 pos: POSITION;<br />

float2 texcoord0: TEXCOORD0;<br />

float4 position: TEXCOORD1;<br />

};<br />

sVertexOutput main(<br />

sVertexInput vIn)<br />

{<br />

sVertexOutput o;<br />

// transform vertex position using the model view projection matrix<br />

float4 pos = mul(vIn.position, matModelViewProj);<br />

// copy transformed position to output vertex position


}<br />

o.pos = pos;<br />

// copy input vertex texture coordinates to the first interpolator<br />

o.texcoord0 = vIn.texcoord0;<br />

// copy transformed position to the second interpolator<br />

o.position = pos;<br />

// return output vertex<br />

return o;<br />

To compute the 2D coordinates to access the input texture, the pixel shader must<br />

now project the position in normalized device space of the fragment or, in other<br />

words, divide x and y by the w component; notice that this is exactly the same<br />

operation performed internally by the rasterizer when a pixel is projected after<br />

being transformed by the projection matrix. After the projection, x and y are in<br />

the range from –1.0 to 1.0; a simple scale and bias will give the right texture coordinates<br />

(in the range from 0.0 to 1.0) to access the color buffer texture. Here is a<br />

function in HLSL that accepts as input a sampler to a color buffer texture and a<br />

transformed position and returns the right pixel value:<br />

float4 tex2DRect(<br />

in sampler2D s,<br />

in float4 position)<br />

{<br />

float2 tc;<br />

}<br />

Section II — Rendering Techniques<br />

Simulating Blending Operations on Floating-point Render Targets<br />

tc.x = (position.x / position.w) * 0.5 + 0.5;<br />

tc.y = (-position.y / position.w) * 0.5 + 0.5;<br />

return tex2D(s, tc);<br />

This function can be optimized using a projected texture access that saves the<br />

explicit division by w. The scale and bias can be moved to the vertex shader to<br />

further save some instructions in the pixel shader. While production code should<br />

use these tricks, the presented code is much more readable and easy to test during<br />

the development cycle.<br />

Putting Everything Together<br />

Finally, everything is put together in a simple shader that shows the technique.<br />

float4 main(<br />

in float4 pos: POSITION,<br />

in float2 texcoord0: TEXCOORD0,<br />

in float4 position: TEXCOORD1): COLOR<br />

{<br />

float4 o;<br />

175


Section II — Rendering Techniques<br />

176 Simulating Blending Operations on Floating-point Render Targets<br />

}<br />

// fetch the diffuse texture<br />

float4 src = tex2D(texture, texcoord0);<br />

// read the previous pass' source pixel<br />

float4 dst = tex2DRect(ColorBuffer, position);<br />

// additive blend!<br />

o = src + dst;<br />

return o;<br />

In a real shader, the source color comes from some kind of computation (in this<br />

small example, it’s just a texture fetch from a diffuse map) and is blended with the<br />

color in the destination color buffer to achieve the wanted result.<br />

When Should One Use It?<br />

I have used this technique to accumulate lighting passes in a high-precision color<br />

buffer with pleasing results. This is the blending equation used:<br />

C f =(A+L 1 +L 2)*T<br />

Where C f is the final color, A is the color from an ambient light pass (it is useful to<br />

fill the z buffer using an inexpensive rendering pass so as to take full advantage of<br />

the early-z rejection unit of <strong>DirectX</strong>9-class GPUs), L1 and L2 are colors from two<br />

lighting passes, and T is the final texture color blended with the content of the<br />

floating-point accumulation buffer.<br />

The result of this computation might be greater than 1.0, which cannot be<br />

displayed directly if not saturated or properly mapped. A simple logarithmic function<br />

can be used to map the result to the [0.0, 1.0] displayable color range to avoid<br />

color saturation (a problem likely to appear in such a situation). Refer to “color<br />

mapping” techniques in high dynamic range rendering for further details on this<br />

topic.<br />

This technique can’t be used when the source color needs to be blended with<br />

a destination color computed in the same pass, as the requisite color has not yet<br />

been written to the destination color buffer. As an example, this is a common<br />

problem when developing a particle system. In this case I would suggest rendering<br />

the particle system in a normal color buffer after the color mapping, which<br />

reduces fillrate usage.<br />

The last problem in using floating-point color buffers is the huge amount of<br />

fillrate used to render to a 64- or 128-bit render target, but, if the application is<br />

CPU bound or fillrate is not an issue, the use of floating-point color buffers significantly<br />

improves an application’s visual fidelity at a negligable cost.


Rendering Volumes in a Vertex &<br />

Pixel Program by Ray Tracing<br />

Eli Z. Gottlieb<br />

3D graphics programmers have long strived for the most realistic graphics possible.<br />

We’ve used mathematical lighting models, textures, bump maps, environment<br />

maps, and now vertex and pixel shaders to achieve as much realism as possible<br />

with as much speed and interactivity as possible. And as always, we have used<br />

streams of triangle vertex data to represent shapes. Still, perfect realism can only<br />

come from one thing: ray tracing. Ray tracing simulates how photons interact<br />

with the environment by following a ray as it moves through a scene. There were<br />

always two problems with ray tracing, though; first, you need access to all the<br />

scene data at once, and second, performing ray-triangle intersections and reflections<br />

is very slow, even when done on the GPU via shaders. Volume textures are<br />

a solution to all these problems. Volumes store the entire scene as a 3D array of<br />

voxels, which in memory are xyzw floating-point vectors, just like pixels. This<br />

means we can trace rays through volumes, and we’re no longer reliant on triangles<br />

that would eventually grow smaller than a pixel! In this article, an algorithm<br />

is presented for tracing rays through volume textures and using the rays for lighting<br />

implemented in v_2_x vertex and pixel shaders. The algorithm is written in<br />

Cg for readability, but you can implement it in any shading language that you<br />

want.<br />

To Trace a Ray<br />

In order to trace a ray through a volume, we first need to be able to render a volume.<br />

For this algorithm, we trace a ray at the same time as rendering the volume.<br />

So we just need to render a cube or another low-poly model to put the volume in.<br />

We also need to pass the vertex’s object space coordinates to the pixel shader<br />

for use as texture coordinates into the volume. To do that, we use this fairly simple<br />

vertex shader:<br />

struct VertexIn<br />

{<br />

float3 xyz : POSITION;<br />

};<br />

struct VertexOut<br />

177


Section II — Rendering Techniques<br />

178 Rendering Volumes in a Vertex & Pixel Program by Ray Tracing<br />

{<br />

float4 xyzw : POSITION;<br />

float3 xyz : TEXCOORD0;<br />

};<br />

float4 Vec3To4(float3 v)<br />

{<br />

return float4(v.x,v.y,v.z,1);<br />

}<br />

VertexOut main(VertexIn vIn,uniform float4x4 mWorld,uniform float4x4 mView,uniform float4x4 mProj)<br />

{<br />

VertexOut fOut;<br />

fOut.xyz = vIn.xyz;<br />

fOut.xyzw = mul(Vec3To4(vIn.xyz),mWorld);<br />

fOut.xyzw = mul(fOut.xyzw,mView);<br />

fOut.xyzw = mul(fOut.xyzw,mProj);<br />

return fOut;<br />

}<br />

In the pixel shader we use the texture coordinate to trace the ray through the volume<br />

and sample the appropriate voxels. Now we’ve got to start writing our pixel<br />

shader.<br />

Inputs and Outputs of the Pixel <strong>Shader</strong><br />

The first thing we do to write the pixel shader is figure out what parameters we<br />

need and what data to return. We definitely need the output of the vertex shader,<br />

and this ray tracing algorithm doesn’t require any depth returns or multiple render<br />

targets, so we just return a float4 for the pixel color. This pixel shader doesn’t<br />

do much; it’s just a starting point to work from.<br />

struct VertexIn<br />

{<br />

float4 xyzw : POSITION; //Clip space position for rasterizer.<br />

float3 xyz : TEXCOORD0; //Object space position.<br />

};<br />

struct FragOut<br />

{<br />

float4 color : COLOR0; //Output pixel color.<br />

};<br />

FragOut main(VertexIn vIn)<br />

{<br />

FragOut fOut;<br />

fOut.color = float4(1,1,1,1);<br />

return fOut;<br />

}


Section II — Rendering Techniques<br />

Rendering Volumes in a Vertex & Pixel Program by Ray Tracing<br />

As you can see, we need a way to store our volume texture data. For ray tracing,<br />

we’re going to need a per-voxel normal, a voxel color, and a voxel emissive light<br />

color.<br />

That’s too much for one voxel in any format! So we have to split it into multiple<br />

textures. We use an ARGB volume texture to contain the normal data in x, y,<br />

and z, and it also holds a 1D texture coordinate into the other two in its alpha<br />

component. That component is used to look up a 1D ARGB color texture and a 1D<br />

RGB emissive light texture. In reality, light doesn’t have an alpha component, so<br />

therefore there is no alpha in the emissive light texture. After those changes, the<br />

result is this shader:<br />

FragOut main(VertexIn vIn,<br />

uniform sampler3D tVolume, //Normal in RGB, texcoord in A.<br />

uniform sampler1D tColor, //RGBA voxel color.<br />

uniform sampler1D tLight) //RGB emissive light color.<br />

{<br />

FragOut fOut;<br />

fOut.color = float4(1,1,1,1);<br />

return fOut;<br />

}<br />

Notice that we gain an advantage in working with the two 1D textures in that like<br />

voxels can lookup at the same coordinate, thus saving memory. Now we need a<br />

way to map our polygon model with the volume texture. To do this, we sample the<br />

volume and use the .a swizzle supported by Cg to lookup into the color texture<br />

and set the color. Then we lookup into the light texture and light the pixel. First,<br />

though, to get a proper coordinate into the volume texture, we must transform<br />

the object space vertex position from a space with the origin at the center, where<br />

OpenGL and Direct3D 9 place the origin, to a space with the origin at the top-left<br />

front corner.<br />

To do this, we just multiply it by an appropriate matrix. Now our code looks<br />

much closer to rendering volume textures.<br />

const float4x4 mVolumeCoords = {1,0,0,0,<br />

0,-1,0,0,<br />

0,0,1,0,<br />

0.5,-0.5,1,0};<br />

float4 Vec3To4(float3 v)//Converts float3 to float4 w/ w=1.<br />

{<br />

return float4(v.x,v.y,v.z,1);<br />

}<br />

FragOut main(VertexIn vIn,<br />

uniform sampler3D tVolume, //Normal in RGB, texcoord in A.<br />

uniform sampler1D tColor, //RGBA voxel color.<br />

uniform sampler1D tLight) //RGB emissive light color.<br />

{<br />

FragOut fOut;<br />

float3 texcoord = mul(vIn.xyz,mVolumeCoords);<br />

fOut.color = tex1D(tColor,tex3D(tVolume,texcoord).a);<br />

179


Section II — Rendering Techniques<br />

180 Rendering Volumes in a Vertex & Pixel Program by Ray Tracing<br />

fOut.color.rgb *= tex1D(tLight,tex3D(tVolume,texcoord).a);<br />

return fOut;<br />

}<br />

There’s still one more thing necessary to make a shader for volume rendering. If<br />

the sampled voxel is empty, we should extrude the texture coordinate along the<br />

vector from the eyepoint to the vertex position. This is where things start to get<br />

messy. To do the extrusion, we need to use a while loop, and the Cg compiler<br />

can’t compile while loops in pixel shaders. This means you’ll have to translate the<br />

shader into ASM yourself. For the extrusion we need to add a new uniform<br />

parameter, a float3 containing the eye position in object space. We also need the<br />

volume texture dimensions so we always extrude one voxel. Finally, we arrive at<br />

the shader for rendering volume textures.<br />

struct VertexIn<br />

{<br />

float4 xyzw : POSITION; //Clip space position for rasterizer.<br />

float3 xyz : TEXCOORD0; //Object space position.<br />

};<br />

struct FragOut<br />

{<br />

float4 color : COLOR0; //Output pixel color.<br />

};<br />

const float4x4 mVolumeCoords = {1,0,0,0,<br />

0,-1,0,0,<br />

0,0,1,0,<br />

0.5,-0.5,1,0};<br />

float4 Vec3To4(float3 v) //Converts float3 to float4 w/ w=1.<br />

{<br />

return float4(v.x,v.y,v.z,1);<br />

}<br />

FragOut main(VertexIn vIn,<br />

uniform float3 vDimens, //Dimensions of the volume texture.<br />

uniform float3 vEyePos, //Eye position in object space.<br />

uniform sampler3D tVolume, //Normal in RGB, texcoord in A.<br />

uniform sampler1D tColor, //RGBA voxel color.<br />

uniform sampler1D tLight) //RGB emissive light color.<br />

{<br />

FragOut fOut;<br />

bool bPixelFound = false;<br />

float3 vVolPoint = vIn.xyz; //Cartesian point in the volume to sample.<br />

while(!bPixelFound)<br />

{<br />

fOut.color = tex1D(tColor,tex3D(tVolume,mul(vVolPoint,mVolumeCoords)).a);<br />

if(fOut.color.a > 0)<br />

{<br />

bPixelFound = true;<br />

fOut.color.rgb *= tex1D(tLight,tex3D(tVolume, mul(vVolPoint,mVolumeCoords)).a);


Section II — Rendering Techniques<br />

Rendering Volumes in a Vertex & Pixel Program by Ray Tracing<br />

};<br />

vVolPoint += normalize(vVolPoint-vEyePos)/vDimens;<br />

};<br />

return fOut;<br />

}<br />

Okay, now how do we ray trace with that? Well, one aspect of ray tracing is finding<br />

out where the ray is going. This means that we trace it through the volume and<br />

reflect and refract it off of voxels as necessary. We can already trace a ray and<br />

reflect it in the pixel shader by storing the coordinate’s voxel that we’re currently<br />

sampling in vVolPoint, giving the ray a velocity, iterating through a loop to reflect<br />

vVolPoint against voxels, and at the end of every iteration adding the velocity vector<br />

to vVolPoint to trace the ray one voxel further. Here’s the resulting shader:<br />

const float4x4 mVolumeCoords = {1,0,0,0,<br />

0,-1,0,0,<br />

0,0,1,0,<br />

0.5,-0.5,1,0};<br />

float4 Vec3To4(float3 v) //Converts float3 to float4 w/ w=1.<br />

{<br />

return float4(v.x,v.y,v.z,1);<br />

}<br />

FragOut main(VertexIn vIn,<br />

uniform float3 vDimens, //Dimensions of the volume texture.<br />

uniform float3 vEyePos, //Eye position in object space.<br />

uniform sampler3D tVolume, //Normal in RGB, texcoord in A.<br />

uniform sampler1D tColor, //RGBA voxel color.<br />

uniform sampler1D tLight) //RGB emissive light color.<br />

{<br />

FragOut fOut;<br />

float3 vRayPoint = vIn.xyz; //Cartesian point in the volume to sample.<br />

float3 vRayDir = normalize(vIn.xyz-vEyePos);<br />

float3 vLight = float3(0,0,0); //RGB light.<br />

while(length(vRayPoint) 1wewould be sampling voxels<br />

//outside the volume.<br />

{<br />

float3 vNormal = tex3D(tVolume,mul(vRayPoint,mVolumeCoords));<br />

if (length(vNormal) > 0)<br />

{<br />

fOut.color.rgb *= tex1D(tColor,tex3D(tVolume,mul(vIn.xyz,mVolumeCoords)).a).rgb;<br />

vLight += tex1D(tLight,tex3D(tVolume, mul(vRayPoint,mVolumeCoords)).a).rgb;<br />

if (dot(vRayDir,vNormal) > 0) //Allow for 2-sided objects to be represented w/a<br />

//1-sided set of voxels.<br />

vRayDir = reflect(vRayDir,-vNormal);<br />

else<br />

vRayDir = reflect(vRayDir,vNormal);<br />

};<br />

vRayPoint += vRayDir/vDimens;<br />

};<br />

181


Section II — Rendering Techniques<br />

182 Rendering Volumes in a Vertex & Pixel Program by Ray Tracing<br />

return fOut;<br />

}<br />

Now there’s only one thing missing: refraction. To implement it, we would have<br />

to be able to start a new loop whenever the ray is refracted. This is impossible<br />

because Cg code can’t add new loops to itself at run time. We can, however,<br />

encapsulate the tracing of the ray in a function and call the function inside itself<br />

whenever we need to refract. So the next question is, how do we know when to<br />

refract? We check the alpha component of the voxel’s color. If it’s below 1, then<br />

the voxel is transparent to some degree and we need to refract through it. To get<br />

the refractive index, a one-component float texture is sampled. Finally, the color,<br />

light, ray position, and ray direction need to be passed to the tracing function<br />

every time it’s called. All the unchanging parameters can be made global variables.<br />

We end up with this final version of the ray tracing code:<br />

struct VertexIn<br />

{<br />

float4 xyzw : POSITION; //Clip space position for rasterizer.<br />

float3 xyz : TEXCOORD0; //Object space position.<br />

};<br />

struct FragOut<br />

{<br />

float4 color : COLOR0;<br />

}<br />

const float4x4 mVolumeCoords = {1,0,0,0,<br />

0,-1,0,0,<br />

0,0,1,0,<br />

0.5,-0.5,1,0};<br />

float4 vColor;<br />

float3 vLight; //No alpha in light.<br />

float3 vNormal;<br />

float3 vDimens;<br />

sampler3D tVolume,<br />

//An RGBA volume texture for normals in object space and a texcoord.<br />

sampler1D tColors, //An RGBA texture containing voxel colors.<br />

sampler1D tLight, //An RGB texture containing emmissive colors.<br />

sampler1D tRefracts) //A one-component texture for refractive indices.<br />

float4 Vector3To4(float3 v)<br />

{<br />

return float4(v.x,v.y,v.z,1);<br />

}<br />

float4 TraceRay(float4 vColor,inout float3 vLight,float3 vRayPos,float3 vRayDir)<br />

{<br />

float4 fOut;<br />

bool bLit = false;<br />

if (length(vNormal) == 1)<br />

{


Section II — Rendering Techniques<br />

Rendering Volumes in a Vertex & Pixel Program by Ray Tracing<br />

vColor = tex1D(tColors,tex3D(tVolume,mul(Vector3To4(vRayPos),mVolumeCoords)).a);<br />

vLight += tex1D(tLight,tex3D(tVolume,mul(Vector3To4(vRayPos),mVolumeCoords)).a) .rgb;<br />

if (dot(vRayDir,vNormal) > 0)<br />

vRayDir = reflect(vRayDir,-vNormal);<br />

else<br />

vRayDir = reflect(vRayDir,vNormal);<br />

fOut.color *= vColor;<br />

};<br />

vRayPos += vRayDir/vDimens;<br />

while (!bLit)<br />

{<br />

if (vColor.a < 1)<br />

{<br />

float4 vRefractColor = TraceRay(vColor,vLight,vRayPos,refract(vRayDir,vNormal,tex1D<br />

(tRefracts,tex3D(tVolume,mul(Vector3To4(vRayPos),mVolumeCoords)).a).r);<br />

vColor = vColor.a*vColor * vRefractColor*vRefractColor.a;<br />

};<br />

vNormal = tex3D(tVolume,mul(Vector3To4(vRayPos),mVolumeCoords)).rgb;<br />

if (length(vNormal) == 1)<br />

{<br />

vColor = tex1D(tColors,tex3D(tVolume,mul(Vector3To4(vRayPos),mVolumeCoords)).a);<br />

vRayDir = reflect(vRayDir,vNormal);<br />

fOut *= vColor;<br />

};<br />

else<br />

if (length(vRayPos) == 1)<br />

{<br />

fOut *= vColor * clamp(vLight) + (vLight - 1.xxx); //Diffuse and specular lighting.<br />

bLit = true;<br />

}<br />

vRayPos += vRayDir/vDimens;<br />

};<br />

return fOut;<br />

}<br />

FragOut main(VertexIn vIn,<br />

uniform float3 vDimens, //Dimensions of the volume texture.<br />

uniform float3 vEyePos, //The eye position, in object space.<br />

uniform sampler3D Volume,<br />

//An RGBA volume texture for normals in object space and a texcoord.<br />

uniform sampler1D Colors, //An RGBA texture containing voxel colors.<br />

uniform sampler1D Light, //An RGB texture containing emmissive colors.<br />

uniform sampler1D Refracts) //A 1-component texture for refractive indices.<br />

{<br />

FragOut fOut;<br />

tVolume = Volume;<br />

tColors = Colors;<br />

tLight = Light;<br />

tRefracts = Refracts;<br />

float3 vRayDir = normalize(vIn.xyz-vEyePos);<br />

float3 vRayPos = vEyePos;<br />

vColor = tex1D(tColors,tex3D(tVolume,mul(Vector3To4(vRayPos),mVolumeCoords)).a);<br />

183


Section II — Rendering Techniques<br />

184 Rendering Volumes in a Vertex & Pixel Program by Ray Tracing<br />

vLight = tex1D(tLight,tex3D(tVolume,mul(Vector3To4(vRayPos),mVolumeCoords)).a) .rgb;<br />

vNormal = tex3D(tVolume,mul(Vector3To4(vRayPos),mVolumeCoords)).rgb;<br />

fOut.color = float4(0,0,0,1);<br />

fOut.color = TraceRay(vColor,vLight,vRayPos,vRayDir);<br />

fOut.color.a = 1;<br />

return fOut;<br />

}<br />

Some Optimizations<br />

Of course, there are some things that you can do to make the shader run faster or<br />

use less memory. One would be a limit on how many times the tracing function<br />

can nest. Another, obviously, would be to compress the textures. You can also<br />

make a fairly small quality sacrifice to save memory by using point lights to trace<br />

rays back to. You may even see a way to optimize the shader code (this wouldn’t<br />

surprise me). As is though, I think the shader runs fast enough on most volumes<br />

to be used in real time.<br />

What to Use It For<br />

Probably the best current use of the shader is to render the unchanging parts of a<br />

scene as volumes so you can obtain the benefits of ray tracing. Another good use<br />

of volume ray tracing is rendering objects with complicated emissive lighting,<br />

such as fireballs, lightning, flashlights, and anything else that shines. On future<br />

hardware, you might be able to blit volume textures, allowing the rendering of the<br />

scene as volumes so you can show an entire ray-traced scene! For now, however,<br />

we don’t have such OGL 2.0 type hardware. In conclusion, volume textures can<br />

have rays traced through them starting from the pixel’s object space coordinates<br />

to render them with ray tracing effects such as shadowing, reflection, and refraction.<br />

This algorithm achieves that goal using hardware pixel shaders for real-time<br />

speed.


Technique<br />

Normal Map Compression<br />

Jakub Klarowicz<br />

The development of new rendering techniques and the increasing popularity of<br />

per-pixel lighting make normal maps play more and more of a significant role in a<br />

texture set of any 3D project. New tools for geometry simplification that move all<br />

geometry details to bump maps (like ATI’s Normal Mapper or nVidia’s Melody)<br />

generate an additional amount of high-resolution normal maps. It leads to a situation<br />

where big chunks of precious video memory are consumed by normal maps.<br />

In the past, when the number and size of textures used in graphical projects<br />

were increasing quickly, developers faced the same problem. A solution to this<br />

was a lousy texture compression standard called DXTC, which allows for memory<br />

footprints up to six times smaller for a texture without noticeable loss in image<br />

quality. Unfortunately, the direct use of the DXTC format for normal map compression<br />

leads to poor quality results. Because of the nature of data contained in<br />

normal maps, as well as the way they are used in the rendering process, all distortions<br />

and errors caused by the compression are easily noticeable. Only very subtle<br />

and soft normal maps can be compressed with an acceptable quality.<br />

This article presents a new approach to the compression of normal maps<br />

using DXTC. The special preprocessing of a map before the compression allows<br />

for the retention of much more detail in comparison with direct compression and<br />

also results in a significantly better visual quality of renderings.<br />

The main idea of the technique is to use all four RGBA channels of the DXT5 format<br />

to compress a three-channel normal map (RGB). The map is transformed in<br />

such a way that part of the information is moved from RGB to alpha channel A.<br />

Since RGB and A channels are compressed independently in the DXT5 format,<br />

the loss of information during compression is smaller and artifacts caused by the<br />

compression are less visible.<br />

The transformation that has to be applied to normal maps before the compression<br />

is very simple. One channel of the normal map (R, G, or B) is copied to<br />

the alpha channel A, and it is then cleared (filled with zero values). The transformed<br />

map consists of the same components as the original map, with one of<br />

them moved into alpha channel A and one of the R, G, or B channels containing<br />

only zeroes. Normal maps preprocessed in such a way are then compressed with<br />

any application that is able to compress to the DXT5 format.<br />

185


Section II — Rendering Techniques<br />

186 Normal Map Compression<br />

Results<br />

In order to use the compressed normal map, it is necessary to move the component<br />

stored in the alpha channel back to its original place (R, G, or B). Decoding<br />

is performed during the rendering and is also very simple. It requires one additional<br />

pixel shader instruction and one additional pixel shader constant. Here’s<br />

some ps 1.1 code that does the job:<br />

ps 1 1<br />

def c0, 1, 0, 0, 0 // 1 in red channel<br />

tex t0 // read normal map<br />

mad t0, t0.a, c0, t0 // move alpha to red channel<br />

... // use normal map<br />

The code assumes that the red channel has been stored in alpha. For other cases,<br />

the only thing that is different is the value of the c0 constant. It should be 0,1,0,0<br />

for green in alpha and 0,0,1,0 for blue in alpha. The c0 constant doesn’t have to be<br />

stored in the pixel shader code directly; it can be set using the Direct3D function<br />

SetPixel<strong>Shader</strong>ConstantF. This way, a proper constant can be set according to<br />

how the normal map that is being used has been prepared.<br />

The figures on this and the following<br />

page present a per-pixel lit<br />

quad with the same normal map.<br />

Notice the “blocky” look of<br />

the directly compressed map.<br />

These blocks correspond to 4x4<br />

groups of pixels — the basic unit<br />

on which DXTC compression<br />

operates. As you can see in Figure<br />

3, the presented method significantly<br />

reduces the<br />

“blockiness.” You can also see<br />

that much more detail is pre-<br />

Figure 1: Uncompressed normal map<br />

served in the map.<br />

The best results are obtained<br />

with tangent space normal maps. Vectors in model space normal maps vary much,<br />

and thus compression artifacts are more visible. It is best to place a component<br />

that has the greatest variation in the alpha channel — for tangent space normal<br />

maps, it’s either R or G.<br />

There’s a small application on the CD that shows the results for various normal<br />

maps. All maps have been coded with the green in alpha method.


Figure 2: Normal map compressed directly to<br />

DXT5<br />

How to Prepare the Normal Map<br />

Here’s a complete procedure on how to prepare a normal map with the green<br />

channel copied to the alpha using Adobe Photoshop 7.0 and nVidia DDS Compression<br />

Plugin [1].<br />

1. Load uncompressed map to Photoshop.<br />

2. Choose Channels from the Window menu.<br />

3. Select the green channel in the Channels tab.<br />

4. Right-click on the green channel.<br />

5. Select Duplicate Channel from the pop-up menu and press OK in the dialog<br />

that appears.<br />

6. Select the green channel again.<br />

7. Choose All from the Select menu.<br />

8. Choose Clear from the Edit menu.<br />

9. Save the map using Save As and the DDS format, making sure the Alpha<br />

Channels check box is selected.<br />

10. Choose DXT5 format in the Save Format combo box.<br />

11. Press Save.<br />

Why It Works<br />

Section II — Rendering Techniques<br />

Normal Map Compression<br />

Figure 3: Normal map compressed with<br />

described method<br />

187<br />

This section assumes the reader’s basic knowledge of the DXT5 format, especially<br />

its compression methods. Details can be found in [2].<br />

At first sight, it may be surprising that map quality can be improved just by<br />

moving one normal component to the alpha channel. One reason for this is that<br />

the alpha channel is quantized separately from RGB channels; the other is the fact<br />

that it is quantized with an increased accuracy. In DXT5, each 4x4 block of pixels<br />

quantizes the alpha to eight separate values. These eight values are distributed


Section II — Rendering Techniques<br />

188 Normal Map Compression<br />

uniformly in a sub-range of the [0, 1] alpha range. This way, the compressed alpha<br />

channel can represent the original content very precisely. The accuracy of the<br />

representation depends on how much the original values differ in the 4x4 block —<br />

the less they differ, the narrower the sub-range is, and thus the quantization error<br />

is smaller. Fortunately, in most normal maps the variation of the normal vector is<br />

smooth and doesn’t change by a big amount in one block.<br />

Due to removing one of the RGB components, the quantization of the RGB<br />

channels is also improved. The DXT5 format quantizes the color channel to a palette<br />

of four colors for each 4x4 block. Two of these colors are stored within the<br />

block, and the other two are derived as a weighted sum of the former two. This<br />

four-color palette can be treated as a set of four points placed on a line in 3D<br />

space. Two of these points can be chosen arbitrarily to explicitly define the line.<br />

16 points in 3D space representing sixteen original colors in the block are<br />

replaced with four colors from the palette. It is now obvious why directly compressed<br />

normal maps look so bad — each block is represented by only four different<br />

vectors; smoothly varying normal vectors are snapped to four vectors from<br />

the palette.<br />

By removing one of the RGB components, the dimension of the original 16<br />

points is reduced by one. Now four points from the palette represent 16 points in<br />

2D space — therefore, the quantization error is smaller in most cases. Further<br />

improvement follows with normal maps authored for the tangent space because<br />

such normal maps are almost 2D initially. Vectors in the tangent space maps are<br />

near (0,0,1), so if the x or y component is removed, the remaining vector is almost<br />

1D and only one of its components varies. This makes a very good fit to the 1D<br />

color quantization of the DXT5.<br />

Pros and Cons<br />

Pros:<br />

� The described method is widely available because of good hardware support<br />

for DXTC.<br />

� Only uses 8 bits per pixel — three times less memory sizes<br />

� Only 1 TMU used (does not require additional TMUs)<br />

� Easy to use, does not require dedicated tools for compression<br />

� Fast and simple decoding in the pixel shader<br />

� It uses DXT5, so it guarantees that the color channel is decoded with full<br />

24-bit precision on all nVidia hardware.<br />

Cons:<br />

� There is a small amount of additional data stored with each normal map —<br />

the constant for a pixel shader required to decode the map that has to be set<br />

in the pixel shader.<br />

� Doesn’t work as well for model space normal maps as for tangent space maps


References<br />

Section II — Rendering Techniques<br />

Normal Map Compression<br />

� Very rough normal maps still look bad when compressed; uncompressed format<br />

is required in such cases.<br />

[1] DXTC texture tools: http://developer.nvidia.com/view.asp?IO=ps_texture_<br />

compression_plugin.<br />

[2] <strong>DirectX</strong> SDK 9.0, <strong>DirectX</strong> documentation for C++.<br />

189


Drops of Water and Texture<br />

Sprites<br />

Sylvain Lefebvre<br />

Introduction<br />

190<br />

Textures are present in almost every real-time graphics application. They are a<br />

very convenient way of improving the appearance of a surface at low cost (in comparison<br />

to the geometry needed to obtain the same appearance). We can distinguish<br />

two main types of textures: explicit textures that consist of (potentially<br />

large) images that are loaded in memory and procedural textures that are computed<br />

on the fly at the pixel level (only the procedure needs to be stored).<br />

While the former are supported by almost all graphics hardware, procedural<br />

textures have been limited to software renderers. Nowadays, as graphics hardware<br />

becomes more and more powerful, classical procedural textures become<br />

affordable on the GPU [1].<br />

However, in many applications something between purely procedural and<br />

explicit textures is needed; we want textures to be created by procedurally combining<br />

explicit textures. We refer to these textures as pattern-based procedural textures<br />

(see [2]). They are not procedural in the sense of classical marble or wood<br />

textures [3], but they combine explicit textures (patterns) in order to create a<br />

larger texture with the desired appearance. For instance, we may need footsteps<br />

to appear on top of a snow texture, impacts to appear where a bullet hit the wall,<br />

or drops of water to fall along a surface. These kinds of dynamic textures cannot<br />

be explicit: Just imagine a game level with thousands of surfaces. Each time a bullet<br />

hits a wall, we have to update the corresponding texture. But as we do not<br />

want the same impact to appear on all surfaces, we need to create a copy of all the<br />

texture data for each surface. The required amount of texture memory would be<br />

prohibitive. Moreover, the resolution of the impacts would not be able to exceed<br />

the resolution of the original texture.<br />

Yet there are bullet impacts on the walls in my favorite first-person shooter<br />

game. How is this done? In most games, effects like bullet impacts are created by<br />

adding a small textured geometric quad, called a decal, on top of the geometry.<br />

This has some inconveniences. First, the quad has to be positioned precisely and<br />

split along the edges of the model on which it is applied. Second, this introduces<br />

geometry for non-geometrical reasons; the effect is purely textural, and here we


have to simulate it using geometry. Third, imagine that we want to animate the<br />

decals on top of the surface; we would have to do all the positioning work at each<br />

frame for each decal. Working with complex surfaces, this would have a high computational<br />

cost.<br />

To overcome this problem, [2] introduced a method of implementing texture<br />

sprites using graphics hardware. Texture sprites allow us to position and animate<br />

sprites in a texture. The main advantage is that we no longer have to worry about<br />

the geometry; the problem will be directly solved in texture space. Another interesting<br />

point is that the resolution of the sprites can be higher than the resolution<br />

of the underlying texture. The memory cost of this technique is low, and it can be<br />

implemented on GeForce3/Radeon 8500 (with some limitations) and on the latest<br />

programmable graphics board. The Drops of Water shader uses the texture<br />

sprites technique to render drops over the texture. It is therefore a pattern-based<br />

procedural texture, as its appearance results from the procedural combination of<br />

explicit textures (the drop and the surface texture).<br />

In the section titled “Texture Sprites,” we introduce the texture sprites technique<br />

and describe its implementation on various types of hardware. In the section<br />

titled “The Drops of Water Effect,” we see how to combine texture sprites, a<br />

wet surface effect, and a magnification effect in order to create the Drops of Water<br />

shader.<br />

Texture Sprites<br />

The texture sprites technique<br />

relies on a new functionality of<br />

graphics hardware that was first<br />

designed to create a fake bump<br />

mapping. This functionality is<br />

called offset textures. It allows the<br />

encoding in a texture of an offset<br />

that is added to the texture coordinates<br />

of a second texture. We use<br />

this in order to position sprites in<br />

our textures.<br />

Reference Texture<br />

Section II — Rendering Techniques<br />

Drops of Water and Texture Sprites<br />

191<br />

Figure 1: The value read in the offset texture<br />

(stage 0) is used to add an offset to the texture<br />

coordinates of a second texture (stage 1).<br />

First we need to store our sprites in a texture, which is called the reference texture<br />

in the discussion that follows. The reference texture is considered a regular grid.<br />

Each cell of this grid is a square block of pixels called a tile. In each tile, we can<br />

either store a sprite or leave it empty (i.e., with the background color). For implementation<br />

reasons, we need empty tiles around a tile containing a sprite. This<br />

limitation can be overcome on the latest graphics hardware, as explained in the<br />

“Implementation Notes” section.


Section II — Rendering Techniques<br />

192 Drops of Water and Texture Sprites<br />

Offset Map<br />

Now we can see how to use offset textures to procedurally create a new texture<br />

from a given explicit texture. We begin with a practical case: Let’s take an offset<br />

texture with a resolution of 8x8 pixels. Each pixel encodes an offset. If we apply<br />

this offset texture on a geometric quad without any filtering, we obtain a grid of<br />

8x8 cells, one for each pixel of the offset texture. Each cell will contain one<br />

(du,dv) vector. Now we take another texture with 256x256 pixel resolution, map it<br />

onto the same quad, and use the offset texture to perturb its texture coordinates.<br />

Each cell of the offset texture therefore applies a translation to the texture<br />

coordinates of a block of 64x64 pixels of the second texture (256/8 = 64). This<br />

way we can independently translate the content of each texture block by simply<br />

updating the corresponding cell of the offset texture (only one pixel in video<br />

memory). As the translation is applied to texture coordinates, this method can be<br />

used with any textured geometry. This type of offset texture is called an offset<br />

map.<br />

Sprite Positioning<br />

Figure 2: A reference texture<br />

containing two sprites. It<br />

consists of 4x2 tiles of 8x8<br />

pixels each. The resolution of<br />

this reference texture is<br />

therefore 32x16 pixels. Each<br />

sprite is surrounded by empty<br />

tiles (the texture is cyclic).<br />

We use an offset map to procedurally create a new texture from the tiles of the<br />

reference texture. This new texture shows the sprites at their choosen position.<br />

We need each cell of the offset map to cover a tile of the reference texture. Suppose<br />

that the offset map and the reference texture are respectively in texture<br />

stages 0 and 1 and have independent texture coordinates. The idea is to compute<br />

the texture<br />

coordinates of<br />

the reference<br />

texture so that<br />

one cell of the<br />

offset map covers<br />

exactly one<br />

tile. This can<br />

easily be done in<br />

a vertex<br />

program. Figure 3: Each cell of the offset map covers exactly one tile of the<br />

reference texture.


Visually, the effect is as if each cell of the offset map were a window opened on<br />

the reference texture. This window has the same resolution as one tile. In each<br />

cell we can choose which part of the reference texture will be displayed. As we<br />

can choose to display only a small part of a tile of the reference texture, we need<br />

to have empty tiles around one sprite tile; we do not want a neighboring sprite to<br />

be visible if we only display a small part of a sprite.<br />

Figure 4: Adding an offset (u,v) to the texture<br />

coordinates results in a translation (-u,-v) of the sprite in<br />

the texture.<br />

As you can see in Figure 4, to translate the sprite by (0.5,0) we have to use an offset<br />

of (–0.5,0). It comes from the fact that the offset is added to the texture coordinates.<br />

Let’s assume that the point at the top-left corner of the sprite is at (0,0) in<br />

the reference texture. Point M, without using the offset map, has texture coordinates<br />

(0.5,0). When we add the offset (–0.5,0) to M, the resulting texture coordinate<br />

is (0,0); that is why we can see the top-left corner of the sprite at this<br />

location. Visually, the sprite has been translated by (0.5,0).<br />

We now have a reference texture containing some sprites and an offset map<br />

that allows us to display sprites or part of sprites in its cells. How do we position a<br />

sprite at a given texture coordinate?<br />

Imagine that we want a sprite at texture coordinates (u,v). That is to say, we<br />

want the top-left corner of the tile containing the sprite to be at (u,v). We can easily<br />

compute in which cell (gi,gj) of the offset map the (u,v) texture coordinates lie;<br />

if the offset map has a resolution ofNxN,thecorresponding cell is 1 :<br />

(gi,gj) = (floor(N * u) % N, floor(N * v) % N)<br />

As the tile of the sprite has the same size as the<br />

cell of the offset map, four cells of the offset map<br />

are needed to display the sprite.<br />

First we position the chosen sprite tile in the<br />

four concerned cells of the offset map: (gi,gj),<br />

(gi+1,gj), (gi,gj+1), (gi+1,gj+1). See Figure 6,<br />

step 1. For this purpose, we need to compute<br />

which tile (ti,tj) of the reference texture is displayed<br />

in the cell at the (u,v) texture coordinates if<br />

we were not using an offset map. If the reference<br />

Section II — Rendering Techniques<br />

Drops of Water and Texture Sprites<br />

When the sprite in tile (ti,tj) is<br />

translated, the neighbouring tiles<br />

become visible.<br />

193<br />

Figure 5: To display a sprite at<br />

an arbitrary position in the<br />

texture, four cells of the offset<br />

map are needed.<br />

1 % is the modulo operator; floor(x) returns the greatest integer that is less than or equal to x.


Section II — Rendering Techniques<br />

194 Drops of Water and Texture Sprites<br />

texture contains T by T tiles 2 , the corresponding tile is:<br />

(ti,tj) = (gi % T, gj % T)<br />

We can therefore compute the offset (du ij,dv ij) to be stored in (gi,gj) in order to<br />

have the tile containing the sprite (si,sj) displayed instead of the tile (ti,tj):<br />

si ti sj tj<br />

( duij, dvij<br />

) (<br />

T T<br />

( ) , ( ) � �<br />

�<br />

)<br />

This offset translates the tile (si,sj) on top of the tile (ti,tj).<br />

Now we have to take into account the relative position of the sprite tile<br />

within the cell (gi,gj). See Figure 6, step 2. The relative position (�u ij, �v ij) of (u,v)<br />

within (gi,gj) is computed as follows:<br />

(�u ij, �v ij) = (u*N – gi, v*N – gj)<br />

To make the sprite appear at the correct position in tile (gi,gj), we have to translate<br />

it by:<br />

( �u , �v<br />

)<br />

ij ij<br />

T<br />

Because we work in texture coordinate space, we have to subtract 3 this vector<br />

from the previously computed offset (du ij,dv ij). Given (�u ij, �v ij), the (�u, �v) values<br />

of the four cells are:<br />

(�u ij, �v ij) (�u (i+1)j, �v (i+1)j) =(�u ij –1,�v ij)<br />

(�u i(j+1), �v i(j+1))=(�u ij, �v ij –1) (�u (i+1)(j+1), �v (i+1)(j+1))=(�u ij –1,�v ij –1)<br />

The entire process of sprite positioning is summarized in Figure 6.<br />

Figure 6: We want to<br />

position the top-left<br />

corner of the sprite at<br />

(u,v).<br />

Step 1: Using the<br />

offset map and the<br />

computed (du,dv), the<br />

tile containing the<br />

sprite is placed in the<br />

four cells.<br />

Step 2: A translation<br />

(�u, �v) is added in<br />

the four cells in order<br />

to display only the<br />

needed part of the<br />

sprite.<br />

Each cell displays a part<br />

of the sprite. When<br />

viewed side by side, the<br />

sprite appears to be at<br />

the chosen (u,v)<br />

coordinates.<br />

2 If tileres * tileres is the size (in pixels) of a tile, the reference texture has a size (in pixels) of<br />

(T * tileres) by (T * tileres).<br />

3 As explained before, translating a sprite by (tu,tv) corresponds to subtracting (tu,tv) from the offset.


Rotation and Scaling<br />

Now that we are able to position sprites in a texture, we can go further and apply<br />

transformations on each sprite (i.e., scaling and rotation). The idea is to transform<br />

the sprite within its tile by applying the transformation to the texture coordinates<br />

before accessing the sprite texture data. To do so, we need a map of the same resolution<br />

as the offset map to store the transformation associated with a sprite (i.e.,<br />

rotation angle and scaling factor). This map will be called the transformation map.<br />

In the same way that a sprite uses four cells of the offset map, the four corresponding<br />

cells of the transformation map will store the transformation information<br />

of the sprite.<br />

Imagine that we are accessing a texture with (u 0,v 0) coordinates, where (u 0,<br />

v 0) � [0,1]x[0,1]. If we want the texture to appear rotated by an angle � around its<br />

center (0.5,0.5), we have to compute new texture coordinates (u 1,v 1) using a 2D<br />

rotation formula:<br />

(u 1,v 1)=<br />

( cos(–�)*(u 0 – 0.5) + sin(–�)*(v 0 – 0.5) + 0.5<br />

– sin(–�)*(u 0 – 0.5) + cos(–�)*(v 0 – 0.5) + 0.5 )<br />

As the transformation is applied on texture coordinates, we have to use an angle<br />

of –� to rotate the texture of an angle � (see Figure 7).<br />

If we want the texture to appear scaled by a factor of s from its center, we<br />

have to compute new texture coordinates (u2,v2) using a 2D scaling formula:<br />

( u1<br />

05 . ) v1<br />

05<br />

( u2, v2<br />

) ( 0.5, 0.5)<br />

s<br />

s<br />

( . )<br />

�<br />

�<br />

� �<br />

�<br />

1<br />

We have to scale texture coordinates by a factor of to scale the texture by a<br />

factor of s.<br />

s<br />

Figure 7: Scaling texture coordinates<br />

Applying this method to the tile of a sprite in the reference texture is straightforward:<br />

We can easily express the texture coordinates in the tile space. Indeed,<br />

given (u,v) coordinates in the reference texture, the corresponding coordinates in<br />

tile space are 4 :<br />

(tu,tv) = (frac(u * T),frac(v * T))<br />

4 Where frac extracts the fractional part of a floating-point number<br />

Section II — Rendering Techniques<br />

Drops of Water and Texture Sprites<br />

195


Section II — Rendering Techniques<br />

196 Drops of Water and Texture Sprites<br />

Limitations<br />

The corresponding tile index is:<br />

(si,sj) = (floor(u * T),floor(v * T))<br />

From (tu,tv), we can go back to reference texture space by computing:<br />

( si �tu) ( sj �tv)<br />

( uv , ) � ( , )<br />

T T<br />

But wait! What will ensure that we are not going out of the sprite tile? Nothing!<br />

So we have to check whether we are going out of the tile; this is done by testing<br />

whether the coordinates still lie in [0,1]x[0,1] after transformation. This is important<br />

because as multiple sprites may be stored in the reference texture, neighboring<br />

sprites may become visible if we shrink a sprite beyond a certain threshold<br />

(even with the empty tiles; see Figure 8).<br />

There are some constraints on the scaling that we can apply: If we enlarge a<br />

sprite too much, it may be clipped by the border of the offset map cells. However,<br />

we can shrink the sprite as much as we want. In order to allow arbitrary rotations,<br />

the shape of a sprite must also be contained in a circle centered at the tile center.<br />

Figure 8: Reference texture One sprite with<br />

scaling of 0.5 (2.0 in<br />

texture space). Only<br />

empty tiles are<br />

displayed.<br />

Overlapping Sprites<br />

As each sprite requires four cells of the offset map, multiple grids should be used<br />

in order to allow the sprites to overlap. This could be done in multiple passes or<br />

in one pass, depending on the hardware that you are working with (see the Drops<br />

of Water effect implementation).<br />

Precision<br />

If we shrink the sprite too much, other<br />

sprites become visible (left). If the hardware<br />

allows it (see the “Implementation”<br />

section), we can check that the texture<br />

coordinates are still in the sprite tile. If not,<br />

we display the background color (right).<br />

The limited precision of per-pixel arithmetic operations can result in positioning<br />

problems; an artifact can appear on the sprite at tile borders. This issue is mainly<br />

solved by carefully creating the positioning texture: All arithmetic operations are<br />

done with respect to 8 or 16 bits of precision. However, some artifacts may still be<br />

visible when the viewpoint is very close to the texture. On recent graphics<br />

boards, the introduction of floating-point textures and 32-bit precision solves this<br />

problem.


Filtering<br />

� Mipmapping — As we are packing sprites in the same texture, we have to<br />

prevent the use of the latest mipmapping levels. Indeed, there is a level for<br />

which each pixel corresponds to the average color of one tile of the reference<br />

texture. All coarser levels are not correct, as they are computed by using pixels<br />

from multiple tiles.<br />

� Far viewpoint — If the viewpoint is far from the surface, aliasing occurs on<br />

the offset map: Multiple cells of the offset map are projected onto one pixel of<br />

the screen. To solve this problem, we have to compute a color version of the<br />

offset map, where each pixel is the average color of the pixels displayed in<br />

the offset map cell. When far from the surface, this texture should be used<br />

instead of the texture sprites.<br />

Implementation<br />

Software Positioning<br />

Software positioning of sprites is done by the PutSprite function. Its parameters<br />

are a sprite index, a texture coordinate, a rotation angle, and a scaling factor. The<br />

PutSprite function computes the indices of the four cells of the offset map that<br />

need to be updated in order to position the sprite. It also computes the relative<br />

coordinates of the sprites in these cells. Then it calls RefTexIJ2OffsetMapIJ,<br />

which updates each cell in order to display the sprite tile with the correct translation.<br />

Note that we pack the transformation data (i.e., rotation angle and scaling<br />

factor) in the blue and alpha channel of the offset map.<br />

void PutSprite(int si,int sj,<br />

double u,double v,<br />

double angle,double scale)<br />

{<br />

double ugrid,vgrid;<br />

double du,dv;<br />

int gi,gj;<br />

// ensure u and v are both > 0.0 (because of modulo operations)<br />

// texture are cyclic with a period of 1.0:<br />

// adding an integer to texture coordinates<br />

// does not change the final result<br />

if (u < 0.0)<br />

u=1.0+(u-(int)u);<br />

if (v < 0.0)<br />

v=1.0+(v-(int)v);<br />

// compute pos in offset map<br />

ugrid=(u*m dwOffsetMapRes);<br />

vgrid=(v*m dwOffsetMapRes);<br />

// compute offset map cell index<br />

gi=(int)ugrid;<br />

Section II — Rendering Techniques<br />

Drops of Water and Texture Sprites<br />

197


Section II — Rendering Techniques<br />

198 Drops of Water and Texture Sprites<br />

}<br />

gj=(int)vgrid;<br />

// compute texture coordinates relative to the cell<br />

du=ugrid-gi;<br />

dv=vgrid-gj;<br />

// cell i,j<br />

RefTexIJ2OffsetMapIJ(si,sj,<br />

gi,gj,<br />

du,dv,<br />

angle,scale);<br />

// cell i+1,j<br />

RefTexIJ2OffsetMapIJ(si,sj,<br />

gi+1,gj,<br />

du-1.0,dv,<br />

angle,scale);<br />

// cell i,j+1<br />

RefTexIJ2OffsetMapIJ(si,sj,<br />

gi,gj+1,<br />

du,dv-1.0,<br />

angle,scale);<br />

// cell i+1,j+1<br />

RefTexIJ2OffsetMapIJ(si,sj,<br />

gi+1,gj+1,<br />

du-1.0,dv-1.0,<br />

angle,scale);<br />

// update offset map in video memory<br />

UpdateOffsetMap();<br />

void RefTexIJ2OffsetMapIJ(int si,int sj,<br />

int gi,int gj,<br />

double delta u,double delta v,<br />

double angle,double scale)<br />

{<br />

int ti,tj;<br />

double du,dv;<br />

int 1 T,idu,idv;<br />

// ensure gi,gj are in grid bounds<br />

gi %= m dwOffsetMapRes;<br />

gj %= m dwOffsetMapRes;<br />

// compute what tile would be here if we were not using an offset map<br />

ti=gi % m dwRefTexNbTiles;<br />

tj=gj % m dwRefTexNbTiles;<br />

// compute offset to apply in this cell<br />

du=(si-ti) - delta u;<br />

dv=(sj-tj) - delta v;<br />

// encoding du,dv as 8-bit integers (low precision !)<br />

1 T=128 / m dwRefTexNbTiles;<br />

idu=(int)(du* 1 T);<br />

idv=(int)(dv* 1 T);<br />

// write to grid


}<br />

m OffsetMap[(gi+gj*m dwOffsetMapRes)*4 ]=(BYTE)idu;<br />

m OffsetMap[(gi+gj*m dwOffsetMapRes)*4+1]=(BYTE)idv;<br />

// transformation data in blue and alpha channel of offset map<br />

m OffsetMap[(gi+gj*m dwOffsetMapRes)*4+2]=(BYTE)(angle*256.0/360.0);<br />

m OffsetMap[(gi+gj*m dwOffsetMapRes)*4+3]=(BYTE)(255.0*scale);<br />

GeForce 3/4 (and higher) ps 1.3 (no rotation, no scaling)<br />

<strong>With</strong> ps 1.3, the implementation relies on the texbem instruction to add the translation<br />

encoded in the offset map to the texture coordinates of the reference<br />

texture.<br />

ps.1.3<br />

tex t0<br />

texbem t1, t0<br />

mov r0, t1<br />

Radeon 8500 (and higher) ps 1.4 (no rotation, no scaling)<br />

The implementation is straightforward; we simply read the offset map and add the<br />

offset to the texture coordinates.<br />

ps.1.4<br />

texcrd r0.xyz, t1 // read texture coordinates<br />

texld r1, t0 // read offset map<br />

add r1.xyz, r0, r1 bx2 // add offset to tex coords<br />

phase<br />

texld r0, r1.xyz // read reference texture<br />

GeForce FX / Radeon 9700 HLSL/ps 2.0<br />

The implementation includes rotation and scaling of sprites. The transformation<br />

of texture coordinates is done by the transformedLookup function. The Cg code<br />

would be almost the same.<br />

half4 transformedLookup(uniform sampler2D tex,<br />

half2 ctex,<br />

half angle,half scale)<br />

{<br />

half4 c;<br />

// transform coordinates from reference texture space to tile space<br />

half2 gcoords=ctex*RefTexNbTiles;<br />

half2 uv0=frac(gcoords); // tile space<br />

half2 isprite=floor(gcoords); // sprite index (si,sj)<br />

// apply rotation<br />

half si,cs;<br />

Section II — Rendering Techniques<br />

Drops of Water and Texture Sprites<br />

199


Section II — Rendering Techniques<br />

200 Drops of Water and Texture Sprites<br />

}<br />

sincos(-angle*6.28,si,cs);<br />

uv0=uv0-0.5;<br />

half2 uv1=half2( uv0.x*cs + uv0.y*si,<br />

-uv0.x*si + uv0.y*cs);<br />

uv1=uv1+0.5;<br />

// apply scaling<br />

uv1=uv1-0.5;<br />

half2 uv2=uv1/scale;<br />

uv2=uv2+0.5;<br />

// are coordinates still in sprite tile?<br />

if ( uv2.x > 1.0 || uv2.x < 0.0<br />

|| uv2.y > 1.0 || uv2.y < 0.0)<br />

c=bkgColor;<br />

else<br />

c=tex2D(tex,(uv2+isprite)/RefTexNbTiles);<br />

return c;<br />

float4 ps20TSprite(VS OUTPUT In) : COLOR<br />

{<br />

float4 color;<br />

}<br />

// read offset map<br />

float4 mapdata=tex2D(SOff0,In.Tex);<br />

// unpack offset<br />

float2 offset=2.0*mapdata.rg-1.0;<br />

// apply offset<br />

float2 uv0=offset+In.Grid;<br />

// apply transformation<br />

float angle=mapdata.b;<br />

float scale=mapdata.a;<br />

color=transformedLookup(STex0,uv0,angle,scale);<br />

return (color);<br />

Implementation Notes<br />

<strong>With</strong> ps 2.0/ps 2.x, it is possible to get rid of the empty tiles in the reference texture<br />

by testing if the texture coordinates are outside of the sprite tile. If the texture<br />

coordinates are outside, we use the background color. As there was no<br />

conditional statement before ps 2.0, we had to use empty tiles.<br />

Extensions<br />

Several extensions of this method are possible, such as random positioning of<br />

sprites according to a spatial probability distribution or aperiodic tiling. Please<br />

refer to the paper [2] for more information on extensions of this method.


The Drops of Water Effect<br />

The Drops of Water effect involves the following techniques:<br />

� Texture sprites for the positioning of drops<br />

� Magnification effect for the rendering of drops<br />

� Phong illumination model for the rendering of surfaces<br />

The Phong illumination model will not be described here. Please refer to the article<br />

“Fragment-level Phong Illumination” in Section II by Emil Persson.<br />

Drops Motion<br />

Wet Surface<br />

Section II — Rendering Techniques<br />

Drops of Water and Texture Sprites<br />

201<br />

The animation of drops is obtained by computing a direction and speed for each<br />

drop. At each time step, we update the position of each drop using its direction<br />

vector multiplied by its speed. The direction is basically a straight line going from<br />

top to bottom with some random angle perturbation. The speed depends on the<br />

size of the drop and is also randomly perturbed. If we want drops to fall along a<br />

complex surface, we must also take curvature into account.<br />

The best way to render the appearance of a wet surface is to begin by looking at a<br />

real wet surface. Let’s drop some water on a table (the computer keyboard is not<br />

a good example — avoid it). What happens? Basically, you can see that the surface<br />

becomes darker and shinier. In terms of the Phong illumination model, this<br />

implies that the diffuse coefficient of the surface decreases while the specular<br />

coefficient increases.<br />

To keep track of wet areas, drops are rendered in a texture as white spots.<br />

This texture is called the wet areas texture. At each time step, the drops are rendered<br />

at their new position on top of the previous time step wet areas texture. To<br />

obtain a drying effect, we simply darken the previous time step texture. This<br />

results in a white trace following the path of each drop. The more time passes, the<br />

darker the trace becomes.<br />

The wet areas texture is then used to modify the diffuse and specular coefficients<br />

of the final surface. If there is a white pixel in the wet areas texture, a low<br />

diffuse coefficient and a high specular coefficient are used. The darker the pixel,<br />

the higher the diffuse coefficient and the lower the specular coefficient.<br />

Now we have animated drops that leave a wet trace on the surface!


Section II — Rendering Techniques<br />

202 Drops of Water and Texture Sprites<br />

Figure 9: The wet area’s texture... Figure 10: ...and the corresponding final<br />

result (See Color Plate 8.)<br />

Magnification Effect<br />

Looking carefully at a real drop of water, we can see that it behaves like a small<br />

magnifying glass. This is due to the refraction of light rays passing from air to<br />

water [4]. Even if it were possible to compute the exact refraction of rays hitting<br />

the drop surface [5], it would be costly.<br />

There is a much simpler way to render such an effect (which has absolutely<br />

no physical correctness!). The idea is to compute an offset to be added to the texture<br />

coordinates of the underlying texture at each pixel. This offset is computed in<br />

order to render the behavior of a magnifying glass: It depends both on the surface<br />

shape and the viewpoint position.<br />

The offset formula is:<br />

� ( texcoords � center)<br />

�<br />

offset ���mag<br />

�<br />

� coeff<br />

.<br />

� height( texcoords)<br />

� � viewvector viewcoeff<br />

�<br />

offset, texcoords, center, and view vector are 2D vectors, and height(texcoords) returns<br />

the height of the drop at texcoords. It is a scalar value. mag coeff and view coeff are also<br />

scalar values. Increasing mag coeff results in an increased magnification effect.<br />

Increasing view coeff results in more dependency between the viewpoint and the<br />

aspect of the drop. The demo application allows interactively changing these<br />

parameters.<br />

Figure 11:<br />

mag coeff = 0.00 mag coeff = 0.15 mag coeff = 0.30 mag coeff = 0.45 mag coeff = 0.60 mag coeff = 0.75


Combining All<br />

Each drop is encoded as a sprite. The final effect uses multiple layers of texture<br />

sprites in order to allow the overlapping of drops. The program detects overlapping<br />

after updating the position of drops and tries to put overlapping drops in different<br />

texture layers. It also merges close drops into bigger drops. The rendering<br />

algorithm for multiple layers of sprites proceeds as follows: First it renders the<br />

background surface, and then it renders each layer of sprites.<br />

Render object with background texture<br />

For I=1 to number of sprite layers<br />

If there are some sprites in the layer I<br />

Render object using texture sprites shader for layer I<br />

For each layer of texture sprites, only pixels that effectively correspond to a sprite<br />

are rendered. All other pixels are discarded. The drop of water effect begins by<br />

rendering the wet areas texture and then renders the final object. The complete<br />

algorithm is as follows:<br />

Animate drops<br />

Set render target to wet areas texture<br />

Render wet areas texture with a darkening factor<br />

For each layer of drops,<br />

If there are some drops in the layer<br />

Render drops as white spots<br />

Set render target to screen<br />

Render surface with per-pixel Phong model<br />

For each layer of drops,<br />

If there are some drops in the layer<br />

Render drops with magnification effect and Phong model<br />

Section II — Rendering Techniques<br />

Drops of Water and Texture Sprites<br />

203<br />

Note that if the program only uses one or two layers of sprites, all the previous<br />

operations can be done in one fragment program (i.e., in one pass). The choice of<br />

multipass versus all-in-one pass depends on the complexity of the geometry; if<br />

there is a lot of geometry, a more complex pixel shader should be used, as the<br />

rendering of the geometry will take a long time. If the geometry is simple, we can<br />

use multipass rendering, as the geometry can be rendered very quickly. Nevertheless,<br />

to determine which approach is the best in a particular case, it is best to<br />

test both approaches. Indeed, rendering bottlenecks are difficult to identify on<br />

modern hardware, and testing is often better than assuming.<br />

The Drops of Water effect is written using Cg. It runs on hardware with ps<br />

2.x support. It cannot run on ps 2.0 because of the program length limitation (64<br />

instructions on ps 2.0). It is, however, possible to simplify the code in order to<br />

make it shorter. There are three Cg fragment programs; the first program renders<br />

the drops as white spots for the wet areas textures, the second program renders<br />

the underlying surface with the Phong model, and the third program renders a<br />

layer of drops of water. The textures used by the programs are:


Section II — Rendering Techniques<br />

204 Drops of Water and Texture Sprites<br />

OffsetMap Texture sprites offset map for the current layer<br />

DropNrms Reference texture encoding drop normals in RGB channels and drop<br />

height in alpha channel<br />

ColorMap Texture of the underlying surface<br />

NrmsMap Normal map of the underlying surface<br />

WetAreas Wet areas texture<br />

The first Cg program (renders a layer of drops in the wet areas texture):<br />

PixelOut main(DowV2F IN,<br />

uniform sampler2D OffsetMap : texunit0,<br />

uniform sampler2D DropNrms : texunit1)<br />

{<br />

half3 color;<br />

half2 coords;<br />

half4 offset;<br />

half4 drop;<br />

}<br />

// ======================================<br />

// texture sprites<br />

// -> look in offset map<br />

offset=h4tex2D(OffsetMap,IN.TCoords0.xy);<br />

offset.xy=(offset.xy-0.5)*2.0;<br />

coords.xy=(offset.xy+IN.TCoords1.xy);<br />

drop=transformedLookup(DropNrms,coords,offset.z,offset.w);<br />

// -> if not in a drop, discard fragment<br />

if (drop.w < 0.7)<br />

discard;<br />

// -> else output white color<br />

PixelOut OUT;<br />

OUT.COL = half4(half3(1.0,1.0,1.0),1.0);<br />

return OUT;<br />

The second Cg program (renders the underlying surface):<br />

PixelOut main(DowV2F IN,<br />

uniform sampler2D ColorMap : texunit2,<br />

uniform sampler2D NrmsMap : texunit3,<br />

uniform sampler2D WetAreas : texunit4)<br />

{<br />

half3 color;<br />

// ===================<br />

// floor lighting<br />

// -> compute per-pixel Light and View vector<br />

half3 nL=normalize(IN.L);<br />

half3 nV=normalize(IN.V);<br />

half3 H=(nV+nL)*0.5;<br />

// -> wet areas texture is used to control diffuse and specular<br />

// coefficients<br />

half wetfloor =h1tex2D(WetAreas,IN.TCoords0.xy);


}<br />

half diffatten=0.45+0.55*(1.0-wetfloor);<br />

// -> read surface normal<br />

half3 fnrm =h3tex2D(NrmsMap,IN.TCoords0.xy)*2.0-1.0;<br />

// -> compute diffuse and specular terms<br />

half fspec =pow(dot(fnrm,H),50.0)*wetfloor;<br />

half fdiff =diffatten*dot(fnrm,nL);<br />

// -> final color<br />

color=h3tex2D(ColorMap,IN.TCoords0.xy)*fdiff+fspec;<br />

PixelOut OUT;<br />

OUT.COL = half4(color,1.0h);<br />

return OUT;<br />

The third Cg program (renders a layer of drops):<br />

PixelOut main(DowV2F IN,<br />

uniform sampler2D OffsetMap : texunit0,<br />

uniform sampler2D DropNrms : texunit1,<br />

uniform sampler2D ColorMap : texunit2,<br />

uniform half MagCoeff,<br />

uniform half ViewCoeff)<br />

{<br />

half3 color;<br />

half2 coords;<br />

half4 offset;<br />

half4 drop;<br />

// ======================================<br />

// texture sprites<br />

// -> look in offset map<br />

offset=h4tex2D(OffsetMap,IN.TCoords0.xy);<br />

offset.xy=(offset.xy-0.5)*2.0;<br />

coords=(offset.xy+IN.TCoords1.xy);<br />

coords=frac(coords);<br />

drop=transformedLookup(DropNrms,coords,offset.z,offset.w);<br />

// -> if not in a drop, discard fragment<br />

if (drop.w < 0.1)<br />

discard;<br />

// ===================<br />

// drop lighting<br />

// -> compute per-pixel Light and View vector<br />

half3 nL=normalize(IN.L);<br />

half3 nV=normalize(IN.V);<br />

half3 H=(nV+nL)*0.5;<br />

// -> magnification effect<br />

half2 decal=-(MagCoeff*(coords-0.75)/drop.w)-nV.xy*ViewCoeff;<br />

// -> unpack drop normal<br />

half3 nrm=(drop.xyz*2.0-1.0);<br />

// -> specular + diffuse<br />

half spec=pow(dot(nrm,H),20.0)*0.75;<br />

Section II — Rendering Techniques<br />

Drops of Water and Texture Sprites<br />

205


Section II — Rendering Techniques<br />

206 Drops of Water and Texture Sprites<br />

}<br />

half diff=(0.6+0.5*dot(nrm,nL));<br />

// -> color<br />

color=h3tex2D(ColorMap,IN.TCoords0.xy+decal.xy)*diff+spec;<br />

// -> alpha for antialiasing of drop edges<br />

half alpha=min((drop.w-0.1)/0.2,1.0);<br />

PixelOut OUT;<br />

OUT.COL = half4(color,alpha);<br />

return OUT;<br />

The Companion CD<br />

Conclusion<br />

There are two demos on the companion CD. The first program (tsprite) is written<br />

with <strong>DirectX</strong>/HLSL and illustrates the texture sprites technique with various<br />

hardware implementations. The second program (dow) is written with <strong>DirectX</strong>/Cg<br />

and demonstrates the Drops of Water effect. Both programs have parameters that<br />

can be interactively changed. Use the menus or press F1 for help.<br />

The Drops of Water effect is a complex shader that involves many different techniques.<br />

It is an illustration of how much textures can improve the appearance of a<br />

surface and how they can be used to achieve complex animated effects. I hope<br />

that you had fun playing with these little drops and that you will find hundreds of<br />

different applications for the texture sprites technique.<br />

Acknowledgments<br />

References<br />

Thanks to Przemek Prusinkiewicz, Julia Taylor-Hell, and Samuel Hornus for carefully<br />

proofreading this article.<br />

[1] 3D Procedural texturing in nVidia Cg Effect Browser — Cg Toolkit.<br />

[2] Lefebvre, Sylvain and Fabrice Neyret, “Pattern Based Procedural Textures,”<br />

Proceedings of the ACM SIGGRAPH 2003 Symposium on Interactive 3D Graphics,<br />

http://www-imagis.imag.fr/Membres/Sylvain.Lefebvre/pattern.<br />

[3] Ebert, David S., F. Kenton Musgrave, Darwyn Peachey, Ken Perlin (Editor),<br />

and Steven Worley, Texturing & Modeling: A Procedural Approach, Academic<br />

Press, 2003.<br />

[4] Glassner, Andrew S. (Editor), An Introduction to Ray Tracing, Academic Press,<br />

1989.<br />

[5] Eye Raytrace in nVidia Cg Effect Browser — Cg Toolkit.


Introduction<br />

Advanced Water Effects<br />

Kurt Pelzer<br />

A water simulation as realistic as possible and as widely usable as possible is<br />

desired for many targeted applications, such as a basic component of the game<br />

play, as an idyllic ambient element, or simply as a delimitation of game worlds.<br />

The first <strong><strong>Shader</strong>X</strong> book [Engel 2002] had several articles about this topic from different<br />

viewpoints using version 1.x shaders. Additionally, some tech demos and<br />

benchmark tools have presented impressive water effects.<br />

In order to achieve a further increase in visual quality, we need the following<br />

features, among others:<br />

� An exact mixing of the visible reflection and semitransparent underwater<br />

scene with respect to the involved materials at the boundaries (specifically<br />

single boundaries of less dense to more dense materials — for example,<br />

air-to-water) and the different angles of incidence between the line of vision<br />

and the tangent planes of the rippled water surface. Each wave and ripple has<br />

to be visible by a correct Fresnel reflection.<br />

� The water surface must be animated as realistically as possible — that is, all<br />

ripples move in a common direction (but the smaller ones with a lower<br />

speed) and smoothly change their look at run time without visible repetitions<br />

at higher viewpoints.<br />

� Depending on the distance from the water surface, the lighting of the visible<br />

underwater objects must be changed to make different water depth recognizable<br />

by the simulated absorption of light. This absorption should be adjustable<br />

for each color channel.<br />

� The complete water effect must fade out at the water’s edge to hide the<br />

coarseness of the game world’s polygonal construction, and this fading should<br />

be done automatically to handle a changing water level or world geometry at<br />

run time.<br />

Based on the new extended shaders as well as the increased performance of<br />

<strong>DirectX</strong> 9-compliant video cards, you can build a top-quality and fast water effect<br />

that includes the above features. This article presents an implementation using<br />

vertex and pixel shader version 2.0 instructions (it is possible to build shaders<br />

with reduced effects based on version 1.x instructions, but the goal of this article<br />

207


Section II — Rendering Techniques<br />

208 Advanced Water Effects<br />

Overview<br />

is to introduce the complete effects). The composition of the complete water simulation<br />

is presented first; all components are explained more precisely in later<br />

sections. Additionally, a demo application with included source code is available on<br />

the companion CD. Screen shots of the demo are shown in Color Plates 9 and 10.<br />

Before we discuss each component of the advanced water effects, it makes sense<br />

to display the general idea of the complete water simulation.<br />

Figure 1: Overview of water simulation<br />

Figure 1 and the following outline should help you find your path through this<br />

article:<br />

Preparation of the Underwater Scene<br />

� Rendering the Underwater Scene (First Render Pass)<br />

� Modifications Dependent on Water Depth (Second Render Pass)<br />

� Projection of the Final Underwater Scene Texture<br />

Preparation of the Above-Water Scene<br />

� Rendering the Reflection Map<br />

� The Detail Map<br />

Faking Waves and Ripples<br />

� Animated Surface Bumping<br />

� Per-Pixel Fresnel Reflection<br />

The Complete <strong>Shader</strong> Programs<br />

� The Final Vertex <strong>Shader</strong><br />

� The Final Pixel <strong>Shader</strong>


Preparation of the Underwater Scene<br />

We have to run two render<br />

passes to generate a realistic<br />

underwater scene view. The first<br />

pass simply fills a render-target<br />

texture with the scene (see the<br />

following section). Depending on<br />

the water depth, a second render<br />

pass modifies this texture to<br />

receive a more realistic absorption<br />

of the light and make different<br />

water depths recognizable<br />

(see the section “Modifications<br />

Dependent on Water Depth (Second<br />

Render Pass)”). Later on<br />

when the water plane is rendered,<br />

we have to project the<br />

final texture to the water surface<br />

(see the section “Projection of<br />

the Final Underwater Scene Texture”).<br />

Figure 2 displays the<br />

whole process.<br />

Section II — Rendering Techniques<br />

Advanced Water Effects<br />

Figure 2: Preparation of the underwater scene<br />

Rendering the Underwater Scene (First Render Pass)<br />

209<br />

We want to simulate a close-to-reality view into the water. The seabed and objects<br />

like fish or plants should be distortable by faked bumps (see the section “Animated<br />

Surface Bumping”). So, we have to render the underwater scene view each<br />

frame again into a render-target texture. For this job, we use the original camera.<br />

A clip plane cuts off the invisible part of the scene above the water surface (see<br />

Figure 3).<br />

Figure 3: Original camera and clip plane


210 Advanced Water Effects<br />

Modifications Dependent on Water Depth<br />

(Second Render Pass)<br />

<strong>With</strong> a second render pass, this render target texture (containing the underwater<br />

scene view) will be modified so that afterward the alpha channel holds the blending<br />

for the water’s edge and the color channels contain a darkened color (depending<br />

on the current water depth). This darkening is to make different water depths<br />

recognizable and simulate water pollution. To compute the intensity for each color<br />

channel, we use a reduction formula: exp(–d * �). You may know this type of formula<br />

from the law of radioactive decay. The parameter d is the current depth of<br />

water, and the � (may be three different values for red, green, and blue) will cause<br />

a water dye (see Figure 4).<br />

Figure 4: Reduction formula exp ( –d * � )<br />

Since in the pixel shader the exponential function with basis 2 is available, our<br />

three � are the reciprocals of the half-life values for each color component. This<br />

is the depth underneath the waterline where the red, green, and blue components<br />

of the light are reduced to half of their brightness. So, our three � are very simple:<br />

� red = 1/Half-Life-Of-Red, � green = 1/Half-Life-Of-Green, and � blue = 1/Half-Life-<br />

Of-Blue. If we select a greater value for � blue, its half-life is going to be smaller.<br />

That means the blue component quickly disappears from the water color, giving<br />

all underwater objects a dimmed blue color and yellow-dyed appearance. Our<br />

reduction formula requires knowing the depth (underwater, not z) of the current<br />

pixel being processed. To do this, we pass the vertex position to the pixel shader<br />

PS-1 as texture coordinates. This guarantees that the vertex position will be<br />

interpolated linearly, providing us with the underwater depth of each pixel. So, the<br />

vertex shader VS-1 is very simple (and can also be implemented using version 1.x<br />

instructions):<br />

// VERTEX SHADER (for DX9 hardware and better) VS-1<br />

// FUNCTION: Modifies underwater scene texture<br />

//<br />

// INPUT:<br />

// v0 = position (3 floats)<br />

// c0 - c3 = world/view/proj matrix<br />

// version instruction<br />

vs 2 0


declare registers<br />

dcl position v0<br />

// transform position into projection space<br />

dp4 oPos.x, v0, c0 // c0 = first row of transposed world/view/proj-matrix.<br />

dp4 oPos.y, v0, c1 // c1 = second row of transposed world/view/proj-matrix.<br />

dp4 oPos.z, v0, c2 // c2 = third row of transposed world/view/proj-matrix.<br />

dp4 oPos.w, v0, c3 // c3 = forth row of transposed world/view/proj-matrix.<br />

// transfer position to pixel shader<br />

mov oT0, v0 // We pass the vertex position to the pixel shader as tex coord.<br />

This is the associated pixel shader PS-1:<br />

// PIXEL SHADER (for DX9 hardware and better) PS-1<br />

// FUNCTION: Modifies underwater scene texture<br />

//<br />

// INPUT:<br />

// t0 = object position (in world space)<br />

// c0 = cam-point in object-space<br />

// c1 = water height in y component, fading scale in alpha component<br />

// c2 = �’s for absorption of light (�(Red), �(Green), �(Blue))<br />

// version instruction<br />

ps 2 0<br />

// define the constants<br />

def c3, 0.00f, 1.00f, 0.00f, 0.00f<br />

// declare the used resources<br />

dcl t0<br />

// calculate the alpha value for water’s edge fading<br />

mov r1, c3 // Calculate the underwater depth<br />

mad r1, -t0, r1, c1 // (distance: water plane � object),<br />

mul sat r0.a, c1.a, r1.g // scale this value and clamp the result to [0,1].<br />

rsq r0.a, r0.a // We want to see a smooth fading, so computing the<br />

rcp sat r0.a, r0.a // square root will be fine.<br />

// calculate the underwater absorption of light<br />

mul r2.rgb, c2, r1.g // Calculate d * � for each color.<br />

exp sat r0.r, -r2.r // exp( -d * � ) for red color.<br />

exp sat r0.g, -r2.g // exp( -d * � ) for green color.<br />

exp sat r0.b, -r2.b // exp( -d * � ) for blue color.<br />

Section II — Rendering Techniques<br />

Advanced Water Effects<br />

// output color<br />

mov oC0, r0 // Output: The final color intensities and the fading alpha.<br />

211<br />

The result of this second render pass must be multiplied by the current content in<br />

the underwater scene texture. Therefore, the alpha blending needs the following<br />

parameters:


Section II — Rendering Techniques<br />

212 Advanced Water Effects<br />

D3DRS_SRCBLEND = D3DBLEND_DESTCOLOR;<br />

D3DRS_DESTBLEND = D3DBLEND_ZERO;<br />

Projection of the Final Underwater Scene Texture<br />

To project the final underwater scene texture on the water surface, the vertex<br />

shader VS-2 must receive the transposed version of the following matrix:<br />

ProjectionMatrix = OrigCam.ViewMatrix * OrigCam.ProjectionMatrix * TrafoMatrix<br />

OrigCam.ViewMatrix*OrigCam.ProjectionMatrix transforms the world into projection<br />

space, where coordinates range from –1 to +1 (see Figure 5).<br />

Figure 5: From world space to projection space<br />

To map these coordinates from the projection space where (x, y) belongs to the<br />

range [–1,+1] to the texture space where (x, y) belongs to the range [0,1], we<br />

have to multiply a special TrafoMatrix. This transformation matrix causes the<br />

transition of the projection space to the texture space by scaling and translating<br />

the vertex positions of all objects.<br />

0.5 0 0 0<br />

TrafoMatrix = 0 –0.5 0 0<br />

0 0 0 0<br />

0.5 0.5 1 1<br />

A sign change in the y component is necessary for correct alignment because<br />

from the texture’s point of view, the scene seems to be inverted (top�down, the<br />

v components of texture coordinates use an inverted y direction — see Figure 6).<br />

Figure 6: Inverted y (= v) direction for texture coordinates


The vertex shader VS-2 (used to build the water surface) computes the dot products<br />

of the water’s vertex positions and the rows of this projection matrix. The<br />

resulting values are the new texture coordinates for the underwater scene texture<br />

and will be sent to the pixel shader PS-2:<br />

// r0 contains the translated object position<br />

dp4 r9.x, r0, c8 // c8 = first row of transposed refraction-projection-matrix<br />

dp4 r9.y, r0, c9 // c9 = second row of transposed refraction-projection-matrix<br />

dp4 r9.zw, r0, c10 // c10 = third row of transposed refraction-projection-matrix<br />

mov oT2, r9 // output: underwater scene tex coords (send to pixel shader)<br />

Finally, in the pixel shader these texture coordinates will be bumped like those of<br />

the reflection map before sampling the texture, but with a lowered strength and<br />

contrarotated direction (see the section “Animated Surface Bumping”).<br />

Preparation of the Above-Water Scene<br />

The above-water scene contains two different areas: objects that can be seen in<br />

the reflection and floating details on the water surface.<br />

Rendering the Reflection Map<br />

To simulate a close-to-reality reflection, we have to render the above-water scene<br />

view (maybe with a reduced object LOD) each frame again into a render-target<br />

texture. For this job, we need the original camera mirrored at the water surface.<br />

The invisible part of the scene under the water surface will be cut off by a clip<br />

plane to prevent obstructions of vision for this new camera (see Figure 7).<br />

Figure 7: Mirrored camera and clip plane<br />

Section II — Rendering Techniques<br />

Advanced Water Effects<br />

213<br />

Mirroring the original camera at the water plane is done by a simple modification<br />

of the view matrix. It works like mirroring the world about the water surface:<br />

MirrorCam.ViewMatrix = MirrorTrafoMatrix * OrigCam.ViewMatrix<br />

Starting with a vertex v = (x, y, z) that is to be reflected, we translate the scene in<br />

the y direction by the negative water-level v = (x, y–wl, z). So the mirror plane<br />

will become the xz plane. Subsequently, reflection about the translated plane is


214 Advanced Water Effects<br />

done by changing the sign of the y component of the input vertex v = (x, –(y–wl),<br />

z). Finally, the mirror plane will be shifted back to its old place again: v = (x,<br />

–(y–wl)+wl, z) = (x, –y+2*wl, z). The previous transformations can be put in the<br />

following matrix form:<br />

1 0 0 0<br />

MirrorTrafoMatrix = 0 –1 0 0<br />

0 0 1 0<br />

0 2*WaterLevel 0 1<br />

To project this reflection map (rendered with the mirrored camera) on the water<br />

surface, the vertex shader VS-2 must receive the transposed version of the following<br />

matrix:<br />

ProjectionMatrix = MirrorCam.ViewMatrix * OrigCam.ProjectionMatrix * TrafoMatrix<br />

MirrorCam.ViewMatrix*OrigCam.ProjectionMatrix transforms the world into<br />

projection space. Just to map these coordinates from the projection space to the<br />

texture space, we have to multiply a special TrafoMatrix. This transformation<br />

matrix equals the one we used for projecting the underwater scene texture (see<br />

the section “Projection of the Final Underwater Scene Texture”). As done for the<br />

underwater scene map, the vertex shader VS-2 computes the dot products of the<br />

water’s vertex positions and the rows of the projection matrix. The resulting values<br />

are the new texture coordinates for the reflection map and are sent to the<br />

pixel shader PS-2:<br />

// r0 contains the water’s vertex position<br />

dp4 r9.x, r0, c4 // c4 = first row of transposed reflection-projection-matrix<br />

dp4 r9.y, r0, c5 // c5 = second row of transposed reflection-projection-matrix<br />

dp4 r9.zw, r0, c6 // c6 = third row of transposed reflection-projection-matrix<br />

mov oT1, r9 // output: reflection tex coords (send to the pixel shader)<br />

Finally, in the pixel shader these texture coordinates will be distorted by animated<br />

bump-normals before sampling the texture (see the section “Animated Surface<br />

Bumping”).<br />

The Detail Map<br />

To provide the water surface with additional details (like algae or oil), we add a<br />

separate texture that has an alpha channel to indicate sections of different transparency.<br />

This detail map must be able to be tiled seamlessly without showing visible<br />

artifacts. Sections without these details should be completely transparent, and<br />

objects like algae should get a semitransparent alpha value. The map will be<br />

translated each frame like the bump-normal maps (see the section “Animated Surface<br />

Bumping”), but this translation scales down to run at a lower speed. Thus,<br />

the realistic impression results in the water and the objects floating on it moving<br />

more slowly than the waves. Like the reflection and underwater scene maps, this<br />

detail texture will be distorted in the pixel shader PS-2 (but the bump-normal gets<br />

a different scaling factor for this job):<br />

// r5 contains the bump (see section: Animated Surface Bumping)<br />

mul r0.rgb, r5, c10.b // c10.b = scaling factor to reduce the bumping


add r0.rgb, r0, t3 // t3 contains the original tex coords for the detail map<br />

texld r3, r0, s3 // load filtered detail texel from tex sampler 3<br />

This reinforces the impression of a close-to-reality animation of the water surface<br />

(see the following section). Blending with the remaining part of the water effect<br />

will happen later on in the pixel shader by calculating a linear interpolation:<br />

lrp sat r8.rgb, r3.a, r3, r7 // r7 contains the blended reflection and underwater scene<br />

Faking Waves and Ripples<br />

Now we have to add realistic waves and ripples to the water surface. We make<br />

use of the interferences between multiple bump map layers and introduce an<br />

exact per-pixel Fresnel reflection to make each surface bump visible.<br />

Animated Surface Bumping<br />

Section II — Rendering Techniques<br />

Advanced Water Effects<br />

The water surface animation is done with simulated waves and ripples that must<br />

smoothly change their shape at run time without visible repetitions at higher<br />

viewpoints (no visible tiles). As a basic resource, we only need to have one<br />

bump-normal map that can be tiled seamlessly without visible artifacts. This map<br />

must be used at least once again in a second layer to overlap the first one. We<br />

want to mix multiple bump-map layers to use interference effects between them.<br />

Each layer has to be scaled with a different factor and has its own time-stamp controlled<br />

translation (see Figure 8).<br />

Figure 8: Two layers with different scalings and time-stamp-based translation<br />

For example, dealing with the first layer is done in the vertex shader VS-2, like<br />

this:<br />

mov r1, c13.x // c13.x = current time-stamp<br />

mul r2, r1, c14 // c14 = translation for the first layer coords<br />

frc r2.xy, r2 // only use the fractional component<br />

mul r3, v8, c15 // v8 = original bump map coords, c15 = scaling factor for<br />

// the first layer coords<br />

add oT0.xy, r2, r3 // calc the final tex coords for first bump layer<br />

215<br />

The other layers must be scaled and translated in the same way (but with different<br />

parameters). The content mixing of the overlapping layers (with different


Section II — Rendering Techniques<br />

216 Advanced Water Effects<br />

weighting factors for each one) is done afterward in the pixel shader PS-2. For<br />

example, two bump-normal maps in four layers can be blended this way:<br />

texld r0, t0, s0 // load first normal layer – first bump maps<br />

texld r4, t4, s0 // load second normal layer – first bump map<br />

texld r5, t5, s4 // load third normal layer – second bump map<br />

texld r6, t6, s4 // load forth normal layer – second bump map<br />

mul r6.rgb, r6, c3 // c3 = scaling factor for forth normal<br />

mad r5.rgb, r5, c2, r6 // c2 = scaling factor for third normal<br />

mad r4.rgb, r4, c1, r5 // c1 = scaling factor for second normal<br />

mad r5.rgb, r0, c0, r4 // c0 = scaling factor for first normal<br />

add r5.rgb, r5, c4 // c4 = (-0.5f*(c0+..+c3)) for color-to-vector trafo<br />

Each ripple and its strength can be detected by bumping the reflection, underwater<br />

scene, and detail maps. Also the change of the reflection and refraction shares<br />

in the final blending help to make out the ripples and waves (see the following<br />

section). Additionally, the contrarotated and different scaled bumping of the reflection<br />

and underwater scene maps increases the visual quality. The refraction at the<br />

air-water boundaries reduces the bump effect for the underwater scene map;<br />

therefore, the scaling factor for the refraction bumps should have a lesser absolute<br />

value (see Figure 9).<br />

Figure 9: Reduced bumping range for the underwater scene map<br />

Bumping the reflection and underwater scene maps is done in the pixel shader<br />

PS-2 this way:<br />

// r5 = mixed bump vector, c12 and c13 = scaling factors for reflection and refraction bumps<br />

mad r7, r5, c12, t1 // add scaled bump to reflection tex coords<br />

mad r8, r5, c13, t2 // add scaled bump to refraction tex coords<br />

texldp r1, r7, s1 // load filtered reflection texel<br />

texldp r2, r8, s2 // load filtered refraction texel<br />

Per-Pixel Fresnel Reflection<br />

The Fresnel term gives a description of how much light is reflected at the boundaries<br />

of two materials. The rest of the light finds its refracted way into the second<br />

semitransparent material. We get the strongest reflection (total reflection) as long<br />

as the angle of incidence of the light ray (just as the ray of view) is greater than a


“critical” angle (Snell’s Law). When the light ray is orthogonal to the surface,<br />

there is only a dim reflection (see Figure 10).<br />

Figure 10: Different Fresnel reflections<br />

A good approximation of the correct Fresnel term is this formula:<br />

(1) R(�) = R(0)+(1–R(0))*(1–cos(�)) 5<br />

with R(0) = ( n1 – n2 ) 2 /(n1+n2) 2<br />

(n1 and n2 are the indices of refraction for the involved materials.)<br />

You may also use a much simpler approximation: R(�) = 1 – cos(�). But this formula<br />

doesn’t take the indices of refraction into account and has a stronger divergence<br />

from the original graph. This divergence produces an unnaturally strong<br />

reflection (see Figure 11). That is why we prefer the better approximation (1).<br />

Although it’s a more complex formula with higher run-time costs, we use it for<br />

our calculations.<br />

The indices of refraction for air and water are these:<br />

Section II — Rendering Techniques<br />

Advanced Water Effects<br />

Figure 11: Original Fresnel for air-to-water and two approximations<br />

n1 = 1.000293 (air) n2 = 1.333333 (water at 20°C / 68°F)<br />

So we get the following constants for the Fresnel approximation at the<br />

air-to-water boundaries:<br />

R(0) = 0.02037f 1 – R(0) = 0.97963f<br />

217


Section II — Rendering Techniques<br />

218 Advanced Water Effects<br />

The Fresnel reflection based on (1) is done in the pixel shader PS-2, like this:<br />

// -r6 contains the normalized cam-to-surface vector and r5 is the surface normal<br />

dp3 sat r7.a, -r6, r5 // calculate the angle alpha<br />

add r8.a, c10.a, -r7.a // c10.a = 1.0f<br />

mul r6.a, r8.a, r8.a // squared<br />

mul r6.a, r6.a, r6.a // quadric<br />

mul r6.a, r6.a, r8.a // quintic<br />

mad r6.a, c10.g, r6.a, c10.r // c10.g = 1-R(0) and c10.r = R(0)<br />

The cam-to-surface vector will be precalculated by the vertex shader VS-2 to<br />

send it (packed as color output) to the pixel shader PS-2:<br />

add r10, r0, -c12 // r0 = position of current water vertex, c12 = cam-<br />

// point in object-space<br />

nrm r8, r10 // normalize the cam-to-surface vector (each component<br />

// must fit into [-1,1])<br />

mad oD0.xyz,r8, c22.z, c22.z // c22.z = 0.5f<br />

In the pixel shader, our vector must be unpacked again before running the Fresnel<br />

calculation. Also, a renormalization of the vector is necessary because a color<br />

interpolation may have taken place (Gouraud shading):<br />

// v0 contains the pre-calculated cam-to-surface vector as color<br />

add r7.rgb, v0, c5 // c5 = ( -0.5f, -0.5f, -0.5f, 0.0f )<br />

nrm r6.rgb, r7 // normalize the cam-to-surface vector<br />

The Fresnel code in the pixel shader receives the normal of the rippled water surface<br />

by normalization of the previously calculated bump vector and exchanging<br />

the y and z components afterward (see the section “Animated Surface Bumping”):<br />

nrm r6.rgb, r5 // normalize bump vector<br />

mov r5.r, r6.r // keep the x component<br />

dp3 r5.g, r6, c14 // c14 = ( 0.0f, 0.0f, 1.0f, 0.0f )<br />

dp3 r5.b, r6, c15 // c15 = ( 0.0f, 1.0f, 0.0f, 0.0f )<br />

The “lying bump vector” is set upright by this coordinate exchange and takes its<br />

correct place as a normal in the tangent space of the bumped water surface (see<br />

Figure 12).<br />

Figure 12: Normal vectors of the bumped water surface<br />

Of course, the bump-normal maps must be prepared for this operation. We simply<br />

use a “bumped” height map and convert it into a bump-normal map using a


method introduced by the Direct3D extensions (D3DX) utility library: the<br />

D3DXComputeNormalMap function. After calculating R(�), we are going to mix<br />

the underwater scene and the reflection by simply using a linear interpolation:<br />

// r5.a contains the R(�) value<br />

lrp sat r7.rgb, r5.a, r1, r2 // r1 und r2 are the reflection and underwater<br />

// scene texels<br />

Now we can see all reflections on the water with different strengths, depending<br />

on the ripples and waves. The detailed information in [Wloka 2002] should be<br />

useful for those of you who need indices of refraction for other materials or want<br />

to gain a better knowledge of this topic (approximating the Fresnel reflection).<br />

The Complete <strong>Shader</strong> Programs<br />

In this section, the final vertex and pixel shader programs (VS-2 and PS-2) are<br />

presented using shader instructions from version 2.0. They must be activated<br />

when rendering the water plane. Based on the calculated alpha value, the color<br />

result must be blended to the frame buffer’s current content. So, the source and<br />

destination blending factors are:<br />

D3DRS_SRCBLEND = D3DBLEND_SRCALPHA;<br />

D3DRS_DESTBLEND = D3DBLEND_INVSRCALPHA;<br />

That will make the effect fade out at the water’s edge (see Figure 13).<br />

Figure 13: Fading dependent on water depth<br />

The Final Vertex <strong>Shader</strong><br />

Section II — Rendering Techniques<br />

Advanced Water Effects<br />

// VERTEX SHADER (for DX9 hardware and better) VS-2<br />

// FUNCTION: Water effect<br />

//<br />

// INPUT:<br />

// v0 = position (3 floats)<br />

// v1 = normal (3 floats)<br />

// v8 = tex coord stage 0 (2 floats – bump map)<br />

// v9 = tex coord stage 1 (3 floats - reflection)<br />

// v10 = tex coord stage 2 (3 floats - underwater scene)<br />

219


Section II — Rendering Techniques<br />

220 Advanced Water Effects<br />

// v11 = tex coord stage 3 (2 floats - surface details)<br />

// c0 - c3 = world/view/proj matrix<br />

// c4 - c7 = reflection texture trafo matrix<br />

// c8 - c11 = underwater scene texture trafo matrix<br />

// c12 = cam-point in object-space & water height<br />

// c13 = time<br />

// c14 = first bump-normal map coords translation<br />

// c15 = first bump-normal map coords scalar<br />

// c16 = second bump-normal map coords translation<br />

// c17 = second bump-normal map coords scalar<br />

// c18 = third bump-normal map coords translation<br />

// c19 = third bump-normal map coords scalar<br />

// c20 = fourth bump-normal map coords translation<br />

// c21 = fourth bump-normal map coords scalar<br />

// c22 = 0.5f in z component<br />

// version instruction<br />

vs 2 0<br />

// define the constants<br />

def c22, 1.0f, 1.0f, 0.5f, 1.0f<br />

// declare registers<br />

dcl position v0<br />

dcl normal v1<br />

dcl texcoord0 v8<br />

dcl texcoord1 v9<br />

dcl texcoord2 v10<br />

dcl texcoord3 v11<br />

// transform position into projection space<br />

mov r0, v0 // We are able to change the water level at run time.<br />

add r0.y, r0.y, c12.w // So, we have to add the current height difference<br />

// to original water y coord.<br />

dp4 oPos.x, r0, c0 // This lifted plane has to be transformed by the<br />

// current world/view/proj matrix.<br />

dp4 oPos.y, r0, c1 // (ditto)<br />

dp4 oPos.z, r0, c2 // (ditto)<br />

dp4 oPos.w, r0, c3 // (ditto)<br />

// calc projective tex coords<br />

dp4 r9.x, r0, c4 // Based on lifted water position we calculate tex coords<br />

dp4 r9.y, r0, c5 // for the reflection map<br />

dp4 r9.zw, r0, c6 // (ditto)<br />

mov oT1, r9 // and hand them over to the pixel shader.<br />

dp4 r9.x, r0, c8 // Based on lifted water position we calculate tex coords<br />

dp4 r9.y, r0, c9 // for the underwater scene map<br />

dp4 r9.zw, r0, c10 // (ditto)<br />

mov oT2, r9 // and hand them over to the pixel shader.<br />

mov oT3.xy, v11 // Tex coords for detail map are passed to pixel shader.


calc the distorted bump-normal map coords<br />

mov r1, c13.x // Based on the current time stamp we calculate some scaled<br />

// and translated<br />

mul r2, r1, c14 // coordinates for the bump-normal map layers.<br />

frc r2.xy, r2 // (ditto)<br />

mul r3, v8, c15 // (ditto)<br />

add oT0.xy, r2, r3 // Output: Tex coords for the first bump-normal map layer.<br />

mul r2, r1, c16 // (ditto)<br />

frc r2.xy, r2 // (ditto)<br />

mul r3, v8, c17 // (ditto)<br />

add oT4.xy, r2, r3 // Output: Tex coords for the second bump-normal map layer.<br />

mul r2, r1, c18 // (ditto)<br />

frc r2.xy, r2 // (ditto)<br />

mul r3, v8, c19 // (ditto)<br />

add oT5.xy, r2, r3 // Output: Tex coords for the third bump-normal map layer.<br />

mul r2, r1, c20 // (ditto)<br />

frc r2.xy, r2 // (ditto)<br />

mul r3, v8, c21 // (ditto)<br />

add oT6.xy, r2, r3 // Output: Tex coords for the forth bump-normal map layer.<br />

// compute the cam-to-water vector<br />

add r10, r0, -c12 // Based on lifted water plane we calculate normalized current<br />

nrm r8, r10 // cam-to-water vector.<br />

// Prepare for per-pixel normalization<br />

mad oD0.xyz, r8, c22.z, c22.z // This vector (packed in a color) has to be passed<br />

// to the pixel shader.<br />

The Final Pixel <strong>Shader</strong><br />

Section II — Rendering Techniques<br />

Advanced Water Effects<br />

// PIXEL SHADER (for DX9 hardware and better) PS-2<br />

// FUNCTION: Water effect<br />

//<br />

// INPUT:<br />

// v0 = cam-to-water vector in cam-space<br />

// t0 = tex coords for first bump-normal map layer<br />

// t1 = tex coords for reflection texture<br />

// t2 = tex coords for underwater scene texture<br />

// t3 = tex coords for surface detail texture<br />

// t4 = tex coords for second bump-normal map layer<br />

// t5 = tex coords for third bump-normal map layer<br />

// t6 = tex coords for fourth bump-normal map layer<br />

// s0 = first bump-normal map<br />

// s1 = reflection texture<br />

// s2 = underwater scene texture<br />

// s3 = surface detail texture<br />

// s4 = second bump-normal map<br />

// c0 = scale first bump-normal map layer (z component must be 2.f)<br />

// c1 = scale second bump-normal map layer (z component must be 2.f)<br />

// c2 = scale third bump-normal map layer (z component must be 2.f)<br />

// c3 = scale fourth bump-normal map layer (z component must be 2.f)<br />

// c4 = weigthed shift for color-to-vector trafo (-0.5f*(c0+..+c3))<br />

221


Section II — Rendering Techniques<br />

222 Advanced Water Effects<br />

// c5 = shift for color-to-vector trafo<br />

// c10 = r(0) (air&water), 1-r(0) for Fresnel, detail bump scaling, const 1<br />

// c11 = shift bumped reflection map<br />

// c12 = scale bumps in reflection map<br />

// c13 = scale bumps in refraction map<br />

// c14 + c15 = change y and z components of bump normal<br />

// version instruction<br />

ps 2 0<br />

// define the constants<br />

def c5, -0.5f, -0.5f, -0.5f, 0.0f<br />

def c10, 0.0204f, 0.9796f, 0.3f, 1.0f<br />

def c14, 0.0f, 0.0f, 1.0f, 0.0f<br />

def c15, 0.0f, 1.0f, 0.0f, 0.0f<br />

// declare the used resources<br />

dcl v0<br />

dcl t0<br />

dcl t1<br />

dcl t2<br />

dcl t3<br />

dcl t4<br />

dcl t5<br />

dcl t6<br />

dcl 2d s0<br />

dcl 2d s1<br />

dcl 2d s2<br />

dcl 2d s3<br />

dcl 2d s4<br />

// load the bump-normal map layers<br />

texld r0, t0, s0 // Load content of first bump-normal layer (using first b-n map)<br />

texld r4, t4, s0 // Load content of second bump-normal layer (using first b-n map)<br />

texld r5, t5, s4 // Load content of third bump-normal layer (using second b-n map)<br />

texld r6, t6, s4 // Load content of fourth bump-normal layer (using second b-n map)<br />

// scale and add the content of the different bump-normal layers<br />

mul r6.rgb, r6, c3 // All four sampled bump-normal colors have to be mixed<br />

mad r5.rgb, r5, c2, r6 // (ditto)<br />

mad r4.rgb, r4, c1, r5 // (ditto)<br />

mad r5.rgb, r0, c0, r4 // (ditto)<br />

add r5.rgb, r5, c4 // and unpacked (color-to-vector-trafo) to be usable as<br />

// current bump vector.<br />

// shift the bumped reflection map<br />

add r7, r5, c11 // Shift the bump vector to prevent reflection artifacts<br />

// at the water’s edge.<br />

// scale bumps in reflection and refraction map<br />

mad r7, r7, c12, t1 // Use a scaled bump vector to modify the coords of<br />

// the reflection map.


mad r8, r5, c13, t2 // Use a scaled bump vector to modify the coords of the<br />

// underwater scene map.<br />

// load the bumped refraction and underwater scene<br />

texldp r1, r7, s1 // Load reflection texel (using modified tex coords).<br />

texldp r2, r8, s2 // Load underwater scene texel (using modified tex coords).<br />

// exchange y and z components of bump-normal (from now on this vector can be used as<br />

// normal vector)<br />

nrm r6.rgb, r5 // Normalize the calculated bump-normal vector.<br />

mov r5.r, r6.r // Keep the x component of this vector.<br />

dp3 r5.g, r6, c14 // Exchange the y and z component of this vector.<br />

dp3 r5.b, r6, c15 // (ditto)<br />

// load the surface detail (also bumped)<br />

mul r0.rgb, r5, c10.b // Bump the tex coords for surface details<br />

add r0.rgb, r0, t3 // (ditto)<br />

texld r3, r0, s3 // and load the surface detail texel.<br />

// renormalize cam-to-water vector in v0<br />

add r7.rgb, v0, c5 // Unpack cam-to-water vector (passed by vertex shader).<br />

nrm r6.rgb, r7 // Renormalize this vector.<br />

// dot cam-to-water vector with the mixed normal vector<br />

dp3 sat r7.a, -r6, r5 // Calculate the cosine of the angle between both vectors.<br />

// calculate the Fresnel term (air-to-water)<br />

add r8.a, c10.a, -r7.a // Use this cosine to calculate the Fresnel approximation.<br />

mul r6.a, r8.a, r8.a // (ditto)<br />

mul r6.a, r6.a, r6.a // (ditto)<br />

mul r6.a, r6.a, r8.a // (ditto)<br />

mad r6.a, c10.g, r6.a, c10.r // (ditto)<br />

// blend underwater scene and reflection map<br />

// use the alpha of the underwater scene map to reduce Fresnel reflection at water’s edge<br />

mul r6.a, r6.a, r2.a // Modulate strength of Fresnel reflection with<br />

// underwater alpha.<br />

lrp sat r7.rgb, r6.a, r1, r2 // Blend both maps (underwater scene and reflection).<br />

// blend in the surface details<br />

mov r8.a, r2.a // (Yep, this line can be cancelled if we use<br />

lrp sat r8.rgb, r3.a, r3, r7 // r2.rgb as target for lrp sat result in this line<br />

mov oC0, r8 // and move the complete r2 register to oC0)<br />

Further Improvements<br />

Section II — Rendering Techniques<br />

Advanced Water Effects<br />

223<br />

Of course, the top quality water simulation explained in this article has several<br />

areas that could stand further improvements, including these four:<br />

� The absorption of light calculated in the second render pass of the underwater<br />

scene (see the section “Modifications Dependent on Water Depth”)


Section II — Rendering Techniques<br />

224 Advanced Water Effects<br />

should not only be based on the current distance between water surface and<br />

underwater object; the underwater part of the cam-to-object view line should<br />

also be taken into account. In fact, the light passes the area between water<br />

surface and object two times before it arrives at the camera (see Figure 14).<br />

This way, we receive a high absorption of light also for lower water levels if<br />

the camera looks with a flat angle of incidence.<br />

� A correct refraction of the line of sight is not simulated at the moment (see<br />

Figure 15). The angle of refraction � has to be calculated this way:<br />

� = arcsin( sin(�)*c2/c1)<br />

...with � = angle of incidence, c1 = 1.000293 (refraction index of air), and c2<br />

= 1.333333 (refraction index of water).<br />

Figure 14: The light passes the area between water surface and object two times.<br />

Figure 15: Refraction of the line of sight<br />

� Additional specular reflections will create highlights on the water surface and<br />

increase the visual quality. The water articles in the first <strong><strong>Shader</strong>X</strong> book<br />

[Engel 2002] present an easy way to add them.<br />

� This article is intended as an introduction to the presented advanced water<br />

effects. To increase the practical usability, we should implement the complete<br />

effect using a high-level shader language (like DX9-HLSL or Cg). This will<br />

raise the level of abstraction and decouple the programs from specific<br />

hardware.


Conclusion<br />

References<br />

Section II — Rendering Techniques<br />

Advanced Water Effects<br />

There are many starting points to develop further improvements for the visual<br />

quality.<br />

Using multiple render passes (reflection map 1, underwater scene 1+1, water<br />

surface 1), we created a close-to-reality water effect, including the desired features<br />

listed at the beginning. Especially in view of the per-pixel Fresnel reflection<br />

for each ripple, the water surface has a good three-dimensional look. Additionally,<br />

the contrarotated and different scaled bumping of the reflection and underwater<br />

scene maps increases the simulation’s quality. <strong>With</strong>out decreasing quality, we are<br />

able to change the world geometry or the water level due to the dynamic darkening<br />

of the underwater scene map and also the supported alpha-fading at the<br />

water’s edge. On the companion CD, you can find a demo application and complete<br />

source code presenting an executable implementation of the discussed<br />

water simulation.<br />

[Engel 2002] Engel, Wolfgang F., ed., Direct3D <strong><strong>Shader</strong>X</strong>: Vertex and Pixel <strong>Shader</strong><br />

<strong>Tips</strong> and <strong>Tricks</strong>, Wordware Publishing, 2002.<br />

[Wloka 2002] Wloka, M., “Fresnel Reflection,” nVidia Technical Report, 2002,<br />

http://developer.nvidia.com/docs/IO/3035/ATT/FresnelReflection.pdf.<br />

225


Efficient Evaluation of Irradiance<br />

Environment Maps<br />

Peter-Pike J. Sloan<br />

Introduction<br />

226<br />

Irradiance environment maps [3] are an effective technique to render diffuse<br />

objects in arbitrary lighting environments. They are more efficient to compute<br />

and use fewer resources than diffuse cube maps [1]. In [3], a method is presented<br />

that uses spherical harmonics to efficiently represent an irradiance environment<br />

map. Based on the surface normal, two formulas for evaluation are presented.<br />

One is tailored for matrix/vector operations using the standard matrix form of a<br />

quadratic equation, and the other more direct technique is specified for “systems<br />

not optimized for matrix and vector operations.” This article presents a more efficient<br />

mapping of the second formula; with the current vertex shader instruction<br />

set, it uses seven constant registers and 11 instructions versus 12 constant registers<br />

and 15 instructions using the quadratic form.<br />

One significant benefit to dealing with lights using irradiance environment<br />

maps or diffuse cube maps is that the cost of evaluating the representation is<br />

independent of the number of lights. This is because the lighting environment is<br />

represented as a whole, and the integration is precomputed for all possible normals<br />

instead of evaluated on the fly for each light. Using spherical harmonics to<br />

represent irradiance environment maps is much more efficient than diffuse cube<br />

maps and can make other techniques built around them (like [1]) more efficient.<br />

Greger et al. [2] precomputes a regular volume of irradiance cube maps in a<br />

scene. Diffuse objects can be efficiently moved around inside the scene by interpolating<br />

between the different cube maps. Using spherical harmonics would just<br />

amount to interpolating the spherical harmonic coefficients instead. One limitation<br />

of these techniques is that they are only truly accurate for convex objects. If<br />

shadows or other global effects are to be modeled, other techniques, like<br />

precomputed radiance transfer (PRT) [4], need to be used.


Background<br />

This article focuses on how to light a diffuse object with a distant spherical lighting<br />

environment, ignoring shadows and other global effects. This is done by computing<br />

the radiance (light) leaving a point on the object, which requires an<br />

� � d<br />

evaluation of the reflection integral: Rp( v) � � L( s) HN( s) ds where �d is the<br />

� s<br />

diffuse reflectance (commonly referred to as the albedo) of the surface and is in<br />

the range [0,1]; the division by � guarantees energy conservation1 . This integral<br />

represents the irradiance2 , where L(s) is a spherical function representing the<br />

lighting environment and HN(s) is the cosine term (dot product of the surface normal<br />

at the given point and a unit direction clamped to zero if it is negative). The<br />

domain of integration is over the sphere of all incoming directions denoted by the<br />

variable s. It is possible to precompute this integral for every possible normal,<br />

since it is just a convolution of HN(s) against L(s) for all possible normal directions.<br />

This results in a spherical function that can be evaluated to determine<br />

reflected radiance.<br />

In image processing, when convolving large kernels, it is more efficient to<br />

use the fourier transform to project the image and the kernel into frequencyspace,<br />

evaluate the convolution (which amounts to multiplication in frequencyspace),<br />

and then inverse transform the image back into the spatial domain. The<br />

cosine kernel is a very large one, covering half the sphere. Ramamoorthi and<br />

Hanrahan [3] observed that the projection of HN(s) into the “frequency-space” of<br />

the sphere (using spherical harmonics, which is described in the next section)<br />

results in almost all of the energy existing in the first nine coefficients. This<br />

means that an accurate representation of the convolved spherical function representing<br />

exit radiance for any normal direction can be expressed using just nine<br />

numbers per color channel instead of a full diffuse cube map.<br />

Spherical Harmonics for Rendering<br />

Spherical harmonics are the natural basis functions to use on the sphere. This<br />

article only briefly describes them; for a more complete description, look at the<br />

references in [3] and [4]. The mathematical form of the complex spherical harmonics<br />

is as follows:<br />

m<br />

m im�|<br />

m|<br />

Y ( �� , ) � K e P (cos �);<br />

l��, �1�m�l l<br />

l<br />

Section II — Rendering Techniques<br />

Efficient Evaluation of Irradiance Environment Maps<br />

l<br />

...where the parameterization of the unit sphere is:<br />

s �( x, y, z)<br />

�(sin(<br />

�) cos( �),sin( �)sin( �), cos( � ))<br />

(1)<br />

1 The net energy leaving a point is never greater than the net energy impinging on it.<br />

2 Irradiance is the integral of the lighting environment impinging on a point. For a diffuse object, the<br />

irradiance should be multiplied by the albedo of the surface divided by � to compute the light<br />

leaving the surface — this is commonly referred to as radiance.<br />

227


Section II — Rendering Techniques<br />

228 Efficient Evaluation of Irradiance Environment Maps<br />

m m<br />

The Pl are the associated Legendre polynomials, and K l are the normalization<br />

constants.<br />

K<br />

m<br />

l �<br />

( 2l�1)( l�| m|)!<br />

4 �(<br />

l�| m|)!<br />

When representing lighting, the complex form is not interesting, so the real form<br />

of the spherical harmonics can be used:<br />

�<br />

m<br />

2Re( Yl) m�0<br />

�� m<br />

m<br />

yl<br />

� � 2Im( Yl) m�0<br />

(2)<br />

� 0<br />

��<br />

Ylm�0 The spherical harmonics can also be expressed as polynomials in 3D, where evaluation<br />

is restricted to the surface of the unit sphere. These polynomials can be<br />

computed by factoring equation (2) using the trigonometric forms for x, y, and z<br />

in equation (1) and trigonometric identities. The index l represents the band<br />

index and corresponds to the degree of the polynomial (analogous to frequency).<br />

A complete lth degree basis contains (l+1) 2 coefficients and can represent all<br />

polynomials through degree l. For simplicity, this article sometimes uses a form<br />

that represents each basis function with a single index yi, where i=l(l+1)+m+1.<br />

The spherical harmonics are an orthonormal basis; this means that<br />

� yi( s) yj( s) ds �1if ( i � j) and 0 if ( i! � j)<br />

. One byproduct of the above definition<br />

is that the basis functions can be defined with any sign convention. Since sign<br />

conventions vary, care has to be taken when mixing between definitions of the<br />

basis functions found in different references. This is particularly true when using<br />

projection coefficients or code that is found on the web.<br />

Since the spherical harmonics form an orthonormal basis, projecting a function<br />

into them is straightforward. Here is the formula for evaluating the projected<br />

function:<br />

�<br />

m m<br />

f( s) � �ll<br />

yl( s)<br />

m<br />

...where ll are the projection coefficients of the function f into the spherical harmonic<br />

basis (i.e., a linear combination of the spherical harmonics with these coefficients<br />

results in an optimal approximation of the function). They can be<br />

computed by integrating the function against the basis functions over the sphere:<br />

m<br />

m<br />

ll�� f( s) yl( s) ds<br />

If the function is band limited, a finite number of bands are required to exactly<br />

reconstruct the function. If the sequence is truncated before the band limit, the<br />

approximation is optimal in a least squares sense (this is the same with the fourier<br />

transform). This projection can be done using code from the web page of the<br />

author of [3] or by using the spherical harmonic projection functions in the latest<br />

version of the <strong>DirectX</strong> SDK.


The convolution formula for spherical harmonics is very simple. Given a circularly<br />

symmetric function h 3 oriented in z expressed in spherical harmonics, the<br />

convolution4 m<br />

of another function f (projection coefficients ll ) with h (projection<br />

m<br />

coefficients hl ) is just:<br />

c<br />

m<br />

l<br />

�<br />

4 � 0<br />

hl l l<br />

2l�1 m<br />

Section II — Rendering Techniques<br />

Efficient Evaluation of Irradiance Environment Maps<br />

229<br />

m<br />

...where cl are the projection coefficients of the convolved function. All circularly<br />

symmetric functions oriented with z only have one non-zero basis function in each<br />

band5 — namely the one where m is equal to zero.<br />

Evaluating the spherical harmonics scaled by the convolved coefficients in a<br />

given direction results in the same value that one would get if one computed an<br />

integral of the product of h oriented in the chosen direction with f over the<br />

sphere. If h is a cosine lobe clamped to zero in the lower hemisphere, f is a lighting<br />

environment, and the direction is a surface normal, then this integral is<br />

exactly the irradiance evaluated at the corresponding point. The cosine lobe has<br />

most of its energy in the first nine coefficients, so the higher frequency terms of<br />

the lighting environment have a minimal contribution; this is the fundamental<br />

observation in [3].<br />

Here are the explicit representations of the first nine spherical harmonics in<br />

polynomial form:<br />

0<br />

1<br />

y0<br />

�<br />

2 �<br />

1 �1<br />

0 3<br />

( y1; y1; y1) � ( �x; �y;<br />

z)<br />

2 �<br />

�21�1 15<br />

( y2; y2; y2)<br />

� ( xy; �xz; �yz)<br />

2 �<br />

0 5 2<br />

y2� ( 3z �1)<br />

4 �<br />

2 15 2 2<br />

y2� ( x � y )<br />

4 �<br />

In the shader in the next section, there are seven constant registers that have to<br />

be computed whenever the lighting environment changes. Given the projection of<br />

m<br />

the lighting environment into spherical harmonics resulting in coefficients Rl ,<br />

m m<br />

Gl , Bl for the red, green, and blue channel of the lighting environment, respectively,<br />

they are defined as follows:<br />

3 A circularly symmetric function on the sphere is one that only has variation in one direction — i.e.,<br />

if you align the direction of variation with the z axis, the function is constant along lines of constant<br />

latitude, and the function varies in � but not � using the spherical parameterization defined above.<br />

4 A non-circularly symmetric function convolved with a spherical function would not result in a<br />

spherical function. This is because circularly symmetric functions can be aligned with a point on<br />

the sphere without any extra parameters (because of the single direction of variation).<br />

5 When oriented in z there is no variation in the variable �, and the basis function where m�0<br />

integrates to zero for any function that has this symmetry.


Section II — Rendering Techniques<br />

230 Efficient Evaluation of Irradiance Environment Maps<br />

<strong>Shader</strong><br />

C<br />

R<br />

cAr cAg cAb cBr cBg cBb cC<br />

x<br />

1<br />

�c R<br />

1<br />

�c G<br />

1<br />

�c B<br />

�2<br />

�c R<br />

�2 �c<br />

G<br />

�2<br />

�c<br />

B c R<br />

1 1<br />

1 1<br />

1 1<br />

2 2<br />

�1�1�1�1 1 1<br />

1 1<br />

1 1<br />

2 2<br />

� 1<br />

0<br />

1 � 1<br />

0<br />

1 � 1<br />

0<br />

1<br />

3<br />

0<br />

2<br />

0 0<br />

0 0<br />

0 0<br />

1<br />

0 0 3 2 0 0 3 2 0 0 3 2 2 2<br />

2 2 2 2<br />

2<br />

�1 � 2<br />

�1<br />

2<br />

0<br />

0<br />

3 2 3<br />

1<br />

1<br />

2 2 2 2<br />

y �c R �c G �c B �c R �c G2cB<br />

cG<br />

z c R c G c B 3c R 3c G 3c<br />

B2cB w �c R �c R �c G �c G �c B �c B �c R �cG �cB<br />

x<br />

...where:<br />

2 1<br />

c0 �n0; c1 �h1n1; c2 �h2n2; c3 �h2n4; h1 � ; h2<br />

� ;<br />

3 4<br />

1 3 15 5 15<br />

n0 � ; n1 � ; n2 � ; n3 � ; n4<br />

�<br />

2 � 2 � 2 � 4 � 4 �<br />

The h i are the convolution coefficients divided by � (irradiance is turned into exit<br />

radiance), and the n i are the normalization coefficients of the basis functions. The<br />

x in cC can be any value, since it is not used by the shader.<br />

In [3] transforming the surface normals by the inverse of the lights’ rotation<br />

relative to the model is proposed. While that is necessary for deformable objects,<br />

for rigid objects it is more efficient to rotate the lighting coefficients directly<br />

before loading them into the shaders — this saves three instructions. If materials<br />

are stored per vertex, the following shader would need one extra instruction that<br />

multiplies the lighting times the diffuse reflectance of the surface.<br />

vs 1 1<br />

dcl position v0<br />

dcl normal v1<br />

m4x4 oPos, v0, c0<br />

mov r0, v1 ; read port limits on inputs<br />

; compute 1st 4 basis functions – linear + constant<br />

; v1 is the normal with a homogenous 1<br />

; c* are precomputed constants<br />

dp4 r1.r, r0, cAr ; r channel from 1st 4 basis functions<br />

dp4 r1.g, r0, cAg ; g channel from 1st 4 basis functions<br />

dp4 r1.b, r0, cAb ; b channel from 1st 4 basis functions<br />

; compute polynomials for next 4 basis functions<br />

mul r2, r0.xyzz, r0.yzzx ; r1 is xy/yz/z^2/xz<br />

; add contributions – store in r2<br />

dp4 r3.r, r2, cBr<br />

dp4 r3.g, r2, cBg<br />

2<br />

4 2<br />

2<br />

4 2<br />

2<br />

4 2


dp4 r3.b, r2, cBb<br />

; compute the final basis function x^2-y^2<br />

mul r0.xy, r0.xy, r0.xy ; x^2 y^2 – other slots are free<br />

add r0.x, r0.x, -r0.y ; x^2-y^2,<br />

mad r1.rgb, cC.rgb, r0.x, r3.rgb<br />

add r0, r1.rgb, r2.rgb ; r0 is now rgb lighting<br />

Acknowledgments<br />

References<br />

Section II — Rendering Techniques<br />

Efficient Evaluation of Irradiance Environment Maps<br />

231<br />

Thanks to Dan Baker for carefully proofreading this article. Wolfgang Engel, Tom<br />

Forsyth, and Willem de Boer provided useful feedback on early versions as well.<br />

[1] Brennan, C., “Diffuse Cube Mapping,” Direct3D <strong><strong>Shader</strong>X</strong>: Vertex and Pixel<br />

<strong>Shader</strong> <strong>Tips</strong> and <strong>Tricks</strong>, Wolfgang Engel, ed., Wordware Publishing, 2002, pp.<br />

287-289.<br />

[2] Greger, G., P. Shirley, P. Hubbard, and D. Greenberg, “The Irradiance Volume,”<br />

IEEE Computer Graphics and Applications, 6(11):1986, pp. 21-29.<br />

[3] Ramamoorthi, R. and P. Hanrahan, “An Efficient Representation for Irradiance<br />

Environment Maps,” Computer Graphics, SIGGRAPH 2001, pp. 497-500.<br />

[4] Sloan, P., J. Kautz, and J. Snyder, “Precomputed Radiance Transfer for Real-<br />

Time Rendering in Dynamic, Low-Frequency Lighting Environments,” Computer<br />

Graphics, SIGGRAPH 2002, pp. 527-536.


Practical Precomputed Radiance<br />

Transfer<br />

Abstract<br />

Peter-Pike J. Sloan<br />

Precomputed radiance transfer (PRT) is a technique that enables rigid objects to be<br />

illuminated in low-frequency lighting environments with global effects like soft<br />

shadows and interreflections in real time. It achieves these results by running a<br />

lengthy preprocess that computes how light is transferred from a source environment<br />

to exit radiance at a point. This article discusses the technique in general<br />

and focuses on a practical example using the recently introduced compressed [7]<br />

form in a vertex shader.<br />

Introduction<br />

232<br />

Generating accurate depictions of complex scenes in interesting lighting environments<br />

is one of the primary goals in computer graphics. The general solution to<br />

this problem requires the solution of an integral equation that is difficult to solve,<br />

even in non-interactive settings [1]. In interactive graphics, shortcuts are generally<br />

taken by making simplifying assumptions of several properties of the scene;<br />

the materials are generally assumed to be simple. The lighting environment is<br />

either approximated with a small number of point and directional lights or environment<br />

maps and transport complexity (i.e., how the light bounces around the<br />

scene — interreflections, caustics, and shadows are examples) are only modeled<br />

in a limited way. For example, shadows are computed for dynamic point lights but<br />

not for environment maps.<br />

There is a lot of interesting previous work that has trade-offs different from<br />

those made with PRT. Polynomial texture maps [3] allow interactive rendering of<br />

diffuse objects with textures that capture local interreflections and scattering, but<br />

they are limited to point light sources. There are several papers that deal with<br />

prefiltering environment maps — most notably in [4], diffuse objects can be interactively<br />

rendered in arbitrary environments by projecting the lighting environment<br />

into spherical harmonics, and the prefiltering is done via convolution in the<br />

frequency domain. This technique is flexible and allows for dynamic geometry, but<br />

no transport complexity is modeled; it is technically only accurate for convex


objects. In [5], a method is presented that can interactively render a wider range<br />

of materials but still does not handle other forms of transport complexity.<br />

Extensions to precomputed radiance transfer is currently an active research<br />

area; while this work focuses on practical concerns and shaders for the most efficient<br />

formulation (diffuse objects), it has been successfully applied to more general<br />

reflectance models [2], integrated with ideas from bidirectional texture<br />

functions and texture synthesis [8], and compressed to handle higher frequency<br />

lighting and extended to handle subsurface scattering [7]. The primary limitations<br />

are relatively low-frequency lighting environments and, more importantly, while<br />

the objects are assumed to be rigid, they are not allowed to deform.<br />

General Background on Precomputed Radiance Transfer<br />

For a diffuse object illuminated in distant lighting environment L, the reflected<br />

radiance at a point P on the surface is:<br />

� � d Rp( v) � � L( s) Vp( s) HN( s) ds<br />

(1)<br />

�<br />

s<br />

...where Vp represents visibility, a binary function that is 1 in a given direction if a<br />

ray in that direction originating at the point can “see” the light source and 0 otherwise.<br />

HN represents the projected area (or cosine term), and the integration is<br />

over the hemisphere about the point’s normal. The diffuse reflectance (or albedo)<br />

of the surface is �d and is generally an RGB color, where each value is between<br />

zero and one. The division by � maps irradiance (the integral) into exit radiance<br />

(what we see) and guarantees energy conservation (i.e., the amount of energy<br />

reflected is never greater than the amount of energy arriving at a point).<br />

<strong>With</strong> a point or directional light, the lighting environment is effectively a delta<br />

function, which turns the integral into a simple function evaluation — the cosine<br />

of the angle between the light and the normal if the direction is not in shadow or<br />

just zero if it is. Since the object is diffuse, the reflected radiance is the same in all<br />

directions, and the integral does not depend on the view direction. The key idea<br />

behind precomputed radiance transfer is to approximate the lighting environment<br />

using a set of basis functions over the sphere:<br />

� �<br />

Ls ( ) � �lB<br />

i i(<br />

s)<br />

i<br />

...where the Bs are a set of basis functions and the ls are the coefficients corresponding<br />

to the optimal (in a least squares sense) projection of the lighting environment<br />

into the basis functions; that is, they minimize:<br />

� � 2<br />

( Ls ( ) lB( s)) ds<br />

(2)<br />

�<br />

��<br />

i i<br />

Section II — Rendering Techniques<br />

Practical Precomputed Radiance Transfer<br />

233<br />

If the basis functions are orthogonal, this just amounts to integrating the lighting<br />

environment against the basis functions, while in general it is necessary to integrate<br />

against the duals of the basis functions.<br />

Now substitute the approximation of the lighting environment into (1):


Section II — Rendering Techniques<br />

234 Practical Precomputed Radiance Transfer<br />

� � d � � � � �<br />

Rp( v) � �� �liBi(<br />

s) �� Vp( s) HN( s) ds<br />

� ��<br />

i �<br />

s<br />

Recall two concepts from basic calculus: the integral of a sum equals the sum of<br />

the integrals, and constants can be pulled outside of integrals1 . This allows us to<br />

reformulate (3), as follows:<br />

� 1 � � �<br />

Rp( v) � � d�li�Bi( s) Vp( s) HN( s) ds<br />

(4)<br />

�<br />

i<br />

s<br />

The important thing to note about the above equation is that the integral only<br />

depends on the choice of basis functions, not on the value of the particular lighting<br />

environment or the albedo of the surface. This means that if you precompute the<br />

integral for each basis function at every point on the object, you are left with the<br />

following expression for reflected radiance:<br />

R ( v) lt<br />

� � � �<br />

p d i pi<br />

i<br />

A dot product between the global lighting coefficients and the spatially varying<br />

(through the index p) transfer vector scaled by the albedo is all that is required. If<br />

the lighting and transfer are represented as vectors (L and T, respectively), this<br />

equation becomes:<br />

�<br />

R ( v) �� ( T � L)<br />

(5)<br />

p d<br />

Compression<br />

As the number of basis functions grows, these transfer vectors become larger and<br />

the data size can become unwieldy. A compression technique was recently proposed<br />

[7] that can significantly reduce both the compute and storage requirements<br />

of the technique. The vertices (or texels) in a mesh are split into discrete<br />

clusters, and each cluster is approximated with a mean and an optimal linear<br />

basis. Mathematically:<br />

T � M ��w<br />

P<br />

p k pj kj<br />

j<br />

...where Tp is the transfer vector at a point, Mk is the mean for cluster k, the Pkj represents the local linear basis for the cluster, and the wpj represents the coordinates<br />

of Tp subtracted from the mean in the given basis. The important thing to<br />

note is that k and wpj vary at every sample (vertex or texel), while Mk and Pkj are<br />

constant for a given cluster.<br />

If this approximation for the transfer vector is now inserted into equation (5),<br />

the following equation results:<br />

Rp( v) d Mk wpjPkj L<br />

j<br />

� �<br />

�<br />

� �<br />

�<br />

� �<br />

�<br />

�<br />

�<br />

�<br />

�<br />

�<br />

� �<br />

� � �<br />

�<br />

�<br />

�<br />

1 This is because integration is a linear operator, that is I(f+g) = I(f)+I(g) and I(s*f) = s*I(f) where f<br />

and g represent functions, s represents a scalar, and I represents integration.<br />

(3)


Again, exploiting a linear operator (the dot product), the terms can be redistributed<br />

into the final form:<br />

� �<br />

�<br />

Rp( v) �� �<br />

d� ( Mk �L) ��wpj( Pkj � L)<br />

�<br />

�<br />

(6)<br />

�<br />

j<br />

�<br />

An important thing to note is that the dot products in the above equation only<br />

depend on per-cluster information, so they can be performed once per frame and<br />

stored as constants in the shader. This also makes the evaluation of reflected radiance<br />

independent of the dimensionality of the lighting basis. Given K clusters and<br />

N local basis vectors, K*(N+1)*3 coefficients have to be computed for each frame<br />

(the 3 is for colored lights) and stored as constants.<br />

Choice of Basis Functions<br />

While any basis functions that approximate the sphere can be used, much of the<br />

previous work has focused on using the real spherical harmonics. In this article<br />

we just touch on the basic formulation and some useful properties. More thorough<br />

descriptions can be found in the references in [6]. The mathematical form of the<br />

complex spherical harmonics is as follows:<br />

m<br />

m im�|<br />

m|<br />

Y ( �� , ) � K e P (cos �);<br />

l�N, �1�m�l l<br />

l<br />

...where the parameterization of the sphere is:<br />

l<br />

s �( x, y, z)<br />

�(sin(<br />

�) cos( �),sin( �)sin( �), cos( � ))<br />

(7)<br />

m m<br />

The Pl are the associated Legendre polynomials, and the K l are the normalization<br />

constants.<br />

K<br />

m<br />

l �<br />

( 2l�1)( l�| m|)!<br />

4 �(<br />

l�| m|)!<br />

Section II — Rendering Techniques<br />

Practical Precomputed Radiance Transfer<br />

The real form of the spherical harmonics is:<br />

� m<br />

2Re( Yl) �� m<br />

m<br />

yl<br />

� �2Im(<br />

Yl) � 0<br />

��<br />

Yl m�0<br />

m�0<br />

m�0<br />

(8)<br />

235<br />

The spherical harmonics can also be expressed as polynomials in 3D where evaluation<br />

is restricted to the surface of the unit sphere. These polynomials can be<br />

computed by factoring equation (8) using the trigonometric forms for x, y, and z in<br />

equation (7) and trigonometric identities; [4] has examples through the quadratics.<br />

The index l represents the band index and corresponds to the degree of the<br />

polynomial; a complete lth degree basis contains (l+1) 2 coefficients and can represent<br />

all polynomials through degree l. For simplicity, we sometimes use a form<br />

that represents each basis function with a single index yi, where i=l(l+1)+m+1.<br />

The spherical harmonics are what is known as an orthonormal basis; this means<br />

that � yi( s) yj( s) ds ��ij �1if ( i � j) and 0 if ( i! � j)<br />

.


Section II — Rendering Techniques<br />

236 Practical Precomputed Radiance Transfer<br />

One byproduct of the above definition is that the basis functions can be<br />

defined with any sign convention. Care has to be made when mixing between definitions<br />

of the basis functions found in different references, particularly when<br />

using projection coefficients or code that is found on the web. Generating the<br />

least squares optimal projection coefficients that minimize equation (2) is simple<br />

for any orthogonal basis:<br />

li � � yi( s) f( s) ds<br />

One other important property of spherical harmonics is that they are rotationally<br />

invariant. This is analogous to the translation invariance in the fourier transform<br />

and can be mathematically expressed as follows: R(proj(f(s)))= proj(f(R(s))),<br />

where R is a rotation matrix and proj represents a projection into spherical harmonics.<br />

This means that the shape of the projection is stable under rotations, so<br />

there will not be any temporal artifacts as the light or object is rotated. Rotations<br />

can be computed in various ways, but all forms ultimately are a linear combination<br />

of the projection coefficients (i.e., the rotated coefficients can be computed by a<br />

matrix multiplication); for low orders, symbolic integration can be used to compute<br />

the entries in these rotation matrices, which happen to be polynomials of the<br />

coefficients of R. See [2] and [6] for a more thorough description and other<br />

references.<br />

Setting up the <strong>Shader</strong>s<br />

Before PRT can be used, several things have to occur. Any lights in the scene<br />

have to be represented using spherical harmonics and combined into a single<br />

spherical function (just add them together, possibly rotating them independently),<br />

the spherical function representing the lights has to be transformed into object<br />

space (i.e., the basis functions have to be oriented in the same way they were during<br />

the precomputation), and if compression is used, the per-cluster dot products<br />

have to be performed.<br />

The latest <strong>DirectX</strong> SDK update has several functions for mapping lights into<br />

spherical harmonic coefficients and rotating them using any pure rotation matrix.<br />

In particular, directional/cone/spot lights can be evaluated, and cube maps can be<br />

directly projected. After the lights have been evaluated in world space, they need<br />

to be rotated into object space (using the transpose of the rigid rotation mapping<br />

the object from object space to world space) or, alternatively, always evaluated<br />

directly in object space (i.e., one rotation applied directly to coefficients or the<br />

“lighting” functions evaluated with direction vectors mapped into object space).<br />

If compression is used, the shader constants need to be evaluated (the percluster<br />

dot products from equation (6)) before being uploaded.<br />

The uncompressed form requires a transfer vector to be stored at every texel<br />

or vertex, while the compressed form requires a mapping into the start of the corresponding<br />

sample’s clusters’ constants (in the constant table) and coefficients in<br />

the clusters’ linear subspace (generally much fewer than the number of basis<br />

functions used to represent the lighting environment). If more clusters are used


than can be represented in the constant table of the graphics card, a multi-pass<br />

technique described in [7] must be employed instead.<br />

Compressed <strong>Shader</strong><br />

The following shader is written in HLSL and parameterized by a single variable<br />

NPCA representing the number of PCA basis vectors stored in each cluster. This<br />

has to be passed in to the compiler or set with a #define at the top of the shader.<br />

The HLSL code assumes that NPCA is a multiple of 4. If NPCA is zero, the technique<br />

is pure vector quantization (this is an inferior compression technique and<br />

should not be used in general). BLENDWEIGHT0 has to index into the start of<br />

the cluster for each vertex. The per-cluster data (from equation (6)) is stored to<br />

minimize the number of constant registers. The mean is first stored as an RGB<br />

color, and then all of the dot products of the PCA vectors with the R lighting coefficients<br />

are stored, followed by G and B. Constraining NPCA to be a multiple of 4<br />

allows this more efficient packing scheme that reduces the number of constant<br />

registers and the number of assembly instructions that are required. If a single<br />

color is stored for each basis vector (effectively wasting all of the alpha channels),<br />

NPCA+1 constant vectors are needed for each cluster, while this scheme<br />

requires (NPCA/4)*3 + 1 registers per cluster — wasting only one slot (the<br />

alpha channel of the dot product with the per-cluster mean).<br />

// inputs stored per vertex<br />

struct VS INPUT<br />

{<br />

float4 vPosition : POSITION;<br />

int vClusterInfo : BLENDWEIGHT0;<br />

#if (NPCA>0)<br />

float4 vPCAWts[(NPCA+3)/4] : BLENDWEIGHT1;<br />

#endif<br />

};<br />

// outputs – position and color<br />

struct VS OUTPUT DIFF<br />

{<br />

float4 vPosition : POSITION;<br />

float4 vDiff : COLOR;<br />

};<br />

// all of the constant registers are mapped<br />

// if vs1.1 is used have to use the appropriate #<br />

// This assumes that the first 4 constants contain<br />

// the complete transformation matrix to NDC space.<br />

float4 c[255] : register(vs 2 0, c0);<br />

VS OUTPUT DIFF Diff<strong>Shader</strong>(const VS INPUT v)<br />

{<br />

VS OUTPUT DIFF o;<br />

Section II — Rendering Techniques<br />

Practical Precomputed Radiance Transfer<br />

237


Section II — Rendering Techniques<br />

238 Practical Precomputed Radiance Transfer<br />

// 1st four constants are the transformation matrix<br />

matrix mXform;<br />

mXform[0] = c[0];<br />

mXform[1] = c[1];<br />

mXform[2] = c[2];<br />

mXform[3] = c[3];<br />

o.vPosition = mul(mXform,v.vPosition); // xform point<br />

int iIndexBase = v.vClusterInfo.x;<br />

float4 vRes = c[iIndexBase]; // cluster mean color<br />

#if (NPCA > 0)<br />

float PCACoefs[NPCA] = (float[NPCA])v.vPCAWts;<br />

// accumulate R/G/B each in a 4 vector<br />

float4 vRed = 0;<br />

float4 vGreen = 0;<br />

float4 vBlue = 0;<br />

// compute the sum from equation 6<br />

// do R/G/B in parallel, 4 coefficients at a time<br />

for(int i=0;i


Acknowledgments<br />

References<br />

Section II — Rendering Techniques<br />

Practical Precomputed Radiance Transfer<br />

239<br />

This article is based on research results generated in collaboration with several<br />

individuals — in particular John Snyder, Jan Kautz, and Jesse Hall. Jason Sandlin<br />

and Ben Luna have been extremely helpful discussing these ideas. Wolfgang<br />

Engel provided valuable feedback and encouragement while I wrote this article.<br />

[1] Kajiya, J., “The Rendering Equation,” SIGGRAPH 1986, pp. 143-150.<br />

[2] Kautz, J., P. Sloan, and J. Snyder, “Fast, Arbitrary BRDF Shading for Low-Frequency<br />

Lighting Using Spherical Harmonics,” 12th Eurographics Workshop on<br />

Rendering, pp. 301-308.<br />

[3] Malzbender, T., D. Gelb, and H. Wolters, “Polynomial Texture Maps,”<br />

SIGGRAPH 2001, pp. 519-528.<br />

[4] Ramamoorthi, R. and P. Hanrahan, “An Efficient Representation for Irradiance<br />

Environment Maps,” SIGGRAPH 2001, pp. 497-500.<br />

[5] Ramamoorthi, R. and P. Hanrahan, “Frequency Space Environment Map Rendering,”<br />

SIGGRAPH 2003, pp. 517-526.<br />

[6] Sloan, P., J. Kautz, and J. Snyder, “Precomputed Radiance Transfer for<br />

Real-Time Rendering in Dynamic, Low-Frequency Lighting Environments,”<br />

SIGGRAPH 2002, pp. 527-536.<br />

[7] Sloan, P., J. Hall, J. Hart, and J. Snyder, “Clustered Principal Component for<br />

Precomputed Radiance Transfer,” SIGGRAPH 2003, pp. 382-391.<br />

[8] Sloan, P., X. Liu, H. Shum, and J. Snyder, “Bi-Scale, Low-Frequency Radiance<br />

Self-Transfer,” SIGGRAPH 2003, pp. 370-375.


Advanced Sky Dome Rendering<br />

Marco Spoerl and Kurt Pelzer<br />

Introduction<br />

<strong>With</strong> the current interest of both hobbyist and professional game programmers in<br />

landscape rendering and the shift from indoor to outdoor environments in game<br />

design, one aspect has become very important: the sky. Traditional approaches for<br />

rendering skies only used texture-mapped domes. This is acceptable for indoor<br />

environments, since the player only sees a glimpse of it. Such sky domes can be<br />

implemented both easily and efficiently. But their major drawback is a lack of flexibility,<br />

which makes it difficult to render dynamic effects like changes with the<br />

time of day.<br />

This article describes a better solution and illustrates the implementation of<br />

a basic vertex color sky dome that:<br />

� Computes the correct position of both the sun and the moon, depending on<br />

time of day<br />

� Changes its color depending on the position of the sun<br />

� Renders a projection of the sun at its correct position<br />

� Renders a projection of the moon at its correct position, including the moon’s<br />

current phase using per-pixel lighting<br />

Position of Sun and Moon<br />

240<br />

[SchlyterA] and [SchlyterB] give excellent instruction for computing the positions<br />

of sky objects from their orbital elements. Although the algorithms presented<br />

there are significantly simplified, they still work well for real-time computer<br />

graphics.<br />

Using the equations in [SchlyterA], it’s simple to compute the position of the<br />

sun. The orbital elements used are longitude of perihelion (w), mean distance (a)<br />

measured in astronomical units (AU), eccentricity (e), and mean anomaly (M).<br />

Most of those and the obliquity of the ecliptic (oblecl) depend on the current time.<br />

From these, all other elements needed to determine the position of the sun are<br />

computed: mean longitude (L), eccentric anomaly (E), rectangular coordinates (x,<br />

y), distance (r) measured in astronomical units, true anomaly (v), and longitude


(lon). Finally, the ecliptic rectangular coordinates are calculated and stored as the<br />

sun’s current position.<br />

In addition to the basic orbital elements mentioned above, the moon needs<br />

the longitude of the ascending node (N) and inclination (i). Everything else is<br />

computed similar to the sun with the exception of the longitude (lon) and the new<br />

parameter latitude (lat), which are computed using the ecliptic rectangular coordinates<br />

and the fact that all distances are not measured in astronomical units but<br />

Earth radii. After that, the longitude, latitude, and distance are corrected using<br />

the perturbations of the moon, and the resulting final coordinates are stored as<br />

the current position. The computation of<br />

the moon’s position is completed with two<br />

important values, elongation and the resulting<br />

phase angle, needed later to display the<br />

moon’s phase.<br />

The computed positions, as illustrated<br />

in Figure 1, are geocentric (i.e., the viewer<br />

is located at the center of the Earth).<br />

Although this affects the results for the<br />

moon (but not for the sun, as it’s too far<br />

away), the position is not shifted to<br />

topocentric coordinates for the sake of clar-<br />

ity. Still another correction has to be made<br />

to both results regarding the geocentric<br />

position: the influence of the Earth’s rotation.<br />

<strong>With</strong>out it, one “day” would last one<br />

year — the time needed for the Earth to circle the sun or, from a geocentric point<br />

of view, for the sun to circle the Earth. To simulate the effect, the sky object’s<br />

position vector is simply transformed by a matrix built from a rotation around the<br />

z-axis by an angle relative to the time of day. Of course, any other axis of rotation<br />

could be chosen to simulate a lower course instead of the current one, which<br />

always leads through the zenith.<br />

Rendering the Dome<br />

Section II — Rendering Techniques<br />

Advanced Sky Dome Rendering<br />

Figure 1: Difference between<br />

geocentric and topocentric<br />

coordinates<br />

241<br />

The dome itself is a simple sphere created at run time. Only the position and normal<br />

of each vertex are stored. Texture coordinates and vertex color are computed<br />

in the vertex shader. One characteristic of the dome is its transformation. The<br />

viewer is always in the center of the sky sphere (i.e., the sky dome moves as the<br />

viewer moves). To take this into account, the translational part of the view matrix<br />

is set to zero before building the transformation matrix for the vertex program.<br />

As the dome has to be huge to cover the whole world while moving around<br />

with the viewer, another modification has to be made. When transforming the vertex<br />

position to clip space inside the vertex shader, the z- and w-coordinates are<br />

made equal (that is, after the division of the x-, y-, and z-coordinates by the<br />

w-component z becomes 1, making the vertex lie on the far clipping plane).


Section II — Rendering Techniques<br />

242 Advanced Sky Dome Rendering<br />

; v0 = position<br />

; c0 - c4 = proj matrix * view matrix<br />

vs 1 1<br />

dcl position v0<br />

; transform position<br />

dp4 oPos.x, v0, c0<br />

dp4 oPos.y, v0, c1<br />

dp4 oPos.zw, v0, c3 ; simply force z to 1.0f by making it equal to w<br />

NOTE Throughout this article, the world matrix is ignored, as it’s always the<br />

identity matrix.<br />

Determination of the sky color is roughly oriented to [Nishita], which<br />

defines equations to compute the color using single and multiple scattering.<br />

Compared to the equations presented in that paper, this article uses a slightly different,<br />

much simplified formula to calculate the final vertex color.<br />

CV(�)=(KrFr(�)+KmFm(�))SCS Fr and Fm are the phase functions for molecule and aerosol scattering, respectively.<br />

Kr and Km are the colors of the molecules and aerosols. S is a scaling factor<br />

relative to the vertex’s y-coordinate, and Cs is the current color of the sun.<br />

The shader starts with the calculation of the scaling value and an exponent<br />

needed later:<br />

(...)<br />

; v0 = position<br />

; c16 = constants (-1000.0f*fInvR*fInvR, 3000.0f, 0.0f, 2.0f)<br />

; calculate steps<br />

mul r0.x, v0.y, v0.y<br />

mad r0.xw, r0.x, c16.xxzz, c16.yyww<br />

(...)<br />

The angle needed to calculate molecule and aerosol scattering is computed using<br />

the sun normal and the current vertex normal. This is feasible, as the viewer is<br />

always standing at the dome’s center.<br />

; v1 = normal<br />

; c12 = sun normal (sunN.x, sunN.y, sunN.z, 0.0f)<br />

(...)<br />

dcl normal v1<br />

(...)<br />

"; calc angle normal - sunnormal<br />

dp3 r0.y, v1, c12<br />

(...)<br />

Molecule scattering is performed using the equation:


3<br />

Fr( �) � ( �� ) � � �<br />

4 1<br />

3 3<br />

4 4<br />

2 2<br />

; c15 = constants (0.0f, 1.0f, 0.75f, 0.9f)<br />

(...)<br />

; calculate fr (molecule scattering)<br />

lit r3.z, r0.yyww<br />

mad r4.x, r3.z, c15.z, c15.z<br />

(...)<br />

Aerosol scattering is simply a scalar 1.75f or 0.75f, depending on the magnitude of<br />

the angle calculated earlier.<br />

"; c15 = constants (0.0f, 1.0f, 0.75f, 0.9f)<br />

(...)<br />

; calculate fm (aerosol scattering)<br />

max r6.x, r0.y, -r0.y<br />

sge r6.x, r6.x, c15.y<br />

add r5.x, r6.x, c15.z<br />

(...)<br />

Afterward, the final color is computed as follows:<br />

; c11 = sun color (red, green, blue, 0.0f)<br />

; c12 = sun normal (sunN.x, sunN.y, sunN.z, 0.0f)<br />

; c13 = constants (KrRed, KrGreen, KrBlue, 0.0f)<br />

; c14 = constants (KmRed, KmGreen, KmBlue, 0.0f)<br />

(...)<br />

; Calculate the color regarding to the scattering values<br />

mul r7.xyz, c13, r4.x ; multiply Fr by Kr color of the molecules<br />

mad r9.xyz, c14, r5.x, r7 ; multiply Fm by Km color of the aerosols and add<br />

mul r9.xyz, r9.xyz, r0.x ; scale by steps<br />

mul oD0.xyz, r9, c11 ; output vertex color scaled by the sun’s current color<br />

(...)<br />

Rendering the Sun<br />

Drawing the sun is actually quite simple. Every<br />

sky object has a camera associated with it. As<br />

seen in Figure 2, this camera is used as a texture<br />

projector located at the world origin and looks at<br />

the sun’s current position.<br />

Its projection and view matrix are concatenated<br />

with a special matrix needed to handle the<br />

transformation from projection space to texture<br />

space.<br />

Section II — Rendering Techniques<br />

Advanced Sky Dome Rendering<br />

243<br />

Figure 2: Basic texture projection<br />

setup


Section II — Rendering Techniques<br />

244 Advanced Sky Dome Rendering<br />

� 05 . f 00 . f 00 . f 00 . f �<br />

�<br />

�<br />

� 00 . f �05<br />

. f 00 . f 00 . f �<br />

TexSpaceMatrix � �<br />

�<br />

�<br />

0.0f<br />

00 . f 00 . f 00 . f<br />

�<br />

�<br />

� 05 . f 05 . f 10 . f 10 . f<br />

�<br />

�<br />

The transposed view-projection-texspace-matrix is passed into the vertex shader<br />

where the texture coordinates for the projected sun texture are computed based<br />

on the current vertex position.<br />

; v0 = position<br />

; c5 - c7 = view matrix * projection matrix * texspace matrix (sun projector)<br />

; c15 = constants (0.0f, 1.0f, 0.75f, 0.9f)<br />

(...)<br />

dp4 oT0.x, v0, c5<br />

dp4 oT0.y, v0, c6<br />

dp4 r0.zw, v0, c7<br />

sge r1.w, r0.w, c15.x<br />

mul oT0.zw, r0, r1.w<br />

(...)<br />

The extra operations before storing the oT0.zw<br />

coordinates are needed to avoid a second projection<br />

behind the sky object’s camera.<br />

The pixel shader is a simple fixed-functionstyle<br />

ADDSMOOTH to blend the sun texture (as<br />

seen in Figure 3) indexed via the texture coordinates<br />

with the calculated sky color passed in via<br />

v0.<br />

; v0 = calculated sky color<br />

; t0 = sun texture<br />

ps 1 1<br />

tex t0 ; fetch sun texture<br />

(...)<br />

mad sat r0, t0, 1 - v0, v0<br />

(...)<br />

Rendering the Moon and Its Phase<br />

Figure 3: The sun texture<br />

Displaying the moon is basically the same as drawing the sun. Again, a transposed<br />

view-projection-texspace-matrix is passed in to the vertex shader to compute the<br />

texture coordinates.<br />

; v0 = position<br />

; c8 - c10 = view matrix * projection matrix * texspace matrix (moon projector)<br />

; c15 = constants (0.0f, 1.0f, 0.75f, 0.9f)


(...)<br />

dp4 r2.x, v0, c8<br />

dp4 r2.y, v0, c9<br />

dp4 r0.zw, v0, c10<br />

sge r1.w, r0.w, c15.x<br />

mul r2.zw, r0, r1.w<br />

mov oT1, r2<br />

mov oT2, r2<br />

If rendered like this (just with the texture, as seen in Figure 4), the moon will<br />

always appear full because its predominant feature — the moon’s phase — is<br />

missing. To solve this, per-pixel lighting using a spherically based normal map is<br />

necessary. There’s only one problem; as the moon itself is not a real object, how<br />

do you compute the tangent space matrix needed for per-pixel lighting? Simple<br />

answer: You don’t. An imaginary viewer on Earth mostly sees the same side of<br />

the moon (i.e., the moon “object” can be thought of as a simple textured quad<br />

always facing the viewer). The imaginary sun rotates around that object (remember<br />

that geocentric coordinates are used — that is, both sun and moon are circling<br />

the Earth) with the light vector being perpendicular to the front of the quad at full<br />

moon and perpendicular to the back at new moon. So the light vector used is simply<br />

a vector (0.0f, 0.0f, –1.0f) rotated around the quad’s local y-axis by the moon’s<br />

phase angle computed earlier. Figure 5 shows the solution.<br />

Figure 4: The moon texture Figure 5: Setup for lighting the<br />

moon<br />

<strong>With</strong> this composition, the tangent space matrix would be the identity matrix and<br />

is omitted altogether. Bearing that in mind, the transformation of the light vector<br />

inside the vertex program is easy.<br />

; c17 = lightvec sun->moon (light.x, light.y, light.z, 0.5f)<br />

(...)<br />

mov r0, c17<br />

mad oD1.xyz, -r0.xyz, c17.w, c17.w<br />

(...)<br />

Section II — Rendering Techniques<br />

Advanced Sky Dome Rendering<br />

245<br />

Figure 6: The moon normal map<br />

The pixel shader uses the normal map, as seen in Figure 6, and the light vector<br />

passed with oD1 or v1, respectively, to “light” the moon texture. The resulting<br />

value is multiplied by an external factor and added to the current color. This factor


Section II — Rendering Techniques<br />

246 Advanced Sky Dome Rendering<br />

is relative to the sun’s current normalized y-position and is used to “disguise” the<br />

moon during the day.<br />

; c0 = scaling factor for the moon color<br />

; v1 = transformed light vector for the moon phase<br />

ps 1 1<br />

(...)<br />

tex t1 ; fetch moon texture<br />

tex t2 ; fetch moon normal map<br />

(...)<br />

dp3 r1, t2 bx2, v1 bx2 ; calculate angle light vector / moon normal<br />

mul sat r1, t1, r1 ; light moon texture<br />

mad sat r0, r1, c0, r0 ; multiply by scaling factor and add to current color<br />

(...)<br />

Putting It All Together<br />

These are the final shaders:<br />

;*************************************************************************<br />

; Vertex <strong>Shader</strong><br />

; Author: Marco Spoerl<br />

; Function: Sky dome<br />

; v0 = position (3 floats)<br />

; v1 = normal (3 floats)<br />

; c0 - c4 = proj matrix * view matrix<br />

; c5 - c7 = view matrix * proj matrix * texspace matrix (sun projector)<br />

; c8 - c10 = view matrix * proj matrix * texspace matrix (moon projector)<br />

; c11 = sun color (red, green, blue, 0.0f)<br />

; c12 = sun normal (sunN.x, sunN.y, sunN.z, 0.0f)<br />

; c13 = constants (KrRed, KrGreen, KrBlue, 0.0f)<br />

; c14 = constants (KmRed, KmGreen, KmBlue, 0.0f)<br />

; c15 = constants (0.0f, 1.0f, 0.75f, 0.9f)<br />

; c16 = constants (-1000.0f*fInvR*fInvR, 3000.0f, 0.0f, 2.0f)<br />

; c17 = lightvec sun->moon (light.x, light.y, light.z, 0.5f)<br />

;*************************************************************************<br />

vs 1 1<br />

dcl position v0<br />

dcl normal v1<br />

; transform position<br />

dp4 oPos.x, v0, c0<br />

dp4 oPos.y, v0, c1<br />

dp4 oPos.zw, v0, c3 ; simply force z to 1.0f by making it equal to w<br />

; calculate steps


mul r0.x, v0.y, v0.y<br />

mad r0.xw, r0.x, c16.xxzz, c16.yyww<br />

; calc angle normal - sunnormal<br />

dp3 r0.y, v1, c12<br />

; calculate fr (molecule scattering)<br />

lit r3.z, r0.yyww<br />

mad r4.x, r3.z, c15.z, c15.z<br />

; calculate fm (aerosol scattering)<br />

max r6.x, r0.y, -r0.y<br />

sge r6.x, r6.x, c15.y<br />

add r5.x, r6.x, c15.z<br />

; Calculate the color regarding to the scattering values<br />

; Kr color of the molecules<br />

mul r7.xyz, c13, r4.x<br />

; Km color of the aerosols<br />

mad r9.xyz, c14, r5.x, r7<br />

; scale by steps<br />

mul r9.xyz, r9.xyz, r0.x<br />

; output color scaled by current sun color<br />

mul oD0.xyz, r9, c11<br />

; output transformed light vector for the moon phase<br />

mov r0, c17<br />

mad oD1.xyz, -r0.xyz, c17.w, c17.w<br />

; output projected texcoord0 (sun)<br />

dp4 oT0.x, v0, c5<br />

dp4 oT0.y, v0, c6<br />

dp4 r0.zw, v0, c7<br />

sge r1.w, r0.w, c15.x<br />

mul oT0.zw, r0, r1.w<br />

; output projected texcoord1/2 (moon/moonnormals)<br />

dp4 r2.x, v0, c8<br />

Section II — Rendering Techniques<br />

Advanced Sky Dome Rendering<br />

247


Section II — Rendering Techniques<br />

248 Advanced Sky Dome Rendering<br />

dp4 r2.y, v0, c9<br />

dp4 r0.zw, v0, c10<br />

sge r1.w, r0.w, c15.x<br />

mul r2.zw, r0, r1.w<br />

mov oT1, r2<br />

mov oT2, r2<br />

;*************************************************************************<br />

; Pixel <strong>Shader</strong><br />

; Author: Marco Spoerl<br />

; Function: Sky dome<br />

; c0 = scaling factor for the moon color<br />

; v0 = calculated vertex color<br />

; v1 = transformed light vector for the moon phase<br />

; t0 = sun texture<br />

; t1 = moon texture<br />

; t2 = moon normal map<br />

;*************************************************************************<br />

ps 1 1<br />

; Fetch textures<br />

tex t0 ; sun<br />

tex t1 ; moon<br />

tex t2 ; moon normals<br />

; ADDSMOOTH vertex color and sun<br />

mad sat r0, t0, 1 - v0, v0<br />

; Calculate moon color<br />

dp3 r1, t2 bx2, v1 bx2<br />

mul sat r1, t1, r1<br />

; ADD current color and scaled moon color<br />

mad sat r0, r1, c0, r0


Where to Go from Here<br />

There are a number of improvements that can be made to these shaders. Among<br />

them are the following:<br />

� Currently, no stars are displayed. They can be implemented using a static<br />

cube map made from photographs of the night sky. A better approach is the<br />

run-time creation of a map using the positional data from the public Bright<br />

Star Catalogue (BSC) or displaying them as points using the information from<br />

the BSC. Furthermore, the star map must be rotated by the time of day to<br />

simulate the Earth’s rotation.<br />

� In addition to the position of the sky objects, [SchlyterA] gives equations to<br />

compute the apparent diameter. The resulting value can be used to change<br />

the projected size of the object.<br />

� The texture for the moon and the sun can be generated at run time using real<br />

3D objects. A solar and lunar eclipse would be possible.<br />

� Other sky objects can be simulated (e.g., planets, asteroids, and comets).<br />

� The topocentric position can be used instead of the geocentric one. Note<br />

that when using topocentric coordinates, the positions of the stars have to<br />

change depending on the position of the viewer. The night sky in the northern<br />

hemisphere, for example, is different from that in the southern side.<br />

� A better night sky model can be applied simulating the influence of the moonlight,<br />

the stars, and phenomena like zodiacal light or airglow. [Jensen] has<br />

some great ideas on that.<br />

� Fog and haze are missing. The method outlined in [Hoffman] can be useful as<br />

an expansion or substitution.<br />

� Clouds are missing, a topic that could fill many books. They can be implemented<br />

using simple noise textures or with approaches described in [Harris]<br />

or [Miyazaki], for example. <strong>With</strong> clouds, effects like rainfall, snowfall, and<br />

maybe even rainbows or lightning can be simulated.<br />

Conclusion<br />

This article showed a basic approach for implementing a simple, non-static sky<br />

dome. Although incomplete and not very accurate, it is a good starting point for<br />

further research into the topic of outdoor rendering. The complete source code<br />

and a sample application can be found on this book’s companion CD. In addition,<br />

Color Plates 9 and 10 show screen shots of the sky along with the advanced water<br />

effects discussed earlier in the book.<br />

Acknowledgments<br />

Section II — Rendering Techniques<br />

Advanced Sky Dome Rendering<br />

249<br />

Thanks to Max Dennis Luesebrink and Matthias Wloka for their work on the very<br />

early version of the sky dome and its vertex shader.


Section II — Rendering Techniques<br />

250 Advanced Sky Dome Rendering<br />

References<br />

[Harris] Harris, Mark J. and Anselmo Lastra, “Real-Time Cloud Rendering,”<br />

Eurographics 2001 Proceedings, Vol. 20, no. 3, pp. 76-84.<br />

[Hoffman] Hoffman, Naty and Arcot J. Preetham, “Rendering Outdoor Light Scattering<br />

in Real Time,” Proceedings of the 2002 Game Developers Conference.<br />

[Jensen] Jensen, Henrik Wann, Fredo Durand, Michael M. Stark, Simon Premoze,<br />

Julie Dorsey, and Peter Shirley, “A Physically-Based Night Sky Model,”<br />

SIGGRAPH 2001, Computer Graphics Proceedings, Annual Conference Series,<br />

pp. 399-408.<br />

[Miyazaki] Miyazaki, R., S. Yoshida ,Y. Dobashi, and T. Nishita, “A Method for<br />

Modeling Clouds based on Atmospheric Fluid Dynamics,” Pacific Graphics 2001,<br />

pp. 363-372.<br />

[Nishita] Nishita, Tomoyuki, Yoshinori Dobashi, Kazufumi Kaneda, and Hideo<br />

Yamashita, “Display Method of the Sky Color Taking into Account Multiple Scattering,”<br />

SIGGRAPH 1996, Computer Graphics Proceedings, Annual Conference<br />

Series, pp. 379-386.<br />

[SchlyterA] Schlyter, Paul, “How to compute planetary positions,” available online<br />

at http://hem.passagen.se/pausch/comp/ppcomp.html.<br />

[SchlyterB] Schlyter, Paul, “Computing planetary positions — a tutorial with<br />

worked examples,” available online at http://stjarnhimlen.se/comp/tutorial.html.


Introduction<br />

Deferred Shading with<br />

Multiple Render Targets<br />

Nicolas Thibieroz<br />

Traditional rendering algorithms submit geometry and immediately apply shading<br />

effects to the rasterized primitives. Complex shading effects often require multiple<br />

render passes to produce the final pixel color with the geometry submitted<br />

every pass. Deferred shading (aka quad shading) is an alternative rendering technique<br />

that submits the scene geometry only once, storing per-pixel attributes into<br />

local video memory to be used in the subsequent rendering passes. In these later<br />

passes, screen-aligned quads are rendered, and the per-pixel attributes contained<br />

in the buffer are retrieved at a 1:1 mapping ratio so that each pixel is shaded individually.<br />

The following figure illustrates the principle of deferred shading.<br />

Figure 1: Deferred shading flow diagram with arbitrary examples of stored data<br />

(position, normal, and color)<br />

Deferred shading has a number of advantages over traditional rendering. Firstly,<br />

only a single geometry pass is required, even if shadow algorithms are used. By<br />

251


Section II — Rendering Techniques<br />

252 Deferred Shading with Multiple Render Targets<br />

virtually eliminating multiple geometry passes, the saved vertex throughput can<br />

be used to dramatically increase the polygon budget of the single geometry pass,<br />

improving the scene’s realism without compromising the performance of the<br />

shading passes (as they are not dependent on underlying geometry).<br />

Secondly, all shading calculations are performed per-pixel, as each pixel has a<br />

unique set of properties. For shading effects that simulate lighting, this is preferable<br />

to using interpolated vertex shader outputs. This subtle difference can have a<br />

dramatic impact on the quality of the visual results.<br />

Thirdly, deferred shading has the advantage of reducing pixel overdraw, as<br />

only the initial geometry pass may have an average overdraw value above 1. This<br />

is because all shading passes operate on pixels that are already visible; thus, no<br />

overdrawn pixel will ever be touched by the pixel shader during the shading<br />

passes.<br />

These advantages make deferred shading an interesting alternative to traditional<br />

multi-pass rendering, though the accrued memory footprint and bandwidth<br />

requirements need careful consideration when implementing this technique. This<br />

article describes deferred shading in detail and gives practical implementation<br />

examples that show how deferred shading can be applied to current and future<br />

games using the <strong>DirectX</strong> 9 API and beyond.<br />

Multiple Render Targets<br />

Prior to <strong>DirectX</strong> 9, one could only output a maximum of 32 bits consisting of four<br />

color components to a render target in a single pass. However, deferred shading<br />

requires a greater number of components to accommodate the pixel attributes calculated<br />

during the initial pass. Multiple render targets (MRTs), a new feature of<br />

<strong>DirectX</strong> 9, allow up to four render targets to be written to in the same rendering<br />

pass, bringing the total number of output components to 16 and the maximum<br />

precision to 512 bits (although these can vary depending on the host 3D device).<br />

<strong>With</strong>out MRT support, outputting more than four components would require additional<br />

geometry passes.<br />

These MRTs are used to store scene information during the geometry pass<br />

(or “building pass”) and are then accessed as textures during the shading passes.<br />

Note that MRTs have the following limitations in the <strong>DirectX</strong> 9 API:<br />

� They must be of identical size.<br />

� They can only be of different bit depths if the D3DPMISCCAPS_MRTINDEPENDENT-<br />

BITDEPTHS cap is exported by the 3D device.<br />

� Dithering, alpha testing, fogging, blending, or masking are only supported if<br />

the 3D device exposes the D3DPMISCCAPS_MRTPOSTPIXELSHADERBLENDING cap.<br />

� They may not be antialiased.<br />

Each MRT contains per-pixel information about the scene and therefore should be<br />

of the same size as the main render target (the back buffer). Because the back<br />

buffer’s width and height are usually not a power of two, 3D devices not supporting<br />

the D3DPTEXTURECAPS_NONPOW2CONDITIONAL cap will need to create MRTs at the


next power of two size above the main render target dimensions (e.g., for a<br />

1280x1024 back buffer, the MRT’s size will be 2048x1024). Although non-power<br />

of two MRTs have limitations, these do not directly affect the algorithm.<br />

Multi-element textures (METs) are another feature, albeit less flexible, of<br />

<strong>DirectX</strong> 9 that closely resembles MRTs. METs are basically a set of textures of the<br />

same format packed together. This limitation and the fact that <strong>DirectX</strong> 9 exports a<br />

single MET format called D3DFMT_MULTI2_ARGB8 (two textures of 8888 format)<br />

make METs unsuitable for deferred shading.<br />

Attribute Buffer<br />

The attribute buffer is the name that we give to the scene data we are storing in<br />

our MRT textures during the building pass. The choice of stored data depends on<br />

the shading model. For instance, Gouraud-shaded directional (or parallel) lights<br />

only require each pixel’s normal and material properties, whereas Phong-shaded<br />

point or spotlights require the pixel’s position in addition to its normal and material<br />

properties.<br />

The implementation detailed in this article assumes all lights required during<br />

the shading passes are based on the Phong model, and therefore the attribute<br />

buffer contains the following:<br />

� Pixel position (X, Y, Z)<br />

� Pixel normal vector (X, Y, Z)<br />

� Pixel diffuse color (R, G, B)<br />

Pixel Position<br />

This is the world space position of the pixel. Note that it is also possible to store<br />

the position in a different coordinate system (eye space being the other logical<br />

alternative), providing all data stored and used in the pixel shader during the lighting<br />

passes is in the same space.<br />

World space pixel positions can be calculated by transforming each vertex in<br />

the vertex shader by the world matrix for the associated model and outputting the<br />

result to a 3D texture coordinate. The iterated texture coordinates in the pixel<br />

shader define the world space position for this pixel.<br />

A 16-bit per-channel float format is ideal to store the pixel’s world position in<br />

order to accommodate a wide range of possible values with sufficient precision<br />

(i.e., D3DFMT_A16B16G16R16F). See Color Plate 11 for an illustration of position<br />

data.<br />

Pixel Normal Vector<br />

Section II — Rendering Techniques<br />

Deferred Shading with Multiple Render Targets<br />

This is the world space normalized normal vector for the pixel. There are two<br />

options that we can choose when producing the world space normal vectors for<br />

storage in the property buffer: model space and tangent space.<br />

253


Section II — Rendering Techniques<br />

254 Deferred Shading with Multiple Render Targets<br />

Pixel Normals Defined in Model Space<br />

The first and simplest option is to have the normals in bump maps already defined<br />

in model space. Model space normal maps are sometimes calculated with a software<br />

tool to generate a low-poly model with model space normal maps from a<br />

high-poly model. <strong>With</strong> normals defined in model space, a simple world matrix<br />

transformation in the pixel shader is all that is required to obtain world space normals.<br />

Although this solution is simple to implement in terms of calculations, it<br />

also requires more texture memory, since texture maps are usually unique for<br />

each model. Also, the artist may have the scene’s normal data already defined in<br />

tangent space, so it might be more convenient from a design point of view to use<br />

these instead.<br />

Pixel Normals Defined in Tangent Space<br />

The second option is more complicated but operates from normals defined in tangent<br />

space (also called texture space). Tangent space is a coordinate system indicating<br />

the texture surface orientation at each vertex. Traditional per-pixel lighting<br />

usually involves transforming the light vector into tangent space so that a DOT3<br />

operation can be performed with the tangent space normals contained in the<br />

bump map. See Equation (1).<br />

�TX<br />

N X BX<br />

�<br />

� � � �<br />

�<br />

VTS �VMS �( TS) �VMS ��TY<br />

NY BY<br />

�<br />

(1)<br />

�<br />

�<br />

� TZ NZ BZ<br />

�<br />

<strong>With</strong> deferred shading, the normals need to be transformed from tangent space to<br />

world space. In the same way a vector is transformed from model space to tangent<br />

space using the tangent space matrix, a tangent space vector can be transformed<br />

to model space using the inverse of the tangent space matrix.<br />

Conveniently, the tangent space matrix is a pure rotation matrix, so its inverse is<br />

simply its transpose. See Equation (2).<br />

�TX<br />

N X BX<br />

� � TX TY TZ�<br />

� � �<br />

� � �<br />

�<br />

VTS �VTS ��TY<br />

NY BY<br />

� �VTS ��N<br />

X NY NZ<br />

�<br />

�<br />

� �<br />

�<br />

� TZ NZ BZ<br />

� � BX BY BZ<br />

�<br />

T<br />

Because we need the normal vectors in world space, we also need to transform<br />

them with the rotation part of the world matrix associated with the current model.<br />

The equation becomes:<br />

� �<br />

T<br />

V �V �( TS) �(<br />

W)<br />

(3)<br />

WS TS<br />

For static models, it is a good idea to have their local orientation match their orientation<br />

in world space so that only a simple transformation with the transpose of<br />

the tangent space matrix is necessary. This saves a few instructions compared to<br />

the dynamic object’s case, which requires Equation (3) to be fully honored.<br />

Because we need a set of transposed tangent space vectors at each pixel, the<br />

normal, binormal, and tangent vectors are passed to the pixel shader through a<br />

(2)


set of three 3D texture coordinates. The iterated vectors define our tangent space<br />

matrix for the current pixel. In theory, these vectors need renormalization before<br />

they can be used to transform the normal; however, in practice the difference in<br />

visual quality is negligible, provided the scene tessellation is high enough.<br />

Precision and Storage<br />

In both cases it can be desirable to renormalize the iterated normals for improved<br />

accuracy. Although linearly interpolated normals are usually close enough to unit<br />

vectors, the error margin accumulates when complex math operations are performed<br />

on these vectors (e.g., calculation of reflection vectors).<br />

Although a float format could be used to store world space normal vectors in<br />

the attribute buffer, it is more economical and usually sufficiently accurate to use<br />

a 32-bit integer format instead (D3DFMT_A2W10V10U10, D3DFMT_A2B10G10R10,<br />

D3DFMT_Q8W8V8U8, D3DFMT_A8R8G8B8).<br />

The deferred shading implementation described in this article uses tangent<br />

space normal maps. See Color Plate 12 for an illustration of normal data.<br />

Pixel Diffuse Color<br />

The pixel’s diffuse color is stored in the property buffer. This color is extracted<br />

from the diffuse texture associated with the model. This color can be stored using<br />

a simple 32-bit texture format (D3DFMT_A8R8G8B8). Diffuse data is shown in Color<br />

Plate 13.<br />

Building Pass<br />

This section details how the attribute buffer is constructed and stored in MRTs.<br />

During the building pass, all the data relevant to the scene is calculated and stored<br />

in our MRTs. The pixel shader sends the relevant data into each of the three<br />

MRTs (i.e., pixel position, normal, and color).<br />

Vertex <strong>Shader</strong> Code<br />

The vertex shader code used for the building pass is fairly simple.<br />

;-------------------------------------------------------------------<br />

; Constants specified by the app<br />

; c0-c3 = Global transformation matrix (World*View*Projection)<br />

; c4-c7 = World transformation matrix<br />

;<br />

; Vertex components<br />

; v0 = Vertex Position<br />

; v1, v2, v3 = Inverse of tangent space vectors<br />

; v4 = 2D texture coordinates (model coordinates)<br />

;------------------------------------------------------------------vs<br />

2 0<br />

dcl position v0 ; Vertex position<br />

Section II — Rendering Techniques<br />

Deferred Shading with Multiple Render Targets<br />

255


Section II — Rendering Techniques<br />

256 Deferred Shading with Multiple Render Targets<br />

dcl binormal v1 ; Transposed binormal<br />

dcl tangent v2 ; Transposed tangent<br />

dcl normal v3 ; Transposed normal<br />

dcl texcoord v4 ; Texture coordinates for diffuse and normal map<br />

; Vertex transformation<br />

m4x4 oPos, v0, c0 ; Transform vertices by WVP matrix<br />

; Model texture coordinates<br />

mov oT0.xy, v4.xy ; Simply copy texture coordinates<br />

; World space coordinates<br />

m4x3 oT1.xyz, v0, c4 ; Transform vertices by world matrix (no w<br />

; needed)<br />

; Inverse (transpose) of tangent space vectors<br />

mov oT2.xyz, v1<br />

mov oT3.xyz, v2<br />

mov oT4.xyz, v3 ; Pass in transposed tangent space vectors<br />

Pixel <strong>Shader</strong> Code<br />

The pixel shader version used has to be able to output multiple color values.<br />

Therefore, pixel shader 2.0 or higher is required. The model texture coordinates<br />

are used to sample the pixel diffuse color and normal from their respective textures.<br />

The world space coordinates are directly stored into the position MRT.<br />

Finally, the transposed tangent space vectors are used to transform the sampled<br />

normal before storing it into the normal MRT.<br />

;-------------------------------------------------------------------<br />

; Constants specified by the app<br />

; c0-c3 = World transformation matrix for model<br />

;------------------------------------------------------------------ps<br />

2 0<br />

; Samplers<br />

dcl 2d s0 ; Diffuse map<br />

dcl 2d s1 ; Normal map<br />

; Texture coordinates<br />

dcl t0.xy ; Texture coordinates for diffuse and normal map<br />

dcl t1.xyz ; World-space position<br />

dcl t2.xyz ; Binormal<br />

dcl t3.xyz ; Tangent<br />

dcl t4.xyz ; Normal (Transposed tangent space vectors)<br />

; Constants<br />

def c30, 1.0, 2.0, 0.0, 0.0<br />

def c31, 0.2, 0.5, 1.0, 1.0<br />

; Texture sampling<br />

texld r2, t0, s1 ; r2 = Normal vector from normal map


texld r3, t0, s0 ; r3 = Color from diffuse map<br />

; Store world-space coordinates into MRT#0<br />

mov oC0, t1 ; Store pixel position in MRT#0<br />

; Convert normal to signed vector<br />

mad r2, r2, c30.g, -c30.r ; r2 = 2*(r2 - 0.5)<br />

; Transform normal vector from tangent space to model space<br />

dp3 r4.x, r2, t2<br />

dp3 r4.y, r2, t3<br />

dp3 r4.z, r2, t4 ; r4.xyz = model space normal<br />

; Transform model space normal vector to world space. Note that only<br />

; the rotation part of the world matrix is needed.<br />

; This step is not required for static models if their<br />

; original model space orientation matches their orientation<br />

; in world space. This would save 3 instructions.<br />

m4x3 r1.xyz, r4, c0<br />

; Convert normal vector to fixed point<br />

; This is not required if the destination MRT is float or signed<br />

mad r1, r1, c31.g, c31.g ; r1 = 0.5*(r1 + 0.5)<br />

; Store world-space normal into MRT#1<br />

mov oC1, r1<br />

; Store diffuse color into MRT#2<br />

mov oC2, r3<br />

Section II — Rendering Techniques<br />

Deferred Shading with Multiple Render Targets<br />

We’ve already established that all models rendered with an identity rotation in<br />

their world matrix do not need to have their normals further transformed by this<br />

matrix. As shown above, skipping this step would save three pixel shader instructions.<br />

However, this implies using two different shaders, one for dynamic models<br />

and another for static ones. As these are usually not rendered in a defined order,<br />

the amount of swapping between the two pixel shaders could be excessive. Using<br />

extended pixel shader 2.0 (ps_2_x) or pixel shader 3.0, static flow control can be<br />

used to determine if transformation with the world matrix is needed.<br />

Note that all texture filtering, such as trilinear or anisotropic, need only be<br />

performed in the building pass; the shading passes will directly access the<br />

already-filtered pixels from the MRT surfaces that we have rendered (using point<br />

sampling).<br />

NOTE It is possible to store extra components into the attribute buffer using<br />

some form of compression. For instance, bias and scale operations would<br />

allow two 16-bit integer components to be stored in a 32-bit component. Pixel<br />

shader instructions are required to pack and unpack the data, but it enables<br />

more elements to be stored in case there are not enough outputs available.<br />

257


Section II — Rendering Techniques<br />

258 Deferred Shading with Multiple Render Targets<br />

Shading Passes<br />

Each shading pass needs only to send a screen-aligned quad so that the shading<br />

calculations affect the entire render surface. The quad’s texture coordinates have<br />

to be set up so that all pixels in the quad reference our MRT data at a 1:1 mapping<br />

ratio. Direct3D’s sampling rules stipulate that an offset is required in order to<br />

achieve a perfect texel-to-pixel mapping. Given the width (W) and height (H) of<br />

the back buffer, the vertices forming the full-screen quad need to be set up in the<br />

following way:<br />

Figure 2: Setting up the vertex structure for a screen-aligned quad<br />

POS screen is the vertex position in screen space coordinates. TEX represents the<br />

2D texture coordinates to use for this vertex.<br />

NOTE The same 1:1 mapping ratio is achieved by offsetting screen space<br />

positions by –0.5 and setting texture coordinates to (0,0), (1, 0), (0, 1), and<br />

(1,1).<br />

Because the contents of the MRTs already represent visible pixels, there is no<br />

need to enable depth buffering during the shading passes, saving valuable memory<br />

bandwidth.<br />

Vertex <strong>Shader</strong> Code<br />

No vertex processing is required, since screen space coordinates are directly sent<br />

to the pixel shader. To skip vertex processing, the vertex declaration must include<br />

the D3DDECLUSAGE_POSITIONT definition. In this case, only the transformed position<br />

and the texture coordinates are required, thus the vertex declaration is:<br />

D3DVERTEXELEMENT9 declTPositionUV[] =<br />

{<br />

{ 0, 0, D3DDECLTYPE FLOAT4, D3DDECLMETHOD DEFAULT, D3DDECLUSAGE POSITIONT, 0},<br />

{ 0, 16, D3DDECLTYPE FLOAT2, D3DDECLMETHOD DEFAULT, D3DDECLUSAGE TEXCOORD, 0},


D3DDECL END()<br />

};<br />

Pixel <strong>Shader</strong> Code<br />

The calculations to be performed during the shading passes obviously depend on<br />

the needs of the application. It is up to the programmer to decide what lighting or<br />

special effects can be applied with the data available in the attribute buffer. The<br />

actual shading algorithms are not affected by their transition to a deferred shading<br />

context.<br />

The following pixel shader is based on Phong lighting and implements diffuse<br />

and specular shading with distance attenuation. Each active light’s contribution is<br />

accumulated by additively blending a full-screen quad onto the frame buffer. A<br />

number of optimizations can be implemented in this shader, but for simplicity they<br />

are not shown here. For example, another lookup texture could be used to replace<br />

the 3-slot pow instruction or a simpler falloff model could be used, etc. A cube normalization<br />

map is used for vector normalization, although the nrm instruction is an<br />

alternative choice. The falloff texture is a simple lookup texture containing the<br />

light attenuation based on the pixel’s distance from the light divided by its maximum<br />

range (texture clamping is used).<br />

;-------------------------------------------------------------------<br />

; Constants specified by the app<br />

; c0 : light position in world space<br />

; c8 : camera position in world space<br />

; c22: c22.a = 1/(light max range), c22.rgb = 1.0f<br />

;------------------------------------------------------------------ps<br />

2 0<br />

; Samplers<br />

dcl 2d s0 ; MRT#0 = Pixel position in world space<br />

dcl 2d s1 ; MRT#1 = Pixel normal vector<br />

dcl 2d s2 ; MRT#2 = Pixel diffuse color<br />

dcl 2d s3 ; Falloff texture<br />

dcl cube s4 ; Cube normalization texture map<br />

; Texture coordinates<br />

dcl t0.xy ; Quad screen-space texture coordinates<br />

; Constants<br />

def c20, 0.5, 2.0, -2.0, 1.0<br />

def c21, 8.0, -0.75, 4.0, 0.0<br />

Section II — Rendering Techniques<br />

Deferred Shading with Multiple Render Targets<br />

; Retrieve property buffer data from MRT textures<br />

texld r0, t0, s0 ; r0.xyz = Pixel world space position<br />

texld r2, t0, s1 ; r2.xyz = Pixel normal vector<br />

texld r3, t0, s2 ; r3.rgb = Pixel color<br />

; Convert normal to signed vector<br />

; This is not required if the normal vector was stored in a signed<br />

259


Section II — Rendering Techniques<br />

260 Deferred Shading with Multiple Render Targets<br />

; or float format<br />

mad r2, r2, c20.y, -c20.w ; r2 = 2*(r2 - 1)<br />

; Calculate pixel-to-light vector<br />

sub r1.xyz, c0, r0 ; r1 = Lpos - Vpos<br />

mov r1.w, c20.w ; Set r1.w to 1.0<br />

nrm r4, r1 ; Normalize vector (r4.w = 1.0/distance)<br />

; Compute diffuse intensity<br />

dp3 r5.w, r4, r2 ; r5.w = (N.L)<br />

; FallOff<br />

rcp r6, r4.w ; r6 = 1/(1/distance) = distance<br />

mul r6, r6, c22.a ; Divide by light max range<br />

texld r6, r6, s3 ; Sample falloff texture<br />

; Compute halfway vector<br />

sub r1.xyz, c8, r0 ; Compute view vector V (pixel to camera)<br />

texld r1, r1, s4 ; Normalized vector with cube map<br />

mad r1, r1, c20.y, -c20.w ; Convert vector to signed format<br />

add r1, r1, r4 ; Add view and light vector<br />

texld r1, r1, s4 ; Normalize half angle vector with cube map<br />

mad r1, r1, c20.y, -c20.w ; Convert to signed format<br />

; Compute specular intensity<br />

dp3 sat r1.w, r1, r2 ; r1.w = sat(H.N)<br />

pow r1.w, r1.w, c21.r ; r1.w = (H.N)^8<br />

; Set specular to 0 if pixel normal is not facing the light<br />

cmp r1.w, r5.w, r1.w, c21.w ; r1.w = ( (N.L)>=0 ) ? (H.N)^8 : 0<br />

; Output final color<br />

mad r0, r3, r5.w, r1.w ; Modulate diffuse color and diffuse<br />

; intensity and add specular<br />

mul r0, r0, r6 ; Modulate with falloff<br />

mov oC0, r0 ; Output final color<br />

Advanced Shading Passes<br />

The simple lighting pass shown above is only the beginning of what can be<br />

achieved with deferred shading. Because shading passes are no longer geometry-dependent<br />

and only applied to visible geometry, their utilization is optimal.<br />

This performance saving can be used to implement more complex shaders or<br />

effects.


Better Lighting<br />

A “real” specular calculation based on the camera reflection vector is straightforward<br />

to implement, since<br />

�<br />

both<br />

�<br />

the pixel and camera position are known. That is,<br />

instead of calculating ( H. N ) , the light reflection vector around the normal (given<br />

� � � � �<br />

by R �2 �( N. L) � N �L)<br />

can be calculated and used in a dot product operation<br />

with the view vector. The specular properties of each pixel could be stored in the<br />

attribute buffer (e.g., using the alpha of the diffuse texture in our implementation)<br />

and used as the power of the specular calculation.<br />

Different types of light can be implemented, from directional and point lights<br />

to spotlights or custom-shaped lights (light volumes are discussed later in this<br />

article). More complex light attenuation models can be implemented using math<br />

instructions inside the pixel shader instead of a texture lookup. Light maps can be<br />

used for custom-shaped lights or when there is a need to restrict light contribution<br />

to a defined area. The demo on the CD offers some implementations of the<br />

effects mentioned above.<br />

Extended Attribute Buffer<br />

Shadows<br />

Providing memory storage and bandwidth are not limiting factors (see later in this<br />

article for a discussion about bandwidth considerations), the attribute buffer can<br />

be used to store additional data relevant to the scene properties, allowing more<br />

complex effects to be implemented later on in the shading passes.<br />

In the example shown in this article, the material properties are simply<br />

approximated to a diffuse color issued from each model’s diffuse map. Other properties,<br />

like specular maps, specular power maps, detail maps, tangent vectors for<br />

anisotropic lighting, a Fresnel term, BRDF data, etc., could be stored in the attribute<br />

buffer. There are potentially so many material properties to store that there<br />

might not be enough space to accommodate them. Various tricks can be used to<br />

effectively compress as much data as possible in a reasonably allocated MRT<br />

space, like storing material IDs, using volume textures, etc.<br />

Deferred shading is fully compatible with the rendering of shadows within the<br />

scene.<br />

Let’s consider stencil shadows (as used in the demo). In the traditional rendering<br />

case (e.g., Doom III-style rendering), each shadow-casting light requires<br />

the scene geometry to be submitted (on top of shadow volumes) so that only<br />

non-shadowed areas are affected by the lighting calculations. When a high number<br />

of lights are used, the number of geometry passes quickly becomes overwhelming.<br />

<strong>With</strong> deferred shading, only the shadow volumes and a full-screen quad per<br />

light are required:<br />

Building pass<br />

For each light:<br />

Clear stencil buffer, disable color writes<br />

Section II — Rendering Techniques<br />

Deferred Shading with Multiple Render Targets<br />

261


Section II — Rendering Techniques<br />

262 Deferred Shading with Multiple Render Targets<br />

Render shadow volumes onto stencil buffer<br />

Enable color writes, stencil test passes for non-shadowed areas<br />

Light pass<br />

Shadow maps are also compatible with deferred shading. Since the world space<br />

pixel position can be stored in the attribute buffer, the distance from the pixel to<br />

the current light can be calculated in the shading pass. It can then be compared to<br />

the depth value in the shadow map to determine if the pixel is in shadow or not.<br />

This technique is sometimes referred to as forward shadow mapping.<br />

Higher Color Range<br />

<strong>DirectX</strong> 9 does not allow blending into floating-point render targets. Because of<br />

this limitation, alpha blending cannot be used to accumulate high dynamic range<br />

lighting calculation results into a destination frame buffer in float format. <strong>With</strong><br />

fixed-point render targets, color channels are automatically clamped to 1.0 during<br />

the blending process, and so if a higher color range is not needed, it is more economical<br />

and straightforward to blend all shading contributions into a fixed-point<br />

frame buffer. High color range effects, however, require an alternative solution.<br />

The most obvious workaround to this limitation is to ignore the hardware<br />

blending capabilities and perform alpha blending on the pixel shader manually.<br />

The idea is to alternate between two floating-point render targets; one contains<br />

the current contents of all shading contributions prior to the current pass and is<br />

set as a texture input to the pixel shader, while the other is set as the destination<br />

render target and receives the new color values updated with the contributions of<br />

the current pass.<br />

Translucency<br />

In traditional rendering, alpha-blended polygons require a rendering pass of their<br />

own. <strong>With</strong> deferred shading, alpha polygons requiring shading (e.g., stained-glass<br />

objects) also need separate processing, although punch-through polygons (i.e.,<br />

alpha-tested) can be sent with all other opaque triangles during the building pass.<br />

This is because the attribute buffer can only store properties for a single pixel;<br />

thus, any translucent pixels rendered on top of these would need to be stored<br />

elsewhere. Unless additive blending is used, real translucent pixels (using an<br />

order-dependent blending mode like SRCALPHA-INVSRCALPHA) need to be<br />

blended with the back buffer only once to avoid repeated contributions from themselves.<br />

Hence, all shading passes need to be performed on the translucent pixels<br />

before any blending with the background takes place. Another problem is overlapping<br />

translucency; any translucent pixel rendered on top of a previously rendered<br />

translucent pixel will need to take into account the new pixel color at this location<br />

(i.e., resulting from the previous blending). These two points are of paramount<br />

importance, since ignoring them can cause visual errors in the resulting render.<br />

Unfortunately, there is no easy way to overcome these problems. For a<br />

real-time application relying on performance, the best idea is probably to avoid


using translucent objects that require lighting (light-independent translucent<br />

objects like explosions, laser beams, smoke, etc., are unaffected because they typically<br />

do not need to be shaded). One could attempt to concatenate all shading<br />

passes in one large shader and gross-sort the translucent triangles from back to<br />

front using this shader. However, this might not be practical because of the sampler<br />

and instruction limits in the pixel shader. Furthermore, this might not be<br />

compatible with some shadowing techniques like stencil buffering. If lit translucent<br />

objects are a requirement of your scene and you are prepared to “go all the<br />

way,” then a solution is depth peeling (i.e., shading each “layer” of translucency<br />

separately before blending it with the frame buffer). In practice the additional<br />

memory, performance, and complexity caused by depth peeling do not make it a<br />

very attractive solution to the translucency problem inherent in deferred shading.<br />

Deferred Shading Optimizations<br />

MRT Configuration and Bandwidth<br />

Care must be taken when choosing the format and bit depth of the MRTs to use<br />

with deferred shading. <strong>With</strong> <strong>DirectX</strong> 9 support for IEEE float formats, one might<br />

be tempted to use four MRTs in the D3DFMT_A32B32G32R32F format for the added<br />

precision and robustness they bring. However, not only will the memory requirements<br />

be considerable (at a 1024x768 screen resolution, this corresponds to total<br />

of 48MB), but also the required memory bandwidth for each frame is likely to<br />

cause a performance bottleneck.<br />

Let’s examine in detail how much memory bandwidth deferred shading<br />

requires. We define the following variables and assume some practical values for<br />

them:<br />

Variable<br />

Name<br />

Description Values Justification<br />

W, H Render width, height 1024, 768 Typical rendering resolution<br />

Z BPP, BB BPP<br />

Depth/stencil buffer bit depth, back<br />

buffer bit depth<br />

32 bpp,<br />

32 bpp<br />

Typical depth/back buffer bit<br />

depth<br />

Overdraw Average overdraw per pixel 3 Average overdraw<br />

Sorting Average number of pixels<br />

arranged in a back to front order.<br />

In the worst case where all pixels<br />

are ordered back to front this value<br />

equals Overdraw; in the best case<br />

it will be equal to 1:<br />

1�Sorting�Overdraw.<br />

TBPP Average texture bit depth (e.g., if<br />

half of textures are 32bpp and<br />

other half is 16bpp, then TBPP=24) Section II — Rendering Techniques<br />

Deferred Shading with Multiple Render Targets<br />

1.5 We assume that half of<br />

overdrawn pixels are drawn<br />

back to front.<br />

24 bpp Mix of compressed, 32bpp<br />

and 32bpp+ textures<br />

263


Section II — Rendering Techniques<br />

264 Deferred Shading with Multiple Render Targets<br />

Variable<br />

Name<br />

TB, TS Description Values Justification<br />

Average number of texture lookups<br />

in building pass/shading pass pixel<br />

shader<br />

n Number of shading passes to<br />

perform<br />

2, 4 Two texture lookup in<br />

building pass (e.g., diffuse<br />

and normal map), four in<br />

shading pass (e.g., light<br />

maps, cube normalization<br />

map, etc.)<br />

8 Average of eight full-screen<br />

shading passes<br />

nMRT Number of MRT surfaces: 1�nMRT�4 - Variable<br />

MRTBPP MRT bit depth - Variable<br />

Table 1: Description of variables used in bandwidth calculation<br />

Assumptions<br />

Because of different 3D acceleration implementations among graphic adapters, it<br />

can be difficult to model a universal bandwidth equation. However, by making a<br />

number of assumptions about the rendering environment, we can get close to an<br />

accurate result. Firstly, we assume the target 3D accelerator is an immediate<br />

mode renderer with some form of early Z support (i.e., the depth test is performed<br />

before the pixel shader — we can reasonably expect this optimization<br />

from all new DX9 accelerators). Secondly, we assume a texture sample needs only<br />

a single texel fetch, regardless of texture filtering. Finally, we ignore any form of<br />

cache effectiveness.<br />

Let’s see how much memory needs to be transferred across the bus for a single<br />

frame using deferred shading. We start by analyzing each feature requiring<br />

memory traffic during the building pass:<br />

Depth/Stencil: W�H�Z BPP�(Overdraw+Sorting)<br />

Back Buffer: 0 (the back buffer is not read nor written during the building<br />

pass)<br />

Textures: W�H�TBPP�TB�Sorting Geometry buffers (vertex and index buffers): CGeometry (constant value)<br />

MRTs: nMRT�W�H�MRTBPP�Sorting Adding these together, we get:<br />

Memory OneFrame=W�H�[Z BPP�(Overdraw+Sorting)+ (4)<br />

T BPP�T B�Sorting+n MRT�MRT BPP�Sorting]+C Geometry<br />

Let’s now examine how much memory needs to be transferred across the bus<br />

during the n shading passes.<br />

Depth/Stencil: 0 (depth buffer disabled)<br />

Back Buffer: 2 (R/W)�W�H�BB BPP�n<br />

Textures: W�H�T S�T BPP�n


Geometry buffers: 0<br />

MRTs: nMRT�W�H�MRTBPP�n Adding these together we get:<br />

MemoryOneFrame=W�H�n��� (R/W)�BBBPP+TS�TBPP+nMRT�MRTBPP� (5)<br />

By adding the amounts of external memory accesses to perform during the building<br />

and the n shading passes and multiplying by a desired frame rate of 60 fps, we<br />

obtain the following bandwidth formula:<br />

� �MRT<br />

BPP �nMRT �( Sorting �n)<br />

� �<br />

� �<br />

�<br />

�<br />

� ��ZBPP<br />

�(<br />

Overdraw<br />

� Sorting)<br />

Bandwidth60 fps � W � H �<br />

�<br />

�<br />

� �<br />

� CGeometry �TBPP �TB�Sorting �<br />

��<br />

60Bytes / Sec<br />

� �<br />

�<br />

�<br />

�<br />

� ��<br />

�n��2� BBBPP �TS�TBPP� ��<br />

�<br />

�<br />

Practical Example<br />

Using our practical values, the bandwidth equation becomes:<br />

MRT BPP �nMRT �(. 15�8) �8�( 3�1. 5)<br />

Bandwidth60<br />

fps � 1024 �768 �<br />

16. 10<br />

3 2 1. 5<br />

8 �2 4 4 3�<br />

6<br />

�<br />

�<br />

� �<br />

�<br />

�<br />

�<br />

�<br />

�<br />

�<br />

�<br />

�<br />

�<br />

�<br />

�<br />

� � �<br />

�<br />

��<br />

60Bytes / Sec<br />

�<br />

�<br />

�<br />

�<br />

�<br />

�<br />

��<br />

� � � � � ��<br />

�<br />

�<br />

� �<br />

Bandwidth60 fps �0. 05 � MRTBPP �nMRT �9. 5 �205 �1GBytes<br />

/ Sec<br />

A gross approximation of the end result is:<br />

Bandwidth<br />

60 fps<br />

n � MRT<br />

�<br />

2<br />

Section II — Rendering Techniques<br />

Deferred Shading with Multiple Render Targets<br />

MRT BPP<br />

� 10GBytes<br />

/ Sec (7)<br />

Here are some examples of bandwidth values for a defined number of MRTs and<br />

their bit depth for the selected variables:<br />

MRTBPP =32 bpp MRTBPP =64 bpp MRTBPP =128 bpp<br />

nMRT =2 14 GBytes/Sec 18 GBytes/Sec 26 GBytes/Sec<br />

nMRT =3 16 GBytes/Sec 22 GBytes/Sec 34 GBytes/Sec<br />

nMRT =4 18 GBytes/Sec 26 GBytes/Sec 42 GBytes/Sec<br />

Table 2: Bandwidth figures for various MRT configurations with selected variables<br />

265<br />

This table clearly shows the overwhelming bandwidth requirements when an<br />

unreasonable MRT configuration is chosen. Of course, other factors influence the<br />

bandwidth requirements (notably the number of shading passes to perform), but<br />

overall it is wiser to select the minimum possible number and bit depth for MRTs,<br />

which can accommodate the stored data.


Section II — Rendering Techniques<br />

266 Deferred Shading with Multiple Render Targets<br />

Optimized MRT Configuration<br />

The implementation described in this article stores position, normal, and diffuse<br />

color into MRTs, totaling 64+32+32=128 bits of data. Providing the 3D device<br />

supports the D3DPMISCCAPS_MRTINDEPENDENTBITDEPTHS cap, this data can be rearranged<br />

in an optimized configuration so that memory bandwidth and footprint are<br />

reduced. Consider the following configuration of four MRTs:<br />

� MRT#0: D3DFMT_G16R16F: Store X, Y position in Red and Green channels<br />

� MRT#1: D3DFMT_R16F: Store Z position in Red channel<br />

� MRT#2: D3DFMT_A8R8G8B8: Store diffuse color in RGB, normal Z in A<br />

� MRT#3: D3DFMT_A8L8: Store normal X in A, Y in L<br />

This equates to a total of 96 bits. Note that the 3D device has to support these<br />

formats as render targets for this configuration to be valid. The pixel shader code<br />

in the building and shading passes needs to be adjusted to write and read MRT<br />

data into their appropriate components.<br />

Using 2D Shapes to Optimize Shading Passes<br />

Sending full-screen-aligned quads for all shading passes results in all the screen<br />

pixels being processed by the pixel shader. Depending on the number of passes to<br />

perform, screen resolution, and shader complexity, this can lead to excessive and<br />

costly pixel processing. By knowing the boundaries at which a light or special<br />

effect ceases to contribute to the scene, only the screen pixels that are inside<br />

those limits need to be sent (e.g., for a point light, this would be a 2D-projected<br />

sphere whose radius is the maximum range of the light). Explosions, cone lights,<br />

etc., can also benefit from using 2D shapes during the shading passes.<br />

Projected Volumes<br />

Unless the screen coordinates to which an effect is to be applied are already<br />

known (HUD, part-screen filters, etc.), in most cases the 2D shapes to deal with<br />

will be projections of 3D volumes into screen coordinates. This is required so that<br />

world space units can be correctly transformed and mapped into screen space.<br />

Let’s consider a simple point light of maximum range, MaxRange. Instead of<br />

sending a full-screen quad during the shading pass, a sphere of radius MaxRange<br />

is transformed and rasterized. This results in only the pixels inside the sphere to<br />

be affected by the light and thus processed. Note that the pixel shader’s falloff calculation<br />

must match the range used by the volume (i.e., pixels whose distance<br />

from the light source is greater than MaxRange have a falloff of zero); otherwise<br />

the difference in intensity between pixels inside and outside the sphere will be<br />

clearly visible.<br />

Sending a volume will have the same visual result as sending a full-screen<br />

quad but without the overhead of processing out-of-bounds pixels. Color Plate 14<br />

demonstrates this idea of processing only those pixels that are affected by the<br />

light.


Back Faces Only<br />

Because the camera could be located inside the volume, it is important to render<br />

the back faces of the volume only. Providing the volume is closed, is convex, and<br />

does not intercept the far clip plane, this will ensure the projected shape is always<br />

visible on the screen (for non-convex light volumes, a convex bounding volume<br />

should be used). If the culling order was not reversed for those volumes, then the<br />

projected shapes would only be correct if the camera was outside the volumes.<br />

Rendering back faces only ensures the projected shape is correct regardless of<br />

the camera position. Note that turning back-face culling off completely is not a<br />

solution because some screen pixels end up being processed twice.<br />

Because we are interested in the intersection of the light area with the visible<br />

pixels in the scene, a further optimization to this technique is to only render<br />

the back-facing pixels of a volume whenever the normal depth buffer visibility<br />

test fails. This can be achieved by simply inverting the Z test when rendering the<br />

volume (e.g., using D3DCMP_GREATER instead of D3DCMP_LESSEQUAL).<br />

Mapping Volume Texture Coordinates to Screen Space<br />

Calculating screen space texture coordinates for the 2D projection of a volume is<br />

more complicated than for an already-transformed full-screen quad. Given the<br />

homogenous clipping coordinates xH, yH, zH, wH and the back buffer dimensions<br />

Width and Height, the equation to retrieve screen-space texture coordinates uS and vS is given by:<br />

� x � H<br />

uS<br />

� �� � �<br />

�<br />

w<br />

�<br />

� H � Width<br />

yH<br />

vS<br />

wH<br />

� �<br />

� �<br />

� �� �<br />

� �<br />

�<br />

� �<br />

�<br />

1<br />

1<br />

1<br />

2<br />

2<br />

(8)<br />

1<br />

1<br />

1<br />

2<br />

2 � Height<br />

Although the vertex shader could pass homogenous clipping coordinates directly<br />

to the pixel shader, a sensible optimization is to precalculate the texture coordinates<br />

in the vertex shader so that the pixel shader need only perform the projective<br />

divide. Conveniently, the latter can be obtained during the sampling process<br />

by projected texture lookup. The equation becomes:<br />

1 � xH wH wH<br />

�<br />

uS<br />

� �� � � �<br />

wH<br />

� 2 2 2�Width<br />

�<br />

(9)<br />

1 � wH yH w � H<br />

vS<br />

� �� � � �<br />

w<br />

�<br />

H � 2 2 2�Height �<br />

�<br />

Vertex <strong>Shader</strong> Code<br />

Section II — Rendering Techniques<br />

Deferred Shading with Multiple Render Targets<br />

;-------------------------------------------------------------------<br />

; Constants specified by the app<br />

; c0-c3 = Global transformation matrix (World*View*Projection)<br />

; c9 = 1.0f/(2.0f*dwScreenWidth), 1.0f/(2.0f*dwScreenHeight)<br />

267


Section II — Rendering Techniques<br />

268 Deferred Shading with Multiple Render Targets<br />

;<br />

; Vertex components<br />

; v0 = Vertex Position<br />

;------------------------------------------------------------------vs<br />

2 0<br />

dcl position v0 ; Vertex position<br />

def c8, 0.5, -0.5, 0.0, 0.0<br />

; Vertex transformation<br />

m4x4 r0, v0, c0 ; Transform vertices by WVP matrix<br />

mov oPos, r0 ; Output position<br />

; Compute texture coordinates<br />

mul r0.xy, r0, c8 ; x/2, -y/2<br />

mad r0.xy, r0.w, c8.x, r0 ; x/2 + w/2, -y/2 + w/2<br />

mad r0.xy, r0.w, c9, r0 ; x/2 + w/2 + w/(2*Width),<br />

; -y/2 + w/2 + w/(2*Height)<br />

mov oT0, r0 ; Output texture coordinates<br />

Pixel <strong>Shader</strong> Code<br />

The iterated texture coordinates issued from the vertex shader are divided by w H<br />

and used as texture coordinates to sample the MRT textures using a projected<br />

texture lookup.<br />

ps 2 0<br />

; Texture coordinates<br />

dcl t0.xyzw ; iterated texture coordinates<br />

; Samplers<br />

dcl 2d s0 ; MRT#0 = World space position<br />

dcl 2d s1 ; MRT#2 = World space normal vector<br />

dcl 2d s2 ; MRT#3 = Pixel Diffuse Color<br />

; Projected texture lookup into MRT textures<br />

texldp r0, t0, s0 ; r0 = world space position<br />

texldp r1, t0, s1 ; r1 = world space normal<br />

texldp r2, t0, s2 ; r2 = Diffuse map color<br />

NOTE <strong>With</strong> pixel shader 3.0 support, the process of retrieving texture<br />

coordinates for a projected shape is simpler; the position register (which<br />

contains the x, y screen position of the pixel currently being processed) can be<br />

used with some scaling and biasing to perform this calculation directly.<br />

Using shapes is better suited for smaller lights and special effects with limited<br />

range. For bright lights or global effects likely to affect the entire render, a<br />

full-screen quad is a better solution.


CD Demo<br />

Summary<br />

Section II — Rendering Techniques<br />

Deferred Shading with Multiple Render Targets<br />

269<br />

The demo on the companion CD shows a scene using deferred shading (see Color<br />

Plate 15). A total of nine lights are used — one point light affecting the entire<br />

screen (“full-screen” light), two point lights rendered using projected sphere<br />

shapes, and six cone lights rendered using projected cone shapes. A DX9-class 3D<br />

accelerator with vertex shader 2.0/pixel shader 2.0 and MRT support is required<br />

to run this demo.<br />

The application can be controlled using menu options (press Alt to bring up<br />

the menu in full-screen mode). The latest version of the demo is available on the<br />

PowerVR Developer Relations web site at www.pvrdev.com.<br />

This article described deferred shading and showed how to implement this technique<br />

using multiple render targets in <strong>DirectX</strong> 9. The two-phase process of the<br />

algorithm was detailed with shader code examples as supportive material. Some<br />

advanced deferred shading effects were proposed, and the robustness and overall<br />

simplicity of the technique should encourage graphics programmers to invent<br />

their own ideas for even better usage of this rendering algorithm. Finally, performance<br />

and memory footprint considerations were discussed, and optimizations<br />

were suggested to improve the effectiveness of the technique.


Meshuggah’s Effects Explained<br />

Carsten Wenzel<br />

Before We Start…<br />

What exactly is Meshuggah? It’s the name of my final year project at university<br />

and was released in March 2001 at http://meshuggah.4fo.de. It uses <strong>DirectX</strong> 8.1<br />

class hardware to render a number of shader effects. All of them are of a different<br />

flavor and range from simple to fairly advanced. Meshuggah has two modes of<br />

operation. There’s a demo mode where all effects run in a sequence and are synchronized<br />

to a music track. The other mode is the interactive browser mode,<br />

which allows you to view each effect individually and tweak all of its parameters.<br />

One of the main goals while developing those shader effects was eye candy. This<br />

means that the effects described herein are not pure vertex or pixel shader tricks.<br />

Rather, they combine various techniques with shader technology to produce<br />

something “visually stunning.” Quite a few of them were heavily inspired by<br />

some great productions released by different demo scene groups during the last<br />

couple of years. Others are my take on research results published in various<br />

papers.<br />

You are reading these words because Wolfgang Engel somehow came across<br />

Meshuggah’s web site (around the time Direct3D <strong><strong>Shader</strong>X</strong> was hitting the<br />

shelves) and sent me an e-mail asking whether I’d like to contribute to the<br />

successor.<br />

I hope the following pages offer some useful and interesting stuff to shader<br />

fanatics — Xbox and PC game developers alike. I’m sure there are still a handful<br />

of developers targeting <strong>DirectX</strong> 8.1. Before we delve into the details, please have<br />

a look at the demo on the companion CD to get an idea of what you can expect to<br />

find on the following pages. The demo comes with full source code for you to<br />

study and experiment with. It contains some changes made to the original version;<br />

specifically, these are minor fixes, speed-ups and cleanups in shader and<br />

demo code.<br />

Infinite Zoom on the z Plane<br />

270<br />

The subject of the first effect is a zoom into an “infinitely” detailed picture placed<br />

on the z plane. Instead of starting to blur or get blocky, the picture constantly<br />

reveals new detail. Ideally, hand-drawn pictures are used to render the zoom


sequence, as shown in Contour by The Black Lotus and Spot by Exceed (both<br />

downloadable at [1]). Due to the author’s lack of artistic skills, we use fractals<br />

(e.g., the Mandelbrot set) instead of hand-drawn art. Don’t confuse it with a typical<br />

fractal zoom though; we don’t evaluate the Mandelbrot set for each frame that we<br />

render. It’s a strictly image-based rendering effect.<br />

The basic algorithm works as follows. Given is a sequence of bitmaps for a<br />

predefined zoom path, each bitmap refining a certain area of the previous one (see<br />

Figure 1).<br />

Figure 1: Two consecutive bitmaps of a zoom sequence<br />

Section II — Rendering Techniques<br />

Meshuggah’s Effects Explained<br />

Some rules apply in order to maintain the illusion of a zoom.<br />

271<br />

� As mentioned before, each bitmap has to reside strictly within the previous<br />

one.<br />

� Each bitmap should cover at least one quarter of the previous one. This way,<br />

new detail will always be available for the area we’re zooming into, thus preventing<br />

blurry results.<br />

� The bitmap resolution should at least match the screen resolution. This<br />

avoids blurry parts in the corners of a rendered frame.<br />

Achieving a zoom effect is now just a matter of drawing these bitmaps properly.<br />

The bitmap sequence needs to be laid out on the z plane. Position and size —<br />

width and height if it’s not a square — of the first bitmap are set to some default<br />

values (e.g., (0, 0) and 1). For the following bitmaps, position and size result from<br />

their absolute location within the first bitmap and how much of its area they<br />

occupy (see Figure 2). 1<br />

1 As we zoom into the Mandelbrot set, we can also derive position and size values from the complex<br />

coordinates used to render the bitmap sequence.


Section II — Rendering Techniques<br />

272 Meshuggah’s Effects Explained<br />

Figure 2: Possible layout of bitmaps for a zoom sequence<br />

To render a frame of the zoom, we draw the sequence of bitmaps in ascending<br />

order with depth test and write being disabled. The position and size values for<br />

each bitmap are used to form a quad or rectangle, respectively. In fact, not all<br />

bitmaps need to be drawn. It is enough to begin with the bitmap having the highest<br />

index in the sequence and still filling the entire screen for the current zoom<br />

position of the camera.<br />

The camera hovers over the z plane, looking straight down using a 90<br />

degrees field of view. Its position is determined by evaluating a spline function<br />

based on the key points of the zoom path for a given time t. Each key point maps<br />

directly to a bitmap. It describes the position of the camera necessary to see the<br />

associated bitmap entirely at full resolution. For a 90 degrees field of view, the key<br />

point c i for bitmap i placed at position p i and having a size of s i (or width w i and<br />

height h i) is as follows:<br />

c i.x = p i.x + s i · 0.5 or c i.x = p i.x + w i · 0.5<br />

c i.y = p i.y + s i · 0.5 or c i.y = p i.y + h i · 0.5<br />

c i.z =–(s i · 0.5) or c i.z = – (max(w i, h i) · 0.5)<br />

Applying those formulas to the two bitmaps in Figure 2 yields c 1 = (0.5, 0.5, –0.5)<br />

and c 2 = (0.45, 0.62, –0.25).<br />

The shaders used for rendering the zoom are straightforward. The vertex<br />

shader transforms each quad/rectangle into clip space and feeds the vertex shader<br />

with texture coordinates. The pixel shader just fetches a color from a bitmap and<br />

stores it as the output color.<br />

Meshuggah uses ten mipmapped textures with a resolution of 1024x1024<br />

pixels to represent the sequence of bitmaps for a zoom. Each texture alone<br />

would take up 6MB if stored as a raw 32-bit ARGB texture. To save texture<br />

memory and bandwidth without compromising zoom depth, compressed textures<br />

are used. This way (using DXT1), we can cut down the required storage space to


one-eighth of the original uncompressed textures. Obviously, all this isn’t for free.<br />

Saving a texture in a compressed format will introduce artifacts that are more or<br />

less visible, depending on the texture’s content. Also, compressed texture formats<br />

usually store colors at a lower depth (16 bits for the colors at either extreme<br />

of a compression block in DXT1), thus introducing color bleeding. In order to<br />

remedy this and improve texture quality, dithering should be performed when<br />

saving textures in a compressed format.<br />

Anisotropic Lighting on Hair Strands<br />

In this section, we generate hair strands that move around, bend under the force<br />

of gravity and inertia, and have anisotropic lighting applied to them. For some of<br />

you, it might be reminiscent of the hair effect in Lapsus by Maturefurk (also available<br />

for download at [1]).<br />

Let’s first talk about how physics are implemented in order to animate the<br />

hair. As mentioned above, the two forces used to animate the hair in this effect<br />

are gravity and inertia. Gravity is simply a constant vector for all hairs. Inertia is<br />

simulated by calculating the second derivative of a function describing each hair<br />

tip’s current position.<br />

What information is necessary to create hair strands? Well, assuming each<br />

hair starts at the origin and ends at some point on a unit sphere, we only need a<br />

unitized normal for the default hair direction (in case there is zero gravity and<br />

zero inertia). To get some variation, we also store the hair length as well as a 1D<br />

texture coordinate. The 1D texture coordinate will be used in conjunction with a<br />

1D noise texture to model streaks running along the hair. Using randomized values<br />

to set up all components gives each hair an individual and more natural look.<br />

It’s time to generate the hairs. Each hair has a fixed number of joints for<br />

which a position needs to be calculated. We start at the origin and set the current<br />

hair direction to its default. For the rest of the joints, we sum up the current direction,<br />

the global gravity vector, and the hair’s inertia vector to a new temporary<br />

direction vector. This temporary direction vector needs to be normalized. The<br />

position for each hair joint is then calculated by taking the position of the previous<br />

joint and adding the normalized temporary direction vector scaled by the hair<br />

length to it. The current hair direction is updated by setting it to the normalized<br />

temporary direction vector. This translates into the following code:<br />

vCurPos =(0,0,0);<br />

vCurDir = vHairDir;<br />

for( int i( 0 ); i


Section II — Rendering Techniques<br />

274 Meshuggah’s Effects Explained<br />

We connect all adjacent joints by generating quads between them. In <strong>DirectX</strong><br />

this can be done efficiently using triangle strips. Note that multiple hairs can be<br />

put into one big triangle strip by stitching them together to increase rendering<br />

performance. In order to generate a quad, we actually need two position vectors<br />

per joint. Otherwise, we would be generating lines — that is, triangles with an<br />

area of zero! By crossing the initial direction vector of each hair with the global<br />

gravity vector and normalizing the result, we get a vector that when shortened by<br />

some amount (depending on how thick the hair should be) can be added to each<br />

joint’s position vector to get the second position vector that we need. At this<br />

point, we should also prepare a 1D texture coordinate for both position vectors so<br />

we can form vertices that can be sent to the vertex shader later. The 1D texture<br />

coordinate for the first position vector is simply the one specified for each hair;<br />

the 1D texture coordinate for the second position vector is the sum of the first<br />

one and an arbitrary value, which is constant for all hairs. The bigger this arbitrary<br />

constant value, the thinner the streaks.<br />

Now that we have generated the geometry, we need to define an appropriate<br />

vertex and pixel shader to give the hair the proper look. Rendering the hair by<br />

applying normal Phong shading works, although the result looks rather plastic.<br />

Instead we take advantage of an anisotropic shading model to get the kind of lighting<br />

that we’re after. [2] and [3] describe how this can be done in hardware using a<br />

2D texture as a lookup table for diffuse and specular intensities. The following<br />

two formulas denote the diffuse and specular intensity for a given point P on a<br />

surface with L being the vector from P to the light source, N the surface normal<br />

in P, T the tangent vector in P, V the vector from P to the viewer, and R the<br />

reflected light vector in P.<br />

Diffuse intensity: L�N � 1 �( L�T) 2 2<br />

Specular intensity: V�R� 1�( LT � ) � 1�(<br />

V�T) �( LT � ) �( V�T) 2<br />

As you can see from those formulas, the only two values we need for the texture<br />

lookup are L·T and V·T. We pass the tangent vector for each joint of a hair (that is,<br />

the current hair direction) along with the position vector and the 1D texture coordinate<br />

to the vertex shader. Here we calculate L and V for each vertex and dot<br />

them with T to get the proper texture coordinates for a lookup in the anisotropic<br />

light map. Care must be taken to map L·T and V·T from [–1, 1] to [0, 1], since the<br />

result of a dot product for two normalized vectors is in [–1, 1] but the corresponding<br />

texture address range is [0, 1].<br />

The following is a hair vertex shader:<br />

#include "..\..\Effects\Hair\Hair<strong>Shader</strong>Constants.h"<br />

#define srcP v0<br />

#define srcT v1<br />

#define srcTex0 v2<br />

#define V r2<br />

#define T r0


#define L r3<br />

#define PWorld r1<br />

#define Temp r10<br />

#define Temp1 r11<br />

vs.1.1<br />

// def CV CONSTANTS, 0.0, 0.5, 1.0, 2.0<br />

// compute world space position<br />

dp4 PWorld.x, srcP, c[ CV WORLD 0]<br />

dp4 PWorld.y, srcP, c[ CV WORLD 1]<br />

dp4 PWorld.z, srcP, c[ CV WORLD 2]<br />

dp4 PWorld.w, srcP, c[ CV WORLD 3]<br />

// vector from vertex position to eye<br />

add V, c[ CV VIEWERPOS ], -PWorld<br />

dp3 V.w, V, V<br />

rsq V.w, V.w<br />

mul V, V, V.w<br />

// transform tangent into world space<br />

dp3 T.x, srcT, c[ CV WORLDIT 0]<br />

dp3 T.y, srcT, c[ CV WORLDIT 1]<br />

dp3 T.zw, srcT, c[ CV WORLDIT 2]<br />

// normalize tangent<br />

dp3 T.w, T, T<br />

rsq T.w, T.w<br />

mul T, T, T.w<br />

// vector from vertex position to light<br />

add L, c[ CV LIGHTPOS ], -PWorld<br />

dp3 L.w, L, L<br />

rsq L.w, L.w<br />

mul L, L, L.w<br />

// generate texture coordinates for anisotropic lighting<br />

// and map from [-1, 1] to [0, 1]<br />

dp3 Temp.x, V, T<br />

dp3 Temp.y, L, T<br />

mad oT0.xy, Temp.xy, c[ CV CONSTANTS ].y, c[ CV CONSTANTS ].y<br />

// copy texture coordinate for 1D hair streaks texture<br />

mov oT1.x, srcTex0.x<br />

// transform vertex into clip space<br />

dp4 oPos.x, srcP, c[ CV WORLDVIEWPROJ 0]<br />

dp4 oPos.y, srcP, c[ CV WORLDVIEWPROJ 1]<br />

dp4 oPos.z, srcP, c[ CV WORLDVIEWPROJ 2]<br />

Section II — Rendering Techniques<br />

Meshuggah’s Effects Explained<br />

275


Section II — Rendering Techniques<br />

276 Meshuggah’s Effects Explained<br />

dp4 oPos.w, srcP, c[ CV WORLDVIEWPROJ 3]<br />

The pixel shader finally computes the lighting. It fetches the diffuse and specular<br />

intensity from the anisotropic light map as well as the color of the streaks from<br />

the 1D noise texture. The colors and intensities are modulated and combined in<br />

the following way:<br />

#include "..\..\Effects\Hair\Hair<strong>Shader</strong>Constants.h"<br />

ps.1.1<br />

tex t0 // get texel from anisotropic texture<br />

tex t1 // get texel from 1D hair streaks texture<br />

// r1 = specular intensity * specular color<br />

mul r1, c[ CP SPECULAR COLOR ], t0.a<br />

// r1 *= hair streak<br />

mul r1, r1, t1<br />

// r0 = diffuse intensity * hair streak<br />

mul r0, t0, t1<br />

// r0 = r0 * diffuse color + specular term in r1<br />

mad r0, r0, c[ CP DIFFUSE COLOR ], r1<br />

Reflections and Refractions on Soft Objects<br />

Soft objects are usually expressed implicitly by mathematical functions. In order<br />

to render them, it is therefore necessary to determine the polygonal representation<br />

of their so-called iso-surface first. This is where the marching cubes algorithm<br />

or one of its derivatives comes into play. 2 It allows us to determine the<br />

polygonal surface of a 3D density field for a given density threshold. The density<br />

data either comes from 3D volumetric data sets (e.g., taken from MRI scans) or is<br />

generated by evaluating density functions (e.g., those used to describe soft<br />

objects). Meshuggah’s soft objects are a conglomeration of simple blobs. A blob<br />

can be defined as follows3 :<br />

�<br />

T<br />

p�( px py pz)<br />

�<br />

T<br />

o � ( ox oy oz)<br />

� 1<br />

F( p)<br />

� � � 2<br />

| p�o| F returns the density at a given point p for a blob originated at o. Given such a<br />

density function, we are able to calculate its partial derivative to obtain the normal<br />

N for a point p in our density field. Having correct normals is crucial when rendering<br />

the surface later on.<br />

The following is a partial derivative of a density function F(p) and its application<br />

to calculate a normal N(p).<br />

2 For an introduction to the marching cubes algorithm, refer to [4]. It also provides a C/C++ implementation.<br />

3 Other formulas for modeling soft objects are presented in [5].


� � �<br />

� � �<br />

�<br />

�<br />

� �F(<br />

p)<br />

�F(<br />

p)<br />

�F(<br />

p)<br />

F( p)<br />

�<br />

�<br />

�<br />

� �x<br />

�y<br />

�z<br />

�<br />

� �<br />

� �o�p� �F( p)<br />

�2��4<br />

| p�o| � �<br />

N( p) ���F(<br />

p)<br />

� �<br />

�<br />

N( p)<br />

� 2 � � 4<br />

| p�o| �<br />

N( p)<br />

� 2<br />

�p�o� �px �ox py �oy T<br />

pz �oz�<br />

2<br />

�px �ox� ��py<br />

�oy� ��pz �oz�<br />

2 2<br />

2<br />

� �<br />

For more complicated density functions, N can be approximated this way:<br />

��<br />

� � 10<br />

� � 1<br />

�<br />

�<br />

� F( p) �F � 1 � �<br />

Napprox( p)<br />

� F( p) �F �<br />

�<br />

��<br />

�<br />

F( p) �F �<br />

�<br />

p�<br />

�<br />

p� �<br />

p� o<br />

o<br />

T<br />

�<br />

�<br />

�� �<br />

0<br />

�<br />

T<br />

0�<br />

�<br />

T<br />

0�<br />

�<br />

� 0<br />

T<br />

��<br />

� �<br />

To build a complex soft object, the density functions of several individual blobs are<br />

summed up. The normal at a given point p in space is the sum of each blob’s N(p).<br />

As mentioned earlier in the text, we use the marching cubes algorithm to find<br />

the polygonal surface of the soft object. However, the tessellation code in<br />

Meshuggah takes special care to avoid brute-force testing of all voxels of the 3D<br />

density field, which would result in a run-time<br />

complexity of O(n 3 ). Since the soft object’s<br />

surface usually cuts through just a small fraction<br />

of the total number of voxels, we track its<br />

surface to limit further processing to those<br />

only. For each blob we therefore trace a line<br />

from its center out (any direction will do) until<br />

we find a voxel that the soft object’s surface<br />

cuts through. If it hasn’t been tessellated yet,<br />

we compute the polygonal surface for it and<br />

then progress to all neighbors also cut by the<br />

surface until all interlinked voxels have been<br />

visited. Otherwise, the blob penetrates<br />

another one that has already been processed.<br />

An algorithm outline is given in [6]. Figure 3<br />

illustrates the tessellation procedure on a slice<br />

of a 3D density field for a soft object consisting<br />

of five blobs.<br />

Section II — Rendering Techniques<br />

Meshuggah’s Effects Explained<br />

�<br />

�<br />

�<br />

�<br />

��<br />

�<br />

277<br />

Figure 3: Efficiently tessellating a<br />

soft object. The arrows indicate<br />

surface tracking. The lines around<br />

the object indicate the soft object’s<br />

surface for a given density<br />

threshold.


278 Meshuggah’s Effects Explained<br />

Now that the polygonal representation of the surface has been generated, we can<br />

focus on the shaders. The goal is to render the soft object like a liquid. To do this,<br />

a vertex shader has to calculate the reflected view vector and an approximately<br />

refracted view vector for each vertex. It also computes an approximation of the<br />

Fresnel term given in [7], which is used later in the pixel shader to combine two<br />

colors sampled from an environment map that corresponds to the reflected and<br />

refracted view direction.<br />

The following is a reflection and refraction vertex shader:<br />

#include "..\..\Effects\SoftObjects\SoftObjects<strong>Shader</strong>Constants.h"<br />

#define srcP v0<br />

#define srcN v1<br />

#define N r0<br />

#define V r1<br />

#define NShort r9<br />

#define Temp r10<br />

#define Temp1 r11<br />

vs.1.1<br />

// def CV CONSTANTS, 0.0, 0.5, 1.0, 2.0<br />

// normalize normal<br />

dp3 N.w, srcN, srcN<br />

rsq N.w, N.w<br />

mul N, N.w, srcN<br />

// vector from vertex to eye<br />

add V, c[ CV VIEWERPOS ], -srcP<br />

dp3 V.w, V, V<br />

rsq V.w, V.w<br />

mul V, V, V.w<br />

// calculate approximated Fresnel term F<br />

// F = Fresnel factor *(1-V.N)^2<br />

dp3 Temp, V, N<br />

add Temp, c[ CV CONSTANTS ].z, -Temp<br />

mul Temp, Temp, Temp<br />

mul oD0.xyz, Temp, c[ CV FRENSEL FACTOR ].x<br />

// calculate reflection vector<br />

//R=2*(E.N) *N-V<br />

dp3 Temp, N, V<br />

mul Temp1, Temp, c[ CV CONSTANTS ].w<br />

mul Temp1, Temp1, N<br />

add oT0.xyz, Temp1, -V<br />

// calculate refraction vector


R'=2*(E.NShort) *N-V<br />

mul NShort, N, c[ CV REFRACT ]<br />

dp3 Temp, NShort, V<br />

mul Temp1, Temp, c[ CV CONSTANTS ].w<br />

mul Temp1, Temp1, NShort<br />

add oT1.xyz, Temp1, -V<br />

// transform vertex to clip space<br />

dp4 oPos.x, srcP, c[ CV WORLDVIEWPROJ 0]<br />

dp4 oPos.y, srcP, c[ CV WORLDVIEWPROJ 1]<br />

dp4 oPos.z, srcP, c[ CV WORLDVIEWPROJ 2]<br />

dp4 oPos.w, srcP, c[ CV WORLDVIEWPROJ 3]<br />

Reflection and refraction pixel shader:<br />

#include "..\..\Effects\SoftObjects\SoftObjects<strong>Shader</strong>Constants.h"<br />

ps.1.1<br />

tex t0 // get reflected color<br />

tex t1 // get refracted color<br />

// blend between refracted color and reflected color<br />

// using Fresnel term<br />

lrp r0, v0, t0, t1<br />

Volumetric Beams via Radial Blur<br />

Section II — Rendering Techniques<br />

Meshuggah’s Effects Explained<br />

279<br />

Volumetric beams as the result of a spherical object emitting light can be rendered<br />

quite convincingly using a radial blur. This technique doesn’t suffer from artifacts<br />

usually found at silhouette edges when rendering shafts of light via multiple<br />

additively blended shells — that is, by extruding the object’s mesh several times.<br />

The beams effect in Meshuggah demonstrates the use of a radial blur by rendering<br />

a “seething” sun. To achieve this, two things need some further thought:<br />

� How to render the turbulent surface of a sun<br />

� How to efficiently implement a radial blur in 3D hardware<br />

Let’s begin with how to render the sun surface. It’s basically a textured sphere. A<br />

great source for textures of that kind is [8], where you can find real sun pictures<br />

shot by SOHO — the Solar & Heliospheric Observatory project carried out by the<br />

European Space Agency (ESA) and the U.S. National Aeronautics and Space<br />

Administration (NASA). One of the sun pictures provided there was taken to create<br />

the texture in Figure 4 (on the following page).<br />

In Figure 4 the alpha channel (left) is a filtered version of the color channel<br />

(right) at a higher scale. Photoshop’s Glowing Edges filter was used to create it.<br />

Bright pixels indicate areas of high sun activity, which will eventually end up as<br />

long, intense beams. To make things more interesting, texture data from the alpha<br />

channel should be animated when rendering the sun surface to gain a complex


Section II — Rendering Techniques<br />

280 Meshuggah’s Effects Explained<br />

motion. The following vertex and pixel shaders take care of this job. The following<br />

is a sun surface vertex shader:<br />

#include "..\..\Effects\Beams\RenderSun<strong>Shader</strong>Constants.h"<br />

#define srcP v0<br />

#define srcTex v1<br />

#define Temp r0<br />

vs.1.1<br />

// apply texture scale factor<br />

mul Temp, srcTex, c[ CV TEXSCALE ]<br />

// animate texture coordinates<br />

add oT0.xy, Temp.xy, c[ CV TEXOFFSET 0 ].xy<br />

add oT1.xy, Temp.xy, c[ CV TEXOFFSET 1 ].xy<br />

add oT2.xy, Temp.xy, c[ CV TEXOFFSET 2 ].xy<br />

add oT3.xy, Temp.xy, c[ CV TEXOFFSET 3 ].xy<br />

// transform surface vertex into clip space<br />

dp4 oPos.x, srcP, c[ CV WORLDVIEWPROJ 0]<br />

dp4 oPos.y, srcP, c[ CV WORLDVIEWPROJ 1]<br />

dp4 oPos.z, srcP, c[ CV WORLDVIEWPROJ 2]<br />

dp4 oPos.w, srcP, c[ CV WORLDVIEWPROJ 3]<br />

Figure 4: Sun surface texture<br />

The vertex shader creates four unique texture coordinates by scaling and translating<br />

the original uv coordinates fed to it. These are used later in the pixel shader to<br />

create a complex motion of the sun’s surface. The offset constants are updated<br />

per frame. Care must be taken when setting these values. Ideally, none of the texture<br />

coordinates move into the same direction at any time. Otherwise, the eye<br />

will be able to keep the animation of individual textures coordinates apart, thus<br />

destroying the illusion of a complex motion.<br />

The following is a sun surface pixel shader:


#include “..\..\Effects\Beams\RenderSun<strong>Shader</strong>Constants.h"<br />

ps.1.1<br />

// get four samples from sun surface texture map<br />

tex t0<br />

tex t1<br />

tex t2<br />

tex t3<br />

// calculate weighted sum of four alpha values<br />

mul r0.a, t0.a, c[ CP SURFACE BLEND VALUE ].a<br />

mad r0.a, t1.a, c[ CP SURFACE BLEND VALUE ].a, r0.a<br />

mad r0.a, t2.a, c[ CP SURFACE BLEND VALUE ].a, r0.a<br />

mad r0.a, t3.a, c[ CP SURFACE BLEND VALUE ].a, r0.a<br />

// modulate weighted alpha value on<br />

// surface color to compute final output<br />

mul r0, t0, r0.a<br />

The sun’s surface is animated by sampling the surface texture four times and calculating<br />

the weighted sum of the alpha values — i.e., 0.25 · (t0.a + t1.a + t2.a +<br />

t3.a). The result is then modulated on the color of the first sample to get the final<br />

output color (see Figure 5a).<br />

Now that the sun’s surface is rendered a radial blur has to be applied to it to<br />

get volumetric beams. But how do we implement a radial blur taking advantage of<br />

3D hardware? One way is to transform the image from Cartesian coordinates (x,<br />

y) to polar coordinates (r, �) and then do a horizontal blur (or vertical blur depending<br />

on which axis corresponds to r and � after the transformation) and finally<br />

transform the result back to Cartesian coordinates. The problem with this<br />

approach is that the CPU needs frame buffer read access to do it, a big performance<br />

bottleneck on current 3D hardware. Fortunately, there is another way to do<br />

the very same thing, which is particularly well suited for hardware-accelerated<br />

rendering.<br />

To accumulate a series of gradually zoomed-in versions of a source texture,<br />

we render it into a destination texture with alpha blending being enabled. The<br />

source texture contains the image to be used for the radial blur. In our case, it’s<br />

the sun’s surface that we just rendered in the previous step. The destination texture<br />

stores the result of the radial blur. Have a look at the following code snippet<br />

in which we render a radial blur:<br />

Clear( pDstTexture );<br />

SetRenderTarget( pDstTexture );<br />

SetTexture( 0, pSrcTexture );<br />

EnableAlphaBlending( true );<br />

SetSourceBlend( ONE );<br />

SetDestinationBlend( ONE );<br />

Section II — Rendering Techniques<br />

Meshuggah’s Effects Explained<br />

281


Section II — Rendering Techniques<br />

282 Meshuggah’s Effects Explained<br />

EnableDepthTest( false );<br />

for( int i( 0 ); i


Figure 5: Sun surface (a) and radial blur applied to it (b)<br />

Simple, Faked Displacement Mapping<br />

Section II — Rendering Techniques<br />

Meshuggah’s Effects Explained<br />

283<br />

Due to the fast pace at which new generations of GPUs are developed, displacement<br />

mapping has become more and more a standard feature in hardware-accelerated<br />

rendering. Displacement maps, as opposed to bump maps, won’t destroy the<br />

impression of surface detail once a certain part of an object that you’re looking at<br />

has turned to such a degree that it shows its silhouette. <strong>With</strong> bump maps, this<br />

impression fades at the silhouette edges of an object because these bumps don’t<br />

really exist physically; they have influence on lighting only. Displacement maps,<br />

however, alter geometry based on the height values stored in them, which affects<br />

both an object’s silhouette and lighting. Another advantage of displacement mapping<br />

is that it can reduce memory bandwidth during rendering. A low-resolution<br />

mesh of an object with a displacement map applied can be used to render a highly<br />

detailed version of the same object. Since tessellation happens entirely on the<br />

GPU geometry, data transfer can be kept minimal. Nevertheless, bump maps have<br />

their place when rendering miniscule surface details, such as scratches, creases,<br />

pores, etc.<br />

So how does it work in general? Take a triangle of an arbitrary mesh, for<br />

example. For each vertex there is a position, a normal, and a 2D texture coordinate<br />

for the displacement map. Upon sending the triangle to the GPU, it gets subdivided<br />

multiple times. Position, normal, and texture coordinates for each vertex<br />

inserted artificially due to subdivision are interpolated. Now every vertex position<br />

needs to be altered. Otherwise, we would end up having a lot of small triangles<br />

representing the original one. Therefore, for each vertex, the GPU determines<br />

the height value from the displacement map at the provided 2D texture coordinate<br />

and uses this value to displace the vertex position along the vertex normal. Likewise,<br />

vertex normals need to be perturbed, as the object’s surface has changed.<br />

At the time that Meshuggah was created, <strong>DirectX</strong> didn’t support displacement<br />

mapping. That’s why this section demonstrates a very simple approach to<br />

fake it with the help of the CPU.<br />

To keep the workload of subdivision low, we limit ourselves to one quad that<br />

should be displacement mapped. This way, we can simply subdivide it in a gridlike<br />

fashion. There’s no need to interpolate normals, and the position as well as


Section II — Rendering Techniques<br />

284 Meshuggah’s Effects Explained<br />

texture coordinates can be derived directly from the grid position. Also, the displaced<br />

position of a vertex and its perturbed normal can be calculated directly<br />

from the associated height value and its neighbors in the displacement map. To<br />

get a high-quality displacement map effect, the quad needs to be subdivided quite<br />

a bit depending on the amount and size of detail provided in the displacement<br />

map. In Meshuggah, the displacement map has a size of 256x256 pixels. The quad<br />

is split into 255x255 tiny pieces that need to be rendered every frame. For each<br />

row of quads, a triangle strip is built and sent to the GPU.<br />

The above technique should now be able to lift a 2D logo up into three<br />

dimensions. Environment and decal mapping enhance the visual detail after displacement<br />

mapping has been applied.<br />

The vertex shader calculates the reflected view vector as well as the Fresnel<br />

term for each vertex. It’s very similar to the soft object vertex shader. The pixel<br />

shader uses the reflected view vector to fetch a color from an environment map. It<br />

is multiplied by the material color of the logo. The result is combined with the<br />

material to blend in reflections. Since we want certain parts of the logo to appear<br />

darker, we create a copy of that color with half the intensity. The final output color<br />

is computed by taking the logo texture into account. Its RGB channels contain a<br />

decal value to blend between the two colors that we’ve just calculated. As a last<br />

step, we copy the value from the alpha channel of the logo texture to mask out all<br />

invisible logo parts.<br />

The following is a pixel shader to render the displaced logo:<br />

#include "..\..\Effects\DisplacementMap\DisplacementMapConstants.h"<br />

ps.1.1<br />

tex t0 // get texel from logo map (decal in rgb, mask in alpha)<br />

tex t1 // get texel from environment map<br />

// environment * material<br />

mul r0, t1, c[ CP MATERIALCOLOR ]<br />

// blend between material color and environment * material color<br />

// based on Fresnel term<br />

lrp r0, v0, c[ CP MATERIALCOLOR ], r0<br />

// get the same color with half the intensity<br />

mov d2 r1, r0<br />

// use logo decal value to blend between light and dark color and<br />

// mask out parts of the logo that are invisible (co-issuing)<br />

lrp r0.rgb, 1 - t0, r0, r1<br />

+ mov r0.a, t0.a


Ocean Scene<br />

Section II — Rendering Techniques<br />

Meshuggah’s Effects Explained<br />

285<br />

The ocean scene is Meshuggah’s most complex effect, both visually/computationally<br />

and in terms of shader usage. It’s based on a statistical model to realistically<br />

animate ocean waves ([9] and [10]). In combination with shaders to calculate<br />

color of water and per-pixel bumpy reflections 4 , it results in a highly believable<br />

rendering of an ocean. Making use of a high dynamic range Fresnel term allows<br />

better visualization of glossy spots on the ocean surface while still maintaining a<br />

high image contrast. The color of water is evaluated utilizing simplified equations<br />

from [11] to determine the view-dependent color of water. Contrast enhancement<br />

is based on an approach described in [12] to further improve the visual results<br />

when rendering ocean water.<br />

First and foremost, a realistic model for simulating and animating ocean<br />

water is necessary to generate an appropriate mesh used for rendering. As mentioned<br />

above, [9] — in some greater detail — and [10] describe a statistical model<br />

based on observations of the real sea. In this model, a wave height field is decomposed<br />

into a set of sinus waves with different amplitudes and phases. While the<br />

model itself provides formulas to generate these amplitudes and phases, an<br />

inverse Fast Fourier Transformation (iFFT) converts them back into the spatial<br />

domain, thus creating a wave height field required for building the ocean mesh. It<br />

also allows calculating proper normals and displacement vectors for the height<br />

field. These are used for lighting calculations and forming choppy waves.<br />

A big advantage of this particular model is that it produces an ocean height<br />

field that tiles perfectly. In Meshuggah, a 64x64 ocean height field is repeated four<br />

times, both horizontally and vertically, to form a 256x256 ocean mesh.<br />

Ocean color computations performed in the shaders take several contributing<br />

factors into account to determine a final pixel color. First up, there is the color of<br />

water. As mentioned above, we can take the equations provided in [11] and simplify<br />

them by treating the water surface as a flat plane. The result is an equation<br />

only depending on the angle between the viewer v, a point p, and its normal n on<br />

the ocean surface. It returns greenish colors for angles of about 90 degrees<br />

(viewer looks over wave) and dark blue colors for angles near or equal to 0<br />

degrees (viewer looks straight at wave). Another factor influencing the final color<br />

is a reflected skylight taken from an environment map. To further enhance details<br />

of the water, surface reflections will be per-pixel based, meaning that the viewer<br />

will be able to see little ripples on the water surface coming from a dynamically<br />

updated bump map. 5 In order to not overemphasize reflections, we calculate a<br />

Fresnel term for the air to water case (slightly different than the one used in previous<br />

sections), multiply it by the reflected skylight color, and add the result to<br />

the color of water. Otherwise, reflections would make the water look more like<br />

liquid metal. [10] proposed the following approximation of the Fresnel term for<br />

the air to water case. � denotes the angle between viewer v and surface normal n.<br />

4 These represent low-scale ripples on the water surface.<br />

5 The normals of the current ocean height field are used here.


Section II — Rendering Techniques<br />

286 Meshuggah’s Effects Explained<br />

1<br />

Fresnel( �)<br />

�<br />

( 1 � cos( �))<br />

8<br />

The Fresnel term is a good candidate for performing contrast enhancement. It<br />

influences how much reflected skylight should be added to the color of water for<br />

a given pixel. By multiplying an exposure factor to the Fresnel term, we can<br />

increase the intensity in areas of the ocean surface where direct sunlight is<br />

reflected while leaving other areas relatively dark. Further details follow the listing<br />

of the ocean scene vertex shader:<br />

#include "..\..\Effects\OceanScene\<strong>Shader</strong>Constants.h"<br />

#include "..\..\Effects\OceanScene\Ocean<strong>Shader</strong>Constants.h"<br />

#define srcP v0<br />

#define srcN v1<br />

#define srcTex v2<br />

#define P r0<br />

#define V r1<br />

#define S r2<br />

#define SxT r3<br />

#define T r4<br />

#define Temp r10<br />

#define Temp1 r11<br />

vs.1.1<br />

// def CV CONSTANTS, 0.0, 0.5, 1.0, 2.0<br />

// scale and translate vertex<br />

mul P, srcP, c[ CV MESH XYZ SCALE ]<br />

add P, P, c[ CV MESH XYZ OFFSET ]<br />

// apply curvature<br />

add Temp, P, -c[ CV VIEWERPOS ]<br />

mul Temp, Temp, c[ CV CONSTANTS ].zxz<br />

dp3 Temp, Temp, Temp<br />

mad P.y, -Temp.x, c[ CV CURVATURE ].x, P.y<br />

// generate S, T and SxT<br />

dp3 SxT.w, srcN, srcN<br />

rsq SxT.w, SxT.w<br />

mul SxT, srcN, SxT.w<br />

mov S, c[ CV CONSTANTS ].zxx<br />

mul T, S.zxyw, SxT.yzxw<br />

mad T, S.yzxw, SxT.zxyw, -T<br />

dp3 T.w, T, T


sq T.w, T.w<br />

mul T, T, T.w<br />

mul S, SxT.zxyw, T.yzxw<br />

mad S, SxT.yzxw, T.zxyw, -S<br />

// set up transformation matrix for bump map normals<br />

mov oT1.x, S.x<br />

mov oT2.x, S.y<br />

mov oT3.x, S.z<br />

mov oT1.y, SxT.x<br />

mov oT2.y, SxT.y<br />

mov oT3.y, SxT.z<br />

mov oT1.z, T.x<br />

mov oT2.z, T.y<br />

mov oT3.z, T.z<br />

// set up view vector for per-pixel reflections<br />

// put it into per-pixel reflection matrix<br />

add oT1.w, c[ CV VIEWERPOS ].x, -P.x<br />

add oT2.w, c[ CV VIEWERPOS ].y, -P.y<br />

add oT3.w, c[ CV VIEWERPOS ].z, -P.z<br />

// set up texture uv for bump map<br />

mul oT0.xy, srcTex.xy, c[ CV BUMP UV SCALE ].xy<br />

// calculate normalized view vector<br />

add V, c[ CV VIEWERPOS ], -P<br />

dp3 V.w, V, V<br />

rsq V.w, V.w<br />

mul V, V, V.w<br />

// set up lerp factor for ocean color<br />

dp3 oD0.xyz, V, SxT<br />

// calculate approximated Fresnel term F<br />

// 1<br />

// F = -------------------------------------<br />

// ( 1 + V.N ) ^ FresnelApprox PowFactor<br />

dp3 Temp, V, SxT<br />

add Temp, c[ CV CONSTANTS ].z, Temp<br />

mov Temp.y, c[ CV FRESNELAPPROX POWFACTOR ].x<br />

lit Temp.z, Temp.xxyy<br />

rcp Temp.z, Temp.z<br />

mul Temp.z, Temp.z, c[ CV DYNAMIC RANGE ].x<br />

// set up high dynamic range Fresnel term<br />

expp Temp1.y, Temp.z<br />

mov oD0.w, Temp1.y<br />

Section II — Rendering Techniques<br />

Meshuggah’s Effects Explained<br />

287


Section II — Rendering Techniques<br />

288 Meshuggah’s Effects Explained<br />

add Temp.z, Temp.z, -Temp1.y<br />

mul oD1.w, Temp.z, c[ CV DYNAMIC RANGE ].y<br />

// transform vertex to clip space<br />

dp4 oPos.x, P, c[ CV WORLDVIEWPROJ 0]<br />

dp4 oPos.y, P, c[ CV WORLDVIEWPROJ 1]<br />

dp4 oPos.z, P, c[ CV WORLDVIEWPROJ 2]<br />

dp4 oPos.w, P, c[ CV WORLDVIEWPROJ 3]<br />

The first step transforms a vertex of our generated ocean mesh into world space.<br />

Then a curvature factor is applied to the world space position. It alters the height<br />

(y) of each ocean vertex based on the squared x/z distance to the current view<br />

position.<br />

Doing per-pixel reflections requires setting up a transformation matrix,<br />

which is used to transform normals fetched from a bump map into world space6 so<br />

that the view vector is reflected correctly. Normals stored in the bump map are<br />

compatible to our (left-handed) world space coordinate system. That is, if the<br />

ocean surface is a flat (y) plane, the transformation matrix is set to identity. Since<br />

our ocean mesh is based on a height field, which in turn is based on a rectangular<br />

grid, generating the transformation matrix is easy. It’s formed by three normalized<br />

vectors x, y, z (also called s, t, and sxt), which are copied into the consecutive<br />

output texture registers oT1-oT3. oT0 is reserved for the bump map’s texture<br />

coordinates. The following is a matrix to transform bump map normals:<br />

�<br />

y � normal of ocean mesh vertex<br />

� T �<br />

z �( 1 0 0)<br />

� y<br />

� �<br />

T<br />

x � y�(<br />

0 0 1)<br />

M x y z T � � �<br />

� ( )<br />

The view vector, which is also necessary for the reflection computations later in<br />

the pixel shader, is stored in the w components of oT1-oT3.<br />

To calculate the color of water, we determine a lerp factor that is used in the<br />

pixel shader to blend between two colors. As you can see, it is a really strippeddown<br />

version of the original formulas given in [11]. For a given ocean mesh vertex,<br />

the lerp factor is just the cosine of the angle between viewer and vertex normal.<br />

A more complex color function encoded in an environment map, as in [10],<br />

was avoided since all available texture slots are already reserved for other purposes<br />

and a second render pass was omitted for performance reasons.<br />

The next step determines the Fresnel term to be used for blending in<br />

reflected skylight in the pixel shader. As mentioned earlier, it serves as a high<br />

dynamic range value. In order to overcome <strong>DirectX</strong> 8.1 class hardware’s limited<br />

range of color inputs and color math, we do the following steps. The Fresnel term<br />

is multiplied by a constant user-customizable exposure factor. Its range is limited<br />

to [0, 4], thus saving pixel shader instructions as we will see later. Since any value<br />

written to output color registers oD0 or oD1 gets clamped to [0, 1], we need to<br />

split up our Fresnel term. The fractional portion of it is extracted and copied to<br />

6 It’s the inverse orthonormal basis for the tangent space of each vertex.


oD0. The integer portion is divided by the maximum exposure factor and copied<br />

to oD1. This way, we avoid any clamping of color values. That’s it for the vertex<br />

shader. Let’s have a look at the ocean scene pixel shader to see how it does its<br />

work:<br />

#include "..\..\Effects\OceanScene\<strong>Shader</strong>Constants.h"<br />

#include "..\..\Effects\OceanScene\Ocean<strong>Shader</strong>Constants.h"<br />

ps.1.1<br />

tex t0 // get normal from bump map<br />

texm3x3pad t1, t0 // transform normal.x into world space<br />

texm3x3pad t2, t0 // transform normal.y into world space<br />

texm3x3vspec t3, t0 // transform normal.z into world space<br />

// and get color for reflected view vector<br />

// apply high dynamic range Fresnel term to env map<br />

mul x4 r0, t3, v1.a<br />

mad r0, t3, v0.a, r0<br />

// calculate ocean color<br />

lrp r1, v0, c[ CP OCEAN COLOR DARK ], c[ CP OCEAN COLOR LIGHT ]<br />

// combine ocean and env color<br />

add r0, r0, r1<br />

The reflected skylight color is determined via transforming a normal from our<br />

bump map into world space using the interpolated per-pixel transformation matrix<br />

stored in t1-t3 and then reflecting the view vector to look up a color in the corresponding<br />

skylight environment map.<br />

To apply the high dynamic range Fresnel term, we first multiply the reflected<br />

skylight color from the environment map by v1 (corresponds to oD1) and use the<br />

_x4 instruction modifier to even out the division that was performed when splitting<br />

up the Fresnel term in the vertex shader. Then we multiply the reflected skylight<br />

color from the environment map by v0 (corresponds to oD0) and add it to the<br />

previous result. This yields the desired Fresnel · reflected skylight color, which is<br />

added to the color of water.<br />

Volumetric Light via Ray Casting<br />

Section II — Rendering Techniques<br />

Meshuggah’s Effects Explained<br />

289<br />

The second volumetric light effect in Meshuggah creates shafts of light via ray<br />

casting. These are shining through a bitmap (e.g., a logo) mapped onto a plane in<br />

3D. Each ray starts traveling from the light’s origin and runs along its given direction<br />

until it either hits the plane (that is, an opaque part of the bitmap) or its<br />

intensity has fallen below a certain threshold. To simplify things, we cast rays on<br />

the z plane. The following equation shows the intersection of a ray with the z<br />

plane.


Section II — Rendering Techniques<br />

290 Meshuggah’s Effects Explained<br />

� �<br />

rt () �o�t�d �<br />

T<br />

n � ( 0 0 1)<br />

�<br />

T<br />

p � ( 0 0 0)<br />

� �<br />

nx�x�ny� y�nz�z�n� p<br />

z �<br />

oz �t�dz �<br />

dz<br />

t �<br />

oz<br />

rt o<br />

�<br />

0<br />

0<br />

Intersection<br />

� �<br />

( Intersection ) � �tIntersection �d<br />

If a ray hits the plane and the intersection point is actually within the boundary of<br />

the bitmap mapped onto the plane, then the corresponding opacity value from the<br />

bitmap determines whether the ray should continue or stop traveling.<br />

Rendering is split up into three parts:<br />

1. For each ray, draw the part from the origin of the light to its intersection<br />

point with the z plane. At the intersection point, the ray’s intensity is set to<br />

one over traveled ray length. Another distance-based attenuation function<br />

can be used as well.<br />

2. Blend the bitmap mask onto the result of the last render step.<br />

3. For each ray that didn’t hit the bitmap, draw the part from its intersection<br />

point with the z plane to infinity.<br />

To allow smooth changes between “ray hit” and “ray didn’t hit” in step three, we<br />

store opacity values at eight bits instead of one bit. This way, we can gradually<br />

fade out a ray when it comes close to an opaque part of the bitmap. All rays are<br />

rendered using alpha blending with both source and destination blend factor set to<br />

one. Ray edges can be smoothed by applying a 1D gradient texture (black �<br />

white � black) perpendicular to the ray direction.<br />

The shaders involved in this effect are rather simple. Texture coordinate and<br />

ray color (the intensity was already premodulated at ray setup) are copied by the<br />

vertex shader. The pixel shader modulates the interpolated ray color on the color<br />

fetched from the gradient texture.<br />

Transition Effects<br />

This final section deals with transition effects. It describes the technique used in<br />

Meshuggah to make things more interesting than a simple color fade. The basic<br />

idea is to have a per-pixel ramp function determine how much to blend between<br />

two corresponding pixels in a source and destination frame. To allow arbitrary<br />

transition patterns, we use a texture to store per-pixel threshold values (see Figure<br />

6), which will later be accessed by a transition pixel shader to evaluate a ramp<br />

function. The transition effect is applied as some sort of post-processing step.<br />

Therefore, a screen space rectangle is rendered stretching the transition texture


over the entire frame to feed the transition<br />

shader with a per-pixel threshold value.<br />

All threshold values stored in the texture are<br />

in range [0, 1]. The darker a pixel in the transition<br />

texture, the longer it takes a pixel of the<br />

source frame to appear in the final image and vice<br />

versa. To control the transition’s progress, we<br />

introduce a constant value fTransTime,0�<br />

fTransTime � 1, which is set per frame. Setting<br />

fTransTime to zero and gradually increasing it to<br />

one will smoothly blend between the destination<br />

and source frame. The following ramp function<br />

implements the specified behavior:<br />

1<br />

f ( fTransTime, fTransThreshold) �clamp ( 16 �(<br />

fTransThreshold<br />

�2� fTransTime �1))<br />

0<br />

Obviously, this ramp function was chosen to be efficiently evaluated in a pixel<br />

shader. It directly translates to the following pixel shader code:<br />

#include "Transition<strong>Shader</strong>Constants.h"<br />

ps.1.1<br />

tex t0 // get threshold value<br />

// set color of source frame and calculate<br />

// first part of ramp function (co-issuing)<br />

mov r0.rgb, c[ CP TRANSITION COLOR ]<br />

+ add x4 r0.a, t0, c[ CP TRANSITION TIME ] bx2<br />

// calculate final part of ramp function<br />

mov x4 r0.a, r0.a<br />

It computes and stores the result of<br />

the ramp function in the alpha channel<br />

of the output color register. During<br />

alpha blending, this value is used<br />

to combine the source and destination<br />

frame accordingly. It should be noted<br />

that in this implementation, the<br />

source frame is just a constant color.<br />

But your app could also render the<br />

source frame into a texture and use<br />

that instead. The destination frame is<br />

stored in the frame buffer. Figure 7<br />

shows the result of the transition<br />

shader applied to the ocean scene in<br />

Meshuggah.<br />

Section II — Rendering Techniques<br />

Meshuggah’s Effects Explained<br />

291<br />

Figure 6: A texture containing<br />

threshold values for the<br />

transition pixel shader<br />

Figure 7: Result of the transition pixel shader<br />

for fTransTime = 0.5 using the texture shown<br />

in Figure 6


Section II — Rendering Techniques<br />

292 Meshuggah’s Effects Explained<br />

Wrapping It Up<br />

In the last few sections, various real-time graphic effects that take advantage of<br />

<strong>DirectX</strong> 8.1-style vertex and pixel shaders were presented. The shaders primarily<br />

dealt with typical tasks found in computer graphics, such as animation, lighting,<br />

and texturing. Implementing the effects required developing different tricks to<br />

overcome some limitations of the <strong>DirectX</strong> 8.1 class hardware and enable hardware-friendly<br />

rendering. Porting the shaders to <strong>DirectX</strong> 9 vertex and pixel shader<br />

assembly or re-engineering them to HLSL or Cg should be fairly simple and<br />

might be interesting for further experiments.<br />

References and Further Reading<br />

[1] Resources for demo scene productions http://www.scene.org,<br />

http://www.pouet.net.<br />

[2] Heidrich, Wolfgang and Hans Peter Seidel, “Anisotropic Reflections in<br />

OpenGL,” http://www9.informatik.uni-erlangen.de/eng/research/rendering/<br />

anisotropic/.<br />

[3] Kilgard, Mark J.,“Hardware Accelerated Anisotropic Lighting,” http://developer.nvidia.com.<br />

[4] Bourke, Paul, “Polygonising a Scalar Field,” http://astronomy.swin.edu.au/<br />

pbourke/modelling/polygonise/.<br />

[5] Bourke, Paul, “Implicit Surfaces,” http://astronomy.swin.edu.au/pbourke/<br />

modelling/implicitsurf/.<br />

[6] Jönsson, Andreas, “Fast Metaballs,” http://www.angelcode.com/articles/<br />

metaballs/metaballs.asp.<br />

[7] Watt, Alan and Mark Watt, Advanced Animation and Rendering Techniques,<br />

Addison Wesley, 1992.<br />

[8] “The very latest SOHO images,” http://sohowww.nascom.nasa.gov/data/<br />

realtime-images.html.<br />

[9] Tessendorf, Jerry, “Simulating Ocean Water,” http://home1.gte.net/tssndrf/<br />

index.html.<br />

[10] Jensen, Lasse S. and Robert Goliáš, “Deep Water Animation and Rendering,”<br />

http://www.swrendering.com/water, http://www.gamasutra.com/gdce/jensen/<br />

jensen_01.htm.<br />

[11] Cohen, Jonathan, Chris Tchou, Tim Hawkins, and Paul Debevec, “Real-time<br />

High Dynamic Range Texture Mapping,” http://www.ict.usc.edu/~jcohen/<br />

hdrtm.html.<br />

[13] Nishita, Tomoyuki and Eihac hiro Nakamae, “Method of Displaying Optical<br />

Effects within Water using Accumulation Buffer,” http://nis-lab.is.s.u-tokyo.ac.jp/<br />

~nis/pub_nis.html.


Layered Car Paint <strong>Shader</strong><br />

John Isidoro, Chris Oat, and Natalya Tatarchuk<br />

Figure 1: Two-tone, suspended microflake car paint rendered in real time using an HLSL<br />

pixel shader in <strong>DirectX</strong> 9<br />

The application of paint to a car’s body can be a complicated process. Expensive<br />

auto body paint is usually applied in layered stages and often includes dye layers,<br />

clear coat layers, and metallic flakes suspended in enamel. The result of these<br />

successive paint layers is a surface that exhibits complex light interactions,<br />

giving the car a smooth, glossy, and sparkly finish. The car model shown here<br />

uses a relatively low number of polygons but employs a high-precision normal<br />

map generated by an appearance-preserving simplification algorithm (visit<br />

http://www.ati.com/developer/ for more information on the ATI Normal Mapper<br />

tool). Due to the pixel shader operations performed across the smoothly changing<br />

surfaces (such as the hood of the car), a 16-bit per-channel normal map is<br />

necessary.<br />

293


Section II — Rendering Techniques<br />

294 Layered Car Paint <strong>Shader</strong><br />

Normal Map Decompression<br />

Base Color<br />

The first step in this pixel shader is normal decompression. Since the normals are<br />

stored in surface local coordinates (aka tangent space), we can assume that the z<br />

component of the normals is positive. Thus, we can store x and y in two channels<br />

of a 16-16 texture map and derive z in the pixel shader from +sqrt(1 – x 2 – y 2 ).<br />

This gives us much higher precision than a traditional 8-8-8-8 normal map (even<br />

10 or 11 bits per channel is not enough for this particular shader) for the same<br />

memory footprint.<br />

Figure 2: Two-tone, microflake, clear coat, and final lighting on side rearview mirror<br />

The normal decompression described above is performed on a surface normal<br />

map, which is generated from an appearance-preserving simplification process (N)<br />

and a high-frequency normalized vector noise map (N n), which is repeated across<br />

the surface. These two normals are used to compute two perturbed normals that<br />

are used to simulate the two-toned nature of the paint as well as the microflake<br />

suspended in an inner coat of the paint.


These normals, N s and N ss, are computed as follows:<br />

N<br />

N<br />

s<br />

ss<br />

aN n � bN<br />

�<br />

where a < b<br />

| aN � bN|<br />

n<br />

cN n � dN<br />

�<br />

where c = d<br />

| cN � dN|<br />

n<br />

The coefficients a, b, c, and d above are constant input parameters to the pixel<br />

shader that determine the distributions of the perturbed normals. The magnitude<br />

of these perturbed normals determines the width of the region in which the<br />

microflake is readily visible. The two normals are dotted with the view vector and<br />

used as parameters in the following polynomial, which determines the color of the<br />

base coat and strength of the microflake term:<br />

c ( N �V) �c1( N �V) �c2( N �V) �c3( N �V)<br />

0<br />

2 4 16<br />

s s s ss<br />

The first three terms of this polynomial perform the blend between the two tones<br />

of the paint. The fourth term adds an extra layer of sparkle for the microflake’s<br />

contribution. Constants c0, c1, and c2 correspond to the base paint colors, while c3<br />

corresponds to the microflake color.<br />

Clear Coat Paint Layer<br />

Section II — Rendering Techniques<br />

Layered Car Paint <strong>Shader</strong><br />

295<br />

Figure 3: Metallic microflakes, suspended in clear enamel, are applied over a base paint<br />

coat (dye layer) and result in subsurface light scattering.<br />

The final step in rendering the painted areas of the car is the inclusion of the clear<br />

coat through the addition of an environment map, as shown below.<br />

One interesting aspect of the clear coat term is the decision to store the environment<br />

map in an RGBScale form to simulate high dynamic range in a low memory<br />

footprint. The alpha channel of the texture, shown on the right in Figure 4,<br />

represents one-sixteenth of the true range of the data, while the RGB, shown on<br />

the left, represents the normalized color. In the pixel shader, the alpha channel<br />

and RGB channels are multiplied together and multiplied by eight to reconstruct a<br />

cheap form of HDR reflectance. This is multiplied by a subtle Fresnel term before<br />

being added to the lighting terms described above.


Section II — Rendering Techniques<br />

296 Layered Car Paint <strong>Shader</strong><br />

Figure 4: The top face of the HDR cubic environment map with RGB channels on the left<br />

and the alpha channel on the right<br />

The full HLSL pixel shader for the car paint and trim is shown here:<br />

struct PsInput<br />

{<br />

float2 Tex : TEXCOORD0;<br />

float3 Tangent : TEXCOORD1;<br />

float3 Binormal : TEXCOORD2;<br />

float3 Normal : TEXCOORD3;<br />

float3 View : TEXCOORD4;<br />

float3 SparkleTex : TEXCOORD5;<br />

};<br />

float4 main(PsInput i) : COLOR<br />

{<br />

// fetch from the incoming normal map:<br />

float3 vNormal = tex2D( normalMap, i.Tex );<br />

// Scale and bias fetched normal to move into [-1.0, 1.0] range:<br />

vNormal = 2.0f * vNormal - 1.0f;<br />

// Microflakes normal map is a high frequency normalized<br />

// vector noise map which is repeated across the surface.<br />

// Fetching the value from it for each pixel allows us to<br />

// compute perturbed normal for the surface to simulate<br />

// appearance of microflakes suspended in the coat of paint:<br />

float3 vFlakesNormal = tex2D(microflakeNMap, i.SparkleTex);<br />

// Don't forget to bias and scale to shift color into [-1.0, 1.0] range:<br />

vFlakesNormal =2*vFlakesNormal - 1.0;<br />

// This shader simulates two layers of microflakes suspended in<br />

// the coat of paint. To compute the surface normal for the first layer,<br />

// the following formula is used:<br />

// Np1=(a*Np+b*N)/ ||a*Np+b*N||where a


float3 vNp1 =<br />

microflakePerturbationA * vFlakesNormal + normalPerturbation * vNormal ;<br />

// To compute the surface normal for the second layer of microflakes, which<br />

// is shifted with respect to the first layer of microflakes, we use this formula:<br />

// Np2=(c*Np+d*N)/||c*Np+d*N||where c == d<br />

//<br />

float3 vNp2 = microflakePerturbation * ( vFlakesNormal + vNormal ) ;<br />

// The view vector (which is currently in world space) needs to be normalized.<br />

// This vector is normalized in the pixel shader to ensure higher precision of<br />

// the resulting view vector. For this highly detailed visual effect, normalizing<br />

// the view vector in the vertex shader and simply interpolating it is insufficient<br />

// and produces artifacts.<br />

float3 vView = normalize( View );<br />

// Transform the surface normal into world space (in order to compute reflection<br />

// vector to perform environment map lookup):<br />

float3x3 mTangentToWorld = transpose( float3x3( Tangent, Binormal, Normal ) );<br />

float3 vNormalWorld = normalize( mul( mTangentToWorld, vNormal ));<br />

// Compute reflection vector resulted from the clear coat of paint on the metallic<br />

// surface:<br />

float fNdotV = saturate(dot( vNormalWorld, vView));<br />

float3 vReflection =2*vNormalWorld * fNdotV - vView;<br />

// Here we just use a constant gloss value to bias reading from the environment<br />

// map, however, in the real demo we use a gloss map which specifies which<br />

// regions will have reflection slightly blurred.<br />

float fEnvBias = glossLevel;<br />

// Sample environment map using this reflection vector:<br />

float4 envMap = texCUBEbias( showroomMap, float4( vReflection, fEnvBias ) );<br />

// Premultiply by alpha:<br />

envMap.rgb = envMap.rgb * envMap.a;<br />

// Brighten the environment map sampling result:<br />

envMap.rgb *= brightnessFactor;<br />

Section II — Rendering Techniques<br />

Layered Car Paint <strong>Shader</strong><br />

// Compute modified Fresnel term for reflections from the first layer of<br />

// microflakes. First transform perturbed surface normal for that layer into<br />

// world space and then compute dot product of that normal with the view vector:<br />

float3 vNp1World = normalize( mul( mTangentToWorld, vNp1) );<br />

float fFresnel1 = saturate( dot( vNp1World, vView ));<br />

// Compute modified Fresnel term for reflections from the second layer of<br />

// microflakes. Again, transform perturbed surface normal for that layer into<br />

// world space and then compute dot product of that normal with the view vector:<br />

float3 vNp2World = normalize( mul( mTangentToWorld, vNp2 ));<br />

float fFresnel2 = saturate( dot( vNp2World, vView ));<br />

297


Section II — Rendering Techniques<br />

298 Layered Car Paint <strong>Shader</strong><br />

}<br />

Conclusion<br />

//<br />

// Compute final paint color: combines all layers of paint as well as two layers<br />

// of microflakes<br />

//<br />

float fFresnel1Sq = fFresnel1 * fFresnel1;<br />

float4 paintColor = fFresnel1 * paintColor0 +<br />

fFresnel1Sq * paintColorMid +<br />

fFresnel1Sq * fFresnel1Sq * paintColor2 +<br />

pow( fFresnel2, 16 ) * flakeLayerColor;<br />

// Combine result of environment map reflection with the paint color:<br />

float fEnvContribution = 1.0 - 0.5 * fNdotV;<br />

float4 finalColor;<br />

finalColor.a = 1.0;<br />

finalColor.rgb = envMap * fEnvContribution + paintColor;<br />

return finalColor;<br />

This shader was developed using empirically gathered phenomenological illumination<br />

characteristics rather than actual physical material attributes. Many different<br />

car paint swatches were observed under various lighting conditions. This<br />

shader strives to reproduce the observed characteristics of those swatches. This<br />

article demonstrated a compelling simulation for the illumination of car paint<br />

using a real-time pixel shader.


Introduction<br />

Motion Blur Using Geometry<br />

and Shading Distortion<br />

Natalya Tatarchuk, Chris Brennan, Alex Vlachos, and John Isidoro<br />

When our demo team decided to implement a real-time version of the Animusic<br />

Pipe Dream animation shown in the SIGGRAPH 2001 Electronic Theater, we realized<br />

that we needed to create a fast, reliable technique for rendering a convincing<br />

motion blur effect for the many balls moving around the complicated music<br />

machine shown in Figure 1. Although enough approaches existed for simulating<br />

motion blur for computer-generated images, some of them weren’t fast enough or<br />

accurate enough for our purposes. Rather than draw the moving balls several<br />

times at different points in space and time, as one might do with an accumulation<br />

buffer [Haeberli90], we chose to draw each ball once, distorting its appearance to<br />

make it appear as if it were in motion during the frame “exposure.” This technique<br />

is an extension to the approach taken by [Wloka96].<br />

Figure 1: A screen shot from the real-time Animusic Pipe Dream demo<br />

(See Color Plate 16 for another view of the demo.)<br />

299


Section II — Rendering Techniques<br />

300 Motion Blur Using Geometry and Shading Distortion<br />

The motion blur effect on film appears due to finite camera shutter speeds. When<br />

the object moves too quickly compared to the shutter speed, the resulting image<br />

of that object seems to move along the film surface while the shutter is opened.<br />

The image on film appears smeared, depending on how fast the object was moving<br />

with respect to the observer. Motion blur is an important cue for realistic rendering<br />

of our world, and it becomes ever more significant as computer-generated<br />

imagery approaches the detailed look of cinematic art. Motion blur is very common<br />

in photography and motion pictures and can be used in different ways for<br />

specific artistic choices. Some artists use motion blur deliberately to delineate<br />

dynamic motion in photographs. Human beings perceive motion blur as natural,<br />

and thus it is expected for convincing computer-generated simulation.<br />

Simulating Motion Blur Effect<br />

For our purposes, let’s consider the movement of objects in object space. There,<br />

motion blur is caused by the object moving through a finite amount of space during<br />

a short time step (equivalent to the exposure time due to an opened shutter).<br />

This allows us to use the distance that the object moved since the previous time<br />

step as an approximation for the instantaneous velocity value necessary to compute<br />

the blurring of the object. The previous work by Wloka and Zeleznik eliminates<br />

the obvious discrete renderings of the ball, which is inevitable in an<br />

accumulation buffer approach, while our technique also accounts for blurring of<br />

the object and computing a more accurate approximation to its contribution to the<br />

scene over time. To achieve the look we wanted, we used a vertex shader and a<br />

pixel shader to distort both the shape and shading of the balls.<br />

Sample RenderMonkey Workspace<br />

The RenderMonkey IDE, an environment for shader development, can be downloaded<br />

from http://www.ati.com/developer/sdk/radeonSDK/html/Tools/Render<br />

Monkey.html. Along with the application itself, it installs a series of workspaces,<br />

including one called Motion Blur.rfx. This workspace contains the effect that we<br />

are describing in this article. Feel free to modify any of the parameters to the<br />

shaders to explore their effects on the final visual result. Of course, you can also<br />

modify the actual shaders to understand the algorithms in greater depth.<br />

Geometry Distortion<br />

We distort the shape of the ball along the movement direction vector to simulate<br />

the stretching of the object as it moves quickly across the image plane with<br />

respect to the observer. Each individual ball is modeled as a “capsule” (two hemispheres<br />

with a connecting cylinder), which is aligned with the direction of motion.<br />

The vertex shader stretches the ball tangent to the direction of motion, as shown<br />

below. The vertices of the front half of the capsule (those whose normals have a<br />

positive dot product with the direction of motion) are pushed in the direction of<br />

motion, and the vertices of the back half of the capsule (those whose normals


Section II — Rendering Techniques<br />

Motion Blur Using Geometry and Shading Distortion<br />

301<br />

have a negative dot product with the direction of motion) are pulled in the opposite<br />

direction. Naturally, the total amount of stretching is the amount of distance<br />

the ball moved since the previous frame.<br />

In Figure 2, we show how the ball’s shape is distorted by the vertex shader in<br />

the direction tangent to the path of motion at the time instant centered in the current<br />

finite shutter time.<br />

<strong>With</strong>out motion blur <strong>With</strong> motion blur<br />

Figure 2: Shape distortion<br />

The capsule geometry is stretched in the vertex shader as a function of distance<br />

traveled, d, measured in ball diameter units. Figure 3 shows a snapshot from the<br />

demo showing a distorted ball moving quickly in the scene.<br />

Figure 3: Geometry distortion for a quickly moving object<br />

The vertex shader also computes a blurriness factor from 1 / (1 + d). The object<br />

becomes more blurred as it moves faster past the observer; thus, it is very important<br />

to use relative speed rather than the actual movement of the ball in the<br />

scene, since motion blur is relative to the observation point. The motion blur<br />

amount is measured as a distance traveled relative to the camera’s motion. If the


Section II — Rendering Techniques<br />

302 Motion Blur Using Geometry and Shading Distortion<br />

observer moves along with the same speed as the object, no motion blur is perceived.<br />

To determine the rate of fade-out for the object, we calculate how much<br />

visible area is covered as the ball moves (which is measured in ball widths). The<br />

fade-out rate is calculated in the range of [0..1] and interpolated across the polygons<br />

to be used in the pixel shader to determine how much to blur the shading of<br />

the balls.<br />

Vertex <strong>Shader</strong><br />

Below you can see an example of the vertex shader that performs shape distortion.<br />

It is also used to compute diffuse illumination contribution from two<br />

run-time lights, which is propagated to the pixel shader.<br />

float4x4 mBallOrientation;<br />

float4x4 inv view matrix;<br />

float4x4 view proj matrix;<br />

float4x4 mW0Matrix;<br />

float4 vColor;<br />

float4 vExtensionDirection;<br />

float4 vObjectCenter;<br />

float4 vMotionDirection;<br />

float4 vAmbientPos3;<br />

float4 vAmbientPos1;<br />

float4 vAmbientPos2;<br />

float4 vAmbientColor1;<br />

float4 mLight1Pos;<br />

float4 mLight2Pos;<br />

float4 vLight1Color;<br />

float4 vLight2Color;<br />

float4 vAmbientColor3;<br />

float4 vAmbientColor2;<br />

float fBallSize;<br />

float fObjectScale;<br />

float fSpeed;<br />

float fZoom;<br />

struct VS OUTPUT<br />

{<br />

float4 ProjPos : POSITION;<br />

float3 Diffuse : COLOR0;<br />

float4 Normal : TEXCOORD0;<br />

float3 View : TEXCOORD1;<br />

float3 Light1 : TEXCOORD2;<br />

float3 Light2 : TEXCOORD3;<br />

float3 Pos : TEXCOORD4;<br />

};<br />

VS OUTPUT main( float4 Pos: POSITION, float3 Normal: NORMAL)<br />

{


VS OUTPUT o=(VSOUTPUT).5;<br />

Pos = float4(Pos.xyz*fObjectScale, 1);<br />

float4 vCameraPosition = mul(inv view matrix, float4(0,0,0,1));<br />

// Calculate View Vector<br />

float3 vView = normalize(vCameraPosition - mul(mW0Matrix, Pos));<br />

// Calculate velocity relative to the eye<br />

float3 vVelocity = vMotionDirection * fSpeed;<br />

// motion vector as seen by the eye<br />

float3 vEyeMotion = vVelocity - vView * dot(vVelocity, vView);<br />

// Speed as relative to the observer:<br />

float fEyeSpeed = length(vEyeMotion);<br />

// Calculate the area that the stretched ball will take on<br />

// the screen – it is dependent on the instantaneous velocity<br />

// of the moving object and its size:<br />

float fBallCoverage =1+fEyeSpeed / fBallSize;<br />

// Calculate the blurriness factor for later alpha blending of<br />

// the ball – the faster it moves, the more “smeared” it will<br />

// appear:<br />

float fBlurriness =1-1/fBallCoverage;<br />

// Export blurriness factor the pixel shader:<br />

o.Normal.w = fBlurriness;<br />

// Translate object to object origin:<br />

float4 vObjectPos = Pos - vObjectCenter;<br />

// Extend the position to elongate the ball relative to the speed:<br />

vObjectPos += fSpeed * vExtensionDirection * -sign(Normal.z);<br />

// Re-orient the ball along its motion path:<br />

float3 vOrientedPosition = mul((float3x3)mBallOrientation, vObjectPos);<br />

// Rotate ball into correct orientation:<br />

vOrientedPosition = mul(vOrientedPosition, (float3x3)mW0Matrix);<br />

// Remove world matrix rotation<br />

vOrientedPosition += vObjectCenter;<br />

//<br />

// Translate object back to where it started:<br />

//<br />

Section II — Rendering Techniques<br />

Motion Blur Using Geometry and Shading Distortion<br />

// Transform position into world space and output it:<br />

float4 vWorldPos = mul(mW0Matrix, float4(vOrientedPosition, 1));<br />

o.Pos = vWorldPos;<br />

303


Section II — Rendering Techniques<br />

304 Motion Blur Using Geometry and Shading Distortion<br />

}<br />

o.ProjPos = mul(view proj matrix, float4(vWorldPos.xyz*fZoom, vWorldPos.w));<br />

//<br />

// Calculate Normal<br />

//<br />

// Rotate normal into correct orientation:<br />

float3 vWorldNormal = mul((float3x3)mBallOrientation, Normal);<br />

// Remove world matrix rotation of normal:<br />

vWorldNormal = mul(vWorldNormal, mW0Matrix);<br />

// Translate to world space:<br />

vWorldNormal = mul(mW0Matrix, vWorldNormal);<br />

o.Normal.xyz = vWorldNormal;<br />

//<br />

// Light vectors for specular lighting:<br />

//<br />

// Light vector 1:<br />

o.Light1 = normalize((float3)mLight1Pos - (float3)vWorldPos);<br />

// Light vector 2:<br />

o.Light2 = normalize((float3)mLight2Pos - (float3)vWorldPos);<br />

//<br />

// Compute diffuse illumination contribution:<br />

//<br />

o.Diffuse = max(dot(vWorldNormal, normalize(vAmbientPos1 –<br />

vWorldPos)), 0) * vAmbientColor1;<br />

o.Diffuse += max(dot(vWorldNormal, normalize(vAmbientPos2 –<br />

vWorldPos)), 0) * vAmbientColor2;<br />

o.Diffuse += max(dot(vWorldNormal, normalize(vAmbientPos3 -<br />

vWorldPos)), 0) * vAmbientColor3;<br />

o.Diffuse += max(dot(vWorldNormal, o.Light1), 0) * vLight1Color;<br />

o.Diffuse += max(dot(vWorldNormal, o.Light2), 0) * vLight2Color;<br />

o.Diffuse = o.Diffuse * vColor;<br />

// More accurate view vector<br />

o.View = normalize(vCameraPosition - vWorldPos);<br />

return o;<br />

Shading Distortion<br />

In addition to merely stretching the capsule geometry along the tangent to the<br />

path of the ball’s motion, the shading of the object is affected by the blurriness<br />

factor computed above. The most important visual cue for motion blur effect is<br />

the increasing transparent quality of the object as it moves quicker on the screen.<br />

This creates the impression that the object really is moving rapidly in the scene.


Figure 4 shows comparison rendering of two objects moving at different speeds<br />

— even just glancing at the two snapshots, we get the impression that the ball in<br />

the right picture is moving much faster than the ball in the left picture.<br />

In our shader, blurring is achieved in multiple ways. There are a number of factors<br />

that contribute to the final ball color, including two specular highlights and an<br />

environment map, all of which are blurred as a function of the ball’s motion during<br />

the frame. In the case of the two specular highlights on each ball, the specular<br />

exponent and its intensity are lowered as the ball goes faster, which effectively<br />

broadens the highlight on the surface of the ball. This serves to spread out the<br />

highlight and make it appear to be blurred in the direction of the ball’s motion. In<br />

essence, the goal is to spread the energy radiating from the specular highlight<br />

among the pixels that the specular highlight would move across during a finite<br />

frame exposure time. In the case of the environment map, we use the<br />

texCUBEbias pixel shader instruction, which applies a per-pixel bias to selectively<br />

sample the smaller mip levels. This blurs the environment map term. The<br />

texCUBEbias intrinsic is used to perform biased texture sampling, where the bias<br />

can be computed per-pixel. This is done to induce some over-blurring of the texture<br />

as the ball moves faster through the image.<br />

...<br />

...<br />

Section II — Rendering Techniques<br />

Motion Blur Using Geometry and Shading Distortion<br />

Slowly moving ball (speed = 0.6) Quickly moving ball (speed = 2.15)<br />

Figure 4: Velocity-dependent shading distortion<br />

// Apply environment map to the object, taking into account<br />

// speed-dependent blurring:<br />

float3 vCubeLookup = vReflection + i.Pos/fEnvMapRadius;<br />

float4 cReflection = texCUBEbias(tCubeEnv,<br />

float4(vCubeLookup, fBlur * fTextureBlur)) * vReflectionColor;<br />

305<br />

In the last few instructions of the pixel shader, the diffuse and specular components<br />

of illumination are combined. Because the specular contribution can be<br />

greater than one, we perform part of the frame buffer compositing operation (Src<br />

* SrcAlpha) in the pixel shader before the colors are clamped to the zero-to-one


Section II — Rendering Techniques<br />

306 Motion Blur Using Geometry and Shading Distortion<br />

range. Each pixel is composited with the frame buffer with a src + srcAlpha *<br />

Dest blend. Doing the Src * SrcAlpha premultiplication in the pixel shader gives a<br />

more accurate result, since it happens prior to pixel shader output color saturation.<br />

See [Tatarchuk03] for a more detailed description of this blending approach<br />

as it is used for preserving specular highlights and color saturation for a translucent<br />

and iridescent surface.<br />

Pixel <strong>Shader</strong><br />

Following is the complete pixel shader used for the effect described in this article.<br />

This is a <strong>DirectX</strong> HLSL pixel shader, which can be compiled to the ps_2_0 target.<br />

float4 vReflectionColor;<br />

float4 vLight1Color;<br />

float4 vLight2Color;<br />

float fBaseSpecularIntensity;<br />

float fTextureBlur;<br />

float fSpecularExpBlurScale;<br />

float fSpecularExp;<br />

float fSpecularDimScale;<br />

float fEnvMapRadius;<br />

sampler tCubeEnv;<br />

struct PS INPUT<br />

{<br />

float3 Diffuse : COLOR0;<br />

float4 Normal : TEXCOORD0;<br />

float3 View : TEXCOORD1;<br />

float3 Light1 : TEXCOORD2;<br />

float3 Light2 : TEXCOORD3;<br />

float3 Pos : TEXCOORD4;<br />

};<br />

float4 main(PS INPUT i) : COLOR<br />

{<br />

// Extract blurring factor from the normal vector interpolator:<br />

float fBlur = i.Normal.w;<br />

// Compute reflection vector:<br />

float3 vNormal = normalize(i.Normal);<br />

float3 vReflection = normalize(2 * dot(i.View, vNormal) * vNormal - i.View);<br />

// Compute fade out rate for the moving ball taking into<br />

// account Fresnel effect:<br />

float fFirstBallWidthFade = saturate(2 * fBlur);<br />

float fRestBallWidthFade = saturate(2 -2*fBlur);<br />

float fFresnel = 1 - saturate(dot(vNormal, i.View));<br />

float fAlpha = fRestBallWidthFade * (1 – fFirstBallWidthFade * fFresnel);


}<br />

Summary<br />

// Environment map the object taking into account<br />

// speed-dependent blurring:<br />

float3 vCubeLookup = vReflection + i.Pos/fEnvMapRadius;<br />

float4 cReflection = texCUBEbias(tCubeEnv,<br />

float4(vCubeLookup, fBlur * fTextureBlur)) * vReflectionColor;<br />

// Compute smearing of specular highlights depending on the amount<br />

// of motion blur:<br />

float fBlurredSpecularExp = max(1, fSpecularExpBlurScale*fBlur + fSpecularExp);<br />

float fSpecularIntensity = fBaseSpecularIntensity * (1 - (fBlur * fSpecularDimScale));<br />

// Compute specular contribution for the first light:<br />

float3 cSpecular1 = pow(saturate(dot(vReflection, i.Light1)),<br />

fBlurredSpecularExp) * fSpecularIntensity * vLight1Color;<br />

// Compute specular contribution for the second light:<br />

float3 cSpecular2 = pow(saturate(dot(vReflection, i.Light2)),<br />

fBlurredSpecularExp) * fSpecularIntensity * vLight2Color;<br />

// Compute input diffuse contribution with both specular<br />

// highlight areas and environment map term:<br />

float3 cColor = cReflection + cSpecular1 + cSpecular2 + i.Diffuse;<br />

// Determine the actual blending amount:<br />

float alpha = fRestBallWidthFade *<br />

(1 – fFirstBallWidthFade * (1-saturate(dot(-vNormal, -i.View))));<br />

// Pre-multiply by alpha and output color:<br />

return float4(cColor*alpha, alpha);<br />

Section II — Rendering Techniques<br />

Motion Blur Using Geometry and Shading Distortion<br />

307<br />

In this article we described an efficient way to implement convincing motion blur<br />

effect by using speed-dependent shape distortion and alignment of objects along<br />

the path of movement combined with shading distortion to simulate accurate blurring<br />

of objects as they move quickly on the screen. Figure 5 shows the progression<br />

of a ball slowing down to collide with another object, a drum, and quickly<br />

moving away after the collision. On each of the pictures in that figure, we can see<br />

how the geometry and shading is changed to provide visual cues about the movement<br />

of the ball.


Section II — Rendering Techniques<br />

308 Motion Blur Using Geometry and Shading Distortion<br />

References<br />

The ball is moving quickly toward the<br />

drum.<br />

Figure 5: Motion blurring of moving ball<br />

The ball is about to hit the drum.<br />

The moment right after impact The ball moving away after the<br />

collision<br />

[Haeberli90] Haeberli, Paul E. and Kurt Akeley, “The accumulation buffer: Hardware<br />

support for high-quality rendering,” SIGGRAPH 1990, pp. 309-318.<br />

[Tatarchuk03] Tatarchuk, N. and C. Brennan, “Simulation of Iridescence and<br />

Translucency on Thin Surfaces,” <strong><strong>Shader</strong>X</strong>2 : <strong>Shader</strong> <strong>Programming</strong> <strong>Tips</strong> & <strong>Tricks</strong><br />

with <strong>DirectX</strong> 9, Wolfgang Engel, ed., Wordware Publishing, 2004, pp. 309-318.<br />

[Wloka96] Wloka, M. and R.C. Zeleznik, “Interactive Real-Time Motion Blur,”<br />

Visual Computer, Springer Verlag, 1996.


Introduction<br />

Simulation of Iridescence and<br />

Translucency on Thin Surfaces<br />

Natalya Tatarchuk and Chris Brennan<br />

This article focuses on simulating the visual effect of translucency and iridescence<br />

of thin surfaces such as butterfly wings. When creating a visual impression<br />

of a particular material, an important characteristic of that surface is the luminosity<br />

of the material. There are various ways that a surface can be luminous. These<br />

include sources with or without heat, from an outside source, and from the object<br />

itself (other than a mere reflection). Luminous objects that exhibit certain combinations<br />

of these characteristics can be described as translucent or iridescent,<br />

depending on the way that the surface “scatters” incoming light.<br />

Translucency of a material is determined by the ability of that surface to allow<br />

light to pass through without full transparency. Translucent materials can only<br />

receive light and thus can be luminous only when lit from an outside source.<br />

Although there has been ample research in recent years of interactive simulation<br />

of fully translucent surfaces such as marble or wax [Jensen01], this article focuses<br />

on simulating translucency for thin surfaces. A good example of the visual effect<br />

of translucency of thin surfaces is if you were to take a piece of rice paper and<br />

hold it against a light source (for example, Chinese lanterns). You would see that<br />

the light makes the rice paper seem to glow from within, yet you cannot see the<br />

light source through the paper because the paper scatters incoming light.<br />

Iridescence is an effect caused by the interference of light waves resulting<br />

from multiple reflections of light off of surfaces of varying thickness. This visual<br />

effect can be detected as a rainbow pattern on the surface of soap bubbles and<br />

gasoline spills and, in general, on surfaces covered with thin film diffracting different<br />

frequencies of incoming light in different directions. The surface of a soap<br />

bubble exhibits iridescence due to a layer of air, which varies in thickness,<br />

between the top and bottom surfaces of the bubble. The reflected colors vary<br />

along with the thickness of the surface. Mother-of-pearl, compact discs, and various<br />

gemstones share that quality. Perhaps most captivating of all, however, is the<br />

iridescence seen on the wings of many beautiful butterflies, such as blue pansy<br />

butterflies, Junonia orithya, or the malachite butterflies, Siproeta stelenes. These<br />

wings exhibit vivid colorful iridescence (see Figure 1 for examples), the color of<br />

which has been shown to be independent of the pigmentation of the wings and is<br />

309


Section II — Rendering Techniques<br />

310 Simulation of Iridescence and Translucency on Thin Surfaces<br />

attributed to the microstructure of the scales located on and within butterfly<br />

wings.<br />

Blue pansy butterfly, Junonia<br />

orithya<br />

Algorithm<br />

Inputs<br />

Figure 1: Butterflies in nature<br />

The effect described in this chapter simulates translucency and iridescent patterns<br />

of delicate butterfly wings. To generate iridescence, we have merged the<br />

approaches described in “Bubble <strong>Shader</strong>” [Isidoro02] and “Textures as Lookup<br />

Tables for Per-Pixel Lighting Computations” [Vlachos02].<br />

This effect uses a geometric model with a position, a normal, a set of texture<br />

coordinates, a tangent, and a binormal vector. All of these components are supplied<br />

to the vertex shader. At the pixel level, we combine gloss, opacity, and normal<br />

maps for a multi-layered final look. The gloss map is used to contribute to<br />

“satiny” highlights on the butterfly wings. The opacity map allows the wings to<br />

have variable transparency, and the normal map is used to give wings a bumpmapped<br />

look to allow for more surface thickness variations. The input texture<br />

coordinates are used to sample all texture maps.<br />

Sample RenderMonkey Workspace<br />

Malachite butterflies, Siproeta stelenes<br />

The RenderMonkey IDE, an environment for shader development, can be downloaded<br />

from http://www.ati.com/developer/sdk/radeonSDK/html/Tools/Render<br />

Monkey.html. Along with the application itself, the RenderMonkey installer<br />

installs a series of workspaces, including one called Iridescent Butterfly.rfx. This<br />

workspace contains the effect that we are describing in this article. Feel free to<br />

modify any of the parameters to the shaders to explore their effect on the final<br />

visual result. Of course, you can also modify the actual shaders to understand<br />

their algorithms in greater depth.


Vertex <strong>Shader</strong><br />

The vertex shader for this effect computes vectors that are used by the pixel<br />

shader to compute the illumination result. At the pixel level, the view vector and<br />

the light vector are used for calculating diffuse illumination and scattered illumination<br />

off of the surface of the wings, which contributes to the translucency effect.<br />

The halfway vector will be used for generation of glossy highlights on the wing’s<br />

surface. In the vertex shader, however, these vectors are simply transformed to<br />

tangent space.<br />

struct VS OUTPUT<br />

{<br />

float4 Pos : POSITION;<br />

float2 Tex : TEXCOORD0;<br />

float3 View : TEXCOORD1;<br />

float3 Light : TEXCOORD2;<br />

float3 Half : TEXCOORD3;<br />

};<br />

VS OUTPUT main( float4 Pos : POSITION,<br />

float4 Normal : NORMAL0,<br />

float2 Tex : TEXCOORD0,<br />

float3 Tangent : TANGENT0,<br />

float3 Binormal : BINORMAL0 )<br />

{<br />

VS OUTPUT Out = (VS OUTPUT) 0;<br />

// Output transformed vertex position:<br />

Out.Pos = mul( view proj matrix, Pos );<br />

// Propagate input texture coordinates:<br />

Out.Tex = Tex;<br />

// Compute the light vector (object space):<br />

float3 vLight = normalize( mul( inv view matrix, lightPos ) - Pos );<br />

// Define tangent space matrix:<br />

float3x3 mTangentSpace;<br />

mTangentSpace[0] = Tangent;<br />

mTangentSpace[1] = Binormal;<br />

mTangentSpace[2] = Normal;<br />

Section II — Rendering Techniques<br />

Simulation of Iridescence and Translucency on Thin Surfaces<br />

// Output light vector in tangent space:<br />

Out.Light = mul( mTangentSpace, vLight );<br />

// Compute the view vector (object space):<br />

float3 vView = normalize( view position - Pos );<br />

// Output view vector in tangent space:<br />

Out.View = mul( mTangentSpace, vView );<br />

311


Section II — Rendering Techniques<br />

312 Simulation of Iridescence and Translucency on Thin Surfaces<br />

}<br />

// Compute the half angle vector (in tangent space):<br />

Out.Half = mul( mTangentSpace, normalize( vView + vLight ) );<br />

return Out;<br />

Pixel <strong>Shader</strong><br />

The pixel shader computes the illumination value for a particular pixel on the surface<br />

of the butterfly wings, taking into account the light propagated through the<br />

surface of the wings due to the translucency effect and light emitted due to the<br />

iridescence of the wings.<br />

First, we load the color value from a base texture map. For efficiency reasons,<br />

we have stored the opacity in the alpha channel of the base texture map. We also<br />

load a normal vector (in tangent space) from the precomputed normal map and a<br />

gloss value, which is used to modulate highlights on the surface of the wings. The<br />

scalar gloss map is stored in the alpha channel of the normal map. Combining<br />

three-channel texture maps with single-channel grayscale value maps allows us to<br />

load two values with a single texture fetch.<br />

float3 vNormal, baseColor;<br />

float fGloss, fTransparency;<br />

// Load normal and gloss map:<br />

float4( vNormal, fGloss ) = tex2D( bump glossMap, Tex );<br />

// Load base and opacity map:<br />

float4 (baseColor, fTransparency) = tex2D( base opacityMap, Tex );<br />

Don’t forget to scale and bias the fetched normal map into the [–1.0, 1.0] range:<br />

// Signed scale the normal:<br />

vNormal = vNormal *2-1;<br />

Figure 2 displays the contents of the texture maps used for this effect:<br />

Base texture map for wings<br />

texture<br />

Opacity texture map


Figure 2: Input texture maps<br />

The texture address mode should be set to CLAMP in u and v for both of these<br />

texture maps. Also, they should be trilinearly filtered (MAGFILTER = linear,<br />

MINFILTER = linear, and MIPFILTER = anisotropic).<br />

Translucency<br />

Section II — Rendering Techniques<br />

Simulation of Iridescence and Translucency on Thin Surfaces<br />

Normal map for bump mapping Gloss map<br />

Next we compute the translucency effect. The amount of light scattered through a<br />

thin surface is proportional to the incident angle of the light on the back side. So,<br />

similar to a Lambertian diffuse calculation, we dot the light vector but with the<br />

negative of the normal vector. We also use a prespecified translucency coefficient<br />

in addition to the fetched opacity value to control the amount of scattered light:<br />

float3 scatteredIllumination = saturate(dot(-vNormal, Light)) *<br />

fTransparency * translucencyCoeff;<br />

As described above, the scattered light contribution is dependent on both the<br />

direction of incident light as well as the surface normal for the pixel location. This<br />

contribution to diffuse illumination of the wings’ surface is what accounts for their<br />

subtle glow. Figure 3 illustrates the contribution of scattered reflected light on the<br />

surface of the wings. If you modify the pixel shader for this effect in the Render-<br />

Monkey workspace to output only the scattered light contribution, you will be<br />

able to investigate how this contribution changes as you rotate the model.<br />

To simulate varying thickness of scales on the surface of butterfly wings as<br />

well as within them, we use a normal map to perturb the normal vectors. The<br />

usual diffuse illumination is computed with a simple dot product and a global<br />

ambient term:<br />

float3 diffuseContribution = saturate(dot(vNormal,Light)) +<br />

ambient;<br />

Figure 3 shows the result of computing diffusely reflected light for the butterfly<br />

wings.<br />

313


Section II — Rendering Techniques<br />

314 Simulation of Iridescence and Translucency on Thin Surfaces<br />

Scattered light contribution Diffuse term<br />

Figure 3: Diffuse illumination<br />

In the next step we combine the base texture map with the diffuse term and the<br />

scattered reflected light contribution to compute a final value for diffuse surface<br />

illumination:<br />

baseColor *= scatteredIllumination + diffuseContribution;<br />

Figure 4 illustrates the results of this operation. Now we can see how scattered<br />

reflected light contributes to the translucency effect of the butterfly wings in the<br />

more illuminated portions of wings in the final picture in Figure 4.<br />

Figure 4: Combining the base texture map with scattered reflected light and diffusely<br />

reflected light<br />

Since butterfly wings in nature have varying transparency, we had our artists<br />

paint in the appropriate opacity values. However, since the desired effect has<br />

transparent geometry, which also has specular highlights, we must take care


Iridescence<br />

Section II — Rendering Techniques<br />

Simulation of Iridescence and Translucency on Thin Surfaces<br />

when doing blending to properly achieve transparency. Typically, blending transparent<br />

materials is done during the blending stage of the rendering pipeline. However,<br />

if the object that you are rendering as transparent is a specular surface,<br />

blending should be done before actually applying specular highlights to the surface.<br />

A brute-force approach for this is to render two separate passes (the diffuse<br />

alpha-blended pass first) and adding specular highlights in the second pass. But to<br />

speed up our effect, we wanted to do it all in one pass. This requires some tricks<br />

with the way alpha blending is used. Since the specular pass is additive and the<br />

diffuse is blended, we want to pre-blend the diffuse color and add the specular<br />

color in the shader. Then, during the blending stage, the source color is not modified;<br />

it’s simply added, since that portion of the blending equation is already taken<br />

care of in the shader code. The destination has the same blend as it normally<br />

would with the standard two-pass “brute-force” approach. If we look at the blending<br />

equation, the two-pass approach is expressed in the following form:<br />

Pass 1: diffuseIlluminationColor * �+ destination * (1 – �)<br />

Pass 2: specularColor + destination<br />

The single-pass approach can be expressed as follows:<br />

Pass 1: (diffuseIlluminationColor * � + specularColor ) * 1+<br />

destination * (1 – �)<br />

Here’s the portion of the shader that premultiplies the diffuse illumination result<br />

with the alpha:<br />

float fOpacity =1-fTransparency;<br />

// Premultiply alpha blend to avoid clamping:<br />

baseColor *= fOpacity;<br />

Figure 5 illustrates the effect of this action on the diffuse illumination result.<br />

Figure 5: Premultiplied alpha blending<br />

315<br />

One of the reasons why butterflies are so captivating in nature is the iridescence<br />

of their wings. To simulate iridescent patterns on the surface of our simulated<br />

butterfly wings, we use an approach similar to the technique described in the


Section II — Rendering Techniques<br />

316 Simulation of Iridescence and Translucency on Thin Surfaces<br />

“Bubble <strong>Shader</strong>” article in the original Direct3D <strong><strong>Shader</strong>X</strong> book [Isidoro02]. Iridescence<br />

is a view-dependent effect, so we can use the view vector, which was computed<br />

in the vertex shader and interpolated across the polygon in tangent space.<br />

We also use scale and bias coefficients to scale and bias the index to make iridescence<br />

change more quickly or slowly across the surface of the wings. You can<br />

explore the effects of these parameters by modifying the variables iridescence_<br />

speed_scale and iridescence_speed_bias in the Iridescent Butterfly.rfx<br />

RenderMonkey workspace.<br />

// Compute index into the iridescence gradient map,<br />

// which consists of N·V coefficients<br />

float fGradientIndex =<br />

dot( vNormal, View ) * iridescence speed scale +<br />

iridescence speed bias;<br />

// Load the iridescence value from the gradient map based on the<br />

// index we just computed above:<br />

float4 iridescence = tex1D( gradientMap, fGradientIndex );<br />

This effect uses a 1D gradient texture map (Figure 6) for computing color-shifted<br />

iridescence values. This texture should have trilinear filtering enabled and<br />

MIRROR texture address mode selected for both u and v coordinates.<br />

Figure 6: Gradient texture map<br />

Figure 7 illustrates the resulting iridescence value.<br />

Figure 7: Iridescence of butterfly wings<br />

To add satiny highlights to the surface of the wings, we use a gloss map generated<br />

by the artist. We compute the gloss value based on the result fetched from the<br />

gloss map and N·V for determining the placement of specular highlights. Finally,<br />

we add the gloss contribution to the previously computed diffuse illumination<br />

result to obtain the final result:<br />

// Compute the final color using this equation:<br />

// N*H * Gloss * Iridescence + Diffuse<br />

float fGlossIndex = fGloss *


Summary<br />

References<br />

( saturate( dot( vNormal, Half )) *<br />

gloss scale + gloss bias );<br />

baseColor += fGlossIndex * iridescence;<br />

Figure 8 shows the final color for the butterfly wings.<br />

Figure 8: Assembled final color<br />

Section II — Rendering Techniques<br />

Simulation of Iridescence and Translucency on Thin Surfaces<br />

To render the final effect correctly, we output the previously computed scalar<br />

fOpacity in the alpha channel of our result:<br />

return float4( baseColor, fOpacity );<br />

Because of the premultiplication mentioned earlier, this means that our<br />

alpha blend factors should be ONE for the SRCBLEND render state and<br />

IVRSRCALPHA for DESTBLEND.<br />

317<br />

In this chapter we presented a technique for simulating translucency and iridescence<br />

on thin surfaces, such as butterfly wings. Our technique combines scattered<br />

reflected light with diffusely reflected light and a color-shifted iridescence<br />

value for a visually interesting final result. You can see this effect in Color Plate<br />

17 and in the Chimp Demo in the ATI Radeon 9800 demo suite on the ATI web<br />

site (see Figure 9): http://www.ati.com/developer/demos/r9800.html.<br />

[Isidoro02] Isidoro, J. and D. Gosselin, “Bubble <strong>Shader</strong>,” Direct3D <strong><strong>Shader</strong>X</strong>: Vertex<br />

and Pixel <strong>Shader</strong> <strong>Tips</strong> and <strong>Tricks</strong>, Wolfgang Engel, ed., Wordware Publishing,<br />

2002, pp. 369-375.<br />

[Jensen01] Jensen, H.W., S.R. Marschner, M. Levoy, and P. Hanrahan, “A Practical<br />

Model for Subsurface Light Transport,” proceedings of SIGGRAPH 2001.<br />

[Vlachos02] VlachosA., J. Isidoro, and C. Oat, “Textures as Lookup Tables for<br />

Per-Pixel Lighting Computations,” Game <strong>Programming</strong> Gems 3, Charles River<br />

Media, 2002.


Section II — Rendering Techniques<br />

318 Simulation of Iridescence and Translucency on Thin Surfaces<br />

Figure 9: Transluscent and iridescent shader-based effects are used to<br />

render the butterfly in the ATI Radeon 9800 Chimp Demo using<br />

hardware acceleration. (Image courtesy of ATI Technologies Inc.)


Introduction<br />

Floating-point Cube Maps<br />

Arkadiusz Waliszewski<br />

Floating-point computations are one of the most welcome features in DX9-compatible<br />

hardware. Floating-point surfaces add more precision and make many<br />

algorithms possible to implement. Unfortunately, the current hardware line implements<br />

them with many limitations. This article addresses issues related to the<br />

floating-point textures.<br />

Floating-point Cube Maps<br />

Cube maps are a very important texture paradigm in computer graphics. They<br />

allow texture lookups based on the direction of the three-dimensional vector.<br />

Unfortunately, not all hardware supports cube maps based on floating-point<br />

surfaces.<br />

We can use 64-bit textures instead (16 bits per component) if we do not care<br />

about dynamic range and memory footprint. But what if hardware also does not<br />

support this format with cube maps or we need greater dynamic range?<br />

The trick is to use a classic 32-bit RGBA texture to represent IEEE 32-bit<br />

floating-point values with limited precision. We pack floating-point values (RGB)<br />

into a single 32-bit texture, where RGB channels keep mantissas of the original<br />

floating-point values and the A channel keeps the common exponent. This format<br />

is known as Radiance RGBE format.<br />

The common exponent keeps the greatest exponent from the RGB floatingpoint<br />

channels. Because of that, all mantissas must be normalized into a [0.5, 1]<br />

range and adjusted to the common exponent. This means loss of precision on<br />

channels, which keeps values significantly smaller than a channel with the greatest<br />

exponent. This loss of data is perfectly acceptable in many areas (e.g., in high<br />

dynamic range environmental mapping) because significantly smaller components<br />

would not be visible after conversion from high-dynamic range values to the visible<br />

range (8-bit back buffer).<br />

319


Section II — Rendering Techniques<br />

320 Floating-point Cube Maps<br />

The following example illustrates this idea:<br />

Floating-point values:<br />

R = 0.9, G = 0.01, andB=0.1<br />

are encoded into RGBE format as:<br />

R = 230, G = 2, B = 25, and E (exponent) = 128<br />

This is because 230/255 is about 0.9, 2/255 is 0.008, 25/255 is 0.098, and the<br />

common exponent is 2 to the power of 0 (because 128 in signed 8-bit representation<br />

is 0).<br />

The second example shows a much greater dynamic range:<br />

Floating-point values:<br />

R= 355.0, G = 0.01, and B = 10.003<br />

are encoded into RGBE format as:<br />

R = 177, G = 0, B = 5, and exponent E = 137<br />

This means the exponent is 2 to the power of 9 (512) and decoded values are:<br />

R = 177/255 * 512 = 355.388<br />

G = 0/255 * 512 = 0<br />

B = 5/255 * 512 = 10.03<br />

We lost all data from the green channel, but the value from the green channel<br />

is significantly smaller than from the red and blue channels, so its contribution<br />

to the final image is insignificant.<br />

The following HLSL pixel shader code decodes the RGBE format into an RGB<br />

high dynamic range value (it assumes that the BaseMapSampler sampler is<br />

defined and holds cube map texture).<br />

// samples cube map texture and decodes into a high dynamic range value<br />

// fPos – direction vector (for lookup into the cube map texture)<br />

float3 sampleCube(float3 fPos)<br />

{<br />

float4 tmp = texCUBE(BaseMapSampler, fPos);<br />

return tmp.rgb * exp2(tmp.a * 255.0 – 128.0);<br />

}<br />

RGBE format covers about 76 orders of magnitude with 1 percent relative accuracy.<br />

It is perfectly acceptable in areas where the alpha channel is not needed, and<br />

loss of data in some circumstances is acceptable. This solution requires all mantissas<br />

to be positive. Negative values are uncommon but can be handled with additional<br />

code and the loss of one mantissa’s bit.<br />

Cube Map Filtering<br />

Although some hardware supports floating-point cube maps, there is no hardware<br />

that supports floating-point surface with filtering better than point sampling.<br />

Sometimes such a limitation is not important, but usually aliasing artifacts related<br />

to the point sampling are unacceptable.


Section II — Rendering Techniques<br />

Floating-point Cube Maps<br />

321<br />

The solution to this problem is to write our own filtering routine in a pixel<br />

shader. We limit our discussion to bilinear filtering and cube maps.<br />

First, we must understand how bilinear filtering works. Consider one-dimensional<br />

texture. Graphics hardware converts a texture’s coordinate from the floating-point<br />

value (range [0, 1]) to the “texel space” (range [0, width of the 1D<br />

texture]). For point sampling, a filtered value comes from one texel, which is closest<br />

to the converted texture coordinate. For bilinear sampling, hardware fetches<br />

two texels that lie closest to the converted coordinate (the converted coordinate<br />

is usually between two texels). Then the hardware computes the contribution of<br />

each texel based on the distance from the coordinate value in “texel space” and<br />

lerps these two texels to produce the final pixel.<br />

Figure 1: Filtering<br />

This approach is directly applicable to the two- and three-dimensional textures. In<br />

the case of cube maps, this is not always true. This is because of the cube map<br />

face selection step, which usually precedes the filtering phase. It means that on<br />

some hardware, cube map bilinear (or above) filtering is done “per cube face,” and<br />

some artifacts are noticeable on the boundaries of the faces (you will notice it if<br />

you know what to search for).<br />

We take another approach. We completely eliminate the face selection phase<br />

from the pixel shader. We treat cube map texture coordinates like regular 3D texture<br />

coordinates, and we use these coordinates to fetch eight texels from the cube<br />

map and lerp them to produce the final pixel. Although this approach is not 100<br />

percent mathematically correct, it usually produces images better filtered than<br />

built-in hardware filtering and simplifies pixel shader code.<br />

We can summarize this by writing some pseudocode of the algorithm:<br />

� Multiply each component of the texture coordinate by the size of the texture<br />

(in pixels) to get the coordinate in “texel space.”<br />

� For each component, compute two closest integer values (e.g., 123.4 value<br />

gives 123 and 124).<br />

� Compute contribution of each integer part by using the fractional part of the<br />

component (e.g., 123.4 gives us integer 123 with contribution 1 – 0.4 = 0.6<br />

and 124 with contribution 0.4).<br />

� From these integer values, construct eight possible texture coordinates and<br />

fetch eight texels (first divide these coordinates by the texture size to get<br />

coordinates in the 0-1 range).<br />

� Compute the final pixel value using contributions and fetched texels (the<br />

exact formula is in the following shader code).


Section II — Rendering Techniques<br />

322 Floating-point Cube Maps<br />

The following HLSL shader code demonstrates this technique:<br />

float4 hlsl filtered cube shader(float3 uvw : TEXCOORD0) : COLOR<br />

{<br />

// should be defined outside the shader<br />

float3 textureSize = float3(32, 32, 32);<br />

float3 textureSizeDiv = float3(0.03125, 0.03125, 0.03125);<br />

float3 halfPixel = float3(0.5, 0.5, 0.5);<br />

float3 oneConst = float3(1.0, 1.0, 1.0);<br />

// multiply coordinates by the texture size<br />

float3 texPos = uvw * textureSize;<br />

// compute first integer coordinates<br />

float3 texPos0 = floor(texPos + halfPixel);<br />

// compute second integer coordinates<br />

float3 texPos1 = texPos0 + oneConst;<br />

// perform division on integer coordinates<br />

texPos0 = texPos0 * textureSizeDiv;<br />

texPos1 = texPos1 * textureSizeDiv;<br />

// compute contributions for each coordinate<br />

float3 blend = frac(texPos + halfPixel);<br />

// construct 8 new coordinates<br />

float3 texPos000 = texPos0;<br />

float3 texPos001 = float3(texPos0.x, texPos0.y, texPos1.z);<br />

float3 texPos010 = float3(texPos0.x, texPos1.y, texPos0.z);<br />

float3 texPos011 = float3(texPos0.x, texPos1.y, texPos1.z);<br />

float3 texPos100 = float3(texPos1.x, texPos0.y, texPos0.z);<br />

float3 texPos101 = float3(texPos1.x, texPos0.y, texPos1.z);<br />

float3 texPos110 = float3(texPos1.x, texPos1.y, texPos0.z);<br />

float3 texPos111 = texPos1;<br />

// sample cube map (using function defined earlier)<br />

float3 C000 = sampleCube(texPos000);<br />

float3 C001 = sampleCube(texPos001);<br />

float3 C010 = sampleCube(texPos010);<br />

float3 C011 = sampleCube(texPos011);<br />

float3 C100 = sampleCube(texPos100);<br />

float3 C101 = sampleCube(texPos101);<br />

float3 C110 = sampleCube(texPos110);<br />

float3 C111 = sampleCube(texPos111);<br />

// compute final pixel value by lerping everything<br />

float3 C = lerp(<br />

lerp(lerp(C000, C010, blend.y),<br />

lerp(C100, C110, blend.y),<br />

blend.x),


}<br />

Conclusion<br />

Demo<br />

lerp( lerp(C001, C011, blend.y),<br />

lerp(C101, C111, blend.y),<br />

blend.x),<br />

blend.z);<br />

return float4(C.r, C.g, C.b, 1.0f);<br />

Section II — Rendering Techniques<br />

Floating-point Cube Maps<br />

This article showed how to overcome the lack of floating-point cube maps and<br />

floating-point surface filtering. By combining the two techniques presented here,<br />

we can use high dynamic range cube maps on hardware that does not support<br />

them natively.<br />

A simple demo application on the CD shows the presented techniques in action.<br />

The left part of Figure 2 shows the classic teapot rendered with a pixel shader<br />

that implements the described algorithms, whereas the right part is rendered<br />

with a standard 32-bit cube map. Corresponding source code also shows how to<br />

simply convert floating-point values into an RGBE format.<br />

Figure 2: Using a pixel shader vs. a cube map<br />

323


Stereoscopic Rendering in<br />

Hardware Using <strong>Shader</strong>s<br />

Thomas Rued<br />

Introduction<br />

Even though we push more polygons than ever before, use modern rendering<br />

hardware, and the resolution and bit depth continues to get better and better, we<br />

still end up letting our projection matrix project our 3D space into a 2D image that<br />

is presented on the screen. This works fine for many applications, but since we<br />

are striving for more photorealistic images and illusions of the real world as we<br />

see it, it is only natural to expand this 2D imaging into a full 3D stereoscopic<br />

image.<br />

Volumetric visualization of medical data is only one area where the perception<br />

of depth can be critical. Games and virtual reality applications can also use<br />

the third axis as a cool gimmick for the users. Earlier, this was only done by using<br />

expensive VR equipment, shutter-glasses, and the like.<br />

The anaglyph (red/green) method (invented by Wilhelm Rollmann in 1853)<br />

is an easy and cheap way to get started with stereoscopy. Early on, it could be<br />

implemented in software, but with the advent of rendering hardware the pixel<br />

processing power increased; in the beginning it was limited to whatever capabilities<br />

the rendering hardware had.<br />

<strong>With</strong> the appearance of programmable graphics hardware, the stereoscopic<br />

possibilities opened up even more. Now it is possible to use more than just traditional<br />

anaglyph encoding. New innovative ways of rendering stereo can be developed<br />

and implemented directly through the use of shaders.<br />

Stereoscopic Rendering Overview<br />

Stereoscopic rendering can be done in many ways, depending on the needs,<br />

equipment, and requirements of the final picture.<br />

Active and Passive Stereo<br />

324<br />

When using special glasses that actually take part in the encoding of the stereoscopic<br />

image, it’s called active stereo. Typical systems using active stereo are<br />

shutterglass systems. When the glasses don’t do anything other than decode the


image, it is called passive stereo. Passive stereo is the area that we are going to<br />

explore in this article, since this is the technique we can implement and use<br />

through pixel shaders on modern 3D rendering hardware.<br />

Generic Stereoscopy<br />

To do stereoscopic rendering in a generic way, we need to expand our normal geometric<br />

pipeline a bit. First, we work with two cameras instead of one (left/right<br />

eye). Second, we need to do a final compositing of the left and right images to create<br />

the final stereoscopic image presented on the screen.<br />

The compositing itself requires temporary storage of image data in a render<br />

target and good control of the viewport and FOV. We go through the setup and use<br />

of these primitives.<br />

Even though there are a handful of different ways to actually compose the<br />

stereo image, they all share the same virtual camera setup. This is the best way<br />

to have a good stereo image. Consequently, we use a fair amount of space on a<br />

detailed description on that aspect.<br />

We also look at stereoscopic artifacts and what we can do about it. Finally, we<br />

go through three different shaders doing stereoscopic compositing.<br />

Stereoscopic Camera Setup<br />

Basic Scene<br />

Section II — Rendering Techniques<br />

Stereoscopic Rendering in Hardware Using <strong>Shader</strong>s<br />

325<br />

Before rendering anything, we have to set up our camera(s) in the correct way.<br />

Failing to do this results in an incorrect stereo image. This results in perception<br />

of depth anomalies and an incorrect visualization of the geometry in the scene.<br />

For our demonstration purposes, we assume a simple scene consisting of an<br />

object and a camera. For stereoscopic rendering, we need two cameras. But for<br />

this part, we assume a single camera setup, since the double scene rendering due<br />

to an extra camera pass is trivial.<br />

Figure 1: Our basic test scene with an object and a camera


Section II — Rendering Techniques<br />

326 Stereoscopic Rendering in Hardware Using <strong>Shader</strong>s<br />

Viewport and Aspect Ratio<br />

Since we are using render targets for our intermediate results and our final<br />

compositing is done using shaders, we need to use textures for render targets.<br />

Like all textures, we have to select power of 2 sized textures. This isn’t exactly<br />

what we want for 800x600 display modes, etc., but with a little help from our projection<br />

matrix, field of view, and aspect ratio, we can render to the full target that<br />

when scaled to the screen looks correct.<br />

Figure 2: Basic scene rendered to a<br />

quadratic render target texture with<br />

respect to the aspect ratio of the<br />

final destination<br />

Figure 3: Render target rendered on final<br />

destination. Now the aspect ratio of the<br />

image is correct, but scaling artifacts may<br />

occur.<br />

First we set the virtual camera viewports to the actual render target size (512x<br />

512, for example). Then we set up the projection matrix to render the scene as if<br />

it was actually rendered to our destination viewport size (1024x768, for example).<br />

This is accomplished by specifying a horizontal and vertical FOV that corresponds<br />

to the aspect ratio of the 1024x768 viewport.<br />

If we choose to set a horizontal FOV, we have to calculate the vertical FOV<br />

by using the formula below:<br />

Vertical FOV (radians) = 2 * tan–1(1/aspect * tan(horizontal FOV/2))<br />

For example: for a 1024x768 viewport, the traditional 4:3 aspect will be 1.333333.<br />

Setting the horizontal FOV to 60 degrees (1.047 radians) results in a vertical FOV<br />

as follows:<br />

Vertical FOV (radians) = 2 * tan–1(0.75 * tan(1.047/2))<br />

Doing this makes use of the total allocated texture size and still gets a correct<br />

aspect on the final image when it is used on the original destination viewport.<br />

However, if a texture with a different size than the actual viewport is used, the<br />

final image will contain scaling artifacts. In practice, these artifacts can be limited<br />

by using traditional texture filtering.<br />

If you need a full 1:1 mapping between render target pixels and screen pixels,<br />

you have to make a render target texture with at least the size of the destination<br />

viewport and then only use a subpart of this texture. By doing this you will not<br />

need to set up a special projection matrix. You can render directly to the render


target, using the same viewport and projection matrix as you would use for nonrender<br />

target rendering. Just remember that later on you need to use special<br />

texture coordinates to obtain the rendered part of the render target. Also, you<br />

will not make use of the total amount of allocated render target memory.<br />

Figure 4: Basic scene rendered to an<br />

oversized quadratic render target,<br />

resulting in waste of expensive<br />

memory<br />

Camera Optics<br />

Even though the camera FOV in computer graphics typically is expressed in<br />

terms of degrees and/or radians, it is often practical to express it in mm. In that<br />

way, we get closer to the real photographic world, but we also get values that we<br />

can reuse later in the basis calculation (see below). The FOV is translated from<br />

mm to radians below:<br />

Horizontal FOV (radians) = 2 * tan–1(36 / (2 * mm))<br />

The vertical FOV can be calculated by using the formula above.<br />

Depth Buffer<br />

Section II — Rendering Techniques<br />

Stereoscopic Rendering in Hardware Using <strong>Shader</strong>s<br />

Figure 5: Render target rendered on the<br />

final destination. Only the original subpart<br />

of the quadratic render target has been<br />

retrieved.<br />

Setting up render targets for scene rendering requires the attachment of a depth<br />

buffer. We need two render targets but can reuse the depth buffer between render<br />

sessions:<br />

Clear left render target and common depth buffer<br />

Render scene to left render target<br />

Clear right render target and common depth buffer<br />

Render scene to right render target<br />

327<br />

In this way, we reuse the depth buffer memory. You must make sure that you do<br />

not use a multipass render back end, as that presumes a certain state of the depth<br />

buffer between passes.


Section II — Rendering Techniques<br />

328 Stereoscopic Rendering in Hardware Using <strong>Shader</strong>s<br />

Parallax, Exo-, Endo-, and Meso-Stereo<br />

In general, the amount of stereo can be expressed as the parallax in the picture.<br />

The zero parallax is located at the screen plane with the positive parallax inside<br />

the screen and the negative outside. When all the objects in a scene are in front of<br />

the screen plane, having only negative parallax, it is called exo-stereo, and when<br />

all the objects in a scene are behind the screen plane, having only positive parallax,<br />

it is called endo-stereo. Normally, the scene is composed in a way where both<br />

a negative and positive parallax is present, resulting in meso-stereo.<br />

We can calculate a correct parallax difference for the scene by using:<br />

��A/n<br />

where:<br />

� = parallax difference<br />

A = the distance between the eyes equal to 65 mm (international standard value)<br />

n = the scale factor between the physical width of the filmstrip and the width of<br />

the screen<br />

Figure 6: Visualization of the parallax in a stereoscopic scene. Notice that the object<br />

outside the screen plane has a lighter shadow to the right and the object inside the<br />

screen plane to the left. The object located at the screen plane doesn’t have any<br />

parallax (zero parallax).<br />

We started by setting the camera’s horizontal FOV and aspect ratio with respect<br />

to the render target size and frame buffer destination viewport size. Then we<br />

found the needed parallax for the scene. Now it’s time to set up the relationship<br />

between the two cameras — the relationship between “the two eyes,” so to<br />

speak.


Basis<br />

First, let me point out that the basic way you set up two virtual cameras for any<br />

kind of stereoscopic rendering is in a parallel way. Some systems wrongly make<br />

use of converging cameras, which doesn’t result in a correct stereo simulation.<br />

The only way to set the cameras up (both virtual and real cameras) is to let their<br />

aiming lines be parallel.<br />

Figure 7: Converging cameras. This<br />

is the incorrect way to set up the<br />

camera’s aiming lines.<br />

That said, the distance between the camera’s aiming lines is denoted as the basis.<br />

To define a good basis is the “holy grail” of getting good stereo perception.<br />

Calculating the basis can be done in different ways, depending on the scene<br />

rendered and the available amount of data.<br />

Optimal Depth<br />

When rendering a big scene, it’s difficult to accurately define where the far point<br />

is. In that case, we calculate the basis in the following way:<br />

b=� * ((N / f)–1)<br />

where:<br />

b = Basis<br />

� = Parallax difference in final picture<br />

N = Near point<br />

f = Focal length<br />

Maximum Depth<br />

Section II — Rendering Techniques<br />

Stereoscopic Rendering in Hardware Using <strong>Shader</strong>s<br />

Figure 8: Parallel cameras. This is the<br />

correct way to initially set up the camera’s<br />

aiming lines.<br />

329<br />

If the far distance is known (which is the case for a simple model viewer and the<br />

like), we can estimate the near and far point exactly and calculate the basis:<br />

b=(� *F*N)/(f*(F–N))<br />

where:<br />

F = Far point


Section II — Rendering Techniques<br />

330 Stereoscopic Rendering in Hardware Using <strong>Shader</strong>s<br />

Notice that the near and far points aren’t necessarily the same as the near and far<br />

z clip planes of the projection matrix.<br />

Here is an example: We have a simple scene with a known far point distance, so<br />

we use the latter basis formula:<br />

A = 65 mm<br />

n = 1500 mm / 36 mm = 42, normal PC screen and filmstrip<br />

� = 65 mm / 42 = 1.56 mm<br />

F = 900 mm<br />

N = 300 mm<br />

f = 35 mm, typical wide-angle lens<br />

b = (1.56mm * 900mm * 300mm) / (35mm * (900mm–300mm))<br />

b=20mm<br />

Depending on our scene, we have to scale this from millimeters into world units.<br />

Basis Factor<br />

Often, the basis is calculated through the use of a basis factor (bf). We can calculate<br />

the basis factor based on the optimal depth formula above and deduce the formula<br />

below:<br />

bf = N/b<br />

where:<br />

bf = basis factor<br />

The basis factor is used to recalculate the basis in situations where the scene<br />

setup is the same and only the camera translation and rotation has changed. In<br />

these situations, the near point changes a lot, and we can benefit from recalculating<br />

the basis the following way:<br />

b = N/bf<br />

where:<br />

bf = basis factor found from the initial scene setup<br />

Giantism and Lilliputism<br />

It should be noted that if the camera basis is smaller than the distance between<br />

the eyes (retinal disparity), the result is called giantism. If the basis is bigger than<br />

the distance between the eyes, it is called lilliputism. These artifacts can be useful<br />

in situations where the amount of stereo isn’t good enough. If we are simulating a<br />

large terrain, it’s difficult to get good stereo from objects far away. In this case, we<br />

can use a larger basis resulting in artificial depth on the distant objects (lilliputism<br />

– model effect). The opposite scenario could also be the case if we had a game concentrated<br />

on nano-technology and we needed to visualize small objects. In this<br />

case, we can get a good stereo effect by using a smaller basis (giantism).<br />

Now that our basis is set up correctly, let’s talk about how these two virtual<br />

cameras are used within a dynamic scene with frequent camera transformation<br />

updates.


Camera Transformation<br />

Since it is important that the two virtual cameras maintain their basis within the<br />

lifetime of the application, we must guarantee that the transformations made to<br />

the cameras don’t break this individual relationship. So, instead of using a traditional<br />

lookat function directly on the left and right camera, you should use a<br />

dummy matrix.<br />

If our scene graph contains a forward kinematics/attachment system, it’s<br />

easy to make a dummy actor and attach the two cameras to this actor, which is the<br />

one controlled and moved around by the user. If that isn’t the case, we need to use<br />

a matrix stack or multiply the dummy element and camera matrices. In either<br />

case, we have to set the initial transformation for the virtual cameras.<br />

Initial Transformation<br />

This is a fairly simple step. By initializing the virtual camera’s transformation<br />

matrix to the identity matrix and only offsetting them in x direction by the basis,<br />

we can multiply them with the camera dummy’s transformation matrix and maintain<br />

the correct basis even after the camera is translated and rotated arbitrarily.<br />

A common mistake is to offset the cameras with a +/– basis, resulting in a<br />

distance between the cameras of 2*basis. This can be fixed by offsetting with +/–<br />

basis/2.<br />

1 0 0 +/– Basis/2<br />

0 1 0 Y<br />

0 0 1 Z<br />

0 0 0 1<br />

Section II — Rendering Techniques<br />

Stereoscopic Rendering in Hardware Using <strong>Shader</strong>s<br />

331<br />

The illustrations below show the setup. Here the cameras are initially positioned<br />

attached to a dummy — or at least transformed by a dummy matrix. Then the<br />

“stereo-camera” can be moved around freely.<br />

Figure 9: Initial x offset of<br />

the cameras to set up the<br />

parallel aiming lines with<br />

the basis distance<br />

Figure 10: The cameras are attached to a dummy or<br />

manipulated by a matrix stack.


Section II — Rendering Techniques<br />

332 Stereoscopic Rendering in Hardware Using <strong>Shader</strong>s<br />

Stereo Window<br />

Figure 11: The dummy<br />

object can be moved freely<br />

around the scene.<br />

Traditional matrix math<br />

guarantees that the parallel<br />

aiming lines and the basis<br />

are maintained.<br />

When looking at a stereo image that has a correct basis setup, we will typically<br />

have some objects in the scene behind the screen plane and some objects in front<br />

of the screen plane.<br />

For static images or prerendered movies, we have a good chance of controlling<br />

how and where this is happening, but for real-time graphics, we can’t guarantee<br />

what the user is doing and the user will consequently get into situations<br />

where the stereo perception will break down (diplopia).<br />

The problem arises when an object located outside of the screen is suddenly<br />

clipped by the screen’s border. This doesn’t happen in the real world. Neither will<br />

an object that is located farther away block an object located nearer in a real-world<br />

situation. In these situations, the brain receives conflicting depth information and<br />

the stereoscopic effect is destroyed.<br />

One trick is to use blurry borders (StereoWindowSuppress), since this will<br />

blur out the artifact for the brain and make it more believable. However, this<br />

doesn’t look good on ordinary scenes, but for a sniper rifle or submarine periscope,<br />

this could be a good solution.<br />

A general solution though is to move the conflicting object behind the stereo<br />

window, resulting in a more positive parallax. In practice, this means that we push<br />

our scene into the screen until the objects clipped by the border are behind the<br />

border, resulting in correct stereo perception by the brain.<br />

Figure 12: An object located in front of<br />

the stereo window but clipped by the<br />

border of the screen. This results in bad<br />

stereo perception.<br />

Figure 13: The object is moved behind the<br />

stereo window, resulting in a correct<br />

perception of depth.


Image Offset<br />

Traditionally, offsetting the “left” and “right” images in the horizontal direction<br />

does this. The “left” is moved in one horizontal direction, and the “right” is<br />

moved in the other. Consequently, this results in missing pixels at the left and<br />

right borders. Since we are working in a virtual world, we do not always have to<br />

do things the traditional way, so why not do the offsetting at the matrix projection<br />

level?<br />

Matrix Offset<br />

By offsetting at the matrix projection level, we do exactly the same thing as the<br />

image offsetting, but since we are doing it at matrix level, we do not get the artifacts<br />

of missing pixels. To do this, we need to look at the projection matrix:<br />

?? 0 +/– stereo window offset 0<br />

0 ?? ?? 0<br />

0 0 ?? ??<br />

0 0 1 0<br />

Notice that even though we pushed the rendered scene into the screen in this<br />

situation to fix a stereo artifact, we could also use the same technique to move<br />

things out of the screen, making them fly over the keyboard. Just make sure that<br />

the screen borders don’t clip the objects, since that ruins the stereo effect, as<br />

stated earlier. Also make sure that a hyper-/hypo-stereo artifact doesn’t appear.<br />

Stereo Window — Basis Relationship<br />

Since both the basis and stereo window are manipulating the parallax of the final<br />

stereoscopic image, it should be noted that these two depend on each other.<br />

Sometimes we have to readjust the basis when the stereo window is set and vice<br />

versa. Unfortunately, the relationship in many ways depends on the given scene,<br />

so a general rule can’t be set forth. It is a subjective evaluation by a person used<br />

to looking at stereoscopic images. Advanced formulas can be used, but in the end<br />

it still depends on the scene’s content. So, for this article we do not go into depth<br />

on that aspect.<br />

Compositing<br />

Section II — Rendering Techniques<br />

Stereoscopic Rendering in Hardware Using <strong>Shader</strong>s<br />

333<br />

When rendering a scene in stereo, we need to render the scene twice, one for<br />

each eye. Two virtual cameras (left and right eye) render the scene from its point<br />

of view, resulting in a slight difference, which is used by our brain to calculate the<br />

depth. This is exactly the same case when we look at the real world through our<br />

eyes.<br />

We cannot render the scene directly to the frame buffer, since we have to<br />

post-process the two rendered scenes into a final stereoscopic image. We


Section II — Rendering Techniques<br />

334 Stereoscopic Rendering in Hardware Using <strong>Shader</strong>s<br />

therefore store the intermediate result in a separate render target for later use,<br />

one for each eye.<br />

Render Targets<br />

In <strong>DirectX</strong>, render targets are either handled as a surface (IDirect3DSurface9)or<br />

as a texture (IDirect3DTexture9). Since we need the render target to be able<br />

to be used as a texture later on, we do not use the CreateRenderTarget(...)<br />

method, since it creates a surface (IDirect3DSurface9). Instead, we use the traditional<br />

CreateTexture(...) method. We have to create a new texture by using the<br />

D3DUSAGE_RENDERTARGET flag, and we have to allocate it in the default memory pool<br />

by using the D3DPOOL_DEFAULT. The memory is updated per frame and is therefore<br />

too expensive for the texture manager to update. On the backside, this means that<br />

we have to check if the target is lost and restore/recreate it if this is the case.<br />

For the final image presentation, a screen-sized quad is used, which renders<br />

the scene by using the two render targets as texture inputs and using a pixel<br />

shader to do the actual mixing of the pictures.<br />

Vertex Format<br />

Rendering<br />

The quad’s vertex format can be constructed in different ways, depending on what<br />

kind of information we need. One common part though is the transformed position<br />

of the position.<br />

struct Vertex<br />

{<br />

float x,y,z,w;<br />

.<br />

.<br />

.<br />

};<br />

After this declaration, we continue on and allocate a vertex buffer containing the<br />

four corners of the screen (0,0,1024,768). Remember to set the w value to 1.<br />

LPDIRECT3DVERTEXBUFFER9 vb;<br />

CreateVertexBuffer(4*sizeof(Vertex),0,D3DFVF XYZRHW | ..., D3DPOOL MANAGED, &vb, NULL);<br />

Later on in the rendering pipeline, we do the actual rendering by setting the pipeline<br />

up in the following way:<br />

Set the current stream: SetStreamSource(0, vb, 0, sizeof(Vertex));<br />

In this case we do not use any vertex shader: SetVertex<strong>Shader</strong>(NULL);<br />

Set the vertex format used in the creation: SetFVF(D3DFVF_XYZRHW | ...);<br />

Set the stereoscopic pixel shader: SetPixel<strong>Shader</strong>(...);<br />

Set left render target: SetTexture(0,...);<br />

Set right render target: SetTexture(1,...);<br />

Draw the quad: DrawPrimitive(D3DPT_TRIANGLESTRIP,0,2);


Pixel <strong>Shader</strong> Implementation<br />

Doing stereoscopic rendering using pixel shaders doesn’t include shutter glasses,<br />

polarized glasses, and similar devices, since these solutions require a special<br />

hardware setup. We only make use of the 3D hardware, the monitor, and some<br />

special glasses.<br />

We look into three different methods: the traditional anaglyph, an enhanced<br />

anaglyph (ColorCode 3-D), and finally the ChromaDepth system.<br />

Traditional Anaglyph<br />

Colors<br />

Anaglyph is widely known from the ’50s as the special “red/green” glasses that<br />

people use in cinemas to get an extra illusion of realism. The system is quite simple<br />

and only requires some cheap cardboard viewers with a red filter covering the<br />

left eye and a green filter covering the right. The system gives good perception of<br />

depth but lacks the reproduction of color. Later the system was expanded to be a<br />

red/cyan solution, in which colors were better preserved since no color information<br />

was discarded. A red/cyan pixel shader can be found in the following code.<br />

ps.1.1 // GeForce 3 class hardware support<br />

def c0, 1.0f, 0.0f, 0.0f, 0.0f // R separation mask (for left).<br />

def c1, 0.0f, 1.0f, 1.0f, 0.0f // GB separation mask (for right).<br />

tex t0 // Declaration of left.<br />

tex t1 // Declaration of right.<br />

// Left<br />

mul r1, t0, c0 // Remove GB channels.<br />

// Right<br />

mad r0, t1,c1, r1 // Add GB to R.<br />

Section II — Rendering Techniques<br />

Stereoscopic Rendering in Hardware Using <strong>Shader</strong>s<br />

335<br />

The shader starts off by removing the GB channels from the left RGB input. This<br />

is done by multiplying R*1, G*0, and B*0; the result is saved in r1. The second<br />

step multiplies the right RGB input: R*0, G*1, and B*1 and adds the saved result<br />

from r1. Finally, it outputs the result to r0. In short, the shader takes the R from<br />

the left and the GB from the right and forms a new RGB value that is output.<br />

Even though the creation of the red/cyan anaglyph system was a great improvement<br />

to the red/green and red/blue system, it didn’t make it all the way. The problem<br />

is that the red/cyan filters separate both color and depth, and the eye-brain<br />

system can only manage to recombine the depth information — not the colors.<br />

This has resulted in a quite good stereo effect but very limited and distorted colors,<br />

plus a very possible stereoscopic sheen.


Section II — Rendering Techniques<br />

336 Stereoscopic Rendering in Hardware Using <strong>Shader</strong>s<br />

ColorCode 3-D<br />

Ghosting<br />

In 1997 a Danish scientist found that instead of splitting both color and depth<br />

between the eyes, it would be easier for the brain to use one eye for color and one<br />

for depth. The result was ColorCode 3-D. In many ways the ColorCode 3-D system<br />

is similar to the traditional anaglyph system. The big difference is the way<br />

the picture is stereo encoded and later decoded by the glasses and the brain. At<br />

first, it seems like the ColorCode 3-D system hurts the eyes more than the traditional<br />

anaglyph system, but after a few seconds the brain adapts to this system<br />

and you experience stereo in full color.<br />

Both shutter glasses, polarized glasses, and anaglyph-based systems lack a 100<br />

percent perfect separation of the left and right image. Consequently, this results<br />

in a “ghosting” effect, where some of the left image can be seen by the right eye<br />

and vice versa. The ColorCode 3-D system has greatly reduced this problem by<br />

using a finalizing step in its encoding of pixels. By implementing a special ghostremoving<br />

algorithm, special color correction steps, etc., the final result gets rid of<br />

most of the ghosting and other color artifacts.<br />

Since the appearance of programmable hardware, the ColorCode 3-D encoding<br />

and color correction/ghost removal steps have been possible to implement<br />

using pixel shader 1.1 and above. It must be implemented as a two-step solution<br />

on ps < 1.4 and can be implemented as a single-step solution on ps >= 1.4 by<br />

using the phase instruction. The system is protected by patent and license, so I<br />

can’t show the full pixel shader here, but a pseudocode version of pixel shader 1.4<br />

can be found in the following code. Additionally, a screen shot of the output is<br />

shown in Color Plate 18.<br />

ps.1.4 // // ATI Radeon 8500 class hardware support<br />

def c0, 1.0f, 1.0f, 0.0f, 0.0f // RG separation mask (for left).<br />

def c1, 0.0f, 0.0f, 1.0f, 0.0f // B separation mask (for right).<br />

def c2, ... // Weights for right blue channel.<br />

def c3, 1.0f, 1.0f, 1.0f, 0.0f // Mask for collapsing weighted RGB values into B.<br />

texld r0,t0 // Declaration of left.<br />

texld r1,t1 // Declaration of right.<br />

// Left – calculate color<br />

mul r3, r0, c0 // Remove B channel, and store in r3<br />

// Right – Calculate depth<br />

mul r0, r1, c2 // Compose new B value as a weighted RGB value<br />

dp3 r0, r0, c3 // Collapse RGB values into a grayscale<br />

mad r0, r0, c1, r3 // Separate B and add to existing RG result in r3


phase<br />

texld r2,r0 // Dependent lookup in a volume texture using rgb as uvw<br />

mov r0,r2 // Result of full color-converted ColorCode encoding<br />

The shader starts by removing all blue from the left RGB. A balanced grayscale<br />

is then made from the right RGB values, which are added as the blue channel.<br />

Lastly, the result is stored in r0. In ps


Section II — Rendering Techniques<br />

338 Stereoscopic Rendering in Hardware Using <strong>Shader</strong>s<br />

The ChromaDepth encoding is split into two shaders. The vertex shader<br />

retrieves the z, scales it to fit within a [0..1] range value, and outputs it as a texture<br />

coordinate. Once that is done, the pixel shader only needs to take the z<br />

(located in t0.x), adjust the calculated texture coordinates (to simulate the stereo<br />

window), and do a lookup into a 1D texture. The texel retrieved is then output.<br />

Stereo Comparison<br />

Anaglyph<br />

Looking at the different results from the three types of encoding shaders, it’s<br />

clear that they all give a feeling of depth but are very different at the same time.<br />

The traditional anaglyph is easy to implement, and it is widely used. Consequently,<br />

a lot of viewers are already out there, resulting in a lot of potential users.<br />

It also gives good quality stereo. On the downside, it is lacking in the reproduction<br />

of color.<br />

ColorCode 3-D<br />

ColorCode 3-D takes care of the color problem. It’s a bit more difficult to use for<br />

the very first time, but after a few seconds, the picture is clearly much better. The<br />

stereo quality is even comparable with the much more expensive shutter glass<br />

systems because the ColorCode 3-D system uses a post-processing step where<br />

the picture is finalized and the stereo artifacts removed. On the downside,<br />

ColorCode 3-D isn’t that widely used, and it is protected by a patent, so you have<br />

to contact ColorCode 3-D Ltd. for more information on the full hardware-accelerated<br />

ColorCode 3-D implementation.<br />

ChromaDepth<br />

ChromaDepth is the only system using clear separation filters. This results in<br />

both good stereo and clean colors. Unfortunately, the whole system’s depth<br />

encoding is based on the visible spectrum of colors to reproduce stereo. So, for<br />

generic use, the system isn’t very usable. Also, it should be noted that since the<br />

system uses clear filters, the separation isn’t that good, resulting in a lot of ghosting.<br />

However, if we need to create artistic stereo that from the start has been<br />

designed for ChromaDepth, the system is good. Finally, we have shown that a<br />

ChromaDepth shader can be implemented as a visualization of the depth buffer,<br />

resulting in minimal rendering overhead (the scene needs to be rendered only<br />

once), so it is clearly the fastest method of doing stereo if colors aren’t needed.


Conclusion<br />

It has been proven scientifically that stereo images improve visual perception by<br />

as much as 400 percent. Stereo is a direction on the road to a more realistic simulation<br />

of reality. Therefore, stereo must be considered an option on the same level<br />

as more geometry, better light calculations, smoother shadows, more detailed textures,<br />

etc.<br />

Viewing Ortho-, Hypo-, and Hyper-stereo<br />

It should be noted that when a given scene is viewed from the same initial position<br />

as it was first recorded, it is called ortho-stereo. The normal goal for stereoscopic<br />

rendering is to achieve the highest possible reproduction of the initial<br />

scene. However, if the scene is viewed from a closer distance, it results in less<br />

stereo (hypo-stereo), and if the distance is larger, there is more stereo (hyper-stereo).<br />

These general rules should be kept in mind when presenting stereoscopic<br />

images on bigger screens for audiences, etc.<br />

Future Implementation<br />

Even though these general stereoscopic setup rules hold for both current and<br />

future stereoscopic systems that make use of virtual camera setups, there are still<br />

issues to consider. Traditionally, game development has included a lot of tricks and<br />

hacks to make it all work in real time. One of these hacks is to make use of<br />

imposters for complex imaging. Particles, flares, beams, trees, grass, etc., all<br />

make use of some kind of “quad-based” system. This works fine for traditional<br />

non-stereo 3D applications, but if translated into a stereo system directly, these<br />

illusions are suddenly noticeable by the viewer. A flat quad-based flare appears flat<br />

and consequently destroys the illusion of a glowing ball of fire.<br />

The solution is to use even more advanced shaders, where depth is considered<br />

and a kind of fake depth image is made, resulting in a correct visual appearance<br />

when viewed in stereo. Depth sprites might be one of the steps in the right<br />

direction, but research must be done in this area.<br />

Acknowledgments<br />

Section II — Rendering Techniques<br />

Stereoscopic Rendering in Hardware Using <strong>Shader</strong>s<br />

Svend B. Sørensen, ColorCode 3-D Ltd., www.colorcode3d.com.<br />

Tammy Barnett and William K. Chiles, American Paper Optics,<br />

www.chromatek.com.<br />

Niels Husted Kjær, Medical Insight, www.medical-insight.com.<br />

Mark Rudings, Titoonic, www.titoonic.com.<br />

Illustrations: Kristina Gordon, Dotrix, www.dotrix.dk.<br />

Modeling: Thomas Suurland, www.suurland.com.<br />

339


Hatching, Stroke Styles, and<br />

Pointillism<br />

Kevin Buchin and Maike Walther<br />

Introduction<br />

Hatching is a common technique used in non-photorealistic rendering (NPR). For<br />

hatching, a series of strokes are combined into textures. These compositions of<br />

strokes can convey the surface form through stroke orientation, the surface material<br />

through stroke arrangement and style, and the effect of light on the surface<br />

through stroke density.<br />

Up until now, an important issue of real-time hatching techniques has been<br />

how to employ the limited programmability of the graphics hardware currently<br />

available. Pixel programmability has now reached a state where we can shift the<br />

focus to adding more flexibility to the hatching scheme and combining hatching<br />

with other techniques for creating new effects.<br />

We present a hatching scheme and some extensions to it, namely changing<br />

the stroke style interactively and hatching with specular highlights. Then we<br />

show how we integrate hand drawings into a scene, taking into account the effect<br />

of lighting. Finally, we show how to choose a color for each stroke — depending<br />

on the background color — that can be used for a pointillistic style.<br />

Approaches to Hatching<br />

340<br />

For hatching, strokes have to be chosen from a collection of possible strokes to<br />

convey some tonal value. A possible approach to this problem is to think of each<br />

stroke having a priority and choose strokes according to their priority (i.e., using<br />

only the most important strokes for light tonal values and adding less important<br />

strokes in areas of darker tonal values). Such collections of strokes are called prioritized<br />

stroke textures [Winkenbach94] and can be seen as the basis for current<br />

hatching schemes.<br />

For real-time hatching, stroke textures for some specific tonal values and different<br />

mipmap levels can be precomputed and blended at run time according to<br />

the given tonal value [Praun01]. To maintain a constant stroke width in screen<br />

space, the mipmap levels contain strokes of the same texel width. Thus, higher<br />

mipmap levels contain fewer strokes than lower mipmap levels for representing


the same tonal value. This technique can be implemented using pixel shaders, as<br />

presented in the first <strong><strong>Shader</strong>X</strong> book [Card02].<br />

Prioritized stroke textures can also be implemented using a thresholding<br />

scheme (i.e., encoding intensity thresholds and information on resulting color<br />

values in a texture). For instance, this information can be differences in tone<br />

[Webb02]. While strokes fade in gradually when blending stroke textures for given<br />

tonal values, using a thresholding scheme lets strokes appear more suddenly.<br />

There is more to hatching than the actual rendering. In particular, texturebased<br />

hatching only works well with an appropriate texture parameterization. An<br />

overview of the complete hatching process is given in [Domine01]. Here we focus<br />

on the shader used for hatching.<br />

Our Thresholding Scheme<br />

Section II — Rendering Techniques<br />

Hatching, Stroke Styles, and Pointillism<br />

341<br />

Our approach is to encode a stroke by its color and intensity threshold. Figure 1<br />

shows a sample stroke texture with grayscale values (a) and corresponding intensity<br />

thresholds (b). An advantage of this approach is that we don’t need to decide<br />

how the stroke color is combined with the background color (for instance, adding,<br />

overlaying, modulating, or replacing the background color) when creating the<br />

textures.<br />

Figure 1: (a) A texture containing the stroke colors as grayscale values, (b) a texture<br />

containing the corresponding intensity thresholds<br />

The stroke colors can be stored in the RGB channels of a texture and the corresponding<br />

intensity thresholds in the alpha-channel of the same texture. To be able<br />

to distinguish the color values and intensity thresholds of the different strokes in<br />

one texture, the texture may not contain overlapping strokes. For drawing overlapping<br />

strokes, we use several textures (for instance, two for horizontal strokes<br />

and two for vertical strokes). Instead of actually using several textures, we can<br />

reuse one stroke texture by translating and/or rotating the original texture coordinates.<br />

To keep the pixel shader simple, we do this in the vertex shader by adding<br />

several texture coordinates to the output Out and — for two horizontal and two<br />

vertical stroke textures — the following lines:


Section II — Rendering Techniques<br />

342 Hatching, Stroke Styles, and Pointillism<br />

Out.Tex0 = Tex0;<br />

Out.Tex1 = Tex0 + offset1.xy;<br />

Out.Tex2 = Tex0.yx + offset1.zw;<br />

Out.Tex3 = Tex0.yx + offset2.xy;<br />

To each stroke texture we assign an intensity interval [start, end] and map the<br />

threshold t in the alpha channel to this interval by start + t/(end-start). After computing<br />

a desired intensity, we can modulate the background color with the stroke<br />

color using the following lines of code:<br />

float4 stroke = tex2D(stroke sampler, Tex0);<br />

color *= (intensity < start + stroke.a/q) ? stroke.rgb : 1.0;<br />

...with q = end-start. In the pixel shader, we compute an intensity using a lighting<br />

model and, again in the case of two horizontal and two vertical applications of one<br />

stroke texture, add the following lines:<br />

float3 color = background color.rgb;<br />

float4 stroke = tex2D(stroke sampler, Tex0);<br />

color *= (intensity < 0.75 + stroke.a/4) ? stroke.rgb : 1.0;<br />

stroke = tex2D(stroke sampler, Tex1);<br />

color *= (intensity < 0.5 + stroke.a/4) ? stroke.rgb : 1.0;<br />

stroke = tex2D(stroke sampler, Tex2);<br />

color *= (intensity < 0.25 + stroke.a/4) ? stroke.rgb : 1.0;<br />

stroke = tex2D(stroke sampler, Tex3);<br />

color *= (intensity < stroke.a/4) ? stroke.rgb : 1.0;<br />

return float4(color.r, color.g, color.b, background color.a);<br />

A teapot rendered with this technique is shown in Figure 2.<br />

Figure 2: A teapot hatched using our thresholding scheme


Varying the Line Style<br />

Section II — Rendering Techniques<br />

Hatching, Stroke Styles, and Pointillism<br />

We can extend the above technique to allow variation of the hatching strokes at<br />

run time. For this, we do not encode strokes directly into a stroke texture but<br />

instead encode lookups into single-stroke textures. We call these textures<br />

stroke-lookup textures.<br />

343<br />

Figure 3: (a) – (d) show the RGBA channels of a stroke-lookup texture and (e) is an illustration of a<br />

lookup. (a) shows the R channel that contains the lookup in s, (b) the G channel that contains the<br />

lookup in t, (c) the B channel that contains the threshold, and (d) the A channel that is used as a<br />

stencil.<br />

A simple example of a stroke-lookup texture is shown in Figure 3. The channels<br />

R and G store the lookups in t and s, channel B stores the threshold, and alpha is<br />

used as a stencil to prevent incorrect interpolation. For achieving a roughly uniform<br />

screen width of the strokes — as in hand-drawn hatchings — we use<br />

mipmap levels with strokes of the same texel size. Standard generation of mipmap<br />

levels would halve the texel width of a stroke in each level, thus strokes farther<br />

away from the viewer would be thinner than those close to the viewer. For correct<br />

interpolation between these mipmap levels, we extend the stroke-lookup coordinates<br />

from [0,1] to [–0.5,1.5]. For this we scale the texture coordinates appropriately,<br />

as illustrated in Figure 4.


Section II — Rendering Techniques<br />

344 Hatching, Stroke Styles, and Pointillism<br />

Figure 4: Illustration of the texture coordinates tau and scaled<br />

stroke-lookup coordinates lookup_x and lookup_y<br />

The above calculations have to be adapted in the following way:<br />

float4 lookup = tex2D(stroke lookup sampler, Tex0);<br />

lookup.xy = (lookup.xy - 0.25)*2;<br />

bool stroke flag = (intensity < interv.x + interv.y*lookup.b) && (lookup.a > 0.99);<br />

color *= stroke flag ? tex2D(single stroke sampler, lookup.xy) : 1.0;<br />

For different strokes, we use lookups into several different single-stroke textures.<br />

For this, we use an additional texture with indices for single-stroke textures.<br />

Alternatively, we could encode the indices into the stenciling channel. To keep the<br />

pixel shader simple, we assume that we have done the lighting computation in the<br />

vertex shader. The simple pixel shader, using lookups into two different singlestroke<br />

textures — a short and a long stroke — could look like this:<br />

float4 getStrokeColor(float2 texCoord, float shiftedIntensity) {<br />

float4 lookup = tex2D(stroke lookup sampler, texCoord);<br />

lookup.xy = (lookup.xy - 0.25)*2;<br />

float stroke = tex2D(index, texCoord);<br />

float4 stroke color = (stroke < 0.5) ?<br />

tex2D(short stroke, lookup.xy) : tex2D(long stroke, lookup.xy);<br />

bool stroke flag = (lookup.w > 0.99) && (shiftedIntensity < lookup.z/4.0);<br />

stroke color = stroke flag ? stroke color : 1.0;<br />

return stroke color;<br />

}<br />

float4 main(<br />

float2 Tex0 : TEXCOORD0,<br />

float2 Tex1 : TEXCOORD1,<br />

float2 Tex2 : TEXCOORD2,<br />

float2 Tex3 : TEXCOORD3,<br />

float2 Tex4 : TEXCOORD4,<br />

float4 Diff : COLOR0 ) : COLOR<br />

{<br />

float4 color = 1.0;<br />

color *= getStrokeColor(Tex0, Diff.x - 0.75);<br />

color *= getStrokeColor(Tex0, Diff.x - 0.75);<br />

color *= getStrokeColor(Tex1, Diff.x - 0.5);


}<br />

color *= getStrokeColor(Tex2, Diff.x - 0.25);<br />

color *= getStrokeColor(Tex3, Diff.x - 0.0);<br />

return color;<br />

Figure 5 shows the use of different stroke styles in combination with specular<br />

highlights.<br />

Hatching with Specular Highlights<br />

So far, we have used strokes only to darken a rendition where the desired intensity<br />

is below a certain threshold. But we can also draw light strokes — assuming<br />

the background is not white. We use this for drawing specular highlights. To draw<br />

light strokes, we just need to check whether the intensity is above a given threshold.<br />

We can use the same stroke textures as for dark strokes — and possibly combine<br />

dark and light strokes — by taking one minus the previous threshold to<br />

maintain the stroke priorities. This effect is illustrated in Figure 5.<br />

Lighting Hand-drawn Illustrations<br />

Section II — Rendering Techniques<br />

Hatching, Stroke Styles, and Pointillism<br />

345<br />

Figure 5: (a) A shaded teapot and (b) – (d) the same teapot hatched with specular highlights and<br />

different stroke styles<br />

Hatching can be used for lighting a hand-drawn illustration and integrating it into<br />

a 3D scene. We use billboards for placing the illustration in the scene. For the<br />

lighting computation, we need to provide normals for the illustration. We do this<br />

by roughly approximating our hand drawing by simple primitives and rendering<br />

these with a pixel shader outputting the color-encoded (normalized) normal vectors<br />

or with a normalization cube map. Using the normals, we choose which<br />

strokes to draw according to the light. Figure 6 (a) shows a hand-drawn image that<br />

we approximated with the shape in (b) and placed in a scene in (c).


Section II — Rendering Techniques<br />

346 Hatching, Stroke Styles, and Pointillism<br />

Figure 6: (a) A hand-drawn image as input, (b) approximation of the shape, (c) resulting image in<br />

a 3D scene (See Color Plate 19.)<br />

Stroke Colors and Pointillism<br />

Conclusion<br />

The last effect chose the stroke color according to the background color. The<br />

resulting strokes were unicolored simply because the background color of one<br />

stroke did not change. In the case of varying background or base color for a<br />

stroke, we would still like to draw unicolored strokes — as is typical for strokebased<br />

illustrations, such as mosaics, oil paintings, and many others. We can do<br />

this by encoding offsets into the strokes, which are used for reading the base color<br />

so that all points of a stroke read the same color. Figure 7 shows the R channel of<br />

such a stroke. The brush uses values from black (0) to white (1). This value has to<br />

be scaled by the maximal relative brush width and height offset_scale.xy in the<br />

pixel shader. If brushes of different sizes are used simultaneously, smaller brushes<br />

should use values from a smaller interval. The code for modifying the texture<br />

coordinate used to determine the base color could look like this:<br />

float2 offset = (tex2D(offset sampler, Tex.xy*scale).xy – 0.5) * offset scale.xy / scale;<br />

float2 newTex = Tex + offset;<br />

Used on its own, this technique can create a<br />

pointillistic style. Figure 8 shows some<br />

examples.<br />

Pixel shaders offer great possibilities for<br />

implementing stroke-based rendering techniques.<br />

As examples of this we have shown<br />

a hatching scheme and several effects<br />

extending this scheme. We hope that these<br />

examples may serve as an inspiration for the<br />

many more effects that are possible.<br />

Figure 7: R channel of the texture with<br />

the brush used in Figure 8 and the<br />

relative maximal brush width<br />

offset_scale.x.


References<br />

Section II — Rendering Techniques<br />

Hatching, Stroke Styles, and Pointillism<br />

[Card02] Card, Drew and Jason Mitchell, “Non-Photorealistic Rendering with<br />

Pixel and Vertex <strong>Shader</strong>s,” Direct3D <strong><strong>Shader</strong>X</strong>: Vertex and Pixel <strong>Shader</strong> <strong>Tips</strong> and<br />

<strong>Tricks</strong>, Wolfgang Engel, ed., Wordware Publishing, 2002, pp. 319-333.<br />

347<br />

Figure 8: The RenderMonkey Iridescent Butterfly rendered (a) without the effect, (b) using a brush<br />

in form of the word “<strong><strong>Shader</strong>X</strong>²,” (c) and (d) using the brush shown in Figure 7 with different values<br />

for scale.<br />

[Domine01] Dominé, Sébastien, Ashu Rege, and Cem Cebenoyan, “Real-Time<br />

Hatching (Tribulations in),” GDC, 2001.<br />

[Praun01] Praun, Emil, Hugues Hoppe, Matthew Webb, and Adam Finkelstein,<br />

“Real-Time Hatching,” proceedings of SIGGRAPH 2001, pp. 581-586.<br />

[Webb02] Webb, Matthew, Emil Praun, Adam Finkelstein, and Hugues Hoppe,<br />

“Fine Tone Control in Hardware Hatching,” proceedings of NPAR 2002, pp. 53-ff.<br />

[Winkenbach94] Winkenbach, Georges and David H. Salesin, “Computer-Generated<br />

Pen-and-Ink Illustration,” proceedings of SIGGRAPH 1994, pp. 91-100.


Layered Fog<br />

Overview<br />

348<br />

Guillaume Werle<br />

Vertical or layered fog can drastically change the mood and realism level of the<br />

scenes produced by your rendering engine. It is also probably one of the easiest<br />

effects that you can implement using shaders. If you’re learning shaders and want<br />

quick, good-looking results, this article is for you.<br />

Figure 1: Volumetric fog in action<br />

As shown in Figure 1, the basic idea is to use the height from some plane instead<br />

of the depth or distance from the viewpoint as the density factor. The technique<br />

described in this article computes the height on a per-vertex basis and uses the<br />

texture coordinate interpolator to get per-pixel precision. This way, the faces<br />

don’t have to be split at the fog’s boundaries.


Integrating Fog in Your Engine<br />

The key to success when designing a shader-driven engine is modularity. To<br />

increase efficiency and reduce the number of shaders needed, the rendering<br />

should be split into independent passes.<br />

Since it uses alpha blending, fog needs to be computed and rendered in the<br />

final stage of the rendering pipeline. D3D’s blending render states need to be set<br />

in the following way to mix the fog with the frame buffer content:<br />

SetRenderState(D3DRS ALPHABLENDENABLE, TRUE);<br />

SetRenderState(D3DRS SRCBLEND, D3DBLEND SRCALPHA);<br />

SetRenderState(D3DRS DESTBLEND, D3DBLEND INVSRCALPHA);<br />

Depth precision issues (also known as Z-fighting) may appear when using multiple<br />

passes for rendering. A little bit of tweaking of your projection matrix should<br />

eliminate these artifacts. Additional information on this topic can be found on Tom<br />

Forsyth’s web page at http://tomsdxfaq.blogspot.com/2002_07_01_tomsdxfaq_<br />

archive.html#79344425.<br />

Density Formula<br />

Since the density of the fog is based on its height, we want to have maximum<br />

opacity at the low boundary of our fog range (FogLowRange) and no fog at all when<br />

our height is equal or above the high boundary (FogHighRange).<br />

Using this very simple formula, we get a coefficient representing this value.<br />

Implementation<br />

VertexDelta =(FogHighRange–VertexHeight)<br />

FogDelta =(FogHighRange–FogLowRange)<br />

VertexDensity = VertexDelta<br />

FogDelta<br />

Since we can’t divide in a vertex shader, we have to multiply VertexDelta by the<br />

reciprocal of FogDelta (InvFogDelta).<br />

10 .<br />

InvFogDelta =<br />

FogHighRange � FogLowRange<br />

VertexDensity =(FogHighRange–VertexHeight)*InvFogDelta<br />

Now, let’s take a look at the vertex shader implementation:<br />

vs.1.1<br />

// constants 0123=world * view * projection matrix<br />

m4x4 oPos, v0, c0 // vertex in screenspace output<br />

Section II — Rendering Techniques<br />

Layered Fog<br />

349


Section II — Rendering Techniques<br />

350 Layered Fog<br />

mov oT0, v7 // copy diffuse texture coordinates<br />

// constants 4567=world matrix<br />

dp4 r0, v0, c5 // vertex y position in world space<br />

// constant 8 = (FogHighRange, InvFogDelta, 0.0f , 1.0f)<br />

sub r0, c8.xxxx, r0 // (high – y)<br />

mul oT1, r0, c8.yyyy // output = (high - y) * InvFogDelta<br />

The density value might exceed the range [0.0 � 1.0], but this isn’t an issue. The<br />

density value will be interpreted in the pixel shader using the texcoord instuction.<br />

This opcode interprets texture coordinates as RGBA values and will clamp them.<br />

The final pixel density also depends on the camera height. CameraDensity<br />

can be computed using the same formula as the vertex density. This value should<br />

be computed only one time per frame and then passed to the pixel shader using a<br />

constant register. This value will need to be clamped in the range [0.0 � 1.0].<br />

The diffuse texture might use alpha blending for transparency, so we also<br />

need to take this into account.<br />

If we summarize the above, we end up with this formula:<br />

FogPixel.a = saturate(VertexDensity+CameraDensity)*FogColor.a*Diffuse.a<br />

FogPixel.rgb = FogColor.rgb<br />

Here is the pixel shader implementation:<br />

ps.1.1<br />

tex t0 // diffuse texture<br />

texcoord t1 // vd = vertex density<br />

// t1 = (vd, vd, vd, 1.0f)<br />

// c2 = red mask (1.0f , 0.0f, 0.0f, 0.0f)<br />

dp3 r0, t1, c2 // copy the red component everywhere<br />

// r0 = (vd, vd, vd, vd)<br />

// cd = camera density<br />

// c1 = (cd, cd, cd, cd)<br />

add satr0, r0, c1 // VertexDensity + CameraDensity<br />

// c0 = fog color<br />

mul r0, r0, c0 // c0.a = fog density<br />

mul r0, r0, t0 // diffuse texture opacity<br />

mov r0.rgb, c0 // output<br />

// r0.rgb = fog color<br />

// r0.a= fog density<br />

NOTE The <strong>DirectX</strong> documentation states that the alpha value of a pixel<br />

fetched with the texcoord instruction should be 1.0, but I haven’t seen any<br />

piece of hardware that follows this rule. You usually get the .w value passed<br />

by the vertex shader.


Final Words<br />

Section II — Rendering Techniques<br />

Layered Fog<br />

351<br />

I implemented this shader for a demo scene application called Raw Confession.<br />

The demo can be found on the companion CD or downloaded from my web page:<br />

http://cocoon.planet-d.net/raw/!Raw_Beta.zip.<br />

Special thanks to Jeff Kiel from nVidia for proofreading this article.


Dense Matrix Algebra on the GPU<br />

Ádám Moravánszky<br />

Introduction<br />

352<br />

Perhaps the most important innovation of the latest generation of programmable<br />

graphics processors (GPUs) is their capability to work with floating-point color<br />

data. Previous generations of GPUs have worked with up to a byte of integer data<br />

per color channel. Developers working on graphics engines with advanced lighting<br />

effects often complained about banding artifacts, even in true-color video<br />

modes, because multiplicative effects quickly made the round-off error caused by<br />

the limited precision noticeable. The advent of GPUs that represent each color<br />

channel with a 32-bit floating-point value has thus been widely celebrated in the<br />

real-time graphics community.<br />

More importantly, while 8-bit color channel precision is often adequate, the<br />

dynamic range is quite limited. Floating-point color buffers make it possible to<br />

work with brightness values well beyond the maximum value that can be represented<br />

in the final image. Though the dynamic range of output device stays the<br />

same, intermediate values during a computation are no longer clamped to this<br />

range. This way, a much more realistic simulation of lighting is possible, resulting<br />

in vibrant images.<br />

The topic of this article is made possible by the emergence of floating-point<br />

color support as well, but we will not be dealing with either of the often-cited<br />

advantages of floating-point buffers described above. In fact, we will not be rendering<br />

images in the conventional sense at all. Instead, we look at the GPU as a<br />

powerful vector coprocessor to the CPU. We use it to solve two common problems<br />

in scientific computing: solving systems of linear equations and linear<br />

complementarity problems. Both of these problems come up in dynamics simulation,<br />

which is a field drawing increasing interest from the game developer<br />

community.<br />

By implementing these algorithms on the GPU, we hope to achieve a performance<br />

gain or at least free up CPU resources, which can then be better spent running<br />

algorithms that are not vectorizable. Because the GPU usually has its hands<br />

full rendering graphics in a computer game, and because GPUs with floating-point<br />

color support are anything but widespread, the results of this article are initially<br />

primarily of theoretical interest for the game community. However, if we can<br />

show convincing performance figures that make such application of GPUs


desirable, we may soon find these applications becoming practical and widespread.<br />

If GPU performance continues to grow at its present rate, we may eventually see<br />

researchers and engineers abandoning expensive supercomputers for clusters of<br />

GPU-equipped PCs.<br />

Previous Work<br />

The fundamental concept of linear algebra is the matrix. Matrices are used in simulation<br />

in order to describe a linear relationship in a concise way. A significant<br />

amount of research has gone into working with large dense matrices. BLAS<br />

(Basic Linear Algebra Subprograms) [2, 7] has emerged as the standard interface<br />

to linear algebra libraries. Freely available implementations of BLAS include<br />

ATLAS (Automatically Tuned Linear Algebra Software) [9], a linear algebra<br />

library that includes Intel SSE2 and AMD 3DNow optimized matrix multiply kernels.<br />

These fast kernels, combined with ATLAS’ cache-friendly memory access<br />

pattern achieved by special ordering of the input data, make it one of the fastest<br />

dense matrix libraries available on the PC platform. In [6], Larsen and McAllister<br />

first investigated using GPUs for linear algebra. At the time of its publication,<br />

floating-point pixel processing was not yet available, so their results were not<br />

practical for real-world problems. The papers [1, 5], made available after this article<br />

was initially submitted, tackle the representation of sparse matrices on the<br />

GPU.<br />

While ATLAS provides a selection of higher-level linear algebra operations,<br />

such as solving linear systems, the code of ATLAS is a high-performance matrix<br />

multiply kernel, which is then leveraged by the high-level operations. We follow<br />

the same principle in our GPU matrix library: We implement a few basic matrix<br />

operations using shaders, including matrix multiply, and then use these as building<br />

blocks to solve the higher level problems. While we have not written a full<br />

GPU BLAS implementation due to time constraints, we show how to implement<br />

all the basic components necessary for this goal.<br />

Implementation<br />

Our implementation consists of a matrix class that carries out all the core arithmetic<br />

operations. It interfaces with the GPU using the <strong>DirectX</strong> 9 Graphics SDK.<br />

The user interface is a script interpreter that parses matrix operation instructions<br />

out of a text stream, manages matrix variable names, reads and writes matrix<br />

variable data to file, and passes operations for execution to the matrix class. We<br />

discuss the matrix class below, as well as two examples of its use.<br />

Matrix Textures<br />

Section II — Rendering Techniques<br />

Dense Matrix Algebra on the GPU<br />

353<br />

If the GPU is to perform large matrix multiplication for us, the first thing we need<br />

to do is represent the matrix data in a format that is accessible by the GPU. GPUs<br />

can in principle work on two basic types of data: geometry and texture maps. Textured<br />

geometry is preferable because of the more compact representation when


Section II — Rendering Techniques<br />

354 Dense Matrix Algebra on the GPU<br />

compared with highly tessellated geometry with vertex colors. Also, unlike geometry,<br />

textures can be output by the GPU in the form of render target surfaces. If<br />

we store a matrix as a texture and then perform a matrix operation, such as<br />

matrix addition by rendering two textures with additive blending into a third render<br />

target surface, the storage format of the resulting matrix can be identical to<br />

the input format. This is a desirable property because this way we can immediately<br />

reuse the resulting texture as an input to another operation without having<br />

to perform format conversion.<br />

We would like our library to work with matrices of real numbers because this<br />

domain is the most generally useful for simulation problems, especially dynamics<br />

simulation. Integers would be too restrictive, while complex numbers are usually<br />

not required. Note that the system we present could be extended to handle complex<br />

numbers should this be the case. Real numbers are most efficiently approximated<br />

on computers using floating-point numbers of various precisions.<br />

Unfortunately, GPUs still only support single-precision floating-point, and future<br />

support for double or higher precision is unlikely, as this sort of precision is not<br />

thought to be needed for graphics applications. Nonetheless, single-precision<br />

floating-point is adequate for many applications.<br />

Storage Format<br />

There are several ways in which the elements of a matrix can be mapped to the<br />

pixels of a texture image. Perhaps the most obvious approach would be to take a<br />

luminance (one channel per pixel) image and fill it with the matrix data using a<br />

direct mapping of elements to pixels in either row or column major format. The<br />

disadvantage is, of course, that GPUs are optimized to process RGBA pixels and<br />

thus have four-way SIMD for executing pixel shaders. A luminance texture would<br />

only use a quarter of the available bandwidth.<br />

Instead, we pack four adjacent matrix elements into a single pixel’s RGBA<br />

channels. The simplest possibilities are to either pack rows or columns of four.<br />

While this packing does make square matrices into 4:1 aspect rectangular textures,<br />

it makes the writing of the pixel shaders for multiplication quite straightforward.<br />

Other schemes, such as packing 2x2 rectangular submatrices into each<br />

pixel, complicates the pixel shaders for doing matrix multiplication and offers no<br />

clear advantage. It is interesting to note that CPU linear algebra packages like<br />

ATLAS primarily get their speed boost by storing the matrices in a convoluted<br />

but very cache-friendly way. Data locality is an important key to performance.<br />

Unfortunately, in contrast to CPU programming, we have only a relatively<br />

high-level control of the GPU. In particular, the order in which pixels get processed<br />

is an undocumented implementation detail. Usually, the GPU automatically<br />

stores textures in a swizzled form to improve cache coherence. It may be interesting<br />

to investigate if more exotic storage formats can boost performance, but<br />

one would have to do quite a bit of experimentation, without necessarily being<br />

able to generalize the results to different GPUs.<br />

The final question regarding data storage is whether there is any difference<br />

between packing rows or columns of four into a pixel. One important difference<br />

comes up when we consider doing vector operations. It is important that pixels be


created along the length of a vector, instead of across. In the latter case, a vector<br />

would only fill one color channel and leave three empty. In this implementation,<br />

we arbitrarily decided to go with storing CPU matrices in row major format and<br />

working with column vectors. Thus, we put 4�1 sub-column vectors into each<br />

pixel. The width of a texture that corresponds to an n�m matrix is thus m, while<br />

the height is n �<br />

�<br />

�4<br />

�<br />

�<br />

�<br />

.<br />

To create a matrix texture from some source data, we create an appropriately<br />

sized render target surface using the D3DFMT_A32B32G32R32F floating-point pixel<br />

format. We don’t need any mipmapping; in fact, we render with point sampling to<br />

prevent texture filtering from falsifying our computations.<br />

Creating a render target texture is technically only necessary if we want the<br />

matrix to serve as a destination for matrix operations; in our application, we<br />

choose not to keep track of this distinction and treat all matrices equally for the<br />

sake of simplicity.<br />

Unfortunately in <strong>DirectX</strong> 9, it is not possible to lock render target surfaces,<br />

so we need to create an identically formatted temporary texture in the SYSTEMMEM<br />

pool. This texture’s surface is then locked, and the matrix data is read into it.<br />

Finally we use the <strong>DirectX</strong> method UpdateTexture() to copy the temporary texture<br />

into our render target texture.<br />

Reading back from the matrix texture happens in the same way. This time the<br />

method GetRenderTargetData() is used to copy from the matrix texture to the<br />

temporary texture.<br />

Matrix Operations<br />

Assignment<br />

After reading in the data, we are ready to perform some matrix operations. We<br />

start by implementing three basic operations — matrix assignment, addition, and<br />

multiplication. Later we will add some others as required by our higher level<br />

algorithms. Note that some operations are not strictly necessary and could be<br />

expressed using others. For example, assignment could be emulated by adding a<br />

zero matrix to the source matrix. Still, writing special-case code when optimizations<br />

are possible is a good idea.<br />

Matrix assignment is the most elementary operation, so we cover it first to introduce<br />

some details in our code:<br />

void Matrix::copy(Matrix & other) {<br />

Note that while the reference rasterizer works fine with the render target surface<br />

being the same as one of the source textures, this case is not officially supported<br />

by Direct3D and should be avoided. In the case of assignment, it is obviously a<br />

null operation to assign a matrix to itself, so we can early out in this case.<br />

if (this == &other) return;<br />

Section II — Rendering Techniques<br />

Dense Matrix Algebra on the GPU<br />

355


Section II — Rendering Techniques<br />

356 Dense Matrix Algebra on the GPU<br />

If the destination texture is not the same size as the source texture, it needs to be<br />

resized. We resize a texture by releasing it and creating a new one of the correct<br />

size.<br />

resize(other.getNRows(), other.getNCols());<br />

If one of the dimensions of the matrix is 0, there is nothing to do:<br />

if (nRows * nCols == 0) return;<br />

Next, we set the destination texture as the render target, begin the scene, assign<br />

vertex and pixel shaders, and assign the source texture to the 0th sampler. For<br />

this simple operation, we do not really need shader support and could do the same<br />

operation with the fixed-function pipeline and texture combiners. On the other<br />

hand, any hardware that supports floating-point pixel formats will most likely have<br />

shader support as well, so we might as well use them. We omit <strong>DirectX</strong> error handling<br />

in the cited code for clarity.<br />

d3dDevice->SetRenderTarget(0,mathSurface);<br />

d3dDevice->BeginScene();<br />

d3dDevice->SetVertex<strong>Shader</strong>( vertex<strong>Shader</strong>s[VS SINGLE TEX QUAD] );<br />

d3dDevice->SetPixel<strong>Shader</strong>( pixel<strong>Shader</strong>s[PS COPY] );<br />

d3dDevice->SetTexture(0,other.mathTexture);<br />

Next, we render a single quadrilateral polygon that exactly covers the destination<br />

texture by using a triangle fan with four vertices. This is what our vertex buffer<br />

contains:<br />

MathVertex quad[4]= {<br />

// x y<br />

{ -1.0f, -1.0f},<br />

{ +1.0f, -1.0f},<br />

{ +1.0f, +1.0f},<br />

{ -1.0f, +1.0f}};<br />

We have 2D clip space coordinates for each vertex. Because we won’t be rendering<br />

3D shapes, and because texture coordinates can be trivially generated in the<br />

vertex shader from this basic data, it is all we need. We place this data into a managed<br />

pool vertex buffer and do not worry about it anymore. It is used for all the<br />

matrix operations except multiplication.<br />

The actual rendering code looks like this:<br />

d3dDevice->SetStreamSource( 0, quadVertexBuffer, 0, sizeof(MathVertex));<br />

float TexcoordBiasW = (1.0f/cols2TextureWidth(nCols)) * 0.5f;<br />

float TexcoordBiasH = (1.0f/rows2TextureHeight(nRows)) * 0.5f;<br />

float consts[4 * 2] = {<br />

0.5, -0.5, 0.5, 1,<br />

0.5+ TexcoordBiasW, 0.5 + TexcoordBiasH, 0, 0 };<br />

d3dDevice->SetVertex<strong>Shader</strong>ConstantF(0, consts, 2);<br />

d3dDevice->DrawPrimitive( D3DPT TRIANGLEFAN, 0, 2 );<br />

d3dDevice->EndScene();<br />

}


The function of the texture coordinate bias values that get passed to the vertex<br />

shader is to line up the destination pixels with the source texel centers by shifting<br />

the texture coordinates by half a texel. If we were to omit this, at each pixel the<br />

texture would be sampled halfway between texels, making it effectively random<br />

which of the four neighboring texels the point sampling would pick.<br />

cols2TextureWidth() and rows2TextureHeight() simply map matrix dimensions<br />

to texture dimensions using the formula mentioned previously:<br />

inline unsigned roundUpDivide(unsigned a, unsigned b) { return (a + b-1) / b; }<br />

inline unsigned rows2TextureHeight(unsigned rows) { return roundUpDivide(rows,4); }<br />

inline unsigned cols2TextureWidth (unsigned cols) { return cols; }<br />

The vertex shader we use, SINGLE_TEX_QUAD, is shown below:<br />

// c0 = [ 0.5, -0.5, 0.5, 1]<br />

// c1 = [ 0.5+ TexcoordBiasW, 0.5 + TexcoordBiasH, 0 , 0]<br />

vs 1 1<br />

dcl position v0<br />

mov oPos, v0<br />

mov oPos.zw, c0.zw<br />

mov r0, c1<br />

mad oT0.xy, v0.xy, c0.xy, r0.xy<br />

We basically emit the vertices that we put in the vertex buffer in clip space after<br />

assigning some constant values to the z and w coordinates. The texture coordinates<br />

are computed from the vertex position in a single instruction, which<br />

involves the flipping of the vertical axis and the application of the bias constants<br />

described above.<br />

Finally, the pixel shader is shown below. It serves to simply copy the input<br />

texture to the destination surface:<br />

//PS COPY out = tex0<br />

ps 2 0<br />

dcl 2d s0<br />

dcl t0<br />

texld r0, t0, s0<br />

mov oC0, r0<br />

Section II — Rendering Techniques<br />

Dense Matrix Algebra on the GPU<br />

357<br />

We have tried using HLSL to produce these shaders, and several of them were<br />

prototyped that way, but the <strong>DirectX</strong> shader compiler failed to produce efficient<br />

code for the more involved matrix multiply cases, so we decided to stay with<br />

hand-coded assembly for this project. The use of pixel shader 2.0 or greater is<br />

necessary in the case of this simple shader not because of any special instructions<br />

or even the number of instructions, but because lower pixel shader versions automatically<br />

clamp their final result to [0,1]. We would like to use the entire floating-point<br />

range.


Section II — Rendering Techniques<br />

358 Dense Matrix Algebra on the GPU<br />

Addition<br />

Addition is very similar to assignment. Because we have the limitation that the<br />

destination texture may not be the same as either of the source textures, we need<br />

to code both a general add and an accumulate (+=) operation. We only cover the<br />

binary version here because the accumulate version is the same as the above<br />

assignment with additive blending with the existing render target turned on.<br />

void Matrix::add(Matrix & a, Matrix & b) {<br />

if (a.nRows != b.nRows || a.nCols != b.nCols)<br />

throw "matrix dimensions don't agree";<br />

if (this == &a) { add(b);return;}<br />

else if (this == &b) { add(a);return;}<br />

resize(a.nRows, a.nCols);<br />

if (a.nRows * a.nCols == 0) return;<br />

d3dDevice->SetRenderTarget(0,mathSurface);<br />

d3dDevice->BeginScene();<br />

d3dDevice->SetVertex<strong>Shader</strong>( vertex<strong>Shader</strong>s[VS SINGLE TEX QUAD] );<br />

d3dDevice->SetPixel<strong>Shader</strong>( pixel<strong>Shader</strong>s[PS ADD] );<br />

d3dDevice->SetTexture(0,a.mathTexture);<br />

d3dDevice->SetTexture(1,b.mathTexture);<br />

d3dDevice->SetStreamSource( 0, quadVertexBuffer, 0, sizeof(MathVertex) );<br />

float TexcoordBiasW = (1.0f/cols2TextureWidth(nCols)) * 0.5f;<br />

float TexcoordBiasH = (1.0f/rows2TextureHeight(nRows)) * 0.5f;<br />

float consts[4 * 2] = {<br />

0.5, -0.5, 0.5, 1,<br />

0.5+ TexcoordBiasW, 0.5 + TexcoordBiasH, 0, 0 };<br />

d3dDevice->SetVertex<strong>Shader</strong>ConstantF(0, consts, 2);<br />

d3dDevice->DrawPrimitive( D3DPT TRIANGLEFAN, 0, 2 );<br />

d3dDevice->EndScene();<br />

}<br />

There are only a few places where the above differs from the assignment code.<br />

First, we need to check if the dimensions of the two source textures match; otherwise,<br />

the addition operation is mathematically undefined. We also check if one<br />

of the source operands is the same as the destination and call the special-case<br />

accumulate code in this case. The second texture is also assigned to the second<br />

texture sampler. We use the same vertex shader as before.<br />

The pixel shader is a different one but not much more complicated; it simply<br />

performs additive blending of the two source textures:<br />

//PS ADD out = tex0 + tex1<br />

ps 2 0<br />

dcl 2d s0


dcl 2d s1<br />

dcl t0<br />

texld r0, t0, s0<br />

texld r1, t0, s1<br />

add r0, r0, r1<br />

mov oC0, r0<br />

Multiplication<br />

Section II — Rendering Techniques<br />

Dense Matrix Algebra on the GPU<br />

359<br />

Writing a general matrix multiply is a bit more challenging because unlike addition,<br />

it doesn’t reduce to mere image blending. Figure 1 shows the schematic for<br />

our matrix multiply procedure.<br />

Figure 1: Schematic of matrix multiply<br />

The texture corresponding to the left operand matrix A is shown on the left side.<br />

The texture of the right-side operand matrix B is at the top right. C, the result<br />

matrix, is shown at the bottom right. C = A B should hold after the operation<br />

completes.<br />

By the definition of matrix multiplication, the number of columns in A has to<br />

equal the number of rows in B. We call this range of q numbers the inner dimension.<br />

Finally, the number of rows in A is equal to the number of rows in C, and the<br />

number of columns in B is equal to the number of columns in C. We call these the<br />

outer dimensions.<br />

In our Figure 1 example, matrix A is 14�30 and matrix B is 30�12. The 4�1<br />

submatrices stored by the textures in a single pixel are shown as ovals. Because<br />

the matrices’ heights are not exactly divisible by four, the last two elements of the<br />

last row of pixels are unused, indicated by their white color. Note that the texture<br />

representing A is only 30 pixels wide. The last two columns of white ovals with<br />

gray markings represent samples read in by the pixel shader outside of the [0,1]


Section II — Rendering Techniques<br />

360 Dense Matrix Algebra on the GPU<br />

texture coordinate range; these virtual texels need to read as zero. They are necessary<br />

so our pixel shader can always work with blocks of four texels, even if the<br />

input matrix sizes are not exact multiples of four.<br />

As any pixel shader, the matrix multiply code has to emit a single (partially<br />

computed) pixel from the pixel shader. Each pixel stores four values, each of<br />

which is a dot product between a row vector of A and a column vector of B. Both<br />

of these vectors have q elements, where q is 30 in our example. Thus, at each<br />

pixel, we need to perform four of these dot products, which is the same as a 4�q<br />

matrix-vector multiplication. Because q may be quite large, our GPU may not be<br />

able to sample all the 11 q texels necessary in one pass due to pixel shader<br />

4<br />

instruction count limits. Thus, we need to decompose this operation into a set of<br />

smaller operations depending on our instruction count limits.<br />

Our atomic pixel shader operation is a 4�4 matrix-vector multiplication,<br />

where the 4�4 matrix is fetched from A and the 4�1 vector from B. We refer to<br />

this atomic multiply as a MOP for “matrix operation.” We need to perform ��<br />

q<br />

of 4<br />

these MOPs per pixel and accumulate the results in order to obtain the final result<br />

for an output pixel. We pack as many of these MOPs into our pixel shader as<br />

possible.<br />

In our example, we assume a hypothetical pixel shader that can perform no<br />

more than three of these MOPs in a single pass. In general, we define the macro<br />

numMOpsPerFragment as the number of MOPs that can fit into a pixel shader. For ps<br />

2.0, we managed to fit six of them.<br />

If the hypothetical example shader can do three MOPs per pass, and we need<br />

30<br />

a total of � 4 � = 8 MOPs for the final result, we need to perform ��<br />

8 = 3 additive<br />

3<br />

passes, as indicated in the figure. Ps 2.0 would only need two passes.<br />

As an example, we have highlighted a pixel in the destination texture. The<br />

pixel shader that emits this pixel as part of the second additive pass samples the<br />

darkened 15 texels from A and B.<br />

Of course, the outer dimensions we don’t have to worry about; they are taken<br />

care of by the inherent parallel processing of the GPU in the form of adjacent pixels,<br />

just like in the assignment and addition shaders. Now that we have covered<br />

the theory, we present the implementation:<br />

void Matrix::multiply(Matrix & a, Matrix & b) {<br />

As usual, we need to check if we’re trying to render a texture onto itself. Here we<br />

do not have a backup plan, so we simply report an error:<br />

if (this == &a || this == &b)<br />

throw "can't operate inplace -- not supported by D3D.";<br />

If the matrix dimensions do not agree, matrix multiplication is undefined:<br />

if (a.nCols != b.nRows)<br />

throw "matrix dimensions don't agree";<br />

resize(a.nRows, b.nCols);<br />

if (nRows * nCols == 0) return;


First, we compute a few constants depending on the input sizes and the number<br />

of instructions permitted in the pixel shader. We render numQuads quads aligned<br />

with the destination surface, with additive shading. We compute numQuads with<br />

the formulas given above.<br />

const unsigned numQuads = roundUpDivide(rows2TextureHeight(b.nRows),numMOpsPerFragment);<br />

Like we did for assignment and addition, we compute texture coordinate bias values<br />

to ensure that texels are sampled at their centers. Here we have two input<br />

textures, so we need to do this twice:<br />

const float TexcoordBiasW = (1.0f/cols2TextureWidth(nCols)) * 0.5f;<br />

const float TexcoordBiasH = (1.0f/rows2TextureHeight(nRows)) * 0.5f;<br />

const float TexcoordBiasAW = (1.0f/cols2TextureWidth(a.nCols)) * 0.5f;<br />

const float TexcoordBiasBH = (1.0f/rows2TextureHeight(b.nRows)) * 0.5f;<br />

A single pixel shader performs several MOPs. We supply it with texture coordinates<br />

for the first five samples corresponding to the first MOP but only provide<br />

texture coordinate increments relative to the first five, which can be used by the<br />

pixel shader to compute the texture coordinates of the subsequent samples.<br />

tcMOpIncrementBH is the height of a texel in B, which is the amount the pixel<br />

shader has to seek down in the texture to get to the pixel used for the next MOP.<br />

const float tcPixelBH =2*TexcoordBiasBH;<br />

const float tcMOpIncrementBH = tcPixelBH;<br />

The second increment we need is that of the texture coordinates between the<br />

additive passes. These will be used by the vertex shader, as we will not pass<br />

explicit texture coordinates to minimize the size of our vertex buffer.<br />

const float tcPassIncrementBH = numMOpsPerFragment * tcPixelBH;<br />

Section II — Rendering Techniques<br />

Dense Matrix Algebra on the GPU<br />

The same constants are also computed for the other input texture:<br />

const float tcPixelAW =2*TexcoordBiasAW;<br />

const float tcMOpIncrementAW =4*tcPixelAW;<br />

const float tcPassIncrementAW = numMOpsPerFragment * tcMOpIncrementAW;<br />

The meaning of the vertex and pixel shader constants will become clear when we<br />

look at the shaders:<br />

float vconsts[] = {<br />

0.5 + TexcoordBiasW, 0.5 + TexcoordBiasH, 0 + TexcoordBiasBH, 0,<br />

0 + TexcoordBiasAW, tcPixelAW + TexcoordBiasAW,<br />

2 * tcPixelAW + TexcoordBiasAW, 3 * tcPixelAW + TexcoordBiasAW,<br />

tcPassIncrementBH, tcPassIncrementAW, 0, 0<br />

};<br />

float pconsts[] = {<br />

1 * tcMOpIncrementAW, 0, 0,0, //2 mops<br />

0, 1 * tcMOpIncrementBH, 0,0,<br />

2 * tcMOpIncrementAW, 0, 0,0, //3 mops<br />

0, 2 * tcMOpIncrementBH, 0,0,<br />

3 * tcMOpIncrementAW, 0, 0,0, //4 mops<br />

361


Section II — Rendering Techniques<br />

362 Dense Matrix Algebra on the GPU<br />

0, 3 * tcMOpIncrementBH, 0,0,<br />

4 * tcMOpIncrementAW, 0, 0,0, //5 mops<br />

0, 4 * tcMOpIncrementBH, 0,0,<br />

5 * tcMOpIncrementAW, 0, 0,0, //6 mops<br />

0, 5 * tcMOpIncrementBH, 0,0,<br />

};<br />

d3dDevice->SetRenderTarget(0,mathSurface);<br />

d3dDevice->BeginScene();<br />

d3dDevice->SetVertexDeclaration( vertexDeclaration2 );<br />

d3dDevice->SetVertex<strong>Shader</strong>( vertex<strong>Shader</strong>s[VS MULT 1] );<br />

d3dDevice->SetPixel<strong>Shader</strong>( pixel<strong>Shader</strong>s[PS MULT 0] );<br />

d3dDevice->SetTexture(0,a.mathTexture);<br />

d3dDevice->SetTexture(1,b.mathTexture);<br />

d3dDevice->SetStreamSource( 0, quadsVertexBuffer, 0,TINYVERTEX SIZE );<br />

d3dDevice->SetVertex<strong>Shader</strong>ConstantF(1, vconsts, 3);<br />

d3dDevice->SetPixel<strong>Shader</strong>ConstantF(0, pconsts, 2 * numMOpsPerFragment);<br />

The vertex buffer contains a triangle list in the following format:<br />

/* x y quadIndex<br />

{ -1, -1, 0 },<br />

{ +1, -1, 0 },<br />

{ +1, +1, 0 },<br />

{ -1, -1, 0 },<br />

{ +1, +1, 0 },<br />

{ -1, +1, 0 },<br />

{ -1, -1, 1 },<br />

{ +1, -1, 1 },<br />

{ +1, +1, 1 },<br />

{ -1, -1, 1 },<br />

{ +1, +1, 1 },<br />

{ -1, +1, 1 },<br />

....<br />

{ -1, -1, 99 },<br />

{ +1, -1, 99 },<br />

{ +1, +1, 99 },<br />

{ -1, -1, 99 },<br />

{ +1, +1, 99 },<br />

{ -1, +1, 99 },<br />

*/<br />

The first two numbers are the 2D clip space coordinates of the vertex, as before.<br />

We have also added a value that is the index of the quad that the vertex belongs to<br />

in the sequence. The vertex shader uses this index value for texture coordinate<br />

generation. Because the data is so simple, we pack each vertex into a 32-bit word<br />

and use the D3DDECLTYPE_UBYTE4 data type. As the bytes are unsigned, we add two


to the coordinates, storing –1 as 0 and 1 as 2. Finally, we render two numQuads of<br />

these triangles:<br />

d3dDevice->DrawPrimitive( D3DPT TRIANGLELIST, 0, 2 );<br />

if (numQuads > 1)<br />

{<br />

d3dDevice->SetRenderState( D3DRS ALPHABLENDENABLE, TRUE );<br />

d3dDevice->DrawPrimitive( D3DPT TRIANGLELIST, 6, 2 * (numQuads - 1));<br />

d3dDevice->SetRenderState( D3DRS ALPHABLENDENABLE, FALSE );<br />

}<br />

d3dDevice->EndScene();<br />

}<br />

On to the shaders. The vertex shader’s job is to “decompress” the very frugal<br />

quantity of data from the vertex buffer and generate decent texture coordinates.<br />

Note how we have submitted all our rendering passes after the first in a single<br />

DrawPrimitive() call, so there is no room to perform any state changes between<br />

quads. The vertex shader has to use the quad index from the vertex buffer to tell<br />

which pass is being performed.<br />

vs 1 1<br />

dcl position v0<br />

def c0, 0.5, -0.5, 0.5, 1<br />

def c4, -1, -1, 0, 0<br />

Because we have encoded the input vertex coordinates as unsigned bytes, we<br />

map them to signed values by subtracting one.<br />

add r3.xy, v0.xy, c4.xy //map from [0,2] to [-1, 1]<br />

mov oPos.xy, r3.xy //emit pos<br />

mov oPos.zw, c0.zw<br />

We start the texture coordinate generation by taking the vertex coordinate as the<br />

starting point and inverting the vertical axis; this is the same as in the previous<br />

shaders.<br />

mov r0.xy, c1.xy //transform viewport axes to tex uv axes<br />

mad r0.xy, r3.xy, c0.xy, r0.xy<br />

Section II — Rendering Techniques<br />

Dense Matrix Algebra on the GPU<br />

Next, we need to compute the U texture coordinates for texture A and the V texture<br />

coordinates for texture B. These depend on which pass we are in; the pass<br />

index is stored in v0.w. This is multiplied by tcPassIncrementAW and<br />

tcPassIncrementBH, respectively, which are constants computed above, and stored<br />

in c3.<br />

mul r1, v0.w, c3.zzxz //can't 'mad' as it would reference 2 consts in 1 instr<br />

add r1, r1, c1<br />

mul r2, v0.w, c3.yyyy<br />

add r2, r2, c2<br />

363<br />

Finally, we emit the five texture coordinates needed for the first MOP of the pixel<br />

shader. The V coordinates of texture A and the U coordinate of texture B are simply<br />

stretched along with the quad to map linearly over the entire destination


Section II — Rendering Techniques<br />

364 Dense Matrix Algebra on the GPU<br />

surface. Even though it would be trivial to compute the four texture coordinates<br />

of A in the pixel shader itself, we choose to do as much of this work as possible in<br />

the vertex shader. This way, we avoid bumping up against the very restrictive<br />

pixel shader instruction count limits, particularly the dependent texture sampling<br />

limits.<br />

mov oT0.x, r2.x<br />

mov oT1.x, r2.y<br />

mov oT2.x, r2.z<br />

mov oT3.x, r2.w<br />

mov oT0.y, r0.y<br />

mov oT1.y, r0.y<br />

mov oT2.y, r0.y<br />

mov oT3.y, r0.y<br />

mov oT4.x, r0.x<br />

mov oT4.y, r1.z<br />

All the matrix element arithmetic is done in the pixel shader. We have made the<br />

pixel shader generic in the sense that it is made up of as many MOPs as it is possible<br />

to execute at once on the target architecture, which is six in ps 2.0. When<br />

new hardware becomes available that supports newer pixel shader versions, getting<br />

a performance boost should only be a matter of duplicating some additional<br />

MOP blocks in the shader and incrementing the ps version declaration. Our ps 2.0<br />

implementation uses 30 texld instructions of the maximum 32 and is thus very<br />

close to optimal.<br />

Inputs to the pixel shader are the registers t0...t3, the texture coordinates<br />

of four horizontally adjacent pixels in A, t4, the texture coordinate for texture B,<br />

and a large set of constants; c0.x holds the texture coordinate increment needed<br />

to move four pixels to the left in A, while c1.y has the increment needed to move<br />

one pixel down in B. c2 and c3 are two times c0 and c1, respectively. c4 and c5 are<br />

the same values times three and so on. Because we have many constant registers<br />

available and few instruction slots, it is good to precompute these values.<br />

ps 2 0<br />

dcl t0.xyzw<br />

dcl t1.xyzw<br />

dcl t2.xyzw<br />

dcl t3.xyzw<br />

dcl t4.xyzw<br />

dcl 2d s0<br />

dcl 2d s1<br />

To perform the first MOP, we fetch the needed data:<br />

texld r0, t0, s0<br />

texld r1, t1, s0<br />

texld r2, t2, s0<br />

texld r3, t3, s0<br />

texld r4, t4, s1


...and execute the 4�1 matrix vector multiply. The result is held in r5.<br />

mul r5, r4.xxxx, r0<br />

mad r5, r4.yyyy, r1, r5<br />

mad r5, r4.zzzz, r2, r5<br />

mad r5, r4.wwww, r3, r5<br />

If we had defined numMOpsPerFragment as 1 above, we would just write r5 to oC0<br />

and be done. However, we have not yet exhausted the capacities of the pixel<br />

shader, so we keep going:<br />

#if numMOpsPerFragment >= 2<br />

The texture coordinates are adjusted to correspond to the next set of inputs:<br />

add r6, t0, c0<br />

add r7, t1, c0<br />

add r8, t2, c0<br />

add r9, t3, c0<br />

add r10, t4, c1<br />

Then we sample the textures as before. Note, however, that we now use registers<br />

r6 through r10 instead of r0 through r4. This is because ps 2.0 does not allow us<br />

to sample a texture into any one register more than four times, so the destination<br />

registers have to be rotated.<br />

texld r6, r6, s0<br />

texld r7, r7, s0<br />

texld r8, r8, s0<br />

texld r9, r9, s0<br />

texld r10, r10, s1<br />

We accumulate the result of the second matrix-vector product with the first:<br />

mad r5, r10.xxxx, r6, r5<br />

mad r5, r10.yyyy, r7, r5<br />

mad r5, r10.zzzz, r8, r5<br />

mad r5, r10.wwww, r9, r5<br />

#endif<br />

MOPs three to six are identical save for the register rotation we mentioned:<br />

#if numMOpsPerFragment >= 3<br />

add r0, t0, c2<br />

add r1, t1, c2<br />

add r2, t2, c2<br />

add r3, t3, c2<br />

add r4, t4, c3<br />

texld r0, r0, s0<br />

texld r1, r1, s0<br />

texld r2, r2, s0<br />

texld r3, r3, s0<br />

texld r4, r4, s1<br />

Section II — Rendering Techniques<br />

Dense Matrix Algebra on the GPU<br />

365


Section II — Rendering Techniques<br />

366 Dense Matrix Algebra on the GPU<br />

mad r5, r4.xxxx, r0, r5<br />

mad r5, r4.yyyy, r1, r5<br />

mad r5, r4.zzzz, r2, r5<br />

mad r5, r4.wwww, r3, r5<br />

#endif<br />

#if numMOpsPerFragment >= 4<br />

add r6, t0, c4<br />

add r7, t1, c4<br />

add r8, t2, c4<br />

add r9, t3, c4<br />

add r10, t4, c5<br />

texld r6, r6, s0<br />

texld r7, r7, s0<br />

texld r8, r8, s0<br />

texld r9, r9, s0<br />

texld r10, r10, s1<br />

mad r5, r10.xxxx, r6, r5<br />

mad r5, r10.yyyy, r7, r5<br />

mad r5, r10.zzzz, r8, r5<br />

mad r5, r10.wwww, r9, r5<br />

#endif<br />

#if numMOpsPerFragment >= 5<br />

add r0, t0, c6<br />

add r1, t1, c6<br />

add r2, t2, c6<br />

add r3, t3, c6<br />

add r4, t4, c7<br />

texld r0, r0, s0<br />

texld r1, r1, s0<br />

texld r2, r2, s0<br />

texld r3, r3, s0<br />

texld r4, r4, s1<br />

mad r5, r4.xxxx, r0, r5<br />

mad r5, r4.yyyy, r1, r5<br />

mad r5, r4.zzzz, r2, r5<br />

mad r5, r4.wwww, r3, r5<br />

#endif<br />

#if numMOpsPerFragment >= 6<br />

add r6, t0, c8<br />

add r7, t1, c8<br />

add r8, t2, c8<br />

add r9, t3, c8<br />

add r10, t4, c9<br />

texld r6, r6, s0<br />

texld r7, r7, s0<br />

texld r8, r8, s0<br />

texld r9, r9, s0


texld r10, r10, s1<br />

mad r5, r10.xxxx, r6, r5<br />

mad r5, r10.yyyy, r7, r5<br />

mad r5, r10.zzzz, r8, r5<br />

mad r5, r10.wwww, r9, r5<br />

#endif<br />

mov oC0, r5<br />

There are a few additional details to be mentioned. Because a pixel shader operates<br />

on 4�(4 numMOpsPerFragment) submatrices, only input matrices with dimensions<br />

that are multiples of 4 numMOpsPerFragment are handled trivially. Other<br />

matrix sizes perform extra work because the pixel shading involving the last column-block<br />

of A and last row-block of B read in zeros and perform redundant computations.<br />

We even have to do work to ensure that, indeed, zeros get read in and<br />

not undefined values. First, we set the texture coordinate mapping mode to the<br />

black border color. Unfortunately, not all GPUs support this feature. To support<br />

these GPUs, we either need to change the way we store the matrices in the surfaces<br />

so that the edge texels are not used, set the edge pixels to black, and use<br />

clamp mode, or restrict ourselves to matrices with row and column counts that<br />

are multiples of 4 numMOpsPerFragment. Finally, the pixel shader does a lot of<br />

redundant work processing input matrices with the inner dimension significantly<br />

smaller than 4 numMOpsPerFragment. Of course, such small matrices are best processed<br />

on the CPU anyway to avoid the overhead of creating and reading back<br />

textures.<br />

Transposed Multiplication<br />

Section II — Rendering Techniques<br />

Dense Matrix Algebra on the GPU<br />

367<br />

In practice, we rarely need to compute the transpose of a matrix as such but often<br />

need to multiply the transpose of a matrix with another matrix. We implement a<br />

transposed multiply operation to be able to do this. The operation we now<br />

describe implements C := A T B, where A, B, and C are still defined as in the last<br />

section. The operation C := A B T is also useful, but its implementation would be<br />

very similar to this one, so we will omit it. As we see later, this operation ends up<br />

to be more costly than the plain multiply. For this reason, it may be worth it to<br />

implement a simple transpose operation C := A T as well, even though this operation<br />

can be inefficiently emulated using this code withB=1.Such an operation<br />

would be a clear win if a sequence of multiplications were needed with a certain<br />

transposed matrix and perhaps even in general.<br />

A trivial CPU implementation of the transposed multiply code would simply<br />

exchange the row and column indexing of A, but this is not so easy on the GPU<br />

because we have packed several matrix elements into a single pixel, so the transpose<br />

has to happen on two levels: The atomic 4�4 matrix pixel shader operation<br />

has to be transposed, and the ordering of these submatrices also needs to be<br />

reversed. An indication of this added complexity is that this time we only managed<br />

to fit four transposed MOPs into our ps 2.0 pixel shader, as opposed to six for<br />

the plain multiply. Because this new constant is different from the previous, we<br />

define it as numMTOpsPerFragment.


Section II — Rendering Techniques<br />

368 Dense Matrix Algebra on the GPU<br />

Figure 2: Schematic for transposed matrix multiply<br />

A diagram to explain this algorithm is provided in Figure 2. This is exactly the<br />

same problem as given in Figure 1, with the matrix A now provided in a transposed<br />

form. To compute C as before, we transpose A while we perform the multiply.<br />

The matrix B and C are unchanged. The darkened regions again show the<br />

texels sampled to compute the contribution of the second pass to the black output<br />

pixel. Note that in matrix A, the region consists of three vertically stacked MOPs,<br />

each of which has four texels in a horizontal row. Our pixel shader will now be<br />

stepping four times to the left before resetting the horizontal offset and taking a<br />

step downward. The pixel shader has to move the starting texture coordinate of A<br />

downward between passes.<br />

The C++ code for the operation is quite similar to the plain-vanilla multiply:<br />

void Matrix::multiplyAT(Matrix & a, Matrix & b){<br />

if (this == &a || this == &b)<br />

throw "can't operate inplace -- not supported by D3D.";<br />

if (a.nRows != b.nRows)<br />

throw "matrix dimensions don't agree";<br />

resize(a.nCols, b.nCols);<br />

if (nRows * nCols == 0) return;<br />

const unsigned numQuads = roundUpDivide(rows2TextureHeight(b.nRows),numMTOpsPerFragment);<br />

const float TexcoordBiasW = (1.0f/cols2TextureWidth(nCols)) * 0.5f;<br />

const float TexcoordBiasH = (1.0f/rows2TextureHeight(nRows)) * 0.5f;<br />

const float TexcoordBiasAW = (1.0f/cols2TextureWidth(a.nCols)) * 0.5f;<br />

const float TexcoordBiasABH = (1.0f/rows2TextureHeight(a.nRows)) * 0.5f;<br />

We compute bias values as usual above, and the offsets for texture B are also<br />

unchanged below:


const float tcPixelBH =2*TexcoordBiasABH;<br />

const float tcMOpIncrementBH = tcPixelBH;<br />

const float tcPassIncrementBH = numMTOpsPerFragment * tcPixelBH;<br />

The offsets for matrix A are now in both the horizontal and vertical directions:<br />

const float tcPixelAW =2*TexcoordBiasAW;<br />

const float tcPixelAH =2*TexcoordBiasABH;<br />

const float tcMOpIncrementAH = tcPixelAH;<br />

const float tcPassIncrementAH = numMTOpsPerFragment * tcMOpIncrementAH;<br />

There is an additional issue in the transposed multiply that did not show up<br />

before. Previously, it was always proper for the vertex shader to simply linearly<br />

map vertex coordinates from the range [1,–1] to the texture coordinate range<br />

[0,1] to define the U or V texture coordinates of an input texture. Now, however,<br />

the U dimension of texture A is mapped vertically and the V dimension horizontally.<br />

If A is not square, and there are unused components in the bottom row of the<br />

destination texture because its height is not a multiple of four, the mapping has to<br />

be adjusted. We map the vertex range [1,–1] to [0, quotient], where quotient is<br />

computed below. In effect, we virtually round up the texture size to the nearest<br />

multiple of four. The rest of the code should be familiar.<br />

const unsigned awidth = cols2TextureWidth(a.nCols);<br />

unsigned modW = awidth % 4;<br />

if (modW != 0) modW =4-modW;<br />

const float quotient = (awidth + modW)/(float)awidth;<br />

const float halfQuot = quotient * 0.5f;<br />

float vconsts[] = {<br />

0.5, - halfQuot, 0.5, 1,<br />

0.5+ TexcoordBiasW, 0.5 + TexcoordBiasH, 0 + TexcoordBiasABH, 0,<br />

0 + TexcoordBiasABH, 0, 0, 0,<br />

0, q + TexcoordBiasAW, 0, 0,<br />

tcPassIncrementBH, tcPassIncrementAH, 0, 0<br />

};<br />

float pconsts[] = {<br />

tcPixelAW, 0, 0,0,<br />

0, 1 * tcMOpIncrementBH, 0,0,<br />

0, 1 * tcPixelAH, 0,0,<br />

0, 2 * tcMOpIncrementBH, 0,0,<br />

0, 2 * tcPixelAH, 0,0,<br />

0, 3 * tcMOpIncrementBH, 0,0,<br />

0, 3 * tcPixelAH, 0,0,<br />

0, 4 * tcMOpIncrementBH, 0,0,<br />

0, 4 * tcPixelAH, 0,0,<br />

0, 5 * tcMOpIncrementBH, 0,0,<br />

0, 5 * tcPixelAH, 0,0,<br />

};<br />

d3dDevice->SetRenderTarget(0,mathSurface);<br />

d3dDevice->BeginScene();<br />

Section II — Rendering Techniques<br />

Dense Matrix Algebra on the GPU<br />

369


Section II — Rendering Techniques<br />

370 Dense Matrix Algebra on the GPU<br />

d3dDevice->SetVertexDeclaration( vertexDeclaration2 );<br />

d3dDevice->SetVertex<strong>Shader</strong>( vertex<strong>Shader</strong>s[VS MULT T] );<br />

d3dDevice->SetPixel<strong>Shader</strong>( pixel<strong>Shader</strong>s[PS MULT T] );<br />

d3dDevice->SetTexture(0,a.mathTexture);<br />

d3dDevice->SetTexture(1,b.mathTexture);<br />

d3dDevice->SetStreamSource( 0, quadsVertexBuffer, 0, TINYVERTEX SIZE );<br />

d3dDevice->SetVertex<strong>Shader</strong>ConstantF(0, vconsts, 5);<br />

d3dDevice->SetPixel<strong>Shader</strong>ConstantF(0, pconsts, 1+2*(numMTOpsPerFragment - 1) );<br />

d3dDevice->DrawPrimitive( D3DPT TRIANGLELIST, 0, 2 );<br />

if (numQuads > 1)<br />

{<br />

d3dDevice->SetRenderState( D3DRS ALPHABLENDENABLE, TRUE );<br />

d3dDevice->DrawPrimitive( D3DPT TRIANGLELIST, 6, 2 * (numQuads - 1) );<br />

d3dDevice->SetRenderState( D3DRS ALPHABLENDENABLE, FALSE );<br />

}<br />

d3dDevice->EndScene();<br />

}<br />

The vertex shader code first extracts and emits the vertex position, as the<br />

straight multiply did:<br />

vs 1 1<br />

dcl position v0<br />

def c5, -1, -1, 0, 0<br />

add r3.xy, v0.xy, c5.xy<br />

mov oPos.xy, r3.xy<br />

mov oPos.zw, c0.zw<br />

The vertex to texture coordinate mapping is now done twice because, as discussed<br />

above, texture A’s texture coordinate range is no longer always [0,1],<br />

while B’s still is. These four instructions could be optimized into fewer instructions,<br />

but the vertex shader is no bottleneck here.<br />

mov r0.xy, c1.xy<br />

mad r0.xy, r3.xy, c0.xy, r0.xy<br />

mov r1.xy, c3.xy<br />

mad r1.xy, r3.xy, c0.xy, r1.xy<br />

The code to add offsets to the texture coordinates is the same as in the plain multiply,<br />

except it works along different dimensions for A:<br />

mul r3, v0.w, c4.zzxz<br />

add r3, r3, c1<br />

mul r2, v0.w, c4.yyyy<br />

add r2, r2, c2<br />

Note that unlike before, we only emit two texture coordinates. We were not able<br />

to optimize the pixel shader in this case by precomputing more texture coordinates<br />

here.<br />

mov oT0.x, r1.y<br />

mov oT0.y, r2.x<br />

mov oT1.x, r0.x


mov oT1.y, r3.z<br />

Below is the last shader presented in this article. Notice that after we fetch the<br />

first sample from A, we keep nudging the texture coordinates to the right to fetch<br />

the next three samples. The last texld samples the 4-vector from B.<br />

ps 2 0<br />

dcl t0.xyzw<br />

dcl t1.xyzw<br />

dcl 2d s0<br />

dcl 2d s1<br />

texld r0, t0, s0<br />

add r4, t0, c0<br />

texld r1, r4, s0<br />

add r4, r4, c0<br />

texld r2, r4, s0<br />

add r4, r4, c0<br />

texld r3, r4, s0<br />

texld r4, t1, s1<br />

The transposed multiply can be accomplished with four dp4s, and the result goes<br />

to r5:<br />

dp4 r5.x, r4, r0<br />

dp4 r5.y, r4, r1<br />

dp4 r5.z, r4, r2<br />

dp4 r5.w, r4, r3<br />

#if numMTOpsPerFragment >= 2<br />

To execute the next MOP, we push the original t0 downward in A by adding c2<br />

and then again sampling four consecutive pixels. We rotate the sampling destination<br />

registers so we avoid getting a fourth-order dependent read error in the<br />

shader compiler as long as possible.<br />

add r0, t0, c2<br />

texld r6, r0, s0<br />

add r0, r0, c0<br />

texld r7, r0, s0<br />

add r0, r0, c0<br />

texld r8, r0, s0<br />

add r0, r0, c0<br />

texld r9, r0, s0<br />

add r1, t1, c1<br />

texld r10, r1, s1<br />

dp4 r6.x, r10, r6<br />

dp4 r6.y, r10, r7<br />

dp4 r6.z, r10, r8<br />

dp4 r6.w, r10, r9<br />

add r5, r5, r6<br />

#endif<br />

#if numMTOpsPerFragment >= 3<br />

Section II — Rendering Techniques<br />

Dense Matrix Algebra on the GPU<br />

371


Section II — Rendering Techniques<br />

372 Dense Matrix Algebra on the GPU<br />

The third and fourth blocks simply continue to follow this pattern.<br />

add r4, t0, c4<br />

texld r0, r4, s0<br />

add r4, r4, c0<br />

texld r1, r4, s0<br />

add r4, r4, c0<br />

texld r2, r4, s0<br />

add r4, r4, c0<br />

texld r3, r4, s0<br />

add r4, t1, c3<br />

texld r4, r4, s1<br />

dp4 r6.x, r4, r0<br />

dp4 r6.y, r4, r1<br />

dp4 r6.z, r4, r2<br />

dp4 r6.w, r4, r3<br />

add r5, r5, r6<br />

#endif<br />

#if numMTOpsPerFragment >= 4<br />

add r0, t0, c6<br />

texld r6, r0, s0<br />

add r0, r0, c0<br />

texld r7, r0, s0<br />

add r0, r0, c0<br />

texld r8, r0, s0<br />

add r0, r0, c0<br />

texld r9, r0, s0<br />

add r1, t1, c5<br />

texld r10, r1, s1<br />

dp4 r6.x, r10, r6<br />

dp4 r6.y, r10, r7<br />

dp4 r6.z, r10, r8<br />

dp4 r6.w, r10, r9<br />

add r5, r5, r6<br />

#endif<br />

Unfortunately, the above conditional block is the last one that compiles with ps 2.0<br />

because the first texld of the next block produces a fourth-order texop error. This<br />

hand-coded assembly code still manages to pack much more math into the pixel<br />

shader than the HLSL compiler managed.<br />

#if numMTOpsPerFragment >= 5<br />

add r7, t0, c8<br />

texld r0, r7, s0<br />

add r7, r7, c0<br />

texld r1, r7, s0<br />

add r7, r7, c0<br />

texld r2, r7, s0<br />

add r7, r7, c0


texld r3, r7, s0<br />

add r4, t1, c7<br />

texld r4, r4, s1<br />

dp4 r6.x, r4, r0<br />

dp4 r6.y, r4, r1<br />

dp4 r6.z, r4, r2<br />

dp4 r6.w, r4, r3<br />

add r5, r5, r6<br />

#endif<br />

mov oC0, r5<br />

Other Operations<br />

From the operations described above, we create more by writing different variations<br />

and writing macro operations that build on them. We briefly summarize<br />

them here:<br />

float Matrix ::dot(Matrix & vec); //this *= vec’<br />

This is only defined if both operands are vectors. The dot product of two vectors a<br />

and b equals a T b, so we can reuse the transposed multiply operation. This results<br />

in a temporary 1�1 texture whose red component we read out and return.<br />

float Matrix::normSquared();<br />

Only defined for a vector a, this simply calls a.dot(a).<br />

void Matrix::multiply(float c); //this *= c<br />

Multiplication by a constant is implemented with a simple shader that does a<br />

multiplicative blend between the destination and a c-colored quad.<br />

void Matrix::add(Matrix & b); //this += b<br />

Section II — Rendering Techniques<br />

Dense Matrix Algebra on the GPU<br />

Unary accumulate is the copy() operation with additive blending with the render<br />

target turned on.<br />

void Matrix::max(Matrix & a, float ref); //this = max(a, ref)<br />

This operation is also similar to copy() but also employs the max pixel shader<br />

opcode to compute the maximum of the corresponding matrix elements.<br />

void Matrix::mad(Matrix & b, float c); //this += b*c.<br />

void Matrix::mad(Matrix & a,Matrix & b,float c); //this =a+b*c<br />

void Matrix::mad(Matrix & a, Matrix & b); //this += a .* b<br />

void Matrix::madad(Matrix & a, Matrix & b, Matrix & c, Matrix & d);<br />

//this =a+(b+c).*d<br />

373<br />

Finally, all the different flavors of the mad (multiply add) operation are a combination<br />

of the add and constant multiply shaders. We use .* to denote array multiplication.<br />

We also implemented some initialization operations. To create a zero<br />

matrix, we simply clear the texture to black. Identity matrices and other special


Section II — Rendering Techniques<br />

374 Dense Matrix Algebra on the GPU<br />

matrices are best implemented by writing the appropriate data with the CPU.<br />

This is also how matrices are saved and loaded from file.<br />

Applications<br />

In this section we describe two high-level algorithms that use the operations we<br />

described above. None of them reads back intermediate results (other than scalars)<br />

from the CPU, so all the real work still happens on the GPU as a sequence of<br />

render-to-texture operations. It would be possible to optimize both of them by<br />

writing special-purpose macro shaders that combine several basic matrix operations<br />

to reduce the number of render-to-texture operations. We have done this to<br />

a small degree by implementing the multiply-add operations, but in general we<br />

would like to keep our operations small in number and reusable.<br />

Both of the discussed methods are iterative. Iterative methods, in contrast to<br />

pivoting methods, are typically simple and perform a small number of matrix<br />

operations to converge to the desired result, rather than doing a number of scalar<br />

operations that are often difficult to vectorize.<br />

Conjugate Gradients<br />

The conjugate gradients algorithm was developed at the ETH Zurich in 1952 [4].<br />

It is the most common iterative algorithm used to solve a system of linear equations<br />

of the form Ax = b, where the matrix A and the vector b are given and the<br />

vector x is to be found. Although this algorithm has been extended to handle more<br />

general classes of matrices, we only deal with the simplest version that requires<br />

A to be symmetric positive definite.<br />

Our implementation of the algorithm does not have any <strong>DirectX</strong> or shader<br />

code of its own. Instead, it uses the methods of the matrix class we created. The<br />

three operand matrices are given:<br />

Matrix &A=...;<br />

Matrix &x=...;<br />

Matrix &b=...;<br />

unsigned n = b.getNRows();<br />

If the algorithm is used in a physics simulation context, it is often desirable to<br />

warm-start it with the solution of the problem in the previous simulation time<br />

step, with the hope that the solution of the current time step is nearby. If the size<br />

of the input vector x is compatible with the size of A, we assume that the user<br />

wants to warm-start with x; otherwise, we start with a first guess of zero:<br />

if (x.getNRows() != n || x.getNCols() != 1)<br />

x.zeros(n, 1);<br />

The algorithm uses three temporary vectors:<br />

Matrix p, r, s;


p.copy(b);<br />

r.copy(b);<br />

float rr = r.normSquared();<br />

s.multiply(A,p);<br />

float t = p.dot(s);<br />

float alpha = rr / t;<br />

x.mad(p, alpha);<br />

float rrnew = rr;<br />

The conjugate gradients algorithm is proven to converge to the exact solution<br />

within n steps 1 , though we could get an approximate solution with fewer<br />

iterations.<br />

unsigned iter = n;<br />

for (unsigned k=2;k


Section II — Rendering Techniques<br />

376 Dense Matrix Algebra on the GPU<br />

As before, A and b are given, and x is to be found. We use the projected Jacobi<br />

method [8] for solving the problem, which is perhaps the simplest way to do so,<br />

though not necessarily the one with the best convergence properties. The projected<br />

Jacobi algorithm can be stated succinctly as the recursion:<br />

�<br />

x �max( x � D( Ax �b),<br />

0)<br />

i�1i i<br />

Where D is defined as:<br />

�<br />

D � �diagonal( A)<br />

1<br />

� is a constant that steers convergence. Clever implementations of the algorithm<br />

tune this value, while the solver runs to speed up convergence; we just use a<br />

fixed value. This algorithm again requires A to be symmetric positive definite.<br />

As before, we first receive the matrices we are to operate on. Note that<br />

because d is a constant, it is also expected to be provided as an input. This time,<br />

the number of iterations is also a mandatory input because with this algorithm,<br />

there is no guaranteed convergence for a certain number of iterations. In the code<br />

below we store the diagonal elements of the diagonal matrix D in a column vector<br />

d.<br />

Matrix &x=...;<br />

Matrix &A=...;<br />

Matrix &d=...;<br />

Matrix &b=...;<br />

unsigned iter = ...;<br />

unsigned n = b.getNRows();<br />

Matrix w, t;<br />

Here we again warm-start the algorithm with the initial value of x if it exists; otherwise,<br />

we start at zero:<br />

if (x.getNRows() != n || x.getNCols() != 1)<br />

{<br />

x.zeros(n, 1);<br />

w.zeros(n, 1);<br />

}<br />

else<br />

w.multiply(A, x);<br />

t.madad(x, b, w, d);<br />

x.max(t, 0);<br />

for (unsigned k=1;k


Results<br />

Section II — Rendering Techniques<br />

Dense Matrix Algebra on the GPU<br />

The above loop implements the iteration presented above. Here too the most<br />

expensive operation is a matrix vector multiply.<br />

Figure 3: GPU vs. CPU performance for various operations involving a<br />

1000�1000 matrix<br />

377<br />

Figure 3 summarizes the running times for the matrix copy, add, multiply, transposed<br />

multiply, conjugate gradients, and projected Jacobi algorithms. We have profiled<br />

four configurations: Our GPU implementation was run on a Radeon 9500 Pro<br />

that was plugged into a PC with an AMD Athlon 800 Mhz processor and 256 MB<br />

of RAM and a Radeon 9700 Pro in a P4 2.4 GHz, with 256 MB RAM. The two<br />

CPU configurations were a simple C implementation by the author and a program<br />

using the ATLAS library. Both of the CPU configurations were timed on a 1.6<br />

GHz Intel Pentium 4 processor-equipped PC with 256 MB RAM. We used a version<br />

of ATLAS optimized for Pentium 4 CPUs with 8 KB L1 cache and 256 KB L2<br />

cache; these specifications correspond to our test system. Note that while ATLAS<br />

gains a sizeable performance boost from this cache-size specific optimization, it<br />

means that the library has to be reconfigured for each target platform; this is<br />

somewhat impractical for interactive entertainment software that is to be distributed<br />

to end users. In comparison, a <strong>DirectX</strong> 9 program, which only indirectly<br />

interfaces with hardware and thus exploits the video card’s manufacturer-specific<br />

driver optimizations, is more flexible.<br />

All GPU timings represent the time elapsed between the call of the appropriate<br />

matrix class method and completion of the readback of the result matrix from<br />

video to system memory. Retrieving results to system memory is a significant<br />

portion of the time needed for copy and addition but are negligible for the other


Section II — Rendering Techniques<br />

378 Dense Matrix Algebra on the GPU<br />

operations. The C implementation is an order of magnitude slower than either the<br />

ATLAS or the GPU implementation, which fall in the same performance class.<br />

The GPU transposed multiply is slower than the straight multiply because it<br />

needs more rendering passes. The CPU implementations’ speeds are the same<br />

for straight and transposed multiply because here the difference boils down to a<br />

change in matrix indexing and computation order that does not even need to influence<br />

cache coherence.<br />

The projected Jacobi algorithm for LCPs runs faster than conjugate gradients<br />

for a given problem size because the number of render-to-texture operations in<br />

the loop is much lower, and we never read back any intermediate values from the<br />

GPU — not even scalars. Conjugate gradients read back a scalar value in<br />

normSquared() — this is the result of a dot product operation and is retrieved<br />

from the 1�1 render target texture used in this case. Because it is only a single<br />

value, it does not stress the limited texture memory to system memory bandwidth<br />

of PCs, but it still hurts performance because it forces the CPU and GPU to<br />

work together in lockstep instead of working asynchronously. One could further<br />

optimize conjugate gradients by merging its sequence of vector operations into a<br />

single operation by writing a custom “macro” pixel shader. This is left as an exercise<br />

for the reader.<br />

Because LCP and conjugate gradients consist of a sequence of vector-vector<br />

and matrix-vector operations, ATLAS is unable to leverage its optimized<br />

matrix-matrix kernel and instead adds significant overhead, losing out to the<br />

inlined C version. For these two algorithms, data caching is less of a bottleneck<br />

because there are fewer numbers to work with. Instead, raw floating-point performance<br />

is dominant, catapulting the Radeons into the lead.<br />

Figure 4: Collapsing wall of 60 cubes


Conclusion<br />

Figure 4 shows a practical application: a wall of 60 cubes collapsing. The simulation<br />

is performed by the projected Jacobi code. The simulation uses � = –0.1 and<br />

does 2n iterations, where n is the size of the input matrix. If the problem is<br />

expressed as a single dense matrix (with an initial size of 400�400), the simulation<br />

runs at two seconds per frame. If the problem is dynamically decomposed<br />

into small sub-problems, real-time performance is achieved. Of course, more<br />

advanced LCP algorithms (which are less suitable for GPU implementation) can<br />

achieve even better results because of eventually lower storage overhead (sparse<br />

matrix) and much faster convergence.<br />

In [6], Larsen and McAllister have benchmarked matrix multiplies on GeForce3<br />

GPUs and found that the GPU, working with byte values, achieves similar performance<br />

to the CPU using single-precision floating-point, effectively operating on<br />

four times as much data. They were pessimistic about the prospect of a fourfold<br />

increase in GPU performance, even if GPUs were to integrate floating-point processing<br />

capabilities. Our results above indicate that this has indeed happened only<br />

two years later.<br />

We have observed that the Radeon GPUs have outperformed optimized CPU<br />

code running on a mid-range PC when executing two important algorithms.<br />

Moreover, the performance penalty due to moving data to the GPU and back to<br />

main memory can be negligible compared to the overall cost of the computation<br />

when the problem size is sufficient. As a result, this additional source of computing<br />

power should not be ignored. Instead, algorithms must be found that can<br />

exploit the specific strengths of the GPU. Approaches that split the work between<br />

CPU and GPU and thus achieve maximum parallelism will make it possible to run<br />

simulations of previously untractable scales on low-cost PCs.<br />

Acknowledgments<br />

References<br />

Section II — Rendering Techniques<br />

Dense Matrix Algebra on the GPU<br />

379<br />

Thanks to Wolfgang Engel, Tom Forsyth, and ATI and nVidia developer relations<br />

for help with different aspects of this project and to Erwin Coumans, Mark Harris,<br />

Stephane Redon, and Jan Paul van Waveren for valuable suggestions.<br />

[1] Bolz, Jeff, Ian Farmer, Eitan Grinspun, and Peter Schröder, “Sparse Matrix<br />

Solvers on the GPU: Conjugate Gradients and Multigrid,” to appear in the proceedings<br />

of SIGGRAPH 2003.<br />

[2] Dongarra, J.J., J. Du Croz, S. Hammarling, and R.J. Hanson, “An extended set<br />

of FORTRAN Basic Linear Algebra Subprograms,” ACM Trans. Math. Soft., 14<br />

(1988), pp. 1-17.


Section II — Rendering Techniques<br />

380 Dense Matrix Algebra on the GPU<br />

[3] Facius, Axel, “Iterative Solution of Linear Systems with Improved Arithmetic<br />

and Result Verification,” Ph.D. thesis, Universität Karlsruhe, July 2000.<br />

[4] Hestenes, M. and E. Stiefel, “Methods of conjugate gradients for solving linear<br />

systems,” J. Research Nat. Bur. Standards 49, 1952.<br />

[5] Krüger, Jens and Rüdiger Westermann, “Linear Algebra Operators for GPU<br />

Implementation of Numerical Algorithms,” to appear in the proceedings of<br />

SIGGRAPH 2003.<br />

[6] Larsen, E.S. and D. McAllister, “Fast Matrix Multiplies using Graphics Hardware,”<br />

SuperComputing 2001 Conference, Denver, CO, November 2001.<br />

[7] Lawson, C.L., R.J. Hanson, D. Kincaid, and F.T. Krogh, “Basic Linear Algebra<br />

Subprograms for FORTRAN usage,” ACM Trans. Math. Soft., 5 (1979), pp.<br />

308-323.<br />

[8] Murty, K.G., Linear Complementarity, Linear and Nonlinear <strong>Programming</strong>,<br />

Helderman-Verlag, 1988.<br />

[9] Whaley, R.C. and J. Dongarra, “Automatically Tuned Linear Algebra Software,”<br />

SuperComputing 1998 Conference, Orlando, FL, November 1998.


Section III<br />

Software <strong>Shader</strong>s and<br />

<strong>Shader</strong> <strong>Programming</strong><br />

<strong>Tips</strong><br />

Software Vertex <strong>Shader</strong> Processing<br />

by Dean P. Macri<br />

x86 <strong>Shader</strong>s–ps_2_0 <strong>Shader</strong>s in Software<br />

by Nicolas Capens<br />

SoftD3D: A Software-only Implementation<br />

of Microsoft’s Direct3D API<br />

by Oliver Weichhold<br />

Named Constants in <strong>Shader</strong> Development<br />

by Jeffrey Kiel<br />

381


Introduction<br />

Software Vertex <strong>Shader</strong><br />

Processing<br />

Dean P. Macri<br />

Recent advances in processor performance coupled with the programmability<br />

available in the vertex shader models provided in Microsoft <strong>DirectX</strong> gives developers<br />

considerable freedom to write customized and specialized techniques for<br />

real-time graphics and gaming. <strong>DirectX</strong> 9 introduced the vs.2.0, vs.2.x, and vs.3.0<br />

shader models that provide static and dynamic branching capabilities, looping,<br />

conditional execution, predication, and more flexible addressing and computational<br />

ability. Rather than waiting for a proliferation of graphics cards that support<br />

these features in hardware, developers can begin using these features today with<br />

software vertex processing. In addition, highly optimized vector processing for<br />

non-graphics data can be done using the software vertex processing pipeline.<br />

This article explores optimization guidelines for writing shaders that use the<br />

software vertex processing pipeline. While that is the main goal, the techniques<br />

described here should also apply to vertex shaders written for graphics hardware.<br />

First I talk about why you’d want to use software vertex processing and I describe<br />

some non-graphics algorithms and techniques that can benefit from software<br />

vertex processing. Then I describe some of the processor features that give the<br />

pipeline significant performance capability and discuss some of the design characteristics<br />

of the Intel implementation of the software vertex processing pipeline.<br />

Next, I walk through the specific optimization guidelines, why they have an<br />

impact, and how to get the highest performance out of your shaders. Finally, I<br />

wrap up with a description of the included sample program and shaders.<br />

Throughout this article, I refer to the Intel-optimized portion of the <strong>DirectX</strong><br />

runtime as a compiler because it recompiles shader bytecode to IA-32 instruction<br />

sequences. Any reference to the word “compiler” is with that context in mind. If<br />

you’re eager to just begin optimizing your vertex shaders and aren’t really interested<br />

in the motivation or background material, feel free to skip ahead to the<br />

“Optimization Guidelines” section.<br />

383


Section III — Software <strong>Shader</strong>s and <strong>Shader</strong> <strong>Programming</strong> <strong>Tips</strong><br />

384 Software Vertex <strong>Shader</strong> Processing<br />

Why Software Vertex Processing?<br />

Once you’ve looked at the vertex shader models provided by <strong>DirectX</strong>, several reasons<br />

for doing software vertex processing quickly come to mind:<br />

� Support for graphics hardware that doesn’t have vertex shaders<br />

� Use of higher shader versions than is supported by available hardware<br />

� Writing shaders with more instructions than the limits of hardware shaders<br />

After some additional thought, other, less obvious reasons that might be envisioned<br />

for doing software vertex processing include:<br />

� Doing work other than pixel processing on the output of the vertex shader<br />

� Doing non-graphics type work using the familiar, vector-oriented shader<br />

model<br />

� Ensuring predictable performance and compatibility across end-user systems<br />

While each of these is relatively self explanatory, I want to focus on the first two<br />

items of the second list and discuss in more detail why you might want to do<br />

these things.<br />

<strong>With</strong> all the flexibility that the programmable aspects of the <strong>DirectX</strong> pipeline<br />

provides, there are still some situations in which you want to short-circuit the<br />

pipeline and do your own work on the intermediate results. One example would<br />

be using the vertex shader portion of the pipeline for transforming bounding volumes<br />

for occlusion or collision detection tests. You don’t want to send those<br />

transformed volumes through the pixel processing pipeline; you just want to do<br />

tests on them to determine whether or not other data should be sent through the<br />

graphics pipeline. In this case, developers have often chosen to implement their<br />

own transformation code to do the work, and often, the optimization level of that<br />

code is minimal. Using the ProcessVertices() method of the IDirect3DDevice9<br />

interface, developers can write a small vertex shader that does the transformation<br />

and writes the results out to a system memory vertex buffer. That buffer can then<br />

be examined for doing the occlusion or collision tests. The end result is that fast<br />

Streaming SIMD Extensions (SSE) optimized code will be used on the transformations,<br />

resulting in a net performance gain with a minimal amount of<br />

development.<br />

Another possibility would be using a vertex shader to update positions and<br />

velocities for a large particle system using numerical integration. The software<br />

vertex processing pipeline could quickly go through a large number of particles,<br />

again using the ProcessVertices() method of the IDirect3DDevice9 interface.<br />

After processing, additional work for preparing a graphical display of the data<br />

could be done.<br />

These two examples should provide some reasonable motivation for why<br />

software vertex processing is useful. The next section describes how the high<br />

performance is achieved using the features of recent processors.


Processor Features and Performance Advances<br />

<strong>With</strong> the introduction of the Intel Pentium 4 processor with hyper-threading technology,<br />

Intel has brought multi-processing to consumer PCs. Combined with the<br />

SSE introduced in the Pentium III processor and the Streaming SIMD Extensions<br />

2 (SSE2) introduced in the Pentium 4 processor, there’s now support of SIMD<br />

operations on 128-bit data in the form of single- and double-precision floating-point<br />

numbers as well as integer values from bytes to double-quadwords with<br />

all sizes in between, as seen in Figure 1. The software implementation of the vertex<br />

processing pipeline described in the next section takes advantage of all these<br />

features. To set the context for better understanding of that information, I briefly<br />

describe some of these features here. Considerably more in-depth information on<br />

the Pentium 4 processor and its features can be found at http://www.intel.com/<br />

products/desktop/processors/pentium4/.<br />

Streaming SIMD Extensions (SSE)<br />

Section III — Software <strong>Shader</strong>s and <strong>Shader</strong> <strong>Programming</strong> <strong>Tips</strong><br />

Software Vertex <strong>Shader</strong> Processing<br />

Figure 1: SSE and SSE2 data types<br />

385<br />

If you’ve been optimizing assembly routines or happen to have browsed through<br />

the instruction set manuals for Intel’s latest processors, you’re well aware that<br />

the vertex shader instructions don’t map one-to-one to the IA-32 instruction set.<br />

However, the SSE and SSE2 instructions are a fairly close fit. Using these<br />

instruction set extensions, engineers at Intel have created a compiler that takes<br />

already-compiled vertex shader bytecode and converts it to optimal SSE and<br />

SSE2 instruction sequences.<br />

The nature of the SSE and SSE2 instruction sets is most appropriate for<br />

operating on data in pairs of registers rather than within a single register. Some<br />

instructions in the vertex shader specifications, like dp3 and dp4, combine the values<br />

within a given register and produce a single result. If the vertex shader<br />

instructions were mapped as directly as possible to SSE and SSE2 instructions,


Section III — Software <strong>Shader</strong>s and <strong>Shader</strong> <strong>Programming</strong> <strong>Tips</strong><br />

386 Software Vertex <strong>Shader</strong> Processing<br />

then “horizontal” operations like dp3 and dp4 would eliminate three-quarters of<br />

the computation bandwidth provided. In an attempt to fully utilize the compute<br />

bandwidth provided for maximum performance, incoming data is transformed<br />

such that one 128-bit SSE/SSE2 register (also known as an XMM register) contains<br />

a single component (.x, .y, .z, or .w) of four vertices. So completely representing<br />

a four-component register from a vertex shader would require four XMM<br />

registers but would contain information for four vertices. Similarly, vertex shader<br />

constant registers are replicated (and stored in memory initially) so that a single<br />

constant register would also consume four XMM registers. Figure 2 shows an<br />

example of a single shader register transformed into four XMM registers.<br />

Figure 2: Transforming data from AOS to SOA<br />

The data at the left of Figure 2 is known as array-of-structures (AOS) data<br />

because it could be represented by a C/C++ data structure with one element of<br />

the structure for each component of the vector. For multiple vectors, you’d have<br />

an array of the structures, hence the name. The data at the right of Figure 2 is<br />

known as structure-of-arrays (SOA) data because you could have a C/C++ structure<br />

with four arrays, one for the x components, one for the y components, etc.<br />

Hyper-Threading Technology<br />

The latest addition to the Pentium 4 processor family is support for symmetric<br />

multithreading with what’s known as hyper-threading technology (HT technology).<br />

HT technology adds a second “logical” processor to the package of a<br />

Pentium 4 processor. The operating system can schedule processes and threads<br />

on the second processor, and behind the scenes the processor schedules operations<br />

from each of the logical processors onto the shared execution resources of<br />

the processor core. The processor caches and most internal data structures are<br />

shared between the logical processors, with only a few essential components<br />

being duplicated, as seen in Figure 3. Because typical IA-32 code sequences have<br />

branches or memory loads that stall the processor temporarily, each of the logical<br />

processors can make progress and combined can often achieve an additional 15 to<br />

20 percent speedup over the same code running on a similar processor with only<br />

one logical processor active.<br />

<strong>With</strong> HT technology reaching consumer desktop PCs, Intel engineers working<br />

on the shader compiler took the opportunity to create multiple threads of vertex<br />

shader processing code when sufficiently large batches of vertices are being


processed. The exact number of vertices<br />

that produce the additional thread is<br />

determined at run time based on several<br />

factors. It’s worth noting that if your application<br />

is doing software vertex processing<br />

and running on a multiprocessor system<br />

or a system with HT technology, one or<br />

more additional threads of execution may<br />

be created to boost performance.<br />

<strong>With</strong> that sampling of background<br />

information, we now have sufficient information<br />

to look closely at the guidelines<br />

that we can use when writing software<br />

vertex shaders. The next section does<br />

just that.<br />

Optimization Guidelines<br />

Section III — Software <strong>Shader</strong>s and <strong>Shader</strong> <strong>Programming</strong> <strong>Tips</strong><br />

Software Vertex <strong>Shader</strong> Processing<br />

387<br />

Figure 3: Intel Pentium 4 processor with<br />

HT technology diagram<br />

The optimization guidelines listed here can help improve the performance of vertex<br />

shaders running in software. Following the list is a detailed description of each<br />

of the guidelines with some examples of how to apply them and why they impact<br />

performance.<br />

� Use the highest shader version provided by the API.<br />

� Use the macros.<br />

� Define heavily used constants.<br />

� Write directly to output registers.<br />

� Use source swizzle and destination masks.<br />

� Minimize dependency chains.<br />

� Minimize temp register usage between basic blocks.<br />

� Use rep instruction if aL isn’t needed in a loop.<br />

� Avoid address register usage or reorder vertices based on expected values.<br />

� Try to eliminate unnecessary conditionals.<br />

� Use predicates for a few instructions when masking won’t work.<br />

� Use the break instructions to early-exit from loops.<br />

� Use conditionals to early-exit from rarely used code.<br />

� Try to arrange conditional data based on expected behavior.<br />

� Profile!


Section III — Software <strong>Shader</strong>s and <strong>Shader</strong> <strong>Programming</strong> <strong>Tips</strong><br />

388 Software Vertex <strong>Shader</strong> Processing<br />

Use the Highest <strong>Shader</strong> Version Provided by the API<br />

<strong>DirectX</strong> 9 introduced three new vertex shader models: vs.2.0, vs.2.x, and vs.3.0.<br />

For the software pipeline, the 2.x and 3.0 models also have vs.2.sw and vs.3.sw<br />

versions that eliminate the constraints on the number of instructions available as<br />

well as extend the register and label limits. By using the highest shader version<br />

available (vs.3.0 or vs.3.sw if limits are a concern), you can take advantage of all<br />

the features provided by that shader model to get the best performance possible.<br />

The Mandelbrot sample shader provided on the companion CD illustrates the performance<br />

advantage of having instructions like break_ge available to early-exit<br />

from loops. Using it yielded a 2.5x speedup in some cases over the vs.2.0 version.<br />

Use the Macros<br />

The shader models in <strong>DirectX</strong> 9 include several macro instructions — some new<br />

and some carried over from previous shader versions. Some examples are the<br />

matrix multiplication instructions (M4X4, M3X4, etc.) and cross-product (CRS)<br />

instructions. Using them helps the compiler make smarter decisions about which<br />

registers need to be temporarily saved to memory and which can be discarded<br />

after use. As a general guideline, if an operation can be done with a single shader<br />

instruction or a combination of other shader instructions, use the single instruction<br />

version. The following code sequence illustrates an example:<br />

Before After<br />

dp4 r0.x, v0, c2<br />

dp4 r0.y, v0, c3<br />

m4x4 r0, v0, c2<br />

dp4 r0.z, v0, c4<br />

dp4 r0.w, v0, c4<br />

mul r2, v1.yzxw, v2.zxyw<br />

crs r2, v1, v2<br />

mad r2, –v2.yzxw, v1.zxyw, r2<br />

Define Heavily Used Constants<br />

When you use the def, defi, and defb instructions to define constants in a shader,<br />

the compiler can examine the specific values and produce code that is more optimal<br />

than if the constants were defined through the <strong>DirectX</strong> APIs outside of the<br />

shader. As an example, if you use the def instruction to define a four-wide constant<br />

of all zeros (0.0f) or all ones (1.0f), the compiler can make smart decisions<br />

when that constant is used. Assuming these are in constants C0 and C1, respectively,<br />

if the compiler encounters an addition of C0 to another register, it won’t<br />

have to generate any code because adding zero to a value won’t change the value.<br />

Similarly, if a value is multiplied by C1, the compiler again wouldn’t have to generate<br />

any code. Any constants that will be the same, regardless of how the shader is<br />

used, should be defined directly in the shader itself using the def, defi, and defb<br />

instructions.


Write Directly to Output Registers<br />

As described previously, the small number of SIMD registers and the reformatting<br />

of the input data to work on four vertices at a time mean that the shader compiler<br />

must generate code to save temporary results to memory whenever they’re<br />

generated but not immediately used. For this reason, if a result being generated is<br />

ultimately to be copied to an output register, modify the shader code to write<br />

directly to that output register rather than to a temporary register that eventually<br />

gets copied.<br />

Use Source Swizzle and Destination Masks<br />

Like the previous guideline, the limited number of SIMD registers available<br />

means that if results are calculated that aren’t used, one or more SIMD registers<br />

are wasted and unnecessary instructions for spilling data to memory and then<br />

restoring it have to be generated. If you need fewer than all four components of a<br />

source register or you only need a subset of the results generated by an instruction,<br />

make certain to specify source swizzle and destination write masks to inform<br />

the compiler of that. In the following example, the mov on the left only uses one<br />

component of r0 and only writes one component of oPos. The code on the right<br />

shows how to improve this to minimize register usage. Note that the mov instruction<br />

doesn’t change.<br />

Before After<br />

dp3 r0, v0, r1 dp3 r0.x, v0.xyz, v1.xyz<br />

add r0, r0, v1 add r0.x, r0.x, v1.x<br />

mov oPos.x, r0.x mov oPos.x, r0.x<br />

Minimize Dependency Chains<br />

Section III — Software <strong>Shader</strong>s and <strong>Shader</strong> <strong>Programming</strong> <strong>Tips</strong><br />

Software Vertex <strong>Shader</strong> Processing<br />

When writing shader assembly, you’re often faced with a situation of one instruction<br />

operating on the results generated by a previous instruction. A given<br />

sequence of instructions in which the input of each instruction is dependent on<br />

the output of previous instructions is called a dependency chain. When implementing<br />

your algorithms in shaders, try to make dependency chains as short as possible<br />

or eliminate them altogether. Also, try to keep the instructions of a<br />

dependency chain clustered closely together rather than spread throughout the<br />

shader. Doing so gives the compiler more flexibility in scheduling instructions and<br />

register usage so that long latency instructions can be overlapped with shorter<br />

latency instructions and less spilling of registers to memory needs to be done.<br />

Minimize Temp Register Usage between Basic Blocks<br />

389<br />

A basic block is a piece of code that has no branching (in the form of if statements,<br />

call statements, and loop or rep constructs). The original vs.1.0 and<br />

vs.1.1 shader models consisted entirely of one basic block because no branching<br />

of any kind existed. In the vs.2.0 and higher shader models, however, multiple


Section III — Software <strong>Shader</strong>s and <strong>Shader</strong> <strong>Programming</strong> <strong>Tips</strong><br />

390 Software Vertex <strong>Shader</strong> Processing<br />

basic blocks can exist in a given shader. To assist the compiler in its optimizations,<br />

it’s recommended that temporary registers are not reused between basic blocks.<br />

Otherwise, the compiler will have to generate code to save and restore the temp<br />

register from memory.<br />

Use rep Instruction if aL Isn’t Needed in a Loop<br />

Two types of looping were introduced in the vs.2.0 shader model: rep and loop.<br />

The rep instruction enables a sequence of instructions to be repeated for a number<br />

of iterations based on an integer constant. The loop instruction works similarly,<br />

but it provides the aL register used as a loop counter for indexing into the<br />

constant pool. For loops where you don’t need to access the constant registers<br />

based on the loop counter, use the rep instruction because the compiler generates<br />

less code for it.<br />

Avoid Address Register Usage or Reorder Vertices Based<br />

on Expected Values<br />

A common theme with the optimizations described here is that anything that<br />

hurts the performance due to the SOA arrangement of data should be eliminated<br />

or avoided. Another example in this area is the use of the address register, a0.<br />

The address register is used to index into the constant registers, and because it<br />

can be computed dynamically within the shader, it can have different values for<br />

different vertices. If you can accomplish a task without the use of the address register,<br />

do so — even if it means a few extra instructions in your shader.<br />

Figure 4 shows the effect of using the address register when the values for<br />

the four vertices in the address register are not all the same. As you can see from<br />

the diagram, the compiler must produce code that extracts each of the four components<br />

(based on the four values in the address register x component) from the<br />

various constant values and then combine them together to produce the final<br />

result (r5.x). The overhead of doing this is approximately 20 clocks. If all four values<br />

of the address register across four vertices were the same, a single instruction<br />

would suffice. Therefore, if possible, reorder your vertices based on the<br />

expected values in the address register.<br />

Figure 4: Address register usage effect on SOA data


Section III — Software <strong>Shader</strong>s and <strong>Shader</strong> <strong>Programming</strong> <strong>Tips</strong><br />

Software Vertex <strong>Shader</strong> Processing<br />

Try to Eliminate Unnecessary Conditionals<br />

Similar to the problem associated with address register usage, conditionals that<br />

vary from one vertex to the next can severely impact performance of software<br />

vertex shaders. When a conditional is encountered, the compiler must generate<br />

code to compute the outcome of the conditional for all four vertices and then execute<br />

both paths (for an if/else statement) and mask and combine the results.<br />

Because of this, it’s best to avoid conditionals if at all possible.<br />

One way to avoid conditionals is to use numerical masking. The sge and slt<br />

instructions compute a 1.0 or 0.0 result for each component of the destination<br />

register based on pair-wise comparisons between the source registers. By doing<br />

both comparisons, two multiplies, and an add, you can produce results that are<br />

commonly implemented with an if/else sequence. For example:<br />

Before After<br />

if_lt r0.x, r1.x slt r8, r0.x, r1.x<br />

add r2, r2, r3 sge r9, r0.x, r1.x<br />

else mul r8, r8, r3 ; “if” portion<br />

add r2, r2, r4 mul r9, r9, r4 ; “else” portion<br />

endif add r8, r8, r9 ; combined results<br />

add r2, r2, r8<br />

Granted, this adds several instructions to the flow and is less easy to understand,<br />

but the performance gain can be significant because no branches are generated.<br />

Use Predicates for a Few Instructions when<br />

Masking Won’t Work<br />

In cases where you have a small number of instructions in the if and else parts<br />

of a loop but where masking as described in the previous guideline won’t work (or<br />

at least not as easily), you can use predicates to achieve the same results. Here’s<br />

an example, similar to the previous one, that uses predicates to avoid branching:<br />

Before After<br />

if_lt r0.x, r1.x setp_lt p0. r0.x, r1.x<br />

mul r2, r1.x, c8 (p0) mul r2, r1.x, c8<br />

add r2, r2, r0.x<br />

else<br />

(p0) add r2, r2, r0.x<br />

mul r2, r0.x, c9 (!p0) mul r2, r0.x, c9<br />

add r2, r2, r1.x<br />

endif<br />

(!p0) add r2, r2, r1.x<br />

391<br />

The benefit of using predication is that the code is still fairly readable and the<br />

compiler can generate code that is still branch free to obtain the highest performance<br />

possible.


Section III — Software <strong>Shader</strong>s and <strong>Shader</strong> <strong>Programming</strong> <strong>Tips</strong><br />

392 Software Vertex <strong>Shader</strong> Processing<br />

Use the Break Instructions to Early-exit from Loops<br />

The old saying “the fastest code is the code that isn’t executed” means that if you<br />

can avoid doing some computation, do so. <strong>With</strong> the break_xx instructions provided<br />

in the vs.2.x and vs.3.0 shader models, you can check for conditions being<br />

met and exit out of a loop that won’t be doing any useful computation for the<br />

remainder of its iterations. The Mandelbrot sample included on the companion<br />

CD illustrates this quite nicely. As mentioned in the first tip, the speedup that<br />

resulted was very significant.<br />

Use Conditionals to Early-exit from Rarely Used Code<br />

Use of any of the transcendental instructions (log, exp, pow) in vertex shader code<br />

causes the compiler to generate a call to an optimized routine. However, the performance<br />

impact can be significant if a large number of these are used. In some<br />

cases, it’s possible to do a dynamic branch based on comparison of values and<br />

avoid having to do the expensive computation. One example is when generating<br />

specular highlights in lighting code; by checking to see if the highlight color is<br />

black or if the specular power is very low, you can branch over the specular lighting<br />

calculation and avoid the call to the pow function. Whenever possible, use conditionals<br />

and branches (not masking or predication) to avoid expensive operations<br />

like log, exp, and pow.<br />

Try to Arrange Conditional Data Based on Expected Behavior<br />

Profile!<br />

When you do have to use conditionals with branching, if possible you can help the<br />

compiler achieve better performance by rearranging your data based on expected<br />

true/false behavior of the conditionals. A rudimentary example is a conditional<br />

that did something different based on whether a face was front facing or back facing.<br />

If you make sure to group your vertices based on spatial locality of the faces,<br />

you get better clustering of vertices that are all on front-facing triangles and vertices<br />

that are all on back-facing triangles. The processor’s branch prediction is<br />

more accurate in these cases and the performance of your shader is higher.<br />

The best way to get great performance out of any piece of code, whether C++,<br />

assembly, or, in this case, vertex shader code, is to profile repeatedly with optimization<br />

in between. Other than watching frame rates as you make tweaks to the<br />

shader assembly, there hasn’t been much that you could do to profile your vertex<br />

shaders, since the tools available are rather minimal. Now, Intel VTune Performance<br />

Analyzer 7.0 has a feature that makes profiling and optimizing software<br />

vertex shaders extremely simple. A trial version is available at http://www.<br />

intel.com/software/products/vtune/vpa/eval.htm. To get a trial license, visit<br />

http://www.intel.com/software/products/distributors/shader_x.htm.<br />

If you write a test case that uses software vertex processing to process a<br />

batch of triangles and it does so in a repeatable way, when you profile the


Section III — Software <strong>Shader</strong>s and <strong>Shader</strong> <strong>Programming</strong> <strong>Tips</strong><br />

Software Vertex <strong>Shader</strong> Processing<br />

application using VTune Analyzer 7.0, you’ll see a large spike in the module<br />

SW<strong>Shader</strong>s.exe.jit where SW<strong>Shader</strong>s will be replaced with the name of your<br />

application (see Figure 5).<br />

Figure 5: VTune Analyzer 7.0<br />

393<br />

Double-clicking on this module will list a few routines that are arbitrarily named,<br />

based on the order in which your vertex shaders and declarations were declared.<br />

If you have multiple vertex shaders, you’ll see multiple routines, but with just one<br />

vertex shader, you should see a routine that handles the vertex declaration mapping<br />

and another that implements the vertex shader itself (see Figure 6).<br />

Figure 6: Hotspots in vertex shader code


Section III — Software <strong>Shader</strong>s and <strong>Shader</strong> <strong>Programming</strong> <strong>Tips</strong><br />

394 Software Vertex <strong>Shader</strong> Processing<br />

Double-clicking on the vertex shader module brings up a listing of the vertex<br />

shader code with clockticks indicating which instructions consumed the most<br />

CPU time, as seen in Figure 7. In this example, we can see that the dp3 instruction<br />

contributed to the biggest portion of time in this shader.<br />

Figure 7: Vertex shader profile<br />

One thing to be aware of when using the VTune Analyzer to profile software vertex<br />

shaders is that sampling must be enabled when the vertex shader declarations<br />

and shaders are created. Two common scenarios that would prevent this and leave<br />

you scratching your head trying to figure out why the biggest spike is in a module<br />

called Other32 are:<br />

� Using a start delay that causes the VTune Analyzer to miss the creation of<br />

the shaders. Start delays are often used to give your application time to<br />

launch and load data so you don’t end up profiling all the startup operations.<br />

� Using the VTPause() and VTResume() APIs to only enable profiling of a specific<br />

portion of your application.<br />

If you find that your sampling is primarily showing up in a module called Other32<br />

(which you can’t drill down into), then one of these situations is probably arising.<br />

The best recommended fix is to use the VTPause() and VTResume() APIs and<br />

make sure that you’ve enabled sampling around any creation of vertex shaders<br />

and declarations.<br />

Sample <strong>Shader</strong><br />

The sample program on the companion CD, built on the <strong>DirectX</strong> 9 framework, can<br />

be used to see the relative performance of two shaders. The program uses a vertex<br />

shader to calculate values of the Mandelbrot set. In doing so, it illustrates how


the features available in<br />

later shader models can<br />

significantly improve the<br />

performance of software<br />

shaders in some cases.<br />

Figure 8 shows a screen<br />

capture from the sample<br />

shader.<br />

It’s interesting to<br />

note that as you zoom in<br />

on the Mandelbrot set<br />

using the dialog controls<br />

in the sample, it’s possible<br />

to get to regions that<br />

are very noisy in terms of<br />

values staying close to<br />

Figure 8: Mandelbrot sample shader<br />

the set and values escaping<br />

from the set. In those<br />

situations, the branching (in the form of a break instruction) in the optimized version<br />

can actually cause the performance to degrade. This is one reason why it’s<br />

very important to profile your shader across as broad a sampling of uses as possible.<br />

Of course, if your shader doesn’t have any branching, then the performance<br />

will be independent of the workload.<br />

Conclusion<br />

The software vertex processing support in <strong>DirectX</strong> 9 provides excellent performance<br />

to begin shipping game titles that incorporate vertex shading for doing<br />

both graphics and non-graphics-related computation. It provides a flexible and<br />

straightforward way to produce vectorized code that can take full advantage of the<br />

instruction sets and technology features of the latest processors to reach consumers.<br />

The guidelines presented here can help you ensure that your shaders will<br />

perform optimally, free the processor to do other game-related calculations, and<br />

enable your titles to stand out from the crowd.<br />

Acknowledgments<br />

Section III — Software <strong>Shader</strong>s and <strong>Shader</strong> <strong>Programming</strong> <strong>Tips</strong><br />

Software Vertex <strong>Shader</strong> Processing<br />

395<br />

I’d like to acknowledge a few individuals who made this article possible. First and<br />

foremost is Ronen Zohar at Intel. Ronen has worked closely with Microsoft in<br />

optimizing the Intel implementation of the processor-specific graphics pipeline in<br />

<strong>DirectX</strong>. Ronen also assembled the list of guidelines described in this article and<br />

wrote the original version of the Mandelbrot shader. Additional thanks go to Will<br />

Damon and Kim Pallister at Intel who provided source material, feedback, and<br />

suggestions that helped improve this article.


x86 <strong>Shader</strong>s–ps_2_0 <strong>Shader</strong>s in<br />

Software<br />

Nicolas Capens<br />

Introduction<br />

Programmable graphics existed long before hardware acceleration. Unfortunately,<br />

the processors used back then were too slow for performing even the simplest<br />

shading tasks in real time. Nevertheless, software shaders are still more flexible<br />

than hardware-accelerated shaders. In this article we investigate the possibility of<br />

real-time software shading with modern processors like the Pentium III and 4.<br />

First, we see a few reasons why software rendering is still useful. Then we try to<br />

find out what makes the straightforward emulation so slow and what can be done<br />

about it. After that we discuss how to implement an efficient software renderer<br />

and <strong>DirectX</strong> pixel shader emulator. Finally, we solve a few practical problems<br />

encountered in the rendering pipeline.<br />

x86 <strong>Shader</strong>s<br />

396<br />

If we could make software shaders reasonably efficient, they would have a good<br />

chance of surviving the battle against hardware-accelerated shaders because they<br />

have their own benefits. Although hardware-accelerated shader support is becoming<br />

more common for the professional industry and competitive gamers, it is not<br />

an obvious feature for low-budget, portable, or office computers. But even in<br />

these cases where top performance is not a requirement, the unlimited programmability<br />

of software shaders brings interesting new possibilities. Other benefits<br />

are getting exactly the same result on pretty much all systems without unsolvable<br />

driver issues or hardware limitations. It frees us from the complexity of handling<br />

graphics card capabilities, so we can focus on the actual application. Software rendering<br />

can also work together with hardware rendering to combine their advantages<br />

and implement the features not supported by the hardware as a fallback.<br />

Pentium III-compatible processors are much more widespread than graphics<br />

acceleration hardware with pixel shader support. Also remember that people who<br />

don’t regularly play games commonly spend more money on a processor than on a<br />

good graphics card. Schools and universities mostly use computers without 3D<br />

graphics cards but often require 3D animations and simulations. Because software


Section III — Software <strong>Shader</strong>s and <strong>Shader</strong> <strong>Programming</strong> <strong>Tips</strong><br />

x86 <strong>Shader</strong>s–ps_2_0 <strong>Shader</strong>s in Software<br />

397<br />

rendering has no limitations, it can be ahead of hardware to develop new<br />

technologies.<br />

Current <strong>DirectX</strong> 9 support for pixel shaders is only useful as a reference<br />

because it takes several seconds per frame, even at low resolutions. There are a<br />

few reasons why the current shader emulation implementation is this slow. Writing<br />

an efficient software renderer for one specific task isn’t very hard with the<br />

help of assembly optimizations, but as soon as we need multiple shaders, it is too<br />

much work to write different code for them and optimize everything manually. So<br />

we need to make it more generic by using lots of control statements to select different<br />

operations. On modern processors with deep pipelines, these control statements<br />

drastically reduce performance. If the processor cannot predict the result<br />

of a comparison correctly, it starts executing code from a wrong branch before it<br />

realizes this error. In this case, it has to stall for many cycles and has to start filling<br />

the pipelines again. Even when the comparisons are perfectly predictable, it<br />

still takes at least a compare and a jump per shader instruction and per render<br />

state. Also, an important performance bottleneck of modern processors is the<br />

memory throughput for the code, so it is necessary to make the inner loop small<br />

by avoiding redundant instructions — not to mention the cache incoherency we<br />

create by jumping.<br />

So what we need is a way to eliminate these control statements. This can be<br />

done by selecting only those blocks of code that perform the wanted operation.<br />

This is just like conditional compilation, only at run time. In other words, instead<br />

of writing optimized routines manually, we need to generate them automatically.<br />

We can immediately make a parallel with hardware-accelerated shaders here,<br />

since these are also compiled at run time. Compilation of <strong>DirectX</strong> shaders is quite<br />

complex and generates much slower code than manually written assembly code,<br />

so what we need for maximum performance is a run-time assembler for our main<br />

processor’s x86 assembly language. Instead of writing our shaders in the <strong>DirectX</strong><br />

shader assembly language, we now write x86 shaders.<br />

Let’s see if this is feasible. Although 3D graphics cards have tremendous parallel<br />

processing power, the main processor is still clocked almost ten times higher.<br />

A quick calculation shows us that on a Pentium 4 running at 2.4 GHz, we could<br />

have 100 clock cycles available per pixel for a resolution of 800x600 and a target<br />

FPS of 50. So to get the most out of it, we need to use instructions that perform<br />

complex tasks in little time. The MMX and SSE instruction sets are especially<br />

very interesting for software rendering. The MMX instruction set is specialized<br />

in doing operations on four 16-bit integers in parallel, which is ideal for high-precision<br />

color operations. The SSE instructions operate on four single-precision floating-point<br />

numbers in parallel, so this is ideal for processing vertex components.<br />

Not only can the programmable shaders be run-time assembled, but also the<br />

fixed-function pipeline can be optimized this way. Some pixel shader instructions<br />

also depend on the current render state. <strong>With</strong> hardware, this isn’t a problem,<br />

since one tiny transistor can decide what part of the silicon has to be used for the<br />

current operation. Here the compare will also be evaluated for all pixels, but this<br />

takes a negligible amount of time. In software, we again need conditional compilation.<br />

This way, only exactly those instructions needed for the current render state


Section III — Software <strong>Shader</strong>s and <strong>Shader</strong> <strong>Programming</strong> <strong>Tips</strong><br />

398 x86 <strong>Shader</strong>s–ps_2_0 <strong>Shader</strong>s in Software<br />

are assembled. We don’t even need the transistor that the hardware needed.<br />

Instead of hard-wired logic, we now have soft-wired logic!<br />

SoftWire is also the name of the assembler that we use throughout this article.<br />

Unlike many overcomplicated commercial assemblers that are aimed at generating<br />

executables and libraries from many source files, SoftWire is specialized at<br />

generating functions that can be called at run time. It is written in C++ and available<br />

as an open-source project under the LGPL license [1]. The object-oriented<br />

interface was designed especially with simplicity for run-time code generation in<br />

mind. We specify a file to assemble, along with the external data and conditional<br />

parameters, and it returns the corresponding block of machine code that can be<br />

called directly.<br />

Let’s try an easy example of an x86 shader assembled with SoftWire to get<br />

started. We subtract 1 from all color components, so we get a fade-to-black effect.<br />

Of course, this can also be done with graphics cards without shader support, but<br />

this example is only for illustrating the very basics. Suppose we already have the<br />

rest of our rendering pipeline and we want our inner pixel loop to be run-time<br />

assembled as a function. This isn’t optimal yet, but the x86 code could look like<br />

this:<br />

p0001h: // Label to reference static data<br />

DW 0001h // Used for subtracting 1 from every<br />

DW 0001h // 16-bit color component<br />

DW 0001h<br />

DW 0001h<br />

Fade:<br />

mov esi, [displayPointer]<br />

punpcklbw mm0, [esi] // Load pixel 32-bit -> 64-bit<br />

psubusw mm0, [p0001h] // Subtraction with saturation<br />

packuswb mm0, mm0 // Pack pixel 64-bit -> 32-bit<br />

movd [esi], mm0 // Write pixel<br />

emms<br />

ret<br />

Let’s store this in a Fade.asm file. Now all we need to do is tell SoftWire where<br />

the displayPointer external variable is stored by passing its address, and then we<br />

can assemble this code. The basic interface of the SoftWire::Assembler class<br />

looks like this:<br />

Assembler(const char *fileName);<br />

const void *callable(const char *entryPoint = 0);<br />

static void defineExternal(void *pointer, const char *name);<br />

static void defineSymbol(int value, const char *name);<br />

To define the displayPointer external variable, we need to use the define-<br />

External method, like this:<br />

defineExternal(&displayPointer, “displayPointer”);


There is a handy macro to make it a little easier to do this:<br />

ASM EXPORT(displayPointer);<br />

Because the defineExternal method is static, this works even when no Assembler<br />

has been constructed yet. Now that we have defined the externals, we are<br />

ready to assemble the file:<br />

Assembler x86(“Fade.asm”); // Construct an assembler called x86<br />

At this point, the file has been translated to machine code, but it is not loaded into<br />

a block of memory, so it is not yet ready to be called. To link and load the code, we<br />

need to do the following:<br />

void (*x86<strong>Shader</strong>)() = (void(*)())x86.callable(“Fade”);<br />

We can now call x86<strong>Shader</strong>() in our inner pixel loop, and it will start executing<br />

from the Fade label. Pretty boring stuff for now, so let’s start playing with the conditional<br />

compilation a bit. Fading to black can be useful, but fading to white is<br />

often used too. To do this, we only need a few extra lines:<br />

Fade:<br />

mov esi, [displayPointer]<br />

punpcklbw<br />

#if fadeBlack<br />

mm0, [esi] // Load pixel 32-bit -> 64-bit<br />

psubuswmm0, [p0001h] // Subtraction with saturation<br />

#else<br />

padduswmm0, [p0001h] // Addition with saturation<br />

#endif<br />

packuswb mm0, mm0 // Pack pixel 64-bit -> 32-bit<br />

movd [esi], mm0 // Write pixel<br />

Depending on the value of fadeBlack, this code now fades the screen to black or<br />

white. All we have to do is tell the assembler the value of fadeBlack, and we can<br />

assemble the file again. This is done with the defineSymbol method:<br />

defineSymbol(fadeState(), “fadeBlack”);<br />

or a handy macro:<br />

Section III — Software <strong>Shader</strong>s and <strong>Shader</strong> <strong>Programming</strong> <strong>Tips</strong><br />

x86 <strong>Shader</strong>s–ps_2_0 <strong>Shader</strong>s in Software<br />

bool fadeBlack = fadeState();<br />

ASM DEFINE(fadeBlack);<br />

399<br />

That’s it! These are the most important features that SoftWire is capable of for<br />

using x86 shaders. It all seems limited, but that’s only because we’ve used the<br />

simplest example. The complete x86 instruction set is supported with all 32-bit<br />

addressing modes, and the conditional compilation can be made arbitrarily complex.<br />

Also, C++ functions can be called from within the assembly code — again<br />

by exporting it. It is also possible to jump to or call internal labels. For more information,<br />

please refer to the SoftWire documentation [1].<br />

But why use SoftWire if we could just use inline x86 assembly and let our<br />

C++ compiler do the rest? SoftWire has one important extra advantage over<br />

inline assembly — namely the conditional compilation. Suppose the fixed-function<br />

vertex pipeline has more than 1000 different combinations of states and we would


Section III — Software <strong>Shader</strong>s and <strong>Shader</strong> <strong>Programming</strong> <strong>Tips</strong><br />

400 x86 <strong>Shader</strong>s–ps_2_0 <strong>Shader</strong>s in Software<br />

like to be able to do them all in software. <strong>With</strong>out using soft-wiring technology, we<br />

would have to write 1000 files and wait until they are all compiled before we can<br />

test them. But even if this could be done with things like macros and templates<br />

and we could put them in one huge library, it would still not be very useful. If we<br />

wanted to add just one extra feature, our library would have to double in size and<br />

it would take twice as long to compile. So this problem grows exponentially, and<br />

using a run-time assembler is the only option left as soon as we need more than a<br />

few render states.<br />

Assembling takes time though. SoftWire was not designed for fast compilation.<br />

In practice, the average x86 shader takes almost a millisecond. If we had to<br />

assemble every shader for every frame and for every render state, it would disturb<br />

the rendering smoothness. Therefore, it is important that we use some sort<br />

of caching system to minimize the number of shaders being reassembled. In the<br />

future, SoftWire might feature a system for fast relinking when only a few constants<br />

are changed. This is possible because many of the instructions being<br />

assembled remain the same.<br />

In many situations, we also need a rasterizer optimized for only one purpose.<br />

While conditional compilation allows supporting any render state, it doesn’t give<br />

us optimal results, especially for simple operations. For example, deferred rendering<br />

techniques draw the scene in a first pass only to the depth buffer to have visibility<br />

information. In this case, a hand-optimized rasterizer could be faster<br />

because we can use all registers for just one task and even process a few pixels in<br />

parallel. Another example is the rasterization of small polygons. For geometry<br />

with a high polygon count, a scanline that is only one or two pixels long occurs a<br />

lot. These scanlines don’t need the complex setup of longer scanlines, and a few<br />

approximations can be used that don’t degrade quality. Affine texture mapping and<br />

per-polygon mipmapping are some examples of approximations that can significantly<br />

boost performance. This can be implemented together with the caching<br />

technique. When we need a certain rasterizer for the first time, we first check if<br />

any of the specialized rasterizers support the current render mode. If this fails,<br />

we use the general rasterizer. This doesn’t mean that the specialized rasterizers<br />

can’t use conditional compilation. For example, the specialized depth buffer<br />

rasterizer could use conditional compilation for the different depth compare<br />

modes. If a rasterizer is needed again, we first search if there’s a compatible<br />

rasterizer in the cache. Since we search for rasterizers in the order of most specialized<br />

to most general, we know that the first compatible rasterizer that we find<br />

will also be the most efficient.<br />

x86 shaders probably sound great in theory, but what about the results? Well,<br />

SoftWire was designed especially for a software-only renderer. It is still a work in<br />

progress, but it’s already reaching 20 FPS in 640x480x32 display mode on a<br />

Celeron running at 1200 MHz for Quake III [2] scenes with all features turned on.<br />

This includes bilinear filtering, exact mipmapping, lightmapping, per-pixel perspective<br />

correction, dynamic lighting, fog, and more. On a Pentium 4 running at<br />

2400 MHz, we get 22 FPS. This is a disappointingly low improvement over the<br />

Celeron and is caused by the Pentium 4’s much higher latency for SIMD instructions.<br />

On a Pentium III running at 450 MHz, we get 9 FPS, which shows that the


Section III — Software <strong>Shader</strong>s and <strong>Shader</strong> <strong>Programming</strong> <strong>Tips</strong><br />

x86 <strong>Shader</strong>s–ps_2_0 <strong>Shader</strong>s in Software<br />

401<br />

Celeron suffers a bit from its smaller cache and slower front side bus. We can conclude<br />

that performance can be considered real time for a 1 GHz Pentium III or a<br />

2 GHz Pentium 4. Note that Quake III was not designed for software rendering,<br />

so with some specific optimizations aimed at preventing overdraw, it could be<br />

much faster. For more recent information and screen shots, please visit the<br />

SoftWire web site [1].<br />

A great advantage of SoftWire over hardware-accelerated shaders is that we<br />

have access to the complete x86 instruction set and all memory! We can make our<br />

shader code as long and as complex as we like, we can do operations that are not<br />

available in current hardware shader languages, and we have full control over<br />

memory. Want to implement your own filtering method? Need an extra memory<br />

buffer for storing certain parameters? Want to render voxels instead of polygons?<br />

Want to do ray tracing or call a recursive function for photon mapping? This is all<br />

no problem when using SoftWire. It is also open source, so adding specific features<br />

is easy. The conditional compilation can also help make the code processor<br />

independent. When SSE support is not detected, we can fall back to regular FPU<br />

instructions or use AMD’s 3DNow! instructions. Note however that MMX uses<br />

the same registers, and a slow emms instruction is needed when switching from<br />

MMX to FPU mode. For AMD processors, a faster femms instruction can be used,<br />

but 3DNow! uses the same FPU registers once again. Because of this and<br />

because SSE operates on four instead of two floating-point numbers, it is far superior.<br />

Luckily, the newest generation of AMD’s Athlon processors also supports<br />

SSE. The SSE instruction set also has some important instructions that complement<br />

the MMX integer instructions, so it is a must for efficient rendering on an<br />

x86 processor.<br />

But there are also disadvantages when using x86 shaders. We cannot expect<br />

to reach the same performance as dedicated hardware. Although it has a higher<br />

clock frequency, the main processor’s power lies primarily in its versatility, not<br />

raw number crunching. The conditional compilation also has a disadvantage over<br />

writing everything manually; it is hard to do instruction scheduling, since we cannot<br />

know in advance which instructions will be assembled. But this slight performance<br />

loss is nothing compared to the flexibility that it brings. For people who are<br />

not familiar with x86 code and especially the SIMD instruction sets, SoftWire<br />

might not seem very interesting. Why not immediately compile the <strong>DirectX</strong><br />

shader instructions to x86 assembly? Although we see how to do this later, there<br />

are two main reasons why using x86 assembly is ideal when aiming for maximum<br />

performance. First of all, to use software shaders, we also need a fixed-function<br />

geometry pipeline and scan converter. To make these efficient, they need to be<br />

written using x86 assembly anyway. Secondly, even if we already had these and<br />

we would still like to write the shaders with the <strong>DirectX</strong> shader languages, we can<br />

never generate code as optimal as manually written code. So it is best to just<br />

regard it as another shader assembly language. But never fear — SoftWire can be<br />

used for much more than pure x86 shaders.<br />

This brings us to another powerful feature supported by SoftWire that we<br />

haven’t touched yet. SoftWire supports a form of macros that is ideal for


Section III — Software <strong>Shader</strong>s and <strong>Shader</strong> <strong>Programming</strong> <strong>Tips</strong><br />

402 x86 <strong>Shader</strong>s–ps_2_0 <strong>Shader</strong>s in Software<br />

abstracting the x86 code into more powerful “instructions” like with <strong>DirectX</strong><br />

shaders. They are much like the C++ inline functions, and they use a similar<br />

syntax:<br />

inline mad(dest, src0, src1, src2)<br />

{<br />

movq dest, src0<br />

psrlw dest, 1<br />

pmullw dest, src1<br />

psrlw dest, 7<br />

paddw dest, src2<br />

}<br />

...which can be invoked like this:<br />

mad mm0, mm1, mm2, mm3<br />

Assuming the MMX registers hold the data in signed 8.8 fixed-point format, this<br />

“instruction” is compatible with the ps 1.3 mad instruction. It is even possible to<br />

use the ps 1.3 register names with some macros. The r0 and r1 registers can be<br />

mapped onto mm0 and mm1, t0 to t3 can be mapped onto mm2 to mm5, and the<br />

read-only registers can be held in memory. This leaves mm6 and mm7 available for<br />

temporary results in the macros. This way, ps 1.3 can be implemented efficiently<br />

with a very similar syntax.<br />

ps_2_0 <strong>Shader</strong>s<br />

So what about the more advanced pixel shaders from <strong>DirectX</strong> 9 like ps 2.0? These<br />

shaders use floating-point numbers everywhere, so MMX is of little use for the<br />

pixel operations. Unfortunately, SSE instructions are also at least twice as slow as<br />

MMX instructions, so it is less suited for real time. On the other hand, MMX<br />

needed many shift instructions for fixed-point arithmetic, which can now be eliminated.<br />

We now also have all MMX registers available for things like texture filtering.<br />

But the number of registers is still a limiting factor. The ps 2.0 standard<br />

specifies no less than 12 temporary registers, which is more than what SSE<br />

offers. So we can’t simply map <strong>DirectX</strong> 9 registers onto x86 registers like we<br />

could with ps 1.3. These extra problems make it too slow for games on the current<br />

generation of processors. But with soft-wiring techniques and SIMD instructions,<br />

it is still much faster than the reference rasterizer. For things like CAD, the<br />

frame rate only has to be interactive and the viewport is usually smaller than the<br />

screen resolution.<br />

Let’s first focus on how to actually compile ps 2.0 shaders into x86 assembly.<br />

Because we need to use SSE and MMX, we need a convenient way to generate<br />

many different instructions. As a first attempt, we could create a string with the<br />

instruction and its operands and let SoftWire assemble this. This is hopelessly<br />

cumbersome. We would need to write many functions to generate code for just a<br />

few instructions. Also note that this is a huge detour because we first select an<br />

instruction and its operands, write that to a string, let SoftWire parse this string,<br />

and check its syntax and semantics; only then can we generate the machine code.


Luckily there’s a shortcut called run-time intrinsics. In SoftWire, run-time<br />

intrinsics are a list of member functions of the Assembler with the same names as<br />

the x86 instructions, which generate the corresponding machine code at run time<br />

and put it into the assembler’s internal buffer. So we can simply construct a new<br />

Assembler instance without specifying a file in the constructor, and we can start<br />

generating code by writing the assembly instructions as intrinsics. Besides being<br />

very easy to use, they are also safer than assembling from a file because the syntax<br />

is checked almost completely at compile time. Also, because it is still written<br />

in C++, things like passing pointers to external data and conditional compilation<br />

become trivial.<br />

To become familiar with the use of run-time intrinsics, let’s write something<br />

more useful than the fade code. Suppose we want to compute the dot product of<br />

two 3D vectors r0 and r1 and store the result in r2. Also suppose that we have a<br />

class Sw<strong>Shader</strong> derived publicly from SoftWire::Assembler:<br />

void Sw<strong>Shader</strong>::encodeDP3()<br />

{<br />

movaps(xmm0, xmmword ptr [r0]);<br />

movaps(xmm1, xmmword ptr [r1]);<br />

}<br />

Section III — Software <strong>Shader</strong>s and <strong>Shader</strong> <strong>Programming</strong> <strong>Tips</strong><br />

x86 <strong>Shader</strong>s–ps_2_0 <strong>Shader</strong>s in Software<br />

mulps(xmm0, xmm1);<br />

movhlps(xmm1, xmm0);<br />

addss(xmm1, xmm0);<br />

shufps(xmm0, xmm0, 0x01);<br />

addss(xmm0, xmm1);<br />

movss(dword ptr [r2], xmm0);<br />

403<br />

This clearly shows that the C++ intrinsics syntax closely resembles the usual<br />

assembly syntax, so it is easy to convert existing code. In the above example, we<br />

used fixed source and destination “registers” stored in memory. Of course, we<br />

can just as easily parse the instruction and its operands from a ps 2.0 shader file<br />

and encode the instruction with the according registers.<br />

For the parsing job, many tools for generating a parser from a grammar are<br />

available. Popular scanner and parser generators like Flex and Bison are slightly<br />

too sophisticated and do not produce C++ code. A simple parser generator more<br />

suited for this simple task is CppCC by Alec Panovici [3]. It produces understandable<br />

C++ code, and it’s very easy to derive the Sw<strong>Shader</strong> class used above from<br />

the parser class. This allows parsing the file and directly storing the data in a list<br />

of ps 2.0 intermediate code instructions in the Sw<strong>Shader</strong> class.<br />

Notice that in the example only two SSE registers are used, and three memory<br />

operations are needed. A much better situation would be to use all available<br />

registers to eliminate most of the loading and storing, but as we discussed before,<br />

we can’t directly map ps 2.0 registers to SSE registers. The best compromise is<br />

to map as many ps 2.0 registers as possible to SSE registers and keep the rest in<br />

memory. This is known as virtual register allocation. Optimal register allocation<br />

cannot be solved in polynomial time, so a few heuristics are available. The most<br />

straightforward method is to just assign SSE registers to the most frequently


Section III — Software <strong>Shader</strong>s and <strong>Shader</strong> <strong>Programming</strong> <strong>Tips</strong><br />

404 x86 <strong>Shader</strong>s–ps_2_0 <strong>Shader</strong>s in Software<br />

used ps 2.0 registers. Unfortunately, this means that when we use any other ps<br />

2.0 register, an SSE register needs to be freed by writing it back to memory. This<br />

is called spilling and adds lots of extra instructions with slow memory accesses.<br />

Most compilers use a graph coloring heuristic. Although near-optimal for the<br />

amount of spill code needed, it is also quite complex to compute. The most popular<br />

solution for run-time compilation is linear-scan register allocation [4]. It is also<br />

quite straightforward, marking when registers need to be allocated and when they<br />

are free for other variables to avoid much of the spill code. When a spill is necessary,<br />

a simple heuristic decides which register is needed the least. It is very fast<br />

and does not produce too much spill code.<br />

Let’s see how we can integrate linear-scan register allocation into SoftWire’s<br />

run-time intrinsics. Since it’s all written in C++, we can write a function xmmreg,<br />

which takes a ps 2.0 register as an argument and returns the SSE register to<br />

where it is mapped. This can then be used as the argument for the run-time<br />

intrinsic. If the ps 2.0 register is not already mapped to an SSE register, the<br />

xmmreg function adds spilling code if necessary. Similarly, we can write a function<br />

r_m128, which returns either a register or memory reference. This needs some<br />

illustration. Suppose the destination ps 2.0 register is stored in dst, the source<br />

registers in src0 and src1, and tmp0 and tmp1 are temporary registers; then the<br />

code would look like this:<br />

void Sw<strong>Shader</strong>::encodeDP3(Operand &dst, Operand &src0, Operand &src1)<br />

{<br />

movaps(xmmreg(tmp0), r m128(src0));<br />

mulps(xmmreg(tmp0), r m128(src1));<br />

}<br />

movhlps(xmmreg(tmp1), xmmreg(tmp0));<br />

addss(xmmreg(tmp1), r m32(tmp0));<br />

shufps(xmmreg(tmp0), r m128(tmp0), 0x01);<br />

addss(xmmreg(tmp0), r m32(tmp1));<br />

movss(xmmreg(dst), r m32(tmp0));<br />

We have solved two problems now: We’ve made the source and destination registers<br />

variable, and if there are enough SSE registers available, we’ve eliminated<br />

many memory operations. But it’s still not optimal. The first instruction is only<br />

needed for preserving the source register. If the destination register is equal to<br />

one of the source registers, we can eliminate some instructions. Here we can use<br />

run-time conditional compilation again. Thanks to run-time intrinsics, this can<br />

simply be written in C++. Let’s try this for the abs instruction:<br />

void Sw<strong>Shader</strong>::encodeABS()<br />

{<br />

if(dst != src0)<br />

{<br />

movaps(xmmreg(dst), xmmreg(src0));<br />

}<br />

typedef declspec(align(16)) float float4[4];


}<br />

Section III — Software <strong>Shader</strong>s and <strong>Shader</strong> <strong>Programming</strong> <strong>Tips</strong><br />

x86 <strong>Shader</strong>s–ps_2_0 <strong>Shader</strong>s in Software<br />

static float4 = {0x7FFFFFFF, 0x7FFFFFFF, 0x7FFFFFFF, 0x7FFFFFFF};<br />

andps(xmmreg(dst), signMask);<br />

Another important optimization is not to copy data from a register to itself. This<br />

can be done by overloading the mov* instructions and detecting when the<br />

operands are equal. The list of optimizations that can be done is almost endless,<br />

and with run-time intrinsics, they can be implemented in a convenient manner.<br />

Note that when a render state changes, we need to reassemble the shader<br />

because it can modify the behavior of texture lookup instructions and the like.<br />

Luckily, we don’t need to parse the file again; only the code generation has to be<br />

redone. Since run-time intrinsics work directly with the SoftWire code, this takes<br />

little time. As we discussed before, caching techniques can further eliminate<br />

unnecessary work.<br />

For a proof of concept for compiling ps 2.0 shaders to x86 assembly using<br />

SoftWire and CppCC, take a look at the sw<strong>Shader</strong> project [5]. This work in progress<br />

shows the details of how to implement the above techniques. Preliminary<br />

tests on the Celeron 1200 show that a pixel fill rate of five million pixels per second<br />

can be reached for a long shader of 40 arithmetic instructions translated to<br />

150 SSE instructions. Compared to hardware-accelerated pixel shaders, this is<br />

very little, but it is certainly a lot better than the reference rasterizer. At modest<br />

resolutions, this is sufficient for interactive rendering.<br />

Rendering Pipeline Operations<br />

405<br />

This article is complemented with the efficient solutions to a few practical problems<br />

when implementing a software renderer. An efficient implementation of<br />

shader emulation using soft-wiring and SIMD instructions isn’t worth anything if<br />

the rest is inefficient or inaccurate. Since we already discussed the operations at<br />

pixel level, let’s build it up all the way to the primitive level, which is the input of<br />

the renderer. Keep in mind that with software rendering, we are not limited in the<br />

way we implement something, but the following methods have proven their usefulness<br />

in practice.<br />

Closest to the pixels are the frame buffer, the depth buffer, and the textures.<br />

The most limiting factor here is the memory bandwidth. Take for example the<br />

Pentium 4 with 400 MHz front side bus, which has a theoretical memory bandwidth<br />

limit of 3.2 GB/s. Suppose we render at a resolution of 800x600 with a<br />

32-bit frame buffer and depth buffer. Per frame we might want to clear the frame<br />

buffer, clear the depth buffer, and then refill them both with the new frame. <strong>With</strong><br />

an average overdraw of two and the double amount of depth buffer tests, this<br />

results in 15MB per frame. This seems small, but when we target 30 FPS, this<br />

takes one-seventh of the time for one frame, and that’s even without all the other<br />

memory accesses, like for texturing and fetching code. That’s millions of clock<br />

cycles where the processor can hardly do any useful arithmetic instructions.<br />

We can’t overcome this bandwidth limit, so it’s important to minimize the<br />

number of memory accesses. In most situations, it is possible to eliminate


Section III — Software <strong>Shader</strong>s and <strong>Shader</strong> <strong>Programming</strong> <strong>Tips</strong><br />

406 x86 <strong>Shader</strong>s–ps_2_0 <strong>Shader</strong>s in Software<br />

clearing the frame buffer, since the whole scene is redrawn and fills the frame<br />

buffer. Reducing overdraw can be done efficiently when rendering from front to<br />

back and using one bit per pixel to indicate if it’s already been filled. Unfortunately,<br />

sorting the polygons is not easy, so using a depth buffer is often the only<br />

option. Clearing the depth buffer can be eliminated by using positive depth values<br />

in even frames and negative values with the inverted depth compare mode in odd<br />

frames. <strong>With</strong> soft-wiring, we can change the compare mode instantly. So by sacrificing<br />

the sign bit of the depth buffer values, we can clear it at almost no cost. Texture<br />

memory access is something completely different. Because the addresses<br />

are not successive, the data is often not ready in the cache and the CPU has to<br />

read them from the RAM. Not only does the memory bus have a maximum<br />

throughput, the memory itself also requires time before the data is available. <strong>With</strong><br />

an access time of 6 ns, this requires roughly a dozen clock cycles waiting for a few<br />

bytes and a lot more for actually finishing the move instruction. For this reason,<br />

the SSE instruction set also features prefetch instructions. They load the<br />

requested data into the cache without blocking other operations. When they are<br />

used a couple of instructions before the data are used, no memory stall will occur<br />

because the data will be available in the cache.<br />

When a triangle is rasterized, it is broken up into horizontal lines of pixels<br />

called scanlines. The scanline setup is responsible for providing the input to the<br />

pixel shader. This is the ideal phase to convert some parameters from floating-point<br />

format to fixed-point integer format, since these are faster to interpolate<br />

and it is quite slow to do conversions per pixel. For colors with integer components,<br />

it is important to use saturated interpolation to avoid artifacts at the edges.<br />

Run-time conditional compilation can be used to interpolate only those parameters<br />

needed by the pixel shader. More specifically for <strong>DirectX</strong> shaders, the dcl<br />

instructions determine what texture coordinates need to be interpolated.<br />

The triangle setup stage is a lot more advanced. First of all, it is responsible<br />

for determining where each scanline starts and ends (called scan-conversion). An<br />

excellent reference for this is the “Perspective Texture Mapping” article by Chris<br />

Hecker [6]. Calculating gradients for interpolants like the depth value and texture<br />

coordinates is another important task of the triangle setup. A common pitfall is to<br />

compute them per scanline. This is not necessary, since the slope of a triangle is<br />

equal everywhere. Another problem is precision. Let’s take, for example, the horizontal<br />

slope of the z-value: �z/�x. Most implementations compute �z at the longest<br />

scanline and divide that by its length. This 2D method brings many problems<br />

with sub-pixel accuracy, and it also doesn’t work well with thin polygons. A much<br />

more robust 3D solution is to use the plane equation:<br />

Ax � By �Cz � Dw �0<br />

Now we want to know �z/�x, which is equal to �z/�x, since it is constant. This can<br />

be computed by differentiating:<br />

Ax � � By � �Cz � � Dw � �0<br />

Keeping y and w constant, we get:


�z<br />

A<br />

��<br />

�x<br />

C<br />

For the gradient in the y direction, we get –B/C. Starting from the plane equation,<br />

we get all our other gradients in a similar way. We just need to substitute z with<br />

the parameter for which we want to compute the gradients and recalculate the<br />

plane equation with the cross product for this new parameter. A nice property is<br />

that C is independent of the coordinate for which we calculate the gradients, so it<br />

only has to be computed once for all gradients. This is because C is twice the area<br />

of the triangle, which is of course constant. We can even compute –1/C at once to<br />

multiply it with the different A and B values to avoid many divisions. D is never<br />

needed, so only A and B have to be computed for every interpolant. <strong>With</strong> SSE,<br />

this takes just a few clock cycles when computing four gradients in parallel.<br />

This method of computing gradients is also an argument to only render triangles.<br />

We could be tempted to render polygons with more than three vertices, but<br />

this brings many problems. To compute the plane equation, we can use only three<br />

vertices, so the other vertices would need to be on the plane to be able to use the<br />

same gradient. This is easy to visualize for the z component. If not all vertices lie<br />

in the same plane, the polygon is not flat, and so the z gradients can’t be constant.<br />

To avoid precision artifacts and to keep things simple, it is advisable to only<br />

rasterize triangles. <strong>With</strong> n-gons, we always need a loop, while with triangles, we<br />

avoid the loop setup and the stalls caused by mispredicted jumps.<br />

Mipmapping requires computing the compression of the texture with respect<br />

to screen coordinates. In general, the relationship between screen coordinates<br />

and homogeneous texture coordinates is given by:<br />

u<br />

u<br />

x x<br />

u<br />

y u<br />

y<br />

v<br />

v<br />

x x<br />

� ' � '<br />

' � � � 0'<br />

� �<br />

� ' �v'<br />

' � � y�v0' � �y<br />

...where we use an apostrophe to denote that these texture coordinates are<br />

homogeneous and u 0' and v 0' are the coordinates at the origin of the screen (x=0,<br />

y=0). So the texture compression in the affine x direction is the length of the<br />

vector (�u/�x, �v/�x):<br />

2 2<br />

u v<br />

mx<br />

�<br />

x x<br />

� �<br />

� � �<br />

� �<br />

� � � �<br />

� �<br />

� � � �<br />

Section III — Software <strong>Shader</strong>s and <strong>Shader</strong> <strong>Programming</strong> <strong>Tips</strong><br />

To get the affine texture coordinates u and v, we need to divide the homogeneous<br />

coordinates by w:<br />

u v<br />

w w<br />

mx<br />

x x<br />

�<br />

� � '����' � �<br />

� ��<br />

� � � ��<br />

� �<br />

� � � � �<br />

� �<br />

�<br />

� � �<br />

� � � �<br />

�� �� �� ��<br />

� � � �<br />

2<br />

2<br />

x86 <strong>Shader</strong>s–ps_2_0 <strong>Shader</strong>s in Software<br />

407


Section III — Software <strong>Shader</strong>s and <strong>Shader</strong> <strong>Programming</strong> <strong>Tips</strong><br />

408 x86 <strong>Shader</strong>s–ps_2_0 <strong>Shader</strong>s in Software<br />

Even though there are fast SSE instructions for computing the square root and<br />

we already have 1/w for perspective correct texture mapping, it’s not that simple.<br />

The gradients of the affine texture coordinates are not constants like the gradients<br />

of the homogeneous texture coordinates. So we have to try to find out how it<br />

changes as a function of the screen coordinates. First we use the quotient rule for<br />

derivatives:<br />

� �u<br />

� � � � �<br />

� � � � �<br />

� � �x<br />

� � � � � �<br />

�<br />

� �<br />

�<br />

�<br />

�<br />

� �<br />

w<br />

w<br />

x u<br />

v<br />

x<br />

w<br />

w<br />

w<br />

x v<br />

'<br />

' �<br />

'<br />

' �<br />

�<br />

2<br />

2<br />

w �<br />

�<br />

�<br />

�<br />

2<br />

1 � u � �<br />

�<br />

� � � � ��� 2<br />

w � x � �<br />

w<br />

w<br />

x u<br />

v<br />

x w<br />

w<br />

x v<br />

� ' � � ' �<br />

'<br />

' �<br />

� � � � �<br />

2<br />

We now have a function where all gradients are constant and can be computed<br />

with the plane equation method. We now investigate how this compression value<br />

changes when we step from one pixel to the next in the hope that we can simplify<br />

the formula. So let’s write u', v', and w as a function of x and y and substitute that<br />

into the formula. To simplify notation, we remove the apostrophe for homogeneous<br />

coordinates:<br />

u<br />

u<br />

x x<br />

u<br />

y u<br />

y<br />

v<br />

v<br />

x x<br />

v<br />

y v<br />

y<br />

w<br />

w<br />

x x<br />

� �<br />

� � � 0<br />

� �<br />

� �<br />

� � � 0<br />

� �<br />

� �w<br />

� � y�w0 � �y<br />

�<br />

m<br />

� �u�<br />

�w<br />

� �<br />

�w�<br />

�u<br />

�<br />

1<br />

x � � �� y�w �� � �� y�u 2<br />

w<br />

�<br />

0 0<br />

�x<br />

�y<br />

�x<br />

�y<br />

�<br />

�<br />

2<br />

2<br />

2<br />

��<br />

� � � � �<br />

��<br />

�<br />

� �� �v<br />

�w<br />

�w<br />

�v<br />

�<br />

�� y�w �� � �� � ��<br />

0 y v �<br />

�<br />

0<br />

��<br />

� �x�<br />

�y<br />

� �x<br />

� �y<br />

�<br />

��<br />

Suddenly we lose the dependency in x under the square root. So although this formula<br />

looks horrible, only the w in the front changes when we step along the<br />

scanline. In other words, we only need to compute the square root for every<br />

scanline; but there’s more. The u 0, v 0, and w 0 are constants, and we can collect<br />

them into other constants. If we give these new constants a logical name, we can<br />

also make this long formula look a little nicer:<br />

2


u w w u<br />

C<br />

x y x y<br />

v w w v<br />

C<br />

x y x y<br />

u<br />

U<br />

x w<br />

� � � �<br />

u � �<br />

� � � �<br />

� � � �<br />

v � �<br />

� � � �<br />

� �w<br />

x � 0 �<br />

� �x<br />

� �<br />

� �<br />

u<br />

v<br />

V<br />

x w<br />

w<br />

x v<br />

0<br />

x � 0 � 0<br />

1<br />

m � C y�U � C y�V 2<br />

w<br />

� � � �<br />

2 2<br />

x u x v x<br />

For the compression in the y direction, m y, we have a similar formula but now with<br />

constants U y and V y. The biggest compression factor determines which mipmap<br />

level we have to use:<br />

� x y�<br />

m�max m , m<br />

Section III — Software <strong>Shader</strong>s and <strong>Shader</strong> <strong>Programming</strong> <strong>Tips</strong><br />

x86 <strong>Shader</strong>s–ps_2_0 <strong>Shader</strong>s in Software<br />

409<br />

We could evaluate the maximum at every pixel, but this is not necessary. We can<br />

approximate this by computing m only at the vertices and using interpolation in<br />

between. This is also what is done by many hardware implementations, since the<br />

difference is hardly noticeable. By multiplying m by w 2 at the vertices and multiplying<br />

twice by 1/w at every pixel, this becomes simple linear interpolation. We<br />

already need 1/w for perspective texture mapping, so this adds no extra cost.<br />

Computing the maximum at the vertices saves us from needing to interpolate<br />

both compression factors. The gradients for m can again be computed using the<br />

method with the plane equation. The last thing that needs to be done per pixel is<br />

taking the log 2 of m because mipmap dimensions are progressively divided by<br />

two. This can be implemented efficiently by either using the exponent of the floating-point<br />

representation or converting to integer and using the bsr instruction to<br />

find the index of the highest bit. It can be handy to index the mipmaps starting<br />

from 0 for 2 16 ×2 16 texture size to 15 for 1×1 texture size. For non-square textures,<br />

the biggest dimension determines the index to avoid over-blurring.<br />

One last trick that can be used in the triangle setup is the method used to<br />

sort the vertices from top to bottom. Normally, sorting involves copying the elements<br />

to a temporary location and swapping values. For vertices with many components,<br />

this can cost a considerable amount of time. Fortunately, this is not<br />

necessary, since we can just sort their references. This trick can be used for any<br />

sorting problem, but it is often forgotten.<br />

Before triangles can be rasterized, they need to be clipped, projected, and<br />

then scaled into the viewport. It is very important to do this in homogeneous<br />

space to have z-values independent of the w coordinate, which is needed for perspective<br />

correction. We could use a clipping volume like the one used in <strong>DirectX</strong>,<br />

but there is a more efficient choice. If we use [0, 1]×[0, 1]×[0, 1] as a clipping<br />

volume, the distances from a vertex to the left, bottom, and front clipping planes<br />

are equal to the x, y and z coordinate, respectively:


Section III — Software <strong>Shader</strong>s and <strong>Shader</strong> <strong>Programming</strong> <strong>Tips</strong><br />

410 x86 <strong>Shader</strong>s–ps_2_0 <strong>Shader</strong>s in Software<br />

0 � x �1<br />

0 � y �1<br />

0 � z �1<br />

�<br />

0 � X �W<br />

0 �Y �W<br />

0 � Z �W<br />

�<br />

0 � X � 0 �W �X<br />

0 � X � 0 �W �Y<br />

0 � Z � 0 �W �Y<br />

In this notation, lowercase letters are screen coordinates, while capital letters are<br />

clip space coordinates. As we can see, only three subtractions are needed. For the<br />

comparisons, it is advisable to use the sign bit of the floating-point format instead<br />

of using the compare instructions because this is faster and also works with SSE<br />

so we can stay in MMX mode. This can be useful for interpolating color values<br />

efficiently when an edge is clipped. To interpolate only those values needed by the<br />

rasterizer, we can once more use run-time conditional compilation.<br />

In this article we focused primarily on pixel shading instead of vertex T&L<br />

because the latter can be performed efficiently by <strong>DirectX</strong>. However, if we want<br />

to develop an API-independent renderer, soft wiring can also make a big difference<br />

in performance for vertex processing. Implementing a vertex shader compiler<br />

is similar to the pixel shader. There’s just one important difference, which<br />

allows another big optimization. <strong>With</strong> pixel shaders, we have to work per-pixel,<br />

but with vertex shaders, we can work per-component. For example, when transforming<br />

a vertex in a straightforward manner, we need a lot of shuffle instructions<br />

to do the dot products. All internal data movement is wasted time, so we’d like to<br />

eliminate that. We can do this by noting that the x, y, z, and w coordinates can be<br />

computed independently. So transforming all vertices separately or first transforming<br />

all x components, then all y components, etc., gives no difference in the<br />

operations being performed. However, when working per-component instead of<br />

per-vertex, we do not need any shuffle instructions because four x coordinates<br />

from different vertices in an SSE register are independent. So what we do is store<br />

a structure-of-arrays for the components instead of storing an array-of-structures<br />

for the vertices. The components are then processed in groups of four in a tight<br />

loop. To implement this, it is useful to first write the SSE code using only the single-scalar<br />

instructions and then replacing them all with packed-scalar instructions<br />

to do exactly the same operation on four components at a time.<br />

As mentioned before, manual scheduling of instructions is not possible. However,<br />

we could develop an automatic optimizer. First, a peephole optimizer can be<br />

used and then a scheduler. Scheduling can considerably improve performance,<br />

since a Pentium III and 4 cannot execute dependent instructions or instructions<br />

that use the same execution unit in parallel. Unfortunately, optimal scheduling


Section III — Software <strong>Shader</strong>s and <strong>Shader</strong> <strong>Programming</strong> <strong>Tips</strong><br />

x86 <strong>Shader</strong>s–ps_2_0 <strong>Shader</strong>s in Software<br />

411<br />

cannot be done in polynomial time. There exist some heuristics, but they need to<br />

know many parameters for each instruction. This includes the throughput, the<br />

latency, and the execution units that it uses. Since there are so many instructions<br />

and variants, this is hard to implement. The Pentium III and 4 are superscalar,<br />

pipelined processors with out-of-order execution, which basically means that we<br />

can’t predict when a certain instruction is executed. Instructions are also composed<br />

of one or more micro-instructions, but this micro-code is a well-kept secret.<br />

It is stored in a ROM that is model specific, making it almost impossible to have<br />

accurate parameters for the heuristical scheduler.<br />

A much easier approach is to treat the processor as a theoretical black box. A<br />

block of code goes in, and we can only see the results. We don’t even care how it<br />

works internally. Practically, this means that we just measure the execution time<br />

of the code and see what happens if we change the instruction order. We use a<br />

brute-force approach by doing many tests, and we keep the code that had the<br />

smallest execution time. By constructing a graph of instruction dependencies,<br />

every permutation of the instructions that preserves dependencies can be tested.<br />

This can quickly become billions of possibilities, but with a good heuristic, a<br />

near-optimal solution can be found in linear time. However, this method is not<br />

that easy to implement in a renderer because an accurate performance measurement<br />

is needed. A benchmark with only one small triangle could execute in less<br />

than a millisecond, but this is not always consistent because of interrupts, task<br />

switches, and other stalls. So the test has to be repeated long enough to have a<br />

reliable average. We often also need more than one shader, so we can’t let the<br />

user wait hours before we start rendering at around 20 percent higher performance.<br />

One solution is to run the scheduler in the background in another thread.<br />

By temporarily setting the thread priority higher, we can try to avoid the task<br />

switches. The longer the user runs the application, the more the shaders get optimized.<br />

This also means that when the user closes the application, the optimal<br />

scheduling is lost. It is also useless to use 20 percent of processor time for scheduling<br />

if we only expect, at most, a 20 percent performance increase from it.<br />

Another solution is to compile them in advance and put them in a library, but then<br />

we lose flexibility. The system that has been used for scheduling can also differ<br />

significantly from the user’s system, which influences the optimal instruction<br />

order. The last solution is to give the user the option to run a long benchmark to<br />

schedule the shaders for the current settings and keeping the results.<br />

The brute-force scheduling method isn’t very practical for graphics shaders,<br />

but there are still other options. One method that certainly deserves attention is<br />

peephole scheduling. Looking at only a few instructions at a time, we can reduce<br />

dependencies between them and empirically schedule the execution unit usage.<br />

This method is very fast, since it doesn’t need to measure the performance of the<br />

altered code, and it is scalable. On the other hand, changes that appear to be beneficial<br />

at first can work adversely in practice because of the badly predictable<br />

behavior of processors with out-of-order execution. So in the end, we just have to<br />

accept that we can’t do a perfect job here.


Section III — Software <strong>Shader</strong>s and <strong>Shader</strong> <strong>Programming</strong> <strong>Tips</strong><br />

412 x86 <strong>Shader</strong>s–ps_2_0 <strong>Shader</strong>s in Software<br />

Conclusion<br />

References<br />

We conclude this article by summarizing the results. Software rendering shows<br />

new potential thanks to the speed of modern SIMD processors, the optimization<br />

via soft wiring, and the ultimate programmability of x86 shaders. For applications<br />

where reliability and a wide market are more important than the race for performance,<br />

it is an interesting alternative. It’s also very discussable that games need a<br />

great deal of eye candy and more frames per second than can be perceived to have<br />

good gameplay. Let’s not forget web multimedia, where we can’t assume that<br />

everyone has hardware acceleration that supports the wanted features but where<br />

soft-wiring can get the best performance out of any type of CPU. It is also useful<br />

for scripting engines, JIT-compilers, and optimization in general. Run-time<br />

intrinsics are very easy to use, so it takes little time to convert existing assembly<br />

code to intrinsics. Application areas like real-time sound and video processing also<br />

benefit a lot from it, as it is less common to have hardware support for these<br />

tasks. In these cases, it also might be worth it to spend a few minutes trying to<br />

get a near-optimal instruction order. So let’s break free from our dependability on<br />

hardware support and do things that have never been tried before!<br />

[1] Capens, Nicolas, SoftWire Run-Time Assembler Library, http://softwire.sourceforge.net.<br />

[2] id Software, Quake III, http://www.idsoftware.com.<br />

[3] Panovici, The C++ Compiler Compiler, http://cppcc.sourceforge.net.<br />

[4] Poletto, Massimiliano and Vivek Sarkar, “Linear-scan register allocation,”<br />

ACM Transactions on <strong>Programming</strong> Languages and Systems, Volume 21, Issue 5<br />

(Sept. 1999), pp. 895-913, http://www-124.ibm.com/developerworks/<br />

oss/jikesrvm/info/pubs.shtml#toplas99.<br />

[5] Capens, Nicolas, SoftWire <strong>Shader</strong> Emulator, http://sw-shader.sourceforge.net.<br />

[6] Hecker, Chris, “Perspective Texture Mapping,” April/May 1995,<br />

http://www.d6.com/users/checker/misctech.htm.


SoftD3D: A Software-only<br />

Implementation of Microsoft’s<br />

Direct3D API<br />

Oliver Weichhold<br />

In this article I describe the process of making SoftD3D and some of the challenges<br />

that had to be overcome along the long road to completion.<br />

Some time ago, our company decided to shift part of its focus toward the<br />

upcoming generation of embedded devices like the Compaq IPAQ. There was a<br />

variety of products in the pipeline — among them an application that required<br />

real-time 3D graphics. After a great deal of internal discussion and brainstorming,<br />

the team opted against the acquisition or in-house development of a proprietary<br />

3D engine. Instead, I was assigned to develop a piece of software that would allow<br />

everyone involved to continue to work on the new platform with a well-known<br />

technology. That technology was Direct3D 8.<br />

Fortunately, at the time I was carrying out the initial planning stage for the<br />

project, a subset of the <strong>DirectX</strong> 8 API was already available for our target platform<br />

(Compaq IPAQ). Unfortunately, due to the lack of built-in 3D hardware, Direct3D<br />

was present but not operational on this platform. I recalled that the IDirect3D8<br />

interface featured a member called RegisterSoftwareDevice, which according to<br />

the documentation could be used to plug in arbitrary (software) rasterizers. I<br />

reckoned that I just had to write a software rasterizer instead of reinventing the<br />

entire API.<br />

Despite the fact that the RegisterSoftwareDevice member was not supported<br />

on the Win32 platform, I started my research and consumed every piece of<br />

publicly available documentation and software related to the task, ranging from<br />

the Windows DDK (device development kit) to actual graphics device driver<br />

source code. After a few days, I came to the conclusion that it wasn’t meant to be;<br />

RegisterSoftwareDevice was either not supported at all or reserved for internal<br />

use by Microsoft. I had to do it the hard way.<br />

At first I felt overwhelmed by the challenge of writing a compatible implementation<br />

of a technology that Microsoft had developed using significant manpower.<br />

True, I had a good deal of experience with all major Direct3D releases, but<br />

that experience was based on the standpoint of an application developer, not that<br />

of an API implementer or even a device driver developer.<br />

After concluding my initial research about the inner workings of the Direct-<br />

3D pipeline, during which the diagrams by Rich Thomson proved to be extremely<br />

413


Section III — Software <strong>Shader</strong>s and <strong>Shader</strong> <strong>Programming</strong> <strong>Tips</strong><br />

414 SoftD3D: A Software-only Implementation of Microsoft’s Direct3D API<br />

useful, I began to iron out the object hierarchy. Since <strong>DirectX</strong> is based on<br />

Microsoft’s Component Object Model (COM) technology, decisions had to be<br />

made in regard to how the Direct3D COM interfaces would be exposed. Under<br />

normal circumstances, this process should be pretty straightforward, but when<br />

you are dealing with a real-time environment where every clock cycle counts,<br />

things can easily get a bit more complicated.<br />

The goal was to produce a Dynamic Link Library (DLL) that would be binary<br />

compatible to Microsoft’s own D3D8.DLL that implements the core of Direct3D<br />

version 8. That task was easily accomplished because D3D8.DLL exports only a<br />

single function relevant to application developers. The name of this function is<br />

Direct3DCreate8. The implementation of the Direct3DCreate8 function exported<br />

by SoftD3D.dll is shown below:<br />

Direct3D8 * stdcall Direct3DCreate8(UINT SDKVersion)<br />

{<br />

CComObject *p;<br />

CComObject::CreateInstance(&p);<br />

if(p)<br />

p->AddRef();<br />

}<br />

return p;<br />

The next task involved creating a set of skeleton classes from the object model<br />

whose sole purpose for now was to expose the associated COM interfaces to get<br />

things running. The actual implementation of most interface members was left<br />

blank at that time. The skeleton for the class implementing the IDirect3DDevice8<br />

interface is shown below:<br />

class ATL NO VTABLE Device8 :<br />

public CComObjectRootEx,<br />

public IDirect3DDevice8<br />

{<br />

BEGIN COM MAP(Device8)<br />

COM INTERFACE ENTRY IID(IID IDirect3DDevice8, IDirect3DDevice8)<br />

END COM MAP()<br />

};<br />

protected:<br />

// IDirect3DDevice8<br />

STDMETHOD(TestCooperativeLevel)();<br />

STDMETHOD (UINT, GetAvailableTextureMem)();<br />

...<br />

NOTE Some readers might notice that I’m using the ActiveX Template<br />

Library (ATL) and are wondering why I did this, considering the real-time<br />

nature of the project. The answer is that it is comfortable and has virtually no<br />

performance impact.


Then it was time to implement the “adapters” that would be responsible for displaying<br />

the rendered images on screen. This task was especially tricky since I<br />

was planning on doing most of the work on Win32 to speed up the development<br />

process. As always, if you develop something on a host platform that is primarily<br />

targeted to another platform that is far inferior performance-wise, you have to be<br />

very careful not to make design decisions that don’t work out on the target platform<br />

later on.<br />

The plan was to employ DirectDraw 7 for frame buffer management on Windows.<br />

To make sure that I wouldn’t hit a dead end when porting the project to the<br />

target hardware, I began evaluating and benchmarking various libraries dealing<br />

with frame buffer and off-screen surface management on PocketPCs. Luckily, it<br />

turned out that the approach taken by most of those libraries was close to optimal<br />

performance and even better — conceptually compatible with DirectDraw.<br />

Having solved that problem, I wrote two adapters: one for Windows using<br />

DirectDraw and another for the Compaq IPAQ using the GapiDraw library.<br />

class ATL NO VTABLE DDraw7Adapter :<br />

public CComObjectRootEx,<br />

public IAdapter<br />

{<br />

BEGIN COM MAP(DDraw7Adapter)<br />

COM INTERFACE ENTRY IID( uuidof(IAdapter), IAdapter)<br />

END COM MAP()<br />

};<br />

// IAdapter<br />

STDMETHOD (UINT, GetModeCount)();<br />

STDMETHOD(EnumModes)(UINT Mode, D3DDISPLAYMODE* pMode);<br />

STDMETHOD(BeginScene)();<br />

STDMETHOD(EndScene)();<br />

STDMETHOD(Flip)();<br />

...<br />

Section III — Software <strong>Shader</strong>s and <strong>Shader</strong> <strong>Programming</strong> <strong>Tips</strong><br />

SoftD3D: A Software-only Implementation of Microsoft’s Direct3D API<br />

415<br />

Before even thinking about working on the actual rendering pipeline, another<br />

problem had to be solved: floating-point calculations. Unlike almost every modern<br />

personal computer, embedded devices such as PocketPCs are usually not<br />

equipped with special hardware that handles floating-point calculations. The IPAQ<br />

is no exception. Floating-point calculations are nonetheless possible on these<br />

devices, but they are slow — very slow. Even the fastest floating-point libraries<br />

for StrongARM processors (which power the IPAQ) are between ten and 100<br />

times slower than integer calculations. This fact poses a huge problem to software<br />

that primarily deals with floating-point calculations. However, there is a solution<br />

called fixed-point math. For the reader unfamiliar with this concept, here is a<br />

quick rundown:<br />

Fixed-point math is a simple way to speed up floating-point calculations by<br />

using integers to represent fractional numbers. Fixed-point formats are usually<br />

expressed using the xx.yy form, where xx describes the number of bits before the


Section III — Software <strong>Shader</strong>s and <strong>Shader</strong> <strong>Programming</strong> <strong>Tips</strong><br />

416 SoftD3D: A Software-only Implementation of Microsoft’s Direct3D API<br />

decimal point and yy the number of bits after the decimal point. In the case of<br />

16.16 fixed-point, 65535.99998 is the largest possible number.<br />

Because the actual calculations are performed using integer math, fixed-point<br />

runs at an acceptable speed, even on the low-end CPUs used in embedded<br />

devices. But there’s a hook; by using fixed-point calculations, you sacrifice precision<br />

and range — things that easily render the performance gain irrelevant if not<br />

used correctly.<br />

Regardless of the potential problems, I decided to add support for fixed-point<br />

math right from the start. Thanks to C++’s operator overloading, it was relatively<br />

easy to produce a fixed-point class that, if used in conjunction with custom<br />

type definitions, handles all floating-point calculation by changing a simple preprocessor<br />

symbol.<br />

The next step involved writing all the resource management and setup code<br />

necessary to create and initialize the device, frame buffer, textures, depth buffer,<br />

and vertex and index buffers allocated by a client application. At that point, I was<br />

already using <strong>DirectX</strong> SDK samples for testing and everything worked quite well<br />

— at least with the SDK samples, as I soon found out.<br />

Let’s now move on to the implementation of the rendering pipeline.<br />

Vertex Stream Splitting<br />

The first thing that happens when any of the DrawPrimitive member functions of<br />

the IDirect3DDevice8 are called is the stream setup. This stage performs the<br />

separation of incoming vertex data for further processing. The term “incoming”<br />

refers to vertex streams specified using a previous call to SetStreamSource or the<br />

vertex data passed to any of the DrawPrimitiveUP functions.<br />

class VertexStreamsInfo<br />

{<br />

public:<br />

VertexStreamsInfo();<br />

virtual ~VertexStreamsInfo();<br />

};<br />

// Attributes<br />

VertexStreamInfo m Streams[VSTREAM MAX];<br />

DWORD m dwTexCoordCount;<br />

DWORD m dwCombinedFVF;<br />

DWORD m dwMaskAllocated; // bitmask indicating streams that were allocated in the<br />

// vertex pipeline and should be freed after processing<br />

// Implementation<br />

HRESULT Init(StreamInfo *pSI, DWORD dwVertex<strong>Shader</strong>);<br />

HRESULT Init(BYTE *pData, DWORD dwStride, DWORD dwVertex<strong>Shader</strong>);<br />

HRESULT Init(StreamInfo *pSI, DWORD *pDeclaration);<br />

The Init member is overloaded three times to facilitate the following cases:<br />

� Fixed-function pipeline


� Fixed-function pipeline using DrawPrimitiveUP<br />

� Programmable pipeline<br />

After the Init function returns, the incoming vertex stream(s) have been separated<br />

into 1-n stream information blocks stored in m_Streams, which will be<br />

passed further down the pipeline.<br />

For example, an FVF (flexible vertex format) of D3DFVF_XYZ |<br />

D3DFVF_TEX2 is split into three distinct streams — one for the vertex position<br />

and one for each of the two texture coordinates.<br />

Depending on whether we are dealing with transformed or untransformed<br />

vertices, the following section can be skipped.<br />

Vertex Processing<br />

Section III — Software <strong>Shader</strong>s and <strong>Shader</strong> <strong>Programming</strong> <strong>Tips</strong><br />

SoftD3D: A Software-only Implementation of Microsoft’s Direct3D API<br />

During this stage, the incoming vertex streams are processed. This is done by<br />

concatenating one or more discrete implementations of the internal IVertex-<br />

StreamProcessor interface.<br />

interface IVertexStreamProcessor<br />

{<br />

STDMETHOD(ProcessStreams)(VertexStreamsInfo *pVSI, DWORD dwStartVertex, DWORD<br />

dwNumVertices) PURE;<br />

STDMETHOD (void, SetNext)(IVertexStreamProcessor *pNext) PURE;<br />

};<br />

417<br />

Each vertex processor operates on all vertices between StartVertex and<br />

NumVertices using single or multiple input streams, producing 0-n output<br />

streams. The resulting set of vertex streams is then passed to the next processor<br />

in the chain (if present).<br />

There are various reasons behind the decision to handle vertex data in a<br />

streaming manner, as opposed to processing them as needed:<br />

� Best performance with indexed primitives<br />

� The use of SIMD instructions (single instruction multiple data) like MMX,<br />

SSE, 3DNOW. These instructions are best suited for larger batches of data<br />

and may even cause heavy speed penalties when the processor has to switch<br />

often between modes (FPU MMX).<br />

� The memory layout of vertex buffers is optimized for processor cache friendliness<br />

and SIMD instructions<br />

The following vertex processors are currently implemented:<br />

� VSP_XFormWorldViewProjection — Applies world/view/projection<br />

transformation and produces XYZW output stream from incoming XYZ<br />

stream<br />

� VSP_XFormWorldViewProjection_X86_SSE — Applies world/view/projection<br />

transformation and produces XYZW output stream from incoming XYZ<br />

stream; SSE optimized


Section III — Software <strong>Shader</strong>s and <strong>Shader</strong> <strong>Programming</strong> <strong>Tips</strong><br />

418 SoftD3D: A Software-only Implementation of Microsoft’s Direct3D API<br />

� VSP_XFormWorldView — Applies world/view transformation and produces<br />

XYZW output stream from incoming XYZ stream<br />

� VSP_VertexFog — Produces FOG output stream from incoming XYZW<br />

stream<br />

� VSP_VertexFogRange — Produces FOG output stream from incoming XYZW<br />

stream — range-based fog<br />

� VSP_TexGen_CameraSpace_Normal — Texture coordinate generation<br />

� VSP_TexGen_CameraSpace_Position — Texture coordinate generation<br />

� VSP_TexGen_CameraSpace_ReflectionVector — Texture coordinate<br />

generation<br />

� VSP_TC_Transform — Texture coordinate transformation<br />

� VSP_Lighting — Direct3D fixed-function pipeline conformant vertex lighting;<br />

produces DIFFUSE and SPECULAR output streams<br />

� VSP_Lighting_X86_SSE — Direct3D fixed-function pipeline conformant vertex<br />

lighting — produces DIFFUSE and SPECULAR output streams; SSE<br />

optimized<br />

The number and type of vertex processors that get concatenated to a processing<br />

chain solely depends on the current renderstate settings.<br />

Rasterizer/Interpolator Setup<br />

This stage consists of a large switch tree that picks the rasterizer that is best<br />

suited to handle the current renderstate settings. SoftD3D implements more than<br />

40 distinct rasterizers and interpolators, which only differ by the number and<br />

quality of interpolated values. This might sound like a lot of work (and it is), but<br />

fortunately this high level of specialization can be accomplished by excessive use<br />

of C++ templates to combine simple fragments into more powerful ones. More<br />

on this later.<br />

At this point, we enter the main triangle loop. This loop iterates over all triangles<br />

that are affected by the current DrawPrimitive call.<br />

Backface Culling<br />

Depending on the current value of the D3DRS_CULLMODE renderstate, triangles get<br />

backface culled. There are a number of possible approaches to backface culling.<br />

SoftD3D performs backface culling in object space because of the advantage of<br />

the possible rejection of triangles very early in the pipeline.<br />

To do this, we first compute the viewer position in object space using the<br />

following pseudocode:<br />

ViewerPosition.x = -Transform[D3DTS VIEW].m[0][0] * Transform[D3DTS VIEW].m[3][0] -<br />

Transform[D3DTS VIEW].m[0][1] * Transform[D3DTS VIEW].m[3][1] -<br />

Transform[D3DTS VIEW].m[0][2] * Transform[D3DTS VIEW].m[3][2];


ViewerPosition.y = -Transform[D3DTS VIEW].m[1][0] * Transform[D3DTS VIEW].m[3][0] -<br />

Transform[D3DTS VIEW].m[1][1] * Transform[D3DTS VIEW].m[3][1] -<br />

Transform[D3DTS VIEW].m[1][2] * Transform[D3DTS VIEW].m[3][2];<br />

ViewerPosition.z = -Transform[D3DTS VIEW].m[2][0] * Transform[D3DTS VIEW].m[3][0] -<br />

Transform[D3DTS VIEW].m[2][1] * Transform[D3DTS VIEW].m[3][1] -<br />

Transform[D3DTS VIEW].m[2][2] * Transform[D3DTS VIEW].m[3][2];<br />

After that, we calculate a vector from the object space viewer position to the triangle<br />

and take the dot product of this vector and the normal vector of the triangle.<br />

The resulting scalar is treated differently depending on whether D3DRS_CULLMODE<br />

= D3DCULL_CCW or D3DRS_CULLMODE = D3DCULL_CW.<br />

Because this process is repeated for every triangle, optimized implementations<br />

for different processor architectures exist.<br />

Vertex Assembly<br />

Section III — Software <strong>Shader</strong>s and <strong>Shader</strong> <strong>Programming</strong> <strong>Tips</strong><br />

SoftD3D: A Software-only Implementation of Microsoft’s Direct3D API<br />

At this point, almost all the data required by the rasterizer is available. The vertex<br />

processing stage has transformed the vertices using the world/view/projection<br />

matrices, lit the vertices, and applied texture transformations and fog. But the<br />

generated data has been stored in a stream layout, which is not exactly optimal for<br />

the rasterizer.<br />

To avoid time-consuming pointer arithmetic at the rasterizer stage, the data<br />

from all streams is now reconciled into a single vertex structure. This is achieved<br />

by dereferencing the streams using pointer arithmetic. To speed up this process,<br />

optimized versions of the code for each Direct3D primitive type have been<br />

implemented.<br />

class TLVertex<br />

{<br />

public:<br />

Vector4 Position;<br />

Vector2D ScreenPosition; // Computed AFTER clipping<br />

TLColor Diffuse; // normalized for interpolation<br />

TLColor Specular; // normalized for interpolation<br />

float Fog;<br />

};<br />

class TLVertexTex : public TLVertex<br />

{<br />

public:<br />

TexCoord tc[MAX TEX COORD];<br />

};<br />

After all three vertices of the current triangle have been initialized, they are<br />

passed down to the next stage.<br />

419


Section III — Software <strong>Shader</strong>s and <strong>Shader</strong> <strong>Programming</strong> <strong>Tips</strong><br />

420 SoftD3D: A Software-only Implementation of Microsoft’s Direct3D API<br />

Clipping<br />

During clipping, we test an individual triangle against the canonical view frustum<br />

in homogeneous clip space (–w


forceinline void FastFloat2Int SetFloor()<br />

{<br />

asm<br />

{<br />

fnstcw wFastFloatTemp<br />

mov ax, wFastFloatTemp<br />

and ax, ~(3


Section III — Software <strong>Shader</strong>s and <strong>Shader</strong> <strong>Programming</strong> <strong>Tips</strong><br />

422 SoftD3D: A Software-only Implementation of Microsoft’s Direct3D API<br />

SoftD3D implements interpolators for:<br />

� x and y coordinates in screen space<br />

� w for depth buffering<br />

� Diffuse color (not perspective correct)<br />

� Specular color (not perspective correct)<br />

� Fog color (not perspective correct)<br />

� One to eight texture coordinates (perspective correct)<br />

Obviously, there’s quite a large number of possible combinations. Computing all<br />

gradients regardless of the renderstate settings is not an option because doing so<br />

wastes a huge amount of processor time. What we need is a set of interpolators<br />

that only work on values that are really going to be used for rendering. One could<br />

write customized code for each possible case, but not only would this be a cumbersome<br />

process but also very prone to human errors. Once again, C++ templates<br />

come to our rescue.<br />

If we could split up the problem into smaller parts and let the compiler combine<br />

those fragments into a working unit without suffering a performance impact,<br />

we’d be set. This is exactly how SoftD3D implements its set of more than 40 distinct<br />

rasterizers. Take a look at the following class declaration:<br />

template<br />

class Rasterizer : public IRasterizer<br />

{<br />

...<br />

};<br />

The Rasterizer class acts as a template-based abstract base class for all<br />

rasterizers, importing its entire functionality from the template arguments specified<br />

by derived classes.<br />

The class declaration below shows a discrete rasterizer that is responsible for<br />

rendering Gouraud-shaded triangles not using a depth buffer. The RTriGouraud<br />

rasterizer really consists of just the declaration. No additional code or data had to<br />

be added.<br />

class RTriGouraud : public Rasterizer<br />

{<br />

};<br />

Now take a look at this one:<br />

template<br />

class RTriGouraudZ : public Rasterizer<br />

{<br />

};<br />

As you might have guessed, this rasterizer also renders Gouraud-shaded triangles,<br />

but this version includes support for interpolating depth values (1/w).


Let’s summarize this stage again. During triangle setup, we compute a set of<br />

constants called gradients that allow us to interpolate arbitrary values across the<br />

surface of the triangle for further processing.<br />

Scanline Rendering<br />

When the rendering process gets down to the scanline level, things start getting<br />

interesting — and complicated. The scanline level is the place where the most<br />

time is spent during rendering, and performance can be gained but more often is<br />

lost. This is the place where every processor cycle matters. Speed is the name of<br />

the game.<br />

There are three main techniques for implementing the scanline loop:<br />

A single complex loop Multiple simple loops Dynamic compilation<br />

Pixel p;<br />

PixelArray pa[Width];<br />

This advanced technique works<br />

by compiling small assembly<br />

fragments into complex<br />

subroutines at run time.<br />

for(i = Left;i


Section III — Software <strong>Shader</strong>s and <strong>Shader</strong> <strong>Programming</strong> <strong>Tips</strong><br />

424 SoftD3D: A Software-only Implementation of Microsoft’s Direct3D API<br />

rasterizer will have to deal with textures or fog. If any of these conditions is true,<br />

then a suitable texture cascade is assembled for later use.<br />

The heart of the system is an interface called ITSSProcessor.<br />

interface ITSSProcessor<br />

{<br />

STDMETHOD (void, SetNext)(ITSSProcessor *pNext) PURE;<br />

STDMETHOD (ITSSProcessor *, GetNext)() PURE;<br />

STDMETHOD (BOOL, IsStatic)() PURE;<br />

};<br />

virtual void fastcall Process(Device8 *pD3DD, int Width, BOOL bUsePMask) PURE;<br />

virtual void fastcall ProcessBounds(Device8 *pD3DD, int Width, BOOL bUsePMask) PURE;<br />

Classes that implement this interface are called TSSProcessors and act as a<br />

hybrid between the interpolator and processor that handles an entire scanline.<br />

Similar to the vertex processing chain explained in the “Vertex Processing”<br />

section, TSSProcessors are concatenated to a chain that, once it is done processing,<br />

has completed the entire Direct3D texture cascade plus fog blending.<br />

The process of assembling the texture cascade represented one of the biggest<br />

challenges because it had to meet the following requirements:<br />

� Very fast<br />

� Perform texture usage tracking — detect and eliminate combinations of texture<br />

stage states and texture settings that will effectively result in no texture<br />

output, thus eliminating the very costly texel lookup phase<br />

� Perform data flow tracking — detect and eliminate cases where data is transported<br />

and/or processed in earlier texture stages only to be discarded later in<br />

the cascade<br />

� Produce accurate results<br />

Because of the highly speed-sensitive nature of every piece of code executed in<br />

the context of a scanline, a large number of specialized TSSProcessors exist.<br />

TSSProcessors can be categorized into the following groups:<br />

� Texel fetchers: Fill a color channel with color and alpha data from a texture<br />

with or without bilinear filtering<br />

� Color operators: Carry out operations defined by the Direct3D D3DTOP<br />

enumeration<br />

� Fog blenders<br />

Texel fetchers represent the most heavily optimized code in SoftD3D. Specialized<br />

implementations for virtually all combinations of processor architectures and texture<br />

formats exist.<br />

Color operators have been heavily (MMX) optimized as well, although not to<br />

the same extent as texel fetchers.


Alpha Testing<br />

Z-Testing<br />

If D3DRS_ALPHATESTENABLE is true, then the alpha testing stage is fed with either<br />

the output of the final TSSProcessor in the texture blending cascade or the output<br />

of the scanline color interpolator.<br />

The actual comparison is performed by one implementation of the<br />

IAlphaTester interface:<br />

interface IAlphaTester<br />

{<br />

virtual void SetRef(DWORD dwAlphaRef) PURE;<br />

virtual void Test(ScanLineContext *pSLC, int Width) PURE;<br />

};<br />

SoftD3D features IAlphaTester implementations for each member of the<br />

D3DRS_ALPHAFUNC enumeration.<br />

Once a call to the Test() method is complete, a bit in the pixel mask contained<br />

in the ScanLineContext object is set for each pixel that didn’t pass the<br />

alpha test, and the number of pixels that didn’t pass the test is returned.<br />

If D3DRS_ZENABLE is true, then the Z-testing stage is fed with output of the 1/w<br />

interpolator.<br />

interface IZTester<br />

{<br />

virtual void SetBuffer(IDirect3DSurface8 *pSurface) PURE;<br />

virtual void SetStart(int x, int y) PURE;<br />

virtual void SkipX() PURE;<br />

};<br />

virtual int Test(ScanLineContext *pSLC, int Width, BOOL bUsePMask, BOOL bUpdate) PURE;<br />

The actual comparison is performed by one implementation of the IZTester interface,<br />

which works exactly like the aforementioned IAlphaTester interface.<br />

Pixel Mask Check<br />

Section III — Software <strong>Shader</strong>s and <strong>Shader</strong> <strong>Programming</strong> <strong>Tips</strong><br />

SoftD3D: A Software-only Implementation of Microsoft’s Direct3D API<br />

425<br />

Alpha testing and Z-testing is always performed as early as possible in the pipeline.<br />

According to the Direct3D specification, alpha testing is always performed<br />

before Z-testing.<br />

Either testing method produces both a bitfield of masked (skipped) pixels and<br />

a pixel skip counter. If the value of the counter equals the width of the current<br />

scanline, this case is treated as an early out condition, and the entire scanline is<br />

skipped.<br />

If no early out condition is met, all of the following operations must obey the<br />

state of the pixel mask produced by either or both testing stages.


Section III — Software <strong>Shader</strong>s and <strong>Shader</strong> <strong>Programming</strong> <strong>Tips</strong><br />

426 SoftD3D: A Software-only Implementation of Microsoft’s Direct3D API<br />

Alpha Blending<br />

If D3DRS_ALPHABLENDENABLE is true, alpha blending is performed on all non-masked<br />

pixels produced by the previous stages.<br />

This is another area of the pipeline that posed a serious challenge. The first<br />

problem that I encountered had something to do with memory accesses. To<br />

explain the problem, we recall how color blending actually works:<br />

If an application enables texture blending, Direct3D must then combine the<br />

color value of the processed polygon pixel with the pixel already stored in the<br />

frame buffer. Direct3D uses the following formula to determine the final color for<br />

each pixel in the primitive’s image [2].<br />

FinalColor �TexelColor �SourceBlendFactor � PixelColor � DestBlendFactor<br />

Right, “Direct3D must then combine the color value of the processed polygon<br />

pixel with the pixel already stored in the frame buffer.” The end of the quote is<br />

the important part.<br />

We must read from the frame buffer! So what’s the problem, you might ask.<br />

The problem is the frame buffer itself. When you allocate a frame buffer intended<br />

for displaying real-time graphics, you preferably allocate video memory on the<br />

graphics card. This way, you get the fastest performance if and only if you restrict<br />

yourself to writing to that chunk of memory because when a frame has been rendered<br />

and must be presented to the user, the video driver only has to change a<br />

memory address on the card in the best case or perform a fast video memory to<br />

video memory transfer in the worst case. That’s great, but unfortunately, this is<br />

not the case with alpha blending because, as mentioned above, we have to read<br />

the frame buffer; reading from video memory is very slow because that memory<br />

area is not cached by the host CPU.<br />

Initially I thought the performance advantage of a video memory frame buffer<br />

would outweigh the slow memory access during blending operations. This<br />

assumption indeed applies to simple test applications. But with real-world applications,<br />

the picture shifted into the opposite direction, which in the end forced me<br />

to abandon video memory frame buffers, bite the bullet, and suffer from system<br />

memory to video memory transfers for every frame.<br />

Another problem was the huge number of possible blending operations<br />

resulting from:<br />

� The number of defined blend factors independently set for both source and<br />

destination (15)<br />

� The number of different frame buffer formats (2)<br />

My solution was to write small code fragments that handle each of the aforementioned<br />

cases and glue those fragments together using macros: DECLARE_ALPHA-<br />

BLENDER(ONE, ZERO);<br />

The device object tracks the various alpha blending-related render states at<br />

all times and provides a suitable alpha blending handler to the rasterizer on<br />

demand, which is cached as long as the affected render states remain unchanged.


Output to the Render Target<br />

Finally, we have reached the last stage of the rendering pipeline.<br />

As I’ve mentioned before, the entire pixel pipeline operates on 32-bit RGBA<br />

packed color values organized into streams. The result of all the previous rendering<br />

stages is a single output stream that represents an entire scanline.<br />

The final task is to write this stream to the correct location within the render<br />

target surface, optionally performing a conversion of the RGBA32 value to the<br />

pixel format of the target surface.<br />

The color conversion turned out to be another major bottleneck, but at least<br />

on the x86 platform the performance gained from employing MMX and SSE<br />

instructions for the color conversion was tremendous.<br />

This concludes our trip down the graphics pipeline of SoftD3D.<br />

Related Problems<br />

Section III — Software <strong>Shader</strong>s and <strong>Shader</strong> <strong>Programming</strong> <strong>Tips</strong><br />

SoftD3D: A Software-only Implementation of Microsoft’s Direct3D API<br />

Clearing the Depth Buffer<br />

During one of my profiling sessions, I noticed a spike in VTune’s function graph. It<br />

turned out to be the Clear() member of the ZBufferSurface32 class. Ironically, I<br />

only noticed how much time was actually spent in that function because of a bug<br />

in my test client that yielded no polygon output. Therefore, almost 99 percent of<br />

the processor time was spent clearing the depth buffer and blitting the off-screen<br />

render target surface into the video memory.<br />

Today I know that clearing the Z buffer can be avoided with software<br />

rasterizers, but back then I spent a good amount of time optimizing the Clear()<br />

function. The result outperforms the initial rep stosd solution by 300 to 400 percent<br />

on an AMD Athlon CPU.<br />

The secret behind this huge performance increase is the use of the new<br />

movntq instruction, which writes directly to a location in memory, bypassing the<br />

cache hierarchy.<br />

Real-world Applications<br />

427<br />

When the feature set of SoftD3D grew beyond the scope of <strong>DirectX</strong> SDK samples,<br />

we began to test the library with real-world applications. This proved to be a<br />

very good decision. The first real-world test was done using Unreal Tournament.<br />

Of course, the game crashed.<br />

A debugging session quickly unveiled several problems with SoftD3D’s<br />

resource management. As those problems were ironed out, the game was still<br />

crashing. But this time, it was not because of a bug in SoftD3D but simply a result<br />

of the game not honoring some of the bits in the D3DCAPS8 structure — a problem<br />

more or less repeated by almost any application we tried for testing.<br />

One application would simply refuse to initialize if certain caps bits were not<br />

available, although the application wouldn’t even make use of those features, and<br />

another one insisted on creating textures using unsupported formats. This forced


Section III — Software <strong>Shader</strong>s and <strong>Shader</strong> <strong>Programming</strong> <strong>Tips</strong><br />

428 SoftD3D: A Software-only Implementation of Microsoft’s Direct3D API<br />

The Future<br />

me to add some features that were initially excluded from SoftD3D’s feature set<br />

and write several workarounds to overcome problems caused by applications<br />

making tricky assumptions about the availability of features and the internal format<br />

of some resources.<br />

Yes, it was a lot of work, but the result was one giant leap toward a mature<br />

product.<br />

The outlook of the future of SoftD3D begins with a quick glance back to the days<br />

when I was working on the pixel pipeline of the library.<br />

I was just working on an SSE-optimized version of a texture filter object<br />

when it struck me: Wouldn’t it be cool if we could generate this assembly code at<br />

run time instead of writing dozens of specialized routines? Coincidently, I was<br />

working on another Direct3D-related project at the same time, and nVidia’s<br />

NVLink tool [3] was used for this project to generate D3D vertex shaders at run<br />

time.<br />

NVLink works by taking a number of small pieces of vertex shader code<br />

called fragments and “sewing” them together at run time. The advantage is that<br />

one no longer has to write specialized shaders for each rendering state. <strong>DirectX</strong> 9<br />

offers a very similar functionality through the ID3DXFragmentLinker interface<br />

[2]. Take a look at these vertex shader fragments:<br />

#beginfragment f_load_r_diffusecolor_incoming_diffuse<br />

mov r_diffusecolor, v2<br />

#endfragment<br />

#beginfragment f_write_diffuse_result<br />

mov oD0, r_diffusecolor<br />

#endfragment<br />

The most obvious departure from ordinary vertex shader code is that hardware<br />

registers have been replaced by symbolic names. These symbols act as virtual<br />

registers and define an “interface” between fragments. For example, the virtual<br />

register r_diffusecolor is referenced by both the f_load_r_diffusecolor_<br />

incoming_diffuse and the f_write_diffuse_result fragments. The assembly<br />

code generated by NVLink for a shader program using both fragments would look<br />

like this:<br />

mov r0, v2<br />

mov oD0, r0<br />

Because of NVLink’s optimizing capabilities, the final output is this:<br />

mov oD0, v2<br />

The unnecessary use of a temporary register has been eliminated.<br />

This concept inspired me to try a different solution for implementing<br />

SoftD3D’s pipeline. However, my own solution differed in a number of aspects<br />

from NVLink. My first decision was to define the shader fragments as binary x86


Section III — Software <strong>Shader</strong>s and <strong>Shader</strong> <strong>Programming</strong> <strong>Tips</strong><br />

SoftD3D: A Software-only Implementation of Microsoft’s Direct3D API<br />

assembly code blocks to avoid the difficult task of writing a parser. Because of this<br />

decision, the project was doomed! But read on.<br />

The shader fragments were implemented as C++ inline assembly:<br />

PXOAPI void PXO_LERP_DIFFUSE_MMX_X86()<br />

{<br />

__asm<br />

{<br />

movq mm0, QWORD PTR [ebx + PXOC_Diffuse]<br />

paddsw mm0, QWORD PTR [ebx + PXOC_DiffuseStep]<br />

movq QWORD PTR [ebx + PXOC_Diffuse], mm0<br />

}<br />

}<br />

429<br />

The C++ compiler does all the dirty work and generates the data to embed the<br />

fragments into SoftD3D in binary form. In order to avoid the tracking of processor<br />

register usage by fragments, a bank of “virtual” registers was defined and all real<br />

processor registers declared as scratch registers.<br />

At run time, SoftD3D generated a list of fragments depending on the current<br />

render states and the available processor features and linked them together into<br />

one big chunk of assembly code. It worked — somehow. The code was slower<br />

than the compiler-generated code! This was mainly because of the ever-increasing<br />

need to move certain data from processor registers to main memory in order<br />

to resolve register conflicts between fragments. Now I was paying the price for<br />

not writing a parser and thus losing the ability to easily track the use processor<br />

registers by the combined fragment program. I decided to discontinue this part of<br />

the project.<br />

Looking back, I feel that my initial attempt at implementing a programmable<br />

software pipeline was merely half-baked, and its failure was actually a good thing<br />

because I’ve learned quite a lot from this failure, and most importantly it cleared<br />

the way for a better implementation. Ironically, the solution was there all the time,<br />

but I didn’t realize it — until now.<br />

I was trying to figure out why one of my shaders (for another project) didn’t<br />

produce the expected results, and I suspected a compiler bug. So I dug down into<br />

the compiled shader token array. The token array is a simple collection of codes<br />

emitted by the vertex or pixel shader assembler. These codes instruct the driver<br />

on how to create the shader. The format of tokens within each shader code determines<br />

its uniqueness. A shader code token is a DWORD with a specific format.<br />

The driver (in this case, SoftD3D) reads the shader code’s tokens to interpret the<br />

code.<br />

Each individual shader code is formatted with a general token layout. The<br />

first token must be a version token. The version token provides the version number<br />

of the code and also determines whether the code is for a pixel or vertex<br />

shader. <strong>Shader</strong> content follows the version token and is composed of various<br />

instruction tokens, perhaps intermingled with comment tokens and white space.<br />

Depending on the precise operation that an instruction token specifies, destination<br />

and source parameter tokens can also be part of the shader content and follow<br />

an instruction token. For example, if the instruction token specifies an add


Section III — Software <strong>Shader</strong>s and <strong>Shader</strong> <strong>Programming</strong> <strong>Tips</strong><br />

430 SoftD3D: A Software-only Implementation of Microsoft’s Direct3D API<br />

operation, the driver determines that one destination and two source parameter<br />

tokens follow the instruction token. An end token completes the shader code.<br />

� Version token: Describes the version number of the shader code and<br />

informs the driver whether the shader code is for a pixel or vertex shader<br />

� Instruction token: Informs the driver of a specific operation to perform<br />

� Destination parameter token: Describes properties of a destination<br />

register<br />

� Source parameter token: Describes properties of a source register<br />

� Label token: Used for certain operations (for example, D3DSIO_CALLNZ)<br />

� Comment token: Describes the length of the comment that follows<br />

� End token: Informs the driver of the end of the shader code<br />

The video driver parses the token array and compiles it into a set of hardwarespecific<br />

opcodes and register states. It should be possible to mimic that behavior<br />

by compiling the tokens into processor-specific assembly language at run time.<br />

Because compilation occurs at run time, processor-specific extensions could be<br />

used. It would all depend on the quality of the token compiler. The idea was born.<br />

Before I got to work on the actual implementation, I wrote a series of tools to<br />

improve my understanding of the subject. The logical first step was a shader<br />

disassembler. Not only was I forced to understand the meaning of every single bit<br />

but the resulting source code still acts as an invaluable reference as well.<br />

Figure 2: <strong>Shader</strong> disassembler<br />

After the disassembler was done, I felt that my understanding of the subject was<br />

still not proficient enough, and I began to work on its counterpart — a macro<br />

assembler. Should the requirement of compiling shader source code at run time<br />

ever arise, the technology would be there.<br />

Now I won’t go into the gory details, but I have to say that it was actually less<br />

complicated than expected thanks to a dream team called Flex & Yacc — well<br />

known in the UNIX world.<br />

Armed with in-depth knowledge and lots of sample code, I went on with the<br />

implementation. I quickly realized that the integration of the shader framework


References<br />

Section III — Software <strong>Shader</strong>s and <strong>Shader</strong> <strong>Programming</strong> <strong>Tips</strong><br />

SoftD3D: A Software-only Implementation of Microsoft’s Direct3D API<br />

into the existing pipeline of SoftD3D was not an easy task. Some architectural<br />

decisions still had to be made before real work could be done. The most important<br />

one was how to map the hardware register set exposed by vertex and pixel<br />

shaders in the correct way.<br />

As mentioned before, the final implementation is supposed to compile the<br />

assembled shaders into processor-specific assembly code. This process is far from<br />

being a trivial issue, and implementing it with parts of the execution environment<br />

yet to be determined would be crazy. An interim implementation was necessary<br />

— an interpreter. This is exactly where SoftD3D stands now. It executes vertex<br />

shaders up to version 3.0 using an interpreter. Of course, the interpreter works<br />

slowly, but this is irrelevant because its sole purpose was laying the foundation for<br />

the final goal: the shader compiler — and this has been accomplished.<br />

[1] Hecker, Chris, “Perspective Texture Mapping,” Game Developer magazine,<br />

April/May ’95, http://www.d6.com/users/checker/misctech.htm.<br />

[2] Microsoft, <strong>DirectX</strong> 9.0 Programmer’s Reference, 2002.<br />

[3] NVLink, nVidia, http://developer.nvidia.com/view.asp?IO=nvlink_2_1.<br />

431


Named Constants in <strong>Shader</strong><br />

Development<br />

432<br />

Jeffrey Kiel<br />

If you’re like me, you have been looking at the articles in this book saying things<br />

like, “Wow, look at that!” or “I didn’t think you could do something that cool!”<br />

Inevitably, you decide that some shader does have the exact effect that will put<br />

your app over the top (with your own tweaks added, of course), so you start to<br />

delve deeper into the workings of the code. While doing this, you come across<br />

some code that looks like this:<br />

ps.1.0<br />

tex t0<br />

dp3 r0, c0, t0<br />

...<br />

So, of course, you look just a few lines below this and see:<br />

float c0[4] = {0.2125f, 0.7154f, 0.0721f, 1.0f}; // Convert RGB to Grayscale<br />

pDevice->SetPixel<strong>Shader</strong>Constant(0, c0, 1);<br />

Being the expert programmer that you are, you figure out that the data in the float<br />

array c0 is being placed into constant register 0, which corresponds to c0 in the<br />

shader code. You further deduce that it is taking the color value retrieved from<br />

the texture and modulating it by this constant to convert it into a grayscale value.<br />

You continue this exercise on some more lines of shader code and quickly get<br />

overwhelmed trying to remember what the different constants actually mean, flipping<br />

back and forth between the shader definition and the setting of the constants,<br />

which you can only hope happen in close proximity. Finally, you tear the page out<br />

of the book and turn it over so you can look at them side by side. Frustrated, you<br />

get out your colored pens and begin to color code the constants in the shader with<br />

those in the code where they are set. Darn, why don’t they make highlighters in<br />

all colors, like my crayons as a kid? ARGH! OK, OK, calm down. Maybe you<br />

aren’t as hyped up on caffeine to let it get this far (yeah, right), but it can be difficult<br />

to make sense of other people’s code that uses numeric identifiers for things<br />

like constants. You would probably fire the guy who wrote his C code like we used<br />

to write Pascal back in high school — with variable names like v1. There must be<br />

a better way.<br />

Yes, there is. Though the implementation is simple (trivial, really), the usefulness<br />

is enormous. This article shows you an easy, useful way to incorporate


named constants into your code that make development easier, help with bug finding,<br />

and even help with performance.<br />

The Way Things Should Be<br />

Here is the example from above with named constants:<br />

ps.1.0<br />

tex t0<br />

dp3 r0, c[RGB_TO_GRAYSCALE], t0<br />

...<br />

float afRGBToGrayscale[4] = {0.2125f, 0.7154f, 0.0721f, 1.0f}; // Convert RGB to grayscale<br />

pDevice->SetPixel<strong>Shader</strong>Constant(RGB_TO_GRAYSCALE, afRGBToGrayscale, 1);<br />

This is obviously much more readable and maintainable.<br />

How Did They Do That?<br />

Section III — Software <strong>Shader</strong>s and <strong>Shader</strong> <strong>Programming</strong> <strong>Tips</strong><br />

Named Constants in <strong>Shader</strong> Development<br />

The real magic in this solution happens with a data structure to map the string<br />

names to enumerated values and some #define macros. First, let’s look at how to<br />

make the constants file. Typically, you want to create one include file (that does<br />

not have any multiple include guards on it like #ifndef or #pragma) that contains<br />

all of your constants. Remember, this is a normal include file, so comments are<br />

just fine. In my example, all of the constants can only contain alphanumeric and<br />

underscore characters, but you can change the parser to take other characters<br />

into account. For example:<br />

//////////////////////////<br />

// shaderdefines.h<br />

// This file defines all of the shader constants<br />

// Vertex shader constants<br />

DEFINECONSTANT(ONE, 1)<br />

DEFINECONSTANT(TRANSPOSE_WORLDVIEWPROJ_MATRIX, 2) // Transpose WORLDVIEWPROJ matrix<br />

// Pixel shader constants<br />

DEFINECONSTANT(RGB_TO_GRAYSCALE, 0) // Convert RGB to grayscale<br />

I placed some other examples to help you to see that any constant from any<br />

shader can be defined here.<br />

In some other code (either an include file if you have multiple source files<br />

containing shaders or in the single shader source file), you need to define the enumeration<br />

that enables you to use these values as constants in your C code. This is<br />

how it is done:<br />

#define DEFINECONSTANT(name, n) name = n,<br />

enum SHADERCONSTANTS {<br />

#include "shaderdefines.h"<br />

};<br />

433


Section III — Software <strong>Shader</strong>s and <strong>Shader</strong> <strong>Programming</strong> <strong>Tips</strong><br />

434 Named Constants in <strong>Shader</strong> Development<br />

#undef DEFINECONSTANT<br />

As you can see, we define a macro that takes the first parameter as the name and<br />

the second parameter as the enumerated value. Understand that you can reuse<br />

elements in the enumeration, thereby having multiple names with the same<br />

numeric value.<br />

Once the enumeration is defined, we need to create the data structure used<br />

when parsing the shader. This should probably be done in the same source file<br />

that contains the parser. This is a simple version of it:<br />

typedef struct _<strong>Shader</strong>ConstNameMap {<br />

char *pcName;<br />

int nID;<br />

} <strong>Shader</strong>ConstNameMap;<br />

As you can see, this simply provides a table to map from the ASCII name to the<br />

associated number. We populate this table with the following code:<br />

#define DEFINECONSTANT(name, n) {#name, n},<br />

<strong>Shader</strong>ConstNameMap G_aSCNameMap[] = {<br />

#include "shaderdefines.h"<br />

};<br />

#undef DEFINECONSTANT<br />

We now have the enumeration and the mapping table defined; all that is left is creating<br />

the parser.<br />

Parsing the <strong>Shader</strong> Values<br />

I decided that efficiency was not that important for the parser, since it is a small<br />

part of the loading effort. Rather than spending a lot of time on a superefficient<br />

parser with lots of bells and whistles, I decided to make it simple and just add the<br />

capability of having integer offsets for the constants (i.e., c[RGB_TO_GRAY-<br />

SCALE + 2]). Here is the code for this simple parser:<br />

#define elementsof(x) (sizeof(x)/sizeof(x[0]))<br />

void ProcessConstants(char *pcOriginal, char *pcProcessed)<br />

{<br />

char *pcWalker = pcOriginal, *pcTagWalker, *pcLengthWalker;<br />

char *pcOutWalker = pcProcessed;<br />

int ii, nTagLength, nTagCount = elementsof(G_aSCNameMap);<br />

int nMultiplier = 1, nOffset = 0;<br />

bool bTagFound;<br />

char zTmp[8], zThisTag[80];<br />

while(*pcWalker != '\0') {<br />

bTagFound = false;<br />

if(*pcWalker == '[') {<br />

pcTagWalker = pcWalker + 1;


Section III — Software <strong>Shader</strong>s and <strong>Shader</strong> <strong>Programming</strong> <strong>Tips</strong><br />

// Figure out how many chars to the next nonalphanum & non-underscore char<br />

pcLengthWalker = pcTagWalker;<br />

while(isalnum(*pcLengthWalker) || *pcLengthWalker == '_')<br />

pcLengthWalker++;<br />

// Copy this tag into a temporary buffer<br />

strncpy(zThisTag, pcTagWalker, pcLengthWalker - pcTagWalker);<br />

zThisTag[pcLengthWalker - pcTagWalker] = '\0';<br />

// Look for the tag in the table<br />

for(ii = 0; ii < nTagCount; ii++) {<br />

nTagLength = strlen(G_aSCNameMap[ii].pcName);<br />

if(!strcmp(G_aSCNameMap[ii].pcName, zThisTag)) {<br />

// We have a match, check for additions and subtractions...<br />

pcTagWalker += nTagLength; // Skip past the tag name<br />

while(*pcTagWalker != '\0' && *pcTagWalker != ']') {<br />

if(*pcTagWalker == '+') {<br />

// It is assumed to be addition, so skip it<br />

pcTagWalker++;<br />

}<br />

else if(*pcTagWalker == '-') {<br />

nMultiplier = -1;<br />

pcTagWalker++;<br />

}<br />

else if(*pcTagWalker == ' ') {<br />

// OK, skip it<br />

pcTagWalker++;<br />

}<br />

else if(isdigit(*pcTagWalker)) {<br />

nOffset = atoi(pcTagWalker);<br />

// Skip over all of the digits...<br />

while(isdigit(*pcTagWalker))<br />

pcTagWalker++;<br />

}<br />

else {<br />

// Probably should complain here...<br />

pcTagWalker++;<br />

}<br />

}<br />

Named Constants in <strong>Shader</strong> Development<br />

// OK, now that we have the tag and the offset, replace this in the outgoing<br />

// string...<br />

if(*pcTagWalker == ']') {<br />

sprintf(zTmp, "%d", G_aSCNameMap[ii].nID + (nMultiplier * nOffset));<br />

*pcOutWalker = '\0'; // Temporary so the strcat will work<br />

strcat(pcOutWalker, zTmp);<br />

pcOutWalker += strlen(zTmp);<br />

pcWalker = pcTagWalker + 1; // Skips the close bracket we are on<br />

435


Section III — Software <strong>Shader</strong>s and <strong>Shader</strong> <strong>Programming</strong> <strong>Tips</strong><br />

436 Named Constants in <strong>Shader</strong> Development<br />

Wrap-up<br />

bTagFound = true;<br />

nOffset = 0; // Reset to 0 in case it was used<br />

break;<br />

}<br />

else {<br />

// Probably should complain here...<br />

}<br />

}<br />

}<br />

}<br />

if(!bTagFound) {<br />

*pcOutWalker = *pcWalker;<br />

pcOutWalker++;<br />

pcWalker++;<br />

}<br />

}<br />

*pcOutWalker = '\0';<br />

}<br />

Hopefully this provides a good basis for using named constants in your code.<br />

Another nice feature of using named constants is that you can rearrange the constants<br />

in your code to make uploading them more efficient. If you do some timing<br />

tests, you can see that the more you group the constants that change, the better<br />

performance you see when setting them. Also, you could use this same code to<br />

redefine the texture registers (t0/SetTexture()), temporary registers, etc. It might<br />

lead to shaders that are very large, but since you will probably compile them and<br />

ship a binary version, it might be worth the added readability during development<br />

time.


Section IV<br />

Image Space<br />

Advanced Image Processing with <strong>DirectX</strong> 9 Pixel<br />

<strong>Shader</strong>s<br />

by Jason L. Mitchell, Marwan Y. Ansari,<br />

and Evan Hart<br />

Night Vision: Frame Buffer Post-processing with<br />

ps.1.1 Hardware<br />

by Guillaume Werle<br />

Non-Photorealistic Post-processing Filters in<br />

MotoGP 2<br />

by Shawn Hargreaves<br />

Image Effects with <strong>DirectX</strong> 9 Pixel <strong>Shader</strong>s<br />

by Marwan Y. Ansari<br />

Using Pixel <strong>Shader</strong>s to Implement a Mosaic Effect<br />

Using Character Glyphs<br />

by Roger Descheneaux and Maurice Ribble<br />

Mandelbrot Set Rendering<br />

by Emil Persson<br />

Real-Time Depth of Field Simulation<br />

by Guennadi Riguer, Natalya Tatarchuk, and<br />

John Isidoro<br />

437


Advanced Image Processing with<br />

<strong>DirectX</strong> 9 Pixel <strong>Shader</strong>s<br />

Introduction<br />

Review<br />

Jason L. Mitchell, Marwan Y. Ansari, and Evan Hart<br />

<strong>With</strong> the introduction of the ps_2_0 pixel shader model in <strong>DirectX</strong> 9, we were able<br />

to significantly expand our ability to use consumer graphics hardware to perform<br />

image processing operations. This is due to the longer program length, the ability<br />

to sample more times from the input image(s), and the addition of floating-point<br />

internal data representation. In Direct3D <strong><strong>Shader</strong>X</strong>: Vertex and Pixel <strong>Shader</strong> <strong>Tips</strong><br />

and <strong>Tricks</strong>, we used the ps_1_4 pixel shader model in <strong>DirectX</strong> 8.1 to perform<br />

basic image processing techniques, such as simple blurs, edge detection, transfer<br />

functions, and morphological operators [Mitchell02]. In this chapter, we extend our<br />

image processing toolbox to include color space conversion, a better edge detection<br />

filter called the Canny filter, separable Gaussian and median filters, and a<br />

real-time implementation of the Fast Fourier Transform.<br />

As shown in our original image processing article in the Direct3D <strong><strong>Shader</strong>X</strong> book,<br />

post-processing of 3D frames is fundamental to producing a variety of interesting<br />

effects in game scenes. Image processing is performed on a GPU by using the<br />

source image as a texture and drawing a screen-aligned quadrilateral into the back<br />

buffer or another texture. A pixel shader is used to process the input image to<br />

produce the desired result in the render target.<br />

Figure 1: Using a pixel shader for image processing by rendering<br />

from one image to another<br />

439


Section IV — Image Space<br />

440 Advanced Image Processing with <strong>DirectX</strong> 9 Pixel <strong>Shader</strong>s<br />

Image processing is especially powerful when the color of the destination pixel is<br />

the result of computations done on multiple pixels from the source image. In this<br />

case, we sample the source image multiple times and use the pixel shader to combine<br />

the data from the multiple samples (or taps) to produce a single output.<br />

Color Space Conversion<br />

Before we get into interesting multi-tap filters, we present a pair of shaders that<br />

can be used to convert between HSV and RGB color spaces. These shaders perform<br />

some relatively complex operations to convert between color spaces, even<br />

though they are only single-tap filters.<br />

For those who may not be familiar with HSV space, it is a color space that is<br />

designed to be intuitive to artists who think of a color’s tint, shade, and tone<br />

[Smith78]. Interpolation in this color<br />

space can be more aesthetically pleasing<br />

than interpolation in RGB space. Additionally,<br />

when comparing colors, it may<br />

be desirable to do so in HSV space. For<br />

example, in RGB space the color {100,<br />

0, 0} is very different from the color {0,<br />

0, 100}. However, their V components<br />

in HSV space are equal. Colors, represented<br />

by {hue, saturation, value}<br />

triples, are defined to lie within a hexagonal<br />

pyramid, as shown in Figure 2.<br />

The hue of a color is represented by<br />

an angle between 0° and 360° around the<br />

central axis of the hexagonal cone. A<br />

color’s saturation is the distance from<br />

the central (achromatic) axis, and its<br />

value is the distance along the axis. Both<br />

saturation and value are defined to be<br />

between 0 and 1.<br />

We have translated the pseudocode<br />

RGB-to-HSV transformation from<br />

[Foley90] to the <strong>DirectX</strong> 9 High Level Shading Language (HLSL) and compiled it<br />

for the ps_2_0 target. If you are unfamiliar with HLSL, you can refer to the “Introduction<br />

to the <strong>DirectX</strong> 9 High Level Shading Language” article in <strong><strong>Shader</strong>X</strong>2 Figure 2: HSV color space<br />

:<br />

Introductions & Tutorials with <strong>DirectX</strong> 9. As described in [Smith78], you can see<br />

that the RGB_to_HSV() function in this shader first determines the minimum and<br />

maximum channels of the input RGB color. The max channel determines the<br />

value of the HSV color or how far along the achromatic central axis of the hexagonal<br />

cone the HSV color will be. The saturation is then computed as the difference<br />

between the max and min RGB channels divided by the max. Hue (the angle<br />

around the central achromatic axis) is then a function of the channel that had the<br />

max magnitude and thus determined the value.


float4 RGB to HSV (float4 color)<br />

{<br />

float r, g, b, delta;<br />

float colorMax, colorMin;<br />

float h=0, s=0, v=0;<br />

float4 hsv=0;<br />

}<br />

r = color[0];<br />

g = color[1];<br />

b = color[2];<br />

colorMax = max (r,g);<br />

colorMax = max (colorMax,b);<br />

colorMin = min (r,g);<br />

colorMin = min (colorMin,b);<br />

v = colorMax; // this is value<br />

if (colorMax != 0)<br />

{<br />

s = (colorMax - colorMin) / colorMax;<br />

}<br />

if (s != 0) // if not achromatic<br />

{<br />

delta = colorMax - colorMin;<br />

if (r == colorMax)<br />

{<br />

h = (g-b)/delta;<br />

}<br />

else if (g == colorMax)<br />

{<br />

h = 2.0 + (b-r) / delta;<br />

}<br />

else // b is max<br />

{<br />

h = 4.0 + (r-g)/delta;<br />

}<br />

}<br />

h*=60;<br />

if(h


Section IV — Image Space<br />

442 Advanced Image Processing with <strong>DirectX</strong> 9 Pixel <strong>Shader</strong>s<br />

The HSV-to-RGB transformation, also translated from [Foley90], is shown below<br />

in HLSL.<br />

float4 HSV to RGB (float4 hsv)<br />

{<br />

float4 color=0;<br />

float f,p,q,t;<br />

float h,s,v;<br />

float r=0,g=0,b=0;<br />

float i;<br />

if (hsv[1] == 0)<br />

{<br />

if (hsv[2] != 0)<br />

{<br />

color = hsv[2];<br />

}<br />

}<br />

else<br />

{<br />

h = hsv.x * 360.0;<br />

s = hsv.y;<br />

v = hsv.z;<br />

if (h == 360.0)<br />

{<br />

h=0;<br />

}<br />

h /=60;<br />

i = floor (h);<br />

f = h-i;<br />

p=v*(1.0 - s);<br />

q=v*(1.0 - (s * f));<br />

t=v*(1.0 - (s * (1.0 -f)));<br />

if (i == 0)<br />

{<br />

r=v;<br />

g=t;<br />

b=p;<br />

}<br />

else if (i == 1)<br />

{<br />

r=q;<br />

g=v;<br />

b=p;<br />

}<br />

else if (i == 2)<br />

{<br />

r=p;<br />

g=v;


}<br />

}<br />

b=t;<br />

}<br />

else if (i == 3)<br />

{<br />

r=p;<br />

g=q;<br />

b=v;<br />

}<br />

else if (i == 4)<br />

{<br />

r=t;<br />

g=p;<br />

b=v;<br />

}<br />

else if (i == 5)<br />

{<br />

r=v;<br />

g=p;<br />

b=q;<br />

}<br />

color.r = r;<br />

color.g = g;<br />

color.b = b;<br />

return color;<br />

Other Color Spaces<br />

It is worth noting that RGB and HSV are not the only color spaces of interest in<br />

computer graphics. For example, the original paper [Smith78] that introduced<br />

HSV also introduced a color space called HSL (for hue, saturation, and lightness),<br />

where L is often the same as the Luminance (Y) channel used in the YIQ color<br />

space. If you are interested in learning more about color spaces, [Smith78] and<br />

[Foley90] both provide excellent discussions.<br />

Now that we have introduced some reasonably advanced single-tap image<br />

operations for converting between color spaces, we can discuss a few multi-tap<br />

filters that perform some sophisticated image processing operations.<br />

Advanced Edge Detection<br />

Section IV — Image Space<br />

Advanced Image Processing with <strong>DirectX</strong> 9 Pixel <strong>Shader</strong>s<br />

443<br />

In Direct3D <strong><strong>Shader</strong>X</strong>, we discussed the Roberts and Sobel edge detection filters<br />

[Mitchell02]. Here, we expand upon those filters and introduce an implementation<br />

of the Canny edge detection filter [Canny86].


Section IV — Image Space<br />

444 Advanced Image Processing with <strong>DirectX</strong> 9 Pixel <strong>Shader</strong>s<br />

Step-by-Step Approach<br />

As outlined in [Jain95], the Canny edge detection filter can be implemented by<br />

performing the following operations:<br />

1. Apply a Gaussian blur.<br />

2. Compute the partial derivatives at each texel.<br />

3. Compute the magnitude and direction of the line (tan-1 ) at each point.<br />

4. Sample the neighbors in the direction of the line and perform nonmaxima<br />

suppression.<br />

Naturally, we implement this in a series of steps, each using a different shader to<br />

operate on the output from the preceding step. A Gaussian blur is the first<br />

shader that is run over the input image. This is done to eliminate any high<br />

frequency noise in the input image. Various filter kernel sizes can be used for this<br />

step.<br />

The next step in the process is computation of the partial derivatives (P and<br />

Q)intheu and v directions, respectively:<br />

Then the magnitude of the derivative is computed using the standard formula:<br />

2 2<br />

Magnitude � P �Q<br />

Finally, the P and Q values are used to determine the direction of the edge at that<br />

texel using the standard equation:<br />

��atan2( QP , )<br />

Magnitude and � are written out to an image so that the next shader can use them<br />

to complete the Canny filter operation. The edge direction, �, is a signed quantity<br />

in the range of –� to � and must be packed into the 0 to 1 range in order to prevent<br />

loss of data between rendering passes. In order to do this, we pack it by<br />

computing:<br />

A = abs(�)/�<br />

You’ve probably noticed that due to the absolute value, this function is not invertible,<br />

hence data is effectively lost. This does not present a problem for this particular<br />

application due to symmetries in the following step.<br />

The final pass involves sampling the image to get the magnitude and the<br />

edge direction, �, at the current location. The edge direction, �, must now be<br />

unpacked into its proper range. Figure 3 shows a partitioning of all values of � (in<br />

degrees) into four sectors.


Section IV — Image Space<br />

Advanced Image Processing with <strong>DirectX</strong> 9 Pixel <strong>Shader</strong>s<br />

The sectors are symmetric and<br />

map to the possible ways that a line can<br />

pass through a 3×3 set of pixels. In the<br />

previous step, we took the absolute<br />

value of � and divided it by � to put it in<br />

the 0 to 1 range. Since we know that �<br />

is already between 0 and 1 from the previous<br />

step, we are almost done. Since<br />

the partitioning is symmetric, it was an<br />

excellent way to reduce the number of<br />

comparisons needed to find the correct<br />

neighbors to sample. Normally, to complete<br />

the mapping, we would multiply A<br />

by 4 and be done. However, if you look<br />

closely at Figure 3 you can see that the Figure 3: The 360 degrees of an angle<br />

partitioned into four sectors<br />

sectors are centered around 0 and 180.<br />

In order to compensate for this, the proper equation is:<br />

Sector = floor(( A – �/16) � 4)<br />

445<br />

Next, we compute the neighboring texel coordinates by checking which sector<br />

this edge goes through. Now that the neighbors have been sampled, we compare<br />

the current texel’s magnitude to the magnitudes of its neighbors. If its magnitude<br />

is greater than both of its neighbors, then it is the local maximum and the value is<br />

kept. If its magnitude is less than either of its neighbors, then this texel’s value is<br />

set to zero. This process is known as nonmaxima suppression, and its goal is to<br />

thin the areas of change so that only the greatest local changes are retained. As a<br />

final step, we can threshold the image in order to reduce the number of false<br />

edges that might be picked up by this process. The threshold is often set by the<br />

user when he or she finds the right balance between true and false edges.<br />

As you can see in Figure 4, the Canny filter produces one-pixel-wide edges<br />

unlike more basic filters such as a Sobel edge filter.<br />

Figure 4: One-pixel-wide edges from the Canny filter


Section IV — Image Space<br />

446 Advanced Image Processing with <strong>DirectX</strong> 9 Pixel <strong>Shader</strong>s<br />

Figure 5: Gradient magnitudes from the Sobel filter (see<br />

[Mitchell02])<br />

Implementation Details<br />

This shader is implemented in the Video<strong>Shader</strong> application on the companion CD<br />

(see the section 4\04 folder) using HLSL and can be compiled for the ps_2_0<br />

target or higher. In this implementation, the samples are taken from the eight<br />

neighbors adjacent to the center of the filter. Looking at the HLSL code, you can<br />

see an array of float<br />

two-tuples called<br />

sampleOffsets[]. This<br />

array defines a set of 2D<br />

offsets from the center<br />

tap, which are used to<br />

determine the locations<br />

from which to sample<br />

the input image. The<br />

locations of these sam-<br />

Figure 6: Locations of taps as defined in sampleOffsets[]<br />

ples relative to the center<br />

tap are shown in<br />

Figure 6.<br />

The four steps of the Canny edge detection filter described above have been<br />

collapsed into two rendering passes requiring the two shaders shown below. The<br />

first shader computes the gradients P and Q followed by the Magnitude and direction<br />

(�). After packing � into the 0 to 1 range, Magnitude and � are written out to a<br />

temporary surface.<br />

sampler InputImage;<br />

float2 sampleOffsets[8] : register (c10);


struct PS INPUT<br />

{<br />

float2 texCoord:TEXCOORD0;<br />

};<br />

float4 main( PS INPUT In ) : COLOR<br />

{<br />

int i =0;<br />

float4 result;<br />

float Magnitude, Theta;<br />

float p=0,q=0;<br />

float pKernel[4] = {-1, 1, -1, 1};<br />

float qKernel[4] = {-1, -1, 1, 1};<br />

float2 texCoords[4];<br />

float3 texSamples[4];<br />

float PI = 3.1415926535897932384626433832795;<br />

}<br />

texCoords[0] = In.texCoord + sampleOffsets[1];<br />

texCoords[1] = In.texCoord + sampleOffsets[2];<br />

texCoords[2] = In.texCoord;<br />

texCoords[3] = In.texCoord + sampleOffsets[4];<br />

for(i=0; i


Section IV — Image Space<br />

448 Advanced Image Processing with <strong>DirectX</strong> 9 Pixel <strong>Shader</strong>s<br />

reads. The magnitudes of these neighbor samples along with a user-defined<br />

threshold are then used to determine whether this pixel is a local maximum or<br />

not, resulting in either 0 or 1 being output as the final result.<br />

sampler InputImage;<br />

float2 sampleOffsets[8] : register (c10);<br />

float4 UserInput : register (c24);<br />

struct PS INPUT<br />

{<br />

float2 texCoord:TEXCOORD0;<br />

};<br />

float4 main( PS INPUT In ) : COLOR<br />

{<br />

int i =0;<br />

float4 result;<br />

float Magnitude, Theta;<br />

float2 texCoords[4];<br />

float4 texSamples[3];<br />

float PI = 3.1415926535897932384626433832795;<br />

// Tap the current texel and figure out line direction<br />

texSamples[0] = tex2D( InputImage, In.texCoord);<br />

Magnitude = texSamples[0].r;<br />

// Sample two neighbors that lie in the direction of the line<br />

// Then find out if this texel has a greater Magnitude.<br />

Theta = texSamples[0].a;<br />

// Must unpack theta. Prior pass made Theta range between 0 and 1<br />

// But we really want it to be either 0,1,2, or 4. See [Jain95]<br />

// for more details.<br />

Theta = (Theta - PI/16) *4;//Nowtheta is between 0 and 4<br />

Theta = floor(Theta); // Now theta is an INT.<br />

if( Theta == 0)<br />

{<br />

texCoords[1] = In.texCoord + sampleOffsets[4];<br />

texCoords[2] = In.texCoord + sampleOffsets[3];<br />

}<br />

else if(Theta == 1)<br />

{<br />

texCoords[1] = In.texCoord + sampleOffsets[2];<br />

texCoords[2] = In.texCoord + sampleOffsets[5];<br />

}<br />

else if(Theta == 2)<br />

{<br />

texCoords[1] = In.texCoord + sampleOffsets[1];


}<br />

texCoords[2] = In.texCoord + sampleOffsets[6];<br />

}<br />

else //if(Theta == 3)<br />

{<br />

texCoords[1] = In.texCoord + sampleOffsets[0];<br />

texCoords[2] = In.texCoord + sampleOffsets[7];<br />

}<br />

// Take other two samples<br />

// Remember they are in the direction of the edge<br />

for(i=1; i


Section IV — Image Space<br />

450 Advanced Image Processing with <strong>DirectX</strong> 9 Pixel <strong>Shader</strong>s<br />

final frames in high dynamic range space to simulate light scattering. In this final<br />

section of the chapter, we discuss three separable filtering operations: the<br />

Gaussian blur, a median filter approximation, and the Fast Fourier Transform.<br />

Separable Gaussian<br />

A very commonly used separable filter is the Gaussian filter, which can be used to<br />

perform blurring of 2D images. The 2D isotropic (i.e., circularly symmetric)<br />

Gaussian filter, g2D(x, y), samples a circular neighborhood of pixels from the input<br />

image and computes their weighted average, according to the following equation:<br />

1<br />

g2D�x, y� � e<br />

2��<br />

�<br />

2 2<br />

x � y<br />

�<br />

2<br />

2�<br />

...where � is the standard deviation of the Gaussian and x and y are the coordinates<br />

of image samples relative to the center of the filter. The standard deviation,<br />

�, determines the size of the filter.<br />

This means that we sample a local<br />

area of texels from the input image and<br />

weight them according to the above equation.<br />

For example, for a Gaussian with � =<br />

1, we compute the following filter kernel<br />

(after normalization).<br />

In theory, the Gaussian has infinite<br />

extent, but the contribution to the final<br />

result is insignificant for input texels outside<br />

of this 5×5 region.<br />

An extremely important property of<br />

the Gaussian is that it is separable. That is,<br />

it can be rearranged in the following<br />

manner:<br />

2<br />

2<br />

�<br />

x y<br />

1<br />

�<br />

2 1<br />

2<br />

2 2<br />

g2D�x, y� �<br />

�<br />

e<br />

�<br />

�<br />

e<br />

�<br />

2 �<br />

�<br />

2<br />

�<br />

� �<br />

� �<br />

� �<br />

� �<br />

�� � �<br />

�<br />

��<br />

�<br />

�� ��<br />

� g x � g y<br />

1D 1D<br />

This means that we can implement a given Gaussian with a series of 1D filtering<br />

operations: one horizontal (g 1D(x)) and one vertical (g 1D(y)). This allows us to<br />

implement Gaussians with much larger kernels (larger �) while performing the<br />

same amount of calculations that are required to implement a smaller non-separable<br />

filter kernel. This technique was used in our real-time implementation of Paul<br />

Debevec’s Rendering with Natural Light animation as seen in Figure 7.<br />

After rendering the scene in high dynamic range space, Debevec performed a<br />

number of large Gaussian blurs on his 2D rendered scene to obtain blooms on<br />

bright areas of the scene. In order to do this in real-time, we exploited the Gaussian’s<br />

separability to perform the operation efficiently. In our case, we used � =7,<br />

which resulted in a 25×25 Gaussian.


Section IV — Image Space<br />

Advanced Image Processing with <strong>DirectX</strong> 9 Pixel <strong>Shader</strong>s<br />

Figure 7: Frame from real-time Rendering with Natural Light (See Color<br />

Plate 20.)<br />

Due to the fact that we have only eight texture coordinate interpolators in the<br />

ps_2_0 pixel shader programming model, we must derive some of our texture<br />

coordinates in the pixel shader as deltas from the center tap location. To make the<br />

most efficient use of the hardware, we can perform as many reads from the input<br />

image as possible using non-dependent texture reads.<br />

In our implementation, we divided our samples into three types: inner taps,<br />

outer taps, and the center tap. The center tap (c) and inner taps (x) shown in Figure<br />

8 are performed using interpolated texture coordinates (and hence non-dependent<br />

texture reads).<br />

Figure 8: Layout of 13 taps of separable Gaussian<br />

The outer taps (o) shown in Figure 8 are sampled using texture coordinates computed<br />

in the pixel shader. That is, they are done with dependent reads. Note that<br />

the center tap (c) uses pick-nearest filtering and is aligned with the center of a<br />

specific texel in the input image. The other 12 taps all use bilinear filtering and<br />

are aligned so that they sample from two different texels in the input image. This<br />

Gaussian filter is implemented in HLSL in the following shader:<br />

float4 hlsl gaussian (float2 tapZero : TEXCOORD0,<br />

float2 tap12 : TEXCOORD1,<br />

float2 tapMinus12 : TEXCOORD2,<br />

float2 tap34 : TEXCOORD3,<br />

float2 tapMinus34 : TEXCOORD4,<br />

float2 tap56 : TEXCOORD5,<br />

float2 tapMinus56 : TEXCOORD6 ) : COLOR<br />

451


Section IV — Image Space<br />

452 Advanced Image Processing with <strong>DirectX</strong> 9 Pixel <strong>Shader</strong>s<br />

{<br />

}<br />

float4 accum, Color[NUM INNER TAPS];<br />

Color[0] = tex2D(nearestImageSampler, tapZero); // sample 0<br />

Color[1] = tex2D(linearImageSampler, tap12); // samples 1, 2<br />

Color[2] = tex2D(linearImageSampler, tapMinus12); // samples -1, -2<br />

Color[3] = tex2D(linearImageSampler, tap34); // samples 3, 4<br />

Color[4] = tex2D(linearImageSampler, tapMinus34); // samples -3, -4<br />

Color[5] = tex2D(linearImageSampler, tap56); // samples 5, 6<br />

Color[6] = tex2D(linearImageSampler, tapMinus56); // samples -5, -6<br />

accum = Color[0] * gTexelWeight[0]; // Weighted sum of samples<br />

accum += Color[1] * gTexelWeight[1];<br />

accum += Color[2] * gTexelWeight[1];<br />

accum += Color[3] * gTexelWeight[2];<br />

accum += Color[4] * gTexelWeight[2];<br />

accum += Color[5] * gTexelWeight[3];<br />

accum += Color[6] * gTexelWeight[3];<br />

float2 outerTaps[NUM OUTER TAPS];<br />

outerTaps[0] = tapZero * gTexelOffset[0]; // coord for samp 7, 8<br />

outerTaps[1] = tapZero * -gTexelOffset[0]; // coord for samp -7, -8<br />

outerTaps[2] = tapZero * gTexelOffset[1]; // coord for samp 9, 10<br />

outerTaps[3] = tapZero * -gTexelOffset[1]; // coord for samp -9, -10<br />

outerTaps[4] = tapZero * gTexelOffset[2]; // coord for samp 11, 12<br />

outerTaps[5] = tapZero * -gTexelOffset[2]; // coord for samp -11,-12<br />

// Sample the outer taps<br />

for (int i=0; i


Section IV — Image Space<br />

Advanced Image Processing with <strong>DirectX</strong> 9 Pixel <strong>Shader</strong>s<br />

6, 1, 2, 2, 8}, you can sort them to get {1, 2, 2, 3, 6, 8, 9} and select the middle<br />

value 3. Hence, the median of these values is 3. In image processing, a median filter<br />

is commonly used to remove “salt and pepper noise” from images prior to performing<br />

other image processing operations. It is good for this kind of operation<br />

because it is not unduly influenced by outliers in the input data (i.e., the noise) the<br />

way that a mean would be. Additionally, the output of a median filter is guaranteed<br />

to be a value that actually appears in the input image data; a mean does not have<br />

this property.<br />

As it turns out, an approximation to a 2D median filter<br />

can be implemented efficiently in a separable manner<br />

[Gennert 03]. Say we have sampled a 3×3 region of our<br />

input image and the data are ranked in the following<br />

order:<br />

We can first take the median of the rows of the<br />

ranked data:<br />

We can then take the median of these medians to get an approximation to the<br />

median of the whole 3×3 region:<br />

453<br />

From this, we obtain the data in the fifth-ranked image sample, which is the correct<br />

value. We say that this method is only an approximation to a true median filter<br />

because the true median will not be found if the ranked data is not so evenly<br />

distributed within the filter kernel. For example, if we have the following ranked<br />

data, we can get an incorrect median:


Section IV — Image Space<br />

454 Advanced Image Processing with <strong>DirectX</strong> 9 Pixel <strong>Shader</strong>s<br />

For a 3×3 filter kernel, however, the worst case that this separable median filter<br />

implementation will give you is the fourth or sixth rank instead of the fifth, which<br />

may be adequate for many applications.<br />

We have implemented this separable approximation to a median filter with a<br />

two-pass rendering approach. The first pass finds the median of each 3×1 region<br />

of the image and outputs it to an intermediate buffer. The second pass performs<br />

the same operation on each 1×3 region of the intermediate buffer. The end result<br />

is equivalent to the separable median algorithm outlined above.<br />

Median Filter HLSL Implementation<br />

In our HLSL implementation of the separable median approximation, both passes<br />

will use the FindMedian() function, which takes three scalar inputs:<br />

float FindMedian(float a, float b, float c)<br />

{<br />

float median;<br />

if(a


}<br />

median = max(b,c);<br />

}<br />

}<br />

return median;<br />

The first pass of the 3×3 median filter, shown below, takes three samples from<br />

the input image: the texel at the current location and the left and right neighbors.<br />

The median red, green, and blue values are found independently, and the result is<br />

written out to a temporary surface.<br />

sampler InputImage;<br />

float2 sampleOffsets[8];<br />

struct PS INPUT<br />

{<br />

float2 texCoord:TEXCOORD0;<br />

};<br />

float4 main( PS INPUT In ) : COLOR<br />

{<br />

int i =0;<br />

float4 result;<br />

float2 texCoords[3];<br />

float3 texSamples[3];<br />

}<br />

texCoords[0] = In.texCoord + sampleOffsets[3];<br />

texCoords[1] = In.texCoord;<br />

texCoords[2] = In.texCoord + sampleOffsets[4];<br />

// the left and right neighbors of this texel<br />

for(i=0; i


Section IV — Image Space<br />

456 Advanced Image Processing with <strong>DirectX</strong> 9 Pixel <strong>Shader</strong>s<br />

};<br />

float4 main( PS INPUT In ) : COLOR<br />

{<br />

int i =0;<br />

float4 result;<br />

float2 texCoords[3];<br />

float3 texSamples[3];<br />

}<br />

texCoords[0] = In.texCoord + sampleOffsets[1];<br />

texCoords[1] = In.texCoord;<br />

texCoords[2] = In.texCoord + sampleOffsets[6];<br />

// the top and bottom neighbors of this texel<br />

for(i=0; i


Median-filtering the red, green, and blue channels of the image independently is a<br />

reasonably arbitrary decision that seems to work well for our data. You may find<br />

that another approach, such as converting to luminance and then determining the<br />

median luminance, works better for your data.<br />

Fourier Transform<br />

A very powerful concept in image processing is transformation of spatial domain<br />

images into the frequency domain via the Fourier Transform. All of the images that<br />

we have discussed so far have existed in the spatial domain. Using the Fourier<br />

Transform, we can transform them to the frequency domain, where the images<br />

are represented not by a 2D array of real-valued brightness values distributed<br />

spatially, but by a 2D array of complex coefficients that are used to weight a set of<br />

sine waves, which when added together would result in the source image. This<br />

set of sine waves is known as a Fourier series, named for its originator, Jean<br />

Baptiste Joseph Fourier. Fourier’s assertion was that any periodic signal can be<br />

represented as the sum of a series of sine waves. This applies to any sort of signal,<br />

including images. The conversion from the spatial domain to the frequency<br />

domain is performed by a Fourier Transform. In the case of digital images consisting<br />

of discrete samples (pixels), we use a Discrete Fourier Transform (DFT). The<br />

equations for performing a DFT and its inverse on a two-dimensional image are<br />

shown below:<br />

Fourier Transform<br />

M �1N<br />

�1<br />

1 �i2��ux/ M�vy/ N�<br />

� , � � ��<br />

� , �<br />

H u v MN<br />

x�<br />

0 y � 0<br />

h x y e<br />

Inverse Fourier Transform<br />

M �1N<br />

�1<br />

�<br />

x�<br />

0 y � 0<br />

� , � � � � , �<br />

h x y H u v e<br />

Section IV — Image Space<br />

Advanced Image Processing with <strong>DirectX</strong> 9 Pixel <strong>Shader</strong>s<br />

i2��ux/ M�vy/ N�<br />

...where h(x, y) is the value of the pixel located at location (x, y), H(u, v)isthe<br />

value of the image in frequency space at location (u, v), M is the width of the<br />

image in pixels, and N is the height of the image in pixels.<br />

For these equations, it is important to remember that these are complex<br />

numbers (i is the square root of –1). Additionally, from complex math:<br />

ix � ix<br />

e �cos�� x �isin��<br />

x and e �cos�� x �isin��<br />

x<br />

GPU Implementation<br />

457<br />

A naïve implementation of these operations would be an extremely expensive<br />

processing step, O(n 4 ) in big O notation. Fortunately, much research has gone into<br />

a class of algorithms known as Fast Fourier Transforms (FFTs). These algorithms<br />

refactor the transform equations above to reduce the complexity to O(n � log n).<br />

The initial algorithm described to accomplish this is referred to as “Decimation in<br />

Time” and was published in 1965 by Cooley and Tukey [Cooley65]. As it turns<br />

out, the decimation in time algorithm translates very naturally to multipass


Section IV — Image Space<br />

458 Advanced Image Processing with <strong>DirectX</strong> 9 Pixel <strong>Shader</strong>s<br />

rendering on graphics hardware with floating-point pixel processing pipelines. Our<br />

multipass rendering technique is based on the code listed in [Crane96].<br />

The FFT uses two primary optimizations to minimize its computational complexity.<br />

The first optimization that the FFT makes is to exploit the transform’s<br />

separability and break the two-dimensional transform into several one-dimensional<br />

transforms. This is done by performing a one-dimensional FFT across the<br />

rows of the image followed by a one-dimensional FFT along the columns of the<br />

resulting image. This greatly reduces the growth in complexity of the operation as<br />

the image size grows. The next optimization uses the fact that a Fourier Transform<br />

of size N can be rewritten as the sum of two Fourier Transforms of size N/2,<br />

eliminating redundant computations. This portion of the optimization reduces the<br />

cost of the one-dimensional transforms from O(n2 )toO(n�log n).<br />

The first thing to note when using a GPU to implement an FFT based on the<br />

decimation in time algorithm is that, to maintain most of its efficiency improvements,<br />

the algorithm must be implemented in multiple passes by rendering to<br />

floating-point temporary buffers. If the spatial domain image is color (i.e., has multiple<br />

channels), these temporary buffers need to be set up as multiple render targets,<br />

since the frequency domain representation of the image uses complex<br />

numbers, thus doubling the number of channels on the output.<br />

For a width × height image, the decimation in time FFT algorithm takes<br />

log2(width) + log2(height) + 2 rendering passes to complete. For example, a<br />

512×512 image takes 20 rendering passes, which renders at approximately 30<br />

frames per second on today’s fastest graphics processors. Because each step of<br />

the computation is based solely on the previous step, we are able to conserve<br />

memory and ping-pong between two floating-point renderable textures to implement<br />

the following steps of the decimation in time algorithm:<br />

1. Horizontal scramble using scramble map to do dependent texture reads from<br />

the original image<br />

2. log2 (width) butterfly passes<br />

3. Vertical scramble using scramble map again<br />

4. log2 (height) butterfly passes<br />

Let’s describe each of these steps in detail.<br />

Scramble<br />

The decimation in time algorithm starts with a phase referred to as a scramble.<br />

This phase reorders the data such that:<br />

data[i] :=: data[rev(i)]<br />

...where rev(i) is the bit reverse of i.<br />

In other words, the data member at location i is swapped with the data member<br />

at the location at the bit-reversed address of i. The bit reverse of a given<br />

value is its mirror image written out in binary. For example, the bit reverse of<br />

0111 is 1110. Figure 10 shows an example of a scramble of a 16-element image.


Values connected by arrows in Figure 10 are swapped during the scramble step.<br />

Obviously, symmetric values such as 0000, 0110, 1001, and 1111 are left in place.<br />

Since pixel shaders can’t easily do such bit-twiddling of pixel addresses, the most<br />

effective way to perform the scramble step is via a dependent read from the input<br />

image using a specially authored scramble map stored in another texture to provide<br />

the bit-twiddled address from which to do the dependent read. The shader to<br />

perform such a dependent read for the horizontal scramble is shown below:<br />

sampler scramble : register(s0);<br />

sampler sourceImage : register(s1);<br />

struct PS INPUT<br />

{<br />

float1 scrambleLoc:TEXCOORD0;<br />

float2 imagePos:TEXCOORD1;<br />

};<br />

float4 main( PS INPUT In ) : COLOR<br />

{<br />

float2 fromPos;<br />

}<br />

fromPos = In.imagePos;<br />

// scramble the x coordinate<br />

// fromPos.x gets assigned red channel of texture<br />

fromPos.x = tex1D(scramble, In.scrambleLoc);<br />

return tex2D(sourceImage, fromPos);<br />

It is important to remember that the scramble map must contain enough bits to<br />

uniquely address each texel in the source image. Typically, this means the texture<br />

should be a 16-bit single channel texture, preferably an integer format such as<br />

D3DFMT_L16.<br />

Butterflies<br />

Section IV — Image Space<br />

Advanced Image Processing with <strong>DirectX</strong> 9 Pixel <strong>Shader</strong>s<br />

Figure 10: Simple scramble of 16×1 image<br />

459<br />

Once the image has been scrambled, a series of butterfly operations are applied to<br />

the image. In each butterfly pass, a pair of pixels is combined via a complex multiply<br />

and add. Due to the inability of graphics processors to write to random locations<br />

in memory, this operation must be done redundantly on both of the pixels in<br />

the pair, and therefore some of the ideal FFT efficiency gains are lost. The


Section IV — Image Space<br />

460 Advanced Image Processing with <strong>DirectX</strong> 9 Pixel <strong>Shader</strong>s<br />

locations of the paired pixels are encoded in a butterfly map. The butterfly map is<br />

as wide as the source image and has one row for each butterfly step. The code for<br />

applying horizontal butterflies is shown below.<br />

//all textures sampled nearest<br />

sampler butterfly : register(s0);<br />

sampler sourceImage : register(s1);<br />

struct PS INPUT<br />

{<br />

float2 srcLocation:TEXCOORD0;<br />

};<br />

//constant to tell which pass is being used<br />

float pass; // pass = passNumber / log2(width)<br />

float4 main( PS INPUT In ) : COLOR<br />

{<br />

float2 sampleCoord;<br />

float4 butterflyVal;<br />

float2 a;<br />

float2 b;<br />

float2 w;<br />

float temp;<br />

}<br />

sampleCoord.x = srcLocation.x;<br />

sampleCoord.y = pass;<br />

butterflyVal = tex2D( butterfly, sampleCoord);<br />

w = butterflyVal.ba;<br />

//sample location A<br />

sampleCoord.x = butterflyVal.y;<br />

sampleCoord.y = srcLocation.y;<br />

a = tex2D( sourceImage, sampleCoord).ra;<br />

//sample location B<br />

sampleCoord.x = abs(butterflyVal.x);<br />

sampleCoord.y = srcLocation.y;<br />

b = tex2D( sourceImage, sampleCoord).ra;<br />

//multiply w*b (complex numbers)<br />

temp = w.x*b.x - w.y*b.y;<br />

b.y = w.y*b.x + w.x*b.y;<br />

b.x = temp;<br />

//perform a + w*b or a - w*b<br />

a=a+((butterflyVal.x < 0.0) ? -b : b);<br />

//make it a 4 component output for good measure<br />

return a.xxxy;


The shader performs an extremely simple operation to accomplish its goal. First,<br />

it fetches a texture to determine where on this line of the image to get two parameters<br />

a and b. This same texel contains a factor w that is combined with a and b<br />

to produce the final result. From these parameters, the algorithm can actually produce<br />

two of the results needed for the next pass (a' and b'), but since GPUs do not<br />

perform random writes to memory, the texture also includes a flag for which value<br />

to leave at this location. The following equation, a butterfly operation, shows the<br />

math used to convert a and b to a' and b'.<br />

a'�a�wb b'�a�wb The shader only concerns itself with a single channel image and expects that the<br />

real component is fetched into the first component and the imaginary component<br />

is fetched into the fourth component. To handle more components, the shader<br />

does not need to change significantly, but it does need to use separate textures<br />

and multiple render targets to handle more than two channels simultaneously.<br />

The largest amount of magic is in the special butterfly texture. This texture contains<br />

the offsets of the a and b parameters to the function in its first two components<br />

and the real and imaginary parts of the w parameter in its last two<br />

components. Additionally, the second texture coordinate is given a sign to encode<br />

whether this execution of the shader should produce a' or b'. To ensure an accurate<br />

representation of all this with the ability to address a large texture, a 32-bit<br />

per-component floating-point texture is the safest choice.<br />

After the scramble and butterfly passes are applied in the horizontal direction,<br />

the same operations are applied to the columns of the image to get the vertical<br />

FFT. The overall algorithm looks something like the following pseudocode:<br />

// Horizontal scramble first<br />

SetSurfaceAsTexture( surfaceA); //input image<br />

SetRenderTarget( surfaceB);<br />

Load<strong>Shader</strong>( HorizontalScramble);<br />

SetTexture( ButterflyTexture[log2(width)]);<br />

DrawQuad();<br />

// Horizontal butterflies<br />

Load<strong>Shader</strong>( HorizontalButterfly);<br />

SetTexture( ButterflyTexture[log2(width)]);<br />

for(i=0;i


Section IV — Image Space<br />

462 Advanced Image Processing with <strong>DirectX</strong> 9 Pixel <strong>Shader</strong>s<br />

// Vertical butterflies<br />

Load<strong>Shader</strong>( VerticalButterfly);<br />

SetTexture( ButterflyTexture[log2(height)]);<br />

for(i=0;i


Conclusion<br />

Utilizing the FFT<br />

Section IV — Image Space<br />

Advanced Image Processing with <strong>DirectX</strong> 9 Pixel <strong>Shader</strong>s<br />

Besides just providing an interesting way to look at and analyze images, the frequency<br />

space representation allows certain operations to be performed more efficiently<br />

than they could be in the spatial domain.<br />

First, removing high frequencies that contribute to aliasing can be most easily<br />

performed in frequency space. The simplest implementation of this simply<br />

crops the image in frequency space to remove the higher frequencies. This is the<br />

application of what is called the ideal filter, but its results tend to be anything but<br />

ideal on an image of finite size. The ideal filter really has an infinite width in the<br />

spatial domain, so when the cropped image is transformed back to the spatial<br />

domain, sharp edges will ring with ghosts propagating in the image. Other filters<br />

have been designed to work around such issues. One well-known filter for this<br />

sort of purpose is the Butterworth filter.<br />

Additionally, frequency space can be used to apply extremely large convolutions<br />

to an image. Convolutions in image space are equivalent to multiplication in<br />

the frequency domain. So instead of having a multiply and add for each element of<br />

a convolution mask at each pixel, as would be required in the spatial domain, the<br />

operation takes only a multiply per pixel in the frequency domain. This is most<br />

useful on large non-separable filters like the Laplacian of Gaussians (LoG), which<br />

produces a second order derivative that can be used to find contours in images. In<br />

Figure 14, a LoG filter has been applied to the reference image used throughout<br />

the section. To apply the filter in the frequency domain, the image and the filter<br />

must first both be transformed into the frequency domain with the Fourier Transform.<br />

The filter must also be centered and padded<br />

with zeros so that it is the same size as the image<br />

to which it is being applied. Once in the frequency<br />

domain, the filter and image — both of which contain<br />

complex numbers — must undergo a complex<br />

multiplication. The result is next run through<br />

the Inverse Fourier Transform. Finally, the image<br />

must be translated similar to the way in which the<br />

frequency space images are translated to get the<br />

correct image. This last step appears to be often<br />

unmentioned in discussions of this operation, but<br />

failure to do it can lead to a fruitless bug hunt.<br />

463<br />

Figure 14: 17×17 Laplacian of<br />

Gaussian operation<br />

In this chapter, we’ve added some sophisticated tools to our image processing<br />

toolbox, including HSV�RGB color space conversion, the Canny edge detection<br />

filter, and separable implementations of a Gaussian blur, a median filter, and the<br />

decimation in time formulation of the Fast Fourier Transform. We hope that these<br />

implementations, presented here in the industry standard <strong>DirectX</strong> 9 High Level<br />

Shading Language, are easy for you to drop into your own image processing


Section IV — Image Space<br />

464 Advanced Image Processing with <strong>DirectX</strong> 9 Pixel <strong>Shader</strong>s<br />

applications. We also hope that they inspire you to create even more powerful<br />

image processing operations specific to your needs.<br />

Sample Application<br />

The image processing techniques presented in this chapter were developed using<br />

live and recorded video fed to Direct3D via the Microsoft Video Mixing Renderer<br />

(VMR). The sample app, Video<strong>Shader</strong>, demonstrates the use of Direct3D and the<br />

VMR, with the above filters and several others implemented using HLSL. Source<br />

for the sample application and all of the shaders is available on the companion CD<br />

as well as the ATI Developer Relations web site (www.ati.com/developer). The<br />

latest version of Video<strong>Shader</strong> is available at http://www2.ati.com/misc/demos/<br />

ATI-9700-Video<strong>Shader</strong>-Demo-v1.2.exe.<br />

Acknowledgments<br />

References<br />

Thanks to John Isidoro of Boston University and ATI Research for the separable<br />

Gaussian filter implementation. Thanks to Michael Gennert of Worcester Polytechnic<br />

Institute and David Gosselin of ATI Research for discussions that resulted<br />

in the implementation of the separable median filter approximation.<br />

[Canny86] Canny, John, “A Computational Approach to Edge Detection,” IEEE<br />

PAMI 8(6) 679-698, November 1986.<br />

[Cooley65] Cooley, J. W. and O. W. Tukey, “An Algorithm for the Machine Calculation<br />

of Complex Fourier Series,” Mathematics of Computation, 19, 297-301, 1965.<br />

[Crane96] Crane, Randy, A Simplified Approach to Image Processing: Classical and<br />

Modern Techniques in C, Prentice Hall, 1996.<br />

[Foley90] James Foley, Andries van Dam, Steven K. Feiner, and John F. Hughes,<br />

Computer Graphics: Principles and Practice, Second Ed., Addison-Wesley, 1990.<br />

[Gennert03] Gennert, Michael, personal communication, 2003.<br />

[Jain95] Jain, Ramesh and Rangachar Kasturi, et al., Machine Vision, McGraw Hill,<br />

1995.<br />

[Mitchell02] Mitchell, Jason L., “Image Processing with 1.4 Pixel <strong>Shader</strong>s in<br />

Direct3D,” Direct3D <strong><strong>Shader</strong>X</strong>: Vertex and Pixel <strong>Shader</strong> <strong>Tips</strong> and <strong>Tricks</strong>, Wolfgang<br />

Engel, ed., Wordware Publishing, 2002, pp. 258-269.<br />

[Smith78] Smith, Alvy Ray, “Color Gamut Transform Pairs,” SIGGRAPH ’78, pp.<br />

12-19.


Night Vision: Frame Buffer Postprocessing<br />

with ps.1.1 Hardware<br />

Introduction<br />

Guillaume Werle<br />

A few years ago, when hardware-accelerated rendering was starting to be a common<br />

feature in every game engine, players complained that all games were somehow<br />

looking quite the same.<br />

Now that programmable hardware is available, the entire rendering process<br />

can be configured. This means that any game with a creative graphic programmer<br />

or a skilled technical artist can have its own graphic touch and look that is different<br />

from the others.<br />

Frame buffer post-processing is one of the easiest ways to achieve a unique<br />

look. Many resources on these topics are available on the Internet nowadays, but<br />

most of them make use of ps.1.4 hardware. In this article I describe how to use<br />

texture-dependent reads on ps.1.1 class hardware to achieve the following effect<br />

(see Figures 1 and 2).<br />

Figure 1: Scene from the Raw Confessions demo<br />

(models and textures by Christophe Romagnoli<br />

and Guillaume Nichols)<br />

Figure 2: Scene from the Raw Confessions demo<br />

(models and textures by Christophe Romagnoli<br />

and Guillaume Nichols)<br />

465


Section IV — Image Space<br />

466 Night Vision: Frame Buffer Post-processing with ps.1.1 Hardware<br />

Description<br />

Texture-dependent reads are definitely harder to use when targeting ps.1.1 hardware.<br />

The rendering process is split into several passes to take care of this issue.<br />

Here’s a quick description of the required steps to achieve this effect:<br />

1. Render the scene in a texture.<br />

2. Convert to grayscale while rendering in another render texture.<br />

3. Use the luminance value of each pixel as an index into a gradient texture and<br />

render in the frame buffer.<br />

Technical Brief on Render Texture<br />

Instead of rendering directly in the frame buffer, the rendering must be done in a<br />

texture. Create a texture with the same size, color format, and depth format as<br />

your frame buffer, and then use the ID3DXRenderToSurface interface provided<br />

with the D3DX library to map that the BeginScene() and the EndScene() calls.<br />

NOTES<br />

� Render textures’ dimensions don’t need to be aligned on a power of two<br />

if the caps D3DPTEXTURECAPS_NONPOW2CONDITIONAL is set.<br />

� Don’t use the D3DXCreateTexture() function to create your render texture;<br />

this function will round the dimensions to the nearest power of two,<br />

even if it’s not needed.<br />

Converting to Grayscale<br />

The luminance value of a color can be computed using a dot product.<br />

Luminance = Red � 0.3 + Green � 0.59 + Blue � 0.1<br />

The following pixel shader applies this formula to output the luminance value in<br />

every color channel:<br />

ps.1.1<br />

tex t0 // rgb texture<br />

// c0 = (0.3, 0.59, 0.1, 1.0)<br />

dp3 r0, t0, c0 // r0 = t0.r * 0.3 + t0.g * 0.59 + t0.b * 0.1


Quad Rendering<br />

Once the scene is stored in the texture, a quad is used for rendering into the<br />

frame buffer.<br />

The Microsoft Direct3D texture sampling rules say that texels are fetched in<br />

the top-left corner. For example, when enabling bilinear filtering, if you sample a<br />

texel at the coordinates (0,0) and if the addressing mode is set to warp, the resulting<br />

color will be a mix of the four corners of the texture.<br />

Knowing this, an offset of a half-texel size (0.5 / TextureSize) must be added<br />

to the texture coordinates.<br />

Color Remapping<br />

This is the last step of the effect. The pixel shader in charge of the color remapping<br />

uses the texreg2gb dest, src instruction. This opcode is able to interpret<br />

the green and blue color components of the source register — the grayscale texture<br />

— as texture coordinates to sample the destination register — the gradient<br />

texture.<br />

The gradient texture is a simple 1D texture. Figure 3 shows the gradient<br />

used to produce Figures 1 and 2.<br />

This code snippet shows the whole pixel shader :<br />

ps.1.1<br />

// t0 grayscale texture<br />

// t1 gradient<br />

tex t0 // grayscale texture<br />

texreg2gb t1, t0 // sample t1 at the coordinates (t0.g, t0.b)<br />

mov r0, t1 // output<br />

Enhancement<br />

Section IV — Image Space<br />

Night Vision: Frame Buffer Post-processing with ps.1.1 Hardware<br />

Figure 3: 1D gradient texture<br />

467<br />

Those shaders leave a great deal of room for visual improvements and<br />

experimentation; for example, a blur filter can be applied while converting the picture<br />

to grayscale, and some extra textures can be blended with the gradient<br />

remapping results. Figures 1 and 2 use this technique to achieve the scanline<br />

screening effects.


Section IV — Image Space<br />

468 Night Vision: Frame Buffer Post-processing with ps.1.1 Hardware<br />

Final Words<br />

I implemented this shader for a demo scene application called Raw Confession.<br />

The demo can be found on the companion CD or downloaded from my web page:<br />

http://cocoon.planet-d.net/raw/!Raw_Beta.zip.<br />

The corresponding RenderMonkey workspace can be found on the companion<br />

CD as well.<br />

Special thanks to Bertrand Carre and David Levy for proofreading this<br />

article.


Non-Photorealistic<br />

Post-processing Filters in<br />

MotoGP 2<br />

Shawn Hargreaves<br />

Stylized rendering techniques are cool, and with programmable shader hardware,<br />

they can be easy to implement too. This article discusses how such effects can be<br />

applied as a post-process over the top of a conventional renderer. The goal is to<br />

have a minimal impact on the structure of an existing engine so that if you are<br />

already using 100 different shaders, adding a new stylized filter should only<br />

increase this to 101, rather than needing a modified version of every existing<br />

shader.<br />

The idea is to render your scene as normal but to an off-screen texture<br />

instead of directly to the D3D back buffer. The resulting image is then copied<br />

across to the back buffer by drawing a single full-screen quad, using a pixel shader<br />

to modify the data en route. Because this filter is applied entirely as a 2D image<br />

space process, it requires no knowledge of the preceding renderer. In fact, these<br />

techniques can just as easily be used over the top of video playback as with the<br />

output of a real-time 3D engine.<br />

Setting Up the Swap Chain<br />

The first step in getting ready to apply a post-processing filter is setting up a swap<br />

chain that lets you hook in a custom pixel shader operation. A typical D3D swap<br />

chain looks something like this:<br />

Depending on which D3DSWAPEFFECT you specified when creating the device,<br />

the Present() call might swap the two buffers or it might copy data from the back<br />

buffer to the front buffer, but either way there is no room for you to insert your<br />

own shader anywhere in this process.<br />

To set up a post-processing filter, you need to create a texture with D3D-<br />

USAGE_RENDERTARGET and an associated depth buffer. You can then set<br />

EnableAutoDepthStencil to FALSE in the D3DPRESENT_PARAMETERS<br />

469


Section IV — Image Space<br />

470 Non-Photorealistic Post-processing Filters in MotoGP 2<br />

structure, since you will not need a depth buffer while drawing to the D3D back<br />

buffer. This results in a triple-buffered swap chain:<br />

The rendering process now looks like:<br />

� BeginScene().<br />

� Set your texture as the active render target.<br />

� Draw the 3D scene.<br />

� Restore the D3D back buffer as the render target, and set your texture surface<br />

as the active texture.<br />

� Draw a full-screen quad to copy the image, applying pixel shader filtering<br />

effects.<br />

� Draw 2D user interface elements such as menus and the heads-up display,<br />

which will not be affected by stylistic processing.<br />

� Call D3D Present() and EndScene().<br />

An incidental benefit of extending the swap chain in this way is that if you create<br />

your device with D3DSWAPEFFECT_COPY (so the D3D back buffer will persist<br />

from one frame to the next), you can turn on alpha blending during the filter operation<br />

to get a cheap full-screen motion blur, blending in some proportion of the<br />

previous frame along with the newly rendered image.<br />

The main disadvantage is that texture render targets do not support<br />

multisampling, so you cannot use any of the clever antialiasing techniques found<br />

in modern GPUs.<br />

What Size Render Target?<br />

Display resolutions tend to have dimensions like 640x480, 1024x768, or<br />

1280x1024, which are not powers of two. This is OK as long as the driver exposes<br />

the NONPOW2CONDITIONAL texture capability, but some hardware does not<br />

support this, and even on cards that do, pixel shader versions 1.0 to 1.3 do not<br />

allow dependent reads into such textures.<br />

The solution is simple: Round up your render target size to the next larger<br />

power of two, which for a 640x480 display mode is 1024x512. While drawing the<br />

scene, set your viewport to only use the top-left 640x480 subset of this image,<br />

and when you come to copy it across to the D3D back buffer, modify your texture<br />

coordinates accordingly.<br />

Beware of a common “gotcha” in the calculation of those texture coordinates.<br />

To preserve correct texture filtering during a full-screen image copy, a half-texel<br />

offset must be added. When copying the top-left portion of a 1024x512 render target<br />

onto a 640x480 back buffer, the correct texture coordinates are:<br />

Top left: u = 0.5 / 1024 v = 0.5 / 512<br />

Bottom right: u = 640.5 / 1024 v = 480.5 / 512


Color Conversions<br />

OK, so we are all set up to feed the output of our main renderer through the pixel<br />

shader of our choosing. What do we want that shader to do?<br />

The most obvious, and probably most widely useful, type of operation is to<br />

perform some kind of colorspace conversion. This could be as simple as a colored<br />

tint for a certain type of lens or a gamma curve adjustment to simulate different<br />

camera exposure settings. The image could be converted to monochrome or sepia<br />

tone, and by swapping or inverting color channels, night vision and infrared<br />

effects can easily be imitated.<br />

The most flexible way of transforming colors is to use the RGB value sampled<br />

from the source image as coordinates for a dependent read into a volume texture.<br />

A full 256x256x256 lookup table would be prohibitively large (64MB if stored<br />

in a 32-bit format!), but thanks to bilinear filtering of the volume texture, a much<br />

smaller table can still give adequate results. Even a small volume texture is still<br />

much bigger than a pixel shader, though, and its effects cannot so easily be<br />

changed just by modifying a few constants. So wherever possible, I think it is<br />

better to do your work directly in the shader.<br />

This is the saturation filter from MotoGP 2. The result of using this filter on<br />

Figure 1 is shown in Figure 2 (see Color Plate 21). It makes bright colors more<br />

intense and primary, while converting the more subtle tones to grayscale. This<br />

demonstrates several important principles of pixel shader color manipulation.<br />

Figure 1: The source image used by all<br />

the filters shown in the following figures<br />

ps.1.1<br />

Section IV — Image Space<br />

Non-Photorealistic Post-processing Filters in MotoGP 2<br />

def c0, 1.0, 1.0, 1.0, 1.0<br />

def c1, 0.3, 0.59, 0.11, 0.5<br />

tex t0 // sample the source image<br />

Figure 2: Color saturation filter<br />

1: mov x4 sat r0.rgb, t0 bx2 // saturate<br />

2: dp3 sat r1.rgba, r0, c0 // do we have any bright colors?<br />

3: dp3 r1.rgb, t0, c1 // grayscale<br />

4: lrp r0.rgb, r1.a, r0, r1 // interpolate between color and grayscale<br />

5: + mov r0.a, t0.a // output alpha<br />

471


Section IV — Image Space<br />

472 Non-Photorealistic Post-processing Filters in MotoGP 2<br />

Instruction #1 (mov_x4_sat) calculates an intensified version of the input color.<br />

The _bx2 input modifier scales any values less than 0.5 down to zero, while the<br />

_x4 output modifier scales up the results, so anything greater than 0.625 will be<br />

saturated right up to maximum brightness. It is important to remember the _sat<br />

modifier when doing this sort of over-brightening operation because if you leave it<br />

out, calculations that overflow 1.0 will produce inconsistent results on different<br />

hardware. <strong>DirectX</strong> 8 cards will always clamp the value at 1.0 (as if the _sat modifier<br />

was present), but ps 1.4 or 2.0 hardware will not.<br />

Instruction #3 calculates a monochrome version of the input color by dotting<br />

it with the constant [0.3, 0.59, 0.11]. It doesn’t really matter what values you use<br />

for this, and [0.33, 0.33, 0.33] might seem more logical, but these values were<br />

chosen because the human eye is more sensitive to the luminance of the green<br />

channel and less sensitive to blue. These are the same scaling factors used by the<br />

standard YUV colorspace.<br />

Instruction #4 chooses between the saturated or monochrome versions of<br />

the color, based on what instruction #2 wrote into r1.a. That summed all the saturated<br />

color channels to create a Boolean selector value, which contains 1 if the<br />

input color is bright enough to saturate upward or 0 if the input is dark enough for<br />

the _bx2 modifier to round it down to black. In C, this calculation would be:<br />

color = (source brightness > 0.5) ? saturated color : monochrome color<br />

The cnd pixel shader instruction seems ideal for such a task, but in fact it is<br />

almost always better to use lrp instead. A discrete test such as cnd tends to cause<br />

popping artifacts as the input data moves past its selection threshold, where a lrp<br />

can give a smooth transition over a range of values.<br />

Displacement Effects<br />

You can do color conversions by using the input color as source coordinates for a<br />

dependent texture read into a lookup table. But what if you reversed this and used<br />

another texture as the source for a dependent read into the image of your main<br />

scene? The shader is trivial:<br />

ps.1.1<br />

tex t0 // sample the displacement texture<br />

texm3x2pad t1, t0 bx2<br />

texm3x2tex t2, t0 bx2 // sample the source image<br />

mov r0, t2<br />

But by the appropriate choice of a displacement texture, all sorts of interesting<br />

effects can be achieved — heat haze, explosion shock waves, refraction of light<br />

through patterned glass, or raindrops on a camera lens (or in this case, sticking<br />

with a non-photorealistic theme, the mosaic effect shown in Figure 3; see Color<br />

Plate 21).


Our goal is to cover the screen with a<br />

grid of hexagons, each filled with a solid<br />

color. Imagine what a 160x100 CGA<br />

mode might have looked like if IBM had<br />

used hexagonal pixels…<br />

This can be done by tiling a hexagonal<br />

displacement texture over the<br />

screen. The values in this texture modify<br />

the location at which the main scene<br />

image is sampled. <strong>With</strong> the right displacement<br />

texture, all pixels inside a<br />

hexagon can be adjusted to sample<br />

exactly the same texel from the source<br />

image, turning the contents of that hexagon into a single flat color.<br />

The texture coordinates for stage 0 control how many times the displacement<br />

texture is tiled over the screen, while stages 1 and 2 hold a 3x2 transform<br />

matrix. If the hexagon pattern is tiled H times horizontally and V times vertically,<br />

your vertex shader outputs should be:<br />

oPos oT0 oT1 oT2<br />

(0, 0) (0, 0) (0.5 * (right – left) / H, 0, left) (0, 0.5 * (bottom – top) / V, top)<br />

(1, 0) (H, 0) (0.5 * (right – left) / H, 0, right) (0, 0.5 * (bottom – top) / V, top)<br />

(1, 1) (H, V) (0.5 * (right – left) / H, 0, right) (0, 0.5 * (bottom – top) / V, bottom)<br />

(0, 1) (0, V) (0.5 * (right – left) / H, 0, left) (0, 0.5 * (bottom – top) / V, bottom)<br />

left, right, top, and bottom are the texture coordinates given in the “What Size Render<br />

Target” section.<br />

The horizontal offset comes from the red channel of the displacement texture<br />

and the vertical offset from the green channel. The blue channel must contain<br />

solid color, as this will be multiplied with the last column of the oT1/oT2 matrix to<br />

give the base coordinates onto which the displacement is added.<br />

The only remaining question is what to put in your displacement texture. If<br />

you are good with Photoshop, you could probably draw one using the gradient fill<br />

tools, but it is easier to generate it in code. My hexagon pattern was created by a<br />

render to texture using the function:<br />

void draw mosaic hexagon()<br />

{<br />

// x y r g b<br />

draw quad(Vtx(0.0, 0.0, 0.5, 0.5, 1.0),<br />

Vtx(0.5, 0.0, 1.0, 0.5, 1.0),<br />

Vtx(0.5, 0.5, 1.0, 1.0, 1.0),<br />

Vtx(0.0, 0.5, 0.5, 1.0, 1.0));<br />

draw quad(Vtx(1.0, 1.0, 0.5, 0.5, 1.0),<br />

Vtx(0.5, 1.0, 0.0, 0.5, 1.0),<br />

Vtx(0.5, 0.5, 0.0, 0.0, 1.0),<br />

Vtx(1.0, 0.5, 0.5, 0.0, 1.0));<br />

draw quad(Vtx(1.0, 0.0, 0.5, 0.5, 1.0),<br />

Section IV — Image Space<br />

Non-Photorealistic Post-processing Filters in MotoGP 2<br />

473<br />

Figure 3: Hexagonal mosaic filter using<br />

dependent texture reads


Section IV — Image Space<br />

474 Non-Photorealistic Post-processing Filters in MotoGP 2<br />

}<br />

Vtx(0.5, 0.0, 0.0, 0.5, 1.0),<br />

Vtx(0.5, 0.5, 0.0, 1.0, 1.0),<br />

Vtx(1.0, 0.5, 0.5, 1.0, 1.0));<br />

draw quad(Vtx(0.0, 1.0, 0.5, 0.5, 1.0),<br />

Vtx(0.5, 1.0, 1.0, 0.5, 1.0),<br />

Vtx(0.5, 0.5, 1.0, 0.0, 1.0),<br />

Vtx(0.0, 0.5, 0.5, 0.0, 1.0));<br />

draw quad(Vtx(0.0, 0.333, 0.0, 0.333, 1.0),<br />

Vtx(1.0, 0.333, 1.0, 0.333, 1.0),<br />

Vtx(1.0, 0.667, 1.0, 0.667, 1.0),<br />

Vtx(0.0, 0.667, 0.0, 0.667, 1.0));<br />

draw tri( Vtx(0.0, 0.333, 0.0, 0.333, 1.0),<br />

Vtx(1.0, 0.333, 1.0, 0.333, 1.0),<br />

Vtx(0.5, 0.167, 0.5, 0.167, 1.0));<br />

draw tri( Vtx(0.0, 0.667, 0.0, 0.667, 1.0),<br />

Vtx(1.0, 0.667, 1.0, 0.667, 1.0),<br />

Vtx(0.5, 0.833, 0.5, 0.833, 1.0));<br />

Rendering more complex animating patterns into the displacement texture can<br />

produce a huge range of cubist or pointillistic style distortions; check out the<br />

Kaleidoscope filter in MotoGP 2 for some examples. Scaling or rotating the offset<br />

values in oT1 and oT2 also gives interesting results.<br />

Cartoon Rendering<br />

The main characteristics of a cartoon style are black lines around the edges of<br />

objects and the use of flat color where there would normally be textured detail or<br />

smooth lighting gradients. There are plenty of ways to achieve these effects, most<br />

of which have been described in detail elsewhere, but the technique presented<br />

here is unusual in that it requires minimal changes to an existing renderer and no<br />

artwork or mesh format alterations whatsoever.<br />

The first step is to add black borders<br />

by running an edge detect filter over the<br />

image of our scene. This is done by setting<br />

the same texture three times on different<br />

stages with the texture<br />

coordinates slightly offset. The pixel<br />

shader compares the brightness of adjacent<br />

samples and, if the color gradient is<br />

steep enough, marks this as an edge<br />

pixel by turning it black. The following<br />

shader produced the image shown in Figure<br />

4 (see Color Plate 21):<br />

This displacement texture<br />

produces a hexagonal<br />

mosaic pattern.<br />

Figure 4: The cartoon shader starts by<br />

applying an edge detect filter.


ps.1.1<br />

def c0, 0.3, 0.59, 0.11, 0<br />

tex t0 // sample the source image<br />

tex t1 // sample offset by (-1, -1)<br />

tex t2 // sample offset by (1, 1)<br />

dp3 r0.rgba, t1, c0 // grayscale sample #1 in r0.a<br />

dp3 r1.rgba, t2, c0 // grayscale sample #2 in r1.a<br />

sub x4 r0.a, r0, r1 // diagonal edge detect difference<br />

mul x4 sat r0.a, r0, r0 // square edge difference to get absolute value<br />

mul r0.rgb, r0, 1-r0.a // output color * edge detect<br />

+ mov r0.a, t0.a // output alpha<br />

More accurate edge detection can be done by using a larger number of sample<br />

points or by including a depth buffer and looking for sudden changes in depth as<br />

well as color (see the references at the end of this article), but in this case we<br />

don’t actually want that precise of a result! <strong>With</strong> the samples offset along a single<br />

diagonal line, the filter favors edges in one direction compared to the other, which<br />

gives a looser, more hand-drawn appearance.<br />

Image-based edge detection can pick out borders that would be impossible to<br />

locate using a geometric approach, such as the lines around individual clouds in<br />

the sky texture.<br />

Getting rid of unwanted texture detail is not as easy to do as a post-process,<br />

so for that we do need to change the main rendering engine. This is a trivial alteration,<br />

however, as you undoubtedly<br />

already have flat color versions of all<br />

your textures loaded into memory. Simply<br />

set D3DSAMP_MAXMIPLEVEL (or<br />

D3DTSS_MAXMIPLEVEL in DX8) to<br />

something greater than zero, and all that<br />

nasty high-resolution texture detail will<br />

go away, as shown in Figure 5 (see Color<br />

Plate 21).<br />

While you are at it, if you have any<br />

alpha texture cutouts, such as trees, a<br />

few trivial changes can make their black<br />

borders thicker and more solid. You<br />

probably already have a good idea how to<br />

Section IV — Image Space<br />

Non-Photorealistic Post-processing Filters in MotoGP 2<br />

475<br />

Figure 5: Changing the mipmap settings<br />

to remove texture detail<br />

do that, as chances are that you spent quite a while trying to get rid of those very<br />

same black borders at some point in the past. If you are using premultiplied alpha,<br />

disable it. If you are using alpha blending, turn it off and go back to simple alpha<br />

tests. If you have D3DRS_ALPHAREF set to something sensible, change it to 0<br />

or 1 — instant black borders around everything you draw!<br />

Unfortunately, this still isn’t quite enough to give a plausible cartoon effect,<br />

so I’m going to have to break the “no new shaders” rule and change the lighting


Section IV — Image Space<br />

476 Non-Photorealistic Post-processing Filters in MotoGP 2<br />

model. Smooth gradients from light to dark just don’t look right in a cartoon<br />

world. The lighting should be quantized into only two or three discrete levels,<br />

with sudden transitions from light to shadow.<br />

This is easy to do with a texture lookup. Discard everything but the most significant<br />

light source, and then in your vertex shader, output the light intensity as a<br />

texture coordinate:<br />

#define LIGHT DIR 1 // object space light direction vector<br />

dp3 oT1.x, v1, c[LIGHT DIR] // dot vertex normal with the light direction<br />

This light value is used to lookup into a 1D texture containing three discrete levels<br />

of brightness:<br />

Figure 6 (see Color Plate 21) shows the<br />

final cartoon renderer, combining the<br />

edge detect filter, changes to the<br />

mipmap and alpha test settings, and<br />

three-level quantized lighting. It isn’t<br />

quite a pure post-processing effect, but it<br />

still only required three new shaders:<br />

one pixel shader for the edge detection<br />

and two vertex shaders for the toon<br />

lighting (one for the bike, another for the<br />

animating rider).<br />

Pencil Sketch Effect<br />

Figure 6: The complete cartoon mode<br />

uses a discrete three-level lighting shader<br />

on bike and rider.<br />

The most important characteristics of a pencil sketch can be summarized as:<br />

� Drawing starts with an empty piece of white paper, which is then darkened<br />

down by the addition of pencil strokes.<br />

� Both the intensity and direction of the strokes may be varied to convey shape<br />

and form, but stroke direction is mostly regular.<br />

� Sketches are often entirely monochrome. Even when colored pencils are<br />

used, there will be a limited color palette.<br />

� When animations are made from a series of sketches, they tend to be at<br />

extremely low framerates due to the amount of manual labor involved in<br />

drawing them.<br />

The first step is obviously to create a texture holding a suitable pattern of pencil<br />

strokes. I used two images with slightly different stroke graphics aligned in opposite<br />

directions:


These are combined into a single texture, with the first stroke pattern in the red<br />

channel and the second in the blue channel. This combined stroke texture is tiled<br />

over the screen, set on texture stage 1, while the main scene image is set on<br />

stage 0. This is then processed through the shader:<br />

ps.1.1<br />

def c1, 1.0, 0.0, 0.0, 0.0<br />

def c2, 0.0, 0.0, 1.0, 0.0<br />

tex t0 // sample the source image<br />

tex t1 // sample the stroke texture<br />

1: mul x4 sat r0.rgb, t0, c0 // scale the frame buffer contents<br />

2: mul r0.rgb, 1-r0, 1-t1 // image * pencil stroke texture<br />

3: dp3 sat r1.rgba, r0, c1 // r1.a = red channel<br />

4: dp3 sat r1.rgb, r0, c2 // r1.rgb = blue channel<br />

5: mul r1.rgb, 1-r1, 1-r1.a // combine, and convert –ve color back to +ve<br />

6: mov x4 sat r0.rgb, t0 // overbrighten the frame buffer contents<br />

7: mul r0.rgb, r1 bx2, r0 // combine sketch with base texture<br />

8: mul x2 r0.rgb, r0, v0 // tint<br />

9: + mov r0.a, v0.a<br />

Section IV — Image Space<br />

Non-Photorealistic Post-processing Filters in MotoGP 2<br />

477<br />

Instruction #1 scales the input color by a constant (c0), which is set by the application.<br />

This controls the sensitivity of the stroke detection — too high and there<br />

will be no strokes at all but too low and the strokes will be too dense. It needs to<br />

be adjusted according to the brightness of your scene: somewhere between 0.25<br />

and 0.5 generally works well.<br />

Instruction #2 combines the input color with the stroke texture in parallel<br />

for both the red and blue channels. It also inverts both colors by subtracting them<br />

from one. This is important because sketching operates in a subtractive colorspace.<br />

Unlike a computer monitor, which adds light over a default black surface, a<br />

pencil artist is removing light from the white paper. It seems highly counterintuitive<br />

from my perspective as a graphics programmer, but the hatching in the<br />

blue sky area is actually triggered by the red color channel, while the hatching on<br />

the red bike comes from the blue channel! This is because there is no need for<br />

any hatching to add blue to the sky, all colors already being present in the default


Section IV — Image Space<br />

478 Non-Photorealistic Post-processing Filters in MotoGP 2<br />

white background. On the contrary, the sky needs hatching in order to remove the<br />

unwanted red channel, which will leave only the desired shade of blue. We are<br />

drawing the absence of color rather than its presence, and this means we have to<br />

invert the input values to get correct results.<br />

Instructions #3 and #4 separate<br />

out the red and blue color channels, creating<br />

Figure 7 and Figure 8, while<br />

instruction #5 combines them back<br />

together, producing Figure 9 (see Color<br />

Plate 22). Although this is purely a<br />

monochrome image, the input color is<br />

controlling the direction of the stroke<br />

texture. The blue sky and red bike are<br />

shaded in opposing directions, while<br />

dark areas such as the wheels combine<br />

both stroke directions to give a crosshatch<br />

pattern.<br />

Figure 8: Sketch strokes in the alternate<br />

direction keyed off the inverse blue<br />

Instruction #6 scales up the input<br />

color by a massive amount, producing<br />

Figure 10 (see Color Plate 22). Most of<br />

the image has been saturated to full<br />

white, with only a few areas of intensely<br />

primary color retaining their hue. When<br />

this is multiplied with the sketch pattern<br />

(instruction #7, producing Figure 11), it<br />

reintroduces a small amount of color to<br />

select parts of the image, while leaving<br />

the bulk of the hatching in monochrome.<br />

Figure 7: Sketch strokes keyed off the<br />

inverse of the red color channel<br />

Figure 9: Both stroke directions combined<br />

together<br />

Figure 10: Overbrightening the source<br />

image removes all but the most primary<br />

of colors.


References<br />

The final step, in instruction #8, is to apply a colored tint. Figure 12 (see<br />

Color Plate 22) shows the final image with a sepia hue.<br />

Figure 11: The stroke pattern from Figure<br />

9 is multiplied with the color data from<br />

Figure 10.<br />

This is all very well for a still image, but how is a pencil sketch to move? It looks<br />

silly if the pencil strokes stay in exactly the same place from one frame to the<br />

next, with only the image beneath them moving. But if we randomize the stroke<br />

texture in any way, the results will flicker horribly at anything approaching a<br />

decent refresh speed. Real sketched animations rarely run any faster than ten or<br />

15 frames per second, but that is hardly desirable in the context of a 3D game<br />

engine!<br />

My compromise was to run at full framerate, redrawing the input scene for<br />

each frame but to only move the stroke texture at periodic intervals. The low<br />

framerate movement of the pencil strokes can fool the eye into thinking that the<br />

scene is only being redrawn at a plausible pencil sketch type of rate, but it<br />

remains smooth and responsive enough for a player to interact with the underlying<br />

game.<br />

Better edge detection methods:<br />

Section IV — Image Space<br />

Non-Photorealistic Post-processing Filters in MotoGP 2<br />

479<br />

Figure 12: The final sketch image is given<br />

a yellow tint.<br />

Mitchell, Jason L., “Image Processing with 1.4 Pixel <strong>Shader</strong>s in Direct3D,”<br />

Direct3D <strong><strong>Shader</strong>X</strong>: Vertex and Pixel <strong>Shader</strong> <strong>Tips</strong> and <strong>Tricks</strong>, Wolfgang Engel, ed.,<br />

Wordware Publishing, 2002, pp. 258-269.<br />

Mitchell, Jason, Chris Brennan, and Drew Card, “Real-Time Image-Space Outlining<br />

for Non-Photorealistic Rendering,” SIGGRAPH 2002,<br />

http://www.ati.com/developer/SIGGRAPH02/NPROutlining_Mitchell.pdf.<br />

Applying image post-processing techniques to video streams:<br />

Ansari, Marwan, “Video Image Processing Using <strong>Shader</strong>s,” ATI Research,<br />

http://www.ati.com/developer/gdc/GDC2003_Video<strong>Shader</strong>.pdf.


Section IV — Image Space<br />

480 Non-Photorealistic Post-processing Filters in MotoGP 2<br />

Doing “proper” sketch rendering with awareness of the underlying geometry, rather<br />

than as an image post-process. These techniques can give far more sophisticated<br />

results but are less easy to fit over the top of an existing renderer:<br />

Buchin, Kevin and Maike Walther, “Hatching, Stroke Styles, and Pointillism,”<br />

<strong><strong>Shader</strong>X</strong>2 : <strong>Shader</strong> <strong>Programming</strong> <strong>Tips</strong> & <strong>Tricks</strong> with <strong>DirectX</strong> 9, Wolfgang Engel, ed.,<br />

Wordware Publishing, Inc., 2004, pp. 340-347.<br />

Praun, E., H. Hoppe, M. Webb, and A. Finkelstein, “Real-time hatching,”<br />

SIGGRAPH 2001, http://research.microsoft.com/~hoppe/hatching.pdf.<br />

The conventional approach to cartoon rendering, using geometry rather than image<br />

post-processing:<br />

NVidia sample program, http://developer.nvidia.com/view.asp?<br />

IO=Toon_Shading<br />

The cartoon and pencil sketch techniques presented in this article were developed<br />

for the Xbox game MotoGP by Climax and published by THQ. MotoGP 2<br />

adds new filters such as saturate and mosaic and supports these effects in the PC<br />

version as well as on the Xbox.


Introduction<br />

Image Effects with <strong>DirectX</strong> 9<br />

Pixel <strong>Shader</strong>s<br />

Marwan Y. Ansari<br />

When most engineers think of pixel and vertex shaders, they instantly think of<br />

fully 3D scenes with animations and textures and the like. However, in the article<br />

“Image Processing with 1.4 Pixel <strong>Shader</strong>s in Direct3D” [Mitchell02] as well as<br />

the article “Advanced Image Processing Using <strong>DirectX</strong> 9 <strong>Shader</strong>s” in this book, it<br />

was shown that shaders can be used for far more than the hardware architects<br />

planned.<br />

In “Advanced Image Processing Using <strong>DirectX</strong> 9 <strong>Shader</strong>s,” shaders are used<br />

to perform image processing operations, such as Canny edge detection, HSV �<br />

RGB color conversions, and the Fast Fourier Transform. In this article, we deal<br />

more with the implementation of cool (well, at least we think so) image space<br />

special effects rather than the science of image processing.<br />

This chapter discusses three classes of image effects:<br />

� Transitions<br />

� Distortions<br />

� Posterization<br />

All of the effects discussed in this chapter were developed using the Video<strong>Shader</strong><br />

application supplied on the companion CD and found on the ATI developer relations<br />

web site. Furthermore, all of the effects you see here run in real time on live<br />

video (most at better than 30 frames per second on Radeon 9700 class hardware).<br />

481


Section IV — Image Space<br />

482 Image Effects with <strong>DirectX</strong> 9 Pixel <strong>Shader</strong>s<br />

Some Review and Notes about This Article<br />

The techniques in this article rely on some basic concepts when processing<br />

images. Firstly, all techniques discussed here are applied to images stored in textures.<br />

Though these algorithms were built to be applied to live video, they are not<br />

limited to that field. The incoming texture can be the result of a 3D rendered<br />

scene, live video, or any 2D image.<br />

The vertex shader does little more than transform the vertex positions and<br />

normals. It does, however, pack both transformed and raw normal information<br />

into texture coordinates so that the pixel shader can use them.<br />

The geometry used for these effects is a simple screen-aligned quad. When<br />

we render the screen-aligned quad, we put the image that we want to process on<br />

one of the texture samplers, typically sampler 0, and render. We term each rendering<br />

of the quad a pass. For 3D graphics engineers, this makes complete sense.<br />

However, in the area of image processing and image effects, rather than saying we<br />

rendered a single quad, we say we “processed the image” once or “went over the<br />

image” once. So in this article, though each pass does render a screen-aligned<br />

quad, it is referred to as a pass over the image.<br />

The two implementations of the Kuwahara filter rely on multiple passes over<br />

the image while the other effects in this article are single pass. In some cases, the<br />

output of pass n is used as the input to pass n+1. In other cases, one pass outputs<br />

to a temporary buffer (a renderable texture), the next pass outputs to another<br />

temporary buffer (another renderable texture), and a third pass samples both of<br />

those buffers in order to get the final resultant image.<br />

Some algorithms discussed in this article and in the “Advanced Image Processing<br />

Using <strong>DirectX</strong> 9 <strong>Shader</strong>s” article rely on constants that are loaded from<br />

the application. Some constants may not change over time, while some may<br />

change every frame. Below is a list of constants used in the effects discussed in<br />

this chapter:<br />

� sampleOffsets[8]: Offsets for sampling the 3×3 area around the current<br />

texel. See Figure 1. Therefore, sampleOffsets[2] contains pixel width in the x<br />

component and pixel height in<br />

the y component. Pixel width<br />

and height are defined as<br />

1/image width and 1/image<br />

height, respectively.<br />

sampleOffsets[6] contains 0 in<br />

the x component and negative<br />

pixel height in the y component.<br />

These values are added<br />

to the incoming texture coordinate<br />

inside the pixel shader<br />

to sample the eight nearest<br />

neighbors.<br />

Figure 1: X represents the current texel surrounded<br />

by its eight nearest neighbors. The<br />

values in the neighboring texels (0...7) are what<br />

the indexes use to reference them in


� viewMatrix: View matrix. This is used in distortion effects.<br />

� UserInput.z: Last X location the user clicked. This is used in thresholding<br />

images. The left edge is 0 and the right edge is 1.<br />

� UserPoint1.xy: First location that the user clicked. Coordinates in the range<br />

[0,1]. Used in ripple shader.<br />

� UserPoint2.xy: Second location that user clicked. Coordinates in the range<br />

[0,1]. Used in ripple shader.<br />

� Timers.w: Time value oscillating between –1 and 1. This allows effects to<br />

use time to animate. The application updates this value on behalf of the<br />

shader. It takes about 15 seconds for the time value to cycle from 1 to –1 and<br />

back to 1.<br />

� Pt1Time.x: Time value remaining for first click. It decreases from 1 to 0.<br />

This allows effects to run for a while and then stop without having the application<br />

swap out the shader. Used in the ripple shader.<br />

� Pt1Time.y: Time value remaining for first click. It increases from 0 to 1.<br />

This allows effects to run for a while and then stop without having the application<br />

swap out the shader. Used in the ripple shader.<br />

� Pt2Time.x: Same as Pt1Time.x but for the second ripple.<br />

� Pt2Time.y: Same as Pt1Time.y but for the second ripple.<br />

Finally, all of the shaders in this chapter are implemented using Microsoft’s High<br />

Level Shading Language (HLSL) compiled for the ps.2.0 shader model under<br />

<strong>DirectX</strong> 9. For more information on HLSL, the ps.2.0 shader model, or <strong>DirectX</strong> 9,<br />

please see the Microsoft web site, the <strong>DirectX</strong> 9 SDK help files, or <strong>Shader</strong> X 2 :<br />

Introductions & Tutorials with <strong>DirectX</strong> 9 [Mitchell04].<br />

Transition Effects<br />

Transition effects can be used when creating cut scenes in a game or when<br />

switching between multiple video streams. They are the simplest of image effects<br />

to perform because they tend to rely on spatial positioning inside the image more<br />

than the content of the pixels. In this section we discuss three transition effects:<br />

left-right slide,spin and shrink away, and spice transitions (created by Pixelan Software<br />

LLC).<br />

Left-Right Slide<br />

Section IV — Image Space<br />

Image Effects with <strong>DirectX</strong> 9 Pixel <strong>Shader</strong>s<br />

483<br />

The left-right slide transition is fairly straightforward. It creates the appearance<br />

that one image is sliding in over the other. All the work is done in the pixel<br />

shader, and no constants other than time are needed.<br />

Figure 2 shows snapshots of the progression of the left-right slide transition.


Section IV — Image Space<br />

484 Image Effects with <strong>DirectX</strong> 9 Pixel <strong>Shader</strong>s<br />

Figure 2: In the five snapshots above, you can see the image of the messy cubicle slide over the<br />

product logo until it is finally completely hidden.<br />

The left-right slide algorithm:<br />

For each texel (u,v)<br />

time = time * 1.1<br />

u'=u–time<br />

If(u'


}<br />

// xPosition = [-1.1, 1.1]<br />

float xPosition = Timers.w * 1.1;<br />

float2 newCoords;<br />

float4 c=.5;<br />

//===========================================//<br />

// Calc new u Coord by subtracting off time. //<br />

// (time range : [-1.1, 1.1]) //<br />

//===========================================//<br />

xPosition = In.texCoord.x - xPosition;<br />

//if u coord is outside 0 to 1 range,<br />

// background image is showing.<br />

if((xPosition < 0) || (xPosition > 1))<br />

{<br />

c = tex2D(productImage, In.texCoord);<br />

}<br />

else<br />

{<br />

newCoords.y = In.texCoord.y;<br />

newCoords.x = xPosition;<br />

c = tex2D( inputImage, newCoords);<br />

}<br />

return c;<br />

Left-Right Squeeze<br />

In a similar vein, we can squeeze the two images to make the transition. Again,<br />

the only constant used is time. Instead of clipping the images against the left or<br />

right side of the window, we keep one image touching the left edge while the<br />

other is always touching the right edge. As time changes, the images squeeze and<br />

stretch in the window. Figure 3 shows snapshots of the progression of the<br />

left-right squeeze.<br />

The left-right squeeze transition algorithm:<br />

Section IV — Image Space<br />

Image Effects with <strong>DirectX</strong> 9 Pixel <strong>Shader</strong>s<br />

Figure 3: In the five snapshots above, you can see the image of the messy cube squeeze against<br />

the product image as time goes forward.<br />

For each texel<br />

Length = Scale and bias time to be between 0 and 1<br />

If( Length < u)<br />

u' = u scale<br />

Sample first image using u',v.<br />

Else<br />

485


Section IV — Image Space<br />

486 Image Effects with <strong>DirectX</strong> 9 Pixel <strong>Shader</strong>s<br />

u' = u scaled based on length<br />

Sample second image using u',v.<br />

Return sample<br />

This algorithm scales and biases time from –1 to 1 into the range of 0 to 1. Doing<br />

so allows us to use the new time value as the current location of the boundary<br />

between the two images. Bear in mind that time ranges from 0 to 1 independently<br />

of u'.<br />

Then the algorithm goes on to compute u' based on the length of one of the<br />

images.<br />

u length<br />

u' �<br />

length<br />

�<br />

if the new time value is less than this texel’s u value.<br />

1 �<br />

...or:<br />

u<br />

u'<br />

� if the new time value is greater than this texel’s u value.<br />

length<br />

Here, (u – length ) is the distance between the texel and the left edge of the<br />

image and (1 – length) is the total width of the image during this time. This division<br />

results in a value that ranges from [0,1].<br />

The left-right squeeze transition:<br />

//----------------------------------------------------------//<br />

// LRSlidingTransition20.hlsl<br />

//<br />

// Left to right squeezing transition between two images.<br />

// Compute the squeezing square’s current position based on<br />

// time...see if the current pixel is inside or outside that<br />

// square and pick the right sample to display.<br />

//<br />

//<br />

// Marwan Y. Ansari - ATI Research, Inc. - 2003<br />

//----------------------------------------------------------//<br />

sampler inputImage; // image to be processed<br />

sampler productImage; // Background image<br />

float4 Timers; // Time value<br />

struct PS INPUT<br />

{<br />

float2 texCoord:TEXCOORD0;<br />

};<br />

float4 main( PS INPUT In ) : COLOR<br />

{<br />

float length = Timers.w; // length = [-1..1]<br />

float2 scaledCoords;<br />

float4 c=.5;<br />

//===================================================//<br />

// Get the length of the shrinking area based on the<br />

// time...this is done above.<br />

// Here we make length = [0..1]


}<br />

//===================================================//<br />

length = (length * 0.5) + 0.5 ;<br />

// find out where this point is in relation to the image;<br />

if ( length < In.texCoord.x )<br />

{<br />

//================================================//<br />

// y doesn't change<br />

//x=<br />

//<br />

// 1<br />

// --- * ( texcoord.x - length.x)<br />

// 1-length<br />

//<br />

// Which will map the image into the right side of<br />

// the screen.<br />

//================================================//<br />

scaledCoords.y = In.texCoord.y;<br />

scaledCoords.x = (1.0 / ( 1.0 - length)) *<br />

(In.texCoord - length);<br />

c = tex2D(productImage, scaledCoords);<br />

}<br />

else<br />

{<br />

//================================================//<br />

// y doesn't change<br />

//x=<br />

//<br />

// 1<br />

// --- * texcoord.x<br />

// length<br />

//<br />

// Which will map the image into the left side of the<br />

// screen.<br />

//================================================//<br />

scaledCoords.y = In.texCoord.y;<br />

scaledCoords.x = (1.0 / ( length)) * In.texCoord;<br />

scaledCoords.x = saturate(scaledCoords);<br />

c = tex2D( inputImage, scaledCoords);<br />

}<br />

return c;<br />

Spin and Shrink Away<br />

Section IV — Image Space<br />

Image Effects with <strong>DirectX</strong> 9 Pixel <strong>Shader</strong>s<br />

487<br />

Another interesting transition effect can give the appearance of geometry without<br />

actually sending any new vertices down the pipe. Below is the algorithm for a<br />

transition that makes one image appear to spin and shrink away from the viewer.<br />

This is quite simple to do by applying a transformation in the pixel shader. Again,<br />

only one set of vertices are ever sent down. As seen in the shaders above, this<br />

method relies on knowing where you are in the image (by way of u,v) and picking


Section IV — Image Space<br />

488 Image Effects with <strong>DirectX</strong> 9 Pixel <strong>Shader</strong>s<br />

an image to display. Figure 4 shows snapshots of the progression of the spin and<br />

shrink away transition.<br />

Figure 4: In the five snapshots above, you can see the image of the messy cubicle rotate and<br />

shrink away until it is completely gone.<br />

The spin and shrink transition algorithm:<br />

For each texel (u,v)<br />

Length = time scaled and biased to be 0 to 1<br />

Get sine and cosine based on length<br />

Construct rotation matrix<br />

Rotate coordinates(u,v) about its center<br />

If(current u,v are inside the transformed coordinates)<br />

Compute u',v' based on time and rotation of u,v<br />

Sample first image using u',v'<br />

Else<br />

Sample second image using u,v<br />

Return sample<br />

A 1D texture is used to compute the sine and cosine values used in the rotation<br />

matrixes. The texture filter naturally and linearly interpolates to compute values<br />

not present in the sine table. Since sine and cosine can be considered the same<br />

function one quarter out of phase, you can do the following:<br />

sin = fetch into sin texture (length)<br />

cos = fetch into sin texture (length + .25)<br />

After constructing the rotation matrix, the coordinates must be translated to the<br />

origin before transformation. Rather than building this directly into the matrix<br />

through matrix concatenation, we simply subtract the 0.5 offset before the transformation.<br />

This saves operations, since it reduces matrix multiplies and does vector<br />

addition instead and centers the square at (0, 0) instead of at (0.5, 0.5).<br />

Now the current coordinate, which may or may not be an element of the spinning<br />

shrinking square, has been translated to the center about the origin. All the<br />

points in the spinning shrinking square are bound between (lengthOver2,<br />

lengthOver2)… (–lengthOver2, –lengthOver2). Since all the coordinates are symmetric<br />

and are bound by lengthOver2, we can simplify the comparison by taking<br />

the absolute value of the coordinates. If the absolute value of either u'orv'is<br />

greater than lengthOver2, then this point is considered outside the rotating<br />

square. See Figure 5.<br />

If the current texel is considered outside the spinning shrinking square, we<br />

can simply display the background image. If it is inside the square, we must<br />

recompute the texture coordinates based on rotation and the shrinking effect.<br />

This is done in two steps. First, compute the coordinates inside the square and<br />

then rotate those coordinates.


The square’s edge (in texture space) that corresponds to 0, 0 will be at offset= 0.5<br />

– lengthOver2.<br />

That is the offset from either the left edge or the bottom of the texture. Subtracting<br />

that value from our current u,v and dividing the entire amount by length<br />

scales the current u,v properly, based on its shrinking as time progresses.<br />

1<br />

u', v' �(( u, v) �( 05 . �lengthOver2)) �<br />

length<br />

The resulting u',v' is measured relative to (or from) the top-left edges of the<br />

square (this is actually a translation of u,v axes from the top-left corner of the texture<br />

to the top-left corner of the square). Then, division by length results in u,v<br />

pairs that are in the range [0,1]: (0,0) at the top-left corner of the square, (1,1) at<br />

the bottom-right corner of the square. Thus, the texture is completely mapped<br />

onto the spinning shrinking square.<br />

The coordinate u',v' is correctly placed inside the shrunken square and ready<br />

for the final step, which is a rotation about the center of the square. As we did<br />

before, subtract 0.5 from the coordinates, multiply by the rotation matrix built earlier,<br />

and add 0.5 to the result.<br />

u', v' �( u, v)<br />

�05<br />

.<br />

u', v' �mul(( u', v'), rotationMatrix)<br />

�0.5<br />

The u',v' coordinates computed are now properly scaled down and rotated based<br />

on time. Now sample the image to be shrunk and spun away with the u',v' coordinates<br />

and return the sample.<br />

The spin and shrink away transition:<br />

//----------------------------------------------------------//<br />

// SpinTransition 2 20.hlsl<br />

//<br />

// Rotates image as it spins to and from the user.<br />

// Compute the shrinking, spinning square’s current position<br />

// based on time...see if the current pixel is inside or<br />

// outside that square and pick the right sample to display.<br />

//<br />

// Marwan Y. Ansari - ATI Research, Inc. - 2003<br />

Section IV — Image Space<br />

Image Effects with <strong>DirectX</strong> 9 Pixel <strong>Shader</strong>s<br />

Figure 5: We can optimize the number of comparisons that we need to<br />

perform by using the absolute value of our coordinates. This limits their range<br />

to the shaded area above.<br />

489


Section IV — Image Space<br />

490 Image Effects with <strong>DirectX</strong> 9 Pixel <strong>Shader</strong>s<br />

//----------------------------------------------------------//<br />

sampler inputImage;<br />

sampler sinTexture;<br />

sampler productImage;<br />

float4 Timers;<br />

struct PS INPUT<br />

{<br />

float2 texCoord:TEXCOORD0;<br />

};<br />

float4 main( PS INPUT In ) : COLOR<br />

{<br />

float length = Timers.w; // length = [-1..1]<br />

float lengthOver2, sinValue, cosValue;<br />

float2 sinCoords = 0;<br />

float3x3 rotMat =0; // initialize the matrix to 0<br />

float3 coords=0;<br />

float3 rotatedCoords;<br />

float4 c=.5;<br />

//===============================//<br />

// Get the length of the shrinking square based<br />

// on the time...this is done above. Here we make<br />

// length = [0..1]<br />

length = ((length * 0.5) + 0.5 ) ;<br />

lengthOver2 = length * 0.5;<br />

// Use length and length - .25 to sample sin and cos<br />

sinCoords.x = length;<br />

sinValue = tex2D( sinTexture, sinCoords);<br />

sinCoords.x = length - .25;<br />

cosValue = tex2D( sinTexture, sinCoords);<br />

// build rotation matrix<br />

rotMat[0][0] = cosValue;<br />

rotMat[0][1] = sinValue;<br />

rotMat[1][0] = -sinValue;<br />

rotMat[1][1] = cosValue;<br />

rotMat[2][2] = 1.0;<br />

// Rotate the box around the screen center<br />

coords.xy = In.texCoord -.5 ;<br />

rotatedCoords = mul( coords, rotMat);<br />

// Now the coords have been x-lated and rotated about the<br />

// origin. Test if its abs() goes past the lengthOver2.<br />

// This method simplifies the compare to a simple


}<br />

//<br />

// Since all the coords are symmetric...and since they are<br />

// bound by lengthOver2, we can simplify the comparison by<br />

// taking the absolute value of the coords.<br />

// If the abs value of either x or y is greater than lengthOver2,<br />

// then this point is considered outside the rotating square.<br />

rotatedCoords = abs(rotatedCoords);<br />

if( lengthOver2


Section IV — Image Space<br />

492 Image Effects with <strong>DirectX</strong> 9 Pixel <strong>Shader</strong>s<br />

transition effect and one to help blending. The transitions, called spices, can be<br />

used as plug-ins to a number of video editing tools. See the Pixelan web site for<br />

more details.<br />

The spice transition algorithm:<br />

For each texel<br />

Sample image1, image2, and transition map<br />

LUTcoord = transition map sample + time // time = [-1, 1]<br />

Sample LUTblender based on LUTcoord<br />

Result = lerp between image1 and image2 based on LUTblender<br />

Return Result<br />

Figure 6: Three spice transition maps. As time increases, the light areas are the first to<br />

transition between the two images while the dark areas are the last. As time decreases,<br />

the light areas transition last and the dark areas transition first.<br />

Figure 7: Lookup table (LUT) used for transitioning between two images. The gray<br />

area in the center keeps the edges from making hard noticeable transitions.<br />

Figure 6 shows three transition maps. As time increases, the light areas are the<br />

first to transition between the images, while the dark areas are last. Figure 7<br />

shows the lookup table (LUT) that is used to linearly interpolate (lerp) between<br />

the two images. The gray area in the middle ensures that the edge transitions are<br />

not too harsh.<br />

What is particularly useful about this approach is that you can create many<br />

interesting effects by changing the transition maps. More transition maps are<br />

available on the CD in the Video<strong>Shader</strong> application (see Figure 8).<br />

However, these are just a few possible transition effects. Just about any transition<br />

effect that is currently available today in the television or movie industry<br />

can be done inside of a pixel shader.


A spice transition:<br />

//--------------------------------------------------------------//<br />

// ImageFade20.hlsl<br />

//<br />

// Performs a wipe transition between 2 images.<br />

//<br />

// Marwan Y. Ansari - ATI Research, Inc. - 2002<br />

//-------------------------------------------------------------sampler<br />

inputImage;<br />

sampler ProductImage;<br />

sampler WipeTransition; // Transition map<br />

sampler LUT; // Lookup table<br />

float4 Timers;<br />

struct PS INPUT<br />

{<br />

float2 texCoord0:TEXCOORD0;<br />

};<br />

float4 main( PS INPUT In ) : COLOR<br />

{<br />

float4 c=.5;<br />

float4 currFrameSample;<br />

float4 targetSample;<br />

float4 transitionSample;<br />

float4 lutSample;<br />

float2 lutCoord;<br />

float timeScalar;<br />

currFrameSample = tex2D( inputImage, In.texCoord0);<br />

targetSample = tex2D( ProductImage, In.texCoord0);<br />

transitionSample = tex2D( WipeTransition, In.texCoord0);<br />

Image Effects with <strong>DirectX</strong> 9 Pixel <strong>Shader</strong>s<br />

Figure 8: Spice transitions using the maps shown in Figure 6 over time.<br />

Section IV — Image Space<br />

493


Section IV — Image Space<br />

494 Image Effects with <strong>DirectX</strong> 9 Pixel <strong>Shader</strong>s<br />

}<br />

timeScalar = Timers.w;<br />

// .7 just slows the transition down.<br />

// Saturate ensures that we clamp from 0 to 1.<br />

lutCoord = saturate(transitionSample + .7* timeScalar);<br />

lutSample = tex2D( LUT, lutCoord);<br />

c = lerp(targetSample, currFrameSample, lutSample);<br />

return c;<br />

Distortion Effects<br />

The effects discussed so far have dealt with blending between images for scene<br />

transitions. They were based on time and the spatial position of the current pixel<br />

shader coordinate or on the value of a transition map at a pixel location. Other<br />

techniques involve directly affecting the image by distorting it. Distorting the<br />

image is a common effect in gaming as well as in the movie industry.<br />

Distortion effects executed fully in the pixel shader can be useful when the<br />

application does not want to or cannot create the extra geometry to perturb the<br />

image. This section shows how, in some cases, environment mapping could be<br />

used to fake geometry and even animation.<br />

For example, the ripple effect discussed here uses no normal maps and no<br />

extra geometry to create the ripples. They are all generated inside the pixel<br />

shader based on a time value.<br />

Fun House Mirror<br />

The first distortion effect mimics the fun house mirror that you might see at a<br />

carnival. A fun house mirror<br />

is simply a mirror<br />

with a curve in it. When<br />

you look into it, instead of<br />

seeing a 1:1 reflection of<br />

yourself, you see a distorted<br />

image due to the<br />

curves in the material.<br />

Rather than actually<br />

rendering the curvy<br />

geometry in real time, we<br />

use a normal map that<br />

was generated using<br />

ATI’s Normal Mapper<br />

Tool (available for<br />

Figure 9: Fun house mirror effects


download on the ATI developer web site).<br />

The normal map used here is 16-bit per component and contains only the x<br />

and y components. The z component (assumed to be positive) can be derived from<br />

x and y using the following relation:<br />

2 2 � �<br />

z� 1 � x � y<br />

Computing this third component costs three instructions in a ps.2.0 shader. However,<br />

the savings of writing 32 bytes per normal map pixel vs. writing 64 bytes is<br />

significant. Conversely, with this approach we are able to store a 16-bit per component<br />

normal map in the same footprint as an 8-8-8-8 texture, gaining muchneeded<br />

precision.<br />

The fun house mirror algorithm:<br />

For each texel<br />

Fetch normal from Normal map<br />

Compute third component (this step could be optional)<br />

Derive per-vertex eye vector using camera position<br />

Compute reflection vector<br />

Sample from cube map based on reflection vector<br />

Return Sample<br />

Section IV — Image Space<br />

Image Effects with <strong>DirectX</strong> 9 Pixel <strong>Shader</strong>s<br />

As you can see, this is simple cube mapping, except that rather than using geometry<br />

to compute the surface normal, we use a normal map. This method is particularly<br />

effective when used with live video, as the interactivity is what makes the<br />

effect interesting.<br />

Since we are dealing with a single image rather than a full 3D environment,<br />

we need to discuss how exactly we populate the cube map. Only one face of the<br />

cube map (the positive z face) is populated with the target image, while the rest of<br />

the image is initialized to blue.<br />

Also, when we compute the normal in the shader, we add a “fudgeFactor”<br />

component. This has nothing to do with the actual algorithm. It is strictly to overcome<br />

an issue with the normal map used that caused the normals to point outside<br />

the positive z face. This is strictly an aesthetic issue, not an algorithmic one.<br />

fudgeFactor was found by trial and error until the final image was acceptable. The<br />

Video<strong>Shader</strong> application allows you to edit the shader files using any text editor<br />

and reload the new shader without having to restart the application. See the documentation<br />

in the Video<strong>Shader</strong> directory for more information.<br />

A fun house shader using a cube map:<br />

//-------------------------------------------------------------//<br />

// FunHouseCube<strong>Shader</strong>.hlsl<br />

//<br />

// Renders the image with fun house mirror effect<br />

// using a cube map.<br />

//<br />

// Marwan Y. Ansari - ATI Research, Inc. - 2002<br />

//------------------------------------------------------------sampler<br />

normalMap;<br />

495


Section IV — Image Space<br />

496 Image Effects with <strong>DirectX</strong> 9 Pixel <strong>Shader</strong>s<br />

sampler cubeTexture;<br />

float3x3 viewMatrix; // World view matrix<br />

struct PS INPUT<br />

{ float2 texCoord0:TEXCOORD0;<br />

float3 texCoord4:TEXCOORD4; //interpolated pos in obj space.<br />

};<br />

float4 main( PS INPUT In ) : COLOR<br />

{<br />

float4 fudgeFactor = {0, -0.00001, 0.21, 0.0};<br />

float4 c=.5;<br />

float2 normalCoord = In.texCoord0;<br />

float3 normal =0, reflection =0, xformedEye;<br />

}<br />

Shower Door<br />

normalCoord.y = 1.0 -normalCoord.y;//Texture is upside down<br />

normal = tex2D( normalMap, normalCoord);<br />

// the normals are coming in on two components,<br />

// we need to derive the third component k = sqrt( i^2 + j^2)<br />

normal.z = sqrt( 1.0f –((normal.x * normal.x) +<br />

(normal.y * normal.y)));<br />

normal += fudgeFactor;<br />

//derive per-vertex eye vector using camera position<br />

//Use the (-untransformed position) as a camera in obj space<br />

//and x-form it to view space. Resultant vector taken as<br />

//vector from camera to current position.<br />

xformedEye.xyz = mul( -In.texCoord4, viewMatrix);<br />

xformedEye.xyz = normalize(xformedEye);<br />

// Compute viewer's reflection vector<br />

reflection.xyz = dot(normal, xformedEye) *2*normal –<br />

(dot(normal,normal) * xformedEye);<br />

// Flip the right and left so that it acts like a mirror.<br />

reflection.x = -reflection.x;<br />

// sample the cube map<br />

c = texCUBE( cubeTexture, reflection );<br />

return c;<br />

3D graphics applications commonly combine environment mapping with bump<br />

mapping to achieve a bumpy-shiny effect (EMBM). EMBM can also be used<br />

to give the impression of looking through glass of varying thicknesses as in Figure<br />

10.<br />

For this effect, we first compute the eye vector and perturb it by a noise map<br />

that is generated offline. Finally, we use the perturbed vector to sample into our


environment map. Since the effect is only applied to a single image, we perform a<br />

cube map lookup.<br />

Shower door effect algorithm:<br />

For each texel<br />

Derive per-vertex eye vector using camera position<br />

Sample noise map<br />

Add noise sample to eye vector<br />

Compute reflection vector<br />

Sample from texture map based on reflection vector<br />

Return sample<br />

The shower door effect:<br />

//-------------------------------------------------------------//<br />

// BumpVideo20.hlsl<br />

//<br />

// EMBM the video image.<br />

//<br />

// Marwan Y. Ansari - ATI Research, Inc. - 2003<br />

//------------------------------------------------------------sampler<br />

inputImage;<br />

sampler noiseMap;<br />

float3x3 viewMatrix;<br />

struct PS INPUT<br />

{<br />

float2 texCoord0:TEXCOORD0;<br />

float3 texCoord4:TEXCOORD4; // vertex position<br />

float3 texCoord6:TEXCOORD6; // normal<br />

};<br />

float4 main( PS INPUT In ) : COLOR<br />

{<br />

Section IV — Image Space<br />

Image Effects with <strong>DirectX</strong> 9 Pixel <strong>Shader</strong>s<br />

Figure 10: The shower door effect is created by using a noise<br />

texture to perturb the normals used in environment mapping.<br />

497


Section IV — Image Space<br />

498 Image Effects with <strong>DirectX</strong> 9 Pixel <strong>Shader</strong>s<br />

Ripple<br />

}<br />

float4 c = 0.5;<br />

float3 normal = In.texCoord6;<br />

float3 xformedEye = 0;<br />

float3 reflection = 0;<br />

//derive per-vertex eye vector using camera position<br />

xformedEye.xyz = mul(-In.texCoord4, viewMatrix);<br />

xformedEye.xyz = normalize(xformedEye);<br />

// Multiply by .25 to decrease the effect of the normal map<br />

normal.xyz += tex2D(noiseMap, In.texCoord0) * .25;<br />

// Compute viewer's reflection vector<br />

reflection.xyz = dot(normal, xformedEye) *2*normal –<br />

(dot(normal,normal) * xformedEye);<br />

// perform cube map lookup to just one face because<br />

// I know there is only one face.<br />

reflection.xy *= 1.0/ reflection.z;<br />

reflection.xy = (-.5*reflection) + .5;<br />

// reflect about the X<br />

reflection.x = 1-reflection.x;<br />

c = tex2D(inputImage, reflection);<br />

return c;<br />

Figure 11: The two ripple centers are colored magenta and fade as the ripple fades. These five<br />

images show, in left to right order, the ripples’ dissipation after they have fully propagated. (See<br />

Color Plate 23.)<br />

The above shaders show how environment mapping and bump mapping can be<br />

used to simulate geometric perturbations. <strong>With</strong> a little more effort, we can simulate<br />

more interesting effects, such as ripples. Figure 11 shows two ripples affecting<br />

an image over time. This is a common effect used in the film industry.<br />

Though manipulating vertex information to simulate a ripple effect is commonplace<br />

in computer graphics, the interesting notion here is that all the computations<br />

are done in the pixel shader. No vertex information other than the normal<br />

is required with this method.<br />

The overall ripple shader algorithm:<br />

For each texel<br />

Compute the normal for ripple one at this point in time


Compute the normal for ripple two at this point in time<br />

Combine ripples one and two with the vertex normal<br />

Derive per-vertex eye vector using camera position<br />

Compute reflection vector<br />

Sample from map based on reflection vector<br />

Return Sample<br />

The algorithm for computing the ripple normal:<br />

Section IV — Image Space<br />

Image Effects with <strong>DirectX</strong> 9 Pixel <strong>Shader</strong>s<br />

(Calling function passes in ripple center, time left on ripple, and<br />

current texel position)<br />

Compute wave radius<br />

Compute the vector from ripple center to current texel (direction vector)<br />

If current texel is inside ripple<br />

Look up sine value of current pixel based on frequency<br />

Compute ripple effect as sine * height * decreasing time<br />

Result = Direction vector * ripple effect<br />

Return Result<br />

499<br />

The interesting piece of the algorithm is the function to compute the ripple’s<br />

normal. In the ps 2.0 shader model, we can compute two ripples in a single pixel<br />

shader using the method described here. There are methods that allow for an<br />

arbitrary number of ripples, but they require writing to an extra buffer and are left<br />

as an exercise for the reader.<br />

We first compute the ripple’s radius based on the amount of time elapsed<br />

since the ripple was started. The radius of the ripple is its time multiplied by the<br />

ripple’s speed. In this shader, speed is a constant. Then we compute the distance<br />

and direction vector from the ripple center to the current texel location.<br />

Next, since we want the ripple to propagate out from a central point, we can<br />

test if the current texel has been affected by the ripple yet. This is easily done by<br />

comparing the distance (from the current texel to the ripple center) to the wave<br />

radius. If the distance is less than the wave radius, then this texel is part of the<br />

ripple.<br />

Now that we know this texel is affected by the ripple, we need to compute<br />

exactly what effect the ripple will have. The ripple we use is based on a sine<br />

wave. Rather than computing the sine value inside the shader, it is more efficient<br />

to use a lookup sine table stored in a texture. Computing the texture coordinates<br />

for the sine texture is done by multiplying the ripple frequency by the distance<br />

from the ripple center. After the sine value is retrieved, it must be multiplied by<br />

the ripple height and its remaining time. Multiplying by the remaining time<br />

dampens the ripple to 0 as the ripple runs out. The only step remaining is multiplying<br />

the ripple effect by the direction vector (from the ripple center to the current<br />

texel). This constructs a normal that is pointed in the correct direction from<br />

the ripple center.<br />

It should be noted that the ripple normals only have values in the x and y<br />

components, while the z component is zero. After combining the two ripple normals<br />

with the vertex normal, the z component of the combined normal is equal to<br />

the z component of the vertex normal.<br />

After the combined normal is computed, the remaining steps are just environment<br />

mapping. For more information on environment mapping see Real-Time<br />

Rendering [Möller99].


Section IV — Image Space<br />

500 Image Effects with <strong>DirectX</strong> 9 Pixel <strong>Shader</strong>s<br />

The ripple shader:<br />

//-----------------------------------------------------------//<br />

// TwoRipple20.hlsl<br />

//<br />

// Creates 2 ripples in the image and propagates them out over<br />

// time.<br />

//<br />

// Marwan Y. Ansari - ATI Research, Inc. - 2002<br />

// Many thanks to Chris Brennan (ATI Research) for his help.<br />

//-----------------------------------------------------------//<br />

sampler inputImage;<br />

sampler sinTexture;<br />

float3x3 viewMatrix;<br />

float4 UserPoint1; // first click position<br />

float4 UserPoint2; // second click position<br />

float4 Pt1Time;<br />

float4 Pt2Time;<br />

struct PS INPUT<br />

{<br />

float2 texCoord0:TEXCOORD0; // current location<br />

float3 texCoord4:TEXCOORD4; // position in obj space<br />

float3 texCoord6:TEXCOORD6; // Normal<br />

};<br />

float3 ComputeRippleNormal(float2 CurrLocation,<br />

float4 RippleCenter, float4 PtTime)<br />

{<br />

float4 ptModulator;<br />

float3 ptDistance =0;<br />

float3 ptDirectionVector =0;<br />

float2 ptsSin =0;<br />

float2 sinCoords = 0;<br />

float3 ripple=0;<br />

float distanceToCenter;<br />

float freqTweak = 4.0;<br />

float waveSpeedTweak = 9.4;<br />

float waveHeightTweak = .125;<br />

float waveRadius;<br />

float isInsideWave;<br />

ptModulator = PtTime * 0.5 ; // time/2 to slow it a bit.<br />

// y ranges from 0 to 0.5<br />

waveRadius = ptModulator.y * waveSpeedTweak;<br />

ptDistance.xy = (RippleCenter - CurrLocation);<br />

ptDirectionVector = normalize(ptDistance);<br />

// As time goes on... waveRadius increases


}<br />

// b/c waveAge increases.<br />

isInsideWave = waveRadius –<br />

sqrt(dot(ptDistance,ptDistance));<br />

if(isInsideWave > 0) // Allows the ripples to grow from RippleCenter.<br />

{ // Otherwise, all pixels would be affected by<br />

// the ripple at time 0.<br />

// mul -waveRadius by freq to get sin coord<br />

// make sinCoords.x negative so we start with a trough<br />

// instead of a crest.<br />

sinCoords.x = -isInsideWave * freqTweak;<br />

ptsSin = tex2D(sinTexture, sinCoords); // Get sin value<br />

// Keep in mind that ptModulator.x ranges from -0.5 to .5<br />

// ...it decreases over time. mul by 1/8 just to tweak...<br />

// then dampen the ripple effect with time..<br />

ptsSin *= waveHeightTweak * ptModulator.x;<br />

ripple = ptsSin.x * ptDirectionVector;<br />

}<br />

return ripple;<br />

float4 main( PS INPUT In ) : COLOR<br />

{<br />

float4 c=.5;<br />

float3 xformedEye;<br />

float3 reflection =0;<br />

float3 combinedNormal;<br />

float3 ripple1=0, ripple2 =0;<br />

ripple1 = ComputeRippleNormal(In.texCoord0,<br />

UserPoint1,<br />

Pt1Time);<br />

ripple2 = ComputeRippleNormal(In.texCoord0,<br />

UserPoint2,<br />

Pt2Time);<br />

// build normal for environment mapping by adding ripples<br />

// vertex shader is sending texCoord6 in negated, we must<br />

// also negate it.<br />

combinedNormal = (ripple1 + ripple2) - In.texCoord6;<br />

//derive per-vertex eye vector using camera position<br />

xformedEye.xyz = mul(-In.texCoord4, viewMatrix);<br />

xformedEye.xyz = normalize(xformedEye);<br />

Image Effects with <strong>DirectX</strong> 9 Pixel <strong>Shader</strong>s<br />

// Compute viewer's reflection vector<br />

reflection.xyz =dot(combinedNormal, xformedEye) *<br />

2 * combinedNormal -<br />

(dot(combinedNormal,combinedNormal) * xformedEye);<br />

Section IV — Image Space<br />

501


Section IV — Image Space<br />

502 Image Effects with <strong>DirectX</strong> 9 Pixel <strong>Shader</strong>s<br />

}<br />

// perform cube map lookup to just one face because I<br />

// know there is only one face.<br />

reflection.xy *= 1.0/reflection.z;<br />

reflection.xy = (-.5*reflection) + .5;<br />

// sample the reflection map<br />

c = tex2D(inputImage, reflection);<br />

return c;<br />

As you can see, it is easy to simulate geometric perturbations using environment<br />

mapping in the pixel shader. As long as the normal is computable procedurally, it<br />

can be stored in a texture or both.<br />

Posterization<br />

So far in this chapter we have discussed transition and distortion effects that are<br />

applied to images. <strong>With</strong> the growing popularity of non-photorealistic rendering<br />

(NPR) in real-time computer graphics, we have also looked into image-space NPR<br />

techniques. In this section, we discuss the use of a Kuwahara filter to generate an<br />

NPR look. We discuss two Kuwahara kernels (5×5 and 7×7). The 5×5 kernel is<br />

simpler to implement and only requires two passes. The 7×7 kernel is harder to<br />

implement and must be broken into four passes. As the kernel size increases, it<br />

becomes more complicated to apply.<br />

Kuwahara (5×5 filter size)<br />

The Kuwahara filter is a non-linear<br />

edge preserving smoothing operation<br />

[Young]. The filter is centered<br />

at the current texel and relies on<br />

sampling the neighboring texels to<br />

compute the new value for the<br />

current texel.<br />

This filter operates on a given<br />

5×5 region by breaking it into four<br />

sub-regions that overlap by a single<br />

pixel in the horizontal and<br />

vertical direction. Our implementations<br />

here are based on 5×5 and<br />

7×7 square regions, as shown in<br />

Figure 12.<br />

Figure 12: The 5×5 kernel has four 3×3 subregions.<br />

Here, two of those sub-regions have<br />

been broken out to emphasize the one-pixelwide<br />

overlap.


The sub-region size is determined using the following method described in<br />

[Young]:<br />

Given a square kernel of length k, sub-regions are of size j×j, where:<br />

k � 1<br />

j �<br />

2<br />

So given our 5×5 kernel, our sub-regions are:<br />

j �<br />

j j<br />

� 5 1<br />

� 3<br />

2<br />

� �3�3 Section IV — Image Space<br />

Image Effects with <strong>DirectX</strong> 9 Pixel <strong>Shader</strong>s<br />

Again, in this implementation, we use square kernels, but other sizes and shapes<br />

can be computed by applying this formula to the length and width.<br />

For each sub-region, the mean color and sample variance (referred to in this<br />

article simply as variance) are computed. Once all four mean/variance pairs are<br />

computed, the variances are compared. The lowest variance is found and its<br />

matching mean is taken as the color for the current pixel. So, the color for the<br />

current texel is selected based on which of the four areas has the smallest change<br />

in color.<br />

This effect requires two passes to implement. The first pass renders not into<br />

the frame buffer but rather into a renderable texture. The second pass uses the<br />

renderable texture from the first pass as its input.<br />

The algorithm for a 5×5 Kuwahara kernel:<br />

First pass<br />

For each texel<br />

Sample 3×3 region around this texel and find the mean.<br />

Compute the variance of this 3×3 region.<br />

Store the mean as rgb, store the variance in the alpha.<br />

Return the mean/variance as a texel. (Stores into<br />

renderable texture)<br />

Second Pass (use first pass as input)<br />

For each texel<br />

Sample the 4 texels that are diagonal from this texel.<br />

Compare the texels’ variances (alpha) finding the lowest.<br />

Return the mean associated with the lowest variance.<br />

503<br />

The first pass computes the mean color by sampling the current texel and its<br />

eight neighbors. The sampling is achieved by storing the pixel width (1/width) and<br />

pixel height (1/height) offsets into a set of constant registers (see the “Some<br />

Review and Some Notes about This Article” section earlier in this article). These<br />

offsets are individually added to this texel’s location and then used as new texture<br />

coordinates.<br />

After the nine samples are taken (one for each texel in the 3×3 sub-region),<br />

the texels are summed component-wise, and the total is divided by 9 (the number<br />

of samples). For example, if we had two pixels, {r 1,g 1,b 1}and{r 2,b 2,g 2}, after<br />

summing component-wise, the result would be {r 1 +r 2,g 1 +g 2 ,b 1 +b 2}. The<br />

final result of the mean is:


Section IV — Image Space<br />

504 Image Effects with <strong>DirectX</strong> 9 Pixel <strong>Shader</strong>s<br />

��<br />

r1�... �r9<br />

� � g1�... �g9<br />

� � b1�... �b9<br />

��<br />

��<br />

�,<br />

� �,<br />

� ��<br />

��<br />

9 � � 9 � � 9 ��<br />

Computing the variance is just as straightforward. The variance is defined as:<br />

v ��� �samplei r �mean r� ��samplei g �mean g� � samplei<br />

N<br />

2 N 1<br />

2 2<br />

2<br />

� . . . . � . b�mean. b�<br />

i�1<br />

...where v is the variance and �2 is the squared standard deviation [Weisstein].<br />

As you can see, it is also possible to get the standard deviation this way by<br />

taking the square root of the variance and storing it. It should be noted, however,<br />

that there is the possibility of artifacts appearing due to the quantizing of the standard<br />

deviation (�). Recall that �’s range is [0,1] and that �2 will logically be at the<br />

lower end of that scale. Since we store �2 into an 8-bit value, it is possible that it<br />

will be quantized to the same value as some other �2 , even though the values are<br />

slightly different. There are two solutions to this issue. The first is to use higher<br />

precision render targets. The other is to take the square root of �2 inside the<br />

shader, since that operation can be done at floating-point precision. In our implementation,<br />

we did not need to find the standard deviation, so this issue is just<br />

something that you may want to keep in mind.<br />

Once the mean and variance are computed, we store them in the output pixel<br />

with the mean replicated into the RGB components and the variance stored in<br />

alpha. The result is that this pass creates a mean/variance map, which is used by<br />

the next pass. Each texel in the mean/variance map contains the mean and variance<br />

of the corresponding 3×3 area in the original image.<br />

Figure 13: The centers for the four 3x3 sub-regions in a Kuwahara filter<br />

are diagonal from the current texel location in the mean/variance map.<br />

In the second pass, we only need to compare the variances for the regions of the<br />

Kuwahara filter to find the lowest. To get the variances, we simply sample the<br />

four pixels of the mean/variance map that lie on a diagonal to the current pixel. If<br />

this texel is at u,v, then we want to sample the mean-variance map at:<br />

u + (1/width), v + (1/height)<br />

u + (1/width), v – (1/height)<br />

u – (1/width), v + (1/height)<br />

u – (1/width), v – (1/height)


See Figure 13. The texture-wrapping mode should be set to CLAMP in order to<br />

prevent edges from interfering with each other. After sampling these values, it is<br />

trivial to compare the variances and store the mean.<br />

This technique is excellent for posterizing images. However, since the filters<br />

are so small (5×5 in this case), the effect is very tiny after just one iteration is<br />

applied. We found that in order to get interesting results, we need to run the 5×5<br />

Kuwahara kernel a total of four times. That makes four filters at two passes per<br />

filter for a total of eight passes. This will vary, however, with image size.<br />

Finally, after all the passes have been done, we can add a last touch to the<br />

image. In order to get a nice NPR effect, we need to add black outlines to the<br />

image to highlight the different color regions the way comic books or cartoons do.<br />

An easy and effective way of doing this is to use an edge detection filter. In<br />

the article “Advanced Image Processing Using <strong>DirectX</strong> 9 <strong>Shader</strong>s,” we discussed<br />

the Canny edge detection filter. We have found that although the Canny edge<br />

detection filter is impressive, its thin edges are not adequate for outlining. The<br />

Sobel edge detection filter gives rather thick edges, however, and is perfect for<br />

the kind of outlining we wanted to aesthetically complement the Kuwahara filter.<br />

For a full discussion of the theory behind the Sobel edge detection filter, see<br />

[Jain95] or [Gonzalez92]. For a two-pass implementation of the filter, see<br />

[Mitchell02], or for a single-pass implementation, see the Video<strong>Shader</strong> application<br />

on the CD.<br />

By modifying the Sobel edge detector to combine the resultant edges with<br />

the posterized scene, we can achieve our thick outlines. After the Sobel edge<br />

detector finds the edges,<br />

we subtract the result<br />

from one and threshold it<br />

based on user input. This<br />

way the user can select<br />

what he or she perceives<br />

as enough edges or outlines<br />

interactively.<br />

See Figure 14 for the<br />

output of the 5×5<br />

Kuwahara filter with<br />

Sobel outlines. The filter<br />

was run four times over<br />

the image with a fifth<br />

pass to add the outlines.<br />

The following computes<br />

the mean and variance<br />

of a 3×3 area for the<br />

5×5 Kuwahara filter:<br />

Section IV — Image Space<br />

Image Effects with <strong>DirectX</strong> 9 Pixel <strong>Shader</strong>s<br />

//-------------------------------------------------------------<br />

// MeanVarianceGeneric.hlsl<br />

//<br />

// Get the mean and variance of the 3x3 area around this pixel.<br />

505<br />

Figure 14: A 5x5 Kuwahara filter plus outlines based on<br />

the Sobel edge detection filter has been applied to the<br />

image for real-time posterization. (See Color Plate 24.)


Section IV — Image Space<br />

506 Image Effects with <strong>DirectX</strong> 9 Pixel <strong>Shader</strong>s<br />

//<br />

// Marwan Y. Ansari - ATI Research, Inc. - 2002<br />

//------------------------------------------------------------sampler<br />

inputImage;<br />

float2 sampleOffsets[8];<br />

struct PS INPUT<br />

{<br />

float2 texCoord:TEXCOORD0;<br />

};<br />

float4 main( PS INPUT In ) : COLOR<br />

{<br />

int i =0;<br />

float4 c=.5;<br />

float2 texCoords[9];<br />

float4 texSamples[9];<br />

float4 total = 0;;<br />

float4 mean;<br />

float variance;<br />

}<br />

// get the coords ready for sampling 3x3 region<br />

for(i =0; i


// Select the color based on the lowest variance.<br />

//<br />

// Marwan Y. Ansari - ATI Research, Inc. - 2002<br />

//----------------------------------------------------------sampler<br />

inputImage;<br />

float2 sampleOffsets[8];<br />

struct PS INPUT<br />

{<br />

float2 texCoord:TEXCOORD0;<br />

};<br />

float4 main( PS INPUT In ) : COLOR<br />

{<br />

float2 sampleCoords[4];<br />

float4 s0, s1, s2, s3;<br />

float4 lowestVariance, lowestVariance2;<br />

float s0a,s1a,s2a,s3a,la,l2a;<br />

sampleCoords[0] = In.texCoord + sampleOffsets[0];//up left<br />

sampleCoords[1] = In.texCoord + sampleOffsets[2];//up right<br />

sampleCoords[2] = In.texCoord + sampleOffsets[5];//dn left<br />

sampleCoords[3] = In.texCoord + sampleOffsets[7];//dn right<br />

s0 = tex2D(inputImage, sampleCoords[0]);<br />

s1 = tex2D(inputImage, sampleCoords[1]);<br />

s2 = tex2D(inputImage, sampleCoords[2]);<br />

s3 = tex2D(inputImage, sampleCoords[3]);<br />

s0a = s0.a; s1a = s1.a;<br />

s2a = s2.a; s3a = s3.a;<br />

// Compare first 2 samples<br />

if( s0a < s1a )<br />

{<br />

lowestVariance = s0;<br />

la = s0a;<br />

}<br />

else<br />

{<br />

lowestVariance = s1;<br />

la = s1a;<br />

}<br />

// Compare second 2 samples<br />

if( s2a < s3a )<br />

{<br />

lowestVariance2 = s2;<br />

l2a = s2a;<br />

}<br />

else<br />

{<br />

lowestVariance2 = s3;<br />

Section IV — Image Space<br />

Image Effects with <strong>DirectX</strong> 9 Pixel <strong>Shader</strong>s<br />

507


Section IV — Image Space<br />

508 Image Effects with <strong>DirectX</strong> 9 Pixel <strong>Shader</strong>s<br />

}<br />

}<br />

l2a = s3a;<br />

// Compare the winners of the 2 previous compares.<br />

if( l2a < la )<br />

{<br />

lowestVariance = lowestVariance2;<br />

}<br />

return lowestVariance;<br />

Kuwahara (7×7 filter size)<br />

Next we discuss the implementation of a 7×7 Kuwahara filter, which, due to its<br />

large area, must be performed in four passes. Recall that a Kuwahara filter breaks<br />

an area into four sub-regions, as shown previously. Hence, a 7×7 filter breaks into<br />

four 4×4 regions.<br />

� The algorithm for 7×7 Kuwahara<br />

First Pass — input is input image.<br />

Compute the mean for the 4×4 around this texel.<br />

Store result to a mean map.<br />

Second Pass — input is input image and mean map<br />

Compute the partial variance of the 4×4 sub-region by computing<br />

the variance of the 3×3 area around the current texel.<br />

Store to partial variance map<br />

Third Pass — input is input image, mean map, and partial variance map<br />

Compute the partial variance of the 4×4 sub-region by computing<br />

the partial variance for the remaining L-shaped region.<br />

Combine both partial variances into a final variance.<br />

Store into a final mean/variance map.<br />

Fourth Pass — input is mean/variance map.<br />

Sample the four 4x4 sub-region centers.<br />

Select the mean/variance pair with the lowest variance.<br />

The first pass of our four-pass 7×7 Kuwahara filter is solely dedicated to computing<br />

the mean for each texel in the image. Inside the shader, this is done in two<br />

steps. The first step computes the offsets for the 3×3 area around the current<br />

texel and fetches the texture samples for them. Next, those nine samples are<br />

summed. The offsets for the remaining L-shaped region (see Figure 15) are then<br />

computed, and those seven texels are also sampled and added to the total. Finally,<br />

we divide the sum by 16 and store it to the mean map.<br />

The next two passes of this effect are dedicated to computing the variance.<br />

Due to the high number of instructions needed to compute the variance across a<br />

4×4 region, we must compute the variances in two passes (passes two and three<br />

of this effect).<br />

The second pass computes the partial variance of the 4×4 sub-region by<br />

computing the variance of the 3×3 region surrounding the current texel. This is<br />

performed in the same way that the 5×5 Kuwahara was performed. Figure 15<br />

shows where the 3×3 region is located inside the 4×4 sub-region. The partial<br />

variance is stored to a partial variance map and is used in the next pass.


Section IV — Image Space<br />

Image Effects with <strong>DirectX</strong> 9 Pixel <strong>Shader</strong>s<br />

Figure 15: The variance for the 7×7 Kuwahara filter is implemented by calculating the<br />

variance of the 3×3 area and the L-shaped area in two separate passes. The 3×3<br />

region’s variance is computed first, and the L-shaped region is computed next. The<br />

“current texel location” is the location of the ouput texel for this 4×4 sub-region.<br />

The third pass computes the partial variance for the remaining area of the 4×4<br />

sub-region. The seven pixels that create the L-shaped region on one side of the<br />

3×3 area surrounding the current texel (see Figure 16) are sampled. Their partial<br />

variance is then combined with the partial variance from the previous pass to<br />

compute the final variance for the 4×4 sub-region. As before, the mean is replicated<br />

across RGB and the variance is stored in alpha, creating a final mean/variance<br />

map for this 4×4 sub-region.<br />

The fourth and final pass of our 7×7 Kuwahara filter implementation happens<br />

exactly in the same way as the final pass of the 5×5 Kuwahara filter with the<br />

509<br />

Figure 16: The mean, both partial variances, and the final mean/variance results are all<br />

stored in the same positions relative to the current texel location in each 4x4 sub-region.


Section IV — Image Space<br />

510 Image Effects with <strong>DirectX</strong> 9 Pixel <strong>Shader</strong>s<br />

exception that the four texel samples are not direct diagonals of the current texel.<br />

Rather, they are located at:<br />

u – (1/width), v + (1/height) // up and to the left<br />

u + (2* 1/width), v + (1/height) // up and to the right<br />

u – (1/width), v – (2* 1/width) // down and to the left<br />

u + (2* 1/width), v – (2* 1/width) // down and to the right<br />

The reason for this odd offsetting may not seem clear at first. Remember in the<br />

prior steps that the current texel has always been three rows over and three rows<br />

down from the top-left corner of the 4×4 sub-region. That texel holds the correct<br />

mean/variance for the 4×4 sub-region, and we must be careful to get our sample<br />

point from the correct location. Figure 16 shows the relative locations.<br />

Now that the 7×7<br />

Kuwahara has been<br />

applied, we can finalize<br />

the posterizing effect by<br />

adding one more pass to<br />

perform the Sobel outlines<br />

as discussed<br />

above. As you can see in<br />

Figure 17, we achieve a<br />

very nice posterizing<br />

effect on images using<br />

this method.<br />

The benefit to<br />

using the 7×7<br />

Kuwahara kernel over<br />

the 5×5 kernel is that<br />

you get better<br />

posterization with fewer<br />

passes. The 5×5 kernel<br />

required four passes of<br />

Figure 17: A 7×7 Kuwahara filter plus outlines based on the<br />

Sobel edge detection filter has been applied to the image<br />

for real-time posterization. The 7×7 filter’s advantage over<br />

the 5×5 filter is better posterization for about the same<br />

number of instructions. (See Color Plate 25.)<br />

the entire filter (eight passes total) to achieve good results. The 7×7 kernel only<br />

requires two passes (eight passes total) to achieve a better result. They may<br />

require the same number of passes over the image, but the 7×7 requires fewer<br />

full kernel passes. By performing fewer kernel passes, the edge gradients are<br />

better maintained and you get nicer outlines while still posterizing based on<br />

larger sub-regions. So for the same number of passes and roughly the same<br />

amount of work, you can get better posterization using a larger kernel.<br />

The following computes the mean of a 4×4 area for the 7×7 Kuwahara:<br />

//--------------------------------------------------------------<br />

// Mean4x4.hlsl<br />

//<br />

// Get the Mean of the 4x4 area around a pixel<br />

//<br />

// Marwan Y. Ansari - ATI Research, Inc. - 2002


-------------------------------------------------------------sampler<br />

inputImage;<br />

float2 sampleOffsets[8];<br />

struct PS INPUT<br />

{<br />

float2 texCoord:TEXCOORD0;<br />

};<br />

float4 main( PS INPUT In ) : COLOR<br />

{<br />

int i =0;<br />

float4 c=.5;<br />

float2 texCoords[9];<br />

float4 texSamples[9];<br />

float4 total = 0;;<br />

float4 mean;<br />

for(i =0; i


Section IV — Image Space<br />

512 Image Effects with <strong>DirectX</strong> 9 Pixel <strong>Shader</strong>s<br />

}<br />

return c;<br />

A variance over a 3×3 area just for a 7×7 Kuwahara:<br />

//----------------------------------------------------------<br />

// Variance3x3.hlsl<br />

//<br />

// Get the partial variance of the 4x4 area by getting the 3x3<br />

// area around this pixel.<br />

//<br />

// Marwan Y. Ansari - ATI Research, Inc. - 2002<br />

//---------------------------------------------------------sampler<br />

inputImage;<br />

sampler meanMap;<br />

float2 sampleOffsets[8];<br />

struct PS INPUT<br />

{<br />

float2 texCoord:TEXCOORD0;<br />

};<br />

float4 main(PS INPUT In) : COLOR<br />

{<br />

int i =0;<br />

float4 c=.5;<br />

float2 texCoords[9];<br />

float4 texSamples[9];<br />

float4 total = 0;;<br />

float4 mean;<br />

float variance;<br />

}<br />

for(i =0; i


A variance over an L-shaped area for a 4×4 area for a 7×7 Kuwahara:<br />

//------------------------------------------------------------<br />

// Variance LShaped.hlsl<br />

//<br />

// Get the variance of the L-shaped region for a 4x4 area for a 7x7 Kuwahara.<br />

//<br />

// Marwan Y. Ansari - ATI Research, Inc. - 2002<br />

//-----------------------------------------------------------sampler<br />

inputImage;<br />

sampler meanMap;<br />

sampler variance3x3Map;<br />

float2 sampleOffsets[8];<br />

struct PS INPUT<br />

{<br />

float2 texCoord:TEXCOORD0;<br />

};<br />

float4 main(PS INPUT In) : COLOR<br />

{<br />

int i =0;<br />

float4 c=.5;<br />

float2 texCoords[7];<br />

float4 texSamples[7];<br />

float variance, xOffset,yOffset;<br />

float4 total = 0;<br />

float4 v3x3, vLShaped, mean;<br />

xOffset = sampleOffsets[0].x *2;<br />

yOffset = sampleOffsets[0].y *2;<br />

// Compute sample offsets for the L-shape.<br />

// 0123<br />

// 4***<br />

// 5*X* X=this texel<br />

// 6***<br />

texCoords[0].x = In.texCoord.x + xOffset;<br />

texCoords[0].y = In.texCoord.y + yOffset; // 0<br />

texCoords[1].x = In.texCoord.x + sampleOffsets[0].x;<br />

texCoords[1].y = In.texCoord.y + yOffset; // 1<br />

texCoords[2].x = In.texCoord.x ;<br />

texCoords[2].y = In.texCoord.y + yOffset; // 2<br />

texCoords[3].x = In.texCoord.x - sampleOffsets[0].x;;<br />

texCoords[3].y = In.texCoord.y + yOffset; // 3<br />

texCoords[4].x = In.texCoord.x + xOffset;<br />

texCoords[4].y = In.texCoord.y + sampleOffsets[0].y; // 4<br />

texCoords[5].x = In.texCoord.x + xOffset;<br />

Section IV — Image Space<br />

Image Effects with <strong>DirectX</strong> 9 Pixel <strong>Shader</strong>s<br />

513


Section IV — Image Space<br />

514 Image Effects with <strong>DirectX</strong> 9 Pixel <strong>Shader</strong>s<br />

}<br />

texCoords[5].y = In.texCoord.y ; // 5<br />

texCoords[6].x = In.texCoord.x + xOffset;<br />

texCoords[6].y = In.texCoord.y - sampleOffsets[0].y; // 6<br />

// Sample the L-shape.<br />

for(i=0; i


}<br />

float s0a,s1a,s2a,s3a,la, l2a;<br />

sampleCoords[0] = In.texCoord + sampleOffsets[0]; //up left<br />

sampleCoords[1] = In.texCoord + sampleOffsets[2] + sampleOffsets[1]; //up right<br />

sampleCoords[2] = In.texCoord + sampleOffsets[5] + sampleOffsets[6]; //dn left<br />

sampleCoords[3] = In.texCoord + sampleOffsets[7] + sampleOffsets[7]; //dn right<br />

s0 = tex2D(inputImage, sampleCoords[0]);<br />

s1 = tex2D(inputImage, sampleCoords[1]);<br />

s2 = tex2D(inputImage, sampleCoords[2]);<br />

s3 = tex2D(inputImage, sampleCoords[3]);<br />

s0a = s0.a; s1a = s1.a;<br />

s2a = s2.a; s3a = s3.a;<br />

// Compare first two samples<br />

if( s0a < s1a )<br />

{<br />

lowestVariance = s0;<br />

la = s0a;<br />

}<br />

else<br />

{<br />

lowestVariance = s1;<br />

la = s1a;<br />

}<br />

// Compare second two samples<br />

if( s2a < s3a )<br />

{<br />

lowestVariance2 = s2;<br />

l2a = s2a;<br />

}<br />

else<br />

{<br />

lowestVariance2 = s3;<br />

l2a = s3a;<br />

}<br />

// Compare the winners of the two previous compares.<br />

if( l2a < la )<br />

{<br />

lowestVariance = lowestVariance2;<br />

}<br />

return lowestVariance;<br />

Section IV — Image Space<br />

Image Effects with <strong>DirectX</strong> 9 Pixel <strong>Shader</strong>s<br />

515


Section IV — Image Space<br />

516 Image Effects with <strong>DirectX</strong> 9 Pixel <strong>Shader</strong>s<br />

Using Sobel edge detection to create outlines:<br />

//--------------------------------------------------------------<br />

// SobelOutlines.hlsl<br />

//<br />

// Compute edges using Sobel operators, and then recomposite the<br />

// edges onto the image as outlines.<br />

//<br />

// Marwan Y. Ansari - ATI Research, Inc. - 2002<br />

//-------------------------------------------------------------sampler<br />

inputImage;<br />

float2 sampleOffsets[8];<br />

float4 UserInput;<br />

struct PS INPUT<br />

{<br />

float2 texCoord:TEXCOORD0;<br />

};<br />

float4 main(PS INPUT In) : COLOR<br />

{<br />

int i =0;<br />

float4 c=.5;<br />

float2 texCoords;<br />

float4 texSamples[8];<br />

float4 vertGradient;<br />

float4 horzGradient;<br />

for(i =0; i


}<br />

// combine Sobel edge with current image.<br />

c *= tex2D(inputImage, In.texCoord);<br />

return c;<br />

Combining Effects Using the Video<strong>Shader</strong> Application<br />

Conclusion<br />

The techniques presented in this chapter were developed using live and recorded<br />

video fed to Direct3D via the Microsoft Video Mixing Renderer (VMR). The sample<br />

app, Video<strong>Shader</strong>, demonstrates the use of Direct3D and the VMR, with the<br />

above filters and several others implemented using HLSL. Source for the sample<br />

application and all of the shaders is available on the companion CD as well as the<br />

ATI Developer Relations web site (www.ati.com/developer).<br />

The effects that we have discussed in this chapter are excellent stand-alone<br />

effects. However, you may want to concatenate the effects to get a greater distortion<br />

or transition between distorted and undistorted images, etc.<br />

The Video<strong>Shader</strong> application has a set of combined shader effects mapped to<br />

the F keys and the number keys. Adding effects is done when the application is<br />

running in windowed mode. By right-clicking on the left window pane, a menu<br />

appears with a list of shaders that you may insert. You can choose to insert your<br />

chosen effect before or after the shader that you right-clicked on.<br />

This article discussed three classes of image effects that are possible with modern<br />

hardware: transitions, distortions, and posterizations. All can run in real time<br />

on today’s <strong>DirectX</strong> 9 hardware and are relatively easy to implement. Since all of<br />

them run in image space, no extra geometry is needed to get some really cool<br />

results. We hope that this article has persuaded you to try out some of these<br />

effects in your own work.<br />

Acknowledgments<br />

Section IV — Image Space<br />

Image Effects with <strong>DirectX</strong> 9 Pixel <strong>Shader</strong>s<br />

517<br />

Thanks to Jason Mitchell and Evan Hart for their help in writing this chapter.<br />

Thanks also to Chris Brennan, David Gosselin, and Chris Oat (all from ATI<br />

Research) who helped in various stages of getting the effects implemented for the<br />

Video<strong>Shader</strong> application.<br />

Special thanks to Jason Mitchell for giving me the idea to implement the<br />

Kuwahara filter and to Larry Seiler of ATI Research for his ideas on optimizing<br />

the 5×5 Kuwahara filter.<br />

Finally, many thanks to Muhammad Haggag at Ain Shams University, Egypt,<br />

for his help in proofreading this chapter.


Section IV — Image Space<br />

518 Image Effects with <strong>DirectX</strong> 9 Pixel <strong>Shader</strong>s<br />

References<br />

ATI Developer web site, http://www.ati.com/devrel<br />

Pixélan Software, LLC, http://www.pixelan.com<br />

[Gonzalez92] Gonzalez, Rafael C. and Richard E. Woods, Digital Image Processing,<br />

Addison-Wesley, 1992.<br />

[Jain95] Jain, Ramesh and Rngachar Kasturi, et al., Machine Vision, McGraw Hill,<br />

1995.<br />

[Mitchell02] Mitchell, Jason L., “Image Processing with 1.4 Pixel <strong>Shader</strong>s in<br />

Direct3D,” Direct3D <strong><strong>Shader</strong>X</strong>: Vertex and Pixel <strong>Shader</strong> <strong>Tips</strong> and <strong>Tricks</strong>, Wolfgang<br />

Engel, ed., Wordware Publishing, 2002, pp. 258-269.<br />

[Mitchell04] Mitchell, Jason L. and Craig Peeper, “Introduction to the <strong>DirectX</strong><br />

High Level Shading Language,” <strong>Shader</strong> X2 : Introduction & Tutorials with <strong>DirectX</strong><br />

9, Wolfgang Engel, ed., Wordware Publishing, 2004, pp. 1-61.<br />

[Möller99] Möller, Tomas and Eric Haines, Real-Time Rendering, A.K. Peters,<br />

Ltd., 1999.<br />

[Weisstein] Weisstein, Eric, “Eric Weisstein’s World of Mathematics,”<br />

http://mathworld.wolfram.com/Variance.html.<br />

[Young] Young, I.T., J.J. Gerbrands, and L.J. van Vliet, “Image Processing Fundamentals<br />

— Smoothing Operations,” http://www.ph.tn.tudelft.nl/Courses/FIP/<br />

noframes/fip-Smoothin.html.


Using Pixel <strong>Shader</strong>s to<br />

Implement a Mosaic Effect Using<br />

Character Glyphs<br />

Introduction<br />

Roger Descheneaux and Maurice Ribble<br />

A mosaic is a picture made by setting small pieces of glass or colored tiles onto a<br />

surface. Individually, the small pieces look nothing like the picture, but when<br />

assembled into an image and viewed from a distance, they form a cohesive whole<br />

that accurately represents the intent of the image.<br />

The technique described here uses a post-processing pixel shader that takes<br />

a screen image and converts it into a mosaic. Rather than using glass or tiles to<br />

form the mosaic, we use window-aligned rectangles containing images of various<br />

intensities. While this technique is appropriate for use with any images, in our<br />

example we use character glyphs to represent the screen image as a sequence of<br />

letters and numbers. The difference in brightness between the various glyphs can<br />

be viewed as forming a monochromatic image. Here is an example of an image<br />

processed using this technique:<br />

The teapot on the left is the original image, while the teapot on the right is the<br />

result of post-processing this image and converting it into character glyphs. This<br />

technique occurs entirely in hardware. It can be performed in a single-pass pixel<br />

shader, though in the following example we use several passes for the sake of<br />

simplicity.<br />

519


Section IV — Image Space<br />

520 Using Pixel <strong>Shader</strong>s to Implement a Mosaic Effect Using Character Glyphs<br />

Algorithm Overview<br />

The original image is divided into a series of image-aligned blocks. Each block is<br />

the size of one glyph in the character set to be rendered. We compute the intensity<br />

of the color in this block by downsampling the image using a linear texture filter<br />

to compute the pixel averages. We then rescale the image to its original size,<br />

replacing the original pixel color with the average color for each pixel block.<br />

We then use a pixel shader to replace each pixel in the original image with<br />

one pixel from the character set to be rendered. The pixel chosen is based on the<br />

intensity of the block in the downsampled image to which the pixel belongs and<br />

the offset of the pixel within the pixel block.<br />

Sample Program<br />

// Step 1: Draw the image to the back buffer.<br />

glViewport (0, 0, width, height);<br />

glDisable(GL FRAGMENT PROGRAM ARB);<br />

glDisable(GL TEXTURE 2D);<br />

glClear(GL COLOR BUFFER BIT|GLDEPTH BUFFER BIT);<br />

glMatrixMode(GL MODELVIEW);<br />

glLoadIdentity();<br />

glRotatef(angle, 1.0, 1.0, 1.0);<br />

// Draw the base image first. We'll convert it later as a<br />

// postprocessing step.<br />

glEnable(GL LIGHTING);<br />

glEnable(GL DEPTH TEST);<br />

glColor3f(1.0, 1.0, 0.0);<br />

glutSolidTeapot(0.5);<br />

glDisable(GL LIGHTING);<br />

glDisable(GL DEPTH TEST);<br />

glFlush();<br />

// Step 2: Downsample the image.<br />

glActiveTextureARB(GL TEXTURE0 ARB);<br />

glEnable(GL TEXTURE 2D);<br />

glTexParameteri(GL TEXTURE 2D, GL TEXTURE MAG FILTER, GL LINEAR);<br />

glTexParameteri(GL TEXTURE 2D, GL TEXTURE MIN FILTER, GL LINEAR);<br />

glLoadIdentity();<br />

// Step 2a: Copy the screen to a texture.<br />

glCopyTexSubImage2D(GL TEXTURE 2D, 0, 0, 0, 0, 0, texImageWidth, texImageHeight);<br />

// Step 2b: Draw the image back to the screen at half its previous resolution.<br />

glViewport (0, 0, texImageWidth / 2, texImageHeight / 2);<br />

glBegin(GL QUADS);<br />

glTexCoord2f(0.0, 0.0);


glVertex2f(-1.0, -1.0);<br />

glTexCoord2f(1.0, 0.0);<br />

glVertex2f(1.0, -1.0);<br />

glTexCoord2f(1.0, 1.0);<br />

glVertex2f(1.0, 1.0);<br />

glTexCoord2f(0.0, 1.0);<br />

glVertex2f(-1.0, 1.0);<br />

glEnd();<br />

Section IV — Image Space<br />

Using Pixel <strong>Shader</strong>s to Implement a Mosaic Effect Using Character Glyphs<br />

// Step 3. Repeat the process to get it to 1/4 its original size.<br />

glCopyTexSubImage2D(GL TEXTURE 2D, 0, 0, 0, 0, 0, texImageWidth / 2, texImageHeight / 2);<br />

glViewport (0, 0, texImageWidth / 4, texImageHeight / 4);<br />

glBegin(GL QUADS);<br />

glTexCoord2f(0.0, 0.0);<br />

glVertex2f(-1.0, -1.0);<br />

glTexCoord2f(0.5, 0.0);<br />

glVertex2f(1.0, -1.0);<br />

glTexCoord2f(0.5, 0.5);<br />

glVertex2f(1.0, 1.0);<br />

glTexCoord2f(0.0, 0.5);<br />

glVertex2f(-1.0, 1.0);<br />

glEnd();<br />

// Step 4: Now 1/8th.<br />

glCopyTexSubImage2D(GL TEXTURE 2D, 0, 0, 0, 0, 0, texImageWidth / 4, texImageHeight / 4);<br />

glViewport (0, 0, texImageWidth / 8, texImageHeight / 8);<br />

glBegin(GL QUADS);<br />

glTexCoord2f(0.0, 0.0);<br />

glVertex2f(-1.0, -1.0);<br />

glTexCoord2f(0.25, 0.0);<br />

glVertex2f(1.0, -1.0);<br />

glTexCoord2f(0.25, 0.25);<br />

glVertex2f(1.0, 1.0);<br />

glTexCoord2f(0.0, 0.25);<br />

glVertex2f(-1.0, 1.0);<br />

glEnd();<br />

// Now perform one more pass to scale the image. We have to change<br />

// the aspect ratio of the downsampled image to match the aspect<br />

// ratio of the font, so each block of the character in the font<br />

// matches the same color in the downsampled image. The font is<br />

// 8x10 pixels. This means that we have to scale the Y direction<br />

// of the rendering somewhat so that ten pixels in height correspond<br />

// to one pixel in this final texture image.<br />

glCopyTexSubImage2D(GL TEXTURE 2D, 0, 0, 0, 0, 0, texImageWidth / 8, texImageHeight / 8);<br />

glViewport (0, 0, texImageWidth / 8, texImageHeight /8/1.25);<br />

glBegin(GL QUADS);<br />

glTexCoord2f(0.0, 0.0);<br />

glVertex2f(-1.0, -1.0);<br />

glTexCoord2f(0.125, 0.0);<br />

glVertex2f(1.0, -1.0);<br />

glTexCoord2f(0.125, 0.125);<br />

521


Section IV — Image Space<br />

522 Using Pixel <strong>Shader</strong>s to Implement a Mosaic Effect Using Character Glyphs<br />

glVertex2f(1.0, 1.0);<br />

glTexCoord2f(0.0, 0.125);<br />

glVertex2f(-1.0, 1.0);<br />

glEnd();<br />

glTexParameteri(GL TEXTURE 2D, GL TEXTURE MAG FILTER, GL NEAREST);<br />

glTexParameteri(GL TEXTURE 2D, GL TEXTURE MIN FILTER, GL NEAREST);<br />

glCopyTexSubImage2D(GL TEXTURE 2D, 0, 0, 0, 0, 0, texImageWidth / 8, texImageHeight / 8);<br />

glViewport (0, 0, texImageWidth, texImageHeight);<br />

glBegin(GL QUADS);<br />

glTexCoord2f(0.0, 0.0);<br />

glVertex2f(-1.0, -1.0);<br />

glTexCoord2f(0.125, 0.0);<br />

glVertex2f(1.0, -1.0);<br />

glTexCoord2f(0.125, 0.125 / 1.25);<br />

glVertex2f(1.0, 1.0);<br />

glTexCoord2f(0.0, 0.125 / 1.25);<br />

glVertex2f(-1.0, 1.0);<br />

glEnd();<br />

// Copy the final downsampled image into our texture.<br />

glCopyTexSubImage2D(GL TEXTURE 2D, 0, 0, 0, 0, 0, texImageWidth, texImageHeight);<br />

// These scaling factors are used to match texels in the<br />

// downsampled image with pixels in the original screen image.<br />

scaleW = 1.0 * width / texImageWidth;<br />

scaleH = 1.0 * height / texImageHeight;<br />

// Now activate the fragment program and re-render the scene using<br />

// the original image as a texture source.<br />

glTexParameteri(GL TEXTURE 2D, GL TEXTURE MAG FILTER, GL NEAREST);<br />

glTexParameteri(GL TEXTURE 2D, GL TEXTURE MIN FILTER, GL NEAREST);<br />

glEnable(GL FRAGMENT PROGRAM ARB);<br />

glViewport (0, 0, width, height);<br />

glBegin(GL QUADS);<br />

glTexCoord2f(0.0, 0.0);<br />

glMultiTexCoord2fARB(GL TEXTURE1 ARB, 0.0, 0.0);<br />

glVertex2f(-1.0, -1.0);<br />

glTexCoord2f(scaleW, 0.0);<br />

glMultiTexCoord2fARB(GL TEXTURE1 ARB, 1.0 * width / 8.0, 0.0);<br />

glVertex2f(1.0, -1.0);<br />

glTexCoord2f(scaleW, scaleH);<br />

glMultiTexCoord2fARB(GL TEXTURE1 ARB, 1.0 * width / 8.0, 1.0 * height / 10.0);<br />

glVertex2f(1.0, 1.0);<br />

glTexCoord2f(0.0, scaleH);<br />

glMultiTexCoord2fARB(GL TEXTURE1 ARB, 0.0, 1.0 * height / 10.0);<br />

glVertex2f(-1.0, 1.0);<br />

glEnd();<br />

// Display the final image.<br />

glDisable(GL FRAGMENT PROGRAM ARB);


}<br />

Section IV — Image Space<br />

Using Pixel <strong>Shader</strong>s to Implement a Mosaic Effect Using Character Glyphs<br />

glutSwapBuffers();<br />

Explanation of the Sample Program<br />

The first step of the process is to determine the average intensity of each character<br />

cell in the image. To do this, we downsample the image a number of times,<br />

which depends on the size of the font being used. In this example, we’re using a<br />

font that is eight pixels wide by ten pixels high, so we downsample the image four<br />

times using a linear texture filter. This creates a texture in which each texel represents<br />

an 8x10 section of the original image containing the average intensity for<br />

that 8x10 section of the image. We then upsample the image using this average<br />

image as the source, creating a final texture that has a size equal to the original<br />

processed image but contains values in each 8x10 region that correspond to the<br />

average value of that 8x10 region on the original image.<br />

Finally, we use that block image as the source texture of a final copy with the<br />

pixel shader enabled, which will replace each pixel with the corresponding pixel in<br />

an appropriate character.<br />

When we render the font, we pass in two sets of texture coordinates. The<br />

first set of texture coordinates ranges from 0.0 to 1.0, encompassing the entire<br />

image. The second set of texture coordinates ranges from the width of the<br />

image/8, the font width, and the height of the image/10, the font height. Later we<br />

take the fractional part of this index, which will repeat every eight pixels in the x<br />

direction and every ten pixels in the y direction.<br />

The code containing the font itself is not included here in the interest of<br />

saving space, but it consists of a single texture, which is 512 pixels wide and 16<br />

pixels high. The image in the texture is a series of 8x10 characters arranged horizontally<br />

from left to right with the darkest characters in the leftmost part of the<br />

image and the brightest characters in the rightmost part of the image.<br />

Sample Pixel <strong>Shader</strong><br />

!!ARBfp1.0<br />

# Constants used by the program.<br />

# This first constant is the scale factor to convert an RGB<br />

# value to a black-and-white value.<br />

PARAM grayScale = {0.30, 0.59, 0.11, 1.0};<br />

# This constant converts S, T coordinates in character space<br />

# to coordinates in font string space. See below for details.<br />

# The values here are 1/number of characters in the font<br />

# and the ratio of the character height to the font height-<br />

# in this case, 1/64 and 10/16.<br />

PARAM charScale = {0.015625, 0.625, 0.0, 0.0};<br />

# This constant is the number of characters in the glyph<br />

# array. It is used in the operation that computes the<br />

# beginning of a glyph in the s-direction.<br />

523


Section IV — Image Space<br />

524 Using Pixel <strong>Shader</strong>s to Implement a Mosaic Effect Using Character Glyphs<br />

PARAM numChars = {64.0, 0.0, 0.0, 0.0};<br />

# This is the inverse of the constant above. It's used to<br />

# convert the beginning of the character back into the<br />

# glyph array space.<br />

PARAM recipChars = program.local[2];<br />

TEMP blockOffset;<br />

TEMP sColor;<br />

TEMP charOffset;<br />

TEMP charCoords;<br />

# Interpolants.<br />

ATTRIB tc = fragment.texcoord[0];<br />

ATTRIB cc = fragment.texcoord[1];<br />

OUTPUT oColor = result.color;<br />

# Look up the pixel color for this character block.<br />

TEX sColor, tc, texture[0], 2D;<br />

# Compute its intensity<br />

DP3 blockOffset, sColor, grayScale;<br />

# Round it to the s-coordinate of the beginning of the<br />

# nearest character to the computed intensity value.<br />

MUL blockOffset, blockOffset, numChars;<br />

FLR blockOffset, blockOffset;<br />

MUL blockOffset, blockOffset, recipChars;<br />

# Using the second set of texture coordinates, find the<br />

# offset of this pixel within the character block. After<br />

# this operation, both X and Y will be in the range 0-1,<br />

# where 0 is the bottom left-most part of a character<br />

# and 1 is the upper right-most part.<br />

FRC charOffset, cc;<br />

# Multiply this number in the 0-1 range by the fraction<br />

# that represents a single pixel within the glyph array.<br />

# In the x direction, this is 1/the number of characters in<br />

# the font. In the y direction, it's the ratio of the font<br />

# height to the height of the texture it's in.<br />

# Add the result to the start of the glyph in the glyph<br />

# array. The result is the coordinate of the texel with<br />

# which this pixel should be replaced.<br />

MAD charCoords, charOffset, charScale, blockOffset;<br />

TEX oColor, charCoords, texture[1], 2D;<br />

END;<br />

The pixel shader has two parts. First, it determines the s-coordinate offset into<br />

the texture, which is the list of characters sorted from darkest to lightest. It does<br />

this by multiplying the intensity of the pixel by the number of characters in the<br />

font, taking the floor of that value and then dividing by the number of characters<br />

in the font, quantizing it so that the intensity value now falls on a coordinate that<br />

is the leftmost part of a character within the font texture.<br />

The second part of the pixel shader adds the coordinates of the glyph within<br />

the glyph array to the start of the glyph array. It then performs a lookup into the<br />

glyph array and replaces the pixel on the screen with a texel from the glyph array.


Conclusion<br />

Section IV — Image Space<br />

Using Pixel <strong>Shader</strong>s to Implement a Mosaic Effect Using Character Glyphs<br />

The effect described above can be used to replace any block of an image with a<br />

rectangular glyph based on the intensity of that area of the image. Various<br />

improvements to the algorithm are possible. For example, the entire operation<br />

could be performed in a single pass by adding instructions to the pixel shader that<br />

compute the intensity value, rather than using the texture unit to downsample the<br />

image.<br />

The effect above is also not limited to text glyphs. It can be used to render<br />

any image using a series of other images. For example, it could be used to draw a<br />

picture of a person with a mosaic of scenes from that person’s life.<br />

Neither is the effect limited to grayscale images. While grayscale is suitable<br />

for the rendering of character glyphs, lookups based on color are also possible —<br />

for example, using 3D textures, it would be possible to sort the image by the<br />

intensities of the different color components.<br />

Acknowledgments<br />

525<br />

Thanks are due to Marwan Ansari for giving us the original idea for performing<br />

image processing using pixel shaders and for encouraging us to publish this work.


Mandelbrot Set Rendering<br />

Emil Persson<br />

Introduction<br />

<strong>With</strong> the arrival of <strong>DirectX</strong> 9 level hardware, a whole new world of possibilities<br />

has opened up in graphics. One such possibility that the new floating-point pixel<br />

shaders have opened is the ability to evaluate advanced mathematical operations<br />

without significant precision loss or with limited range. The kind of applications<br />

that one tends to think of first where the capabilities of <strong>DirectX</strong> 9 shaders can be<br />

beneficial is often various lighting scenarios, atmospheric effects, animation, etc.<br />

These are all very interesting topics to dive deep into, but there’s another dimension<br />

that these shaders open up that may not have crossed our minds the first<br />

time we learned about the new shaders. For the first time, we can utilize pixel<br />

shaders to visualize the wonderful world of fractals.<br />

The Mandelbrot Fractal<br />

526<br />

Probably the most famous and well-known fractal is the Mandelbrot set. The<br />

Mandelbrot set basically consists of the complex number that after an infinite<br />

number of iterations of a simple formula is still within close range of the origin.<br />

While the higher math behind all this and all its implications is something that<br />

interests only a select few, the graphical art that you can produce with such a<br />

series is something that can amaze just about everyone.<br />

How does one visualize the Mandelbrot set? Easy — you simply take a complex<br />

number, evaluate a function on this number, and get a new number. Then<br />

repeat this a sufficient number of times. The classical iteration looks like this:<br />

2<br />

Zi�1� Zi �C<br />

Here C is the original number and Z i is the number that we are working with. We<br />

begin by setting Z 0 to C. Expanding this formula into real and imaginary parts of<br />

the complex number, we get these two formulas:<br />

2 2<br />

i i i x<br />

X �1<br />

� X �Y �C<br />

Y �2XY �C<br />

i�1i i y


The X is the real part, and Y is the imaginary part. Now we only need to do this<br />

math in a pixel shader. The first thing that we need to do is pass the constant C to<br />

the pixel shader. Where do we get C from, and what is it really? As we want to<br />

visualize the Mandelbrot set, we want to view every point in the XY plane that<br />

belongs to the Mandelbrot set, as these are clearly different from the pixel that<br />

doesn’t belong to the set. This means that we are interested in those points in the<br />

XY plane that, after an infinite amount of iterations of the formulas above, is still<br />

close to the origin. So what we pass as C is basically the position of a point in the<br />

XY plane. We will define a subset of the plane (for instance, the rectangle (–2, –2)<br />

– (2, 2)) and draw this range as a quad covering the whole viewport. C is basically<br />

the position, and thus we will pass it as a texture coordinate, which will be interpolated<br />

over the surface. The Mandelbrot set definition declares that we need to<br />

loop the equations above an infinite number of times in order to decide whether a<br />

point is within the set or not. Obviously, this is impossible to do, so usually one<br />

just loops it a sufficient number of times and then decides if we are still close to<br />

the origin. If we are, then we assume that we are part of the Mandelbrot set, a<br />

fairly reasonable assumption.<br />

An iteration of the formulas above can be done in three instructions. C is<br />

passed in texture coordinates t0, r0 contains our Z in its x and y components, and<br />

r2 is a temporary register. The implementation will look like this:<br />

mad r2.xy, r0.x, r0, t0<br />

mad r1.x, –r0.y, r0.y, r2.x<br />

mad r1.y, r0.x, r0.y, r2.y<br />

As you can see, the result ends up in another register, r1. This is because both<br />

the x and y components of the previous value are needed in the evaluation of the<br />

new value, so we can’t overwrite any of them. So we write the results to r1<br />

instead. In the next iteration, we can do the same thing again but with r0 and r1<br />

reversed such that the next result ends up in r0 again. Then we only need to take<br />

these two iterations and cut and paste them until we reach the limit of the hardware.<br />

A Radeon 9700, for instance, accepts pixel shaders of at most 64 ALU<br />

instructions. This means that we can get, at most, 21 iterations, but probably<br />

fewer in reality because we probably prefer to do something cool with the end<br />

result before we write it to the frame buffer. In the code for this article, we end up<br />

with 19 iterations.<br />

Visualizing It<br />

Section IV — Image Space<br />

Mandelbrot Set Rendering<br />

527<br />

We’ve done our 19 iterations — now what? Well, we need to transform it into<br />

something meaningful for the eye. There are billions of ways to do this, and which<br />

one we choose is arbitrary and can be based on our subjective preference. For all<br />

this to be meaningful though, we need to make the pixels that end up in the<br />

Mandelbrot set visually different from those that didn’t. Traditionally, when one<br />

renders Mandelbrot sets on the CPU, people have used whatever number of loops<br />

it took until we ended up at a distance larger than two from the origin. This is<br />

then used to look up a color from a palette. Unfortunately, this kind of information


Section IV — Image Space<br />

528 Mandelbrot Set Rendering<br />

is not available to us. As of publication, there’s no support for data-based branching<br />

in the pixel shaders in any hardware available on the market, so we can’t count<br />

loops. All the information that we have is the final position after all our iterations.<br />

This is sufficient, however, and we map this distance to a color. A large distance<br />

means that it’s not in the Mandelbrot set, while a small distance means that it<br />

most likely is. To get some nice coloration, we use the distance as a texture coordinate<br />

and look up the color from a texture with a dependent texture read. This<br />

texture is one-dimensional and contains a color spectrum not too different looking<br />

from a rainbow, except that it softly fades to black to the right. The distance can<br />

be anywhere from zero to very high, so instead of just mapping it directly, we use<br />

a similar formula as when one maps high dynamic range images into the 0…1<br />

interval with exposure control. We use the following formula:<br />

R<br />

cd<br />

� � �<br />

1 2<br />

R is our resulting color, and d is our distance. Instead of bothering with taking the<br />

square root in order to find the distance, we just use the squared distance; mind<br />

you, this final step is no exact science — it is better classified as art. The constant<br />

c is just an arbitrary constant that says how far from the origin a point can be<br />

without mapping to our black edge of the texture. We select it purely on subjective<br />

grounds; I have found that something around 8 will suit us well. The final<br />

implementation is pretty straightforward:<br />

def c0, 0.0, 1.0, 8.0, 0.0<br />

...<br />

mov r1.z, c0.x<br />

dp3 satr0, r1, r1<br />

mul r0.x, r0.x, c0.z<br />

exp r0.x, –r0.x<br />

sub r0, c0.y, r0.x<br />

texld r0, r0, s0<br />

mov oC0, r0<br />

First we fill r1.z with a zero so we can use the dot product instruction without<br />

reading uninitialized components. You may wonder why we use a dp3_sat;<br />

shouldn’t we use dp3? Well, we should. Unfortunately, in practice some implementations<br />

seem to have problems raising numbers to high negative numbers;<br />

this can create some noisy artifacts. However, as 2 –8 is already a very small number,<br />

there is no visual difference if we clamp it. We should now have a nice colored<br />

Mandelbrot before our eyes.


Introduction<br />

Real-Time Depth of Field<br />

Simulation<br />

Guennadi Riguer, Natalya Tatarchuk, and John Isidoro<br />

Photorealistic rendering attempts to generate computer images with quality<br />

approaching that of real-life images. Quite often, computer-rendered images look<br />

almost photorealistic, but they are missing something subtle — something that<br />

makes them look synthetic or too perfect. Depth of field is one of those very<br />

important visual components of real photography that makes images look “real.”<br />

In “real-world” photography or cinematography, the physical properties of the<br />

camera cause some parts of the scene to be blurred, while maintaining sharpness<br />

in other areas. While blurriness sometimes can be thought of as an imperfection<br />

and undesirable artifact that distorts original images and hides some of the scene<br />

details, it can also be used as a tool to provide valuable visual clues and guide a<br />

viewer’s attention to important parts of the scene. Using depth of field effectively<br />

can improve photorealism and add an artistic touch to rendered images. Figure 1<br />

shows a simple scene rendered with and without depth of field.<br />

<strong>With</strong> depth of field No depth of field<br />

Figure 1: A scene rendered with and without depth of field<br />

Recent developments in the field of programmable graphics hardware allow us to<br />

simulate complex visual effects such as depth of field in real time. This article<br />

presents two real-time implementations of the depth of field effect using <strong>DirectX</strong><br />

9 class hardware. The High Level Shading Language (HLSL) from Microsoft is<br />

used to simplify shader development.<br />

529


Section IV — Image Space<br />

530 Real-Time Depth of Field Simulation<br />

Camera Models and Depth of Field<br />

Computer images rendered with conventional methods look too sharp and lack<br />

the defects and artifacts of real cameras. <strong>With</strong>out these defects, it is hard to trick<br />

the eye into believing the images were captured by a real camera. Better camera<br />

models become even more important when computer-generated images have to<br />

be combined with ones produced by a real camera. The visual discrepancy mostly<br />

comes from the difference between physical cameras and the camera models normally<br />

used in computer graphics. Computer graphics generally implicitly use a<br />

pinhole camera model, while real cameras use lenses of finite dimensions.<br />

Pinhole Camera Model<br />

In the pinhole camera, light rays scattered from objects pass though an infinitely<br />

small pinhole lens. Only a single ray emanating from each point in the scene is<br />

allowed to pass though the pinhole. All rays going in other directions are ignored.<br />

Because only a single ray passes though the pinhole, only a single ray hits the<br />

imaging plane at any given point. This creates an image that is always in focus.<br />

Figure 2 illustrates the pinhole camera in action.<br />

Figure 2: Pinhole camera<br />

Thin Lens Camera Model<br />

In the real world, all lenses have finite dimensions and let through rays coming<br />

from multiple different directions. As a result, parts of the scene are sharp only if<br />

they are located at or near a specific focal distance. For a lens with focal length f,a<br />

sharp image of a given object is produced at the imaging plane offset from the lens<br />

by v, when the object is at the distance u from the lens. This is described by a thin<br />

lens equation:<br />

1 1 1<br />

� �<br />

u v f<br />

The distance from the image plane to the object in focus can be expressed as:<br />

z focus = u + v<br />

Figure 3 demonstrates how the thin lens camera works.


Figure 3: Thin lens camera<br />

Multiple rays scattered from a given point on an object will pass through the lens,<br />

forming the cone of light. If the object is in focus, all rays will converge at a single<br />

point on the image plane. However, if a given point on an object in the scene is<br />

not near the focal distance, the cone of light rays will intersect the image plane in<br />

an area shaped like a conic section. Typically, the conic section is approximated by<br />

a circle called the circle of confusion.<br />

The circle of confusion diameter b depends on the distance of the plane of<br />

focus and lens aperture setting a (also known as the f-stop). For a known focus<br />

distance and lens parameters, the size of the circle of confusion can be calculated<br />

as:<br />

� focus �<br />

� � �<br />

D� f z �z<br />

b �<br />

z z f<br />

D f<br />

�<br />

a<br />

focus<br />

, where D is a lens diameter<br />

Any circle of confusion greater than the smallest point a human eye can resolve<br />

contributes to the blurriness of the image that we see as depth of field.<br />

Overview of Depth of Field Techniques<br />

Section IV — Image Space<br />

Real-Time Depth of Field Simulation<br />

531<br />

A number of techniques can be used to simulate depth of field in rendered scenes.<br />

One technique used in offline rendering employs distributed ray tracing. For each<br />

image point, multiple rays are shot through the lens. Coming from a single point<br />

of the image plane, these rays focus on the single point of the object if it is at the<br />

focal point. If the object is not in focus, the rays get scattered into the environment,<br />

which contributes to blurring. Because the rays accurately sample the surrounding<br />

environment, this method produces the most realistic depth of field<br />

effect, lacking many artifacts produced by other methods. The quality, however,<br />

comes at a cost, and this technique is unacceptable for real-time rendering.<br />

Another method involves the accumulation buffer. The accumulation buffer<br />

integrates images from multiple render passes. Each of the images is rendered<br />

from a slightly different position and direction within the virtual lens aperture.<br />

While less complex than ray tracing, this method is also quite expensive because<br />

images have to be rendered many times to achieve good visual results.


Section IV — Image Space<br />

532 Real-Time Depth of Field Simulation<br />

A cheaper and more reasonable alternative for real-time implementation is<br />

the post-processing method. Usually, this method involves two-pass rendering.<br />

On the first pass, the scene is rendered with some additional information, such as<br />

depth. On the second pass, some filter is run on the result of the first pass to blur<br />

the image. This article presents two variations of this general post-processing<br />

approach. Each version has some strengths and weaknesses and can produce<br />

high-quality photorealistic depth of field effects on <strong>DirectX</strong> 9 graphics hardware.<br />

Depth of Field Implementation via Simulation<br />

of Circle of Confusion<br />

The first implementation that we present is an extension to the post-processing<br />

method proposed by Potmesil and Chakravarty in [Potmesil83]. On the first pass,<br />

we render the scene, outputting the color as well as the information necessary to<br />

blur the image. On the second pass, we filter the image from the first pass with a<br />

variable-sized filter kernel to simulate the circle of confusion. A blurriness factor<br />

computed on the first pass controls the size of the filter kernel used in the second<br />

pass. Special measures are taken to eliminate the leaking of color of objects in<br />

focus onto backgrounds that have been blurred.<br />

Pass One: Scene Rendering<br />

First, the whole scene is rendered by outputting depth and blurriness factor, which<br />

is used to describe how much each pixel should be blurred, in addition to the<br />

resulting scene rendering color. This can be accomplished by rendering the scene<br />

to multiple buffers at one time. <strong>DirectX</strong> 9 has a useful feature called Multiple Render<br />

Targets (MRT) that allows simultaneous shader output into the multiple<br />

renderable buffers. Using this feature gives us the ability to output all of the data<br />

channels (scene color, depth, and blurriness factor) in our first pass. One of the<br />

MRT restrictions on some hardware is the requirement for all render surfaces to<br />

have the same bit depth while allowing use of different surface formats. Guided by<br />

this requirement, we can pick the D3DFMT_A8R8G8B8 format for the scene color<br />

output and the two-channel texture format D3DFMT_G16R16 format for the depth<br />

and blurriness factor. As shown in Figure 4, both formats are 32 bits per pixel and<br />

provide us with enough space for the necessary information at the desired<br />

precision.<br />

Figure 4: Pixel shader output for a scene rendering pass


Scene Rendering Vertex <strong>Shader</strong><br />

The vertex shader for the scene rendering pass is just a regular vertex shader<br />

with one little addition: It outputs scene depth in the camera space. This depth<br />

value is later used in the pixel shader to compute blurriness factor.<br />

An example of a simple scene vertex shader is shown below:<br />

/////////////////////////////////////////////////////////////////////<br />

float3 lightPos; // light position in model space<br />

float4 mtrlAmbient;<br />

float4 mtrlDiffuse;<br />

matrix matWorldViewProj;<br />

matrix matWorldView;<br />

/////////////////////////////////////////////////////////////////////<br />

struct VS INPUT<br />

{<br />

float4 vPos: POSITION;<br />

float3 vNorm: NORMAL;<br />

float2 vTexCoord: TEXCOORD0;<br />

};<br />

struct VS OUTPUT<br />

{<br />

float4 vPos: POSITION;<br />

float4 vColor: COLOR0;<br />

float fDepth: TEXCOORD0;<br />

float2 vTexCoord: TEXCOORD1;<br />

};<br />

/////////////////////////////////////////////////////////////////////<br />

VS OUTPUT scene shader vs(VS INPUT v)<br />

{<br />

VS OUTPUT o=(VSOUTPUT)0;<br />

float4 vPosWV;<br />

float3 vNorm;<br />

float3 vLightDir;<br />

// Transform position<br />

o.vPos = mul(v.vPos, matWorldViewProj);<br />

// Position in camera space<br />

vPosWV = mul(v.vPos, matWorldView);<br />

// Output depth in camera space<br />

o.fDepth = vPosWV.z;<br />

// Compute diffuse lighting<br />

vLightDir = normalize(lightPos - v.vPos);<br />

Section IV — Image Space<br />

Real-Time Depth of Field Simulation<br />

533


Section IV — Image Space<br />

534 Real-Time Depth of Field Simulation<br />

}<br />

vNorm = normalize(v.vNorm);<br />

o.vColor = dot(vNorm, vLightDir) * mtrlDiffuse + mtrlAmbient;<br />

// Output texture coordinates<br />

o.vTexCoord = v.vTexCoord;<br />

return o;<br />

Scene Rendering Pixel <strong>Shader</strong><br />

The pixel shader of the scene rendering pass needs to compute the blurriness factor<br />

and output it along with the scene depth and color. To abstract from the different<br />

display sizes and resolutions, the blurriness is defined to lie in the [0..1]<br />

range. A value of zero means the pixel is perfectly sharp, while a value of one corresponds<br />

to the pixel of the maximal circle of confusion size. The reason behind<br />

using the [0..1] range is twofold. First, the blurriness is not expressed in terms of<br />

pixels and can scale with resolution during the post-processing step. Second, the<br />

values can be directly used as sample weights when eliminating “bleeding”<br />

artifacts.<br />

For each pixel of a scene, this shader computes the circle of confusion size<br />

based on the formula provided in the preceding discussion of the thin lens model.<br />

Later in the process, the size of the circle of confusion is scaled by the factor corresponding<br />

to the size of the circle in pixels for a given resolution and display size.<br />

As a last step, the blurriness value is divided by a maximal desired circle of confusion<br />

size in pixels (variable maxCoC) and clamped to the [0..1] range. Sometimes it<br />

might be necessary to limit the circle of confusion size (through the variable<br />

maxCoC) to reasonable values (i.e., ten pixels) to avoid sampling artifacts caused by<br />

an insufficient number of filter taps.<br />

An example of a scene pixel shader that can be compiled to the ps 2.0 shader<br />

model is shown below:<br />

/////////////////////////////////////////////////////////////////////<br />

float focalLen;<br />

float Dlens;<br />

float Zfocus;<br />

float maxCoC;<br />

float scale;<br />

sampler TexSampler;<br />

float sceneRange;<br />

/////////////////////////////////////////////////////////////////////<br />

struct PS INPUT<br />

{<br />

float4 vColor: COLOR0;<br />

float fDepth: TEXCOORD0;<br />

float2 vTexCoord: TEXCOORD1;<br />

};


struct PS OUTPUT<br />

{<br />

float4 vColor: COLOR0;<br />

float4 vDoF: COLOR1;<br />

};<br />

/////////////////////////////////////////////////////////////////////<br />

PS OUTPUT scene shader ps(PS INPUT v)<br />

{<br />

PS OUTPUT o=(PSOUTPUT)0;<br />

}<br />

// Output color<br />

o.vColor = v.vColor * tex2D(TexSampler, v.vTexCoord);<br />

// Compute blur factor based on the CoC size scaled and<br />

// normalized to the [0..1] range<br />

float pixCoC = abs(Dlens * focalLen * (Zfocus - v.fDepth) /<br />

(Zfocus * (v.fDepth - focalLen)));<br />

float blur = saturate(pixCoC * scale / maxCoC);<br />

// Depth/blurriness value scaled to the [0..1] range<br />

o.vDoF = float4(v.fDepth / sceneRange, blur, 0, 0);<br />

return o;<br />

Pass Two: Post-processing<br />

Section IV — Image Space<br />

Real-Time Depth of Field Simulation<br />

535<br />

During the post-processing pass, the results of the previous rendering are processed,<br />

and the color image is blurred based on the blurriness factor computed in<br />

the first pass. Blurring is performed using a variable-sized filter representing the<br />

circle of confusion. To perform image filtering, a simple screen-aligned quadrilateral<br />

is drawn, textured with the results of the first pass. Figure 5 shows the<br />

quad’s texture coordinates and screen positions for a render target of W×H<br />

dimensions. The quad corner positions are shifted by –0.5 pixels to properly align<br />

texels to pixels.<br />

Figure 5: Texture coordinates and vertex positions for screen space quad


Section IV — Image Space<br />

536 Real-Time Depth of Field Simulation<br />

This vertex shader is designed for a vs 1.1 compilation target.<br />

//////////////////////////////////////////////////////////////////////<br />

float4 viewportScale;<br />

float4 viewportBias;<br />

//////////////////////////////////////////////////////////////////////<br />

struct VS INPUT<br />

{<br />

float4 vPos: POSITION;<br />

float2 vTexCoord: TEXCOORD;<br />

};<br />

struct VS OUTPUT<br />

{<br />

float4 vPos: POSITION;<br />

float2 vTexCoord: TEXCOORD0;<br />

};<br />

//////////////////////////////////////////////////////////////////////<br />

VS OUTPUT dof filter vs(VS INPUT v)<br />

{<br />

VS OUTPUT o=(VSOUTPUT)0;<br />

}<br />

// Scale and bias viewport<br />

o.vPos = v.vPos * viewportScale + viewportBias;<br />

// Pass through the texture coordinates<br />

o. vTexCoord = v.vTexCoord;<br />

return o;<br />

Post-processing Pixel <strong>Shader</strong><br />

The filter kernel in the post-processing step has 13 samples — a center sample<br />

and 12 outer samples, as shown in Figure 6. The number of samples was dictated<br />

by practical reasons of real-time implementation and represents the maximum<br />

number of samples that can be processed by a 2.0 pixel shader in a single pass.<br />

The center tap is aligned with the pixel being filtered, while the outer taps<br />

are sampled from nearby pixels. The filter uses stochastic sampling, and the outer<br />

samples are aligned in the filter according to a Poisson disk distribution. Other<br />

sample patterns can be used to achieve specific artistic results, as presented later<br />

in the “Bokeh” section.


Figure 6: Depth of field filter kernel<br />

The filter size is computed per-pixel from the blurriness value of the center sample<br />

and the maximum allowable circle of confusion size. Figure 7 shows the relationship<br />

between blurriness and filter kernel size.<br />

Figure 7: Relationship between blurriness and filter size<br />

The post-processing pixel shader computes filter sample positions based on 2D<br />

offsets stored in the filterTaps array and the size of the circle of confusion. The<br />

2D offsets are locations of taps for the filter of one pixel in diameter. The following<br />

code shows how these values can be initialized in the program, according to<br />

the render target resolution.<br />

void SetupFilterKernel()<br />

{<br />

// Scale tap offsets based on render target size<br />

FLOAT dx = 0.5f / (FLOAT)dwRTWidth;<br />

FLOAT dy = 0.5f / (FLOAT)dwRTHeight;<br />

Section IV — Image Space<br />

Real-Time Depth of Field Simulation<br />

D3DXVECTOR4 v[12];<br />

v[0] = D3DXVECTOR4(-0.326212f * dx, -0.40581f * dy, 0.0f, 0.0f);<br />

v[1] = D3DXVECTOR4(-0.840144f * dx, -0.07358f * dy, 0.0f, 0.0f);<br />

v[2] = D3DXVECTOR4(-0.695914f * dx, 0.457137f * dy, 0.0f, 0.0f);<br />

v[3] = D3DXVECTOR4(-0.203345f * dx, 0.620716f * dy, 0.0f, 0.0f);<br />

v[4] = D3DXVECTOR4(0.96234f * dx, -0.194983f * dy, 0.0f, 0.0f);<br />

v[5] = D3DXVECTOR4(0.473434f * dx, -0.480026f * dy, 0.0f, 0.0f);<br />

v[6] = D3DXVECTOR4(0.519456f * dx, 0.767022f * dy, 0.0f, 0.0f);<br />

v[7] = D3DXVECTOR4(0.185461f * dx, -0.893124f * dy, 0.0f, 0.0f);<br />

v[8] = D3DXVECTOR4(0.507431f * dx, 0.064425f * dy, 0.0f, 0.0f);<br />

v[9] = D3DXVECTOR4(0.89642f * dx, 0.412458f * dy, 0.0f, 0.0f);<br />

v[10] = D3DXVECTOR4(-0.32194f * dx, -0.932615f * dy, 0.0f, 0.0f);<br />

v[11] = D3DXVECTOR4(-0.791559f * dx, -0.59771f * dy, 0.0f, 0.0f);<br />

537


Section IV — Image Space<br />

538 Real-Time Depth of Field Simulation<br />

}<br />

// Set array of offsets<br />

pEffect->SetVectorArray("filterTaps", v, 12);<br />

Once sample positions are computed, the filter averages color from its samples to<br />

derive the blurred color. When the blurriness value is close to zero, all samples<br />

come from the same pixel and no blurring happens. As the blurriness factor<br />

increases, the filter will start sampling from more and more neighboring pixels,<br />

thus increasingly blurring the image. All images are sampled with D3DTEXF_LINEAR<br />

filtering. Using linear filtering is not very accurate on the edges of objects where<br />

depth might abruptly change; however, it produces better overall quality images<br />

in practice.<br />

One of the problems commonly associated with all post-filtering methods is<br />

leaking of color from sharp objects onto the blurry backgrounds. This results in<br />

faint halos around sharp objects, as can be seen on the left side of Figure 8. The<br />

color leaking happens because the filter for the blurry background will sample<br />

color from the sharp object in the vicinity due to the large filter size. To solve this<br />

problem, we will discard the outer samples that can contribute to leaking according<br />

to the following criteria: If the outer sample is in focus and it is in front of the<br />

blurry center sample, it should not contribute to the blurred color. This can introduce<br />

a minor popping effect when objects go in or out of focus. To combat sample<br />

popping, the outer sample blurriness factor is used as a sample weight to fade out<br />

its contribution gradually. The right side of Figure 8 shows a portion of a scene<br />

fragment with color leaking eliminated.<br />

Leaking of sharp objects<br />

Figure 8: Elimination of color leaking<br />

Below, we show a depth of field pixel shader that implements the concepts discussed<br />

above. This shader can be compiled to the 2.0 pixel shader model.<br />

//////////////////////////////////////////////////////////////////////<br />

#define NUM DOF TAPS 12<br />

float maxCoC;<br />

float2 filterTaps[NUM DOF TAPS];<br />

Sharp objects without color leaking


struct PS INPUT<br />

{<br />

float2 vTexCoord: TEXCOORD;<br />

};<br />

//////////////////////////////////////////////////////////////////////<br />

float4 dof filter ps(PS INPUT v) : COLOR<br />

{<br />

// Start with center sample color<br />

float4 colorSum = tex2D(SceneColorSampler, v.vTexCoord);<br />

float totalContribution = 1.0f;<br />

// Depth and blurriness values for center sample<br />

float2 centerDepthBlur = tex2D(DepthBlurSampler, v.vTexCoord);<br />

}<br />

// Compute CoC size based on blurriness<br />

float sizeCoC = centerDepthBlur.y * maxCoC;<br />

// Run through all filter taps<br />

for(inti=0;i centerDepthBlur.x) ?<br />

1.0f : tapDepthBlur.y;<br />

// Accumulate color and sample contribution<br />

colorSum += tapColor * tapContribution;<br />

totalContribution += tapContribution;<br />

// Normalize color sum<br />

float4 finalColor = colorSum / totalContribution;<br />

return finalColor;<br />

Section IV — Image Space<br />

Real-Time Depth of Field Simulation<br />

539<br />

Now that we have discussed our implementation, which models the circle of confusion<br />

with a variable-sized stochastic filter kernel, we can describe an implementation<br />

that is based on a separable Gaussian filter.


Section IV — Image Space<br />

540 Real-Time Depth of Field Simulation<br />

Depth of Field Rendering by Blurring<br />

with Separable Gaussian Filter<br />

This separable Gaussian filter approach differs from the previous approach of simulating<br />

depth of field in two ways. First, it does not utilize multiple render targets<br />

for outputting depth information. Second, to simulate the blurring that occurs in<br />

depth of field, we apply a Gaussian filter during the post-processing stage instead<br />

of simulating the circle of confusion of a physical camera lens.<br />

Implementation Overview<br />

In this method, we first render the scene at full resolution to an offscreen buffer,<br />

outputting depth information for each pixel to the alpha channel of that buffer. We<br />

then downsample this fully rendered scene into an image one-fourth the size (half<br />

in x and half in y) of the original. Next, we perform blurring of the downsampled<br />

scene in two passes by running the image through two passes of a separable<br />

Gaussian filter — first along the x axis and then along the y axis. On the final pass,<br />

we blend between the original full resolution rendering of our scene and the<br />

blurred post-processed image based on the distance of each pixel from the specified<br />

focal plane stored in the downsampled image. The intermediate filtering<br />

results are stored in 16-bit per-channel integer format (D3DFMT_A16B16G16R16) for<br />

extra precision. Let’s discuss this method in more detail, going step by step<br />

through the different rendering passes and shaders used.<br />

Pass One: Scene Rendering<br />

During the scene rendering pass, we render the scene to the full resolution<br />

offscreen buffer, computing color information and a depth falloff value for each<br />

pixel. The depth falloff value will determine how much each pixel will be blurred<br />

during the subsequent post-processing stage. The distance from the focal plane is<br />

output to the alpha channel of the offscreen buffer.<br />

Scene Rendering Vertex <strong>Shader</strong><br />

We compute the depth falloff value and the distance from the focal plane in the<br />

vertex shader. First, we determine the distance of each vertex from the focal<br />

plane in view space. We output scaled to the [0..1] range distance from the focal<br />

plane into the texture coordinate interpolator. This is illustrated in the following<br />

vertex shader, compiled to vertex shader target vs 1.1:<br />

/////////////////////////////////////////////////////////////////////<br />

float3 lightPos; // light position in model space<br />

float4 mtrlAmbient;<br />

float4 mtrlDiffuse;<br />

matrix matWorldViewProj;<br />

matrix matWorldView;


float fFocalDistance;<br />

float fFocalRange;<br />

/////////////////////////////////////////////////////////////////////<br />

struct VS INPUT<br />

{<br />

float4 vPos: POSITION;<br />

float3 vNorm: NORMAL;<br />

float2 vTexCoord: TEXCOORD0;<br />

};<br />

struct VS OUTPUT<br />

{<br />

float4 vPos: POSITION;<br />

float4 vColor: COLOR0;<br />

float fBlur: TEXCOORD0;<br />

float2 vTexCoord: TEXCOORD1;<br />

};<br />

/////////////////////////////////////////////////////////////////////<br />

VS OUTPUT scene shader vs(VS INPUT v)<br />

{<br />

VS OUTPUT o=(VSOUTPUT)0;<br />

float4 vPosWV;<br />

float3 vNorm;<br />

float3 vLightDir;<br />

}<br />

// Transform position<br />

o.vPos = mul(v.vPos, matWorldViewProj);<br />

// Position in camera space<br />

vPosWV = mul(v.vPos, matWorldView);<br />

// Normalized distance to focal plane in camera space,<br />

// used as a measure of blurriness for depth of field<br />

o.fBlur = saturate(abs(vPosWV.z - fFocalDistance) / fFocalRange);<br />

// Compute diffuse lighting<br />

vLightDir = normalize(lightPos - v.vPos);<br />

vNorm = normalize(v.vNorm);<br />

o.vColor = dot(vNorm, vLightDir) * mtrlDiffuse + mtrlAmbient;<br />

// Output texture coordinates<br />

o.vTexCoord = v.vTexCoord;<br />

return o;<br />

Section IV — Image Space<br />

Real-Time Depth of Field Simulation<br />

541


Section IV — Image Space<br />

542 Real-Time Depth of Field Simulation<br />

Scene Rendering Pixel <strong>Shader</strong><br />

In the pixel shader, we render our scene as desired. The alpha channel receives<br />

the blurriness value expressed as the distance from the focal plane. This pixel<br />

shader is designed to be compiled into a ps 2.0 target.<br />

/////////////////////////////////////////////////////////////////////<br />

sampler TexSampler;<br />

/////////////////////////////////////////////////////////////////////<br />

struct PS INPUT<br />

{<br />

float4 vColor: COLOR0;<br />

float fBlur: TEXCOORD0;<br />

float2 vTexCoord: TEXCOORD1;<br />

};<br />

/////////////////////////////////////////////////////////////////////<br />

float4 scene shader ps(PS INPUT v) : COLOR<br />

{<br />

float3 vColor;<br />

}<br />

// Output color<br />

vColor = v.vColor * tex2D(TexSampler, v.vTexCoord);<br />

// Output blurriness in alpha<br />

return float4(vColor, v.fBlur);<br />

Pass Two: Downsampling<br />

To downsample the full resolution image, we simply render a quad one-fourth the<br />

size of the original image while sampling from the original image and outputting it<br />

to the smaller offscreen buffer. The alpha channel of the downsampled image<br />

receives a blurriness value computed as the scaled distance from the focus plane<br />

for each pixel. This information will be used during post-processing to control the<br />

amount of blurring applied to the downsampled image as well as to blend between<br />

a blurred image of the scene and the original rendering to simulate the effect of<br />

depth of field.<br />

Downsampling Vertex <strong>Shader</strong><br />

In this simple vertex shader, we transform the vertices into clip space and propagate<br />

incoming texture coordinates to the pixel shader. Note that at this point, the<br />

incoming model must be a screen-aligned quad of dimensions one-fourth the size<br />

of the original image.


matrix matWorldViewProj;<br />

//////////////////////////////////////////////////////////////////////<br />

struct VS OUTPUT<br />

{<br />

float4 vPos: POSITION;<br />

float2 vTex: TEXCOORD0;<br />

};<br />

//////////////////////////////////////////////////////////////////////<br />

VS OUTPUT main(float4 Pos: POSITION, float2 Tex: TEXCOORD0)<br />

{<br />

VS OUTPUT o=(VSOUTPUT)0;<br />

}<br />

// Output transformed vertex position:<br />

o.vPos = mul(matWorldViewProj, Pos);<br />

// Propagate texture coordinate to the pixel shader<br />

o.vTex = Tex;<br />

return o;<br />

Downsampling Pixel <strong>Shader</strong><br />

In the pixel shader for the downsampling pass, we sample the original scene rendering<br />

using texture coordinates from the smaller screen-aligned quad and store<br />

the results in an offscreen render target. This pixel shader can be compiled to ps<br />

1.4 or above.<br />

//////////////////////////////////////////////////////////////////////<br />

sampler renderTexture;<br />

//////////////////////////////////////////////////////////////////////<br />

float4 main(float2 Tex: TEXCOORD0) : COLOR<br />

{<br />

// Downsample rendered scene:<br />

return tex2D(renderTexture, Tex);<br />

}<br />

Section IV — Image Space<br />

Real-Time Depth of Field Simulation<br />

Post-processing for Simulation of Depth of Field<br />

One of the most frequently used filters for performing smoothing of an image is<br />

the Gaussian filter (see Figure 9). Typically, the filter is applied in the following<br />

way:<br />

543


Section IV — Image Space<br />

544 Real-Time Depth of Field Simulation<br />

F �<br />

n n<br />

�� i�1<br />

j �1<br />

S<br />

PC<br />

ij ij<br />

...where F is the filtered value<br />

of the target pixel, P is a pixel<br />

in the 2D grid, C is a coefficient<br />

in the 2D Gaussian matrix, n is<br />

the vertical/horizontal dimensions<br />

of the matrix, and S is the<br />

sum of all values in the<br />

Gaussian matrix.<br />

Once a suitable kernel has<br />

been calculated, Gaussian Figure 9: Gaussian filter kernel<br />

smoothing can be performed<br />

using standard convolution methods. The convolution can in fact be performed<br />

fairly quickly since the equation for the 2D isotropic Gaussian is separable into x<br />

and y components. Thus, the 2D convolution can be performed by first convolving<br />

with a 1D Gaussian in the x direction and then convolving with another 1D<br />

Gaussian in the y direction. This allows us to apply a larger size filter to the input<br />

image in two successive passes of 1D filters. We perform this operation by rendering<br />

into a temporary buffer and sampling a line (or a column, for y axis filtering)<br />

of texels in each of the passes.<br />

The size of the downsampled buffer determines the size of texels used for<br />

controlling sampling points for the Gaussian filter taps. This can be precomputed<br />

as a constant to the shader ahead of time. The following is an example of how the<br />

filter tap offset can be computed.<br />

void SetupFilterKernel()<br />

{<br />

// Scale tap offsets based on render target size<br />

FLOAT dx = 1.0f / (FLOAT)dwRTWidth;<br />

FLOAT dy = 1.0f / (FLOAT)dwRTHeight;<br />

D3DXVECTOR4 v[7];<br />

v[0] = D3DXVECTOR4(0.0f, 0.0f, 0.0f, 0.0f);<br />

v[1] = D3DXVECTOR4(1.3366f * dx, 0.0f, 0.0f, 0.0f);<br />

v[2] = D3DXVECTOR4(3.4295f * dx, 0.0f, 0.0f, 0.0f);<br />

v[3] = D3DXVECTOR4(5.4264f * dx, 0.0f, 0.0f, 0.0f);<br />

v[4] = D3DXVECTOR4(7.4359f * dx, 0.0f, 0.0f, 0.0f);<br />

v[5] = D3DXVECTOR4(9.4436f * dx, 0.0f, 0.0f, 0.0f);<br />

v[6] = D3DXVECTOR4(11.4401f * dx, 0.0f, 0.0f, 0.0f);<br />

// Set array of horizontal offsets for X-pass<br />

m pEffect->SetVectorArray("horzTapOffs", v, 7);<br />

v[0] = D3DXVECTOR4(0.0f, 0.0f, 0.0f, 0.0f);<br />

v[1] = D3DXVECTOR4(0.0f, 1.3366f * dy, 0.0f, 0.0f);


}<br />

v[2] = D3DXVECTOR4(0.0f, 3.4295f * dy, 0.0f, 0.0f);<br />

v[3] = D3DXVECTOR4(0.0f, 5.4264f * dy, 0.0f, 0.0f);<br />

v[4] = D3DXVECTOR4(0.0f, 7.4359f * dy, 0.0f, 0.0f);<br />

v[5] = D3DXVECTOR4(0.0f, 9.4436f * dy, 0.0f, 0.0f);<br />

v[6] = D3DXVECTOR4(0.0f, 11.4401f * dy, 0.0f, 0.0f);<br />

// Set array of vertical offsets for Y-pass<br />

m pEffect->SetVectorArray("vertTapOffs", v, 7);<br />

Section IV — Image Space<br />

Real-Time Depth of Field Simulation<br />

Pass Three: Separable Gaussian Filtering in X Axis<br />

First, we perform Gaussian filter blurring along the x axis of the downsampled<br />

image. For each pixel in the downsampled image, we sample n texture samples<br />

dynamically along the x axis in the following manner:<br />

Figure 10: Samples for applying 1D Gaussian filter<br />

545<br />

The center sample and the inner taps of the filter are done with interpolated<br />

texture coordinates, which are computed in the vertex shader. To compute the<br />

offsets for the first seven samples, we use input texture coordinates and the<br />

precomputed tap offsets based on the image resolution.<br />

In the pixel shader, we sample the image for the center tap and the first six<br />

inner taps, using nearest filtering for the center sample and bilinear sampling for<br />

the inner samples.<br />

The pixel shader code derives the texture coordinates for the outer samples<br />

based on precomputed deltas from the location of the center sample. The outer<br />

samples are fetched via dependent reads, as texture coordinates are derived in the<br />

pixel shader itself.<br />

All samples are weighted based on the predefined weight thresholds and<br />

blurriness values and added together. This results in a weighted sum of 25 texels<br />

from the source image, which is large enough to allow us to create a convincing<br />

blurring effect for simulating depth of field without violating the maximum number<br />

of instructions for the 2.0 pixel shader.<br />

Note that the output of this pass is directed to a separate offscreen buffer. At<br />

this point we have used three separate offscreen render targets: one to output<br />

results of the full scene rendering, one to output results of the downsampling<br />

pass, and one to output the results of Gaussian blurring.


Section IV — Image Space<br />

546 Real-Time Depth of Field Simulation<br />

Vertex <strong>Shader</strong> for X Axis of Separable Gaussian Filter<br />

This vertex shader is designed for a vs 1.1 compilation target.<br />

//////////////////////////////////////////////////////////////////////<br />

float4 viewportScale;<br />

float4 viewportBias;<br />

// Offsets 0-3 used by vertex shader, 4-6 by pixel shader<br />

float2 horzTapOffs[7];<br />

//////////////////////////////////////////////////////////////////////<br />

struct VS INPUT<br />

{<br />

float4 vPos: POSITION;<br />

float2 vTexCoord: TEXCOORD;<br />

};<br />

struct VS OUTPUT TEX7<br />

{<br />

float4 vPos: POSITION;<br />

float2 vTap0: TEXCOORD0;<br />

float2 vTap1: TEXCOORD1;<br />

float2 vTap2: TEXCOORD2;<br />

float2 vTap3: TEXCOORD3;<br />

float2 vTap1Neg: TEXCOORD4;<br />

float2 vTap2Neg: TEXCOORD5;<br />

float2 vTap3Neg: TEXCOORD6;<br />

};<br />

//////////////////////////////////////////////////////////////////////<br />

VS OUTPUT TEX7 filter gaussian x vs(VS INPUT v)<br />

{<br />

VS OUTPUT TEX7 o = (VS OUTPUT TEX7)0;<br />

}<br />

// Scale and bias viewport<br />

o.vPos = v.vPos * viewportScale + viewportBias;<br />

// Compute tap coordinates<br />

o.vTap0 = v.vTexCoord;<br />

o.vTap1 = v.vTexCoord + horzTapOffs[1];<br />

o.vTap2 = v.vTexCoord + horzTapOffs[2];<br />

o.vTap3 = v.vTexCoord + horzTapOffs[3];<br />

o.vTap1Neg = v.vTexCoord - horzTapOffs[1];<br />

o.vTap2Neg = v.vTexCoord - horzTapOffs[2];<br />

o.vTap3Neg = v.vTexCoord - horzTapOffs[3];<br />

return o;


Pixel <strong>Shader</strong> for X Axis of Separable Gaussian Filter<br />

This pixel shader is fine-tuned to compile for a ps 2.0 compilation target.<br />

//////////////////////////////////////////////////////////////////////<br />

// Thresholds for computing sample weights<br />

float4 vThresh0 = {0.1, 0.3, 0.5, -0.01};<br />

float4 vThresh1 = {0.6, 0.7, 0.8, 0.9};<br />

sampler renderTexture;<br />

// Offsets 0-3 used by vertex shader, 4-6 by pixel shader<br />

float2 horzTapOffs[7];<br />

//////////////////////////////////////////////////////////////////////<br />

struct PS INPUT TEX7<br />

{<br />

float2 vTap0: TEXCOORD0;<br />

float2 vTap1: TEXCOORD1;<br />

float2 vTap2: TEXCOORD2;<br />

float2 vTap3: TEXCOORD3;<br />

float2 vTap1Neg: TEXCOORD4;<br />

float2 vTap2Neg: TEXCOORD5;<br />

float2 vTap3Neg: TEXCOORD6;<br />

};<br />

//////////////////////////////////////////////////////////////////////<br />

float4 filter gaussian x ps(PS INPUT TEX7 v) : COLOR<br />

{<br />

// Samples<br />

float4 s0, s1, s2, s3, s4, s5, s6;<br />

float4 vWeights4;<br />

float3 vWeights3;<br />

// Acumulated color and weights<br />

float3 vColorSum;<br />

float fWeightSum;<br />

// Sample taps with coordinates from VS<br />

s0 = tex2D(renderTexture, v.vTap0);<br />

s1 = tex2D(renderTexture, v.vTap1);<br />

s2 = tex2D(renderTexture, v.vTap2);<br />

s3 = tex2D(renderTexture, v.vTap3);<br />

s4 = tex2D(renderTexture, v.vTap1Neg);<br />

s5 = tex2D(renderTexture, v.vTap2Neg);<br />

s6 = tex2D(renderTexture, v.vTap3Neg);<br />

// Compute weights for first 4 samples (including center tap)<br />

// by thresholding blurriness (in sample alpha)<br />

vWeights4.x = saturate(s1.a - vThresh0.x);<br />

vWeights4.y = saturate(s2.a - vThresh0.y);<br />

Section IV — Image Space<br />

Real-Time Depth of Field Simulation<br />

547


Section IV — Image Space<br />

548 Real-Time Depth of Field Simulation<br />

vWeights4.z = saturate(s3.a - vThresh0.x);<br />

vWeights4.w = saturate(s0.a - vThresh0.w);<br />

// Accumulate weighted samples<br />

vColorSum = s0 * vWeights4.x + s1 * vWeights4.y +<br />

s2 * vWeights4.z + s3 * vWeights4.w;<br />

// Sum weights using DOT<br />

fWeightSum = dot(vWeights4, 1);<br />

// Compute weights for three remaining samples<br />

vWeights3.x = saturate(s4.a - vThresh0.x);<br />

vWeights3.y = saturate(s5.a - vThresh0.y);<br />

vWeights3.z = saturate(s6.a - vThresh0.z);<br />

// Accumulate weighted samples<br />

vColorSum += s4 * vWeights3.x + s4 * vWeights3.y +<br />

s6 * vWeights3.z;<br />

// Sum weights using DOT<br />

fWeightSum += dot(vWeights3, 1);<br />

// Compute tex coords for other taps<br />

float2 vTap4 = v.vTap0 + horzTapOffs[4];<br />

float2 vTap5 = v.vTap0 + horzTapOffs[5];<br />

float2 vTap6 = v.vTap0 + horzTapOffs[6];<br />

float2 vTap4Neg = v.vTap0 - horzTapOffs[4];<br />

float2 vTap5Neg = v.vTap0 - horzTapOffs[5];<br />

float2 vTap6Neg = v.vTap0 - horzTapOffs[6];<br />

// Sample the taps<br />

s0 = tex2D(renderTexture, vTap4);<br />

s1 = tex2D(renderTexture, vTap5);<br />

s2 = tex2D(renderTexture, vTap6);<br />

s3 = tex2D(renderTexture, vTap4Neg);<br />

s4 = tex2D(renderTexture, vTap5Neg);<br />

s5 = tex2D(renderTexture, vTap6Neg);<br />

// Compute weights for three samples<br />

vWeights3.x = saturate(s0.a - vThresh1.x);<br />

vWeights3.y = saturate(s1.a - vThresh1.y);<br />

vWeights3.z = saturate(s2.a - vThresh1.z);<br />

// Accumulate weighted samples<br />

vColorSum += s0 * vWeights3.x + s1 * vWeights3.y +<br />

s2 * vWeights3.z;<br />

// Sum weights using DOT<br />

fWeightSum += dot(vWeights3, 1);<br />

// Compute weights for 3 samples<br />

vWeights3.x = saturate(s3.a - vThresh1.x);


}<br />

vWeights3.y = saturate(s4.a - vThresh1.y);<br />

vWeights3.z = saturate(s5.a - vThresh1.z);<br />

// Accumulate weighted samples<br />

vColorSum += s3 * vWeights3.x + s4 * vWeights3.y +<br />

s5 * vWeights3.z;<br />

// Sum weights using DOT<br />

fWeightSum += dot(vWeights3, 1);<br />

// Divide weighted sum of samples by sum of all weights<br />

vColorSum /= fWeightSum;<br />

// Color and weights sum output scaled (by 1/256)<br />

// to fit values in 16 bit 0 to 1 range<br />

return float4(vColorSum, fWeightSum) * 0.00390625;<br />

Pass Four: Separable Gaussian Filtering in Y Axis<br />

In the next pass, we perform a similar operation but with blurring along the y axis.<br />

The input to this pass is the image that we just blurred along the x axis. The output<br />

of this pass is directed to an offscreen render target (blurredXYTexture),<br />

which is going to be used during final image compositing.<br />

Vertex <strong>Shader</strong> for Y Axis of Separable Gaussian Filter<br />

In the vertex shader we again compute the first set of texture sample offsets to be<br />

used in the pixel shader for sampling the pre-blurred image. This vertex shader<br />

uses exactly the same approach as the vertex shader in the previous pass but with<br />

different offset values. This particular vertex shader is designed for a vs 1.1 compilation<br />

target.<br />

//////////////////////////////////////////////////////////////////////<br />

float4 viewportScale;<br />

float4 viewportBias;<br />

// Offsets 0-3 used by vertex shader, 4-6 by pixel shader<br />

float2 vertTapOffs[7];<br />

//////////////////////////////////////////////////////////////////////<br />

struct VS INPUT<br />

{<br />

float4 vPos: POSITION;<br />

float2 vTexCoord: TEXCOORD;<br />

};<br />

struct VS OUTPUT TEX7<br />

{<br />

float4 vPos: POSITION;<br />

float2 vTap0: TEXCOORD0;<br />

Section IV — Image Space<br />

Real-Time Depth of Field Simulation<br />

549


Section IV — Image Space<br />

550 Real-Time Depth of Field Simulation<br />

};<br />

float2 vTap1: TEXCOORD1;<br />

float2 vTap2: TEXCOORD2;<br />

float2 vTap3: TEXCOORD3;<br />

float2 vTap1Neg: TEXCOORD4;<br />

float2 vTap2Neg: TEXCOORD5;<br />

float2 vTap3Neg: TEXCOORD6;<br />

//////////////////////////////////////////////////////////////////////<br />

VS OUTPUT TEX7 filter gaussian y vs(VS INPUT v)<br />

{<br />

VS OUTPUT TEX7 o = (VS OUTPUT TEX7)0;<br />

}<br />

// Scale and bias viewport<br />

o.vPos = v.vPos * viewportScale + viewportBias;<br />

// Compute tap coordinates<br />

o.vTap0 = v.vTexCoord;<br />

o.vTap1 = v.vTexCoord + vertTapOffs[1];<br />

o.vTap2 = v.vTexCoord + vertTapOffs[2];<br />

o.vTap3 = v.vTexCoord + vertTapOffs[3];<br />

o.vTap1Neg = v.vTexCoord - vertTapOffs[1];<br />

o.vTap2Neg = v.vTexCoord - vertTapOffs[2];<br />

o.vTap3Neg = v.vTexCoord - vertTapOffs[3];<br />

return o;<br />

Pixel <strong>Shader</strong> for Y Axis of Separable Gaussian Filter<br />

Similar to processing the image in the previous pass, we again sample the first<br />

seven samples along the y axis using interpolated texture offsets and combine<br />

these samples using appropriate kernel weights. Then we compute the next six<br />

offset coordinates and sample the image using dependent texture reads. Finally,<br />

we combine all weighted samples and output the value into an offscreen buffer.<br />

This pixel shader is compiled to a ps 2.0 target.<br />

//////////////////////////////////////////////////////////////////////<br />

float4 vWeights0 = {0.080, 0.075, 0.070, 0.100};<br />

float4 vWeights1 = {0.065, 0.060, 0.055, 0.050};<br />

sampler blurredXTexture;<br />

// Offsets 0-3 used by vertex shader, 4-6 by pixel shader<br />

float2 vertTapOffs[7];<br />

//////////////////////////////////////////////////////////////////////<br />

struct PS INPUT TEX7<br />

{<br />

float2 vTap0: TEXCOORD0;


};<br />

float2 vTap1: TEXCOORD1;<br />

float2 vTap2: TEXCOORD2;<br />

float2 vTap3: TEXCOORD3;<br />

float2 vTap1Neg: TEXCOORD4;<br />

float2 vTap2Neg: TEXCOORD5;<br />

float2 vTap3Neg: TEXCOORD6;<br />

//////////////////////////////////////////////////////////////////////<br />

float4 filter gaussian y ps(PS INPUT TEX7 v) : COLOR<br />

{<br />

// Samples<br />

float4 s0, s1, s2, s3, s4, s5, s6;<br />

// Accumulated color and weights<br />

float4 vColorWeightSum;<br />

// Sample taps with coordinates from VS<br />

s0 = tex2D(blurredXTexture, v.vTap0);<br />

s1 = tex2D(blurredXTexture, v.vTap1);<br />

s2 = tex2D(blurredXTexture, v.vTap2);<br />

s3 = tex2D(blurredXTexture, v.vTap3);<br />

s4 = tex2D(blurredXTexture, v.vTap1Neg);<br />

s5 = tex2D(blurredXTexture, v.vTap2Neg);<br />

s6 = tex2D(blurredXTexture, v.vTap3Neg);<br />

// Modulate sampled color values by the weights stored<br />

// in the alpha channel of each sample<br />

s0.rgb = s0.rgb * s0.a;<br />

s1.rgb = s1.rgb * s1.a;<br />

s2.rgb = s2.rgb * s2.a;<br />

s3.rgb = s3.rgb * s3.a;<br />

s4.rgb = s4.rgb * s4.a;<br />

s5.rgb = s5.rgb * s5.a;<br />

s6.rgb = s6.rgb * s6.a;<br />

// Aggregate all samples weighting them with predefined<br />

// kernel weights, weight sum in alpha<br />

vColorWeightSum = s0 * vWeights0.w +<br />

(s1 + s4) * vWeights0.x +<br />

(s2 + s5) * vWeights0.y +<br />

(s3 + s6) * vWeights0.z;<br />

// Compute tex coords for other taps<br />

float2 vTap4 = v.vTap0 + vertTapOffs[4];<br />

float2 vTap5 = v.vTap0 + vertTapOffs[5];<br />

float2 vTap6 = v.vTap0 + vertTapOffs[6];<br />

float2 vTap4Neg = v.vTap0 - vertTapOffs[4];<br />

float2 vTap5Neg = v.vTap0 - vertTapOffs[5];<br />

float2 vTap6Neg = v.vTap0 - vertTapOffs[6];<br />

// Sample the taps<br />

Section IV — Image Space<br />

Real-Time Depth of Field Simulation<br />

551


Section IV — Image Space<br />

552 Real-Time Depth of Field Simulation<br />

s0 = tex2D(blurredXTexture, vTap4);<br />

s1 = tex2D(blurredXTexture, vTap5);<br />

s2 = tex2D(blurredXTexture, vTap6);<br />

s3 = tex2D(blurredXTexture, vTap4Neg);<br />

s4 = tex2D(blurredXTexture, vTap5Neg);<br />

s5 = tex2D(blurredXTexture, vTap6Neg);<br />

// Modulate sampled color values by the weights stored<br />

// in the alpha channel of each sample<br />

s0.rgb = s0.rgb * s0.a;<br />

s1.rgb = s1.rgb * s1.a;<br />

s2.rgb = s2.rgb * s2.a;<br />

s3.rgb = s3.rgb * s3.a;<br />

s4.rgb = s4.rgb * s4.a;<br />

s5.rgb = s5.rgb * s5.a;<br />

// Aggregate all samples weighting them with predefined<br />

// kernel weights, weight sum in alpha<br />

vColorWeightSum += (s1 + s3) * vWeights1.x +<br />

(s1 + s4) * vWeights1.y +<br />

(s2 + s5) * vWeights1.z;<br />

// Average combined sample for all samples in the kernel<br />

vColorWeightSum.rgb /= vColorWeightSum.a;<br />

// Account for scale factor applied in previous pass<br />

// (blur along the X axis) to output values<br />

// in 16 bit 0 to 1 range<br />

return vColorWeightSum * 256.0;<br />

}<br />

Figure 11 shows the result of applying the 25×25 separable Gaussian to the<br />

downsampled image:<br />

Figure 11: 25×25 Gaussian blurred image


Pass Five: Compositing the Final Output<br />

In the final pass, we create a composite image of the actual scene rendering with<br />

the Gaussian blurred image using the distance from the focal plane information<br />

that is stored in the alpha channel of the original image. The two offscreen render<br />

targets are used to sample that information (in our example, renderTexture is<br />

used to sample from full-scene rendering pass results, and blurredXYTexture contains<br />

results of applying Gaussian filtering to the downsampled image). All textures<br />

are sampled using interpolated texture coordinates.<br />

Vertex <strong>Shader</strong> for Final Composite Pass<br />

In this vertex shader, we simply transform the vertices and propagate the texture<br />

coordinate to the pixel shader. The vertex shader is designed to compile to vs 1.1<br />

target.<br />

//////////////////////////////////////////////////////////////////////<br />

float4 viewportScale;<br />

float4 viewportBias;<br />

//////////////////////////////////////////////////////////////////////<br />

struct VS INPUT<br />

{<br />

float4 vPos: POSITION;<br />

float2 vTex: TEXCOORD;<br />

};<br />

struct VS OUTPUT<br />

{<br />

float4 vPos: POSITION;<br />

float2 vTex: TEXCOORD0;<br />

};<br />

//////////////////////////////////////////////////////////////////////<br />

VS OUTPUT final pass vs(VS INPUT v)<br />

{<br />

VS OUTPUT o=(VSOUTPUT)0;<br />

}<br />

// Scale and bias viewport<br />

o.vPos = v.vPos * viewportScale + viewportBias;<br />

// Propagate texture coordinate to the pixel shader<br />

o.vTex = v.vTex;<br />

return o;<br />

Section IV — Image Space<br />

Real-Time Depth of Field Simulation<br />

553


Section IV — Image Space<br />

554 Real-Time Depth of Field Simulation<br />

Pixel <strong>Shader</strong> for Final Composite Pass<br />

In this pixel shader, we actually do the compositing of the final image. In the pixel<br />

shader we retrieve the depth falloff distance stored in the downsampled image’s<br />

alpha channel. This focal plane distance is used as a blending weight to blend<br />

between the post-processed Gaussian-blurred blurred image and the original full<br />

resolution scene rendering. This pixel shader is designed to compile to a ps 1.4<br />

target or above.<br />

//////////////////////////////////////////////////////////////////////<br />

sampler blurredXYTexture;<br />

sampler renderTexture;<br />

//////////////////////////////////////////////////////////////////////<br />

float4 final pass ps(float2 Tex: TEXCOORD0) : COLOR<br />

{<br />

// Sample Gaussian-blurred image<br />

float4 vBlurred = tex2D(blurredXYTexture, Tex);<br />

}<br />

// Sample full-resolution scene rendering result<br />

float4 vFullres = tex2D(renderTexture, Tex);<br />

// Interpolate between original full-resolution and<br />

// blurred images based on blurriness<br />

float3 vColor = lerp(vFullres, vBlurred, vFullres.a);<br />

return float4(vColor, 1.0);<br />

Figure 12 shows the result of the final compositing stage.<br />

Figure 12: Final composite image for depth of field effect<br />

The images for Figures 11<br />

and 12 are taken from a<br />

screen saver from the ATI<br />

Radeon 9700 demo suite,<br />

which you can download<br />

from http://www.ati.com/<br />

developer/screensavers.html.


Bokeh<br />

Section IV — Image Space<br />

Real-Time Depth of Field Simulation<br />

555<br />

Different lenses with the same apertures and focal distances produce slightly different<br />

out-of-focus images. In photography, the “quality” of an out-of-focus or<br />

blurred image is described by the Japanese term “bokeh.” While this term is<br />

mostly familiar to photographers, it is relatively new to computer graphics<br />

professionals.<br />

The perfect lens should have no spherical aberration and should focus incoming<br />

rays in a perfect cone of light behind the lens. In such a camera, if the image is<br />

not in focus, each blurred point is represented by a uniformly illuminated circle of<br />

confusion. All real lenses have some degree of spherical aberration, and always<br />

have non-uniform distribution of light in the light cone and thus in the circle of<br />

confusion. The lens’ diaphragm and number of shutter blades can also have some<br />

effect on the shape of the circle of confusion. “Bokeh,” which is a Japanese<br />

phoneticization of the French word bouquet, describes this phenomenon and is a<br />

subjective factor, meaning that there is no objective way to measure people’s<br />

reaction to this phenomenon. What might be considered “bad” bokeh under certain<br />

circumstances can be desirable for some artistic effects and vice versa.<br />

To simulate different lens bokehs, one can use filters with different distributions<br />

and weightings of filter taps. Figure 13 demonstrates part of the same scene<br />

processed with blur filters of the same size but with different filter taps distributions<br />

and weightings.<br />

“Good” bokeh “Bad” bokeh<br />

Rectangular diaphragm Triangle-shaped diaphragm<br />

Figure 13: Real-time lens bokeh


Section IV — Image Space<br />

556 Real-Time Depth of Field Simulation<br />

Summary<br />

References<br />

This article presented two different real-time implementations of a depth of field<br />

effect using a <strong>DirectX</strong> 9 class programmable graphics hardware. Simulating depth<br />

of field is very important for creating convincing visual representations of our<br />

world on the computer screen. It can also be used in artistic ways to generate<br />

more cinematic-looking visual effects. See Color Plate 26 for examples of the<br />

techniques discussed in this article.<br />

[Potmesil83] Potmesil, Michael and Indranil Chakravarty, “Modeling motion blur<br />

in computer-generated images,” SIGGRAPH Proceedings of the 10th Annual<br />

Conference on Computer Graphics and Interactive Techniques, 1983.


Section V<br />

Shadows<br />

Soft Shadows<br />

by Flavien Brebion<br />

Robust Object ID Shadows<br />

by Sim Dietrich<br />

Reverse Extruded Shadow Volumes<br />

by Renaldas Zioma<br />

557


Presentation<br />

Soft Shadows<br />

Flavien Brebion<br />

In recent years, dynamic real-time shadowing has slowly replaced in the programmer’s<br />

heart (if not in the implementation) the old, fake static techniques used for<br />

ages in games and other applications, but although this has improved the quality<br />

and realism of scenes, it is still far from being perfect. One of these techniques,<br />

commonly referred to as shadow volumes or stencil shadows, has a very plastic<br />

look because of its hard, sharp edges. In this article I propose a technique to fake<br />

real-time soft shadows that is an extension of the shadow volumes algorithm. By<br />

no means do I claim it is perfect or physically correct, but it gives decent results<br />

at interactive frame rates on <strong>DirectX</strong> 8 ps_1_4/<strong>DirectX</strong> 9-generation hardware.<br />

Standard Shadowing Algorithms<br />

There are a number of algorithms currently used for shadowing.<br />

Per-vertex Shadowing<br />

APIs such as <strong>DirectX</strong> or OpenGL implement their lighting equation per-vertex,<br />

but this does not handle shadowing. To handle shadows, one idea is to precalculate<br />

all the light contributions per-vertex and store it into a single color that is<br />

used later as a modulation factor. Although very fast, this technique requires the<br />

scene to be heavily tesselated, and the calculations are too slow to be done in real<br />

time. Consequently, the lights or occluders cannot move.<br />

Lightmapping<br />

Another popular technique is called lightmapping. This was used in the Quake and<br />

Unreal engines and derivatives a few years ago (and still are, to some extent).<br />

This is basically an extension of the per-vertex lighting to store the light’s contributions<br />

per texel instead of per vertex. The lighting is also precalculated but<br />

embedded into small resolution textures called lightmaps. These lightmaps are<br />

modulated with the textures later on, giving a pretty realistic look and feel to the<br />

scene. Although they no longer rely on heavy tesselation, they require a lot of<br />

texture memory and still cannot be used for dynamic lights. It is still possible to<br />

559


Section V — Shadows<br />

560 Soft Shadows<br />

generate many lightmaps per light (for example, to light/unlight a room), but this<br />

technique is inherently static.<br />

Shadow Mapping<br />

Shadow maps are more and more often used for real-time dynamic shadowing.<br />

Their concept is extremely simple: A depth buffer is rendered from the light’s<br />

point of view, and the Z values are compared with the depth buffer normally computed<br />

from the viewer’s point of view by projecting the light’s buffer onto the<br />

scene. This effectively means that for a given pixel, the comparison result tells if<br />

the pixel is visible from the light or not. Unfortunately, shadow maps suffer from a<br />

number of problems like aliasing (due to the limited resolution or bits precision of<br />

the light’s depth buffer) and Z-fighting (due to the projection of the light’s depth<br />

buffer onto the scene’s geometry) and require a hardware Z-comparison test. Last<br />

but not least, omni-directional lights are hard to support since an omni-light’s<br />

point of view is a 360° sphere. It is still possible to use a cube map instead, but<br />

that means a six-times performance cost penalty. On the other hand, shadow maps<br />

do not generate any additional geometry or excessive fillrate needs.<br />

Projected Shadows<br />

The idea behind projected shadows is to generate an object’s shadow into a<br />

shadow texture and to project it from the light’s point of view to blend it over the<br />

scene. To avoid wasting one pass to project the shadow on the scene, it is possible<br />

to locate the subset of the scene where the shadow is lying. However, self-shadowing<br />

and omni-lights are not easily supported.<br />

Stencil Shadows<br />

<strong>With</strong> the addition of hardware-accelerated stencil buffers to consumer video cards,<br />

the technique known as shadow volumes has become more and more widely used.<br />

Lightmapping was static and needed the light maps to be precomputed; stencil<br />

shadows, on the other hand, are fully dynamic: The light or the scene’s objects<br />

can move in real time with no restrictions. Shadow volumes generally use the<br />

stencil buffer to simulate a simple form of ray-casting (although it is also possible<br />

to use the color or alpha buffer to simulate the stencil buffer) to count, for each<br />

ray passing through a pixel of the screen, the number of times it intersects the<br />

shadow volumes, the definition of a shadow volume being a half-space formed by a<br />

light and an occluding triangle.<br />

Shadow volumes (also commonly called stencil shadows) are not easy to<br />

implement. They suffer from many problems, two of which are the viewerinside-volume<br />

problem and the heavy fillrate. The first problem appears as soon<br />

as the viewer enters the shadow volume; in that case, the intersection counter<br />

simulated by the stencil is messed up, and some shadowed areas appear lit while<br />

some lit areas appear shadowed. It is possible to fix that by capping the shadow<br />

volume to the near clipping plane, but it is not convenient due to the high CPU<br />

workload. Another possibility is to cast a ray from the viewer to the light and<br />

determine if it hits an occluder; however, the test is only valid for the single pixel


that is located at the center of the screen, and artifacts can still occur when half<br />

the screen is inside the shadow volume while the other half is outside. Bill<br />

Bilodeau and Mike Songy [1] and John Carmack [2] (who arrived at the same<br />

approach independently) were the first to propose a modification of the basic<br />

shadow volumes algorithm by pushing the problem away to the far clipping plane<br />

instead. It is then possible to either limit the shadow volume length (in the case of<br />

attenuated lights) or tweak the projection matrix to push the far plane to infinity,<br />

which completely solves the problem since there is no longer any possible<br />

intersection.<br />

The second problem with stencil shadows is that they generate a lot of new<br />

polygons to display; not only do new vertices have to be generated and transformed,<br />

but the polygons also have to be rasterized. Even if we disable color rendering<br />

and Z-buffering, since we just need to increase/decrease the stencil values,<br />

it still has a heavy cost. One solution, which is based on the fact that given two<br />

adjacent triangles in a mesh, their respective shadow volume contributions in the<br />

stencil buffer cancel each other, is to calculate the silhouette of the whole mesh<br />

on the CPU and generate a single, huge shadow volume for it, effectively decreasing<br />

the number of polygons to be displayed. This greatly improves the performance,<br />

but new vertices still have to be generated (by extruding the silhouette<br />

edges) and transformed. Fortunately, with the advent of vertex shaders, this step<br />

can be completely off-loaded to the GPU processor. For further details about the<br />

stencil shadows algorithm and its problems and solutions, please see “The Theory<br />

of Stencil Shadow Volumes” in <strong><strong>Shader</strong>X</strong> 2 : Introductions & Tutorials with<br />

<strong>DirectX</strong> 9.<br />

Brotman and Balder [3] implemented real-time soft shadows in 1984 by jittering<br />

multiple shadow volumes; Instead of rendering one shadow volume for the<br />

light in black, they render n shadow volumes from n lights, all randomly displaced<br />

around the original light’s center, and sum up their contributions. Unfortunately,<br />

this is not very practical since more than ten samples generally have to be used to<br />

obtain good results (Cass Everitt and Mark J. Kilgard [4] from nVidia have implemented<br />

it with a cluster of 12 lights), which means an effective fillrate cost<br />

increase of a factor of ten on an algorithm that is already fillrate intensive.<br />

Overview of the Soft Shadows Algorithm<br />

Section V — Shadows<br />

Soft Shadows<br />

561<br />

The real-time models previously described generally work for point lights, spotlights<br />

(projected shadows, shadow maps), or for perfect point, omnidirectionnal<br />

lights (shadow volumes, shadow maps). As a result, none of them effectively simulate<br />

the effect of a non-perfect light source (that is, a volumetric light source). In<br />

reality, no light is a point; even the sun in the sky has a visible radius causing penumbras.<br />

Perfect point lights generate sharp/hard shadows, but if we want to<br />

enhance the realism and quality of real-time computer-rendered scenes, we need<br />

a way to fake the soft shadows caused by volumetric lights. Figure 1 shows an<br />

example of a volumetric light as seen from the viewpoints A, B, and C. If you cast<br />

an infinite amount of rays from the light to the point, some of them might hit an<br />

occluder, while some of them might reach the point. When 100% of the light<br />

reaches the point, we say the point is completely visible from the light (point A).


Section V — Shadows<br />

562 Soft Shadows<br />

When 0% of the light reaches the<br />

point, we say the point is completely<br />

shadowed by the light<br />

(point C). We define the penumbra<br />

as the set of points that receive<br />

more than 0% of the light rays but<br />

less than 100%; that is, all the<br />

points that are partially visible from<br />

the light (point B). Although this<br />

model is not perfectly accurate, it<br />

provides a good basis to start<br />

developing our soft shadows<br />

technique.<br />

The idea behind the soft shadows<br />

algorithm is simple and works<br />

as an extension of the shadow volumes<br />

algorithm. We render the<br />

shadowed areas from screen-space<br />

to a texture, which we call the<br />

Figure 1: A volumetric light causing soft<br />

shadows. Point A receives 100% of the light<br />

(fully lit); point B receives around 60% of the<br />

light (partially lit) and is in the light’s<br />

penumbra; point C receives no light at all (it is<br />

completely hidden by the occluder) and is in<br />

the occluder’s shadow.<br />

shadow map (not to be confused with the shadowing technique of the same name<br />

described in the “Shadow Mapping” section); pixels can basically take two values:<br />

0 if the pixel is lit or 1 if it is shadowed. As the shadow volume algorithm works<br />

with a perfect point light source, no other values at that time are possible. Now, if<br />

we had a way to generate the penumbra area in that texture, we could just apply a<br />

blurring filter of the pixels of the penumbra near the inner contour to obtain a<br />

nice, smooth transition between completely shadowed pixels (0% of light<br />

received, texel value of 1) and completely lit pixels (100% of light received, texel<br />

value of 0). This is obviously not a physically accurate solution but would greatly<br />

help to enhance the visual quality of the scene and lessen that “plastic” look.<br />

In reality, there are many types of volumetric lights. For the sake of simplicity,<br />

we are going to simulate a single, simple type of volumetric light: the spherical<br />

light, which is defined as a center point and an emission radius. As the shadow<br />

volumes algorithm does not work with spherical lights but with point lights, we<br />

need a way to locate the penumbra of a spherical light by using shadow volumes<br />

only. It’s a tough task, but not impossible. Figure 2 shows a possible solution; a<br />

shadow volume is generated from a standard point light, located at the position of<br />

the spherical light. This volume defines the inner contour of the penumbra, which<br />

we call the inner volume. This process is using a vertex shader to optimize the<br />

shadow volumes calculation and is described in the “Standard Shadowing Algorithms”<br />

section. After that, a second volume is generated from a jittered point<br />

light. Jittering is the action of moving the position of the point light source by a<br />

small vector for each vertex of the scene. This jittering process is described in<br />

the “Outer Volume by Jittering the Point Light Source” section. Figure 3 shows a<br />

simple point light compared to a spherical light and a jittered point light. This new<br />

volume is called the outer volume and defines the outer contour of the penumbra.


Figure 2: Inner and outer volumes for an<br />

area light<br />

The inner and outer volumes are next rendered to the shadow map, each in one<br />

color component channel. For instance, the inner volume is stored in the red component,<br />

while the outer volume is stored in the green component. Other components<br />

(blue and alpha) are simply ignored. The blurring process uses this shadow<br />

map both as input and output recursively and needs to know which pixels are in<br />

the penumbra and which are not (hence the decomposition of the inner and outer<br />

areas into different channels). There is more on this in the section “Blurring the<br />

Penumbra.”<br />

After this blurring step, we can use a pixel shader to modulate the scene with<br />

a lightmap, obtained from the shadow map. This quite straightforward step is<br />

described in the section “Using the Shadow Map,” followed by miscellaneous considerations<br />

in the section “Other Considerations.”<br />

Penumbra Generation<br />

The first step to our algorithm is to determine the inner and outer volumes forming<br />

the penumbra. The inner volume is generated by using the shadow volumes<br />

algorithm with a point light located at the same position as the considered spherical<br />

light. Let’s first see an overview of the shadow volumes algorithm and how it<br />

could be implemented in hardware by using vertex shaders. This vertex shader is<br />

later modified for the outer volume to support a jittering coefficient, which indicates<br />

how far the light has to be displaced.<br />

Inner Volume Using Hardware Stencil Shadows<br />

Section V — Shadows<br />

Soft Shadows<br />

563<br />

Figure 3: A point light source L (left) is defined as a<br />

position P. A sphere light source L' (middle) is defined<br />

as a position and a radius. A jittered light source L''<br />

(right) is a point light source whose center is displaced<br />

(jittered) by a vector of length R around P.<br />

To generate a shadow volume for a simple point light, we use the Z-fail technique.<br />

The silhouette is first generated on the CPU by looking at which faces are visible<br />

from the point light’s position and which are not. For each triangle, the plane<br />

equation is used to determine if the light is on the positive side (front-facing triangle)<br />

or the negative one (back-facing triangle). The algorithm looks like this:


Section V — Shadows<br />

564 Soft Shadows<br />

For each triangle T,<br />

Let a, b, c and d be the coefficients of the plane equation for T;<br />

the normal N of the plane is then (a, b, c)<br />

Let L be the light’s position vector,<br />

Triangle is front-facing if (N dot L) +d>0<br />

End For<br />

Silhouette edges are then computed by looking at which edges are shared between<br />

two triangles that are not both on the same side in respect to the light’s position:<br />

For each triangle T,<br />

For each triangle T' that is adjacent to T,<br />

Let E be the edge defined by T and T',<br />

E is a silhouette edge when T is front-facing the light and T' is back-facing,<br />

OR when T is back-facing and T' is front-facing.<br />

End For<br />

End For<br />

This information (silhouette edges, front-facing triangles,<br />

and back-facing triangles) are enough to generate the<br />

shadow volumes. Front-facing triangles are left<br />

untouched; they form the near capping<br />

of the volume. Back-facing triangles<br />

are extruded away from the<br />

light; they form the far capping.<br />

Finally, a quadrilateral is generated<br />

from a silhouette edge and<br />

its extruded equivalent. These<br />

form the sides of the volume.<br />

This process is shown in Figure<br />

4 for a simple occluder<br />

quadrilateral.<br />

To perform these operations Figure 4: The shadow volume generated from an<br />

in a vertex shader, we cannot gen- occluder O is formed by the near capping (V0, V1,<br />

erate any new vertices. The idea V2, V3), the far capping (V0', V1', V2', V3'), and<br />

is to do the extrusion (which is the four sides (V0, V0', V1', V1; V1, V1', V2', V2;<br />

V2, V2', V3', V3; V3, V3', V0', V0).<br />

the process of moving a vertex<br />

from the near capping to the far capping) in the vertex shader by associating an<br />

extrusion weight to each vertex and duplicating the vertex buffer; the original vertex<br />

buffer contains all the vertices with a weight of 0, and the duplicated buffer<br />

contains the same vertices but with a weight of 1 to indicate that they have to be<br />

extruded away from the light. The light position itself is passed as a shader constant.<br />

Since the shadow volume is not lit or textured, all we need for our vertex<br />

format is a triplet for the position and a single coefficient. We then append these<br />

two buffers together, as shown in Figure 5.<br />

Figure 5: Simple vertex buffer format for shadow volume extrusion in a vertex shader


As a result, we end up with a vertex buffer of size 2N (N being the original vertex<br />

buffer size for the whole mesh) that is completely static. Note that it is possible to<br />

use different streams to avoid the vertex buffer duplication if memory is a concern.<br />

But how do we form the shadow volume now, since it is different for every<br />

frame? One solution is to use a dynamic index buffer. We build the shadow volumes<br />

by referencing the vertices with an extrusion weight of 0 (for near capping)<br />

or by referencing the vertices with an extrusion weight of 1 (for far capping). So,<br />

for example, the side V0, V0', V1', V1 of the shadow volume shown in Figure 4 is<br />

formed by appending the indices 0, 4, 1, 1, 4, 5 (assuming we are working with triangle<br />

lists) to the index buffer. The following algorithm shows how to build the<br />

indices for shadow volumes:<br />

Reset dynamic index array I for shadow volumes of mesh.<br />

For each triangle T of the mesh,<br />

Let N be the number of vertices in the mesh (the vertex buffer has a length of<br />

2N vertices),<br />

Let V0, V1, and V2 be the mesh indices of the vertices of the triangle T,<br />

If T is back-facing the light,<br />

/// one triangle for the far capping<br />

Append the indices (V0 + N), (V1 + N) and (V2 + N) to I<br />

Else<br />

/// one triangle for the near capping<br />

Append the indices V0, V1 and V2 to I,<br />

/// now form the shadow volume sides from the silhouette edges<br />

If edge defined by V0-V1 is a silhouette edge,<br />

Append the indices V0, V1, (V0 + N), (V1 + N) to I<br />

Else if edge defined by V1-V2 is a silhouette edge,<br />

Append the indices V1, V2, (V1 + N), (V2 + N) to I,<br />

Else if edge defined by V2-V0 is a silhouette edge,<br />

Append the indices V2, V0, (V2 + N), (V0 + N) to I<br />

End If<br />

End If<br />

End For<br />

So now that we have generated the shadow<br />

volume vertices (statically) and indices<br />

(dynamically, or semi-statically, if the light<br />

does not move every frame), we still need to<br />

perform the extrusion in a vertex shader.<br />

This is not very hard; the vertex shader just<br />

translates the vertex away from the light<br />

by a distance that is dependent on the extrusion<br />

weight. The extrusion distance is a<br />

constant equal to the light attenuation<br />

radius; that is, the radius is the distance at<br />

which the light is no longer contributing<br />

anything to the scene. The different steps<br />

performed by the vertex shader are<br />

described in Figure 6.<br />

Section V — Shadows<br />

Soft Shadows<br />

565<br />

Figure 6: The steps performed by the<br />

vertex shader for shadow volumes extrusion.<br />

L is light’s position (constant),<br />

D is radius (constant), V is incoming<br />

vertex’s position, and E is vertex’s extrusion<br />

weight (generally 0 or 1).


Section V — Shadows<br />

566 Soft Shadows<br />

The following is the vertex shader’s code for these steps:<br />

vs.1.1<br />

; Constants used the shader:<br />

; c1.x: light radius<br />

; c4-c7: world-view-projection matrix<br />

; c20: light position in object space<br />

dcl position0 v0<br />

dcl blendweight0 v3<br />

; step 1: calculate M<br />

sub r8, v0, c20<br />

; step 2: normalize M, result is M'<br />

dp3 r8.w, r8, r8<br />

rsq r8.w, r8.w<br />

mul r8.xyz, r8.xyz, r8.w<br />

; step 3: G=D*M'<br />

mul r8, r8, c1.x<br />

; step 4: G' =E*G<br />

mul r8, r8, v3.x<br />

; step 5: V' =V+G'<br />

add r8, r8, v0<br />

; r8.w is no longer correct, fix it<br />

mov r8.w, v0.w<br />

; transform vertex:<br />

m4x4 oPos, r8, c4<br />

Rendering the shadow volume is not very hard, and you can easily find a lot of<br />

information on that topic. The stencil buffer is first cleared to a constant (to avoid<br />

wrapping issues, assuming we have eight stencil bits, we use the reference value<br />

128). Texture-mapping, Gouraud shading, and Z-buffer writes are all disabled, but<br />

we keep the Z-buffer test. We also disable writing to the color buffer, since we are<br />

only interested in the stencil values. We set up the stencil test to always pass and<br />

the stencil operation to increment on pixels that fail on the Z-buffer test.<br />

Back-face culling is enabled for clockwise triangles. Then, the shadow volumes<br />

are rendered using the extrusion vertex shader. Once this is done, we reverse the<br />

back-face culling order to counterclockwise and the stencil operation to decrement<br />

on the failed Z-buffer test, and we render the shadow volumes again. Note<br />

that rendering the shadow volumes can be collapsed to a single pass using the<br />

two-sided stencil test and operations if the hardware supports it.


Outer Volume by Jittering the Point Light Source<br />

The outer volume is a bit different from the inner volume. To generate the inner<br />

volume, we have been using the shadow volumes algorithm without any slight<br />

modification. The outer volume, however, requires us to jitter the point light<br />

source (that is, for each vertex in the scene, to move the light<br />

source in order to simulate a volumetric light). We call the<br />

jittering vector the vector used to translate the light<br />

source per vertex. It has a direction and a length. We<br />

can use the radius of the spherical light as the<br />

length, but what about its direction? If we choose<br />

a constant direction, the penumbra does not<br />

exactly match the inner shadow, depending<br />

on where the occluder is compared to<br />

the light position. Another idea is to<br />

move the light toward the object<br />

center, but depending on the<br />

shape of the object, the<br />

same problem appears. Now<br />

let’s consider the problem<br />

from another angle: The<br />

outer volume looks very<br />

similar to the inner volume<br />

as it would be generated<br />

from a “bigger” object, as<br />

demonstrated in Figure 7.<br />

Maybe we can inflate the<br />

object to generate the outer<br />

volume from a point light<br />

source. Let’s see what happens<br />

when we try to do that.<br />

Section V — Shadows<br />

Soft Shadows<br />

567<br />

Figure 7: By using a simple point light L, we can generate<br />

an inner volume from an occluder object and an outer<br />

volume from the inflated object. The penumbra is defined<br />

by the outer area excluding the inner one.<br />

The first problem is to generate the inflated object. Parker, Shirley, and Smits<br />

[5], who found a similar problem in their own technique, proposed using a transformation<br />

that is natural to the object (such as, for a sphere, a bigger sphere).<br />

Unfortunately, they do not propose a solution for an arbitrary mesh, as there is no<br />

information about its nature. One idea is to average all the normals for all the vertices<br />

of the object (independent of the angle between the faces or other conditions)<br />

and expand all the vertices in the direction pointed out by the normals, as<br />

demonstrated for a cube in Figure 8 (as shown on the following page).<br />

The real problem with that approach is when you start to consider the situation<br />

in which an inflated face intersects another face, as demonstrated in Figure 9.<br />

In that case, shadowing artifacts appear.<br />

These situations should be prevented. One idea is to use CSG operations to<br />

remove the vertices that lie inside the object’s mesh and then clean up the mesh<br />

to keep it two-manifold (as required by the Z-fail stencil algorithm). It is also possible<br />

to ignore the problem if it is rare due to the nature of the object or if the


Section V — Shadows<br />

568 Soft Shadows<br />

inflation distance is pretty small compared to the object’s size (in which case the<br />

artifacts are small or unnoticeable).<br />

Figure 8: Side view of a simple cube with its face normals (left), same cube<br />

with averaged normals (middle), and inflated cube along the normals (right)<br />

Figure 9: Problems with the inflating algorithm when face intersection occurs while<br />

inflating the mesh. Here we have the original object (left) with its inflation vectors<br />

(middle) and the object after inflation (right).<br />

If we try to use this inflated object with the stencil shadows algorithm from a<br />

point light source, the generated outer volume is incorrect. First of all, the same<br />

depth values from the Z-buffer cannot be reused for the inner and outer volumes;<br />

shadow volumes require the Z-test to be enabled, but the vertices from the<br />

inflated object differ from the vertices of the normal object. Since the inner volume<br />

uses the normal object and the outer volume uses the inflated object, their<br />

vertices don’t match. As a result, we have to reset the Z-buffer by rendering the<br />

scene once again between the inner and outer volume passes. It gets even worse<br />

if you start to consider multiple lights because this means you have to reset the<br />

Z-buffer not one time for the whole scene but two times per light.<br />

Assuming we can live with this performance drop induced by not using the<br />

same vertices to render the inner and outer volumes, we still have another problem.<br />

Indeed, as the vertices don’t match, the penumbra near the occluder doesn’t<br />

match either, which is not correct, as seen in Figure 10. Compare with Figure 2,<br />

which is what we actually want.<br />

So what choices do we have left? Remember that when we were speaking of<br />

jittering the light, we needed a direction and a length. We proposed to use<br />

the emission radius of the spherical light as the length, but we were unsure about<br />

the direction, as a constant vector wouldn’t work. But what would happen if we<br />

used the direction of the inflation vector as our jittering direction? Figures 11 and<br />

12 show that by actually using the inverse direction of the inflation vector, we can


Section V — Shadows<br />

Soft Shadows<br />

569<br />

create an outer volume that has all the wanted properties. The vector N is the<br />

inflation vector for the vertex V. When extruding that vertex V away from the<br />

light, instead of using the center of the light P, we use jitter P by N'. N' is found by<br />

normalizing the inflation vector, negating it, and then multipying it by R (the<br />

radius of the spherical light).<br />

Figure 10: Using an inflated object results<br />

in a near-correct but still invalid outer<br />

volume near the original occluder vertices.<br />

Figure 11: Using the inflation vector N at the<br />

vertex V to jitter the light’s position P by a vector N'<br />

Figure 12: Side view of the jittering process for two vertices of a box occluder<br />

This jittering operation is different for every vertex of the considered occluder(s),<br />

since the inflation vector N is different for every vertex V.<br />

To implement that on top of our current shadow volume vertex shader, we<br />

only need one more piece of information: the normalized inflation vector N. Consequently,<br />

the vertex format for shadow volume vertices has to be updated (see<br />

Figure 13).


Section V — Shadows<br />

570 Soft Shadows<br />

Figure 13: Extended vertex buffer format to jitter the light while doing shadow<br />

volume extrusions<br />

The vertex shader only needs minor additions:<br />

; constants:<br />

; c3.x contains the jittering distance<br />

; c20 contains the light position P<br />

; declare the inflation vector<br />

dcl normal0 v4<br />

; jitter light position in r1<br />

mul r1, v4, c3.x<br />

sub r1, c20, r1<br />

The jittering distance is equal to the radius of the spherical light if we are in the<br />

outer volume generation pass or to 0 for the inner volume pass. That way, we can<br />

use the same shader for both inner and outer volume passes. The complete<br />

shadow volume extrusion vertex shader code becomes:<br />

vs.1.1<br />

; Constants used in the shader:<br />

; c1.x: light attenuation radius<br />

; c3.x: light emission radius (radius of spherical light)<br />

; c4-c7: world-view-projection matrix<br />

; c20: light position in object space<br />

dcl position0 v0<br />

dcl blendweight0 v3<br />

dcl normal0 v4<br />

; jitter the light<br />

mul r1, v4, c3.x<br />

sub r1, c20, r1<br />

; step 1: calculate M<br />

sub r8, v0, r1<br />

; step 2: normalize M, result is M'<br />

dp3 r8.w, r8, r8<br />

rsq r8.w, r8.w<br />

mul r8.xyz, r8.xyz, r8.w


; step 3: G=D*M'<br />

mul r8, r8, c1.x<br />

; step 4: G' =E*G<br />

mul r8, r8, v3.x<br />

; step 5: V' =V+G'<br />

add r8, r8, v0<br />

; r8.w is no longer correct, fix it<br />

mov r8.w, v0.w<br />

; transform vertex:<br />

m4x4 oPos, r8, c4<br />

Blurring the Penumbra<br />

Section V — Shadows<br />

Soft Shadows<br />

The initialization step of the blurring algorithm starts by rendering the inner volume<br />

to the red component of the shadow map and the outer volume to the green<br />

component of the shadow map. As the result of the shadow volumes algorithm<br />

can be seen as a boolean (inside shadow or outside shadow), we encode it as a<br />

color component with a value of 0 when the pixel is outside the shadow or a value<br />

of 1 (or 255) when the pixel is inside the shadow. Since there are two volumes<br />

(inner and outer), there are four possible combinations. One of them is actually<br />

not possible (full red, no green), since the outer shadow completely covers the<br />

inner shadow. Three combinations remain, as seen in Figure 14.<br />

� No red, no green: both outside inner and outer shadows; pixel receives 100%<br />

of the light<br />

� No red, full green: inside outer shadow but outside inner shadow; in<br />

penumbra<br />

� Full red, full green: inside inner shadow; pixel receives 0% of the light<br />

Blue and alpha components have no particular significance and are ignored.<br />

Figure 14: Contents of the color buffer of the shadow map after<br />

a typical penumbra generation<br />

571


Section V — Shadows<br />

572 Soft Shadows<br />

To render the shadow map, we enable render-to-texture, clear the background<br />

color to black, and initialize the depth and stencil buffers. First, we render the<br />

inner volume as described in the “Inner Volume Using Hardware Stencil<br />

Shadows” section by using a point light source with the shadow volumes algorithm.<br />

The result of this pass is restricted to the stencil buffer, so the next thing<br />

to do is output it to the red component of the shadow map. To do that, we just<br />

apply a simple red quadrilateral covering the viewport with the stencil test<br />

enabled. The same process is done for the green component with the outer volume;<br />

the stencil buffer is cleared again (but not the Z-buffer since the vertices<br />

haven’t changed since the previous pass), the outer volume is rendered to the<br />

stencil buffer as described in the “Outer Volume by Jittering the Point Light<br />

Sources” section, and then a green quadrilateral is rendered with the stencil test<br />

enabled. At that point, the contents of the shadow map should be similar to what<br />

is seen in Figure 14. Here is a summary of the whole algorithm:<br />

Enable render-to-texture for a NxN texels shadow map,<br />

Clear color buffer to black, clear Z-buffer and stencil,<br />

Render scene to the Z-buffer to prepare stencil shadows,<br />

Render inner shadow volumes to the stencil buffer,<br />

Enable stencil test, enable color mask (write to red only),<br />

Display a red full-screen quadrilateral for pixels in inner shadow,<br />

Clear stencil again,<br />

Render outer shadow volumes by jittering the light with the inflation vectors,<br />

Enable stencil test, enable color mask (write to green only),<br />

Display a green full-screen quadrilateral for pixels in outer shadow,<br />

Clean up states<br />

The last step is to blur the pixels near the inner contour of the penumbra. The<br />

idea here is to apply a blur filter to the red component of the shadow map but<br />

allow reads or writes to texels that have full green values only (that is, texels<br />

inside the penumbra or the inner shadow). This should result in a nice, soft blur<br />

of the shadow around the penumbra without messing up the texels that are not<br />

shadowed. It is important to note that the blur is only applied to the red component;<br />

the green component is left unchanged.<br />

First of all, let’s see how blurring can be<br />

implemented in hardware without any read<br />

or write restriction. Blurring a texture works<br />

for a given texel by sampling neighboring<br />

texels of this texture and averaging their<br />

color value (see Figure 15). Since it is not<br />

possible to sample the same texture many<br />

times on the same texture unit sampler<br />

(except with dependent texture reads, but<br />

you don’t want to do that for obvious performance<br />

reasons), what we actually do is bind<br />

the same texture to different texture unit<br />

samplers. Then it is all a matter of offsetting<br />

the texture coordinates by a small value for<br />

each sampler.<br />

Figure 15: A simple four-tap blurring<br />

filter. The black dot R represents the<br />

center texel; the A, B, C, and D dots<br />

represent four possible neighboring<br />

texels. They all have a weight of 1,<br />

hence the resulting texel is the<br />

average of its neighbors.


No vertex shader is needed, since the<br />

offsets can be precomputed directly. In Figure<br />

16 a three-tap filter is used to show the<br />

texture coordinate offsets; a texture A is<br />

bound on unit 0 with an offset of (U +<br />

0.025, V – 0.1); a texture B is bound on unit<br />

1 with an offset of (U – 0.1, V + 0.025); a<br />

texture C is bound on unit 2 with an offset<br />

of (U + 0.1, V + 0.1). The following pixel<br />

shader code shows how to sample four textures<br />

and average them.<br />

ps 1 4<br />

; weight for each sample<br />

def c0, 0.25, 0.25, 0.25, 0.25<br />

; simple 4-tap blur : sample 4 times<br />

texld r0, t0<br />

texld r1, t1<br />

texld r2, t2<br />

texld r3, t3<br />

; weight r0<br />

mul r0, r0, c0<br />

; weight r1, sum up into r0<br />

mad r0, r1, c0, r0<br />

; weight r2, sum up into r0<br />

mad r0, r2, c0, r0<br />

; weight r3, sum up into r0<br />

mad r0, r3, c0, r0<br />

; now in r0 we’ve got 0.25 * t0 + 0.25 * t1 + 0.25 * t2 + 0.25 * t3<br />

; which is the same as (t0 + t1 + t2 + t3) / 4<br />

For our soft shadow algorithm, we use a<br />

custom blur filter using six taps (hence<br />

using six texture units). The taps are<br />

more or less arranged in a circle, as seen<br />

in Figure 17.<br />

Now that we’ve seen how to implement<br />

a simple blur, we need to include<br />

the read and write restrictions so that we<br />

only blur texels near the penumbra. First,<br />

let’s say that we are sampling a texel that<br />

is outside the outer shadow; its contribution<br />

should be ignored. One way is to<br />

Section V — Shadows<br />

Soft Shadows<br />

573<br />

Figure 16: A three-tap filter and the<br />

texture offsets. The gray square<br />

represents the UV-space of the filtered<br />

texture output. The three input<br />

textures UV-space are all offset<br />

compared to the gray one.<br />

Figure 17: The simple six-tap blur filter<br />

that we use for smoothing up our<br />

shadow map


Section V — Shadows<br />

574 Soft Shadows<br />

keep track of the weights and only perform a blur between the texels that are<br />

inside the outer shadow. For example, say that in Figure 17, only the R, A, and B<br />

taps are inside the outer shadow; then we’d sum up the R, A, and B taps’ contributions<br />

each with a weight of 1/3 only. Unfortunately, this is not practical, since<br />

you’d have to use an additional counter register to keep track of the total weight<br />

and divide the final result by it. Instead, it is possible to keep a constant weight of<br />

1/6 by replacing the taps to ignore by the center tap’s contribution, which is<br />

always valid. Thus, in our example we’d sum up the A and B taps’ contributions<br />

once and the R tap’s contribution four times (one normally, plus C, D, and E<br />

ignored), always using a weight of 1/6. Here is the algorithm:<br />

result = 0<br />

For each tap in [R, A, B, C, D, E] do<br />

If tap is inside outer shadow,<br />

result = result + current tap’s contribution / 6<br />

else<br />

result = result + R tap’s contribution / 6<br />

End If<br />

End Foreach<br />

Note that testing whether a tap is inside the outer shadow basically means sampling<br />

the tap and testing if the green component is 1. The write restriction is no<br />

more complex. We only want to blur the red channel of texels that are inside the<br />

outer shadow, and we have an easy way to do that — by multiplying the computed<br />

result by the green component of the center tap. Thus, if the center tap is outside<br />

the outer shadow, the result is 0, independent of the neighboring texels’ contribution.<br />

Otherwise, the contributions are not modified:<br />

result = result * [If R tap is in outer shadow then 1 else 0]<br />

The whole algorithm becomes:<br />

If R tap is in outer shadow,<br />

result = 0<br />

For each tap in [R, A, B, C, D, E] do<br />

If tap is inside outer shadow,<br />

result = result + current tap’s contribution / 6<br />

else<br />

result = result + R tap’s contribution / 6<br />

End If<br />

End Foreach<br />

Else<br />

Result = 0<br />

End If<br />

Output.red = Result<br />

Output.green = R tap’s green component<br />

To implement this algorithm with pixel shader 1.4, we need a way to do the “tap is<br />

in outer shadow” test. The green component of a texel of the shadow map determines<br />

if that texel is inside (values of 1) or outside (values of 0) the outer shadow,<br />

so with the cnd instruction we can perform the test and conditionally move the<br />

considered tap’s contribution or the black tap’s contribution. We also need to<br />

weight and accumulate (with a mad instruction) the contributions into a register<br />

used as the result, which means we need to use at least 12 instructions — six for


the conditional moves plus six for the accumulations. This exceeds the maximum<br />

instruction count for a ps_1_4 phase, so we would have to use two phases. Fortunately,<br />

it is possible to accumulate the result into an alpha component and collapse<br />

the instructions to form pairs, as a color operation followed by an alpha operation<br />

counts as a single instruction in the ps_1_4 model. <strong>With</strong> this trick, it is possible to<br />

only use a unique phase with six texture samples. The following is the code for<br />

that pixel shader:<br />

; Pixel shader version 1.4<br />

ps 1 4<br />

; weight constant for each incoming texture's contribution<br />

def c0, 0.1666, 0.1666, 0.1666, 0.1666<br />

; sample the 6 textures. t0 is at the center of the pixel;<br />

; all other textures are sampled with an offset.<br />

texld r0, t0<br />

texld r1, t1<br />

texld r2, t2<br />

texld r3, t3<br />

texld r4, t4<br />

texld r5, t5<br />

; stores in r1.r either r0.r or r1.r, depending on the<br />

; value of r1.g. This is a conditional mov. In the meantime,<br />

; we'll weight the contribution of r0.r in the alpha<br />

; component of r0 and add it to the result.<br />

cnd r1.r, r1.g, r1.r, r0.r<br />

+mul r0.a, r0.r, c0.z<br />

; basically does the same for r2 and the contribution of<br />

; r1.r. As you notice, there's a 1-instruction "delay"<br />

; between calculating the contribution of a texture and<br />

; weighting and adding it to the final result.<br />

cnd r2.r, r2.g, r2.r, r0.r<br />

+mad r0.a, r1.r, c0.z, r0.a<br />

; same here, but for r3's contribution and weighting r2.<br />

cnd r3.r, r3.g, r3.r, r0.r<br />

+mad r0.a, r2.r, c0.z, r0.a<br />

; same here, but for r4's contribution and weighting r3.<br />

cnd r4.r, r4.g, r4.r, r0.r<br />

+mad r0.a, r3.r, c0.z, r0.a<br />

; same here, but for r5's contribution and weighting r4.<br />

cnd r5.r, r5.g, r5.r, r0.r<br />

+mad r0.a, r4.r, c0.z, r0.a<br />

; we need a final weighting due to the 1-instruction "delay."<br />

; this is done here, by weighting r5 and adding it up to<br />

; the final result, r0.<br />

Section V — Shadows<br />

Soft Shadows<br />

575


Section V — Shadows<br />

576 Soft Shadows<br />

mad r0.r, r5.r, c0.z, r0.a<br />

; this final one is masking the result depending on the<br />

; value of r0.g; do not forget that the green component is<br />

; always either 0 or 1. We basically only allow writing<br />

; when the center pixel is from the penumbra or the inner<br />

; shadow:<br />

mul r0.r, r0.r, r0.g<br />

; that's it! note that r0.g hasn't been modified at all<br />

; by the shader; in summary, all the red components of the<br />

; offset textures have been blurred by an average filter<br />

; and the green component untouched (blurring doesn't affect<br />

; the penumbra/shadowed areas)<br />

To get a nice blur, this process has to be repeated a certain number of times (I<br />

found three or four to give pretty good results, but it is a performance trade-off).<br />

As it is not possible to render to the shadow map with the same shadow map<br />

input, a double (or more) buffer scheme has to be used, as seen in Figure 18.<br />

Figure 18: The complete blurring process pipeline<br />

Using the Shadow Map<br />

After the last blurring step, the shadow map’s red component contains values in<br />

the [0, – 1] range that correspond to the intensity of light received by the pixel. To<br />

use the shadow map, we only need to project the shadow map onto the scene<br />

using a standard screen-space projection; that is, we directly use the screen-space<br />

coordinates of a vertex as input for the texture coordinates. We can implement<br />

any custom lighting equation in a pixel shader and modulate the light’s contribution<br />

by the red component of the shadow map.<br />

As you are completely free to use any lighting equation, the soft shadows<br />

method can be used for per-vertex lighting or per-pixel lighting with bump mapping<br />

or specular lighting (or anything else). Multiple lights can be collapsed into<br />

one pass; for example, if we use per-vertex lighting and six texture units, we can<br />

use one texture for the diffuse map and five shadow maps for five soft-shadowed<br />

lights and sum up the contributions of each light in a pixel shader. The following<br />

are the vertex and pixel shaders to perform the projection and access the shadow<br />

map:<br />

vs.1.1<br />

; constants :<br />

; c0.y : 0.5<br />

; c4-c7 : world-view-projection matrix


dcl position0 v0<br />

dcl texcoord0 v2<br />

; screen-space coordinates of vertex:<br />

m4x4 r10, v0, c4<br />

; generate texture coordinates for projection:<br />

mov r6, r10<br />

mov r6.z, r6.w<br />

; r6.x = (x/z) * 0.5 + 0.5 = (x * 0.5 + 0.5 * z) / z<br />

mul r6.x, r6.x, c0.y<br />

mad r6.x, r6.z, c0.y, r6.x<br />

;r6.y=1–(y/z) * 0.5 + 0.5 = (1 *z–y*0.5+0.5*z)/z<br />

mul r6.y, r6.y, c0.y<br />

mad r6.y, r6.z, c0.y, -r6.y<br />

; outputs :<br />

mov oPos, r10<br />

mov oT0, v2<br />

mov oT1, r6<br />

ps 1 4<br />

; sample t0 (diffuse map) and t1 (shadow map)<br />

texld r0, t0<br />

texld r1, t1 dw.xyw<br />

; r1.r is the light contribution. Use 1-r1.r because values in the shadow map range<br />

; from 0 (lit) to 1 (shadowed), and we want the contrary.<br />

mul r0, r0, 1-r1.r<br />

Other Considerations<br />

Section V — Shadows<br />

Soft Shadows<br />

577<br />

The Z-fail shadow volume algorithm requires the scene meshes to be two-manifold<br />

and form closed volumes. For a given edge, there must be two — no less and<br />

no more — adjacent triangles. It is possible to tweak the geometry to handle<br />

non-closed meshes, but it is not obvious. This is something to keep in mind when<br />

trying to implement the CSG simplifications for intersecting faces when inflating<br />

objects, as after the CSG operation, the meshes have to remain two-manifold.<br />

It is recommended that you keep a 1:1 ratio between the screen resolution<br />

and the shadow map size when possible, although you can decrease the shadow<br />

map size to increase performances. It is important to note that when using<br />

shadow maps smaller than the screen resolution, the shadow information can<br />

“melt” together and you can see “halos” appear around objects. This can be seen<br />

around the cat’s legs and the temple columns in the demo on the companion CD.<br />

This is generally not too bad until you reach 1:4 ratios, like using a 256x256<br />

shadow map for a 1024x768 screen. Aliasing also becomes visible when you start


Section V — Shadows<br />

578 Soft Shadows<br />

Results<br />

moving; shadows near the occluders, which should be sharp, are quickly blurred.<br />

On the other hand, there is a four times fillrate and pixel shading rate difference<br />

when reducing by two the size of a shadow map, so it is a trade-off.<br />

One idea to save fillrate or pixel shading rate is to keep the stencil test<br />

enabled when blurring the shadow map. Unfortunately, this requires clearing the<br />

color buffer of the shadow map to black anyway (since this area is not written due<br />

to the stencil test), so in the end, there is no performance improvement.<br />

To improve the blurring filter, it is possible to enable bilinear filtering on the<br />

shadow map inputs. However, this causes the green channel to be filtered too,<br />

which leads to a natural blur after a few blurring steps. To avoid that, as the green<br />

channel has to be preserved, filtering always has to be disabled for the center tap.<br />

You might have to play with the shadow map offsets to ensure that the<br />

incoming and outgoing pixel positions do match on screen in order to avoid the<br />

shadow map being shifted by a few pixels after each blurring step.<br />

A demo is available on the companion CD. It requires <strong>DirectX</strong> 9 with a vertex<br />

shader 1.1 and pixel shader 1.4 compatible video card and at least six texture units<br />

(Radeon 8500 and up, GeForce fx and up). It shows a 6,000-triangle temple scene<br />

with six lights, two of which are casting real-time soft shadows. It runs at around<br />

10 to 15 frames per second on a Radeon 8500 with a 512x512 shadow map and a<br />

four-step blur process. On a Radeon 9700, performance increases up to 30 frames<br />

per second. The stencil shadows algorithm could be optimized, and no culling of<br />

any sort is being done. Art is courtesy of Daniel Cornibert (danielcornibert@<br />

hotmail.com). Figures 19 and 20 show some scenes from the demo.<br />

Figure 19: Screen shot taken from the soft shadows demo. This 6,000-triangle temple<br />

scene features six lights, two of which are casting spherical soft shadows, and runs at<br />

up to 35 fps on a Radeon 9700. The two lights are animated. (See Color Plate 27.)


References<br />

Section V — Shadows<br />

Soft Shadows<br />

Figure 20: Another screen shot from the temple scene, with a different viewpoint.<br />

The white dot near the cattaur (a cross between a cat and a centaur) shows the<br />

main light’s position. (See Color Plate 27.)<br />

579<br />

[1] Bilodeau, Bill and Mike Songy, “Real Time Shadows,” Creativity 1999, Creative<br />

Labs Inc. Sponsored game developer conferences, Los Angeles, California,<br />

and Surrey, England, May 1999.<br />

[2] Carmack, John, unpublished correspondence, 2000.<br />

[3] Brotman, Lynne and Norman Badler, “Generating Soft Shadows with a Depth<br />

Buffer Algorithm,” IEEE Computer Graphics and Applications, October 1984, pp.<br />

5-12.<br />

[4] Everitt, Cass and Mark J. Kilgard, “Practical and Robust Stenciled Shadow<br />

Volumes for Hardware-Accelerated Rendering,” March 2002.<br />

[5] Parker, S., P. Shirley, and B. Smits, “Single Sample Soft Shadows,” Tech. Rep.<br />

UUCS-98-019, Computer Science Department, University of Utah, October 1998.


Robust Object ID Shadows<br />

Sim Dietrich<br />

Shadows are still an active area of research in real-time graphics. The two most<br />

popular approaches are shadow volumes and depth-based shadows, often called<br />

shadow maps.<br />

Currently in wide use by newer graphics engines, shadow volumes can cost<br />

huge amounts of fillrate and cannot be cached from frame to frame. Also, shadow<br />

volumes are a vertex geometry-based approach, so any geometry created from<br />

texel or pixel manipulation, such as alpha testing, won’t work with them.<br />

Depth-Based Shadows<br />

580<br />

Texture or z-buffer-based approaches such as depth-based shadows handle all<br />

geometry types equally. In addition, they are view independent, so they can be<br />

cached to save fillrate. However, that very view independence causes various<br />

forms of aliasing artifacts [Stamminger02].<br />

Projected shadow techniques rely on rendering the light’s view of the scene<br />

into a texture or z-buffer, referred to here as the light view texture. This texture is<br />

then projected back onto the scene for each camera-visible pixel, and the depth<br />

from the light of each pixel is compared to the depth stored in the shadow map.<br />

If the depth stored in the light view texture is closer to the light than the<br />

depth computed for the pixel under question, that pixel is deemed shadowed.<br />

One form of aliasing that arises when using shadow maps is that of depth<br />

aliasing. The depth function in the light view texture is discontinuous because<br />

there is only a single depth value for each texel. If the view texture is magnified<br />

from the scene camera’s point of view, the depth value calculated for the pixel will<br />

differ from the value looked up in the texture. This is due to resampling the depth<br />

function at two different frequencies.<br />

This form of aliasing can be reduced by applying a bias to the depth calculation<br />

by adjusting the light’s depth function either at the shadow creation or<br />

shadow testing phase to ensure there is some minimum distance between a<br />

shadow caster and receiver. The magnitude of this bias dictates the size of the<br />

smallest distance that can be supported between shadow caster and receiver.<br />

However, since the polygon may be at an arbitrary slope with respect to the<br />

light, a scalar bias is not sufficient [Wang94]. Instead, z-based approaches can use<br />

the new <strong>DirectX</strong> 9 renderstate D3DRS_DEPTHBIAS, which allows for greater<br />

biases at greater polygon slopes.


In the diagram at right, the dotted line represents<br />

the true polygon depth, while the stair-step pattern<br />

represents the depth stored in the light view<br />

texture. Depending on exactly where the true<br />

polygon depth is sampled, it will appear in front of<br />

or behind the depth stored in the light view texture.<br />

A depth bias can effectively move either the<br />

stair-step line or the continuous line apart to try to<br />

avoid aliasing artifacts.<br />

Texture-Based Depth Shadows<br />

Section V — Shadows<br />

Robust Object ID Shadows<br />

Figure 1: Depth aliasing<br />

581<br />

Not all graphics hardware supports true z-based shadows, which allow use of a<br />

z-buffer as a texture, so shader authors must rely on manipulating textures<br />

instead in order to encode depth [Dietrich01]. Because the polygonal depth bias is<br />

applied post-pixel shader, this renders the depth bias useless for texture-based<br />

shadow approaches. Scalar biases can still be used, but different scenes will<br />

require varying biases, making this shadow technique less general.<br />

One way around this challenging depth aliasing problem is to avoid using<br />

depth entirely. Instead of storing depth in the light view texture, object IDs can be<br />

stored [Hourcade85].<br />

Each object that can cast or receive shadows is assigned a unique object ID<br />

and rendered to the light view texture. Next, the object ID texture is projected<br />

back onto the scene, and each rendered object tests whether its object ID<br />

matches the projected object ID. If so, then that part of the object is nearest to<br />

the light and is not shadowed. If not, that must mean there is some other object at<br />

least partially blocking the light from this pixel.<br />

By getting rid of the depth term, the depth aliasing problems go away completely.<br />

Object ID shadows do not require a depth bias at all.<br />

There are a few remaining problems, however — how to assign object IDs<br />

and accounting for projection aliasing. One approach for assigning object IDs is to<br />

assign them per object or character, but that prevents the object or character from<br />

casting shadows onto itself. In order to achieve self-shadowing, another approach<br />

is to assign IDs per animation bone or model segment [Vlachos01]. An extension<br />

of this approach would assign a separate object ID for every triangle in the scene,<br />

including world geometry. Any of these approaches can work, but all share a common<br />

problem with object ID-based shadows — projection aliasing.<br />

Let’s say that we have a simple scene with a flat floor and a vertical wall that<br />

meet at a corner. Let’s further assume that these two sections get assigned different<br />

object IDs, and the light is shining directly at the floor/wall junction (see Figure<br />

2).<br />

In this case, a naïve object ID shadowing algorithm would produce self-shadowing<br />

errors along the boundary between the wall and floor.


Section V — Shadows<br />

582 Robust Object ID Shadows<br />

This happens due to the fact that we start<br />

with a line rasterized at some resolution and<br />

orientation on the screen. During the shadow<br />

testing pixel shader, the point at the center of<br />

each pixel is projected back to the light view<br />

texture, and point sampling takes place to look<br />

up an object ID along another line in texture<br />

space of a differing resolution and orientation.<br />

These two lines in screen space and texture<br />

space will always almost line up, but rarely<br />

will they line up perfectly.<br />

Since rasterization snaps to the nearest<br />

pixel center, any aliasing error can only be up<br />

to half a pixel wide. So, if we sample a 2x2 pixel area, thereby band-limiting the<br />

signal so that a single-pixel object ID difference is ignored, we can safely eliminate<br />

this form of aliasing.<br />

This can be accomplished by sampling a 2x2 area in the light view texture<br />

and only applying shadowing in the case when all four samples agree that the<br />

pixel is in shadow [Dietrich01].<br />

An easy way to achieve this for simple scenes is to restrict oneself to sorted<br />

8-bit object IDs and use bilinear filtering [Vlachos01]. However, for scenes any<br />

more complex, 8 bits won’t be enough for the more than 256 objects.<br />

The following is a shader that can handle 2 28 different object IDs on ps.1.1<br />

hardware using <strong>DirectX</strong> 8.1 or higher. The large number of object IDs is important<br />

in cases where the shadowing technique will be used for both characters and<br />

world geometry. Two-hundred sixty million is enough IDs that any size level can<br />

be handled without having to sort or intelligently assign IDs. Simply avoiding<br />

duplicate IDs for various geometry pieces is enough.<br />

One pass is required to pass down the object ID (typically stored in each<br />

vertex for API call efficiency) into the light view texture.<br />

Pass 1: To Light View Texture<br />

ps.1.1<br />

// Just output the Object ID, stored across 7 bits each of<br />

// R, G, B, & A from diffuse iterator<br />

mov r0, v0<br />

Figure 2: Varying object IDs along<br />

object boundaries<br />

Two passes are required to test each object for shadowing, but the results are<br />

guaranteed not to suffer from projection or depth aliasing.<br />

Both passes use the same pixel and vertex shader to sample two of the object<br />

IDs in the 2x2 texel region. The results of each pass are written to Dest Alpha as<br />

1 for Lit and 0 for Shadowed.<br />

That way, no matter if one or both achieve a result of Lit, the final result will<br />

be marked as Lit, preventing self-shadowing artifacts on object ID boundaries.


Following the second shadow test, the lighting can be additively blended into<br />

the scene, based on Dest alpha:<br />

DestColor = SrcColor * DestAlpha + DestColor<br />

Passes 2 and 3: To Back Buffer<br />

ps.1.1<br />

tex t0 // fetch id0<br />

tex t1 // fetch id1<br />

// grab 1st ID & compare<br />

sub x4 r0, v0, t0 // diff0 * 8<br />

add x4 r0, r0, r0 // diff0 * 64<br />

dp3 x4 sat t0.rgb, r0, r0 // square diff0, sum * 64<br />

+<br />

mul x4 sat r0.a, r0.a, r0.a // sum * 64<br />

// grab 2nd ID & compare<br />

sub x4 r1, v0, t1 // diff0 * 8<br />

add x4 r1, r1, r1 // diff1 * 64<br />

dp3 x4 sat t1.rgb, r1, r1 // square diffs, sum * 64<br />

+<br />

mul x4 sat t1.a, r1.a, r1.a<br />

add x4 sat t1.rgb, t1, t1.a // sum * 64 *4=256<br />

+<br />

add x4 sat r0.a, r0.a, t0.b // sum * 64 *4=256<br />

add sat r0, 1-r0.a, 1-t1 // reverse & add results of<br />

// both id checks<br />

// r0 now contains 1 for lit, 0 for shadowed<br />

Section V — Shadows<br />

Robust Object ID Shadows<br />

583<br />

The main concept behind the shader is to perform two ID checks in parallel.<br />

Because the object IDs are not sorted by depth from the light but simply assigned<br />

at preprocess time, the difference between two IDs may be positive, negative, or<br />

zero.<br />

The repeated _x4 scaling is used to try to force a possibly small difference in<br />

IDs, like 2 / 255, to be scaled all the way to –1, 0, or 1.<br />

The dp3-based squaring is used as an absolute value, making negative differences<br />

count as positive differences. This is important so that during the summing<br />

up of the R, G, B, and A channels, negative differences don’t cancel out positive<br />

ones.<br />

In order to make this shader work, the object IDs must be allocated so that<br />

none of the R, G, B, or A channels differ by only an LSB. This basically means<br />

allocating IDs in twos on a per-channel basis.<br />

Allocating IDs by twos for each color channel ensures that the minimum difference<br />

in any non-zero subtraction result is two. This allows us to shave off


Section V — Shadows<br />

584 Robust Object ID Shadows<br />

another “scale up by 2” instruction, which would have forced this to a three-pass<br />

approach.<br />

Undersampling<br />

This set of shaders completely solves the projection aliasing problem, but one<br />

problem still remains, which is inherent in all object ID shadowing methods:<br />

undersampling.<br />

If a triangle in the scene projects to a very thin or small area in the light’s<br />

view texture, such that it doesn’t cross a pixel center and thus fails to be<br />

rasterized at all, the entire triangle will appear in shadow because some other<br />

nearby object ID is found in the light view texture instead during the shadow testing<br />

phase.<br />

One can reduce this problem by not giving every triangle in a finely tessellated<br />

mesh its own object ID. Smaller triangles will undersample more often.<br />

Increasing the resolution of the light view texture decreases the occurrence of<br />

undersampling artifacts.<br />

Undersampling is most common on silhouette edges with respect to the light<br />

and can be hidden to some degree by adjusting the lighting equation from N dot L<br />

to something like (N dot L) 2 , which causes lighting on silhouettes to fade out<br />

more quickly.<br />

Undersampling also occurs with depth-based shadows but is less serious<br />

because a depth bias can correct it to some degree.<br />

Mipmapped Object IDs<br />

One idea to work around object ID undersampling is to utilize mipmapping to<br />

select the object IDs.<br />

Rather than store object IDs in the vertices of each triangle or mesh section,<br />

one can store them in a texture. Point sampling should be used to fetch the appropriate<br />

ID. The mipmaps for the object ID texture are constructed so that when<br />

sections that share the same object ID start to become thin or small and are at<br />

risk for undersampling, neighboring object ID sections are merged together. This<br />

allows a gradual lessening of shadow detail in a more controlled manner.<br />

Figure 3: Mipmapped object IDs


The largest mipmap level contains areas of varying object ID. Each triangle in the<br />

shadow casting or receiving mesh has texture coordinates referencing one of<br />

these areas of constant object ID. The texture coordinates must be created such<br />

that they don’t reference a neighboring object ID section.<br />

The next smaller object ID mipmap is simply a smaller version of the first,<br />

whereas the next two mipmap levels show how the object ID sections are merged<br />

so that the smaller sections are sub-sumed into the larger sections. Eventually<br />

there is only a single object ID remaining, at which point the mesh will not<br />

self-shadow.<br />

These mipmaps cannot be used in the traditional manner, however, because<br />

while they work fine from the light’s point of view, when creating the light view<br />

texture one would have to write a complicated ps.2.0+ pixel shader in order to<br />

reproduce the mipmap selection from the light’s point of view during the shadow<br />

testing phase.<br />

One could instead simply choose a mipmap level for an entire mesh or mesh<br />

section. Unfortunately, that reduces batch sizes because it would require switching<br />

the LOD bias and mipmap clamping state in between each object draw call.<br />

Object ID LOD<br />

Rather than using mipmapping to solve the problem, one can simply use the concept<br />

of geometric LOD. When constructing discrete geometric LODs, one must<br />

generate object IDs to store in each vertex of every triangle of the mesh. The<br />

LOD algorithm used to reduce geometry will automatically simplify the mesh, and<br />

the object ID-creation algorithm is run on the resulting lower-polygon mesh.<br />

This is still not a great solution for many applications, however. One reason is<br />

that not all geometry in the level may have LODs generated for it, such as the<br />

world geometry.<br />

Another reason is that neither the mipmap nor LOD approach completely<br />

solves the problem of undersampling; rather, each approach reduces the frequency<br />

of errors.<br />

Combining Object and Geometric LOD<br />

Another approach to reducing object ID aliasing is to have a two-dimensional LOD<br />

table. One axis would represent distance from the shadowing light, and the other<br />

axis would represent distance from the scene camera. The camera distance axis<br />

would choose a mesh of a certain level of geometric complexity. The light distance<br />

axis would select a texture map that corresponds to a lower number of<br />

unique object IDs for the mesh.<br />

Object ID Allocation for Convex Regions<br />

Section V — Shadows<br />

Robust Object ID Shadows<br />

585<br />

Yet another approach that helps reduce undersampling problems is to identify<br />

convex areas or volumes of a mesh and make the entire convex section share the<br />

same object ID. This works because a convex mesh cannot shadow itself.


Section V — Shadows<br />

586 Robust Object ID Shadows<br />

Summary<br />

References<br />

A simple version of this approach allocates object IDs via plane equations.<br />

Since two coplanar triangles can never shadow each other, no matter if they are<br />

adjacent or disjoint, one can give all coplanar triangles the same object ID. This is<br />

very effective in reducing object ID aliasing with world geometry because an<br />

entire floor or wall can share the same object ID, no matter how finely tessellated.<br />

Actually, each of the four walls and floor of a rectangular room could share the<br />

same ID, since they form a convex region, but this may be a hard case to detect in<br />

practice if not using a BSP for world geometry.<br />

For a large class of applications that aren’t tessellated enough for undersampling<br />

to cause significant problems, object ID shadows remain a viable alternative to<br />

depth-based approaches, given an efficient and robust method of achieving them,<br />

such as those presented above.<br />

[Dietrich01] Dietrich, D. Sim, “Practical Priority Buffer Shadows,” Game <strong>Programming</strong><br />

Gems 2, Charles River Media, 2001, pp. 481-487.<br />

[Hourcade85] Hourcade, J.C and A. Nicolas, “Algorithms for Antialiased Cast<br />

Shadows,” Computers and Graphics, vol. 9, no. 3, 1985, pp. 259-265.<br />

[Stamminger02] Stamminger, Mark and George Drettakis, “Perspective Shadow<br />

Maps,” http://www-sop.inria.fr/reves/publications/data/2002/SD02/<br />

PerspectiveShadowMaps.pdf.<br />

[Vlachos01] Vlachos, Alex, David Gosselin, and Jason Michtell, “Self-Shadowing<br />

Characters,” Game <strong>Programming</strong> Gems 2, Charles River Media, 2001, pp.<br />

421-423.<br />

[Wang94] Wang, Y. and S. Molnar, “Second-Depth Shadow Mapping,”<br />

http://www.cs.unc.edu/~molnar/Papers/Shadow.ps.


Introduction<br />

Reverse Extruded Shadow<br />

Volumes<br />

Renaldas Zioma<br />

This article suggests a solution for dealing with shadowing artifacts that uses<br />

stenciled shadow volumes and allows proper self-shadowing while using occluder<br />

geometry, which is separate from the visible geometry. Occluder geometry can be<br />

simplified, improved for shadow volume extrusion in vertex shader, and real-time<br />

animated. This solution is derived and adopted for the stenciled shadow volumes<br />

from the work of Yulan Wang and Steven Molnar on shadow mapping [1].<br />

The reverse extruded shadow volumes technique relies on a correct illumination<br />

model to hide shadowing artifacts. Breaking the illumination model, for<br />

example, when using a darkening approach (when light is subtracted in shadowed<br />

areas) [2], may produce artifacts on the polygons facing away from the light<br />

source and requires special treatment. The simple case of a darkening approach<br />

with one light source is discussed later.<br />

Why Separate Occluder and Visible Geometry?<br />

There are many scenarios when it’s useful to have a separate geometry for<br />

shadow volume construction. You may have to add extra triangles along sharp<br />

edges and possibly extra vertices at the same position but with different normals,<br />

since normals in a visible mesh are used for lighting calculations and shadow volume<br />

extrusion needs face normals instead [3]. Maybe you have to remove unnecessary<br />

triangles to reduce occluder mesh complexity. Or you may go even further<br />

and reduce the number of bone influences per vertex in the animated occluder to<br />

gain some more speed (if the shadowing algorithm is not already fillrate bound).<br />

Problems and Solutions<br />

Separate occluder geometry is very useful for improving visual quality of the<br />

shadow or gaining some performance increase; however, when applied with the<br />

conventional shadow volume extrusion algorithm, it suffers from unacceptable<br />

lighting artifacts — harsh shadows appear on the lighted side of the visible<br />

587


Section V — Shadows<br />

588 Reverse Extruded Shadow Volumes<br />

geometry. This is caused by occluder geometry protruding from the visible geometry<br />

or ambiguities in depth values (z-fighting).<br />

Figure 1: Shadowing artifacts on the lighted side of the visible geometry [8]<br />

There are a number of ad hoc solutions that can be used to reduce such shadowing<br />

artifacts; however, as we see later, they are not very robust.<br />

� Fit occluder mesh within the visible mesh [3]<br />

This solution works only for static geometry. In the case of dynamically modified<br />

geometry, even if all vertices of the occluder mesh are kept inside the<br />

visible mesh, there is no way to ensure that the occluder mesh isn’t<br />

protruding.<br />

Figure 2: “Fitted within” mesh before and after bending<br />

� Inset vertices of the occluder mesh along their normals<br />

This reduces protruding of the occluder mesh during the animations — but at<br />

the expense of smaller shadows. It also requires careful hand-tuning for each<br />

animation. However, there is another bad thing about this solution: It may<br />

break depending on the distance between the viewer and geometry, since<br />

depth buffer isn’t linear (post-perspective divide means non-linear distribution)<br />

and the vertices are inset constantly (actual bias in depth buffer units<br />

will vary over the frustum) [4].


� Add bias to polygons of visible mesh in post-perspective space<br />

To add an offset to polygons in post-perspective space, the ZBIAS state in<br />

<strong>DirectX</strong> 8 (Tom Forsyth suggests near and far plane shifting as a better solution<br />

to ZBIAS [5], however) and the DEPTHBIAS and SLOPESCALE-<br />

DEPTHBIAS states in <strong>DirectX</strong> 9 are available. While this solution overcomes<br />

the non-linear distribution problem, it doesn’t solve the problems listed earlier.<br />

Also, it introduces even more hand-tuning — a trade-off between eliminating<br />

shadowing artifacts and shadows that are too far away to be realistic.<br />

The algorithm becomes more and more complex. It requires more hand-tuned<br />

parameters or introduces new restrictions. This encourages attacking the problem<br />

from another perspective and searching for more robust solutions.<br />

Let’s leave stenciled shadow volumes for a bit and look at another shadowing<br />

solution — shadow maps (light’s depth buffers). The cleverness of shadow mapping<br />

is that the depth buffer generated by rendering a scene from the light is a<br />

precomputed light visibility test over the light’s view volume. The visibility test<br />

is of the form:<br />

p z


Section V — Shadows<br />

590 Reverse Extruded Shadow Volumes<br />

front-face z-values are likely far enough apart to not falsely self-shadow [6]. This<br />

allows transferring the artifacts from the front faces to the back faces. Artifacts on<br />

the back faces do not matter because they are already known to be in shadow and<br />

can be hidden by the illumination model.<br />

If this technique works for shadow mapping, when the right parameters for<br />

correct illumination and suitable back-face and front-face z-values are chosen, it<br />

might work just as well for stenciled shadow volumes.<br />

Bringing Shadow Mapping Wisdom to Shadow Volumes<br />

The conventional shadow volume extrusion algorithm treats polygons facing the<br />

light source as light occluders (shadow casters). Polygons that are facing away<br />

from the light source are projected to infinity in order to form a shadow volume.<br />

Once again, notice the similarity to the conventional shadow map technique<br />

when the front faces are used to fill the depth buffer, forming the “front” of the<br />

shadow volume, which in turn is extending to infinity.<br />

Here is the conventional shadow volume extrusion algorithm (L is a light’s<br />

direction, N is a normal of the occluder polygon):<br />

� If L.N < 0, project vertex to infinity (or just far enough) along the normal.<br />

� Leave other vertices unchanged.<br />

Now let’s try to apply Wang and Molnar’s wisdom to stenciled shadow volumes.<br />

Instead of using polygons facing the light source as shadow casters, we could just<br />

use the polygons facing away from the light source instead; in other words, we<br />

need to reverse the conventional shadow volume extrusion technique.<br />

Here is the reverse shadow volume extrusion algorithm:<br />

� If L.N > 0, project vertex to infinity (or just far enough) along the normal.<br />

� Leave other vertices unchanged.<br />

Figure 4: Reversed and conventional shadow volume extrusion


What does this technique do? It’s the same as Wang and Molnar described it for<br />

shadow mapping — it doesn’t reduce the shadowing ambiguities but transfers<br />

them from polygons that face the light source to polygons that are facing away<br />

from the light source. Since these polygons are always in shadow by definition,<br />

the illumination model will hide these artifacts automatically.<br />

Implementation of Reverse Shadow Volume Extrusion<br />

The following code illustrates the reverse shadow volume extrusion algorithm<br />

implemented as a vertex shader. The first part of the vertex shader calculates the<br />

direction to the light and then normalizes it.<br />

#define POSITION v0.xyz<br />

#define NORMAL v3.xyz<br />

#define WORLD VIEW PROJ c0<br />

#define LIGHT c4.xyz // light position in object space<br />

#define EXTRUDE c5.x // extrusion offset<br />

#define LN THRESHOLD c5.y<br />

vertexshader vs =<br />

#ifndef DX9<br />

decl {<br />

stream 0;<br />

float v0[3]; // position<br />

float v3[3]; // normal<br />

}<br />

#endif<br />

asm {<br />

vs.1.1<br />

#ifdef DX9<br />

dcl position POSITION<br />

dcl normal NORMAL<br />

#endif<br />

add r0.xyz, LIGHT, -POSITION // r0: L<br />

dp3 r0.w, r0.xyz, r0.xyz // r0.w: |L|^2<br />

rsq r0.w, r0.w // r0.w: 1/|L|<br />

mul r0.xyz, r0.xyz, r0.www // r0.xyz: normalize( L )<br />

Section V — Shadows<br />

Reverse Extruded Shadow Volumes<br />

Figure 5: Shadowing artifacts on the lighted side, without light, and hidden by the light<br />

591


Section V — Shadows<br />

592 Reverse Extruded Shadow Volumes<br />

The following part of the shader calculates the dot product between the occluder<br />

normal and the direction to the light in order to decide if the occluder polygon is<br />

facing the light or not. If the dot product is greater than or equal to the threshold<br />

(the threshold usually equals 0), the polygon is considered as facing the light and<br />

should be projected away from the light. The result of the sge instruction, which<br />

equals 1 if the polygon is facing the light and 0 otherwise, is used as a mask for<br />

the projection distance.<br />

Also, please note that the new vertex shader differs from the conventional<br />

shadow volume extrusion vertex shader only in one instruction. Instead of the<br />

slt instruction, it uses the sge instruction.<br />

dp3 r0.w, r0.xyz, NORMAL // r0.w: L.N<br />

sge r0.w, r0.w, LN THRESHOLD // r0.w: 1.0f if frontface!<br />

// r0.w: 0.0f if backface<br />

mul r0.w, r0.w, EXTRUDE // r0.w: extrusion coefficient<br />

The final part of the shader makes the actual extrusion of the vertex by adding<br />

projection distance to the vertex position. The work of the shader is finished with<br />

the transform of the vertex position to the projection space by the world-viewprojection<br />

matrix.<br />

};<br />

mad r0.xyz, -r0.xyz, r0.www, POSITION // r0.xyz: extruded vertex<br />

mov r0.w, v0.w<br />

m4x4 oPos, r0, WORLD VIEW PROJ // vertex to projection space<br />

Shadows via Darkening<br />

Analysis<br />

If the shadowing approach when light is subtracted in shadowed areas is used, the<br />

reverse shadow volume extrusion technique may produce artifacts on the polygons<br />

that are facing away from the light.<br />

This happens because such an approach is actually breaking the illumination<br />

model, and the reverse shadow volume technique relies on the right lighting calculations<br />

to hide the artifacts. The light may be subtracted from the polygons that<br />

were not in light, resulting in regions that are darker than the ambient term. In<br />

order to fix this problem for the single light source, the ambient term must be<br />

added to the frame buffer only after the darkening pass.<br />

The reverse shadow volume extrusion technique has the following strengths:<br />

� The reverse shadow volume extrusion technique provides an easy way to<br />

use separate occluder and visible geometry for stencil shadows.<br />

� The reverse shadow volume extrusion technique allows proper self-shadowing<br />

using stencil shadows.


Summary<br />

� It’s more robust than ad-hoc solutions for conventional shadow volume<br />

extrusion and does not introduce fragile algorithms and hand-tuned<br />

parameters.<br />

� It’s easy to migrate from conventional shadow volume extrusion to the<br />

reverse — the magic is done only in one instruction.<br />

References<br />

Section V — Shadows<br />

Reverse Extruded Shadow Volumes<br />

593<br />

The reverse shadow volume extrusion technique has the following weaknesses:<br />

� A closed mesh is required for shadow generation. Actually, if the vertex<br />

shader has already been used for shadow volume extrusion, it requires a<br />

closed mesh in any case. If the occluder mesh isn’t closed, then the pre-processing<br />

step must generate a closed mesh. Notice that visible mesh can not<br />

be closed, since it is not involved in shadow volume extrusion at all.<br />

� A correct illumination model is required. Since shadowing approach via darkening<br />

is breaking the illumination model, special treatment is needed when<br />

such an approach is used to avoid the artifacts on the polygons facing away<br />

from the light.<br />

� The occluder geometry must be sufficiently sized — back faces should not<br />

protrude or z-fight with the front faces.<br />

The reverse shadow volume extrusion technique introduces an easy way to use<br />

separate occluder and visible geometry and allows proper self-shadowing with the<br />

stenciled shadow volumes approach.<br />

[1] Wang, Yulan and Steven Molnar, “Second-Depth Shadow Mapping,” UNC-CS<br />

Technical Report TR94-019, 1994.<br />

[2] Dietrich, Sim, “Shadow Techniques,” GDC 2001.<br />

[3] Everitt, Cass and Mark J. Kigard, “Optimized Stencil Shadow Volumes,”<br />

GDC 2003.<br />

[4] Kilgard, Mark J., “Shadow Mapping with Today’s OpenGL Hardware,”<br />

CEDEC 2001.<br />

[5] Forsyth, Tom, “Why ZBIAS is not a good thing,” http://tomsdxfaq.blogspot.com/.<br />

[6] Everitt, Cass, Ashu Rege, and Cem Cebenoyan, “Hardware Shadow Mapping,”<br />

http://developer.nvidia.com/docs/IO/1830/ATT/shadow_mapping.pdf.<br />

[7] Williams, Lance, “Casting Curved Shadows on Curved Surfaces,” Computer<br />

Graphics, SIGGRAPH ’78 proceedings, pp. 270-274.<br />

[8] Pranckevicius, Aras, “Reverse extruded shadow volumes,” http://www.gim.<br />

ktu.lt/nesnausk/nearaz/texts/revext.html.


Section VI<br />

3D Engine and Tools<br />

Design<br />

<strong>Shader</strong> Abstraction<br />

by Tom Forsyth<br />

Post-Process Fun with Effects Buffers<br />

by Tom Forsyth<br />

<strong>Shader</strong>s under Control (Codecreatures Engine)<br />

by Oliver Hoeller<br />

<strong>Shader</strong> Integration in the Gamebryo Graphics<br />

Engine<br />

by Scott Sherman, Dan Amerson, Shaun Kime,<br />

and Tim Preston<br />

Vertex <strong>Shader</strong> Compiler<br />

by David Pangerl<br />

<strong>Shader</strong> Disassembler<br />

by Jean-Sebastian Luce<br />

595


The Problems<br />

<strong>Shader</strong> Abstraction<br />

Tom Forsyth<br />

There are many problems that crop up when writing any sort of graphics engine,<br />

whether aimed just at the PC platform or at multiple platforms, including consoles.<br />

The big one is almost always scalability, specifically the scalability of the<br />

shaders used for rendering the scene. Even when aiming at a single console platform,<br />

scalability is still important for allowing maximum detail in foreground areas<br />

while not spending time rendering this detail in the background, where it is not<br />

visible. When developing for multiple platforms or a sensible range of PC hardware,<br />

scalability becomes extremely important.<br />

Various conventional solutions exist, none of them ideal. A brief list of the<br />

possible cases and their traditional solutions follows:<br />

Multiple PC cards — TSS, PS1.1, PS1.3, PS1.4; FFP, VS1.1; P/VS 2.0,<br />

P/VS3.01 � Use the lowest common denominator. Ugly.<br />

� Fix a high “minimum spec.” Reduces your possible market.<br />

� Two or three versions of the low-level engine. Large coder, artist, and QA<br />

time.<br />

Multiple platforms — PC, XB, PS2, GC<br />

� Multiple and completely separate engines. Lots of coder time. Artists author<br />

multiple versions.<br />

� Shared API layer. Essentially the lowest-common-denominator solution<br />

again.<br />

Scalability within scenes according to distance<br />

� Mesh Level of Detail (VIPM, static LoD) can help but only geometry, not<br />

fillrate.<br />

� Fade stuff out (litter, incidental objects, texture layers) — looks rather<br />

strange.<br />

1 TSS is TextureStateState, a reference to the SetTextureStageState call used to set up this style of pixel<br />

blending. FFP is the fixed-function pipeline — the transform and light part of Direct3D that was replaced<br />

by vertex shaders. PS is pixel shader. VS is vertex shader. P/VS is both pixel and vertex shader.<br />

597


Section VI — 3D Engine and Tools Design<br />

598 <strong>Shader</strong> Abstraction<br />

A Single Solution?<br />

The solution proposed in this article is to abstract shaders. In many ways, this is<br />

just a combination of some of the above ideas, but it is much more thorough than<br />

the solutions above and affects the design of the entire engine. The abstraction of<br />

shaders means that the models (and therefore the artists that authored those<br />

models) specify a description of an ideal shader, but then in code the shader is<br />

allowed to degrade gracefully in quality according to both platform and distance<br />

from the camera.<br />

This concept is not new. Features like detail textures have been optional<br />

extras for some time in many games, enabled only on higher-end systems, but<br />

they essentially work from the lowest common denominator upward by attempting<br />

to artificially add detail. This gives a much poorer quality result than working<br />

the other way — allowing artists to author high and scale down.<br />

<strong>With</strong> the advent of some very complex shader models such as anisotropic<br />

BRDFs 2 , self-shadowing bump maps, and displacement mapping, the gap between<br />

the lowest acceptable quality of shaders and the highest possible grows even<br />

wider. Adding a few noise functions over the top of a texture map fails to impress<br />

people any more, and the alternative is to discard the lower-end systems, which<br />

make up a large proportion of the market.<br />

There is a small speed cost associated with this flexibility. However, the extra<br />

cost of this abstraction is matched by the considerable advantages — and the<br />

run-time speed difference is small, and in many cases the abstraction allows other<br />

optimizations to be applied that more than counteract the slight overhead.<br />

Essentially, the idea is for the artists to design objects with the highest<br />

superset of features available. They model and describe objects with as much<br />

shader detail as they have time and resources for, without worrying too much<br />

about which target platforms can render the data they produce or how much of the<br />

scene will be rendered with all that data. Naturally, some judgment is required to<br />

balance time spent producing super-detailed shader information against the likely<br />

benefit, but the artists are not tied rigidly to the target platform(s) in the usual<br />

way.<br />

For simplicity, I usually refer to the different types of PC cards as different<br />

platforms. The only difference between the two cases is that on the PC, the<br />

choice of “platform” is not known until game installation or start of day, while on<br />

the console it can be determined at compile time. However, the methods used do<br />

not add any performance penalty for this relatively late decision and the only<br />

slowdown is at load time, but even then it is small compared to the time taken to<br />

retrieve mesh and texture data from the hard drive.<br />

2 Bidirectional Reflectance Distribution Functions


The Material<br />

The core concept in this abstraction is the Material 3 . Frequently implemented as a<br />

C++ class, this is a black box that wraps up all the rendering details from the rest<br />

of the engine. Mesh data, textures, lighting, position, orientation, animation, and<br />

so on are fed into the Material and out come pixels on the screen. This data can<br />

be fed in when exporting from the various content creation stages and intermediate<br />

formats stored on the hard drive, CD, or DVD, or it can be fed in at run time<br />

for dynamic objects such as sprites, HUD, font draws, particle systems, and so on.<br />

Either way, there are a few well-defined input formats that are shared by all Materials,<br />

but the details of rendering and the internal data stored on the hard drive,<br />

CD, or DVD are all private to the specific Material.<br />

The shared input interface allows a single mesh with complex shader data to<br />

be fed to a wide range of Materials with different rendering styles, and the Material<br />

deals with the details of efficient rendering at a certain quality and speed<br />

level. Because all the intermediate information (such as vertex buffers, shaders,<br />

texture formats, and content) is private to the Material, each Material can be individually<br />

optimized for particular situations and platforms without worrying about<br />

breaking any other parts of the code. In this way, the huge complexity of a<br />

multi-format highly scalable rendering system are kept manageable by reducing<br />

as many interdependencies as possible.<br />

Materials are generally both subclassed (using virtual C++ classes) and<br />

instanced. A subclass is used when there is a different style of rendering—adifferent<br />

number of passes, different types of inputs (e.g., lighting info, shadow buffers,<br />

environment maps), and so on. A different instance is used when the code is<br />

the same but the details are different — for example, different render states or<br />

pixel or vertex shaders. Essentially, any time there is an “if” in the rendering<br />

function that tests data determined at start of day, it is probably time to use a different<br />

class rather than an instance.<br />

A Simple Static Material<br />

Section VI — 3D Engine and Tools Design<br />

<strong>Shader</strong> Abstraction<br />

599<br />

To help make a few of these ideas concrete, here is a very simple implementation<br />

of a static Material class. A static Material is used for meshes that do not have<br />

their mesh data changed by the CPU at run time — positions, texture coordinates,<br />

vertex coloring, and so on are purely determined at export time. However,<br />

this does include cases where data is generated by the GPU — animation, changing<br />

lighting conditions, environment mapping, and so on. This covers most<br />

meshes that are output from a 3D art package. The opposite is a dynamic class,<br />

such as a particle system or a font draw — these will be addressed later 4 . To keep<br />

it simple, this engine does not yet have any textures or animation:<br />

3 To make it clear when I am talking about the code construct, I capitalize the M when talking about it, as<br />

opposed to a “material,” which is usually used as a way of referring to the properties of a particular surface.<br />

The concepts overlap but are not always identical.


Section VI — 3D Engine and Tools Design<br />

600 <strong>Shader</strong> Abstraction<br />

class Material<br />

{<br />

private:<br />

MaterialDescriptor desc;<br />

Material ( const MaterialDescriptor &md );<br />

virtual ~Material();<br />

public:<br />

static Material *FindOrCreate ( const MaterialDescriptor &md );<br />

virtual void *Export ( size t &SizeOfDataInBytes,<br />

const FatVertex *pVertices,<br />

const u32 *pIndices,<br />

int iNumVerts, int iNumTris )=0;<br />

virtual void Render ( const void *pExportedData,<br />

const Matrix43 &orientation )=0;<br />

};<br />

// And some derived classes would be:<br />

class MaterialUnlit : Material;<br />

class MaterialLit : Material;<br />

class MaterialLitShiny : Material;<br />

Note that Material is a base class and cannot be created directly. It can only be<br />

used as a template and interface for the other Materials derived from this class<br />

that implement specific rendering methods. However, the Material class is the<br />

interface that the rest of the engine uses; it does not (except in very specific circumstances)<br />

need to know about any of the derived Material classes. The following<br />

sections describe each part of this class.<br />

MaterialDescriptor<br />

MaterialDescriptor is a relatively large inefficient structure with various combinations<br />

of flags and enums that specifies the properties of a certain Material. The<br />

flags say things like “this material has a diffuse texture,” “this material is shiny,”<br />

“this material has vertex colors in the mesh data,” and so on. This structure is<br />

used to uniquely identify a Material, and no two Materials will have the same<br />

MaterialDescriptor, even if they are different derived classes.<br />

Material::FindOrCreate()<br />

To ensure that no two Materials will have the same descriptor, the engine is not<br />

allowed to create Materials itself. What the engine does when it needs a new<br />

Material is create the appropriate descriptor and pass this to the Material::Find-<br />

OrCreate() function. This will either return an existing Material or create a new<br />

one; it keeps an internal list of all existing Materials. This function is a simple<br />

version of a “class factory,” where all the classes it can create are derived from<br />

Material, but the actual type (MaterialUnlit, MaterialLit, MaterialShiny) is determined<br />

by the descriptor that is fed in. Although searching through the list of<br />

4 Note that on some platforms, the distinction is a little more precise; see “Dynamic Materials” later for<br />

more.


existing Materials and trying to match up the MaterialDescriptor is relatively<br />

slow (it can be sped up by hashing the descriptor as a quick test), this is only done<br />

when new objects and new materials are created, which is usually only at start of<br />

day or appropriate intervals, such as loading new levels. In practice, it is not a<br />

speed problem.<br />

Material::Export()<br />

Section VI — 3D Engine and Tools Design<br />

<strong>Shader</strong> Abstraction<br />

601<br />

The Export function is called by the developer when converting data from the format<br />

output by art packages into data to be written to the distribution media (CD,<br />

DVD, archive, WAD file, etc.). For each mesh that is placed in the game, a MaterialDescriptor<br />

is created that describes its rendering properties, FindOrCreate is<br />

used to find the specific Material that will do the rendering, and then the mesh<br />

data is sent to the Export() method of this Material.<br />

The input vertex format — FatVertex — is shared across all Materials and<br />

frequently contains space for all the possible required data (such as tangent vectors,<br />

multiple texture coordinate sets, vertex colors, and so on), even if most<br />

Materials do not use or require this data. Similarly, the index data is in a standard<br />

format. Indexed triangle lists with 32-bit indices are very common because they<br />

are simple for content packages and load/save routines to use, but again the format<br />

is up to the application. Whether the target platform prefers indexed or<br />

non-indexed data, lists, strips, fans, or quads does not affect the input to Export<br />

— any expanding, reordering, splitting, etc. is handled internally by the Material::Export()<br />

call, since only it knows the rendering specifics.<br />

The Export function takes all the mesh data (and texture data, which has<br />

been omitted for simplicity for now) and processes it into a hardware-friendly<br />

form that it can easily use at run time. This usually involves stripping or reordering<br />

the triangles for efficiency, reindexing them if needed, removing all the data in<br />

the FatVertex that is either not needed or cannot be used (for example, tangentspace<br />

data for a Material that cannot render bump maps), compressing the data for<br />

the target platform, and so on. The final processed data is packed into a single<br />

continuous chunk of bytes in whatever format the Material requires and passed<br />

back as the result of Material::Export(). The size of the data is written to SizeOf-<br />

DataInBytes. Of course, any sort of data-packaging method may be used — STL<br />

vectors of bytes and so on are also handy.<br />

It is then the mesh’s responsibility to write this data to the distribution media<br />

in a place where it can load it again later. The mesh will also write out the MaterialDescriptor<br />

that it used to find the Material.<br />

Note that the same mesh data (FatVertices and indices) may be fed to almost<br />

any Material’s Export() method. If data is provided that is not used by the Material’s<br />

rendering style, it is either ignored or incorporated in some sensible manner<br />

to try to approximate the desired effect. For example, feeding a shininess map to a<br />

Material that only renders using vertex specular will probably make it look at the<br />

map, take the average shininess value, and use that as its overall shininess. Alternatively,<br />

if the Material uses a per-vertex shininess value, it may sample the shininess<br />

map at each of the mesh vertices. If there are two possible ways to process


Section VI — 3D Engine and Tools Design<br />

602 <strong>Shader</strong> Abstraction<br />

the data from input format to rendering format, the mesh can choose which to use<br />

by sending a standard set of flags to the Export() function.<br />

If data is missing that the Material would normally require, it must be able to<br />

cope and generate default “NULL” data. For example, if a Material renders with a<br />

diffuse texture but is passed mesh data with no texture, it needs to generate a<br />

small pure white texture to use so that the same visual effect is seen on-screen.<br />

This is inefficient and should normally be flagged as a bad thing; ideally the mesh<br />

should have selected a different Material by specifying in the MaterialDescriptor<br />

that it did not have a diffuse texture. However, this should not be a fatal error, and<br />

the code should be able to cope and use sensible defaults. These cases do happen<br />

because, as can be seen later, Material::FindOrCreate() does not necessarily<br />

return the Material that was asked for.<br />

Material::Render()<br />

Scalability<br />

When the mesh is loaded while the game is running (for example, when loading a<br />

level that the mesh is in), it loads the MaterialDescriptor that it saves, calls Material::FindOrCreate(),<br />

and stores the Material* it gets back. It can then throw the<br />

bulky MaterialDescriptor away, since it doesn’t need it any more — it has a<br />

pointer to the Material singleton instead. The mesh also loads the big chunk of<br />

data returned by Export(). It still doesn’t know what is in this data, but it does<br />

know that if it passes the data to Material::Render() along with its current orientation,<br />

the mesh will appear on the screen.<br />

The Render() call knows what format the data is in because this is the same<br />

Material class that created the data during the Export() call, so it can use the data<br />

directly to render the mesh on-screen in the most efficient way possible.<br />

Each Material instance has exactly one rendering style with one quality level on<br />

one target platform. A Material that renders on a PS2 will not render on an Xbox,<br />

and vice versa, although there may be two Materials that render using the same<br />

techniques and produce the same pixel output. Making the target platform (again,<br />

counting different classes of PC cards as different platforms) part of the MaterialDescriptor<br />

allows the export phase to be explicit about which targets it is<br />

exporting to. It can create two otherwise identical Materials, one for each platform,<br />

and call Material::Export() on both with the same mesh data. They can then<br />

do completely different things to transform the data into native formats for the<br />

target hardware, such as stripping, batching, and data format conversion.<br />

Note that although the export phase can call Material::Export() on Materials<br />

that are destined for PS2 and Xbox, it cannot normally call the Material::Render()<br />

method on these Materials because that will only work on the target platform<br />

itself. One possible exception is that on a PC compile, these Materials may render<br />

some sort of emulation of the target hardware. This allows artists to preview how<br />

their data will look on various platforms without needing a full export cycle each<br />

time or needing a console development station at every desk. However, in the


Section VI — 3D Engine and Tools Design<br />

<strong>Shader</strong> Abstraction<br />

603<br />

case of consoles, it is extremely hard to get the outputs looking identical, since<br />

one is on a high-resolution monitor and one is usually outputting to a television.<br />

Each Material knows how to downgrade itself in quality, with the assumption<br />

that this method will be quicker to render. If one of the targets is the PC, Materials<br />

will also know how to downgrade themselves to require less sophisticated<br />

hardware — for example, to use a lower pixel or vertex shader version number, or<br />

to drop back to the TextureStageState pipe or the fixed-function texture and lighting<br />

pipeline. These are exposed by the methods Material::FindQuicker() and<br />

Material::FindSimpler() that both either return a different Material (via an internal<br />

call to FindOrCreate) or return NULL if there is no quicker or simpler way to<br />

render the data.<br />

At export time, a mesh will call Material::FindOrCreate() with its desired<br />

MaterialDescription. This will find the highest-quality Material available, which<br />

should render the mesh with all its data on-screen. The mesh calls Export() and<br />

saves the data produced. It then keeps calling FindQuicker() and/or FindSimpler()<br />

and exporting more data each step of the way for each of the Materials until one<br />

or both return NULL for the quicker/simpler versions of the Material.<br />

At run time, the same process is carried out to load the different versions of<br />

the mesh but this time calling Import() on each Material. FindOrCreate() finds the<br />

highest-detail Material, the mesh calls Import() on that, and then goes down the<br />

chain of Materials returned by successive FindQuicker() calls, importing each in<br />

turn.<br />

Note that on the PC target there is no need to call FindSimpler() at run time.<br />

The initial FindOrCreate() call will have returned a Material that is valid for the<br />

current hardware capabilities. Once the highest-detail Material is returned, it<br />

guarantees that any FindQuicker() results will also be valid on the same platform.<br />

A FindQuicker() call should never return a Material that uses more hardware<br />

resources than the current Material. In some cases, especially when dealing with<br />

the legacy DX7 TextureStageState and fixed-function pipeline interfaces, this can<br />

get fairly tricky; in practice, what usually ends up happening is that there is a<br />

generic, single-texture LoD chain of Materials, a generic dual-texture chain, and<br />

also several chains specifically targeted at certain common chipsets. If the chipset<br />

detected is not one of those specifically supported, then the generic single- or<br />

dual-texture chain is used as appropriate. In theory there are various capability<br />

bits that can be checked for bits of functionality, but because of driver bugs, in<br />

practice it is safest to simply identify the card directly. If identification fails, use a<br />

generic “safe” feature set that seems to work on all known cards. Achieving good<br />

DX7 compatibility is a huge topic and outside the scope of this article, but using<br />

this Material system keeps the implementation relatively simple and easy to<br />

modify.<br />

In some cases, the fact that each Material performs its own export call can<br />

produce duplicate data. For example, a Material that handles three vertex lights<br />

can degrade to a Material that only handles one vertex light, and the mesh data<br />

stored by each is usually the same — it is only the vertex shader that changes. To<br />

avoid this, when Material::Export() is called, it exports the data to a memory<br />

buffer rather than a file and compares the contents against all other previously


Section VI — 3D Engine and Tools Design<br />

604 <strong>Shader</strong> Abstraction<br />

exported mesh buffers. If it finds a match, it uses the existing file rather than<br />

duplicating the data. Using a suitable hash function such as a CRC keeps this comparison<br />

fast. The export is performed multiple times, but the extra time taken is<br />

not usually much, and this method removes any dependencies between the different<br />

Export() calls for different Materials — all it checks is whether their final output<br />

data is identical or not. This keeps maintenance problems to a minimum and<br />

maintains flexibility.<br />

A further quality improvement is if each Material not only knows which other<br />

Material is a version of itself that is quicker to render but also knows how to<br />

smoothly downgrade itself visually to match that Material. For example, a Material<br />

that renders a diffuse and detail texture will fade the detail texture to nothing,<br />

and it then visually matches the simpler Material without a detail texture. This is<br />

done by adding an argument to the Material::Render() method that controls the<br />

Level of Detail (LoD). A value of 1.0 renders the Material with full detail. An LoD<br />

value of 0.0 renders the Material with reduced detail, making it visually identical<br />

to calling Material::FindQuicker()->Render() with an LoD version of 1.0. This<br />

allows the mesh to gradually degrade the current Material that it is using until it<br />

can then swap to using the quicker Material but avoiding any sudden pop. These<br />

sudden pops attract the eye of the player and are very distracting, whereas a<br />

smooth change is far harder to notice.<br />

For continuous worlds that do not have any discrete boundaries, such as landscapes,<br />

each mesh (where a “mesh” is a section of landscape) cannot be given a<br />

single LoD value; otherwise, seams will appear between parts of the landscape. In<br />

this case, the LoD value will be calculated by the Material independently at each<br />

vertex. However, the calling program can still calculate the highest possible LoD<br />

for each mesh and call the appropriate Material for that LoD. This does complicate<br />

the system, but in practice continuous landscape engines are highly specialized<br />

anyway — these Materials are likely to always be used through a different<br />

set of interfaces designed specifically for landscape rendering.<br />

In some cases, there may be Materials that do not do a transition themselves<br />

but link to a Material that can. This allows rendering to usually use Materials that<br />

do not transition (and therefore may execute faster), but when going from one to<br />

the other, it goes via a Material that can transition but executes slightly slower.<br />

The idea is that few things in the scene are using the transitional Materials at any<br />

one time, so performance is retained but without the visually objectionable popping<br />

of a sudden change. In practice, this is rarely necessary, as most effects<br />

(detail maps, bump maps) can be faded out with little speed impact.<br />

Data-Hiding and Maintenance<br />

Because almost all Material classes, whatever their descriptor, can be fed the<br />

same mesh data at export time and use the same external interfaces, it is easy to<br />

keep them isolated from each other. This means that the most common rendering<br />

cases can be separated out into their own separate Material and minutely optimized<br />

without breaking all the other Materials. All the less-common shader<br />

effects or combinations can be done by far more general Materials that may


handle a lot of the cases by rendering more complex shaders and inserting<br />

dummy black or white colors or textures or constant factors of 0 or 1. Additionally,<br />

quite late in the project more Materials can be added that optimize the commonly<br />

used cases without changing any other Materials. This allows optimization to be<br />

done right up to the last moment without the fear of having broken some obscure<br />

combination of shader features, since that combination will still use well-tested<br />

general-purpose Materials.<br />

This idea of starting with general cases that work everywhere (although perhaps<br />

not optimally) and specializing as the project goes on and the common<br />

requirements are better understood is a powerful one. It allows the engine to support<br />

a huge range of features without committing programmer time to optimizing<br />

every possible combination of supported features and without producing exponentially<br />

uglier code as the number of special cases increases.<br />

Implementation Details<br />

Section VI — 3D Engine and Tools Design<br />

<strong>Shader</strong> Abstraction<br />

605<br />

In practice, life is a little more complex. In addition to a Material::Export()<br />

method, there is usually a Material::Import() method that actually takes the large<br />

chunk of data. This method will create things like D3D vertex and index buffers,<br />

fill them with the data from the large chunk of data, then store the buffer pointers<br />

in a smaller chunk of data, which it will then hand back to the mesh. The large<br />

chunk of data can now be freed. It is this smaller chunk of data that the mesh<br />

passes to the Material::Render() method.<br />

Along with the Material::Import() method, there is a Material::DeleteImportedData()<br />

method that is used to clean up the mesh data (for example, when<br />

loading a different level where this mesh is not used). This takes this same chunk<br />

of data as passed to Material::Render(), releases the vertex and index buffers, and<br />

frees the remaining memory.<br />

Material::Export() also takes a number of miscellaneous mesh properties,<br />

such as overall shininess, overall diffuse color, bump map bumpiness factor, etc.<br />

These are frequently passed in either as an array of floats or Vec4s — the array is<br />

always a standard size, and data is in a standard offset within that array — or as a<br />

structure or class 5 . Again, this data is of a standard format across all Materials and<br />

can be passed to any Material::Export() call, and it will be interpreted correctly.<br />

The values that are used by the Material will also be stored in the big chunk of<br />

data that Material::Export() returns.<br />

When created, Materials will bolt together and compile any shaders or render<br />

state blocks that they need. In our engine, this step has been separated out into a<br />

“validation” stage — the theory being that you can hang onto a shader and use<br />

Material::Validate (create shaders) and Material::Invalidate (free shaders) at will.<br />

This just lets us do some more memory management without worrying about reference<br />

counting and other annoying bookkeeping, but it is purely a style thing.<br />

5 Although a structure or class is more sophisticated and makes for more readable code, in practice we use<br />

an array of Vec4s, since it can be indexed with a number rather than a name.


Section VI — 3D Engine and Tools Design<br />

606 <strong>Shader</strong> Abstraction<br />

Dynamic Materials<br />

Textures<br />

A dynamic Material is essentially the same as a static Material except that its<br />

mesh data changes at run time. These are used for rendering items such as<br />

heads-up displays, scores, text fonts, particle systems, diagrammatic items like<br />

arrows and aiming reticles, incidental effects such as tracer trails, and decal<br />

effects such as bullet holes and footprints.<br />

It is worth noting that the difference between static and dynamic Materials is<br />

fairly subtle on platforms such as the GameCube and PCs without hardware transform<br />

and lighting or animation capability. In these cases, the CPU performs the<br />

role of a GPU vertex unit and changes mesh data at run time. The distinction is<br />

whether the data is generated by relatively higher-level shared code (either<br />

cross-platform or cross-Material), as in the case of font and particle drawing, or if<br />

the routines are low level and specific to the Material, which is the case for animation<br />

and lighting. Essentially, a dynamic Material exposes its mesh format outside<br />

the Material, while a static Material does not.<br />

To allow this, these meshes have a second Material::Render() method that<br />

takes pointers to various mesh structures — vertices and indices and so on. This<br />

interface can simply take the same arguments as the Material::Export() call, but<br />

using a wide format such as a FatVertex is usually inefficient. The usual way to<br />

deal with this is to have part of the MaterialDescriptor define the input data format.<br />

For static Materials, this entry is left as “none,” but for dynamic Materials,<br />

this is explicitly and precisely defined.<br />

Some Materials may have a common input format that can be driven by the<br />

same code on all target platforms, with a small bit of run-time manipulation to get<br />

it into a format usable by the hardware. A common example is the status display<br />

— it usually consists of a small number of elements with simple render techniques<br />

and is identical across all platforms. To keep code simple, a common format<br />

can be used, meaning that code written on one platform automatically works<br />

on the others.<br />

Other cases may have very target-specific formats and be used only in code<br />

used on that platform; a common example is the particle system code, which<br />

although dynamic needs to be very fast. The common code for particle systems is<br />

usually at quite a high level to take advantage of hardware and CPU quirks, and<br />

the actual use of the Material::Render() method is buried deep in platform-specific<br />

code.<br />

So far, the discussion of Materials has avoided textures in detail. One way to integrate<br />

them is to treat them like any other mesh data. They are fed in some standard<br />

form (for example, a linear 32-bit ARGB array) to Material::Export(), and the<br />

export code manipulates and combines the data to produce a given number of<br />

hardware textures in various formats. For example, a diffuse texture and an opacity<br />

texture (a grayscale map where white=opaque and black=transparent) would


e fed in, and in most cases the Material would combine them into a single ARGB<br />

texture with the opacity map in the alpha channel. This texture then gets put into<br />

the big chunk of data returned by Material::Export(), and at run time the actual<br />

textures are created.<br />

In practice this is not sensible. A lot of source textures are shared between<br />

many different meshes (for example, tarmac road textures, brickwork, tree and<br />

bush leaves, and so on), and this naïve export method removes the ability to share<br />

the hardware versions of the textures because each Material is by design isolated<br />

from other Materials. Removing the sharing uses a lot of memory and slows<br />

everything down.<br />

One way to do this is to keep the texture export process separate from Materials.<br />

So if the artist uses the same image in two places, the same hardware texture<br />

is used, and both Material renderers must be able to use that hardware<br />

texture.<br />

For simple Materials, this can work. But for more complex shaders, it is useless.<br />

A good example is with bump mapping. Typically, artists will provide a diffuse<br />

(unlit) texture and a grayscale heightfield bump map. There are three<br />

different styles of rendering.<br />

The simplest is no bump map. The bump map is prelit at export time from<br />

some “standard” lighting direction (usually above the object), and this lighting is<br />

combined with the unlit diffuse texture to produce a single prelit diffuse texture.<br />

The next method is emboss bump mapping. In the most common implementation,<br />

the bump map height is put in the alpha channel of a texture, and the unlit<br />

diffuse texture is placed in the RGB channels.<br />

The third method is normal-map bump mapping. In this, the bump map<br />

heightfield is processed to find the gradients of the heightfield, and those are used<br />

to produce a normal map — a map of vectors where the XYZ of each normal is<br />

held in the RGB channels of a texture. The unlit diffuse texture is put into a second<br />

texture’s RGB channels. The alpha channels of both maps are unused.<br />

These three methods all require the source textures to be processed in some<br />

way, and two of them require that the two source images be combined so that, for<br />

example, using the same bump map with a different diffuse texture would require<br />

a different hardware texture.<br />

To solve these conflicting requirements — wanting the Material to decide<br />

how to process its textures and yet letting textures be shared between otherwise<br />

unrelated Materials — Mucky Foot uses a class called a TextureSource. This class<br />

only exists in the exporter side of things, not on the target platforms, and each<br />

TextureSource describes a hardware texture. It does this not by storing pixels<br />

directly but by storing the processing steps that a Material has applied to a number<br />

of source images (TGAs, JPGs, etc.) to obtain the final hardware texture. All<br />

TextureSources share a common base class:<br />

class TextureSource<br />

{<br />

private:<br />

TextureSource();<br />

virtual ~TextureSource();<br />

Section VI — 3D Engine and Tools Design<br />

<strong>Shader</strong> Abstraction<br />

607


Section VI — 3D Engine and Tools Design<br />

608 <strong>Shader</strong> Abstraction<br />

public:<br />

virtual const Image *GenerateImage ( void )=0;<br />

virtual String GenerateName ( void )=0;<br />

};<br />

Some TextureSources directly describe images stored on disk:<br />

class TextureSourceTGA : public TextureSource<br />

{<br />

public:<br />

TextureSourceTGA ( String sFilename );<br />

};<br />

Others describe an image in terms of an operation on other images:<br />

class TextureSourceNormalMap : public TextureSource<br />

{<br />

private:<br />

TextureSource *ptsSource;<br />

public:<br />

TextureSourceNormalMap ( TextureSource *ptsSource );<br />

};<br />

class TextureSourceAlphaColourCombine : public TextureSource<br />

{<br />

private:<br />

TextureSource *ptsAlphaSource, *ptsColourSource;<br />

public:<br />

TextureSourceAlphaColourCombine (TextureSource *ptsAlphaSource,<br />

TextureSource *ptsColourSource );<br />

};<br />

These classes are created like so:<br />

TextureSource *ptsBumpmap = new TextureSourceTGA ( “bumpy.tga” );<br />

TextureSource *ptsDiffuse = new TextureSourceTGA ( “colours.tga” );<br />

TextureSource *ptsNormal = new TextureSourceNormalMap ( ptsBumpmap );<br />

TextureSource *ptsEmboss = new TextureSourceAlphaColourCombine (ptsBumpmap, ptsDiffuse );<br />

TextureSources may be chained together indefinitely, sometimes producing<br />

chains like this:<br />

TextureSource *ptsNormal = new TextureSourceNormalMap (<br />

new TextureSourceChangeContrast ( 2.0f,<br />

new TextureSourceInvert (<br />

new TextureSourceToGrayscale (<br />

new TextureSourceTGA ( “brick.tga” )))));<br />

The TextureSource::GenerateImage() method returns an image (which is just a<br />

raw 32-bit ARGB linear format with a width and height) that is the actual result of<br />

the TextureSource. This image is cached so that GenerateImage can be called<br />

multiple times without redoing all the image generation work, which is why the<br />

result is a const Image*. For the TextureSourceTGA class, GenerateImage simply<br />

loads the TGA off disk and writes the data into its cached Image. For the


TextureSourceNormalMap class, GenerateImage() first calls ptsSource->GenerateImage()<br />

and processes the returned Image as a heightfield to create its own<br />

Image — a normal map. Similarly, TextureSourceAlphaColourCombine calls<br />

ptsAlphaSource->GenerateImage() and ptsColourSource->GenerateImage(),<br />

takes the alpha channel from the first and the color channels from the second, and<br />

combines them into its own ARGB Image, which it returns.<br />

So using these, the Material::Export() call is passed in a set of Texture-<br />

Sources — almost always just TextureSourceTGA classes referring to raw texture<br />

artwork used on the mesh. The Material then processes these as it wants by<br />

creating new TextureSources, using the passed-in TextureSources as arguments<br />

to the constructors. These new processed TextureSources are then returned in a<br />

list by Material::Export() to tell the mesh what texture data the Material::Render()<br />

call is going to require. The exact number of TextureSources returned is up<br />

to the Material::Export() call, and the order they are returned in must match the<br />

order in which they are passed to the Material::Render() call.<br />

Each of the TextureSources returned by Material::Export() then gets the<br />

GenerateImage() method called on them, and the resulting Image is processed<br />

into a hardware texture format and exported to the final target platform. As with<br />

the large contiguous chunk of data returned by Material::Export(), the main<br />

engine code does not know or care what data is inside those TextureSources.<br />

Except in one respect. The main engine would like to know if that same data<br />

is used by any other meshes, so it can generate and load only one texture and<br />

pass it to all the meshes that need it. The way we do this is by calling Texture-<br />

Source::GenerateName(), which returns a text string that describes the chain of<br />

TextureSource calls, usually something like “NormMap(Contrast(2.0,Invert(Grey<br />

(TGA(“brick.tga”))))).” The strings for every exported texture in the world are<br />

kept; if two strings match, the outputs must be the same, and only one texture is<br />

created. Another way to do this would be to hash the contents and formats of<br />

hardware textures and compare the hashes. If the hashes match, a closer check is<br />

made on the individual texels; if they match, then the two can be merged into a<br />

single texture. This involves a lot more data checking and is slower but more<br />

aggressive. For example, “NormMap(Contrast(2.0,Grey(Invert(TGA(“brick.<br />

tga”)))))” describes exactly the same data, but it is a different string. It may be<br />

worth doing this aggressive check every now and again.<br />

Note that this addresses the same problem as multiple Materials producing<br />

identical mesh data, but from the other end. The reasons for this are pragmatic —<br />

there is a lot more reuse of texture data than there is of mesh data; the operations<br />

performed on texture data are more tightly defined and shared by many Materials;<br />

and the time taken to produce texture images is far longer than for mesh data, so<br />

an early duplication check is far more important.<br />

So two calls have changed now that textures have been added.<br />

virtual void *Material::Export ( size t &SizeOfDataInBytes,<br />

std::list &lptsOutputTextures,<br />

TextureSource *ptsInputTextures[],<br />

const FatVertex *pVertices,<br />

const u32 *pIndices,<br />

Section VI — 3D Engine and Tools Design<br />

<strong>Shader</strong> Abstraction<br />

609


Section VI — 3D Engine and Tools Design<br />

610 <strong>Shader</strong> Abstraction<br />

Animation<br />

Lighting<br />

int iNumVerts, int iNumTris )=0;<br />

virtual void Material::Render ( const void *pExportedData,<br />

Texture *ptexTextures[],<br />

const Matrix43 &orientation )=0;<br />

Note that the array ptsInputTextures[] is always a fixed size, and the indexes are<br />

defined using a global enum or similar so that, for example, ptsInputTextures[0] is<br />

always the diffuse texture (if any exists), while ptsInputTextures[5] is always the<br />

bump map texture (if any exists).<br />

However, the same is not true of ptexTextures[]; this array has the same<br />

number of elements as the returned list lptsOutputTextures, and the two have a<br />

one-to-one correlation (each TextureSource gets GenerateImage() called to get its<br />

Image, which is then turned into a platform-specific hardware Texture). Note that<br />

the Material::Render() call does not need to know how large ptexTextures[] is —<br />

it already knows because of the sort of Material it is.<br />

When rendering an animated mesh, the animated skeleton is passed as yet<br />

another argument to the Material::Render() call in a standard form, and the Material<br />

deals with all the details of rendering with that set of bones. Similar processing<br />

can be performed by Material::Export() on the skeleton of the mesh, but it is<br />

usually not necessary and simplifies the code if there is a single shared format for<br />

skeleton and animation data.<br />

Whether a mesh is animated or not is also a flag in the MaterialDescriptor,<br />

since animated meshes require a different sort of vertex processing and lighting<br />

pipeline to non-animated meshes.<br />

Lighting for a mesh must usually be generated at run time (with the exception of<br />

radiosity-style lightmaps). This causes problems because as the mesh moves<br />

around or the environment changes with gameplay, different numbers and types of<br />

lighting (spot, directional, etc.) will affect the mesh. One solution is for each<br />

Material to always use a fixed number of lights of each type. For example, a certain<br />

Material will always use two directional lights —no more and no less. This is<br />

acceptable for some situations, but if there are infrequent cases where more lights<br />

would give a better result, it would be useful to spend a small amount of time to<br />

do this. In the case where there is only one light affecting an object, it is a waste<br />

of performance to always use two (and set the second light to black, for example).<br />

One solution is for each mesh to create an array of Materials that are all the<br />

same, except each is capable of doing a different number of lights and/or combination<br />

of lights. However, this rapidly produces a huge number of combinations<br />

(e.g., up to four lights of three types requires an array of 256 materials 6 ). Although<br />

many of these may reference the same material, a pointer to the material and the<br />

chunk of data it requires has added 4 KB to every mesh and a potentially massive


Section VI — 3D Engine and Tools Design<br />

<strong>Shader</strong> Abstraction<br />

611<br />

amount of exported mesh data. The other disadvantage is that the mesh has no<br />

knowledge of whether certain combinations are easy to reduce to simpler cases<br />

because it does not know anything about the specific platform capabilities.<br />

A good example case is when using a relatively complex function, such as<br />

Spherical Harmonics 7 (SH), to perform vertex lighting but standard dot3 to do<br />

per-pixel lighting. This case assumes all lights are directional for ease of illustration.<br />

Typically only zero, one, or two lights shining on an object will have dot3<br />

lighting (usually the brightest), and the others will be done using SH.<br />

For a mesh without a bump map or on a platform with no bump map support,<br />

there is no per-pixel lighting at all, so all lights are encoded in the SH lighting, and<br />

the same Material can be used for any number of lights.<br />

If the mesh has a bump map, it will use a different Material. On DX7-class<br />

hardware, the Material will want to apply the single brightest light as a bumpmapped<br />

light and encode the rest into SH. So there are two cases — no lights (or<br />

only ambient lights) and one or more lights.<br />

On DX8-class hardware with pixel shaders, it is cheap and effective to bump<br />

map two lights. So now there are three cases — zero, one, or two bump-mapped<br />

lights. Additionally, there may be a threshold where turning off the second<br />

bump-mapped light gives a speed increase at very little loss of quality, and this<br />

threshold will be controlled by some combination of light brightness and mesh<br />

Level of Detail. The point is that this judgment is very specific to the rendering<br />

method used.<br />

The mesh itself doesn’t really want to have to deal with this sort of complexity<br />

every frame to decide which Material to use. It may end up doing a lot of work,<br />

only to have the particular Material not use the results at all (in the above case of<br />

hardware with no bump map support). The solution we found was to allow the<br />

Material itself to look at the lighting context and make judgment calls internally.<br />

This may mean switching vertex or pixel processing pipelines according to the<br />

number and type of lights, but that fits within what Materials are allowed to do.<br />

The mesh sets up the lighting system with pertinent instance-specific information,<br />

such as its position, size, and current animation state, and then it is the<br />

Material that asks how many and what sort of lights are affecting the mesh. It<br />

then uses that information to decide which shaders to use, puts the light data into<br />

the correct shader constants, and renders the mesh.<br />

Another advantage is that optimizing lighting selection code in a single Material<br />

affects all meshes that use that Material; there is no need to change mesh<br />

code in multiple places to take advantage of this. This means the routines used<br />

can be extremely specific to that single Material. This is important if it is a frequently<br />

used Material — hand-tuning that one case may give useful speed or<br />

quality increases. Again, these improvements can be made late in the project, and<br />

6 Counting a disabled light as a fourth “type,” and without wasting CPU effort for each mesh at run time to<br />

reorder the incident lights, this requires 4*4*4*4 = 256 combinations.<br />

7 SH lighting is roughly equivalent to sampling a low-res cubic environment map but done in the vertex<br />

shader and using cunning math rather than textures. Many more details can be found by Googling for<br />

“Spherical Harmonic Irradiance,” but the important point here is that it captures the environment lighting<br />

well, and is the same cost no matter how many lights are in that environment.


Section VI — 3D Engine and Tools Design<br />

612 <strong>Shader</strong> Abstraction<br />

since the changes are localized, the chance of adding unnoticed bugs in obscure<br />

cases is reduced.<br />

Batching and Sorting<br />

Conclusion<br />

It has always been the case that some type of sorting of draw order is beneficial.<br />

The obvious example is sorting all alpha-blended objects, so they are drawn after<br />

all opaque objects, and then drawing them from back to front. Also helpful is sorting<br />

opaque objects by texture and/or shader. As shaders (both pixel and vertex)<br />

become larger, the benefit from this sorting grows. Also helpful is to sort opaque<br />

objects in a very rough front-to-back order, since this allows the Z-buffer to reject<br />

as many pixels as possible without shading them or writing them to the frame<br />

buffer.<br />

Another sorting order is needed when using various forms of shadow buffer<br />

or reflection rendering. To reduce the amount of video memory needed, the usual<br />

method is to use only one or two render target textures, render the required<br />

information (shadow or reflection) to them, then render all the materials that use<br />

these, and repeat with the next shadow or reflection until the scene is complete.<br />

This requires the rendering to be sorted by which render target it uses. On<br />

lower-end hardware or in the distance, this type of rendering cannot be done, and<br />

a generic prerendered environment map or “blob shadow” texture will be used<br />

instead. No special sorting is required here; indeed, this type of sorting can slow<br />

down the rendering unnecessarily.<br />

These examples illustrate that the criteria for sorting the drawing order of<br />

objects is yet again determined directly by the Material and not by the object<br />

itself. The way to do this is to change the relevant Material::Render() method so<br />

that it just wraps up the inputs to the call (textures, orientation, etc.) into a convenient<br />

data structure and adds it to a list. At the end of this phase, the lists are<br />

sorted according to the requirements of the Material(s) and replayed in order —<br />

this time actually rendering the data to the screen. Although this storage and traversal<br />

of lists takes CPU cycles and memory bandwidth, it is usually a savings<br />

overall because the expensive states of the graphics pipeline (texture, shader,<br />

etc.) are changed less often than when drawn in an arbitrary order.<br />

Where this batching and sorting turns out not to be a savings in practice, it is<br />

easy to leave those Material::Render() methods doing actual immediate rendering.<br />

This is usually only true for a few special cases, such as rendering fonts and<br />

particle systems, both of which are typically batched well at a higher level.<br />

Abstracting shaders and referencing them by desired rendering style rather than<br />

by actual rendering style allows excellent scalability for multiple platforms and<br />

multiple PC graphics cards without authoring multiple versions of artwork or sacrificing<br />

quality on the high end or speed and compatibility on the low end.


Section VI — 3D Engine and Tools Design<br />

<strong>Shader</strong> Abstraction<br />

613<br />

Using encapsulation or data hiding allows the implementation of individual<br />

Materials to be hidden from the rest of the engine and from each other, increasing<br />

code robustness and adaptability and allowing programmers to focus their optimization<br />

efforts on only the most common cases.<br />

While the changes to a traditional game engine are major, once made, the<br />

system is robust, understandable, and flexible, again using the principles of data<br />

hiding and letting each Material decide what its inputs are going to be and what<br />

rendering schemes it will use, rather than forcing everything into the same<br />

system.


Post-Process Fun with<br />

Effects Buffers<br />

Tom Forsyth<br />

Previous Work<br />

Overview<br />

614<br />

Hardware is now becoming powerful enough that framebuffer post-processing<br />

effects can supplement pure polygon rendering. These effects treat the world not<br />

as geometric shapes but as an image to manipulate.<br />

The most common current example is depth of field blur. Examples include<br />

samples from many graphics card manufacturers and the “Depth Of Field”<br />

<strong>DirectX</strong> 9 sample 1 and in games such as Splinter Cell 2 and others. The framebuffer<br />

is successively blurred to another surface 3 , then parts of that blurred version<br />

are blended back onto the framebuffer to simulate parts that are out of focus.<br />

Another common example is heat haze or distortion, as in Jak and Daxter 4 or<br />

Metroid Prime 5 . Here, rendered objects do not directly change the color of the<br />

framebuffer; they move pixels in the framebuffer around — either by only a few<br />

pixels to cause a heat-haze effect or by large amounts of the screen to give a<br />

“raindrops on glass” effect.<br />

The idea behind this article is to generalize many of these effects into a unified<br />

framework where multiple effects can be added, tried out, and combined at run<br />

time without replicating shared code and keeping optimal speed and memory use<br />

when only a few of the effects are visible.<br />

Multiple back buffers are created, all of which are texture render targets. The<br />

“main” buffer is the size of the screen and has the standard RGB scene rendered<br />

to it. The other buffers are called effects buffers. Various objects and particle systems<br />

render to them instead of (or in addition to) rendering to the main buffer.<br />

1 DepthOfField sample demo, <strong>DirectX</strong>9 SDK, available from http://msdn.microsoft.com/directx<br />

2 Tom Clancy’s Splinter Cell by Ubisoft — full of frame post-processing features, notably the Xbox version<br />

(http://www.splintercell.com/)<br />

3 In practice many implementations blur the image inside the pixel shader and use the result immediately,<br />

rather than rendering the blurred version to a separate surface, but the principle is the same.<br />

4 Jak and Daxter by Naughty Dog (http://www.naughtydog.com/)<br />

5 Metroid Prime by Retro Studios (http://www.metroidprime.com/)


Once the main scene and the effects buffers are rendered, they are all combined<br />

together using various texture-processing passes and rendered to the real<br />

back buffer, which is then presented. The values in the effects buffers are not usually<br />

colors, but they determine how much of a particular effect is done to the main<br />

buffer. For example, a high value in the “blur” effects buffer makes the main<br />

buffer very blurry at that pixel, while a low value leaves it sharp and unfiltered.<br />

The Z-buffer is shared between all buffers. Usually, to make sure effects are<br />

occluded properly, the main buffer scene is rendered with standard Z-buffer settings,<br />

and then the effects buffers are rendered to with Z-tests turned on but<br />

Z-writes turned off. This ensures that the effects are properly occluded by solid<br />

objects so that effects such as a heat-haze hidden behind a solid wall do not affect<br />

the wall itself. Not writing to the Z-buffer means that effects do not sort perfectly<br />

between themselves, but in practice most effects renders use additive blending,<br />

which is commutative, or the incorrect sorting is hard to see, or the objects can<br />

be rendered in back-to-front order to fix the problem.<br />

Rendering Structure<br />

Section VI — 3D Engine and Tools Design<br />

Post-Process Fun with Effects Buffers<br />

615<br />

Traditional rendering engines go through each object in the list of objects in the<br />

world and compare them against the viewing frustum. If they are visible, they are<br />

rendered immediately.<br />

This system is rather different because the effects buffers must be rendered<br />

after the main buffer so that they respect the Z-buffer information, and preferably<br />

all the objects for a single effect should be rendered together to reduce the number<br />

of render target and state changes.<br />

To do this, each effect has a list of objects (meshes, particle systems, etc.)<br />

that produce or influence that effect. The lists are cleared at the start of each<br />

frame. The order of operations then becomes:<br />

� Search the list of objects in the world for those visible.<br />

� For each visible object, for each effect it uses, add it to that effect’s list.<br />

� Optionally sort each effect’s list.<br />

� For each effect, change to its rendering buffer and render all the objects in its<br />

list.<br />

� Finally, combine all rendering buffers together to make the final image.<br />

Of course, some sort of hierarchy or volume-query device is used for efficiency<br />

instead of checking every object in the world against the frustum. Many engines<br />

already have a lot of this structure in place for other reasons, but the use of<br />

effects buffers makes it even more integral to the rendering process.<br />

Note that each object can contribute to multiple effects channels. For example,<br />

a flame particle system is partially rendered to the main buffer as visible<br />

flames but also rendered to both the blur and distortion buffers to get the heathaze<br />

effect. For this reason, when an object adds itself to an effect’s list, it uses<br />

some form of ID or a unique callback address, so when it is later called for rendering,<br />

it knows which effect style to render.


Section VI — 3D Engine and Tools Design<br />

616 Post-Process Fun with Effects Buffers<br />

Dynamic Allocation<br />

Conceptually, each effect has an independent buffer to which it renders. This<br />

buffer may be anywhere from one to four channels in size. For example, a single-channel<br />

effect, such as blur, could be allocated a render target with an A8 or<br />

L8 format. However, in practice all render targets are 32-bit ARGB buffers, and<br />

the four channels are shared out dynamically between any effects passes that are<br />

currently active (active means that there is something on-screen that produces<br />

this effect).<br />

Allocating dynamically allows the minimum number of render targets to be<br />

used so that if no objects of a particular effect type are in the visible frustum, no<br />

rendering is performed for them and the number of buffers used can be reduced.<br />

This does complicate the writing of shaders to some extent, but the advantage is<br />

that multiple effects may be scattered around the environment at whim (or even<br />

subject to the player’s actions — moving objects about and so on) with near-optimal<br />

rendering speed at all times.<br />

To partially simplify matters, effects that require more than a single channel<br />

are always allocated the same color channels — usually RG, RGB, or ARGB,<br />

according to their design. These are allocated first. After this initial phase, any<br />

single-channel effects are allocated from the remaining free channels of the render<br />

targets. It is usually relatively simple to allow single-channel effects to change<br />

channels at run time using write masks and swizzling. Where exceptions exist,<br />

they can be restricted to certain channels during allocation, though it may lead to<br />

inefficient use of memory. Because there are typically many more single-channel<br />

effects than multi-channel effects, this order of allocation works well.<br />

As mentioned, render targets could be allocated one-per-effect of the correct<br />

size. However, support for these buffer types is more limited, especially as render<br />

targets. Since the target is frequently PS 1.1-style hardware, the number of independent<br />

texture reads for the final combining pass is limited and has an effect on<br />

speed. Using only two texture reads instead of three or four, even if the number of<br />

bytes read is the same, usually has speed benefits from better texture cache use<br />

and allowing more parallelism. In addition, the SetRenderTarget call in <strong>DirectX</strong> is<br />

notoriously slow on some graphics cards (though slowly improving over time),<br />

and reducing the number of these calls is a big speed boost. Changing the write<br />

mask (D3DRS_COLORWRITEENABLE) is usually much faster than changing<br />

the render target.<br />

One case where the allocation scheme needs to be modified slightly is feedback<br />

effects in channels. If the channel is not cleared between frames, dynamic<br />

allocation needs to be modified so that the same channel and target are used each<br />

time. This is easily done in code, though it can lead to inefficient allocation in<br />

some cases. If the feedback has a maximum number of frames that it will persist<br />

for before fading away, the channel can be turned off that many frames after the<br />

last object that is using that effect has moved out of the frustum.


Alpha-blended Objects<br />

Traditionally, all solid objects need to be rendered first in rough front-to-back<br />

order. Then all alpha-blended objects must be rendered in back-to-front order,<br />

usually without writing to the Z-buffer. This partitioning into two phases is sometimes<br />

ugly and hacked together. The effect-buffer rendering scheme introduces<br />

the concept of multiple passes and channels as a first-class feature, which means<br />

it can be used to do this partitioning with a lot more elegance.<br />

The two passes (opaque and alpha-blended) are made into separate “effects,”<br />

but both allocated the same RGB channel to render to. Since the opaque pass is<br />

always rendered before all other effects passes and nearly all effects passes will<br />

use alpha-blending and not write to the Z-buffer, this handles the alpha-blended<br />

parts of the scene automatically. Additionally, since all objects are added to their<br />

respective effect buffer in a list before any are rendered, it is simple to insert a<br />

sorting phase on each effect’s list. This can sort strictly back-to-front for the<br />

alpha-blended pass and other effects passes that require strict sorting and in<br />

whatever order is optimal for the opaque part — sorted by shader and texture,<br />

then rough front-to-back order, and so on.<br />

Note that many objects render in both passes. The opaque parts of the object<br />

are rendered in the first pass using a high alpha-test value and alpha-blending disabled<br />

and the translucent parts rendered in the second pass using a low alpha-test<br />

value and alpha-blending enabled. This also reduces the need to self-sort complex<br />

self-intersecting objects, such as trees and bushes, which can normally be very<br />

costly in CPU cycles.<br />

Different Sizes of Effects Buffers<br />

Section VI — 3D Engine and Tools Design<br />

Post-Process Fun with Effects Buffers<br />

617<br />

One possible option is to use an effects buffer smaller than the standard back<br />

buffer in order to save fillrate and memory use. For many effects, this reduced<br />

resolution is sufficient; effects such as distortion and blur work perfectly well<br />

when halved in resolution in each direction, and in some cases the softer edges<br />

may be a desired effect. However, this does mean the effects buffer cannot simply<br />

share the existing Z-buffer. Two possibilities exist, depending on platform capabilities<br />

or application requirements.<br />

First, the main buffer is rendered to a full-sized buffer to set up the Z-buffer.<br />

Then a shrink-blit is done from the Z-buffer to the smaller-sized Z-buffer used by<br />

the effect rendering. This shrink does not necessarily have to filter correctly (or<br />

at all), so frequently the hardware can be spoofed into performing this shrink-blit<br />

by pretending the contents are simply pixel or texel data and disabling filtering.<br />

This smaller Z-buffer is then used when rendering effects to reject pixels hidden<br />

by solid parts of the scene. Although this shrink-blit is not reliably or efficiently<br />

possible on PC cards, this method works well on most consoles.<br />

Alternatively, the effect buffers can ignore the standard Z-buffer but set aside<br />

one of the channels where any rendering always writes depth into (8 bits is usually<br />

sufficient for this purpose). When doing the post-process pass to combine the


Section VI — 3D Engine and Tools Design<br />

618 Post-Process Fun with Effects Buffers<br />

main buffer and the effects buffers, this 8-bit channel is used to reject effect texels<br />

further away than the main buffer’s Z value. For example, under <strong>DirectX</strong> this<br />

would be done using the texm3x2depth (PS 1.3) or texdepth (PS 1.4) instructions<br />

or by writing to the oDepth register (PS 2.0+). However, this has the problem<br />

that only one depth value can be stored, so a distant effect (behind a solid object)<br />

that is covered by a close effect (in front of the solid object) will still be rendered.<br />

These artifacts may be few or subtle enough to be acceptable for the reduction in<br />

fillrate.<br />

Multiple Render Targets<br />

DX9 exposes the concept of multiple render targets. Up to four render targets can<br />

be written to by a single draw call. This potentially removes the need to render<br />

objects multiple times, once for each effect. However, the savings are only useful<br />

in some situations.<br />

First, most objects are only rendered to one of the buffers. Very few render<br />

the same triangles to multiple buffers, and where they do (for example, the displacement<br />

and blur values are frequently used together), it is easy to ensure that<br />

these channels are placed in the same buffer (RG and alpha channels in this case)<br />

and both rendered at once.<br />

Using MRTs is also dogged by implementation problems. First, all render<br />

targets must be updated; the “texkill” instruction affects them all or none of<br />

them. The channel write masks are always respected, but if two effects need to<br />

use pixel-by-pixel kills on different pixels, they cannot be rendered in the same<br />

pass. Second, alpha-blending with MRTs is not well supported by hardware, which<br />

makes many effects impossible because most require additive blending. PS 3.0<br />

requires that alpha-blending work on all render target formats and when using<br />

MRT, but in PS 2.0, this capability is modified by caps bits and is often missing.<br />

On the other hand, some hardware prefers that four texture targets are bound<br />

as different outputs, and then the color write masks are used to completely turn<br />

some of those targets off for different renders. The alternative is multiple<br />

SetRenderTarget calls, which can be slow, especially if the Z-buffer needs to be<br />

shared between all the renders. At the time of publication, there is little or no<br />

hardware to test on to compare relative speeds of the various techniques, but it is<br />

worth noting the possibilities for the future.<br />

Auto-gen Mipmapping<br />

Many graphics cards can automatically calculate mipmaps of render target textures<br />

with very little speed impact. This is much easier than generating mipmap<br />

levels manually by successive render target changes and shrinking. Although this<br />

is all that most drivers are actually doing internally for auto-gen mipmapping, they<br />

can take advantage of any hardware quirks and the reduced API call overhead. In<br />

some cases, there is specialized hardware that performs the mipmapping operation,<br />

making it virtually free.


Because of this, it is worth looking at using this facility when performing blur<br />

processing. While simply using the lower mipmap levels raw can produce obvious<br />

and objectionable bilinear filtering artifacts, combining them with samples from<br />

larger mipmap levels can remove these artifacts while using fewer passes and/or<br />

samples than the more conventional single-layer filter kernels.<br />

Specific Effects<br />

A few common examples are given here as an illustration, and most are shown in<br />

action in the demo. However, the number of effects is huge, particularly odd<br />

game-specific ones such as magical, supernatural, or alien.<br />

Saturation and Desaturation<br />

One effect channel can control the amount of saturation a pixel receives. Complete<br />

desaturation is otherwise known as converting to grayscale, and it is easy to<br />

create in a pixel shader by doing a dot-product between the main buffer color and<br />

a “grayscale vector,” which is usually something like (0.299, 0.587, 0.114) 6 . When<br />

the result of the dot-product is replicated to the RGB channels, this represents<br />

the fully unsaturated color. Simply interpolating between this and the main buffer<br />

color allows gradual desaturation of the color. It is the interpolation factor used<br />

that is stored in the effect buffer. Interpolating away from this grayscale produces<br />

a more saturated color with more vibrant colors. Note that in PS 1.x, the interpolant<br />

of the “lerp” instruction can be clamped to the range [0-1] before use 7 .<br />

Therefore, if interpolations outside this range are used, it is better to explicitly<br />

use a subtract instruction followed by a multiply-add to do the interpolation.<br />

Any linear color transformation can be performed using the same technique<br />

— again, both toward and away from the post-transform result as desired. This<br />

can be used to perform “film grading” of images, and using an effect buffer to control<br />

it allows selected areas to be graded differently. For example, the lighting on a<br />

character’s face may be accentuated, while the lighting on the background is<br />

muted to concentrate attention. Desaturation can portray illness or death without<br />

having to directly change the rendering style or the textures used. There are<br />

obvious applications for magical effects using particle systems that “suck the life”<br />

out of the surroundings. Subtle effects like these were used extensively in the<br />

Lord of the Rings: The Fellowship of the Ring 8 .<br />

Blur and Depth of Field<br />

Section VI — 3D Engine and Tools Design<br />

Post-Process Fun with Effects Buffers<br />

As a single channel effect, this holds a value going from 0 (no blur) to 1 (a heavy<br />

blur). Two classes of the effect can render into this buffer.<br />

619<br />

6 This vector is the Y vector of the RGB to the YIQ conversion matrix. Depending on whether you are<br />

working in gamma-corrected or linear space, these values may change slightly.<br />

7 Not in all implementations, but in some common ones, so it is worth noting this restriction<br />

8 Lord of the Rings: The Fellowship of the Ring — Special Extended DVD Edition, the section titled “Digital<br />

Grading”


Section VI — 3D Engine and Tools Design<br />

620 Post-Process Fun with Effects Buffers<br />

The first is three-dimensional objects that cause blurring of the image<br />

beyond them. Heat haze above hot surfaces and flames produces blur (as well as<br />

the separate distortion effect), and these are usually rendered using particle systems<br />

that add values into this buffer. Frosted or dirty glass can also render<br />

additively into this buffer to blur objects behind them; these are rendered as<br />

objects that are the shape of the glass itself.<br />

The second class is the depth-of-field simulation of a camera lens. A particular<br />

distance from the camera is chosen as the current focal length, and objects in<br />

front of or behind this depth will be blurred the further that they are from the<br />

depth. One way to do this is to render the entire opaque scene a second time,<br />

writing values corresponding to depth information into this buffer. This is done in<br />

many sample applications. The problem is that rendering the scene a second time<br />

is expensive in geometry throughput. One way around this is to ensure that this<br />

channel is the alpha channel of the main buffer and render both depth and color<br />

together. However, this clashes with the fake HDR rendering effect, which would<br />

also like to do this, and in many cases, there is a much simpler method available.<br />

Depth is already rendered to the Z-buffer, and it is useful to be able to use<br />

this information. Some consoles can read the Z-buffer directly as pixel information,<br />

and with cunning scaling or lookup tables, this can be transformed into a<br />

depth-blur value. Alternatively, a simple method that works well is to wait until<br />

the Z-buffer is set up by the opaque pass and then render fullscreen planes at various<br />

distances using Z-testing. Because the depth of field blur is usually fairly subtle,<br />

having only eight or 16 different values (and therefore eight or 16 different<br />

planes) is enough. For objects further than the depth of focus, planes are rendered<br />

using additive blending at successively closer distances and lower blur values<br />

using a less-than Z-test but no Z-writes. Each pass sets a bit in the stencil buffer,<br />

and pixels are only rendered where the stencil buffer is clear. Using the stencil<br />

buffer ensures that each pixel is only shaded once and by the furthest “blur plane”<br />

that is still in front of the solid object at that pixel.<br />

Although this requires rendering many fullscreen planes, this does not usually<br />

consume huge amounts of fillrate, since more cards have very fast Z-buffer<br />

rejection and most of the pixels in these planes will be rejected.<br />

To blur objects closer than the focal depth, the same trick is used but by rendering<br />

planes close to the camera, moving away to the focal plane, and using a<br />

greater-than Z-test. In this way, the objects closest to the camera have the earlier<br />

planes render to the effect buffer and are the most blurred.<br />

The actual blurring can be done in a variety of ways. Some very intricate and<br />

high-quality methods are available; see some of the graphics card manufacturer<br />

demos for examples on various bits of hardware. However, a common one that<br />

works on a wide variety of hardware is to apply a blur filter to the main buffer and<br />

then for each pixel blend between the blurred version and the unblurred main<br />

buffer according to the value in this buffer.<br />

Fake High Dynamic Range Rendering<br />

Any really shiny or inherently bright objects render a low-contrast version of<br />

themselves to this single-channel effect to approximate the “extra brightness”


Distortion<br />

Section VI — 3D Engine and Tools Design<br />

Post-Process Fun with Effects Buffers<br />

621<br />

that they have. This extra brightness cannot be seen directly in the framebuffer<br />

because of the limited range of an integer framebuffer, but the extra is rendered<br />

into this channel.<br />

In the back buffer composition pass, the main buffer is scaled by this effects<br />

channel, then blurred, and then added back onto the main buffer. Dark and normal-brightness<br />

objects write 0 to this channel, so they will be dark in this blurred<br />

version. Bright objects write positive values, and these will be blurred and added<br />

back. This simulates the “bloom” that over-bright images produce in both camera<br />

lenses and the human eye, and this bloom is very effective at conveying the<br />

“extra brightness” that the limited gamut of the monitor and framebuffer cannot<br />

directly convey.<br />

For inherently bright objects such as the sun or light sources like fires or<br />

lamp bulbs, the value written to the effects buffer is a fixed value, depending on<br />

the object’s brightness. For very shiny objects, their environment map is rendered<br />

as both RGB values and this extra brightness. Shiny objects simply modulate<br />

the extra brightness channel by their shininess and write it to the effects<br />

channel. This allows sunlight to bloom off shiny objects, such as chrome or car<br />

paint.<br />

Note that the main buffer must be modulated by the extra brightness before<br />

being blurred; a bright object will produce a light bloom that covers a dark area.<br />

This is different from the depth of field effect, which usually needs to avoid this<br />

effect to look good. Additionally, the size of the blur filter for depth of field is usually<br />

much smaller than for this bloom effect. Therefore, the two blur passes cannot<br />

usually be combined into one.<br />

As well as a general glow effect, the filter used can be all sorts of odd shapes,<br />

notably “star” filters, which blur the image in only a few discrete directions. This<br />

filter was used to extreme effect in Wreckless 9 .<br />

Because so many objects can write to this effect buffer in common scenes<br />

(notably anything even slightly shiny), it is common to put this buffer in the alpha<br />

channel of the main buffer and combine rendering of the two into a single pass.<br />

This can be tricky if the destination alpha channel is used for other rendering<br />

effects, but if these effects are used purely for opaque multi-pass texturing tricks,<br />

it is still possible, as long as the last pass always writes the extra brightness value<br />

to the channel.<br />

This effect displaces pixels from the main buffer to the screen with the X and Y<br />

screen offsets stored in two channels. This can be used for a heat-haze shimmer<br />

effect with particle systems, as seen in Jak and Daxter, or with larger distortions<br />

to produce water-droplet-on-glass effects, as seen in Metroid Prime, or with glass<br />

objects to simulate refraction of light through them.<br />

9 Wreckless, also called DOUBLE S.T.E.A.L. Masaki Kawase’s GDC2003 talk on this is available at<br />

http://www.daionet.gr.jp/~masa/.


Section VI — 3D Engine and Tools Design<br />

622 Post-Process Fun with Effects Buffers<br />

The two effect channels store vertical and horizontal offset data, and in the<br />

final combiner, they are usually used as inputs to the “texbem” pixel shader<br />

instruction to look up offset data into the main buffer.<br />

There are two problems here. First, texbem in PS 1.1 takes signed values<br />

(where 0x00 represents a value of 0.0), rather than the more usual offset values<br />

(where 0x80 represents a value of 0.0). Fortunately, PS 1.2 and above allow the<br />

_bx2 modifier on texbem that converts offset data to signed data before use, as<br />

when it is used with the dot3 instruction.<br />

The other related problem is that when rendering data to the effect buffer,<br />

ideally an object should be able to use a blend that either decreases or increases<br />

the offset values as necessary so that accumulating offsets would work as<br />

expected, especially for multiple heat-haze particle systems. Some sort of additive<br />

blend would be ideal, but since the data is offset, a standard additive blend can<br />

only ever increase the data, never decrease it. By changing the D3DRS_BLEND-<br />

OP operation from D3DBLENDOP_ADD to D3DBLENDOP_SUBTRACT, a render<br />

pass can subtract data, but this change cannot be performed every pixel.<br />

Ideally, an “add bias” blend would be used, where buffer=buffer+texture–0.5,<br />

much like the D3DTOP_ADDSIGNED operation in the TextureStageState pipeline<br />

or the pixel-shader equivalent of using the _bias modifier on one of the arguments,<br />

but no such alpha-blend exists.<br />

A simple solution that can work well in some cases is to blend between the<br />

two values using a SRCALPHA:INVSRCALPHA 10 blend rather than adding them.<br />

Although this is dependent on the rendering order of objects (unlike using proper<br />

addition) and is not at all correct (except in a few special cases), it can look convincingly<br />

good and has the advantage of not requiring multiple passes.<br />

A more correct solution is to render twice — once using D3DBLEND-<br />

OP_ADD with a ONE:ONE blend, clamping negative texture values to zero, and<br />

once using D3DBLENDOP_REVSUBTRACT, inverting the texture and again<br />

clamping negative values to zero. This is the most flexible and accurate but<br />

requires two rendering passes of each object. It can also hit saturation problems.<br />

<strong>With</strong>in a single object or particle system, the positive offsets may balance the<br />

negative ones to give a net result of no change. However, if the initial value is 0.5<br />

and all the positive offsets are rendered first to give an offset to +0.7, the result<br />

can saturate to 1.0, when in fact it should be 1.2. Then the negative ones are rendered<br />

and cause an offset of –0.7. <strong>With</strong>out saturation, this would give the initial<br />

value of 0.5, but because of saturation, the actual result is 0.3. This is usually not a<br />

very noticeable effect; when distortions of the image become large enough to saturate,<br />

they are typically so large that only a general idea of what they are doing is<br />

visually perceptible, and the errors caused by saturation are hidden.<br />

Because many objects and special effects render to both the blur and<br />

distortion fields, such as heat-haze and frosted glass, these two channels are frequently<br />

dynamically assigned so that they are in the same render target and<br />

10 I use the convention that an A:B blend means that D3DRS_SRCBLEND=D3DBLEND_A,<br />

D3DRS_DESTBLEND= D3DBLEND_B. This convention is now incomplete because it does not include<br />

the D3DRS_BLENDOP function, which was added in DX7, but it is nevertheless a useful shorthand, which<br />

many understand.


endered to at the same time. In many cases, it may be worth combining them<br />

properly to make a single, three-channel effect. This means that depth of field<br />

rendering needs to know about the distortion channels and leave them alone, but<br />

this is an easy fix.<br />

Edge Detection<br />

G-buffers<br />

Feedback<br />

Section VI — 3D Engine and Tools Design<br />

Post-Process Fun with Effects Buffers<br />

623<br />

For cartoon-style effects, performing edge detection on various bits of data can<br />

produce some very nice images. Edge detection can be performed purely on final<br />

framebuffer color information. However, the results are hard to predict in some<br />

cases and can pick up strange details and miss others. To be properly effective,<br />

edge detection needs to be performed on more user-defined data. The easiest<br />

way to provide this data is in a separate effects channel. These values are not necessarily<br />

linear values — all that matters is whether two are sufficiently different<br />

to produce an edge or similar enough not to. By simply assigning two polygons a<br />

“different enough” value, an edge is automatically produced. These values can<br />

come from textures, vertex colors, shading, or a combination of all three, which<br />

means that cartoon edges do not need to match geometry edges — a common<br />

problem with some techniques.<br />

Geometry buffers are an interesting extension of having multiple output buffers.<br />

Instead of performing shading while rendering the geometry, the idea is to simply<br />

rasterise and Z-reject the raw geometric data and write that to a variety of buffers.<br />

Data such as depth (which along with screen position produces a world position),<br />

normal, material ID, and surface-local position (otherwise known as texture<br />

co-ordinates) are written to the G-buffers, and all texture compositing, lookups,<br />

and shading are performed once per screen pixel.<br />

The advantage is that shading is performed once and only once on each<br />

screen pixel, allowing much more complex shaders (n) and larger depth complexity<br />

(m) but getting O(n+m) cost rather than O(nm) with traditional methods.<br />

While a true G-buffer has many problems, such as dealing with translucency<br />

and large amounts of temporary storage, some of the concepts can be useful when<br />

designing special effects.<br />

One example of this is the cartoon shader demo produced by ATI 11 that<br />

recorded material ID, depth, and normal and from those fields produced some<br />

impressive images.<br />

Sometimes feedback can be very useful in rendering special effects. It can produce<br />

interesting and complex patterns without requiring complex geometry or<br />

multiple rendering passes by using frame-to-frame image coherence. Examples<br />

11 The “table with cheese and wine” demo from Real-Time 3D Scene Postprocessing presented at GDC 2003<br />

— available from http://www.ati.com/developer/


Section VI — 3D Engine and Tools Design<br />

624 Post-Process Fun with Effects Buffers<br />

The Demo<br />

Conclusion<br />

include smoke, fire, fog, and others. In the game Blade2 12 , we used a feedback<br />

effect on sprays of blood from sword or gunshot wounds. A simple low-cost particle<br />

system was used for the blood droplets, but they were rendered to an offscreen<br />

buffer. This buffer was not cleared each frame, just darkened, and while<br />

doing so, it was distorted by rendering it as a mesh to itself. The result was then<br />

blended over the framebuffer at the end of each frame. The visual effect was that<br />

the cheap particle system was given size and persistence and became more like a<br />

stream with volume than a group of particles. Producing the same effect purely<br />

with geometry every frame would have taken roughly ten times the fillrate and<br />

vertex processing.<br />

The crucial thing is that only the blood should be used in feedback, whereas<br />

previous effects, such as motion blur, have performed feedback on the entire<br />

screen, which has limited application. Also note that the good thing about blood is<br />

that it is a single color. We only used a single channel (the destination alpha channel<br />

of the framebuffer, which was otherwise unused) for the intensity of the blood;<br />

when blending it back to the framebuffer, it was tinted red. At the same time, the<br />

intensity was oversaturated so that while the values in the effect buffer faded linearly<br />

to zero with time, the visual effect was not linear; it spent a while at full<br />

brightness followed by a fairly abrupt fade to nothing. This helped increase the<br />

apparent volume of the “stream” of blood. Very tasteful.<br />

The demo shows the use of a couple of these effects. The most important thing<br />

demonstrated is the dynamic allocation of channels in render targets, with each<br />

effect dynamically enabled/disabled according to toggles. In practice, these would<br />

be according to what objects are visible in the viewing frustum, and the intent is<br />

to minimize the average memory required for render targets and the average<br />

fillrate used to render the objects and combine the final image.<br />

The demo also demonstrates the use of the callback system for rendering<br />

objects, the simple way that the previously special case of rendering alphablended<br />

objects last now fits easily into the more general framework, and the way<br />

that sorting each effect by different criteria is simple to add.<br />

Many single effects have been previously demonstrated using secondary buffers<br />

or the alpha channel of an ARGB render target. A method has been shown for unifying<br />

the common features of many of these post-processing effects to apply them<br />

together in the same frame, use the minimum amount of run-time memory and<br />

fillrate to do so, and apply them selectively only to parts of the frame that require<br />

them.<br />

12 Blade2 by Mucky Foot, published by Activision on PS2 and Xbox. The technique described here was only<br />

implemented on the Xbox version.


Introduction<br />

<strong>Shader</strong>s under Control<br />

(Codecreatures Engine)<br />

Oliver Hoeller<br />

Today, shaders can’t be ignored by modern 3D engine design. There are new<br />

aspects in the design of a 3D engine that weren’t considered in early engine<br />

architectures.<br />

I would like to mention some points here beginning from the base architecture<br />

used in the Codecreatures engine (e.g., Codecreatures benchmark), which<br />

are important in complex environments.<br />

The following aspects are described in this article:<br />

� Multiple passes per object to map surface effects with several render passes<br />

� Various shadow effects including scene rendering used for stencil shadows<br />

and dynamic reflections on surfaces (water, mirrors, shining structures)<br />

� Seamless worlds with large ranges of visibility and expanded indoor areas, as<br />

well as the administration of resources like textures, materials, and meshes<br />

Essential Base Architecture of a Modern 3D Engine<br />

Here is a simplified structural overview of a modular 3D engine.<br />

Figure 1: Modular engine architecture overview<br />

The next sections describe all systems represented in Figure 1.<br />

625


Section VI — 3D Engine and Tools Design<br />

626 <strong>Shader</strong>s under Control (Codecreatures Engine)<br />

Subsystems<br />

The lowest levels of all are the subsystems. They are thin wrappers for 3D APIs,<br />

such as OpenGL or <strong>DirectX</strong>. Primarily this wrapper serves to abstract different<br />

APIs.<br />

Scene and Resource Management<br />

This field is mostly underestimated. <strong>With</strong> complex scenes, it is accordingly important<br />

to have cache systems, organized as least recently used caches to handle<br />

data-intensive resources, for example.<br />

Another underestimated topic is the enormous memory consumption of<br />

meshes. To gain flexibility in storage of meshes, they are put into streams, which<br />

don’t match later with the desired hardware format. For this reason, it is meaningful<br />

that these hardware-abstracted meshes operate over an appropriate cache. If<br />

the engine uses non-visual relevant systems like collision (e.g., collision meshes),<br />

it can operate with the same cache systems.<br />

Some aspects should be mentioned with the hardware-bound format of certain<br />

mesh data. The number of texture coordinate pairs in the flexible vertex format<br />

structure plays an important role. Also, “standard” deviating flexible vertex<br />

formats, which contain position, normals (diffuse, specular), and texture coordinate<br />

pairs, should be considered. Texture coordinate pairs are usually assigned to<br />

a specific texture type. For example:<br />

Stage#0: Diffusetexture<br />

Stage#1: Detailtexture<br />

State#2-n: Lightmap per-pixel light, bump map, environmentmaps, etc.<br />

The engine-specific interleaved vertex streams are converted from the resource<br />

system into the desired API (<strong>DirectX</strong>)-compliant vertex format before the render<br />

pipeline starts rendering the scene. In most cases, cache-friendly structures have<br />

a size of 32 or 64 bytes.<br />

We use a directed acyclic (scene) graph to organize the scene content<br />

hierachically. Each object instances in the scene graph can be referenced using a<br />

unique value (such as a string name).<br />

Visibility Determination System<br />

In principle this system is composed of several stages that accomplish different<br />

optimizations in the view cone for the current camera. It is concerned with<br />

(object) culling, HSD (hidden surface determination), and HOD (hidden object<br />

determination), as well as an optimal representation of the still existing scene and<br />

its visible scene objects.<br />

I don’t want to introduce all of the stages here, but I refer to some important<br />

requirements that a modern render pipeline must accomplish. This system should<br />

be able to create arbitrary so-called frustum databases. These databases contain<br />

the excerpt of all optimizing stages of our render pipeline, thus all the objects that<br />

are still visible from the view cone (frustum) of used cameras.


The distribution in databases that contain the optimized camera-adapted data<br />

stream are crucial for different things because new shader effects can often consist<br />

of multipass effects (see the section titled “Material System”).<br />

Frustum Processor<br />

The function of the frustum processor (FP) is to reference existing data objects<br />

like meshes and/or textures through the resource system to convert these<br />

resources into an optimal format for hardware.<br />

The optimized format for meshes is stored into one large vertex buffer or in<br />

various individual vertex buffers, depending on mesh type and size. Summarizing<br />

the adjusted materials and/or the choice of appropriate sort criteria (textures,<br />

shaders, z-depth, etc.) is important because these criteria concern how fast these<br />

objects can be rendered by hardware (bandwidth, texture state/shader state<br />

switches, occlusion culling).<br />

Per-object Multipass Technique<br />

A characteristic of the frustum processor is the multiple rendering of objects.<br />

How often an object must be rendered is determined by the specified material,<br />

which exports a function that returns an appropriate number of passes to render.<br />

The render target used by the frustum processor can be changed transparently<br />

by the material system, which provides an available render target texture.<br />

The combination of textures, as well as their administration for mix and frame<br />

textures is completely directed by the material system.<br />

Material System<br />

Section VI — 3D Engine and Tools Design<br />

<strong>Shader</strong>s under Control (Codecreatures Engine)<br />

Figure 2: A multistage render pipeline with frustum databases<br />

<strong>With</strong> help from the material system, effects can be combined over arbitrary texture<br />

passes. It uses the ability of the frustum processor to utilize the object<br />

627


Section VI — 3D Engine and Tools Design<br />

628 <strong>Shader</strong>s under Control (Codecreatures Engine)<br />

multipass technique. The frustum proccessor is informed about the number of<br />

necessary render passes.<br />

The material system thereby supports the following steps, which can be<br />

blended accordingly:<br />

� Illumination stage (lightmap, vertex/pixel light, dynamic shadows)<br />

� Object stage (diffuse and detail texture, etc.)<br />

� Environment stage (environment maps, etc.)<br />

The material system needs this render-to-texture feature, called per-object<br />

multipass rendering, to map the object stage pass (mentioned above). For a case<br />

in which hardware can’t mix the object material in one pass and it is not possible<br />

to blend this object in the frame buffer, this technique is used. The material system<br />

also contains surface or effect shaders, which are edited and used here. <strong>With</strong><br />

support of the scene multipass technique (described in the following section), it is<br />

possible, for example, to implement in your game night-vision goggles or other<br />

“complete scene visual effects.” Your material system can contain extra data and<br />

shaders to achieve this effect. A further interesting field is a material level of<br />

detail technique, which can be used in larger outdoor scenarios. To save rendering<br />

time, the material system can use different representations of shaders in a current<br />

material that is mapped onto a distant object.<br />

Per-scene Multipass Technique<br />

This technique, in contrast to the similar per-object multipass technique, enables<br />

rendering entire scenery to a texture surface.<br />

Figure 3: Workflow per scene/object render passes<br />

This render-to-texture feature is set up in this situation to request current<br />

dynamic camera views of the scene (possible with certain criteria filtered and providing<br />

still existing objects with a special master material), which is then rendered<br />

to a texture and finally used as resource. This feature can be used with


dynamic per-pixel illumination, during conversion of reflection and refraction<br />

effects (e.g., crystal sculptures, polished surfaces, or water), as well as real-time<br />

actualization of cube maps.<br />

An appropriate frustum database that can be processed through the frustum<br />

processor must be created first. Infinite recursions must be prevented here. This<br />

can be made possible (e.g., over a frame-oriented reference counter or in a simple<br />

way) so that this problem is recognized by the material itself. It disables this feature,<br />

since it has activated the process.<br />

Current Drawbacks<br />

Section VI — 3D Engine and Tools Design<br />

<strong>Shader</strong>s under Control (Codecreatures Engine)<br />

629<br />

There are different problems to solve; some are mentioned here.<br />

The frustum processor can have problems finding an optimal sorting for the<br />

scene objects, which must be drawn. We can also prepare precompiled optimizations<br />

like summarizing similar materials to reduce draw calls into the API to<br />

minimize render state changes. Unfortunately, there exists no reasonable universal<br />

solution that is sufficient for both indoor and outdoor scenarios.<br />

These techniques described above that are used by per-object/scene<br />

multipass rendering have a serious restriction that concerns performance stalls<br />

and memory consumption. There is at present no way to render directly on compressed<br />

textures, so appropriate render target textures for those multipass techniques<br />

must be present in an uncompressed format.<br />

Additional render target textures with different resolutions should possess<br />

their own depth buffer (for performance reasons). However, these depth buffers<br />

require additional memory on the graphic card; this type of texture resides on the<br />

graphic card and isn’t paged out.<br />

The intelligent administration of these textures is necessary because these<br />

resources otherwise need a lot of memory on the graphic card. These textures<br />

should (if they are used longer for effects that rarely or never change) be created<br />

during application starts (stored permanently in memory) or even compressed at<br />

run time. But compressing textures during run time can cause performance<br />

problems.<br />

In addition, hardware stalls are possible if render target textures are needed<br />

too early. This means that if these targets are not filled by graphics hardware<br />

immediately but after utilization of hardware, all state changes are collected and<br />

executed. The difference between scene setup and drawing a scene can amount to<br />

three or four frames here. An early access (a Lock() function on a surface — e.g.,<br />

for compression or similar immediate accesses, like SetTexture and a resulting<br />

draw call after scene setup) to this texture can cause unwanted stalls in graphics<br />

hardware.


Section VI — 3D Engine and Tools Design<br />

630 <strong>Shader</strong>s under Control (Codecreatures Engine)<br />

Summary<br />

Outlook<br />

Today’s engine architecture designs differ substantially from earlier designs,<br />

which have reduced possibilities regarding geometry and pixel pipeline and were<br />

strongly dependent on the fixed-function system of graphics API.<br />

The potential with <strong>DirectX</strong> 9 or appropriate OpenGL extensions makes it<br />

possible to intervene in geometry setup as well as the rasterizing pipeline. As<br />

described above, you have to consider many things in modern engine design to<br />

take advantage of the available API functions.<br />

The development concerning shaders and their flexibility and possibilities are at<br />

the beginning of development at the present time. <strong>Shader</strong> programs will become<br />

more extensive and more flexible in the future, the number of pixel pipelines will<br />

increase, and graphics hardware will be able to process more texture stages per<br />

pass.<br />

The goal is to equip environments with better materials so that they work<br />

more realistically and/or correspond optimally to the conception of the designer.<br />

Improved HOD algorithms and refined shadow techniques would be conceivable.<br />

Some basic features, however, should be implemented because they were so<br />

far neglected. These include:<br />

� Rendering on compressed textures<br />

� Status queries for rendering processes like render-to-texture (e.g., IsBusy,<br />

etc.)


Introduction<br />

The Past<br />

<strong>Shader</strong> Integration in the<br />

Gamebryo Graphics Engine<br />

Scott Sherman, Dan Amerson, Shaun Kime, and Tim Preston<br />

As can be seen from the results shown in many of the other articles in this book,<br />

well-written, creative pixel and vertex shaders can generate incredible visual<br />

impact. However, an often-overlooked aspect of shaders is their integration into a<br />

larger-scale graphics/rendering engine. As a provider of 3D graphics runtimes and<br />

tools, we at NDL have grappled with this issue directly. The framework, support<br />

code, and tools required to ensure that shaders can reach their full potential in a<br />

general game engine can be extensive. Every aspect of both game engine design<br />

and game development workflow comes into play, covering the gamut of shader<br />

integration with artist tools, loading of vertex and pixel shaders into the engine,<br />

shader asset management, and even shader parameter animation.<br />

Above and beyond basic shader integration, today’s game developers require<br />

the ability to customize their own shaders, requiring any shader-engine integration<br />

to be as flexible as it is complete. When designing the integration with NDL’s<br />

Gamebryo (a very general, “genre-agnostic” engine), this flexibility requirement<br />

proved quite difficult to satisfy in a way that was both complete and easy to use.<br />

This article discusses the approach that NDL took for integrating shaders in<br />

the latest release of our 3D graphics engine and toolkit, Gamebryo. We start out<br />

with a short history of shader support in our engine, discussing the original problems<br />

we anticipated and our attempt to solve them. A list of requirements determined<br />

as a result of our original system is presented, followed by in-depth<br />

coverage of the current system. The article concludes with a case study discussing<br />

the development of a sample application, Eturnum, which demonstrates the<br />

power of the new system.<br />

The Initial Problem<br />

When programmable shaders entered the development scene, much excitement<br />

and fanfare heralded the impressive graphics effects that were then possible on<br />

the new hardware. While many developers scrambled to take advantage of this<br />

631


Section VI — 3D Engine and Tools Design<br />

632 <strong>Shader</strong> Integration in the Gamebryo Graphics Engine<br />

powerful new capability and improve the visual appeal of their games, initial<br />

acceptance of shaders in 3D games was nowhere near universal. Despite the best<br />

intentions of developers and hardware manufacturers alike, shaders did not have<br />

the immediate, intended effects on the look and feel of most games. Initially, this<br />

was often blamed on the lack of an installed base of shader-capable hardware.<br />

However, we felt (and later saw, through interactions with our customers) that<br />

there were other factors involved in this lack of shader usage.<br />

The first stumbling block to shader integration is that in the most basic<br />

sense, shaders are just problem-specific, hardware-specific pieces of code. As an<br />

example, consider the case where a game is being developed with full shader support.<br />

To take advantage of the latest hardware, the team will need to write<br />

shaders to the most recent specifications.<br />

However, legacy hardware will require support. The same shaders will have<br />

to be implemented for this older hardware, and a method for handling the effect<br />

without hardware shader support must also be developed. As you can see, the<br />

number of shader programs required can quickly add up. This situation is different<br />

than the case where customer hardware in general will have differing hardware<br />

capabilities, such as the number of textures per pass, available stages, and the<br />

like. Multitexturing capabilities, for example, can be handled far easier by having<br />

fixed-function effects fall back to multipass solutions on older hardware. When<br />

forming the passes for a fixed-function run, some simple capability checks and<br />

predefined blending rules allow for the same code to run on many hardware configurations.<br />

Pixel and vertex shaders, however, require a completely different version<br />

of the code to take full advantage of the hardware that the application is<br />

running on. Many developers were not ready to sacrifice the engines that they<br />

had developed in exchange for writing and optimizing hundreds of assembly programs<br />

to accomplish the same task. The feeling is similar to the time when developers<br />

had to write to graphic-card-specific APIs to get the results they wanted.<br />

Another large inhibitor to widespread adoption is the integration of shaders<br />

with the art pipeline. <strong>With</strong>out a clean flexible framework supporting them,<br />

shaders essentially require the developer to hard-code data values placed into<br />

registers. This takes the creativity out of the hands in which it belongs: the artist’s.<br />

This situation can lead to a large drain on your programmer productivity, as<br />

they will lose cycles while tweaking values in the shader code for the artist.<br />

Finally, we believe that one reason for the lack of shader adoption is the set of<br />

available sample shaders. Oddly enough, the problem is not a lack of such sample<br />

shaders but rather the sheer number of different shader frameworks upon which<br />

these sample shaders are based. While this sounds contradictory, a large number<br />

of sample shaders were available everywhere from hobbyist web sites to the<br />

actual graphics card manufacturers, all implemented using completely different<br />

frameworks and assumptions. The available samples use <strong>DirectX</strong> effect files, the<br />

nVidia Effects Browser, ATI RenderMonkey, and even homebrew “editors” —<br />

most taking wildly different approaches to integrating shaders. No provider supplied<br />

a clear way to integrate their particular format into an actual game; their<br />

examples were very specific to the framework that they provided for viewing the<br />

effects.


The NetImmerse Solution<br />

Section VI — 3D Engine and Tools Design<br />

<strong>Shader</strong> Integration in the Gamebryo Graphics Engine<br />

633<br />

Gamebryo is a cross-platform 3D graphics engine and game toolkit that evolved<br />

from NDL’s previous product, NetImmerse. The versatility of the engine is evident<br />

in the number of genres in which it has been utilized, including MMORPGs,<br />

role-playing, racing, and space combat.<br />

Since Gamebryo originated from the NetImmerse engine, it is appropriate to<br />

give a short discussion on the shader system that NetImmerse contained, covering<br />

the problems and stumbling blocks that occurred with the system. The first<br />

version of shader integration, presented in NetImmerse 4.0, was known as the<br />

Configurable Texture Pipeline. It would have been more aptly named the<br />

Configurable Rendering Pipeline, as it allowed for developers to completely customize<br />

the rendering of objects with NetImmerse.<br />

The original system functioned by supplying a class interface, the<br />

ConfigurableTextureData (CTD) class, through which the engine set up and executed<br />

the passes required for rendering an object. Most of the functions returned<br />

a code to the renderer allowing for sections of the pipeline to be skipped. For<br />

example, when the derived class would set a pixel shader on a pass, the function<br />

could return a value to the renderer, indicating it should not set the pixel shader.<br />

These return values provided complete customizability of the rendering pipeline.<br />

A derived class could simply skip a single step of the default rendering path all the<br />

way up to completely bypassing NetImmerse rendering an object at all.<br />

While the system was quite powerful with respect to what could be accomplished<br />

with it, there were several problems with it. The base implementation of<br />

the class was the default rendering pipeline, which in hindsight was a mistake.<br />

The default pipeline of NetImmerse (and Gamebryo as well) is rather powerful,<br />

allowing for high-level representations of both dynamic and static effects to be<br />

applied to rendered objects and handled by the engine seamlessly. For example,<br />

multiple projected lights and shadows, fog and environment maps, and numerous<br />

other visual effects can be applied to an object. This system allowed a large<br />

amount of flexibility with respect to applying various effects to an object but added<br />

several member functions and variables that were not required to override the<br />

pipeline. This increased complexity ultimately led to confusion when developers<br />

were first starting out deriving their own custom implementations, and some did<br />

not have the time to invest in learning how to use the system effectively.<br />

Another side effect of the base class being the default pipeline was that<br />

implementing simple shader programs (for example, a single-pass effect that utilized<br />

a vertex and pixel shader) required quite a bit of code to be written, proving<br />

to be more complex than necessary. A “simple” pipeline would require developers<br />

to implement six virtual functions that composed the rendering path. Since the<br />

default pipeline was so complex, it was difficult to leverage any of the base functionality<br />

of the class for simple operations. To accomplish relatively easy tasks,<br />

these functions typically had to be replaced completely in the derived CTD. This<br />

was a design oversight that, while providing a powerful interface to rendering,<br />

made their usage much more difficult and time consuming.


Section VI — 3D Engine and Tools Design<br />

634 <strong>Shader</strong> Integration in the Gamebryo Graphics Engine<br />

CTD usage was complicated further by the fact that the only way to access<br />

the functionality of the system was programmatically. There were no capabilities<br />

for streaming the classes to and from files, so assigning any CTD-derived class to<br />

an object required the application to do so “by hand” at run time. This omission<br />

prevented adding an easy way to integrate them into the art pipeline, which in<br />

turn hampered productivity, as the assets could only be viewed in the game itself<br />

or a modified viewer that contained the required derived classes.<br />

<strong>With</strong> respect to supporting available formats, the system left that in the<br />

hands of the developer. Unfortunately, this typically would require that the<br />

derived class handle every aspect of rendering an object. In short, the system<br />

required far too much development for too little return.<br />

Requirements of a <strong>Shader</strong> System<br />

Once the CTD system was in the field, the issues listed above and others arose,<br />

which led to the compilation of a list of requirements for the next version of the<br />

system. The major issues that we felt should be addressed are presented here,<br />

with a brief description of each.<br />

Ease of Use without Sacrificing Power or Flexibility<br />

First and foremost, the system should be easy to use out of the box while still<br />

allowing more advanced users to implement any effect they can devise. It should<br />

take minimal time and, if possible, no source code compilation to apply a simple<br />

vertex and/or pixel shader program to a rendered object, but it should not “handcuff”<br />

developers by requiring they follow a strict implementation model. This will<br />

allow developers to utilize the system at a level at which they are both capable of<br />

and comfortable with.<br />

Art Pipeline Integration<br />

The system should allow shader support to be integrated directly into the art<br />

development pipeline. The system should also expose “editable” parameters to<br />

allow the artist to experiment to obtain the desired look. This is a key element of<br />

any shader system, as it keeps visually creative control in the hands of the artist<br />

and allows for asset viewers to display the object exactly as it appears in-game,<br />

reducing the model/export/view iterations.<br />

Simple Access to <strong>Shader</strong> Collections<br />

The system should provide a way for shader authors to easily distribute new and<br />

updated shaders to the rest of the development team. This will streamline the<br />

development process, allowing for quick integration of new shader assets. <strong>Shader</strong><br />

collection support also aids in the integration of shader support in the art pipeline,<br />

as the art tools can work with a known interface.


Support Industry Standard Formats<br />

The system should support as many viable shader file formats that are currently<br />

available as possible. This includes formats such as <strong>DirectX</strong> effect files, nVidia’s<br />

CgFX files, and ATI RenderMonkey files. Doing so will allow the shader author to<br />

work in a format with which he is most comfortable. Supporting these formats<br />

also makes it easier to integrate samples gleaned from this book and other<br />

sources into a team’s palette of shaders. The capability to leverage existing tools,<br />

such as RenderMonkey, is also gained with this approach.<br />

Data-driven Support<br />

The system should allow for rendering effects via a data-driven method. This<br />

means that it should be possible to integrate new shaders with no code compilation<br />

required. This aspect of the system could involve a script-based format that<br />

allows for text files to be written “describing” the effect. Another alternative<br />

would be to have an external shader editing application generate a binary file that<br />

contains the details of the rendering task.<br />

A Unified Rendering Path<br />

The system should have a well-defined interface for the way the renderer displays<br />

any geometric object. All objects, whether using a custom shader or the default<br />

pipeline, should be processed in the same manner by the renderer. There should<br />

not be two paths through the rendering pipeline; objects with a shader should follow<br />

the same pipeline as those without one. The interface must also supply sufficient<br />

low-level access such that the users can completely replace the default<br />

rendering pipeline by deriving from the interface class if they wish to. Providing<br />

the developer with a precise definition of what the renderer expects during each<br />

phase of displaying an object will allow for this. Finally, the interface should be as<br />

straightforward as possible, with a family of derived classes to increase the supplied<br />

functionality in logical increments. This will allow for developers to select<br />

their level of integration, easing the task of developing shaders while not restricting<br />

what can be accomplished.<br />

The Gamebryo <strong>Shader</strong> System<br />

Section VI — 3D Engine and Tools Design<br />

<strong>Shader</strong> Integration in the Gamebryo Graphics Engine<br />

635<br />

<strong>With</strong> these requirements in mind, work began on the next implementation<br />

of shader integration in NDL’s technology. The Gamebryo <strong>Shader</strong> System provides<br />

the classes and framework required to implement vertex and pixel shader<br />

support with all the power of the previous system, while providing alternative,<br />

more accessible methods for accomplishing the same tasks. The system is also<br />

fully integrated into the art pipeline, including support for custom coded shaders<br />

supplied by developers to be used with no tool modifications.


Section VI — 3D Engine and Tools Design<br />

636 <strong>Shader</strong> Integration in the Gamebryo Graphics Engine<br />

<strong>With</strong>in the context of Gamebryo, we use slightly different terminology than<br />

<strong>DirectX</strong> for shader-related components, so a short list of definitions may be helpful<br />

here.<br />

� shader: An abstract representation of a complete rendering effect to apply to<br />

an object. A shader is a complete visual effect that is applied when rendering<br />

an object in the engine, including all passes and render states that are<br />

required to achieve said effect.<br />

� shader programs: Vertex and pixel shaders in Gamebryo. This naming convention<br />

is intended to specify them with more relevance to their function/purpose.<br />

A shader can utilize numerous shader programs — one vertex<br />

and/or one pixel shader program per pass required to achieve the effect. In<br />

general, both pixel and vertex shaders are referred to as shader programs<br />

when they are not identified specifically as either a pixel or a vertex shader.<br />

� shader library: A collection of shaders encapsulated in a DLL and/or static<br />

library package. The Xbox version of the shader system only supports static<br />

libraries, as DLLs are not available on the platform.<br />

Providing a DLL version is required for use in the art tools, so the system<br />

does not have to be recompiled to support new libraries. It can also refer to<br />

the interface functions defined for accessing the collection.<br />

� binary shader: An interface to a completely data-driven implementation of a<br />

visual effect. This term can also refer to actual data representing a<br />

data-driven shader.<br />

System Components<br />

The Gamebryo <strong>Shader</strong> System contains several low-level classes that aid in the<br />

implementation and utilization of shaders in the pipeline. As any engine will contain<br />

a similar set of classes, we will not describe them in detail. The classes are<br />

straightforward, providing a direct representation of the hardware settings of the<br />

device when rendering. They cover render state and texture stage settings, pixel<br />

and vertex shaders, constant register mapping, and passes that make up a rendering<br />

effect, providing the building blocks used in the construction of the Gamebryo<br />

rendering pipeline.<br />

Ease of Use without Sacrificing Power or Flexibility<br />

The primary goal of the Gamebryo <strong>Shader</strong> System is to allow for developers to<br />

add shaders to their applications with minimal start-up time while still allowing<br />

for more advanced users to have complete control over the rendering pipeline if<br />

they desire. To allow developers to quickly prototype shaders, we supply a<br />

text-based format, the NDL <strong>Shader</strong> Format (NSF). We also support the use of<br />

existing “external” formats such as <strong>DirectX</strong> effect files, should shader authors opt<br />

for utilizing them in their development. CgFX and RenderMonkey support is also<br />

being added to provide developers with a wide selection of industry-supported<br />

shader formats. As shader authors generate more advanced effects that require<br />

capabilities beyond any of these formats, they can move to deriving their own


shader interface classes from a number of supplied interfaces, ranging from a<br />

bare-bones interface class up to deriving from the default pipeline used by the<br />

engine. These derived classes may be grouped into a shader library for easy integration<br />

into both their application and the art pipeline.<br />

Art Pipeline Integration<br />

Why Integration Is Important<br />

Section VI — 3D Engine and Tools Design<br />

<strong>Shader</strong> Integration in the Gamebryo Graphics Engine<br />

Art pipeline integration essentially requires shaders to be present in every step<br />

along the art pipeline. Integration does not mean writing shader code, although<br />

certainly some artists with a programming slant could do a good job of it. Art pipeline<br />

integration simply means exposing “editable” values to your artists and<br />

allowing them to preview the content in your engine throughout the pipeline. Any<br />

value that is not a required hard-coded shader constant is a great candidate to be<br />

an artist-adjustable value. For example, writing a toon shader usually involves an<br />

indirect texture lookup into a one-dimensional texture. This texture is a perfect<br />

candidate for an editable value. Often you’ll be surprised with what an artist can<br />

accomplish when allowed to play with the parameters.<br />

Artists using Gamebryo have the option to generate art content inside of<br />

either 3ds max or Maya, and our suite of plug-ins and tools converts the content<br />

into the Gamebryo format. Additionally, artists have the ability to view how their<br />

art will look inside the art package in a separate preview window. It is absolutely<br />

critical for the art pipeline integration to give instant feedback to the artist. This<br />

feedback is doubly necessary for shaders. Artists are used to seeing their options<br />

and playing around with them to achieve the visual result that they want. This is<br />

how most artists build a mental model of their art package. Often, nontraditional<br />

rendering effects require a fair bit of experimentation and tweaking to build a valid<br />

mental model. Rendering pipeline equations simply won’t suffice. If the artist can<br />

tweak a few parameters and then preview what effect those changes had, he will<br />

feel more comfortable working with shaders and become productive with them<br />

significantly faster.<br />

How We Integrated <strong>Shader</strong>s in the Art Pipeline<br />

637<br />

Adding additional, dynamic user interface items to any application can prove to be<br />

quite difficult. Unfortunately, this is precisely what was required for integration<br />

with the art package. Luckily, 3ds max and Maya both provide a simple mechanism<br />

for adding these items — “custom attributes” in Max and “extra attributes”<br />

in Maya. For the purposes of this discussion, we simply refer to both as custom<br />

attributes. In both art packages, custom attributes add additional data structures<br />

and optional user interface widgets that extend the normal meaning of objects.<br />

For instance, you could add a gradient ramp texture for a toon shader to a material.<br />

A custom attribute containing the texture and all of the GUI for editing that<br />

texture can be applied with a little bit of MEL or MAXScript. Furthermore, the<br />

underlying mechanisms for animation in the art package are automatically supported<br />

in the scripting language. Therefore, keyframe animation of values comes


Section VI — 3D Engine and Tools Design<br />

638 <strong>Shader</strong> Integration in the Gamebryo Graphics Engine<br />

essentially “for free” with this approach. Supporting custom attributes and their<br />

keyframed values in the exporter is fairly straightforward.<br />

<strong>Shader</strong>s and their descriptions are loaded at application startup by the<br />

Gamebryo plug-ins. All of the known shaders are available to the artist in Max’s<br />

Gamebryo <strong>Shader</strong> or Maya’s NiMulti<strong>Shader</strong>. A drop-down list allows artists to<br />

select the shader that they wish to use. Descriptions of the overall shader, its<br />

attributes, supported pixel and vertex shader versions, and descriptions of each<br />

technique are available at the press of a button. Once the user applies the selected<br />

shader to the object, custom attributes are dynamically generated and applied to<br />

the current object. Artists can then edit these attributes just like any other attribute<br />

in the art package. For example, colors can be edited through the standard<br />

color picker. Textures can be edited through the standard user interface widgets.<br />

Animation of each shader attribute is as simple as animating anything else in Max<br />

or Maya.<br />

See Figure 1 for an example of the shader parameter interface in 3D Studio<br />

MAX and Figure 2 for an example of the Maya interface.<br />

Figure 1: The 3D Studio MAX artist<br />

interface<br />

This familiar user interface gets<br />

the artist up and running with<br />

shaders very quickly, but familiarity<br />

alone is not enough. We have found<br />

that shaders are often developed in<br />

a very iterative fashion. Program- Figure 2: The Maya artist interface<br />

mers move effects from vertex<br />

code to pixel code, requiring different user interface widgets or artists asking programmers<br />

for more attributes to modify. Luckily, the custom attribute solution in<br />

both packages supports redefining custom attributes. As long as the attribute<br />

names don’t change, their values, including all keyframes, can be transferred to<br />

the new definition. New attributes are filled in with their default values. These


features are an incredible help to the artists in the development of a game. <strong>With</strong>out<br />

such features, artists would have to redo art assets any time the shader<br />

changed. We augment this process by auto-detecting whenever the shader<br />

changes by comparing its attribute definition in the Max or Maya file to the definition<br />

loaded at application startup. If the definitions are different, the user is notified<br />

and he can choose to upgrade his art assets or leave them as is.<br />

Simple Access to <strong>Shader</strong> Collections<br />

Section VI — 3D Engine and Tools Design<br />

<strong>Shader</strong> Integration in the Gamebryo Graphics Engine<br />

639<br />

No matter how flexible or powerful a shader system is, if it does not supply an<br />

easy way to update the development team with new shaders, production can<br />

potentially be hindered during their integration into the application. To provide<br />

shader authors with a simple mechanism for supplying new and updated shaders<br />

to the rest of the development team, the Gamebryo <strong>Shader</strong> System takes a shader<br />

library approach. A shader library is an interface for accessing a collection of<br />

shaders via a static or dynamically linked library that contains the code for the<br />

shader(s).<br />

To utilize the shader library system, the application simply registers a library<br />

with the system. When a call is made to attach a shader to an object, either by the<br />

file streaming system or the application itself, each registered library will be<br />

checked for the presence of the requested shader. If a library contains the shader,<br />

it will be retrieved and registered with the system using a reference counting system.<br />

A function in the shader class itself is then called to allow any special processing<br />

of the geometry that it may require, such as generating tangent space<br />

data if it is not already present in the geometric object. Applications may register<br />

as many libraries as they wish, allowing for shaders to be grouped into libraries by<br />

concepts such as level, unit type, spells, or any category that makes sense for the<br />

application.<br />

For integration into the art tool chain, the shader library interface also contains<br />

a description mechanism. The supplied description contains a short description<br />

of the library itself as well as a description for each shader that it holds. The<br />

shader descriptions hold information on the various implementations of the<br />

shader, its requirements in terms of hardware and platform, and the attributes<br />

that it utilizes. These descriptions provide the means for exposing artist-editable<br />

shader parameters to the modeling tools, as well as comments to aid the artist by<br />

describing what a given shader does and what its parameters are used for.<br />

Gamebryo ships with a shader library, NSB<strong>Shader</strong>Lib, for recognizing and<br />

loading NDL <strong>Shader</strong> Binary (NSB) files. This format is a binary data representation<br />

of a visual effect to apply to a rendered object and is described in more detail<br />

later in this chapter. When registered with the system, the library will search a<br />

given directory, optionally recursing subfolders, identifying all NSB files contained<br />

within, and adding them to an internal list. When a particular shader is requested<br />

for attachment to a geometric object, the library will search this list. If the shader<br />

exists in the library but has not yet been loaded, the most appropriate implementation<br />

for the hardware is instantiated based on the capabilities of the system and<br />

the requested shader versions and returned.


Section VI — 3D Engine and Tools Design<br />

640 <strong>Shader</strong> Integration in the Gamebryo Graphics Engine<br />

Support Industry Standard Formats<br />

To allow shader authors to work with the format that they are most comfortable<br />

with, the Gamebryo <strong>Shader</strong> System contains support for <strong>DirectX</strong> effect files,<br />

nVidia’s CgFX files, and ATI’s RenderMonkey files. Classes derived from our<br />

base shader interface were written to encapsulate the functionality required to<br />

utilize these formats. Some requirements for how they are authored were defined<br />

to ease the integration of these file types into the engine, such as identifying artist-editable<br />

values via annotations in <strong>DirectX</strong> effect files or using the grouping feature<br />

in ATI’s RenderMonkey files to represent different implementations of the<br />

same effect. These should not hinder developers but do add an additional burden<br />

on those wishing to take full advantage of our system when using “external”<br />

formats.<br />

By providing the support framework for a particular format in the form of a<br />

shader library, developers can easily integrate those shaders simply by registering<br />

the library with the shader system. This approach also allows NDL to handle integrating<br />

new formats without requiring the delivery of a full source code update to<br />

our customers. This adds significant expandability to the system, allowing for adding<br />

support for future updates and advances as simply as possible.<br />

To give an example of the benefit of this type of shader library approach to<br />

external formats, just by deriving a shader class that implements the D3DXEffect<br />

interface and packaging it in a shader library, support for the DX9 FX file format<br />

was added with no code changes to the renderer. Other “industry standard” formats<br />

are each given their own shader library to handle accessing them. This will<br />

allow developers to select which formats, including their own custom ones, they<br />

wish to support in their applications simply by registering the corresponding<br />

library.<br />

Data-driven Support<br />

In an effort to simplify using the system and provide a rapid prototyping capability,<br />

Gamebryo shader integration was designed to provide data-driven support.<br />

When we speak of data-driven shaders, we mean allowing for a shader to be utilized<br />

with no source code compilation required. To accomplish this feature, datadriven<br />

shaders contain a list of passes which in turn contain render state and texture<br />

stage state settings, as well as the pixel and vertex shader programs and<br />

their corresponding constant register settings required to obtain the desired<br />

effect. One method that supplies this type of support is the previously mentioned<br />

binary shader format (NSB). NSB files are supported via a shader library, much<br />

like the DX9 effects support, and provide developers with the ability to quickly<br />

and easily add shaders to their game.<br />

A key component for supplying data-driven shaders is a class that maps data<br />

values to shader constant registers.


Constant Register Mapping<br />

Section VI — 3D Engine and Tools Design<br />

<strong>Shader</strong> Integration in the Gamebryo Graphics Engine<br />

To allow for mapping data values to shader constant registers, Gamebryo provides<br />

the constant mapping class. Two derived classes provide specific implementations<br />

for pixel shader and vertex shader constants. Each class consists of a map of<br />

entries representing the data and the register(s) to which it is mapped.<br />

Support is included for several data source mappings to a given set of registers.<br />

A constant type maps a constant data value, such as the Taylor Series coefficients.<br />

A predefined type maps one of a set of Gamebryo-defined values. Each<br />

derived shader constant map class contains specific values for that usage type.<br />

For example, the vertex shader-specific class defines mappings for the World-<br />

ViewProjection transform and the diffuse material color. This data will be automatically<br />

updated and set on the device when the object is being rendered. A<br />

per-object attribute type maps a data value (or attribute) attached to the rendered<br />

object. This mapping allows for a single shader to achieve different visual results<br />

by having parameters differ for each object being rendered. The global attribute,<br />

similar to the per-object attribute, maps a data value from a global table of parameters.<br />

This mapping is helpful for setting values such as lighting parameters.<br />

Global attributes do require the application to update them as necessary. Finally,<br />

there is the operator type that allows the shader author to perform a mathematical<br />

operation on two other entries and map the result to the shader constant register(s).<br />

This mapping is useful for using the CPU to reduce shader instruction<br />

count and potentially the number of constant registers utilized. For example,<br />

transforming the light position to object space can be done once per object, as<br />

opposed to once per vertex, since the result is the same for all vertices in the<br />

mesh.<br />

Entries include support for a number of data types, including Boolean values,<br />

vectors of one to four unsigned integers, vectors of one to four floats, 3x4 or 4x4<br />

matrices, four floating-point component color values, and texture images.<br />

Constant register mapping is a key feature of the engine, allowing for truly<br />

data-driven shaders. Creating a constant map greatly simplifies updating and setting<br />

shader constant register values for a pass. This automatic mapping of constant<br />

registers also eliminates one of the primary reasons most shader integration<br />

approaches require custom C++ code.<br />

Binary <strong>Shader</strong>s and the NDL <strong>Shader</strong> Format<br />

641<br />

To facilitate a completely data-driven approach to implementing shaders, NDL<br />

developed a set of libraries that allow for text-based representations of shaders.<br />

Two libraries were created to support developers taking this approach to including<br />

shaders in their application.<br />

The first library, NiBinary<strong>Shader</strong>Lib, implements a shader-derived class,<br />

NiBinary<strong>Shader</strong>, which has been extended to allow for directly setting groups of<br />

device states and pass configurations to implement a visual effect. This class<br />

removes the need for developers to write C++ code to implement shaders in the<br />

Gamebryo <strong>Shader</strong> System. By simply providing different data for the various class<br />

members, a wide range of effects can be achieved.


Section VI — 3D Engine and Tools Design<br />

642 <strong>Shader</strong> Integration in the Gamebryo Graphics Engine<br />

An abstract, platform-independent representation of this data is also supplied<br />

in this library via the class NSB<strong>Shader</strong>. These shaders can be streamed to and<br />

from storage devices and written to NDL <strong>Shader</strong> Binary (NSB) files. Due to differences<br />

in enumeration values between Xbox, DX8, and DX9-based D3D implementations,<br />

a format that could be stored and subsequently used on all platforms<br />

was required, with the conversion code occurring at load time. The class also supports<br />

the concept of multiple implementations, which are different methods for<br />

achieving the same visual effect. This system is similar to Techniques in the<br />

<strong>DirectX</strong> effect file format. At load time, the system will search the implementations<br />

of an effect and return the most appropriate version, taking the system hardware<br />

and the requested versions into account to form the decision.<br />

The second library that Gamebryo provides is a utility library, NSFParserLib,<br />

which parses NDL <strong>Shader</strong> Format files and generates the corresponding NSB<br />

files. The NDL <strong>Shader</strong> Format (NSF) is a text-based file format that allows developers<br />

to write shaders in a simple language and apply them to objects with no<br />

C++ code to write and no compilation required. <strong>Shader</strong> authors can define both<br />

global and per-object attributes, which in turn can be mapped to shader constant<br />

registers. All device settings can be defined in the file, including render states,<br />

texture stage settings, and pixel and vertex shader programs. The format is similar<br />

to the <strong>DirectX</strong> effect files, allowing multiple implementations of the same<br />

shader to provide legacy hardware support. The library operates by searching a<br />

given file directory, optionally recursing its subfolders, looking for NSF files.<br />

When found, the file will be parsed, and its corresponding NSB file will be written.<br />

NOTE The binary shader library was separated from the parser library to<br />

allow for developers to implement their own text-based formats without having<br />

to also develop an underlying binary representation.<br />

The NSB<strong>Shader</strong>Lib library used in conjunction with the NSFParserLib provides a<br />

complete system for rapid prototyping of shaders within applications or the tool<br />

chain. The application can simply run the NSFParser on all NSF files, generating<br />

the corresponding NSB files for all those found. Then by registering the NSB-<br />

<strong>Shader</strong>Lib, both existing NSB files and newly generated ones will be available for<br />

applying to rendered objects.<br />

A Unified Rendering Path<br />

A common path through the rendering pipeline is key to allowing shader developers<br />

to fully understand what is happening “under the hood” with their creations.<br />

A clean, consistent interface must exist that provides all the functionality required<br />

to allow not only the application of shader programs to rendered objects but also<br />

for the definition of a complete rendering pipeline through which the data will<br />

flow. The interface should be completely clear as to what the developer needs to<br />

implement in order to achieve the desired effect. As described previously, the<br />

original CTD system failed quite severely at this particular goal.


At the lowest level, the Gamebryo <strong>Shader</strong> System contains a class, Ni<strong>Shader</strong>,<br />

which is simply a name and implementation number for a shader. This interface is<br />

provided to minimize cross-platform compilation issues as well as keep the door<br />

open for future expansion to our other supported platforms. The heart of D3Dbased<br />

shader integration in Gamebryo is an interface class derived from this,<br />

appropriately named NiD3D<strong>Shader</strong>Interface.<br />

NiD3D<strong>Shader</strong>Interface<br />

Section VI — 3D Engine and Tools Design<br />

<strong>Shader</strong> Integration in the Gamebryo Graphics Engine<br />

643<br />

NiD3D<strong>Shader</strong>Interface defines how the engine renders geometric objects. It is<br />

the lowest level that a developer can derive from, giving them complete power to<br />

achieve what they wish to accomplish. This power also means the developer has<br />

complete responsibility for properly setting up the hardware for the rendering of<br />

the object. The concept is the same as the original CTD system design, with a bit<br />

more structure to the interface functions.<br />

An Initialize function is called when a shader is created, allowing for any<br />

class-specific initialization required, such as registering shader programs and<br />

other one-time tasks. Pre- and post-process functions are called before and after<br />

any other processing of a rendered object is exposed to allow for any specialized<br />

setup and shutdown code required by the class. A function to update the pipeline<br />

is called once per rendered object to allow for the formation of passes based on<br />

higher-level static and dynamic effects. A derived class does not have to form the<br />

passes each time the object is rendered, but the approach is permitted.<br />

For each pass on a rendered object, a call is made to configure the current<br />

pass on the hardware. A function for setting up the required transformations is<br />

also called. Typically, the calculation would be performed on the first pass and<br />

cached for subsequent passes. It is possible for an implementation to perform<br />

per-pass modifications to the results if needed. A call is then made to set optional<br />

shader programs and their corresponding constant registers. The final per-pass<br />

call is to prepare the geometry for rendering, which ensures that the geometry is<br />

packed in the format required for rendering. If the geometry has not been packed<br />

at this point, this function is expected to do so, as well as set the stream sources<br />

and indices.<br />

Two additional functions exist in the interface: one to indicate to the shader<br />

that a new rendered object is being processed and another to indicate to the<br />

shader that the renderer is ready to begin the next pass.<br />

These functions correspond to the steps that occur during the rendering of<br />

geometry in the Gamebryo engine. By defining this level of interaction, developers<br />

know exactly what is expected by the pipeline and can implement whatever<br />

effect they wish with minimal reworking of their ideas to fit within our framework.<br />

This interface replicates and extends the level of interaction that developers<br />

were permitted with the previous system, thus meeting one of our major<br />

design goals to provide the same power and flexibility of the original system.


Section VI — 3D Engine and Tools Design<br />

644 <strong>Shader</strong> Integration in the Gamebryo Graphics Engine<br />

The <strong>Shader</strong> Interface Extended<br />

The system also provides additional classes to ease the integration of shaders in<br />

the engine. We opted to build these classes in a manner that supplies developers<br />

with increasing levels of functionality to aid in shader-based development. The<br />

default rendering pipeline is also implemented as a shader interface-based class.<br />

NiD3D<strong>Shader</strong><br />

NiD3D<strong>Shader</strong> is derived from NiD3D<strong>Shader</strong>Interface, adding additional functionality<br />

and members to aid in the implementation of shader-based effects. It<br />

includes a definition of how to pack the geometry, an optional group of global render<br />

states that are set once for the entire effect (considered “global” states for the<br />

shader, such as setting the depth test to enabled), optional global shader constant<br />

register mappings for both vertex and pixel shader programs, and an array of pass<br />

instances that make up the complete rendering effect.<br />

The NiD3D<strong>Shader</strong> class provides an interface for implementing shader-based<br />

effects without requiring large amounts of code to be written. The derived class<br />

can simply fill in the members of the class with the appropriate settings and let<br />

the base implementation take care of the rendering details.<br />

NiD3DDefault<strong>Shader</strong><br />

NiD3DDefault<strong>Shader</strong> implements the default rendering pipeline for the<br />

Gamebryo engine and is derived from the NiD3D<strong>Shader</strong> class. It analyzes<br />

the platform-independent static and dynamic effects applied to the object and constructs<br />

the appropriate passes to achieve the complete effect. If a shader instance<br />

is not present on an object being rendered in Gamebryo, the default shader is<br />

used.<br />

Deriving from the NiD3DDefault<strong>Shader</strong> class allows a developer to extend<br />

the pipeline that ships with Gamebryo in a number of ways. One could extend the<br />

existing pipeline to implement a new rendering technique that has not been incorporated<br />

into the engine yet. Another would be to alter the functionality provided<br />

(for example, the pipeline implements projected lights by modulating them, but a<br />

developer could derive from the class and implement projected lights as additive).<br />

One suggested usage is to implement “fast path” shaders, which remove sections<br />

of the construction step for effects that the developer will not be using. For example,<br />

if the game does not use projected lights, the part of the pipeline that analyzes<br />

and sets up projected lights could be removed, resulting in a faster path<br />

through the renderer while maintaining all the functionality with respect to other<br />

effects.<br />

This three-tiered approach to the shader classes supplies a well-defined<br />

structure in which shaders can be developed. Depending on their needs, developers<br />

can opt for any of the interfaces that will provide them with their required<br />

level of access to the rendering pipeline. Figure 3 shows the options for deriving<br />

classes that the original CTD system presented to developers.


As you can see from this diagram, developers had limited options for shader support;<br />

there is only one “entry point” for deriving their shader classes. Compound<br />

this with the fact that the interface was quite muddled with functionality for<br />

implementing the default pipeline, none of which was required for implementing<br />

shaders, and it is understandable why so many problems arose with its usage.<br />

The Gamebryo system derivation options are displayed in Figure 4.<br />

The new system presents developers with far more options for what level they<br />

may derive from, from implementing shaders with no C++ code required to completely<br />

overriding the rendering of objects. The underlying interface to the rendering<br />

is much more clearly defined, laying out the functionality they are required<br />

to provide at each step. This makes the task of developing shaders far easier<br />

when using the Gamebryo engine.<br />

Case Study: Eturnum<br />

Section VI — 3D Engine and Tools Design<br />

<strong>Shader</strong> Integration in the Gamebryo Graphics Engine<br />

Figure 3: Class diagram for the NetImmerse shader system<br />

Figure 4: Class diagram for the Gamebryo shader system<br />

645<br />

To understand the reasoning behind the design goals outlined in this chapter, it is<br />

helpful to consider a case study of a sample application. This section discusses different<br />

aspects and issues that arose in developing NDL’s shader-driven demo for<br />

GDC 2003, Eturnum. We examine each of the six design goals and how Eturnum<br />

reinforced the applicability of those goals. Finally, we explore some of the lessons<br />

learned with Eturnum, paying particular attention to areas that still require<br />

improvement.


Section VI — 3D Engine and Tools Design<br />

646 <strong>Shader</strong> Integration in the Gamebryo Graphics Engine<br />

About Eturnum<br />

The original idea for Eturnum was to showcase the Gamebryo <strong>Shader</strong> System by<br />

placing a highly detailed character in a realistic scene. In addition to showing high<br />

polygon throughput and performance, the demo renders almost every surface<br />

with a custom shader. These shaders implement a variety of fairly common pixel<br />

and vertex shader effects, such as dot3 bump mapping, palettized skinning, and<br />

thin film interference.<br />

Design Goals<br />

Since the final goal of Eturnum was to demonstrate the power of the Gamebryo<br />

<strong>Shader</strong> System, it makes sense to examine how the development of the demo<br />

reinforced the previously stated design goals.<br />

Unified Rendering Path<br />

Because all objects in Gamebryo take the same path through the renderer, we<br />

were able to leverage some utility out of the default, fixed-function pipeline in the<br />

development of Eturnum. Generally, neither art assets nor shader code exist at<br />

the beginning of a project. It is unsatisfactory to stall either aspect of development<br />

to wait for the other. Since the Gamebryo <strong>Shader</strong> System was designed with a<br />

unified rendering path in mind, it was possible to begin designing art assets for<br />

Eturnum that would later have shaders attached to them. Unifying the rendering<br />

path meant that no work would be lost when shaders were applied. All the parameters<br />

for materials, textures, etc. carried directly over from the default pipeline<br />

shader to the custom shaders written for Eturnum.<br />

Data-driven Support<br />

<strong>Shader</strong>s for Eturnum were developed using NSF files. As stated previously, NSF<br />

files contain information for a rendering effect in a script file, much like Microsoft’s<br />

<strong>DirectX</strong> effects files. Using NSF files allowed rapid, data-driven iteration to<br />

occur on the shaders. This rapid feedback cycle produced more refined shaders<br />

without consuming significant amounts of time rebuilding programs. The NSF<br />

files were parsed at run time, making changes to the NSF files instantly apparent<br />

in the application.<br />

Industry Standard Formats<br />

The Gamebryo <strong>Shader</strong> System is designed to be compatible with major industry<br />

standards while still adding information specific to Gamebryo where necessary.<br />

<strong>With</strong> this fact in mind, effects for Eturnum were often built initially in ATI’s<br />

RenderMonkey program. Although the library to directly import RenderMonkey<br />

XML files directly into Gamebryo was not complete at the time the demo was<br />

authored, translating the information from a RenderMonkey file to NSF was trivial<br />

given the compatible setup of the Gamebryo <strong>Shader</strong> System.


Simple Access to <strong>Shader</strong> Collections<br />

Parsing the NSF files at run time for Eturnum provided a strong, data-driven<br />

shader model for the application. Additionally, it fulfilled another requirement for<br />

the shader system — providing simple access to shader collections for all members<br />

of the development team. Throughout Eturnum, the shaders for each member<br />

of the development team could be updated with a simple text file that was<br />

parsed at run time.<br />

Art Pipeline Integration<br />

Section VI — 3D Engine and Tools Design<br />

<strong>Shader</strong> Integration in the Gamebryo Graphics Engine<br />

The Gamebryo plug-ins for 3ds max and Maya both support the use of shader<br />

libraries at run time, and they ship with libraries to parse NSF files and load the<br />

associated NSB files. This fact was invaluable to the development of Eturnum. It<br />

was possible for artists to create<br />

the assets, assign shaders to assets,<br />

and preview those assets before<br />

even considering exporting for use<br />

in the application. Additionally, the<br />

use of custom attributes to hold<br />

shader parameters in both plug-ins<br />

allowed the artists rather than the Figure 5: The original artist interface<br />

programmers to modify shader<br />

effects.<br />

Ease of Use without Sacrificing Power or Flexibility<br />

647<br />

Of all the design requirements for the Gamebryo <strong>Shader</strong> System, generating an<br />

easy-to-use system without limiting the creativity of artists and programmers<br />

was the most difficult and the most important. In the end, the system showed its<br />

strength by demonstrating just this capability in the development of Eturnum.<br />

For our alien, the initial shader effect implemented matrix palette skinning with a<br />

base map and two per-vertex directional lights. This effect was satisfactory, but<br />

the demo called for more. We needed to add a warp effect to the alien for use<br />

when he teleports in and out of the temple. Because the system was designed to<br />

allow almost any imaginable effect, the warp effect was easily coded with a change<br />

to the NSF file and changes to the accompanying shader programs.<br />

Once the alien was teleporting into the temple, however, the art staff wanted<br />

per-pixel lighting and dot3 bump mapping on the character. Again, these features<br />

were not a problem to add since the system was designed for easy use and powerful<br />

expansion. Changes to the NSF file were instantly recognized in the art packages,<br />

and the art team did not have to redesign any art to fit the new shaders.<br />

Additional custom attributes in the art packages were transparently added and set<br />

to default values without disrupting the art production pipeline in the slightest.


Section VI — 3D Engine and Tools Design<br />

648 <strong>Shader</strong> Integration in the Gamebryo Graphics Engine<br />

Lessons Learned<br />

Although the development of Eturnum reinforced the design goals of the<br />

Gamebryo <strong>Shader</strong> System, some important lessons were learned regarding the<br />

shader system and shader-driven development in general.<br />

First, although distributing NSF files to our art staff allowed easy access to<br />

shaders, this solution was still not optimal. It was always possible that a single<br />

NSF file out of the group was not synchronized. Ideally, all shaders would be collected<br />

into a single file that represented the current database for use in the application.<br />

Much like a source control database, this file would always contain the<br />

most current version of all the shader effects.<br />

We also found that understandable, artist-editable parameters must be available<br />

in the art development packages to provide the necessary customizability<br />

and creativity. An initial version of the warp effect used hard-coded values to control<br />

the waves of color that washed over our character. When the shader was<br />

changed to use the custom attributes shown in Figure 5, artists had the ability to<br />

control the effect but also had difficulty understanding how the various numerical<br />

parameters affected the shader. To address this confusion, the effect was changed<br />

to use a texture map for the effect rather than numeric inputs. For this texture,<br />

the U coordinate of lookups was calculated from the normal dotted with the view<br />

vector and the V coordinate<br />

was the current value of the<br />

WarpAlpha parameter. When<br />

presented with this change,<br />

one artist stayed three extra<br />

hours to play with the effect<br />

because he was having so<br />

Figure 6: The more artist-friendly interface<br />

much fun with it.<br />

Wrap-Up and Future Plans<br />

A fully configurable and programmable pipeline can provide huge benefits to a<br />

game’s overall visual impact. Such a pipeline allows the flexibility for very creative<br />

effects but cannot stand alone. For artists to create these effects, they must<br />

be provided with an environment that makes the authoring of shader-driven<br />

assets simple and understandable. During the development of Eturnum, we discovered<br />

that a programmer-centric interface to shaders was not enough to enable<br />

the full abilities of the artist efficiently. The Gamebryo tools team created an artist-centric<br />

integration that leveraged the user interface components provided by<br />

the SDKs of the art tools themselves, ensuring that the controls looked and<br />

behaved in ways that were familiar to the artist. This increased the artist’s<br />

shader-tweaking productivity significantly, an important factor as project milestones<br />

approached.<br />

The Gamebryo <strong>Shader</strong> System was designed with a clear set of requirements<br />

that we at NDL felt were needed to provide a complete shader solution to our<br />

developers. The system met or exceeded most of the goals we set, but like any


Section VI — 3D Engine and Tools Design<br />

<strong>Shader</strong> Integration in the Gamebryo Graphics Engine<br />

649<br />

software development project, additional features are in the process of being integrated<br />

or are planned for future integration. These include:<br />

� Allowing the constant mapping technique for per-object and global attributes<br />

to be used for other values, such as render state and texture stage settings<br />

� Adding shader constant register management to minimize Set<strong>Shader</strong>-<br />

Constant calls<br />

� High-level shader language support is currently only available via the<br />

<strong>DirectX</strong> effect, RenderMonkey, and CgFX support. A method for supporting<br />

this in the NDL <strong>Shader</strong> Format should also be developed.<br />

� Implementing a method for integrating render-to-texture effects into the<br />

data-driven system<br />

� Adding a WAD-type system for packaging binary shaders as well as their<br />

associated pixel and vertex shader files<br />

� Developing analysis tools that analyze a scene with the default pipeline and<br />

generate custom, per-object shaders that contain only the effects required to<br />

render them<br />

� Developing and supporting a cross-platform shader language that provides<br />

support on all Gamebryo platforms<br />

Due to the constant advancement of graphics hardware, shader integration is<br />

never truly complete in a graphics engine. While the Gamebryo <strong>Shader</strong> System is<br />

full-featured with respect to our list of requirements, we are looking forward to its<br />

continued development.


Vertex <strong>Shader</strong> Compiler<br />

David Pangerl<br />

Introduction<br />

The Vertex <strong>Shader</strong> Compiler (VSC) is a free, C-based, <strong>DirectX</strong> vertex shader (VS)<br />

compiler. It is a high-level programming language (HLPL) for the <strong>DirectX</strong> VS<br />

assembler programming. The result of the VSC is an optimized VS assembler<br />

with the additional run-time information. This article describes what VSC can do<br />

for a VS programmer and how it can be used for simpler and efficient high-level<br />

writing of vertex shaders.<br />

Current Features<br />

The VSC is a fast, robust, and stable compiler. It was designed to divide the vertex<br />

shader pipeline into several parts. It has a powerful plug-in system, built-in<br />

optimization, and a precompiler. Its features include:<br />

� Compilation of a program provided via string or file<br />

� Same code for different mesh vertex types<br />

� Function swapping<br />

� Fast compilation<br />

� Good optimization<br />

� Handling register restrictions<br />

� VS array support<br />

� Small and simple C++ code (~3300 lines) with a very simple interface<br />

VSC License<br />

650<br />

The VSC is copyrighted by ZootFly. This library is free software. You can redistribute<br />

it and/or modify it under the terms of the BSD-style license, which is<br />

included in this library in the file licence_bsd.txt.


Concept<br />

<strong>DirectX</strong> Vertex <strong>Shader</strong> Assembler<br />

Language<br />

The <strong>DirectX</strong> VS assembler is a direct way of programming the vertex processing.<br />

It is very powerful and useful, although using it requires knowledge from a variety<br />

of fields:<br />

� Basic assembler knowledge<br />

� Instruction specifications (<strong>DirectX</strong> documentation)<br />

� Instruction restrictions (<strong>DirectX</strong> documentation)<br />

� Register usage (<strong>DirectX</strong> documentation)<br />

� Register per instruction restrictions (<strong>DirectX</strong> documentation)<br />

Even the simplest function (length, sin, cos, etc.) requires a lot of assembler<br />

instructions. It also requires a slightly different approach than the HLPL. Furthermore,<br />

the maintenance, debugging, and upgrades of the assembler code are very<br />

hard and demanding tasks.<br />

This is why we decided to develop a custom HLPL, which eliminates all the<br />

tedious tasks of assembler programming.<br />

The VSC language is based on the C syntax with some simplifications and modifications,<br />

as shown in the following example:<br />

//first.cvs<br />

#include <br />

float sqr(float a)<br />

{<br />

return a*a;<br />

}<br />

void vsmain(float param)<br />

{<br />

out.Pos=m4x4(in.Pos * sqr(param) + 10, FinalMatrix.x);<br />

}<br />

Section VI — 3D Engine and Tools Design<br />

Vertex <strong>Shader</strong> Compiler<br />

651<br />

Because of the <strong>DirectX</strong> VS assembler specifics, there are some architectural differences<br />

from classic C:<br />

� All functions are compiled inline (due to no stack).<br />

� All function parameters are treated as references.<br />

� Only float and vector types are supported (the VS register consists of four<br />

floats).


Section VI — 3D Engine and Tools Design<br />

652 Vertex <strong>Shader</strong> Compiler<br />

Types<br />

Variables<br />

Constants<br />

� There are no if, while, for, and switch statements (<strong>DirectX</strong> 8 VS has no<br />

branching instructions; this will be implemented in <strong>DirectX</strong> 9 VS).<br />

Furthermore, some simplifications have been made due to the relative simplicity<br />

of the VS programs:<br />

� No scopes<br />

� No user-defined types and structures<br />

� Much simpler precompiler<br />

The nature of VS leads to some grammar additions:<br />

� VS input and output structure definition<br />

� Plug-in introduction<br />

� Constant register reservation introduction<br />

� The float type is IEEE single-precision (32 bits) floating point.<br />

� A vector is a structure of four floats (see the following C definition of a vector<br />

structure).<br />

struct vector {<br />

float x;<br />

float y;<br />

float z;<br />

float w;<br />

};<br />

Local variables can be defined at any point in the program as in C++. The VSC<br />

doesn’t support global variables.<br />

The VSC supports two types of constants: float and vector constants. Float constants<br />

are defined by a single float number. Vector constants are defined as an<br />

array of one to four float numbers in brackets. Note the following examples:<br />

// float constant<br />

float a=10;<br />

float b=1.234;<br />

//<br />

a=a*0.5+b*0.2;<br />

// vector constants<br />

vector x=(1);<br />

vector y=(1,1,1,0);<br />

//<br />

x=in.Pos + (0,1,0);


Functions<br />

The constant definition table contains additional data exported with the program<br />

assembly, as shown below. It can be used to set the constants within a user<br />

program.<br />

//Result of first.cvs with constant definition table export (first.vs)<br />

vs.1.1<br />

mov r0.x, c4.x // max reg per ins. salvage<br />

mul r0.y, c4.x, r0.x<br />

mad r0, inPos, r0.y, c5.x // | (o2.1)<br />

m4x4 outPos, r0, c0 // assignment| (o1.1)<br />

// Plugin: FinalMatrix::x=c0<br />

// Plugin: FinalMatrix::y=c1<br />

// Plugin: FinalMatrix::z=c2<br />

// Plugin: FinalMatrix::w=c3<br />

// Constant: c5=(10.000,?,?,?)<br />

All functions in the VSC are compiled in-line. All function parameters are treated<br />

as references (modifying the parameter within a function will modify the value of<br />

the variable that was passed to the function).<br />

The program entry function is vsmain. Parameters can be passed to the<br />

vsmain function, as shown below.<br />

//vsmain.cvs<br />

output vertexshader {<br />

vector Pos;<br />

};<br />

void vsmain(float a, vector b)<br />

{<br />

out.Pos=a*b;<br />

}<br />

Here is the output:<br />

vs.1.0<br />

// vsmain parameter assignment:<br />

// parameter 'a'=c0.x.<br />

// parameter 'b'=c1.<br />

mov r0, c1 // max reg per ins. salvage<br />

mul outPos, c0.x, r0 // assignment||max reg per ins. salvage (o1 failed) (o1.1)<br />

All <strong>DirectX</strong> VS instructions and macros are added as VSC functions, as shown in<br />

the following table:<br />

Functions<br />

add(anytype a, anytype b)<br />

dp3(anytype a, anytype b)<br />

dp4(anytype a, anytype b)<br />

dst(anytype a, anytype b)<br />

Section VI — 3D Engine and Tools Design<br />

Vertex <strong>Shader</strong> Compiler<br />

653


Section VI — 3D Engine and Tools Design<br />

654 Vertex <strong>Shader</strong> Compiler<br />

Functions<br />

exp(anytype a)<br />

expp(anytype a)<br />

frc(anytype a)<br />

lit(anytype a)<br />

log(anytype a)<br />

logp(anytype a)<br />

m3x2(anytype a, anytype b)<br />

m3x3(anytype a, anytype b)<br />

m3x4(anytype a, anytype b)<br />

m4x3(anytype a, anytype b)<br />

m4x4(anytype a, anytype b)<br />

mad(anytype a, anytype b, anytype c)<br />

max(anytype a, anytype b)<br />

min(anytype a, anytype b)<br />

mov(anytype a, anytype b)<br />

mul(anytype a, anytype b)<br />

rcp(anytype a)<br />

rsq(anytype a)<br />

sge(anytype a, anytype b)<br />

slt(anytype a, anytype b)<br />

sub(anytype a, anytype b)<br />

VSC also supports some comparison expressions:<br />

Syntax Description<br />

(a>=b?c) Returns c if a>=b; otherwise returns 0.<br />

(a=b; otherwise returns d.<br />

(a


Plug-ins<br />

Here is an example of the input/output structure definition:<br />

output vertexshader {<br />

vector Pos; // vertex position<br />

vector DColor; // diffuse color<br />

vector SColor; // specular color<br />

float PointSize; // sprite point size<br />

float Fog; // fog value<br />

vector Tex0; // texture [0..7] coordinates<br />

vector Tex1;<br />

vector Tex2;<br />

vector Tex3;<br />

vector Tex4;<br />

vector Tex5;<br />

vector Tex6;<br />

vector Tex7;<br />

};<br />

input vertexshader {<br />

vector Pos;<br />

vector Normal;<br />

vector Color;<br />

vector Tex0;<br />

vector Tex1;<br />

vector Tex2;<br />

vector Tex3;<br />

};<br />

Plug-ins are the VSC substitutes for structures in the C language. They are the<br />

main link between the VS and the main program and are used to provide the VS<br />

with the main program data (i.e., object matrix, camera matrix, final matrix, lighting<br />

information, soft binding, etc.) and to manipulate the constants in the VS.<br />

There are two types of plug-ins: simple and radical. The simple plug-in<br />

optimizes its variable space usage. It reserves space in the VS constant variable<br />

space only for those variables that are actually used, while the radical plug-in<br />

reserves the constant space for all the plug-in variables defined in the plug-in, if<br />

any of them is being used (see the first .cvs example in the “Constants” section).<br />

All of the plug-in variables are read-only.<br />

Here is the plug-in grammar:<br />

Section VI — 3D Engine and Tools Design<br />

Vertex <strong>Shader</strong> Compiler<br />

plugin definition : [radical] plugin ‘(‘ argument list ‘)’ ‘;’<br />

argument list : argument<br />

| argument ‘,’ argument list<br />

argument : type name array definition<br />

array definition :<br />

| ‘[‘ number ‘]’<br />

type : float<br />

| vector<br />

655


Section VI — 3D Engine and Tools Design<br />

656 Vertex <strong>Shader</strong> Compiler<br />

This is an example of a plug-in:<br />

// plugins.cvs<br />

input vertexshader {<br />

vector Pos;<br />

};<br />

output vertexshader {<br />

vector Pos;<br />

vector DColor;<br />

};<br />

radical plugin FinalMatrix(vector x, vector y, vector z, vector w);<br />

plugin Light(vector Color, vector Direction, float Range);<br />

void vsmain()<br />

{<br />

out.Pos=m4x4(in.Pos, FinalMatrix.x);<br />

out.DColor=Light.Color;<br />

}<br />

The radical plug-in FinalMatrix reserves constant space for all variables x, y, z,<br />

and w, while the simple plug-in Light reserves constant space only for the variable<br />

Color that is used in the shader (see below).<br />

vs.1.0<br />

m4x4 outPos, inPos, c0 // assignment| (o1.1)<br />

mov outDColor, c4 // assignment<br />

// Plugin: FinalMatrix::x=c0<br />

// Plugin: FinalMatrix::y=c1<br />

// Plugin: FinalMatrix::z=c2<br />

// Plugin: FinalMatrix::w=c3<br />

// Plugin: Light::Color=c4<br />

Reservation of Constant Registers<br />

The VSC allows you to reserve temporary and constant registers. This is useful<br />

when you want to use the VSC output with user modifications. Below are examples<br />

of the constant reservation grammar and the use of the constant reservation:<br />

define reservation : reserve register type ‘(‘ reserve register list ‘)’ ‘;’<br />

register list : register<br />

| register ‘,’ register list<br />

register : number<br />

: number ‘.’ ‘.’ number<br />

: number ‘,’ register<br />

register type : temp<br />

| const<br />

reserve const (10,11,12,20..30); // reserve register c10, c11, c12, and c20 through c30<br />

reserve temp (0..4); // reserve register r0, r1, r2, r3, and r4


Arrays<br />

The VSC supports arrays as plug-in variables (see the “Plug-ins” section). As the<br />

array parameter type, only vector type can be used.<br />

Here is an example:<br />

// softbind.cvs<br />

input vertexshader {<br />

vector JointIndex;<br />

vector JointWeight;<br />

//<br />

vector PosOffset0;<br />

vector PosOffset1;<br />

};<br />

output vertexshader {<br />

vector Pos;<br />

};<br />

plugin Skeleton(vector mat[78]); // max 26 bones per one mesh<br />

void vsmain()<br />

{<br />

vector t=in.JointIndex*768; // 256 *3(3vectors per matrix)<br />

//<br />

out.Pos=m4x3(in.PosOffset0, Skeleton.mat[ t.x ]) * in.JointWeight.x +<br />

m4x3(in.PosOffset1, Skeleton.mat[ t.y ]) * in.JointWeight.y;<br />

}<br />

Precompiler<br />

Section VI — 3D Engine and Tools Design<br />

Vertex <strong>Shader</strong> Compiler<br />

657<br />

The VSC has a simple built-in precompiler, which uses the grammar shown in the<br />

following table:<br />

Grammar Description<br />

#include ‘“’ filename ’”’ The file “filename” is inserted. File uses the search path<br />

from the source file directory.<br />

#include ‘’ The file “filename” is inserted. File uses the include<br />

search path.<br />

#define name Define precompiler variable name<br />

#undefine name Undefine precompiler variable name<br />

#O0 Set optimization level 0 (no optimization)<br />

#O1 Set optimization level 1<br />

#O2 Set optimization level 2<br />

#On Set optimization level Max<br />

#ifdef name statements<br />

If precompiler variable name exists, statements will be<br />

[ #else statements ] #endif compiled; otherwise, if else statements exist, they will be<br />

compiled.<br />

#ifndef name statements [ If precompiler variable name doesn’t exist, statements<br />

#else statements ] #endif will be compiled; otherwise, if else statements exist, they<br />

will be compiled.


Section VI — 3D Engine and Tools Design<br />

658 Vertex <strong>Shader</strong> Compiler<br />

Function Swapping<br />

The VS program is required to do several VS calculations, like matrix transformations,<br />

lighting, etc. To write a specific VS, we need to write a VS for all the combinations<br />

of different transformations and lighting methods.<br />

The number of shaders we have to write for each new shader is multiplied by<br />

the number of different transformation methods (normal transformation, soft<br />

binding, wobbly effect, teleport effect, etc.) and different lighting methods (ambient<br />

light, directional light, light with attenuation, light map, etc.).<br />

For example, we have transformations T 1 and T 2, lighting methods L 1 and L 2,<br />

and we want to write a VS to do effect E. We need to write VS T 1+L 1+E,<br />

T 1+L 2+E, T 2+L 1+E, T 2+L 2+E.<br />

<strong>With</strong> the VSC, you can divide the VS pipeline and avoid a large number of<br />

required shaders.<br />

In the transformation example below, you can see an instance of how to use<br />

function swapping for dividing transformation types. First, we introduce the virtual<br />

function vstransfrom. This function will transform the input position and normal.<br />

Then we write all the different transformations we need.<br />

At compile time, we specify the appropriate function swap: If we want a normal<br />

transformation, we specify vstransform, vstransform_normal; however, if we<br />

want a wobbly effect, we specify vstransform, vstransform_wobbly (see included<br />

file compileall.bat).<br />

// transformation.cvs<br />

// softbind.cvs<br />

input vertexshader {<br />

vector Pos;<br />

vector Normal;<br />

};<br />

output vertexshader {<br />

vector Pos;<br />

vector Normal;<br />

};<br />

// normal mesh vertex type transformation<br />

void vstransform normal(vector pos,vector nor)<br />

{<br />

pos=in.Pos;<br />

nor=in.Normal;<br />

}<br />

void vstransform wobly(vector pos, vector nor)<br />

{<br />

pos=in.Pos * in.Pos;<br />

nor=in.Normal * 2;<br />

}


adical plugin FinalMatrix(vector x, vector y, vector z, vector w);<br />

void vsmain()<br />

{<br />

// position and normal transformation<br />

vector pos,nor;<br />

vstransform(pos, nor);<br />

}<br />

Optimization<br />

// transform position into world space<br />

out.Pos=m4x4(pos, FinalMatrix.x);<br />

VSC uses very simple optimization techniques, described in this section.<br />

Although they are simple, they produce very optimized assembler code.<br />

Register Scope Optimization<br />

Section VI — 3D Engine and Tools Design<br />

Vertex <strong>Shader</strong> Compiler<br />

<strong>With</strong>in the register A scope (the register scope starts at the instruction at which<br />

register A has been set and ends at the instruction that overwrites it with a new<br />

value), all occurrences of register A are replaced with register B.<br />

Instruction Register 1 Register 2 Register 3<br />

Mov A B<br />

ins1 * A *<br />

ins2 * * A<br />

...isreplaced with:<br />

Instruction Register 1 Register 2 Register 3<br />

ins1 * B *<br />

ins2 * * B<br />

A set instruction (all instructions except mov) followed by the mov instruction<br />

into the target register are replaced with the set instruction directly to the target<br />

register (if the target register scope allows it).<br />

Instruction Register 1 Register 2 Register 3<br />

ins1 A * *<br />

Mov B A<br />

...isreplaced with:<br />

Instruction Register 1 Register 2 Register 3<br />

ins1 B * *<br />

All instructions for setting the registers that are not used are removed (output<br />

registers are excluded).<br />

659


Section VI — 3D Engine and Tools Design<br />

660 Vertex <strong>Shader</strong> Compiler<br />

Additional Optimizations<br />

The combination of multiplication and addition is replaced with a single<br />

instruction.<br />

Instruction Register 1 Register 2 Register 3<br />

Mul A B C<br />

add D A E<br />

...isreplaced with:<br />

Instruction Register 1 Register 2 Register 3 Register 4<br />

mad D B C E<br />

And:<br />

Instruction Register 1 Register 2 Register 3<br />

mul A B C<br />

add D E A<br />

...isreplaced with:<br />

Instruction Register 1 Register 2 Register 3 Register 4<br />

mad D B C E<br />

VSC Standard Library<br />

The VSC provides a small set of built-in functions to simplify VS programming.<br />

These functions are similar to the C standard library functions and include mathematical,<br />

geometric, and VSC functions.<br />

Mathematical Functions<br />

Table 1 lists the mathematical functions provided in the VSC Standard Library.<br />

The list includes functions useful in trigonometry, exponentiation, and rounding.<br />

Functions change the parameters only where noted.<br />

Table 1: Mathematical functions<br />

Function Description<br />

float abs(float a) Absolute value of a.<br />

float acos(float a) Arccosine of x in range [–1,1]; result is in range<br />

[–pi/2,pi/2].<br />

float all(vector x) Returns 0 if any component of x is equal to 0;<br />

returns 1 otherwise.<br />

float any(vector x) Returns 1 if any component of x is equal to 1;<br />

returns 0 otherwise.<br />

float asin(float a) Arcsine of a in range [–1,1]; result is in range<br />

[0,pi].<br />

float atan(float a) Arctangent of a in range [–pi/2,pi/2].<br />

float atan2(float a, float b) Arctangent of b/a in range [–pi,pi].


Function Description<br />

float ceil(float a) Smallest integer not less than a.<br />

float clamp(float a, float b, float c) a clamped to the range [b,c] as follows:<br />

Returns b if ac.Otherwise<br />

returns a.<br />

float cos(float a) Cosine of a in range [–pi,pi].<br />

float cross(vector x, vector y) Cross product of x and y.<br />

float exp(float a) Exponential function e^a.<br />

float exp2(float a) Exponential function 2^a.<br />

float floor(float a) Largest integer not greater than a.<br />

float fmod(float a, float b) Reminder of a/b with the same sign as a. b<br />

must not be equal to 0.<br />

float frac(float a) Fractional part of a.<br />

float ldexp(float a, float b) a * 2^b.<br />

float lerp(float a, float b, float f) (1–f)*a + f*b.<br />

float log(float a) Natural logarithm ln(a). a must be greater than<br />

0.<br />

float log2(float a) Base 2 logarithm of a. a must be greater than<br />

0.<br />

float log10(float a) Base 10 logarithm of a. a must be greater than<br />

0.<br />

float max(float a, float b) Maximum of a and b.<br />

float min(float a, float b) Minimum of a and b.<br />

float pow(float a, float b) a^b.<br />

float round(float a) Closest integer to a.<br />

float rsqrt(float a) Reciprocal square root of a.<br />

float sign(float a) Returns 1 if a>0.Returns –1 if a< 0.<br />

Otherwise returns 0.<br />

float sin(float a) Sine of a in range [–pi,pi].<br />

float sqrt(float a) Square root of a. a must be greater than 0.<br />

void fsplit(float a, float b, float c) Splits a into integral part b and fractional part c.<br />

void sincos(float a, float sin, float cos) sin is set to sin(a), and cos is set to cos(a).<br />

Geometric Functions<br />

Table 2 lists the geometric functions provided in the VSC Standard Library.<br />

Table 2: Geometric functions<br />

Section VI — 3D Engine and Tools Design<br />

Vertex <strong>Shader</strong> Compiler<br />

Function Description<br />

float distance(vector x, vector y) Euclidean distance between points x and y.<br />

float length(vector x) Euclidean length of vector x.<br />

float normalize(vector x) Returns a vector of length 1 that points in the<br />

same direction as vector x.<br />

vector reflect(vector n, vector i) Returns reflection vector for surface normal n<br />

and eye to position vector direction i.<br />

661


Section VI — 3D Engine and Tools Design<br />

662 Vertex <strong>Shader</strong> Compiler<br />

VSC Functions<br />

Examples<br />

Lighting<br />

Table 3 lists the VSC functions provided in the VSC Standard Library.<br />

Table 3: VSC functions<br />

Function Description<br />

vector bumpmapuv(vector lightdir) Bump map vector calculation for uv output.<br />

vector bumpmapcolor(vector lightdir) Bump map vector calculation for color output.<br />

vector light(vector pos,<br />

Simple lighting calculation.<br />

vector normal,<br />

vector lightpos,<br />

vector lightcolor)<br />

vector lightadvanced<br />

Advanced lighting calculation with light fallout.<br />

(vector pos,<br />

vector normal,<br />

vector lightpos,<br />

vector lightcolor,<br />

vector lightrange)<br />

vector lightattenuation<br />

Lighting calculation with texture attenuation<br />

(vector pos,<br />

lookup. Function returns the texture uv for<br />

vector normal,<br />

stage0 int tex0 and texture uv for stage1 in<br />

vector lightpos,<br />

tex1.<br />

vector lightcolor,<br />

vector tex0,<br />

vector tex1)<br />

This section includes example programs written in VSC.<br />

Lighting shaders:<br />

� Light attenuation<br />

� Per-pixel specularity<br />

Light attenuation:<br />

#include <br />

void vsmain()<br />

{<br />

// position transformation<br />

vector pos,nor;<br />

vstransform(pos, nor);<br />

// std output<br />

out.Pos=m4x4(pos, FinalMatrix.x);<br />

// attenuation texture coordinates<br />

lightattenuation(pos, nor, SelectedLight.Pos, SelectedLight.Color,


}<br />

SelectedLight.Range, out.Tex0, out.Tex1)<br />

// light intensity<br />

out.DColor=light(pos, nor, SelectedLight.Pos, SelectedLight.Color);<br />

// base texture for alpha<br />

#ifdef inTex0<br />

out.Tex2=in.Tex0;<br />

#endif<br />

Per-pixel specularity:<br />

#include <br />

void vsmain()<br />

{<br />

vector pos,nor;<br />

vstransform(pos, nor);<br />

}<br />

Base <strong>Shader</strong>s<br />

// std output<br />

out.Pos=m4x4(pos, FinalMatrix.x);<br />

//<br />

vector lightdir=normalize(SpecularLight.Pos - pos);<br />

vector half =normalize(normalize(CameraMatrix.Pos - pos) + lightdir);<br />

//<br />

out.DColor.xyzw=SpecularLight.Color;<br />

// color cube normal color<br />

out.Tex0=bumpmapuv(half);<br />

//<br />

#ifdef inLightMapUV<br />

out.Tex1=in.LightMapUV;<br />

#else<br />

out.Tex1=(0,0,0,0);<br />

#endif<br />

Base shaders:<br />

� bump mapping<br />

� multi-texture blending<br />

Bump mapping:<br />

#include <br />

void vsmain()<br />

{<br />

// position transformation<br />

vector pos,nor;<br />

vstransform(pos, nor);<br />

Section VI — 3D Engine and Tools Design<br />

Vertex <strong>Shader</strong> Compiler<br />

663


Section VI — 3D Engine and Tools Design<br />

664 Vertex <strong>Shader</strong> Compiler<br />

Effects<br />

}<br />

// std output<br />

out.Pos=m4x4(pos, FinalMatrix.x);<br />

#ifdef inTex1<br />

out.Tex0=in.Tex1;<br />

#else<br />

out.Tex0=in.Tex0;<br />

#endif<br />

// normal map<br />

vector lightdir=normalize(SelectedLight.Pos - pos);<br />

out.Tex1=bumpmapuv(lightdir);<br />

// attenuation texture coordinates<br />

lightattenuation(pos, nor, SelectedLight.Pos, SelectedLight.Color,<br />

SelectedLight.Range, out.Tex2, out.Tex3)<br />

// light color<br />

out.SColor=SelectedLight.Color;<br />

Multi-texture blending:<br />

#include <br />

void vsmain()<br />

{<br />

vector pos,nor;<br />

vstransform(pos, nor);<br />

out.Pos=m4x4(pos, FinalMatrix.x);<br />

}<br />

// define color<br />

out.DColor=in.Color.zwxy;<br />

out.SColor=in.Color.xyzw;<br />

// define as many texure outputs as given<br />

#ifdef inTex1<br />

out.Tex0=in.Tex0;<br />

out.Tex1=in.Tex1;<br />

out.Tex2=in.Tex1;<br />

#else<br />

out.Tex0=in.Tex0;<br />

out.Tex1=in.Tex0;<br />

out.Tex2=in.Tex0;<br />

#endif<br />

Effects:<br />

� Volumetric shadows<br />

Volumetric shadows:


Gallery<br />

#include <br />

void vsmain(float projectionlength)<br />

{<br />

// in Tex0.x is plane D parameter<br />

// test if face normal is facing towards light<br />

vector pos,nor;<br />

vstransform(pos, nor);<br />

//<br />

vector lightdir=pos - Light1.Pos;<br />

lightdir=normalize(lightdir);<br />

float test=dp3(nor, lightdir).x;<br />

pos.xyz=pos + (test


Section VI — 3D Engine and Tools Design<br />

666 Vertex <strong>Shader</strong> Compiler<br />

Acknowledgments<br />

I am very grateful to Gregor Grlj for proofreading. I am also thankful to Helena<br />

Smigoc and Mladen Zagorac for lecturing.


<strong>Shader</strong> Disassembler<br />

NOTE In this article, the word “shader” refers to programs for both the<br />

vertex and pixel pipelines.<br />

Jean-Sebastian Luce<br />

Microsoft <strong>DirectX</strong> 9 introduces a new shading language, High Level Shading Language<br />

(HLSL), which is much easier to use compared to shader assembler. However,<br />

since the video driver understands nothing but shader byte code, both HLSL<br />

and assembler shader code have to be respectively compiled and assembled by<br />

the <strong>DirectX</strong> runtime. Unlike shader assembler, shader byte code is not “readable”<br />

by humans. This article gives a solution for converting shader byte code (back)<br />

into assembly instructions.<br />

A <strong>Shader</strong> Disassembler: What Is It Useful For?<br />

Although HLSL has many benefits for the programmer, its main drawback is that<br />

generated code is not always as optimal as hand-written assembly code. Sometimes,<br />

when coding in HLSL, if the programmer forgets to use fully vectorized<br />

operations, a non-optimal binary program can still result. For instance, the following<br />

code scales a 2D vector by two uniform constants:<br />

struct VS OUTPUT {float4 Position : POSITION; float2 Tc0 : TEXCOORD0;};<br />

vs 1 1VSOUTPUT vsMain(float4 Position : POSITION, uniform float4x4 ObjectPrCamera,<br />

uniform float ScaleU, uniform float ScaleV)<br />

{<br />

VS OUTPUT Output = (VS OUTPUT) 0;<br />

Output.Position = mul(Position, ObjectPrCamera);<br />

Output.Tc0.x = Position.x*ScaleU;<br />

Output.Tc0.y = Position.y*ScaleV;<br />

return Output;<br />

}<br />

This results in the HLSL compiler generating the equivalent of the following<br />

assembly:<br />

vs 1 1<br />

dcl position0 v0<br />

m4x4 oPos, v0, c0; c0-c3 = ObjectPrCamera<br />

mul oT0.x, v0.x, c4.x; c4 = ScaleU<br />

667


Section VI — 3D Engine and Tools Design<br />

668 <strong>Shader</strong> Disassembler<br />

mul oT0.y, v0.y, c5.x; c5 = ScaleV<br />

An obvious improvement is to scale by a 2D vector instead of two scalar<br />

constants:<br />

struct VS OUTPUT {float4 Position : POSITION; float2 Tc0 : TEXCOORD0;};<br />

vs 1 1VSOUTPUT vsMain(float4 Position : POSITION, uniform float4x4 ObjectPrCamera,<br />

uniform float2 ScaleUV)<br />

{<br />

VS OUTPUT Output = (VS OUTPUT) 0;<br />

Output.Position = mul(Position, ObjectPrCamera);<br />

Output.Tc0 = Position*ScaleUV;<br />

return Output;<br />

}<br />

...which is compiled to the following (note that one less instruction is generated):<br />

vs 1 1<br />

dcl position0 v0<br />

m4x4 oPos, v0, c0 ; c0-c3 = ObjectPrCamera<br />

mul oT0.xy, v0, c4 ; c4 = ScaleUV<br />

Moreover, early pixel shader hardware (ps_1_x) is very limited in its capabilities<br />

compared to later versions (few address operations and arithmetic instructions,<br />

small number of available instruction slots). Therefore, HLSL coding for these<br />

platforms should be done with care. Even ps_2_0-compliant hardware has important<br />

limitations (like the lack of arbitrary swizzling), which can force the compiler<br />

to use more instruction slots. An HLSL programmer should at least know the limitations<br />

of the target platform, and getting a look at the generated assembly would<br />

be helpful in writing optimal code. For this reason, reading and checking the generated<br />

shader assembly of an HLSL program is important. In the <strong>DirectX</strong> 9 SDK,<br />

the only way to view the assembly code generated by the HLSL compiler is to<br />

use the external compiler fxc.exe with the -Fc flag. Because this tool is not suited<br />

to be called from another program, and since we can get clearer assembly, as we<br />

see later in the “Disassembler Integration and Customization” section, let’s<br />

implement a vertex/pixel shader disassembler. In addition, this disassembler can<br />

be used to view shaders where you don’t have the source assembly.<br />

<strong>Shader</strong> Byte Code Description<br />

To understand how a shader byte code disassembler works, we start by describing<br />

the byte code itself: the elements of the byte code (called tokens) and their<br />

formatting. The <strong>DirectX</strong> 9 shader byte code is simpler than, for instance, the<br />

80x86 code; instructions are always made of 32-bit aligned, little-endian byteordered<br />

tokens.<br />

The first token in a compiled shader program is always a version token, corresponding<br />

to vs_x_y and ps_x_y assembler instructions.


Version Token<br />

Bits 31-16 Bits 15-8 Bits 7-0<br />

0xFFFF=pixel pipeline<br />

0xFFFE=vertex pipeline<br />

major version minor version<br />

For 1_x vertex and pixel shader targets, the minor version field contains a subversion.<br />

For instance, 0xFFFF0103 is a version token meaning ps_1_3. But for<br />

<strong>DirectX</strong> 9 targets (>=2_0), the minor version has another meaning:<br />

0x00: normal target (for instance, 0xFFFF0200 means ps_2_0)<br />

0x01: extended target (for instance, 0xFFFF0201 means ps_2_x)<br />

0xFF: software target (for instance, 0xFFFF02FF means ps_2_sw)<br />

Each assembler instruction generates one instruction token (even if this instruction<br />

is a macro like “m4x4,” which uses four instruction slots in the graphic chip),<br />

containing the following information:<br />

Instruction Token<br />

Bit 31 Bit 30 Bit 29 Bit 28 Bits 27-14 Bits 23-16 Bits 15-0<br />

0 Co-issue 0 Predicate Tokens count Specific Operation<br />

(excluding this<br />

one)<br />

controls code<br />

The co-issue bit has meaning only on pixel shader versions earlier than 2_0. When<br />

this bit is set, the corresponding instruction is executed simultaneously with the<br />

preceding instruction. The two instructions can be paired only if they are executed<br />

concurrently in the RGB and alpha pipes.<br />

The predicate bit has meaning only on vertex and pixel shader versions 2_0<br />

and later. When this bit is set, the value of the predicate register (p0) is used to<br />

control, at run time, the instruction write per component. For instance:<br />

if p0=(true, true, false, false)<br />

"(p0) add r1, r2, r3” only writes r1.xy.<br />

Section VI — 3D Engine and Tools Design<br />

<strong>Shader</strong> Disassembler<br />

669<br />

When the predicate bit is set, an extra predicate source token is inserted between<br />

the destination token and the first source token. This extra token describes the<br />

predicate register used and is formatted like the source register token, which will<br />

be seen later.<br />

The specific controls field has meaning only for the ifc, breakc, and setp<br />

instructions, with values from 1 to 6 corresponding to gt, eq, ge, lt, ne, and le,<br />

respectively.<br />

The operation code value is one of the values in the D3DSHADER_IN-<br />

STRUCTION_OPCODE_TYPE enum and is defined for all vertex and pixel<br />

assembler instructions (for instance, the nop operation code is 0, and mov is 1).<br />

Depending on the precise operation that an instruction token specifies<br />

(defined by the operation code field), destination/source parameter tokens can follow<br />

the instruction token. For example, if the instruction token specifies an add operation,<br />

one destination and two source parameter tokens would follow. The following<br />

table outlines the description of the parameter token:


Section VI — 3D Engine and Tools Design<br />

670 <strong>Shader</strong> Disassembler<br />

Bit 31 Bits<br />

30-28<br />

1 Register<br />

type (1)<br />

Bits<br />

27-24<br />

Shift<br />

scale<br />

Destination Parameter Token<br />

Bits 23-20 Bits Bits Bit 13 Bits 12-11 Bits 10-0<br />

19-16 15-14<br />

Result Write 0 Relative Register Register<br />

modifier mask<br />

addressing type (2) number<br />

In the destination parameter token, the register type is split into two parts: (1) =<br />

bits 0-2 and (2) = bits 3 and 4. The final value is one of those defined in the<br />

D3DSHADER_PARAM_REGISTER_TYPE enum. Using this field, we can determine<br />

if the register is a constant, a temporary, a texture, etc. This field is used<br />

with the register number field to get the full register name (for instance, c21).<br />

There are a few special cases for register type/index:<br />

� The registers an and tn share the same register type value (i.e., D3DSPR_<br />

ADDR=D3DSPR_TEXTURE), since addressing and texture registers are<br />

valid only in the vertex and pixel pipelines, respectively.<br />

� The registers oPos, oFog, and oPts have the same type D3DSPR_RASTOUT<br />

but are distinguished by the register number, ranging from 0 to 2.<br />

� Similarly, the registers vPos and vFace have the same type D3DSPR_MISC-<br />

TYPE but are distinguished by register numbers 0 and 1.<br />

� The relative addressing bit is meaningful only on vertex shader versions 3_0<br />

and later.<br />

� The write mask field contains a bit per destination channel, set to 1 if the<br />

component is written to (bit 16=X, bit 17 = Y, …).<br />

The final two fields modify certain instructions rather than the destination register<br />

itself.<br />

� The shift scale field is a 4-bit signed scale (0x00=none, 0x01=_x2,<br />

0x0F=_d2, for instance mul_x2 r0, r1, t0).<br />

� The result modifier field can be a combination of _sat, _pp (partial precision),<br />

and _centroid.<br />

The format of the source parameter token is similar to the destination parameter<br />

token, with the exception that the write mask, result modifier, and shift scale are<br />

replaced by swizzle and source modifiers.<br />

Source Parameter Token<br />

Bit 31 Bits 30-28 Bits 27-24 Bits 23-16 Bits 15-14 Bit 13 Bits 12-11 Bits 10-0<br />

1 Register<br />

type (1)<br />

Source<br />

modifier<br />

Swizzle 0 Relative<br />

addressing<br />

Register<br />

type (2)<br />

Register<br />

number<br />

� The register type field works identically to the field in the destination parameter<br />

token.<br />

� The source modifier field is one of the values in the D3DSHADER_PARAM_<br />

SRCMOD_TYPE enum.


� The swizzle field contains (for each of the four destination components) two<br />

bits selecting the source component (0=X, 1=Y, 2=Z, and 3=W). For<br />

instance, if bits 17-16 hold 0x1, this means that the source component Y is<br />

swizzled into the destination component X.<br />

� The relative addressing flag has meaning for all vertex shader versions and<br />

pixel shader versions 3_0 and later. Relative addressing enables constant register<br />

index selection at run time, depending on the value of an address register<br />

(a0 or aL) component. The actual constant register index selected is the<br />

sum of the token’s constant register number and the (run-time) value of the<br />

address register component. For instance:<br />

mov r0, c16[a0.x]<br />

copies the value of the constant register n “16+a0.x” into r0.<br />

On the vs_1_1 target, only a0.x is available. On vs_2_0 and ps_3_0 or<br />

later targets, a0.x/y/z/w and aL are available.<br />

When relative addressing is enabled and when the shader version is at least 2_0<br />

(vs_1_1 always implicitly uses the unique address register a0.x), a relativeaddressing<br />

token follows this parameter token.<br />

Relative-Addressing Token<br />

Bit 31 Bits 30-28 Bits 27-20 Bits 19-16 Bits 15-14 Bit 13 Bits 12-11 Bits 10-0<br />

1 Register<br />

type (1)<br />

Unused Component<br />

index<br />

Section VI — 3D Engine and Tools Design<br />

<strong>Shader</strong> Disassembler<br />

0 Unused Register<br />

type (2)<br />

Register<br />

number<br />

The special assembler instructions def, defi, and defb, respectively, require floating-point<br />

number, integer, or Boolean source parameters, which represent immediate<br />

values written in the assembly.<br />

The assembler instruction dcl is followed by a special parameter token giving<br />

(in addition to the register type and number) a usage and usage index, or a texture<br />

type (for a sampler register).<br />

A comment token may also be used by the compiler to store debug information<br />

(we read it only to compute the jump offset to the next instruction token).<br />

Comment token<br />

Bit 31 Bits 30-16 Bits 15-0<br />

0 Length in DWORD count (not<br />

including the comment token)<br />

0xFFFE<br />

671<br />

An end token (identified by the value 0x0000FFFF) terminates the shader byte<br />

code stream.<br />

Additional information about <strong>DirectX</strong> 9 shader byte code can be found at [1].


Section VI — 3D Engine and Tools Design<br />

672 <strong>Shader</strong> Disassembler<br />

Disassembly Algorithm<br />

Since each instruction is followed by one or more parameter tokens before reaching<br />

the next instruction token, to disassemble the shader byte code, all we have to<br />

do is parse each instruction token and then, depending on the instruction found,<br />

parse its expected parameter tokens, output the disassembly, and loop for the<br />

next instruction.<br />

The first thing to do with a new token is check it against the special tokens<br />

(namely, the version token, comment tokens, and the end token). If we are not<br />

processing one of those tokens, we have to check the operation code of the<br />

instruction. Then we come to the first difficult part of the task; there are a lot of<br />

different instructions with various prototypes (type and number of arguments of<br />

the instruction).<br />

To solve that issue, we use C preprocessor macros. If you read the Dx9-<br />

<strong>Shader</strong>BCode_Instr.h file, you’ll notice that (almost) every instruction is<br />

described by a line giving its OpCode_ID (for instance, D3DSIO_NOP) and then<br />

its parameters (RegSrc/Dst for source/destination register, Real/Bool/Integer for<br />

immediate values). A switch is done on the opcode, and for each instruction, the<br />

macros expand the parameters with calls to instruction/parameters disassembly<br />

methods. A few instruction opcodes have to be handled in a specific way because<br />

they have a unique syntax (like the dcl instruction) or a behavior that depends on<br />

the vertex/pixel shader target version (like texcoord/texcrd and tex/texld).<br />

We now look, in further detail, at how each instruction is disassembled. First,<br />

the CatInstr method outputs the instruction name in lowercase, appends an<br />

optional specific control text for some comparison instructions (ifc, breakc, and<br />

setp), and then appends an optional shift text and a destination modifier text (both<br />

are taken from the next destination parameter token, if applicable).<br />

Then, the destination and source parameters are disassembled by the<br />

CatParamRegSrc and CatParamRegDst methods. These two methods are based<br />

upon the CatRegister method, which outputs the register name and an optional<br />

index. CatParamRegSrc begins to write the start of a possible source modifier (for<br />

example, negation or bias), followed by the register name, an optional address<br />

text with the right swizzle (in this case, an address register token would follow<br />

the current source parameter token), and the end of the modifier if necessary.<br />

Finally, a swizzle specifier (“.x” will be written for “.xxxx”) is finally added.<br />

CatParamRegDst is simpler, since only a write mask has to be appended to the<br />

register.<br />

CoIssue and Predicated instruction modifiers are handled before the OpCode<br />

switch. The predicated case is the more difficult one, since the predicate register<br />

token is inserted between the first and second parameter tokens (InstrToken,<br />

Param1Token, PredicateToken, Param2Token, ...), so the idea is to jump over the<br />

predicate token in order to make the following instruction disassembly work.


Section VI — 3D Engine and Tools Design<br />

<strong>Shader</strong> Disassembler<br />

Disassembler Integration and Customization<br />

This disassembler is easily integrated with C++ code; as an example, in the<br />

DisAsm.cpp file, only one DisAssemble call is required, passing the byte code<br />

data as input parameters. The disassembly is returned in an allocated string (valid<br />

until another DisAssemble call).<br />

In addition, symbol information taken from constant tables (for example, the<br />

fact that the constant ObjectPrCamera uses c0 to c3 registers) can be incorporated<br />

to help in understanding the assembly.<br />

A callback method pointer can be set to override constant and sampler register<br />

name disassembly. This callback is called with the register index and a “user<br />

data” in parameters. If the user returns a valid string pointer, the constant will be<br />

displayed as “cMyConstantName” or “sMySamplerName” in the output asm<br />

(returning 0 selects the normal register name, which is defined by the <strong>DirectX</strong><br />

shader assembler reference as when no callback is installed). Here is a sample of<br />

how to implement a constant name callback:<br />

struct SConstantTable {<br />

D3DXCONSTANT DESC*Constants;<br />

Natural Count;<br />

};<br />

const char* GetRegisterNameC(Natural iRegister, void* UserData)<br />

{<br />

SConstantTable* Table = (SConstantTable*) UserData;<br />

for (Natural i=0; iCount; i++)<br />

{<br />

const D3DXCONSTANT DESC & Constant = Table->Constants[i];<br />

if (Constant.RegisterSet!=D3DXRS FLOAT4)<br />

continue;<br />

Integer iRegisterRel = iRegister-Constant.RegisterIndex;<br />

if (iRegisterRel=Constant.RegisterCount)<br />

continue;<br />

char Name[128];<br />

if (Constant.RegisterCount==1)<br />

strcpy(Name, Constant.Name);<br />

else<br />

sprintf(Name, "%s %d", Constant.Name, iRegisterRel);<br />

return Name;<br />

}<br />

return Null; // index out of bounds<br />

}<br />

SConstantTable Table; // to fill<br />

CDx9<strong>Shader</strong>Disasm::UserData = & Table;<br />

CDx9<strong>Shader</strong>Disasm::CallbackGetRegisterNameC = & GetRegisterNameC;<br />

673<br />

The input and ouput (vs_3_0 only) registers can be displayed using their<br />

declaration semantic (usage and usage index) rather than their register index<br />

(again, this helps in understanding the assembly). There is a special case for the<br />

vs_3_0 output register since several semantics can be linked to the same register.


Section VI — 3D Engine and Tools Design<br />

674 <strong>Shader</strong> Disassembler<br />

References<br />

In this case, we get back to the register index naming convention to remove any<br />

ambiguity. Because these registers cannot be understood by <strong>DirectX</strong> assembler,<br />

there is the possibility of disabling this (enabled by default) semantic register<br />

naming behavior by changing two booleans’ values (one for the input registers<br />

and one for the output registers).<br />

[1] Direct3D Driver <strong>Shader</strong> Codes (MSDN Library, Direct3D DDK)<br />

http://msdn.microsoft.com/library/default.asp? url=/library/en-us/graphics/<br />

hh/graphics/d3denum-9o6f.asp.


Color Plate 1. A PVS optimized scene (See page 26.)<br />

Color Plate 2. Planet with cloud cover, noise texture, ocean, and per-pixel lighting (See page 38.)


Color Plate 3. The contents of the position and normal map MRTs as well as the final result in wireframe and solid mode (See page 56.)


Color Plate 4. (left) Diffuse (dot3) sunlight plus radiosity hemisphere lookup. (right) <strong>With</strong> the<br />

addition of specular and Fresnel contributions, using a static cube map holding an image of<br />

typical surroundings. (See page 113.)<br />

Color Plate 5. (left) The images in Figures 1, 4, and 5 in the “Hemisphere Lighting with Radiosity<br />

Maps” article were created along the zoomed-in section on this 2048x512 radiosity map. (See page<br />

116.)


Color Plate 6. These images show the hemisphere lighting on its own, using a single DXT1 format<br />

radiosity map that encodes both shadow and ground color information. (See page 117.)<br />

Color Plate 7. The complete lighting model, combining a base texture, radiosity hemisphere,<br />

Fresnel cube map, and dot3 sunlight. (See page 121.)


Color Plate 8. Rendering drop of water (See page 202.)<br />

Color Plate 9. Screen shot from the water demo illustrating advanced<br />

water and sky dome effects (See pages 208 and 249.)<br />

Color Plate 10. Screen shot from the water demo illustrating<br />

advanced water and sky dome effects (See pages 208 and 249.)


Color Plate 11. Position data (See page 253.)<br />

Color Plate 12. Normal data (See page 255.)


Color Plate 13. Diffuse data (See page 255.)<br />

Color Plate 14. The blue cone shapes represent the areas where the lights affect the scene, i.e., the<br />

pixels on which the pixel shader will be executed. (See page 266.)


Color Plate 15. Final render using deferred shading from demo (See page 269.)<br />

Color Plate 16. A screen shot from the Animusic demo shows motion blur via geometry and shading<br />

distortion. (See page 299.)


A B C D<br />

E F G H<br />

A translucent and iridescent material computed in a<br />

single pass with 2.0 pixel shader. The diffuse<br />

illumination is computed by modulating base texture (C)<br />

with combined scattered (A) and diffuse (B) illumination<br />

contributions. Resulting diffuse color is pre-multiplied by<br />

inverse alpha (F) to simulate correct alpha-blending in a<br />

single pass without clamping specular highlights. To<br />

compute iridescence contribution (J) to the final image,<br />

we compute glossy specular highlights (G) and combine<br />

them with iridescence values (H) resulting from applying<br />

an iridescence map. The resulting specular iridescent<br />

shading is combined with pre-blended translucent<br />

diffuse shading (I) to achieve final visual effect (K).<br />

(Images courtesy of Natalya Tatarchuk and Chris<br />

Brennan, ATI Technologies Inc.)<br />

K<br />

I J<br />

Color Plate 17


Color Plate 18. Stereoscopic rendering in hardware (See page 336.)<br />

Color Plate 19. Result of a hand-drawn image as input in a 3D scene (See page 346.)


Color Plate 20. A real-time implementation of Paul Debevec’s Rendering with Natural Light animation using the separable<br />

Gaussian filter. (See page 451.)


Figure 1 (See page 471.)<br />

Figure 3 (See page 473.)<br />

Figure 2 (See page 471.)<br />

Figure 4 (See page 474.)<br />

Figure 5 (See page 475.) Figure 6 (See page 476.)<br />

Color Plate 21


Figure 7 (See page 478.) Figure 8 (See page 478.)<br />

Figure 9 (See page 478.) Figure 10 (See page 478.)<br />

Figure 11 (See page 479.)<br />

Color Plate 22<br />

Figure 12 (See page 479.)


Color Plate 23. The two ripple centers are colored magenta and fade as the ripple fades. These five images show, in left to right order, the ripples’ dissipation after<br />

they have fully propagated. (See page 498.)<br />

Color Plate 25. A 7×7 Kuwahara filter plus outlines based on the Sobel edge<br />

detection filter has been applied to the image for real-time posterization. The<br />

7×7 filter’s advantage over the 5×5 filter is better posterization for about the<br />

same number of instructions. (See page 510.)<br />

Color Plate 24. A 5x5 Kuwahara filter plus outlines based on the Sobel edge<br />

detection filter has been applied to the image for real-time posterization.<br />

(See page 505.)


Example of real-time depth of field simulation using post-processing techniques described in the<br />

article "Real-Time Depth of Field Simulation." Courtesy of ATI Technologies, Inc. (See page 556.)<br />

Example of the wireframe for the real-time depth of field simulation using post-processing<br />

techniques described in the article “Real-Time Depth of Field Simulation.” Courtesy of ATI<br />

Technologies, Inc. (See page 556.)<br />

Color Plate 26


Screen shot taken from the soft shadows demo. This 6,000-triangle temple scene features six<br />

lights, two of which are casting spherical soft shadows, and runs at up to 35 fps on a Radeon<br />

9700. The two lights are animated. (See page 578.)<br />

Another screen shot from the temple scene, with a different viewpoint. The white dot near the<br />

cattaur (a cross between a cat and a centaur) shows the main light's position. (See page 578.)<br />

Color Plate 27


2D noise function, 34<br />

2D shapes, using to optimize shading<br />

passes, 266-268<br />

3D noise function, 34<br />

A<br />

above-water scene, creating,<br />

213-215<br />

acceleration, calculating, 69-70<br />

accumulated voxels, 163-164<br />

accumulation buffer, using for depth<br />

of field, 531<br />

active stereo, 324<br />

adaptive displacement mapping, 82<br />

address registers, avoiding, 390<br />

aliasing, 142-144, 580<br />

alpha-blended objects, 617<br />

alpha blending, 426, 617<br />

alpha testing, 425<br />

ambient component, 135-136<br />

anaglyph, 335, 338<br />

implementing, 335<br />

arbitrarily high specular exponent,<br />

150-151<br />

arithmetic interpolation, 155<br />

array-of-structures data, 386<br />

arrays in VSC, 657<br />

art pipeline integration, 637-639<br />

aspect ratio, setting, 326<br />

asynchronous notification, 59-61<br />

attentuation, 135<br />

attribute buffer, 253-255<br />

building pass, 255-257<br />

extending, 261<br />

shading pass, 258-260<br />

specular calculation, 261<br />

B<br />

backface culling, 418-419<br />

base shaders, 663-664<br />

basis, 329<br />

basis functions, 235<br />

bilinear filtering, 321<br />

billboards, 107<br />

aligning, 107-109<br />

binary shaders, 636, 641-642<br />

bits, reclaiming, 5-6<br />

Blinn shading, 149-150<br />

implementing, 151-152, 154,<br />

155<br />

blur effect, 619-620<br />

blurriness factor, 532<br />

bokeh, 555<br />

break instructions, using to<br />

early-exit, 392<br />

bump map, 136<br />

bump mapping, 607<br />

implementing, 663-664<br />

butterfly operations, 459-460<br />

implementing, 460-461<br />

C<br />

camera,<br />

models, 530-531<br />

setting for stereoscopic rendering,<br />

329<br />

transforming, 331-332<br />

camera lens, focus of, 555<br />

Canny edge detection filter,<br />

444-446<br />

implementing, 446-449<br />

car paint effect, 293<br />

base color, 294-295<br />

clear coat, 295-296<br />

implementing, 296-298<br />

cartoon rendering, 474-476<br />

Cg,<br />

using for drops of water effect,<br />

203-206<br />

using for sun effects, 127-130<br />

using to render planets, 33-39<br />

character glyph, using for mosaic<br />

effect, 519-520<br />

ChromaDepth, 337, 338<br />

implementing, 337-338<br />

circle of confusion, 531<br />

simulating, 532-539<br />

clipping, 420<br />

cloth animation,<br />

implementing, 45-55<br />

implementing with shaders,<br />

44-56<br />

initializing, 44-45<br />

normals, 53-54<br />

overview of, 56<br />

rendering, 54-55<br />

setting constraints, 52-53<br />

setting positions, 48-51<br />

cloth model, 40-44<br />

cloth, simulating, 40-57<br />

cluster galaxy, 123<br />

rendering, 123-124<br />

Codecreatures engine, 625<br />

675<br />

Index<br />

collision, determining interactions<br />

in, 66-71<br />

collision detection, 58<br />

using cube map, 61<br />

visibility test, 59-61<br />

z test, 59-61<br />

collision map, 63, 65<br />

creating, 63-66<br />

color remapping, 467<br />

color space, 440<br />

conversion, 471<br />

ColorCode 3-D, 336, 338<br />

using to reduce ghosting,<br />

336-337<br />

comment token, 671<br />

compression, 234-235<br />

implementing, 237-238<br />

normal map, 185-186<br />

compression transform data type,<br />

6-7<br />

conditional data, arranging, 392<br />

conditionals,<br />

eliminating, 391<br />

using to early-exit, 392<br />

conjugate gradients algorithm,<br />

374-375, 378<br />

constant mapping class, 641<br />

constant register mapping, 641<br />

constant registers, reserving in<br />

VSC, 656<br />

constants,<br />

defining, 388<br />

in VSC, 652-653<br />

named, 433<br />

reasons for naming, 432<br />

constants file, creating, 433-434<br />

constraints, handling, 45-47<br />

contrast, enhancing, 286<br />

control statements, eliminating,<br />

397<br />

CPU vs. GPU performance,<br />

377-379<br />

cube map,<br />

filtering floating-point, 320-321<br />

floating-point, 319<br />

using floating-point, 320<br />

using for collision detection, 61<br />

D<br />

darkening, using for shadows, 592<br />

data type, compression transform,<br />

6-7


676 Index<br />

data types, <strong>DirectX</strong> 9, 4-5<br />

decals, using, 190<br />

decimation in time algorithm,<br />

457-458<br />

deferred shading, 251<br />

advantages of, 251-252<br />

optimizing, 260-262, 263-268<br />

using shadows with, 261-262<br />

dependency chains, 389<br />

minimizing, 389<br />

depth aliasing, 102-103, 580-581<br />

depth-based shadows, 580-581<br />

depth buffer,<br />

clearing, 427<br />

setting, 327<br />

depth buffering, drawbacks to,<br />

169-170<br />

depth of field, 529<br />

simulating using circle of confusion,<br />

532-539<br />

simulating using separable<br />

Gaussian filter, 540-554<br />

techniques, 531-532<br />

depth, calculating, 91-92, 329-330<br />

depth of field effect, 619-620<br />

depth peeling, 263<br />

desaturation effect, 619<br />

destination masks, using, 389<br />

destination parameter token, 670<br />

detail map, creating, 214-215<br />

diffuse component, 132-133<br />

diffuse lighting, 132<br />

diffuse object, lighting, 227<br />

Direct3D texture formats, 172<br />

<strong>DirectX</strong> 9,<br />

data types, 4-5<br />

new features in, 3-7<br />

<strong>DirectX</strong> vertex shader assembler,<br />

651<br />

disassembly algorithm, 672<br />

Discrete Fourier Transform, 457<br />

displacement compression, 7-12,<br />

83-84<br />

displacement effects, 472-474<br />

displacement map, 73<br />

pre-filtering, 8<br />

tools for creating, 79-82<br />

displacement mapping, 7-8, 56, 73,<br />

283<br />

advantages of, 73-74<br />

disadvantages of, 74-75<br />

faking, 283-284<br />

requirements for using, 75<br />

distortion effects, 494-502,<br />

621-623<br />

distributed ray tracing, using for<br />

depth of field, 531<br />

dithering, 102-104<br />

downsampling, 542<br />

drops of water,<br />

animating, 201<br />

magnification of, 202<br />

drops of water effect, 201-202<br />

algorithm, 203<br />

implementing, 203-206<br />

DXT5 format, 187-188<br />

DXTC, 185<br />

advantages of, 188<br />

disadvantages of, 188-189<br />

using for normal map compression,<br />

185-186<br />

dynamic allocation, 616<br />

dynamic compilation technique,<br />

423<br />

dynamic Material, 606<br />

E<br />

edge detection, 443-449, 623<br />

effects, 619-624<br />

effects buffers, 614-615<br />

sizing, 617-618<br />

endo-stereo, 328<br />

engine architecture, 625-629<br />

problems with, 629<br />

enumeration, defining, 433-434<br />

Eturnum case study, 645-648<br />

exo-stereo, 328<br />

explicit textures, 190<br />

extrusion, 564<br />

F<br />

falloff, 135<br />

far capping, 564<br />

far viewpoint, 197<br />

Fast Fourier Transforms, 457-458<br />

implementing, 458-462<br />

using, 461-463<br />

feedback effect, 623-624<br />

filter kernels, 138<br />

filtering, 197 see also Canny edge<br />

detection filter and separable<br />

filtering techniques<br />

bilinear, 321<br />

float constants in VSC, 652-653<br />

floating-point cube maps, 319<br />

filtering, 320-321<br />

using, 320<br />

floating-point numbers, problems<br />

with conversion, 420-421<br />

floating-point precision, using to<br />

simulate blending operations,<br />

173-174<br />

floating-point render target, creating,<br />

172-173<br />

floating-point textures, 172<br />

setting as render target, 173<br />

using blending equation, 176<br />

fog,<br />

implementing, 349-351<br />

integrating, 349<br />

layered, 348<br />

vertical, 348<br />

fog density, calculating, 349<br />

Fourier Transform, 457<br />

fractals, 526<br />

creating, 33-35<br />

frame buffer,<br />

post-processing, 465<br />

reading from, 426<br />

frequency domain, 457<br />

Fresnel term, using for contrast<br />

enhancement, 286<br />

frustum processor, 627<br />

fun house mirror distortion,<br />

494-495<br />

implementing, 495-496<br />

function swapping, 658-659<br />

functions in VSC, 653-654<br />

G<br />

galaxy effects, 123<br />

Gamebryo engine, 631, 633<br />

Gamebryo <strong>Shader</strong> System, 635<br />

components of, 636<br />

features, 636-642<br />

future of, 648-649<br />

Gaussian filter, using to smooth<br />

image, 543-545<br />

geocentric positions, 241<br />

geometric functions in VSC, 661<br />

geometry,<br />

buffers, 623<br />

compression and vertex<br />

shaders, 3-12<br />

distorting, 300-301<br />

textures, 44<br />

geo-mipmapping, 24<br />

geomorphing, 18-19, 24-25, 27-30<br />

frame rates, 30-31<br />

ghost removal, 336<br />

giantism, 330<br />

gloss, 134, 150<br />

GPU,<br />

rendering planets on, 33-39<br />

using to calculate collision<br />

detection, 58<br />

vs. CPU performance, 377-379<br />

graphics engine, problems with<br />

scalability, 597<br />

grayscale, converting to, 466<br />

ground color map, 115<br />

generating, 115-116<br />

H<br />

hair strands, generating, 273-276<br />

half vector, 150<br />

hatching, 340<br />

implementing, 341-342<br />

methods, 340-341


using to light illustrations,<br />

345-346<br />

varying line style of, 343-345<br />

with specular highlights, 345<br />

heightfield displacement map,<br />

76-77<br />

hemisphere lighting, 113-114<br />

implementing, 118-121<br />

optimizing, 122<br />

high dynamic range rendering, faking,<br />

620-621<br />

HLSL, drawbacks to using, 667<br />

HLSL shaders<br />

butterfly operations, 460<br />

Canny edge detection filter,<br />

446-449<br />

floating-point cube map filtering,<br />

322-323<br />

fun house mirror distortion,<br />

495-496<br />

HSV-to-RGB transformation,<br />

442-443<br />

Kuwahara filter (5x5), 505-508<br />

Kuwahara filter (7x7), 510-517<br />

left-right slide, 484-485<br />

left-right squeeze transition,<br />

486-487<br />

RGB-to-HSV transformation,<br />

440-441<br />

ripple distortion, 500-502<br />

separable Gaussian filter,<br />

451-452<br />

separable median filter, 454-456<br />

shower door distortion,<br />

497-498<br />

spice transitions, 493-494<br />

spin and shrink away transition,<br />

489-491<br />

HLSL vertex shader, compression,<br />

237-238<br />

HSL color space, 443<br />

HSV color space, 440<br />

HSV-to-RGB transformation,<br />

implementing, 442-443<br />

hue, 440<br />

hyper-threading, 386-387<br />

I<br />

iFFT, 285<br />

image effects, types of, 481<br />

image processing, 439-440<br />

imposters and voxel objects,<br />

170-171<br />

indirect lighting, 135<br />

inner volume, 562-563<br />

generating, 563-566<br />

input texture, reading, 174-175<br />

input/output in VSC, 654-655<br />

instruction token, 669<br />

interface, importance of, 642-643<br />

interpolation,<br />

arithmetic, 155<br />

texture-encoded, 154<br />

interpolator, setting up, 418<br />

inverse Fast Fourier Transformations,<br />

see iFFT<br />

Inverse Fourier Transform, 457<br />

iridescence, 309<br />

simulating, 315-317<br />

irradiance environment maps, 226<br />

implementing, 230-231<br />

J<br />

jittering, 562<br />

using to generate outer volume,<br />

567-571<br />

K<br />

Kuwahara filter (5x5), 502-505<br />

implementing, 505-508<br />

Kuwahara filter (7x7), 508-510<br />

implementing, 510-517<br />

L<br />

layered fog, 348<br />

left-right slide transition, 483-484<br />

implementing, 484-485<br />

left-right squeeze transition,<br />

485-486<br />

implementing, 486-487<br />

level of detail, see LOD<br />

light attenuation, implementing,<br />

662-663<br />

light space interpolation, 157-158<br />

light view texture, 580<br />

lighting, 610-612<br />

normal, 12<br />

shaders, 662-663<br />

lightmap,<br />

applying to terrain, 22-24<br />

combining with material, 23-24<br />

creating, 19-20<br />

lightmapping, 559-560<br />

lilliputism, 330<br />

linear basis, 11<br />

using, 11-12<br />

linear complementarity problem,<br />

375-377<br />

lit voxels, 165-166<br />

LOD, 18<br />

lookup table,<br />

setting up, 14<br />

setting up index scale in, 17<br />

setting up offset constant in, 17<br />

using with vertex shaders,<br />

13-16<br />

loops, performance considerations<br />

of, 51-52<br />

low-polygon base mesh, 75<br />

Index<br />

677<br />

M<br />

macros, using, 388<br />

magnification effect, 202<br />

Mandelbrot set, 526-527<br />

visualizing, 527-528<br />

material,<br />

applying to terrain, 23-24<br />

combining with lightmap, 23-24<br />

Material class,<br />

dynamic, 606<br />

scalability of, 602-604<br />

static, 599-602, 606<br />

textures, 606-610<br />

using to animate mesh, 610<br />

using to implement shader<br />

abstraction, 599-602<br />

material system, 628-629<br />

Material::DeleteImportedData()<br />

function, 605<br />

Material::Export() function,<br />

601-602, 605<br />

Material::FindOrCreate() function,<br />

600-601<br />

Material::Import() function, 605<br />

Material::Render() function, 602<br />

MaterialDescriptor, 600<br />

mathematical functions in VSC,<br />

660-661<br />

matrix, 353<br />

mapping to texture image,<br />

354-355<br />

offset, 333<br />

storing elements of, 345-355<br />

texture, 353-354<br />

matrix operations, 355<br />

addition, 358-359<br />

assignment, 355-357<br />

miscellaneous, 373-374<br />

multiplication, 359-367<br />

transposed multiplication,<br />

367-373<br />

mesh,<br />

lighting, 610-612<br />

rendering, 610<br />

mesh reduction, 81-82<br />

Meshuggah, 270<br />

effects in, 270-291<br />

meso-stereo, 328<br />

MET (multi-element texture), 253<br />

mipmapping, 8, 197, 618-619<br />

object IDs, 584-585<br />

model space, 254<br />

defining pixel normals in, 254<br />

modular 3D engine architecture,<br />

625-629<br />

moon,<br />

positioning, 240-241<br />

rendering, 244-246<br />

mosaic, 519<br />

mosaic effect, 519


678 Index<br />

implementing, 520-523<br />

motion blur, 299-300<br />

implementing, 302-304,<br />

306-308<br />

simulating, 300-302<br />

MRT (multiple render target), 252,<br />

618<br />

bandwidth requirements,<br />

263-265<br />

using, 252-253<br />

MRT configuration, 263-265<br />

optimized, 266<br />

multi-element texture, see MET<br />

multifractal, creating, 33-34<br />

multiple render target, see MRT<br />

multiple simple loops technique,<br />

423<br />

multi-texture blending, implementing,<br />

664<br />

N<br />

named constants, 433<br />

near capping, 564<br />

negative thickness, 104<br />

NetImmerse engine, 633<br />

problems with, 633-634<br />

NiBinary<strong>Shader</strong>Lib, 641<br />

NiD3DDefault<strong>Shader</strong> class,<br />

644-645<br />

NiD3D<strong>Shader</strong> class, 644<br />

NiD3D<strong>Shader</strong>Interface class, 643<br />

noise function,<br />

using in sun shader, 127-128<br />

using to create fractal, 33-35<br />

using to texture planet, 36-37<br />

nonmaxima suppression, 445<br />

normal decompression, 294<br />

normal map, 136<br />

creating, 138<br />

preparing with green channel,<br />

187<br />

rendering, 82-84<br />

using with displacement mapping,<br />

76-77<br />

normal map compression, using<br />

DXTC for, 185-186<br />

normalized half vector, 150<br />

normalized reflected eye vector,<br />

150<br />

normalized specular bump mapping,<br />

155-156<br />

shaders, 156, 157-158, 159<br />

with per-pixel power, 159<br />

normals,<br />

calculating, 165-166<br />

optimizing generation of,<br />

166-169<br />

precision of, 255<br />

N-Patches, 9<br />

using, 9-10<br />

NSB<strong>Shader</strong>Lib, 639<br />

NSFParserLib, 642<br />

Nyquist theorem, 13<br />

O<br />

object ID, 581<br />

allocation, 585-586<br />

LOD, 585<br />

mipmap, 584-585<br />

passing, 582-584<br />

shadows, 581<br />

using mipmapping to select,<br />

584-585<br />

occluder geometry, 587-588<br />

ocean scene, 285<br />

implementing, 286-289<br />

offset map, 192<br />

offset textures, 191<br />

optimizations<br />

calculating normals, 166-169<br />

deferred shading, 260-262,<br />

263-268<br />

hemisphere lighting, 122<br />

ray tracing, 184<br />

rendering thick volumes,<br />

104-105<br />

sky dome, 248-249<br />

software vertex processing,<br />

387-394<br />

terrain rendering, 25-27<br />

VSC, 659-660<br />

water simulation, 223-224<br />

outer volume, 562-563<br />

generating, 567-571<br />

output registers, writing to, 389<br />

P<br />

parallax, 328<br />

parallel cameras, 329<br />

particle, 107<br />

emitter, 107<br />

system, 107<br />

particle effects, implementing,<br />

110-112<br />

pass, 482<br />

passive stereo, 324-325<br />

patches, 18<br />

calculating tessellation level of,<br />

20-21<br />

connecting, 21<br />

using PVS with, 25-27<br />

path volume, 62<br />

generating, 62-63<br />

using to determine area of collision,<br />

63-66<br />

pattern-based procedural textures,<br />

190<br />

pencil sketch effect, 476-479<br />

penumbra, 561<br />

blurring, 571-576<br />

generating, 563-571<br />

Perlin noise function, using to create<br />

fractal, 34-35<br />

per-pixel<br />

Fresnel reflection, 216-219<br />

specular level, 150<br />

specular power, 150, 152-153<br />

per-pixel specularity, implementing,<br />

663<br />

per-vertex shadowing, 559<br />

Phong illumination,<br />

fragment-level, 136-137<br />

implementing, 139-142<br />

model, 131-132<br />

Phong shading, 149-150<br />

pinhole camera, 530<br />

pixel<br />

diffuse color, 255<br />

mask check, 425<br />

normal vector, 253-255<br />

position, 253<br />

pixel shaders<br />

accessing shadow map, 577<br />

accumulated voxels, 163-164<br />

anaglyph, 335<br />

attribute buffer building pass,<br />

256-257<br />

blurring penumbra, 575-576<br />

car paint effect, 296-298<br />

circle of confusion simulation,<br />

534-535, 538-539<br />

cloth animation, 45, 49-51,<br />

52-53, 53-54<br />

color remapping, 467<br />

displacement mapping, 284<br />

downsampling, 643<br />

ghost removal, 336-337<br />

grayscale conversion, 466<br />

hair, 276<br />

hemisphere lighting, 120-121<br />

layered fog, 349-350<br />

mapping volume texture coordinates,<br />

268<br />

matrix addition, 358-359<br />

matrix assignment, 357<br />

matrix multiplication, 364-367<br />

matrix transposed multiplication,<br />

371-273<br />

mosaic effect, 523-524<br />

motion blur, 306-307<br />

normalized specular bump<br />

mapping, 156, 158<br />

normalized specular bump<br />

mapping with per-pixel<br />

power, 159<br />

ocean scene, 289<br />

optimizing normal generation,<br />

167-169<br />

passing object ID to back<br />

buffer, 583


passing object ID to lighting<br />

view texture, 582<br />

pencil sketch effect, 477<br />

Phong illumination, 140-142<br />

ray tracing, 178-182<br />

reading input texture, 174-175<br />

reflection and refraction, 279<br />

rendering objects as thick volumes,<br />

95, 96-97, 98, 101-102<br />

sampling and averaging textures,<br />

573<br />

saturation filter, 471-472<br />

separable Gaussian filter, 542,<br />

547-549, 550-552, 554<br />

shading pass, 259-260<br />

shadow mapping, 145-148<br />

single-stroke textures, 344-345<br />

sky dome, 248<br />

solid voxels, 165<br />

specular bump mapping, 151,<br />

154, 155<br />

sprite positioning, 199-200<br />

sun surface, 281<br />

translucency and iridescence,<br />

312-313<br />

underwater scene, 211<br />

using to process image,<br />

439-440<br />

water simulation, 221-223<br />

planets,<br />

creating cloud layer for, 37<br />

generating geometry for, 33-35<br />

lighting, 37-38<br />

rendering, 33-39<br />

texturing, 35-37<br />

plug-ins in VSC, 655-656<br />

point, calculating inside a triangle,<br />

9<br />

pointillism, 346-347<br />

polynomial texture maps, 232-233<br />

posterization, 502-517<br />

post-processing, 439<br />

using for depth of field, 531<br />

post-processing filter, setting up,<br />

469-470<br />

Potentially Visible Set, see PVS<br />

precision, limitations of, 196<br />

precompiler in VSC, 657<br />

precomputed radiance transfer, see<br />

PRT<br />

predicates, using, 391<br />

pre-ps_1_4 hardware, specular<br />

bump mapping on, 149-160<br />

pre-sample displacement mapping,<br />

82-83<br />

prioritized stroke textures, 340-341<br />

procedural textures, 190<br />

profiling, 392-394<br />

projected Jacobi algorithm, 376,<br />

378<br />

projected shadows, 560<br />

projection aliasing, 581-582<br />

PRT (precomputed radiance transfer),<br />

232, 233-234<br />

using, 236-237<br />

ps_2_0 shaders, 402-405<br />

PVS (Potentially Visible Set), 25<br />

using to reduce number of<br />

patches, 25-27<br />

Q<br />

quad rendering, 467<br />

quad shading, see deferred shading<br />

R<br />

radial blur,<br />

rendering, 281-283<br />

using to render volumetric<br />

beams, 279-283<br />

radiosity maps, 115<br />

generating, 115-116<br />

rasterizer, setting up, 418<br />

ray casting, for volumetric lighting<br />

effect, 289-290<br />

ray tracing, 177, 184<br />

disadvantages of, 177<br />

distributed, 531<br />

optimizing, 184<br />

pixel shader, 178-182<br />

vertex shader, 177-178,<br />

182-184<br />

reference textures, 191-192<br />

reflected eye vector, 150<br />

reflection,<br />

adding to water scene, 216-219<br />

calculating for soft objects,<br />

276-279<br />

reflection map, rendering, 213-214<br />

refraction,<br />

calculating for soft objects,<br />

276-279<br />

implementing, 182<br />

relative-addressing token, 671<br />

render targets,<br />

multiple, 618<br />

outputting to, 427<br />

sizing, 470<br />

render texture, 466<br />

rendering order, 615<br />

rep instruction, using, 390<br />

resource management, 626<br />

reverse extruded shadow volumes<br />

technique, 587<br />

reverse shadow volume extrusion,<br />

590-591<br />

advantages of, 592-593<br />

disadvantages of, 593<br />

implementing, 591-592<br />

RGB color space, 440<br />

Index<br />

679<br />

RGB-encoded values, decoding,<br />

95-97<br />

RGB-encoding, 92-95<br />

RGB-to-HSV transformation,<br />

implementing, 440-441<br />

ripple distortion, 498-499<br />

implementing, 500-502<br />

run-time intrinsics, 403<br />

using, 403-404<br />

S<br />

saturation, 440<br />

saturation effect, 619<br />

saturation filter, implementing,<br />

471-472<br />

scalability,<br />

of Material class, 602-604<br />

problems with in graphics<br />

engine, 597<br />

scan conversion, 421<br />

scanline loop, implementing, 423<br />

scanline rendering, 423-427<br />

scene management, 626<br />

scramble phase, 458-459<br />

screen space texture coordinates,<br />

mapping, 267-268<br />

screen-alignment, 107-109<br />

separable filtering techniques,<br />

449-450<br />

separable Gaussian filter, 450-451,<br />

545-555<br />

implementing, 451-452,<br />

540-554<br />

separable median filter, 452-454<br />

implementing, 454-457<br />

shader, 636<br />

shader abstraction, 598<br />

implementing with Material<br />

class, 599-602<br />

shader byte code, 668-671<br />

shader disassembler,<br />

customization of, 673-674<br />

integration of, 673<br />

reasons for using, 667-668<br />

shader emulation, implementing,<br />

405-411<br />

shader integration, drawbacks to,<br />

632<br />

shader library, 636, 639<br />

using, 639<br />

shader programs, 636<br />

shader system, requirements of,<br />

634-635<br />

shader values, parsing, 434-436<br />

shaders,<br />

integrating into art pipeline,<br />

637-639<br />

using latest version, 388<br />

shading distortion, 304-306<br />

shading, deferred, 251-252


680 Index<br />

shadow map, 562<br />

using, 576<br />

shadow mapping, 144, 560<br />

implementing, 144-148<br />

shadow volume extrusion algorithm,<br />

590<br />

shadow volumes, 62, 560-561<br />

shadowing, 116-117<br />

using darkening for, 592<br />

shadowing algorithms, 559-563<br />

shadowing artifacts,<br />

preventing, 567-568<br />

reducing, 588-590<br />

shadows,<br />

adding to scene, 144<br />

and voxel objects, 171<br />

using with deferred shading,<br />

261-262<br />

shininess, 133<br />

shower door distortion, 496-497<br />

implementing, 497-498<br />

silhouette edges, calculating, 564<br />

single complex loop technique, 423<br />

sky dome,<br />

implementing, 246-248<br />

optimizing, 248-249<br />

rendering, 241-243<br />

requirements of, 240<br />

Sobel edge detection, creating outlines<br />

with, 516-517<br />

Sobel filter, using, 138<br />

soft objects, rendering, 276-279<br />

soft shadows, 561<br />

algorithm, 561-563<br />

SoftD3D,<br />

future of, 428-431<br />

rasterizer implementation, 422<br />

reasons for developing,<br />

413-416<br />

software shaders, 396<br />

software vertex processing, 383<br />

optimizing, 387-394<br />

reasons to use, 384<br />

SoftWire, 398<br />

advantages of, 399-400<br />

macro support, 401-402<br />

using, 398-399<br />

vs. x86 shaders, 402<br />

solid voxels, 164-165<br />

sorting, 612<br />

source parameter token, 670-671<br />

source swizzle masks, using, 389<br />

spatial domain, 457<br />

specular bump mapping<br />

on pre-ps_1_4 hardware,<br />

149-160<br />

shaders, 150-151, 154, 155<br />

specular component, 133-134<br />

specular highlights, hatching with,<br />

345<br />

spherical harmonics, 227-230,<br />

235-236<br />

spice transitions, 491-493<br />

implementing, 493-494<br />

spilling, 404<br />

spin and shrink away transition,<br />

487-489<br />

implementing, 489-491<br />

spiral galaxy, 124<br />

rendering, 124-126<br />

spring model for cloth animation,<br />

40-44<br />

sprites, see also texture sprites<br />

overlapping, 196<br />

positioning, 192-194, 197-200<br />

rotating, 195-196<br />

scaling, 195-196<br />

SSE, 385-386<br />

stacked quads, 170<br />

star effect, implementing, 14-16<br />

start-of-day tessellation, 84<br />

static Material, 599-602, 606<br />

stencil shadows, 144, 560-561<br />

using to generate inner volume,<br />

563-566<br />

stereo,<br />

active, 324<br />

passive, 324-325<br />

stereo scene, compositing,<br />

333-334<br />

stereo window, 332-333<br />

stereoscopic camera, setting up,<br />

325-328<br />

stereoscopic rendering, 324-325<br />

stereoscopic rendering methods,<br />

335<br />

comparing, 338<br />

stereoscopy, 325<br />

stream processing, 417-418<br />

stream, setting up, 416-471<br />

Streaming SIMD Extensions, see<br />

SSE<br />

stroke colors, 346<br />

stroke-lookup textures, 343<br />

implementing, 344-345<br />

structure-of-arrays data, 386<br />

sun,<br />

positioning, 240-241<br />

rendering, 243-244<br />

rendering surface of, 279-281<br />

sun effects, 127<br />

sun texture, generating, 128-130<br />

surface basis, 7, 8<br />

surface bumping, 215-216<br />

swap chain, setting up, 469-470<br />

T<br />

tangent space, 137-138, 254<br />

defining pixel normals in,<br />

254-255<br />

temporary register usage, minimizing,<br />

389-390<br />

terrain,<br />

applying lightmap to, 22-24<br />

applying material to, 23-24<br />

terrain mesh, creating, 19-20<br />

terrain rendering, 18<br />

optimizations for, 25-27<br />

problems with, 19<br />

terrain triangulation, 19<br />

tessellation calculation, 27-30<br />

tessellation level, calculating for<br />

patch, 20-21<br />

texture blending, 423-424<br />

texture coordinates, mapping to<br />

screen space, 267-268<br />

texture formats, 172<br />

texture image, mapping to matrix,<br />

354-355<br />

texture mapping,<br />

problems with unique, 80-81<br />

unique, 76<br />

texture space, see tangent space<br />

texture sprites, 191 see also sprites<br />

disadvantages of, 196-197<br />

implementing, 197-200<br />

using, 191-196<br />

texture-based depth shadows,<br />

581-584<br />

texture-dependent reads, 466<br />

texture-encoded interpolation, 154<br />

textures, 190<br />

explicit, 190<br />

floating-point, 172<br />

offset, 191<br />

pattern-based procedural, 190<br />

procedural, 190<br />

reference, 191-192<br />

using with Material, 606-610<br />

thick volumes,<br />

handling intersection of,<br />

100-102<br />

implementing, 94-102<br />

optimizing, 104-105<br />

rendering, 89-91<br />

rendering in free space, 97-100<br />

thickness,<br />

calculating, 91-92<br />

negative, 104<br />

thin lens camera, 530-531<br />

thresholding scheme, 341<br />

implementing for hatching,<br />

341-342<br />

tile, 191<br />

tokens, 668-669<br />

topocentric positions, 241<br />

transformation map, 195<br />

transition effects, 483-494<br />

transitioning, 290-291


translucency, 309<br />

calculating, 313-315<br />

translucent pixels, working with,<br />

262-263<br />

triangles, setting up, 421-423<br />

triangulation, 19<br />

t-vertices, 21<br />

tweening, 27-30<br />

types in VSC, 652<br />

U<br />

undersampling, 584<br />

mipmapping to select object<br />

IDs, 584-585<br />

reducing, 584-586<br />

underwater scene,<br />

creating, 209-212<br />

pixel shader, 211<br />

projecting, 212-213<br />

vertex shader, 210-211<br />

unique texture mapping, 76<br />

problems with, 80-81<br />

V<br />

value, 440<br />

value noise, 36-37<br />

variables in VSC, 652<br />

vector constants in VSC, 652-653<br />

vector space, 137<br />

version token, 668-669<br />

vertex assembly, 419<br />

vertex format, defining, 138-139<br />

vertex pointer arrays, using, 420<br />

vertex processing, 417-418<br />

vertex shader 3.0, 3<br />

Vertex <strong>Shader</strong> Compiler, see VSC<br />

vertex shaders<br />

accessing shadow map, 576-577<br />

attribute buffer building pass,<br />

255-256<br />

ChromaDepth, 337-338<br />

circle of confusion simulation,<br />

533-534, 536<br />

cloth animation, 45, 47, 48-49,<br />

54-55<br />

downsampling, 542-543<br />

hair, 274-276<br />

hemisphere lighting, 118-120<br />

irradiance environment maps,<br />

230-231<br />

layered fog, 349-350<br />

mapping volume texture coordinates,<br />

267-268<br />

matrix assignment, 357<br />

matrix multiplication, 363-364<br />

matrix transposed multiplication,<br />

370-371<br />

motion blur, 302-304<br />

normalized specular bump mapping,<br />

157-158<br />

ocean scene, 286-288<br />

particle effects, 110-112<br />

Phong illumination, 139-140<br />

ray tracing, 177-178, 182-184<br />

reflection and refraction,<br />

278-279<br />

rendering inner volume, 566<br />

rendering objects as thick volumes,<br />

94, 95, 99<br />

rendering outer volume, 570<br />

reverse shadow volume extrusion,<br />

591-592<br />

separable Gaussian filter,<br />

540-541, 546, 549-550, 553<br />

shadow mapping, 145-148<br />

shadow volume extrusion,<br />

570-571<br />

sky dome, 246-248<br />

specular bump mapping,<br />

150-151<br />

star effect, 14-16<br />

sun surface, 280<br />

translucency and iridescence,<br />

311-312<br />

underwater scene, 210-211<br />

volume rendering, 180-181<br />

voxel rendering, 162<br />

water simulation, 219-221<br />

vertex shaders,<br />

and geometry compression,<br />

3-12<br />

using lookup tables with, 13-16<br />

vertex stream declaration format,<br />

changes to, 3<br />

limitations of, 3-4<br />

vertex stream splitting, 416-417<br />

vertex texturing, 55-56<br />

vertex tweening, 27-30<br />

vertical fog, 348<br />

view transformation matrix, using<br />

to align billboards, 108-109<br />

viewport, setting, 326<br />

virtual register allocation, 403-404<br />

visibility determination system,<br />

626-627<br />

visibility test, for collision detection,<br />

59-61<br />

visible geometry, 587-588<br />

vnoise function, using in sun<br />

shader, 127-128<br />

volume, rendering, 180-181<br />

volumes, see thick volumes<br />

volumetric beams, rendering with<br />

radial blur, 279-283<br />

volumetric lighting via ray casting,<br />

289-290<br />

volumetric lights, 562<br />

volumetric shadows, implementing,<br />

664-665<br />

voxel data, generating, 171<br />

Index<br />

voxel objects, 161<br />

and imposters, 170-171<br />

and shadows, 171<br />

voxels,<br />

accumulated, 163-164<br />

lit, 165-166<br />

rendering, 161-162<br />

solid, 164-164<br />

VSC (Vertex <strong>Shader</strong> Compiler),<br />

650<br />

examples, 662-666<br />

features of, 650<br />

functions, 662<br />

input/output structure, 654-655<br />

optimizing, 659-660<br />

plug-ins, 655-656<br />

VSC language, 651-657<br />

vs. C syntax, 651-652<br />

VSC Standard Library, 660-662<br />

VTune Analyzer 7.0, 392-394<br />

W<br />

water simulation,<br />

adding reflection to, 216-219<br />

adding waves to, 215-219<br />

creating above-water scene,<br />

213-215<br />

creating underwater scene,<br />

209-213<br />

implementing, 219-223<br />

optimizing, 223-224<br />

overview of, 208<br />

requirements for, 207<br />

simulating depth, 210-212<br />

waves, adding to water scene,<br />

215-219<br />

wet areas texture, 201-202<br />

world space, 137<br />

world space pixel normal vector,<br />

253<br />

world space pixel positions, 253<br />

calculating, 253<br />

X<br />

x86 shaders, 396-402<br />

disadvantages of, 401<br />

vs. SoftWire, 402<br />

681<br />

Z<br />

z test, for collision detection, 59-61<br />

Z-buffer, 615<br />

Z-fail shadow volume algorithm,<br />

577-578<br />

zoom effect, 270<br />

implementing, 271-273<br />

Z-testing, 425


Learn FileMaker Pro 6<br />

1-55622-974-7 $39.95<br />

6 x 9 504 pp.<br />

FileMaker Pro 6 Developer’s Guide<br />

to XML/XSL<br />

1-55622-043-X $49.95<br />

6 x 9 416 pp.<br />

Advanced FileMaker Pro 6 Web<br />

Development<br />

1-55622-860-0 $59.95<br />

6 x 9 464 pp.<br />

Official Butterfly.net Game<br />

Developer’s Guide<br />

1-55622-044-8 • $59.95<br />

6 x 9 500 pp.<br />

Introduction to 3D Game <strong>Programming</strong><br />

with <strong>DirectX</strong> 9.0<br />

1-55622-913-5 $49.95<br />

6 x 9 424 pp.<br />

Looking<br />

Check out Wordware’s marketfeaturing<br />

the following new<br />

Game Development and Production<br />

1-55622-951-8 $49.95<br />

6 x 9 432 pp.<br />

<strong><strong>Shader</strong>X</strong> 2 : Introductions &<br />

Tutorials with <strong>DirectX</strong> 9<br />

1-55622-902-X $44.95<br />

6 x 9 384 pp.<br />

LightWave 3D 7.5 Lighting<br />

1-55622-354-4 $69.95<br />

6 x 9 496 pp.<br />

Advanced 3D Game <strong>Programming</strong><br />

with <strong>DirectX</strong> 9.0<br />

1-55622-968-2 $59.95<br />

6 x 9 552 pp.<br />

Strategy Game <strong>Programming</strong><br />

with <strong>DirectX</strong> 9.0<br />

1-55622-922-4 $59.95<br />

6 x 9 560 pp.<br />

Essential LightWave 3D 7.5<br />

1-55622-226-2 $44.95<br />

6 x 9 424 pp.<br />

Visit us online at www.wordware.com for more information.


for more?<br />

leading Game Developer’s Library<br />

releases and backlist titles.<br />

Direct3D <strong><strong>Shader</strong>X</strong>: Vertex and Pixel<br />

<strong>Shader</strong> <strong>Tips</strong> and <strong>Tricks</strong><br />

1-55622-041-3 $59.95<br />

7½ x 9¼ 520 pp.<br />

Modeling a Character in 3DS Max<br />

1-55622-815-5 $44.95<br />

7½ x 9¼ 544 pp.<br />

Game Design: Theory and Practice<br />

1-55622-735-3 $49.95<br />

7½ x 9¼ 608 pp.<br />

Games That Sell!<br />

1-55622-950-X $34.95<br />

6 x 9 336 pp.<br />

Vector Game Math Processors<br />

1-55622-921-6 $59.95<br />

6 x 9 528 pp.<br />

Advanced Linux 3D Graphics<br />

<strong>Programming</strong><br />

1-55622-853-8 $59.95<br />

7½ x 9¼ 640 pp.<br />

Game Design Foundations<br />

1-55622-973-9 $39.95<br />

6 x 9 400 pp.<br />

<strong>DirectX</strong> 9 Audio Exposed: Interactive<br />

Audio Development<br />

1-55622-288-2 $59.95<br />

6 x 9 568 pp.<br />

Java 1.4 Game <strong>Programming</strong><br />

1-55622-963-1 $59.95<br />

6 x 9 672 pp.<br />

Use the following coupon code for online specials: <strong>Shader</strong>9887


About the CD<br />

The companion CD contains examples and source code discussed in the articles.<br />

There are folders for each section and subfolders for each article within<br />

the sections, although there may not be an example for some articles. Many<br />

folders include a readme.txt document that explains the examples, contains<br />

instructions, and lists hardware requirements.<br />

Simply place the CD in your CD drive and select the folder for which you<br />

would like to see the example.<br />

� Warning: By opening the CD package, you accept the terms and<br />

conditions of the CD/Source Code Usage License Agreement on the<br />

following page.<br />

Additionally, opening the CD package makes this book nonreturnable.


CD/Source Code Usage License Agreement<br />

Please read the following CD/Source Code usage license agreement before opening the CD and<br />

using the contents therein:<br />

1. By opening the accompanying software package, you are indicating that you have read and<br />

agree to be bound by all terms and conditions of this CD/Source Code usage license<br />

agreement.<br />

2. The compilation of code and utilities contained on the CD and in the book are copyrighted<br />

and protected by both U.S. copyright law and international copyright treaties, and is owned<br />

by Wordware Publishing, Inc. Individual source code, example programs, help files,<br />

freeware, shareware, utilities, and evaluation packages, including their copyrights, are<br />

owned by the respective authors.<br />

3. No part of the enclosed CD or this book, including all source code, help files, shareware,<br />

freeware, utilities, example programs, or evaluation programs, may be made available on a<br />

public forum (such as a World Wide Web page, FTP site, bulletin board, or Internet news<br />

group) without the express written permission of Wordware Publishing, Inc. or the author of<br />

the respective source code, help files, shareware, freeware, utilities, example programs, or<br />

evaluation programs.<br />

4. You may not decompile, reverse engineer, disassemble, create a derivative work, or otherwise<br />

use the enclosed programs, help files, freeware, shareware, utilities, or evaluation<br />

programs except as stated in this agreement.<br />

5. The software, contained on the CD and/or as source code in this book, is sold without warranty<br />

of any kind. Wordware Publishing, Inc. and the authors specifically disclaim all other<br />

warranties, express or implied, including but not limited to implied warranties of merchantability<br />

and fitness for a particular purpose with respect to defects in the disk, the program,<br />

source code, sample files, help files, freeware, shareware, utilities, and evaluation programs<br />

contained therein, and/or the techniques described in the book and implemented in the<br />

example programs. In no event shall Wordware Publishing, Inc., its dealers, its distributors,<br />

or the authors be liable or held responsible for any loss of profit or any other alleged or<br />

actual private or commercial damage, including but not limited to special, incidental, consequential,<br />

or other damages.<br />

6. One (1) copy of the CD or any source code therein may be created for backup purposes. The<br />

CD and all accompanying source code, sample files, help files, freeware, shareware, utilities,<br />

and evaluation programs may be copied to your hard drive. <strong>With</strong> the exception of freeware<br />

and shareware programs, at no time can any part of the contents of this CD reside on more<br />

than one computer at one time. The contents of the CD can be copied to another computer,<br />

as long as the contents of the CD contained on the original computer are deleted.<br />

7. You may not include any part of the CD contents, including all source code, example programs,<br />

shareware, freeware, help files, utilities, or evaluation programs in any compilation of<br />

source code, utilities, help files, example programs, freeware, shareware, or evaluation programs<br />

on any media, including but not limited to CD, disk, or Internet distribution, without<br />

the express written permission of Wordware Publishing, Inc. or the owner of the individual<br />

source code, utilities, help files, example programs, freeware, shareware, or evaluation<br />

programs.<br />

8. You may use the source code, techniques, and example programs in your own commercial or<br />

private applications unless otherwise noted by additional usage agreements as found on the<br />

CD.<br />

� Warning: By opening the CD package, you accept the terms and conditions<br />

of the CD/Source Code Usage License Agreement.<br />

Additionally, opening the CD package makes this book nonreturnable.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!