[其它] A Parallel Constraint Solver for a Rigid Body Simulation

[复制链接]

彬彬

797 主题	1 听众	1万积分

资深设计师

Rank: 7 Rank: 7 Rank: 7

纳金币: 5568
精华: 0

电梯直达

楼主

发表于 2012-1-5 08:23:45 |只看该作者 |倒序浏览

1 Introduction

Rigid body simulations are used often in games. Because of the

computational cost, a large scale simulation is not feasible in real

time. However, with the increased performance of GPUs, achieving

a larger-scale simulation in real time becomes more realistic. A

rigid body simulation is too complicated to implement on the GPU

in a straightforward manner. One of the challenges is implementing

an efficient constraint solver. A projected Gauss-Seidel method is

often used to solve constraints [Catto 2005]. The input and output

of the solver are the velocity of rigid bodies. However, constraints

sharing a body cannot be solved in parallel because of this data

dependency. This is a challenge when a GPU is used for a rigid

body simulation. A solution to solve constraints in parallel is to

split them into groups of constraints called batches. In a batch, no

body is shared among constraints, so they can be solved in parallel.

An introduction of batch solves the problem of constraint solving,

but batch creation itself is a serial process. To complete a rigid body

simulation pipeline on the GPU, batch creation on the GPU is also

necessary.

We present a method to create batches and solve constraints in parallel

on the GPU. Our batch creation is a two-step process: global

split followed by local batch creation. Global split separates the

constraints into constraint groups. In local batch creation, a SIMD.

(a GPU consists of SIMD engines, or SIMDs, each of which has a

wide SIMD, (e.g., an ATI FireProTMV8800 has 20SIMD engines,

each of which has 64 SIMD lanes)) processes a constraint group efficiently

without doing any global synchronization. The batch creation

has a good memory access pattern because it does most of the

random memory accesses to fast on-chip local data share (LDS).

The constraint solver also assigns a constraint group for a SIMD.

Advantages of this approach not only minimize global communication,

but also localize the computation that can improve the cache

hit rate. Our constraint solver moves some dispatch work that had

to be done by the CPU to the GPU, which reduces the dispatch

overhead on the CPU.

e-mail: takahiro.harada@amd.com

2 Method

2.1 Global Split

At first, constraints are split into disjoint constraint groups, each

of which can be processed in parallel on a SIMD. However, it is

not possible to remove all the dependency between groups. An example

is a chain in which all the constraints are connected. For

this situation, constraints cannot be split into disjoint groups. Our

approach is to create sets of constraint groups. A set has several

constraint groups, which do not have dependency among groups,

but constraints in different sets can have dependency (Fig. 2). To

divide constraints into groups, a spatial split is used. The simulation

space is diced into grids and constraints are assigned for a cell. After

the split, non-adjacent cells do not share constraints, so they can

be processed in parallel. Constraints in each cell make up a constraint

group. Because we used two-dimensional split, the cells or

the constraint groups are split into four independent sets, which are

solved in four steps. Solving a constraint group is an independent

job processed in parallel. On a GPU, a SIMD processes a constraint

group. However there can be dependency on constraints in a group:

they cannot be processed in parallel on a wide SIMD architecture

although they can be solved directly on a multi-core CPU. Local

batch creation solves the issue.

Implementation Global split is performed by calculation of a cell

index for each constraint followed by a radix sort, which sorts constraints

by cell indices. Then the sorted constraints are scanned

to obtain offsets for constraint groups for each cell and number of

constraints in each group