- 最后登录
- 2017-9-18
- 注册时间
- 2011-1-12
- 阅读权限
- 90
- 积分
- 12276
  
- 纳金币
- 5568
- 精华
- 0
|
1 Introduction
Rigid body simulations are used often in games. Because of the
computational cost, a large scale simulation is not feasible in real
time. However, with the increased performance of GPUs, achieving
a larger-scale simulation in real time becomes more realistic. A
rigid body simulation is too complicated to implement on the GPU
in a straightforward manner. One of the challenges is implementing
an efficient constraint solver. A projected Gauss-Seidel method is
often used to solve constraints [Catto 2005]. The input and output
of the solver are the velocity of rigid bodies. However, constraints
sharing a body cannot be solved in parallel because of this data
dependency. This is a challenge when a GPU is used for a rigid
body simulation. A solution to solve constraints in parallel is to
split them into groups of constraints called batches. In a batch, no
body is shared among constraints, so they can be solved in parallel.
An introduction of batch solves the problem of constraint solving,
but batch creation itself is a serial process. To complete a rigid body
simulation pipeline on the GPU, batch creation on the GPU is also
necessary.
We present a method to create batches and solve constraints in parallel
on the GPU. Our batch creation is a two-step process: global
split followed by local batch creation. Global split separates the
constraints into constraint groups. In local batch creation, a SIMD.
(a GPU consists of SIMD engines, or SIMDs, each of which has a
wide SIMD, (e.g., an ATI FireProTMV8800 has 20SIMD engines,
each of which has 64 SIMD lanes)) processes a constraint group efficiently
without doing any global synchronization. The batch creation
has a good memory access pattern because it does most of the
random memory accesses to fast on-chip local data share (LDS).
The constraint solver also assigns a constraint group for a SIMD.
Advantages of this approach not only minimize global communication,
but also localize the computation that can improve the cache
hit rate. Our constraint solver moves some dispatch work that had
to be done by the CPU to the GPU, which reduces the dispatch
overhead on the CPU.
e-mail: takahiro.harada@amd.com
2 Method
2.1 Global Split
At first, constraints are split into disjoint constraint groups, each
of which can be processed in parallel on a SIMD. However, it is
not possible to remove all the dependency between groups. An example
is a chain in which all the constraints are connected. For
this situation, constraints cannot be split into disjoint groups. Our
approach is to create sets of constraint groups. A set has several
constraint groups, which do not have dependency among groups,
but constraints in different sets can have dependency (Fig. 2). To
divide constraints into groups, a spatial split is used. The simulation
space is diced into grids and constraints are assigned for a cell. After
the split, non-adjacent cells do not share constraints, so they can
be processed in parallel. Constraints in each cell make up a constraint
group. Because we used two-dimensional split, the cells or
the constraint groups are split into four independent sets, which are
solved in four steps. Solving a constraint group is an independent
job processed in parallel. On a GPU, a SIMD processes a constraint
group. However there can be dependency on constraints in a group:
they cannot be processed in parallel on a wide SIMD architecture
although they can be solved directly on a multi-core CPU. Local
batch creation solves the issue.
Implementation Global split is performed by calculation of a cell
index for each constraint followed by a radix sort, which sorts constraints
by cell indices. Then the sorted constraints are scanned
to obtain offsets for constraint groups for each cell and number of
constraints in each group |
|