[Parallel Algorithm] Parallel Prefix Sum Algorithm, mainly by MPI
[2024 Summer GeekPie HPC Lecture Notes] Catalog of Notes
- Week 1 Notes
- Week 2
- Week 3
- Reference: https://blog.csdn.net/weixin_45937291/article/details/127605085
- This is not the same as
MPI_Scan
, which has a prefix sum w.r.t. the rank; this prefix sum is w.r.t. the array itself.
假设这里有 $n$ elements, $p$ processes. Steps:
- Serially compute the prefix sums of only
ceil(n/p)
elements, regarding these elements as an independent array. The last process may have less elements w.r.t. the size. We assume the indices are[start_idx, end_idx-1]
, i.e.[start_idx, end_idx)
.
串行计算ceil(n/p)
个元素的前缀和。这些元素组成一个子数组,只计算这个子数组的前缀和,而不考虑再前面的元素。假设这些元素 index 是 [start_idx, end_idx-1]
,这一步的伪代码如下:
rank: process id (rank)
prefix_sum[N]: where the `ans` should be in
N: # of elements
size: aka `p`, # of processes
start_idx, end_idx = IDX_GEN(rank)
prefix_sum[start_idx] = data[start_idx]
for i = [start_idx+1 ... end_idx)
prefix_sum[i] = prefix_sum[i-1]+data[i]
end for
- Now call the
PARALLEL_PREFIX_SUM
function:
function PARALLEL_PREFIX_SUM(pid, X, size)
pid: process id (rank)
X: prefix_sum[end_idx] of the process
size: # of processes
ret_sum = X
total_sum = X
d = ceil(log2(size)) /* Depth */
/* If `size` is not power of 2, then `ceil` is needed. */
/* This can also be written as
`for (j2 = 1; j2 < size; j2 *= 2)` and `partner_pid = XOR(pid, j2)`.
where `j2` is equivalent to `2**j`
*/
for j = 0...d-1 (including d-1)
partner_pid = XOR(pid, 2**j)
if partner_pid >= size then
continue /* If `size` is not power of 2 */
end if
/* Send & Recv from the partner */
recv_sum = 0 /* declare */
Send_and_Recv(
me=pid,
dest=partner_pid,
souce=partner_pid,
message_sent=total_sum,
buffer_recv=recv_sum
)
/*
In <mpi.h>, MPI_SendRecv is implemented to avoid deadlocks caused by setting
wrong separated MPI_Send and MPI_Recv
This facilitates this loop a lot!
*/
total_sum += recv_sum /* i.e. total_sum = old_total_sum + recv_total_sum */
if partner_pid < pid then
ret_sum += recv_sum /* i.e. prefix_sum = old_total_sum + recv_total_sum */
end if
end for
return ret_sum
end
在注释中用英语强调的点:
2.1 d
(depth) 作为辅助变量,其log2
如果实现效率低,可以换成 for(j2 = 0; j2 < size; j2 *= 2)
的形式,并且在 partner_pid
的计算中也更改为 partner_pid = XOR(pid, j2)
;
2.2 如果 p
不是2的整数次幂,那么需要注意的是,
partner_id
有可能会越界,此时不要执行下一步操作d
的计算(如果使用了d
而不是j2
)需要 向上取整。
2.3 函数的调用应该如下:
ret_sum = PARALLEL_PREFIX_SUM(pid=rank, X=prefix_sum[end_idx-1], size=size)
-
After calling, the return value
ret_sum
in every process is the trueprefix_sum[end_idx]
value.We can compute the difference between the old (un-updated
prefix_sum[end_idx]
) and the newret_sum
:把函数的返回值加到原来串行计算的子数组上:
q = ret_sum - prefix_sum[end_idx-1] for i = [start_idx ... end_idx) prefix_sum[i] += q end for
Amortized Analysis
并行化之前:(Before parallelizing)$ \mathcal O(n) $
并行化之后:(After parallelizing) $\mathcal O(n/p +\log p)$
Computing time | Communicating time | |
---|---|---|
Step 1 | $\mathcal O(n/p)$ (if only serial) | 0 |
Step 2 | $\mathcal O(\log p)$ | $\mathcal O(\log p)$ |
Step 3 | $\mathcal O(n/p)$ (if only serial) | 0 |
$2n/p + \log p$ | $\log p$ |
Why?
-
Taking full advantage of commutativity of
XOR
operation充分利用了
XOR
的交换性 -
Follow the serial execution of array length -> Follow the serial execution of
depth=log p
and 2 little serial sections of lengthn/p
parallelized