[Parallel Algorithm] Parallel Prefix Sum Algorithm, mainly by MPI

2024-07-19

2 min read

Notes

2024 Summer GeekPie HPC Lectures

[2024 Summer GeekPie HPC Lecture Notes] Catalog of Notes

Week 1 Notes
- Singularity
Week 2
- Week 2 Notes
- MPI Prefix Sum
- Week 2 HW
  - MPI Matrix Multiplication
  - MPI Scan
Week 3
- Notes
  - Profiling
  - Optimization
- HW
  - CA1 Project 4
  - CUDA Transpose Optimization

Reference: https://blog.csdn.net/weixin_45937291/article/details/127605085
This is not the same as MPI_Scan, which has a prefix sum w.r.t. the rank; this prefix sum is w.r.t. the array itself.

Case for the 15-element array and 4 processes

假设这里有 $n$ elements, $p$ processes. Steps:

Serially compute the prefix sums of only ceil(n/p) elements, regarding these elements as an independent array. The last process may have less elements w.r.t. the size. We assume the indices are [start_idx, end_idx-1], i.e. [start_idx, end_idx).

串行计算ceil(n/p)个元素的前缀和。这些元素组成一个子数组，只计算这个子数组的前缀和，而不考虑再前面的元素。假设这些元素 index 是 [start_idx, end_idx-1]，这一步的伪代码如下：

rank: process id (rank)
prefix_sum[N]: where the `ans` should be in
N: # of elements
size: aka `p`, # of processes

start_idx, end_idx = IDX_GEN(rank)
prefix_sum[start_idx] = data[start_idx]
for i = [start_idx+1 ... end_idx)
	prefix_sum[i] = prefix_sum[i-1]+data[i]
end for

Now call the PARALLEL_PREFIX_SUM function:

function PARALLEL_PREFIX_SUM(pid, X, size)
	pid: process id (rank)
	X: prefix_sum[end_idx] of the process
	size: # of processes
	
	ret_sum = X
	total_sum = X
	d = ceil(log2(size)) /* Depth */
	/* If `size` is not power of 2, then `ceil` is needed. */
	
	/* This can also be written as
	
	`for (j2 = 1; j2 < size; j2 *= 2)` and `partner_pid = XOR(pid, j2)`.
	
	where `j2` is equivalent to `2**j`
	*/
	
	for j = 0...d-1 (including d-1)
		partner_pid = XOR(pid, 2**j) 
		if partner_pid >= size then
			continue /* If `size` is not power of 2 */
		end if
		
        /* Send & Recv from the partner */
		recv_sum = 0 /* declare */
		Send_and_Recv(
			me=pid,
			dest=partner_pid, 
			souce=partner_pid,
			message_sent=total_sum,
			buffer_recv=recv_sum
        )
		/* 
		In <mpi.h>, MPI_SendRecv is implemented to avoid deadlocks caused by setting 
		wrong separated MPI_Send and MPI_Recv
		This facilitates this loop a lot!
		*/
		
		total_sum += recv_sum /* i.e. total_sum = old_total_sum + recv_total_sum */
		if partner_pid < pid then
			ret_sum += recv_sum /* i.e. prefix_sum = old_total_sum + recv_total_sum */
		end if
	end for
	return ret_sum
end

在注释中用英语强调的点：

2.1 d(depth) 作为辅助变量，其log2如果实现效率低，可以换成 for(j2 = 0; j2 < size; j2 *= 2) 的形式，并且在 partner_pid 的计算中也更改为 partner_pid = XOR(pid, j2)；

2.2 如果 p 不是2的整数次幂，那么需要注意的是，

partner_id 有可能会越界，此时不要执行下一步操作
d 的计算（如果使用了 d 而不是 j2）需要向上取整。

2.3 函数的调用应该如下：

ret_sum = PARALLEL_PREFIX_SUM(pid=rank, X=prefix_sum[end_idx-1], size=size)

After calling, the return value ret_sum in every process is the true prefix_sum[end_idx] value.

We can compute the difference between the old (un-updated prefix_sum[end_idx]) and the new ret_sum:

把函数的返回值加到原来串行计算的子数组上：
```
q = ret_sum - prefix_sum[end_idx-1]
for i = [start_idx ... end_idx)
	prefix_sum[i] += q
end for
```

Amortized Analysis

并行化之前：(Before parallelizing)$ \mathcal O(n) $

并行化之后：(After parallelizing) $\mathcal O(n/p +\log p)$

	Computing time	Communicating time
Step 1	$\mathcal O(n/p)$ (if only serial)	0
Step 2	$\mathcal O(\log p)$	$\mathcal O(\log p)$
Step 3	$\mathcal O(n/p)$ (if only serial)	0
	$2n/p + \log p$	$\log p$

Why?

Taking full advantage of commutativity of XOR operation

充分利用了 XOR 的交换性
Follow the serial execution of array length -> Follow the serial execution of depth=log p and 2 little serial sections of length n/p parallelized

[Parallel Algorithm] Parallel Prefix Sum Algorithm, mainly by MPI

[2024 Summer GeekPie HPC Lecture Notes] Catalog of Notes

Amortized Analysis

Why?

Portals

On This Page

See Also