[Parallel Algorithm] Parallel Prefix Sum Algorithm, mainly by MPI

2024-07-19
2 min read

[2024 Summer GeekPie HPC Lecture Notes] Catalog of Notes


Case for the 15-element array and 4 processes

假设这里有 $n$ elements, $p$ processes. Steps:

  1. Serially compute the prefix sums of only ceil(n/p) elements, regarding these elements as an independent array. The last process may have less elements w.r.t. the size. We assume the indices are [start_idx, end_idx-1], i.e. [start_idx, end_idx).

串行计算ceil(n/p)个元素的前缀和。这些元素组成一个子数组,只计算这个子数组的前缀和,而不考虑再前面的元素。假设这些元素 index 是 [start_idx, end_idx-1],这一步的伪代码如下:

rank: process id (rank)
prefix_sum[N]: where the `ans` should be in
N: # of elements
size: aka `p`, # of processes

start_idx, end_idx = IDX_GEN(rank)
prefix_sum[start_idx] = data[start_idx]
for i = [start_idx+1 ... end_idx)
	prefix_sum[i] = prefix_sum[i-1]+data[i]
end for
  1. Now call the PARALLEL_PREFIX_SUM function:
function PARALLEL_PREFIX_SUM(pid, X, size)
	pid: process id (rank)
	X: prefix_sum[end_idx] of the process
	size: # of processes
	
	ret_sum = X
	total_sum = X
	d = ceil(log2(size)) /* Depth */
	/* If `size` is not power of 2, then `ceil` is needed. */
	
	/* This can also be written as
	
	`for (j2 = 1; j2 < size; j2 *= 2)` and `partner_pid = XOR(pid, j2)`.
	
	where `j2` is equivalent to `2**j`
	*/
	
	for j = 0...d-1 (including d-1)
		partner_pid = XOR(pid, 2**j) 
		if partner_pid >= size then
			continue /* If `size` is not power of 2 */
		end if
		
        /* Send & Recv from the partner */
		recv_sum = 0 /* declare */
		Send_and_Recv(
			me=pid,
			dest=partner_pid, 
			souce=partner_pid,
			message_sent=total_sum,
			buffer_recv=recv_sum
        )
		/* 
		In <mpi.h>, MPI_SendRecv is implemented to avoid deadlocks caused by setting 
		wrong separated MPI_Send and MPI_Recv
		This facilitates this loop a lot!
		*/
		
		total_sum += recv_sum /* i.e. total_sum = old_total_sum + recv_total_sum */
		if partner_pid < pid then
			ret_sum += recv_sum /* i.e. prefix_sum = old_total_sum + recv_total_sum */
		end if
	end for
	return ret_sum
end

在注释中用英语强调的点:

2.1 d(depth) 作为辅助变量,其log2如果实现效率低,可以换成 for(j2 = 0; j2 < size; j2 *= 2) 的形式,并且在 partner_pid 的计算中也更改为 partner_pid = XOR(pid, j2)

2.2 如果 p 不是2的整数次幂,那么需要注意的是,

  • partner_id 有可能会越界,此时不要执行下一步操作
  • d 的计算(如果使用了 d 而不是 j2)需要 向上取整。

2.3 函数的调用应该如下:

ret_sum = PARALLEL_PREFIX_SUM(pid=rank, X=prefix_sum[end_idx-1], size=size)
  1. After calling, the return value ret_sum in every process is the true prefix_sum[end_idx] value.

    We can compute the difference between the old (un-updated prefix_sum[end_idx]) and the new ret_sum:

    把函数的返回值加到原来串行计算的子数组上:

    q = ret_sum - prefix_sum[end_idx-1]
    for i = [start_idx ... end_idx)
    	prefix_sum[i] += q
    end for
    

Amortized Analysis

并行化之前:(Before parallelizing)$ \mathcal O(n) $

并行化之后:(After parallelizing) $\mathcal O(n/p +\log p)$

Computing time Communicating time
Step 1 $\mathcal O(n/p)$ (if only serial) 0
Step 2 $\mathcal O(\log p)$ $\mathcal O(\log p)$
Step 3 $\mathcal O(n/p)$ (if only serial) 0
$2n/p + \log p$ $\log p$

Why?

  1. Taking full advantage of commutativity of XOR operation

    充分利用了 XOR 的交换性

  2. Follow the serial execution of array length -> Follow the serial execution of depth=log p and 2 little serial sections of length n/p parallelized