# Parallel Algorithms

Two closely related models of parallel computation.

Circuits

Logic gates (AND/OR/not) connected by wires

important measures

number of gates

depth (clock cycles in synchronous circuit)

PRAM

\(P\) processors, each with a RAM, local registers

global memory of \(M\) locations

each processor can in one step do a RAM op or read/write to one global memory location

synchronous parallel steps

not realistic, but explores “degree of parallelism”

Essentially the same models, but let us focus on different things.

## Circuits

Logic gates (AND/OR/not) connected by wires

important measures

number of gates

depth (clock cycles in synchronous circuit)

bounded vs unbounded fan-in/out

\(AC(k)\) and \(NC(k)\): unbounded and bounded fan-in with depth \(O(\log^k n)\) for problems of size \(n\)

\(AC(k) \subset NC(k) \subset AC(k+1)\) using full binary tree

\(NC=\cup NC(k)=\cup AC(k)\)

Addition

consider adding \(a_i\) and \(b_i\) with carry \(c_{i-1}\) to produce output \(s_i\) and next carry \(c_i\)

Ripple carry: \(O(n)\) gates, \(O(n)\) time

Carry lookahead: \(O(n)\) gates, \(O(\log n)\) time

preplan for late arrival of \(c_i\).

given \(a_i\) and \(b_i\), three possible cases for \(c_i\)

if \(a_i=b_i\), then \(c_i=a_i\) determined without \(c_{i-1}\):

*generate*\(c_1=1\) or*kill*\(c_i=0\)otherwise,

*propagate*\(c_i=c_{i-1}\)write \(x_i=k,g,p\) accordingly

consider \(3 \times 3\) “multiplication table” for effect of two adders in a row.

pair propagates previous carry only if both propagate.

\(x_i\) \(k\) \(p\) \(g\) \(k\) \(k\) \(k\) \(g\) \(x_{i-1}\) \(p\) \(k\) \(p\) \(g\) \(g\) \(k\) \(g\) \(g\) Now let \(y_0=k\), \(y_i = y_{i-1} \times x_i\)

constraints “multiplication table” by induction

\(x_i\) \(k\) \(p\) \(g\) \(k\) \(k\) \(k\) \(g\) \(y_{i-1}\) \(p\) \(k\) never \(g\) \(g\) \(k\) \(g\) \(g\) conclude: \(y_i=k\) means \(c_i=0\), \(y_i=g\) means \(c_i=1\), and \(y_i=p\) never happens

so problem reduced to computing all \(y_i\) in parallel

Parallel prefix

Build full binary tree

two gates at each node

pass up product of all children

pass down product of all \(x_i\) preceding leftmost child

works for any associative function

## PRAM

various conflict resolutions (CREW, EREW, CRCW)

\(CRCW(k) \subset EREW(k+1)\)

\(NC = \cup CRCW(k)\)

PRAMs simulate circuits, and vice versa

So \(NC\) well-defined

differences in practice

CRCW can OR in \(O(1)\) time

EREW requires \(\log n\) time (info theory lower bound)

EREW needs \(\log n\) to find max

CRCW finds max in constant time with \(n^2\) processors

Compare every pair

If an item loses, write “not max” in its entry

Check all entries

If item is max (not overwritten), write its value in answer

in \(O(\log\log n)\) time with \(n\) processors

Suppose \(k\) items remain

Make \(k^2/n\) blocks of \(n/k\) items

quadratic time max for each: \((k^2/n)(n/k)^2=n\) processors total

recurrence: \(T(k)=1+T(k^2/n)\)

\(T(n/2^i)=1+T(n/2^{2i})\)

so \(\log\log n\) iters.

parallel prefix

using \(n\) processors

list ranking EREW

next pointers \(n(x)\)

\(d(x)+=d(n(x))\); \(n(x)=n(n(x))\).

by induction, sum of values on path to end doesn’t change

# Work-Efficient Algorithms

Idea:

We’ve seen parallel algorithms that are somewhat “inefficient”

do more total work (processors times time) than sequential

Ideal solution: arrange for total work to be proportional to best sequential work

*Work-Efficient Algorithm*Then a small number of processors (or even 1) can “simulate” many processors in a fully efficient way

Parallel analogue of “cache oblivious algorithm”—you write algorithm once for many processors; lose nothing when it gets simulated on fewer.

Brent’s theorem

Different perspective on work: count number of processors actually working in each time step.

If algorithm does \(x\) total work and critical path \(t\)

Then \(p\) processors take \(x/p+t\) time

So, if use \(p=x/t\) processors, finish in time \(t\) with efficient algorithm

detail: assigning processors to tasks can be tricky if e.g. work emerges dynamically during execution

Work-efficient parallel prefix

linear sequential work

going to need \(\log n\) time

so, aim to get by with \(n/\log n\) processors

give each processor a block of \(\log n\) items to add up

reduces problem to \(n/\log n\) values

use old algorithm

each processor fixes up prefixes for its block

Work-efficient list ranking

harder: can’t easily give contiguous “blocks” of \(\log n\) to each processor (requires list ranking)

However, assume items in arbitrary order in array of \(n\) structs, so

*can*give \(\log n\) distinct items to each processor.use random coin flips to knock out “alternate” items

shortcut any item that is heads and has tails successor

requires at most one shortcut

and constant probability every other item is shortcut (and independent)

so by chernoff, 1/16 of items are shortcut out

“compact” remaining items into smaller array using parallel prefix on

**array**of pointers (ignoring list structure) to collect only “marked” nodes and update their pointerslet each processor handle \(\log n\) (arbitrary) items

\(O(n/\log n)\) processors, \(O(\log n)\) time

After \(O(\log\log n)\) rounds, number of items reduced to \(n/\log n\)

use old algorithm

result: \(O(\log n \log\log n)\) time, \(n/\log n\) processors

to improve, use faster “compaction” algorithm to collect marked nodes: \(O(\log\log n)\) time randomized, or \(O(\log n/\log\log n)\) deterministic. get optimal alg.

How about deterministic algorithm? Use “deterministic coin tossing”

take all local maxima as part of ruling set.

Euler tour to reduce to parallel prefix for

inorder traversal of tree

computing depth in tree

numbering leaves

work efficient

# Expression Tree Evaluation

plus and times nodes.

Idea

merge leaves into parents (good for bushy tree)

pointer jump degree-2 nodes (good for tall trees)

combine?

how “pointer jump” an operator?

Generalize problem:

“product lookahead”

Each tree edge has a label \((a,b)\)

meaning that if subtree below evaluates to \(y\) then value \((ay+b)\) should be passed up edge

Merging a single leaf

method for eliminating all left-child leaves

root \(q\) with right child \(p\) (product node) on edge labelled \((a_3,b_3)\)

\(p\) has left child edge \((a_1,b_1)\) leaf \(\ell\) with value \(v\)

right child edge to \(s\) with label \((a_2,b_2)\)

fold out \(p\) and \(\ell\), make \(s\) a child of \(q\)

what label of new edge?

prepare for \(s\) subtree to eval on \(y\).

choose \(a,b\) such that \(ay+b=a_3\cdot[(a_1v+b_1)\cdot(a_2y+b_2)]+b_3\)

Parallelize

keep tree full, so killing all leaves kills whole tree

collapse a leaf and pointer jump parent

problem: can’t do to both children at once

solution: number leaves in-order (Euler Tour)

three step process:

shunt odd-numbered left-child leaves

shunt odd-number right-child leaves

divide leaf-numbers by 2

guarantees never simultaneously shunt siblings

# Sorting

Parallelizing binary search

placing one item into a sorted array takes \(\log n\) time with 1 processor

but with \(k\) processors can improve to \(\log_k n\)

compare to \(k\) evenly spaced items

use concurrent OR to find out which pair I’m between in \(O(1)\) time

recurse

shows how to gain parallel speed by wasting work

in particular, can find in \(O(1)\) time with \(n\) processors

or even \(\sqrt{n}\) processors

CREW Merge sort:

merge two length-\(k\) sequences using \(k\) processors

each element of first seq. uses binary search to find place in second

so knows how many items smaller

so knows rank in merged sequence: go there

then do same for second list

\(O(\log k)\) time with \(k\) processors

handle all length-\(k\) lists in parallel with \(n\) processors

total time \(O(\sum_{i \le \lg n} \log 2^i)=O(\log^2 n)\) with \(n\) processors

Better with more processors?

use \(n^2\) processors to do all pairwise comparisons in parallel

For each item, count number of wins in parallel in \(O(\log n)\) time

This determines item’s rank in output list

Use same ideas with fewer processors?

Note can sort \(\sqrt{n}\) items in \(O(\log n)\) time

Use insights to improve merge?

Idea: nearby items in one list go to nearby places in merged list

So don’t do whole computation for every item

Break \(A\) into blocks and figure out roughly where each block goes in \(B\)

Then spread each block to exact right places in \(B\) (recursively)

Details: merge \(n\) items of \(A\) into \(m\) items of \(B\) in \(O(\log\log n)\) time using \(m+n\) processors

choose \(\sqrt{n}\) evenly spaced fenceposts \(\alpha_i\) among \(A\)

use \(\sqrt{m}\) processors to find exact location of each in \(B\)

total processors \(\sqrt{nm}\le \le \max(n,m) \le n+m\)

split \(B\) at locations of \(\alpha_i\)

now know \(\sqrt{n}\) items (between \(\alpha_i\) and \(\alpha_{i+1}\)) go in each split piece of \(B\)

So recurse: \(T(n)=2+T(\sqrt{n})=O(\log\log n)\)

Use in parallel merge sort: \(O(\log n \log\log n)\) with \(n\) processors.

Cole shows how to “pipeline” merges, get optimal \(O(\log n)\) time.

Qsort idea

\(\sqrt{n}\) pivots

sort all in \(O(\log n)\) time using \(n\) processors

binary search each item to locate position in pivots

recurse

\(T(n) = O(\log n) + T(\sqrt{n}) = O(\log n)\)

rigorous analysis messy

need to deal with variation in sizes of subproblems

# Connectivity and connected components

Linear time sequential trivial.

## Directed

Squaring adjacency matrix

\(\log n\) time to reduce diameter to 1

\(mn\) processors for first iter, but adds edges

so, \(n^3\) processors

and space to \(n^2\) even if initial graph is sparse

improvements to \(mn\) processors

But “transitive closure bottleneck” still bedevils parallel algs.

Even if just want to find \(s\)-\(t\) path we don’t know any better

Problem is that output has size \(n^2\) and we don’t know which part of it matters.

## Undirected

Basic approach:

Sets of connected vertices grouped as stars

One root, all others parent-point to it

Initially all vertices alone

Edge “live” if connects two distinct stars

Find live edges in constant time by checking roots

For live edge with roots \(u<v\), connect \(u\) as child of \(v\)

May be conflicts, but CRCW resolves

Now get stars again

Use pointer jumping

Note: may have chains of links, so need \(\log n\) jumps

Every live star attached to another

So number of stars decreases by 2

\(m+n\) processors, \(\log^2 n\) time.

Smarter: interleave hooking and jumping:

Maintain set of rooted trees

Each node points to parent

Hook some trees together to make fewer trees

Pointer jump (once) to make shallower trees

Eventually, each connected component is one star

More details:

“top” vertex: root or its children

detect top vertex in constant time

each vertex has label

find root label of each top vertex

Can detect if am star in constant time:

no pointer double reaches root

for each edge:

If ends both on top, different nonstar components, then hook smaller index component to larger

may be conflicting hooks; assume CRCW resolves

indices prevent cycles

If star points to non-star, hook it

do one pointer jump

Potential function: height of live stars and tall trees

Live stars get hooked to something (star or internal)

But never hooked to leaf. So add 1 to height of target

So sum of heights doesn’t go up

But now, every unit of height is in a tall tree

Pointer doubling decreases by at least 1/3

Total height divided each time

So \(\log n\) iterations

Summary: \(O(m+n)\) processors, \(O(\log n)\) time.

Improvements:

\(O((m+n)\alpha(m,n)/\log n)\) processors, \(\log n\) time, CRCW

Randomized \(O(\log n)\), \(O(m/\log n)\) processors, EREW

# Randomization

Randomization in parallel:

load balancing

symmetry breaking

isolating solutions

Classes:

NC: poly processor, polylog steps

RNC: with randomization. polylog runtime, monte carlo

ZNC: las vegas NC

immune to choice of R/W conflict resolution

# Sorting

Quicksort in parallel:

\(n\) processors

each takes one item, compares to splitter

count number of predecessors less than splitter

determines location of item in split

total time \(O(\log n)\)

combine: \(O(\log n)\) per layer with \(n\) processors

problem: \(\Omega(\log^2 n)\) time bound

problem: \(n\log^2 n\) work

Using many processors:

do all \(n^2\) comparisons

use parallel prefix to count number of items less than each item

\(O(\log n)\) time

or \(O(n)\) time with \(n\) processors

Combine with quicksort:

Note: single pivot step inefficient: uses \(n\) processors and \(\log n\) time.

Better: use \(\sqrt{n}\) simultaneous pivots

Choose \(\sqrt{n}\) random items and sort fully to get \(\sqrt{n}\) intervals

For all \(n\) items, use binary search to find right interval

recurse

\(T(n)=O(\log n)+T(\sqrt{n})=O(\log n + \frac12\log n+\frac14\log n + \cdots)=O(\log n)\)

Formal analysis:

consider root-leaf path to any item \(x\)

argue total number of parallel steps on path is \(O(\log n)\)

consider item \(x\)

claim splitter within \(\alpha\sqrt{n}\) on each side

since prob. not at most \((1-\alpha\sqrt{n}/n)^{\sqrt{n}} \le e^{-\alpha}\)

fix \(\gamma, d<1/\gamma\)

define \(\tau_k = d^k\)

define \(\rho_k = n^{(2/3)^k}\) \((\rho_{k+1}=\rho_k^{2/3})\)

note size \(\rho_k\) problem takes \(\gamma^k\log n\) time

note size \(\rho_k\) problem odds of having child of size \(>\rho_{k+1}\) is less than \(e^{-\rho_k^{1/6}}\)

argue at most \(d^k\) size-\(\rho_k\) problems whp

follows because probability of \(d^k\) size-\(\rho_k\) problems in a row is at most

deduce runtime \(\sum d^k\gamma_k = \sum (d\gamma)^{k}\log n = O(\log n)\)

note: as problem shrinks, allowing more divergence in quantity for whp result

minor detail: “whp” dies for small problems

OK: if problem size \(\log n\), finish in \(\log n\) time with \(\log n\) processors

# Maximal independent set

trivial sequential algorithm

inherently sequential

from node point of view: each thinks can join MIS if others stay out

randomization breaks this symmetry

Randomized idea

each node joins with some probability

all neighbors excluded

many nodes join

few phases needed

Algorithm:

all degree 0 nodes join

node \(v\) joins with probability \(1/2d(v)\)

if edge \((u,v)\) has both ends marked, unmark lower degree vertex

put all marked nodes in IS

delete all neighbors

Intuition: \(d\)-regular graph

vertex vanishes if it or neighbor gets chosen

mark with probability \(1/2d\)

prob (no neighbor marked) is \((1-1/2d)^d\), constant

so const prob. of neighbor of \(v\) marked—destroys \(v\)

what about unmarking of \(v\)’s neighbor?

prob(unmarking forced) only constant as argued above.

So just changes constants

const fraction of nodes vanish: \(O(\log n)\) phases

Implementing a phase trivial in \(O(\log n)\).

Prob chosen for IS, given marked, exceeds \(1/2\)

suppose \(w\) marked. only unmarked if higher degree neighbor marked

higher degree neighbor marked with prob. \(\le 1/2d(w)\)

only \(d(w)\) neighbors

prob. any superior neighbor marked at most \(1/2\).

For general case, define good vertices

good: at least \(1/3\) neighbors have lower degree

prob. no neighbor of good marked \(\le (1-1/2d(v))^{d(v)/3} \le e^{-1/6}\).

So some neighbor marked with prob. \(1-e^{-1/6}\)

Stays marked with prob. 1/2

deduce prob. good vertex killed exceeds \((1-e^{-1/6})/2\)

Problem: perhaps only one good vertex?

Good edges

any edge with a good neighbor

has const prob. to vanish

show half edges good

deduce \(O(\log n)\) iterations.

Proof

Let \(V_B\) be bad vertices; we count edges with both ends in \(V_B\).

direct edges from lower to higher degree \(d_i\) is indegree, \(d_o\) outdegree

if \(v\) bad, then \(d_i(v) \le d(v)/3\)

deduce \[\sum_{V_B} d_i(v) \le \frac13 \sum_{V_B} d(v)=\frac13\sum_{V_B}( d_i(v)+d_o(v))\]

so \(\sum_{V_B} d_i(v) \le \frac12 \sum_{V_B} d_o(v)\)

which means indegree can only “catch” half of outdegree; other half must go to good vertices.

more carefully,

\(d_o(v)-d_i(v) \ge \frac13(d(v))=\frac13(d_o(v)+d_i(v))\).

Let \(V_G,V_B\) be good, bad vertices

degree of bad vertices is \[\begin{aligned} 2e(V_B,V_B)+e(V_B,V_G)+e(V_G,V_B) &= & \sum_{v\in V_B} d_o(v)+d_i(v)\\ &\le &3\sum(d_o(v)-d_i(v))\\ &= &3(e(V_B,V_G)-e(V_G,V_B))\\ &\le &3(e(V_B,V_G)+e(V_G,V_B)\end{aligned}\] Deduce \(e(V_B,V_B) \le e(V_B,V_G)+e(V_G,V_B)\). result follows.

Derandomization:

Analysis focuses on edges,

so unsurprisingly, pairwise independence sufficient

not immediately obvious, but again consider \(d\)-uniform case

prob vertex marked \(1/2d\)

neighbors \(1,\ldots,d\) in increasing degree order

Let \(E_i\) be event that \(i\) is marked.

Let \(E'_i\) be \(E_i\) but no \(E_j\) for \(j<i\)

\(A_i\) event no neighbor of \(i\) chosen

Then prob eliminate \(v\) at least \[\begin{aligned} \sum \Pr[E'_i \cap A_i] &= &\sum\Pr[E'_i]\Pr[A_i \mid E'_i]\\ &\ge &\sum \Pr[E'_i]\Pr[A_i]\end{aligned}\]

Wait: show \(\Pr[A_i \mid E'_i] \ge \Pr[A_i]\)

true if independent

measure \(\Pr[\neg A_i \mid E'_i] \le \sum \Pr[E_w \mid E'_i]\) (sum over neighbors \(w\) of \(i\))

measure \[\begin{aligned} \Pr[E_w \mid E'_i] &= &\frac{\Pr[E_w \cap E']}{\Pr[E'_i]}\\ &= &\frac{\Pr[(E_w \cap \neg E_1 \cap \cdots) \cap E_i]}{\Pr[(\neg E_1 \cap \cdots) \cap E_i]} \\ &= &\frac{\Pr[E_w \cap \neg E_1 \cap \cdots \mid E_i]}{\Pr[\neg E_1 \cap \cdots \mid E_i]} \\ &\le &\frac{\Pr[E_w \mid E_i]}{1-\sum_{j\le i}\Pr[E_j \mid E_i]}\\ &= &\Theta(\Pr[E_w])\end{aligned}\] (last step assumes \(d\)-regular so only \(d\) neighbors with odds \(1/2d\))

But expected marked neighbors \(1/2\), so by Markov \(\Pr[A_i]>1/2\)

so prob eliminate \(v\) exceeds \(\sum\Pr[E'_i]=\Pr[\cup E_i]\)

lower bound as \(\sum\Pr[E_i]-\sum\Pr[E_i \cap E_j] = 1/2-d(d-1)/8d^2 > 1/4\)

so \(1/2d\) prob. \(v\) marked but no neighbor marked, so \(v\) chosen

Generate pairwise independent with \(O(\log n)\) bits

try all polynomial seeds in parallel

one works

gives deterministic \(NC\) algorithm

with care, \(O(m)\) processors and \(O(\log n)\) time (randomized)

LFMIS P-complete.

# Perfect Matching

We focus on bipartite; book does general case.

Last time, saw detection algorithm in \(\RNC\):

Tutte matrix

Sumbolic determinant nonzero iff PM

assign random values in \(1,\ldots,2m\)

Matrix Mul, Determinant in \(\NC\)

How about finding one?

If unique, no problem

Since only one nozero term, ok to replace each entry by a 1.

Remove each edge, see if still PM in parallel

multiplies processors by \(m\)

but still \(\NC\)

Idea:

make unique minimum

**weight**perfect matchingfind it

Isolating lemma: [MVV]

Family of distinct sets over \(x_1,\ldots,x_m\)

assign random weights in \(1,\ldots,2m\)

Pr(unique min-weight set)\(\ge 1/2\)

Odd: no dependence on number of sets!

(of course \(<2^m\))

Proof:

Fix item \(x_i\)

\(Y\) is min-value sets containing \(x_i\)

\(N\) is min-value sets not containing \(x_i\)

true min-sets are either those in \(Y\) or in \(N\)

how decide? Value of \(x_i\)

For \(x_i=-\infty\), min-sets are \(Y\)

For \(x_i=+\infty\), min-sets are \(N\)

As increase from \(-\infty\) to \(\infty\), single transition value when both \(X\) and \(Y\) are min-weight

If only \(Y\) min-weight, then \(x_i\) in every min-set

If only \(X\) min-weight, then \(x_i\) in no min-set

If both min-weight, \(x_i\) is

*ambiguous*Suppose no \(x_i\) ambiguous. Then min-weight set unique!

Exactly one value for \(x_i\) makes it ambiguous given remainder

So Pr(ambiguous)\(=1/2m\)

So Pr(any ambiguous)\(< m/2m=1/2\)

Usage:

Consider tutte matrix \(A\)

Assign random value \(2^{w_i}\) to \(x_i\), with \(w_i \in 1,\ldots,2m\)

Weight of matching is \(2^{\sum w_i}\)

Let \(W\) be minimum sum

Unique w/pr \(1/2\)

If so, determinant is odd multiple of \(2^W\)

Try removing edges one at a time

Edge in PM iff new determinant\(/2^W\) is even.

Big numbers? No problem: values have poly number of bits

\(NC\) algorithm open.

For exact matching, \(P\) algorithm open.