Data-Parallel Programming Models -------------------------------- "The world consists of arrays/matrices to be processed" Idea ---- - Concurrency obtained by applying same operation to all elements of data structure - Data-parallel program = sequence of data-parallel operations Techniques ---------- - Decomposition of data structure/operations - Mapping of segments to different nodes Array Statements ----------------- Fortran 90 (F90) - array assignment statements - array intrinsic functions Scalar program (F77): integer A(10,10), B(10,10), c integer i,j do i = 1,10 do j = 1,10 A(i,j) = B(i,j)+c enddo enddo Array program (F90): integer A(10,10), B(10,10), c A = B+c Processes whole arrays! - Array/scalar operations Array/array operations integer A(10,10), B(10,10), C(10,10) A = B+C Array reduction operations integer A(10), B B = SUM(A) Array sectioning integer A(10), B(10) A(1:7) = B(1:7)+B(2:8) Restricted arrays integer A(10) WHERE(A /= 0) A = 1.0/A Vectorization automatically by compiler (real parallelization see HPF below) SIMD Model ---------- - Single Instruction, Multiple Data - Set of nodes all of which apply the *same* operation at the same time in a synchronized ("lock-step") fashion Example: Matrix Multiplication ------- Algorithm: Cube of n^3 processors > z / --> x | v y Map A(i,j) to P(j,i,k) Map B(i,j) to P(i,k,j) B (90 degs left) C A Each processor computes single product - P_ijk: C_ijk=A_ik*B_kj Bars are added along X-dirs -P_0ij: C_ij = Sum_k C_ijk B(k,j) ////////// IIIIIIIIII C(i,j) A(i,k) Program: int A[N,N], B[N,N], C[N,N]; plural int a, b, c; a = A[iyproc, ixproc]; b = B[ixproc, izproc]; c = a*b; for (i = 0; i < N-1; i++) if (ixproc > 0) c = xnetE[1].c else c += xnetE[1].c; if (ixproc == 0) C[iyproc, izproc] = c; Maspar architecture and programming language - 2D grid of nodes (arithmetic units) - xNetE[1].var - variable var on right ("eastern") neighbar - access yields in communication - scalar variables central on host computer - plural variables duplicated on nodes - ixproc, iyproc, izproc hold coordinates of current node - plural statements (involving plural vars) executed on every node Structure like sequential program - single flow of control - only one instruction executed at each time step (on many procs) - conditionals restrict set of active processors - first all processors execute for which condition true - second all processors execute for which condition false SPMD Model: ---------- Single Program, Multiple Data - Same basic idea (sequential program structure) - Relax SIMD restriction - Different nodes may execute different operations (depending on actual data) High Performance Fortran (HPF) - F90 is subset - Extension to SPMD language - Data distribution directives (PROCESSORS, DISTRIBUTE) - Data alignment directives (ALIGN) integer A(M,N), B(N,M) !HPF\$ PROCESSORS P(32) !HPF\$ DISTRIBUTE A(BLOCK) ONTO P !HPF\$ ALIGN B(i,j) WITH A(j,i) FORALL (I=1:M, J=1:N) A(I,J) = A(I,J) + B(J,I) - PROCESSORS declared (virtual) processor grid - DISTRIBUTE maps array to processor grid - BLOCK distribution - CYCLIC distribution - ALIGN aligns array with other array - FORALL exhibits concurrency - independent loop iterations Alternative - annotated DO loops !HPF\$ INDEPENDENT do i = 1,M do J = 1,N A(I,J) = A(I,J) + B(J,I) enddo enddo Implicit parallelism: - Simple independent do loops automatically recognized by compiler SPMD model: - bodies of parallel do loops may contain arbitrary code - each parallel process executes independently of each other do i = 1,M do J = 1,N if (A(i,j).gt.0) then A(I,J) = A(I,J) + B(J,I) else A(I,J) = A(I,J) - B(J,I) endif enddo enddo More details on HPF, see handouts from Foster, Chapter 7. Functional Languages -------------------- Backus 1978, Turing Award Lecture "Can Programming be Liberated from the von Neumann Style? A Functional Style and Its Algebra of Programs" Abandon "Word at a time programming" in favor of "combining forms" FP (Functional Programming System) ---------------------------------- - Objects (atoms, or sequences ) - Functions alpha f: = /f: = f(x1, f(... f(x_n-1,xn))) (f o g):x = f:(g:x) trans:<,...,> = <,...> Example: Inner product of vectors - IP = (/ +) o (alpha *) o trans Matrix multiplication - MM = (alpha alpha IP) o (alpha distl) o distr o [1, trans o 2] distl:> = <, ..., > distr:<,z> = <, ..., > [f1,...,fn]:x -= i: = x_i Program = Op o Op o ... o Op - sequence of stages - each stage potentially allows concurrency (alpha, /, ....) Starting point of large interest in functional languages and their use for parallelism. Connection Machine Lisp ----------------------- Steele, Hillis, TMC Data type "xapping" (associative mapping) - unordered set of (index->value) pairs x = {sky->blue, apple->red, grass->green} - indexing (xref x 'apple) => red - xectors [1, 2, 4] = {0->1, 1->2, 2->4} - apply-to-all (alpha f '[1,2,4]) => [(f 1), (f 2), (f 4)] - gathering (beta + '[1,2,3]) => 6 - .... (defun inner-product (f g p q) (reduce f alpha(funcall g p q))) (funcall (inner-product #'+ #'*) '[1 2 3] '[4 5 6]) => 32 Used for implementing data-parallel symbolic programs on the Connection Machine CM-2 (SIMD type, 65536 single-bit processors) NESL ---- A nested Data-Parallel Language (Guy Blelloch, CMU) Parallel operations on sequences {negate(a): a in [3, 4, -9, 5]} => [-3, 4, 9, -5] {negate(a): a in [3, 4, -9, 5] | a < 4 } => [-3, 4, 9] {a+b: a in [3, 4, -9, 5]; b in [1, 2, 3, 4]} => [4, -2, 6, -9] Various operations on sequences - sequence expressions like above - op_scan(a) (reduction of a by op) .... Nested parallelism {sum(v): v in [a, b, c]} - parallelism within each sum - parallelism among three instances of sum - recursive activations possible function qsort(a) = if (#a < 2) then a else let pivot = a[#a/2] lesser = {e in a | e < pivot} equal = {e in a | e = pivot} greater = {e in a | e > pivot} result = {qsort(v): v in [lesser, greater]} in result[0] ++ equal ++ result[1] General ------- Easiest to use form of parallelism - conceptually single control flow - only on independent computation/data segments - independent DO loops - array expressions and the like - well suitable for functional form of parallelism