DataParallel Programming Models

"The world consists of arrays/matrices to be processed"
Idea

 Concurrency obtained by applying same operation to all elements of
data structure
 Dataparallel program = sequence of dataparallel operations
Techniques

 Decomposition of data structure/operations
 Mapping of segments to different nodes
Array Statements

Fortran 90 (F90)
 array assignment statements
 array intrinsic functions
Scalar program (F77):
integer A(10,10), B(10,10), c
integer i,j
do i = 1,10
do j = 1,10
A(i,j) = B(i,j)+c
enddo
enddo
Array program (F90):
integer A(10,10), B(10,10), c
A = B+c
Processes whole arrays!
 Array/scalar operations
Array/array operations
integer A(10,10), B(10,10), C(10,10)
A = B+C
Array reduction operations
integer A(10), B
B = SUM(A)
Array sectioning
integer A(10), B(10)
A(1:7) = B(1:7)+B(2:8)
Restricted arrays
integer A(10)
WHERE(A /= 0) A = 1.0/A
Vectorization automatically by compiler
(real parallelization see HPF below)
SIMD Model

 Single Instruction, Multiple Data
 Set of nodes all of which apply the *same* operation at the same time
in a synchronized ("lockstep") fashion
Example: Matrix Multiplication

Algorithm:
Cube of n^3 processors
> z
/
> x

v
y
Map A(i,j) to P(j,i,k)
Map B(i,j) to P(i,k,j)
B (90 degs left)
C
A
Each processor computes single product
 P_ijk: C_ijk=A_ik*B_kj
Bars are added along Xdirs
P_0ij: C_ij = Sum_k C_ijk
B(k,j)
//////////
IIIIIIIIII C(i,j)
A(i,k)
Program:
int A[N,N], B[N,N], C[N,N];
plural int a, b, c;
a = A[iyproc, ixproc];
b = B[ixproc, izproc];
c = a*b;
for (i = 0; i < N1; i++)
if (ixproc > 0)
c = xnetE[1].c
else
c += xnetE[1].c;
if (ixproc == 0) C[iyproc, izproc] = c;
Maspar architecture and programming language
 2D grid of nodes (arithmetic units)
 xNetE[1].var
 variable var on right ("eastern") neighbar
 access yields in communication
 scalar variables central on host computer
 plural variables duplicated on nodes
 ixproc, iyproc, izproc hold coordinates of current node
 plural statements (involving plural vars) executed on every node
Structure like sequential program
 single flow of control
 only one instruction executed at each time step (on many procs)
 conditionals restrict set of active processors
 first all processors execute for which condition true
 second all processors execute for which condition false
SPMD Model:

Single Program, Multiple Data
 Same basic idea (sequential program structure)
 Relax SIMD restriction
 Different nodes may execute different operations
(depending on actual data)
High Performance Fortran (HPF)
 F90 is subset
 Extension to SPMD language
 Data distribution directives (PROCESSORS, DISTRIBUTE)
 Data alignment directives (ALIGN)
integer A(M,N), B(N,M)
!HPF$ PROCESSORS P(32)
!HPF$ DISTRIBUTE A(BLOCK) ONTO P
!HPF$ ALIGN B(i,j) WITH A(j,i)
FORALL (I=1:M, J=1:N) A(I,J) = A(I,J) + B(J,I)
 PROCESSORS declared (virtual) processor grid
 DISTRIBUTE maps array to processor grid
 BLOCK distribution
 CYCLIC distribution
 ALIGN aligns array with other array
 FORALL exhibits concurrency
 independent loop iterations
Alternative
 annotated DO loops
!HPF$ INDEPENDENT
do i = 1,M
do J = 1,N
A(I,J) = A(I,J) + B(J,I)
enddo
enddo
Implicit parallelism:
 Simple independent do loops automatically recognized by compiler
SPMD model:
 bodies of parallel do loops may contain arbitrary code
 each parallel process executes independently of each other
do i = 1,M
do J = 1,N
if (A(i,j).gt.0) then
A(I,J) = A(I,J) + B(J,I)
else
A(I,J) = A(I,J)  B(J,I)
endif
enddo
enddo
More details on HPF, see handouts from Foster, Chapter 7.
Functional Languages

Backus 1978, Turing Award Lecture
"Can Programming be Liberated from the von Neumann Style? A Functional Style
and Its Algebra of Programs"
Abandon "Word at a time programming"
in favor of "combining forms"
FP (Functional Programming System)

 Objects (atoms, or sequences )
 Functions
alpha f: =
/f: = f(x1, f(... f(x_n1,xn)))
(f o g):x = f:(g:x)
trans:<,...,> = <,...>
Example:
Inner product of vectors
 IP = (/ +) o (alpha *) o trans
Matrix multiplication
 MM = (alpha alpha IP) o (alpha distl) o distr o [1, trans o 2]
distl:> = <, ..., >
distr:<,z> = <, ..., >
[f1,...,fn]:x =
i: = x_i
Program = Op o Op o ... o Op
 sequence of stages
 each stage potentially allows concurrency (alpha, /, ....)
Starting point of large interest in functional languages and
their use for parallelism.
Connection Machine Lisp

Steele, Hillis, TMC
Data type "xapping" (associative mapping)
 unordered set of (index>value) pairs
x = {sky>blue, apple>red, grass>green}
 indexing
(xref x 'apple) => red
 xectors
[1, 2, 4] = {0>1, 1>2, 2>4}
 applytoall
(alpha f '[1,2,4]) => [(f 1), (f 2), (f 4)]
 gathering
(beta + '[1,2,3]) => 6
 ....
(defun innerproduct (f g p q) (reduce f alpha(funcall g p q)))
(funcall (innerproduct #'+ #'*) '[1 2 3] '[4 5 6]) => 32
Used for implementing dataparallel symbolic programs on the Connection Machine
CM2 (SIMD type, 65536 singlebit processors)
NESL

A nested DataParallel Language (Guy Blelloch, CMU)
Parallel operations on sequences
{negate(a): a in [3, 4, 9, 5]} =>
[3, 4, 9, 5]
{negate(a): a in [3, 4, 9, 5]  a < 4 } =>
[3, 4, 9]
{a+b: a in [3, 4, 9, 5]; b in [1, 2, 3, 4]}
=> [4, 2, 6, 9]
Various operations on sequences
 sequence expressions like above
 op_scan(a) (reduction of a by op)
....
Nested parallelism
{sum(v): v in [a, b, c]}
 parallelism within each sum
 parallelism among three instances of sum
 recursive activations possible
function qsort(a) =
if (#a < 2) then a
else
let pivot = a[#a/2]
lesser = {e in a  e < pivot}
equal = {e in a  e = pivot}
greater = {e in a  e > pivot}
result = {qsort(v): v in [lesser, greater]}
in result[0] ++ equal ++ result[1]
General

Easiest to use form of parallelism
 conceptually single control flow
 only on independent computation/data segments
 independent DO loops
 array expressions and the like
 well suitable for functional form of parallelism