Single program, multiple data

{{Short description|Computing technique used to achieve parallelism}} {{redirect|SPMD|the membrane technology|Semipermeable membrane device}} {{missing information|GPUs|date=November 2019}}  In computing, '''single program, multiple data''' ('''SPMD''') is a term that has been used to refer to computational models for exploiting parallelism whereby multiple processors cooperate in the execution of a program in order to obtain results faster.

The term SPMD was introduced in 1983 and was used to denote two different computational models:

# by Michel Auguin (University of Nice Sophia-Antipolis) and François Larbey (Thomson/Sintra),<ref name=":0">{{Cite book |author1=M. Auguin |author2=F. Larbey |chapter=OPSILA: an advanced SIMD for numerical analysis and signal processing |title=Microprocessing and microprogramming: EUROMICRO ; Symposium. proceedings ; Microcomputers: developments in industry, business and education. 13-16 Sep 1983 |date=1983 |location=Amsterdam |publisher=North-Holland |isbn=0-444-86742-2}}</ref><ref name=":1">{{Cite book |author1=M. Auguin |author2=F. Labrey |chapter=A Multi-processor SIMD Machine: OPSILA |title=Microcomputers, usage and design |date=1985 |publisher=North-Holland |location=Amsterdam |isbn=0-444-87835-1 |editor1=K. Waldschmidt |editor2=B. Myhrhaug}}</ref><ref name=":2">{{Cite journal |date=1987 |title=Experience Using a SIMD/SPMD Multiprocessor Architecture |journal=Multiprocessing and Microprogramming |volume=21 |issue=1–5 |pages=171–178|doi=10.1016/0165-6074(87)90034-2 |last1=Auguin |first1=M. |last2=Boeri |first2=F. |last3=Dalban |first3=J.P |last4=Vincent-Carrefour |first4=A. }}</ref> as a "''fork-and-join''" and data-parallel approach where the ''parallel tasks ("single program")'' are split-up and run simultaneously ''in lockstep'' on multiple ''SIMD processors'' with different inputs, and # by Frederica Darema (IBM),<ref name=":3">{{cite book | last1=Darema | first1=Frederica | title=Recent Advances in Parallel Virtual Machine and Message Passing Interface | chapter=The SPMD Model: Past, Present and Future | series=Lecture Notes in Computer Science | date=2001 | volume=2131 | page=1 | doi=10.1007/3-540-45417-9_1 | isbn=978-3-540-42609-7 }}</ref><ref name=":4">{{Cite report |author1=F. Darema-Rogers |author2=D. A. George |author3=V. A. Norton |author4=G. F. Pfister |date=23 January 1985|chapter=A VM Parallel Environment |title=IBM Research Report RC 11225 |publisher=IBM T. J. Watson Research Center}}</ref><ref name=":5">{{Cite journal |title=A single-program-multiple-data computational model for EPEX/FORTRAN |journal=Journal of Parallel Computing |volume=7 |year=1988 |pages=11–24|doi=10.1016/0167-8191(88)90094-4 |last1=Darema |first1=F. |last2=George |first2=D.A. |last3=Norton |first3=V.A. |last4=Pfister |first4=G.F. }}</ref> where "''all (processors) processes begin executing the same program... but through synchronization directives ... self-schedule themselves to execute different instructions and act on different data''" and enabling MIMD parallelization of a given program, and is a more general approach than data-parallel and more efficient than the fork-and-join for parallel execution on general purpose multiprocessors.

The (IBM) SPMD is the most common style of parallel programming and can be considered a subcategory of MIMD in that it refers to MIMD execution of a given ("single") program.<ref name=":6">{{Cite journal |last=Flynn |first=Michael |title=Some Computer Organizations and Their Effectiveness |url=https://www.cs.utah.edu/~hari/teaching/paralg/Flynn72.pdf |journal=IEEE Transactions on Computers |date=1972 |volume=C-21 |issue=9 |pages=948–960|doi=10.1109/TC.1972.5009071 |s2cid=18573685 }}</ref> It is also a prerequisite for research concepts such as active messages and distributed shared memory.{{Flynn's Taxonomy}}

==SPMD vs SIMD== thumb|An example of "Single program, multiple data" In SPMD parallel execution, multiple autonomous processors simultaneously execute the same program at independent points, rather than in the lockstep that SIMD or SIMT imposes on different data. In SIMD the same operation (instruction) is applied on multiple data to manipulate data streams (a version of SIMD is vector processing where the data are organized as vectors).

Unlike SIMD, SPMD does not require special support from the processor it's run on, be it CPUs or GPUs.

SPMD and SIMD are not mutually exclusive: each SPMD program can include SIMD, or vector, or GPU sub-processing. Many CPUs include multiple SIMD-capable cores, each of which can participate in SPMD; the same applies to many GPUs containing several SIMD "streams". SPMD has been used for parallel programming of both message passing and shared-memory machine architectures.

== Operation ==

===Distributed memory=== On distributed memory computer architectures, SPMD implementations usually employ message passing programming. A distributed memory computer consists of a collection of interconnected, independent computers, called nodes. For parallel execution, each node starts its own program and communicates with other nodes by sending and receiving messages, calling send/receive routines for that purpose. Other ''parallelization directives'' such as Barrier synchronization may also be implemented by messages. The messages can be sent by a number of communication mechanisms, such as TCP/IP over Ethernet, or specialized high-speed interconnects such as InfiniBand or Omni-Path. For distributed memory environments, serial sections of the program can be implemented by identical computation of the serial section on all nodes rather than computing the result on one node and sending it to the others, if that improves performance by reducing communication overhead.

Nowadays, the programmer is isolated from the details of the message passing by standard interfaces, such as PVM and MPI.

Distributed memory is the programming style used on parallel supercomputers from homegrown Beowulf clusters to the largest clusters on the Teragrid, as well as present GPU-based supercomputers.

===Shared memory=== On a shared memory machine (a computer with several interconnected CPUs that access the same memory space), the sharing can be implemented in the context of either physically shared memory or logically shared (but physically distributed) memory; in addition to the shared memory, the CPUs in the computer system can also include local (or private) memory. For either of these contexts, synchronization can be enabled with hardware enabled primitives (such as compare-and-swap, or fetch-and-add. For machines that do not have such hardware support, locks can be used and data can be "exchanged" across processors (or, more generally, ''processes'' or threads) by depositing the sharable data in a shared memory area. When the hardware does not support shared memory, packing the data as a "message" is often the most efficient way to program (logically) shared memory computers with large number of processors, where the physical memory is local to processors and accessing the memory of another processor takes longer. SPMD on a shared memory machine can be implemented by standard processes (heavyweight) or threads (lightweight).

Shared memory multiprocessing (both symmetric multiprocessing, SMP, and non-uniform memory access, NUMA) presents the programmer with a common memory space and the possibility to parallelize execution. With the (IBM) SPMD model the cooperating processors (or processes) take different paths through the program, using parallel directives (''parallelization and synchronization directives'', which can utilize compare-and-swap and fetch-and-add operations on shared memory synchronization variables), and perform operations on data in the shared memory ("shared data"); the processors (or processes) can also have access and perform operations on data in their local memory ("private data"). In contrast, with fork-and-join approaches, the program starts executing on one processor and the execution splits in a parallel region, which is started when parallel directives are encountered; in a parallel region, the processors execute a parallel task on different data. A typical example is the parallel DO loop, where different processors work on separate parts of the arrays involved in the loop. At the end of the loop, execution is synchronized (with soft- or hard-barriers<ref name=":5" />), and processors (processes) continue to the next available section of the program to execute. The (IBM) SPMD has been implemented in the current standard interface for shared memory multiprocessing, OpenMP, which uses multithreading, usually implemented by lightweight processes, called threads.

===Combination of levels of parallelism=== Current computers allow exploiting many parallel modes at the same time for maximum combined effect. A distributed memory program using MPI may run on a collection of nodes. Each node may be a shared memory computer and execute in parallel on multiple CPUs using OpenMP. Within each CPU, SIMD vector instructions (usually generated automatically by the compiler) and superscalar instruction execution (usually handled transparently by the CPU itself), such as pipelining and the use of multiple parallel functional units, are used for maximum single CPU speed.

== Implementations == MPI is commonly used to implement SPMD. As mentioned earlier it is suited for distributed memory systems (multiple machines) but also works on shared-memory scenarios (multiple cores).<ref>{{cite journal |last1=Strout |first1=M.M. |last2=Kreaseck |first2=B. |last3=Hovland |first3=P.D. |title=Data-Flow Analysis for MPI Programs |date=2006 |pages=175–184 |doi=10.1109/ICPP.2006.32}}</ref>

=== GPUs and other accelerators === Most graphics shaders are written in a SPMD programming model: the code describes an operation on a single element. The code is then turned into parallel code by the shader compiler using whatever parallelism facilities the hardware offers (multiple units of SIMD in the case of most GPUs, multiple units of SIMT in the case of Nvidia GPUs). CUDA likewise follows an SPMD/SIMT model.<ref name="pharr">{{cite web |title=The story of ispc: origins (part 1) |url=https://pharr.org/matt/blog/2018/04/18/ispc-origins |website=pharr.org |date=2018}}</ref> When targeting SIMD hardware, control flow is typically mapped onto SIMD operations by predication, which restricts what portions of a vector register is changed using a mask.<ref name=ipsc/><ref name=ipsc2>{{cite conference |last1=Pharr |first1=Matt |last2=Mark |first2=William R. |title=ispc: A SPMD compiler for high-performance CPU programming |date=May 2012 |pages=1–13 |doi=10.1109/InPar.2012.6339601|conference=2012 Innovative Parallel Computing (InPar)|url=https://ispc.github.io/papers/ispc_inpar_2012.pdf}}</ref> GPUs generally do not have a ''unified'' address space; instead, there are several levels of memory available, only some of which are shared among shader programs.<ref>{{cite web |title=WebGPU Shading Language |url=https://www.w3.org/TR/WGSL/#address-space |website=www.w3.org |language=en}}</ref>

In the machine learning libraries Jax and PyTorch, SPMD is used to distribute the work (''shard'') over multiple devices, either automatically or manually.<ref>{{cite web |title=PyTorch/XLA SPMD User Guide — PyTorch/XLA master documentation |url=https://docs.pytorch.org/xla/master/spmd.html |website=docs.pytorch.org}}</ref><ref>{{cite web |title=Introduction to parallel programming — JAX documentation |url=https://docs.jax.dev/en/latest/sharded-computation.html |website=docs.jax.dev}}</ref> By having all devices run what is functionally the same program, automatic work distribution becomes much easier and the need for cross-device communication is reduced.<ref>{{Cite| publisher = arXiv| doi = 10.48550/arXiv.2105.04663| last1 = Xu| first1 = Yuanzhong| last2 = Lee| first2 = HyoukJoong| last3 = Chen| first3 = Dehao| last4 = Hechtman| first4 = Blake| last5 = Huang| first5 = Yanping| last6 = Joshi| first6 = Rahul| last7 = Krikun| first7 = Maxim| last8 = Lepikhin| first8 = Dmitry| last9 = Ly| first9 = Andy| last10 = Maggioni| first10 = Marcello| last11 = Pang| first11 = Ruoming| last12 = Shazeer| first12 = Noam| last13 = Wang| first13 = Shibo| last14 = Wang| first14 = Tao| last15 = Wu| first15 = Yonghui| last16 = Chen| first16 = Zhifeng| title = GSPMD: General and Scalable Parallelization for ML Computation Graphs| date = 2021-12-23}}</ref>

Clang offers an SPMD code-generation mode for its OpenMP offloading support in addition to the regular mode.<ref name="clang14omp">{{cite web |title=OpenMP Support — Clang 14.0.0 documentation |url=https://releases.llvm.org/14.0.0/tools/clang/docs/OpenMPSupport.html |website=releases.llvm.org}}</ref><ref>{{cite web |title=LLVM/OpenMP Runtimes |url=https://github.com/llvm/llvm-project/blob/llvmorg-23-init/openmp/docs/design/Runtimes.rst |website=GitHub |language=en}}</ref> Clang's OpenCL part considers a target to be ''SPMD'' if the hardware is able to spawn multiple work-items on-the-fly.<ref>{{cite web |title=Clang Compiler User's Manual |url=https://github.com/llvm/llvm-project/blob/llvmorg-23-init/clang/docs/UsersManual.rst |quote=For "non-SPMD" targets which cannot spawn multiple work-items on the fly using hardware, which covers practically all non-GPU devices such as CPUs and DSPs}}</ref>

=== Single CPU core (SPMD on SIMD) === Intel IPSC (Implicit SPMD Program Compiler) is an open-source compiler for SPMD programs written in a dialect of C. It turns input programs, which are written in a seemingly single-threaded SPMD model, into efficient x86 (SSE2 to AVX512) or ARM (NEON) SIMD code or Intel GPU SIMD code.<ref name=ipsc>{{cite web |title=Intel® ISPC User's Guide § The ISPC Parallel Execution Model|url=https://ispc.github.io/ispc.html#the-ispc-parallel-execution-model |website=ispc.github.io}}</ref> Most of IPSC was written by Matt Pharr. According to him, IPSC was designed to produce a compiler that generates good wide-vector code for Larrabee-like architectures. Automatic vectorization proved too fragile to use reliably and a shader-like solutution was sought.<ref name="pharr"/><ref name=ipsc2/>

The NSIMD library offers an SPMD interface, similar in concept to IPSC's. It targets scalar, x86 (SSE2 to AVX-512) SIMD, ARM (NEON or SVE) SIMD, POWERPC VMX/VSX SIMD, CUDA, ROCm, and OneAPI.<ref>{{cite web |title=NSIMD documentation |url=https://agenium-scale.github.io/nsimd/ |website=agenium-scale.github.io}}</ref>

SPMD-on-SIMD (similar to IPSC) has been academically demonstrated on LLVM,<ref>{{cite conference |last1=Kandiah |first1=Vijay |last2=Lustig |first2=Daniel |last3=Villa |first3=Oreste |last4=Nellans |first4=David |last5=Hardavellas |first5=Nikos |title=Parsimony: Enabling SIMD/Vector Programming in Standard Compiler Flows |date=17 February 2023 |pages=186–198 |doi=10.1145/3579990.3580019 |url=https://research.nvidia.com/publication/2023-02_parsimony-enabling-simdvector-programming-standard-compiler-flows|conference=CGO 2023: Proceedings of the 21st ACM/IEEE International Symposium on Code Generation and Optimizati}}</ref><ref>{{cite conference |last1=Kruppe |first1=Robin |last2=Oppermann |first2=Julian |last3=Sommer |first3=Lukas |last4=Koch |first4=Andreas |title=Extending LLVM for Lightweight SPMD Vectorization: Using SIMD and Vector Instructions Easily from Any Language |date=February 2019 |pages=278–279 |doi=10.1109/CGO.2019.8661165 |conference=2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)}}</ref> but none has been accepted into official LLVM as of March 2026.

==History== The acronym SPMD for "Single-Program Multiple-Data" has been used to describe two different computational models for exploiting parallel computing, and this is due to both terms being natural extensions of Flynn's taxonomy.<ref name=":6" /> The two respective groups of researchers were unaware of each other's use of the term SPMD to independently describe different models of parallel programming.

The term SPMD was proposed first in 1983 by Michel Auguin (University of Nice Sophia-Antipolis) and François Larbey (Thomson/Sintra) in the context of the OPSILA parallel computer and in the context of a fork-and-join and data parallel computational model approach.<ref name=":0" /> This computer consisted of a master (controller processor) and SIMD processors (or vector processor mode as proposed by Flynn). In Auguin's SPMD model, the same (parallel) task ("''same program''") is executed on different (SIMD) processors ("''operating in lock-step mode''"<ref name=":0" /> acting on a part ("slice") of the data-vector. Specifically, their 1985 paper<ref name=":1" /> and others<ref name=":2" /><ref name=":0" /> stated:<blockquote>We consider the SPMD (Single Program, Multiple Data) operating mode. This mode allows simultaneous execution of the same task (one per processor) but prevents data exchange between processors. Data exchanges are only performed under SIMD mode by means of vector assignments. We assume synchronizations are summed-up to switchings (sic) between SIMD and SPMD operatings [sic] modes using global fork-join primitives.</blockquote>Starting around the same timeframe (in late 1983 – early 1984), the SPMD term was proposed by Frederica Darema (at IBM at that time, and part of the RP3 group) to define a different SPMD computational model that she proposed,<ref name=":5" /><ref name=":4" /><ref name=":3" /> as a programming model which in the intervening years has been applied to a wide range of general-purpose high-performance computers (including RP3 - the 512-processor IBM Research Parallel Processor Prototype) and has led to the current parallel computing standards. The (IBM) SPMD programming model assumes a multiplicity of processors which operate cooperatively, all executing the same program but can take different paths through the program based on parallelization directives embedded in the program:<ref name=":5" /><ref name=":4" /><ref name=":3" /><ref name=":7">{{cite book | last1=Darema | first1=Frederica | title=Supercomputing | chapter=Applications environment for the IBM Research Parallel Processor Prototype (RP3) | series=Lecture Notes in Computer Science | date=1988 | volume=297 | pages=80–95 | doi=10.1007/3-540-18991-2_6 | isbn=978-3-540-18991-6 }}</ref><ref name=":8">{{Cite report |last=Darema |first=Frederica |date=1986 |chapter=Parallel Applications Development for Shared Memory Systems |title=IBM Research Report RC12229 |publisher=IBM T. J. Watson Research Center |location=Yorktown Heights, NY}}</ref><blockquote>All processes participating in the parallel computation are created at the beginning of the execution and remain in existence until the end ... [the processors/processes] execute different instructions and act on different data ... the job [(work)] to be done by each process is allocated dynamically ... [i.e. the processes] self-schedule themselves to execute different instructions and act on different data [thus self-assign themselves to cooperate in execution of serial and parallel tasks (as well as replicate tasks) in the program.]</blockquote>The notion ''process'' generalized the term ''processor'' in the sense that multiple processes can execute on a processor (to for example exploit larger degrees of parallelism for more efficiency and load-balancing). The (IBM) SPMD model was proposed by Darema as an approach different and more efficient than the fork-and-join that was pursued by all others in the community at that time; it is also more general than just "data-parallel" computational model and can encompass fork-and-join (as a subcategory implementation). The original context of the (IBM) SPMD was the RP3 computer (the 512-prosessor IBM Research Parallel Processor Prototype), which supported general purpose computing, with both distributed and (logically) shared memory.<ref name=":7" /> The (IBM) SPMD model was implemented by Darema and IBM colleagues into the EPEX (Environment for Parallel Execution), one of the first prototype programming environments.<ref name=":5" /><ref name=":4" /><ref name=":3" /><ref name=":7" /><ref name=":8" /><ref>{{Cite report |author1=J. M. Stone |author2=F. Darema-Rogers |author3=V. A. Norton |author4=G. F. Pfister |date=30 September 1985 |chapter=Introduction to the VM/EPEX Fortran Preprocessor |title=IBM Research Report RC11407 |publisher=IBM T. J. Watson Research Center |location=Yorktown Heights, NY}}</ref> The effectiveness of the (IBM) SPMD was demonstrated for a wide class of applications,<ref name=":7" /><ref name=":3" /> and was implemented in the IBM FORTRAN in 1988,<ref>{{Cite journal |date=1988 |title=IBM Parallel FORTRAN |journal=IBM Systems Journal |volume=27 |issue=4 |pages=416–435|doi=10.1147/sj.274.0416 |last1=Toomey |first1=L. J. |last2=Plachy |first2=E. C. |last3=Scarborough |first3=R. G. |last4=Sahulka |first4=R. J. |last5=Shaw |first5=J. F. |last6=Shannon |first6=A. W. }}</ref> the first vendor-product in parallel programming; and in MPI (1991 and on), OpenMP (1997 and on), and other environments which have adopted and cite the (IBM) SPMD Computational Model.

By the late 1980s, there were many distributed computers with proprietary message passing libraries. The first SPMD standard was PVM. The current de facto standard is MPI.

The Cray parallel directives were a direct predecessor of OpenMP.

==References== {{reflist}}

==External links== *[http://www.cisl.ucar.edu/docs/ibm/ref/parallel.html Parallel job management and message passing] *[http://www.llnl.gov/casc/Overture/henshaw/documentation/App/manual/node36.html Single Program Multiple Data stream] {{Webarchive|url=https://web.archive.org/web/20040604110624/http://www.llnl.gov/casc/Overture/henshaw/documentation/App/manual/node36.html |date=2004-06-04 }} *[https://web0.tc.cornell.edu/Services/Education/Topics/Parallel/Design/SPMD.aspx SPMD] *[http://math.nist.gov/~KRemington/Primer/distrib.html Distributed-memory programming] {{Webarchive|url=https://web.archive.org/web/20131213105449/http://math.nist.gov/~KRemington/Primer/distrib.html |date=2013-12-13 }}

Category:Parallel computing Category:Flynn's taxonomy