Amdahl's Law

17 Dec 2024

5 min. read

     Every quantitative field has its fundamental laws. Engineers will point to mass, momentum, and energy conservation. Modern physicists will smirk and point to Noether’s theorem, instead. Economists will point to supply and demand. Financial professionals will point to compound interest. Mathematicians will point to Godel’s Incompleteness Theorem. Computer scientists will gesture in the direction of Alan Turing. In high-performance computing (HPC), the first thing that you learn is how hard it can be to parallelize something. Then, this motivates you to learn how to figure out whether or not you should[^1].

     Every algorithm can be thought of as being a set of sections that are parallelizable, together with a set of sections that are not parallelizable. Pregnancy is the classic example of a process that is not parallelizable. Nine women can have nine babies in nine months, but as the joke goes, regardless of what the manager thinks, nine women cannot have one baby in one month. With a single core, the wall-time of the algorithm can be expressed as,

\[\begin{align} T(1) = T_{ser} + T_{par} \end{align}\]

     When multiple processors are introduced, which to be sure is not the only way to add parallelism, but it is the simplest to think about, then we cut the time of the parallel section accordingly. Assuming the workload is evenly balanced, the processors are uniform, and we don’t think too much about the rare case when the specific details of the hardware, and problem, align so as to neglect for the moment the situations when super-linear speedups occur, then we can assume that $p$ processors will reduce the wall-time of the parallel section by a factor of $p$,

\[\begin{align} T(p) \geq T_{ser} + \frac{T_{par}}{p} \end{align}\]

     In practice, the workload will not be evenly balanced, bottlenecks will exist, etc., so we consider the expression on the RHS to be a lower-bound on the wall-time of the algorithm, hence the sign. What we are really interested in is an estimate of the speedup,

\[S(p) = \frac{T(1)}{T(p)}\] \[\rightarrow S(p) \leq \frac{T_{ser} + T_{par}}{T_{ser} + \frac{T_{par}}{p}} \label{eqn:speedup}\]

     Frequently, when you have a binary option like this in mathematics, or physics, which adds up to some value, then you know that a coefficient is swiftly going to show up which characterizes how much each of the options contributes. For example, in probability, if there are two outcomes to an event then we can ascribe a probability, $\alpha$, that one of the events occurs. Then, the probability that the other one occurs is $1 - \alpha$. To progress, we do something similar here. We say that there is a coefficient, $\alpha$, which relates how much of the overall runtime is taken by the serial portion,

\[\begin{align} T_{ser} &= \alpha T(1) \\ \therefore T_{par} &= (1 - \alpha)T(1) \end{align}\]

Consequently, we insert this into Equation (\ref{eqn:speedup}), and rearrange,

\[\begin{align} \therefore S(p) &\leq \frac{\alpha T(1) + (1 - \alpha)T(1)}{\alpha T(1) + \frac{1}{p}(1 - \alpha)T(1)} \\ &\leq \frac{\alpha + 1 - \alpha}{\alpha + \frac{1 - \alpha}{p}} \\ &\leq \frac{1}{\alpha + \frac{1 - \alpha}{p}} \label{eqn:amdahl_final} \end{align}\]

In general, Equation (\ref{eqn:amdahl_final}) gives us an expression for the speedup. One of the first things that physicists like to do when they find such a thing for what they were looking for is take limits,

\[\begin{align} &\lim_{p\rightarrow 1} S(p) \leq 1 \\ &\lim_{p\rightarrow \infty} S(p) \leq \frac{1}{\alpha} \end{align}\]

     Obviously, the first line is what we expect. When there is only a single core in our cluster, then we expect at best no change in performance. At worst, we might even see a slowdown due to the software overhead from introducing parallelism. The accuracy of this analysis is limited to algorithms which are strong-scaling, meaning that even with a fixed problem size more processors will result in a speedup compared to less. As we will see in a moment, this is not always true, and in practice we usually don’t talk of this as a property of an algorithm, but more so as the regime for a test.

     The fact remains that more does not always equal better, and for a very important reason when it comes to parallel programming. We can already see a hint of this from the second line of the above. Even in the case where we have an infinitude of processing cores, we are still limited by the structure of the algorithm, here represented by the coefficient $\alpha$, regarding what speedup we can obtain. To wrap this up, let’s mention weak-scaling, and point to how we analyze speedup when we vary both the problem size, AND the number of processors. A more general law is required here, known as Gustafson’s Law, and we will have to add additional parameters that expand upon the structure to describe how well the respective sections of the algorithm scale when the data volume is varied.

[^1:] Schmidt, B., et al. (2017). Parallel Programming: Concepts and Practice (1st ed.). Morgan Kaufmann.
Amazon