Smooth maximum

In mathematics, a '''smooth maximum''' of an indexed family ''x''<sub>1</sub>, ..., ''x''<sub>''n''</sub> of numbers is a smooth approximation to the maximum function <math>\max(x_1,\ldots,x_n),</math> meaning a parametric family of functions <math>m_\alpha(x_1,\ldots,x_n)</math> such that for every {{tmath|\alpha}}, the function {{tmath|m_\alpha}} is smooth, and the family converges to the maximum function {{tmath|m_\alpha \to \max}} as {{tmath|\alpha\to\infty}}. The concept of '''smooth minimum''' is similarly defined. In many cases, a single family approximates both: maximum as the parameter goes to positive infinity, minimum as the parameter goes to negative infinity; in symbols, {{tmath|m_\alpha \to \max}} as {{tmath|\alpha \to \infty}} and {{tmath|m_\alpha \to \min}} as {{tmath|\alpha \to -\infty}}. The term can also be used loosely for a specific smooth function that behaves similarly to a maximum, without necessarily being part of a parametrized family.

== Examples ==

=== Boltzmann operator ===

thumb|Smoothmax of (−x, x) versus x for various parameter values. Very smooth for <math>\alpha</math>=0.5, and more sharp for <math>\alpha</math>=8.

For large positive values of the parameter <math>\alpha > 0</math>, the following formulation is a smooth, differentiable approximation of the maximum function. For negative values of the parameter that are large in absolute value, it approximates the minimum.

:<math> \mathcal{S}_\alpha (x_1,\ldots,x_n) = \frac{\sum_{i=1}^n x_i e^{\alpha x_i}}{\sum_{i=1}^n e^{\alpha x_i}} </math>

<math>\mathcal{S}_\alpha</math> has the following properties: #<math>\mathcal{S}_\alpha\to \max</math> as <math>\alpha\to\infty</math> #<math>\mathcal{S}_0</math> is the arithmetic mean of its inputs #<math>\mathcal{S}_\alpha\to \min</math> as <math>\alpha\to -\infty</math>

The gradient of <math>\mathcal{S}_{\alpha}</math> is closely related to softmax and is given by

:<math> \nabla_{x_i}\mathcal{S}_\alpha (x_1,\ldots,x_n) = \frac{e^{\alpha x_i}}{\sum_{j=1}^n e^{\alpha x_j}} [1 + \alpha(x_i - \mathcal{S}_\alpha (x_1,\ldots,x_n))]. </math>

This makes the softmax function useful for optimization techniques that use gradient descent.

This operator is sometimes called the Boltzmann operator,<ref name="Asadi">{{cite journal |last1=Asadi |first1=Kavosh |last2=Littman |first2=Michael L. |author-link2=Michael L. Littman |date=2017 |title=An Alternative Softmax Operator for Reinforcement Learning |url=https://proceedings.mlr.press/v70/asadi17a.html |journal=PMLR |volume=70 |pages=243–252 |arxiv=1612.05628 |access-date=January 6, 2023}}</ref> after the Boltzmann distribution.

=== LogSumExp === {{main|LogSumExp}} Another smooth maximum is LogSumExp:

:<math>\mathrm{LSE}_\alpha(x_1, \ldots, x_n) = \frac{1}{\alpha} \log \sum_{i=1}^n \exp \alpha x_i</math>

This can also be normalized if the <math>x_i</math> are all non-negative, yielding a function with domain <math>[0,\infty)^n</math> and range <math>[0, \infty)</math>: :<math>g(x_1, \ldots, x_n) = \log \left( \sum_{i=1}^n \exp x_i - (n-1) \right)</math>

The <math>(n - 1)</math> term corrects for the fact that <math>\exp(0) = 1</math> by canceling out all but one zero exponential, and <math>\log 1 = 0</math> if all <math>x_i</math> are zero.

=== Mellowmax ===

The mellowmax operator<ref name="Asadi"/> is defined as follows: :<math>\mathrm{mm}_\alpha(x) = \frac{1}{\alpha} \log \frac{1}{n} \sum_{i=1}^n \exp \alpha x_i </math> It is a non-expansive operator. As <math>\alpha \to \infty</math>, it acts like a maximum. As <math>\alpha \to 0</math>, it acts like an arithmetic mean. As <math>\alpha \to -\infty</math>, it acts like a minimum. This operator can be viewed as a particular instantiation of the quasi-arithmetic mean. It can also be derived from information theoretical principles as a way of regularizing policies with a cost function defined by KL divergence. The operator has previously been utilized in other areas, such as power engineering.<ref>{{cite journal |last1=Safak |first1=Aysel |date=February 1993 |title=Statistical analysis of the power sum of multiple correlated log-normal components |journal=IEEE Transactions on Vehicular Technology |volume=42 |issue=1 |pages={58–61 |doi=10.1109/25.192387 }}</ref>

==== Connection between LogSumExp and Mellowmax ====

LogSumExp and Mellowmax are the same function differing by a constant <math>\frac{\log {n}}{\alpha}</math>. LogSumExp is always larger than the true max, differing at most from the true max by <math>\frac{\log {n}}{\alpha}</math> in the case where all n arguments are equal and being exactly equal to the true max when all but one argument is <math>-\infty</math>. Similarly, Mellowmax is always less than the true max, differing at most from the true max by <math>\frac{\log {n}}{\alpha}</math> in the case where all but one argument is <math>-\infty</math> and being exactly equal to the true max when all n arguments are equal.

=== p-Norm === {{main|P-norm}} Another smooth maximum is the p-norm:

:<math> \| (x_1, \ldots, x_n) \|_p = \left( \sum_{i=1}^n |x_i|^p \right)^\frac{1}{p} </math>

which converges to <math>\| (x_1, \ldots, x_n) \|_\infty = \max_{1\leq i\leq n} |x_i| </math> as <math>p \to \infty</math>.

=== Smooth maximum unit ===

The following binary operator is called the Smooth Maximum Unit (SMU):<ref>{{Cite arXiv|eprint = 2111.04682|last1 = Biswas|first1 = Koushik|last2 = Kumar|first2 = Sandeep|last3 = Banerjee|first3 = Shilpak|author4 = Ashish Kumar Pandey|title = SMU: Smooth activation function for deep networks using smoothing maximum technique|year = 2021| class=cs.LG }}</ref> :<math> \begin{align} \textstyle\max_\varepsilon(a, b) &= \frac{a + b + |a - b|_\varepsilon}{2} \\ &= \frac{a + b + \sqrt{(a - b)^2 + \varepsilon}}{2} \end{align} </math> where <math>\varepsilon \geq 0</math> is a parameter. As <math>\varepsilon \to 0</math>, <math>|\cdot|_\varepsilon \to |\cdot|</math> and thus <math>\textstyle\max_\varepsilon \to \max</math>.

==See also== * LogSumExp * Softmax function * Generalized mean

==References== {{Reflist}}

Category:Mathematical notation Category:Basic concepts in set theory

https://www.johndcook.com/soft_maximum.pdf

M. Lange, D. Zühlke, O. Holz, and T. Villmann, "Applications of lp-norms and their smooth approximations for gradient based learning vector quantization," ''in Proc. ESANN'', Apr. 2014, pp. 271-276. (https://www.elen.ucl.ac.be/Proceedings/esann/esannpdf/es2014-153.pdf)