# Smooth maximum

> Mediated Wiki article. Canonical URL: https://mediated.wiki/source/Smooth_maximum
> Markdown URL: https://mediated.wiki/source/Smooth_maximum.md
> Source: https://en.wikipedia.org/wiki/Smooth_maximum
> Source revision: 1317993932
> License: Creative Commons Attribution-ShareAlike 4.0 International (https://creativecommons.org/licenses/by-sa/4.0/)

Mathematical approximation

In [mathematics](/source/Mathematics), a **smooth maximum** of an [indexed family](/source/Indexed_family) *x*1, ..., *x**n* of numbers is a [smooth approximation](/source/Smooth_approximation) to the [maximum](/source/Maximum) function max ( x 1 , … , x n ) , {\displaystyle \max(x_{1},\ldots ,x_{n}),} meaning a [parametric family](/source/Parametric_family) of functions m α ( x 1 , … , x n ) {\displaystyle m_{\alpha }(x_{1},\ldots ,x_{n})} such that for every ⁠ α {\displaystyle \alpha } ⁠, the function ⁠ m α {\displaystyle m_{\alpha }} ⁠ is smooth, and the family converges to the maximum function ⁠ m α → max {\displaystyle m_{\alpha }\to \max } ⁠ as ⁠ α → ∞ {\displaystyle \alpha \to \infty } ⁠. The concept of **smooth minimum** is similarly defined. In many cases, a single family approximates both: maximum as the parameter goes to positive infinity, minimum as the parameter goes to negative infinity; in symbols, ⁠ m α → max {\displaystyle m_{\alpha }\to \max } ⁠ as ⁠ α → ∞ {\displaystyle \alpha \to \infty } ⁠ and ⁠ m α → min {\displaystyle m_{\alpha }\to \min } ⁠ as ⁠ α → − ∞ {\displaystyle \alpha \to -\infty } ⁠. The term can also be used loosely for a specific smooth function that behaves similarly to a maximum, without necessarily being part of a parametrized family.

## Examples

### Boltzmann operator

Smoothmax of (−x, x) versus x for various parameter values. Very smooth for

        α

    {\displaystyle \alpha }

=0.5, and more sharp for

        α

    {\displaystyle \alpha }

=8.

For large positive values of the parameter α > 0 {\displaystyle \alpha >0} , the following formulation is a smooth, [differentiable](/source/Differential_(calculus)) approximation of the maximum function. For negative values of the parameter that are large in absolute value, it approximates the minimum.

- S α ( x 1 , … , x n ) = ∑ i = 1 n x i e α x i ∑ i = 1 n e α x i {\displaystyle {\mathcal {S}}_{\alpha }(x_{1},\ldots ,x_{n})={\frac {\sum _{i=1}^{n}x_{i}e^{\alpha x_{i}}}{\sum _{i=1}^{n}e^{\alpha x_{i}}}}}

S α {\displaystyle {\mathcal {S}}_{\alpha }} has the following properties:

1. S α → max {\displaystyle {\mathcal {S}}_{\alpha }\to \max } as α → ∞ {\displaystyle \alpha \to \infty }

1. S 0 {\displaystyle {\mathcal {S}}_{0}} is the [arithmetic mean](/source/Arithmetic_mean) of its inputs

1. S α → min {\displaystyle {\mathcal {S}}_{\alpha }\to \min } as α → − ∞ {\displaystyle \alpha \to -\infty }

The gradient of S α {\displaystyle {\mathcal {S}}_{\alpha }} is closely related to [softmax](/source/Softmax_function) and is given by

- ∇ x i S α ( x 1 , … , x n ) = e α x i ∑ j = 1 n e α x j [ 1 + α ( x i − S α ( x 1 , … , x n ) ) ] . {\displaystyle \nabla _{x_{i}}{\mathcal {S}}_{\alpha }(x_{1},\ldots ,x_{n})={\frac {e^{\alpha x_{i}}}{\sum _{j=1}^{n}e^{\alpha x_{j}}}}[1+\alpha (x_{i}-{\mathcal {S}}_{\alpha }(x_{1},\ldots ,x_{n}))].}

This makes the softmax function useful for optimization techniques that use [gradient descent](/source/Gradient_descent).

This operator is sometimes called the Boltzmann operator,[1] after the [Boltzmann distribution](/source/Boltzmann_distribution).

### LogSumExp

Main article: [LogSumExp](/source/LogSumExp)

Another smooth maximum is [LogSumExp](/source/LogSumExp):

- L S E α ( x 1 , … , x n ) = 1 α log ⁡ ∑ i = 1 n exp ⁡ α x i {\displaystyle \mathrm {LSE} _{\alpha }(x_{1},\ldots ,x_{n})={\frac {1}{\alpha }}\log \sum _{i=1}^{n}\exp \alpha x_{i}}

This can also be normalized if the x i {\displaystyle x_{i}} are all non-negative, yielding a function with domain [ 0 , ∞ ) n {\displaystyle [0,\infty )^{n}} and range [ 0 , ∞ ) {\displaystyle [0,\infty )} :

- g ( x 1 , … , x n ) = log ⁡ ( ∑ i = 1 n exp ⁡ x i − ( n − 1 ) ) {\displaystyle g(x_{1},\ldots ,x_{n})=\log \left(\sum _{i=1}^{n}\exp x_{i}-(n-1)\right)}

The ( n − 1 ) {\displaystyle (n-1)} term corrects for the fact that exp ⁡ ( 0 ) = 1 {\displaystyle \exp(0)=1} by canceling out all but one zero exponential, and log ⁡ 1 = 0 {\displaystyle \log 1=0} if all x i {\displaystyle x_{i}} are zero.

### Mellowmax

The mellowmax operator[1] is defined as follows:

- m m α ( x ) = 1 α log ⁡ 1 n ∑ i = 1 n exp ⁡ α x i {\displaystyle \mathrm {mm} _{\alpha }(x)={\frac {1}{\alpha }}\log {\frac {1}{n}}\sum _{i=1}^{n}\exp \alpha x_{i}}

It is a [non-expansive](/source/Non-expansive_function) operator. As α → ∞ {\displaystyle \alpha \to \infty } , it acts like a maximum. As α → 0 {\displaystyle \alpha \to 0} , it acts like an arithmetic mean. As α → − ∞ {\displaystyle \alpha \to -\infty } , it acts like a minimum. This operator can be viewed as a particular instantiation of the [quasi-arithmetic mean](/source/Quasi-arithmetic_mean). It can also be derived from information theoretical principles as a way of regularizing policies with a cost function defined by KL divergence. The operator has previously been utilized in other areas, such as power engineering.[2]

#### Connection between LogSumExp and Mellowmax

LogSumExp and Mellowmax are the same function differing by a constant log ⁡ n α {\displaystyle {\frac {\log {n}}{\alpha }}} . LogSumExp is always larger than the true max, differing at most from the true max by log ⁡ n α {\displaystyle {\frac {\log {n}}{\alpha }}} in the case where all n arguments are equal and being exactly equal to the true max when all but one argument is − ∞ {\displaystyle -\infty } . Similarly, Mellowmax is always less than the true max, differing at most from the true max by log ⁡ n α {\displaystyle {\frac {\log {n}}{\alpha }}} in the case where all but one argument is − ∞ {\displaystyle -\infty } and being exactly equal to the true max when all n arguments are equal.

### p-Norm

Main article: [P-norm](/source/P-norm)

Another smooth maximum is the [p-norm](/source/P-norm):

- ‖ ( x 1 , … , x n ) ‖ p = ( ∑ i = 1 n | x i | p ) 1 p {\displaystyle \|(x_{1},\ldots ,x_{n})\|_{p}=\left(\sum _{i=1}^{n}|x_{i}|^{p}\right)^{\frac {1}{p}}}

which converges to ‖ ( x 1 , … , x n ) ‖ ∞ = max 1 ≤ i ≤ n | x i | {\displaystyle \|(x_{1},\ldots ,x_{n})\|_{\infty }=\max _{1\leq i\leq n}|x_{i}|} as p → ∞ {\displaystyle p\to \infty } .

An advantage of the p-norm is that it is a [norm](/source/Norm_(mathematics)). As such it is [scale invariant](/source/Scale_invariant) ([homogeneous](/source/Homogeneous_function)): ‖ ( λ x 1 , … , λ x n ) ‖ p = | λ | ⋅ ‖ ( x 1 , … , x n ) ‖ p {\displaystyle \|(\lambda x_{1},\ldots ,\lambda x_{n})\|_{p}=|\lambda |\cdot \|(x_{1},\ldots ,x_{n})\|_{p}} , and it satisfies the [triangle inequality](/source/Triangle_inequality).

### Smooth maximum unit

The following binary operator is called the Smooth Maximum Unit (SMU):[3]

- max ε ( a , b ) = a + b + | a − b | ε 2 = a + b + ( a − b ) 2 + ε 2 {\displaystyle {\begin{aligned}\textstyle \max _{\varepsilon }(a,b)&={\frac {a+b+|a-b|_{\varepsilon }}{2}}\\&={\frac {a+b+{\sqrt {(a-b)^{2}+\varepsilon }}}{2}}\end{aligned}}}

where ε ≥ 0 {\displaystyle \varepsilon \geq 0} is a parameter. As ε → 0 {\displaystyle \varepsilon \to 0} , | ⋅ | ε → | ⋅ | {\displaystyle |\cdot |_{\varepsilon }\to |\cdot |} and thus max ε → max {\displaystyle \textstyle \max _{\varepsilon }\to \max } .

## See also

- [LogSumExp](/source/LogSumExp)

- [Softmax function](/source/Softmax_function)

- [Generalized mean](/source/Generalized_mean)

## References

1. ^ [***a***](#cite_ref-Asadi_1-0) [***b***](#cite_ref-Asadi_1-1) Asadi, Kavosh; [Littman, Michael L.](/source/Michael_L._Littman) (2017). ["An Alternative Softmax Operator for Reinforcement Learning"](https://proceedings.mlr.press/v70/asadi17a.html). *PMLR*. **70**: 243–252. [arXiv](/source/ArXiv_(identifier)):[1612.05628](https://arxiv.org/abs/1612.05628). Retrieved January 6, 2023.

1. **[^](#cite_ref-2)** Safak, Aysel (February 1993). "Statistical analysis of the power sum of multiple correlated log-normal components". *IEEE Transactions on Vehicular Technology*. **42** (1): {58–61. [doi](/source/Doi_(identifier)):[10.1109/25.192387](https://doi.org/10.1109%2F25.192387).

1. **[^](#cite_ref-3)** Biswas, Koushik; Kumar, Sandeep; Banerjee, Shilpak; Ashish Kumar Pandey (2021). "SMU: Smooth activation function for deep networks using smoothing maximum technique". [arXiv](/source/ArXiv_(identifier)):[2111.04682](https://arxiv.org/abs/2111.04682) [[cs.LG](https://arxiv.org/archive/cs.LG)].

[https://www.johndcook.com/soft_maximum.pdf](https://www.johndcook.com/soft_maximum.pdf)

M. Lange, D. Zühlke, O. Holz, and T. Villmann, "Applications of lp-norms and their smooth approximations for gradient based learning vector quantization," *in Proc. ESANN*, Apr. 2014, pp. 271-276. ([https://www.elen.ucl.ac.be/Proceedings/esann/esannpdf/es2014-153.pdf](https://www.elen.ucl.ac.be/Proceedings/esann/esannpdf/es2014-153.pdf))

---
Adapted from the Wikipedia article [Smooth maximum](https://en.wikipedia.org/wiki/Smooth_maximum) by Wikipedia contributors ([contributor history](https://en.wikipedia.org/wiki/Smooth_maximum?action=history)). Available under [Creative Commons Attribution-ShareAlike 4.0 International](https://creativecommons.org/licenses/by-sa/4.0/). Changes may have been made.