# Subgradient method

> Mediated Wiki article. Canonical URL: https://mediated.wiki/source/Subgradient_method
> Markdown URL: https://mediated.wiki/source/Subgradient_method.md
> Source: https://en.wikipedia.org/wiki/Subgradient_method
> Source revision: 1334089787
> License: Creative Commons Attribution-ShareAlike 4.0 International (https://creativecommons.org/licenses/by-sa/4.0/)

{{Short description|Concept in convex optimization mathematics}}
'''Subgradient methods''' are [convex optimization](/source/convex_optimization) methods which use [subderivatives](/source/Subderivative). Originally developed by [Naum Z. Shor](/source/Naum_Z._Shor) and others in the 1960s and 1970s, subgradient methods are convergent when applied even to a non-differentiable objective function. When the objective function is differentiable, subgradient methods for unconstrained problems use the same search direction as the method of [gradient descent](/source/gradient_descent).

Subgradient methods are slower than Newton's method when applied to minimize twice continuously differentiable convex functions. However, Newton's method fails to converge on problems that have non-differentiable kinks.

In recent years, some [interior-point methods](/source/interior-point_methods) have been suggested for convex minimization problems, but subgradient projection methods and related bundle methods of descent remain competitive. For convex minimization problems with very large number of dimensions, subgradient-projection methods are suitable, because they require little storage.

Subgradient projection methods are often applied to large-scale problems with decomposition techniques. Such decomposition methods often allow a simple distributed method for a problem.

==Classical subgradient rules==

Let <math>f : \Reals^n \to \Reals</math> be a [convex function](/source/convex_function) with domain <math>\Reals^n.</math>
A classical subgradient method iterates
<math display=block>x^{(k+1)} = x^{(k)} - \alpha_k g^{(k)} \ </math>
where <math>g^{(k)}</math> denotes ''any'' [subgradient](/source/subgradient) of <math> f \ </math> at <math>x^{(k)}, \ </math> and <math>x^{(k)}</math> is the <math>k^{th}</math> iterate of <math>x.</math> 
If <math>f \ </math> is differentiable, then its only subgradient is the gradient vector <math>\nabla f</math> itself.
It may happen that <math>-g^{(k)}</math> is not a descent direction for <math>f \ </math>  at <math>x^{(k)}.</math>  We therefore maintain a list <math>f_{\rm{best}} \ </math> that keeps track of the lowest objective function value found so far, i.e.
<math display=block>f_{\rm{best}}^{(k)} = \min\{f_{\rm{best}}^{(k-1)} , f(x^{(k)}) \}.</math>

===Step size rules===

Many different types of step-size rules are used by subgradient methods.  This article notes five classical step-size rules for which convergence [proof](/source/mathematical_proof)s are known:

*Constant step size, <math>\alpha_k = \alpha.</math>
*Constant step length, <math>\alpha_k = \gamma/\lVert g^{(k)} \rVert_2,</math> which gives <math>\lVert x^{(k+1)} - x^{(k)} \rVert_2 = \gamma.</math>
*Square summable but not summable step size, i.e. any step sizes satisfying <math display="block">\alpha_k\geq0,\qquad\sum_{k=1}^\infty \alpha_k^2 < \infty,\qquad \sum_{k=1}^\infty \alpha_k = \infty.</math>
*Nonsummable diminishing, i.e. any step sizes satisfying <math display="block">\alpha_k \geq 0,\qquad \lim_{k\to\infty} \alpha_k = 0,\qquad \sum_{k=1}^\infty \alpha_k = \infty.</math>
*Nonsummable diminishing step lengths, i.e. <math>\alpha_k = \gamma_k/\lVert g^{(k)} \rVert_2,</math> where <math display="block">\gamma_k \geq 0,\qquad \lim_{k\to\infty} \gamma_k = 0,\qquad \sum_{k=1}^\infty \gamma_k = \infty.</math>
For all five rules, the step-sizes are determined "off-line", before the method is iterated; the step-sizes do not depend on preceding iterations.  This "off-line" property of subgradient methods differs from the "on-line" step-size rules used for descent methods for differentiable functions: Many methods for minimizing differentiable functions satisfy Wolfe's sufficient conditions for convergence, where step-sizes typically depend on the current point and the current search-direction. An extensive discussion of stepsize rules for subgradient methods, including incremental versions, is given in the books by Bertsekas<ref>{{cite book
  | last = Bertsekas
  | first = Dimitri P.
  | author-link = Dimitri P. Bertsekas
  | title = Convex Optimization Algorithms
  | edition = Second
  | publisher = Athena Scientific
  | year = 2015
  | location = Belmont, MA.
  | isbn =  978-1-886529-28-1
}}</ref> and by Bertsekas, Nedic, and Ozdaglar.<ref>{{cite book
|last1=Bertsekas
|first1=Dimitri P.
|last2=Nedic
|first2=Angelia
|last3 = Ozdaglar
|first3 = Asuman
| title = Convex Analysis and Optimization
| edition = Second
| publisher = Athena Scientific
| year = 2003
| location = Belmont, MA.
| isbn =  1-886529-45-0 }}</ref>

===Convergence results===

For constant step-length and scaled subgradients having [Euclidean norm](/source/Euclidean_norm) equal to one, the subgradient method converges to an arbitrarily close approximation to the minimum value, that is
:<math>\lim_{k\to\infty} f_{\rm{best}}^{(k)} - f^* <\epsilon</math> by a result of [Shor](/source/Naum_Z._Shor).<ref>
The approximate convergence of the constant step-size (scaled) subgradient method is stated as Exercise 6.3.14(a) in [Bertsekas](/source/Dimitri_P._Bertsekas) (page 636): {{cite book
  | last = Bertsekas
  | first = Dimitri P.
  | author-link = Dimitri P. Bertsekas
  | title = Nonlinear Programming
  | edition = Second
  | publisher = Athena Scientific
  | year = 1999
  | location = Cambridge, MA.
  | isbn = 1-886529-00-0 
}} On page 636, Bertsekas attributes this result to Shor: {{cite book
  | last = Shor
  | first = Naum Z.
  | author-link = Naum Z. Shor
  | title = Minimization Methods for Non-differentiable Functions
  | publisher = [Springer-Verlag](/source/Springer-Verlag)
  | isbn = 0-387-12763-1
  | year = 1985
}}
</ref>
These classical subgradient methods have poor performance and are no longer recommended for general use.<ref name="Lem"/><ref name="KLL"/> However, they are still used widely in specialized applications because they are simple and they can be easily adapted to take advantage of the special structure of the problem at hand.

==Subgradient-projection and bundle methods==
During the 1970s, [Claude Lemaréchal](/source/Claude_Lemar%C3%A9chal) and Phil Wolfe proposed "bundle methods" of descent for problems of convex minimization.<ref>
{{cite book
  | last = Bertsekas
  | first = Dimitri P.
  | author-link = Dimitri P. Bertsekas
  | title = Nonlinear Programming
  | edition = Second
  | publisher = Athena Scientific
  | year = 1999
  | location = Cambridge, MA.
  | isbn = 1-886529-00-0 
}} 
</ref> The meaning of the term "bundle methods" has changed significantly since that time. Modern versions and full convergence analysis were provided by Kiwiel.
<ref>
{{cite book|last=Kiwiel|first=Krzysztof|title=Methods of Descent for Nondifferentiable Optimization|publisher=[Springer Verlag](/source/Springer_Verlag)|location=Berlin|year=1985|pages=362|isbn=978-3540156420 |mr=0797754}}
</ref> Contemporary bundle-methods often use "[level](/source/level_set) control" rules for choosing step-sizes, developing techniques from the "subgradient-projection" method of Boris T. Polyak (1969). However, there are problems on which bundle methods offer little advantage over subgradient-projection methods.<ref name="Lem">
{{cite book| last=Lemaréchal|first=Claude|author-link=Claude Lemaréchal|chapter=Lagrangian relaxation|pages=112–156|title=Computational combinatorial optimization: Papers from the Spring School held in Schloß Dagstuhl, May 15–19, 2000|editor=Michael Jünger and Denis Naddef|series=Lecture Notes in Computer Science|volume=2241|publisher=Springer-Verlag| location=Berlin|year=2001|isbn=3-540-42877-1|mr=1900016|doi=10.1007/3-540-45586-8_4|s2cid=9048698 }}</ref><ref name="KLL">
{{cite journal|last1=Kiwiel|first1=Krzysztof&nbsp;C.|last2=Larsson |first2=Torbjörn|last3=Lindberg|first3=P.&nbsp;O.|title=Lagrangian relaxation via ballstep subgradient methods|url=http://rcin.org.pl/Content/139438/PDF/RB-2002-76.pdf |journal=[Mathematics of Operations Research](/source/Mathematics_of_Operations_Research)|volume=32|date=August 2007|number=3|pages=669–686|mr=2348241|doi=10.1287/moor.1070.0261}}
</ref>

==Constrained optimization==
===Projected subgradient===
One extension of the subgradient method is the '''projected subgradient method''', which solves the constrained [optimization](/source/Mathematical_optimization) problem
:minimize <math>f(x) \ </math> subject to <math display=block>x \in \mathcal{C}</math>

where <math>\mathcal{C}</math> is a [convex set](/source/convex_set). 
The projected subgradient method uses the iteration
<math display=block>x^{(k+1)} = P \left(x^{(k)} - \alpha_k g^{(k)}\right)</math>
where <math>P</math> is projection on <math>\mathcal{C}</math> and <math>g^{(k)}</math> is any subgradient of <math>f \ </math> at <math>x^{(k)}.</math>

===General constraints===

The subgradient method can be extended to solve the inequality constrained problem

:minimize <math>f_0(x) \ </math> subject to <math display=block>f_i (x) \leq 0,\quad i = 1,\ldots,m</math>

where <math>f_i</math> are convex.  The algorithm takes the same form as the unconstrained case
<math display=block>x^{(k+1)} = x^{(k)} - \alpha_k g^{(k)} \ </math>
where <math>\alpha_k>0</math> is a step size, and <math>g^{(k)}</math> is a subgradient of the objective or one of the constraint functions at <math>x. \ </math>  Take
<math display=block>g^{(k)} = 
\begin{cases} 
  \partial f_0 (x)  & \text{ if } f_i(x) \leq 0 \; \forall i = 1 \dots m \\
  \partial f_j (x)  & \text{ for some } j \text{ such that } f_j(x) > 0 
\end{cases}</math>
where <math>\partial f</math> denotes the [subdifferential](/source/subdifferential) of <math>f. \ </math>  If the current point is feasible, the algorithm uses an objective subgradient; if the current point is infeasible, the algorithm chooses a subgradient of any violated constraint.

==See also==

* {{annotated link|Stochastic gradient descent}}

==References==

{{reflist}}

==Further reading==

* {{cite book
  | last = Bertsekas
  | first = Dimitri P.
  | title = Nonlinear Programming
  | publisher = Athena Scientific
  | year = 1999
  | location = Belmont, MA.
  | isbn = 1-886529-00-0 
}}
* {{cite book
|last1=Bertsekas
|first1=Dimitri P.
|last2=Nedic
|first2=Angelia
|last3 = Ozdaglar
|first3 = Asuman
| title = Convex Analysis and Optimization
| edition = Second
| publisher = Athena Scientific
| year = 2003
| location = Belmont, MA.
| isbn =  1-886529-45-0
}} 
* {{cite book
  | last = Bertsekas
  | first = Dimitri P.
  | title = Convex Optimization Algorithms
  | publisher = Athena Scientific
  | year = 2015
  | location = Belmont, MA.
  | isbn = 978-1-886529-28-1 
}}
* {{cite book
  | last = Shor
  | first = Naum Z.
  | title = Minimization Methods for Non-differentiable Functions
  | publisher = [Springer-Verlag](/source/Springer-Verlag)
  | isbn = 0-387-12763-1
  | year = 1985
}}
* {{cite book|last=Ruszczyński|author-link=Andrzej Piotr Ruszczyński|first=Andrzej|title=Nonlinear Optimization|publisher=[Princeton University Press](/source/Princeton_University_Press)|location=Princeton, NJ|year=2006|pages=xii+454|isbn=978-0691119151 |mr=2199043}}

==External links==

* [http://www.stanford.edu/class/ee364a/ EE364A] and [http://www.stanford.edu/class/ee364b/ EE364B], Stanford's convex optimization course sequence.

{{Convex analysis and variational analysis}}
{{optimization algorithms|convex}}

Category:Convex analysis
Category:Convex optimization
Category:Mathematical optimization
Category:Optimization algorithms and methods

---
Adapted from the Wikipedia article [Subgradient method](https://en.wikipedia.org/wiki/Subgradient_method) by Wikipedia contributors ([contributor history](https://en.wikipedia.org/wiki/Subgradient_method?action=history)). Available under [Creative Commons Attribution-ShareAlike 4.0 International](https://creativecommons.org/licenses/by-sa/4.0/). Changes may have been made.