Adjoint state method

{{Short description|Numerical method}} {{primary sources|date=January 2010}} The '''adjoint state method''' is a numerical method for efficiently computing the gradient of a function or operator in a numerical optimization problem.<ref>{{Cite journal|last1=Pollini|first1=Nicolò|last2=Lavan|first2=Oren|last3=Amir|first3=Oded|date=2018-06-01|title=Adjoint sensitivity analysis and optimization of hysteretic dynamic systems with nonlinear viscous dampers|journal=Structural and Multidisciplinary Optimization|language=en|volume=57|issue=6|pages=2273–2289|doi=10.1007/s00158-017-1858-2|s2cid=125712091|issn=1615-1488}}</ref> It has applications in geophysics, seismic imaging, photonics and more recently in neural networks.<ref>Ricky T. Q. Chen, Yulia Rubanova, Jesse Bettencourt, David Duvenaud ''Neural Ordinary Differential Equations'' [https://arxiv.org/abs/1806.07366 Available online]</ref>

The adjoint state space is chosen to simplify the physical interpretation of equation constraints.<ref name='Plessix_2006_GJI'>Plessix, R-E. "A review of the adjoint-state method for computing the gradient of a functional with geophysical applications." Geophysical Journal International, 2006, 167(2): 495-503. [https://academic.oup.com/gji/article/167/2/495/559970 free access on GJI website]</ref>

Adjoint state techniques allow the use of integration by parts, resulting in a form which explicitly contains the physically interesting quantity. An adjoint state equation is introduced, including a new unknown variable.

The adjoint method formulates the gradient of a function towards its parameters in a constraint optimization form. By using the dual form of this constraint optimization problem, it can be used to calculate the gradient very fast. A nice property is that the number of computations is independent of the number of parameters for which you want the gradient. The adjoint method is derived from the dual problem<ref>{{cite journal |last1=McNamara |first1=Antoine |last2=Treuille |first2=Adrien |last3=Popović |first3=Zoran |last4=Stam |first4=Jos |title=Fluid control using the adjoint method |url=https://www.dgp.toronto.edu/public_user/stam/reality/Research/pdf/sig04.pdf |access-date=28 October 2022 |archive-url=https://web.archive.org/web/20220129011505/https://www.dgp.toronto.edu/public_user/stam/reality/Research/pdf/sig04.pdf |archive-date=29 January 2022 |journal=ACM Transactions on Graphics |volume=23 | issue=3| pages=449–456 |doi=10.1145/1015706.1015744 |date=August 2004 |url-status=live}}</ref> and is used e.g. in the Landweber iteration method.<ref>{{cite web |last1=Lundvall |first1=Johan |title=Data Assimilation in Fluid Dynamics using Adjoint Optimization |url=http://liu.diva-portal.org/smash/get/diva2:24091/FULLTEXT01.pdf |publisher=Linköping University of Technology |access-date=28 October 2022 |archive-url=https://web.archive.org/web/20221009083111/http://liu.diva-portal.org/smash/get/diva2:24091/FULLTEXT01.pdf |archive-date=9 October 2022 |location=Sweden |language=en |date=2007 |url-status=live}}</ref>

The name '''adjoint state method''' refers to the dual form of the problem, where the adjoint matrix <math>A^*=\overline A ^T</math> is used.

When the initial problem consists of calculating the product <math>s^T x</math> and <math>x</math> must satisfy <math>Ax=b</math>, the dual problem can be realized as calculating the product {{nowrap|<math>r^T b</math> (<math> = s^T x </math>)}}, where <math>r</math> must satisfy <math>A^* r = s </math>. And <math> r </math> is called the adjoint state vector.

== General case ==

The original adjoint calculation method goes back to Jean Céa,<ref>{{Cite journal|last1=Cea|first1=Jean|date=1986|title=Conception optimale ou identification de formes, calcul rapide de la dérivée directionnelle de la fonction coût|journal=ESAIM: Mathematical Modelling and Numerical Analysis - Modélisation Mathématique et Analyse Numérique|language=fr|volume=20|issue=3|pages=371–402|doi=10.1051/m2an/1986200303711 |url=http://www.numdam.org/item/M2AN_1986__20_3_371_0/|doi-access=free}}</ref> with the use of the Lagrangian of the optimization problem to compute the derivative of a functional with respect to a shape parameter.

For a state variable <math>u \in \mathcal{U}</math>, an optimization variable <math>v\in\mathcal{V}</math>, an objective functional <math>J:\mathcal{U}\times\mathcal{V}\to\mathbb{R}</math> is defined. The state variable <math>u</math> is often implicitly dependent on <math>v</math> through the (direct) state equation <math>D_v(u)=0</math> (usually the weak form of a partial differential equation), thus the considered objective is <math>j(v) = J(u_v,v)</math>, where <math>u_v</math> is the solution of the state equation given the optimization variables <math>v</math>. Usually, one would be interested in calculating <math>\nabla j(v)</math> using the chain rule: :<math> \nabla j(v) = \nabla_v J(u_v,v) + \nabla_u J(u_v)\nabla_v u_v. </math>

Unfortunately, the term <math>\nabla_v u_v</math> is often very hard to differentiate analytically since the dependence is defined through an implicit equation. The Lagrangian functional can be used as a workaround for this issue. Since the state equation can be considered as a constraint in the minimization of <math>j</math>, the problem

:<math> \text{minimize}\ j(v) = J(u_v,v) </math> :<math> \text{subject to}\ D_v(u_v) = 0 </math>

has an associate Lagrangian functional <math> \mathcal{L}:\mathcal{U}\times\mathcal{V}\times\mathcal{U} \to \mathbb{R} </math> defined by

:<math> \mathcal{L}(u,v,\lambda) = J(u,v) + \langle D_v(u),\lambda\rangle, </math>

where <math>\lambda\in\mathcal{U}</math> is a Lagrange multiplier or '''adjoint state variable''' and <math>\langle\cdot,\cdot\rangle</math> is an inner product on <math>\mathcal{U}</math>. The method of Lagrange multipliers states that a solution to the problem has to be a stationary point of the lagrangian, namely

:<math> \begin{cases} d_u\mathcal{L}(u,v,\lambda;\delta_u) = d_uJ(u,v;\delta_u) + \langle \delta_u,D_v^*(\lambda)\rangle = 0 & \forall \delta_u \in \mathcal{U},\\ d_v\mathcal{L}(u,v,\lambda;\delta_v) = d_vJ(u,v;\delta_v) + \langle d_vD_v(u;\delta_v),\lambda\rangle = 0 & \forall \delta_v\in\mathcal{V},\\ d_\lambda\mathcal{L}(u,v,\lambda;\delta_\lambda) = \langle D_v(u), \delta_\lambda\rangle = 0 \quad & \forall \delta_\lambda \in \mathcal{U}, \end{cases} </math> where <math>d_xF(x;\delta_x)</math> is the Gateaux derivative of <math>F</math> with respect to <math>x</math> in the direction <math>\delta_x</math>. The last equation is equivalent to <math>D_v(u) = 0</math>, the state equation, to which the solution is <math>u_v</math>. The first equation is the so-called adjoint state equation,

:<math> \langle \delta_u,D_v^*(\lambda)\rangle = -d_uJ(u_v,v;\delta_u) \quad \forall \delta_u \in \mathcal{U}, </math>

because the operator involved is the adjoint operator of <math>D_v</math>, <math>D_v^*</math>. Resolving this equation yields the adjoint state <math>\lambda_v</math>. The gradient of the quantity of interest <math>j</math> with respect to <math>v</math> is <math>\langle \nabla j(v),\delta_v\rangle = d_v j(v;\delta_v) = d_v \mathcal{L}(u_v,v,\lambda_v;\delta_v) </math> (the second equation with <math>u=u_v</math> and <math>\lambda=\lambda_v</math>), thus it can be easily identified by subsequently resolving the direct and adjoint state equations. The process is even simpler when the operator <math>D_v</math> is self-adjoint or symmetric since the direct and adjoint state equations differ only by their right-hand side.

== Example: Linear case == In a real finite dimensional linear programming context, the objective function could be <math>J(u,v) = \langle Au , v \rangle</math>, for <math>v\in\mathbb{R}^n</math>, <math>u\in\mathbb{R}^m</math> and <math> A \in \mathbb{R}^{n\times m}</math>, and let the state equation be <math> B_vu = b</math>, with <math> B_v \in \mathbb{R}^{m\times m}</math> and <math> b\in \mathbb{R}^m</math>.

The Lagrangian function of the problem is <math> \mathcal{L}(u,v,\lambda) = \langle Au,v\rangle + \langle B_vu - b,\lambda\rangle</math>, where <math>\lambda\in\mathbb{R}^m</math>.

The derivative of <math>\mathcal{L}</math> with respect to <math>\lambda</math> yields the state equation as shown before, and the state variable is <math>u_v = B_v^{-1}b</math>. The derivative of <math>\mathcal{L}</math> with respect to <math>u</math> is equivalent to the adjoint equation, which is, for every <math>\delta_u \in \mathbb{R}^m</math>, :<math> d_u[\langle B_v\cdot- b, \lambda\rangle](u;\delta_u)= - \langle A^\top v,\delta u\rangle \iff \langle B_v \delta_u,\lambda\rangle = - \langle A^\top v,\delta u\rangle \iff \langle B_v^\top \lambda + A^\top v,\delta_u\rangle = 0 \iff B_v^\top \lambda = - A^\top v.</math> Thus, we can write symbolically <math>\lambda_v = - B_v^{-\top}A^\top v</math>. The gradient would be :<math> \langle \nabla j(v),\delta_v\rangle = \langle Au_v,\delta_v\rangle + \langle \nabla_vB_v : \lambda_v\otimes u_v,\delta_v\rangle,</math> where <math> \nabla_v B_v = \frac{\partial B_{ij}}{\partial v_k}</math> is a third-order tensor, <math>\lambda_v \otimes u_v = \lambda^\top_v u_v</math> is the dyadic product between the direct and adjoint states and <math>:</math> denotes a double tensor contraction. It is assumed that <math>B_v</math> has a known analytic expression that can be differentiated easily.

=== Numerical consideration for the self-adjoint case === If the operator <math> B_v</math> was self-adjoint, <math> B_v = B_v^\top</math>, the direct state equation and the adjoint state equation would have the same left-hand side. In the goal of never inverting a matrix, which is a very slow process numerically, a LU decomposition can be used instead to solve the state equation, in <math>O(m^3)</math> operations for the decomposition and <math>O(m^2)</math> operations for the resolution. That same decomposition can then be used to solve the adjoint state equation in only <math>O(m^2)</math> operations since the matrices are the same.

== References == <references/>

== External links ==

* A well written explanation by Errico: [https://dx.doi.org/10.1175/1520-0477(1997)078%3C2577:WIAAM%3E2.0.CO;2 What is an adjoint Model? ] * Another well written explanation with worked examples, written by Bradley [http://cs.stanford.edu/~ambrad/adjoint_tutorial.pdf] * More technical explanation: A [https://doi.org/10.1111/j.1365-246X.2006.02978.x review] of the adjoint-state method for computing the gradient of a functional with geophysical applications * MIT course [https://web.archive.org/web/20150906095132/http://ocw.mit.edu/courses/mathematics/18-325-topics-in-applied-mathematics-waves-and-imaging-fall-2012/lecture-notes/MIT18_325F12_Chapter4.pdf] * MIT notes [http://math.mit.edu/~stevenj/18.336/adjoint.pdf]

Category:Numerical analysis