# Conditional mutual information

> Mediated Wiki article. Canonical URL: https://mediated.wiki/source/Conditional_mutual_information
> Markdown URL: https://mediated.wiki/source/Conditional_mutual_information.md
> Source: https://en.wikipedia.org/wiki/Conditional_mutual_information
> Source revision: 1354129700
> License: Creative Commons Attribution-ShareAlike 4.0 International (https://creativecommons.org/licenses/by-sa/4.0/)

{{Short description|Information theory}}
{{Information theory}}

[[Image:VennInfo3Var.svg|thumb|256px|right|[Venn diagram](/source/Venn_diagram) of information theoretic measures for three variables <math>x</math>, <math>y</math>, and <math>z</math>, represented by the lower left, lower right, and upper circles, respectively. The conditional mutual informations <math>I(x;z|y)</math>, <math>I(y;z|x)</math> and <math>I(x;y|z)</math> are represented by the yellow, cyan, and magenta regions, respectively.]]

In [probability theory](/source/probability_theory), particularly [information theory](/source/information_theory), the '''conditional mutual information'''<ref name = Wyner1978>{{cite journal|last=Wyner|first=A. D. |title=A definition of conditional mutual information for arbitrary ensembles|journal=Information and Control|year=1978|volume=38|issue=1|pages=51–59|doi=10.1016/s0019-9958(78)90026-8|doi-access=free}}</ref><ref name = Dobrushin1959>{{cite journal|last=Dobrushin|first=R. L. |title=General formulation of Shannon's main theorem in information theory|journal=Uspekhi Mat. Nauk|year=1959|volume=14|pages=3–104}}</ref> is, in its most basic form, the [expected value](/source/expected_value) of the [mutual information](/source/mutual_information) of two random variables given the value of a third.

==Definition==
For random variables <math>X</math>, <math>Y</math>, and <math>Z</math> with [support sets](/source/Support_(mathematics)) <math>\mathcal{X}</math>, <math>\mathcal{Y}</math> and <math>\mathcal{Z}</math>, we define the conditional mutual information as

<math>
I(X;Y|Z) = \int_\mathcal{Z} D_{\mathrm{KL}}( P_{(X,Y)|Z} \| P_{X|Z} \otimes P_{Y|Z} ) dP_{Z}
</math>.

This may be written in terms of the expectation operator: <math>I(X;Y|Z) = \mathbb{E}_Z [D_{\mathrm{KL}}( P_{(X,Y)|Z} \| P_{X|Z} \otimes P_{Y|Z} )]</math>.

Thus <math>I(X;Y|Z)</math> is the expected (with respect to <math>Z</math>) [Kullback–Leibler divergence](/source/Kullback%E2%80%93Leibler_divergence) from the conditional joint distribution <math>P_{(X,Y)|Z}</math> to the product of the conditional marginals <math>P_{X|Z}</math> and <math>P_{Y|Z}</math>. Compare with the definition of [mutual information](/source/mutual_information).

==In terms of PMFs for discrete distributions==
For discrete random variables <math>X</math>, <math>Y</math>, and <math>Z</math> with [support sets](/source/Support_(mathematics)) <math>\mathcal{X}</math>, <math>\mathcal{Y}</math> and <math>\mathcal{Z}</math>, the conditional mutual information <math>I(X;Y|Z)</math> is as follows
:<math>
I(X;Y|Z) = \sum_{z\in \mathcal{Z}} p_Z(z) \sum_{y\in \mathcal{Y}} \sum_{x\in \mathcal{X}}
      p_{X,Y|Z}(x,y|z) \log \frac{p_{X,Y|Z}(x,y|z)}{p_{X|Z}(x|z)p_{Y|Z}(y|z)}
</math>
where the marginal, joint, and/or conditional [probability mass function](/source/probability_mass_function)s are denoted by <math>p</math> with the appropriate subscript. This can be simplified as

<math>
I(X;Y|Z) = \sum_{z\in \mathcal{Z}} \sum_{y\in \mathcal{Y}} \sum_{x\in \mathcal{X}} p_{X,Y,Z}(x,y,z) \log \frac{p_Z(z)p_{X,Y,Z}(x,y,z)}{p_{X,Z}(x,z)p_{Y,Z}(y,z)}
</math>.

==In terms of PDFs for continuous distributions==
For (absolutely) continuous random variables <math>X</math>, <math>Y</math>, and <math>Z</math> with [support sets](/source/Support_(mathematics)) <math>\mathcal{X}</math>, <math>\mathcal{Y}</math> and <math>\mathcal{Z}</math>, the conditional mutual information <math>I(X;Y|Z)</math> is as follows
:<math>
I(X;Y|Z) = \int_{\mathcal{Z}} \bigg( \int_{\mathcal{Y}} \int_{\mathcal{X}}
      \log \left(\frac{p_{X,Y|Z}(x,y|z)}{p_{X|Z}(x|z)p_{Y|Z}(y|z)}\right) p_{X,Y|Z}(x,y|z) dx dy \bigg) p_Z(z) dz
</math>
where the marginal, joint, and/or conditional [probability density function](/source/probability_density_function)s are denoted by <math>p</math> with the appropriate subscript. This can be simplified as

<math>
I(X;Y|Z) = \int_{\mathcal{Z}} \int_{\mathcal{Y}} \int_{\mathcal{X}} \log \left(\frac{p_Z(z)p_{X,Y,Z}(x,y,z)}{p_{X,Z}(x,z)p_{Y,Z}(y,z)}\right) p_{X,Y,Z}(x,y,z) dx dy dz
</math>.

==Some identities==
Alternatively, we may write in terms of joint and conditional [entropies](/source/Entropy_(information_theory)) as<ref>{{cite book |last1=Cover |first1=Thomas |author-link1=Thomas M. Cover |last2=Thomas |first2=Joy A. |title=Elements of Information Theory |edition=2nd |location=New York |publisher=[Wiley-Interscience](/source/Wiley-Interscience) |date=2006 |isbn=0-471-24195-4}}</ref>
:<math>\begin{align}
I(X;Y|Z) &= H(X,Z) + H(Y,Z) - H(X,Y,Z) - H(Z) \\
         &= H(X|Z) - H(X|Y,Z) \\
         &= H(X|Z)+H(Y|Z)-H(X,Y|Z).
\end{align}</math>
This can be rewritten to show its relationship to mutual information
:<math>I(X;Y|Z) = I(X;Y,Z) - I(X;Z)</math>
usually rearranged as '''the chain rule for mutual information'''
:<math>I(X;Y,Z) = I(X;Z) + I(X;Y|Z)</math>
or
:<math>I(X;Y|Z) = I(X;Y) - (I(X;Z) - I(X;Z|Y))\,.</math>
Another equivalent form of the above is
:<math>\begin{align}
I(X;Y|Z) &= H(Z|X) + H(X) + H(Z|Y) + H(Y) - H(Z|X,Y) - H(X,Y) - H(Z)\\
         &= I(X;Y) + H(Z|X) + H(Z|Y) - H(Z|X,Y) - H(Z)
\end{align}\,.</math>
Like mutual information, conditional mutual information can be expressed as a [Kullback–Leibler divergence](/source/Kullback%E2%80%93Leibler_divergence):

:<math> I(X;Y|Z) = D_{\mathrm{KL}}[ p(X,Y,Z) \| p(X|Z)p(Y|Z)p(Z) ]. </math>

Or as an expected value of simpler Kullback–Leibler divergences:
:<math> I(X;Y|Z) = \sum_{z \in \mathcal{Z}} p( Z=z ) D_{\mathrm{KL}}[ p(X,Y|z) \| p(X|z)p(Y|z) ]</math>,
:<math> I(X;Y|Z) = \sum_{y \in \mathcal{Y}} p( Y=y ) D_{\mathrm{KL}}[ p(X,Z|y) \| p(X|Z)p(Z|y) ]</math>.

==More general definition==
A more general definition of conditional mutual information, applicable to random variables with continuous or other arbitrary distributions, will depend on the concept of '''[regular conditional probability](/source/regular_conditional_probability)'''.<ref>D. Leao, Jr. et al. ''Regular conditional probability, disintegration of probability and Radon spaces.'' Proyecciones. Vol. 23, No. 1, pp. 15–29, May 2004, Universidad Católica del Norte, Antofagasta, Chile [http://www.scielo.cl/pdf/proy/v23n1/art02.pdf PDF]</ref>

Let <math>(\Omega, \mathcal F, \mathfrak P)</math> be a [probability space](/source/probability_space), and let the random variables <math>X</math>, <math>Y</math>, and <math>Z</math> each be defined as a Borel-measurable function from <math>\Omega</math> to some state space endowed with a topological structure.

Consider the [Borel measure](/source/Borel_measure) (on the [σ-algebra](/source/%CF%83-algebra) generated by the open sets) in the state space of each [random variable](/source/random_variable) defined by assigning each Borel set the <math>\mathfrak P</math>-measure of its preimage in <math>\mathcal F</math>.  This is called the [pushforward measure](/source/pushforward_measure) <math>X _* \mathfrak P = \mathfrak P\big(X^{-1}(\cdot)\big).</math>  The '''support of a random variable''' is defined to be the [topological support](/source/Support_(measure_theory)) of this measure, i.e. <math>\mathrm{supp}\,X = \mathrm{supp}\,X _* \mathfrak P.</math>

Now we can formally define the [conditional probability measure](/source/conditional_probability_distribution) given the value of one (or, via the [product topology](/source/product_topology), more) of the random variables.  Let <math>M</math> be a measurable subset of <math>\Omega,</math> (i.e. <math>M \in \mathcal F,</math>) and let <math>x \in \mathrm{supp}\,X.</math>  Then, using the [disintegration theorem](/source/disintegration_theorem):
:<math>\mathfrak P(M | X=x) = \lim_{U \ni x}
  \frac {\mathfrak P(M \cap \{X \in U\})}
        {\mathfrak P(\{X \in U\})}
  \qquad \textrm{and} \qquad \mathfrak P(M|X) = \int_M d\mathfrak P\big(\omega|X=X(\omega)\big),</math>
where the limit is taken over the open neighborhoods <math>U</math> of <math>x</math>, as they are allowed to become arbitrarily smaller with respect to [set inclusion](/source/Subset).

Finally we can define the conditional mutual information via [Lebesgue integration](/source/Lebesgue_integration):
:<math>I(X;Y|Z) = \int_\Omega \log
  \Bigl(
  \frac {d \mathfrak P(\omega|X,Z)\, d\mathfrak P(\omega|Y,Z)}
        {d \mathfrak P(\omega|Z)\, d\mathfrak P(\omega|X,Y,Z)}
  \Bigr)
  d \mathfrak P(\omega),
  </math>
where the integrand is the logarithm of a [Radon–Nikodym derivative](/source/Radon%E2%80%93Nikodym_derivative) involving some of the conditional probability measures we have just defined.

==Note on notation==
In an expression such as <math>I(A;B|C),</math> <math>A,</math> <math>B,</math> and <math>C</math> need not necessarily be restricted to representing individual random variables, but could also represent the joint distribution of any collection of random variables defined on the same [probability space](/source/probability_space).  As is common in [probability theory](/source/probability_theory), we may use the comma to denote such a joint distribution, e.g. <math>I(A_0,A_1;B_1,B_2,B_3|C_0,C_1).</math>  Hence the use of the semicolon (or occasionally a colon or even a wedge <math>\wedge</math>) to separate the principal arguments of the mutual information symbol.  (No such distinction is necessary in the symbol for [joint entropy](/source/joint_entropy), since the joint entropy of any number of random variables is the same as the entropy of their joint distribution.)

== Properties ==
===Nonnegativity===
It is always true that
:<math>I(X;Y|Z) \ge 0</math>,
for discrete, jointly distributed random variables <math>X</math>, <math>Y</math> and <math>Z</math>.  This result has been used as a basic building block for proving other [inequalities in information theory](/source/inequalities_in_information_theory), in particular, those known as Shannon-type inequalities. Conditional mutual information is also non-negative for continuous random variables under certain regularity conditions.<ref>{{cite book |last1=Polyanskiy |first1=Yury |last2=Wu |first2=Yihong |title=Lecture notes on information theory |date=2017 |page=30 |url=http://people.lids.mit.edu/yp/homepage/data/itlectures_v5.pdf}}</ref>

===Interaction information===
Conditioning on a third random variable may either increase or decrease the mutual information: that is, the difference <math>I(X;Y) - I(X;Y|Z)</math>, called the [interaction information](/source/interaction_information), may be positive, negative, or zero. This is the case even when random variables are pairwise independent. Such is the case when: <math display="block">X \sim \mathrm{Bernoulli}(0.5), Z \sim \mathrm{Bernoulli}(0.5), \quad Y=\left\{\begin{array}{ll} X & \text{if }Z=0\\ 1-X & \text{if }Z=1 \end{array}\right.</math>in which case <math>X</math>, <math>Y</math> and <math>Z</math> are pairwise independent and in particular <math>I(X;Y)=0</math>, but <math>I(X;Y|Z)=1.</math> (Y here is the [xor](/source/xor) of X and Z so Z acts as the "secret key" for "plaintext" X and ciphertext "Y")

===Chain rule for mutual information===
The chain rule (as derived above) provides two ways to decompose <math>I(X;Y,Z)</math>:
:<math>
\begin{align}
I(X;Y,Z) &= I(X;Z) + I(X;Y|Z) \\
         &= I(X;Y) + I(X;Z|Y)
\end{align}
</math>
The [data processing inequality](/source/data_processing_inequality) is closely related to conditional mutual information and can be proven using the chain rule.

==Interaction information==
{{main|Interaction information}}
The conditional mutual information is used to inductively define the '''interaction information''', a generalization of mutual information, as follows:
:<math>I(X_1;\ldots;X_{n+1}) = I(X_1;\ldots;X_n) - I(X_1;\ldots;X_n|X_{n+1}),</math>
where
:<math>I(X_1;\,...\,;X_{n-1}|X_n) = \mathbb{E}_{X_n} \bigl[I(X_1;\,...\,;X_{n-1})|X_n\bigr].</math>
Because the conditional mutual information can be greater than or less than its unconditional counterpart, the interaction information can be positive, negative, or zero, which makes it hard to interpret.

==References==
<references/>

Category:Information theory
Category:Entropy and information

---
Adapted from the Wikipedia article [Conditional mutual information](https://en.wikipedia.org/wiki/Conditional_mutual_information) by Wikipedia contributors ([contributor history](https://en.wikipedia.org/wiki/Conditional_mutual_information?action=history)). Available under [Creative Commons Attribution-ShareAlike 4.0 International](https://creativecommons.org/licenses/by-sa/4.0/). Changes may have been made.