Influential observation

{{Short description|Observation that would cause a large change if deleted}} [[File:Anscombe's quartet 3.svg|right|400px|thumb|In [[Anscombe's quartet]] the two [[dataset]]s on the bottom both contain influential points. All four sets are identical when examined using simple summary statistics, but vary considerably when graphed. If one point is removed, the line would look very different.]] In [[statistics]], an '''influential observation''' is an observation for a [[Estimation theory|statistical calculation]] whose deletion from the dataset would noticeably change the [[result]] of the calculation.<ref>{{citation|title=Elementary Statistics for Geographers|first1=James E.|last1=Burt|first2=Gerald M.|last2=Barber|first3=David L.|last3=Rigby|publisher=Guilford Press|year=2009|isbn=9781572304840|page=513|url=https://books.google.com/books?id=p7YMOPuu8ugC&pg=PA513}}.</ref> In particular, in [[regression analysis]] an influential observation is one whose deletion has a large effect on the parameter estimates.<ref name = "Everitt"/>

== Assessment == Various methods have been proposed for measuring influence.<ref>{{cite web |first=Larry |last=Winner |title=Influence Statistics, Outliers, and Collinearity Diagnostics |date=March 25, 2002 |url=http://stat.ufl.edu/~winner/sta6127/influence.doc }}</ref><ref>{{cite book |last1=Belsley |first1=David A. |last2=Kuh |first2=Edwin |last3=Welsh |first3=Roy E. | year=1980 |title=Regression Diagnostics: Identifying Influential Data and Sources of Collinearity |publisher=[[John Wiley & Sons]] |location=New York |series=Wiley Series in Probability and Mathematical Statistics |isbn=0-471-05856-4 |pages=11–16 |url=https://books.google.com/books?id=GECBEUJVNe0C&pg=PA11 }}</ref> Assume an estimated regression <math>\mathbf{y} = \mathbf{X} \mathbf{b} + \mathbf{e}</math>, where <math>\mathbf{y}</math> is an ''n''×1 column vector for the response variable, <math>\mathbf{X}</math> is the ''n''×''k'' [[design matrix]] of explanatory variables (including a constant), <math>\mathbf{e}</math> is the ''n''×1 residual vector, and <math>\mathbf{b}</math> is a ''k''×1 vector of estimates of some population parameter <math>\mathbf{\beta} \in \mathbb{R}^{k}</math>. Also define <math>\mathbf{H} \equiv \mathbf{X} \left(\mathbf{X}^{\mathsf{T}} \mathbf{X} \right)^{-1} \mathbf{X}^{\mathsf{T}}</math>, the [[projection matrix]] of <math>\mathbf{X}</math>. Then we have the following measures of influence:

# <math>\text{DFBETA}_{i} \equiv \mathbf{b} - \mathbf{b}_{(-i)} = \frac{\left( \mathbf{X}^{\mathsf{T}} \mathbf{X} \right)^{-1} \mathbf{x}_{i}^{\mathsf{T}} e_{i}}{1 - h_{ii}}</math>, where <math>\mathbf{b}_{(-i)}</math> denotes the coefficients estimated with the ''i''-th row <math>\mathbf{x}_{i}</math> of <math>\mathbf{X}</math> deleted, <math>h_{ii} = \mathbf{x}_{i} \left( \mathbf{X}^{\mathsf{T}} \mathbf{X} \right)^{-1} \mathbf{x}_{i}^{\mathsf{T}}</math> denotes the ''i''-th value of matrix's <math>\mathbf{H}</math> main diagonal. Thus DFBETA measures the difference in each parameter estimate with and without the influential point. There is a DFBETA for each variable and each observation (if there are ''N'' observations and ''k'' variables there are N·k DFBETAs).<ref>{{cite web |title=Outliers and DFBETA |url=http://www.albany.edu/faculty/kretheme/PAD705/SupportMat/DFBETA.pdf |url-status=live |archive-date=May 11, 2013 |archive-url=https://web.archive.org/web/20130511013229/http://www.albany.edu/faculty/kretheme/PAD705/SupportMat/DFBETA.pdf }}</ref> Table shows DFBETAs for the third dataset from Anscombe's quartet (bottom left chart in the figure): {| class="wikitable" style="text-align: center; margin-left:auto; margin-right:auto;" border="1" | x | y | intercept | slope |- | 10.0 || 7.46 || -0.005 || -0.044 |- | 8.0 || 6.77 || -0.037 || 0.019 |- | '''13.0''' || '''12.74''' || '''-357.910''' || '''525.268''' |- | 9.0 || 7.11 || -0.033 || 0 |- | 11.0 || 7.81 || 0.049 || -0.117 |- | 14.0 || 8.84 || 0.490 || -0.667 |- | 6.0 || 6.08 || 0.027 || -0.021 |- | 4.0 || 5.39 || 0.241 || -0.209 |- | 12.0 || 8.15 || 0.137 || -0.231 |- | 7.0 || 6.42 || -0.020 || 0.013 |- | 5.0 || 5.73 || 0.105 || -0.087 |} {{ordered list | item1_value=2 | 1 = [[DFFITS]] - difference in fits | item2_value=3 | 2 = [[Cook's distance|Cook's ''D'']] measures the effect of removing a data point on all the parameters combined.<ref name="Everitt">{{cite book | last = Everitt | first = Brian | title = The Cambridge Dictionary of Statistics | publisher = Cambridge University Press | location = Cambridge, UK New York | year = 1998 | isbn = 0-521-59346-8 | url = https://archive.org/details/cambridgediction00ever_0 }} </ref> }}

== Outliers, leverage and influence ==

An [[outlier]] may be defined as a [[data point]] that differs markedly from other observations.<ref>{{Cite journal |last=Grubbs |first=F. E. |date=February 1969 |title=Procedures for detecting outlying observations in samples |journal=Technometrics |volume=11 |issue=1 |pages=1–21 |doi= 10.1080/00401706.1969.10490657|quote=An outlying observation, or "outlier," is one that appears to deviate markedly from other members of the sample in which it occurs.}}</ref><ref>{{cite book |last=Maddala |first=G. S. |author-link=G. S. Maddala |chapter=Outliers |title=Introduction to Econometrics |location=New York |publisher=MacMillan |edition=2nd |year=1992 |isbn=978-0-02-374545-4 |pages=[https://archive.org/details/introductiontoec00madd/page/89 89] |quote=An outlier is an observation that is far removed from the rest of the observations. |chapter-url=https://books.google.com/books?id=nBS3AAAAIAAJ&pg=PA89 |url=https://archive.org/details/introductiontoec00madd/page/89 }}</ref> A [[high-leverage point]] are observations made at extreme values of independent variables.<ref>{{cite book |last=Everitt |first=B. S. |year=2002 |title=Cambridge Dictionary of Statistics |publisher=Cambridge University Press |isbn=0-521-81099-X }}</ref> Both types of atypical observations will force the regression line to be close to the point.<ref name = "Everitt"/> In Anscombe's quartet, the bottom right image has a point with high leverage and the bottom left image has an outlying point.

== See also == * [[Influence function (statistics)]] * [[Outlier]] * [[Leverage (statistics)|Leverage]] ** [[Partial leverage]] * [[Regression analysis]] * {{slink|Cook's distance|Detecting highly influential observations}} * [[Anomaly detection]]

== References == {{reflist}}

== Further reading == * {{cite journal |first1=Catherine |last1=Dehon |first2=Marjorie |last2=Gassner |first3=Vincenzo |last3=Verardi |title=Beware of 'Good' Outliers and Overoptimistic Conclusions |journal=Oxford Bulletin of Economics and Statistics |volume=71 |issue=3 |pages=437–452 |year=2009 |doi=10.1111/j.1468-0084.2009.00543.x |s2cid=154376487 }} * {{cite book |first=Peter |last=Kennedy |author-link=Peter Kennedy (economist) |chapter=Robust Estimation |title=A Guide to Econometrics |location=Cambridge |publisher=The MIT Press |edition=Fifth |year=2003 |isbn=0-262-61183-X |pages=372–388 |chapter-url=https://books.google.com/books?id=B8I5SP69e4kC&pg=PA372 }}

[[Category:Actuarial science]] [[Category:Regression diagnostics]] [[Category:Robust statistics]]