Optimal matching

{{Short description|Sequence analysis in social science}} {{distinguish|text=maximum matching in graph theory or the statistical problem of finding an optimal match for causal inference}} '''Optimal matching''' is a sequence analysis method used in social science, to assess the dissimilarity of ordered arrays of tokens that usually represent a time-ordered sequence of socio-economic states two individuals have experienced. Once such distances have been calculated for a set of observations (e.g. individuals in a cohort) classical tools (such as cluster analysis) can be used. The method was tailored to social sciences<ref>A. Abbott and A. Tsay, (2000) ''[http://smr.sagepub.com/cgi/content/abstract/29/1/3 Sequence Analysis and Optimal Matching Methods in Sociology: Review and Prospect]'' Sociological Methods & Research], Vol. 29, 3-33. {{doi|10.1177/0049124100029001001}}</ref> from a technique originally introduced to study molecular biology (protein or genetic) sequences (see sequence alignment). Optimal matching uses the Needleman-Wunsch algorithm.

== Algorithm == Let <math>S = (s_1, s_2, s_3, \ldots s_T)</math> be a sequence of states <math>s_i</math> belonging to a finite set of possible states. Let us denote <math>{\mathbf S}</math> the sequence space, i.e. the set of all possible sequences of states.

Optimal matching algorithms work by defining simple operator algebras that manipulate sequences, i.e. a set of operators <math>a_i: {\mathbf S} \rightarrow {\mathbf S}</math>. In the most simple approach, a set composed of only three basic operations to transform sequences is used: * one state <math>s</math> is inserted in the sequence <math>a^{\rm Ins}_{s'} (s_1, s_2, s_3, \ldots s_T) = (s_1, s_2, s_3, \ldots, s', \ldots s_T) </math> * one state is deleted from the sequence <math>a^{\rm Del}_{s_2} (s_1, s_2, s_3, \ldots s_T) = (s_1, s_3, \ldots s_T)</math> and * a state <math>s_1</math> is replaced (substituted) by state <math>s'_1</math>, <math>a^{\rm Sub}_{s_1,s'_1} (s_1, s_2, s_3, \ldots s_T) = (s'_1, s_2, s_3, \ldots s_T)</math>.

Imagine now that a ''cost'' <math>c(a_i) \in {\mathbf R}^+_0</math> is associated to each operator. Given two sequences <math>S_1</math> and <math>S_2</math>, the idea is to measure the ''cost'' of obtaining <math>S_2</math> from <math>S_1</math> using operators from the algebra. Let <math>A={a_1, a_2, \ldots a_n}</math> be a sequence of operators such that the application of all the operators of this sequence <math>A</math> to the first sequence <math>S_1</math> gives the second sequence <math>S_2</math>: <math>S_2 = a_1 \circ a_2 \circ \ldots \circ a_{n} (S_1)</math> where <math>a_1 \circ a_2</math> denotes the compound operator. To this set we associate the cost <math>c(A) = \sum_{i=1}^n c(a_i)</math>, that represents the total cost of the transformation. One should consider at this point that there might exist different such sequences <math>A</math> that transform <math>S_1</math> into <math>S_2</math>; a reasonable choice is to select the cheapest of such sequences. We thus call distance <br> <math>d(S_1,S_2)= \min_A \left \{ c(A)~{\rm such~that}~S_2 = A (S_1) \right \} </math> <br> that is, the cost of the least expensive set of transformations that turn <math>S_1</math> into <math>S_2</math>. Notice that <math>d(S_1,S_2)</math> is by definition nonnegative since it is the sum of positive costs, and trivially <math>d(S_1,S_2)=0</math> if and only if <math>S_1=S_2</math>, that is there is no cost. The distance function is symmetric if insertion and deletion costs are equal <math>c(a^{\rm Ins}) = c(a^{\rm Del})</math>; the term ''indel'' cost usually refers to the common cost of insertion and deletion.

Considering a set composed of only the three basic operations described above, this proximity measure satisfies the triangular inequality. Transitivity however, depends on the definition of the set of elementary operations.

== Criticism == Although optimal matching techniques are widely used in sociology and demography, such techniques also have their flaws. As was pointed out by several authors (for example L. L. Wu<ref>L. L. Wu. (2000) ''[http://smr.sagepub.com/cgi/content/refs/29/1/41 Some Comments on "Sequence Analysis and Optimal Matching Methods in Sociology: Review and Prospect"] {{Webarchive|url=https://web.archive.org/web/20061024143211/http://smr.sagepub.com/cgi/content/refs/29/1/41 |date=2006-10-24 }}'' Sociological Methods & Research, 29 41-64. {{doi|10.1177/0049124100029001003}}</ref>), the main problem in the application of optimal matching is to appropriately define the costs <math>c(a_i)</math>.

== Software == * [http://www.stat.ruhr-uni-bochum.de/tda.html TDA] is a powerful program, offering access to some of the latest developments in transition data analysis. * [http://ideas.repec.org/a/tsj/stataj/v6y2006i4p435-460.html STATA] has implemented a package to run optimal matching analysis. * [http://traminer.unige.ch/ TraMineR] is an open source R-package for analyzing and visualizing states and events sequences, including optimal matching analysis.

== References and notes == <references/>

Category:Data mining Category:Statistical distance Category:Quantitative research