Dice-Sørensen coefficient

{{Short description|Statistic used for comparing the similarity of two samples}} The '''Dice-Sørensen coefficient''' {{crossreference|(see below for other names)}} is a statistic used to gauge the similarity of two [[Sample (statistics)|samples]]. It was independently developed by the botanists [[Lee R. Dice|Lee Raymond Dice]]<ref>{{cite journal |last=Dice |first=Lee R. |title=Measures of the Amount of Ecologic Association Between Species |jstor=1932409 |journal=Ecology |volume=26 |issue=3 |year=1945 |pages=297–302 |doi=10.2307/1932409 |bibcode=1945Ecol...26..297D |s2cid=53335638 }}</ref> and [[Thorvald Sørensen]],<ref>{{cite journal |last=Sørensen |first=T. |year=1948 |title=A method of establishing groups of equal amplitude in plant sociology based on similarity of species and its application to analyses of the vegetation on Danish commons |journal=[[Kongelige Danske Videnskabernes Selskab]] |volume=5 |issue=4 |pages=1–34 }}</ref> who published in 1945 and 1948 respectively.

==Name== The index is known by several other names, especially '''Sørensen–Dice index''',<ref name ="carass">{{cite journal | last1=Carass | first1=A. | last2=Roy | first2=S. | last3=Gherman | first3=A. | last4=Reinhold | first4=J.C. | last5=Jesson |first5=A. | last6=Arbel | first6=T. | last7=Maier | first7=O. | last8=Handels | first8=H. | last9=Ghafoorian | first9=M. | last10=Platel | first10=B. | last11=Birenbaum | first11=A. | last12=Greenspan | first12=H. | last13=Pham | first13=D.L. | last14=Crainiceanu | first14=C.M. | last15=Calabresi | first15=P.A. | last16=Prince | first16=J.L. | last17=Gray Roncal | first17=W.R. | last18=Shinohara | first18=R.T. | last19=Oguz | first19=I. | display-authors=5 | title=Evaluating White Matter Lesion Segmentations with Refined Sørensen-Dice Analysis | journal=Scientific Reports | volume=10 | issue= 1| year=2020 | issn=2045-2322 | doi=10.1038/s41598-020-64803-w | page=8242 | pmid=32427874| pmc=7237671 | bibcode=2020NatSR..10.8242C | doi-access=free }}</ref> '''Sørensen index''' and '''Dice's coefficient'''. Other variations include the "similarity coefficient" or "index", such as '''Dice similarity coefficient''' ('''DSC'''). Common alternate spellings for Sørensen are ''Sorenson'', ''Soerenson'' and ''Sörenson'', and all three can also be seen with the ''–sen'' ending (the [[Danish phonology|Danish letter ø]] is phonetically equivalent to the German/Swedish ö, which can be written as oe in ASCII).

Other names include: * [[F1 score]] * [[Jan Czekanowski|Czekanowski]]'s binary (non-quantitative) index<ref name ="gallagher"/> * Measure of genetic similarity<ref name="nei">{{cite journal | title=Mathematical model for studying genetic variation in terms of restriction endonucleases | last1 = Nei | first1 = M. | last2 = Li | first2 = W.H. | journal = [[Proceedings of the National Academy of Sciences of the United States of America|PNAS]] | year = 1979 | volume = 76 | issue = 10 | pages = 5269–5273 | doi = 10.1073/pnas.76.10.5269 | pmid = 291943 | pmc = 413122 | bibcode = 1979PNAS...76.5269N | doi-access = free }}</ref> * Zijdenbos similarity index,<ref>{{cite conference | last1=Prescott | first1=J.W. | last2=Pennell | first2=M. | last3=Best | first3=T.M. | last4=Swanson | first4=M.S. | last5=Haq | first5=F. | last6=Jackson | first6=R. | last7=Gurcan | first7=M.N. | title=2009 Annual International Conference of the IEEE Engineering in Medicine and Biology Society | chapter=An automated method to segment the femur for osteoarthritis research | publisher=IEEE | year=2009 | pages=6364–6367 | doi=10.1109/iembs.2009.5333257 | pmc=2826829 }}</ref><ref>{{cite journal | last1=Swanson | first1=M.S. | last2=Prescott | first2=J.W. | last3=Best | first3=T.M. | last4=Powell | first4=K. | last5=Jackson | first5=R.D. | last6=Haq | first6=F. | last7=Gurcan | first7=M.N. | title=Semi-automated segmentation to assess the lateral meniscus in normal and osteoarthritic knees | journal=Osteoarthritis and Cartilage | volume=18 | issue=3 | year=2010 | issn=1063-4584 | doi=10.1016/j.joca.2009.10.004 | pages=344–353 | pmc=2826568 | pmid=19857510}}</ref> referring to a 1994 paper of Zijdenbos et al.<ref name ="zijdenbos">{{cite journal | last1=Zijdenbos | first1=A.P. | last2=Dawant | first2=B.M. | last3=Margolin | first3=R.A. | last4=Palmer | first4=A.C. | title=Morphometric analysis of white matter lesions in MR images: method and validation | journal=IEEE Transactions on Medical Imaging | volume=13 | issue=4 | year=1994 | issn=0278-0062 | doi=10.1109/42.363096 | pages=716–724 | pmid=18218550 | bibcode=1994ITMI...13..716Z }}</ref><ref name ="carass" />

==Formula== Sørensen's original formula was intended to be applied to discrete data. Given two sets, X and Y, it is defined as

:<math> DSC = \frac{2 |X \cap Y|}{|X| + |Y|}</math>

where |''X''| and |''Y''| are the [[Cardinality|cardinalities]] of the two sets (i.e. the number of elements in each set). The Sørensen index equals twice the number of elements common to both sets divided by the sum of the number of elements in each set. Equivalently, the index is the size of the intersection as a fraction of the average size of the two sets.

When applied to Boolean data, using the definition of true positive (TP), false positive (FP), and false negative (FN), it can be written as

:<math> DSC = \frac{2 \mathit{TP}}{2 \mathit{TP} + \mathit{FP} + \mathit{FN}}</math>.

It is different from the [[Jaccard index]] which only counts true positives once in both the numerator and denominator. DSC is the quotient of similarity and ranges between 0 and 1.<ref>{{Cite journal |last1=Murguía |first1=Miguel |last2=Luis Villaseñor |first2=José |date=2003 |title=Estimating the effect of the similarity coefficient and the cluster algorithm on biogeographic classifications |url=https://www.sekj.org/PDF/anbf40/anbf40-415.pdf |journal=Annales Botanici Fennici |volume=40 |pages=415–421 |issn=0003-3847}}</ref> It can be viewed as a [[similarity measure]] over sets.

Similarly to the [[Jaccard index]], the set operations can be expressed in terms of vector operations over binary vectors '''a''' and '''b''':

:<math>s_v = \frac{2 | \bf{a} \cdot \bf{b} |}{| \bf{a} |^2 + | \bf{b} |^2} </math>

which gives the same outcome over binary vectors and also gives a more general similarity metric over vectors in general terms.

For sets ''X'' and ''Y'' of keywords used in [[information retrieval]], the coefficient may be defined as twice the shared information (intersection) over the sum of cardinalities :<ref>{{cite book |last=van Rijsbergen |first=Cornelis Joost |year=1979 |title=Information Retrieval |url=https://www.dcs.gla.ac.uk/Keith/Preface.html |publisher=Butterworths |location=London |isbn=3-642-12274-4 }}</ref>

When taken as a [[string similarity]] measure, the coefficient may be calculated for two strings, ''x'' and ''y'' using [[bigram]]s as follows:<ref>{{cite conference |last=Kondrak |first=Grzegorz |author2=Marcu, Daniel |author3= Knight, Kevin |year=2003 |title=Cognates Can Improve Statistical Translation Models |book-title=Proceedings of HLT-NAACL 2003: Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics |pages=46–48 |url=https://www.isi.edu/~marcu/papers/cognates-hlt2003.pdf}}</ref>

:<math>s = \frac{2 n_t}{n_x + n_y}</math>

where ''n''''t'' is the number of character bigrams found in both strings, ''n''''x'' is the number of bigrams in string ''x'' and ''n''''y'' is the number of bigrams in string ''y''. For example, to calculate the similarity between:

:<code>night</code> :<code>nacht</code>

We would find the set of bigrams in each word: :{<code>ni</code>,<code>ig</code>,<code>gh</code>,<code>ht</code>} :{<code>na</code>,<code>ac</code>,<code>ch</code>,<code>ht</code>}

Each set has four elements, and the intersection of these two sets has only one element: <code>ht</code>.

Inserting these numbers into the formula, we calculate, ''s'' = (2 · 1) / (4 + 4) = 0.25.

=== Continuous Dice Coefficient === Source:<ref>{{cite bioRxiv |last1=Shamir|first1=Reuben R.|last2=Duchin|first2=Yuval|last3=Kim|first3=Jinyoung|last4=Sapiro|first4=Guillermo|last5=Harel|first5=Noam|date=2018-04-25|title=Continuous Dice Coefficient: a Method for Evaluating Probabilistic Segmentations|language=en|biorxiv=10.1101/306977}}</ref>

For a discrete (binary) ground truth <math>A</math> and continuous measures <math>B</math> in the interval [0,1], the following formula can be used:

Where <math>|A \cap B| = \Sigma_i a_ib_i </math> and <math>|B| = \Sigma_i b_i </math>

c can be computed as follows:

<math> c = \frac{\Sigma_i a_ib_i}{\Sigma_i a_i \operatorname{sign}{(b_i)}}</math>

If <math> \Sigma_i a_i \operatorname{sign}{(b_i)} = 0</math> which means no overlap between A and B, c is set to 1 arbitrarily.

==Difference from Jaccard == This coefficient is not very different in form from the [[Jaccard index]]. In fact, both are equivalent in the sense that given a value for the Sørensen–Dice coefficient <math>S</math>, one can calculate the respective Jaccard index value <math>J</math> and vice versa, using the equations <math>J=S/(2-S)</math> and <math>S=2J/(1+J)</math>.

Since the Sørensen–Dice coefficient does not satisfy the [[triangle inequality]], it can be considered a [[Metric (mathematics)#Generalized metrics|semimetric]] version of the Jaccard index.<ref name ="gallagher"/>

The function ranges between zero and one, like Jaccard. Unlike Jaccard, the corresponding difference function

:<math>d(X, Y) = 1 - \frac{2 | X \cap Y |}{| X | + | Y |} </math>

is not a proper distance metric as it does not satisfy the triangle inequality.<ref name ="gallagher">Gallagher, E.D., 1999. [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.9.1334&rep=rep1&type=pdf COMPAH Documentation], University of Massachusetts, Boston</ref> The simplest counterexample of this is given by the three sets <math>X=\{a\}</math>, <math>Y=\{b\}</math> and <math>Z = X \cup Y = \{a, b\}</math>. We have <math>d(X,Y)=1</math> and <math>d(X,Z)=d(Y,Z)=1/3</math>. To satisfy the triangle inequality, the sum of any two sides must be greater than or equal to that of the remaining side. However, <math>d(X, Z) + d(Y, Z) = 2/3 < 1 = d(X, Y)</math>.

==Applications== The Sørensen–Dice coefficient is useful for ecological community data (e.g. Looman & Campbell, 1960<ref>{{cite journal | last1 = Looman | first1 = J. | last2 = Campbell | first2 = J.B. | year = 1960 | title = Adaptation of Sorensen's K (1948) for estimating unit affinities in prairie vegetation | journal = Ecology | volume = 41 | issue = 3| pages = 409–416 | jstor=1933315 | doi = 10.2307/1933315 | bibcode = 1960Ecol...41..409L }}</ref>). Justification for its use is primarily empirical rather than theoretical (although it can be justified theoretically as the intersection of two [[fuzzy set]]s<ref>{{cite journal | last1 = Roberts | first1 = D.W. | s2cid = 12573576 | year = 1986 | title = Ordination on the basis of fuzzy set theory | doi = 10.1007/BF00039905 | journal = Vegetatio | volume = 66 | issue = 3| pages = 123–131 }}</ref>). As compared to [[Euclidean distance]], the Sørensen distance retains sensitivity in more heterogeneous data sets and gives less weight to outliers.<ref>McCune, Bruce & Grace, James (2002) Analysis of Ecological Communities. Mjm Software Design; {{ISBN|0-9721290-0-6}}.</ref> Recently the Dice score (and its variations, e.g. logDice taking a logarithm of it) has become popular in computer [[lexicography]] for measuring the lexical association score of two given words.<ref>[https://nlp.fi.muni.cz/raslan/2008/raslan08.pdf#page=14 Rychlý, P. (2008) A lexicographer-friendly association score. Proceedings of the Second Workshop on Recent Advances in Slavonic Natural Language Processing RASLAN 2008: 6–9]</ref> logDice is also used as part of the Mash Distance for genome and metagenome distance estimation<ref>Ondov, Brian D., et al. "Mash: fast genome and metagenome distance estimation using MinHash." Genome biology 17.1 (2016): 1-14.</ref> Finally, Dice is used in [[image segmentation]], in particular for comparing algorithm output against reference masks in medical applications.<ref name ="zijdenbos"/>

==Abundance version== The expression is easily extended to [[Abundance (ecology)|abundance]] instead of presence/absence of species. This quantitative version is known by several names: * Quantitative Sørensen–Dice index<ref name ="gallagher"/> * Quantitative Sørensen index<ref name ="gallagher"/> * Quantitative Dice index<ref name ="gallagher"/> * [[Bray–Curtis dissimilarity|Bray–Curtis similarity]] (1 minus the ''Bray-Curtis dissimilarity'')<ref name ="gallagher"/> * [[Jan Czekanowski|Czekanowski]]'s quantitative index<ref name ="gallagher"/> * [[Hugo Steinhaus|Steinhaus]] index<ref name ="gallagher"/> * [[E. C. Pielou|Pielou]]'s percentage similarity<ref name ="gallagher"/> * Proportion of specific agreement<ref>{{cite journal|first1=Indu|last1=Ayappa|first2=Robert G|last2=Norman|year=2000|title=Non-Invasive Detection of Respiratory Effort-Related Arousals (RERAs) by a Nasal Cannula/Pressure Transducer System|journal=Sleep|volume=23|issue=6|pages=763–771|doi=10.1093/sleep/23.6.763|pmid=11007443|doi-access=free}}</ref> or positive agreement<ref>{{cite web|url=https://www.john-uebersax.com/stat/raw.htm#binobs|title=Raw Agreement Indices|author=John Uebersax}}</ref>

==See also== * [[Correlation]] * [[F1 score]] * [[Hellinger distance]] * [[Jaccard index]] * [[Hamming distance]] * [[Mantel test]] * [[Morisita's overlap index]] * [[Overlap coefficient]] * [[Renkonen similarity index]] * [[Tversky index]] * [[Universal adaptive strategy theory (UAST)]]

==References== {{reflist}}

==External links== {{Wikibooks|Algorithm implementation|Strings/Dice's coefficient|Dice's coefficient}}

{{DEFAULTSORT:Sorensen-Dice coefficient}} [[Category:Information retrieval evaluation]] [[Category:String metrics]] [[Category:Measure theory]] [[Category:Similarity measures]]