# String metric

> Mediated Wiki article. Canonical URL: https://mediated.wiki/source/String_metric
> Markdown URL: https://mediated.wiki/source/String_metric.md
> Source: https://en.wikipedia.org/wiki/String_metric
> Source revision: 1239983587
> License: Creative Commons Attribution-ShareAlike 4.0 International (https://creativecommons.org/licenses/by-sa/4.0/)

{{Short description|Metric that measures the distance between two strings of text}}
{{redirect|String distance|the distance between strings and the fingerboard in musical instruments|Action (music)}}

In [mathematics](/source/mathematics) and [computer science](/source/computer_science), a '''string metric''' (also known as a '''string similarity metric''' or '''string distance function''') is a [metric](/source/metric_(mathematics)) that measures [distance](/source/distance) ("inverse similarity") between two [text strings](/source/string_(computer_science)) for [approximate string matching](/source/approximate_string_matching) or comparison and in [fuzzy string searching](/source/approximate_string_matching). A requirement for a string ''metric'' (e.g. in contrast to [string matching](/source/string_matching)) is fulfillment of the [triangle inequality](/source/triangle_inequality). For example, the strings "Sam" and "Samuel" can be considered to be close.<ref>{{cite book
 | last = Lu
 | first = Jiaheng | title = Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data | chapter = String similarity measures and joins with synonyms |display-authors=etal 
 | year=2013
 | pages=373–384
 | chapter-url=https://dl.acm.org/citation.cfm?id=2465313| doi = 10.1145/2463676.2465313
 | isbn = 9781450320375
 | s2cid = 2091942 }}</ref> A string metric provides a number indicating an algorithm-specific indication of distance.

The most widely known string metric is a rudimentary one called the [Levenshtein distance](/source/Levenshtein_distance) (also known as edit distance).<ref>{{cite journal
 | last = Navarro
 | first = Gonzalo
 | title = A guided tour to approximate string matching
 | journal = ACM Computing Surveys |volume=33 |issue=1
 | year=2001
 | pages=31–88
 | doi=10.1145/375360.375365| hdl=10533/172862
 | s2cid = 207551224
 | hdl-access=free
 }}</ref>  It operates between two input strings, returning a number equivalent to the number of substitutions and deletions needed in order to transform one input string into another. Simplistic string metrics such as [Levenshtein distance](/source/Levenshtein_distance) have expanded to include phonetic, [token](/source/token_(parser)), grammatical and character-based methods of statistical comparisons.

String metrics are used heavily in [information integration](/source/information_integration) and are currently used in areas including [fraud detection](/source/Data_analysis_techniques_for_fraud_detection), [fingerprint analysis](/source/fingerprint_analysis), [plagiarism detection](/source/plagiarism_detection), [ontology merging](/source/ontology_merging), [DNA analysis](/source/DNA_analysis), RNA analysis, [image analysis](/source/image_analysis), evidence-based [machine learning](/source/machine_learning), [database](/source/database) [data deduplication](/source/data_deduplication), [data mining](/source/data_mining), [incremental search](/source/incremental_search), [data integration](/source/data_integration), [malware detection](/source/Malware),<ref>{{cite journal |author1=[Shlomi Dolev](/source/Shlomi_Dolev) | last2=Mohammad |first2=Ghanayim |last3=Alexander |first3=Binun |last4=Sergey |first4=Frenkel |last5=Yeali |first5=S. Sun |title=Relationship of Jaccard and edit distance in malware clustering and online identification |journal=16th IEEE International Symposium on Network Computing and Applications |date=2017 |pages=369–373}}</ref> and semantic [knowledge integration](/source/knowledge_integration).

==List of string metrics==

<!-- This can be a separate article, someday. -->
* [Levenshtein distance](/source/Levenshtein_distance), or its generalization [edit distance](/source/edit_distance)
* [Damerau–Levenshtein distance](/source/Damerau%E2%80%93Levenshtein_distance)
* [Sørensen–Dice coefficient](/source/S%C3%B8rensen%E2%80%93Dice_coefficient)
* [Block distance](/source/Block_distance) or [L1 distance](/source/L1_distance) or [City block distance](/source/City_block_distance)
* [Hamming distance](/source/Hamming_distance)
* [Simple matching coefficient](/source/Simple_matching_coefficient) (SMC)
* [Jaccard similarity](/source/Jaccard_similarity) or [Jaccard coefficient](/source/Jaccard_coefficient) or [Tanimoto coefficient](/source/Tanimoto_coefficient)
* [Tversky index](/source/Tversky_index)
* [Overlap coefficient](/source/Overlap_coefficient)
* [Variational distance](/source/Variational_distance)<ref name="sam">[http://www.coli.uni-saarland.de/courses/LT1/2011/slides/stringmetrics.pdf Sam's String Metrics - Computational Linguistics and Phonetics]</ref>
* [Hellinger distance](/source/Hellinger_distance) or [Bhattacharyya distance](/source/Bhattacharyya_distance)
* [Information radius](/source/Information_radius) ([Jensen–Shannon divergence](/source/Jensen%E2%80%93Shannon_divergence))
* [Skew divergence](/source/Skew_divergence)<ref name="sam"/>
* [Confusion probability](/source/Confusion_probability)<ref name="sam"/>
* [Tau metric](/source/Kendall_tau_distance), an approximation of the [Kullback–Leibler divergence](/source/Kullback%E2%80%93Leibler_divergence)
* [Fellegi and Sunters metric](/source/Fellegi_and_Sunters_metric) (SFS)<ref name="sam"/>
* [Maximal matches](/source/Maximal_matches)<ref name="sam"/>
* [Grammar-based distance](/source/Grammar-based_distance)<ref>Russell, David J., et al. [https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-11-601 "A grammar-based distance metric enables fast and accurate clustering of large sets of 16S sequences."] BMC bioinformatics 11.1 (2010): 1-14.</ref>
* [TFIDF](/source/Tf%E2%80%93idf) distance metric<ref>{{Cite journal|title = A Comparison of String Distance Metrics for Name-Matching Tasks.|url = https://dl.acm.org/doi/10.5555/3104278.3104293|date = 2003-08-01|pages = 73–78|first1 = William|last1 = Cohen|first2 = Pradeep|last2 = Ravikumar|first3 = Stephen|last3 = Fienberg}}</ref>

There also exist functions which measure a dissimilarity between strings, but do not necessarily fulfill the triangle inequality, and as such are not ''metrics'' in the mathematical sense. An example of such function is the [Jaro–Winkler distance](/source/Jaro%E2%80%93Winkler_distance).

==Selected string measures examples==

{| class="wikitable"
|-
! Name
! Description
! Example
|-
|[Hamming distance](/source/Hamming_distance)
| Only for strings of the same length. Number of changed characters.
| "'''{{mono|1=ka<span style="color:#0082ff">rol</span>in}}'''" and "'''{{mono|1=ka<span style="color:red;">thr</span>in}}'''" is 3.
|-
|[Levenshtein distance](/source/Levenshtein_distance) and [Damerau–Levenshtein distance](/source/Damerau%E2%80%93Levenshtein_distance)
| Generalization of Hamming distance that allows for different length strings, and (with Damerau) for transpositions
| {{mono|'''k'''itt'''e'''n}} and {{mono|'''s'''itt'''i'''n'''g'''}} have a distance of 3. 
# {{mono|'''k'''itten}} → {{mono|'''s'''itten}} (substitution of "s" for "k")
# {{mono|sitt'''e'''n}} → {{mono|sitt'''i'''n}} (substitution of "i" for "e")
# {{mono|sittin}} → {{mono|sittin'''g'''}} (insertion of "g" at the end).
|-
|[Jaro–Winkler distance](/source/Jaro%E2%80%93Winkler_distance)
|
| JaroWinklerDist("MARTHA","MARHTA") =
:<math>d_j = \frac{1}{3}\left(\frac{m}{|s_1|} + \frac{m}{|s_2|} + \frac{m-t}{m}\right) = \frac{1}{3}\left(\frac{6}{6} + \frac{6}{6} + \frac{6-\frac{2}{2}}{6}\right) = 0.944</math>
* <math>m</math> is the number of ''matching characters'';
* <math>t</math> is half the number of ''transpositions''(<code>"MARTHA"[3]!=H, "MARHTA"[3]!=T</code>).
<!--|-
|[Simple matching coefficient](/source/Simple_matching_coefficient) (SMC)
|-->
<!--|-
|-
|[Jaccard similarity](/source/Jaccard_similarity) or [Jaccard coefficient](/source/Jaccard_coefficient) or [Tanimoto coefficient](/source/Tanimoto_coefficient)
|-->
|-
|[Most frequent k characters](/source/Most_frequent_k_characters)
|
|MostFreqKeySimilarity('<span style="color:red;">r</span><span style="color:#0082ff">e</span>s<span style="color:#0082ff">e</span>a<span style="color:red;">r</span>ch', 's<span style="color:#0082ff">ee</span>king', 2) = 2
<!--|-
|[Tversky index](/source/Tversky_index)
|-->
<!--|-
|[Overlap coefficient](/source/Overlap_coefficient)
|-->
<!--|-
|[Variational distance](/source/Variational_distance)
|-->
<!--|-
|[Hellinger distance](/source/Hellinger_distance) or [Bhattacharyya distance](/source/Bhattacharyya_distance)
|-->
<!--|-
|[Information radius](/source/Information_radius) ([Jensen–Shannon divergence](/source/Jensen%E2%80%93Shannon_divergence))
|-->
<!--|-
|[Skew divergence](/source/Skew_divergence)
|-->
<!--|-
|[Confusion probability](/source/Confusion_probability)
|-->
<!--|-
|[Tau metric](/source/Tau_metric), an approximation of the [Kullback–Leibler divergence](/source/Kullback%E2%80%93Leibler_divergence)
|-->
<!--|-
|[Fellegi and Sunters metric](/source/Fellegi_and_Sunters_metric) (SFS)
|-->
<!--|-
|[Maximal matches](/source/Maximal_matches)
|-->
|}

==References==

{{Reflist}}

==External links==
*[https://web.archive.org/web/20070304092115/http://www.dcs.shef.ac.uk/~sam/stringmetrics.html#qgram String Similarity Metrics for Information Integration]  A fairly complete overview {{webarchive |url=https://web.archive.org/web/*/http://www.dcs.shef.ac.uk/~sam/stringmetrics.html#ukkonen |date=* }}
* [http://www.speech.cs.cmu.edu/ Carnegie Mellon University open source library]
* [http://rockymadden.com/stringmetric/ StringMetric project] a [Scala](/source/Scala_programming_language) library of string metrics and phonetic algorithms
* [https://github.com/NaturalNode/natural Natural project] a [JavaScript](/source/JavaScript) natural language processing library which includes implementations of popular string metrics

{{strings}}

{{DEFAULTSORT:String Metric}}
Category:String metrics
Category:Metrics

---
Adapted from the Wikipedia article [String metric](https://en.wikipedia.org/wiki/String_metric) by Wikipedia contributors ([contributor history](https://en.wikipedia.org/wiki/String_metric?action=history)). Available under [Creative Commons Attribution-ShareAlike 4.0 International](https://creativecommons.org/licenses/by-sa/4.0/). Changes may have been made.
