Medoid

{{Short description|Objects maximally similar to other objects in a dataset}} '''Medoids''' are representative objects of a [[data set]] or a [[Cluster analysis|cluster]] within a data set whose sum of dissimilarities to all the objects in the cluster is minimal.<ref name="Struyf1997">{{cite journal |last1=Struyf |first1=Anja |last2=Hubert |first2=Mia |author2-link=Mia Hubert |author3-link=Peter Rousseeuw |last3=Rousseeuw |first3=Peter |journal=Journal of Statistical Software |year=1997 |title=Clustering in an Object-Oriented Environment |volume=1 |issue=4 |pages=1–30 |url=http://www.jstatsoft.org/v01/i04 }}</ref> Medoids are similar in concept to [[mean]]s or [[centroids]], but medoids are always restricted to be members of the data set. Medoids are most commonly used on data when a mean or centroid cannot be defined, such as graphs. They are also used in contexts where the centroid is not representative of the dataset like in images, 3-D trajectories and [[gene expression]]<ref name="newpam">{{cite journal |last1=van der Laan |first1=Mark J. |author-link=Mark van der Laan |last2=Pollard |first2=Katherine S. |last3=Bryan |first3=Jennifer |author-link3=Jenny Bryan |year=2003 |title=A New Partitioning Around Medoids Algorithm |journal=Journal of Statistical Computation and Simulation |publisher=Taylor & Francis Group |volume=73 |issue=8 |pages=575–584 |doi=10.1080/0094965031000136012 |s2cid=17437463 |url=https://biostats.bepress.com/ucbbiostat/paper105 }}</ref> (where while the data is sparse the medoid need not be). These are also of interest while wanting to find a representative using some distance other than [[Euclidean distance#Squared Euclidean distance|squared euclidean distance]] (for instance in movie-ratings).

For some data sets there may be more than one medoid, as with medians. A common application of the medoid is the [[k-medoids]] clustering algorithm, which is similar to the [[k-means]] algorithm but works when a mean or centroid is not definable. This algorithm basically works as follows. First, a set of medoids is chosen at random. Second, the distances to the other points are computed. Third, data are clustered according to the medoid they are most similar to. Fourth, the medoid set is optimized via an iterative process.

Note that a medoid is not equivalent to a [[median]], a [[geometric median]], or [[centroid]]. A [[median]] is only defined on 1-dimensional data, and it only minimizes dissimilarity to other points for metrics induced by a [[Norm (mathematics)#Absolute-value norm|norm]] (such as the [[Norm (mathematics)#Taxicab norm or Manhattan norm|Manhattan distance]] or [[Euclidean distance]]). A [[geometric median]] is defined in any dimension, but unlike a medoid, it is not necessarily a point from within the original dataset.

== Definition == Let <math display="inline">\mathcal X := \{ x_1, x_2, \dots, x_n \}</math> be a set of <math display="inline"> n </math> points in a space with a [[Metric (mathematics)|distance function]] d. Medoid is defined as

:<math> x_{\text{medoid}} = \arg\min_{y \in \mathcal X} \sum_{i=1}^n d(y, x_i).</math>

== Clustering with medoids ==

Medoids are a popular replacement for the cluster mean when the distance function is not (squared) Euclidean distance, or not even a [[metric space|metric]] (as the medoid does not require the [[triangle inequality]]). When partitioning the data set into clusters, the medoid of each cluster can be used as a representative of each cluster.

Clustering algorithms based on the idea of medoids include: * Partitioning Around Medoids (PAM), the standard [[k-medoids]] algorithm * Hierarchical Clustering Around Medoids (HACAM), which uses medoids in [[hierarchical clustering]]

== Algorithms to compute the medoid of a set == From the definition above, it is clear that the medoid of a set <math>\mathcal X</math> can be computed after computing all pairwise distances between points in the ensemble. This would take <math display="inline">O(n^2) </math> distance evaluations (with <math>n=|\mathcal X|</math>). In the worst case, one can not compute the medoid with fewer distance evaluations.<ref name=":1">Newling, James; & Fleuret, François (2016); "A sub-quadratic exact medoid algorithm", in ''Proceedings of the 20th International Conference on Artificial Intelligence and Statistics'', PMLR 54:185-193, 2017 [http://proceedings.mlr.press/v54/newling17a/newling17a.pdf Available online].</ref><ref name=":0">{{cite journal |last1=Bagaria |first1=Vivek |last2=Kamath |first2=Govinda M. |last3=Ntranos |first3=Vasilis |last4=Zhang |first4=Martin J. |last5=Tse |first5=David |title=Medoids in almost linear time via multi-armed bandits |date=2017 |arxiv=1711.00817 }}</ref> However, there are many approaches that allow us to compute medoids either exactly or approximately in sub-quadratic time under different statistical models.

If the points lie on the real line, computing the medoid reduces to computing the median which can be done in <math display="inline"> O(n) </math> by Quick-select algorithm of Hoare.<ref>[[Charles Antony Richard Hoare|Hoare, Charles Antony Richard]] (1961); "Algorithm 65: find", in ''Communications of the ACM'', ''4''(7), 321-322</ref> However, in higher dimensional real spaces, no linear-time algorithm is known. RAND<ref>[[David Eppstein|Eppstein, David]]; & Wang, Joseph (2006); "Fast approximation of centrality", in ''Graph Algorithms and Applications'', '''5''', pp. 39-45</ref> is an algorithm that estimates the average distance of each point to all the other points by sampling a random subset of other points. It takes a total of <math display="inline">O\left(\frac{n \log n}{\epsilon^2}\right)</math> distance computations to approximate the medoid within a factor of <math display="inline">(1+\epsilon\Delta)</math> [[with high probability]], where <math display="inline">\Delta</math> is the maximum distance between two points in the ensemble. Note that RAND is an [[approximation algorithm]], and moreover <math display="inline">\Delta</math> may ''not'' be known apriori.

RAND was leveraged by TOPRANK <ref>{{cite book |doi=10.1007/978-3-540-69311-6_21 |chapter=Ranking of Closeness Centrality for Large-Scale Social Networks |title=Frontiers in Algorithmics |series=Lecture Notes in Computer Science |year=2008 |last1=Okamoto |first1=Kazuya |last2=Chen |first2=Wei |last3=Li |first3=Xiang-Yang |volume=5059 |pages=186–195 |isbn=978-3-540-69310-9 }}</ref> which uses the estimates obtained by RAND to focus on a small subset of candidate points, evaluates the average distance of these points ''exactly'', and picks the minimum of those. TOPRANK needs <math display="inline"> O(n^{\frac{5}{3}} \log^{\frac{4}{3}} n)</math> distance computations to find the ''exact'' medoid with high probability under a distributional assumption on the average distances.

trimed <ref name=":1" /> presents an algorithm to find the medoid with <math display="inline">O(n^{\frac{3}{2}} 2^{\Theta(d)})</math> distance evaluations under a distributional assumption on the points. The algorithm uses the triangle inequality to cut down the search space.

Meddit<ref name=":0" /> leverages a connection of the medoid computation with [[multi-armed bandit]]s and uses an upper-Confidence-bound type of algorithm to get an algorithm which takes <math display="inline">O(n \log n)</math> distance evaluations under statistical assumptions on the points.

Correlated Sequential Halving<ref name=":corrSH">Baharav, Tavor Z.; & Tse, David N. (2019); "Ultra Fast Medoid Identification via Correlated Sequential Halving", in ''Advances in Neural Information Processing Systems'', [http://papers.nips.cc/paper/8623-ultra-fast-medoid-identification-via-correlated-sequential-halving.pdf available online]</ref> also leverages multi-armed bandit techniques, improving upon Meddit. By exploiting the correlation structure in the problem, the algorithm is able to provably yield drastic improvement (usually around 1-2 orders of magnitude) in both number of distance computations needed and wall clock time.

== Implementations == An implementation of RAND, TOPRANK, and trimed can be found [https://github.com/idiap/trimed here]. An implementation of Meddit can be found [https://github.com/bagavi/Meddit here] and [https://github.com/EdwardRaff/JSAT/blob/master/JSAT/src/jsat/clustering/MEDDIT.java here]. An implementation of Correlated Sequential Halving can be found [https://github.com/TavorB/Correlated-Sequential-Halving here].

== Medoids in text and natural language processing (NLP) == Medoids can be applied to various text and NLP tasks to improve the efficiency and accuracy of analyses.<ref>{{Cite web |last1=Dai |first1=Qiongjie |last2=Liu |first2=Jicheng |date=July 2019 |title=The Exploration and Application of K-medoids in Text Clustering |url=http://www.isaacpub.org/images/PaperPDF/JAAM_100130_2019061217522981776.pdf |access-date=25 April 2023}}</ref> By clustering text data based on similarity, medoids can help identify representative examples within the dataset, leading to better understanding and interpretation of the data.

=== Text clustering === Text clustering is the process of grouping similar text or documents together based on their content. Medoid-based clustering algorithms can be employed to partition large amounts of text into clusters, with each cluster represented by a medoid document. This technique helps in organizing, summarizing, and retrieving information from large collections of documents, such as in search engines, social media analytics and recommendation systems.<ref>{{Cite web |title=What is Natural Language Processing? |url=https://www.oracle.com/artificial-intelligence/what-is-natural-language-processing/}}</ref>

=== Text summarization === Text summarization aims to produce a concise and coherent summary of a larger text by extracting the most important and relevant information. Medoid-based clustering can be used to identify the most representative sentences in a document or a group of documents, which can then be combined to create a summary. This approach is especially useful for extractive summarization tasks, where the goal is to generate a summary by selecting the most relevant sentences from the original text.<ref>{{Cite web |last1=Hu |first1=Po |last2=He |first2=Tingting |last3=Ji |first3=Donghong |title=Chinese Text Summarization Based on Thematic Area Detection |url=https://aclanthology.org/W04-1018.pdf}}</ref>

=== Sentiment analysis === [[Sentiment analysis]] involves determining the sentiment or emotion expressed in a piece of text, such as positive, negative, or neutral. Medoid-based clustering can be applied to group text data based on similar sentiment patterns. By analyzing the medoid of each cluster, researchers can gain insights into the predominant sentiment of the cluster, helping in tasks such as opinion mining, customer feedback analysis, and social media monitoring.<ref>{{Cite journal |last1=Pessutto |first1=Lucas |last2=Vargas |first2=Danny |last3=Moreira |first3=Viviane |date=24 February 2020 |title=Multilingual aspect clustering for sentiment analysis |journal=Knowledge-Based Systems |volume=192 |article-number=105339 |doi=10.1016/j.knosys.2019.105339 |s2cid=211830280 |url=https://www.sciencedirect.com/science/article/abs/pii/S0950705119306070|url-access=subscription }}</ref>

=== Topic modeling === Topic modeling is a technique used to discover abstract topics that occur in a collection of documents. Medoid-based clustering can be applied to group documents with similar themes or topics. By analyzing the medoids of these clusters, researchers can gain an understanding of the underlying topics in the text corpus, facilitating tasks such as document categorization, trend analysis, and content recommendation.<ref>{{Cite journal |last1=Preud'homme |first1=Gregoire |last2=Duarte |first2=Kevin |date=18 February 2021 |title=Head-to-head comparison of clustering methods for heterogeneous data: a simulation-driven benchmark |journal=Scientific Reports |volume=11 |issue=1 |page=4202 |doi=10.1038/s41598-021-83340-8 |pmid=33603019 |pmc=7892576 |bibcode=2021NatSR..11.4202P }}</ref>

=== Techniques for measuring text similarity in medoid-based clustering === When applying medoid-based clustering to text data, it is essential to choose an appropriate [[similarity measure]] to compare documents effectively. Each technique has its advantages and limitations, and the choice of the similarity measure should be based on the specific requirements and characteristics of the text data being analyzed.<ref name=":02">{{Cite journal |last1=Amer |first1=Ali |last2=Abdalla |first2=Hassan |date=14 September 2020 |title=A set theory based similarity measure for text clustering and classification |journal=Journal of Big Data |volume=7 |article-number=74 |doi=10.1186/s40537-020-00344-3 |s2cid=256403960 |doi-access=free }}</ref> The following are common techniques for measuring text similarity in medoid-based clustering: [[File:CosineSimilarity.png|thumb|339x339px|This example shows how cosine similarity will compare the angle of lines between objects to determine how similar the items are. Note that most text embeddings will be at least a few hundred dimensions instead of just two.]]

==== Cosine similarity ==== [[Cosine similarity]] is a widely used measure to compare the similarity between two pieces of text. It calculates the cosine of the angle between two document vectors in a high-dimensional space.<ref name=":02" /> Cosine similarity ranges between -1 and 1, where a value closer to 1 indicates higher similarity, and a value closer to -1 indicates lower similarity. By visualizing two lines originating from the origin and extending to the respective points of interest, and then measuring the angle between these lines, one can determine the similarity between the associated points. Cosine similarity is less affected by document length, so it may be better at producing medoids that are representative of the content of a cluster instead of the length.

==== Jaccard similarity ==== [[File:Jaccard_Similarity_Formula.png|thumb|283x283px|This Jaccard similarity formula can easily be applied to text.]] Jaccard similarity, also known as the Jaccard coefficient, measures the similarity between two sets by comparing the ratio of their intersection to their union. In the context of text data, each document is represented as a set of words, and the Jaccard similarity is computed based on the common words between the two sets. The Jaccard similarity ranges between 0 and 1, where a higher value indicates a higher degree of similarity between the documents.{{citation needed|date=June 2024}}

==== Euclidean distance ==== [[File:EUDistance.png|thumb|346x346px|This example shows how Euclidean distance will calculate the distance between objects to determine how similar the items are. Note that most text embeddings will be at least a few hundred dimensions instead of just two.]] Euclidean distance is a standard distance metric used to measure the dissimilarity between two points in a multi-dimensional space. In the context of text data, documents are often represented as high-dimensional vectors, such as TF vectors, and the Euclidean distance can be used to measure the dissimilarity between them. A lower Euclidean distance indicates a higher degree of similarity between the documents.<ref name=":02" /> Using Euclidean distance may result in medoids that are more representative of the length of a document.

==== Edit distance ==== Edit distance, also known as the [[Levenshtein distance]], measures the similarity between two strings by calculating the minimum number of operations (insertions, deletions, or substitutions) required to transform one string into the other. In the context of text data, edit distance can be used to compare the similarity between short text documents or individual words. A lower edit distance indicates a higher degree of similarity between the strings.<ref>{{Cite web |last=Wu |first=Gang |date=17 December 2022 |title=String Similarity Metrics – Edit Distance |url=https://www.baeldung.com/cs/string-similarity-edit-distance}}</ref>

=== Medoid applications in large language models ===

==== Medoids for analyzing large language model embeddings ==== [[File:CarMedoidExample.png|thumb|309x309px|This is an example of how text can be grouped with similar items when embedded based on location. This represents grouping by Euclidean distance. If these were grouped by a different similarity measure like cosine similarity, then the medoids may be different.]] Medoids can be employed to analyze and understand the [[vector space]] representations generated by large language models (LLMs), such as BERT, GPT, or RoBERTa. By applying medoid-based clustering on the embeddings produced by these models for words, phrases, or sentences, researchers can explore the semantic relationships captured by LLMs. This approach can help identify clusters of semantically similar entities, providing insights into the structure and organization of the high-dimensional embedding spaces generated by these models.<ref>{{Cite web |last=Mokhtarani |first=Shabnam |date=26 August 2021 |title=Embeddings in Machine Learning: Everything You Need to Know |url=https://www.featureform.com/post/the-definitive-guide-to-embeddings}}</ref>

==== Medoids for data selection and active learning ==== Active learning involves choosing data points from a training pool that will maximize model performance. Medoids can play a crucial role in data selection and active learning with LLMs. Medoid-based clustering can be used to identify representative and diverse samples from a large text dataset, which can then be employed to fine-tune LLMs more efficiently or to create better training sets. By selecting medoids as training examples, researchers can may have a more balanced and informative training set, potentially improving the generalization and robustness of the fine-tuned models.<ref>{{Cite arXiv |last1=Wu |first1=Yuexin |last2=Xu |first2=Yichong |last3=Singh |first3=Aarti |last4=Yang |first4=Yiming |last5=Dubrawski |first5=Artur |title=Active Learning for Graph Neural Networks via Node Feature Propagation |year=2019 |class=cs.LG |eprint=1910.07567 }}</ref>

==== Medoids for model interpretability and safety ==== Applying medoids in the context of LLMs can contribute to improving model interpretability. By clustering the embeddings generated by LLMs and selecting medoids as representatives of each cluster, researchers can provide a more interpretable summary of the model's behavior.<ref>{{Cite arXiv |last1=Tiwari |first1=Mo |last2=Mayclin |first2=James |last3=Piech |first3=Chris |last4=Zhang |first4=Martin |last5=Thrun |first5=Sebastian |last6=Shomorony |first6=Ilan |title=BanditPAM: Almost Linear Time k-Medoids Clustering via Multi-Armed Bandits |year=2020 |class=cs.LG |eprint=2006.06856 }}</ref> This approach can help in understanding the model's decision-making process, identifying potential biases, and uncovering the underlying structure of the LLM-generated embeddings. As the discussion around interpretability and safety of LLMs continues to ramp up, using medoids may serve as a valuable tool for achieving this goal.

== Real-world applications == As a versatile clustering method, medoids can be applied to a variety of real-world issues in numerous fields, stretching from biology and medicine to advertising and marketing, and social networks. Its potential to handle complex data sets with a high degree of perplexity makes it a powerful device in modern-day data analytics.

=== Gene expression analysis === In gene expression analysis,<ref>{{Cite journal |last1=Zhang |first1=Yan |last2=Shi |first2=Weiyu |last3=Sun |first3=Yeqing |date=2023-02-17 |title=A functional gene module identification algorithm in gene expression data based on genetic algorithm and gene ontology |journal=BMC Genomics |volume=24 |issue=1 |pages=76 |doi=10.1186/s12864-023-09157-z |issn=1471-2164 |pmc=9936134 |pmid=36797662 |doi-access=free }}</ref> researchers utilize advanced technologies consisting of microarrays and RNA sequencing to measure the expression levels of numerous genes in biological samples, which results in multi-dimensional data that can be complex and difficult to analyze. Medoids are a potential solution by clustering genes primarily based on their expression profiles, enabling researchers to discover co-expressed groups of genes that could provide valuable insights into the molecular mechanisms of biological processes and diseases.

=== Social network analysis === For social network evaluation,<ref>{{Cite journal |last1=Saha |first1=Sanjit Kumar |last2=Schmitt |first2=Ingo |date=2020-01-01 |title=Non-TI Clustering in the Context of Social Networks |journal=Procedia Computer Science |series=The 11th International Conference on Ambient Systems, Networks and Technologies (ANT) / The 3rd International Conference on Emerging Data and Industry 4.0 (EDI40) / Affiliated Workshops |language=en |volume=170 |pages=1186–1191 |doi=10.1016/j.procs.2020.03.031 |s2cid=218812939 |issn=1877-0509|doi-access=free }}</ref> medoids can be an exceptional tool for recognizing central or influential nodes in a social network. Researchers can cluster nodes based on their connectivity styles and identify nodes which are most likely to have a substantial impact on the network’s function and structure. One popular approach to making use of medoids in social network analysis is to compute a distance or similarity metric between pairs of nodes based on their properties.

=== Market segmentation === Medoids also can be employed for market segmentation,<ref>{{cite journal |last1=Wu |first1=Zengyuan |last2=Jin |first2=Lingmin |last3=Zhao |first3=Jiali |last4=Jing |first4=Lizheng |last5=Chen |first5=Liang |title=Research on Segmenting E-Commerce Customer through an Improved K-Medoids Clustering Algorithm |journal=Computational Intelligence and Neuroscience |date=18 June 2022 |volume=2022 |pages=1–10 |doi=10.1155/2022/9930613 |pmid=35761867 |pmc=9233613 |doi-access=free }}</ref> which is an analytical procedure that includes grouping clients primarily based on their purchasing behavior, demographic traits, and various other attributes. Clustering clients into segments using medoids allows companies to tailor their advertising and marketing techniques in a manner that aligns with the needs of each group of customers. The medoids serve as representative factors within every cluster, encapsulating the primary characteristics of the customers in that group.

The Within-Groups Sum of Squared Error (WGSS) is a formula employed in market segmentation that aims to quantify the concentration of squared errors within clusters. It seeks to capture the distribution of errors within groups by squaring them and aggregating the results.The WGSS metric quantifies the cohesiveness of samples within clusters, indicating tighter clusters with lower WGSS values and a correspondingly superior clustering effect. The formula for WGSS is:

<math display="block" id="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9233613/">\text{WGSS} = \frac{1}{2} \left[(m_1-1) \overline{d_1^2} + \cdots + (m_k-1) \overline{d_k^2}\right]</math>

Where <math>\overline{d_1^2}</math> is the average distance of samples within the ''k''-th cluster and <math>m_k</math> is the number of samples in the ''k''-th cluster.

=== Anomaly detection === Medoids can also be instrumental in identifying anomalies, and one efficient method is through cluster-based [[anomaly detection]]. They can be used to discover clusters of data points that deviate significantly from the rest of the data. By clustering the data into groups using medoids and comparing the properties of every cluster to the data, researchers can clearly detect clusters that are anomalous.{{citation needed|date=June 2024}}

== References == <references />

== External links == * [https://www.youtube.com/watch?v=4b5d3muPQmA StatQuest k-means video] used for visual reference in [[#Visualization_of_the_medoid-based_clustering_process]] section

[[Category:Cluster analysis]] [[Category:Means]]