One-class classification

{{short description|Approach to training in machine learning}} In machine learning, '''one-class classification''' ('''OCC'''), also known as '''unary classification''' or '''class-modelling''', is an approach to the training of binary classifiers in which only examples of one of the two classes are used.<ref>{{cite journal | vauthors = Oliveri P | title = Class-modelling in food analytical chemistry: Development, sampling, optimisation and validation issues - A tutorial | journal = Analytica Chimica Acta | volume = 982 | pages = 9–19 | date = August 2017 | pmid = 28734370 | doi = 10.1016/j.aca.2017.05.013 | bibcode = 2017AcAC..982....9O | hdl = 11567/881059 | hdl-access = free }}</ref>

Examples include the monitoring of helicopter gearboxes,<ref>{{cite CiteSeerX | vauthors = Japkowicz N, Myers C, Gluck M | date = 1995 | title = A Novelty Detection Approach to Classification | pages = 518–523 |citeseerx = 10.1.1.40.3663}}</ref><ref>{{cite thesis | url= https://rucore.libraries.rutgers.edu/rutgers-lib/59019/ | vauthors = Japkowicz N | date = 1999 | title = Concept-Learning in the Absence of Counter-Examples:An Autoassociation-Based Approach to Classification | publisher = Rutgers University}}</ref><ref>{{cite journal | url= https://link.springer.com/content/pdf/10.1023/A:1007660820062.pdf | vauthors = Japkowicz N | date = 2001 | title = Supervised Versus Unsupervised Binary-Learning by Feedforward Neural Networks | journal = Machine Learning | volume = 42 | pages = 97–122 | doi = 10.1023/A:1007660820062 | s2cid = 7298189 | doi-access = free }}</ref> motor failure prediction,<ref>{{cite news | url= http://papers.nips.cc/paper/1077-a-neural-network-autoassociator-for-induction-motor-failure-prediction.pdf | vauthors = Petsche T, Marcantonio A, Darken C, Hanson S, Kuhn G, Santoso I | date = 1996 | title = A Neural Network Autoassociator for Induction Motor Failure Prediction | publisher = NIPS}}</ref> or assessing the operational status of a nuclear plant as 'normal':<ref name=":1">{{cite thesis |title=One-class classification: Concept-learning in the absence of counter-examples. |date=2001 |degree=Ph.D. |publisher=University of Delft |url=https://rduin.nl/papers/thesis_01_tax.pdf |vauthors=Tax D |location=The Netherlands}}</ref> In such scenarios, there are few, if any, examples of the catastrophic system states – rare outliers – that comprise the second class. Alternatively, the class that is being focused on may cover a small, coherent subset of the data and the training may rely on an information bottleneck approach.<ref>{{cite book|last=Crammer|first=Koby|title=Twenty-first international conference on Machine learning - ICML '04 |chapter=A needle in a haystack |date=2004|chapter-url=https://dl.acm.org/citation.cfm?id=1015399|page=26|doi=10.1145/1015330.1015399|isbn=978-1-58113-838-2 |s2cid=8736254}}</ref>

In practice, counter-examples from the second class may be used in later rounds of training to further refine the algorithm.

== Overview == The term one-class classification (OCC) was coined by Moya & Hush (1996)<ref>{{cite journal | last1 = Moya | first1 = M. | last2 = Hush | first2 = D. | year = 1996 | title = Network constraints and multi- objective optimization for one-class classification | journal = Neural Networks | volume = 9 | issue = 3| pages = 463–474 | doi = 10.1016/0893-6080(95)00120-4 }}</ref> and many applications can be found in scientific literature, for example outlier detection, anomaly detection, novelty detection. A feature of OCC is that it uses only sample points from the assigned class, so that a representative sampling is not strictly required for non-target classes.<ref>{{cite journal| vauthors = Rodionova OY, Oliveri P, Pomerantsev AL |date=2016-12-15|title=Rigorous and compliant approaches to one-class classification |journal=Chemometrics and Intelligent Laboratory Systems|volume=159|pages=89–96|doi=10.1016/j.chemolab.2016.10.002|hdl=11567/864539 |hdl-access=free}}</ref>

== Introduction == thumb|right|The hypersphere containing the target data having center c and radius r. Objects on the boundary are support vectors, and two objects lie outside the boundary having slack greater than 0. SVM based one-class classification (OCC) relies on identifying the smallest hypersphere (with radius r, and center c) consisting of all the data points.<ref>{{cite journal |last1=Zineb |first1=Noumir |last2=Honeine |first2=Paul |last3=Richard |first3=Cedue |title=On simple one-class classification methods. |journal=IEEE International Symposium on Information Theory Proceedings |date=2012 |publisher=IEEE, 2012}}</ref> This method is called Support Vector Data Description (SVDD). Formally, the problem can be defined in the following constrained optimization form,

<math display="block"> \min_{r,c} r^2 \text{ subject to, } ||\Phi(x_i) - c||^2 \le r^2 \;\; \forall i = 1, 2, ..., n</math>

However, the above formulation is highly restrictive, and is sensitive to the presence of outliers. Therefore, a flexible formulation, that allow for the presence of outliers is formulated as shown below,

<math>\text{subject to, } ||\Phi(x_i) - c||^2 \le r^2 + \zeta_i \;\; \forall i = 1, 2, ..., n</math>

From the Karush–Kuhn–Tucker conditions for optimality, we get

<math>c = \sum_{i=1}^{n}\alpha_i\Phi(x_i),</math>

where the <math>\alpha_i</math>'s are the solution to the following optimization problem:

<math>\max_\alpha \sum_{i=1}^{n}\alpha_i\kappa(x_i, x_i) - \sum_{i, j = 1}^{n}\alpha_i\alpha_j\kappa(x_i, x_j) </math>

subject to, <math>\sum_{i=1}^{n}\alpha_i = 1 \text{ and } 0 \le \alpha_i \le \frac{1}{\nu n} \text{for all } i = 1,2,...,n.</math>

The introduction of kernel function provide additional flexibility to the One-class SVM (OSVM) algorithm.<ref>{{Cite book|last1=Khan|first1=Shehroz S.|last2=Madden|first2=Michael G.|date=2010|editor-last=Coyle|editor-first=Lorcan|editor2-last=Freyne|editor2-first=Jill|chapter=A Survey of Recent Trends in One Class Classification|title=Artificial Intelligence and Cognitive Science|volume=6206|series=Lecture Notes in Computer Science|language=en|publisher=Springer Berlin Heidelberg|pages=188–197|doi=10.1007/978-3-642-17080-5_21|isbn=978-3-642-17080-5|hdl=10379/1472|s2cid=36784649 |hdl-access=free}}</ref>

=== PU (Positive Unlabeled) learning === A similar problem is '''PU learning''', in which a binary classifier is constructed by semi-supervised learning from only ''positive'' and ''unlabeled'' sample points.<ref>{{cite book|title=Web Data Mining|last=Liu|first=Bing|publisher=Springer|year=2007|pages=165–178}}</ref>

In PU learning, two sets of examples are assumed to be available for training: the positive set <math>P</math> and a ''mixed set'' <math>U</math>, which is assumed to contain both positive and negative samples, but without these being labeled as such. This contrasts with other forms of semisupervised learning, where it is assumed that a labeled set containing examples of both classes is available in addition to unlabeled samples. A variety of techniques exist to adapt supervised classifiers to the PU learning setting, including variants of the EM algorithm. PU learning has been successfully applied to text,<ref>{{cite conference |author=Bing Liu |author2=Wee Sun Lee |author3=Philip S. Yu |author3-link=Philip S. Yu |author4=Xiao-Li Li |name-list-style=amp |year=2002 |title=Partially supervised classification of text documents |conference=ICML |pages=8–12}}</ref><ref>{{cite conference |author=Hwanjo Yu |author2=Jiawei Han |author3=Kevin Chen-Chuan Chang |title=PEBL: positive example based learning for web page classification using SVM |conference=ACM SIGKDD |year=2002}}</ref><ref>{{cite conference |author=Xiao-Li Li |author2=Bing Liu |name-list-style=amp |title=Learning to classify text using positive and unlabeled data |conference=IJCAI |year=2003}}</ref> time series,<ref>{{cite conference |author=Minh Nhut Nguyen |author2=Xiao-Li Li |author3=See-Kiong Ng |name-list-style=amp |title=Positive Unlabeled Learning for Time Series Classification |conference=IJCAI |year=2011}}</ref> bioinformatics tasks,<ref>{{cite conference |author=Peng Yang |author2=Xiao-Li Li |author3=Jian-Ping Mei |author4=Chee-Keong Kwoh |author5=See-Kiong Ng |name-list-style=amp |title=Positive-Unlabeled Learning for Disease Gene Identification |conference=Bioinformatics, Vol 28(20)|year=2012}}</ref><ref>{{cite journal |author=Bugnon, L. A. |author2=Yones, C. |author3=Milone, D. H. |author4=Stegmayer, G. |name-list-style=amp |title=Genome-wide discovery of pre-miRNAs: comparison of recent approaches based on machine learning |journal= Oxford Bioinformatics |year=2020|volume=22 |issue=3 |doi=10.1093/bib/bbaa184 |pmid=32814347 }}</ref> and remote sensing data.<ref>{{cite journal|last1=Li|first1=W.|last2=Guo|first2=Q.|last3=Elkan|first3=C.|date=February 2011|title=A Positive and Unlabeled Learning Algorithm for One-Class Classification of Remote-Sensing Data|journal=IEEE Transactions on Geoscience and Remote Sensing|volume=49|issue=2|pages=717–725|doi=10.1109/TGRS.2010.2058578|bibcode=2011ITGRS..49..717L|s2cid=267120|issn=0196-2892}}</ref>

== Approaches == Several approaches have been proposed to solve one-class classification (OCC). The approaches can be distinguished into three main categories, '''density estimation''', '''boundary methods''', and '''reconstruction methods'''.<ref name=":1" />

=== Density estimation methods === Density estimation methods rely on estimating the density of the data points, and set the threshold. These methods rely on assuming distributions, such as Gaussian, or a Poisson distribution. Following which discordancy tests can be used to test the new objects. These methods are robust to scale variance.

'''Gaussian model'''<ref>{{Cite book|url=https://books.google.com/books?id=T0S0BgAAQBAJ&q=neural+networks+for+pattern+recognition&pg=PP1|title=Neural Networks for Pattern Recognition|last1=Bishop|first1=Christopher M.|last2=Bishop|first2=Professor of Neural Computing Christopher M.|date=1995-11-23|publisher=Clarendon Press|isbn=978-0-19-853864-6|language=en}}</ref> is one of the simplest methods to create one-class classifiers. Due to Central Limit Theorem (CLT),<ref>{{Cite book |last=Ullman |first=Neil R |date=2017-01-01|title=Elementary statistics|url=http://archives.umc.edu.dz/handle/123456789/116529}}{{dead link|date=June 2023}}</ref> these methods work best when large number of samples are present, and they are perturbed by small independent error values. The probability distribution for a d-dimensional object is given by:

<math>p_{\mathcal{N}}(z;\mu;\Sigma) = \frac{1}{(2\pi)^{\frac{d}{2}} |\Sigma|^\frac{1}{2}}\exp\left\{-\frac{1}{2}(z-\mu)^T \Sigma^{-1} (z - \mu)\right\}</math>

Where, '''<math>\mu</math>''' is the mean and '''<math>\Sigma</math>''' is the covariance matrix. Computing the inverse of covariance matrix ('''<math>\Sigma^{-1}</math>''') is the costliest operation, and in the cases where the data is not scaled properly, or data has singular directions pseudo-inverse <math>\Sigma^+</math> is used to approximate the inverse, and is calculated as <math>\Sigma^T(\Sigma \Sigma^T)^{-1}</math>.<ref>{{Cite web|url=http://bookstore.siam.org/wc02/|title=Introduction to Applied Mathematics|website=SIAM Bookstore|language=en|access-date=2019-04-29}}</ref>

=== Boundary methods === Boundary methods focus on setting boundaries around a few set of points, called target points. These methods attempt to optimize the volume. Boundary methods rely on distances, and hence are not robust to scale variance. K-centers method, NN-d, and SVDD are some of the key examples.

'''K-centers'''

In K-center algorithm,<ref>{{Cite book|last1=Ypma|first1=Alexander|last2=Duin|first2=Robert P. W.|chapter=Support objects for domain approximation |date=1998|editor-last=Niklasson|editor-first=Lars|editor2-last=Bodén|editor2-first=Mikael|editor3-last=Ziemke|editor3-first=Tom|title=Icann 98|series=Perspectives in Neural Computing|language=en|publisher=Springer London|pages=719–724|doi=10.1007/978-1-4471-1599-1_110|isbn=978-1-4471-1599-1}}</ref> <math>k</math> small balls with equal radius are placed to minimize the maximum distance of all minimum distances between training objects and the centers. Formally, the following error is minimized,

<math>\varepsilon_{k-center} = \max_i ( \min_k || x_i - \mu_k ||^2 )</math>

The algorithm uses forward search method with random initialization, where the radius is determined by the maximum distance of the object, any given ball should capture. After the centers are determined, for any given test object '''<math>z</math>''' the distance can be calculated as,

<math>d_{k-centr}(z) = \min_k || z - \mu_k ||^2</math>

=== Reconstruction methods === Reconstruction methods use prior knowledge and generating process to build a generating model that best fits the data. New objects can be described in terms of a state of the generating model. Some examples of reconstruction methods for OCC are, k-means clustering, learning vector quantization, self-organizing maps, etc.

== Applications ==

=== Document classification === The basic Support Vector Machine (SVM) paradigm is trained using both positive and negative examples, however studies have shown there are many valid reasons for using ''only'' positive examples. When the SVM algorithm is modified to only use positive examples, the process is considered one-class classification. One situation where this type of classification might prove useful to the SVM paradigm is in trying to identify a web browser's sites of interest based only off of the user's browsing history.

=== Biomedical studies === One-class classification can be particularly useful in biomedical studies where often data from other classes can be difficult or impossible to obtain. In studying biomedical data it can be difficult and/or expensive to obtain the set of labeled data from the second class that would be necessary to perform a two-class classification. A study from The Scientific World Journal found that the typicality approach is the most useful in analysing biomedical data because it can be applied to any type of dataset (continuous, discrete, or nominal).<ref name=":0">{{cite journal | vauthors = Irigoien I, Sierra B, Arenas C | title = Towards application of one-class classification methods to medical data | journal = TheScientificWorldJournal | volume = 2014 | article-number = 730712 | date = 2014 | pmid = 24778600 | pmc = 3980920 | doi = 10.1155/2014/730712 | doi-access = free }}</ref> The typicality approach is based on the clustering of data by examining data and placing it into new or existing clusters.<ref>{{cite journal | vauthors = Irigoien I, Arenas C | title = INCA: new statistic for estimating the number of clusters and identifying atypical units | journal = Statistics in Medicine | volume = 27 | issue = 15 | pages = 2948–73 | date = July 2008 | pmid = 18050154 | doi = 10.1002/sim.3143 | s2cid = 24791212 }}</ref> To apply typicality to one-class classification for biomedical studies, each new observation, <math>y_0 </math>, is compared to the target class, <math>C</math>, and identified as an outlier or a member of the target class.<ref name=":0" />

=== Unsupervised Concept Drift Detection === One-class classification has similarities with unsupervised concept drift detection, where both aim to identify whether the unseen data share similar characteristics to the initial data. A concept is referred to as the fixed probability distribution which data is drawn from. In unsupervised concept drift detection, the goal is to detect if the data distribution changes without utilizing class labels. In one-class classification, the flow of data is not important. Unseen data is classified as typical or outlier depending on its characteristics, whether it is from the initial concept or not. However, unsupervised drift detection monitors the flow of data, and signals a drift if there is a significant amount of change or anomalies. Unsupervised concept drift detection can be identified as the continuous form of one-class classification.<ref>{{cite journal |title=Concept learning using one-class classifiers for implicit drift detection in evolving data streams |journal=Artificial Intelligence Review |date= November 2020 |last1=Gözüaçık |first1= Ömer |last2=Can |first2=Fazli |volume=54 |issue=5 |pages=3725–3747 |doi=10.1007/s10462-020-09939-x |hdl=11693/77042 |s2cid=229506136 |hdl-access=free }}</ref> One-class classifiers are used for detecting concept drifts.<ref>{{cite journal |title=One-class classifiers with incremental learning and forgetting for data streams with concept drift |journal=Soft Computing |year=2015 |last1=Krawczyk |first1=Bartosz |last2=Woźniak |first2=Michał |volume=19 |issue=12 |pages=3387–3400 |doi=10.1007/s00500-014-1492-5 |s2cid=207011971 |doi-access=free }}</ref>

== References == {{reflist|30em}}

Category:Statistical classification Category:Classification algorithms