Polynomial kernel

{{Short description|Machine learning kernel function}} {{about|machine learning|polynomial kernels in complexity theory|Kernelization}}

[[File:Svm 8 polinomial.JPG|thumb|300px|Illustration of the mapping <math>\varphi</math>. On the left a set of samples in the input space, on the right the same samples in the feature space where the polynomial kernel <math>K(x,y)</math> (for some values of the parameters <math>c</math> and <math>d</math>) is the inner product. The hyperplane learned in feature space by an SVM is an ellipse in the input space.]] In [[machine learning]], the '''polynomial kernel''' is a [[kernel function]] commonly used with [[support vector machine]]s (SVMs) and other [[Kernel trick|kernelized]] models, that represents the similarity of vectors (training samples) in a feature space over polynomials of the original variables, allowing learning of non-linear models.

Intuitively, the polynomial kernel looks not only at the given features of input samples to determine their similarity, but also combinations of these. In the context of [[regression analysis]], such combinations are known as interaction features. The (implicit) feature space of a polynomial kernel is equivalent to that of [[polynomial regression]], but without the combinatorial blowup in the number of parameters to be learned. When the input features are binary-valued (booleans), then the features correspond to [[logical conjunction]]s of input features.<ref name="Goldberg2008">Yoav Goldberg and Michael Elhadad (2008). splitSVM: Fast, Space-Efficient, non-Heuristic, Polynomial Kernel Computation for NLP Applications. Proc. ACL-08: HLT.</ref>

==Definition== For degree-{{mvar|d}} polynomials, the polynomial kernel is defined as<ref>{{cite web |url=https://www.cs.tufts.edu/~roni/Teaching/CLT/LN/lecture18.pdf |title=Archived copy |accessdate=2012-11-12 |url-status=dead |archiveurl=https://web.archive.org/web/20130415231446/http://www.cs.tufts.edu/~roni/Teaching/CLT/LN/lecture18.pdf |archivedate=2013-04-15 }}</ref>

:<math>K(\mathbf{x},\mathbf{y}) = (\mathbf{x}^\mathsf{T} \mathbf{y} + c)^{d}</math>

where {{mvar|x}} and {{mvar|y}} are vectors of size {{mvar|n}} in the ''input space'', i.e. vectors of features computed from training or test samples and {{math|''c'' ≥ 0}} is a free parameter trading off the influence of higher-order versus lower-order terms in the polynomial. When {{math|''c'' {{=}} 0}}, the kernel is called homogeneous.<ref>{{cite arXiv |last=Shashua |first=Amnon |eprint=0904.3664v1 |title=Introduction to Machine Learning: Class Notes 67577 |class=cs.LG |year=2009}}</ref> (A further generalized polykernel divides {{math|''x''T''y''}} by a user-specified scalar parameter {{mvar|a}}.{{r|lin2012}})

As a kernel, {{mvar|K}} corresponds to an inner product in a feature space based on some mapping {{mvar|φ}}:

:<math>K(\mathbf{x},\mathbf{y}) = \langle \varphi(\mathbf{x}), \varphi(\mathbf{y}) \rangle</math>

The nature of {{mvar|φ}} can be seen from an example. Let {{math|''d'' {{=}} 2}}, so we get the special case of the quadratic kernel. After using the [[multinomial theorem]] (twice—the outermost application is the [[binomial theorem]]) and regrouping,

:<math>K(\mathbf{x},\mathbf{y}) = \left(\sum_{i=1}^n x_i y_i + c\right)^2 = \sum_{i=1}^n \left(x_i^2\right) \left(y_i^2 \right) + \sum_{i=2}^n \sum_{j=1}^{i-1} \left( \sqrt{2} x_i x_j \right) \left( \sqrt{2} y_i y_j \right) + \sum_{i=1}^n \left( \sqrt{2c} x_i \right) \left( \sqrt{2c} y_i \right) + c^2 </math>

From this it follows that the feature map is given by:

:<math> \varphi(x) = \left( x_n^2, \ldots, x_1^2, \sqrt{2} x_n x_{n-1}, \ldots, \sqrt{2} x_n x_1, \sqrt{2} x_{n-1} x_{n-2}, \ldots, \sqrt{2} x_{n-1} x_{1}, \ldots, \sqrt{2} x_{2} x_{1}, \sqrt{2c} x_n, \ldots, \sqrt{2c} x_1, c \right) </math>

generalizing for <math>\left(\mathbf{x}^{T}\mathbf{y} + c\right)^d</math>, where <math>\mathbf{x}\in\mathbb{R}^{n}</math>, <math>\mathbf{y}\in \mathbb{R}^{n}</math> and applying the [[multinomial theorem]]:

<math> \begin{alignat}{2} \left(\mathbf{x}^{T}\mathbf{y} + c\right)^d & = \sum_{j_1+j_2+\dots +j_{n+1}=d} \frac{\sqrt{d!}}{\sqrt{j_1! \cdots j_n! j_{n+1}!}} x_1^{j_1}\cdots x_n^{j_n} \sqrt{c}^{j_{n+1}} \frac{\sqrt{d!}}{\sqrt{j_1! \cdots j_n! j_{n+1}!}} y_1^{j_1}\cdots y_n^{j_n} \sqrt{c}^{j_{n+1}}\\ &=\varphi(\mathbf{x})^{T} \varphi(\mathbf{y}) \end{alignat} </math>

The last summation has <math>l_d=\tbinom {n+d}{d}</math> elements, so that: :<math> \varphi(\mathbf{x}) = \left(a_{1},\dots, a_{l},\dots,a_{l_d} \right ) </math> where <math>l=(j_1,j_2,...,j_{n},j_{n+1})</math> and :<math> a_{l}=\frac{\sqrt{d!} }{\sqrt{j_1! \cdots j_n!j_{n+1}! }} x_1^{j_1}\cdots x_n^{j_n} \sqrt{c}^{j_{n+1}}\quad|\quad j_1+j_2+\dots+j_n +j_{n+1} = d </math>

==Practical use== Although the [[RBF kernel]] is more popular in SVM classification than the polynomial kernel, the latter is quite popular in [[natural language processing]] (NLP).{{r|Goldberg2008}}<ref name="Chang2010">{{cite journal |first1=Yin-Wen |last1=Chang |first2=Cho-Jui |last2=Hsieh |first3=Kai-Wei |last3=Chang |first4=Michael |last4=Ringgaard |first5=Chih-Jen |last5=Lin |year=2010 |url=http://jmlr.csail.mit.edu/papers/v11/chang10a.html |title=Training and testing low-degree polynomial data mappings via linear SVM |journal=[[Journal of Machine Learning Research]] |volume=11 |pages=1471–1490}}</ref> The most common degree is {{math|''d'' {{=}} 2}} (quadratic), since larger degrees tend to [[overfitting|overfit]] on NLP problems.

Various ways of computing the polynomial kernel (both exact and approximate) have been devised as alternatives to the usual non-linear SVM training algorithms, including:

* full expansion of the kernel prior to training/testing with a linear SVM,{{r|Chang2010}} i.e. full computation of the mapping {{mvar|φ}} as in polynomial regression; * [[Association rule learning|basket mining]] (using a variant of the [[apriori algorithm]]) for the most commonly occurring feature conjunctions in a training set to produce an approximate expansion;<ref name="Kudo2003">{{cite conference |first1=T. |last1=Kudo |first2=Y. |last2=Matsumoto |year=2003 |title=Fast methods for kernel-based text analysis |conference=Proc. ACL}}</ref> * [[inverted index]]ing of support vectors.{{r|Kudo2003}}{{r|Goldberg2008}}

One problem with the polynomial kernel is that it may suffer from [[Numerical stability|numerical instability]]: when {{math|''x''T''y'' + ''c'' < 1}}, {{math|''K''(''x'', ''y'') {{=}} (''x''T''y'' + ''c'')''d''}} tends to zero with increasing {{mvar|d}}, whereas when {{math|''x''T''y'' + ''c'' > 1}}, {{math|''K''(''x'', ''y'')}} tends to infinity.<ref name="lin2012">{{cite conference |first=Chih-Jen |last=Lin |year=2012 |url=http://www.csie.ntu.edu.tw/~cjlin/talks/mlss_kyoto.pdf |title=Machine learning software: design and practical use |conference=Machine Learning Summer School |location=Kyoto}}</ref>

<ref name="Shewchuk">{{cite conference|first=Chih-Jen |last=Lin |year=2012 |url=https://people.eecs.berkeley.edu/~jrs/189s21/lec/16.pdf |title=Shewchuk |conference=Machine Learning Summer School |location=Kyoto}}</ref>

==References== <references/>

[[Category:Kernel methods for machine learning]]