# Polynomial kernel

> Mediated Wiki article. Canonical URL: https://mediated.wiki/source/Polynomial_kernel
> Markdown URL: https://mediated.wiki/source/Polynomial_kernel.md
> Source: https://en.wikipedia.org/wiki/Polynomial_kernel
> Source revision: 1337746775
> License: Creative Commons Attribution-ShareAlike 4.0 International (https://creativecommons.org/licenses/by-sa/4.0/)

Machine learning kernel function

This article is about machine learning. For polynomial kernels in complexity theory, see [Kernelization](/source/Kernelization).

Illustration of the mapping

        φ

    {\displaystyle \varphi }

. On the left a set of samples in the input space, on the right the same samples in the feature space where the polynomial kernel

        K
        (
        x
        ,
        y
        )

    {\displaystyle K(x,y)}

 (for some values of the parameters

        c

    {\displaystyle c}

 and

        d

    {\displaystyle d}

) is the inner product. The hyperplane learned in feature space by an SVM is an ellipse in the input space.

In [machine learning](/source/Machine_learning), the **polynomial kernel** is a [kernel function](/source/Kernel_function) commonly used with [support vector machines](/source/Support_vector_machine) (SVMs) and other [kernelized](/source/Kernel_trick) models, that represents the similarity of vectors (training samples) in a feature space over polynomials of the original variables, allowing learning of non-linear models.

Intuitively, the polynomial kernel looks not only at the given features of input samples to determine their similarity, but also combinations of these. In the context of [regression analysis](/source/Regression_analysis), such combinations are known as interaction features. The (implicit) feature space of a polynomial kernel is equivalent to that of [polynomial regression](/source/Polynomial_regression), but without the combinatorial blowup in the number of parameters to be learned. When the input features are binary-valued (booleans), then the features correspond to [logical conjunctions](/source/Logical_conjunction) of input features.[1]

## Definition

For degree-d polynomials, the polynomial kernel is defined as[2]

- K ( x , y ) = ( x T y + c ) d {\displaystyle K(\mathbf {x} ,\mathbf {y} )=(\mathbf {x} ^{\mathsf {T}}\mathbf {y} +c)^{d}}

where x and y are vectors of size n in the *input space*, i.e. vectors of features computed from training or test samples and *c* ≥ 0 is a free parameter trading off the influence of higher-order versus lower-order terms in the polynomial. When *c* = 0, the kernel is called homogeneous.[3] (A further generalized polykernel divides *x*T*y* by a user-specified scalar parameter a.[4])

As a kernel, K corresponds to an inner product in a feature space based on some mapping φ:

- K ( x , y ) = ⟨ φ ( x ) , φ ( y ) ⟩ {\displaystyle K(\mathbf {x} ,\mathbf {y} )=\langle \varphi (\mathbf {x} ),\varphi (\mathbf {y} )\rangle }

The nature of φ can be seen from an example. Let *d* = 2, so we get the special case of the quadratic kernel. After using the [multinomial theorem](/source/Multinomial_theorem) (twice—the outermost application is the [binomial theorem](/source/Binomial_theorem)) and regrouping,

- K ( x , y ) = ( ∑ i = 1 n x i y i + c ) 2 = ∑ i = 1 n ( x i 2 ) ( y i 2 ) + ∑ i = 2 n ∑ j = 1 i − 1 ( 2 x i x j ) ( 2 y i y j ) + ∑ i = 1 n ( 2 c x i ) ( 2 c y i ) + c 2 {\displaystyle K(\mathbf {x} ,\mathbf {y} )=\left(\sum _{i=1}^{n}x_{i}y_{i}+c\right)^{2}=\sum _{i=1}^{n}\left(x_{i}^{2}\right)\left(y_{i}^{2}\right)+\sum _{i=2}^{n}\sum _{j=1}^{i-1}\left({\sqrt {2}}x_{i}x_{j}\right)\left({\sqrt {2}}y_{i}y_{j}\right)+\sum _{i=1}^{n}\left({\sqrt {2c}}x_{i}\right)\left({\sqrt {2c}}y_{i}\right)+c^{2}}

From this it follows that the feature map is given by:

- φ ( x ) = ( x n 2 , … , x 1 2 , 2 x n x n − 1 , … , 2 x n x 1 , 2 x n − 1 x n − 2 , … , 2 x n − 1 x 1 , … , 2 x 2 x 1 , 2 c x n , … , 2 c x 1 , c ) {\displaystyle \varphi (x)=\left(x_{n}^{2},\ldots ,x_{1}^{2},{\sqrt {2}}x_{n}x_{n-1},\ldots ,{\sqrt {2}}x_{n}x_{1},{\sqrt {2}}x_{n-1}x_{n-2},\ldots ,{\sqrt {2}}x_{n-1}x_{1},\ldots ,{\sqrt {2}}x_{2}x_{1},{\sqrt {2c}}x_{n},\ldots ,{\sqrt {2c}}x_{1},c\right)}

generalizing for ( x T y + c ) d {\displaystyle \left(\mathbf {x} ^{T}\mathbf {y} +c\right)^{d}} , where x ∈ R n {\displaystyle \mathbf {x} \in \mathbb {R} ^{n}} , y ∈ R n {\displaystyle \mathbf {y} \in \mathbb {R} ^{n}} and applying the [multinomial theorem](/source/Multinomial_theorem):

( x T y + c ) d = ∑ j 1 + j 2 + ⋯ + j n + 1 = d d ! j 1 ! ⋯ j n ! j n + 1 ! x 1 j 1 ⋯ x n j n c j n + 1 d ! j 1 ! ⋯ j n ! j n + 1 ! y 1 j 1 ⋯ y n j n c j n + 1 = φ ( x ) T φ ( y ) {\displaystyle {\begin{alignedat}{2}\left(\mathbf {x} ^{T}\mathbf {y} +c\right)^{d}&=\sum _{j_{1}+j_{2}+\dots +j_{n+1}=d}{\frac {\sqrt {d!}}{\sqrt {j_{1}!\cdots j_{n}!j_{n+1}!}}}x_{1}^{j_{1}}\cdots x_{n}^{j_{n}}{\sqrt {c}}^{j_{n+1}}{\frac {\sqrt {d!}}{\sqrt {j_{1}!\cdots j_{n}!j_{n+1}!}}}y_{1}^{j_{1}}\cdots y_{n}^{j_{n}}{\sqrt {c}}^{j_{n+1}}\\&=\varphi (\mathbf {x} )^{T}\varphi (\mathbf {y} )\end{alignedat}}}

The last summation has l d = ( n + d d ) {\displaystyle l_{d}={\tbinom {n+d}{d}}} elements, so that:

- φ ( x ) = ( a 1 , … , a l , … , a l d ) {\displaystyle \varphi (\mathbf {x} )=\left(a_{1},\dots ,a_{l},\dots ,a_{l_{d}}\right)}

where l = ( j 1 , j 2 , . . . , j n , j n + 1 ) {\displaystyle l=(j_{1},j_{2},...,j_{n},j_{n+1})} and

- a l = d ! j 1 ! ⋯ j n ! j n + 1 ! x 1 j 1 ⋯ x n j n c j n + 1 | j 1 + j 2 + ⋯ + j n + j n + 1 = d {\displaystyle a_{l}={\frac {\sqrt {d!}}{\sqrt {j_{1}!\cdots j_{n}!j_{n+1}!}}}x_{1}^{j_{1}}\cdots x_{n}^{j_{n}}{\sqrt {c}}^{j_{n+1}}\quad |\quad j_{1}+j_{2}+\dots +j_{n}+j_{n+1}=d}

## Practical use

Although the [RBF kernel](/source/RBF_kernel) is more popular in SVM classification than the polynomial kernel, the latter is quite popular in [natural language processing](/source/Natural_language_processing) (NLP).[1][5] The most common degree is *d* = 2 (quadratic), since larger degrees tend to [overfit](/source/Overfitting) on NLP problems.

Various ways of computing the polynomial kernel (both exact and approximate) have been devised as alternatives to the usual non-linear SVM training algorithms, including:

- full expansion of the kernel prior to training/testing with a linear SVM,[5] i.e. full computation of the mapping φ as in polynomial regression;

- [basket mining](/source/Association_rule_learning) (using a variant of the [apriori algorithm](/source/Apriori_algorithm)) for the most commonly occurring feature conjunctions in a training set to produce an approximate expansion;[6]

- [inverted indexing](/source/Inverted_index) of support vectors.[6][1]

One problem with the polynomial kernel is that it may suffer from [numerical instability](/source/Numerical_stability): when *x*T*y* + *c* < 1, *K*(*x*, *y*) = (*x*T*y* + *c*)*d* tends to zero with increasing d, whereas when *x*T*y* + *c* > 1, *K*(*x*, *y*) tends to infinity.[4]

[7]

## References

1. ^ [***a***](#cite_ref-Goldberg2008_1-0) [***b***](#cite_ref-Goldberg2008_1-1) [***c***](#cite_ref-Goldberg2008_1-2) Yoav Goldberg and Michael Elhadad (2008). splitSVM: Fast, Space-Efficient, non-Heuristic, Polynomial Kernel Computation for NLP Applications. Proc. ACL-08: HLT.

1. **[^](#cite_ref-2)** ["Archived copy"](https://web.archive.org/web/20130415231446/http://www.cs.tufts.edu/~roni/Teaching/CLT/LN/lecture18.pdf) (PDF). Archived from [the original](https://www.cs.tufts.edu/~roni/Teaching/CLT/LN/lecture18.pdf) (PDF) on 2013-04-15. Retrieved 2012-11-12.{{[cite web](https://en.wikipedia.org/wiki/Template:Cite_web)}}: CS1 maint: archived copy as title ([link](https://en.wikipedia.org/wiki/Category:CS1_maint:_archived_copy_as_title))

1. **[^](#cite_ref-3)** Shashua, Amnon (2009). "Introduction to Machine Learning: Class Notes 67577". [arXiv](/source/ArXiv_(identifier)):[0904.3664v1](https://arxiv.org/abs/0904.3664v1) [[cs.LG](https://arxiv.org/archive/cs.LG)].

1. ^ [***a***](#cite_ref-lin2012_4-0) [***b***](#cite_ref-lin2012_4-1) Lin, Chih-Jen (2012). [*Machine learning software: design and practical use*](http://www.csie.ntu.edu.tw/~cjlin/talks/mlss_kyoto.pdf) (PDF). Machine Learning Summer School. Kyoto.

1. ^ [***a***](#cite_ref-Chang2010_5-0) [***b***](#cite_ref-Chang2010_5-1) Chang, Yin-Wen; Hsieh, Cho-Jui; Chang, Kai-Wei; Ringgaard, Michael; Lin, Chih-Jen (2010). ["Training and testing low-degree polynomial data mappings via linear SVM"](http://jmlr.csail.mit.edu/papers/v11/chang10a.html). *[Journal of Machine Learning Research](/source/Journal_of_Machine_Learning_Research)*. **11**: 1471–1490.

1. ^ [***a***](#cite_ref-Kudo2003_6-0) [***b***](#cite_ref-Kudo2003_6-1) Kudo, T.; Matsumoto, Y. (2003). *Fast methods for kernel-based text analysis*. Proc. ACL.

1. **[^](#cite_ref-Shewchuk_7-0)** Lin, Chih-Jen (2012). [*Shewchuk*](https://people.eecs.berkeley.edu/~jrs/189s21/lec/16.pdf) (PDF). Machine Learning Summer School. Kyoto.

---
Adapted from the Wikipedia article [Polynomial kernel](https://en.wikipedia.org/wiki/Polynomial_kernel) by Wikipedia contributors ([contributor history](https://en.wikipedia.org/wiki/Polynomial_kernel?action=history)). Available under [Creative Commons Attribution-ShareAlike 4.0 International](https://creativecommons.org/licenses/by-sa/4.0/). Changes may have been made.
