# Structured prediction

> Mediated Wiki article. Canonical URL: https://mediated.wiki/source/Structured_prediction
> Markdown URL: https://mediated.wiki/source/Structured_prediction.md
> Source: https://en.wikipedia.org/wiki/Structured_prediction
> Source revision: 1273324430
> License: Creative Commons Attribution-ShareAlike 4.0 International (https://creativecommons.org/licenses/by-sa/4.0/)

{{Short description|Supervised machine learning techniques}}
{{Machine learning|Problems}}
'''Structured prediction''' or '''structured output learning''' is an [umbrella term](/source/umbrella_term) for [supervised](/source/Supervised_learning) [machine learning](/source/machine_learning) techniques that involves [predicting](/source/predicting) structured objects, rather than [discrete](/source/Statistical_classification) or [real](/source/Regression_analysis) values.<ref>Gökhan BakIr, Ben Taskar, Thomas Hofmann, Bernhard Schölkopf, Alex Smola and SVN Vishwanathan (2007), [https://mitpress.mit.edu/books/predicting-structured-data Predicting Structured Data], MIT Press.</ref>

Similar to commonly used supervised learning techniques, structured prediction models are typically trained by means of observed data in which the predicted value is compared to the [ground truth](/source/ground_truth), and this is used to adjust the model parameters. Due to the complexity of the model and the interrelations of predicted variables, the processes of model training and inference are often computationally infeasible, so [approximate inference](/source/approximate_inference) and learning methods are used.

==Applications==

An example application is the problem of translating a [natural language](/source/natural_language) sentence into a syntactic representation such as a [parse tree](/source/parse_tree). This can be seen as a structured prediction problem<ref name="Laf:McC:Per01" /> in which the structured output domain is the set of all possible parse trees. Structured prediction is used in a wide variety of domains including [bioinformatics](/source/bioinformatics), [natural language processing](/source/natural_language_processing) (NLP), [speech recognition](/source/speech_recognition), and [computer vision](/source/computer_vision). 

===Example: sequence tagging===

Sequence tagging is a class of problems prevalent in NLP in which input data are often sequential, for instance sentences of text. The sequence tagging problem appears in several guises, such as [part-of-speech tagging](/source/part-of-speech_tagging) (POS tagging) and [named entity recognition](/source/named_entity_recognition). In POS tagging, for example, each word in a sequence must be 'tagged' with a ''class label'' representing the type of word:

:{|
| This || [DT](/source/Determiner)
|-
| is || [VBZ](/source/Verb)
|-
| a || [DT](/source/Determiner)
|-
| tagged || [JJ](/source/Adjective)
|-
| sentence. || [NN](/source/Noun)
|}

The main challenge of this problem is to resolve [ambiguity](/source/ambiguity): in the above example, the words "sentence" and "tagged" in English can also be [verbs](/source/%3Awikt%3Asentence).

While this problem can be solved by simply performing [classification](/source/Statistical_classification) of individual [tokens](/source/Lexical_analysis), this approach does not take into account the empirical fact that tags do not occur independently; instead, each tag displays a strong [conditional dependence](/source/conditional_dependence) on the tag of the previous word. This fact can be exploited in a sequence model such as a [hidden Markov model](/source/hidden_Markov_model) or [conditional random field](/source/conditional_random_field)<ref name="Laf:McC:Per01">{{cite conference |author=Lafferty, J. |author2=McCallum, A. |author3=Pereira, F. |title=Conditional random fields: Probabilistic models for segmenting and labeling sequence data|book-title =Proc. 18th International Conf. on Machine Learning |date = 2001| pages= 282–289|url=http://www.cis.upenn.edu/~pereira/papers/crf.pdf }}
</ref> that predicts the entire tag sequence for a sentence (rather than just individual tags) via the [Viterbi algorithm](/source/Viterbi_algorithm).

==Techniques==

Probabilistic [graphical model](/source/graphical_model)s form a large class of structured prediction models. In particular, [Bayesian network](/source/Bayesian_network)s and [random field](/source/random_field)s are popular. Other algorithms and models for structured prediction include [inductive logic programming](/source/inductive_logic_programming), [case-based reasoning](/source/case-based_reasoning), [structured SVM](/source/structured_SVM)s, [Markov logic network](/source/Markov_logic_network)s, [Probabilistic Soft Logic](/source/Probabilistic_Soft_Logic), and [constrained conditional model](/source/constrained_conditional_model)s. The main techniques are:

* [Conditional random field](/source/Conditional_random_field)s
* [Structured support vector machine](/source/Structured_support_vector_machine)s
* [Structured ''k''-nearest neighbours](/source/Structured_kNN)
* [Recurrent neural network](/source/Recurrent_neural_network)s, in particular [Elman networks](/source/Recurrent_neural_network)
* [Transformers](/source/Transformer_(deep_learning_architecture)).

===Structured perceptron===
One of the easiest ways to understand algorithms for general structured prediction is the structured perceptron by [Collins](/source/Michael_Collins_(computational_linguist)).<ref>{{cite conference |first=Michael |last=Collins |year=2002 |title=Discriminative training methods for hidden Markov models: Theory and experiments with perceptron algorithms |conference=Proc. EMNLP |volume=10 |url=http://acl.ldc.upenn.edu/W/W02/W02-1001.pdf}}</ref> This algorithm combines the [perceptron](/source/perceptron) algorithm for learning [linear classifier](/source/linear_classifier)s with an inference algorithm (classically the [Viterbi algorithm](/source/Viterbi_algorithm) when used on sequence data) and can be described abstractly as follows:

# First, define a function <math>\phi(x,y)</math> that maps a training sample <math>x</math> and a candidate prediction <math>y</math> to a vector of length <math>n</math> (<math>x</math> and <math>y</math> may have any structure; <math>n</math> is problem-dependent, but must be fixed for each model). Let <math>GEN</math> be a function that generates candidate predictions.
# Then:

::Let <math>w</math> be a weight vector of length <math>n</math>

::For a predetermined number of iterations:

:::For each sample <math>x</math> in the training set with true output <math>t</math>:

::::Make a prediction <math>\hat{y}</math>: <math>\hat{y}={\operatorname{arg\,max}}\, \{y \in GEN(x)\}\,(w^T, \phi(x,y))</math>

::::Update <math>w</math> (from <math>\hat{y}</math> towards <math>t</math>): <math>w=w+c(-\phi(x, \hat{y})+ \phi(x,t))</math>, where <math>c</math> is the [learning rate](/source/learning_rate).

In practice, finding the argmax over <math>{GEN}({x})</math> is done using an algorithm such as Viterbi or a [max-sum](/source/max-sum_algorithm), rather than an [exhaustive search](/source/exhaustive_search) through an exponentially large set of candidates.

The idea of learning is similar to that for [multiclass perceptrons](/source/Perceptron).

==References==
<references/>
* Noah Smith, [https://www.cs.cmu.edu/~nasmith/LSP/ Linguistic Structure Prediction], 2011.
* Michael Collins, [https://www.cs.columbia.edu/~mcollins/papers/tagperc.pdf Discriminative Training Methods for Hidden Markov Models], 2002.

==External links==
* [https://github.com/ashish01/CollinsTagger Implementation of Collins structured perceptron]

Category:Structured prediction

---
Adapted from the Wikipedia article [Structured prediction](https://en.wikipedia.org/wiki/Structured_prediction) by Wikipedia contributors ([contributor history](https://en.wikipedia.org/wiki/Structured_prediction?action=history)). Available under [Creative Commons Attribution-ShareAlike 4.0 International](https://creativecommons.org/licenses/by-sa/4.0/). Changes may have been made.