Sentence boundary disambiguation

{{Short description|Issue when parsing sentence structure}} '''Sentence boundary disambiguation''' ('''SBD'''), also known as '''sentence breaking''', '''sentence boundary detection''', and '''sentence segmentation''', is the problem in natural language processing of deciding where sentences begin and end. Natural language processing tools often require their input to be divided into sentences; however, sentence boundary identification can be challenging due to the potential ambiguity of punctuation marks. In written English, a period may indicate the end of a sentence, or may denote an abbreviation, a decimal point, an ellipsis, or an email address, among other possibilities. About 47% of the periods in ''The Wall Street Journal'' corpus denote abbreviations.<ref>{{cite conference|url=https://www.researchgate.net/publication/2605947|title=Automatic extraction of rules for sentence boundary disambiguation |author1=E. Stamatatos |author2=N. Fakotakis |author3=G. Kokkinakis |name-list-style=amp |book-title=Proceedings of the Workshop on Machine Learning in Human Language Technology |pages=88–92 |publisher=University of Patras}}</ref> Question marks and exclamation marks can be similarly ambiguous due to use in emoticons, source code, and slang.

Some languages including Japanese and Chinese have unambiguous sentence-ending markers.

==Strategies== The standard 'vanilla' approach to locate the end of a sentence:{{clarify|date=February 2015}}

:(a) If it is a period, it ends a sentence. :(b) If the preceding token is in the hand-compiled list of abbreviations, then it does not end a sentence. :(c) If the next token is capitalized, then it ends a sentence.

This strategy gets about 95% of sentences correct.<ref>{{cite web|url=http://www.attivio.com/attivio/blog/263-doing-things-with-words-part-two-sentence-boundary-detection.html|title= Doing Things with Words, Part Two: Sentence Boundary Detection|first=John |last=O'Neil |access-date=2009-01-03|archive-url=https://web.archive.org/web/20090221004022/http://www.attivio.com/attivio/blog/263-doing-things-with-words-part-two-sentence-boundary-detection.html|archive-date=2009-02-21|url-status=dead}}</ref> Things such as shortened names, e.g. "D. H. Lawrence" (with whitespaces between the individual words that form the full name), idiosyncratic orthographical spellings used for stylistic purposes (often referring to a single concept, e.g. an entertainment product title like ".hack//SIGN") and usage of non-standard punctuation (or non-standard usage ''of'' punctuation) in a text often fall under the remaining 5%.

Another approach is to automatically learn a set of rules from a set of documents where the sentence breaks are pre-marked. Solutions have been based on a maximum entropy model.<ref>{{cite web|url=http://www.aclweb.org/anthology/A/A97/A97-1004.pdf|title=A Maximum Entropy Approach to Identifying Sentence Boundaries |first1=JC |last1=Reynar |first2=A |last2=Ratnaparkhi |access-date=2009-01-03}}</ref> The SATZ<ref name="satz">{{cite web|archive-url=https://web.archive.org/web/20070922132340/http://elib.cs.berkeley.edu/src/satz/|archive-date=2007-09-22|url-status=dead|url=http://elib.cs.berkeley.edu/src/satz/|title=SATZ: An Adaptive Sentence Boundary Detector}}</ref> architecture uses a neural network to disambiguate sentence boundaries and achieves 98.5% accuracy.

==Software== ;Examples of use of Perl compatible regular expressions ("PCRE") :* <syntaxhighlight lang="ragel" inline>((?<=[a-z0-9][.?!])|(?<=[a-z0-9][.?!]\"))(\s|\r\n)(?=\"?[A-Z])</syntaxhighlight> :* <syntaxhighlight lang="php" inline>$sentences = preg_split("/(?<!\..)([\?\!\.]+)\s(?!.\.)/", $text, -1, PREG_SPLIT_DELIM_CAPTURE);</syntaxhighlight> (for PHP)

;Online use, libraries, and APIs :* sent_detector{{snd}}Java<ref>{{cite web |url=http://text0.mib.man.ac.uk:8080/scottpiao/sent_detector |title=SentParBreaker Web page |archive-url=https://web.archive.org/web/20071112103940/http://text0.mib.man.ac.uk:8080/scottpiao/sent_detector |archive-date=2007-11-12 |url-status=dead}}</ref> :* Lingua-EN-Sentence{{snd}}perl<ref>{{Cite web|url=https://metacpan.org/release/SHLOMOY/Lingua-EN-Sentence-0.25|title=Lingua-EN-Sentence-0.25 - Module for splitting text into sentences. - metacpan.org|website=metacpan.org}}</ref> :* Sentence.pm{{snd}}perl<ref>{{Cite web|url=https://metacpan.org/release/TGROSE/HTML-Summary-0.017/view/lib/Text/Sentence.pm|title=Text::Sentence - module for splitting text into sentences - metacpan.org|website=metacpan.org}}</ref>  :* SATZ{{snd}}An Adaptive Sentence Segmentation System{{snd}}by David D. Palmer{{snd}}C<ref name="satz" /> ;Toolkits that include sentence detection :* Apache OpenNLP<ref>{{Cite web|url=https://opennlp.apache.org/|title=Apache OpenNLP|website=opennlp.apache.org}}</ref> :* Freeling (software)<ref>{{cite web |url=https://nlp.lsi.upc.edu/freeling/ |title=Welcome {{!}} FreeLing Home Page}}</ref> :* Natural Language Toolkit<ref>{{Cite web|url=https://www.nltk.org/|title=NLTK :: Natural Language Toolkit|website=www.nltk.org}}</ref> :* Stanford NLP<ref>{{Cite web|url=https://nlp.stanford.edu/software/index.shtml|title=Software - The Stanford Natural Language Processing Group|website=nlp.stanford.edu}}</ref> :* GExp<ref>{{Cite web|url=https://code.google.com/archive/p/graph-expression/wikis/SentenceSplitting.wiki|title=Google Code Archive - Long-term storage for Google Code Project Hosting.|website=code.google.com}}</ref> :* CogComp-NLP<ref>{{Cite web|url=https://github.com/CogComp/cogcomp-nlp|title=CogCompNLP|date=January 2, 2024|via=GitHub}}</ref>

==References== {{reflist}}

==External links== *[https://spacy.io/universe/project/python-sentence-boundary-disambiguation pySBD - python Sentence Boundary Disambiguation]

Category:Tasks of natural language processing