Full-text search

{{Short description|Search using the full text of documents}} In text retrieval, '''full-text search''' refers to a set of techniques for searching a single computer-stored document or a collection in a full-text database. Full-text search is distinguished from searches based on metadata or on specific parts of documents, such as titles, abstracts, selected sections, or bibliographical references.<ref name=":0">{{Cite book |last1=Manning |first1=Christopher D. |title=Introduction to Information Retrieval |last2=Raghavan |first2=Prabhakar |last3=Schütze |first3=Hinrich |publisher=Cambridge University Press |year=2009 |isbn=978-0-521-86571-5}}</ref><ref name=":1">{{Cite book |last1=Baeza-Yates |first1=Ricardo |title=Modern Information Retrieval |last2=Ribeiro-Neto |first2=Berthier |publisher=Addison-Wesley |year=1999 |isbn=0-201-39829-X |edition=1st}}</ref><ref name=":5">{{Cite web |title=History Of Search Engines |url=https://www.researchgate.net/publication/265104813 |archive-url=http://web.archive.org/web/20260214050152/https://www.researchgate.net/publication/265104813_History_Of_Search_Engines |archive-date=February 14, 2026 |access-date=March 14, 2026 |website=ResearchGate |language=en}}</ref>

In a full-text search, a search engine examines the words in stored documents to find matches for user queries.<ref name=":6">{{Cite journal |last1=Yuwono |first1=Budi |last2=Lee |first2=Dik L. |date=1997 |title=Server Ranking for Distributed Text Retrieval Systems on the Internet |url=http://www.worldscientific.com/doi/abs/10.1142/9789812819536_0005 |language= |publisher=World Scientific |pages=41–49 |doi=10.1142/9789812819536_0005 |isbn=978-981-02-3107-1|journal=Database Systems for Advanced Applications '97|url-access=subscription }}</ref> Full-text-search techniques began to appear in the 1960s (for example, IBM STAIRS in 1969), and became common in online bibliographic databases during the 1990s. Many websites and application programs, including word processors, implement full-text search functionality.<ref name=":0" /><ref name=":1" /><ref>{{Cite web |last=Salton |first=Gerard |date=1974 |title=Information Storage and Retrieval |url=https://files.eric.ed.gov/fulltext/ED101718.pdf}}</ref><ref>{{Cite book |last1=Salton |first1=Gerard |title=Introduction to Modern Information Retrieval |last2=McGill |first2=Michael J. |date=1983 |publisher=McGraw-Hill |isbn=0-07-054484-0}}</ref> Some web search engines, such as the former AltaVista, indexed the full text of web pages, while others indexed only selected portions of pages.<ref name=":1" /><ref name=":5" /><ref>{{Cite web |last=Sullivan |first=Danny |date=2013-06-28 |title=A Eulogy For AltaVista, The Google Of Its Time |url=https://searchengineland.com/altavista-eulogy-165366 |access-date=2026-03-15 |website=Search Engine Land |language=en}}</ref>

==Indexing== When dealing with a small number of documents, a full-text-search engine can scan the contents of each document directly for every query, a strategy known as "serial scanning." Some tools, such as grep, operate in this way.<ref name=":0" /><ref name=":2" />

When the number of documents or queries is large, full-text search is typically divided into two tasks: indexing and searching. During indexing, the engine scans all documents and builds a list of search terms, often called an index, or more precisely, a concordance. During the search stage, queries are performed against the index rather than the original documents.<ref name=":0" /><ref name=":1" /><ref name=":2">{{Cite book |last1=Witten |first1=Ian H. |title=Managing Gigabytes: Compressing and Indexing Documents and Images |last2=Moffat |first2=Alistair |last3=Bell |first3=Timothy C. |publisher=Morgan Kaufmann |year=1999 |isbn=1-55860-570-3 |edition=2nd}}</ref><ref name=":3" />

The indexer records each term found in a document and may note its position within the text. Common words, known as stop words (for instance, "the" or "and"), are omitted because they add little value to search results. Some indexers also perform stemming, which reduces words to their base form; for example, "drives", "drove", and "driven" may all be indexed under the concept word "drive."<ref name=":0" /><ref name=":1" /><ref name=":2" /><ref>{{Cite journal |last1=Göksel |first1=Gökhan |last2=Arslan |first2=Ahmet |last3=Dinçer |first3=Bekir Taner |date=2023 |title=A selective approach to stemming for minimizing the risk of failure in information retrieval systems |journal=PeerJ Computer Science |volume=9 |article-number=e1175 |doi=10.7717/peerj-cs.1175 |doi-access=free|issn=2376-5992 |pmc=10280253 |pmid=37346699}}</ref>

==The precision vs. recall tradeoff== 150px|thumb|right|Diagram of a low-precision, low-recall search '''Recall''' '''and precision''' are standard measures for search effectiveness. Recall quantifies the proportion of relevant results returned by a search, while '''precision''' measures the proportion of returned results that are relevant. Formally, recall is the ratio of relevant results returned to the total number of relevant results available, and precision is the ratio of the number of relevant results returned to the total number of results returned.<ref name=":0" /><ref name=":1" /><ref name=":6" /><ref name="isbn1430215941" />

The diagram at the right illustrates a search with low recall and low precision. In the diagram, red and green dots represent the total population of potential search results for a given query, with green dots indicating relevant results and red indicating irrelevant results. Relevance is indicated by proximity to the center of the inner circle. The results actually returned by the search are highlighted on a light-blue background. In the example, only one relevant result of three possible relevant results was returned, giving a recall of ⅓ (33%). The precision is 1/4 (25%), since only one of the four results returned was relevant.<ref name="isbn1430215941">{{cite book|last1=Coles|first1=Michael|year=2008|title=Pro Full-Text Search in SQL Server 2008|edition=1st|publisher=Apress|isbn=978-1-4302-1594-3|last2=Cotter|first2=Hilary}}</ref>

Due to the ambiguities of natural language, full-text-search systems typically include features such as filtering to increase precision and stemming to increase recall. Controlled-vocabulary search can also help alleviate low-precision results by tagging documents to reduce ambiguity.<ref name=":0" /><ref name=":1" /><ref name=":3" /> There is generally a trade-off between precision and recall: increasing precision can reduce overall recall, while increasing recall may lower precision.<ref name=":0" /><ref name=":1" /><ref name="isbn1430215941" />

==False-positive problem==

Full-text search can retrieve many documents that are not relevant to the intended query. Such documents are called '''false positives''' (see Type I error). The retrieval of irrelevant documents is often caused by the inherent ambiguity of natural language.<ref name=":0" /><ref name=":1" /><ref name=":3">{{Cite journal |last1=Rivas |first1=A. R. |last2=Iglesias |first2=E. L. |last3=Borrajo |first3=L. |date=2014 |title=Study of query expansion techniques and their application in the biomedical information retrieval |journal=TheScientificWorldJournal |volume=2014 |article-number=132158 |doi=10.1155/2014/132158 |doi-access=free |issn=1537-744X |pmc=3958669 |pmid=24723793}}</ref> In the accompanying diagram, false positives are represented by irrelevant results (red dots) that were returned by the search (highlighted on a light-blue background).

'''Clustering techniques,''' often based on Bayesian algorithms, can help reduce false positives. For example, for a search term such as "bank", clustering can categorize documents into groups such as "financial institution", "place to sit", or "place to store." Depending on the occurrence of words relevant to these categories, a search term or a search result can be assigned to one or more categories. This approach is widely used in the e-discovery domain.<ref>{{Cite web |last=Socha |first=George |date=August 12, 2017 |title=Cluster Clear: Are Clustering Tools the Solution to Tedious Identification and Reduction Processes? {{!}} Judicature |url=https://judicature.duke.edu/articles/cluster-clear-are-clustering-tools-the-solution-to-tedious-identification-and-reduction-processes/ |access-date=2026-03-14 |website=judicature.duke.edu |language=en-US}}</ref>

== Synonym problem == At a basic level, search engines return items that contain the exact phrase listed in the query. Tools and methodologies exist to account for grammatical or typographical errors and to refine results; however, these techniques still typically require a close textual match. Because there are often multiple ways to refer to an entity or concept, full-text search may fail to retrieve an item if the exact term is not used in the query.

Not to be confused with semantic search, synonyms can be retrieved by creating an index of related terms, such that when a variation of a word is searched, items containing any of the related terms may also be returned.<ref>{{Cite journal |last=Beall |first=Jeffrey |date=2008-09-01 |title=The Weaknesses of Full-Text Searching |url=https://www.sciencedirect.com/science/article/pii/S0099133308001067 |journal=The Journal of Academic Librarianship |volume=34 |issue=5 |pages=438–444 |doi=10.1016/j.acalib.2008.06.007 |issn=0099-1333|url-access=subscription }}</ref>

==Performance improvements==

The limitations of full-text searching have been addressed in two ways: by providing users with tools to express search questions more precisely, and by developing algorithms that improve retrieval precision.<ref name=":0" /><ref name=":1" /><ref name=":2" /><ref name="isbn1430215941" /><ref name=":4" />

===Improved querying tools===

* Keywords and synonym search, or query expansion: A technique in which document creators (or trained indexers) supply lists of words that describe the subject of a text, including synonyms. Keywords improve recall, especially when the search term does not appear explicitly in the text. * Field-restricted search: Some search engines allow users to limit searches to specific fields within a stored data record, such as "Title" or "Author." * Boolean queries: Searches using Boolean operators (for example, "encyclopedia" AND "online" NOT "Encarta") can increase precision. The AND operator retrieves only documents containing all specified terms; NOT excludes documents containing a term. The OR operator can be used to increase recall, for instance, "encyclopedia" AND "online" OR "Internet" NOT "Encarta." Using Boolean queries will retrieve documents about online encyclopedias that use the term "Internet" instead of "online." This to increase precision may sometimes reduce recall significantly.<ref name=":4">{{Cite report |url=http://hdl.handle.net/10919/19378 |title=Experimental Comparison of Schemes for Interpreting Boolean Queries |last1=Lee |first1=Whay C. |last2=Fox |first2=Edward A. |date=May 1, 1988 |publisher=Department of Computer Science, Virginia Polytechnic Institute & State University |hdl=10919/19378 |language=en}}</ref> * Phrase search: Matches documents containing an exact sequence of words, such as "Wikipedia, the free encyclopedia." * Concept search: Matches multi-word concepts, for example compound term processing. This approach is increasingly used in e-discovery solutions. * Concordance search: Produces an alphabetical list of all principal words that occur in a text along with their immediate context. * Proximity search: Retrieves documents in which two or more words occur within a specified distance, for example "Wikipedia" WITHIN2 "free" retrieves documents where "Wikipedia" and "free" are separated by at most two words. * Regular expression search: Uses a complex but powerful syntax to specify precise retrieval conditions. * Fuzzy search: Retrieves documents that match the query terms approximately, allowing for variations such as edit distance. * Wildcard search: Substitutes one or more characters in a query with a wildcard symbol (e.g., *). For example, "s*n" matches "sin", "son", or "sun."

==Software==

The following is a partial list of software products that support full-text indexing and search. Some of these are accompanied by documentation describing their architecture or algorithms, which may provide additional insight into how full-text search is implemented.

=== Free and open source software === <!--

Please do not add web links or products which do not have Wikipedia articles. They will be summarily deleted.

--> * Apache Lucene * Apache Solr * ArangoSearch * BaseX * KinoSearch * Lemur/Indri * MariaDB * mnoGoSearch * MySQL * OpenSearch * PostgreSQL * Searchdaimon * Sphinx * Swish-e * Terrier IR Platform * Xapian {{col-float-break}}

=== Proprietary software === <!--

Please do not add web links or products which do not have Wikipedia articles. They will be summarily deleted.

--> * Algolia * Autonomy Corporation * Azure Search * Bar Ilan Responsa Project * Basis database * Brainware * BRS/Search * Concept Searching Limited * Dieselpoint * dtSearch * Elasticsearch * Endeca * Exalead * Fast Search & Transfer * Inktomi * Lucid Imagination * MarkLogic * MongoDB * SAP HANA * Swiftype * Thunderstone Software LLC * Vivísimo {{col-float-end}}

== References == {{Reflist}}

==See also== *Pattern matching and string matching *Compound term processing *Enterprise search *Information extraction *Information retrieval *Faceted search *WebCrawler, first FTS engine *Search engine indexing - how search engines generate indices to support full-text searching

{{DEFAULTSORT:Full Text Search}} Category:Text editor features Category:Information retrieval genres