Concordancer

{{Short description|Computer program that constructs concordances from text corpora}}

A '''concordancer''' is a computer program that automatically constructs a concordance—an alphabetised index of every occurrence of a word or phrase in a body of text, each entry displayed with its surrounding context. Concordancers are primary tools in corpus linguistics, lexicography, computer-assisted translation, and language teaching. The most common display format is the key word in context (KWIC) layout, in which each hit appears centred on a line with a fixed span of words to its left and right, enabling rapid scanning of usage patterns across many occurrences.

== History ==

=== Pre-computational concordances ===

The compilation of concordances predates computers by many centuries. Around 1230, the French Dominican cardinal Hugh of Saint-Cher directed a team of friars in assembling a concordance of the Latin Vulgate Bible, generally regarded as the first systematic concordance of any text.<ref name="christianity-hugh">{{cite web |url=https://www.christianity.com/church/church-history/timeline/1201-1500/hugh-of-st-chers-concordance-11629840.html |title=Hugh of St. Cher's Concordance |website=Christianity.com |access-date=2025-04-01}}</ref> To help readers locate passages, Hugh divided each biblical chapter into lettered sections. Later milestones include a Hebrew Old Testament concordance compiled by Rabbi Mordecai Nathan (1448), Alexander Cruden's ''Complete Concordance to the Holy Scriptures'' (1737), and the manuscript ''Asaf ha-Mazkir'', an unfinished concordance to the Babylonian Talmud compiled by Moses Rigotz around the turn of the 19th century.<ref name="jewish-encyc-concordance">{{cite encyclopedia |title=Concordance |encyclopedia=Jewish Encyclopedia |year=1906 |url=https://jewishencyclopedia.com/articles/4584-concordance |access-date=2026-05-04}}</ref>

=== First computer concordance ===

The first concordance produced with computing assistance was the ''Index Thomisticus'', a comprehensive lexical index of the writings of and around Thomas Aquinas, totalling approximately 10.6 million Latin words. The Italian Jesuit priest Roberto Busa conceived the project in 1946 and secured the sponsorship of IBM in 1949 after a meeting with chairman Thomas J. Watson.<ref name="histinfo-busa">{{cite web |url=https://www.historyofinformation.com/detail.php?entryid=2321 |title=Father Roberto Busa Conceives the Index Thomisticus |website=History of Information |access-date=2025-04-01}}</ref> Keypunch operators in Gallarate, Italy, encoded the texts onto punched cards from around 1950. IBM executive Paul Tasman developed the processing methods. The full 56-volume printed edition was completed around 1980, followed by a CD-ROM edition in 1989 and a web-accessible version in 2005.

=== The KWIC format ===

The key word in context (KWIC) display was formalised as a computational technique by Hans Peter Luhn, a researcher at IBM, in a 1960 paper in ''American Documentation''.<ref name="luhn1960">{{cite journal |last=Luhn |first=H. P. |year=1960 |title=Key word-in-context index for technical literature (kwic index) |journal=American Documentation |volume=11 |issue=4 |pages=288–295 |doi=10.1002/asi.5090110403}}</ref> In KWIC output, each instance of the search term (the ''node word'') is centred on a line with a fixed window of words to each side; sorting the resulting lines alphabetically by the immediately adjacent word reveals collocational and phraseological patterns at a glance.<ref name="lancs-conc">{{cite web |url=http://corpora.lancs.ac.uk/clmtp/2-conc.php |title=Concordancing |work=Corpus Linguistics: Method, Theory and Practice |publisher=Lancaster University |access-date=2025-04-01}}</ref>

=== COCOA ===

One of the first dedicated concordancing programs was '''COCOA''' (COunt and COncordance Generation on Atlas), created in 1965 by D. B. Russell at University College London and the Atlas Computer Laboratory in Harwell, Oxfordshire.<ref name="cocoa-chilton">{{cite web |url=https://www.chilton-computing.org.uk/acl/applications/cocoa/p001.htm |title=COCOA: Count and Concordance Generation on Atlas |website=Chilton Computing |access-date=2025-04-01}}</ref> Written in approximately 4,000 cards of FORTRAN, it processed text annotated with flat, non-hierarchical markup tags and could produce word counts and concordances in multiple languages. Within its first six months COCOA had been applied to texts in at least six languages. A second version designed for multiple mainframe platforms was distributed to British computing centres in the mid-1970s. Growing dissatisfaction with its interface and the eventual withdrawal of Atlas Laboratory support prompted British funding bodies to commission a successor program.

=== Oxford Concordance Program ===

The '''Oxford Concordance Program''' (OCP) was designed and written in FORTRAN by Susan Hockey and Ian Marriott at Oxford University Computing Services (OUCS) between 1979 and 1980 and first released in 1981.<ref name="ocp-hockey87">{{cite journal |last1=Hockey |first1=Susan |last2=Martin |first2=John |year=1987 |title=The Oxford Concordance Program Version 2 |journal=Literary and Linguistic Computing |volume=2 |issue=2 |pages=125–131 |doi=10.1093/llc/2.2.125}}</ref> Hockey and Marriott acknowledged that OCP owed much to COCOA and the CLOC system at the University of Birmingham. OCP accepted COCOA-format markup to encode metadata such as author, act, scene, and line number, and was described by its authors as "a machine-independent text analysis program for producing word lists, indices and concordances in a variety of languages and alphabets." By the mid-1980s it had been licensed to approximately 240 institutions in 23 countries.<ref name="ocp-cti">{{cite web |url=https://users.ox.ac.uk/~ctitext2/resguide/resources/o125.html |title=Oxford Concordance Program |publisher=CTI Centre for Textual Studies, Oxford University |access-date=2025-04-01}}</ref> A personal computer version, Micro-OCP, was developed for the IBM PC and sold by Oxford University Press from the late 1980s. Version 2 was rewritten in 1985–86 and documented in the same 1987 article by Hockey and co-author John Martin.<ref name="ocp-hockey87" />

=== Personal computer era ===

The availability of affordable personal computers in the 1980s and 1990s enabled standalone concordancing applications that analysts could run locally without specialist computing facilities. '''MicroConcord''', developed by Mike Scott and Tim Johns and published by Oxford University Press in 1993 for MS-DOS, was among the first concordancers designed specifically for classroom language teaching.<ref name="timjohns">{{cite web |url=https://lexically.net/TimJohns/Kibbitzer/timconc.htm |title=Tim Johns: concordancing in the language classroom |website=lexically.net |access-date=2025-04-01}}</ref> '''WordSmith Tools''', also developed by Mike Scott, was first released in 1996 and became one of the most widely used corpus analysis suites in academic linguistics research.<ref name="ws-home">{{cite web |url=https://www.lexically.net/wordsmith/ |title=WordSmith Tools |publisher=Lexical Analysis Software |access-date=2025-04-01}}</ref> Other tools from this era include TACT (University of Toronto, 1989), a suite of MS-DOS freeware programs for literary text analysis, and MonoConc, a Windows concordancer created by Michael Barlow.

=== Web-based concordancers ===

From the late 1990s onwards, web-based concordancers hosted on remote servers gave researchers browser access to large preloaded corpora without requiring local storage or processing. The '''Sketch Engine''', developed by Adam Kilgarriff and Pavel Rychlý (Masaryk University), was launched commercially in July 2003 by Lexical Computing Limited and introduced ''word sketches''—automatically generated one-page profiles of a word's typical grammatical relations and collocations.<ref name="sketch-2004">{{cite conference |last1=Kilgarriff |first1=Adam |last2=Rychlý |first2=Pavel |last3=Smrž |first3=Pavel |last4=Tugwell |first4=David |year=2004 |title=The Sketch Engine |book-title=Proceedings of the 11th EURALEX International Congress |location=Lorient |pages=105–116 |url=https://www.euralex.org/elx_proceedings/Euralex2004/011_2004_V1_Adam%20KILGARRIFF,%20Pavel%20RYCHLY,%20Pavel%20SMRZ,%20David%20TUGWELL_The%20%20Sketch%20Engine.pdf}}</ref> '''AntConc''', created by Laurence Anthony at Waseda University, Tokyo, was first released in 2002 as freeware for Windows, macOS, and Linux.<ref name="antconc-2005">{{cite conference |last=Anthony |first=Laurence |year=2005 |title=AntConc: Design and development of a freeware corpus analysis toolkit for the technical writing classroom |book-title=Proceedings of the IEEE International Professional Communication Conference |pages=729–737 |doi=10.1109/IPCC.2005.1494244}}</ref>

== Features ==

Modern concordancers typically offer a range of analytical functions beyond basic KWIC display.<ref name="lancs-conc" /><ref name="weisser">{{cite web |url=https://martinweisser.org/corpora_site/concordancers.html |title=Concordancers: An Overview |first=Martin |last=Weisser |access-date=2025-04-01}}</ref> These commonly include:

* '''KWIC display''' with the node word centred and context words in aligned columns, sortable by the word one, two, or three positions to the left or right of the node (L1–L3 and R1–R3) * '''Concordance plots''', visualising the distribution of hits as marks along a scaled bar representing each text in the corpus * '''Frequency and word lists''', both alphabetical and ranked by frequency * '''Collocation statistics''', identifying words that co-occur with the search term more often than chance, quantified by measures such as mutual information, the t-score, or log-likelihood * '''Keyword analysis''', comparing word frequencies between a study corpus and a reference corpus to identify statistically distinctive items * '''N-gram analysis''', finding frequently recurring word sequences of a specified length * '''Part-of-speech tagging''' integration, allowing searches filtered to particular grammatical categories * '''Unicode support''' for multilingual text

Bilingual and parallel concordancers additionally display aligned text in two or more languages side by side, enabling comparison of translation equivalents across language pairs.

== Notable concordancers ==

=== WordSmith Tools ===

Created by Mike Scott and first released in 1996, '''WordSmith Tools''' is a Windows corpus analysis suite that evolved from MicroConcord.<ref name="ws-home" /><ref name="ws-manual">{{cite web |url=https://lexically.net/downloads/version4/wordsmith.pdf |title=WordSmith Tools Version 4 Manual |last=Scott |first=Mike |publisher=Lexical Analysis Software |access-date=2025-04-01}}</ref> Its three core modules are ''Concord'' (KWIC concordances), ''WordList'' (frequency and alphabetical word lists), and ''Keywords'' (statistical keyword identification relative to a reference corpus). Oxford University Press used WordSmith Tools for dictionary preparation work. Version 4.0 is freely available; later versions are sold by Lexical Analysis Software Limited.

=== AntConc ===

'''AntConc''' is a freeware, multiplatform concordancing toolkit created by Laurence Anthony, Professor of Applied Linguistics at Waseda University, Tokyo.<ref name="antconc-home">{{cite web |url=https://www.laurenceanthony.net/software/antconc/ |title=AntConc |first=Laurence |last=Anthony |publisher=Waseda University |access-date=2025-04-01}}</ref> First released in 2002 and formally described in a 2005 academic paper, it runs on Windows, macOS, and Linux. Its tools include a KWIC concordancer, a concordance plot for visualising distribution across texts, a collocates tool, a keyword list, and an n-gram analysis module. Because it is free and requires only plain text files, AntConc is widely used in linguistics courses and independent research worldwide.

=== Sketch Engine ===

The '''Sketch Engine''' is a corpus management and query system co-created by Adam Kilgarriff and Pavel Rychlý and launched in 2003 by Lexical Computing Limited.<ref name="sketch-2004" /><ref name="sketch-10y">{{cite journal |last1=Kilgarriff |first1=Adam |last2=Baisa |first2=Vít |last3=Bušta |first3=Jan |last4=Jakubíček |first4=Miloš |last5=Kovář |first5=Vojtěch |last6=Michelfeit |first6=Jan |last7=Rychlý |first7=Pavel |last8=Suchomel |first8=Vít |year=2014 |title=The Sketch Engine: Ten Years On |journal=Lexicography |volume=1 |issue=1 |pages=7–36 |doi=10.1007/s40607-014-0009-9}}</ref> It provides browser-based access to over 800 corpora in more than 100 languages. Beyond concordance searching, it offers word sketches, collocation analysis, distributional thesaurus construction, keyword and terminology extraction, and diachronic analysis. It is used by major publishers including Macmillan and Oxford University Press for lexicographic research. A subset tool, SKELL (Sketch Engine for Language Learning), is freely accessible to individual learners.

=== Wmatrix ===

'''Wmatrix''' is a web-based corpus processing environment developed by Paul Rayson at the University Centre for Computer Corpus Research on Language (UCREL), Lancaster University.<ref name="wmatrix">{{cite web |url=https://ucrel.lancs.ac.uk/wmatrix/ |title=Wmatrix: A web-based corpus processing environment |first=Paul |last=Rayson |publisher=UCREL, Lancaster University |access-date=2025-04-01}}</ref> Alongside concordances and frequency lists, Wmatrix integrates CLAWS part-of-speech tagging and the USAS semantic tagger, enabling keyword analysis simultaneously at the levels of individual words, grammatical categories, and semantic domains—an approach that extends standard keyword methods beyond simple lexical comparison.

=== ParaConc ===

'''ParaConc''', developed by Michael Barlow, is a Windows concordancer for parallel (multilingual) corpora that accepts up to four aligned texts in different languages.<ref name="paraconc">{{cite web |url=https://paraconc.com/ |title=ParaConc |first=Michael |last=Barlow |access-date=2025-04-01}}</ref> Designed for contrastive analysis, translation studies, and language learning research, it includes a "Hot words" feature that uses relative frequency data to suggest likely translation equivalents of a search word.

=== LancsBox ===

'''LancsBox''' is a free, cross-platform corpus analysis tool developed at Lancaster University under the direction of Vaclav Brezina.<ref name="lancsbox">{{cite web |url=http://corpora.lancs.ac.uk/lancsbox/index.php |title=LancsBox |publisher=Lancaster University |access-date=2025-04-01}}</ref> Released in 2015, it supports more than 15 languages and includes a KWIC concordancer, frequency analysis, and a GRAPH tool that renders collocations as an interactive network diagram. It integrates the TreeTagger for part-of-speech annotation and was designed to lower barriers to corpus analysis in teaching and research contexts.

== Applications ==

=== Corpus linguistics ===

Concordancers are the primary analytical instrument in corpus linguistics, providing systematic access to patterns of use across large samples of authentic text. Common research uses include studying collocations and phraseology, analysing semantic prosody, comparing language varieties, and tracking lexical and grammatical change over time. Large reference corpora such as the British National Corpus (approximately 100 million words) and the Corpus of Contemporary American English (over one billion words) are typically queried through dedicated web concordancers.

=== Lexicography ===

John Sinclair at the University of Birmingham pioneered the systematic use of concordance data in dictionary making through the COBUILD project, funded by Collins from the early 1980s. The project produced the ''Collins COBUILD English Language Dictionary'' (1987), generally considered the first major English dictionary compiled entirely from corpus evidence rather than invented illustrative examples.<ref name="cobuild">{{cite web |url=https://blog.collinsdictionary.com/the-history-of-cobuild/ |title=The History of COBUILD |publisher=Collins Dictionary |access-date=2025-04-01}}</ref> Concordance lines allowed lexicographers to observe authentic collocates, typical syntactic environments, and register distinctions that introspection-based methods had tended to miss. Corpus-driven methods have since become standard practice in commercial lexicography.

=== Computer-assisted translation ===

In computer-assisted translation (CAT) software, a concordancer search allows translators to query a translation memory for all previously translated instances of a word or phrase in context, enabling consistency across a document or project. Bilingual concordancers—which search sentence-aligned parallel corpora in two languages simultaneously—are also used to locate translation equivalents in existing translated texts. Web-based bilingual concordancers such as Linguee and Reverso Context extend this capability to large publicly accessible multilingual corpora.

=== Language teaching ===

Tim Johns at the University of Birmingham coined the term ''data-driven learning'' (DDL) around 1990 to describe a pedagogical approach in which language learners use concordancers to explore corpus evidence and discover grammatical and lexical patterns inductively, acting as "language detectives" rather than passive recipients of pre-stated rules. Johns and Mike Scott developed MicroConcord (1993) specifically for classroom use. DDL has since been studied extensively across second and foreign language teaching contexts and has been found to support learner autonomy and awareness of collocational patterns.

== References ==

Category:Corpus linguistics Category:Translation software Category:Natural language processing