# Lexer hack

> Mediated Wiki article. Canonical URL: https://mediated.wiki/source/Lexer_hack
> Markdown URL: https://mediated.wiki/source/Lexer_hack.md
> Source: https://en.wikipedia.org/wiki/Lexer_hack
> Source revision: 1346365181
> License: Creative Commons Attribution-ShareAlike 4.0 International (https://creativecommons.org/licenses/by-sa/4.0/)

{{Short description|Technique in computer programming}}
In [computer programming](/source/computer_programming), the '''lexer hack''' is a solution to [parsing](/source/parsing) a [context-sensitive](/source/context-sensitive_grammar) grammar such as [C](/source/ANSI_C), where classifying a sequence of characters as a variable name or a type name requires contextual information, by feeding contextual information backwards from the parser to the lexer.

The lexer hack is frowned upon in modern compilers as it creates tight coupling between otherwise largely independent steps in the compilation process. Instead, identifier-like tokens are tokenized as identifiers and later disambiguated by the parser, allowing cleaner [separation of concerns](/source/separation_of_concerns).

== Problem ==
The fundamental problem is differentiating types from other identifiers. In the following example, the lexical class of <code>A</code> cannot be determined without further contextual information:
<syntaxhighlight lang="C">
A * B;
</syntaxhighlight>

This code could either be multiplication or a declaration, depending on context.

In more detail, in a [compiler](/source/compiler), the lexer performs one of the earliest stages of converting the [source code](/source/source_code) to a program. It scans the text to extract meaningful ''tokens'', such as words, numbers, and strings. The parser analyzes sequences of tokens attempting to match them to syntax rules representing language structures, such as loops and variable declarations. A problem occurs here if a single sequence of tokens can ambiguously match more than one syntax rule.

This ambiguity can happen in C if the lexer does not distinguish between variable and [typedef](/source/typedef) identifiers.<ref name="Roskind 91">{{cite web |url=http://www.cs.utah.edu/research/projects/mso/goofie/grammar5.txt |author=Roskind, James A. |date=1991-07-11 |title=A YACC-able C++ 2.1 GRAMMAR, AND THE RESULTING AMBIGUITIES |access-date=2008-11-27 |archive-url=https://web.archive.org/web/20070622120718/http://www.cs.utah.edu/research/projects/mso/goofie/grammar5.txt |archive-date=2007-06-22 |url-status=dead }}</ref> For example, in the C expression:

<syntaxhighlight lang="C">
A * B;
</syntaxhighlight>

the lexer may find these tokens:

# ?? 'A'
# operator '*'
# identifier 'B'
# punctuation ';'

Depending on whether <code>A</code> is a typedef name or not it may be desirable to tokenize <code>A</code> as either an ''identifier'' or a ''type'' so the parser does not have to handle an ambiguous parse. This grammatical ambiguity is known as the "typedef-name: identifier" problem, due to the name of the problematic [production rule](/source/production_(computer_science)).<ref>{{cite book | isbn=0131103628 | author=Brian W. Kernighan and Dennis M. Ritchie | title=The C Programming Language | edition=2nd | location=Englewood Cliffs/NJ | publisher=Prentice Hall | series=Prentice Hall Software Series | date=Apr 1988 |chapter=Appendix A |page=236}}</ref><ref name="Bendersky 07">{{cite web | url=http://eli.thegreenplace.net/2007/11/24/the-context-sensitivity-of-cs-grammar/ | title=The context sensitivity of C's grammar | author=Bendersky, Eli | date=2007-11-24}}</ref>

== The hack solution ==

The solution generally consists of feeding information from the semantic [symbol table](/source/symbol_table) back into the lexer. That is, rather than functioning as a pure one-way [pipeline](/source/Pipeline_(software)) from the lexer to the parser, there is a backchannel from semantic analysis back to the lexer. This mixing of parsing and semantic analysis is generally regarded as inelegant, which is why it is called a "[hack](/source/Hack_(computer_science))".

Without added context, the lexer cannot distinguish type identifiers from other identifiers because all identifiers have the same format. With the hack in the above example, when the lexer finds the identifier ''A'' it should be able to classify the token as a type identifier. The rules of the language would be clarified by specifying that typecasts require a type identifier and the ambiguity disappears.

The problem also exists in [C++](/source/C%2B%2B) and parsers can use the same hack.<ref name="Roskind 91" /><!-- This needs clarification. C++ may be worse due to template names. -->

== Alternative solutions ==

This problem does not arise (and hence needs no "hack" in order to solve) when using [lexerless parsing](/source/lexerless_parsing) techniques, as these are intrinsically contextual. These are generally seen as less elegant designs, however, because they lack the modularity of having a [concurrent](/source/concurrent_computation) lexer and parser in a pipeline.{{Citation needed|date=April 2016}}

Some parser generators, such as the [byacc](/source/Berkeley_Yacc)-derived BtYacc ("Backtracking Yacc"), give the generated parser the ability to try multiple attempts to parse the tokens. In the problem described here, if an attempt fails because of semantic information about the identifier, it can backtrack and attempt other rules.<ref>{{cite web | url=http://www.siber.com/btyacc/ | title=BtYacc 3.0}} Based on Berkeley yacc with modifications by Chris Dodd and Vadim Maslov.</ref>

The [Clang](/source/Clang) parser handles the situation in a completely different way, namely by using a non-reference lexical grammar. Clang's lexer does not attempt to differentiate between type names and variable names: it simply reports the current token as an identifier. The parser then uses Clang's [semantic analysis](/source/Semantic_analysis_(compilers)) library to determine the nature of the identifier. This allows a simpler and more maintainable architecture than The Lexer Hack.<ref>{{cite web|last=Bendersky|first=Eli|title=How Clang handles the type / variable name ambiguity of C/C++|url=http://eli.thegreenplace.net/2012/07/05/how-clang-handles-the-type-variable-name-ambiguity-of-cc/}}</ref> This is also the approach used in most other modern languages, which do not distinguish different classes of identifiers in the lexical grammar, but instead defer them to the parsing or semantic analysis phase, when sufficient information is available.{{example needed|date=December 2018}}

== See also ==
* [Dangling else](/source/Dangling_else)<!-- other basic syntax problem -->
*[Most vexing parse](/source/Most_vexing_parse)

== References ==
{{Reflist}}

== Citations ==

* http://www.cs.berkeley.edu/~smcpeak/elkhound/sources/elkhound/index.html
* http://cs.nyu.edu/rgrimm/papers/pldi06.pdf
* http://cens.ioc.ee/local/man/CompaqCompilers/ladebug/ladebug-manual-details.html {{Webarchive|url=https://web.archive.org/web/20050106023644/http://cens.ioc.ee/local/man/CompaqCompilers/ladebug/ladebug-manual-details.html |date=2005-01-06 }}
* [https://doi.org/10.1007%2Fs00236-004-0137-z DOI.org]
* [https://archive.today/20130415002840/http://news.gmane.org/find-root.php?group=gmane.comp.lang.groovy.jsr&article=843&type=blog]
* http://groups.google.com/group/comp.compilers/browse_frm/thread/db7f68e9d8b49002/fa20bf5de9c73472?lnk=st&q=%2B%22the+lexer+hack%22&rnum=1&hl=en#fa20bf5de9c73472

{{DEFAULTSORT:Lexer hack, The}}
Category:C (programming language)
Category:C++
Category:Parsing
Category:Articles with example C code

---
Adapted from the Wikipedia article [Lexer hack](https://en.wikipedia.org/wiki/Lexer_hack) by Wikipedia contributors ([contributor history](https://en.wikipedia.org/wiki/Lexer_hack?action=history)). Available under [Creative Commons Attribution-ShareAlike 4.0 International](https://creativecommons.org/licenses/by-sa/4.0/). Changes may have been made.