# Unicode equivalence

> Mediated Wiki article. Canonical URL: https://mediated.wiki/source/Unicode_equivalence
> Markdown URL: https://mediated.wiki/source/Unicode_equivalence.md
> Source: https://en.wikipedia.org/wiki/Unicode_equivalence
> Source revision: 1348506736
> License: Creative Commons Attribution-ShareAlike 4.0 International (https://creativecommons.org/licenses/by-sa/4.0/)

Aspect of the Unicode standard

This article needs more citations. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed. Find sources: "Unicode equivalence" – news · newspapers · books · scholar · JSTOR (November 2014) (Learn how and when to remove this message)

**Unicode equivalence** is the specification by the [Unicode](/source/Unicode) [character](/source/Character_(computing)) encoding standard that some sequences of [code points](/source/Code_point) represent essentially the same character. The feature was introduced in the standard to allow compatibility with pre-existing standard [character sets](/source/Character_set), which often included similar or identical characters.

Unicode provides two such notions, [canonical](/source/Canonical_form) equivalence and compatibility. [Code point](/source/Code_point) sequences that are defined as **canonically equivalent** are assumed to have the same appearance and meaning when printed or displayed. For example, the code point U+006E n [LATIN SMALL LETTER N](/source/N) followed by U+0303 ◌̃ [COMBINING TILDE](/source/Combining_character) is defined by Unicode to be canonically equivalent to the single code point U+00F1 ñ [LATIN SMALL LETTER N WITH TILDE](/source/%C3%91). Therefore, those sequences should be displayed in the same manner, should be treated in the same way by applications such as [alphabetizing](/source/Alphabetical_order) names or [searching](/source/String_searching), and may be substituted for each other. Similarly, each [Hangul](/source/Hangul) syllable block that is encoded as a single character may be equivalently encoded as a combination of a leading conjoining jamo; a vowel conjoining jamo; and, if appropriate, a trailing conjoining jamo.

Sequences that are defined as **compatible** are assumed to have possibly distinct appearances but the same meaning in some contexts. Thus, for example, U+FB00 ﬀ LATIN SMALL LIGATURE FF, a [typographic ligature](/source/Typographic_ligature), is defined to be compatible with, but not canonically equivalent to, the sequence U+0066 U+0066 (two Latin "f" letters). Compatible sequences may be treated the same way in some applications (such as [sorting](/source/Sorting) and [indexing](/source/Index_(database))) but not in others, and they may be substituted for each other in some situations, but not in others. Sequences that are canonically equivalent are also compatible, but the opposite is not necessarily true.

The standard also defines a [text normalization](/source/Text_normalization) procedure, called **Unicode normalization**, which replaces equivalent sequences of characters so that any two texts that are equivalent will be reduced to the same sequence of code points, called the **normalization form** or **normal form** of the original text. For both of the equivalence notions, Unicode defines two normal forms, one **fully composed** (where multiple code points are replaced by single points whenever possible) and one **fully decomposed** (where single points are split into multiple ones).

## Sources of equivalence

### Character duplication

Main article: [Duplicate characters in Unicode](/source/Duplicate_characters_in_Unicode)

For compatibility or other reasons, Unicode sometimes assigns two different code points to entities that are essentially the same character. For example, the letter "A with a [ring diacritic](/source/Ring_diacritic) above" is encoded as U+00C5 Å LATIN CAPITAL LETTER A WITH RING ABOVE (a letter of the [alphabet](/source/Alphabet) in [Swedish](/source/Swedish_language) and several other [languages](/source/Language)) or as U+212B Å ANGSTROM SIGN. However, the symbol for [angstrom](/source/Angstrom) is defined to be that Swedish letter, and most other symbols that are letters (such as ⟨V⟩ for [volt](/source/Volt)) do not have a separate code point for each usage. In general, the code points of truly identical characters are defined to be canonically equivalent.

### Combining and precomposed characters

For consistency with some older standards, Unicode provides single code points for many characters that could be viewed as modified forms of other characters (such as U+00F1 for "ñ" or U+00C5 for "Å") or as combinations of two or more characters (such as U+FB00 for the ligature ﬀ or U+0132 for the [Dutch letter](/source/Dutch_alphabet) [ĳ](/source/IJ_(digraph)))

For consistency with other standards and greater flexibility, Unicode also provides codes for many elements that are not used on their own but are meant instead to modify or combine with a preceding [base character](https://en.wikipedia.org/w/index.php?title=Base_character&action=edit&redlink=1). Examples of those [combining characters](/source/Combining_character) are U+0303 ◌̃ [COMBINING TILDE](/source/%CC%83) and the [Japanese](/source/Japanese_script) diacritic [dakuten](/source/Dakuten) (U+3099 ◌゙ COMBINING KATAKANA-HIRAGANA VOICED SOUND MARK).

In the context of Unicode, **character composition** is the process of replacing the code points of a base letter followed by one or more combining characters into a single [precomposed character](/source/Precomposed_character); and **character decomposition** is the opposite process.

In general, precomposed characters are defined to be canonically equivalent to the sequence of their base letter and subsequent combining diacritic marks, whatever order they may occur.

#### Example

Amélie with its two canonically equivalent Unicode forms (NFC and NFD) NFC character A m é l i e NFC code point 0041 006d 00e9 006c 0069 0065 NFD code point 0041 006d 0065 0301 006c 0069 0065 NFD character A m e ◌́ l i e

### Typographical non-interaction

Some scripts regularly use multiple combining marks that do not, in general, interact typographically, and do not have precomposed characters for the combinations. Pairs of such non-interacting marks can be stored in either order. Generally, the alternative sequences are canonically equivalent. The rules that define their sequencing in the canonical form also define whether they are considered to interact.

### Typographic conventions

Unicode provides code points for some characters or groups of characters that are modified only for aesthetic reasons (such as [ligatures](/source/Typographic_ligature), the [half-width katakana](/source/Half-width_katakana) characters, or the [full-width](/source/Full-width) [Latin letters](/source/ISO_basic_Latin_alphabet) for use in Japanese texts) or to add new semantics without losing the original one (such as digits in [subscript](/source/Subscript) or [superscript](/source/Superscript) positions, or the circled digits (such as "①") inherited from some Japanese fonts). Such a sequence is considered compatible with the sequence of original (individual and unmodified) characters for the benefit of applications where the appearance and added semantics are not relevant. However, the two sequences are not declared canonically equivalent since the distinction has some semantic value and affects the rendering of the text.

## Encoding errors

[UTF-8](/source/UTF-8) and [UTF-16](/source/UTF-16) (and also some other Unicode encodings) do not allow all possible sequences of [code units](/source/Code_unit). Different software will convert invalid sequences into Unicode characters using varying rules, some of which are very lossy (such as by turning all invalid sequences into the same character). That can be considered a form of normalization and can lead to the same difficulties as others.

## Normalization

A text processing software implementing the Unicode string search and comparison functionality must take into account the presence of equivalent code points. In the absence of that feature, users searching for a particular code point sequence would be unable to find other visually indistinguishable glyphs that have a different but canonically equivalent code point representation.

### Algorithms

Unicode provides standard normalization algorithms that produce a unique (normal) code point sequence for all sequences that are equivalent; the equivalence criteria can be either canonical (NF) or compatibility (NFK). Since one can arbitrarily choose the [representative](/source/Representative_(mathematics)) element of an [equivalence class](/source/Equivalence_class), multiple canonical forms are possible for each equivalence criterion. Unicode provides two normal forms that are semantically meaningful for each of the two compatibility criteria: the composed forms NFC and NFKC and the decomposed forms NFD and NFKD. Both the composed and decomposed forms impose a **canonical ordering** on the code point sequence, which is necessary for the normal forms to be unique.

To compare or search Unicode strings, software can use either composed or decomposed forms; the choice does not matter as long as it is the same for all strings involved in a search, comparison, etc. On the other hand, the choice of equivalence criteria can affect search results. For instance, some [typographic ligatures](/source/Typographic_ligature) like U+FB03 (ﬃ), [Roman numerals](/source/Roman_numerals) like U+2168 (Ⅸ) and even [subscripts and superscripts](/source/Unicode_subscripts_and_superscripts), e.g. U+2075 (⁵) have their own Unicode code points. Canonical normalization (NF) does not affect any of those, but compatibility normalization (NFK) decomposes the ffi ligature into the constituent letters and so a search for U+0066 (f) as substring would succeed in an NFKC normalization of U+FB03 but not in NFC normalization of U+FB03. The same is true when searching for the Latin letter I (U+0049) in the precomposed Roman numeral Ⅸ (U+2168). Similarly, the superscript ⁵ (U+2075) is transformed to 5 (U+0035) by compatibility mapping.

Transforming superscripts into baseline equivalents may not be appropriate, however, for [rich text](/source/Rich_text) software because the superscript information is lost in the process. To allow for this distinction, the Unicode character database contains **compatibility formatting tags** that provide additional details on the compatibility transformation.[1] In the case of typographic ligatures, this tag is simply <compat>, while for the superscript it is <super>. Rich text standards like [HTML](/source/HTML) take into account the compatibility tags. For instance, HTML uses its own markup to position a U+0035 in a superscript position.[2]

### Normal forms

The four Unicode normalization forms and the algorithms (transformations) for obtaining them are listed in the table below.

NFD Normalization Form Canonical Decomposition Characters are decomposed by canonical equivalence, and multiple combining characters are arranged in a specific order. NFC Normalization Form Canonical Composition Characters are decomposed and then recomposed by canonical equivalence. NFKD Normalization Form Compatibility Decomposition Characters are decomposed by compatibility, and multiple combining characters are arranged in a specific order. NFKC Normalization Form Compatibility Composition Characters are decomposed by compatibility, then recomposed by canonical equivalence.

All of those algorithms are [idempotent](/source/Idempotent) transformations, meaning that a string that is already in one of those normalized forms will not be modified if processed again by the same algorithm.

The normal forms are not [closed](/source/Closure_(mathematics)) under string [concatenation](/source/Concatenation).[3] For defective Unicode strings starting with a Hangul vowel or trailing [conjoining jamo](/source/Hangul_Jamo_(Unicode_block)), concatenation can break composition.

However, they are not [injective](/source/Injective_function) (they map different original glyphs and sequences to the same normalized sequence) and thus also not [bijective](/source/Bijection) (cannot be restored). For example, the distinct Unicode strings "U+212B" (the angstrom sign "Å") and "U+00C5" (the Swedish letter "Å") are both expanded by NFD (or NFKD) into the sequence "U+0041 U+030A" (Latin letter "A" and combining [ring above](/source/Ring_above) "°"), which is then reduced by NFC (or NFKC) to "U+00C5" (the Swedish letter "Å").

A single character (other than a Hangul syllable block) that gets replaced by another under normalization can be identified in the Unicode tables for having a non-empty compatibility field but lacking a compatibility tag.

### Canonical ordering

The canonical ordering is concerned mainly with the ordering of a sequence of combining characters. For the examples in this section, the characters are assumed to be [diacritics](/source/Diacritic), but in general, some diacritics are not combining characters, and some combining characters are not diacritics.

Unicode assigns each character a **combining class**, which is identified by a numerical value. Non-combining characters have class number 0, and combining characters have a positive combining class value. To obtain the canonical ordering, every substring of characters having non-zero combining class value must be sorted by the combining class value using a [stable sorting](/source/Sorting_algorithm#Stability) algorithm. Stable sorting is required because combining characters with the same class value are assumed to interact typographically and so the two possible orders are *not* considered equivalent.

For example, the character U+1EBF (ế), used in the [Vietnamese alphabet](/source/Vietnamese_alphabet), has both an acute and a circumflex accent. Its canonical decomposition is the three-character sequence U+0065 (e) U+0302 (circumflex accent) U+0301 (acute accent). The combining classes for the two accents are both 230 and so U+1EBF is not equivalent to U+0065 U+0301 U+0302.

Since not all combining sequences have a precomposed equivalent (the last one in the previous example can only be reduced to U+00E9 U+0302), even the normal form NFC is affected by combining characters' behavior.

## Errors from normalization differences

When two applications share Unicode data but normalize them differently, errors and data loss may result. In one specific instance, [OS X](/source/OS_X) normalized Unicode filenames sent from the [Netatalk](/source/Netatalk) and [Samba](/source/Samba_(software)) file- and printer-sharing software. Netatalk and Samba did not recognize the altered filenames as equivalent to the original, which led to data loss.[4][5] Resolving such an issue is non-trivial, as normalization is not losslessly invertible.

## See also

- [Complex text layout](/source/Complex_text_layout) – Neighbour-dependent grapheme positioning

- [Diacritic](/source/Diacritic) – Modifier mark added to a letter

- [IDN homograph attack](/source/IDN_homograph_attack) – Visually similar letters in domain names

- [ISO/IEC 14651](/source/ISO%2FIEC_14651) – String comparison algorithm standard

- [Ligature (typography)](/source/Ligature_(typography)) – Glyph combining two or more letterformsPages displaying short descriptions of redirect targets

- [Precomposed character](/source/Precomposed_character) – Compound character with single codepoint

- [Representative glyph](/source/Representative_glyph) – Non-specific archetype to represent a grapheme

- [uconv](/source/Uconv) – Coversion utility software, can convert to and from NFC and NFD Unicode normalization forms.

- [Unicode](/source/Unicode) – Character encoding standard

- [Unicode compatibility characters](/source/Unicode_compatibility_characters) – Characters encoded solely to maintain round-trip convertibility with other standards

## Notes

1. **[^](#cite_ref-1)** ["UAX #44: Unicode Character Database"](https://www.unicode.org/reports/tr44/#Character_Decomposition_Mappings). Unicode.org. Retrieved 20 November 2014.

1. **[^](#cite_ref-2)** ["Unicode in XML and other Markup Languages"](http://unicode.org/reports/tr20/tr20-2.html#Compatibility). Unicode.org. Retrieved 20 November 2014.

1. **[^](#cite_ref-3)** Per [What should be done about concatenation](http://www.unicode.org/faq/normalization.html#5)

1. **[^](#cite_ref-4)** ["netatalk / Bugs / #349 volcharset:UTF8 doesn't work from Mac"](https://sourceforge.net/tracker/?func=detail&aid=2727174&group_id=8642&atid=108642). *[SourceForge](/source/SourceForge)*. Retrieved 20 November 2014.

1. **[^](#cite_ref-5)** ["rsync, samba, UTF8, international characters, oh my!"](https://web.archive.org/web/20100109162824/http://forums.macosxhints.com/archive/index.php/t-99344.html). 2009. Archived from [the original](http://forums.macosxhints.com/archive/index.php/t-99344.html) on January 9, 2010.

## References

- [Unicode Standard Annex #15: Unicode Normalization Forms](http://unicode.org/reports/tr15/)

## External links

- [Unicode.org FAQ - Normalization](https://www.unicode.org/faq/normalization.html)

- [Charlint - a character normalization tool](http://www.w3.org/International/charlint/) written in Perl

v t e Unicode Unicode Consortium ISO/IEC 10646 (Universal Character Set) Versions Code points Block List Universal Character Set Character charts Character property Plane Private Use Area Pairs Combining character Compatibility characters Duplicate characters Equivalence Homoglyph Precomposed character list Z-variant Variation sequences Regional indicator symbol Emoji skin color Characters Special purpose BOM Combining grapheme joiner Left-to-right mark – Right-to-left mark Soft hyphen Variant form Word joiner Zero-width joiner Zero-width non-joiner Zero-width space Lists Characters CJK Unified Ideographs Combining character Duplicate characters Numerals Scripts Spaces Symbols Halfwidth – fullwidth Alias names – abbreviations Whitespace characters Processing Algorithms Bidirectional text Collation ISO/IEC 14651 Equivalence Variation sequences International Ideographs Core Encoding comparison BOCU-1 CESU-8 Punycode SCSU UTF-1 UTF-7 UTF-8 UTF-16/UCS-2 UTF-32/UCS-4 UTF-EBCDIC Use Domain names (IDN) Email Fonts HTML entity references numeric references Input International Ideographs Core Related standards Common Locale Data Repository (CLDR) GB 18030 ISO/IEC 8859 DIN 91379 ISO 15924 Related topics Anomalies ConScript Unicode Registry Ideographic Research Group International Components for Unicode People involved with Unicode Han unification Scripts and symbols in Unicode Scripts Common, inherited Combining marks Diacritics Punctuation marks Spaces Numbers Modern Adlam Arabic Armenian Balinese Bamum Batak Bengali Beria Erfe Bopomofo Braille Buhid Burmese Canadian Aboriginal Chakma Cham Cherokee CJK Unified Ideographs (Han) Cyrillic Deseret Devanagari Garay Geʽez Georgian Greek Gujarati Gunjala Gondi Gurmukhi Gurung Khema Hangul Hanifi Rohingya Hanja Hanunuoo Hebrew Hiragana Javanese Kanji Kannada Katakana Kayah Li Khmer Kirat Rai Lao Latin Lepcha Limbu Lisu (Fraser) Lontara Malayalam Masaram Gondi Mende Kikakui Medefaidrin Miao (Pollard) Mongolian Mru N'Ko Nag Mundari New Tai Lue Nüshu Nyiakeng Puachue Hmong Odia Ol Chiki Ol Onal Osage Osmanya Pahawh Hmong Pau Cin Hau Pracalit (Newa) Ranjana Rejang Samaritan Saurashtra Shavian Sinhala Sorang Sompeng Sundanese Sunuwar Syriac Tagbanwa Tai Le Tai Tham Tai Viet Tai Yo Tamil Tangsa Telugu Thaana Thai Tibetan Tifinagh Tirhuta Tolong Siki Toto Vai Wancho Warang Citi Yi Ancient, historic Ahom Anatolian hieroglyphs Ancient North Arabian Avestan Bassa Vah Bhaiksuki Brāhmī Carian Caucasian Albanian Coptic Cuneiform Cypriot Cypro-Minoan Dives Akuru Dogra Egyptian hieroglyphs Elbasan Elymaic Glagolitic Gothic Grantha Hatran Imperial Aramaic Inscriptional Pahlavi Inscriptional Parthian Kaithi Kawi Kharosthi Khitan small script Khojki Khudawadi Khwarezmian (Chorasmian) Linear A Linear B Lycian Lydian Mahajani Makasar Mandaic Manichaean Marchen Meetei Mayek Meroitic Modi Multani Nabataean Nandinagari Ogham Old Hungarian Old Italic Old Permic Old Persian cuneiform Old Sogdian Old Turkic Old Uyghur Palmyrene ʼPhags-pa Phoenician Psalter Pahlavi Runic Sharada Siddham Sidetic Sogdian South Arabian Soyombo Sylheti Nagri Tagalog (Baybayin) Takri Tangut Todhri Tulu Tigalari Ugaritic Vithkuqi Yezidi Zanabazar Square Notational Duployan SignWriting Symbols Cultural, political, religious symbols Currency symbols Control Pictures Mathematical operators, symbols Glossary Phonetic symbols (including IPA) Emoji Category: Unicode Category: Unicode blocks

---
Adapted from the Wikipedia article [Unicode equivalence](https://en.wikipedia.org/wiki/Unicode_equivalence) by Wikipedia contributors ([contributor history](https://en.wikipedia.org/wiki/Unicode_equivalence?action=history)). Available under [Creative Commons Attribution-ShareAlike 4.0 International](https://creativecommons.org/licenses/by-sa/4.0/). Changes may have been made.