IEEE 754

{{Short description|IEEE standard for floating-point arithmetic}} {{Use dmy dates|date=May 2019|cs1-dates=y}} {{Floating-point}}

The '''IEEE Standard for Floating-Point Arithmetic''' ('''IEEE 754''') is a [[technical standard]] for [[floating-point arithmetic]] originally established in 1985 by the [[Institute of Electrical and Electronics Engineers]] (IEEE). The standard [[#Design rationale|addressed many problems]] found in the diverse floating-point implementations that made them difficult to use reliably and [[Software portability|portably]]. Many hardware [[floating-point unit]]s use the IEEE 754 standard.

The standard defines: * ''arithmetic formats:'' sets of [[Binary code|binary]] and [[decimal]] floating-point data, which consist of finite numbers (including [[signed zero]]s and [[subnormal number]]s), [[infinity|infinities]], and special "not a number" values ([[NaN]]s) * ''interchange formats:'' encodings (bit strings) that may be used to exchange floating-point data in an efficient and compact form * ''rounding rules:'' properties to be satisfied when rounding numbers during arithmetic and conversions * ''operations:'' arithmetic and other operations (such as [[trigonometric functions]]) on arithmetic formats * ''exception handling:'' indications of exceptional conditions (such as [[division by zero]], overflow, etc.)

[[IEEE 754-2008 revision|IEEE 754-2008]], published in August 2008, includes nearly all of the original [[IEEE 754-1985]] standard, plus the [[IEEE 854-1987]] (Radix-Independent Floating-Point Arithmetic) standard. {{anchor|2019}}The current version, IEEE 754-2019, was published in July 2019.<ref>{{Harvnb|IEEE 754|2019}}</ref> It is a minor revision of the previous version, incorporating mainly clarifications, defect fixes and new recommended operations.

==History== {| class="wikitable floatright" style="padding-left: 1.5em;" |+Timeline of Floating-Point Arithmetic Standard |- ! Year ! Official Standard |- | 1982 | IEC 559:1982 |- | 1985 | IEEE 754-1985 |- | 1987 | IEEE 854-1987 |- | 1989 | IEC 559:1989 |- | 2008 | IEEE 754-2008 |- | 2011 | ISO/IEC/IEEE 60559:2011 |- | 2019 | IEEE 754-2019 |- | 2020 | ISO/IEC 60559:2020 |- | 2029 | {{TBA}} |}

The need for a floating-point standard arose from chaos in the business and scientific computing industry in the 1960s and 1970s. IBM used a [[IBM hexadecimal floating-point|hexadecimal floating-point format]] with seven bits always used for the exponent regardless of precision. [[Control Data Corporation|CDC]] and [[Cray]] computers used [[ones' complement]] representation, which admits a value of +0 and −0. CDC 60-bit computers did not have full 60-bit adders, so integer arithmetic was limited to 48 bits of precision from the floating-point unit. Exception processing from divide-by-zero was different on different computers. Moving data between systems and even repeating the same calculations on different systems was often difficult.

The first IEEE standard for floating-point arithmetic, [[IEEE 754-1985]], was published in 1985. It covered only binary floating-point arithmetic.

A new version, [[IEEE 754-2008 revision|IEEE 754-2008]], was published in August 2008, following a seven-year revision process, chaired by Dan Zuras and edited by [[Mike Cowlishaw]]. It replaced both IEEE 754-1985 (Binary Floating-Point Arithmetic) and [[IEEE 854-1987]] (Radix-Independent Floating-Point Arithmetic) standards. The binary formats in the original standard are included in this new standard along with three new basic formats, one binary and two decimal. To conform to the current standard, an implementation must implement at least one of the basic formats as both an arithmetic format and an interchange format.

The international standard '''ISO/IEC/IEEE 60559:2011''' (with content identical to IEEE 754-2008) has been approved for adoption through [[International Organization for Standardization|ISO]]/[[International Electrotechnical Commission|IEC]] [[ISO/IEC JTC 1|JTC 1]]/SC 25 under the ISO/IEEE PSDO Agreement<ref>{{cite web|last=Haasz|first=Jodi|url=http://grouper.ieee.org/groups/754/email/msg04167.html|title=FW: ISO/IEC/IEEE 60559 (IEEE Std 754-2008)|website=[[IEEE]]|access-date=4 April 2018|archive-url=https://web.archive.org/web/20171027190846/http://grouper.ieee.org/groups/754/email/msg04167.html|archive-date=2017-10-27}}</ref><ref>{{cite web |publisher=ISO|title=ISO/IEEE Partner Standards Development Organization (PSDO) Cooperation Agreement |url=https://grouper.ieee.org/groups/802/minutes/jul2008/opening_reports/psdo1.pdf |access-date=27 December 2021 |date=2007-12-19}}</ref> and published.{{sfn|ISO/IEC JTC 1/SC 25|2011}}

The current version, IEEE 754-2019 published in July 2019, is derived from and replaces IEEE 754-2008, following a revision process started in September 2015, chaired by David G. Hough and edited by Mike Cowlishaw. It incorporates mainly clarifications (e.g. ''totalOrder'') and defect fixes (e.g. ''minNum''), but also includes some new recommended operations (e.g. ''augmentedAddition'').<ref name=IEEE754-errata>{{cite web|url=https://speleotrove.com/misc/IEEE754-errata.html|title=IEEE 754-2008 errata|first=Mike|last=Cowlishaw|website=speleotrove.com|date=13 November 2013|access-date=24 January 2020}}</ref><ref>{{cite web|url=https://754r.ucbtest.org/|title=ANSI/IEEE Std 754-2019 |website=ucbtest.org|access-date=16 January 2024}}</ref>

The international standard '''ISO/IEC 60559:2020''' (with content identical to IEEE 754-2019) has been approved for adoption through ISO/IEC [[ISO/IEC JTC 1|JTC 1]]/SC 25 and published.{{sfn|ISO/IEC JTC 1/SC 25|2020}}

The next projected revision of the standard is in 2029.<ref>{{cite web |url=https://grouper.ieee.org/groups/msc/ANSI_IEEE-Std-754-2019/background/future.txt |title=Issues for the next revision of 754 |website=[[IEEE]] |access-date=12 August 2024}}</ref>

== Formats == An IEEE 754 ''format'' is a "set of representations of numerical values and symbols". A format may also include how the set is encoded.<ref>{{Harvnb|IEEE 754|2008|loc=§2.1.27}}.</ref>

A floating-point format is specified by * a base (also called ''radix'') ''b'', which is either 2 (binary) or 10 (decimal) in IEEE 754; * a precision ''p''; * an exponent range from ''emin'' to ''emax'', with ''emin'' = 1 − ''emax'', or equivalently ''emin'' = − (''emax'' − 1), for all IEEE 754 formats.

A format comprises * Finite numbers, which can be described by three integers: ''s'' = a ''sign'' (zero or one), ''c'' = a ''[[significand]]'' (also called a ''coefficient'' or ''mantissa'') having no more than ''p'' digits when written in base ''b'' (i.e., an integer in the range through 0 to ''b''''p'' − 1), and ''q'' = an ''exponent'' such that ''emin'' ≤ ''q'' + ''p'' − 1 ≤ ''emax''. The numerical value of such a finite number is {{nowrap|(−1)''s'' × ''c'' × ''b''''q''}}.{{efn|1=For example, if the base is 10, the sign is 1 (indicating negative), the significand is 12345, and the exponent is −3, then the value of the number is {{nowrap|(−1)1 × 12345 × 10−3}} = {{nowrap|−1 × 12345 × 0.001}} = −12.345.}} Moreover, there are two zero values, called [[signed zero]]s: the sign bit specifies whether a zero is +0 (positive zero) or −0 (negative zero). * Two infinities: +∞ and −∞. * Two kinds of [[NaN]] (not-a-number): a quiet NaN (qNaN) and a signaling NaN (sNaN).

For example, if ''b'' = 10, ''p'' = 7, and ''emax'' = 96, then ''emin'' = −95, the significand satisfies 0 ≤ ''c'' ≤ {{val|9999999}}, and the exponent satisfies {{nowrap|−101 ≤ ''q'' ≤ 90}}. Consequently, the smallest non-zero positive number that can be represented is 1×10−101, and the largest is 9999999×1090 (9.999999×1096), so the full range of numbers is −9.999999×1096 through 9.999999×1096. The numbers −''b''1−''emax'' and ''b''1−''emax'' (here, −1×10−95 and 1×10−95) are the smallest (in magnitude) ''normal numbers''; non-zero numbers between these smallest numbers are called [[subnormal number]]s.

=== Representation and encoding in memory === Some numbers may have several possible floating-point representations. For instance, if ''b'' = 10, and ''p'' = 7, then −12.345 can be represented by −12345×10−3, −123450×10−4, and −1234500×10−5. However, for most operations, such as arithmetic operations, the result (value) does not depend on the representation of the inputs.

For the decimal formats, any representation is valid, and the set of these representations is called a ''cohort''. When a result can have several representations, the standard specifies which member of the cohort is chosen.

For the binary formats, the representation is made unique by choosing the smallest representable exponent allowing the value to be represented exactly. Further, the exponent is not represented directly, but a [[Exponent bias|bias]] is added so that the smallest representable exponent is represented as 1, with 0 used for subnormal numbers. For numbers with an exponent in the normal range (the exponent field being neither all ones nor all zeros), the leading bit of the significand will always be 1. Consequently, a leading 1 can be implied rather than explicitly present in the memory encoding, and under the standard the explicitly represented part of the significand will lie between 0 and 1. This rule is called ''leading bit convention'', ''implicit bit convention'', or ''hidden bit convention''. This rule allows the binary format to have an extra bit of precision. The leading bit convention cannot be used for the subnormal numbers as they have an exponent outside the normal exponent range and scale by the smallest represented exponent as used for the smallest normal numbers.

Due to the possibility of multiple encodings (at least in formats called ''interchange formats''), a NaN may carry other information: a sign bit (which has no meaning, but may be used by some operations) and a ''payload'', which is intended for diagnostic information indicating the source of the NaN (but the payload may have other uses, such as ''NaN-boxing''<ref>{{Cite web |url=https://udn.realityripple.com/docs/Mozilla/Projects/SpiderMonkey/Internals |title=SpiderMonkey Internals |website=udn.realityripple.com |access-date=11 March 2018}}</ref><ref>{{Cite book |last1=Klemens |first1=Ben |title=21st Century C: C Tips from the New School |date=September 2014 |publisher=O'Reilly Media, Incorporated |page=160 |url=https://books.google.com/books?id=ASuiBAAAQBAJ |access-date=11 March 2018|isbn=9781491904442 }}</ref><ref>{{Cite web |url=https://github.com/zuiderkwast/nanbox |title=zuiderkwast/nanbox: NaN-boxing in C |website=[[GitHub]] |access-date=11 March 2018}}</ref>).

=== Basic and interchange formats ===

The standard defines five basic formats that are named for their numeric base and the number of bits used in their interchange encoding. There are three binary floating-point basic formats (encoded with 32, 64 or 128 bits) and two decimal floating-point basic formats (encoded with 64 or 128 bits). The [[binary32]] and [[binary64]] formats are the ''single'' and ''double'' formats of [[IEEE 754-1985]] respectively. A conforming implementation must fully implement at least one of the basic formats.

The standard also defines ''[[#Interchange formats|interchange formats]]'', which generalize these basic formats.<ref>{{Harvnb|IEEE 754|2008|loc=§3.6}}.</ref> For the binary formats, the leading bit convention is required. The following table summarizes some of the possible interchange formats (including the basic formats). <div class="noresize"> {{Table alignment}} {|class="wikitable defaultright col1left col2left col3center col12left" ! scope=col colspan=3 | ! scope=col colspan=2 | Significand ! scope=col colspan=2 | Exponent ! scope=col colspan=4 | Properties{{efn|Approximative values. For exact values see each format's individual Wikipedia entry}} ! scope=col colspan=1 | |- ! scope=col | Name ! scope=col | Common name ! scope=col {{verth|Radix}} ! scope=col {{verth|Digits{{efn|Number of digits in the radix used, including any implicit digit, but not counting the sign bit.}}}} ! scope=col style=max-width:8em {{verth|Decimal digits{{efn|Corresponding number of decimal digits, see text for more details.}}}} ! scope=col | Min ! scope=col | Max ! scope=col | ''MAXVAL'' ! scope=col style=min-width:8em | log10 ''MAXVAL'' ! scope=col style=min-width:6em | ''MINVAL''>0 (normal) ! scope=col style=min-width:6em | ''MINVAL''>0 (subnormal) ! scope=col | Notes |- |[[Half-precision floating-point format|binary16]] | Half precision | 2 | 11 | 3.31 | −14 | 15 | 65504 | 4.816 | 6.10{{x10^|−5}} | 5.96{{x10^|−8}} | Interchange |- | [[Single-precision floating-point format|binary32]] | Single precision | 2 | 24 | 7.22 | −126 | 127 | 3.40{{x10^|38}} | 38.532 | 1.18{{x10^|−38}} | 1.40{{x10^|−45}} | Basic |- | [[Double-precision floating-point format|binary64]] | Double precision | 2 | 53 | 15.95 | −1022 | 1023 | 1.80{{x10^|308}} | 308.255 | 2.23{{x10^|−308}} | 4.94{{x10^|−324}} | Basic |- | [[Quadruple-precision floating-point format|binary128]] | Quadruple precision | 2 | 113 | 34.02 | −16382 | 16383 | 1.19{{x10^|4932}} | 4932.075 | 3.36{{x10^|−4932}} | 6.48{{x10^|��4966}} | Basic |- | [[Octuple-precision floating-point format|binary256]] | Octuple precision | 2 | 237 | 71.34 | −262142 | 262143 | 1.61{{x10^|78913}} | 78913.207 | 2.48{{x10^|−78913}} | 2.25{{x10^|−78984}} | Interchange |- | [[Decimal32 floating-point format|decimal32]] | | 10 | 7 | 7 | −95 | 96 | 1.0{{x10^|97}} | 97 − 4.34{{x10^|−8}} | 1{{x10^|−95}} | 1{{x10^|−101}} | Interchange |- | [[Decimal64 floating-point format|decimal64]] | | 10 | 16 | 16 | −383 | 384 | 1.0{{x10^|385}} | 385 − 4.34{{x10^|−17}} | 1{{x10^|−383}} | 1{{x10^|−398}} | Basic |- | [[Decimal128 floating-point format|decimal128]] | | 10 | 34 | 34 | −6143 | 6144 | 1.0{{x10^|6145}} | {{nowrap|6145 − 4.34{{x10^|−35}}}} | 1{{x10^|−6143}} | 1{{x10^|−6176}} | Basic |- |} </div>

In the table above, integer values are exact, whereas values in decimal notation (e.g. 1.0) are rounded values. The minimum exponents listed are for normal numbers; the special [[subnormal number]] representation allows even smaller (in magnitude) numbers to be represented with some loss of precision. For example, the smallest positive number that can be represented in binary64 is 2−1074; contributions to the −1074 figure include the ''emin'' value −1022 and all but one of the 53 significand bits (2−1022 − (53 − 1) = 2−1074).

Decimal digits is the precision of the format expressed in terms of an equivalent number of decimal digits. It is computed as ''digits'' × log10 ''base''. E.g. binary128 has approximately the same precision as a 34 digit decimal number.

log10 ''MAXVAL'' is a measure of the range of the encoding. Its integer part is the largest exponent shown on the output of a value in scientific notation with one leading digit in the significand before the decimal point (e.g. 1.698{{x10^|38}} is near the largest value in binary32, 9.999999{{x10^|96}} is the largest value in decimal32).

The binary32 (single) and binary64 (double) formats are two of the most common formats used today. The figure below shows the absolute precision for both formats over a range of values. This figure can be used to select an appropriate format given the expected value of a number and the required precision.

[[File:IEEE754.svg|thumb|none|550px|Precision of binary32 and binary64 in the range 10−12 to 1012]]

An example of a layout for [[Single-precision floating-point format|32-bit floating point]] is [[File:Float example.svg|none]] and the [[Double-precision floating-point format|64 bit layout]] is similar.

=== Extended and extendable precision formats === The standard specifies optional [[Extended precision|extended]] and extendable precision formats, which provide greater precision than the basic formats.<ref>{{Harvnb|IEEE 754|2008|loc=§3.7}}.</ref> An extended precision format extends a basic format by using more precision and more exponent range. An extendable precision format allows the user to specify the precision and exponent range. An implementation may use whatever internal representation it chooses for such formats; all that needs to be defined are its parameters (''b'', ''p'', and ''emax''). These parameters uniquely describe the set of finite numbers (combinations of sign, significand, and exponent for the given radix) that it can represent.

The standard recommends that language standards provide a method of specifying ''p'' and ''emax'' for each supported base ''b''.<ref>{{Harvnb|IEEE 754|2008|loc=§3.7}} states: "Language standards should define mechanisms supporting extendable precision for each supported radix."</ref> The standard recommends that language standards and implementations support an extended format which has a greater precision than the largest basic format supported for each radix ''b''.<ref>{{Harvnb|IEEE 754|2008|loc=§3.7}} states: "Language standards or implementations should support an extended precision format that extends the widest basic format that is supported in that radix."</ref> For an extended format with a precision between two basic formats the exponent range must be as great as that of the next wider basic format. So for instance a 64-bit extended precision binary number must have an 'emax' of at least 16383. The [[x87]] [[Extended precision#x86 extended-precision format|80-bit extended format]] meets this requirement.

The original [[IEEE 754-1985]] standard also had the concept of ''extended formats'', but without any mandatory relation between ''emin'' and ''emax''. For example, the [[Motorola 68881]] 80-bit format,<ref>{{cite book |title=Motorola MC68000 Family |series=Programmer's Reference Manual |year=1992 |publisher=NXP Semiconductors |pages=1-16,1-18,1-23 |url=https://www.nxp.com/docs/en/reference-manual/M68000PRM.pdf}}</ref> where ''emin'' = − ''emax'', was a conforming extended format, but it became non-conforming in the 2008 revision.

=== Interchange formats === Interchange formats are intended for the exchange of floating-point data using a bit string of fixed length for a given format.

====Binary==== For the exchange of binary floating-point numbers, interchange formats of length 16 bits, 32 bits, 64 bits, and any multiple of 32 bits ≥ 128{{efn|Contrary to decimal, there is no binary interchange format of 96-bit length. Such a format is still allowed as a non-interchange format, though.}} are defined. The 16-bit format is intended for the exchange or storage of small numbers (e.g., for graphics).

The encoding scheme for these binary interchange formats is the same as that of IEEE 754-1985: a sign bit, followed by ''w'' exponent bits that describe the exponent offset by a ''[[Exponent bias|bias]]'', and ''p'' − 1 bits that describe the significand. The width of the exponent field for a ''k''-bit format is computed as ''w'' = round(4 log2(''k'')) − 13. The existing 64- and 128-bit formats follow this rule, but the 16- and 32-bit formats have more exponent bits (5 and 8 respectively) than this formula would provide (3 and 7 respectively).

As with IEEE 754-1985, the biased-exponent field is filled with all 1 bits to indicate either infinity (trailing significand field = 0) or a NaN (trailing significand field ≠ 0). For NaNs, quiet NaNs and signaling NaNs are distinguished by using the most significant bit of the trailing significand field exclusively,{{efn|The standard recommends 0 for signaling NaNs, 1 for quiet NaNs, so that a signaling NaNs can be quieted by changing only this bit to 1, while the reverse could yield the encoding of an infinity.}} and the payload is carried in the remaining bits.

====Decimal==== For the exchange of decimal floating-point numbers, interchange formats of any multiple of 32 bits are defined. As with binary interchange, the encoding scheme for the decimal interchange formats encodes the sign, exponent, and significand. Two different bit-level encodings are defined, and interchange is complicated by the fact that some external indicator of the encoding in use may be required.

The two options allow the significand to be encoded as a compressed sequence of decimal digits using [[densely packed decimal]] or, alternatively, as a [[Binary integer decimal|binary integer]]. The former is more convenient for direct hardware implementation of the standard, while the latter is more suited to software emulation on a binary computer. In either case, the set of numbers (combinations of sign, significand, and exponent) that may be encoded is identical, and [[Floating point#Special values|special values]] (±zero with the minimum exponent, ±infinity, quiet NaNs, and signaling NaNs) have identical encodings.

== Rounding rules == The standard defines five rounding rules. The first two rules round to a nearest value; the others are called ''[[directed rounding]]s'':

=== Roundings to nearest === * '''[[Rounding#Rounding half to even|Round to nearest, ties to even]]''' – rounds to the nearest value; if the number falls midway, it is rounded to the nearest value with an even least significant digit. * '''[[Rounding#Rounding half away from zero|Round to nearest, ties away from zero]]''' (or '''ties to away''')  – rounds to the nearest value; if the number falls midway, it is rounded to the nearest value above (for positive numbers) or below (for negative numbers). At the extremes, a value with a magnitude strictly less than <math>k=b^{\text{emax}}\left(b-\tfrac{1}{2}b^{1-p}\right)</math> will be rounded to the minimum or maximum finite number (depending on the value's sign). Any numbers with exactly this magnitude are considered ties; this choice of tie may be conceptualized as the midpoint between <math>\pm b^{\text{emax}}(b-b^{1-p})</math> and <math>\pm b^{\text{emax}+1}</math>, which, were the exponent not limited, would be the next representable floating-point numbers larger in magnitude. Numbers with a magnitude strictly larger than {{mvar|k}} are rounded to the corresponding infinity.<ref>{{Harvnb|IEEE 754|2008|loc=§4.3.1|ps=. "In the following two rounding-direction attributes, an infinitely precise result with magnitude at least <math>b^{\text{emax}}(b-\tfrac{1}{2}b^{1-p})</math> shall round to <math>\infty</math> with no change in sign."}}</ref>

"Round to nearest, ties to even" is the default for binary floating point and the recommended default for decimal. "Round to nearest, ties to away" is only required for decimal implementations.<ref>{{Harvnb|IEEE 754|2008|loc=§4.3.3}}</ref>

=== Directed roundings === * '''Round toward 0''' – directed rounding towards zero (also known as ''truncation''). * '''Round toward +∞''' – directed rounding towards positive infinity (also known as ''rounding up'' or ''ceiling''). * '''Round toward −∞''' – directed rounding towards negative infinity (also known as ''rounding down'' or ''floor'').

{| class="wikitable" |+ Example of rounding to integers using the IEEE 754 rules !rowspan=2| Mode ||colspan=4| Example value |- ! +11.5 ! +12.5 ! −11.5 ! −12.5 |- | to nearest, ties to even | +12.0 | +12.0 | −12.0 | −12.0 |- | to nearest, ties away from zero | +12.0 | +13.0 | −12.0 | −13.0 |- | toward 0 | +11.0 | +12.0 | −11.0 | −12.0 |- | toward +∞ | +12.0 | +13.0 | −11.0 | −12.0 |- | toward −∞ | +11.0 | +12.0 | −12.0 | −13.0 |}

Unless specified otherwise, the floating-point result of an operation is determined by applying the rounding function on the infinitely precise (mathematical) result. Such an operation is said to be ''correctly rounded''. This requirement is called ''correct rounding''.<ref>{{Harvnb|IEEE 754|2019|loc=§2.1}}</ref>

== Required operations ==

Required operations for a supported arithmetic format (including the basic formats) include:

* Conversions to and from integer<ref name="IEEE 754 2008 loc=§5.3.1">{{Harvnb|IEEE 754|2008|loc=§5.3.1}}</ref><ref name="IEEE 754 2008 loc=§5.4.1">{{Harvnb|IEEE 754|2008|loc=§5.4.1}}</ref> * Previous and next consecutive values<ref name="IEEE 754 2008 loc=§5.3.1"/> * Arithmetic operations (add, subtract, multiply, divide, square root, [[Multiply–accumulate operation|fused multiply–add]], remainder, minimum, maximum)<ref name="IEEE 754 2008 loc=§5.3.1"/><ref name="IEEE 754 2008 loc=§5.4.1"/> * Conversions (between formats, to and from strings, etc.)<ref>{{Harvnb|IEEE 754|2008|loc=§5.4.2}}</ref><ref>{{Harvnb|IEEE 754|2008|loc=§5.4.3}}</ref> * Scaling and (for decimal) quantizing<ref>{{Harvnb|IEEE 754|2008|loc=§5.3.2}}</ref><ref>{{Harvnb|IEEE 754|2008|loc=§5.3.3}}</ref> * Copying and manipulating the sign (abs, negate, etc.)<ref>{{Harvnb|IEEE 754|2008|loc=§5.5.1}}</ref> * Comparisons and total ordering<ref name=total-ordering>{{Harvnb|IEEE 754|2008|loc=§5.10}}</ref><ref>{{Harvnb|IEEE 754|2008|loc=§5.11}}</ref> * Classification of numbers (subnormal, finite, etc.) and testing for NaNs<ref>{{Harvnb|IEEE 754|2008|loc=§5.7.2}}</ref> * Testing and setting status flags<ref>{{Harvnb|IEEE 754|2008|loc=§5.7.4}}</ref>

=== Comparison predicates ===

The standard provides comparison predicates to compare one floating-point datum to another in the supported arithmetic format.<ref>{{Harvnb|IEEE 754|2019|loc=§5.11}}</ref> Any comparison with a NaN is treated as unordered. −0 and +0 compare as equal.

=== Total-ordering predicate ===

The standard provides a predicate ''totalOrder'', which defines a [[total order]]ing on canonical members of the supported arithmetic format.<ref name="IEEE 754 2019 loc=§5.10">{{Harvnb|IEEE 754|2019|loc=§5.10}}</ref> The predicate agrees with the comparison predicates (see section {{section link||Comparison predicates}}) when one floating-point number is less than the other. The main differences are:<ref name=rust_total_cmp>{{cite web |title=Implement total_cmp for f32, f64 by golddranks · Pull Request #72568 · rust-lang/rust |url=https://github.com/rust-lang/rust/pull/72568 |website=GitHub |language=en}} – contains relevant quotations from IEEE 754-2008 and -2019. Contains a type-pun implementation and explanation.</ref> * NaN is sortable. ** NaN is treated as if it had a larger absolute value than Infinity (or any other floating-point numbers). (−NaN < −Infinity; +Infinity < +NaN.) ** qNaN and sNaN are treated as if qNaN had a larger absolute value than sNaN. (−qNaN < −sNaN; +sNaN < +qNaN.) ** NaN is then sorted according to the payload. In IEEE 754-2008, a NaN with a lesser payload is treated as having a lesser absolute value. In IEEE 754-2019, any implementation-defined ordering is acceptable. * Negative zero is treated as smaller than positive zero. * If both sides of the comparison refer to the same floating-point datum, the one with the lesser exponent is treated as having a lesser absolute value.<ref name="IEEE 754 2019 loc=§5.10"/>

The ''totalOrder'' predicate does not impose a total ordering on all encodings in a format. In particular, it does not distinguish among different encodings of the same floating-point representation, as when one or both encodings are non-canonical.<ref name="IEEE 754 2019 loc=§5.10"/> IEEE 754-2019 incorporates clarifications of ''totalOrder''.

For the binary interchange formats whose encoding follows the IEEE 754-2008 recommendation on [[NaN#Encoding|placement of the NaN signaling bit]], the comparison is identical to one that [[type punning|type puns]] the floating-point numbers to a sign–magnitude integer (assuming a payload ordering consistent with this comparison), an old trick for FP comparison without an FPU.<ref name="Herf_2001"/>

== Exception handling == {{See also|Floating-point arithmetic#Exception handling}}

The standard defines five exceptions, each of which returns a default value and has a corresponding status flag that is raised when the exception occurs.{{efn|No flag is raised in certain cases of underflow.}} No other exception handling is required, but additional non-default alternatives are recommended (see {{slink||Alternate exception handling}}).

The five possible exceptions are ; Invalid operation: mathematically undefined, e.g., the square root of a negative number. By default, returns qNaN. ; Division by zero: an operation on finite operands gives an exact infinite result, e.g., 1/0 or log(0). By default, returns ±infinity. ; Overflow: a finite result is too large to be represented accurately (i.e., its exponent with an unbounded exponent range would be larger than ''emax''). By default, returns ±infinity for the round-to-nearest modes (and follows the rounding rules for the directed rounding modes). ; Underflow: a result is very small (outside the normal range). By default, returns a number less than or equal to the minimum positive normal number in magnitude (following the rounding rules); a [[subnormal number]] always implies an underflow exception, but by default, if it is exact, no flag is raised. ; Inexact: the exact (i.e., unrounded) result is not representable exactly. By default, returns the correctly rounded result.

These are the same five exceptions as were defined in IEEE 754-1985, but the ''division by zero'' exception has been extended to operations other than the division.

Some decimal floating-point implementations define additional exceptions,<ref>{{cite web|url=https://docs.python.org/library/decimal.html#signals|title=9.4. decimal — Decimal fixed point and floating point arithmetic — Python 3.6.5 documentation|website=docs.python.org|access-date=4 April 2018}}</ref><ref>{{cite web|url=http://speleotrove.com/decimal/daexcep.html|title=Decimal Arithmetic - Exceptional conditions|website=speleotrove.com|access-date=4 April 2018}}</ref> which are not part of IEEE 754: ; Clamped: a result's exponent is too large for the destination format. By default, trailing zeros will be added to the coefficient to reduce the exponent to the largest usable value. If this is not possible (because this would cause the number of digits needed to be more than the destination format) then an overflow exception occurs. ; Rounded: a result's coefficient requires more digits than the destination format provides. An inexact exception is signaled if any non-zero digits are discarded.

Additionally, operations like quantize when either operand is infinite, or when the result does not fit the destination format, will also signal invalid operation exception.<ref>{{Harvnb|IEEE 754|2008|loc=§7.2(h)}}</ref>

== Special values == === Signed zero === {{Main|Signed zero}} In the IEEE 754 standard, zero is signed, meaning that there exist both a "positive zero" (+0) and a "negative zero" (−0). In most [[run-time environment]]s, positive zero is usually printed as "<code>0</code>" and the negative zero as "<code>-0</code>". The two values behave as equal in numerical comparisons, but some operations return different results for +0 and −0. For instance, {{math|1/(−0)}} returns negative infinity, while {{math|1/(+0)}} returns positive infinity (so that the identity {{math|1=1/(1/±∞) = ±∞}} is maintained). Other common [[discontinuous function|functions with a discontinuity]] at {{math|1=''x'' = 0}} which might treat +0 and −0 differently include [[Gamma function|{{math|Γ(''x'')}}]] and the [[Square root#Principal square root of a complex number|principal square root]] of {{math|''y'' + ''xi''}} for any negative number ''y''. As with any approximation scheme, operations involving "negative zero" can occasionally cause confusion. For example, in IEEE 754, {{math|''x'' {{=}} ''y''}} does not always imply {{math|1=1/''x'' = 1/''y''}}, as {{math|1=0 = −0}} but {{math|1/0 ≠ 1/(−0)}}.{{sfn|Goldberg|1991}} Moreover, the reciprocal square root{{efn|See [[Fast inverse square root]] and [[Methods of computing square roots#Iterative methods for reciprocal square roots]]}} of {{math|±0}} is {{math|±∞}} while the mathematical function <math>1/\sqrt{x}</math> over the real numbers does not have any negative value.

=== Subnormal numbers === {{Main|Subnormal numbers}} Subnormal values fill the [[arithmetic underflow|underflow]] gap with values where the absolute distance between them is the same as for adjacent values just outside the underflow gap. This is an improvement over the older practice to just have zero in the underflow gap, and where underflowing results were replaced by zero (flush to zero).<ref name="Muller_2010">{{cite book |author-last1=Muller |author-first1=Jean-Michel |author-last2=Brisebarre |author-first2=Nicolas |author-last3=de Dinechin |author-first3=Florent |author-last4=Jeannerod |author-first4=Claude-Pierre |author-last5=Lefèvre |author-first5=Vincent |author-last6=Melquiond |author-first6=Guillaume |author-last7=Revol |author-first7=Nathalie|author7-link=Nathalie Revol |author-last8=Stehlé |author-first8=Damien |author-last9=Torres |author-first9=Serge |title=Handbook of Floating-Point Arithmetic |date=2010 |publisher=[[Birkhäuser]] |edition=1 |isbn=978-0-8176-4704-9 |doi=10.1007/978-0-8176-4705-6  |ref=muller_et_al_pg_16 |url=https://books.google.com/books?id=baFvrIOPvncC&pg=PA16}}</ref>

Modern floating-point hardware usually handles subnormal values (as well as normal values), and does not require software emulation for subnormals.

=== Infinities === {{further|topic=the concept of infinite|Infinity}} The infinities of the [[extended real number line]] can be represented in IEEE floating-point datatypes, just like ordinary floating-point values like 1, 1.5, etc. They are not error values in any way, though they are often (depends on the rounding) used as replacement values when there is an overflow. Upon a divide-by-zero exception, a positive or negative infinity is returned as an exact result. An infinity can also be introduced as a numeral (like C's "INFINITY" macro, or "{{math|∞}}" if the programming language allows that syntax).

IEEE 754 requires infinities to be handled in a reasonable way, such as * {{math|1=(+∞) + (+7) = (+∞)}} * {{math|1=(+∞) × (−2) = (−∞)}} * {{math|1=(+∞) × 0 =}} NaN – there is no meaningful thing to do

=== NaNs === {{Main|NaN}} IEEE 754 specifies a special value called "Not a Number" (NaN) to be returned as the result of certain "invalid" operations, such as 0/0, {{math|∞×0}}, or sqrt(−1). In general, NaNs will be propagated, i.e. most operations involving a NaN will result in a NaN, although functions that would give some defined result for any given floating-point value will do so for NaNs as well, e.g. NaN ^ 0 = 1. There are two kinds of NaNs: the default ''quiet'' NaNs and, optionally, ''signaling'' NaNs. A signaling NaN in any arithmetic operation (including numerical comparisons) will cause an "invalid operation" [[#Exception handling|exception]] to be signaled.

The representation of NaNs specified by the standard has some unspecified bits that could be used to encode the type or source of error; but there is no standard for that encoding. In theory, signaling NaNs could be used by a [[runtime system]] to flag uninitialized variables, or extend the floating-point numbers with other special values without slowing down the computations with ordinary values, although such extensions are not common. A variant of this approach (sometimes called "NaN-boxing") is used by some [[JavaScript]] runtimes<ref name="wingolog2011">{{cite web |last1=Wingo |first1=Andy |title=value representation in javascript implementations |url=https://wingolog.org/archives/2011/05/18/value-representation-in-javascript-implementations |website=wingolog |access-date=9 September 2025 |archive-url=https://web.archive.org/web/20250821070214/https://wingolog.org/archives/2011/05/18/value-representation-in-javascript-implementations |archive-date=21 August 2025 |url-status=live |date=18 May 2011}}</ref> and [[LuaJIT]]<ref name="mikepall2009">{{cite web |last1=Pall |first1=Mike |title=LuaJIT 2.0 intellectual property disclosure and research opportunities |url=http://article.gmane.org/gmane.comp.lang.lua.general/58908 |website=gmane.comp.lang.lua.general (Usenet) |access-date=9 September 2025 |archive-url=https://web.archive.org/web/20091107031558/http://article.gmane.org/gmane.comp.lang.lua.general/58908 |archive-date=7 November 2009 |format=email |date=2 November 2009 |quote=Design aspects of the VM: [...] NaN-tagging: 64 bit tagged values are used for stack slots and table slots.}}</ref> to store 64-bit pointer values and IEEE 754 double-precision floating-point values in the same data type, allowing runtimes to eliminate the overhead of extra memory allocations and indirections for floating-point values.

== Design rationale == [[File:William Kahan.jpg|thumb|right|233px|[[William Kahan]], a primary architect of the Intel [[80x87]] floating-point coprocessor and IEEE 754 floating-point standard]]

It is a common misconception that the more esoteric features of the IEEE 754 standard discussed here, such as extended formats, NaN, infinities, subnormals etc., are only of interest to [[numerical analyst]]s, or for advanced numerical applications. In fact the opposite is true: these features are designed to give safe robust defaults for numerically unsophisticated programmers, in addition to supporting sophisticated numerical libraries by experts. The key designer of IEEE 754, [[William Kahan]], notes that it is incorrect to "... [deem] features of IEEE Standard 754 for Binary Floating-Point Arithmetic that ...[are] not appreciated to be features usable by none but numerical experts. The facts are quite the opposite. In 1977 those features were designed into the Intel 8087 to serve the widest possible market... Error-analysis tells us how to design floating-point arithmetic, like IEEE Standard 754, moderately tolerant of well-meaning ignorance among programmers".<ref name="Kahan_2001_JavaHurt">{{cite web |author-first1=William Morton |author-last1=Kahan |author-link1=William Morton Kahan |author-first2=Joseph |author-last2=Darcy |date=2001 |orig-date=1998-03-01 |url=http://www.cs.berkeley.edu/~wkahan/JAVAhurt.pdf |archive-url=https://web.archive.org/web/20000816043653/http://www.cs.berkeley.edu/~wkahan/JAVAhurt.pdf |archive-date=2000-08-16 |url-status=live |title=How Java's floating-point hurts everyone everywhere |access-date=2003-09-05}}</ref> * The special values such as infinity and NaN ensure that the floating-point arithmetic is algebraically complete: every floating-point operation produces a well-defined result and will not—by default—throw a machine interrupt or trap. Moreover, the choices of special values returned in exceptional cases were designed to give the correct answer in many cases. For instance, under IEEE 754 arithmetic, a [[continued fraction]] such as <math display=block>R(z) := 7 - \cfrac{3}{z - 2 + \cfrac{4}{z - 3 } }</math> can be implemented straightforwardly and will give the correct answer even when there is a [[division by zero]], because any positive number divided by zero results in {{math|+∞}}, e.g. when {{math|1=''z'' = 3}}, {{math|1= ''R''(''z'') = 7}}.<ref name="Kahan_1981_WhyIEEE">{{cite web |url=http://www.cs.berkeley.edu/~wkahan/ieee754status/why-ieee.pdf |archive-url=https://web.archive.org/web/20041204070746/http://www.cs.berkeley.edu/~wkahan/ieee754status/why-ieee.pdf |archive-date=2004-12-04 |url-status=live |title=Why do we need a floating-point arithmetic standard? |page=26 |author-first=William Morton |author-last=Kahan |author-link=William Morton Kahan |date=1981-02-12}}</ref> As noted by Kahan, the unhandled trap consecutive to a floating-point to 16-bit integer conversion overflow that caused the [[Ariane flight V88|loss of an Ariane 5]] rocket would not have happened under the default IEEE 754 floating-point policy.<ref name="Kahan_2001_JavaHurt"/> * Subnormal numbers ensure that for ''finite'' floating-point numbers {{mvar|x}} and {{mvar|y}}, {{math|1=''x'' − ''y'' = 0}} if and only if {{math|1=''x'' = ''y''}}, as expected, but which did not hold under earlier floating-point representations.<ref name="Severance_1998">{{cite web |url=http://www.eecs.berkeley.edu/~wkahan/ieee754status/754story.html |title=An Interview with the Old Man of Floating-Point |author-first=Charles |author-last=Severance |author-link=Charles Severance (computer scientist) |date=1998-02-20}}</ref> * On the design rationale of the x87 [[extended precision|80-bit format]], Kahan notes: "This Extended format is designed to be used, with negligible loss of speed, for all but the simplest arithmetic with float and double operands. For example, it should be used for scratch variables in loops that implement recurrences like polynomial evaluation, scalar products, partial and continued fractions. It often averts premature Over/Underflow or severe local cancellation that can spoil simple algorithms".<ref name="Kahan_1996_Baleful">{{cite web |url=http://www.cs.berkeley.edu/~wkahan/ieee754status/baleful.pdf |archive-url=https://web.archive.org/web/20131013011212/http://www.cs.berkeley.edu/~wkahan/ieee754status/baleful.pdf |archive-date=2013-10-13 |url-status=live |title=The Baleful Effect of Computer Benchmarks upon Applied Mathematics, Physics and Chemistry |author-first=William Morton |author-last=Kahan |author-link=William Morton Kahan |date=1996-06-11}}</ref> Computing intermediate results in an extended format with high precision and extended exponent has precedents in the historical practice of scientific [[significant figures#Arithmetic|calculation]] and in the design of [[scientific calculator]]s e.g. [[Hewlett-Packard]]'s [[financial calculator]]s performed arithmetic and financial functions to three more significant decimals than they stored or displayed.<ref name="Kahan_1996_Baleful"/> The implementation of extended precision enabled standard elementary function libraries to be readily developed that normally gave double precision results within one [[unit in the last place]] (ULP) at high speed. * Correct rounding of values to the nearest representable value avoids systematic biases in calculations and slows the growth of errors. Rounding ties to even removes the statistical bias that can occur in adding similar figures. * Directed rounding was intended as an aid with checking error bounds, for instance in [[interval arithmetic]]. It is also used in the implementation of some functions. * The mathematical basis of the operations, in particular correct rounding, allows one to prove mathematical properties and design floating-point algorithms such as [[2Sum|2Sum, Fast2Sum]] and [[Kahan summation algorithm]], e.g. to improve accuracy or implement multiple-precision arithmetic subroutines relatively easily.

A property of the single- and double-precision formats is that their encoding allows one to easily sort them without using floating-point hardware, as if the bits represented [[sign-magnitude]] integers, although it is unclear whether this was a design consideration (it seems noteworthy that the earlier [[IBM hexadecimal floating-point]] representation also had this property for normalized numbers). With the prevalent [[two's-complement]] representation, [[Type punning|interpreting]] the bits as signed integers sorts the positives correctly, but with the negatives reversed; as one possible correction for that, with an [[exclusive or|xor]] to flip the sign bit for positive values and all bits for negative values, all the values become sortable as unsigned integers (with {{nobr|−0 < +0}}).<ref name="Herf_2001">{{cite web |author-last=Herf |author-first=Michael |title=radix tricks |url=http://stereopsis.com/radix.html |website=stereopsis: graphics |date=December 2001}}</ref>

== Recommendations ==

=== Alternate exception handling === The standard recommends optional exception handling in various forms, including presubstitution of user-defined default values, and traps (exceptions that change the flow of control in some way) and other exception handling models that interrupt the flow, such as try/catch. The traps and other exception mechanisms remain optional, as they were in IEEE 754-1985.

=== Recommended operations === Clause 9 in the standard recommends additional mathematical operations<ref>{{Harvnb|IEEE 754|2019|loc=§9.2}}</ref> that language standards should define.<ref>{{Harvnb|IEEE 754|2008|loc=Clause 9}}</ref> None are required in order to conform to the standard.

The following are recommended arithmetic operations, which must round correctly:<ref>{{Harvnb|IEEE 754|2019|loc=§9.2}}.</ref> * [[exp(x)|<math>e^x</math>]], <math>2^x</math>, <math>10^x</math> * [[exp(x)−1|<math>e^x - 1</math>]], <math>2^x - 1</math>, <math>10^x - 1</math> * [[Natural logarithm|<math>\ln x</math>]], [[Binary logarithm|<math>\log_{2} x</math>]], [[Common logarithm|<math>\log_{10} x</math>]] * [[ln(1+x)|<math>\ln (1 + x)</math>]], <math>\log_{2} (1 + x)</math>, <math>\log_{10} (1 + x)</math> * [[Hypot|<math display=inline>\sqrt{x^2 + y^2}</math>]] * [[Reciprocal square root|<math>1\big/\sqrt{x\vphantom{t}}</math>]] * <math>(1 + x)^n</math> for <math>x \ge -1</math> (named ''compound'' and used to compute an [[exponential growth]], whose rate cannot be less than −1)<ref>{{cite web |title=Too much power - pow vs powr, powd, pown, rootn, compound |url=https://grouper.ieee.org/groups/msc/ANSI_IEEE-Std-754-2019/background/power.txt |website=[[IEEE]] |access-date=16 January 2024 |quote=Since growth rates can't be less than -1, such rates signal invalid exceptions.}}</ref> * [[nth root|<math>x^{\frac{1}{n}}</math>]] * [[Exponentiation#Integer exponents|<math>x^n</math>]], [[Exponentiation|<math>x^y</math>]] * [[sin (trigonometry)|<math>\sin x</math>]], [[cos (trigonometry)|<math>\cos x</math>]], [[tan (trigonometry)|<math>\tan x</math>]] * [[arcsin (trigonometry)|<math>\arcsin x</math>]], [[arccos (trigonometry)|<math>\arccos x</math>]], [[arctan (trigonometry)|<math>\arctan x</math>]], [[atan2|<math>\operatorname{atan2}(y, x)</math>]] * <math>\operatorname{sinPi} x = \sin \pi x</math>, <math>\operatorname{cosPi} x = \cos \pi x</math>, <math>\operatorname{tanPi} x = \tan \pi x</math> (see also: [[Multiples of π]]) * <math>\operatorname{asinPi} x = \tfrac1\pi \arcsin x</math>, <math>\operatorname{acosPi} x = \tfrac1\pi \arccos x</math>, <math>\operatorname{atanPi} x = \tfrac1\pi \arctan x</math>, <math>\operatorname{atan2Pi} (y, x) = \tfrac1\pi \operatorname{atan2}(y, x)</math> (see also: [[Multiples of π]]) * [[sinh (mathematical function)|<math>\sinh x</math>]], [[cosh (mathematical function)|<math>\cosh x</math>]], [[tanh (mathematical function)|<math>\tanh x</math>]] * [[arsinh|<math>\operatorname{arsinh} x</math>]], [[arcosh|<math>\operatorname{arcosh} x</math>]], [[artanh|<math>\operatorname{artanh} x</math>]]

The <math>\operatorname{asinPi}</math>, <math>\operatorname{acosPi}</math> and <math>\operatorname{tanPi}</math> functions were not part of the IEEE 754-2008 standard because they were deemed less necessary.<ref>{{cite web |url=http://grouper.ieee.org/groups/754/email/msg03842.html |url-status=dead |archive-url=https://web.archive.org/web/20170706053605/http://grouper.ieee.org/groups/754/email/msg03842.html |archive-date=2017-07-06 |title=Re: Missing functions tanPi, asinPi and acosPi |website=[[IEEE]] |access-date=4 April 2018}}</ref> <math>\operatorname{asinPi}</math> and <math>\operatorname{acosPi}</math> were mentioned, but this was regarded as an error.<ref name=IEEE754-errata/> All three were added in the 2019 revision.

The recommended operations also include setting and accessing dynamic mode rounding direction,<ref>{{Harvnb|IEEE 754|2008|loc=§9.3}}.</ref> and implementation-defined vector reduction operations such as sum, scaled product, and [[dot product]], whose accuracy is unspecified by the standard.<ref>{{Harvnb|IEEE 754|2008|loc=§9.4}}.</ref>

{{anchor|Augmented arithmetic operation}} {{As of|2019}}, ''augmented arithmetic operations''<ref>{{Harvnb|IEEE 754|2019|loc=§9.5}}</ref> for the binary formats are also recommended. These operations, specified for addition, subtraction and multiplication, produce a pair of values consisting of a result correctly rounded to nearest in the format and the error term, which is representable exactly in the format. At the time of publication of the standard, no hardware implementations are known, but very similar operations were already implemented in software using well-known algorithms. The history and motivation for their standardization are explained in a background document.<ref name="Riedy_2018">{{cite web |author-last1=Riedy |author-first1=Jason |author-last2=Demmel |author-first2=James |title=Augmented Arithmetic Operations Proposed for IEEE-754 2018 |publisher=25th IEEE Symbosium on Computer Arithmetic (ARITH 2018) |pages=49–56 |url=http://www.ecs.umass.edu/arith-2018/pdf/arith25_34.pdf |access-date=2019-07-23 |url-status=live |archive-url=https://web.archive.org/web/20190723172615/http://www.ecs.umass.edu/arith-2018/pdf/arith25_34.pdf |archive-date=2019-07-23}}</ref><ref name="Revision_2019">{{cite web |url=https://grouper.ieee.org/groups/msc/ANSI_IEEE-Std-754-2019/background/ |title=ANSI/IEEE Std 754-2019 – Background Documents |website=[[IEEE]] |access-date=16 January 2024}}</ref>

{{anchor|Minimum and maximum operation}} As of 2019, the formerly required ''minNum'', ''maxNum'', ''minNumMag'', and ''maxNumMag'' in IEEE 754-2008 are now [[Deprecation|deprecated]] due to their [[Associative property|non-associativity]]. Instead, two sets of new minimum and maximum operations are recommended.<ref>{{Harvnb|IEEE 754|2019|loc=§9.6}}.</ref> The first set contains ''minimum'', ''minimumNumber'', ''maximum'' and ''maximumNumber''. The second set contains ''minimumMagnitude'', ''minimumMagnitudeNumber'', ''maximumMagnitude'' and ''maximumMagnitudeNumber''. The history and motivation for this change are explained in a background document.<ref>{{cite web|author-last1=Chen |author-first1=David |url=https://grouper.ieee.org/groups/msc/ANSI_IEEE-Std-754-2019/background/minNum_maxNum_Removal_Demotion_v3.pdf|title=The Removal/Demotion of MinNum and MaxNum Operations from IEEE 754-2018 |website=[[IEEE]]|access-date=16 January 2024}}</ref>

=== Expression evaluation === The standard recommends how language standards should specify the semantics of sequences of operations, and points out the subtleties of literal meanings and optimizations that change the value of a result. By contrast, the previous [[IEEE 754-1985|1985]] version of the standard left aspects of the language interface unspecified, which led to inconsistent behavior between compilers, or different optimization levels in an [[optimizing compiler]].

Programming languages should allow a user to specify a minimum precision for intermediate calculations of expressions for each radix. This is referred to as ''preferredWidth'' in the standard, and it should be possible to set this on a per-block basis. Intermediate calculations within expressions should be calculated, and any temporaries saved, using the maximum of the width of the operands and the preferred width if set. Thus, for instance, a compiler targeting [[x87]] floating-point hardware should have a means of specifying that intermediate calculations must use the [[Extended precision#IEEE 754 extended precision formats|double-extended format]]. The stored value of a variable must always be used when evaluating subsequent expressions, rather than any precursor from before rounding and assigning to the variable.

=== Reproducibility === The IEEE 754-1985 version of the standard allowed many variations in implementations (such as the encoding of some values and the detection of certain exceptions). IEEE 754-2008 has reduced these allowances, but a few variations still remain (especially for binary formats). The reproducibility clause recommends that language standards should provide a means to write reproducible programs (i.e., programs that will produce the same result in all implementations of a language) and describes what needs to be done to achieve reproducible results.

Concrete examples of potentially non-reproducible behavior can be found in C and [[C++]], which allow the use of higher precision for results of floating-point operations and contraction of floating-point expressions, such as regular multiply-and-add into [[Multiply–accumulate operation|FMA]] and <code>1.0/sqrt(x)</code> into a reciprocal square root as a single instruction.<ref>{{cite web |url=https://devblogs.microsoft.com/cppblog/the-fpcontract-flag-and-changes-to-fp-modes-in-vs2022/ |title=The /fp:contract flag and changes to FP modes in VS2022 |last=Beeraka |first=Gautham |date=14 December 2021 |website=devblogs.microsoft.com |publisher=Microsoft |access-date=9 June 2025}}</ref> C/C++ Compilers such as [[GNU Compiler Collection|GCC]] and [[cl.exe]] generally default to allowing both unless specifically asked not to, as these changes can generate faster code without obvious loss of accuracy. Compilers also offer more overtly non-compliant "fast" optimizations.<ref>{{cite web |title=Optimize Options (Using the GNU Compiler Collection (GCC)) |url=https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html |website=gcc.gnu.org}}</ref><ref>{{cite web |title=/fp (Specify floating-point behavior) |url=https://learn.microsoft.com/en-us/cpp/build/reference/fp-specify-floating-point-behavior?view=msvc-170 |website=learn.microsoft.com |language=en-us}}</ref> [[C mathematical functions]] are usually not implemented to be "correctly rounded" and add to the problem.<ref>{{cite web |title=Does any floating point-intensive code produce bit-exact results in any x86-based architecture? |url=https://stackoverflow.com/a/41001110 |website=Stack Overflow |language=en}}</ref> The floating-point environment may also be unexpectedly changed by third-party code.

== Character representation == {{see also|Floating-point arithmetic#Binary-to-decimal conversion with minimal number of digits}}

The standard requires operations to convert between basic formats and ''external character sequence'' formats.<ref>{{Harvnb|IEEE 754|2008|loc=§5.12}}.</ref> Conversions to and from a decimal character format are required for all formats. Conversion to an external character sequence must be such that conversion back using round to nearest, ties to even will recover the original number. There is no requirement to preserve the payload of a quiet NaN or signaling NaN, and conversion from the external character sequence may turn a signaling NaN into a quiet NaN.

The original binary value will be preserved by converting to decimal and back again using:<ref>{{Harvnb|IEEE 754|2008|loc=§5.12.2}}.</ref> * 5 decimal digits for binary16, * 9 decimal digits for binary32, * 17 decimal digits for binary64, * 36 decimal digits for binary128.

For other binary formats, the required number of decimal digits is{{efn|As an implementation limit, correct rounding is only guaranteed for the number of decimal digits required plus 3 for the largest supported binary format. For instance, if binary32 is the largest supported binary format, then a conversion from a decimal external sequence with 12 decimal digits is guaranteed to be correctly rounded when converted to binary32; but conversion of a sequence of 13 decimal digits is not; however, the standard recommends that implementations impose no such limit.}}

: <math>1 + \lceil p\log_{10}(2)\rceil,</math>

where ''p'' is the number of significant bits in the binary format, e.g. 237 bits for binary256.

When using a decimal floating-point format, the decimal representation will be preserved using: * 7 decimal digits for decimal32, * 16 decimal digits for decimal64, * 34 decimal digits for decimal128.

Algorithms, with code, for correctly rounded conversion from binary to decimal and decimal to binary are discussed by Gay,<ref>{{Citation |first=David M. |last=Gay |title=Correctly rounded binary-decimal and decimal-binary conversions |series=Numerical Analysis Manuscript |id=90-10 |publisher=AT&T Laboratories |location=Murry Hill, NJ, US |date=November 30, 1990 |url=http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.31.4049 }}</ref> and for testing{{snd}} by Paxson and Kahan.<ref>{{Citation |title=A Program for Testing IEEE Decimal–Binary Conversion |series=Manuscript |first1=Vern |last1=Paxson |first2=William |last2=Kahan |author2-link=William Kahan |date=May 22, 1991 |citeseerx = 10.1.1.144.5889}}</ref>

=== Hexadecimal literals ===

The standard recommends providing conversions to and from ''external hexadecimal-significand character sequences'', based on [[C99]]'s hexadecimal floating point literals. Such a literal consists of an optional sign (<code>+</code> or <code>-</code>), the indicator "0x", a hexadecimal number with or without a period, an exponent indicator "p", and a decimal exponent with optional sign. The syntax is not case-sensitive.<ref>{{Harvnb|IEEE 754|2008|loc=§5.12.3}}</ref> The decimal exponent scales by powers of 2. For example, <code>0x0.1p0</code> is 1/16 and <code>0x0.1p-4</code> is 1/256.<ref>{{cite web |title=6.9.3. Hexadecimal floating point literals — Glasgow Haskell Compiler 9.3.20220129 User's Guide |url=https://ghc.gitlab.haskell.org/ghc/doc/users_guide/exts/hex_float_literals.html |website=ghc.gitlab.haskell.org |access-date=29 January 2022}}</ref>

== See also == * [[bfloat16 floating-point format]] * [[Binade]] * [[Coprocessor]] * [[C99#IEEE 754 floating-point support|C99]] for code examples demonstrating access and use of IEEE 754 features * [[Floating-point arithmetic#IEEE 754: floating point in modern computers|Floating-point arithmetic]], for history, design rationale and example usage of IEEE 754 features * [[Fixed-point arithmetic]], for an alternative approach at computation with rational numbers (especially beneficial when the exponent range is known, fixed, or bound at compile time) * [[IBM System z9]], the first CPU to implement IEEE 754-2008 decimal arithmetic (using hardware microcode) * [[IBM z10]], [[IBM z196]], [[IBM zEC12 (microprocessor)|IBM zEC12]], and [[IBM z13 (microprocessor)|IBM z13]], CPUs that implement IEEE 754-2008 decimal arithmetic fully in hardware * [[ISO/IEC 10967]], language-independent arithmetic (LIA) * [[Minifloat]], low-precision binary floating-point formats following IEEE 754 principles * [[POWER6]], [[POWER7]], and [[POWER8]] CPUs that implement IEEE 754-2008 decimal arithmetic fully in hardware * [[strictfp]], an obsolete keyword in the [[Java (programming language)|Java programming language]] that previously restricted arithmetic to IEEE 754 single and double precision to ensure reproducibility across common hardware platforms (as of Java 17, this behavior is required) * [[Table-maker's dilemma]] for more about the correct rounding of functions * [[Standard Apple Numerics Environment]] * [[Tapered floating point]] * [[Unum (number format)#Posit (Type III Unum)|Posit]], an alternative number format

== Notes == {{Notelist}}

== References == {{Reflist}}

=== Standards === * {{Cite book |title=IEEE Standard for Binary Floating-Point Arithmetic |series=ANSI/IEEE STD 754-1985 |pages=1–20 |date=12 October 1985 |publisher=IEEE |doi=10.1109/IEEESTD.1985.82928 |isbn=0-7381-1165-1 }} * {{Cite book |title=IEEE Standard for Floating-Point Arithmetic |series=IEEE STD 754-2008 |pages=1–70 |author=IEEE Computer Society |date=29 August 2008 |publisher=IEEE |id=IEEE Std 754-2008 |doi=10.1109/IEEESTD.2008.4610935 |ref=CITEREFIEEE_7542008 |isbn=978-0-7381-5753-5 }} * {{Cite book |title=IEEE Standard for Floating-Point Arithmetic |series=IEEE STD 754-2019 |pages=1–84 |author=IEEE Computer Society |date=22 July 2019 |publisher=IEEE |id=IEEE Std 754-2019 |doi=10.1109/IEEESTD.2019.8766229 |isbn=978-1-5044-5924-2 |ref=CITEREFIEEE_7542019}} * {{Cite book |last=ISO/IEC JTC 1/SC 25|title=ISO/IEC/IEEE 60559:2011 — Information technology — Microprocessor Systems — Floating-Point arithmetic |url=https://www.iso.org/standard/57469.html |publisher=ISO |pages=1–58 |date=June 2011}} * {{Cite book |last= ISO/IEC JTC 1/SC 25|title=ISO/IEC 60559:2020 — Information technology — Microprocessor Systems — Floating-Point arithmetic |url=https://www.iso.org/standard/80985.html |url-access=subscription |publisher=ISO |pages=1–74 |date=May 2020}}

=== Secondary references === * [http://speleotrove.com/decimal Decimal floating-point] arithmetic, FAQs, bibliography, and links * [http://www.cygnus-software.com/papers/comparingfloats/comparingfloats.htm Comparing binary floats] * [https://web.archive.org/web/20111201211023/http://babbage.cs.qc.cuny.edu/IEEE-754.old/References.xhtml IEEE 754 Reference Material] * [http://speleotrove.com/decimal/854mins.html IEEE 854-1987] – History and minutes * [https://web.archive.org/web/20171230124220/http://grouper.ieee.org/groups/754/reading.html Supplementary readings for IEEE 754]. Includes historical perspectives.

== Further reading == * {{cite journal |author-first=David |author-last=Goldberg |title=What Every Computer Scientist Should Know About Floating-Point Arithmetic |journal=[[ACM Computing Surveys]] |date=March 1991 |volume=23 |issue=1 |pages=5–48 |doi=10.1145/103162.103163 |doi-access=free }} (With the addendum "Differences Among IEEE 754 Implementations": [https://web.archive.org/web/20171011072644/http://www.cse.msu.edu/~cse320/Documents/FloatingPoint.pdf], [https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html]) * {{cite journal | first = Chris | last = Hecker | author-link = Chris Hecker | title = Let's Get To The (Floating) Point | journal = [[Game Developer (magazine)|Game Developer]] |date=February 1996 | pages = 19–24 | url = http://chrishecker.com/images/f/fb/Gdmfp.pdf }} *{{cite journal | first = Charles | last = Severance | title = IEEE 754: An Interview with William Kahan | journal = [[IEEE Computer]] |date=March 1998 | volume = 31 | issue = 3 | pages = 114–115 | doi = 10.1109/MC.1998.660194 | url = http://www.dr-chuck.com/dr-chuck/papers/columns/r3114.pdf | access-date = 8 March 2019 }} * {{cite book | first = Mike | last = Cowlishaw |chapter =Decimal floating-point: Algorism for computers | title = 16th IEEE Symposium on Computer Arithmetic, 2003. Proceedings. | author-link = Mike Cowlishaw |date=June 2003 | pages = 104–111 | isbn = 978-0-7695-1894-7 | chapter-url = http://speleotrove.com/decimal/IEEE-cowlishaw-arith16.pdf | access-date = 14 November 2014 | publisher = IEEE Computer Society | location = Los Alamitos, Calif. | doi = 10.1109/ARITH.2003.1207666 }}. * {{cite journal | first = David | last = Monniaux | title = The pitfalls of verifying floating-point computations | journal = [[ACM Transactions on Programming Languages and Systems]] |date=May 2008 | pages = 1–41 | volume = 30 | issue = 3 | doi = 10.1145/1353445.1353446 | url = https://hal.science/hal-00128124/en/ |arxiv = cs/0701192 }}: A compendium of non-intuitive behaviours of floating-point on popular architectures, with implications for program verification and testing. * {{cite book |author-last1=Muller |author-first1=Jean-Michel |author-last2=Brunie |author-first2=Nicolas |author-last3=de Dinechin |author-first3=Florent |author-last4=Jeannerod |author-first4=Claude-Pierre |author-first5=Mioara |author-last5=Joldes |author-last6=Lefèvre |author-first6=Vincent |author-last7=Melquiond |author-first7=Guillaume |author-last8=Revol |author-first8=Nathalie|author8-link=Nathalie Revol |author-last9=Torres |author-first9=Serge |title=Handbook of Floating-Point Arithmetic |date=2018 |orig-year=2010 |publisher=[[Birkhäuser]] |edition=2 |isbn=978-3-319-76525-9 |doi=10.1007/978-3-319-76526-6|url=https://cds.cern.ch/record/1315760 }} * {{cite book |first=Michael L. |last=Overton |title=Numerical Computing with IEEE Floating Point Arithmetic |year=2001 |location=[[Courant Institute of Mathematical Sciences]], [[New York University]], New York, US |doi=10.1137/1.9780898718072 |edition=1 |publisher=[[Society for Industrial and Applied Mathematics|SIAM]] |publication-place=Philadelphia, US |isbn=978-0-89871-482-1 |id=978-0-89871-571-2, 0-89871-571-7 }} 2nd edition, 2025. SIAM. {{ISBN|978-1-61197-840-7}}. * [http://blogs.mathworks.com/cleve/2014/07/07/floating-point-numbers/ Cleve Moler on Floating Point numbers] * {{cite book |author-first=Nelson H. F. |author-last=Beebe |title=The Mathematical-Function Computation Handbook - Programming Using the MathCW Portable Software Library |date=2017-08-22 |location=Salt Lake City, UT, US |publisher=[[Springer International Publishing AG]] |edition=1 |isbn=978-3-319-64109-6 |doi=10.1007/978-3-319-64110-2}} * {{cite journal |author-first=David G. |author-last=Hough |title=The IEEE Standard 754: One for the History Books |journal=Computer |date=December 2019 |volume=52 |issue=12 |pages=109–112 |publisher=[[IEEE]] |doi=10.1109/MC.2019.2926614 |url = https://www.computer.org/csdl/magazine/co/2019/12/08909942/1f8KFWxbTCU }}

== External links == * [https://grouper.ieee.org/groups/msc/ANSI_IEEE-Std-754-2019/ ANSI/IEEE Std 754-2019] {{Wikibooks | Floating Point | Special Numbers | special numbers specified in the IEEE 754 standard }} {{Commons category|IEEE 754}} *{{cite video |title=Kahan on creating IEEE Standard Floating Point |date=16 November 2020 |work=Turing Awardee Clips |url=https://www.youtube.com/watch?v=ATCpecsyPE8| archive-url=https://ghostarchive.org/varchive/youtube/20211108/ATCpecsyPE8| archive-date=2021-11-08 | url-status=live}}{{cbignore}}

[[Category:Computer arithmetic]] [[Category:IEEE standards]] [[Category:Floating point types]] [[Category:Binary arithmetic]]