API
Functions
- gfloat.decode_float(fi: FormatInfo, i: int) FloatValue [source]
Given
FormatInfo
and integer code point, decode to aFloatValue
- Parameters:
fi¶ (FormatInfo) – Floating point format descriptor.
i¶ (int) – Integer code point, in the range \(0 \le i < 2^{k}\), where \(k\) =
fi.k
- Returns:
Decoded float value
- Raises:
ValueError – If
i
is outside the range of valid code points infi
.
- gfloat.round_float(fi: FormatInfo, v: float, rnd: RoundMode = RoundMode.TiesToEven, sat: bool = False) float [source]
Round input to the given
FormatInfo
, given rounding mode and saturation flagAn input NaN will convert to a NaN in the target. An input Infinity will convert to the largest float if
sat
, otherwise to an Inf, if present, otherwise to a NaN. Negative zero will be returned if the format has negative zero, otherwise zero.- Parameters:
- Returns:
A float which is one of the values in the format.
- Raises:
ValueError – The target format cannot represent the input (e.g. converting a NaN, or an Inf when the target has no NaN or Inf, and
sat
is false)
- gfloat.encode_float(fi: FormatInfo, v: float) int [source]
Encode input to the given
FormatInfo
.Will round toward zero if
v
is not in the value set. Will saturate to Inf, NaN, fi.max in order of precedence. Encode -0 to 0 if not fi.has_nzFor other roundings and saturations, call
round_float()
first.- Parameters:
fi¶ (FormatInfo) – Describes the target format
v¶ (float) – The value to be encoded.
- Returns:
The integer code point
- gfloat.decode_block(fi: BlockFormatInfo, block: Iterable[int]) Iterable[float] [source]
Decode a
block
of integer codepoints in Block Formatfi
The scale is encoded in the first value of
block
, with the remaining values encoding the block elements.The size of the iterable is not checked against the format descriptor.
- Parameters:
fi¶ (BlockFormatInfo) – Describes the block format
block¶ (Iterable[int]) – Input block
- Returns:
A sequence of floats representing the encoded values.
- gfloat.encode_block(fi: BlockFormatInfo, scale: float, vals: Iterable[float]) Iterable[int] [source]
Encode a
block
of bytes into block Format descibed byfi
The
scale
is explicitly passed, and is converted to 1/(1/scale) before rounding to the target format.It is checked for overflow in the target format, and will raise an exception if it does.
- Parameters:
fi¶ (BlockFormatInfo) – Describes the target block format
scale¶ (float) – Scale to be recorded in the block
vals¶ (Iterable[float]) – Input block
- Returns:
A sequence of ints representing the encoded values.
- Raises:
ValueError – The scale overflows the target scale encoding format.
Classes
- class gfloat.FormatInfo[source]
Class describing a floating-point format, parametrized by width, precision, and special value encoding rules.
- name: str
Short name for the format, e.g. binary32, bfloat16
- k: int
Number of bits in the format
- precision: int
Number of significand bits (including implicit leading bit)
- emax: int
Largest exponent, emax, which shall equal floor(log_2(maxFinite))
- has_nz: bool
Set if format encodes -0 at (sgn=1,exp=0,significand=0). If False, that encoding decodes to a NaN labelled NaN_0
- has_infs: bool
Set if format includes +/- Infinity. If set, the non-nan value with the highest encoding for each sign (s) is replaced by (s)Inf.
- num_high_nans: int
Number of NaNs that are encoded in the highest encodings for each sign
- has_subnormals: bool
Set if format encodes subnormals
- is_signed: bool
Set if the format has a sign bit
- is_twos_complement: bool
Set if the format uses two’s complement encoding for the significand
- property tSignificandBits: int
The number of trailing significand bits, t
- property expBits: int
The number of exponent bits, w
- property signBits: int
The number of sign bits, s
- property expBias: int
The exponent bias derived from (p,emax)
- This is the bias that should be applied so that
\(floor(log_2(maxFinite)) = emax\)
- property bits: int
The number of bits occupied by the type.
- property eps: float
The difference between 1.0 and the next smallest representable float larger than 1.0. For example, for 64-bit binary floats in the IEEE-754 standard,
eps = 2**-52
, approximately 2.22e-16.
- property epsneg: float
The difference between 1.0 and the next smallest representable float less than 1.0. For example, for 64-bit binary floats in the IEEE-754 standard,
epsneg = 2**-53
, approximately 1.11e-16.
- property iexp: int
The number of bits in the exponent portion of the floating point representation.
- property machep: int
The exponent that yields eps.
- property max: float
The largest representable number.
- property maxexp: int
The smallest positive power of the base (2) that causes overflow.
- property min: float
The smallest representable number, typically
-max
.
- property num_nans: int
The number of code points which decode to NaN
- property code_of_nan: int
Return a codepoint for a NaN
- property code_of_posinf: int
Return a codepoint for positive infinity
- property code_of_neginf: int
Return a codepoint for negative infinity
- property code_of_zero: int
Return a codepoint for (non-negative) zero
- property has_zero: bool
Does the format have zero?
This is false if the mantissa is 0 width and we don’t have subnormals - essentially the mantissa is always decoded as 1. If we have subnormals, the only subnormal is zero, and the mantissa is always decoded as 0.
- property code_of_negzero: int
Return a codepoint for negative zero
- property code_of_max: int
Return a codepoint for fi.max
- property code_of_min: int
Return a codepoint for fi.min
- property smallest_normal: float
The smallest positive floating point number with 1 as leading bit in the significand following IEEE-754.
- property smallest_subnormal: float
The smallest positive floating point number with 0 as leading bit in the significand following IEEE-754.
- property smallest: float
The smallest positive floating point number.
- property is_all_subnormal: bool
Are all encoded values subnormal?
- class gfloat.FloatClass[source]
Enum for the classification of a FloatValue.
- NORMAL = 1
A positive or negative normalized non-zero value
- SUBNORMAL = 2
A positive or negative subnormal value
- ZERO = 3
A positive or negative zero value
- INFINITE = 4
A positive or negative infinity (+/-Inf)
- NAN = 5
Not a Number (NaN)
- class gfloat.RoundMode[source]
Enum for IEEE-754 rounding modes.
Result r is obtained from input v depending on rounding mode as follows
- TowardZero = 1
\(\max \{ r ~ s.t. ~ |r| \le |v| \}\)
- TowardNegative = 2
\(\max \{ r ~ s.t. ~ r \le v \}\)
- TowardPositive = 3
\(\min \{ r ~ s.t. ~ r \ge v \}\)
- TiesToEven = 4
Round to nearest, ties to even
- TiesToAway = 5
Round to nearest, ties away from zero
- class gfloat.FloatValue[source]
A floating-point value decoded in great detail.
- ival: int
Integer code point
- fval: float
Value. Assumed to be exactly round-trippable to python float. This is true for all <64bit formats known in 2023.
- exp: int
Raw exponent without bias
- expval: int
Exponent, bias subtracted
- significand: int
Significand as an integer
- fsignificand: float
Significand as a float in the range [0,2)
- signbit: int
1 => negative, 0 => positive
- Type:
Sign bit
- fclass: FloatClass
See FloatClass
- class gfloat.BlockFormatInfo[source]
BlockFormatInfo(name: str, etype: gfloat.types.FormatInfo, k: int, stype: gfloat.types.FormatInfo)
- name: str
Short name for the format, e.g. BlockFP8
- etype: FormatInfo
Element data type
- k: int
Scaling block size
- stype: FormatInfo
Scale datatype
- property element_bits: int
The number of bits in each element, d
- property scale_bits: int
The number of bits in the scale, w
- property block_size_bytes: int
The number of bytes in a block
Pretty printers
- gfloat.float_pow2str(v: float, min_exponent: float = -inf) str [source]
Render floating point values as exact fractions times a power of two.
Example: float_pow2str(127.0) is “127/64*2^6”,
That is (a significand between 1 and 2) times (a power of two).
If min_exponent is supplied, then values with exponent below min_exponent, are printed as fractions less than 1, with exponent set to min_exponent. This is typically used to represent subnormal values.