API

Functions

gfloat.decode_float(fi: FormatInfo, i: int) FloatValue[source]

Given FormatInfo and integer code point, decode to a FloatValue

Parameters:
  • fi (FormatInfo) – Floating point format descriptor.

  • i (int) – Integer code point, in the range \(0 \le i < 2^{k}\), where \(k\) = fi.k

Returns:

Decoded float value

Raises:

ValueError – If i is outside the range of valid code points in fi.

gfloat.round_float(fi: FormatInfo, v: float, rnd: RoundMode = RoundMode.TiesToEven, sat: bool = False) float[source]

Round input to the given FormatInfo, given rounding mode and saturation flag

An input NaN will convert to a NaN in the target. An input Infinity will convert to the largest float if sat, otherwise to an Inf, if present, otherwise to a NaN. Negative zero will be returned if the format has negative zero, otherwise zero.

Parameters:
  • fi (FormatInfo) – Describes the target format

  • v (float) – Input value to be rounded

  • rnd (RoundMode) – Rounding mode to use

  • sat (bool) – Saturation flag: if True, round overflowed values to fi.max

Returns:

A float which is one of the values in the format.

Raises:

ValueError – The target format cannot represent the input (e.g. converting a NaN, or an Inf when the target has no NaN or Inf, and sat is false)

gfloat.encode_float(fi: FormatInfo, v: float) int[source]

Encode input to the given FormatInfo.

Will round toward zero if v is not in the value set. Will saturate to Inf, NaN, fi.max in order of precedence. Encode -0 to 0 if not fi.has_nz

For other roundings and saturations, call round_float() first.

Parameters:
  • fi (FormatInfo) – Describes the target format

  • v (float) – The value to be encoded.

Returns:

The integer code point

gfloat.decode_block(fi: BlockFormatInfo, block: Iterable[int]) Iterable[float][source]

Decode a block of integer codepoints in Block Format fi

The scale is encoded in the first value of block, with the remaining values encoding the block elements.

The size of the iterable is not checked against the format descriptor.

Parameters:
  • fi (BlockFormatInfo) – Describes the block format

  • block (Iterable[int]) – Input block

Returns:

A sequence of floats representing the encoded values.

gfloat.encode_block(fi: BlockFormatInfo, scale: float, vals: Iterable[float]) Iterable[int][source]

Encode a block of bytes into block Format descibed by fi

The scale is explicitly passed, and is converted to 1/(1/scale) before rounding to the target format.

It is checked for overflow in the target format, and will raise an exception if it does.

Parameters:
  • fi (BlockFormatInfo) – Describes the target block format

  • scale (float) – Scale to be recorded in the block

  • vals (Iterable[float]) – Input block

Returns:

A sequence of ints representing the encoded values.

Raises:

ValueError – The scale overflows the target scale encoding format.

Classes

class gfloat.FormatInfo[source]

Class describing a floating-point format, parametrized by width, precision, and special value encoding rules.

name: str

Short name for the format, e.g. binary32, bfloat16

k: int

Number of bits in the format

precision: int

Number of significand bits (including implicit leading bit)

emax: int

Largest exponent, emax, which shall equal floor(log_2(maxFinite))

has_nz: bool

Set if format encodes -0 at (sgn=1,exp=0,significand=0). If False, that encoding decodes to a NaN labelled NaN_0

has_infs: bool

Set if format includes +/- Infinity. If set, the non-nan value with the highest encoding for each sign (s) is replaced by (s)Inf.

num_high_nans: int

Number of NaNs that are encoded in the highest encodings for each sign

has_subnormals: bool

Set if format encodes subnormals

is_signed: bool

Set if the format has a sign bit

is_twos_complement: bool

Set if the format uses two’s complement encoding for the significand

property tSignificandBits: int

The number of trailing significand bits, t

property expBits: int

The number of exponent bits, w

property signBits: int

The number of sign bits, s

property expBias: int

The exponent bias derived from (p,emax)

This is the bias that should be applied so that

\(floor(log_2(maxFinite)) = emax\)

property bits: int

The number of bits occupied by the type.

property eps: float

The difference between 1.0 and the next smallest representable float larger than 1.0. For example, for 64-bit binary floats in the IEEE-754 standard, eps = 2**-52, approximately 2.22e-16.

property epsneg: float

The difference between 1.0 and the next smallest representable float less than 1.0. For example, for 64-bit binary floats in the IEEE-754 standard, epsneg = 2**-53, approximately 1.11e-16.

property iexp: int

The number of bits in the exponent portion of the floating point representation.

property machep: int

The exponent that yields eps.

property max: float

The largest representable number.

property maxexp: int

The smallest positive power of the base (2) that causes overflow.

property min: float

The smallest representable number, typically -max.

property num_nans: int

The number of code points which decode to NaN

property code_of_nan: int

Return a codepoint for a NaN

property code_of_posinf: int

Return a codepoint for positive infinity

property code_of_neginf: int

Return a codepoint for negative infinity

property code_of_zero: int

Return a codepoint for (non-negative) zero

property has_zero: bool

Does the format have zero?

This is false if the mantissa is 0 width and we don’t have subnormals - essentially the mantissa is always decoded as 1. If we have subnormals, the only subnormal is zero, and the mantissa is always decoded as 0.

property code_of_negzero: int

Return a codepoint for negative zero

property code_of_max: int

Return a codepoint for fi.max

property code_of_min: int

Return a codepoint for fi.min

property smallest_normal: float

The smallest positive floating point number with 1 as leading bit in the significand following IEEE-754.

property smallest_subnormal: float

The smallest positive floating point number with 0 as leading bit in the significand following IEEE-754.

property smallest: float

The smallest positive floating point number.

property is_all_subnormal: bool

Are all encoded values subnormal?

class gfloat.FloatClass[source]

Enum for the classification of a FloatValue.

NORMAL = 1

A positive or negative normalized non-zero value

SUBNORMAL = 2

A positive or negative subnormal value

ZERO = 3

A positive or negative zero value

INFINITE = 4

A positive or negative infinity (+/-Inf)

NAN = 5

Not a Number (NaN)

class gfloat.RoundMode[source]

Enum for IEEE-754 rounding modes.

Result r is obtained from input v depending on rounding mode as follows

TowardZero = 1

\(\max \{ r ~ s.t. ~ |r| \le |v| \}\)

TowardNegative = 2

\(\max \{ r ~ s.t. ~ r \le v \}\)

TowardPositive = 3

\(\min \{ r ~ s.t. ~ r \ge v \}\)

TiesToEven = 4

Round to nearest, ties to even

TiesToAway = 5

Round to nearest, ties away from zero

class gfloat.FloatValue[source]

A floating-point value decoded in great detail.

ival: int

Integer code point

fval: float

Value. Assumed to be exactly round-trippable to python float. This is true for all <64bit formats known in 2023.

exp: int

Raw exponent without bias

expval: int

Exponent, bias subtracted

significand: int

Significand as an integer

fsignificand: float

Significand as a float in the range [0,2)

signbit: int

1 => negative, 0 => positive

Type:

Sign bit

fclass: FloatClass

See FloatClass

class gfloat.BlockFormatInfo[source]

BlockFormatInfo(name: str, etype: gfloat.types.FormatInfo, k: int, stype: gfloat.types.FormatInfo)

name: str

Short name for the format, e.g. BlockFP8

etype: FormatInfo

Element data type

k: int

Scaling block size

stype: FormatInfo

Scale datatype

property element_bits: int

The number of bits in each element, d

property scale_bits: int

The number of bits in the scale, w

property block_size_bytes: int

The number of bytes in a block

Pretty printers

gfloat.float_pow2str(v: float, min_exponent: float = -inf) str[source]

Render floating point values as exact fractions times a power of two.

Example: float_pow2str(127.0) is “127/64*2^6”,

That is (a significand between 1 and 2) times (a power of two).

If min_exponent is supplied, then values with exponent below min_exponent, are printed as fractions less than 1, with exponent set to min_exponent. This is typically used to represent subnormal values.

gfloat.float_tilde_unless_roundtrip_str(v: float, width: int = 14, d: int = 8) str[source]

Return a string representation of v, in base 10, with maximum width width and decimal digits d