binary

basics

A system of storing data using just two digits: 1 and 0.
Everything in a computer is ultimately stored in binary (high voltage wire = 1, low voltage wire = 0)
Generally rooted in the mathematical concept of binary (as a base 2 system of representing numbers)
Since computers tend to “think” in binary, it is ultimately useful to work with values in binary. By convention we prepend any binary value with “0b”

operations

&, |, ~, ^: convert every
<<, >>:

Left shift: Convert to binary, then move all bits left, appending 0s as needed (Equivalent to multiplying by a power of 2)
Right shift (logical): Convert to binary, then move all bits right, prepending 0s as needed (Equivalent to dividing by a power of 2)
for example:
![[Pasted image 20240523104246.png]]

signed numbers

![[Pasted image 20240523110950.png]]
Formally:

Define a “bias”
To interpret stored binary: Read the data as an unsigned number, then add the bias
To store a data value: Subtract the bias, then store the resulting number as an unsigned number

float

fixed point representation

![[Pasted image 20240523113405.png]]
but what about other numbers?

very large number (31,556,926,010 (3.155692610 x 10^10))
very small number (0.000000000052917710 (5.2917710 x 10^-11))

floating point

IEEE standard 754!
Take scientific notation as an example:
![[Pasted image 20240526155108.png]]
Similarly, the floating point method are as the $A*2^B$

1 bit: sign bit
8 bits: exponent (B)
23 bits: significand(A)
$(-1)^s*(1+significand)*2^{exponent-127}$

sth. special

0

zero have no normalized representation!(all zeros)

large & small numbers

255 is the same as 0? –overflow (more than 3.4* 10^38!) & underflow (less than 1.2* 10^-38!) !

$\pm \infty$

IEEE standard: export 1111 1111 , significand zero for $\pm \infty$

not a number(NaN)

export 1111 1111, significand nonzero.

Another problem: there is a gap between FP numbers and zero!

smallest normalized number: $2^{-126}$
smallest number between 2 numbers: $2^{-149}$
Solution: denormalized number( no (implied) leading 1; implicit exponent for all denorms = -126)
You can see [here](IEEE-754 Floating Point Converter (h-schmidt.net)) for IEEE 754 float transformation

other floating point representations

double precision floating point

extend the 16 bits to 32 bits!
sign 1bits; exponent 11bits; significand 20bits!