Lecture 2. Computer Representation of Data


Binary digits

Data is kept in a computer as a series of high and low electronic signals, and so they are considered base 2 numbers.

A single digit of a binary number is thus the atom of computing, since all information is composed of binary digits or bits.

In any number base $b$, the value of the $i$th digit $d$ is $$d\times b^i$$ where $i$ starts at $0$ and increases from right to left, eg.

$$1011_2 = (1\times 2^3) + (0\times 2^2) + (1\times 2^1) + (1\times 2^0)= 8_{10}+0_{10}+2_{10}+1_{10}=11_{10}$$

The least significant bit is used to refer to the rightmost bit and most significant bit to the leftmost bit.

Integers

Definition

An integer is a number that can be written without a fractional component.

Computers can perform operations on numbers whose precision is finite and fixed. The ARMv8 subset we have studied in previous lecture allows doublewords (64 bits) to store information. We like to use these 64 bits to encode both postive and negative numbers. The most used representation is two's complement:

\begin{align} 00000000\_00000000\_00000000\_00000000\_00000000\_00000000\_00000000\_00000000_2 &= 0_{10}\nonumber\\ 00000000\_00000000\_00000000\_00000000\_00000000\_00000000\_00000000\_00000001_2 &= 1_{10}\nonumber\\ 00000000\_00000000\_00000000\_00000000\_00000000\_00000000\_00000000\_00000010_2 &= 2_{10}\nonumber\\ &\cdots\nonumber\\ 01111111\_11111111\_11111111\_11111111\_11111111\_11111111\_11111111\_11111111_2 &= 92233720360854775807_{10}\nonumber\\ 10000000\_00000000\_00000000\_00000000\_00000000\_00000000\_00000000\_00000000_2 &= -92233720360854775808_{10}\nonumber\\ 10000000\_00000000\_00000000\_00000000\_00000000\_00000000\_00000000\_00000001_2 &= -92233720360854775807_{10}\nonumber\\ &\cdots\nonumber\\ 11111111\_11111111\_11111111\_11111111\_11111111\_11111111\_11111111\_11111101_2 &= -3_{10}\nonumber\\ 11111111\_11111111\_11111111\_11111111\_11111111\_11111111\_11111111\_11111110_2 &= -2_{10}\nonumber\\ \end{align}
  • Two's complement does have one negative number that has no corresponding positive number: $-92233720360854775808_{10}$.

  • Two's complement has the advantage that all negative numbers have a $1$ in the most significant bit. Thus, hardware needs to test only this bit to see if a number is positive or negative (with the number $0$ is considered positive). This bit is often called the sign bit

  • The conversion to base $10$ is straightforward: $$(x_{63}\times-2^{63})+(x_{62}\times2^{62})+(x_{61}\times2^{61})+\cdots+(x_{1}\times2^{1})+(x_{0}\times2^{0})$$

Arithmetic

Negation

Two's complement notation allows a quick away to negate an integer: simply invert every 0 and 1, then add one to the result.

Sign extension

An integer represented in $n$ bits can be converted easily to an integer represented in more than $n$ bits: take the most significant bit (the sign bit) and replicate it to fill the new bits of the larger representation.

Addition and Subtraction

Addition is just what you expect. Digits are added bit by bit from right to left, with carries passed to the next digit to the left.

Subtraction uses addition: the appropriate operand is simply negated before being added.

\begin{align} \ \ 00000111_2&=7_{10}\nonumber\\ +\ 00000110_2&=6_{10}\nonumber\\ \hline \ \ 00001101_2&=13_{10}\nonumber \end{align}\begin{align} \ \ 00000111_2&=7_{10}\nonumber\\ -\ 00000110_2&=-6_{10}\nonumber\\ \hline \ \ 00000111_2&=7_{10}\nonumber\\ +\ 11111010_2&=-6_{10}\nonumber\\ \hline \ \ 00000001_2&=1_{10}\nonumber \end{align}

Multiplication and Division

Both implementations of these operations in hardware are inspired by the multiplication and division in longhand, eg. \begin{align} \ \ \ \ \ 1000_2 &\nonumber\\ \times\ \ \ \ 1001_2&\nonumber\\ \hline \ \ \ \ \ 1000_2&\nonumber\\ \ \ \ \ 0000_2\ \ &\nonumber\\ \ \ \ 0000_2\ \ \ \ &\nonumber\\ +\ 1000_2\ \ \ \ \ \ &\nonumber\\ \hline \ \ 1001000_2&\nonumber\\ \end{align}

Floating Point Numbers

Definition

Programming languages also support numbers with fractions, which are called reals in mathematics: \begin{align} 3.14159265\dots&:& \pi\nonumber\\ 2.71828\dots&:& e\nonumber\\ 1.0\times10^{-9}&:&0.000000001\nonumber\\ 1.333333\times10^1&:&\frac{40}{3}\nonumber \end{align} The notation for the last two numbers is called normalized scientific notation, which has a single digit to the left of the decimal point different from $0$.

Binary numbers can also be shown in scientific notation: $$1.0\times2^{-1}: 0.5$$ The point in the expression is called binary point.

Numbers for which the binary point is not fixed are called floating point numbers: $$(-1)^{S}\times F\times 2^{E_2}$$

A designer of floating point representation must find a compromise between the size of the fraction and the size of the exponent. This tradeoff is between precision and range.

The ARMv8 subset we have studied uses also doublewords to represent floating point values:

  • S: a single bit
  • F: 52 bits
  • E_2: 11 bits

IEEE 754 encoding

To pack even more bits into the number the IEEE 754 standard makes the leading bit of normalized binary numbers implicit. So the number is 24 bits long in single precision and 53 bits long in double precision:

$$(-1)^{S}\times(1+F)\times2^{E_2}$$

Other features of IEEE 754 are special symbols to represent unusual events, eg. $-\infty$, $+\infty$ and $NaN$. The latter is used as the result of an invalid operation.

To allow easy sorting, the exponent is placed before the fraction and the most negative exponent is represented as $00\dots00_2$ and the most positive as $11\dots11_2$. This convention is called biased notation:

$$(-1)^{S}\times(1+F)\times2^{E_2+B_2}$$

where $B_2$ is in single precision $127_{10}$ and in double precision $1023_{10}$.

Double precision allows numbers as small as $1.7976931348623157e-308$ and as large as $1.7976931348623157e308$. Overflow means that the exponent is too large to be represented in the exponent field and underflow means that the nonzero fraction is so small that it cannot be represented. When an overflow or an underflow occurs, an exception is raised.

The distance between two adjacent representable floating point numbers is not constant, but is smaller for smaller values and larger for larger values.

Arithmetic

Addition

Following algorithm is used:

  1. Compare the exponents of the two numbers; shift fraction of the smaller one to the right until its exponent would match the larger exponent.

  2. Add the fractions.

  3. Normalize the sum, either shifting right and incrementing the exponent or shifting left and decrementing the exponent.

  4. Check overflow or underflow? If yes, generate an exception.

  5. Round the fraction to the appriorate number of bits.

  6. Is the result still normalized? If yes, we are done; if no go to step 3.

Eg, add $0.5_{10}=0\ 01110\ (1)0000000000$ and $-0.4375_{10}=1\ 01101\ (1)1100000000$ as binary halfwords (16 bits: 1 / 5 / 10):

  1. $1\ 01101\ (1)1100000000 = 1\ 01110\ (0)1110000000$

  2. \begin{align} 0\ 01110\ (1)0000000000 + 1\ 01110\ (0)1110000000 &= 0\ 01110\ (1)0000000000 + 0\ 01110\ (1)0010000000\nonumber\\ &=0\ 01110\ (0)0010000000\nonumber \end{align}

  1. $0\ 01110\ (0)0010000000 = 0\ 01011\ (1)0000000000$

  2. There is no overflow or underflow

  3. There is no change to the bits due to rounding

Multiplication

Following algorithm is used:

  1. Add the exponents of both operands. Attention we have to substract the bias from the sum!
  2. Multiplicate the fractions.
  3. Normalize the product.
  4. Check overflow or underflow? If yes, generate an exception.
  5. Round the fraction to the appriorate number of bits.
  6. Is the result still normalized? If no go to step 3.
  7. Set the sign of the product to positive if the signs of both operands are the same; if they differ make the sign negative.

Eg, multiply $0.5_{10}=0\ 01110\ (1)0000000000$ and $-0.4375_{10}=1\ 01101\ (1)1100000000$ as binary halfwords (16 bits: 1 / 5 / 10):

  1. $01101 + 01110 - 01111 = 01100$
  2. $(1)0000000000 \times (1)1100000000 = (1)1100000000$
  3. $01100\ (1)1100000000$ is normalized
  4. There is no overflow of underflow
  5. There is no change to the bits due to rounding
  6. $01100\ (1)1100000000$ is still normalized
  7. $1\ 01100\ (1)1100000000$ is the final result

Characters

Computers can also process text.

ASCII

The American Standard Code for Information (ASCII) is a representation of the English alphabet. 128 specified characters are encoded into seven-bit integers. Ninety-five of them are printable: these include the digits 0 to 9, lowercase letters a to z, uppercase letters A to Z, and punctuation symbols.

value char value char value char value char value char value char
32 SPACE 48 0 64 @ 80 P 96 ` 112 p
33 ! 49 1 65 A 81 Q 97 a 113 q
34 " 50 2 66 B 82 R 98 b 114 r
35 # 51 3 67 C 83 S 99 c 115 s
36 $ 52 4 68 D 84 T 100 d 116 t
37 % 53 5 69 E 85 U 101 e 117 u
38 & 54 6 70 F 86 V 102 f 118 v
39 ' 55 7 71 G 87 W 103 g 119 w
40 ( 56 8 72 H 88 X 104 h 120 x
41 ) 57 9 73 I 89 Y 105 i 121 y
42 * 58 : 74 J 90 Z 106 j 122 z
43 + 59 ; 75 K 91 [ 107 k 123 {
44 , 60 < 76 L 92 |108 l 124 |
45 - 61 = 77 M 93 ] 109 m 125 }
46 . 62 > 78 N 94 ^ 110 n 126 ~
47 / 63 ? 79 O 95 _ 111 o 127 DEL

UTF-8

UTF-8 is a variable width character encoding capable of encoding all valid code points in Unicode (Universal Coded Character Set) using one to four bytes.

It was designed for backward compatibility with ASCII. Code points with lower numerical values, which tend to occur more frequently, are encoded using fewer bytes. The first 128 characters of Unicode, which correspond one-to-one with ASCII, are encoded using a single octet with the same binary value as ASCII, so that valid ASCII text is valid UTF-8-encoded Unicode as well.

UTF-8 is the dominant character encoding for the World Wide Web.