Other stuff

String class

Older compilers do not provide the predefined class String. Here is an implementation in StringClass.h. Here a code example how to use it main1.cp. Another code example using it in the Standard Template Library (STL) is here main3.cp

Note:
the declaration of the class and the definition of its methods are packed into a single ".h" header file. When using it, you may have to separate this in a declaration(.h) and method definition(.cpp) files.

String I/O stream

An object of the string class can be used as a stream with the class istrstream.h. Here a code example how to use it main2.cp.

Note:
the declaration of the class and the definition of its methods are packed into a single ".h" header file. When using it, you may have to separate this in a declaration(.h) and method definition(.cpp) files.

Cleaning OSX '._' (dot underscore) files

MacOSX uses hidden files to store system data/resources whose names start with "._". In some cases those files cause trouble. Here is an application for MacOS classic that builds a list of those files and offers their deletion or to make them visible.
The source code is provided so adaptations are possible :
CleanOSX binhex
CleanOSX macbin

Multiprecision and binary number coding

Having to deal with the computation of CRC (Cyclic Redundancy Check) I saw that those use up to 64 bits elements. By looking on the Web I found pages about the implementation and optimisation of 128 bits division. Beeing interested I looked for an implementation on my old 32 bit PowerMac. The Compiler Writer's Guide gives an example for 64 bits division that is a good base. Using the same procedure I started with the 96/64 bits division and later the 128/64 bits division. The division case is often considered as it is the most complex operation.
First a little description of binary numbers.

Coding of unsigned and signed numbers

For simplicity consider the case of 4 bits. Positive numbers would be:
1 : 0001, 2 : 0010, 3 : 0011, 4 : 0100 and 5 : 0101
The negative numbers are coded in 2's complement:
-1 : 1111, -2 : 1110, -3 : 1101, -4 : 1100, -5 : 1011
But if considering the numbers as unsigned:
15 : 1111, 14 : 1110, 13 : 1101, 12 : 1100, 11 : 1011. The ranges of the signed numbers would go from -8 (1000) to 7 (0111), and for the unsigned 0 to 15 (1111).

When adding a positive and a negative number, if the result is >= 0 you get a carry bit that must be ignored. Examples:
1+-1 = 1 0000, -1+2 = 1 0001, -4+5 = 1 0001
The same when adding 2 negative numbers.
But when adding a positive and a negative number, if the result is < 0 you do not get a carry bit. Examples:
-2+1= 0 1111, -4+3= 0 1111, -5+2= 0 1101

But if considering the numbers as unsigned, the carry is then an overflow. The first case would then be 1 + 15 and give a result that can't be represented on 4 bits.
The detection of the other cases of overflow is by considering the carrying bits (when adding binary numbers if adding digits that are both set to 1, 1 + 1 = 10 a bit must be forwarded to the next higher digit) on the left most bit. If the carried in bit (from the next lower bits) is not equal to the carried out bit (the final carry bit), the operation cauuses an overflow.

In the PowerPC documentation (programming environment manual), the overflow bit in XER is set when "the carry out of the msb is not equal to the carry out of msb+1". Probably the right explanation is msb-1, that is the carry in of the msb. The carry is set if there is a carry out of the msb. So it indicates an overflow for the unsigned addition.
Overflow examples:
Signed: -8 + -1 = 1000 + 1111 = 1 0111, 7 + 1 = 0111 + 0001 = 1000
The carry in of the msb is 0 and out is 1, and in the second 1 and 0.
If the two addition are considered unsigned:
8 + 15 = 23 and 7 + 1 = 8, the first has a carry bit indicating an overflow and the second presents no problem.

Some time I was wondering if the subtract instruction sets the overflow and carry bits differently from the addition looking how the overflow by 0x00800000 - 0x80000000 is set considering a little too fast that 0x80000000 is the 2's complement if itself. In fact the overflow happens in this conversion, and not in the final operation!

Multiprecision calculations

To perform calculations with integer numbers of higher size than the processor's precision, the numbers are held in several registers.
Note on multiplication:
The multiplication of 2 arguments with n bits giving a result on 2n bits does not overflow. For signed multiplication:

maximum
	. -2^(n-1) * -2^(n-1)  = + 2^(2n-2) < 2^(2n-1)
	. (2^(n-1)-1) * (2^(n-1)-1) = 2^(2n-2) - (2 * 2^(n-1)) + 1 = 2^(2(n-1)) - 1 - 2*(2^(n-1) - 1) < 2^(2(n-1)) - 1 for n>1 < 2^(2n-1)
minimum . -2^(n-1) * (2^(n-1)-1) = - 2^(2n-2)) + 2^(n-1) > -2(2n-1)

For unsigned:

maximum	. (2^(n) -1) * (2^(n) -1) = 2^(2n) -2 * 2^(n) + 1 = 2^(2n) - 1 - 2 * (2^(n) - 2) < 2^(2n) - 1 for n > 1
And the case 0 is obvious.

Multiprecision 32 bits PowerPC assembler

The division implementations are based on the 64/64 bits division algorithm described in the PowerPC Compile Writer`s Guide.

The PowerPC Compiler Writer's Guide can be found on following links:

https://www.warthman.com/projects-IBM-CWG-PowerPC-compiler-writers-guide.htm
With direct link to the document -> https://www.warthman.com/images/IBM_PPC_Compiler_Writer's_Guide-cwg.pdf
https://e-booksdirectory.com/details.php?ebook=3237
With direct link to the document -> https://cr.yp.to/2005-590/powerpc-cwg.pdf
http://releases-origin.llvm.org/9.0.1/docs/CompilerWriterInfo.html
-> http://www.ibm.com/chips/techlib/techlib.nsf/techdocs/852569B20050FF7785256996007558C6 But link does not work anymore
https://archive.org/stream/powerpc-cwg/powerpc-cwg_djvu.txt Raw text lacking of document formatting

As is explained in the CWG, the division algorithm is implemented for unsigned values. For a signed division, the sign must be dealt with separately and the case -2^(N-1)/-1 must be caught.

Recently I renamed the division routines and reordered them between the files. The tests were extended:

The names are div_x_y with x (max) number of bits of the dividend (always 128 or 96) and y the (max) number of (significant) bits of the divisor.
Logically one expects the divisions 128:128, 128:96, 128:64 and perhaps 128:32(not implemented). But when performing the division operation, if the highest bit (when numbering from 0 95, 63 and 31) is 1, a carry may be shifted in the higher register of the dividend so an additional adde has to be performed just to consider the carry as well as a subtraction (subtract 0) to take it into account. (See Important note for details).
If this highest bit is not set, (divisor having not more than 95, 63 or 31 significant bits) then this will not happen allowing to spare an addition and a subtraction in the loop. This may be interesting for performance.
Those routines are named div128_95, 128_63 and 96_63 and defined in file div2.s. A routine div128_31 is not implemented.

A quick description of the files:

test.s : test routines, addition subtraction to check overflow and carry settings and the bits shifting for the 96 and 128 bits divisions.
div.s : 64/64 bits division as in the CWG and a version where the divisor is shifted (got it working). Follows the 96/64 and 128/64 bits implementations (check the note!) and last a 128/96 bits version that uses macros for the bit shifting.
div2.s : 96/63, 128/63 and 128/95 bits implementations (check the note!) that can be used as optimized versions.
div3.s : special case for a 16 bits divisor.
lib_64.s : 64 bits addition and subtraction. With the result, the 3 bits Summary Overflow, Overflow and Carry are given. A multiplication (result on 128 bits) is also implemented that I used to check the division.
lib_128.s : 128 bits addition and subtraction with as before the 3 bits from the XER register.
shifttest.s : definition of macros for bits shifting in 4 registers. For left and right 2 versions, one where the destination register are identical to the source, and one where they differ. This allows instruction reordering for little performance gains.
main_shift.c : Testing routine of the shifting macros.

All those assembler routines can be imported into C. The use is shown in main_div.c implementing the tests.
Further tests are implemented in main_div2.c for the routines in div2.s and div3.s.

Important note:
In the 96 and 128 bits divisions I thought the registers of higher order than the 2 divisor registers were always 0 and suppressed all calculations for them.
Contrary to what I thought first, the registers of higher order in the 96 and 128 bits are not always 0. From bit 63 (numbering start with 0) in the shifted dividend a carry bit may be pushed into bit 64 (generally the bit of higher order than the msb of the divisor).
This case can only happen when the msb of the divisor is set (bit 63, divisor >= 2^63). For example if during the operation you have the 64 higher bits of the dividend < divisor, a 0 is put into the quotient and the dividend shifted.
To illustrate take as simple example the case of 96 bits division with elements of 4 bits instead of 32. We want:

0010 0010 0000 : 1001 0000.
Alignment of the significant msbs gives:

0000	1000	1000
	1001	0000

A 0 would be put into the quotient, and the shift of the dividend will put a 1 into the highest order register. This must be taken into account when comparing the 2 values on the next step!

To test this I first used the two following divisions:

	Test_128.high = 0x02020000;
	Test_128.middle_u = 0x00000000;
	Test_128.middle_l = 0x00000000;
	Test_128.low = 0x00000000;

	Divisor.high = 0x90000000;
	Divisor.low = 0x00000000;

	div_128(&Test_128, &Divisor, &Quot_128, &Remain_64);

	Test_96.high = 0x02020000;
	Test_96.middle = 0x00000000;
	Test_96.low = 0x00000000;

	div_96 (&Test_96, &Divisor, &Quot_96, &Remain_64);

Surprisingly, when calculating the value backwards, I found out that the division gave the correct result! It seems that the last subtraction gives a negative result, setting the carry which is not changed as the last subtraction is suppressed giving the right result for the quotient.
But I found 2 examples where this does not work:

0x4b30 c635 1234 fedc 0001 0000 0000 0000 : 0xb2f3 0101 0000 0001
0x3000 000f 0000 1111 0000 0000 0 .. 0 : 0xe000 000f 0000 0000

For those divisions, the carry into the third register has to be taken into account as done in the safe version of 128:64 and 96:64 bits division algorithm defined in div.s
They of course need a couple of registers and instructions more. So if it is sure that the divisor < 2^63, the routines in file div2.s may be used as optimized version (optimization is still small!).

Combination of 2 divisions to implement division with bigger numbers

To implement the 128 bits division with the 96 bits one, we use a combination of 2 96 bits divisions as follows:
a,b,c,d represent each a 32 bit value, and abcd a 128 bits value:

abcd   abc * 2^32    d    quo_h*dvs+rem_h * 2^32    d                    rem_h*2^32 + d                   quo_l*dvs + rem_l
---- = ---------- + --- = ---------------------- + --- = quo_h * 2^32 + --------------- = quo_h * 2^32 + ------------------
 dvs      dvs       dvs            dvs             dvs                         dvs.                            dvs

The final quotient is then: quo_h * 2^32 + quo_l
And the final remainder : rem_l

This scheme is used in thefollowig case.

Optimisation case of divisor with 16 or less significant bits

Routine div128_16 in file div3.s and usage in main_div2.c.
If the divisor has not more than 16 significant bits, a faster way to implement the division is to use the assembly div instruction.
This routine implements an optimized version of the 128 bits division in case the divisor has not more than 16 significant bits. It simply uses the PowerPC implemented div instruction.
This mathematical process is as follows.
The principle is the same as in the implementation of a 128 division with 2 96 bits division. Let's consider the following operation:

abcd : e
Where a represents 32 bits, b, c, d and e represent all 16 bits:
    abcd      a * 2^48 + b * 2^32 + c * 2^16 + d    a             b             c             d
F = ------ = ----------------------------------- = --- * 2^48 +  --- * 2^32 +  --- * 2^16 +  ---
     e                        e                    e              e             e             e

When considering that every fraction gives a quotient q and a remainder:

x / e = qi + ri  : e

You start by dividing the highest order 32 bits by the 16 bits divisor giving a remainder r1 of 16 bits (at most) and the 32 (at most) highest order bits of the final quotient q1.
The remainder then gives the 16 higher order bits for the next division associated with the next 16 bits of the original dividend (b). This next division provides the following 16 bits of the quotient and a new 16 bits remainder. Again this remainder is associated with the following 16 bits of the original dividend for the next division step.

F = q1 * 2^48 + (r1 * 2^48 + b * 2^32) : e  +  (c* 2^16 + d) : e = q1 * 2^48 + ((r1 * 2^16 + b) : e) * 2^16 +  (c* 2^16 + d) : e
= q1 * 2^48 + q2 * 2^32 +  ((r2 * 2^16 + c) : e)  + d : e

As r1 and b are 16 bits values, the expression (r1 * 2^48 + b * 2^32) just concatenates the 2 values into a 32 bits value.

First the ho word (1) of the dividend is divided by the divisor giving the upper 32 bits of the final quotient and a 16 bits remainder.
This remainder is concatenated with the upper 16 bits of the second word of the dividend (high order half word 2) to build the temporary second dividend whose division by the divisor generates a 16 bit quotient and again a 16 bit remainder. Repeating the operation with every 16 bits slice finally generates the whole quotient. The process can be described with the following figure with the 4 32 bits

+-----------------+-----------------+-----------------+-----------------+
|       1         |        2        |        3        |        4        |	dividend
+-----------------+-----------------+-----------------+-----------------+
|                 |        |        |        |        |0......0 x......x|	divisor
|                 |        |        |        |        |        |        |
|                 |        |        |        |        |        |        |
|  dividend 1     |        |        |        |        |        |        |
|  / divisor      |        |        |        |        |        |        |
|         remaind1|ho hw 2 |        |        |        |        |        |
|            / divisor     |        |        |        |        |        |
|                 |        |        |        |        |        |        |
|                 |remaind2|lo hw 2 |        |        |        |        |
|                 |   / divisor     |        |        |        |        |
|                 |        |        |        |        |        |        |
|                 |        |remaind3|ho hw 3 |        |        |        |
|                 |        |    / divisor    |        |        |        |
|                 |        |        |        |        |        |        |
|                 |        |        |remaind4|lo hw 3 |        |        |
|                 |        |        |   / divisor     |        |        |
|                 |        |        |        |        |        |        |
|                 |        |        |        |remaind5|ho hw 4 |        |


|    quotient 1   |quotie 2|quotie 3|quotie 4|quotie 5|quotie 6|quotie 7|

Last the stuffit archive Multiprec.sit. This also contains a Symantec C++ 8.6 project.

Other example using number coding

Lastly I had a case using the rounding routines (floor and ceil).
When converting an angle to a sector, I came over an optimisation example. Two cases are considered:

- the input angle ranges from 0 to 360 degrees
- the input angle ranges from -180 to +180 degrees.

If the circle is to be divided into 2^n sectors we have (numbered from 1 to 2^n):
Case 1:

	sector = (int)floor(/(360/2^n)) + 1;

Case 2:

	temp = (int)floor(/(360/2^n));
	sector = (temp & (2^n-1)) + 1;

In the last case we take the n lower bits from the temporary value. On the positive range (0 to 180) the values for temp are 1 to 2^(n-1) and on the negative side, (-180 to 0) -2^(n-1) to -1. With the coding of the negative values, the n lower bits taken as unsigned will range from 2^(n-1) to 2^n-1.
When disassembling the C code we have (I took 8 sectors so n = 3):

	return (int)floor(az/45.0) + 1;
00000010: C8210058  lfd        fp1,88(SP)
00000014: C8020000  lfd        fp0,@5(RTOC)
00000018: FC210024  fdiv       fp1,fp1,fp0
0000001C: 48000001  bl         .floor
00000020: 60000000  nop
00000024: FC00081E  fctiwz     fp0,fp1
00000028: D8010038  stfd       fp0,56(SP)
0000002C: 8061003C  lwz        r3,60(SP)
00000030: 38630001  addi       r3,r3,1
00000034: 80010048  lwz        r0,72(SP)
00000038: 38210040  addi       SP,SP,64
0000003C: 7C0803A6  mtlr       r0
00000040: 4E800020  blr

	temp = (int)floor(az/45.0);
00000014: C8210068. lfd        fp1,104(SP)
00000018: C8020000  lfd        fp0,@5(RTOC)
0000001C: FC210024  fdiv       fp1,fp1,fp0
00000020: 48000001  bl         .floor
00000024: 60000000  nop
00000028: FC00081E  fctiwz     fp0,fp1
0000002C: D8010038  stfd       fp0,56(SP)
00000030: 83E1003C  lwz        r31,60(SP)
	return (temp & 0x0007) + 1;
00000034: 57E3077E  clrlwi     r3,r31,29
00000038: 38630001  addi       r3,r3,1
0000003C: 80010058  lwz        r0,88(SP)
00000040: 38210050  addi       SP,SP,80
00000044: 7C0803A6  mtlr       r0
00000048: 83E1FFFC  lwz        r31,-4(SP)
0000004C: 4E800020  blr

Now the the floor routine is meant to return a floating point value, making a conversion to integer necessary (fctiwz instruction).
But the PowerPC processors provide an assembler instruction that directly returns an integer value. I don't know how the floor routine is implemented in the library, but it seems more efficient to use this instruction (fctiw) to implement a routine int new_floor(double x).
This would be the new code:

				dialect	PowerPC

					file	'floor.s'

# Only the sections can be exported (dos not work with labels)
#				export my_floor1[DS]
				export .my_floor1[PR] => 'floor1_s'


# Floor with change of FSCR lower bits
		csect	my_floor1[DS]
				dc.l	.my_floor1[PR]
				dc.l	TOC[tc0]
				
		csect	.my_floor1[PR]
my_floor1:
# Get 64 bits aligned address and 8 bytes on the stack into r12
	addi	r12, r1, -15
	rlwinm	r12, r12, 0, 0, 28
# Save FPSCR at r12 address 
	mffs	fp13
	stfd	fp13, 0(r12)
# If lower 2 bits already 11 round directly
	lwz	r11, 4(r12)
	rlwinm	r10, r11, 0, 30, 31
	cmpi	cr7,1,r10,3
	bc	12,30, Round1	# beq	Round1
# Set lower 2 bits of FPSCR (round towards -INF)
	ori	r11,r11,3
	stw	r11, 4(r12)
	lfd	fp12, 0(r12)
	mtfsf	255, fp12
	fctiw	fp11, fp1
	stfd	fp11, -8(r12)
	lwz		r3, -4(r12)
# Restore FPSCR
	mtfsf	255, fp13
	blr
Round1:
	fctiw	fp11, fp1
	stfd	fp11, -8(r12)
	lwz		r3, -4(r12)
	blr

The rounding type (towards -infinity) must be selected in the FPSCR register by setting the lower 2 bits. If they are not both set additional code is necessary. As this is a leaf routine, no stack frame is necessary.
The other rounding (toward + infinity, zero or to nearest integer) can also be selected.
As the modification of the FPSCR register may cause a context switch, this could cause a little perfomance loss. Therefore and for interest I also implemented a version of the routines without using FPSCR selection. In principle, it starts by cutting the mantissa past the comma that is simply the rounding towards zero (fctiwz) and after performs a correction:

Floor: if the cut part of the mantissa is nonzero, subtract 1 for negative values,
Ceil: if the cut part of the mantissa is nonzero, add 1 for the positive values,
Round to nearest: the first bit after the comma is checked and if nonzero, add 1 for positive values or subtract 1 for negative ones. This rounds a mantissa of x.5 to x+1 (away from zero), but makes things simpler.

Now the special cases of negative exponents had to be implemented. In the test file, 2 versions: the first without branching, and the second using a branch to dissociate cases exp < -1 and exp = -1. Last added this to the floor and ceil routines were it was also missing.
Lastly, I tested what were the default rounding selection in the FPSCR on MacOS Classic, as well as how a value x.5 is rounded. For this I added 2 routines at the end of file Round.s. It seems that the default selection is the rounding to nearest (both lower bits of FPSCR set to 0), and a value of x.5 is always rounded to the even number: this is the so called rounding half to even (see explanations)
Precisely 2*n+0.5 is rounded to 2*n and 2*n+1.5 is rounded to 2*n+2.
With this a bug correction in the stack handling as well as a copy paste error, and a little optimisation (replaced rlwinm by andi. to extract the 2 lower bits of FPSCR).
The assembler sources for the actual library routines can be found here here. There are always 2 versions: one using the settings in the FPSCR and the built-in instruction fctiw, the other not touching the FPSCR using rounding to zero instruction fctiwz and performing the adaptations necessary. I packed the code and project for Symantec C++ 8.6 here. It contains other conversion routines as well as little testing routines and debugging code. See the ReadMe.txt file. Another C project for CodeWarrior Pro 6 holds the C code for sector computation. It is the source of the disassamblies above.

Last update 29 May 2024