Fast Memory Copy

We have experimented with some faster memory copies back when we were using P5-based machines.

* Design

We will first explain the three methods we tested, rep/movsl used in the standard libc, unrolled/prefetched using integer registers, and unrolled/prefetched using floating-point registers.

* libc

This is the standard memory copy routine in libc (and also used to move things around in the kernel:
	movb	%cl,%al
	shrl	$2,%ecx			/* copy longword-wise */
	movb	%al,%cl
	andb	$3,%cl			/* copy remaining bytes */
This can move about 40MB/s on our 133MHz Pentium (Triton Chipset, 60ns EDO RAM, 256KB PB cache) as a user-level program. The movs instruction, when prefixed by rep, moves data from an area pointed to by the %esi ("source index") register to that pointed to by %edi ("destination index"). The number of items moved is in the %ecx ("count") register. It speeds up the movement by using movsl (move 32-bit longword), until there are at most 3 bytes remaining, and then calls movsb (move byte) to do the rest.

* Copy through integer registers

The next version has a hand-unrolled loop with prefetching. The sample code below copies 64 bytes per iteration:
	cmpl $63,%ecx
	jbe unrolled_tail

	.align 2,0x90
	movl 32(%esi),%eax		/* prefetch next cache line */
	cmpl $67,%ecx
	jbe unrolled_tmp		/* and one more if we have */
	movl 64(%esi),%eax		/*   >= 68 bytes to move */

	.align 2,0x90
	movl 0(%esi),%eax		/* load in pairs */
	movl 4(%esi),%edx
	movl %eax,0(%edi)		/* store in pairs */
	movl %edx,4(%edi)
	movl 8(%esi),%eax
	movl 12(%esi),%edx
	movl %eax,8(%edi)
	movl %edx,12(%edi)
	movl 16(%esi),%eax
	movl 20(%esi),%edx
	movl %eax,16(%edi)
	movl %edx,20(%edi)
	movl 24(%esi),%eax
	movl 28(%esi),%edx
	movl %eax,24(%edi)
	movl %edx,28(%edi)
	movl 32(%esi),%eax
	movl 36(%esi),%edx
	movl %eax,32(%edi)
	movl %edx,36(%edi)
	movl 40(%esi),%eax
	movl 44(%esi),%edx
	movl %eax,40(%edi)
	movl %edx,44(%edi)
	movl 48(%esi),%eax
	movl 52(%esi),%edx
	movl %eax,48(%edi)
	movl %edx,52(%edi)
	movl 56(%esi),%eax
	movl 60(%esi),%edx
	movl %eax,56(%edi)
	movl %edx,60(%edi)
	addl $-64,%ecx
	addl $64,%esi
	addl $64,%edi
	cmpl $63,%ecx
	ja unrolled_loop

unrolled_tail:				/* this part same as libc */
	movl %ecx,%eax
	shrl $2,%ecx
	movl %eax,%ecx
	andl $3,%ecx

Note that it also attempts to prefetch the next cache line by touching the src+32 and src+64'th bytes. (We are assuming the cache line size is 32 bytes.)

This version gives us up to 60MB/s on the same machine, or a 50% speedup, if we unroll the loop enough.

* Copy through floating-point registers

The last version, using floating-point operations for temporary storage instead of integer registers, looks like this:
	cmpl $63,%ecx
	jbe unrolled_tail

	pushl %ecx
	cmpl $1792,%ecx                 /* prefetch up to 1792 bytes */
	jbe 2f                          /* (1792 = 2048 - 256) */
	movl $1792,%ecx
	subl %ecx,0(%esp)
	cmpl $256,%ecx
	jb 5f
	pushl %esi
	pushl %ecx
	.align 4,0x90
	movl 0(%esi),%eax
	movl 32(%esi),%eax
	movl 64(%esi),%eax
	movl 96(%esi),%eax
	movl 128(%esi),%eax
	movl 160(%esi),%eax
	movl 192(%esi),%eax
	movl 224(%esi),%eax
	addl $256,%esi
	subl $256,%ecx
	cmpl $256,%ecx
	jae 3b
	popl %ecx
	popl %esi
	.align 2,0x90
	fildq 0(%esi)
	fildq 8(%esi)
	fildq 16(%esi)
	fildq 24(%esi)			/* load 8 quad (64-bit) words */
	fildq 32(%esi)
	fildq 40(%esi)
	fildq 48(%esi)
	fildq 56(%esi)
	fistpq 56(%edi)
	fistpq 48(%edi)
	fistpq 40(%edi)
	fistpq 32(%edi)                 /* store them in reverse order */
	fistpq 24(%edi)
	fistpq 16(%edi)
	fistpq 8(%edi)
	fistpq 0(%edi)
	addl $-64,%ecx
	addl $64,%esi
	addl $64,%edi
	cmpl $63,%ecx
	ja unrolled_loop
	popl %eax
	addl %eax,%ecx
	cmpl $64,%ecx
	jae 4b

unrolled_tail:				/* this part same as libc */
	movl %ecx,%eax
	shrl $2,%ecx
	movl %eax,%ecx
	andl $3,%ecx
The Intel x86 floating-point unit has eight 80-bit registers organized as a stack. The fildq (floating-point integer load quadword) instruction loads a 64-bit integer into a 80-bit register, converting it into floating point in the process. (Note there is no data loss since the 80-bit floating-point format has 64 bits for the significand.) The fistpq (floating-point integer store and pop quadword) does the opposite.

This version can move up to 80MB/s, or 100% speedup, on the same machine. The speed doesn't seem to go up much with even more unrolling.

* Current Status

The fast memory copy routine is in the FreeBSD kernel starting from release 2.2. It automatically detects a P5-class processor and enables itself. From release 2.2.6, it will also run a small test to enable it only when it actually runs faster than the regular version (it doesn't help on AMD K6, for instance, even though it is a P5-class chip).

* Results

Here are the results we measured on our machines as well as others.

* Pentium 133, Triton Chipset, EDO memory new

This is our reference platform, a Dell Dimension XPS P133c with the Intel Triton chipset, 256KB pipeline burst cache and 60ns enhanced data output (EDO) memory. Note that as we unroll the loop further (the first number after the slash (/) is the number of bytes copied in one iteration), the separation point with the libc version (the lowest green line) moves to the right.

It seems like copying 64 bytes using FP registers (the first blue line) is the best solution.

* Pentium 166, Triton Chipset, EDO memory

This is another Triton, with a 166MHz Pentium, 256KB PB cache and 60ns EDO memory on an Asus PCI/I-P55TP4N motherboard. It is faster than the 133MHz Pentium above. This was contributed by Simon Nybroe <>.

* Pentium 100, Triton Chipset new

Here is yet another Triton with a 100MHz Pentium and 256KB PB cache, but 60ns regular memory on an Asus PCI/I-P55TP4XE motherboard. Compared to the above two, there is the integer copy numbers are slighly (~15%) worse and the FP copy numbers are much (~30%) worse.

* Pentium 90, Triton II Chipset new

Here is a Triton II (my own) with a 90MHz Pentium and 512KB PB cache, but 60ns regular memory on an Asus PCI/I-P55T2P4 motherboard. It is about 10% slower than the 100MHz Triton, which makes sense.

* Pentium 90, SiS Chipset new

This is the exact same machine with a different motherboard (SiS chipset, no PB cache). See how much difference the motherboard makes.

* Pentium 100, SiS Chipset new

This is DEC Venturis Slimline.

* Pentium 90, Neptune Chipset new

This is an HP Vectra.

* Pentium 90, Pluto Chipset

This one is contributed by Marc van Kempen <>.

* P6 200, Natoma Chipset

This is a 200-MHz P6 donated to us by Intel. It has a Natoma chipset on a "server" motherboard.

Note that using floating-point registers don't help at all, as the best they can do is to match the speed of libc bcopy. Also the maximum bandwidth is only about half that of Dell's 133-MHz Pentium above.

* P6 200, Step-B Orion Chipset

This was contributed by Wayne Scott <>. Note the fastest numbers come from libc's bcopy; this is because the P6 has a fast string copy mode for rep/movs that kicks in at about 128 bytes. The highest line is for the "server" motherboard (4-way interleaved memory) and the rest are measured on the "desktop" (2-way inteerleave). The fast string copy doesn't seem to help

* P6 150, (?) Orion Chipset

Here is another P6, by Garrett Wollman <>. It doesn't seem that the fast string copy mode is helping him.

* P6 180, Orion Chipset

Here is yet another P6, by Andrew Gallatin <gallatin@stat.Duke.EDU>. It is a Micron Pro Magnum-180. It doesn't seem that the fast string copy mode is helping him either.

* 486-100, SiS Chipset

Tere is a 486 contributed by Kenneth Merry <>. We can see that using floating-point registers don't help on a 486.

* 486-100, SiS Chipset

The last one is a 486 contributed by Mats Lofkvist <>. It looks slightly different from the one above although it's the same CPU and chipset (different versions maybe?). But it's quite clear floating-point registers won't help in either case.

NOW Home Page | Tertiary Disk Home Page