Jeff;834087 said:
Pointing out statistics, but not showing the code used to provide these statistics is just dumb. Next time, show the code, or don't post such things...who knows, perhaps you didn't do something properly, thus providing false and/or invalid results.
I overlooked that, oops.
I've been running this in debug mode, for a 64bit platform.
Test machine is a AMD X2 5700+ (2.7Ghz) with 4GB DDR2 PC3200 running Windows 7 Ultimate 64bit.
Code:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Diagnostics;
namespace ConsoleApplication1
{
class Program
{
static void Main(string[] args)
{
Tester testing = new Tester();
testing.RunTest();
}
}
public class Tester
{
private int m_Index;
private int m_BufferLength = 512;
public void RunTest()
{
System.Console.WriteLine("Testing System.BitConverter.GetBytes() against Bit-shifting");
System.Console.WriteLine("Test run of BitShifted:");
// 100,000,000 runs should return an average enough result. (And show marked differences)
Random RNG = new Random();
Stopwatch watch = new Stopwatch(); watch.Start();
// Little bit of overhead from the RNG, but I don't think it will affect the results too much.
for (int i = 0; i < 100000000; i++)
{
Bitshifted(RNG.Next());
}
watch.Stop(); System.Console.WriteLine("Bitshifted finished in {0}", watch.Elapsed.ToString());
System.Console.WriteLine("Test run of GetBytes:");
watch.Start();
// Same as above, but at least its equal overhead on both methods.
for (int i = 0; i < 100000000; i++)
{
Managed(RNG.Next());
}
watch.Stop(); System.Console.WriteLine("GetBytes finished in {0}", watch.Elapsed.ToString());
System.Threading.Thread.Sleep(10000);
}
public void Flush()
{
// Stubbed.
return;
}
/// <summary>
/// This method uses bit-shifting to return a bytestream.
/// </summary>
/// <param name="toWrite"></param>
/// <returns></returns>
public byte[] Bitshifted(int toWrite)
{
m_Index = 0;
byte[] TheBuffer; TheBuffer = new byte[8];
if ((this.m_Index + 4) > this.m_BufferLength)
{
this.Flush();
}
TheBuffer[this.m_Index++] = (byte)(toWrite >> 0x18 & 0xFF);
TheBuffer[this.m_Index++] = (byte)(toWrite >> 0x10 & 0xFF);
TheBuffer[this.m_Index++] = (byte)(toWrite >> 0x8 & 0xFF);
TheBuffer[this.m_Index++] = (byte)(toWrite & 0xFF);
return TheBuffer;
}
/// <summary>
/// This method uses System.BitConverter.GetBytes() to return a bytestream.
/// </summary>
/// <param name="toWrite"></param>
/// <returns></returns>
public byte[] Managed(int toWrite)
{
m_Index = 0;
byte[] TheBuffer; TheBuffer = new byte[4];
if ((this.m_Index + 4) > this.m_BufferLength)
{
this.Flush();
}
// Apply the returned data directly, rather than calling it four times.
TheBuffer = System.BitConverter.GetBytes(toWrite);
return TheBuffer;
}
}
}
there's no "different architecture" for C#, there's just the one defined in .NET standard.
What I meant by "different architecture" was moving from a Big Endian CPU to a Little
Endian machine, such as running mono on a sun SPARC(Big Endian), Windows (x86/64) is Little Endian.
Unless unbeknown to me, .NET shields you from this somehow?
I admit that GetBytes is significantly slower than BitShifting, which seems odd. At first, I figured it was down to the
stack and context switches from calling into the GetBytes function so frequently.
So I had a look at the code for System.BitConverter.GetBytes(int):
Code:
public static unsafe byte[] GetBytes(int value)
{
byte[] buffer = new byte[4];
fixed (byte* numRef = buffer)
{
*((int*) numRef) = value;
}
return buffer;
}
First glance seems like that'd be faster than the bitshifting, its just a straight up memory mangle rather than a bunch
of maths. It should only take a few assembly instructions to perform that operation. (And it does)
So I modified the testing code and planted the innards of GetBytes straight into my test code, to reduce the effect of
calling into another method so frequently. Fired up the test with the modified Managed(int) method and..
Making Managed() into:
Code:
/// <summary>
/// This method uses System.BitConverter.GetBytes() to return a bytestream.
/// </summary>
/// <param name="toWrite"></param>
/// <returns></returns>
public unsafe byte[] Managed(int toWrite)
{
//m_Index = 0;
//if ((this.m_Index + 4) > this.m_BufferLength)
//{
// this.Flush();
//}
byte[] buffer = new byte[4];
fixed (byte* numRef = buffer)
{
*((int*)numRef) = toWrite;
}
return buffer;
}
Returns:
Code:
Testing System.BitConverter.GetBytes() against Bit-shifting
Test run of BitShifted:
Bitshifted finished in 00:00:08.8765452
Test run of GetBytes:
GetBytes finished in 00:00:14.3915536
Its still slower. Poking around in the assembly gets this:
Code:
TheBuffer[this.m_Index++] = (byte)(toWrite >> 0x18 & 0xFF);
00000068 mov rax,qword ptr [rsp+000000C0h]
00000070 mov eax,dword ptr [rax+8]
00000073 mov dword ptr [rsp+34h],eax
00000077 mov eax,dword ptr [rsp+34h]
0000007b mov dword ptr [rsp+30h],eax
0000007f mov ecx,dword ptr [rsp+34h]
00000083 add ecx,1
00000086 mov rax,qword ptr [rsp+20h]
0000008b mov qword ptr [rsp+38h],rax
00000090 mov rax,qword ptr [rsp+000000C0h]
00000098 mov dword ptr [rax+8],ecx
0000009b mov eax,dword ptr [rsp+000000C8h]
000000a2 sar eax,18h
000000a5 and eax,0FFh
000000aa mov dword ptr [rsp+40h],eax
000000ae movsxd rcx,dword ptr [rsp+30h]
000000b3 mov rax,qword ptr [rsp+38h]
000000b8 mov rax,qword ptr [rax+8]
000000bc mov qword ptr [rsp+48h],rcx
000000c1 cmp qword ptr [rsp+48h],rax
000000c6 jae 00000000000000D4
000000c8 mov rax,qword ptr [rsp+48h]
000000cd mov qword ptr [rsp+48h],rax
000000d2 jmp 00000000000000D9
000000d4 call FFFFFFFFF49A6950
000000d9 mov rdx,qword ptr [rsp+38h]
000000de mov rcx,qword ptr [rsp+48h]
000000e3 movzx eax,byte ptr [rsp+40h]
000000e8 mov byte ptr [rdx+rcx+10h],al
Yeah.. thats just one of the array operations, the other 3 have roughly the same amount of mov's behind them. Which
partly confirms my theory about bitshifting being more operations on the processor.
Looking at the disassembly for the Managed() method gives:
Code:
fixed (byte* numRef = buffer)
00000060 mov rax,qword ptr [rsp+20h]
00000065 mov qword ptr [rsp+38h],rax
0000006a cmp qword ptr [rsp+20h],0
00000070 je 000000000000007F
00000072 mov rax,qword ptr [rsp+38h]
00000077 mov rax,qword ptr [rax+8]
0000007b test eax,eax
0000007d jne 000000000000008A
0000007f mov qword ptr [rsp+28h],0
00000088 jmp 00000000000000C8
0000008a mov rax,qword ptr [rsp+38h]
0000008f mov rax,qword ptr [rax+8]
00000093 mov qword ptr [rsp+40h],0
0000009c cmp qword ptr [rsp+40h],rax
000000a1 jae 00000000000000AF
000000a3 mov rax,qword ptr [rsp+40h]
000000a8 mov qword ptr [rsp+40h],rax
000000ad jmp 00000000000000B4
000000af call FFFFFFFFF49A6660
000000b4 mov rcx,qword ptr [rsp+38h]
000000b9 mov rax,qword ptr [rsp+40h]
000000be lea rax,[rcx+rax+10h]
000000c3 mov qword ptr [rsp+28h],rax
{
000000c8 nop
*((int*)numRef) = toWrite;
000000c9 mov rcx,qword ptr [rsp+28h]
000000ce mov eax,dword ptr [rsp+68h]
000000d2 mov dword ptr [rcx],eax
}
000000d4 nop
000000d5 mov qword ptr [rsp+28h],0
return buffer;
000000de mov rax,qword ptr [rsp+20h]
000000e3 mov qword ptr [rsp+30h],rax
000000e8 jmp 00000000000000EA
So, it seems like the memory copy ( - ((int*)numRef) = toWrite; - ) is only 3 instructions in total, but setting up the fixed( ) requirement
for the source of byte's is what takes so long, the codes a bit spaghetti'fied, but I can at least tell that the jmp's only move around
within the region of the pinning, theres a call that goes somewhere else, I presume thats to tell the GC to pin the memory in place. (KeepAlive()?)
I guess if you could call the memory assignment directly without having to pin the source buffer into memory to protect
it from the GC, maybe GetBytes() would be faster, maybe the compiler is quietly optimizing the code behind the scenes with MMX/SIMD
instructions? My assembly is not as good as it used to be and I can only get the jist of the compiler's code. I know that MOVSXD is part
of the x86_64 additions but the rest looks like plain un-SIMD'd instructions me.
If anyone here understands assembly better, feel free to explain whats going on.
So in summary: Yes, Bit shifting is still faster than GetBytes no matter how you mangle it. I still have concerns that the bitshifting'll
break if you moved to a big endian machine, but 220%+ performance seems like a good reason to just branch a conditional for other endian
machines.
I also can't help but wonder if you could make this *really* fast by buffering up all the int's you want to split into bytes and feeding
them into a CUDA/OpenCL program and running them through a stream processor, get the result back as a texture or something.
Also, you could maybe use a cached MemoryStream object to push int's into and get byte[]'s back out, as below, but that'd
probably have an even bigger overhead, and now its just into the realm's over vastly over-engineering a simple problem.
Code:
using (MemoryStream stream = new MemoryStream()) {
using (BinaryWriter writer = new BinaryWriter(stream)){
writer.Write(src);
return stream.ToArray();
}
}
Yeah, long post I know, I got interested in the reason why GetBytes was slower even though it looks like it shouldn't. I'm still
using the bit shifts in my normal code, and will continue to do so until fixed() operations become less expensive to call in mass.