SSE SIMD and ARM NEON research
I started off wanting to know the CPU cycles and possible cache misses from SSE SIMD instructions, but was kind of mind blown at what SIMD can actually do. There’s also a hell of a lot that the best compilers cannot do with C or C++ code with SIMD stuff. An example is getting the sign bits from each value in a SIMD register in EAX, which is damn handy for some math.
The SSE SIMD stuff is only the base line because there’s now SSE4 I believe. (just checked and it is. behind times I am) Being able to multiple math operations on a single register has tickled my interest for a long time. Tonight I decided to put some proper research into it. Until I got sidetracked with ARM NEON.
Anyway, besides the point and before I go onto the ARM stuff. Over the next few nights at least, I’m going to be testing out more assembler programming. This time using SIMD instructions and possibly being able to use maybe a noise algorithm. Later on in time, not over the next few nights, I will look at using this experience for 3d matrix calcs.
But… Then I looked deeper into the ARM NEON…
What I found with the NEON is that it is kind of like a hyper-threaded architecture. The cpu will run 2 NEON instructions per cycle but during CPU down things (stalls, waits, etc). I need to get deeper to undertsand that but it does sound very much like the way the hyper-threading works on the intel core processors. Still good.
Another thing I did like about the ARM assembler language is it just just so awesome. In standard assembler, you load registers, multiply one register by 2/4/8, add them and the last instruction grabs the result. In ARM, you load the basic registers, and in one instruction you can offsets and bit manipulate to get the address and store the result. Crazy.
I’ll come back to ARM stuff later one. For now, I’ll be focusing on x86_64 stuff and all the SIMD stuff. Over the next few days I’ll run some test and hopefully post some test code. That is if I get something running.