|
Location: Desktop development - C/C++ License: The Microsoft Public License (Ms-PL) SSE2 Vectorization of Alphablend CodePosted by Wong Shao VoonUsing SSE2 to speed up alphablending |
Skill: BeginnerPosted: 29/10/2011Views: 164Rating: 5.00 /5Popularity: 0.00 |
| Sign Up to vote for this article |
In this article, we will explore vectorizing the pixel alphablending code in my earlier article. I believe most C/C++ programmers has tinkered with SIMD instructions like MMX and Streaming SIMD Extensions 2 (SSE) or even Advanced Vector Extensions (AVX). Most programmers, I believe, gave up after their first SIMD attempt, after that effort produced an mediocre or worse performance than the original non-vectorizing code. The main reason for slow performance, is CPU cycles are typically wasted in packing the primitive data into the vector prior to SIMD execution and unpacking them into their proper place in the structure after execution. We will dive deeper into this in the next section. Main reason that SIMD is slow to be adopted by programmers around the world, is because SIMD code is harder to write and the resultant code, even with using instrinics, are not readable and even less understandable.
In this article, we will only focus on SSE2; the reason being MMX is a 64bit SIMD technology which does not offer significant speed up compared to SSE2 (mostly 128 bit SIMD from Intel) on today's 64-bit processors, and Advanced Vector Extensions (AVX) (256 bit SIMD from Intel) is not chosen for this article because not many mainstream users own these latest Intel processors that can exploit AVX: SSE2 has been around since 2001. (Okay, I admit, the real reason is actually because I do not own AVX supported processor.) All, but oldest, personal computers should have housed a x86/x64 processor which support SSE2. Rather than using inline assembly code, we will use SSE2 instrinics provided by Visual Studio to assist us in calling the SSE2 instructions: Currently, Microsoft 64 bit C/C++ compiler does not permit inline assembly code.
Let us imagine we have a structure type named SomeStruct as shown below and we have a array of SomeStruct, essentially, an array of structures.
To use SSE2 arithmetic instructions on aInteger, we must first pack 4 aInteger into a 128 bit vector of integer type, __m128i.

After we have got the computation result, we must unpack 4 aInteger into its individual structure objects in the array.

This is usually what kills the performance! To enhance performance, programmers are advised to pack their data into structure of arrays rather than array of structures; the same advice is applicable to GPGPU programming so as to allow memory coalescing. It is often not possible to alter the existing data domain/model to unnaturally suit the SIMD data packing requirement. In the structure shown below, arrInteger, arrDouble and arrShort are usually pointers to dynamic allocated arrays whose size is only known during runtime but I show them here to be of array type so as to emphasize they are of array type rather than pointer type.
As the reader may have noticed, integer, double and short data type have different sizes. 4 integers would fit into __m128i and 2 doubles would fit into __m128i and 8 shorts would fit into __m128i. This might present difficulties, when programmer are writing code to loop through the __m128i arrays to do computation. For our case, this is not a problem, because we are only using unsigned short type in this alphablend article.

For our SSE2 alphablend code, we will, as I have just said, use unsigned short type to store each color channel. The reader may ask why use unsigned short to store a value which only ranged from 0 to 255? The answer is due to the multiplication used, the intermediate values would most likely exceed 255 while the final result is still stayed within 255. In normal coding, we can use byte variables to do the same computation, without using 16-bit variable. The reason is because the byte value is automatically promoted to integer (word size of the platform) before any computation takes place and is also automatically converted back to byte after computation. We have declared __m128i array to store unsigned short for each color channel of 2 alphablending images. m_fullsize is to store the full size of the array in bytes. m_arraysize is the array size in terms of 128-bit sized element. m_remainder is to hold the number to indicate which of the unsigned short are valid in the last 128-bit element if the number of unsigned short required is not cleanly divisble by 8 (8 x 16 bits=128-bits). For example, 4 integers fits into the __m128i, if we have 6 integers to computer, array of 2 __m128i element is needed but on the last __m128i element, only 2 integers is used, thus m_remainder would reflect 2.
With GetUShortElement method, we can get unsigned short element from the __m128i array, using an index which counts in terms of unsigned short elements.
With Get128BitElement method, we can get __m128i element from the __m128i array, using an bigIndex which counts in terms of __m128i elements. bigSize is size of __m128i array in 128-bit chunks. smallRemainder is number of valid unsigned short element in the last __m128i element. When we detected that it is last element and the number of short integer is not clearly divisible by 8, we need to set the unused shorts in the last __m128i element to 1s, so as to prevent division-by-zero error. It is a good practice to do this for every last element returned, regardless it is integer type or float type.
We must take care to allocate our __m128i array with _aligned_malloc function so that they would align with 16 bytes boundary as required by 128-bit SSE2 instructions. We calculate and store remainder if the number of unsigned short required is not cleanly divisble by 8 (as explained before)(Note: there are 8 unsigned short (16 bits data type) in 128-bit).
After allocating our __m128i arrays, we will populate them with the color information from the image object. We use GetUShortElement to get our unsigned short element from __m128i array.
Lastly, we must remember to deallocate the __m128i arrays with _aligned_free in destructor of the parent class. This is good practice to remember to write deallocation code before we begin our main coding.
Before we run any SSE2 code, we should ensure that SSE2 is supported on the processor first. We can use the SIMD Detector class, written by the author, to do the check.
Now, we have come to the meat of article: using SSE2 to do alphablending. The formula used is ((SrcColor * Alpha) + (DestColor * InverseAlpha)) >> 8. The below code listing is heavily commented. We use _mm_set1_epi16 instruction to set the same short value (0) in every short element in the __m128i element. _mm_mullo_epi16 instruction is used to multiply the unsigned short and only retain the 16 bits if the final result is overflowed into 32 bits. We used _mm_add_epi16 instruction to adding of unsigned shorts. _mm_srli_epi16 instruction is to do shift right by 8 times to simulate division by 256. We can get away without using division instruction which is a blessing because SSE2 does not support division for integers.

This is timing of doing alphablending 1000 times on a Intel i7 870 at 2.93 Ghz. The performance characteristic is different from what is posted in the earlier article because the benchmarks are done on different PCs. The earlier PC used, has 'retired' from useful service in this world.

(Lower is better) Unoptimized Code : 4679 milliseoncds Optimized Code : 2108 milliseoncds Optimized Code with Shift : 1931 milliseoncds SSE2 Optimized Code : 862 milliseoncds
Below is number of times of speedup we got when compared to the unoptimized version. But comparing the SSE2 speedup to the 2nd runner up (Optimized Code with Shift), the speedup is only 2.2 times, not the 8 times as we have expected. So what is the memory consumption is needed for this SSE2 speedup? 2 bytes for each color channel instead of 1 byte. The original implmentation need 4 bytes to store a color pixel, ARGB. (4 bytes is used because processor can load and store 32 bit data faster than 24 bit data which may required shifting it to proper 32 bit boundary before loading inside the 32 bit register). The SSE2 version need 6 bytes to store a color pixel, RGB. Individual alpha channel is not stored because they are not used. SSE2 version would need 50% more of memory to perform its magic.
(Higher is better) Unoptimized Code : 1x Optimized Code : 2.2x Optimized Code with Shift : 2.4x SSE2 Optimized Code : 5.4x
We have used SSE2 to speed up from the previous fastest performance by 2 times. The speedup is not dramatic. It is probably not worth the amount of extended effort and unreadable code to do this. To achieve better readability, we could use the vector class, by Agner Fog of Copenhagen University College of Engineering, which basically extends the ones listed in dvec.h header. These vector class support overloaded operators for -, +, *, / and so on. And they also support the missing operations like multiplication and division for some integer type in SSE2 though those operations are sometimes implemented using the scalar version of instruction (not SSE2, so they have no speed advantage). This eliminates the programmer's effort to implement these unsupported/missing operations himself/herself. Any suggestion to improve the performance further is welcome. We could ramp up performance by using OpenMP or Parallel Patterns Library (PPL) in tandem with SSE2. But that is an article for another day.
This article, along with any associated source code and files, is licensed under The Microsoft Public License (Ms-PL)
| Wong Shao Voon
| I guess I'll write here what I does in my free time, than to write an accolade of skills which I currently possess. I believe the things I does in my free time, say more about me. When I am not working, I like to watch Japanese anime. I am also writing some movie script, hoping to see my own movie on the big screen one day. I like to jog because it makes me feel good, having done something meaningful in the morning before the day starts. I also writes articles for IntelliProject; I have a few ideas to write about but never get around writing because of hectic schedule. Location: |
Sign up to post message on the article message board!