hn-classics/_stories/2008/15199648.md

569 lines
18 KiB
Markdown
Raw Permalink Normal View History

---
created_at: '2017-09-08T12:26:28.000Z'
title: Implementing strcmp, strlen, and strstr using SSE 4.2 instructions (2008)
url: https://www.strchr.com/strcmp_and_strlen_using_sse_4.2
author: tambourine_man
points: 101
story_text:
comment_text:
num_comments: 21
story_id:
story_title:
story_url:
parent_id:
created_at_i: 1504873588
_tags:
- story
- author_tambourine_man
- story_15199648
objectID: '15199648'
2018-06-08 12:05:27 +00:00
year: 2008
---
2018-02-23 18:19:40 +00:00
[Source](https://www.strchr.com/strcmp_and_strlen_using_sse_4.2 "Permalink to Implementing strcmp, strlen, and strstr using SSE 4.2 instructions - strchr.com")
# Implementing strcmp, strlen, and strstr using SSE 4.2 instructions - strchr.com
[strchr.com][1]
_Abstract:_
Using new Intel Core i7 instructions to speed up string manipulation.
Created 9 years ago by _Peter Kankowski_
Last changed 4 years ago
_Contributors_: Dap and Adam Messinger
Filed under _Low-level code optimization_
![Share on social sites][2]
# Implementing strcmp, strlen, and strstr using SSE 4.2 instructions
## The new instructions
SSE 4.2 introduces four instructions (PcmpEstrI, PcmpEstrM, PcmpIstrI, and PcmpIstrM) that can be used to speed up text processing code (including strcmp, memcmp, strstr, and strspn functions).
Intel had published [the description for new instruction formats][3], but no sample code nor high-level guidelines. This article tries to provide them.
MovDqU xmm0, dqword[str1]
PcmpIstrI xmm0, dqword[str2], imm
_PcmpIstrI_ is one of the new string-handling instructions comparing their operands. The first operand is always an SSE register (typically loaded with MovDqU). The second operand can be a memory location. Note **the immediate operand**, which consists of several bit fields controlling operation modes.
## Aggregation operations
The heart of a string-processing instruction is **the aggregation operation** (immediate bits [3:2]).
**Equal each** (imm[3:2] = 10). This operation compares two strings (think of strcmp or memcmp). The result of comparison is a bit mask (1 if the corresponding bytes are equal, 0 if not equal). For example:
operand1 = "UseFlatAssembler"
operand2 = "UsingAnAssembler"
IntRes1 = 1100000111111111
**Equal any** (imm[3:2] = 00). The first operand is a character set, the second is a string (think of strspn or strcspn). The bit mask includes 1 if the character belongs to a set, 0 if not:
operand2 = "You Drive Me Mad", operand1 = "aeiouy"
IntRes1 = 0110001010010010
**Ranges** (imm[3:2] = 01). The first operand consists of ranges, for example, "azAZ" means "all characters from a to z and all characters from A to Z":
operand2 = "I'm here because", operand1 = "azAZ"
IntRes1 = 1010111101111111
**Equal ordered** (imm[3:2] = 11). Substring search (strstr). The first operand contains a string to search for, the second is a string to search in. The bit mask includes 1 if the substring is found at the corresponding position:
operand2 = "WhenWeWillBeWed!", operand1 = "We"
IntRes1 = 000010000000100
After computing the aggregation function, IntRes1 can be complemented, expanded into byte mask or shrinked into index. The result is written into xmm0 or ECX registers. Intel manual explains these details well, so there is no need to repeate them here.
## Other features of SSE 4.2
* The strings do not need to be aligned.
* The processor properly handles end-of-the-string case for zero-terminated strings and Pascal-style strings.
* You can use the instructions with Unicode characters, signed or unsigned bytes.
* Four aggregation operations can be used to implement a wide range of string-processing functions.
## Implementation
The following strcmp and strlen functions were written and published when the processors with SSE 4.2 support were not available yet. Later they were tested on real hardware and found to be correct.
; compile with [FASM][4]
; Immediate byte constants
EQUAL_ANY = 0000b
RANGES = 0100b
EQUAL_EACH = 1000b
EQUAL_ORDERED = 1100b
NEGATIVE_POLARITY = 010000b
BYTE_MASK = 1000000b
; ==== strcmp ====
**strcmp_sse42**:
; Using __fastcall convention, ecx = string1, edx = string2
mov eax, ecx
sub eax, edx ; eax = ecx - edx
sub edx, 16
STRCMP_LOOP:
add edx, 16
MovDqU xmm0, dqword[edx]
; find the first *different* bytes, hence negative polarity
PcmpIstrI xmm0, dqword[edx + eax], EQUAL_EACH + NEGATIVE_POLARITY
ja STRCMP_LOOP
jc STRCMP_DIFF
; the strings are equal
xor eax, eax
ret
STRCMP_DIFF:
; subtract the first different bytes
add eax, edx
movzx eax, byte[eax + ecx]
movzx edx, byte[edx + ecx]
sub eax, edx
ret
; ==== strlen ====
**strlen_sse42**:
; ecx = string
mov eax, -16
mov edx, ecx
pxor xmm0, xmm0
STRLEN_LOOP:
add eax, 16
PcmpIstrI xmm0, dqword[edx + eax], EQUAL_EACH
jnz STRLEN_LOOP
add eax, ecx
ret
; ==== strstr ====
**strstr_sse42**:
; ecx = haystack, edx = needle
push esi
push edi
MovDqU xmm2, dqword[edx] ; load the first 16 bytes of neddle
Pxor xmm3, xmm3
lea eax, [ecx - 16]
; find the first possible match of 16-byte fragment in haystack
STRSTR_MAIN_LOOP:
add eax, 16
PcmpIstrI xmm2, dqword[eax], EQUAL_ORDERED
ja STRSTR_MAIN_LOOP
jnc STRSTR_NOT_FOUND
add eax, ecx ; save the possible match start
mov edi, edx
mov esi, eax
sub edi, esi
sub esi, 16
; compare the strings
@@:
add esi, 16
MovDqU xmm1, dqword[esi + edi]
; mask out invalid bytes in the haystack
PcmpIstrM xmm3, xmm1, EQUAL_EACH + NEGATIVE_POLARITY + BYTE_MASK
MovDqU xmm4, dqword[esi]
PAnd xmm4, xmm0
PcmpIstrI xmm1, xmm4, EQUAL_EACH + NEGATIVE_POLARITY
ja @B
jnc STRSTR_FOUND
; continue searching from the next byte
sub eax, 15
jmp STRSTR_MAIN_LOOP
STRSTR_NOT_FOUND:
xor eax, eax
STRSTR_FOUND:
pop edi
pop esi
ret
The first processor with SSE 4.2 support is [Intel Core i7][5], followed by [Intel Core i5][6].
## Additional reading
* [Intel Software Developer Manuals.][3]
* [comp.arch discussion][7] about the new string-processing instructions.
* [Optimizing strlen][8] on processors without SSE4 support.
* [A review of Intel Nehalem][9] processor, which includes support for the text processing instructions.
**Typo in Intel manual:** on figure 5-1, "imm8[6:5]" near _Optional boolean negation_ should be "imm8[5:4]".
![Peter Kankowski][10]
Peter Kankowski
## About the author
Peter is the developer of [Aba Search and Replace][11], a tool for replacing text in multiple files. He likes to program in C with a bit of C++, also in x86 assembly language, Python, and PHP. He can be reached at [kankowski@gmail.com.][12]
Created 9 years ago by _Peter Kankowski_
Last changed 4 years ago
_Contributors_: Dap and Adam Messinger
## 23 comments
Ten recent comments are shown below. [Show all comments][13]
Jeroen van Bemmel, 6 years ago
It may be better to implement these functions using intrinsics, that way the compiler is better able to choose the right registers:
long diff = cs-ct;
long nextbytes = 16; // asm ("rdx") does not work
ct -= 16;
// Force GCC to use a register for nextbytes? Makes the code much worse! (adds LEA)
// __asm__ __volatile__( "mov $16, %0" : "=r"(nextbytes) );
loop:
// could align it ASMVOLATILE( ".align 16n" : : : "memory" );
__m128i ct16chars = _mm_loadu_si128( (const __m128i *) (ct += nextbytes) );
int offset = _mm_cmpistri( ct16chars, * (const __m128i *) (ct+diff),
_SIDD_CMP_EQUAL_EACH | _SIDD_NEGATIVE_POLARITY );
__asm__ __volatile__ goto( "ja %l[loop] n jc %l[not_equal]" : : : "memory" : loop, not_equal );
return 0;
not_equal:
return ct[diff+offset] - ct[offset];
I did not manage to get gcc to emit instructions which use a register instead of a direct constant - it could be that the compiler believes it performs worse...
(NB am using "asm goto" feature in the above code, it interacts poorly with optimization sometimes because gcc does not know it depends on the flags status)
Jeroen van Bemmel, 6 years ago
Comment on the SSE4.2 strlen() implementation: it actually performs much worse than the following SSE2 implementation (inspired by Agner Fog's implementation):
size_t strlen( const char* s ) {
__m128i zero = _mm_set1_epi8( 0 );
__m128i *s_aligned = (__m128i*) (((long)s) & -0x10L);
u8 misbits = ((long)s) & 0xf;
__m128i s16cs = _mm_load_si128( s_aligned );
__m128i bytemask = _mm_cmpeq_epi8( s16cs, zero );
int bitmask = _mm_movemask_epi8( bytemask );
bitmask = (bitmask>>misbits)<<misbits;
// Alternative: use TEST instead of BSF, then BSF at end (only). Much better on older CPUs
// TEST has latency 1, while BSF has 3!
while (bitmask==0) {
s16cs = _mm_load_si128( ++s_aligned );
bytemask = _mm_cmpeq_epi8( s16cs, zero );
bitmask = _mm_movemask_epi8( bytemask );
}
// bsf only takes 1 clock cycle on modern cpus
return ((( const char* ) s_aligned ) - s) + __builtin_ctz(bitmask);
}
Measurements using Agner Fog's testPMC routines on a Core i7, calling 100000 * strlen() for 4 test strings:
C++ inlined version of the above code:
Processor 0
Clock Core cyc Instruct Uops BrMispred
6977183 4862520 11500107 12627774 29
6558494 4810778 11500010 12600524 22
6640618 4852955 11500011 12600727 7
6559492 4810098 11500010 12600491 8
6566901 4805810 11500010 12600462 9
6535459 4796366 11500010 12600415 3
6542663 4803975 11500010 12600415 4
6555762 4797329 11500011 12600935 13
SSE2 version:
Processor 0
Clock Core cyc Instruct Uops BrMispred
42095645 33382200 2500115 48931751 1
31560226 33403649 2500014 48910384 2
30967280 33373350 2500016 48907965 3
30473645 33376225 2500015 48907719 2
30468486 33378743 2500016 48907860 1
30474334 33377705 2500016 48907859 1
30473657 33378184 2500015 48907638 1
30474393 33376966 2500015 48908046 1
So even though the SSE4 strlen() code above contains far less instructions, these translate into many uops, resulting in more CPU cycles
Jeroen van Bemmel, 6 years ago
Correction: second measurement above is for the SSE4 version of course
Peter Kankowski, 6 years ago
Sorry for the late answer. My results on Core i5:
Short strings (from 1 to 10 characters, from Gettysburg Address)
Microsoft's strlen 990
[my_strlen][8] 640
Agner Fog's strlen 1132
SSE2 (your version) 886
SSE4.2 613
A long string ("Jabberwocky")
Microsoft's strlen 919
my_strlen 1156
Agner Fog's strlen 2200
SSE2 (your version) 533
SSE4.2 610
So you version is the winner for long strings. You may be interested to look at [a strchr implementation][14] using SSE2. In comparison with it, you carefully avoided the branch for the first misaligned chunk. Nice job!
Sirmabus, 5 years ago
Peter, related but a bit of the topic, have you ever went about optimizing pattern matching with a mask?
Basically the common byte search with wildcards that typically looks like this:
"55 8B EC 6A FF 68 ?? ?? ?? ?? 64 A1 00 00 00 00 50 64 89 25 00 00 00 00 81"
There the "??" represent the variable bytes that you want to skip.
There are many ways of representing this kind of data that could be baked beforehand.
One could have an array of the same size bytes using flags '01' for keep/compare and '00' to skip, or sort of run length encoding, etc.; and of course not dealing with the pattern as a string (no need for that overhead).
I have this idea that one could use SSE instructions with some sort of mask, using bit operation, one might be able to process many pattern bytes at the time.
Any ideas?
Peter Kankowski, 5 years ago
You already described it. Instead of 01 for keep/compare, use 0xFF. So in the mask, you will have 0xFF for each byte you want to compare and 0 for each byte you want to skip. Then run PAND instruction between the loaded bytes and the mask. After PAND, all skipped bytes will turn to zero. In the pattern, you should have zeros for skipped bytes, so that they will be equal.
Renat Saifutdinov, 4 years ago
Any performance tests for strlen_sse42 function does not make sense, because the function has a bug causing crash if a null character of source string is located near to the protected memory page boundary (function always read first 16 bytes, but).
Aligned read with mask false bits is necessary.
Test for Windows platform:
#include <windows.h>
int length;
int main()
{
enum { N = 0x1000 };
// allocate two pages
char *str = (char*)VirtualAlloc(0, N + N, MEM_RESERVE | MEM_COMMIT, PAGE_READWRITE);
// protect next page
DWORD old;
BOOL res = VirtualProtect(str + N, N, PAGE_NOACCESS, &old);
size_t i;
// fill first page
for(i = 0; i != N; ++i)
{
str[i] = 'a' + (i % 20);
if((i % 13) == 0) str[i] = 0;
}
str[N - 1] = 0; // terminator at end of first page
for(size_t i = 0; i != N; ++i)
{
length = strlen_sse42(str + i); // <<< crash!
if(strlen(str + i) != length) *(char*)0 = 0;
}
return 0;
}
Peter Kankowski, 4 years ago
Thank you, I will change it to use aligned reads.
KWillets, 2 years ago
I found an evil hack for this while I was trying to figure out the flags. After this section:
STRCMP_LOOP:
add edx, 16
MovDqU xmm0, dqword[edx]
; find the first *different* bytes, hence negative polarity
PcmpIstrI xmm0, dqword[edx + eax], EQUAL_EACH + NEGATIVE_POLARITY
ja STRCMP_LOOP
ecx is only 16 here if the strings are equal (and the terminator has been found), so the final sub just needs to yield 0. In particular, if we turn 16 into 0:
and ecx, 15
we make this yield 0:
movzx eax, byte[eax + ecx]
movzx edx, byte[edx + ecx]
sub eax, edx
since the first bytes of the comparison must be equal in this case. For other cases ecx is unchanged.
I don't know if it's better, but it's one way to handle the exit conditions.
Ostap, 4 months ago
Beautiful !
Thanx Peter
32bit implementation was exactly what i were looking for ..
Works better than native VC12
Your name:
Comment:
Enter verification code:
![Verification code][15]
Please ignore this field:
[strchr.com][1]
Perfectionistic and minimalistic programming.
### Featured pages
[Discussion: the first language][16]
Which programming language to learn first?
[Hash functions: An empirical comparison][17]
Benchmark program for hash tables and comparison of 15 popular hash functions.
[Recommended books and sites][18]
The minimal reading list to become a good programmer.
[Software interface design tips][19]
How to design easy-to-use interfaces between modules of your program.
[Implementing strcmp, strlen, and strstr using SSE 4.2 instructions][20]
Using new Intel Core i7 instructions to speed up string manipulation.
[Table-driven approach][21]
How to make you code shorter and easier to maintain by using arrays.
[x86 Machine Code Statistics][22]
Which instructions and addressing modes are used most often. What is the average instruction length.
### Recent comments
[Peter Kankowski:][23]
Vengir, it returns plural2 for 21, 31, 41, etc. For Russian, modulo = 10 is specified for singular, hence the difference.
[Vengir:][24]
The way I understand the Polish algorithm, it returns singular for numbers 21, 31, 41, etc...
[Ostap:][25]
Beautiful ! Thanx Peter 32bit implementation was exactly what i were looking for .. Works better than native VC12
[Jessica:][26]
Hi Peter...
[Jessica:][27]
Hi peter I am new to the x86 and was wondering If you could give me some advice on how to calculate and run butchers algorithm...
(c) Peter Kankowski, 2006—2018\. Some rights reserved.
[1]: https://www.strchr.com/
[2]: https://www.strchr.com/ico/social_256.png
[3]: http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html
[4]: http://www.flatassembler.net
[5]: http://en.wikipedia.org/wiki/Intel_Core_i7
[6]: http://en.wikipedia.org/wiki/Intel_Core_i5
[7]: http://groups.google.com/group/comp.arch/browse_thread/thread/420b8b782086d03a
[8]: https://www.strchr.com/optimized_strlen_function
[9]: http://arstechnica.com/gadgets/2008/04/what-you-need-to-know-about-nehalem/3/
[10]: https://www.strchr.com/media/avatars/1.jpg
[11]: http://www.abareplace.com
[12]: mailto:kankowski%40gmail.com
[13]: https://www.strchr.com/strcmp_and_strlen_using_sse_4.2?allcomments=1#comments
[14]: http://www.strchr.com/optimized_strlen_function?allcomments=1#comment_634
[15]: https://www.strchr.com/kcaptcha.php ""
[16]: https://www.strchr.com/first_language
[17]: https://www.strchr.com/hash_functions
[18]: https://www.strchr.com/links
[19]: https://www.strchr.com/software_interface
[20]: https://www.strchr.com/strcmp_and_strlen_using_sse_4.2
[21]: https://www.strchr.com/table-driven
[22]: https://www.strchr.com/x86_machine_code_statistics
[23]: https://www.strchr.com/plural_forms#comment_827
[24]: https://www.strchr.com/plural_forms#comment_826
[25]: https://www.strchr.com/strcmp_and_strlen_using_sse_4.2#comment_825
[26]: https://www.strchr.com/calendar#comment_824
[27]: https://www.strchr.com/calendar#comment_823