hn-classics/_stories/2008/15199648.md

---
created_at: '2017-09-08T12:26:28.000Z'
title: Implementing strcmp, strlen, and strstr using SSE 4.2 instructions (2008)
url: https://www.strchr.com/strcmp_and_strlen_using_sse_4.2
author: tambourine_man
points: 101
story_text: 
comment_text: 
num_comments: 21
story_id: 
story_title: 
story_url: 
parent_id: 
created_at_i: 1504873588
_tags:
- story
- author_tambourine_man
- story_15199648
objectID: '15199648'
year: 2008

---
[Source](https://www.strchr.com/strcmp_and_strlen_using_sse_4.2 "Permalink to Implementing strcmp, strlen, and strstr using SSE 4.2 instructions - strchr.com")

# Implementing strcmp, strlen, and strstr using SSE 4.2 instructions - strchr.com

[strchr.com][1]

_Abstract:_  
Using new Intel Core i7 instructions to speed up string manipulation.

Created 9 years ago by _Peter Kankowski_  
Last changed 4 years ago  
_Contributors_: Dap and Adam Messinger  
Filed under _Low-level code optimization_

![Share on social sites][2]

# Implementing strcmp, strlen, and strstr using SSE 4.2 instructions

## The new instructions

SSE 4.2 introduces four instructions (PcmpEstrI, PcmpEstrM, PcmpIstrI, and PcmpIstrM) that can be used to speed up text processing code (including strcmp, memcmp, strstr, and strspn functions).

Intel had published [the description for new instruction formats][3], but no sample code nor high-level guidelines. This article tries to provide them.
    
    
    MovDqU    xmm0, dqword[str1]
    PcmpIstrI xmm0, dqword[str2], imm
    

_PcmpIstrI_ is one of the new string-handling instructions comparing their operands. The first operand is always an SSE register (typically loaded with MovDqU). The second operand can be a memory location. Note **the immediate operand**, which consists of several bit fields controlling operation modes.

## Aggregation operations

The heart of a string-processing instruction is **the aggregation operation** (immediate bits [3:2]).

**Equal each** (imm[3:2] = 10). This operation compares two strings (think of strcmp or memcmp). The result of comparison is a bit mask (1 if the corresponding bytes are equal, 0 if not equal). For example:
    
    
    operand1 = "UseFlatAssembler"
    operand2 = "UsingAnAssembler"
    IntRes1  =  1100000111111111
    

**Equal any** (imm[3:2] = 00). The first operand is a character set, the second is a string (think of strspn or strcspn). The bit mask includes 1 if the character belongs to a set, 0 if not:
    
    
    operand2 = "You Drive Me Mad", operand1 = "aeiouy"
    IntRes1  =  0110001010010010
    

**Ranges** (imm[3:2] = 01). The first operand consists of ranges, for example, "azAZ" means "all characters from a to z and all characters from A to Z":
    
    
    operand2 = "I'm here because", operand1 = "azAZ"
    IntRes1  =  1010111101111111
    

**Equal ordered** (imm[3:2] = 11). Substring search (strstr). The first operand contains a string to search for, the second is a string to search in. The bit mask includes 1 if the substring is found at the corresponding position:
    
    
    operand2 = "WhenWeWillBeWed!", operand1 = "We"
    IntRes1  =  000010000000100
    

After computing the aggregation function, IntRes1 can be complemented, expanded into byte mask or shrinked into index. The result is written into xmm0 or ECX registers. Intel manual explains these details well, so there is no need to repeate them here.

## Other features of SSE 4.2

* The strings do not need to be aligned.
* The processor properly handles end-of-the-string case for zero-terminated strings and Pascal-style strings.
* You can use the instructions with Unicode characters, signed or unsigned bytes.
* Four aggregation operations can be used to implement a wide range of string-processing functions.

## Implementation

The following strcmp and strlen functions were written and published when the processors with SSE 4.2 support were not available yet. Later they were tested on real hardware and found to be correct.
    
    
    ; compile with [FASM][4]
    ; Immediate byte constants
    EQUAL_ANY	    = 0000b
    RANGES		    = 0100b
    EQUAL_EACH	    = 1000b
    EQUAL_ORDERED	    = 1100b
    NEGATIVE_POLARITY = 010000b
    BYTE_MASK	 = 1000000b
    
    ; ==== strcmp ====
    
    **strcmp_sse42**:
      ; Using __fastcall convention, ecx = string1, edx = string2
      mov eax, ecx
      sub eax, edx ; eax = ecx - edx
      sub edx, 16
    
    STRCMP_LOOP:
        add edx, 16
        MovDqU    xmm0, dqword[edx]
        ; find the first *different* bytes, hence negative polarity
        PcmpIstrI xmm0, dqword[edx + eax], EQUAL_EACH + NEGATIVE_POLARITY
        ja STRCMP_LOOP
    
      jc STRCMP_DIFF
    
      ; the strings are equal
      xor eax, eax
      ret
    STRCMP_DIFF:
      ; subtract the first different bytes
      add eax, edx
      movzx eax, byte[eax + ecx]
      movzx edx, byte[edx + ecx]
      sub eax, edx
      ret
    
    
    ; ==== strlen ====
    **strlen_sse42**:
      ; ecx = string
      mov eax, -16
      mov edx, ecx
      pxor xmm0, xmm0
    
    STRLEN_LOOP:
        add eax, 16
        PcmpIstrI xmm0, dqword[edx + eax], EQUAL_EACH
        jnz STRLEN_LOOP
    
      add eax, ecx
      ret
    
    ; ==== strstr ====
    **strstr_sse42**:
      ; ecx = haystack, edx = needle
    
      push esi
      push edi
      MovDqU xmm2, dqword[edx] ; load the first 16 bytes of neddle
      Pxor xmm3, xmm3
      lea eax, [ecx - 16]
    
      ; find the first possible match of 16-byte fragment in haystack
    STRSTR_MAIN_LOOP:
        add eax, 16
        PcmpIstrI xmm2, dqword[eax], EQUAL_ORDERED
        ja STRSTR_MAIN_LOOP
    
      jnc STRSTR_NOT_FOUND
    
      add eax, ecx ; save the possible match start
      mov edi, edx
      mov esi, eax
      sub edi, esi
      sub esi, 16
    
      ; compare the strings
    @@:
        add esi, 16
        MovDqU    xmm1, dqword[esi + edi]
        ; mask out invalid bytes in the haystack
        PcmpIstrM xmm3, xmm1, EQUAL_EACH + NEGATIVE_POLARITY + BYTE_MASK
        MovDqU xmm4, dqword[esi]
        PAnd xmm4, xmm0
        PcmpIstrI xmm1, xmm4, EQUAL_EACH + NEGATIVE_POLARITY
        ja @B
    
      jnc STRSTR_FOUND
    
      ; continue searching from the next byte
      sub eax, 15
      jmp STRSTR_MAIN_LOOP
    
    STRSTR_NOT_FOUND:
      xor eax, eax
    
    STRSTR_FOUND:
      pop edi
      pop esi
      ret
    

The first processor with SSE 4.2 support is [Intel Core i7][5], followed by [Intel Core i5][6].

## Additional reading

* [Intel Software Developer Manuals.][3]
* [comp.arch discussion][7] about the new string-processing instructions.
* [Optimizing strlen][8] on processors without SSE4 support.
* [A review of Intel Nehalem][9] processor, which includes support for the text processing instructions.

**Typo in Intel manual:** on figure 5-1, "imm8[6:5]" near _Optional boolean negation_ should be "imm8[5:4]".

![Peter Kankowski][10]  
Peter Kankowski

## About the author

Peter is the developer of [Aba Search and Replace][11], a tool for replacing text in multiple files. He likes to program in C with a bit of C++, also in x86 assembly language, Python, and PHP. He can be reached at [kankowski@gmail.com.][12]

Created 9 years ago by _Peter Kankowski_  
Last changed 4 years ago  
_Contributors_: Dap and Adam Messinger  

## 23 comments

Ten recent comments are shown below. [Show all comments][13]

Jeroen van Bemmel, 6 years ago

It may be better to implement these functions using intrinsics, that way the compiler is better able to choose the right registers:
    
    
      long diff = cs-ct;
      long nextbytes = 16; // asm ("rdx") does not work
      ct -= 16;
    
    // Force GCC to use a register for nextbytes? Makes the code much worse! (adds LEA)
    // __asm__ __volatile__( "mov $16, %0" : "=r"(nextbytes) );
    
    loop:
    	// could align it ASMVOLATILE( ".align 16n" : : : "memory" );
    	__m128i ct16chars = _mm_loadu_si128( (const __m128i *) (ct += nextbytes) );
    	int offset = _mm_cmpistri( ct16chars, * (const __m128i *) (ct+diff), 
                           _SIDD_CMP_EQUAL_EACH | _SIDD_NEGATIVE_POLARITY );
    	__asm__ __volatile__ goto( "ja %l[loop] n jc %l[not_equal]" : : : "memory" : loop, not_equal );
    	return 0;
    
    not_equal:
    	return ct[diff+offset] - ct[offset];
    
    

I did not manage to get gcc to emit instructions which use a register instead of a direct constant - it could be that the compiler believes it performs worse...

(NB am using "asm goto" feature in the above code, it interacts poorly with optimization sometimes because gcc does not know it depends on the flags status)

Jeroen van Bemmel, 6 years ago

Comment on the SSE4.2 strlen() implementation: it actually performs much worse than the following SSE2 implementation (inspired by Agner Fog's implementation):
    
    
    size_t strlen( const char* s ) {
    __m128i zero = _mm_set1_epi8( 0 );
    __m128i *s_aligned = (__m128i*) (((long)s) & -0x10L);
    u8 misbits = ((long)s) & 0xf;
    __m128i s16cs = _mm_load_si128( s_aligned );
    __m128i bytemask = _mm_cmpeq_epi8( s16cs, zero );
    int bitmask = _mm_movemask_epi8( bytemask );
    bitmask = (bitmask>>misbits)<<misbits;
    
    // Alternative: use TEST instead of BSF, then BSF at end (only). Much better on older CPUs
    // TEST has latency 1, while BSF has 3!
    while (bitmask==0) {
    	s16cs = _mm_load_si128( ++s_aligned );
    	bytemask = _mm_cmpeq_epi8( s16cs, zero );
    	bitmask = _mm_movemask_epi8( bytemask );
    }
    
    // bsf only takes 1 clock cycle on modern cpus
    return ((( const char* ) s_aligned ) - s) + __builtin_ctz(bitmask);
    }
    
    

Measurements using Agner Fog's testPMC routines on a Core i7, calling 100000 * strlen() for 4 test strings:

C++ inlined version of the above code:

Processor 0
    
    
         Clock    Core cyc   Instruct       Uops  BrMispred 
       6977183    4862520   11500107   12627774         29 
       6558494    4810778   11500010   12600524         22 
       6640618    4852955   11500011   12600727          7 
       6559492    4810098   11500010   12600491          8 
       6566901    4805810   11500010   12600462          9 
       6535459    4796366   11500010   12600415          3 
       6542663    4803975   11500010   12600415          4 
       6555762    4797329   11500011   12600935         13
    
    

SSE2 version:

Processor 0
    
    
         Clock    Core cyc   Instruct       Uops  BrMispred 
      42095645   33382200    2500115   48931751          1 
      31560226   33403649    2500014   48910384          2 
      30967280   33373350    2500016   48907965          3 
      30473645   33376225    2500015   48907719          2 
      30468486   33378743    2500016   48907860          1 
      30474334   33377705    2500016   48907859          1 
      30473657   33378184    2500015   48907638          1 
      30474393   33376966    2500015   48908046          1
    
    

So even though the SSE4 strlen() code above contains far less instructions, these translate into many uops, resulting in more CPU cycles

Jeroen van Bemmel, 6 years ago

Correction: second measurement above is for the SSE4 version of course

Peter Kankowski, 6 years ago

Sorry for the late answer. My results on Core i5:
    
    
    
    Short strings (from 1 to 10 characters, from Gettysburg Address)
    Microsoft's strlen    990
    [my_strlen][8]             640
    Agner Fog's strlen   1132
    SSE2 (your version)   886
    SSE4.2                613
    
    A long string ("Jabberwocky")
    Microsoft's strlen    919
    my_strlen            1156
    Agner Fog's strlen   2200
    SSE2 (your version)   533
    SSE4.2                610
    

So you version is the winner for long strings. You may be interested to look at [a strchr implementation][14] using SSE2. In comparison with it, you carefully avoided the branch for the first misaligned chunk. Nice job!

Sirmabus, 5 years ago

Peter, related but a bit of the topic, have you ever went about optimizing pattern matching with a mask?

Basically the common byte search with wildcards that typically looks like this:
    
    
    "55 8B EC 6A FF 68 ?? ?? ?? ?? 64 A1 00 00 00 00 50 64 89 25 00 00 00 00 81"
    

There the "??" represent the variable bytes that you want to skip.

There are many ways of representing this kind of data that could be baked beforehand.

One could have an array of the same size bytes using flags '01' for keep/compare and '00' to skip, or sort of run length encoding, etc.; and of course not dealing with the pattern as a string (no need for that overhead).

I have this idea that one could use SSE instructions with some sort of mask, using bit operation, one might be able to process many pattern bytes at the time.

Any ideas?

Peter Kankowski, 5 years ago

You already described it. Instead of 01 for keep/compare, use 0xFF. So in the mask, you will have 0xFF for each byte you want to compare and 0 for each byte you want to skip. Then run PAND instruction between the loaded bytes and the mask. After PAND, all skipped bytes will turn to zero. In the pattern, you should have zeros for skipped bytes, so that they will be equal.

Renat Saifutdinov, 4 years ago

Any performance tests for strlen_sse42 function does not make sense, because the function has a bug causing crash if a null character of source string is located near to the protected memory page boundary (function always read first 16 bytes, but).

Aligned read with mask false bits is necessary.

Test for Windows platform:
    
    
    #include <windows.h>
    
    int length;
    
    int main()
    {
      enum { N = 0x1000 };
    
      // allocate two pages
      char *str = (char*)VirtualAlloc(0, N + N, MEM_RESERVE | MEM_COMMIT, PAGE_READWRITE);
      // protect next page
      DWORD old;
      BOOL res = VirtualProtect(str + N, N, PAGE_NOACCESS, &old);
    
      size_t i;
    
      // fill first page
      for(i = 0; i != N; ++i)
      {
        str[i] = 'a' + (i % 20);
        if((i % 13) == 0) str[i] = 0;
      }
      str[N - 1] = 0; // terminator at end of first page
    
      for(size_t i = 0; i != N; ++i)
      {
        length = strlen_sse42(str + i); // <<< crash!
        if(strlen(str + i) != length) *(char*)0 = 0;
      }
      return 0;
    }
    
    

Peter Kankowski, 4 years ago

Thank you, I will change it to use aligned reads.

KWillets, 2 years ago

I found an evil hack for this while I was trying to figure out the flags. After this section:

STRCMP_LOOP:

add edx, 16
    
    
          MovDqU    xmm0, dqword[edx]
    

; find the first *different* bytes, hence negative polarity
    
    
          PcmpIstrI xmm0, dqword[edx + eax], EQUAL_EACH + NEGATIVE_POLARITY
          ja STRCMP_LOOP
    
    

ecx is only 16 here if the strings are equal (and the terminator has been found), so the final sub just needs to yield 0. In particular, if we turn 16 into 0:

and ecx, 15

we make this yield 0:
    
    
          movzx eax, byte[eax + ecx]
          movzx edx, byte[edx + ecx]
    

sub eax, edx

since the first bytes of the comparison must be equal in this case. For other cases ecx is unchanged.

I don't know if it's better, but it's one way to handle the exit conditions. 

Ostap, 4 months ago
    
    
    Beautiful !
    

Thanx Peter 

32bit implementation was exactly what i were looking for ..

Works better than native VC12 

Your name:

  

Comment:

  
Enter verification code:  
![Verification code][15]  
Please ignore this field: 

[strchr.com][1]

Perfectionistic and minimalistic programming.

### Featured pages

[Discussion: the first language][16]

Which programming language to learn first?

[Hash functions: An empirical comparison][17]

Benchmark program for hash tables and comparison of 15 popular hash functions.

[Recommended books and sites][18]

The minimal reading list to become a good programmer.

[Software interface design tips][19]

How to design easy-to-use interfaces between modules of your program.

[Implementing strcmp, strlen, and strstr using SSE 4.2 instructions][20]

Using new Intel Core i7 instructions to speed up string manipulation.

[Table-driven approach][21]

How to make you code shorter and easier to maintain by using arrays.

[x86 Machine Code Statistics][22]

Which instructions and addressing modes are used most often. What is the average instruction length.

### Recent comments

[Peter Kankowski:][23]

Vengir, it returns plural2 for 21, 31, 41, etc. For Russian, modulo = 10 is specified for singular, hence the difference. 

[Vengir:][24]

The way I understand the Polish algorithm, it returns singular for numbers 21, 31, 41, etc...

[Ostap:][25]

Beautiful ! Thanx Peter 32bit implementation was exactly what i were looking for .. Works better than native VC12 

[Jessica:][26]

Hi Peter...

[Jessica:][27]

Hi peter I am new to the x86 and was wondering If you could give me some advice on how to calculate and run butchers algorithm...

(c) Peter Kankowski, 2006—2018\. Some rights reserved.

[1]: https://www.strchr.com/
[2]: https://www.strchr.com/ico/social_256.png
[3]: http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html
[4]: http://www.flatassembler.net
[5]: http://en.wikipedia.org/wiki/Intel_Core_i7
[6]: http://en.wikipedia.org/wiki/Intel_Core_i5
[7]: http://groups.google.com/group/comp.arch/browse_thread/thread/420b8b782086d03a
[8]: https://www.strchr.com/optimized_strlen_function
[9]: http://arstechnica.com/gadgets/2008/04/what-you-need-to-know-about-nehalem/3/
[10]: https://www.strchr.com/media/avatars/1.jpg
[11]: http://www.abareplace.com
[12]: mailto:kankowski%40gmail.com
[13]: https://www.strchr.com/strcmp_and_strlen_using_sse_4.2?allcomments=1#comments
[14]: http://www.strchr.com/optimized_strlen_function?allcomments=1#comment_634
[15]: https://www.strchr.com/kcaptcha.php ""
[16]: https://www.strchr.com/first_language
[17]: https://www.strchr.com/hash_functions
[18]: https://www.strchr.com/links
[19]: https://www.strchr.com/software_interface
[20]: https://www.strchr.com/strcmp_and_strlen_using_sse_4.2
[21]: https://www.strchr.com/table-driven
[22]: https://www.strchr.com/x86_machine_code_statistics
[23]: https://www.strchr.com/plural_forms#comment_827
[24]: https://www.strchr.com/plural_forms#comment_826
[25]: https://www.strchr.com/strcmp_and_strlen_using_sse_4.2#comment_825
[26]: https://www.strchr.com/calendar#comment_824
[27]: https://www.strchr.com/calendar#comment_823