Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Un Auditted #2

Open
valarauca opened this issue Dec 9, 2016 · 10 comments
Open

Un Auditted #2

valarauca opened this issue Dec 9, 2016 · 10 comments

Comments

@valarauca
Copy link
Owner

I don't know this is working as indented

I wouldn't trust a single rando online to tell me it was working as indented.

This'll be an open issue/thread until there is a consensus the code is correct. Please use it to raise concerns, or make requests, or yell me for doing something wrong.

Thank you.

@valarauca
Copy link
Owner Author

There is an Audit Branch now

I'm going to be placing platform specific asm dumps and the build scripts I used to generate them there

Initial Windows MSVC AMD64 Commit

@valarauca
Copy link
Owner Author

valarauca commented Dec 10, 2016

Initial Review:


The compare equality family: OKAY

  • ct_u8_eq
  • ct_u16_eq
  • ct_u32_eq
  • ct_u64_eq

Have no branches. They appear to be constructed out of nothing but xor, not, and, and shr with a movzx to handle the smaller items. This looks constant time. There isn't really any magic the processor can do here.

Example


The optional select family: OKAY

  • ct_select_u8
  • ct_select_u16
  • ct_select_u32
  • ct_select_u64

These have been compiled down to 4 instructions total (only register sizing changes between them). They all take the form of:

        test    cl, cl		#Is Arg0 true?
        cmove   rdx, r8 	#CONDITIONALLY OVER WRITE Arg2
        mov     rax, rdx	#Put Arg2 in return register
        ret 			

So this boils down too a branch + conditional move. According to Agner Fog CMOV is a fixed latency instruction, albeit non-free (1-6 cycles depending on vendor/generation). The issue still stands this is a branch, and it can be miss-predicted.

This is a ~20-40 tick delta for a prediction. Not necessarily a true or false. I think this is OKAY. I don't think and external hacker across ethernet can know my branch cache status.


The compare buffer family: BROKE AS A JOKE

  • ct_select_u8
  • ct_select_u16
  • ct_select_u32
  • ct_select_u64

Well this sucks!

        sub     rsp, 40
        mov     r10, rdx
        cmp     r10, r9                    #length check
        jnz     ?_003                      #Fail Length Check Branch
        mov     r9b, -1
        test    r10, r10
        jz      ?_002                      #Dump optmization for array len == 1
        xor     edx, edx
?_001:  cmp     rdx, r10                   #start loop, bounds check
        jnc     ?_005                      #PANIC on bad memory access
        movzx   eax, byte [r8+rdx]
        xor     al, byte [rcx+rdx]
        inc     rdx
        or      r9b, al
        cmp     rdx, r10                    #Are we at the end of the loop?
        jc      ?_001                       #Loop again
        not     r9b
?_002:  mov     ecx, r9d                    #This _should_ be a call to ct_u8_eq. B
        shr     cl, 4                       #But it is inlined
        and     cl, r9b
        mov     eax, ecx
        shr     al, 2
        and     al, cl
        mov     ecx, eax
        shr     cl, 1
        test    cl, al                      #ct_u8_eq is being inlined here
        setne   al                          #but ct_u8_eq DOESNT CALL TEST
        jmp     ?_004                       #jump past zeroing eax to return TRUE
?_003:  xor     eax, eax
?_004:  add     rsp, 40
        ret
?_005: 
        #Rust Panic
        #Unwind the Stack

I think changing the ct_u8_eq(2) to be #[inline(never)] will solve this. My guess is ct_u8_eq(2) was inlined, then optimized. If you compare my commented section to the actual ct_u8_eq(2) code

I am very concerned about that last branch. It really isn't do anything productive. The I'm assuming single list optimization seems stupid, but I really don't know how to combat it.


Conditional Copying: BROKE AS A JOKE

  • ct_select_u8
  • ct_select_u16
  • ct_select_u32
  • ct_select_u64

The functions have 2 paths. Both will increment across the whole buffer. But one does A LOT more work. So it is pretty obvious these are not consistent time.

ct_copy_u8:
        sub     rsp, 40                                 
        mov     rax, rdx                                
        cmp     r8, qword [rsp+50H]                     
        jnz     ?_030                                  #If Array Len's aren't equal panic 
        test    r8, r8                                  
        jz      ?_028                                  #If array len is 0 short circuit to TRUE
        xor     edx, edx                               
        test    cl, cl                                 
        jz      ?_027                                  #Jump to a loop that does nothing
?_026:  cmp     rdx, r8                                #This loop does stuff
        jnc     ?_029                                  
        movzx   ecx, byte [r9+rdx]                     
        mov     byte [rax+rdx], cl                     
        inc     rdx                                    
        cmp     rdx, r8                                #Determine to keep looping
        jc      ?_026                                  #LOOP
        jmp     ?_028                                  #RETURN
?_027:  cmp     rdx, r8                                #This loop does N O T H I N G
        jnc     ?_029                                  #Still has a santy check b/c LMAO
        inc     rdx                                    
        cmp     rdx, r8                                 
        jc      ?_027				       #Should we keep looping?
?_028:  add     rsp, 40
        ret
?_029:  #Rust Panic
?_030:  #Rust panic

Comments questions concerns?

Am I not understanding the assembly ?

I'm guess and reading docs at the same time so my confidence is a bit low.

@valarauca
Copy link
Owner Author

I think changing the ct_u8_eq(2) to be #[inline(never)] will solve this. My guess is ct_u8_eq(2) was inlined, then optimized. If you compare my commented section to the actual ct_u8_eq(2) code

So this totally fixed everything

ct_u8_slice_eq:; Function begin
        sub     rsp, 40              
        mov     r10, rdx             
        cmp     r10, r9              
        jnz     ?_002                	#these jump are based on
        test    r10, r10      		#non equal lens
	jz      ?_003                	#and zero len
        xor     r9d, r9d             	#so their A-OKAY
        xor     edx, edx             	
?_001:  cmp     rdx, r10             
        jnc     ?_006                	#sanity check
        mov     al, byte [r8+rdx]    
        xor     al, byte [rcx+rdx]   
        inc     rdx                  
        or      al, r9b              
        cmp     rdx, r10             
        mov     r9b, al              
        jc      ?_001                	#repeat loop
        jmp     ?_004                	#exit loop
?_002:  xor     eax, eax             	
        jmp     ?_005               	#weird non-equal list stuff 
?_003:  xor     eax, eax             
?_004:  movzx   ecx, al              
        xor     edx, edx             
        call    ct_u8_eq		#10/10
?_005:  nop                   		
        add     rsp, 40               
        ret                           
?_006:
					#Rust Panic
					#Unwind the Stack

Inspired by this fix I'm going to just call into the conditional comparison for the condition slice copy.

I think that'll work since I'm just looping over a function call. The loop will be constant, and the function will be. So life will be amazing.

@valarauca
Copy link
Owner Author

valarauca commented Dec 10, 2016

I think that'll work since I'm just looping over a function call. The loop will be constant, and the function will be. So life will be amazing.

And life is amazing 😎😎😎😎😎😎😎

ct_copy_u8:; Function begin
        push    r15                  
        push    r14                  
        push    rsi                  
        push    rdi                  
        push    rbp                  
        push    rbx                  
        sub     rsp, 40             
        mov     r14, r9             
        mov     rsi, r8             
        mov     rbx, rdx            
        cmp     rsi, qword [rsp+80H]
        jnz     ?_029               	#Jump to explicit panic for non-equal args
        test    rsi, rsi            
        jz      ?_027               	#Skipping looping if array len == 0
        xor     edi, edi            
        movzx   r15d, cl            
?_026:  cmp     rdi, rsi            
        jnc     ?_028          		#Memory safety panic     
        lea     rbp, [rdi+1H]       
        movzx   r8d, byte [rbx+rdi] 
        movzx   edx, byte [r14+rdi] 
        mov     ecx, r15d           
        call    ct_select_u8        	#constant time swap
        mov     byte [rbx+rdi], al  
        cmp     rbp, rsi              
        mov     rdi, rbp              
        jc      ?_026                	#loop
?_027:  add     rsp, 40               
        pop     rbx                  
        pop     rbp                  
        pop     rdi                  
        pop     rsi                  
        pop     r14                  
        pop     r15                  
        ret                          
?_028: 
					#Rust Panic
?_029: 
					#Rust Panic

The Audit Rev3 branch appears FULLY FUNCTIONAL on Windows MSVC AMD64. All advertised guarantees exist. Branching only occurs on 0 array lengths, and memory panics 💯

@valarauca
Copy link
Owner Author

All previous changes have been merged with master as of ed6901c

@valarauca
Copy link
Owner Author

@valarauca
Copy link
Owner Author

valarauca commented Dec 11, 2016

So I wrote a [timer crate](https://github.com/valarauca/amd64_timer0 to help me benchmark if True and False crates have different times.

Oh it isn't even close to consistent :(

@glfmn
Copy link

glfmn commented Jun 18, 2017

Have you considered using the cargo bench feature with black boxes to compare performance?

@valarauca
Copy link
Owner Author

valarauca commented Jun 18, 2017

Not a bad idea.

Performance really isn't the first and foremost goal, the timing is. Also cargo bench isn't great for what this issue is covering. I'm trying to be CPU clock consistent.

But yes putting up cargo bench numbers would be good.


In reality I've mostly failed to make this consistent time, and I should pull it from crates.io as it doesn't live up to its claim.

@burdges
Copy link
Contributor

burdges commented Jun 18, 2017

There is another crate trying to do this now : https://github.com/isislovecruft/subtle/ I've no idea if it will succeed either though, but worth exploring.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants