Implementation notes: aarch64, rpi4ubuntu64, crypto_stream/chacha8

Computer: rpi4ubuntu64
Architecture: aarch64
CPU ID: 410fd083
SUPERCOP version: 20191221
Operation: crypto_stream
Primitive: chacha8
TimeObject sizeTest sizeImplementationCompilerBenchmark dateSUPERCOP version
36103860 0 414456 848 792dolbeau/arm-neongcc_-march=native_-mtune=native_-O2_-fomit-frame-pointer_-fwrapv_-fPIC_-fPIE2020011020191221
36234596 0 416497 856 808dolbeau/arm-neongcc_-march=native_-mtune=native_-O3_-fomit-frame-pointer_-fwrapv_-fPIC_-fPIE2020011020191221
38933820 0 115500 784 808dolbeau/arm-neonclang_-mcpu=native_-O3_-fomit-frame-pointer_-fwrapv_-Qunused-arguments_-fPIC_-fPIE2020011020191221
45413260 0 413008 832 784dolbeau/arm-neongcc_-march=native_-mtune=native_-Os_-fomit-frame-pointer_-fwrapv_-fPIC_-fPIE2020011020191221
55352908 0 414825 856 808dolbeau/mipsel-msagcc_-march=native_-mtune=native_-O3_-fomit-frame-pointer_-fwrapv_-fPIC_-fPIE2020011020191221
55603212 0 415105 856 808e/mergedgcc_-march=native_-mtune=native_-O3_-fomit-frame-pointer_-fwrapv_-fPIC_-fPIE2020011020191221
55982844 0 414745 856 808e/refgcc_-march=native_-mtune=native_-O3_-fomit-frame-pointer_-fwrapv_-fPIC_-fPIE2020011020191221
56332048 0 411768 832 784e/mergedgcc_-march=native_-mtune=native_-Os_-fomit-frame-pointer_-fwrapv_-fPIC_-fPIE2020011020191221
56502420 0 413000 848 792e/mergedgcc_-march=native_-mtune=native_-O2_-fomit-frame-pointer_-fwrapv_-fPIC_-fPIE2020011020191221
57484740 0 416649 856 808e/regsgcc_-march=native_-mtune=native_-O3_-fomit-frame-pointer_-fwrapv_-fPIC_-fPIE2020011020191221
60594300 0 415008 848 792dolbeau/arm-neongcc_-march=native_-mtune=native_-O_-fomit-frame-pointer_-fwrapv_-fPIC_-fPIE2020011020191221
67282924 0 114596 784 808e/mergedclang_-mcpu=native_-O3_-fomit-frame-pointer_-fwrapv_-Qunused-arguments_-fPIC_-fPIE2020011020191221
70042476 0 114156 784 808dolbeau/mipsel-msaclang_-mcpu=native_-O3_-fomit-frame-pointer_-fwrapv_-Qunused-arguments_-fPIC_-fPIE2020011020191221
70042476 0 114148 784 808e/refclang_-mcpu=native_-O3_-fomit-frame-pointer_-fwrapv_-Qunused-arguments_-fPIC_-fPIE2020011020191221
72542476 0 114148 784 808e/regsclang_-mcpu=native_-O3_-fomit-frame-pointer_-fwrapv_-Qunused-arguments_-fPIC_-fPIE2020011020191221
85323748 0 414448 848 792e/mergedgcc_-march=native_-mtune=native_-O_-fomit-frame-pointer_-fwrapv_-fPIC_-fPIE2020011020191221
101902168 0 412752 848 792e/regsgcc_-march=native_-mtune=native_-O2_-fomit-frame-pointer_-fwrapv_-fPIC_-fPIE2020011020191221
118143144 0 413848 848 792e/regsgcc_-march=native_-mtune=native_-O_-fomit-frame-pointer_-fwrapv_-fPIC_-fPIE2020011020191221
133491944 0 411664 832 784e/regsgcc_-march=native_-mtune=native_-Os_-fomit-frame-pointer_-fwrapv_-fPIC_-fPIE2020011020191221
133681800 0 411528 832 784e/refgcc_-march=native_-mtune=native_-Os_-fomit-frame-pointer_-fwrapv_-fPIC_-fPIE2020011020191221
133691800 0 411544 832 784dolbeau/mipsel-msagcc_-march=native_-mtune=native_-Os_-fomit-frame-pointer_-fwrapv_-fPIC_-fPIE2020011020191221
134112136 0 412720 848 792e/refgcc_-march=native_-mtune=native_-O2_-fomit-frame-pointer_-fwrapv_-fPIC_-fPIE2020011020191221
138762188 0 412792 848 792dolbeau/mipsel-msagcc_-march=native_-mtune=native_-O2_-fomit-frame-pointer_-fwrapv_-fPIC_-fPIE2020011020191221
143692640 0 413352 848 792dolbeau/mipsel-msagcc_-march=native_-mtune=native_-O_-fomit-frame-pointer_-fwrapv_-fPIC_-fPIE2020011020191221
143712640 0 413336 848 792e/refgcc_-march=native_-mtune=native_-O_-fomit-frame-pointer_-fwrapv_-fPIC_-fPIE2020011020191221

Compiler output

Implementation: amd64-ssse3
Security model: unknown
Compiler: clang -mcpu=native -O3 -fomit-frame-pointer -fwrapv -Qunused-arguments -fPIC -fPIE
chacha.s: chacha.s:19:5: error: unknown token in expression
chacha.s: mov %rsp,%r11
chacha.s: ^
chacha.s: chacha.s:19:5: error: invalid operand
chacha.s: mov %rsp,%r11
chacha.s: ^
chacha.s: chacha.s:20:5: error: invalid token in expression
chacha.s: and $31,%r11
chacha.s: ^
chacha.s: chacha.s:20:5: error: invalid operand
chacha.s: and $31,%r11
chacha.s: ^
chacha.s: chacha.s:21:5: error: invalid token in expression
chacha.s: add $384,%r11
chacha.s: ^
chacha.s: chacha.s:21:5: error: invalid operand
chacha.s: add $384,%r11
chacha.s: ^
chacha.s: chacha.s:22:5: error: unknown token in expression
chacha.s: sub %r11,%rsp
chacha.s: ^
chacha.s: chacha.s:22:5: error: invalid operand
chacha.s: sub %r11,%rsp
chacha.s: ^
chacha.s: chacha.s:23:6: error: unknown token in expression
chacha.s: ...

Number of similar (compiler,implementation) pairs: 1, namely:
CompilerImplementations
clang -mcpu=native -O3 -fomit-frame-pointer -fwrapv -Qunused-arguments -fPIC -fPIE amd64-ssse3

Compiler output

Implementation: amd64-ssse3
Security model: unknown
Compiler: gcc -march=native -mtune=native -O2 -fomit-frame-pointer -fwrapv -fPIC -fPIE
chacha.s: chacha.s: Assembler messages:
chacha.s: chacha.s:19: Error: operand 1 must be an integer register -- `mov %rsp,%r11'
chacha.s: chacha.s:20: Error: operand 1 must be an integer or stack pointer register -- `and $31,%r11'
chacha.s: chacha.s:21: Error: operand 1 must be an integer or stack pointer register -- `add $384,%r11'
chacha.s: chacha.s:22: Error: operand 1 must be an integer or stack pointer register -- `sub %r11,%rsp'
chacha.s: chacha.s:23: Error: operand 1 must be an integer register -- `mov %rdi,%r8'
chacha.s: chacha.s:24: Error: operand 1 must be an integer register -- `mov %rsi,%rsi'
chacha.s: chacha.s:25: Error: operand 1 must be an integer register -- `mov %rsi,%rdi'
chacha.s: chacha.s:26: Error: operand 1 must be an integer register -- `mov %rdx,%rdx'
chacha.s: chacha.s:27: Error: operand 1 must be an integer or stack pointer register -- `cmp $0,%rdx'
chacha.s: chacha.s:29: Error: unknown mnemonic `jbe' -- `jbe ._done'
chacha.s: chacha.s:31: Error: operand 1 must be an integer register -- `mov $0,%rax'
chacha.s: chacha.s:33: Error: operand 1 must be an integer register -- `mov %rdx,%rcx'
chacha.s: chacha.s:35: Error: unknown mnemonic `rep' -- `rep stosb'
chacha.s: chacha.s:37: Error: operand 1 must be an integer or stack pointer register -- `sub %rdx,%rdi'
chacha.s: chacha.s:39: Error: unknown mnemonic `jmp' -- `jmp ._start'
chacha.s: chacha.s:47: Error: operand 1 must be an integer register -- `mov %rsp,%r11'
chacha.s: chacha.s:48: Error: operand 1 must be an integer or stack pointer register -- `and $31,%r11'
chacha.s: chacha.s:49: Error: operand 1 must be an integer or stack pointer register -- `add $384,%r11'
chacha.s: chacha.s:50: Error: operand 1 must be an integer or stack pointer register -- `sub %r11,%rsp'
chacha.s: chacha.s:52: Error: operand 1 must be an integer register -- `mov %rdi,%r8'
chacha.s: chacha.s:54: Error: operand 1 must be an integer register -- `mov %rsi,%rsi'
chacha.s: chacha.s:56: Error: operand 1 must be an integer register -- `mov %rdx,%rdi'
chacha.s: chacha.s:58: Error: operand 1 must be an integer register -- `mov %rcx,%rdx'
chacha.s: chacha.s:60: Error: operand 1 must be an integer or stack pointer register -- `cmp $0,%rdx'
chacha.s: ...

Number of similar (compiler,implementation) pairs: 4, namely:
CompilerImplementations
gcc -march=native -mtune=native -O2 -fomit-frame-pointer -fwrapv -fPIC -fPIE amd64-ssse3
gcc -march=native -mtune=native -O3 -fomit-frame-pointer -fwrapv -fPIC -fPIE amd64-ssse3
gcc -march=native -mtune=native -O -fomit-frame-pointer -fwrapv -fPIC -fPIE amd64-ssse3
gcc -march=native -mtune=native -Os -fomit-frame-pointer -fwrapv -fPIC -fPIE amd64-ssse3

Compiler output

Implementation: goll_gueron
Security model: unknown
Compiler: clang -mcpu=native -O3 -fomit-frame-pointer -fwrapv -Qunused-arguments -fPIC -fPIE
stream.c: In file included from stream.c:11:
stream.c: In file included from /usr/lib/llvm-9/lib/clang/9.0.0/include/immintrin.h:14:
stream.c: /usr/lib/llvm-9/lib/clang/9.0.0/include/mmintrin.h:50:12: error: invalid conversion between vector type '__m64' (vector of 1 'long long' value) and integer type 'int' of different size
stream.c: return (__m64)__builtin_ia32_vec_init_v2si(__i, 0);
stream.c: ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
stream.c: /usr/lib/llvm-9/lib/clang/9.0.0/include/mmintrin.h:129:12: error: invalid conversion between vector type '__m64' (vector of 1 'long long' value) and integer type 'int' of different size
stream.c: return (__m64)__builtin_ia32_packsswb((__v4hi)__m1, (__v4hi)__m2);
stream.c: ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
stream.c: /usr/lib/llvm-9/lib/clang/9.0.0/include/mmintrin.h:159:12: error: invalid conversion between vector type '__m64' (vector of 1 'long long' value) and integer type 'int' of different size
stream.c: return (__m64)__builtin_ia32_packssdw((__v2si)__m1, (__v2si)__m2);
stream.c: ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
stream.c: /usr/lib/llvm-9/lib/clang/9.0.0/include/mmintrin.h:189:12: error: invalid conversion between vector type '__m64' (vector of 1 'long long' value) and integer type 'int' of different size
stream.c: return (__m64)__builtin_ia32_packuswb((__v4hi)__m1, (__v4hi)__m2);
stream.c: ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
stream.c: /usr/lib/llvm-9/lib/clang/9.0.0/include/mmintrin.h:216:12: error: invalid conversion between vector type '__m64' (vector of 1 'long long' value) and integer type 'int' of different size
stream.c: return (__m64)__builtin_ia32_punpckhbw((__v8qi)__m1, (__v8qi)__m2);
stream.c: ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
stream.c: /usr/lib/llvm-9/lib/clang/9.0.0/include/mmintrin.h:239:12: error: invalid conversion between vector type '__m64' (vector of 1 'long long' value) and integer type 'int' of different size
stream.c: return (__m64)__builtin_ia32_punpckhwd((__v4hi)__m1, (__v4hi)__m2);
stream.c: ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
stream.c: /usr/lib/llvm-9/lib/clang/9.0.0/include/mmintrin.h:260:12: error: invalid conversion between vector type '__m64' (vector of 1 'long long' value) and integer type 'int' of different size
stream.c: return (__m64)__builtin_ia32_punpckhdq((__v2si)__m1, (__v2si)__m2);
stream.c: ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
stream.c: /usr/lib/llvm-9/lib/clang/9.0.0/include/mmintrin.h:287:12: error: invalid conversion between vector type '__m64' (vector of 1 'long long' value) and integer type 'int' of different size
stream.c: return (__m64)__builtin_ia32_punpcklbw((__v8qi)__m1, (__v8qi)__m2);
stream.c: ...

Number of similar (compiler,implementation) pairs: 1, namely:
CompilerImplementations
clang -mcpu=native -O3 -fomit-frame-pointer -fwrapv -Qunused-arguments -fPIC -fPIE goll_gueron

Compiler output

Implementation: goll_gueron
Security model: unknown
Compiler: gcc -march=native -mtune=native -O2 -fomit-frame-pointer -fwrapv -fPIC -fPIE
stream.c: stream.c:11:10: fatal error: immintrin.h: No such file or directory
stream.c: 11 | #include <immintrin.h>
stream.c: | ^~~~~~~~~~~~~
stream.c: compilation terminated.

Number of similar (compiler,implementation) pairs: 4, namely:
CompilerImplementations
gcc -march=native -mtune=native -O2 -fomit-frame-pointer -fwrapv -fPIC -fPIE goll_gueron
gcc -march=native -mtune=native -O3 -fomit-frame-pointer -fwrapv -fPIC -fPIE goll_gueron
gcc -march=native -mtune=native -O -fomit-frame-pointer -fwrapv -fPIC -fPIE goll_gueron
gcc -march=native -mtune=native -Os -fomit-frame-pointer -fwrapv -fPIC -fPIE goll_gueron

Compiler output

Implementation: krovetz/avx2
Security model: unknown
Compiler: clang -mcpu=native -O3 -fomit-frame-pointer -fwrapv -Qunused-arguments -fPIC -fPIE
stream.c: In file included from stream.c:8:
stream.c: In file included from /usr/lib/llvm-9/lib/clang/9.0.0/include/immintrin.h:14:
stream.c: /usr/lib/llvm-9/lib/clang/9.0.0/include/mmintrin.h:50:12: error: invalid conversion between vector type '__m64' (vector of 1 'long long' value) and integer type 'int' of different size
stream.c: return (__m64)__builtin_ia32_vec_init_v2si(__i, 0);
stream.c: ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
stream.c: /usr/lib/llvm-9/lib/clang/9.0.0/include/mmintrin.h:129:12: error: invalid conversion between vector type '__m64' (vector of 1 'long long' value) and integer type 'int' of different size
stream.c: return (__m64)__builtin_ia32_packsswb((__v4hi)__m1, (__v4hi)__m2);
stream.c: ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
stream.c: /usr/lib/llvm-9/lib/clang/9.0.0/include/mmintrin.h:159:12: error: invalid conversion between vector type '__m64' (vector of 1 'long long' value) and integer type 'int' of different size
stream.c: return (__m64)__builtin_ia32_packssdw((__v2si)__m1, (__v2si)__m2);
stream.c: ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
stream.c: /usr/lib/llvm-9/lib/clang/9.0.0/include/mmintrin.h:189:12: error: invalid conversion between vector type '__m64' (vector of 1 'long long' value) and integer type 'int' of different size
stream.c: return (__m64)__builtin_ia32_packuswb((__v4hi)__m1, (__v4hi)__m2);
stream.c: ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
stream.c: /usr/lib/llvm-9/lib/clang/9.0.0/include/mmintrin.h:216:12: error: invalid conversion between vector type '__m64' (vector of 1 'long long' value) and integer type 'int' of different size
stream.c: return (__m64)__builtin_ia32_punpckhbw((__v8qi)__m1, (__v8qi)__m2);
stream.c: ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
stream.c: /usr/lib/llvm-9/lib/clang/9.0.0/include/mmintrin.h:239:12: error: invalid conversion between vector type '__m64' (vector of 1 'long long' value) and integer type 'int' of different size
stream.c: return (__m64)__builtin_ia32_punpckhwd((__v4hi)__m1, (__v4hi)__m2);
stream.c: ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
stream.c: /usr/lib/llvm-9/lib/clang/9.0.0/include/mmintrin.h:260:12: error: invalid conversion between vector type '__m64' (vector of 1 'long long' value) and integer type 'int' of different size
stream.c: return (__m64)__builtin_ia32_punpckhdq((__v2si)__m1, (__v2si)__m2);
stream.c: ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
stream.c: /usr/lib/llvm-9/lib/clang/9.0.0/include/mmintrin.h:287:12: error: invalid conversion between vector type '__m64' (vector of 1 'long long' value) and integer type 'int' of different size
stream.c: return (__m64)__builtin_ia32_punpcklbw((__v8qi)__m1, (__v8qi)__m2);
stream.c: ...

Number of similar (compiler,implementation) pairs: 1, namely:
CompilerImplementations
clang -mcpu=native -O3 -fomit-frame-pointer -fwrapv -Qunused-arguments -fPIC -fPIE krovetz/avx2

Compiler output

Implementation: krovetz/avx2
Security model: unknown
Compiler: gcc -march=native -mtune=native -O2 -fomit-frame-pointer -fwrapv -fPIC -fPIE
stream.c: stream.c:8:10: fatal error: immintrin.h: No such file or directory
stream.c: 8 | #include <immintrin.h>
stream.c: | ^~~~~~~~~~~~~
stream.c: compilation terminated.

Number of similar (compiler,implementation) pairs: 4, namely:
CompilerImplementations
gcc -march=native -mtune=native -O2 -fomit-frame-pointer -fwrapv -fPIC -fPIE krovetz/avx2
gcc -march=native -mtune=native -O3 -fomit-frame-pointer -fwrapv -fPIC -fPIE krovetz/avx2
gcc -march=native -mtune=native -O -fomit-frame-pointer -fwrapv -fPIC -fPIE krovetz/avx2
gcc -march=native -mtune=native -Os -fomit-frame-pointer -fwrapv -fPIC -fPIE krovetz/avx2

Compiler output

Implementation: krovetz/vec128
Security model: unknown
Compiler: clang -mcpu=native -O3 -fomit-frame-pointer -fwrapv -Qunused-arguments -fPIC -fPIE
stream.c: stream.c:80:2: error: -- Implementation supports only machines with neon, altivec or SSE2
stream.c: #error -- Implementation supports only machines with neon, altivec or SSE2
stream.c: ^
stream.c: stream.c:151:14: warning: implicit declaration of function 'NONCE' is invalid in C99 [-Wimplicit-function-declaration]
stream.c: vec s3 = NONCE(np);
stream.c: ^
stream.c: stream.c:151:9: error: initializing 'vec' (vector of 4 'unsigned int' values) with an expression of incompatible type 'int'
stream.c: vec s3 = NONCE(np);
stream.c: ^ ~~~~~~~~~
stream.c: stream.c:152:36: error: use of undeclared identifier 'VBPI'
stream.c: for (iters = 0; iters < inlen/(BPI*64); iters++) {
stream.c: ^
stream.c: stream.c:91:19: note: expanded from macro 'BPI'
stream.c: #define BPI (VBPI + GPR_TOO) /* Blocks computed per loop iteration */
stream.c: ^
stream.c: stream.c:152:36: error: use of undeclared identifier 'GPR_TOO'
stream.c: stream.c:91:26: note: expanded from macro 'BPI'
stream.c: #define BPI (VBPI + GPR_TOO) /* Blocks computed per loop iteration */
stream.c: ^
stream.c: stream.c:155:19: error: use of undeclared identifier 'ONE'
stream.c: v7 = v3 + ONE;
stream.c: ^
stream.c: stream.c:176:13: warning: implicit declaration of function 'ROTW16' is invalid in C99 [-Wimplicit-function-declaration]
stream.c: DQROUND_VECTORS(v0,v1,v2,v3)
stream.c: ^
stream.c: ...

Number of similar (compiler,implementation) pairs: 1, namely:
CompilerImplementations
clang -mcpu=native -O3 -fomit-frame-pointer -fwrapv -Qunused-arguments -fPIC -fPIE krovetz/vec128

Compiler output

Implementation: krovetz/vec128
Security model: unknown
Compiler: gcc -march=native -mtune=native -O2 -fomit-frame-pointer -fwrapv -fPIC -fPIE
stream.c: stream.c:80:2: error: #error -- Implementation supports only machines with neon, altivec or SSE2
stream.c: 80 | #error -- Implementation supports only machines with neon, altivec or SSE2
stream.c: | ^~~~~
stream.c: stream.c: In function 'crypto_stream_chacha8_krovetz_vec128_xor':
stream.c: stream.c:151:14: warning: implicit declaration of function 'NONCE' [-Wimplicit-function-declaration]
stream.c: 151 | vec s3 = NONCE(np);
stream.c: | ^~~~~
stream.c: stream.c:151:14: error: incompatible types when initializing type 'vec' {aka '__vector(4) unsigned int'} using type 'int'
stream.c: stream.c:91:19: error: 'VBPI' undeclared (first use in this function); did you mean 'BPI'?
stream.c: 91 | #define BPI (VBPI + GPR_TOO) /* Blocks computed per loop iteration */
stream.c: | ^~~~
stream.c: stream.c:152:36: note: in expansion of macro 'BPI'
stream.c: 152 | for (iters = 0; iters < inlen/(BPI*64); iters++) {
stream.c: | ^~~
stream.c: stream.c:91:19: note: each undeclared identifier is reported only once for each function it appears in
stream.c: 91 | #define BPI (VBPI + GPR_TOO) /* Blocks computed per loop iteration */
stream.c: | ^~~~
stream.c: stream.c:152:36: note: in expansion of macro 'BPI'
stream.c: 152 | for (iters = 0; iters < inlen/(BPI*64); iters++) {
stream.c: | ^~~
stream.c: stream.c:91:26: error: 'GPR_TOO' undeclared (first use in this function)
stream.c: 91 | #define BPI (VBPI + GPR_TOO) /* Blocks computed per loop iteration */
stream.c: | ^~~~~~~
stream.c: stream.c:152:36: note: in expansion of macro 'BPI'
stream.c: 152 | for (iters = 0; iters < inlen/(BPI*64); iters++) {
stream.c: ...

Number of similar (compiler,implementation) pairs: 4, namely:
CompilerImplementations
gcc -march=native -mtune=native -O2 -fomit-frame-pointer -fwrapv -fPIC -fPIE krovetz/vec128
gcc -march=native -mtune=native -O3 -fomit-frame-pointer -fwrapv -fPIC -fPIE krovetz/vec128
gcc -march=native -mtune=native -O -fomit-frame-pointer -fwrapv -fPIC -fPIE krovetz/vec128
gcc -march=native -mtune=native -Os -fomit-frame-pointer -fwrapv -fPIC -fPIE krovetz/vec128