eBACS: ECRYPT Benchmarking of Cryptographic Systems
|How to submit new software:||Tips||hash||stream||aead||dh||kem||encrypt||sign|
|List of primitives measured:||lwc||sha3||hash||stream||lwc||caesar||aead||dh||kem||encrypt||sign|
|Measurements indexed by machine:||lwc||sha3||hash||stream||lwc||caesar||aead||dh||kem||encrypt||sign|
|List of subroutines:||verify||decode||encode||sort||core||hashblocks||scalarmult|
The latest release of SUPERCOP measures the performance of hash functions, secret-key stream ciphers, public-key encryption systems, public-key signature systems, and public-key secret-sharing systems. SUPERCOP integrates and improves upon
Specifically, SUPERCOP measures cryptographic primitives according to several criteria:
To collect measurements, simply download, unpack, and run SUPERCOP:
wget https://bench.cr.yp.to/supercop/supercop-20210125.tar.xz unxz < supercop-20210125.tar.xz | tar -xf - cd supercop-20210125 nohup sh do &Put the resulting supercop-20210125/bench/*/data.gz file on the web, and send the URL to the eBACS/eBATS/eBASC/eBASH mailing list.
Your computer's clock needs to be reasonably accurate. You need standard compilation tools (apt install build-essential on Debian). You should also install xsltproc (apt install xsltproc) to compile the Keccak Code Package, and OpenSSL (apt install libssl-dev) to benchmark implementations that use OpenSSL, but if these aren't installed then SUPERCOP will still benchmark other implementations.
Multiple computers that share filesystems can run SUPERCOP in the same directory. Each computer will create its own subdirectory of bench, labelled by the computer's name, and will perform all work inside that subdirectory.
wget https://bench.cr.yp.to/supercop/supercop-20210125.tar.xz unxz < supercop-20210125.tar.xz | tar -xf - cd supercop-20210125 nohup sh data-do &Put the resulting supercop-data/*/data.gz file on the web, and send the URL to the eBACS/eBATS/eBASC/eBASH mailing list.
The disadvantage of this method is that it consumes extra disk space (typically 20 gigabytes or more, and many inodes). The big advantage of this method is incrementality: an updated version of SUPERCOP will automatically reuse most of the work from the supercop-20210125 run, benchmarking only new code and changed code, so the new benchmark run will finish much more quickly. (However, if an OS update has changed the compiler version, everything will be automatically re-benchmarked.)
Another advantage of this method is parallelizability: on (e.g.) a 4-core machine you can run
nohup sh data-do 4 &to finish the benchmarks almost 4 times as quickly.
Heterogeneous cores. ARM big.LITTLE systems have two different types of cores supporting the same instructions. Depending on the CPU load, the OS may decide to run jobs on the smaller, lower-power cores or on the larger, faster cores. It is also possible, although less common, to have heterogeneous cores on Intel/AMD systems.
Running ./data-do 4 automatially pins tasks to cores 0, 1, 2, 3, but how do you figure out that cores 0, 1, 2, 3 all have the same type? And how do you benchmark the other types of cores? The simple answer is to instead run ./data-do-biglittle (without a number). This script figures out which different types of cores there are, and separately benchmarks each type. The script should work on any Linux system with Python and with enough CPU data in /sys/devices/system/cpu and /proc/cpuinfo; if in doubt, try running ./data-do-biglittle -d first for a quick overview of which cores it will use.
Interrupts and hyperthreading. SUPERCOP checks the CPU's cycle counter, runs a computation, checks the cycle counter again, runs the computation again, etc. It records all the resulting measurements. It also runs the measuring program multiple times in case there are variations from one run to another (from, e.g., randomized memory layouts). Medians and quartiles are reported on the web pages, and any severe discrepancies are flagged in red.
An interrupt (for an incoming network packet, for example, or for an operating-system clock tick) will slow down a computation. This is much less important for SUPERCOP than it is for most benchmarking tools: a typical cryptographic operation measured by SUPERCOP completes quickly and is unlikely to be interrupted.
However, the picture changes if interrupts are frequent enough to interrupt 25% or more of the computations. The most common way for this to happen is for the same CPU core to be busy doing something else. It is thus recommended to run SUPERCOP on an idle machine.
On an otherwise idle machine, ./data-do 4 can interfere with itself, because what the OS calls cores 0, 1, 2, 3 is actually 4 "hyperthreads" competing for resources on just 2 cores. On typical Linux machines, hyperthreading is indicated by dashes or commas in the files in /sys/devices/system/cpu/cpu*/topology/thread_siblings_list: for example, if cpu2 and cpu6 both say 2,6 then they are actually one physical core running two threads.
./data-do-biglittle automatically checks for hyperthreading and runs on separate cores. An alternative is to turn off hyperthreading in the BIOS.
Hyperthreading can gain performance for some mixes of applications. It would be interesting, although challenging, to robustly benchmark this effect.
Underclocking and overclocking. A CPU that nominally runs at, e.g., 2GHz will often try to save power by running at a lower speed if it doesn't think it's busy enough to justify running at 2GHz. A computation then takes longer. A user who isn't satisfied with the resulting performance can and does tell the CPU to avoid this slowdown, so eBACS policy is to measure performance without this slowdown.
Note that the slowdown is not necessarily proportional to CPU frequency: for example, if DRAM continues running at full speed then a computation that accesses DRAM can end up taking fewer CPU cycles. Furthermore, it is not safe to assume that SUPERCOP creates enough load to avoid underclocking.
Inadequate cooling might force a CPU to reduce speed so as to avoid overheating. Standard cooling systems are designed so that the CPU can always run at its nominal frequency without overheating. Machines where cooling problems are detected are marked as unstable in eBACS tables.
The same CPU might also sometimes be able to run at a higher speed: "Turbo Boost", "Turbo Core", etc. The reason the CPU's nominal speed isn't higher is that the CPU can't always run at a higher speed: some computations will make the CPU overheat with standard cooling systems or, even worse, draw too much power. The power and temperature depend on the mix of current and recent computations, requiring lengthy benchmarks for accurate measurements and raising the question of what mixes are appropriate.
The current goal of eBACS is limited to measuring performance at the nominal CPU frequency. This is not meant to understate the importance of understanding the CPU frequencies that can be achieved in various environments, the performance of cryptographic software at higher frequencies, and the ability of software to sustain those frequencies.
Given the above considerations, it is recommended that SUPERCOP users disable underclocking, disable overclocking, and check for adequate cooling.
On current Linux systems, the system administrator can disable overclocking with
echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo
on Intel systems or
echo 0 > /sys/devices/system/cpu/cpufreq/boost
on AMD systems. This lasts until the next boot. For an effect that persists across reboots, the system administrator can do the same in boot scripts. An alternative is to disable overclocking in the BIOS.
On current Linux systems, the system administrator can disable underclocking as follows (unless the BIOS is configured to limit OS control):
for i in /sys/devices/system/cpu/cpu*/cpufreq do echo performance > $i/scaling_governor done
Some CPU-OS combinations provide more fine-grained control, such as
for i in /sys/devices/system/cpu/cpu*/cpufreq do echo userspace > $i/scaling_governor echo 2000000 > $i/scaling_setspeed done
to set a fixed 2GHz frequency. On these machines, you can see the available frequencies listed in /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies. When the top two frequencies are separated by just 1000, the larger is Turbo Boost; use the smaller.
Beware that it is common for an OS to enable underclocking starting on the scale of a minute after boot, overriding a boot script that disables underclocking. The system administrator can disable the enable-underclocking script for future reboots, but details depend on the OS; update-rc.d ondemand disable works on some systems, and systemctl disable ondemand.service works on some systems.
Low-precision clocks and mis-timed clocks. It is rare for a CPU to not provide a cycle counter. It is, however, common for a CPU to make its cycle counter available only to the OS kernel, at least by default. If the best clock that SUPERCOP can find ticks only 1000000 times per second, while the computation of interest finishes in 1.5 microseconds, then SUPERCOP's measurements will have tremendous variation: 1 microsecond, then 2 microseconds, then 1 microsecond, then 2 microseconds, etc. SUPERCOP could increase the precision by measuring the clock only after, say, 100 copies of the computation, but this would hide other variations of interest and would make the measurements less robust against frequent OS interrupts. It is better to give SUPERCOP access to a cycle counter.
On Intel/AMD systems, SUPERCOP generally uses the RDTSC instruction. This instruction has always been documented to count clock cycles since CPU reset, and normally the OS makes it available to applications by default. Intel and AMD now creatively interpret "clock cycles" to mean not CPU cycles, but rather the cycles of an artificial clock always running at the nominal CPU frequency; this discrepancy does not matter if the CPU is running at the nominal CPU frequency. Beware that some Intel CPUs run the artificial clock at a different frequency from the nominal CPU frequency.
Starting in 2021, SUPERCOP favors the RDPMC instruction on Linux systems that make RDPMC available to applications. The cycle counter provided through RDPMC runs at the actual CPU frequency, so accidental variations in the CPU frequency have much less effect on RDPMC results than on RDTSC results. There is still an effect from accesses to off-CPU resources such as full-speed DRAM, so controlling the CPU frequency is still recommended.
If /proc/sys/kernel/perf_event_paranoid says 3 then the system administrator will need to set it to 2 to enable RDPMC. People who disable RDPMC are generally reacting to side-channel attacks that use RDPMC, but there is little reason to believe that the variable-time software targeted by those attacks would be secure against a competent attacker armed only with RDTSC or other clocks.
Xen systems similarly disable RDPMC by default, but it may be possible to enable RDPMC with vpmu=on.
On current ARM systems under Linux, the following kernel modules enable the cycle counter on all cores: https://github.com/thoughtpolice/enable_arm_pmu (32-bit) and https://github.com/rdolbeau/enable_arm_pmu (64-bit). More variations are required for older systems, such as the following:
Further variations. Some cryptographic computations naturally vary in run time. An obvious example is RSA key generation, using rejection sampling to generate primes. In theory SUPERCOP could perform these computations enough times to see that their time follows a stable distribution, and could then remove the red flags.
Most computations run more slowly the first time because the code needs to be loaded into cache. SUPERCOP does not attempt to measure this effect (the effect disappears in medians and quartiles), but it has begun recording code size, which puts an upper bound on the effect.
There are many other microarchitectural effects that create variations from one run of a computation to another run of the same computation, even when the inputs and randomness are identical: e.g., branch-predictor state.
The database, in uncompressed form, consists of a series of database entries. Each database entry is a line consisting of the following space-separated words:
SUPERCOP automatically tries all available implementations of each primitive, and many compilers for each implementation, to select the fastest combination of implementation and compiler. Each try produces a database entry with the following words:
If a compiler issues an error message (or a warning or any other output), SUPERCOP produces a database entry with the following words:
If an implementation fails to run (for example, because it uses machine instructions not supported by the CPU), SUPERCOP produces a database entry with the following words:
SUPERCOP then measures the performance of the selected implementation and compiler on a wider variety of specific operations; for example, hash functions are selected on the basis of 1536-byte hashing, but are then measured for hashing 0 bytes, 1 byte, 2 bytes, 3 bytes, etc. Each specific operation produces a database entry with the following words:
SUPERCOP also records the selected implementation in a separate database entry with the following words:
SUPERCOP also records the selected compiler in a separate database entry with the following words:
SUPERCOP also records the CPU identifier in a separate database entry with the following words:
SUPERCOP also records the number of CPU cycles per second in a separate database entry with the following words:
SUPERCOP also records the cycle-counting mechanism in a separate database entry with the following words: