Splitting The Nanosecond To Find A Nanosecond

Splitting The Nanosecond To Find A Nanosecond

Splitting The Nanosecond To Find A Nanosecond

2nd December, 2020

For years trading system designers have marvelled at the lengths a small group of HFT firms will go to find an extra nano over their competitors. Anecdotally people comprehend why being fast doesn’t matter nearly so much as being a hair faster than the next firm. Finding that hair could be a case of being in a closer rack to the matching engine, a meter less cable, or the next Layer 1 switch.

In our last blog post “Five Nanos and Fibre Taps” we talked Layer 1 hardware, test bench cadence and the science of five nanoseconds per meter – in this blog we’re back to share the results of sub nano benchmarking, crosspoint proximity, and why all ports aren’t created equal.

Sub Nanoseconds

The first port of call for round two was to enable sub-nanosecond hardware timestamping in our testbench allowing us to gain a deeper granularity, revealing up to picosecond time resolution. A picosecond is 12 metric decimal points within a second, otherwise described as a trillionth of a second. Enabling this level of precision has allowed us the ability to benchmark with a super high level of accuracy, providing detailed insight, backed with real data reporting, to exact latency figures for each network component in a trading stack.

Crosspoint Proximity

Looking at the first set of results from the test bench (performed on layer 1 device in port replication mode), it became apparent that each port reports a subtly different latency profile. We are talking about nanosecond and picosecond differences in time per port, however understanding the reason for this variance and perhaps more pertinently how to capitalise on it was clearly going to be key. We were soon able to dispel the common misunderstanding that two ports physically located beside one another have a lower latency profile than two ports at opposite ends of the device. It transpires that we need to take into consideration ‘board traces’, that is the distance of the actual circuitry used to connect a given port to the fabric of the device, as well as the concept of the crosspoint component.

The crosspoint itself, a form of an internal patching matrix, allows the user to configure which port to egress an ingress signal to. How does this affect port-to-port latency profiles? It means we need to consider the flow not from a port-to-port perspective, but instead from port-to-crosspoint-to-port. This creates a requirement to understand a ports proximity to the crosspoint component itself and the length of the board traces that carry a signal between the two. Interestingly – the Layer 1 device we are currently benchmarking has the crosspoint located in the lower central section of the device, directly behind the physical front panel of ports. Looking deeper at the results we see that the lowest port replication latency profile is achieved when configuring the ingress and egress ports to be those located in the lower most central location of the front panel. Again, we are talking about subtle variances in time here, but in the world of electronic trading and order entry executions – those differences could mean everything.

So how can this information be leveraged? By performing multiple tests with different device configurations and analysing the results, we now understand the most optimal configuration in regards what ports are to be consumed by hosts first. An emphasis on the word “first” here. Naturally as the number of hosts increase, the proximity to the crosspoint will also increase. So, whilst we can now fully optimise our Layer 1 configurations – we must understand and accept that eventually port-to-crosspoint latency variance will play a part in the overall latency of the system.

Test bench

The task of benchmarking any network device inevitably involves physical repetition, a mass of results, data integrity and presentation challenges and an overall need for a structured approach.

Initially, moving an SFP from one port to another is something we just couldn’t avoid. We automated the scheduling of each test by sending a packet through the device at a fixed interval, to later realise that defining the value of that interval is quite a challenge. Too low and the tester would have to rush, too high would result in wasted time.

Creating a tester-controlled scheduler script ensured data integrity through the use of sequencing and provided tester convenience with an option of when to start and stop each test. This also helped to ensure that any environmental or human error factors were not impacting the resulting data. Meaning the tester could verify the next test or even allow them to write some notes on the next blog post that just happened to come to mind!

As we tested different sized packets through each port on a device, the resulting output of the benchmarking process for a single flow direction grew. Multiplying the packet size test variances by 47 with the amount of egress ports on a 48-port device in port replication mode results in 94 individual tests.

Post processing, the data now becomes a fundamental requirement and function of the test bench. This consists of taking the output of results, dissecting the packet headers to determine what type of packet was seen and its size, comparing it to the previous packet to ensure a packet match, as we are comparing timestamps at two points of the network and finally extracting sub-nanosecond hardware timestamps. In order to extract a value of time from those timestamps, investigation into the format of the additional packet headers was required, allowing us to perform byte level decoding. Now that we had all our data in the correct format (and by format we mean readable by humans) presenting the data for analysis became a trivial task.


Our benchmarking is now complete for the first key vendor Layer 1 device and shows some very interesting results. We can now confirm its ability to perform port replication in 5ns with a +/- 1 variance, meaning that it is possible to a achieve ~ 4ns latency profile *if* you get your crosspoint proximity right!

With two other key vendors lined up for the test bench over the coming weeks we are well on our way to being able to provide vendor agnostic solutions to our current and prospective clients, with the knowledge and expertise on how to further optimise such solutions to achieve the fastest in ultra-low latency connectivity. Stay tuned for the results of these tests!

– Josh Patel, Network Engineer at Options

For more on our ULL Layer 1 offering and R & D Layer 1 testing please see links below:



0 responses to “Splitting The Nanosecond To Find A Nanosecond”

  1. michael botlo says:

    This is approaching experimental physics territory, happy to make an intro to CERN, lmk.

Leave a Reply

Your email address will not be published. Required fields are marked *