Latency Spinbit Implementation Experience

Last update: 2017-11-28
Questions? Suggestions? devaerep@student.ethz.ch
This document is kinda append only.

Implementation

Spinbit implemented in Minq according to PR 609
For experimental purposes, a full measurement byte is added to the QUIC header (long & short).
The spinbit statemachine is implemented completely independed from other Minq code. The only reason that the statemachine can not be implemented as a shim between UDP and QUIC is because the measurement byte (bit) should be authenticated by QUIC.
- On connection init, Minq calls c.measurement = newMeasurementData(role) passing only if this endpoint is the server or client.
- On packet reception, Minq calls c.measurement.incommingMeasurementTasks(&hdr). only passing the QUIC packet header.
- On packet transmission Minq calls c.measurement.hdrData.encode() which returns the measurement byte to be added to the Minq header.
Implementation code on Github.

Test Setup

Mininet used for testing. Full testing framework available on Github.

network topology:

             static shaped      dynamic shaped
                 link 0             link 1
                   |                  |
                   V                  V
    +----------+       +----------+       +----------+
    | client-0 | <---> | switch-0 | <---> | switch-1 |
    +----------+       +----------+       +----------+
                                                ^
                                                |  <-- unshaped link
                                                |         link 2
                                                V
                                          +----------+
                                          | observer |
                                          | switch-2 |
                                          +----------+
                                                ^
                                                |  <-- unshaped link
                                                |         link 3
                                                V
    +----------+       +----------+       +----------+
    | server-0 | <---> | switch-4 | <---> | switch-3 |
    +----------+       +----------+       +----------+
                   ^                  ^
                   |                  |
                 link 5             link 4
             static shaped      dynamic shaped

static and dynamic shaped links have a netem qdisc attached to them when the network is initialized.
The parameters of static shaped links are set when they are created, and are never changed.
The parameters of dynamic shaped links are set when they are created, and are modified later during the emulation too.
All links are bandwidth limited to 100 Mb/s.
tcmpdump is running on the observer (switch-2). The observer is recording data flowing in both directions, but keeps different state for spinbit analysis in each direction.
server and client dump per packet acknowledgement RTT.

Experiments & Results

Bulk data upload to server

In this set of experiments, the client initiates a connection to the server and uploads a 100 MiB file. The static links are shaped to have a 10 ms delay. At t = 30 s netem parameters are added to the dynamic links. The results for different netem settings are listed below.

delay 10ms <time vs rtt diagram> <time vs rtt diagram (detail)>
- Clearly, the latency spinbit is an indicator of application RTT.
- The there is a very strong correlation between the spinbit data and the client measurements (note that the client has more samples, the spinbit measurement gives 2 samples per RTT, while the server gets 1 sample per packet)
- In this example, the spinbit actually provides more data than would otherwise be available to the server. This is because the server sends almosts exclusively ACK only packets. Only the ocasional MAX_STREAM_DATA frame will trigger an acknowledgement from the server, and provide an RTT data point.
- Around the 80 s mark you can see some points where the spinbit reported RTT drops to almost zero (not on plot). This is caused by light reordering. The observed spinbit sequence is something like 11111100100000 <Wireshark screenshot>. There are a number of straight forward solutions to this:
  1. A two bit spinbit.
  2. Consider the (last few bits of the) packet number in the spinbit analysis.
  3. filter out almost-zero values.
delay 500us <time vs rtt diagram> <time vs rtt diagram (detail)>
- Also small RTT variations can be observed.
delay 10ms 3ms 25 <time vs rtt diagram> <CC and FC info>
- The spinbit can deal with jitter
- As congestion window drops, the server sends much less MAX_STREAM_DATA frames, causing its RTT estimate frequency to drop even further. The spinbit is not affected.
loss random 0.01 <time vs rtt diagram>
- The spinbit can deal with light loss.
loss random 2 <time vs rtt diagram>
- The spinbit can deal with heavy loss.
delay 1ms reorder 1 25 <time vs rtt diagram>
- The spinbit can deal with light reordering.
delay 1ms reorder 10 25 <time vs rtt diagram>
- The spinbit does not really like heavier reordering.
- But the methods outlined above will probably fix that.
delay 1ms reorder 50 25 <time vs rtt diagram>
- Same.

The occasional small packet

In this set of experiments, the client sends a small Hello packet to a server every 100 ms. The static links are shaped to have a 10 ms delay. At t = 30 s netem parameters are added to the dynamic links. Because of the results are very similar, we will focus on delay 10ms <time vs rtt diagram>

We see from the figure that the spinbit data is pretty useless. This is easily understood when we look at the packets flowing. Consider the following example, where every link has delay 1 ms, and Hello packets are send every 10 ms. The following packets are then generated / observed / received:

 10 ms: H 1                 11 ms: H 1                 12 ms: H 1
  0 ms: H 0                  1 ms: H 0                  2 ms: H 0
 +----------+               +----------+               +----------+
 |          |-------------->|          |-------------->|          |
 |  client  |               | observer |               |  server  |
 |          |<--------------|          |<--------------|          |
 +----------+               +----------+               +----------+
  4 ms: A 0                  3 ms: A 0                  2 ms: A 0
 14 ms: A 1                 13 ms: A 1                 12 ms: A 1

Legend:
  H: Hello
  A: Ack
  0: Spin 0
  1: Spin 1

So, the observer sees one spinbit transition per Hello packet transmitted. As Brian puts it: "The spinbit measures the dominant frequency of the protocol".

The occasional small packet with echo

Same as above, but now the server also echos back the Hello packets. We again look at delay 10ms <time vs rtt diagram>. Let us consider again the packets flowing of the network.

 20 ms: HA 0                 21 ms: HA 0               22 ms: HA 0
 10 ms: HA 1                 11 ms: HA 1               12 ms: HA 1
  4 ms: A  1                  5 ms: A  1                6 ms: A  1
  0 ms: H  0                  1 ms: H  0                2 ms: H  0
 +----------+               +----------+               +----------+
 |          |-------------->|          |-------------->|          |
 |  client  |               | observer |               |  server  |
 |          |<--------------|          |<--------------|          |
 +----------+               +----------+               +----------+
  4 ms: EA 0                 3 ms: EA 0                 2 ms: EA 0
 14 ms: EA 1                13 ms: EA 1                12 ms: EA 1

Legend:
  H: Hello
  E: Echo
  A: Ack
  0: Spin 0
  1: Spin 1

We can see that here the correct RTT can be measured only once, after which we again measure the Hello frequency rather than the RTT. Two more things should be noted here:

Here you can measure the up- and downstream delay seperately and then add them up to get the RTT. Although this has some problems too.
The jitter caused in the client RTT data is caused by the long ACK delay for ACK only packets. This can easily be fixed by ignoring the timing info from ACK only packets.

Single vs dual bit spin signal

To verify how well a dual bit spin signal performs, I ran some tests. I'll only show delay 5ms reorder 50 25 here, as heavy reordering was the only thing the one bit spinbit had problems with. <time vs rtt diagram> <comparison one vs two bits>.

It looks like this two bit signal is completely imune to reordering. In fact, in my current implementation (where I ignore all spin signals that are not (current_spin + 1) % 4), a packet would need to be reorderd 3 RTTs before it causes an incorrect latency measurement.

One bit spinbit with filtering

In order to verify the value added by adding a second bit to the spin signal, I compared it to a number of other approaches:

Rejecting RTT samples if they are bellow a certain static threshold (1 ms).
Rejecting RTT samples if they are bellow a certain dynamic threshold (10 % of minimum of last 10 samples).
Ignoring all packets with Pn's lower than the highest Pn already received (this is similar to what the endpoints do in PR 609).

Some results:

delay 1ms reorder 10 25 <time vs rtt diagram> <time vs rtt rejection diagram>
- Using a static threshold removes some bad samples, but not all.
- Using a dynamic threshold performs better, but stil not ideal.
- Considering the Pns yiels the same result as a two bit spinbit.
delay 5ms reorder 50 25 <time vs rtt diagram> <time vs rtt rejection diagram>
- Same

The reason that both considering the Pns and a two bit spin signal give the same results, is because they effectively do the same: they detect reordering. The main performance difference between the two is how much reordering they tollerate. For the two bit signal, reordering up to 3 RTTs can be handled. For the Pns method, reordering up to the point where a Pn wrap around occurs is tollerable. However, when reordering is this heavy, the receiving endpoint can not properly expand the packet numbers anymore, so more bits of the Pn should be send anyway.

In conclusion: adding a second bit to the spin signal does make the spin signal significantly more resilient against reordering. However, similar results can also be obtained by using the information already in the QUIC header. Therefore, the second spin bit adds little or no information for a passive observer.

The Valid Bit

Design

From the above, it is clear that although the spinbit can provide a passive observer with valuable information about the application RTT, the information derived from the spinbit is not always related to the RTT. Furthermore, it is not possible to derive from the spinbit if it is providing a measurement of the RTT or some other frequency. Therefore, it is usefull for a passive observer to have an indication of when the spinbit caries valid information about the RTT.

One proposal is to add a blocked bit the to QUIC header (PR 279). While such a bit might give some information as to when the spinbit signal is valid, and when it is not, this would only be an indirect signal. To ensure that a passive observer can know with complete certainty that the spinbit is carrying valid information about the RTT, I propose the valid bit.

The valid bit works as follows:

It is initiated by both endpoints to 1.
Upon receiving an edge on the spin signal, the endpoint records the current timestamp.
When generating an edge on the outgoing spin signal, the endpoint checks if this edge is generated within a certain time window after receiving the edge on the incomming spin sigal. If this is the case, the valid bit is set to 1. Otherwise it is set to 0.
Currently I have been using 1 ms for this time window, but other (dynamic) values might be better.

The valid bit has a number of advantages:

It is a direct signal about the validity of the spinbit.
By instead of measuring the time between two spinbit transitions in the same direction of the flow, monitoring how long it takes before a transition in one direction is reflected in the other direction, the up and downstream delay can be measured independently. The valid bit indicates when such half-RTT measurement is valid.
Thus, one spin bit edge with valid = 1 generated by and endpoint is enough for an observer to sample the up or downstream delay.

Test Results

For testing the valid bit, I introduce a new type of traffic patern: bursty. It consists of calm periods of 15 s where heartbeats are send to the server every 100 ms. After this period the client transfers 1 MiB of bulk data, after which the next calm period starts.

bursty <time vs rtt diagram>
- The pink regions indicate where the valid bit was set to 0.
- The valid bit is sucesfull at filtering out invalid RTT measurements
- The valid bit has an instantanious response
delay 500us <time vs rtt diagram>
- Under normal circumstances the valid bit triggers occasionally.
- In this example 35 times for a total invalid time of 1.72 s
- This is a good thing rather than a bad thing
- As production grade QUIC implementations will probably be faster, this will happen less often
delay 5ms reorder 50 25 <time vs rtt diagram>
- Under heavy reordering, the valid bit triggers more often.
- In this example 188 times for a total invalid time of 9.89 s
- This is a gain good thing rather than a bad thing, the trace without rejected samples is cleaner.