Musings: The Case for Validating Inputs in Software-Defined WANs

2024-12-13

This was a really interesting paper from HotNets '24 because it flips (as I understand it) traditional perspective on its head by validating SDN inputs rather than their outputs. They challenge the assumption typically embedded in the process of validating outputs: the correctness of inputs. This is motivated by reports that find "over one third" of major outages (I presume at Google) were because of incorrect inputs to the SDN controller.

A tangent to the validation of outputs

It reminded me of a presentation we had in my networking class (Security and Performance Challenges in Networked Systems) from Steffen Smolka titled P4-Based Automated Reasoning (P4-BAR) for the (Networking) Masses!. The premise of his work is to constrain the configuration space of switch ASICs from a variety of inputs as to reduce the complexity of the SDN controller (and thus make it easier to reason about). It similarly flips the paradigm -- instead of programs programming the ASICs they run on, ASICs "abide" by the programs themselves. I think his graphic is very intuitive. From the controller's perspective, the ASIC becomes very predictable despite its inherent complexity.

P4-BAR uses static tests to validate whether P4 programs run as intended on the ASICs. It's a case of validating outputs. I take this tangent to illustrate the novelty of Krentsel's paper where we shift focus to validating the inputs of the SDN controller instead.

The premise

The paper addresses how it is even possible for the SDN controller to receive "incorrect" inputs given that it "reads network state directly from routers". As always, it comes down to bugs. Bugs in network operator code and even in the underlying fabric itself. They observe that network operators rely on ad-hoc checks that are usually static and don't reflect the dynamism of the network--thus they are difficult to manage and largely insufficient.

A key observation is that

Perhaps more fundamentally, our analysis reveals that inputs are often incorrect not because they cannot possibly occur or are unlikely to occur, but because they are not currently occurring; i.e., they do not reflect the current state of the network.
...
We argue that input validation must be based on dynamic invariants that ensure an SDN controller’s inputs reflect current network state.

once again indicating a need for dynamic as opposed to static tests. Because the current network state, as observed by the SDN controllers through different signals, might not be completely correct, they rely on the symmetry inherent in networked systems. Most obviously, they use the conservation of flow (ie bytes_in $\approx$ bytes_out).

Hodor

(nice name)

Design

Collects raw signals from network devices (only routers?)
1. I think the authors' experience at Google shows here. They simply acknowledge that network operators already know what signals to collect because they designed the network (though they acknowledge that sometimes this design has bugs)
"Hardens" signals (representing either the state of the network or the intent of network operators) from the underlying network
1. "Hardened" is to mean signals are validated by the symmetry inherent in the network and discrepancies are resolved
2. Signals are divided into low (interface counters) and high levels (control levels). Low level signals are the baseline by which high level signals must align to
Dynamically checks that the inputs to the SDN controller are consistent with current network state
1. On fail, revert to the last input state

Hodor's redundancy checks

$R_1$: symmetry across ends of the link (bytes_in $\approx$ bytes_out)
$R_2$: flow conservation across all router ports
1. expanding the constraints here allows the incorrect counter to be determined, assuming an isolated incorrect instance
$R_3$: alternative signals (link status can be determined whether bytes_in == 0 && bytes_out == 0)
$R_4$: manufactured signals (ie active neighbor probes)

For $R_4$, I'm curious whether this can affect the accuracy of the network signals, especially since they "believe that the surface area for bugs that Hodor introduces is relatively small". It certainly seems unlikely for a breaking bug to be introduced in Hodor, but the claim that "Hodor does not process or aggregate signals but only reads and compares them" should be accompanied by an asterik if they are sending active probes. If, however improbable, the probe sending rate is bugged, perhaps it can significantly affect the fidelity of the signals.

The flow conservation constraint is cool and succinct.

$$\forall v \in V, \sum_{e \in E_{in}(v)} counter(e) = \sum_{e \in E_{out}(v)} counter(e)+dropped(v)$$

where $V$ are the vertices/routers and $E$ are the physical links connecting them. This allows you to solve up to $|V| - 1$ unknowns.

Symbolic Regression

My reading of this paper coincided with the poster presentations in networking class. One of the graduate students had been working on a new project to automate constraint mining from network data. The search space for symbolic regression (SR) is massive and automating the task (especially when results need to be sound) is difficult because of the computational complexity. This paper seemed tangentially connected to his work. The authors mention symbolic regression, but disregard it because they "may capture spurious relationships... that are not fundamental to the system's operation", essentially arguing that the the utility of the captured results rely too heavily on the fidelity of underlying raw data. I feel like this particular concern might be overexaggerated, but do recognize the difficulty of the problem. This is just to say that I feel like applying symbolic regression to hardened network signals (as in Hodor) could be quite interesting to investigate, and that I'm actively learning about the area.

The Case for Validating Inputs in Software-Defined WANs

Krentsel, Alexander and Iyer, Rishabh and Keslassy, Isaac and Ratnasamy, Sylvia and Shaikh, Anees and Shakir, Rob

Published: 2024

https://doi.org/10.1145/3696348.3696874