On this page:
11.1 Overview
11.2 Ensuring your experiment nodes use the correct interfaces
11.2.1 Common causes of control network misuse
11.2.2 Empirically verifying interface use
11.3 What to do if you get a Control Network Violation email
11.4 Why is it so important to avoid the control network?
11.5 When can I use the control network?
8.15

11 The Control Network🔗

11.1 Overview🔗

At every CloudLab cluster, the majority of the allocatable nodes (primarily general-purpose servers) are connected to a persistent, shared IPv4 subnet known as the control network. Each cluster has its own control network namespace, typically a subset of the affiliated university’s larger IPv4 namespace.

The control network at each cluster is used for two primary purposes. First, since it is Internet accessible, it provides the path through which users interact with their experiment nodes, including getting data in and out of CloudLab. Second, it is used by CloudLab infrastructure for experiment control. This includes initial imaging and configuration of nodes as well as ongoing monitoring of resource usage at the cluster level. It is also used for infrastructure coordination and control between the individual clusters and the master CloudLab portal at Utah.

The control network is distinct from the network topology connecting nodes as defined in an experiment’s profile. That topology, known as the experiment network, is realized at experiment instantiation time on a completely separate fabric of physical switches using vlans on those switches to implement an experiment’s links and lans. Each experiment network is isolated from those of other active experiments, and exists only for the lifetime of the experiment.

Because the control and experiment networks are fundamentally different, the control network interface on nodes should not be considered as “just another interface” in your experiment that can be used in addition to those in the profile topology for inter-node communication. Additionally, CloudLab is designed to primarily support “closed world” experimentation in which all network activity takes place between nodes within the experiment on the profile-defined experiment network. It is not intended to support “Internet wide” experimentation, such as scanning external Internet hosts, measuring bandwidth between CloudLab and other sites, hosting long-term Internet-visible services, or making heavy use of external storage services.

We do allow Internet wide experiments on occasion, but this requires permission from, and coordination with, CloudLab staff since such traffic tends to trigger monitors resulting in abuse reports from both at CloudLab host institutions and external sites.

As such, heavy use of the control network by nodes in an experiment for any reason is considered a mistake and will be flagged as a control network violation.

The following sections elaborate on how to ensure your experiments use the correct network (including a list of common mistakes and how to empirically verify correct interface use), how to respond to a Control Network Violation message, why it is so important not to use the control network anyway, and finally after all the negativity, what activity is allowed on the control network.

11.2 Ensuring your experiment nodes use the correct interfaces🔗

From the perspective of an experiment node, the control network and any experiment networks are accessed through the same OS-provided mechanisms. This requires that the experimenter know how to distinguish the two and ensure that they use the correct interfaces.

On the surface, the distinction is simple: active experiment interfaces will have 10.x.x.x private IPv4 addresses assigned to them. The control network interface will have an Internet routable IP address in the IANA-assigned IP space of the specific CloudLab cluster. All applications that transmit packets running on experiment nodes should use the 10.x.x.x addresses (or the aliases in /etc/hosts) when addressing other nodes in experiments. All applications listening for connections from other nodes in the experiment should bind to those names/addresses as well.

This sounds simple enough, but unfortunately there are a number of scenarios in which identifying the control interface is not so easy or it is not obvious that you are using it. The next section details some of the common mistakes made and how to avoid them.

11.2.1 Common causes of control network misuse🔗
11.2.2 Empirically verifying interface use🔗

It is always a good idea to ensure that your experiment is using the correct network interfaces after initial experiment setup has been done and once the involved applications are running. To do this you should start with the smallest instance of your experiment that is reasonable, just two nodes if possible, and then scale up to your desired size once you are confident that things are behaving correctly.

“Behaving correctly” in this context means “not using the control network excessively”, so we are only interested in getting a sense of the volume of traffic on the control network interface and whether there are any services listening on the interface that need not be. Following is a short list of techniques you can use to determine that.

If you want to figure out if any of your experiment services are listening for connections on the control network, there are a couple of ways to test:

11.3 What to do if you get a Control Network Violation email🔗

There are two forms of “control network violations” that we send email notifications about. One is the “very unusual amount of traffic over the shared control network” message which is the result of our auto-detection of high traffic volumes (cluster-specific, but typically exceeding 1Gb/sec or 100,000 packets/sec) over an extended period of time (average over 10 minutes). The other is the “we have determined that one or more nodes in the experiment has been compromised” message which is usually the result of CloudLab staff receiving a message from the host site or an external administrator, that a CloudLab node is engaging in undesirable behavior.

In addition to sending the email, we may also:

In the high traffic volume case, we generally first send the email and wait for an hour or two for a response. If we get no response, we will freeze your account and quarantine your experiment. If we still get no response, we will terminate the experiment and leave your account frozen. Upon receiving a reply and reaching a resolution, we will unfreeze your account and release the experiment from quarantine, allowing the nodes to boot from disk again.

In the suspected compromise case, we will immediately freeze your account and quarantine the experiment. Again, if there is no response we will terminate the experiment and leave your account frozen. Upon receiving a reply and reaching a resolution, we will unfreeze your account so that you can login to the nodes while they are in the MFS and extract any valuable data from the disks. We will then terminate the experiment. As a rule, we never resume experiments with nodes that have been compromised.

See Using the Recovery MFS for more information on accessing a quarantined node’s disk from the MFS.

Your responsibility as a user is first to make sure you whitelist email from CloudLab! We will only use email to reach out to you. It is not good to discover that your experiment has been terminated because you never saw email from us.

If you receive one of these emails from us, do not panic! In most instances, it is not a serious problem for us, just a situation that needs to be fixed. Even with compromised nodes, we understand that these things happen. If you make an honest effort to fix the problem and be diligent in future experiments, all is well. We want you to continue using CloudLab for your research!

Don’t panic, but do be responsible and respond quickly to the email so that the problem can be resolved quickly. Your initial response should include an explanation of what you are attempting to do, why that may have caused the problem observed, and what you plan to do to fix it.

We do make exceptions when the “excessive” traffic is necessary, or when a “compromised” node really isn’t.

11.4 Why is it so important to avoid the control network?🔗

This document has been hammering home the point “Using the control network is bad!” while citing largely reasons why it causes problems for the CloudLab infrastructure. For example, high volumes of traffic (data or packet rate) can interfere with infrastructure services, in particular those that are UDP-based such as DHCP, TFTP and our multicast image distribution mechanism. This traffic can also affect other experiments, interfering with their interactions with the infrastructure or introducing overhead receiving and rejecting rogue traffic. This is exacerbated by the control net fabric not being as well provisioned as the experiment fabric: switches are typically 1Gb or 10Gb, having less interswitch bandwidth and being less balanced in the topology. The control network’s relative transparency to applications coupled with its accessibility from the Internet, makes it much easier to inadvertently run vulnerable services with insecure default security policies that are exposed more than intended. It also allows for running services that can hijack identical infrastructure services; e.g., ARP, DHCP, DNS, or NFS.

These are all reasons we as infrastructure providers care about. There are also good reasons that you as an experimenter should care. The experiment network fabric consists of higher bandwidth node NICs (10, 25, 100Gbps) coupled with better switches and higher bandwidth interconnects. This fabric provides isolation guarantees by using separate switch vlans per experiment link/lan and optional performance guarantees by not allowing over-provisioning of logical links onto the physical links. Together these provide a more reproducible environment for experimentation. Accidental use of the control network instead of the experiment network can undermine these characteristics. You might perform a benchmark expecting it to run over an isolated 100Gb link, but misconfiguration might cause it to use a 10Gb link over the shared network. Likewise a dynamic routing experiment over a carefully constructed multi-hop topology, might be circumvented by the routing daemons discovering and using the control network where all nodes are one hop from every other node.

11.5 When can I use the control network?🔗

While there are many usage patterns of the control network you should avoid, there are times when it is legitimate to use it. These include:

In general, low bandwidth or one-time “bursty” use between CloudLab nodes and the Internet, or CloudLab nodes in one experiment to nodes in another experiment, are okay–think human interaction or 1Gb/sec or less of TCP traffic.

We are flexible however, so if you have a need to move a large amount of data in or out of CloudLab or between experiments quickly, just let us know in advance.