One thing that’s still underrated in programming: Skipping a library dependency and
coding something small that you need directly. No over-generalized library, no unecessary abstractions, no library dependency hell – just
straight forward code that solves your specific problem in the most direct way possible.
To put a longer story on it: I’ve been working on provisioning VMs with cloud-init NoCloud for
the cloud orchestration platform I’m building. To bootstrap VM state from a fresh image, I need to expose a VFAT disk from the host hypervisor
to the guest with the VM’s metadata (Network setup, hostname, users, etc.). The disk image needs to be dynamically built during
VM provisioning time, and injected with the correct instance information inside the VFAT volume for the guest to read.
VM provisioning happens in Rust code, so there are so a few different ways I considered implementing this:
- Build up a directory on a local scratch temp dir with the instance metadata files, and shell out to
mkfs.vfat with something like std::process to turn the directory into a vfat volume.
- Use an existing Rust library like fatfs to write the instance metadata files into a new FAT filesystem
Both of these require taking on external program or library dependencies.
I went a different way instead. I made a single zero-dependency Rust function that builds up the FAT filesystem in memory, and writes out the bytes directly to a file on the host.
It’s a few hundred lines of code, most of which are comments and tests. I read an old Microsoft FAT specification, wrestled with the hackish oddities of Long Filename support, and got
something working in a few hours time.
What’s missing from this implementation? So much!
- Only supports a fixed number of FAT filesystem clusters
- Only supports fixed filesystem sector sizes
- Only supports a hard-coded root directory size
- Hardcoded filesystem label only (It has to be the string
cidata to work with cloud-init anyways!)
- No read support
- Basic FAT12 only (No larger files with FAT32)
- No general file write interface: Just supports writing the 4 metadata files we need and that’s it
- Immutable writes only, no updates / mutations.
But this is all I need (and will probably ever need!). If I ever start doing anything fancier and dynamic with FAT filesystems, I can re-evaluate.
Instead, I get no breaking API changes from upstream libraries, no CVEs to patch, no license changes or rug-pulls, and code that’s straight forward and easy to fix and debug directly. Why carry around the extra weight of more dependencies for such a tightly scoped problem? We’re only writing 4 small metadata files to the vfat disk image after all.
Was this the quickest way to solve the problem? Probably not! Slapping in another dependency via cargo add and gluing some more APIs together would have probably gotten the problem “out of the way” faster from the start. But the initial expediency shouldn’t overshadow the benefits of building something tidy, compact, and definitionally simple like this.
I want less duct tape and glue in my programs. I want less moving pieces and maintenance. I want code that’s mine. I want systems that are load bearing and ready for production. I want more functionality, with less dependency.
If you’re looking for a Rust based Vmm alternative to Firecracker,
I’d check out Cloud Hypervisor. Cloud Hypervisor ended up hitting that sweet spot for me
in many places where Firecracker is chasing different outcomes.
Both projects share so many things to love:
- Entirely Rust based, sharing many quality crates from the rust-vmm project.
- Built on top of the Linux KVM hypervisor
- Good virtio device support (disks, network interfaces, etc.)
- Simple Vmm Http API
In contrast to Firecracker, Cloud Hypervisor directly targets long-lived, stateful VMs (but can also slim down to ‘microvm’ territory),
and thus supports other features that Firecracker will most likely never add.
UEFI boot support
Unlike Firecracker, which only supports direct Linux boot, Cloud Hypervisor can boot a minimal UEFI firmware. This
gives your guest VM direct control of the kernel upgrade lifecycle from within the guest. It also allows running OS types other
than Linux. If you’re running long-lived stateful VMs that expect to manage their own kernel version upgrades over
time, then this is what you want. I’m no fan of the complexities of UEFI, but the realities are that if you want wide-spread
guest boot support (including Windows, etc.), having at least basic UEFI support is a good and useful thing.
Firecracker requires using Linux Direct Boot. This limits your
guest VMs to Linux only, and requires that the Vmm itself manage the kernel upgrade lifecycle of the guest VMs. Firecracker is optimizing
for something different than Cloud Hypervisor: Lightweight, short-lived VMs that are biased towards serverless function workloads (It
was built for Amazon’s Lambda after all). Cloud Hypervisor also supports Linux direct boot in addition to UEFI support.
Qcow2 disk images
Firecracker only ships with Raw disk image support. The biggest drawback I hit with raw disk images is that it requires
pre-allocating the entire guest disk, consuming valuable host disk capacity (or implementing other complex setups like device-mapper for overlay).
These pre-allocations also lead to extra byte copying when shuffling block devices around a host cluster.
Instead, Cloud Hypervisor has direct Qcow2 support. This gives us baked in support for thin-allocated disk images (saving over-allocation), as well as backing-file
support. You can have read-only overlay semantics for shared disk image lineage, snapshots, etc. No external device
mapper required.
For generating base images, it’s much easier to generate Qcow2 images with base a base OS install, and then at VM provisioning time,
issue a quick metadata operation to the Qcow2 file to resize the image for each provisioned guest.
Beyond Qcow2 disk images, I previously wrote an experimental implementation of network attached block devices for Firecracker.
While this quick and dirty experiment worked, Firecracker’s focus on short-lived VMs for serverless functions made me doubt whether network-attached storage code patches would be welcome upstream. I’d
love to push contributions back to Cloud Hypervisor that enable arbitrary block-device backend support.
Live migration
I haven’t gotten to plumb in live-migration yet, but this is something Firecracker is also unlikely to ever add, due to
its focus on short-lived VMs. I’m looking forward to implementing this one on my cluster control plane!
Not Amazon
Fundamentally, I’m just not comfortable making deep investments in a project that’s governed by Amazon. Amazon’s values don’t align well
with mine as of late, and I’d rather invest in tooling that doesn’t put more compute power in Jeff Bezos’ pocket.
To be sure: there are other hyperscalers involved in Cloud Hypervisor, but the diversity gives me more confidence. The last major release saw contributions
from technologists at Microsoft, Google, Meta, IBM, Tencent, and several more.
I’m much choosier about licenses and governance of open source projects I contribute to these days. Too many rug-pulls for comfort.
Give it a try
I was able to add Cloud Hypervisor support to my internal cloud control plane in about a day or so. It shares many of the broad
feature sets that Firecracker has, and then goes further.
If you’ve ever been intrigued by Firecracker, and are looking for an alternative whose mission isn’t so pigeon-holed into serverless, I’d
give Cloud Hypervisor a look!
NOTE: This is a longer explanation to a question I responded to on GitHub about dynamically adding listeners / services to Pingora.
I wanted to show the technique I’m using to dynamically manage Pingora LoadBalancer instances inside a general proxy / load balancer service I’m building. Pingora is a Rust based proxy library from Cloudflare that can be used to build high performance http load balancers and proxy services.
Pingora’s design prompts you to setup your process using a static Service graph that you build
during process startup. This similar to patterns found in Guava’s Service interface, for managing several logical asynchronous services inside a single process. You configure your various services (load balancers, proxy services, background health checks, etc), add them to a Server instance, and then start the Server which takes over the lifecycle of your services.
From Pingora’s getting started guide:
fn main() {
let mut server = Server::new(None).unwrap();
server.bootstrap();
let upstreams = LoadBalancer::try_from_iter(["1.1.1.1:443", "1.0.0.1:443"]).unwrap();
let mut lb = http_proxy_service(&server.configuration, LB(Arc::new(upstreams)));
lb.add_tcp("0.0.0.0:6188");
server.add_service(lb);
server.run_forever();
}
The Server struct normally does a lot of heavy lifting for you:
- Clean process startup / shutdown, including handling the correct
Service dependency ordering.
- Managing each Service’s tokio
Runtime. The Server instance creates an individual Runtime instance for each Service that it manages.
- Zero downtime listener socket handoff: The
Server handles handing off listener file descriptors over a unix socket to a new process to support zero downtime upgrades. This is very similar to how something like Envoy proxy does online hot restarts.
One major problem: After setting up services, the API to start the Server returns a Rust “Never type”.
Once your program hands off control to Server#run_forever, you’re never getting it back.
impl Server {
/// Start the server using Self::run and default RunArgs.
/// This function will block forever until the server needs to quit. So this would be the last function to call for this object.
/// Note: this function may fork the process for daemonization, so any additional threads created before this function will be lost to any service logic once this function is called.
pub fn run_forever(self) -> !
}
It’s quite easy to forgo the Server type entirely, easily use all the Pingora services, and retain full control of your process. This gives you the ability to dynamically start / stop
services (including LoadBalancer instances), or do whatever else you want. It’s on you to manage clean shutdown, and provision tokio runtimes. You lose out on some of the other built-ins like zero downtime hot restarts, but in my case, it’s worth it.
Starting a LoadBalancer without a Server is pretty straightforward:
fn make_load_balancer() -> GenBackgroundService<LoadBalancer<RoundRobin>> {
let backends = Backends::new(Box::new(ResourceDiscovery));
let mut load_balancer = LoadBalancer::from_backends(backends);
let health_check = TcpHealthCheck::new();
load_balancer.set_health_check(health_check);
load_balancer.health_check_frequency = Some(Duration::from_secs(5));
load_balancer.update_frequency = Some(Duration::from_secs(30));
let load_balancer_service = background_service("health_check", load_balancer);
load_balancer_service
}
fn main() -> Result<(), anyhow::Error> {
// Manage our own tokio Runtime
let runtime = tokio::runtime::Runtime::new()
.expect("Could not start tokio runtime");
// Create a Vec of tokio task handles, so we can wait for
// them to finish during process shutdown
let mut tasks = Vec::new();
// Each service watches this common channel to trigger clean shutdown.
let (shutdown_tx, shutdown_rx) = tokio::sync::watch::channel(false);
let load_balancer = make_load_balancer();
// Start a load balancer on the tokio runtime ourselves
tasks.push(runtime.spawn(async move {
load_balancer.task()
.start(shutdown_rx.clone())
.await
}));
}
Starting a proxy service is also straightforward:
fn main() -> Result<(), anyhow::Error> {
// With no Server, we have to manage our own ServerConf
let server_config: Arc<ServerConf> = Arc::new(Default::default());
let mut proxy_service = http_proxy_service(&server_config, Proxy);
proxy_service.add_tcp("0.0.0.0:80");
// Start the http proxy on the tokio Runtime
tasks.push(runtime.spawn(async move {
proxy_service.start_service(None, shutdown_rx.clone(), 1)
.await
}));
}
Here’s an example where we start a new LoadBalancer service whenever we receive an event on a tokio signal (in this case,
a timer). This is the control loop running on main, that we use in place of Server#run_forever:
fn main() -> Result<(), anyhow::Error> {
let mut tasks = Vec::new();
let (shutdown_tx, shutdown_rx) = tokio::sync::watch::channel(false);
// Setup services, like above (load balancers, proxies, etc.)
// then proceed to main control loop.
runtime.block_on(async {
let mut interval = tokio::time::interval(Duration::from_secs(30));
loop {
tokio::select! {
// Wait for shutdown. Normally the Server handles all external signal handling for you.
_ = tokio::signal::ctrl_c() => {
tracing::info!("Got shutdown signal. Stopping");
// Trigger shutdown to all services
shutdown_tx.send(true)?;
// Join / wait for all tasks to stop
for task in tasks {
if let Err(err) = task.await {
tracing::error!("Join error during task shutdown: {:?}", err);
}
}
break;
}
// Contrived example: Making a LoadBalancer on a timer. In practice, you'd probably stash
// your LoadBalancer instances in a shared data structure, and start / stop them on whatever
// signal is meaningful for your service.
//
// This example uses a timer, but you can manage this any way you want.
_ = interval.tick() => {
tasks.push(runtime.spawn(async move {
make_load_balancer().task()
.start(shutdown_rx.clone())
.await;
}));
}
}
}
Ok(())
})
}
In real code, I run through a full reconciliation process inside the proxy process. The proxy calls
out to the control plane to fetch the list of configured load balancers, and starts / stops them when
they’re created or destroyed. This design allows me to host multiple LoadBalancer instances inside the same process, while still maintaining separate backend configurations for each distinct load balancer. The control tasks run on a separate tokio Runtime than the proxy and load balancer services that handle requests / responses.
Check out Pingora if you’re interested in a Rust based load balancer library.