Saturday, January 9, 2016

A comparison of alternatives to init(8) and rc(8)

(Full disclosure: I am the author of relaunchd, one of the projects being evaluated in this article)

There are several projects underway to create a new system initialization and service management framework for FreeBSD and other Unix-like systems. This article compares three of these projects:

The boot process

For the purpose of this discussion, the boot process can be broken into three stages:
  1. the boot loader, loader(8)
  2. the primordial process, init(8)
  3. the run control service, rc(8)
The functions of the boot loader will not be covered in this article.

The boot loader hands control to init, which has the following basic responsibilites:
  • enter single user mode, if requested
  • spawn a number of getty(8) terminals for the console
  • launch the rc mechanism
After performing its duties for the early part of the boot process, the init process continues running in the background to perform several other duties:
  • restarting getty if it dies
  • reaping orphaned child processes
  • waiting for a signal to reboot or shutdown the system
The rc mechanism is a set of shell scripts that determine what services to start, and in what order. Once everything has been started, the rc process terminates.

Problems with the current process

While the current init and rc mechanism are classic parts of the Unix design, there are several reasons why people are interested in replacing them:
  • reliability - if a service dies, the rc mechanism will not automatically restart it;
  • performance - services are started in a serial fashion, which is slow and does not take advantage of the parallelism of today's multi-core hardware;
  • security - the shell scripts that rc uses do not allow for the enhanced security features that are possible in other systems.
  • complexity - shell scripts are programs, which makes them more difficult to manage than simple configuration files used by most modern alternatives.
  • dependency management - the rc system requires service dependencies to be explicitly defined in each rc script. In many alternative systems, dependencies are automatically handled through mechanisms like socket activation.
  • features - the rc system has a stable interface which limits the ability to extend it with new features. It is only able to handle basic functions like starting, stopping, and checking if the service has died.

Proposed solutions

This article takes a look at three of the proposed solutions.

launchd

The NextBSD project has ported the original launchd from Apple, and offers a very close approximation to the experience of using launchd in OS X. One major difference is that within NextBSD, the launchd.plist(5) jobs are specified using JSON instead of XML.

To achieve this feat, the NextBSD developers wrote a Mach compatibility layer that allows launchd to use Mach IPC and (possibly) other features of Mach.

relaunchd

The relaunchd project has started from scratch and built a workalike to launchd without using any of the original code. It is not yet feature-complete, but a lot of progress has been made. It works under FreeBSD and Linux, and uses LibUCL to parse its configuration files.

Unlike the two other projects in this article, relaunchd is explicitly designed not to replace init(8) and rc(8). Instead, it is intended to provide additional features and benefits to users programs that want them, and to not disrupt the existing initialization and boot system. The idea is that programs will gradually see the benefits of switching to relaunchd management, and the number of things being managed under the traditional rc mechanism will gradually shrink to almost nothing.

nosh

The nosh project does not aim to be a clone of launchd; rather, it tries to take the best features of all of the modern rc/init replacements and combine them into a new system. It has compatibility shims for systemd, Solaris' SMF, Red Hat's chkconfig/service, and OpenBSD's rcctl.

nosh has also been designed to work with FreeBSD and Linux.

Feature comparison


The table below gives a quick comparison between the three projects:

  launchd relaunchd nosh
Replaces init(8) and runs as PID #1 Yes No Yes1
Replaces rc(8) and manages all services Yes No Yes1
IPC mechanism Mach ports None currently; planned support for DBus and libipc None
Configuration file format JSON JSON2 Custom scripting language
Compatibility shims for systemd and other init systems No No Yes
Process supervision YesPlanned, but not implemented Yes
Cron replacement YesPlanned, but not implemented No
Milestones and targets No No Yes
Built-in syslog replacement No No Yes
1. Nosh allows you to experiment with running alongside an existing init/rc mechanism, but the documentation implies this is not the goal of the project.
2. Technically, jobs could be defined in any format that LibUCL supports, which (currently) is JSON, YAML, and nginx-style.


Thursday, December 3, 2015

First DTrace hack

I was able to get a rough port of the DTrace Toolkit script "statsnoop" to run on FreeBSD. This largely involved removing references to Solaris syscalls that don't exist on FreeBSD, and commenting out some Solaris specific lookups for fstat().

The patch is below:

# diff -u statsnoop statsnoop.FreeBSD 
--- statsnoop   2015-11-12 05:11:04.000000000 -0500
+++ statsnoop.FreeBSD   2015-12-03 00:38:45.089471021 -0500
@@ -191,9 +191,9 @@
  /*
   * Print stat event
   */
- syscall::stat:entry, syscall::stat64:entry, syscall::xstat:entry,
- syscall::lstat:entry, syscall::lstat64:entry, syscall::lxstat:entry,
- syscall::fstat:entry, syscall::fstat64:entry, syscall::fxstat:entry
+ syscall::stat:entry,
+ syscall::lstat:entry,
+ syscall::fstat:entry
  {
        /* default is to trace unless filtering */
        self->ok = FILTER ? 0 : 1;
@@ -204,34 +204,29 @@
        (OPT_trace == 1 && TRACE == probefunc) ? self->ok = 1 : 1;
  }
 
- syscall::stat:entry, syscall::stat64:entry,
- syscall::lstat:entry, syscall::lstat64:entry, syscall::lxstat:entry
+ syscall::stat:entry,
+ syscall::lstat:entry
  /self->ok/
  {
        self->pathp = arg0;
  }
 
- syscall::xstat:entry
- /self->ok/
- {
-       self->pathp = arg1;
- }
-
- syscall::stat:return, syscall::stat64:return, syscall::xstat:return,
- syscall::lstat:return, syscall::lstat64:return, syscall::lxstat:return
+ syscall::stat:return,
+ syscall::lstat:return
  /self->ok/
  {
        self->path = copyinstr(self->pathp);
        self->pathp = 0;
  }
 
- syscall::fstat:return, syscall::fstat64:entry, syscall::fxstat:entry
+/*
+ syscall::fstat:return
  /self->ok/
  {
        self->filep = curthread->t_procp->p_user.u_finfo.fi_list[arg0].uf_file;
  }
 
- syscall::fstat:return, syscall::fstat64:return, syscall::fxstat:return
+ syscall::fstat:return
  /self->ok/
  {
         this->vnodep = self->filep != 0 ? self->filep->f_vnode : 0;
@@ -239,10 +234,11 @@
             cleanpath(this->vnodep->v_path) : "") : "";
        self->filep = 0;
  }
+*/
 
- syscall::stat:return, syscall::stat64:return, syscall::xstat:return,
- syscall::lstat:return, syscall::lstat64:return, syscall::lxstat:return,
- syscall::fstat:return, syscall::fstat64:return, syscall::fxstat:return
+ syscall::stat:return,
+ syscall::lstat:return,
+ syscall::fstat:return
  /self->ok && (! OPT_failonly || (int)arg0 < 0) && 
      ((OPT_file == 0) || (OPT_file == 1 && PATHNAME == copyinstr(self->pathp)))/
  {
@@ -275,9 +271,9 @@
  /* 
   * Cleanup 
   */
- syscall::stat:return, syscall::stat64:return, syscall::xstat:return,
- syscall::lstat:return, syscall::lstat64:return, syscall::lxstat:return,
- syscall::fstat:return, syscall::fstat64:return, syscall::fxstat:return
+ syscall::stat:return,
+ syscall::lstat:return,
+ syscall::fstat:return
  /self->ok/
  {
        self->path = 0;


And voila, I'm able to watch all the stat(2) calls in real time. Sample of the output, with warnings removed:
 
 
# ./statsnoop.FreeBSD 2>&1 | grep -v 'invalid address'
  UID    PID COMM          FD PATH                 
 1001   1373 konsole        0 /etc/nsswitch.conf   
 1001   1373 konsole        0 /etc/nsswitch.conf   
 1001   1373 konsole        0 /etc/nsswitch.conf   
 1001   1373 konsole        0 /etc/nsswitch.conf   
 1001   1373 konsole        0 /etc/nsswitch.conf   
 1001   1373 konsole        0 /etc/nsswitch.conf   
 1001   1373 konsole        0 /etc/nsswitch.conf   
 1001   1373 konsole        0 /etc/nsswitch.conf   
 1001   1373 konsole        0 /etc/nsswitch.conf   
 1001   1373 konsole        0 /etc/nsswitch.conf   
 1001   1373 konsole        0 /etc/nsswitch.conf   
 1001   1373 konsole        0 /etc/nsswitch.conf   
 1001   1373 konsole        0 /etc/nsswitch.conf   
 1001   1373 konsole        0 /etc/nsswitch.conf   
 1001   1373 konsole        0 /etc/nsswitch.conf   
 1001   1373 konsole        0 /etc/nsswitch.conf   
 1001   1373 konsole        0 /etc/nsswitch.conf   
 1001   1373 konsole        0 /etc/nsswitch.conf   
 1001   1373 konsole        0 /etc/nsswitch.conf   
 1001   1373 konsole        0 /etc/nsswitch.conf   
    0   1074 sh            -1 /var/tmp/appcafe/dispatch-queue 
    0  10683 sleep          0 /etc                 
    0  10683 sleep          0 /etc/libmap.conf     
    0  10683 sleep          0 /usr                 
    0  10683 sleep          0 /usr/local           
    0  10683 sleep          0 /usr/local/etc       
    0  10683 sleep         -1 /usr/local/etc/libmap.d 

For some reason, konsole loves to make sure that /etc/nsswitch.conf has not changed, and calls stat() constantly. Sounds like a job for stated(8) to solve, someday..

Monday, November 2, 2015

relaunchd v0.1 released

Version 0.1 of the relaunchd project has been released, and submitted to the FreeBSD ports tree:

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=204240

Friday, October 30, 2015

stated v0.1 released

The first version of stated has been released. You can find out more about it by visiting the new website at:

http://mheily.github.io/stated/

It has been submitted for inclusion in the FreeBSD ports tree:

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=204172

Sunday, October 25, 2015

Sharing state with stated

In my spare time, I've been working on a new publish/subscribe mechanism called stated (pronounced "state dee").

This mechanism allows unrelated programs to exchange information about changes to their internal state. It's not fully complete, but I think enough of the API and implementation is for people to take a look at.

I was inspired to write stated by looking at what Apple did with their notify(3) API. The basic idea is good, but there are some  problems with the Apple design, however:
  1. The API design is a mix of stateless and stateful functions.
  2. State information is limited to a single integer.
  3. Weak security around the system namespace
  4. Not easily portable due to entangling dependencies 
  5. The name conflicts with an existing open source library
I'll spend the rest of this blog post discussing these problems in more detail, and showing how stated addresses the problem.

API confusion

The original design of the notify(3) API was stateless, and the ability to include state information was added later. In fact, the first sentence of the manpage still reads:
"These routines allow processes to exchange stateless notification events"
I wanted an API that was clean and focused on one thing: state change notifications. In the state(3) API, all notifications must include state information.

Not all states are integers

When it comes to the kind of state information that can be communicated, the notify(3) API is totally inadequate. You are limited to communicating state via a single unsigned integer value. By contrast, the state(3) API allows you to publish a character string of arbitrary length. This would allow you to publish a simple string as the state value, or encode a more complex set of values using JSON or XML or whatever encoding scheme you like.

To explain why this is important, imagine you have a daemon that is responsible for controlling the timezone; call it "timezoned" for example. Now imagine that you are a program that cares about the timezone, and you want to be notified whenever someone changes the timezone.

Using the state(3) API, timezoned can publish the name of the new time zone as the state, and subscribers can read this value and update their internal cache. Subscribers do not need to know the details of how/where the timezone is set; all they need to know is that the service that publishes information about the system.timezone state has told them that the new timezone is "America/New_York".

By contrast, the notify(3) API would require timezoned to coerce the current timezone into an unsigned integer. It would be up to the calling program to figure out what that means, and to do some kind of lookup to convert that into the user's preferred name for the timezone.

Insecure global namespace

The notify(3) api provides an unprotected global namespace with no isolation between the operating system and unprivileged users. Any program can impersonate any other program, and publish notifications on it's behalf.

By contrast, state(3) provides a "secure-by-default" approach to the global notification namespace. Processes running under UID 0 are considered "the system" and have full control over the global namespace. All unprivileged users are confined to their own user.uid.### namespace, and are not able to publish to the global namespace.

Entangling dependencies

The Apple implementation of libnotify is not very portable, because it depends on other Apple-specific technologies like Mach, the Apple System Logger, libdispatch, and the C blocks extension.

By contrast, stated tries to limit itself to standard POSIX facilities as much as possible. There are a few exceptions:
  • kqueue(2) is used for monitoring file descriptors
  • a tmpfs filesystem is used to avoid writing notification information to disk. 

A rose by any other name

The name of the Apple implementation is "libnotify", which is already the name of an existing freedesktop.org package in FreeBSD. To avoid clashing with this existing package, I decided to release my new library under the name "libstate" and the corresponding daemon named "stated". These names are not currently in use in Linux or BSD.