bus1Version 1

bus1 — Kernel Message Bus

Synopsis

#include <linux/bus1.h>

Description

The bus1 Kernel Message Bus defines and implements a distributed object model. It allows local processes to send messages to objects owned by remote processes, as well as share their own objects with others. Object ownership is static and cannot be transferred. Access to remote objects is prohibited, unless it was explicitly granted. Processes can transmit messages to a remote object via the message bus, transferring a data payload, object access rights, file descriptors, or other auxiliary data.

To participate on the message bus, a peer context must be created. Peer contexts are kernel objects, identified by a file descriptor. They are not bound to any process, but can be shared freely. The peer context provides a message queue to store all incoming messages, a registry for all locally owned objects, and tracks access rights to remote objects. A peer context never serves as routing entity, but merely as anchor for peer-owned resources. Any message on the bus is always destined for an object, and the bus takes care to transfer a message into the message queue of the peer context that owns this object.

The message bus manages object access using capabilities. That is, by default only the owner of an object is granted access rights. No other peer can access the object, nor are they aware of the existance of the object. However, access rights can be transmitted as auxiliary data with any message, effectively granting them to the receiver of the message. This even works transitively, that is, any peer that was granted access to an object can pass on those rights, even if they do not own the object. But mind that no access rights can ever be revoked, besides the owner destroying the object.

Nodes and Handles

Each peer context comes with a registry of owned objects, which in bus1 parlance are called nodes. A peer is always the exclusive owner of all nodes it has created. Ownership cannot be transferred. The message bus manages access rights to nodes as a set of handles held by each peer. For each node a peer has access to, whether it is local or remote, the message bus keeps a handle on the peer. Initially when a node is created the node owner is the only peer with a handle to the newly created node. Handles are local to each peer, but can be transmitted as auxiliary data with any message, effectively allocating a new handle to the same node in the destination peer. This works transitively, and each peer that holds a handle can pass it on further, or deliberately drop it. As long as a peer has a handle to a node it can send messages to it. However, a node owner can, at any time, decide to destroy a node. This causes all further message transactions to this node to fail, although messages that have already been queued for the node are still delivered. When a node is destroyed, all peers that hold handles to the node are notified of the destruction. Moreover, if the owner of a node that has been destroyed releases all its handles to the node, no further messages or notifications destined for the node are delivered.

Handles are the only way to refer to both local and remote nodes. For each handle allocated on a peer, a 64-bit ID is assigned to identify that particular handle on that particular peer. The ID is only valid locally on that peer, it cannot be used by remote peers to address the handle (in other words, the ID namespace is tied to each peer and does not define global entities). When creating a new node, userspace freely selects the ID except that the BUS1_HANDLE_FLAG_MANAGED bit must be cleared, and when receiving a handle from a remote peer the kernel assigns the ID, which always has the BUS1_HANDLE_FLAG_MANAGED set. Additionally, the BUS1_HANDLE_FLAG_REMOTE flag tells whether a specific ID refers to a remote handle (if set), or to an owner handle (if unset). An ID assigned by the kernel is never reused, even after a handle has been dropped. The kernel keeps a user-reference count for each handle. Every time a handle is exposed to a peer, the user-reference count of that handle is incremented by one. This is never done asynchronously, but only synchronously when an ioctl is called by the holding peer. Therefore, a peer can reliable deduce the current user-reference count of all its handles, regardless of any ongoing message transaction. References can be explicitly dropped by a peer. Once the counter of a handle hits zero, it is destroyed, its ID becomes invalid, and if it was assigned by the kernel, it will not be reused again. Note that a peer can never have multiple different handles to the same node, rather the kernel always coalesces them into a single handle, using the user-reference counter to track it. However, if a handle is fully released, but the peer later acquires a handle to the same remote node again, its ID will be different, as IDs are never reused.

New nodes are allocated on-demand by passing the desired ID to the kernel in any ioctl that accepts a handle ID. When allocating a new node, the node owner implicitly also gets a handle to that node. As long as the node is valid, the kernel will pin a single user-reference to the owner's handle. This guarantees that a node owner always retains access to their node, until they explicitly destroy it (which will make it possible for userspace to release the handle like any other). Once all the handles to a local node have been released, no more messages destined for the node will be received. Otherwise, a handle to a local node behaves just like any other handle, that is, user-references are acquired and released according to its use. However, whenever the overall sum of all user-references on all handles to a node drops to one (which implies that only the pinned reference of the owner is left), a release-notification is queued on the node owner. If the counter is incremented again, any such notification is dropped, if not already dequeued.

Message Transactions

A message transaction atomically transfers a message to any number of destinations. Unless requested otherwise, the message transaction fully succeeds or fully fails.

To receive messag payloads, each peer has an associated shmem-backed pool which may be mapped read-only by the receiving peer. The kernel copies the message payload directly from the sending peer to each of the receivers' pool without an intermediary kernel buffer. The pool is divided into slices to hold each message. When a message is received, its offset into the pool in bytes is returned to userspace, and userspace has to explicitly release the slice once it has finished with it.

The kernel amends all data messages with the uid, gid, pid, tid, and optionally the security context of the sending peer. The information is collected from the sending peer when the message is sent and translated into the namespaces of the receiving peer's file-descriptor.

Seed Message

Every peer may pin a special seed message. Only the peer itself may set and retrieve the seed, and at most one seed message may be pinned at any given time. The seed typically describes the peer itself and pins any nodes and handles necessary to bootstrap the peer.

Resource quotas

Each user has a fixed amount of available resources. The limits are static, but may be overridden by module parameters. Limits are placed on the amount of memory a user's pools may consume, the number of handles a user may hold and, the number of inflight messages may be destined for a user and the number of file descriptors may be inflight to a user. All inflight resources are accounted on the receiving peer.

As resources are accounted on the receiver, a quota mechanism is in place in order to avoid intentional or unintentional resource exhaustion by a malicious or broken sending user. At the time of a message transaction, the sending user may consume in total (including what is consumed by previous transactions) half of the total resources of the receiving user that have not been consumed by another user. When a message is dequeued its resource consumption is deaccounted from the sending users quota.

If a receiving peer does not dequeue any of its incoming messages it would be possible for a users quota to be fully consumed by one peer, making it impossible to communicate with other functioning peers owned by the same user. A second quota is therefore enforced per-peer, enforcing that at the time of a message transaction the receiving peer may consume at in total (including what is consumed by previous transactions) half of the total resources available to the sending user that have not been consumed by another peer.

Global Ordering

Despite there being no global synchronization, all events on the bus, such as sending or receiving of messages, release of handles or destruction of nodes, behave as if they were globally ordered. That is, for any two events it is always possible to consider one to have happened before the other in such a way that it is consistent with all the effects observed on the bus.

For instance, if two events occurr on one peer (say the sending of a message, and the destruction of a node), and they are observed on another peer (by receiving the message and receiving a destruction notification for the node), we are guaranteed that the order the events occurred in and the order they were observed in is the same.

One could consider a further example involving three peers, if a message is sent from one peer to two others, and after receiving the message the first recipient sends a further message to the second recipient, it is guaranteed that the original message is received before the subsequent one.

This principle of causality is also respected in the pressence of side-channel communication. That is, if one event may have triggered another, even if on different, disconnected, peers, we are guaranteed that the events are ordered accordingly. To be precise, if one event (such as receiving a message) completed before another (such as sending a message) was started, then they are ordered accordingly.

Also in the case where there can be no causal relationship, we are guaranteed a global order. In case two events happend concurrently, there can never be any inconsistency in which occurred before the other. By way of example, consider two peers sending one message each to two different peers, we are guaranteed that both the recipient peers receive the two messages in the same order, even though the order may be arbitrary.

Operating on a bus1 file descriptor

The bus1 peer file descriptor supports the following operations:

open(2)

A call to open(2) on the bus1 character device (usually /dev/bus1) creates a new peer context identified by the returned file descriptor.

poll(2) , select(2) , (and similar)

The file descriptor supports poll(2) (and analogously epoll(7)) and select(2), as follows:

  • The file descriptor is readable (the readfds argument of select(2); the POLLIN flag of poll(2)) if one or more messages are ready to be dequeued.

  • The file descriptor is writable (the writefds argument of select(2); the POLLOUT flag of poll(2)) if the peer has not been shut down, yet (i.e., the peer can be used to send messages).

  • The file descriptor signals a hang-up (overloaded on the readfds argument of select(2); the POLLHUP flag of poll(2)) if the peer has been shut down.

The bus1 peer file descriptor also supports the other file descriptor multiplexing APIs: pselect(2), and ppoll(2).

mmap(2)

A call to mmap(2) installs a memory mapping to the message pool of the peer into the caller's address-space. No writable mappings are allowed. Furthermore, the pool has no fixed size, but grows dynamically with the demands of the peer.

ioctl(2)

The following bus1-specific commands are supported:

BUS1_CMD_PEER_DISCONNECT

This argument disconnects a peer and does not take an argument. All slices, handles, nodes and queued messages are released and destroyed and all future operations on the peer will fail with -ESHUTDOWN.

BUS1_CMD_PEER_QUERY

This command queries the state of a peer context. It takes the following structure as argument:

struct bus1_cmd_peer_reset {
        __u64 flags;
        __u64 peer_flags;
        __u64 max_slices;
        __u64 max_handles;
        __u64 max_inflight_bytes;
        __u64 max_inflight_fds;
};

flags must always be set to 0. The state as set via BUS1_CMD_PEER_RESET, or the default state if it was never reset, is returned.

BUS1_CMD_PEER_RESET

This command resets a peer context. It takes the following structure as argument:

struct bus1_cmd_peer_reset {
        __u64 flags;
        __u64 peer_flags;
        __u64 max_slices;
        __u64 max_handles;
        __u64 max_inflight_bytes;
        __u64 max_inflight_fds;
};

If peer_flags has BUS1_PEER_FLAG_WANT_SECCTX set, the security context of the sending task is attached to each message received by this peer. max_slices, max_handles, max_inflight_bytes, and max_inflight_fds are the resource limits for this peer. Note that these are simply max valuse, the resource usage is also limited per user.

If flags has BUS1_CMD_PEER_RESET_FLAG_FLUSH_SEED set, the seed message is dropped, and if BUS1_CMD_PEER_RESET_FLAG_FLUSH is set, all slices and handles are released, all messages are dropped from the queue and all nodes that are not pinned by the seed message are destroyed.

BUS1_CMD_HANDLE_TRANSFER

This command transfers a handle from one peer context to another. It takes the following structure as argument:

struct bus1_cmd_handle_transfer {
        __u64 flags;
        __u64 src_handle;
        __u64 dst_fd;
        __u64 dst_handle;
};

flags must always be set to 0, src_handle is the handle ID of the handle being transferred in the source context, dst_fd is the file descriptor representing the destination peer context and dst_handle must be BUS1_HANDLE_INVALID and is set to the new handle ID in the destination context on return.

If dst_fd is set to -1 the source context is also used as the destination.

BUS1_CMD_HANDLE_RELEASE

This command releases one user reference to a handle. It takes a handle ID as argument.

BUS1_CMD_NODE_DESTROY

This command destroys a set of nodes. It takes the following structure as argument:

struct bus1_cmd_node_destroy {
        __u64 flags;
        __u64 ptr_nodes;
        __u64 n_nodes;
};

flags must always be set to 0, ptr_nodes must be a pointer to an array of handle IDs of owner handles of local nodes, and n_nodes must be the size of the array.

BUS1_CMD_SLICE_RELEASE

This command releases one slice from the local pool. It takes a pool offset to the start of the slice to be released.

BUS1_CMD_SEND

This command sends a message. It takes the following structure as argument:

struct bus1_cmd_send {
        __u64 flags;
        __u64 ptr_destinations;
        __u64 ptr_errors;
        __u64 n_destinations;
        __u64 ptr_vecs;
        __u64 n_vecs;
        __u64 ptr_handles;
        __u64 n_handles;
        __u64 ptr_fds;
        __u64 n_fds;
};

flags may be set to at most one of BUS1_SEND_FLAG_CONTINUE and BUS1_SEND_FLAG_SEED. If BUS1_SEND_FLAG_CONTINUE is set any messages that cannot be delivered due to errors on the remote peer do not make the whole transaction fail, but merely set the corresponding error code in the error code array respectively. If BUS1_SEND_FLAG_SEED is set the message replaces the seed message on the local peer. In this case, n_destinations must be 0.

ptr_destinations is a pointer to an array of handle IDs, ptr_errors is a pointer to an array of corresponding errno codes, and n_destinations is the length of the arrays. The message being sent is delivered to the peer context owning the nodes pointed to by each of the handles in the array.

ptr_vecs is a pointer to an array of iovecs and n_vecs is the length of the array. The iovecs represent the payload of the message which is delivered to each destination.

ptr_handles is a pointer to an array of handle IDs and n_handles is the length of the array. Each of the handles in this array is installed in each destination peer context at receive time. If the underlying node has been destroyed at the time the message is delivered (the message would be ordered after the node's destruction notification) then BUS1_HANDLE_INVALID will be delivered instead.

ptr_fds is a pointer to an integer array of file descriptors and n_fds is the length of the array. Each of the file descriptors in this array may be installed in the destination peer context at receive time (see below).

BUS1_CMD_RECV

This command receives a message. It takes the following structure as argument:

struct bus1_cmd_recv {
        __u64 flags;
        __u64 max_offset;
        struct {
                __u64 type;
                __u64 flags;
                __u64 destination;
                __u32 uid;
                __u32 gid;
                __u32 pid;
                __u32 tid;
                __u64 offset;
                __u64 n_bytes;
                __u64 n_handles;
                __u64 n_fds;
                __u64 n_secctx;
        } msg;
};

If BUS1_RECV_FLAG_PEEK is set in flags, the received message is not dropped from the queue. If BUS1_RECV_FLAG_SEED is set, the peer's seed is received rather than a message from the queue. If BUS1_RECV_FLAG_INSTALL_FDS the file descriptors attached to the received message are installed in the receiving process. Care must be taken when using this flag from more than one process on the same message as file descriptor numbers are per-process and not per-peer.

max_offset indicates the maximum offset into the pool the receiving peer is able to read. If a message slice would exceed this offset the call would fail with -ERANGE.

msg.type indicates the type of message. BUS1_MSG_NONE is never returned. BUS1_MSG_DATA indicates a regular message sent from another peer, possibly containing a payload, as well as attached handles and filedescriptors. BUS1_MSG_NODE_DESTROY indicates that the node referenced by the handle in msg.destination was destroyed by its owner. BUS1_MSG_NODE_RELEASE indicates that all the references to handles referencing the node in msg.destination have been released.

msg.flags indicates additional flags of the message. BUS1_MSG_FLAG_HAS_SECCTX indicates that a security context was attached to the message (to distinguish an empty n_secctx from an invalid one). BUS1_MSG_FLAG_CONTINUE indicates that there are more messages queued which belong to the same message transaction.

msg.destination is the ID of the destination node or handle of the message.

msg.uid, msg.gid, msg.pid, and msg.tid are the user, group, process and thread ID of the process that created the sending peer context.

msg.offset is the offset, in bytes, into the pool of the payload and msg.n_bytes is its length.

msg.n_handles is the number of handles attached to the message. The handle IDs are stored in the pool following the payload (and possibly padding to make the array 8-byte aligned).

msg.n_fds is the number of handles attached to the message, or 0 if BUS1_RECV_FLAG_INSTALL_FDS was not set. The file descriptor numbers are stored in the pool following the handle array (and possibly padding to make the array 8-byte aligned).

msg.n_secctx is the number of bytes attached to the message, which contain the security context of the sender. The security context is stored in the pool following the payload (and possibly padding to make it 8-byte aligned).

close(2)

A call to close(2) releases the passed file descriptor. When all file descriptors associated with the same peer context have been closed, the peer is shut down. This destroys all nodes of that peer, releases all handles, flushes its queue and pool, and deallocates all related resources. Messages that have been sent by the peer and are still queued on destination queues are unaffected by this.

Return value

All bus1 operations return zero on success. On failure, a negative error code is returned.

Errors

These are all standard errors generated by the bus layer. See the description of each ioctl for details on their occurrence.

EAGAIN

No messages ready to be read.

EBADF

Invalid file descriptor.

EDQUOT

Resource quota exceeded.

EFAULT

Cannot read, or write, ioctl parameters.

EHOSTUNREACH

The destination object is no longer available.

EINVAL

Invalid ioctl parameters.

EMSGSIZE

The message to be sent exceeds its allowed resource limits.

ENOMEM

Out of kernel memory.

ENOTTY

Unknown ioctl.

ENXIO

Unknown object.

EOPNOTSUPP

Operation not supported.

EPERM

Permission denied.

ERANGE

The message to be received would exceed the maximal offset.

ESHUTDOWN

Local peer was already disconnected.