Wednesday, 16 February 2011

Introducing multicast Unix sockets

I have been working on implementing multicast Unix sockets in the Linux kernel. This allows a process to send a message on a socket to a multicast group with one system call sendmsg() and the message will be received by all sockets member of the multicast group.

This work has been sponsored by my employer Collabora.

Several projects could benefit from this new IPC system:

  1. D-Bus

    D-Bus is a message bus system. Applications exchange D-Bus messages traditionally through a central process, dbus-daemon. When dbus-daemon receives the message, it determines the recipients and delivers the message to each recipient’s socket. This architecture causes dbus-daemon to wake up for every single message causing expensive context switches, memory copies and processing. If the D-Bus peers were part of a multicast group, the kernel could deliver D-Bus messages directly to the recipients. It could use socket filters to deliver them only to the correct recipients, according to the D-Bus match rules.
  2. The T IPC system

    In the same manner as D-Bus, multicast Unix sockets and socket filters could be used by the T IPC system.
  3. Udev

    Udev uses Linux’ netlink sockets to send multicast messages from udevd to libudev listeners. Netlink sockets are usually used for communication between the kernel and userspace, but can also be used for userspace-only communication. It has limitations though; there are only 32 multicast groups, system-wide, and only root can send multicast messages.

    Update: netlink does not have that limit anymore since 2005.

My implementation aims to be a general purpose multicast IPC system, without the limitations of netlink multicast. The kernel patches and a test suite are available in git:

git clone git://git.collabora.co.uk/git/user/alban/linux-2.6.35.y/.git unix-multicast18
git clone git://git.collabora.co.uk/git/user/alban/check-unix-multicast

Multicast is implemented on datagram and seqpacket sockets, but not on stream sockets. It would not make sense on stream sockets because the messages are not delimited and there would be no guarantee that several senders’ messages would not be mixed. The semantics are different between datagram and seqpacket sockets.

Multicast on datagram sockets

Communication on datagram multicast Unix sockets

The setsockopt() call which creates the multicast group binds the socket on the multicast address. Messages sent to the group are received by all members, including the sender, if it joined with the “loopback” feature. Socket filtering may be used by a recipient to avoid receiving messages, however this does not affect delivery of the message to other peers in the group.

The daemon controlling the multicast group can receive the messages from the peers if the feature “send-to-peer” is enabled.

Multicast on seqpacket sockets

 Communication on seqpacket multicast Unix sockets

Seqpacket sockets are connection-based and the daemon can control who is able to join the group. The daemon can receive the messages from the peers on its accepted sockets (A1, A2, A3 on the diagram above) with the “send-to-peer” feature. It is useful for D-Bus: dbus-daemon can reply to the method calls on the bus driver.

Socket filter for D-Bus

Each socket can upload a socket filter (or Berkeley Packet Filter, BPF) in the kernel. Socket filters are small programs, executed for every message sent to a socket. If the socket filter returns zero, the message is discarded and the process does not need to wake up.

The socket filter could be modified and uploaded in the kernel every time the D-Bus peer wants to add or remove a D-Bus match rule and get or lose a unique or well-known name. So D-Bus messages are not delivered to every D-Bus peer but only the right recipients.

Socket filters may be applied on SOCK_DGRAM and SOCK_SEQPACKET only. They make little sense on SOCK_STREAM because there are no message boundaries. This limits the size of D-Bus messages to about 110kB, although it can be changed with setsockopt(SO_SNDBUF) up to a maximum of 219kB (or more by tuning /proc/sys/net/core/rmem_max).

Atomic delivery

Messages are delivered atomically to all recipients. This is true even when the sender is interrupted by a signal, killed, lacks memory or is blocked because of the flow control. I don’t want a message to be partially delivered to some recipients. When the system call sendmsg() returns an error (such as EAGAIN), it is guaranteed that nobody received the message.

Ordering

When several senders are sending messages concurrently, the recipients need to receive messages in the same order. Here is a scenario I want to avoid:

Message ordering

A and B are sending one message concurrently to recipients C and D. Without proper locking, the recipients could receive the messages A and B in a different order. My patches take care of this and the test suite checks that messages are received in the same order by all recipients.

Flow control

When a reader is too slow to consume messages from its receiving queue, the receiving queue could be full. There is several ways to manage this situation:

  • Have infinite sized receiving queues. This is not really an option, is it?
  • Drop messages, either silently or with a notification to the recipient (”you have lost some messages”). This is the correct semantic for udev. Netlink sockets notifies recipients about lost messages with ENOBUFS in recvmsg(2).
  • Block the sender. The sender will block or send() will return EAGAIN with non-blocking sockets. Poll() or select() will tell when a message can be delivered again.
  • Disconnect the slow recipient
  • Disconnect the spammy sender

The correct solution for D-Bus is not trivial. This is not a new problem: even without multicast Unix sockets, dbus-daemon already has the same problem. Discussion in bug #33606. The current implementation of multicast Unix sockets either drops messages silently or blocks the sender, depending on the setting of the multicast group.

No comments:

Post a Comment