Friday, 25 February 2011

Importance of the scheduling priority in D-Bus

Context switches are expensive. When do they happen in D-Bus communication?

application1 sends a D-Bus message to application2 through dbus-daemon

application1 sends a D-Bus message to application2 through dbus-daemon

The kernel may perform a context switch:

  • During the sendmsg() system call, when application1 sends a D-Bus message to dbus-daemon
  • During the sendmsg() system call, when dbus-daemon sends a D-Bus message to application2
  • At the end of a time slice

I instrumented libdbus in order to trace events such as message parsing, dispatching messages in dbus-daemon,  sendmsg(), recvmsg(). I took inspiration from Federico’s performance scripts but without using strace because straced processes have a big overhead compared to the duration of events I want to measure. My instrumentation doesn’t save traced events in a file to avoid overhead as much as possible, but just keep them in memory and save them after the experiment is finished. This is triggered:

  1. when a process using libdbus exits. The save function is registered by atexit().
  2. when a process using libdbus receives a SIGUSR2 signal. The save function is registered by sigaction().

I collect logs generated by dbus-daemon and all D-Bus peers and feed them to plot-timeline.py to generate timelines. I estimate the overhead of logging between 3 microseconds and 5 microseconds by logging consecutively 2 events. All the experiments are run in a virtual machine with one processor.

Experiment #1: a D-Bus signal delivered to 10 recipients

I sent one D-Bus signal with dbus-send and received by 10 dbus-monitor. You can reuse this script.

In this experiment, dbus-send was context switched off during the send() system call and that system call returned after dbus-daemon received the message, sent it to the 10 recipients, and all of them received it.

We can see dbus-daemon iterating over every recipients and sending the message on each recipient’s socket.

There is 12 context switches in this experiment:

  • 1 context switch from the sender to dbus-daemon
  • 10 context switches to each recipient
  • 1 final context switch to the sender

Experiment #2: 10 D-Bus signal delivered to 10 recipients

It is the same experiment as before but the sender emits 10 D-Bus signals (as fast as it could) instead of a single one. dbus-send cannot do that, so I used dbus-ping-pong.

The first three D-Bus signal are delivered in the same way; this timeline looks like the first timeline. There is also 12 context switches per D-Bus signal emitted.

But something different happen for the 7 next D-Bus signals: dbus-daemon is context switched off during the send() on every recipient.

The scheduler tries to maintain fairness along all the processes by giving them the same amount of CPU time. But dbus-daemon has about 10 times more work to do than every single recipient because it sends 10 messages and the recipients only receive 1 message. So when dbus-daemon writes on a recipient’s socket, there is two processes in the running state and the scheduler has to take a decision between a CPU hungry process (dbus-daemon) and a recipient. The recipient is chosen and it causes a context switches.

The context switches per D-Bus signal emitted are:

  • 1 context switch from the sender to dbus-daemon
  • 1 context switch from dbus-daemon to the first recipient
  • 1 context switch from the first recipient back to dbus-daemon
  • again 2 times 9 context switches between dbus-daemon and the remaining recipients
  • 1 context switch from dbus-daemon to the sender

That’s 22 context switches in total instead of 12 for the same amount of work and the same amount of messages delivered.

The experiment was run in 0.036121 seconds.

Experiment #3: 10 D-Bus signal delivered to 10 recipients with a high priority dbus-daemon

With the command renice, I change the priority of dbus-daemon to -15.

In this case, the kernel chose the first scheduling strategy with 12 context switches per D-Bus signal emitted. The experiment was run in 0.019320 seconds. It is 1.8 times faster.

Most D-Bus messages don’t have so many recipients. But still, it is interesting to see how much the priority of dbus-daemon influences the performance.

On Debian, dbus-daemons (session and system bus) run with the default priority (nice level: 0). On the N900, they run with a higher priority (nice level: -5).

Experiment #4: details of dispatching

Let’s zoom on dbus-daemon’s dispatching for two different sizes of D-Bus signals:277 bytes and 17kB.

277 bytes D-Bus signal:

  • Total: 524 microseconds
  • Receiving: 36 microseconds (7%)
  • Dispatch: 164 microseconds (31%)
    • 79 microseconds in match rules
  • Sending: 324 microseconds (62%)

17kB D-Bus signal:

  • Total: 1026 microseconds
  • Receiving: 413 microseconds (40%)
  • Dispatch: 230 microseconds (22%)
    • 79 microseconds in match rules
  • Sending: 383 microseconds (37%)

Thursday, 24 February 2011

D-Bus traffic pattern with Telepathy

Telepathy is a big user of D-Bus, both on the GNOME desktop with Empathy and on the N900. When I have a lot of Telepathy accounts and contacts, the start-up time could be about 10 seconds. How much is the D-Bus communication responsible for the start-up time? When I tried the D-Bus in the kernel prototype on the N900, I win 1.3 seconds. It shows it is not negligible and there is room for improvements. But the slowness may not be due only to D-Bus itself but the way it is used. It is interesting to look at the D-Bus traffic pattern. This topic was already largely covered by Will’s  talk on Profiling and Optimizing D-Bus APIs but I want to give a preview on future D-Bus analysis tools. The existing D-Bus tools dbus-monitor, D-Feet and Bustle are still very useful of course.

A glance at D-Bus messages

The easiest to begin with is to record all the D-Bus traffic with dbus-monitor while I start Empathy, send a few messages, and quit. Let’s see if there is any surprise. dbus-monitor generated a 13MB text file containing 8313 D-Bus messages:

  • 2850 method calls
  • 761 methods returns
  • 4699 signals
  • 3 errors

The number of method calls and method returns are quite different! This is because dbus-monitor doesn’t record method returns from the bus driver (org.freedesktop.DBus, implemented by dbus-daemon). Indeed, my log file contains 2066 method calls to the bus driver:

Method calls for D-Bus match rules (AddMatch/RemoveMatch) generate a lot of traffic. The test was done with the upstream dbus-glib 0.88-2.1, without the patch on DBusGProxy (Bug #23846) to have per-member match rules. If that patch was applied, the number of D-Bus match rules could be even bigger.

If the D-Bus match rules were implemented with socket filters on multicast Unix sockets, AddMatch and RemoveMatch messages would be completely replaced by the system call setsockopt(), reducing the number of context switches from D-Bus peers to dbus-daemon. But the situation could be improved with the current dbus-daemon implementation: 628 AddMatch are match rules on NameOwnerChanged (48%), generated by dbus-glib, but there is actually only 25 unique match rules. The duplicate match rules are caused by Bug #33646 on DBusGProxy.

Performance depends on the size of message

Now, let’s have a look on the size of messages and the impact on performances. I can use dbus-ping-pong to estimate how long it takes to send a D-Bus message with a specific size. The following script starts the ping-pong messages between two D-Bus peers, with several message sizes, and I gathered the results in the next graph:

#!/bin/bash

./dbus-ping-pong server org.Echos &
PID=$!
echo "Server is PID=$PID"
sleep 1

for i in 1 3 10 31 100 316 1000 3162 10000 31622 100000 316227 1000000 3162277 10000000 ; do
  echo $i
  ./dbus-ping-pong client org.Echos 100 $i
done

kill $PID

Performance of dbus-ping-pong with various message sizes

Figure 1

I ran the test twice with different values of max_bytes_read_per_iteration, the size of the buffer used to read on the socket; the first time with 2048 bytes and the second time with 65536. When the read buffer is 32 times bigger, there is less recvmsg() system calls for big incoming D-Bus messages and it makes the test 28% faster.

Usual message sizes in Telepathy

Let’s see what are the usual sizes of messages in the context of Telepathy. I just run bustle-dbus-monitor with a patch to get the size of each message (Bug #34067).



Figure 2

Most messages are between 101 bytes and 316 bytes (in orange on the graph).

With piecewise linear interpolation (spreadsheet source) based on the results of dbus-ping-pong, I can approximate how long takes the transmission of each individual message, and then what size of messages are the bottleneck and would be worth optimizing.

It may not be a good estimation because there is more computation in the real Telepathy scenario than in the synthetic dbus-ping-pong scenario: the number of match rules in dbus-daemon is different, there is more connections, and the Telepathy processes are not only sending D-Bus messages so the CPU caches are not behaving the same way. The real time spent in D-Bus communication should be larger than what is computed here with piecewise linear interpolation. But it can still give an idea.



Figure 3

For messages with a small size (under 3000 bytes), the two histograms (figure 2 and figure 3) have the same shape because the ping-pong time is almost constant for those messages. For the larger messages, they take a bit more time than their distribution because the ping-ping time is getting bigger on those messages.

Messages between 101 bytes and 316 bytes are the ones causing the most delay in Telepathy start-up.

The Stats interface

We can look at those messages in more details with the new Stats interface in dbus-daemon. The script messages_stats.py creates a csv file from the Stats interface containing the peak size of messages, the peak number of recipients and number of each D-Bus method call and signal. With that data, I can check that no messages from Telepathy are received too broadly, and if they are, I could use the GetConnectionMatchRules patch to find the application which subscribes to a too broad match rule.

Most of the D-Bus messages generated by Telepathy have a reasonable number of recipients. Only 2 different D-Bus signals have more than 3 recipients, and it is only a few messages in total:

  • org.freedesktop.Telepathy.Connection.Interface.Aliasing.AliasesChanged (4 recipients)
  • org.freedesktop.Telepathy.Connection.Interface.SimplePresence.PresencesChanged (4 recipients)

Repartition of messages per number of recipients

Figure 4

The most sent D-Bus messages by Telepathy are:

  • org.freedesktop.Telepathy.Connection.Interface.Contacts.GetContactAttributes (17 calls)
  • org.freedesktop.Telepathy.Account.AccountPropertyChanged (11 signals)

I don’t see anything obviously bad here. Telepathy is a good D-Bus citizen!

Wednesday, 16 February 2011

Introducing multicast Unix sockets

I have been working on implementing multicast Unix sockets in the Linux kernel. This allows a process to send a message on a socket to a multicast group with one system call sendmsg() and the message will be received by all sockets member of the multicast group.

This work has been sponsored by my employer Collabora.

Several projects could benefit from this new IPC system:

  1. D-Bus

    D-Bus is a message bus system. Applications exchange D-Bus messages traditionally through a central process, dbus-daemon. When dbus-daemon receives the message, it determines the recipients and delivers the message to each recipient’s socket. This architecture causes dbus-daemon to wake up for every single message causing expensive context switches, memory copies and processing. If the D-Bus peers were part of a multicast group, the kernel could deliver D-Bus messages directly to the recipients. It could use socket filters to deliver them only to the correct recipients, according to the D-Bus match rules.
  2. The T IPC system

    In the same manner as D-Bus, multicast Unix sockets and socket filters could be used by the T IPC system.
  3. Udev

    Udev uses Linux’ netlink sockets to send multicast messages from udevd to libudev listeners. Netlink sockets are usually used for communication between the kernel and userspace, but can also be used for userspace-only communication. It has limitations though; there are only 32 multicast groups, system-wide, and only root can send multicast messages.

    Update: netlink does not have that limit anymore since 2005.

My implementation aims to be a general purpose multicast IPC system, without the limitations of netlink multicast. The kernel patches and a test suite are available in git:

git clone git://git.collabora.co.uk/git/user/alban/linux-2.6.35.y/.git unix-multicast18
git clone git://git.collabora.co.uk/git/user/alban/check-unix-multicast

Multicast is implemented on datagram and seqpacket sockets, but not on stream sockets. It would not make sense on stream sockets because the messages are not delimited and there would be no guarantee that several senders’ messages would not be mixed. The semantics are different between datagram and seqpacket sockets.

Multicast on datagram sockets

Communication on datagram multicast Unix sockets

The setsockopt() call which creates the multicast group binds the socket on the multicast address. Messages sent to the group are received by all members, including the sender, if it joined with the “loopback” feature. Socket filtering may be used by a recipient to avoid receiving messages, however this does not affect delivery of the message to other peers in the group.

The daemon controlling the multicast group can receive the messages from the peers if the feature “send-to-peer” is enabled.

Multicast on seqpacket sockets

 Communication on seqpacket multicast Unix sockets

Seqpacket sockets are connection-based and the daemon can control who is able to join the group. The daemon can receive the messages from the peers on its accepted sockets (A1, A2, A3 on the diagram above) with the “send-to-peer” feature. It is useful for D-Bus: dbus-daemon can reply to the method calls on the bus driver.

Socket filter for D-Bus

Each socket can upload a socket filter (or Berkeley Packet Filter, BPF) in the kernel. Socket filters are small programs, executed for every message sent to a socket. If the socket filter returns zero, the message is discarded and the process does not need to wake up.

The socket filter could be modified and uploaded in the kernel every time the D-Bus peer wants to add or remove a D-Bus match rule and get or lose a unique or well-known name. So D-Bus messages are not delivered to every D-Bus peer but only the right recipients.

Socket filters may be applied on SOCK_DGRAM and SOCK_SEQPACKET only. They make little sense on SOCK_STREAM because there are no message boundaries. This limits the size of D-Bus messages to about 110kB, although it can be changed with setsockopt(SO_SNDBUF) up to a maximum of 219kB (or more by tuning /proc/sys/net/core/rmem_max).

Atomic delivery

Messages are delivered atomically to all recipients. This is true even when the sender is interrupted by a signal, killed, lacks memory or is blocked because of the flow control. I don’t want a message to be partially delivered to some recipients. When the system call sendmsg() returns an error (such as EAGAIN), it is guaranteed that nobody received the message.

Ordering

When several senders are sending messages concurrently, the recipients need to receive messages in the same order. Here is a scenario I want to avoid:

Message ordering

A and B are sending one message concurrently to recipients C and D. Without proper locking, the recipients could receive the messages A and B in a different order. My patches take care of this and the test suite checks that messages are received in the same order by all recipients.

Flow control

When a reader is too slow to consume messages from its receiving queue, the receiving queue could be full. There is several ways to manage this situation:

  • Have infinite sized receiving queues. This is not really an option, is it?
  • Drop messages, either silently or with a notification to the recipient (”you have lost some messages”). This is the correct semantic for udev. Netlink sockets notifies recipients about lost messages with ENOBUFS in recvmsg(2).
  • Block the sender. The sender will block or send() will return EAGAIN with non-blocking sockets. Poll() or select() will tell when a message can be delivered again.
  • Disconnect the slow recipient
  • Disconnect the spammy sender

The correct solution for D-Bus is not trivial. This is not a new problem: even without multicast Unix sockets, dbus-daemon already has the same problem. Discussion in bug #33606. The current implementation of multicast Unix sockets either drops messages silently or blocks the sender, depending on the setting of the multicast group.

Wednesday, 15 September 2010

D-Bus in the kernel, faster!

In the last few weeks at Collabora, I worked on having D-Bus directly in the Linux kernel to improve its performances. I based my work on what Ian Molton did. For now it’s just a prototype, but some benchmarks I did show a good performance improvement.

The cost of context switches

When an application sends a message on the bus to another application, the message is first sent to dbus-daemon through an Unix socket. The kernel copies the message to the receiving queue of dbus-daemon and dbus-daemon wakes up. Then dbus-daemon adds the sender field in the header of the message, and sends it to the recipients according to the destination field and the match rules (usually one recipient but there could be more for signals or in case of eavesdropping).

D-Bus message transmission with dbus-daemon

D-Bus message transmission with dbus-daemon

So dbus-daemon wakes up on every message, it costs a context switch and a memory copy.

Don’t wake up, dbus-daemon!

The idea of D-Bus in the kernel is to deliver the message directly to the right recipients without waking up any intermediary process. We added a new kind of socket, “AF_DBUS”, that behaves in a similar way to Unix sockets. Every application using DBus would need to use the new socket type, but that just means changing libdbus and the few other libraries for D-Bus.

Introducing AF_DBUS

Introducing AF_DBUS

The kdbus kernel module reads all the messages and check for the messages “Hello“, “NameAcquired“, “NameLost” and “AddMatch” to get to know the unique names, the well-known names and the match rules. Then it is able to deliver the messages only to the right applications and shortcut dbus-daemon. If there are several recipients, the message is still memcpied only one time thanks to skb_clone().

Introducing AF_DBUS

AF_DBUS does not wake up dbus-daemon unnecessarily

The prototype still uses dbus-daemon for authentication and D-Bus activation. The bus driver org.freedesktop.DBus is still implemented in dbus-daemon.

New performances

The first benchmark is dbus-ping-pong. It is a simple tool to call a D-Bus method and wait for the reply 10000 times. I tried it both on a KVM/i386 virtual machine and on a N900/arm.

  • KVM/i386, without kdbus: 3.887s
  • KVM/i386, with kdbus: 2.085s (x1.8)
  • N900/arm, without kdbus: 28.00s
  • N900/arm, with kdbus: 9.23s (x3)

I tried Adrien Bustany’s benchmark tool designed with Tracker in mind. My test is on KVM/i386 with #define CHUNK_SIZE 8 in order to have D-Bus messages of 6905 bytes (the kdbus prototype has a limitation at the moment: messages bigger than SKB_MAX_ALLOC or 8kB are still delivered to dbus-daemon so it does not bring better performances).

  • KVM/i386, without kdbus: Fetched 300000 rows in 32874ms
  • KVM/i386, with kdbus:Fetched 300000 rows in 24368ms (26% faster - x1.35)

I also tested how long does a N900 take to connect to Jabber and show my contacts’ presence on the desktop widgets. I measured the time manually but ran enough tests to be sure it is consistent.

  • N900/arm, without kdbus:  avg 11.87s
  • N900/arm, with kdbus: avg 10.56s (x1.12)

Sources

Keep in mind this is only a proof-of-concept. It is not ready for merging and has limitations, including security ones. However,  I managed to use kdbus for both the system and session bus on a N900 and, with just a few hacks, everything worked fine.