Using DRTIO and subkernels

In larger or more spread-out systems, a single core device might not be suited to managing all the RTIO operations or channels necessary. For these situations ARTIQ supplies Distributed Real-Time IO, or DRTIO. This allows systems to be configured with some or all of their RTIO channels distributed to one or several satellite core devices, which are linked to the master core device. These remote channels are then accessible in kernels on the master device exactly like local channels.

While the components of a system, as well as the distribution of peripherals among satellites, are necessarily fixed in the system configuration, the specific topology of core and satellite links is flexible and can be changed whenever necessary. It is supplied to the core device by means of a routing table (see below). Kasli and Kasli-SoC devices use the SFP ports for DRTIO connections, whereunder links should be high-speed duplex serial lines operating 1Gbps or more.

Certain peripheral cards with onboard FPGAs of their own (e.g. Shuttler) can be configured as satellites in a DRTIO setting, allowing them to run their own subkernels and make use of DDMA. In these cases, the EEM connection to the core device is used for DRTIO communication (DRTIO-over-EEM).

Note

As with other configuration changes (e.g. adding new hardware), if you are in possession of a non-distributed ARTIQ system and you’d like to expand it into a DRTIO setup, it’s easily possible to do so, but you need to be sure that both master and satellite are (re)flashed with this in mind. As usual, if you obtained your hardware from M-Labs, you will normally be supplied with all the binaries you need, through afws_client or otherwise.

Note

Do not confuse the DRTIO master device (used to mean the central controlling core device of a distributed system) with the ARTIQ master (the central piece of software of ARTIQ’s management system, which interacts with artiq_client and the dashboard.) artiq_run can be used to run experiments on DRTIO systems just as easily as non-distributed ones, and the ARTIQ master interacts with the central core device regardless of whether it’s configured as a DRTIO master or standalone.

Using DRTIO

Configuring the routing table

By default, DRTIO assumes a routing table for a star topology (i.e. all satellites directly connected to the master), with destination 0 being the master device’s local RTIO core and destinations 1 and above corresponding to devices on the master’s respective downstream ports. To use any other topology, it is necessary to supply a corresponding routing table in the form of a binary file, written to flash storage under the key routing_table. The binary file is easily generated in the correct format using artiq_route. This example is for a chain of 3 devices:

# create an empty routing table
$ artiq_route rt.bin init

# set destination 0 to the master's local RTIO core
$ artiq_route rt.bin set 0 0

# for destination 1, first use hop 1 (the first downstream port)
# then use the local RTIO core of that second device.
$ artiq_route rt.bin set 1 1 0

# for destination 2, use hop 1 and reach the second device as
# before, then use hop 1 on that device to reach the third
# device, and finally use the local RTIO core (hop 0) of the
# third device.
$ artiq_route rt.bin set 2 1 1 0

$ artiq_route rt.bin show
  0:   0
  1:   1   0
  2:   1   1   0

$ artiq_coremgmt config write -f routing_table rt.bin

The local RTIO core of the master device is considered a destination like any other; it must be explicitly listed in the routing table to be accessible to kernels.

All routes must end with the local RTIO core of the destination device. Incorrect routing tables will cause RTIODestinationUnreachable exceptions.

As with other configuration changes, the core device should be restarted (artiq_coremgmt reboot, power cycle, etc.) for changes to take effect.

Using the core language with DRTIO

Remote channels are accessed just as local channels are (e.g., most commonly, by calling self.setattr_device() and then referencing the device by name.)

Latency

Each hop (link traversed) increases the RTIO latency of a destination by a significant amount; however, this latency is constant and can be compensated for in kernels. To limit latency in a system, fully utilize the downstream ports of devices to reduce the depth of the tree, instead of creating chains. In some situations, the use of subkernels (see below) may also bypass potential latency issues.

Distributed Direct Memory Access (DDMA)

By default on DRTIO systems, all events recorded by the master’s DMA core are kept and played back on the master. With distributed DMA, RTIO events that should be played back on remote destinations are distributed to the corresponding satellites. In some cases (typically, large buffers on several satellites with high event throughput), it allows for better performance and higher bandwidth, as the RTIO events do not have to be sent over the DRTIO link(s) during playback.

To enable distributed DMA for the master, simply provide an enable_ddma=True argument for the record() method - taking a snippet from the non-distributed example in the core language tutorial:

@kernel
def record(self):
    with self.core_dma.record("pulses", enable_ddma=True):
        # all RTIO operations now go to the "pulses"
        # DMA buffer, instead of being executed immediately.
        for i in range(50):
            self.ttl0.pulse(100*ns)
            delay(100*ns)

In standalone systems, as well as in subkernels (see below), this argument is ignored; in standalone systems it is meaningless and in subkernels it must always be enabled for structural reasons.

Enabling DDMA on a purely local sequence on a DRTIO system introduces an overhead during trace recording which comes from additional processing done on the record, so careful use is advised. Due to the extra time that communicating with relevant satellites takes, an additional delay before playback may be necessary to prevent a RTIOUnderflow when playing back a DDMA-enabled sequence.

Subkernels

Rather than only offloading the RTIO channels to satellites and limiting all processing to the master core device, it is fully possible to run kernels directly on satellite devices. These are referred to as subkernels. Using subkernels to process and control remote RTIO channels can free up resources on the core device.

Subkernels behave for the most part like regular kernels; they accept arguments, can return values, and are marked by the decorator @subkernel(destination=i), where i is the satellite’s destination number as used in the routing table. To call a subkernel, call it like any other function. There are however a few caveats:

  • subkernels do not support RPCs,

  • subkernels do not support (recursive) DRTIO (but they can call other subkernels and send messages to each other, see below),

  • they support DMA, for which DDMA is considered always enabled,

  • their return values must be fully annotated with an ARTIQ type,

  • their arguments should be annotated, and only basic ARTIQ types are supported,

  • they can raise exceptions, but the exceptions cannot be caught by the master (they can only be caught locally or propagated directly to the host),

  • while self is allowed as an argument, it is retrieved at compile time and exists as a purely local object afterwards. Any changes made by other kernels will not be visible, and changes made locally will not be applied anywhere else.

Subkernels in practice

Subkernels begin execution as soon as possible when called. By default, they are not awaited, but awaiting is necessary to receive results or exceptions. The await function subkernel_await(function, [timeout]) takes as argument the subkernel to be awaited and, optionally, a timeout value in milliseconds. If the timeout is reached without response from the subkernel, a SubkernelError is raised. If no timeout value is supplied the function waits indefinitely for the return. Negative timeout values are ignored.

For example, a subkernel performing integer addition:

from artiq.experiment import *


@subkernel(destination=1)
def subkernel_add(a: TInt32, b: TInt32) -> TInt32:
    return a + b

class SubkernelExperiment(EnvExperiment):
    def build(self):
        self.setattr_device("core")

    @kernel
    def run(self):
        subkernel_add(2, 2)
        result = subkernel_await(subkernel_add)
        assert result == 4

Subkernels are compiled after the main kernel and immediately sent to the designated satellite. When they are called, the master simply instructs the subkernel to load and run the corresponding kernel. When self is used in subkernels, it is embedded into the compiled and uploaded data; this is the reason why changes made do not propagate between kernels.

If a subkernel is called on a satellite where a kernel is already running, the newer kernel overrides silently, and the previous kernel will not be completed.

Warning

Be careful with use of self.core.reset() around subkernels. Since self in subkernels is purely local, calling self.core.reset() in a subkernel will only affect that specific satellite and its own FIFOs. On the other hand, calling self.core.reset() in the master kernel will clear FIFOs in all satellites, regardless of whether a subkernel is running, but will not stop the subkernel. As a result, any event currently in a FIFO queue will be cleared, but the subkernels may continue to queue events. This is likely to result in odd behavior; it’s best to avoid using self.core.reset() during the lifetime of any subkernels.

If a subkernel is complex and its binary relatively large, the delay between the call and actually running the subkernel may be substantial. If it’s necessary to minimize this delay, subkernel_preload(function) should be used before the call.

While a subkernel is running, the satellite is disconnected from the RTIO interface of the master. As a result, regardless of what devices the subkernel itself uses, none of the RTIO devices on that satellite will be available to the master, nor will messages be passed on to any further satellites downstream. This applies both to regular RTIO operations and DDMA. While a subkernel is running, a satellite may use its own local DMA, but an attempt by any other device to run DDMA through the satellite will fail. Control is returned to the master when no subkernel is running – to be sure that a device will be accessible, await before performing any RTIO operations on the affected satellite.

Note

Subkernels do not exit automatically if a master kernel exits, and are seamlessly carried over between experiments. Much like RTIO events left in FIFO queues, the nature of seamless transition means subkernels left running after the end of an experiment cannot be guaranteed to complete (as they may be overriden by newer subkernels in the next experiment). Following experiments must also be aware of the risk of attempting to reach RTIO devices currently ‘blocked’ by an active subkernel left over from a previous experiment. This can be avoided simply by having each experiment await all of its subkernels at some point before exiting. Alternatively, if necessary, a system can be sanitized by calling trivial kernels in each satellite – any leftover subkernels will be overriden and automatically cancelled.

Calling other kernels

Subkernels can call other kernels and subkernels. For a more complex example:

from artiq.experiment import *

class SubkernelExperiment(EnvExperiment):
    def build(self):
        self.setattr_device("core")
        self.setattr_device("ttl0")
        self.setattr_device("ttl8")  # assuming it's on satellite

    @subkernel(destination=1)
    def add_and_pulse(self, a: TInt32, b: TInt32) -> TInt32:
        c = a + b
        self.pulse_ttl(c)
        return c

    @subkernel(destination=1)
    def pulse_ttl(self, delay: TInt32) -> TNone:
        self.ttl8.pulse(delay*us)

    @kernel
    def run(self):
        subkernel_preload(self.add_and_pulse)
        self.core.reset()
        delay(10*ms)
        self.add_and_pulse(2, 2)
        self.ttl0.pulse(15*us)
        result = subkernel_await(self.add_and_pulse)
        assert result == 4
        self.pulse_ttl(20)

In this case, without the preload, the delay after the core reset would need to be longer. Depending on the connection, the call may still take some time in itself. Notice that the method pulse_ttl() can be called both within a subkernel and on its own.

Note

Subkernels can call subkernels on any other satellite, not only their own. Care should however be taken that different kernels do not call subkernels on the same satellite, or only very cautiously. If, e.g., a newer call overrides a subkernel that another caller is awaiting, unpredictable timeouts or locks may result, as the original subkernel will never return. There is not currently any mechanism to check whether a particular satellite is ‘busy’; it is up to the programmer to handle this correctly.

Message passing

Apart from arguments and returns, subkernels can also pass messages between each other or the master with built-in subkernel_send() and subkernel_recv() functions. This can be used for communication between subkernels, to pass additional data, or to send partially computed data. Consider the following example:

from artiq.experiment import *

@subkernel(destination=1)
def simple_message() -> TInt32:
    data = subkernel_recv("message", TInt32)
    return data + 20

class MessagePassing(EnvExperiment):
    def build(self):
        self.setattr_device("core")

    @kernel
    def run(self):
        simple_self()
        subkernel_send(1, "message", 150)
        result = subkernel_await(simple_self)
        assert result == 170

The subkernel_send(destination, name, value) function requires three arguments: a destination, a name for the message (to be used for identification in the corresponding subkernel_recv()), and the passed value.

The subkernel_recv(name, type, [timeout]) function requires two arguments: message name (matching exactly the name provided in subkernel_send) and expected type. Optionally, it also accepts a third argument, a timeout for the operation in milliseconds. As with subkernel_await, the default behavior is to wait as long as necessary, and a negative argument is ignored.

A message can only be received while a subkernel is running, and is placed into a buffer to be retrieved when required. As a result send executes independently of any receive and never deadlocks. However, a receive function may timeout or lock (wait forever) if no message with the correct name and destination is ever sent.