DAQ

The File-based Filter Farm in the CMS DAQ MarkII

The CMS DAQ system will be upgraded after LS1 in order to replace obsolete network equipment, use more homogeneous switching technologies, prepare the ground for future upgrade of the detector front-ends. The experiment parameters for the post-LS1 data taking remain similar to the ones of Run 1: a Level-1 aggregate rate of 100 kHz and an aggregate HLT output bandwidth of up to 2 GB/s. A moderate event-size increase is anticipated from increased pile-up and changes in the detector readout. For the output bandwidth, the figure of 2 GB/s is assumed.

The original Filter Farm design has been successfully operated in 2010–2013 and its efficiency and fault tolerance brought to an excellent level. There are, however, a number of disadvantages in that design at the interface between the DAQ data flow and the High-Level Trigger that warrant a careful scrutiny in view of the deployment of DAQ2, after the LS1:

  1. The reduction of the number of RU builder output ports in the new DAQ2 design requires the splitting of the BU and FU functionality. The additional synchronous connections would essentially cancel the simplification resulting from fewer RU builder ports.

  2. The concurrency of XDAQ and CMSSW tasks in the HLT calls for special releases of CMSSW that integrate the XDAQ framework. This requires synchronisation of common code (compilers, system libraries, tools).

  3. The CMSSW runtime environment is not particularly adapted to the DAQ runtime strategy.

  4. The synchronous operation of the HLT requires reconciling the DAQ and CMSSW state machine transitions and does not allow for sufficient time decoupling of the DAQ. There are large overheads in certain operations of CMSSW (e.g. at run start for condition loading).

Hardware

The main characteristic of the HLT hardware environment is its rapid evolution. The existing architecture allows to easily and seamlessly accommodate different generations of HLT nodes, and to rapidly deploy new hardware when machine conditions warrant. It is clearly desirable to maintain and increase this flexibility. In 2015, we will also need to integrate the legacy HLT nodes that are not yet at their end of life.

Software

The primary goal of the FBEvF design is decoupling the online and offline code base. There are three main interfaces to the HLT which require special treatment in the DAQ: the raw input interface to feed data from the detector; the control interface that manages the life cycle and state transition of the CMSSW executable; and the monitoring interface, providing at the same time fast feedback to the operator and persistent information like trigger counts which are stored in a database. The output interface, on the other hand, is easily replaceable with a file, at the price of a more complex management of a large number of files in the local storage area.

The most challenging task, from a performance point of view, is to feed input binary data from a single BU into several HLT nodes at a data rate of up to 4 GB/s and an event rate of up to 2 kHz.

The Raw Data format from the detector Front-End Drivers (FEDs) naturally lends itself to provide the building blocks for a simple low-level binary format of the input data. Each FED provides a variable-length, 64-bit aligned block of data consisting of a standardised header (64b) and trailer (64b), encapsulating the variable length payload (detector-dependent). The Common Data Format defines the header and trailer content and bit-field assignments. The trailer provides a word count and a checksum (CRC16). A raw CMS event consists of the concatenation of each and every FED block produced by the detector. The file can be structured as a concatenation of events prepended with a 64-bit aligned event header.

In the DAQ1, Run Control sequences the HLT through its internal CMSSW states via an XDAQ-based adaptor layer on top of the CMSSW state model. The different time scale and inherent variability of the HLT processing time, however, can cause undesirable lead times in completing state transition, sometimes interpreted as malfunctions. In the new file-based design, the execution flow is entirely data-driven and exclusively under the control of the CMSSW executable. The lifetime of the CMSSW processes is controlled by the input data and monitored by independent watchdog processes.

All the monitoring information is generated by either services in the flow of the HLT execution (for rates etc.) or the watchdog processes, looking at the parameter of the system itself (e.g. disk usage etc.).

In the FFF, the Online data-quality monitoring is served events and histograms via files on the local disk of the FUs. The aggregation of the DQM data will follow the same scheme as for the normal event data. The event and histogram data needed by online DQM are saved to dedicated storage, from where the online DQM system can access the files.

Design and implementation 

Figure 2: Principle architecture of the FFF system

The structure of the basic building block of the DAQ2 EvF system is dictated by the form factor chosen for the EVB network, and the CPU and networking requirements for the corresponding per-BU rate. For example, processing a 2 kHz event rate would require, for the typical HLT CPU usage observed during Run 1, roughly 200 cores. The most recent machines currently deployed in DAQ1 are dual 8-core motherboards arranged in a 4-fold 2U chassis, and a complete “BU appliance” would require a minimum of three chassis and the corresponding network interconnect. In order to avoid investing into retrofitting legacy machines with expensive 10 GbE interfaces, these will be connected using the existing 1 GbE and branched into the new 10/40 GbE switch using existing line cards.

A RAMdisk has been chosen as the baseline solution to store HLT input data, as it provides a simple and effective high-bandwidth buffer without the problems inherent to classic storage or SSD. At the scale of interest, it is also the cheapest solution fulfilling all requirements.

A complete appliance test has been carried out by the DAQ group using dummy events of fixed size built out of four fragments generated in four RU machines. The Builder Unit could write events of 1 MB size up to a rate of 5.9 kHz to files resident on a RAM disk of 256 GB. When the RAM disk was exported to the FU using NFSv4, 192 CMSSW processes running on legacy FU machines and connected over 2 GbE links could read concurrently at a rate up to 1.9 kHz, whereas 16 processes connected over 10 GbE could concurrently read at a rate up to 3.9 kHz. The CMSSW modules necessary for reading input and writing output event data in a format supporting concatenation exist and are being finalised. Monitoring uses a self-describing JSON format that is aggregated at the different levels using completely generic software. The CMSSW processes are controlled by a service daemon, which is entirely data-driven. In the coming weeks, the file-based HLT demonstrator will be completed with the hierarchical concatenation of output to provide a small-scale full-chain functional HLT system entirely based on files. The choice of the storage technology for the storage and transfer system, i.e. the Storage Manager replacement, will be discussed in a future article.


by E. Meschi