An assessment of a model for error processing in the CMS Data Acquisition System

. The CMS Data Acquisition System consists of O(20000) interdependent services. A system providing exception and application-specific monitoring data is essential for the operation of such a cluster. Due to the number of involved services the amount of monitoring data is higher than a human operator can handle efficiently. Thus moving the expert-knowledge for error analysis from the operator to a dedicated system is a natural choice. This reduces the number of notifications to the operator for simpler visualization and provides meaningful error cause descriptions and suggestions for possible countermeasures. This paper discusses an architecture of a workflow-based hierarchical error analysis system based on Guardians for the CMS Data Acquisition System. Guardians provide a common interface for error analysis of a specific service or subsystem. To provide effective and complete error analysis, the requirements regarding information sources, monitoring and configuration, are analyzed. Formats for common notification types are defined and a generic Guardian based on Event-Condition-Action rules is presented as a proof-of-concept.


Introduction
The Compact Muon Solenoid (CMS) experiment at the CERN LHC pp collider has to cope with an interaction rate of 40 MHz.Since no purely software-based distributed system may digest the total detector data of 1 MByte for a single event every 25 ns, pre-selection is performed in custom built, pipelined processors that reside close to the detectors.
The resulting data rate of 100 kHz is processed by the CMS data acquisition system [6] that consists of O(20000) interdependent services.It follows a service-oriented architecture (SOA) [1] [8] where each service provides a SOAP control interface [10].High-level data acquisition applications have been implemented using the XDAQ framework [7].The CMS data acquisition system also provides low-level monitoring and alarming information through the XDAQ monitoring and alarming system (XMAS) [2] infrastructure that is based on a scalable and distributed publish/subscribe eventing system [3] and currently handles O(100000) notifications per second.This paper will present the architecture for a dedicated error processing system to reduce the number of notifications to the operator for simpler visualization and to provide meaningful error cause descriptions and suggestions for possible countermeasures.

Gap Analysis
The CMS data acquisition system provides monitoring and alarming information but no facilities that analyze this information to derive high-level interpretations.Such an error processing system compares the actual with the nominal state of the monitored system.Configuration information defines the nominal state; Run-time information describes the actual state, which is provided through XMAS but lacks state information and integration with legacy services.As the system continues to be developed, error processing algorithms require continues adaptation.To ease this task the algorithms shall be formulated independent of communication protocol and format.

Technologies
Continuing with a service based approach and taking the previously mentioned requirements and constraints into account, we implemented an error processing system with Web Workflows.Web Workflows combine business processes with the Web by encapsulating a workflow behind a SOAP Web Service with a defined interface.They allow separation of protocols and formats handled by the Workflow engine and the definition of error processing algorithms as Workflows.
Major business process management software vendors provide their own Web Workflow engine implementations, for example Oracle BPEL process manager and IBM WebSphere Process Manager [17].We chose the ActiveBPEL workflow engine [4] as it implements protocol interoperability (SOAP over HTTP) with the existing monitoring system out of the box and can be extended with new communication protocols and data formats without modifying Workflows.It provides a standards compliant workflow editor and depends on a limited number of software packages (Tomcat and Java) that are already used in the CMS experiment.

Run-time and Configuration Information
Run-time information represents the actual condition of the running system and can be categorized as shown in Figure 1: • State information contains information about the actual state of services.With hierarchical states as defined in ASAP [5] (Figure 2) we can impose general states for visualization and error processing and allow flexibility by refinement of states when necessary for control.For example a service can define a custom sub-state open.running.discardto indicate that it is operational but discarding incoming data.• Error information describes exceptions, which could not be handled locally by services.It embeds a complete exception trace for debugging.In addition custom properties can be added at each level of the exception trace to provide further information for error processing in an automated fashion.Configuration information represents the nominal condition of the running system.It can be categorized in hardware and software information.Hardware information describes the setup of hosts, devices and networks.Software information specifies applications, services and communication endpoints.

Error Processing Architecture
A high-level error processing system is responsible to detect the cause of errors on startup and during operation of the monitored system.Therefore it analyzes differences between actual and nominal status of the system.The general architecture of our error processing system is depicted in Figure 3.The data layer contains services of the monitored system, which may emit data into the monitoring and alarming system.The logic layer contains the monitoring and error processing system and the visualization layer contains the graphical user interface the operator interacts with.
The error processing system contains two kinds of services, an Error Processor and Guardians.In our system the Error Processor is an intermediate, which subscribes to the monitoring and alarming system and asynchronously receives all error notifications generated by the services in the data layer.Subsequently these notifications are forwarded to error processing components, called Guardians.Guardians are logically ordered in a hierarchy as depicted in Figure 4 and contain expert knowledge about specific services or subsystems.A Guardian is only interested in a specific subset of notifications and thus provides a filter expression to reduce the number of notifications from the Error Processor.The low-level Guardians observe specific services whereas the higher ones observe groups of services.In case a Guardian cannot identify the cause of an error directly it may emit an exception, which is passed to a higher-level Guardian.Error processing should always be done on the lowest possible layer without incorporating knowledge about other subsystems or services.This keeps the higher-level Guardians abstract and confined to their respective group of applications.In case a Guardian could identify the cause of an error it may emit a notification to the operator.
All Guardians provide the same SOAP interface and as such may be implemented in any language.This allows integration with already existing rule-based systems or custom error processing code in case a generic Guardian is insufficient.The request message to the Guardians contains a list of error notifications and a list of URLs of monitoring data servers, which may be queried for more information.The response message contains operator notifications if an error cause could be identified or a derived error notification.Additionally it encloses a list of matched notification identifiers.
Finally the Error Processor forwards operator notifications to the operator, error notifications to XMAS and informs the operator to redefine the rules for unmatched notifications based on their unique identifier.Subsequently the derived error notifications are asynchronously sent from XMAS to the Error Processor, which will forward these notifications to another Guardian, effectively achieving a logical data flow as depicted in Figure 4. We chose to implement error processing using BPEL as it already provides powerful languages for filtering (XPath) [12] and querying (XQuery) [11] XML data.Using those features we implemented a generic Guardian, which processes Event-Condition-Action (ECA) rules [9].A rule that checks the diskUsage of our computers is shown in Figure 5.This is an example of a rule which is not triggered by an error notification but triggered periodically and checks service-specific information.A set of rules is specific to one Guardian.A low-level Guardian for example defines a set of rules to process errors emitted by a specific service type, effectively leading to disjoint sets of rules for different Guardians.Higher-level Guardians define rules to match notifications derived by low-level Guardians only, leading to the hierarchical error processing depicted in Figure 4.

Enhancements
During evaluation of existing workflow engines we identified some shortcomings of BPEL and missing components necessary for integration with our system: • BPEL workflows can only be triggered through SOAP messages and not through timers or even more complex rules.• ActiveBPEL natively supports only SOAP based protocols.
Integration: As not all services publish directly into XMAS we added custom workflow checking scripts, which query the states of those services over SSH and publish their information into XMAS through SOAP messages.In addition some services use a custom, binary protocol for performance reasons.
Although WSDL allows defining interfaces independent of transport protocols, the ActiveBPEL engine only supports SOAP over HTTP as a protocol by default.ActiveBPEL solves this problem by providing InvokationHandlers, which translate between internal workflow engine data representation (XML) and custom formats and protocols and therefore allowing seamless integration with our system at hand.

Summary
This paper summarizes requirements and pitfalls during design and implementation of a generic error processing system using the CMS experiment as a case study.The presented error processing architecture relies on Workflow and Web Service technologies, which allow seamless integration into the existing environment.We implemented a generic workflow-based Guardian, which performs error processing based on ECA rules.
Tests of the error processing system were performed in the production environment of the CMS data acquisition system.In particular ECA rules have been defined for commonly encountered errors, such as failing service location protocol (SLP) servers and domain name resolution (DNS) servers.The error causes have been identified indirectly from error notifications emitted by data acquisition applications.
We observed that error notifications in our system can be classified in regards to the number of originators and the number of notifications per originator.A Guardian will handle those kinds of errors in the following ways: • A transient error emitted by one originator leads to a single error notification.It will be matched by one specific rule in a low-level Guardian and will directly or indirectly (through a higher-level Guardian) emit one operator notification.• A transient error emitted by multiple originators leads to multiple error notifications.It will be matched by one specific rule in a low-level Guardian and will directly or indirectly emit one operator notification.• A permanent error emitted by one or multiple originators leads to multiple error notifications sent repeatedly.It will be matched by one specific rule in a low-level Guardian and will directly or indirectly emit the same operator notification repeatedly.Our tests have shown that error notifications from multiple originators dominate the number of notifications.Our error processing system can handle these errors and reduces the number of notifications by the number of originators.This shows that the presented architecture is an adequate approach to analyze errors found in the CMS data acquisition system.
The low-level Guardians split the system into disjoint parts and thus scale to the number of services found in our system.Scalability is however limited by the Error Processor, which needs to forward all error notifications between XMAS and the Guardians.Additional measurements in the XDAQ framework revealed a performance bottleneck induced by the overhead of the SOAP protocol, which limits the throughput to 200 messages per second.
Planned improvements to the current system include porting XMAS to a binary protocol to reduce the protocol overhead.Guardians shall subscribe directly to XMAS to improve scalability of the error processing system.This requires extending the subscription mechanism to support complex filter expressions taking notification properties into account.

Figure 3
Figure 3 UML collaboration diagram of the error processing system and related services.

Figure 4
Figure 4 Logical data flow for error notifications (arrows in the middle) and operator notifications (arrows on the right).

Figure 6
Figure 6 UML collaboration diagram for auxiliary components.

•
Service information contains dynamic data ranging from statistics to configuration data not known a priori.It is freely definable and usually specific to applications.

Figure 7 Event Generator rule of a timer emitting an event once per minute. Figure 8 Event Generator rule for triggering a web service (workflow) based on a timer event.
17th International Conference on Computing in High Energy and Nuclear Physics (CHEP09) IOP Publishing Journal of Physics: Conference Series 219 (2010) 022039 doi:10.1088/1742-6596/219/2/022039