Using MaxCompiler for the high level synthesis of trigger algorithms

Firmware for FPGA trigger applications at the CMS experiment is conventionally written using hardware description languages such as Verilog and VHDL. MaxCompiler is an alternative, Java based, tool for developing FPGA applications which uses a higher level of abstraction from the hardware than a hardware description language. An implementation of the jet and energy sum algorithms for the CMS Level-1 calorimeter trigger has been written using MaxCompiler to benchmark against the VHDL implementation in terms of accuracy, latency, resource usage, and code size. A Kalman Filter track fitting algorithm has been developed using MaxCompiler for a proposed CMS Level-1 track trigger for the High-Luminosity LHC upgrade. The design achieves a low resource usage, and has a latency of 187.5 ns per iteration.

: Firmware for FPGA trigger applications at the CMS experiment is conventionally written using hardware description languages such as Verilog and VHDL. MaxCompiler is an alternative, Java based, tool for developing FPGA applications which uses a higher level of abstraction from the hardware than a hardware description language. An implementation of the jet and energy sum algorithms for the CMS Level-1 calorimeter trigger has been written using MaxCompiler to benchmark against the VHDL implementation in terms of accuracy, latency, resource usage, and code size. A Kalman Filter track fitting algorithm has been developed using MaxCompiler for a proposed CMS Level-1 track trigger for the High-Luminosity LHC upgrade. The design achieves a low resource usage, and has a latency of 187.5 ns per iteration.

K
: Data processing methods; Pattern recognition, cluster finding, calibration and fitting methods; Trigger algorithms; Trigger concepts and systems (hardware and software) 1Corresponding author.

Introduction
High level synthesis (HLS) is an alternative way to develop applications for FPGAs, which are typically programmed using hardware description languages (HDLs) such as VHDL and Verilog. HLS tools enable development of applications from a procedural description, rather than from the behavioural description written in HDL. The use of HLS may lead to shorter development times and more maintainable code bases, given the different approach, and the abstraction of hardware specifics from the application design. The requirements for any HLS tool in a High Energy Physics (HEP) triggering application are: that the design must be testable and verifiable; that the utilization of the FPGA resources should be efficient compared to a hand-written HDL equivalent; and that the latency of the design meets the strict requirements for trigger systems. In this paper we explore the use of MaxCompiler [1] in HEP trigger applications, using the Level-1 (L1) trigger of the Compact Muon Solenoid (CMS) [2] experiment at the Large Hadron Collider (LHC).
MaxJ is the language used to describe algorithms for MaxCompiler. An extension of Java, MaxJ allows the programmer to use a higher level of abstraction from the hardware than when using HDLs. MaxCompiler compiles the code into output compatible with the FPGA vendor synthesis tools. The synthesisable output takes the form of a combination of VHDL and vendor specific components, such as for DSPs or RAMs.

MaxCompiler
A key feature of MaxCompiler is the scheduling of the design into a pipeline automatically. The scheduler determines which operations must be in sequence with others, and which can be performed in parallel. In the default mode each logic operation is followed by a register, forming the pipeline.

JINST 12 C02015
x × + y 3 z Figure 1. Dataflow graph for the simple design performing 'z = x 2 + y' with inputs x, y and output z. Operations are represented as nodes on the graph. The default latency for a DSP operation is 3 clock cycles -this is configurable -so the signal y is registered by the compiler for 3 clock cycles to maintain the synchronisation between x and y. The register is represented by the box labelled '3'.
When paths of different latency must meet, the shorter branch is pipelined so as to synchronise the two paths. A 'dataflow graph' of the design is produced, showing the scheduling of operations, and is used by the compiler for further optimisations of the design. Automatic pipelining simplifies design greatly, since the developer is freed from having to synchronise signals by hand. Meeting the requirement for low latency in a trigger algorithm may require the tuning of the number of registers in the datapath, which is supported by the language. This comes down to finding the balance between pipeline length and clock frequency. Figure 1 shows an example dataflow graph for the simple design taking inputs x and y and outputting z = x 2 + y, produced using the following code: DFEVar x = io.input("x", dfeInt(16)); // A 16 bit signed integer DFEVar y = io.input("y", dfeInt(16)); DFEVar z = x * x + y; io.output("z", z, z.getType()); In the source code the developer separates what part is intended for the FPGA and what runs on the CPU explicitly, using different objects for each. 'DFEVars' describe the objects on the FPGA, leaving the usual int, float, etc. types on the CPU for use at compile time only, for example to define constant values or create multiple instances of an operation. The following code inputs a variable x, and multiples it with ten integer constants, in parallel, storing the results in the array y. The compiler will also map the operations to hardware efficiently, using shifts for multiplications by powers of 2, and simply registering x rather than multiplying by 1. DFEVar x = io.input("x", dfeInt(16)); DFEVar[] y = new DFEVar[1 ]; for(int n = ; n < 1 ; n++){ y[n] = x * n; } As a high level language, MaxJ provides some abstraction from the hardware implementation specifics of operations. Memories are defined more similarly to CPU applications: as a queue, FIFO, RAM, or ROM rather than by explicit instantiation of memory IP, or inferral by the synthesis tool from port definitions as with HDLs. Similarly, an operation such as x × y + z can be optimised to -2 -

JINST 12 C02015
use a DSP only by the compiler, with the source code simply stating 'x * y + z'. This abstraction also allows the developer some freedom in targeting applications to FPGAs from different vendors, since the compiler is tasked with mapping operations to hardware to a greater extent than in a conventional HDL application -notwithstanding different physical features in different FPGAs.

CMS Level-1 Calorimeter Trigger
The CMS Level-1 calorimeter trigger processes measurements from the muon detector and calorimeters to trigger readout of the full detector. Within the calorimeter trigger the electromagnetic, hadronic, and hadronic forward calorimeters (ECAL, HCAL and HF respectively) provide trigger tower primitives with transverse energy and position at a reduced resolution.
A time multiplexing architecture is used, with the result that a single node processes all trigger towers in the event it receives. A first hardware layer sends the calorimeter data over optical fibre to a second layer over a period of several bunch crossings. All towers from one azimuthal ring at a particular pseudorapidity (η) slice are sent simultaneously, one from each side of the detector. There are 68 η slices, and 72 towers in a 2π ring in φ. Nodes in the second hardware layer find jets, e/γ and τ candidates, as well as the total and missing transverse energy in each event. The second hardware layer is comprised of MP7 boards, which feature a Xilinx Virtex-7 XC7VX690T FPGA for processing [4].

Jet and Energy Sum Algorithm Implementation
The two CMS L1 calorimeter trigger algorithms re-implemented using MaxCompiler are the jet and energy sum algorithms. The energy of all trigger towers is summed together, separately for electromagnetic and hadronic parts, to determine the total E T in the event, and in parallel the vector sum is calculated to determine the missing energy. In the calorimeter trigger the jet energy is the sum of the tower energies in a fixed 9 × 9 tower window. The central tower in the jet must have the greatest energy. Pile-up is estimated locally from the energy in trigger towers neighbouring the window and subtracted from the jet. The twelve highest energy jets are passed to the output.
Due to the data ordering introduced by the time multiplexing, φ separation between towers corresponds to a separation across input links, while η separation corresponds to a separation in time. Implementing the jet algorithm therefore involves significant pipelining to make jets with energies spanning 9 towers in η. All the algorithms operate at a 240 MHz clock frequency.

MaxJ Implementation
Making efficient use of the resources of the chip in MaxJ required similar design patterns to the VHDL implementation. For example, neighbouring jet candidates have energy contributions from many of the same trigger towers. When computing the energy of many jet candidates in parallel, building the sum of tower energies in reusable arrangements makes for efficient use of the FPGA resources. In both languages this requires some particular constructions by the developer.
By passing data through each implementation we see that the two firmware outputs match well, as shown in figure 2. The MaxJ is seen to obtain results which are bit-identical to the VHDL. A small difference is present in the distribution of the azimuthal angles of jets. This is due to an internal difference in the sorting, which in the MaxJ can lead to jets of equal energy and pseudo-rapidity emerging in the opposite order to the VHDL implementation. This difference would not affect the physics performance of the trigger. Targeting the MP7 board, the MaxJ implementation uses 7% more LUT slices than the VHDL, while DSP and BRAM usage is identical. The MaxJ code used approximately half the number of lines of source code of VHDL.

CMS Level-1 Track Trigger Demonstrator
In order to maximally exploit the scientific potential of the LHC, the machine will be upgraded to provide a peak instantaneous luminosity of 5 − 7.5 × 10 34 cm −2 s −1 and a total integrated luminosity of 3000 fb −1 [5]. Construction will begin in 2024, and it is known as the High-Luminosity LHC (HL-LHC). In the 140-200 pile-up conditions expected, reconstructing tracks from the silicon tracker at Level-1, and exploiting them in the trigger, will be necessary in order to keep object energy thresholds as low as possible at the given trigger accept rate. A 'p T module' has been designed to reduce the data volume from the tracker to the trigger [6]. Charged particle tracks will be measured on each of two segmented silicon detector layers, spaced approximately 1 mm apart. The bend induced by the 3.8 T magnetic field causes a displacement between the layers. Pairs of measurements within a configurable window are clustered to form a 'stub'. Only stubs are to be read out to the L1 track trigger, with a p T threshold of 2 − 3 GeV. Approximately 98% of tracks are below 2 GeV and considered to be less useful for the trigger. A factor 10 reduction in data rate to the trigger is achieved by reading out stubs only.
The track trigger must reconstruct charged particle tracks with a latency of around 4 µs. One proposal for a track trigger design is based on an FPGA implementation of a Hough Transform (HT) [7]. The HT performs track finding in the plane transverse to the beam pipe, grouping stubs which form a viable trajectory in this 2-dimensional perspective. Following the HT, further steps are required to perform a fit to the stubs in 3 dimensions, filter the tracks to remove false associations of stubs, and remove duplicated tracks. A Kalman Filter [8] (KF) has been developed to carry out track fitting and simultaneous filtering. The Kalman Filter firmware is partly developed using MaxJ, and targets the MP7 board.

Kalman Filter
Kalman filters have been used extensively in tracking in HEP [9], and indeed on parallel architectures [10]. CMS uses a Kalman Filter at the High Level Trigger and for offline reconstruction. The filter begins with an estimate of the track parameters -provided by the preceding HT which found the track. Stubs are introduced to the KF iteratively, each time pulling the parameters by an amount dependent on the relative uncertainties of the measurement and the parameters, as seen in equations (3.1)-(3.9). Multiple stubs on the same detector layer are not considered to belong to the same track, and the candidate is split into independent trajectories. As a local method, the Kalman filter allows stubs to be tested for compatibility with a trajectory before updating the parameters. This is a useful feature for fitting track candidates found by the HT, which can contain stubs far from the true track in the z dimension. Another general feature of local track fitting methods is the need for smaller matrices than a global method for the same number of track parameters and measurements, which may yield an algorithm with lower resource usage in the FPGA.
The equations (3.1)-(3.9) update an estimate of the track parameters x k−1 and their covariance matrix C k−1 on layer k − 1 to an estimate x k and covariance matrix C k on layer k. The matrix F k−1 projects the track parameters forwards with process noise Q k−1 . H k predicts the stub position on layer k, m k is a stub measurement and V k is the measurement uncertainty.
A simplified 4-parameter fit with linear equations is implemented, fitting x = (1/2R, φ 0 , tan λ, z 0 ) where R is the track radius of curvature -proportional to the p T -φ 0 is the initial track angle in the x-y plane, tan λ is the track gradient in the r-z plane and z 0 is the vertex. The matrices in this parametrisation are a combination of 4 × 4, 2 × 2 and 4 × 2, and the matrix inversion is of a 2 × 2.  number of bits before and after the radix point. This allows the use of integer operations, which are cheaper in terms of FPGA resources and latency than floating-point operations. MaxCompiler contains many features to simplify the production of efficient fixed-point designs.

MaxJ Implementation
Let x and y be two signed numbers, with n x and n y bits respectively. The result of x × y will at most use n x + n y −1 bits to represent all possible outcomes. In the Xilinx Virtex-7 FPGA targeted for this application, the DSPs used to perform multiplications have inputs for one 18-bit and one 25-bit signed value. Multiple DSPs can be used if either multiplicand is larger than the single DSP input size.
It is evident that if the product x × y is to be used in another multiplication -as occurs frequently in equations (3.1)-(3.9) -then the number of bits used to encode x × y must be kept to no more than 18 or 25 bits in order to use only 1 DSP for the subsequent multiplication and keep the overall resource usage small. The developer can select from a collection of strategies to allow the compiler to set the bit width and radix point location for the output. After setting the bit size to be either 18 or 25, for minimal DSP usage, the tool is able to set the radix point given a requirement to avoid either underflow or overflow. Alternatively the developer is able to set the point manually. This feature was used extensively in the Kalman Filter firmware, achieving a DSP usage of one per multiplication operation.
A matrix multiplication is given by (AB) i j = k A ik B k j . Since the matrices in this Kalman Filter are small, each A ik × B k j is executed in parallel, with the summation performed as a balanced adder tree on the products. Figure 3 shows the dataflow graph for a matrix multiplication for one element of a '(n × 4) × (4 × m)' product. Furthermore, matrix components which are constants known at compile time are optimised by the compiler, as described in section 1.1. The division required to find the matrix inverse is performed using an efficient, custom algorithm requiring 1 BRAM and 1 DSP. The implementation of a single Kalman Filter (equations (3.1)-(3.9)) uses approximately 1% each of the LUTs, DSPs, and BRAMs of a Xilinx Virtex-7 XC7VX690T FPGA on the MP7 board, and is capable of running at the desired 240 MHz clock frequency, with a single iteration latency of 45 clock cycles, or 187.5 ns.

Simulation
MaxCompiler contains features to facilitate communication between the firmware and other CPU applications -whether in hardware, or simulation. In developing the Kalman Filter, the MaxCompiler simulation was found to be an extremely useful feature. The tool compiles the firmware into a model which uses the data-flow graph and libraries supporting the use of arbitrary bit width maths operations. This model takes the form of a '.max' file, which can be most directly interfaced with a C/C++ application, but can also be used with other languages such as Python and MATLAB. A function call passes formatted data to the firmware (whether in simulation or hardware) and the output can be processed on the CPU.
For the Kalman Filter this was used to test the firmware on Monte-Carlo stubs using the CMS experiment software, CMSSW [11], and a simulation framework for studying the Hough Transform performance. This direct simulation from within a C++ application facilitated comparisons between the Kalman Filter firmware and a floating point C++ implementation of the algorithm on a periteration basis, as seen in figure 4 for the 1/2R and φ 0 track parameters. The majority of update operations are seen to match between the two implementations to better than 0.5%.
In addition to comparisons of the firmware output with a software model, debug information was used to tune and debug the MaxJ. The available debug data includes the collection of results of intermediate calculations within the firmware, and numerical exception information, making it easy to find type underflows and overflows from incorrect truncations.

Control Flow
Stubs and states are passed into the KF by a module, written in VHDL, which stores the objects in a RAM and a FIFO respectively. The control flow module maintains an address book for stubs, and loops over the stubs in a candidate in order of layer. Due to the pipelined design of the filtering equations, it is possible to pass new data into the filter at every clock cycle. This allows a single KF processor to filter multiple independent track candidates simultaneously. This design demonstrates the possibility of developing an application in a combination of VHDL and MaxJ. An advantage of this approach is the decoupling of the data-flow and control-flow, whereby the high level language is used to describe the maths operations, and the HDL is used to move data within the FPGA. A configuration including 36 independent KF processors has been placed and routed in the target FPGA and is capable of operation at 240 MHz.

Summary
MaxJ, a high level synthesis language, has been explored as a design tool for trigger applications in High Energy Physics. By implementing existing algorithms from the CMS calorimeter trigger, written originally in VHDL, in MaxJ, we have shown that MaxCompiler can produce identical results, using 7% more LUT slices and the same number of DSPs and BRAMs. The MaxJ code for the calorimeter trigger application requires approximately half the number of code lines of the VHDL. The smaller code-base, together with the higher level of abstraction offered by the MaxJ language, suggest the possibility of developing more maintainable, and more widely understood, trigger algorithms.
We have then demonstrated the capabilities of the compiler by developing a complex track fitting algorithm for the CMS Time Multiplexed Track Trigger demonstrator. Features of the tool which assist the development of such an application have been highlighted, including the fixed-point optimisations and the simulation of the firmware interfaced with a C++ application. The resulting firmware achieves a good matching with a C++ floating-point implementation, and each Kalman Filter utilises just 1% of the resources of a Xilinx Virtex-7 XC7VX690T, allowing 36 to be used in parallel at 240 MHz. The Kalman Filter must now be integrated into the Time Multiplexed Track Trigger demonstrator system.