Implementation of FPGA-based level-1 tracking at CMS for the HL-LHC

A new approach for track reconstruction is presented to be used in the all-hardware first level of the CMS trigger. The application of the approach is intended for the upgraded all-silicon tracker, which is to be installed for the High Luminosity era of the LHC (HL-LHC). The upgraded LHC machine is expected to deliver a luminosity on the order of 5 × 1034 cm−2s−1. This expected luminosity means there would be about 125 pileup events in each bunch crossing at a frequency of 40 MHz. To keep the CMS trigger rate at a manageable level under these conditions, it is necessary to make quick decisions on the events that will be processed. The timing estimates for the algorithm are expected to be below 5 μs, well within the requirements of the L1 trigger at CMS for track identification. The algorithm is integer-based, allowing it to be implemented on an FPGA. Currently we are working on a demonstrator hardware implementation using a Xilinx Virtex 6 FPGA. Results from simulations in C++ and Verilog are presented to show the algorithm performance in terms of data throughput and parameter resolution.


Algorithm overview
The approach presented here is similar to offline methods for track reconstruction used by the CMS and ATLAS experiments [2,3]. With advances in integrated circuit technology, these complex algorithms can now be used in an online environment such as the level-1 trigger at the CMS detector. The algorithm is divided into sequential steps that allow for parallelism in the track processing. The track finding is seeded by pairs of stubs from adjacent layers in the tracker that are combined to form 'tracklets'. Then, using the detector origin as a constraint in the r − φ plane, we calculate an initial estimate of the track parameters for the tracklet. We select tracklets with p T of at least 2 GeV and a longitudinal impact parameter of |z 0 |< 15 cm. The tracklets are then projected to other layers using a uniform magnetic field (both outside-in and inside-out), depending on the layers used for seeding. For example, if a track was seeded in layers 1 and 2, we would project to layers 3, 4, 5, and 6. Instead if the track was seeded in layers 3 and 4, we would project in to layers 1 and 2 and project out to layers 5 and 6. Then we look in the projected layers for stubs consistent with the trajectory of a high-p T track. If a stub is found within a given window of the expected track projection, we include it in the track candidate and store this difference in position. Along with precalculated derivatives and the differences from the previous step, we perform a linearized χ 2 fit to correct the initial track parameters. Since we seed in multiple pairs of layers, a track can be reconstructed more than once. In the last step we remove the duplicates found from the multiple seeding. Finally we pass the tracks to the global trigger. The global trigger can now associate the level-1 tracks to muons, electrons, and jets and make a better decision for which events to keep.

Algorithm timing
The space available to store each event in a buffer limits the time allowed for the trigger decision. With the bunch crossing rate of the LHC of 40 MHz, the level-1 trigger decision must be made in less than 10 µs. Thus the track reconstruction must be made in approximately 5 µs. This is to account for time needed by the global trigger to associate tracks to the other physics objects and make a decision. To increase the time available for processing, we consider time multiplexing the system by a factor of four. Each copy of the system receives a new event every 100 ns. The hardware implementation of the algorithm can be done in eight steps for a total of 800 ns. These steps are: 1. Sort the input stubs by their corresponding layer. Each stub stores the coarse estimate of the p T and the coordinates in r, φ , and z. 1 2. Sort the stubs into bins of z and φ and store a reduced version of the data. The reduced version of the data only contains a few of the most significant bits for the coordinates of the stub.
3. Select possible tracklets from allowed stub pairs. A lookup table is used to check for consistency of the stubs with a high p T track coming from the origin.
4. For the tracklets selected in the previous step, calculate an initial estimate of their parameters and their position in the extrapolated layers.
5. The projections are routed into bin of z and φ as in step 2.
6. Find matching stubs to the projected tracklet with the reduced versions.
7. Calculate the difference in position between the stubs and the projected tracklet.
8. Calculate the corrections to the initial estimates of the track parameters using the differences obtained in the previous step and lookup tables for precalculated derivatives.
Every 100 ns a new event is received and the previous event moves onto the next step. We are then processing a different event in each step of the algorithm. The number of objects that can be processed will depend on the specific clock speed that can be achieved by the hardware used. A block diagram of the algorithm is shown in figure 1, where each module is copied four times to represent the time multiplexing.  Figure 1. Block diagram of the tracking algorithm. Every module is replicated four times to represent the time multiplexing of the system. The detector sends the data every 25 ns, so each copy receives data every 100 ns.

System architecture
We divide the detector into 28 sectors in the r − φ plane (figure 2), such that all tracks with p T > 2 GeV are fully contained in at most two sectors. This reduces the need for large data transfers between sectors to more than the two nearest neighbors. We also divide the barrel into four z regions, which will be associated with the input links from the detector. We define a 'virtual module' as a subdivision of a sector in bins of z and φ . Each z region is subdivided into two virtual modules for even and odd layers. The odd layers are divided into three virtual modules in φ while the even layers are divided into four. In the even layers, the first and last virtual modules are shared between adjacent sectors. The subdivision into virtual modules has been optimized so that a stub in a virtual module can only form a tracklet with stubs in two virtual modules in the neighboring layer. In figure 3 we can see an example of a high momentum track (p T > 2 GeV) that is seeded from virtual modules 2 in the inner layer and 3 in the outer layer. We also see that a low energy track would be seeded from modules 2 and 4 and thus not consider this combination of virtual modules. Subdividing φ and z into virtual modules reduces the number of possible tracklet combinations even before any processing is done.   Figure 3. Virtual module subdivision of a φ sector. A track (green) with p T > 2 GeV is shown to be seeded from stubs in virtual modules 2 and 3 in the inner and outer layers. A lower energy track (red) has stubs in modules 2 and 4, therefore we do not consider this pair for reconstruction.

Algorithm performance 2.1 Occupancy
In a minimum-bias event, the large majority of the tracks in a given event seen at CMS comes from very soft interactions. There is already a requirement on the input stubs to be consistent with tracks with p T greater than 2 GeV. Most of the stubs that pass this initial requirement come from combinatorics and do not correspond to real stubs. There are still on average 65 and 55 stubs in a φ sector in the innermost two layers respectively as seen in figure 4 (left). As a result, we would need to process on average 65 × 55 ≈ 3600 stub pairs per sector with seeding in just the innermost layers. Most of these are fake combinations and there would be many more if we counted the seeding in other layers. Figure 4 (right) shows the average number of stubs in a virtual module for the two innermost layers. When we consider only allowed combinations from the virtual modules, a stub pair is known as a tracklet candidate.

Parameter resolution
We estimate the resolution of the track parameters using a version of the algorithm implemented in a separate C++ code with both floating-point and integer precision. Figure 6 shows the resolution of the track parameters from the floating point and integer implementations. The parameter resolution is calculated as the difference between the generated values from the Monte Carlo event simulations and the results of the algorithm.

Hardware implementation
The algorithm presented here can be implemented in commercial hardware, taking advantage of the flexibility offered by Field Programmable Gate Arrays (FPGAs). We have simulated the algorithm in Verilog as the first step towards a complete hardware implementation. The implemented simulation reproduces the same results as the C++ integer version. We are using the Gigabit Link Interface Board (GLIB) [4] as a testing system, as it provides a relatively easy interface through the use of IPBus. This board is based on a Xilinx Virtex 6. Currently we are in the process of porting the algorithm to the FPGA and we see the expected results so far for the calculations in the algorithm.

System scale
With the newer technology becoming more available, we plan to use Virtex 7 FPGAs from Xilinx in our design. These FPGAs provide several times more resources than those in the Virtex 6 and we expect that a single chip could be enough to implement one sector. Given the expected stub density at the upgraded detector, each of these chips would receive approximately 200 Gb/s and would send out about 100 Gb/s of data. The necessary IO blocks for this data flux are already present in the ATCA blades with Virtex 7 FPGAs, such as the Pulsar IIb [7]. Taking into account the factor of four time-multiplexing, the approximate number of FPGAs required for the system is on the order of 100. We hope to benefit from experience already gained by other systems in CMS, such as the trigger or the DAQ. With the available resources in current FPGAs and certainly with the technology that will come before the HL-LHC era, we estimate, based on resource requirements of memories, logic and DSP slices, that with a small number we can process an entire φ sector. Xilinx has introduced their new generation of products called "Ultrascale", that already promises much more available resources and up to 90% utilization without performance degradation [5]. Other manufacturers are now developing FPGAs based on 14 nm technology, which will bring chips with higher density and lower power consumption [6].

Summary
Including tracking at the level-1 trigger is a requirement for the CMS experiment at the HL-LHC in order to deal with the challenging environment at the increased luminosity. We have presented a possible approach for tracking based on seeding tracklets and extrapolating to other layers where we look for matching hits. Simulations have shown that this method is viable for an integer implementation using commercial hardware. The work is ongoing for a slice test using the GLIB, which will provide a more realistic estimate for the processing time as well as the resource usage.