Redundancy or GaAs? Two different approaches to solve the problem of SEU (Single Event Upset) in a Digital Optical Link.

Southern Methodist University, 3215 Daniel Avenue, DALLAS TX 75275, USA.
bdink@mail.physics.smu.edu

Andrieux.M-L, Gallin-Martel.L
Institut des Sciences Nucleaires, 53 Avenue des Martyrs, F-38026 Grenoble Cedex, France.

The Royal Institute of Technology (KTH), Physics Department Frescati, Frescativagen 24.S-10405 Stockholm, Sweden.

Rethore.F
Centre de Physique des Particules de Marseille, 163 Avenure de Luminy, Case 907, F-13288 Marseille Cedex, France

Abstract

The fast digital optical links for the ATLAS Liquid Argon Calorimeter must survive in a high radiation environment with a total fluence of $3 \times 10^{13}$ neutrons (1 MeV Si)/cm$^2$ and 10 kGy (Si). The links based on Agilent Technologies Glink serializer/deserializer set, show a total dose radiation resistance to neutrons and gammas that would allow for 10 years of operation in the ATLAS detector. We have observed, however, an unacceptable rate of single event upsets (SEU) due to neutrons interacting in the silicon-based serializer.

In order to solve this problem, we have developed two link systems. The first one, Dual-Glink, is based on a principle of redundancy: data are sent on two independent links. On the reception side, data are analyzed and error recovery is performed without dead time. The second solution uses a GaAs serializer/deserializer set from TriQuint. We observe a very small number of SEU's. In addition, high speed of 2.5 Gb/s allows for transmission of the data twice during one event period and for error recovery.

The design of the 3 types of links, their performance in the laboratory and the results of the radiation tests are presented for all systems.

I. SINGLE GLINK.

A fast digital optical link was developed for the readout of the Front End Boards (FEB) of the ATLAS Liquid calorimeter. This link is based on the serializer/deserializer set HDMP1022/1024 from Agilent Technologies. The block diagram of the link is shown in Fig.1.

We use a multiplexer to change the input data stream of 32 bits @ 40.08 MHz coming from the FEBs into a 16 bits @ 80.16 MHz to accommodate the HDMP1022 input data bus requirement. The serializer working in a double frame mode provides a 1.6 Gb/s differential output, which is transformed into an optical signal by a Methode transceiver (MDX19-4-1-S) that uses a 850nm VCSEL. A 50/125 graded index optical fiber transfers the optical signal 200 meters away to the counting room, where the reception part of the link is located. On the reception side we use the receiving part of a Methode transceiver to recover the 1.6 Gb/s electrical signal. This signal is used as input to a HDMP1024 deserializer. A demultiplexer programmed in an ALTERA 7128 FPGA transforms the 16 bits @ 80.16 MHz output of the deserializer into the 32 bits @ 40.08 MHz stream identical to that coming from FEB.

The timing quality of the link is best described by the eye diagram of the link shown in Fig.2. Several
prototypes worked in the laboratory for extended periods of time showing a Bit Error Rate to be better than 10^{-16}.

**Figure 2: Eye Diagram of the Single Glink. 100ps/div.**

Since the transmission part of the link must survive for 10 years in a high radiation environment with a total fluence of 3x10^{13} neutrons (1MeV Si)/cm^2 and 10kGy (Si) for photons, we have carried out neutron and gamma radiation tests on the serializer, transceivers and optical fibers with integrated doses exceeding those expected at LHC. The multiplexer will be designed in DMILL technology. It was not part of these tests. The tests permitted to select and qualify the components of the link: The Methode transceiver was selected for its resistance to neutron radiation [1]. The Plasma Optics fiber was selected for its resistance to gamma radiation.

**A. Photon exposure**

The Plasma Optics optical fiber survived 10kGy of photon radiation with little change of attenuation (<0.2dB/meter) [2]. Three Methode transceivers were tested using Co-60 source with a high photon dose rate of 2.8 kGy/h, i.e., about 400,000 times greater than that expected in ATLAS. They stopped sending light after a total dose exposure ranging from 3.5 kGy to 13 kGy. Since each transceiver contains both electronic components and also a plastic lens covering the VCSEL, we assign this problem as due to the darkening of the plastic material of the lens. Such darkening anneals with time due to the transmission of light at 850 nm during the lower rate photon exposures. To test this, we continued transceiver operations without external photon radiation. The transceivers recovered completely after short period of operations confirming our belief that the effect was due to the rate and not due to the total dose effect. The Glink serializers survived 45.3 kGy. No SEU was observed during irradiation.

**B. Neutron exposure**

The transmission part of the link was tested in neutron radiation. All the components survived the total dose exposure, however we observed SEUs. Six neutron radiation tests were carried out in 3 different facilities: ISN, CERI and Chalmers. From the SEU rates recorded during these tests, we extrapolated the SEU expected in the ATLAS environment [3]. This work took into account the fact that the neutron spectrum used for the irradiation tests is not the same as the neutron spectrum in ATLAS. This spectrum is known only from simulations and is subject to substantial uncertainty. Several types of errors were observed during these tests: bit flips, clock corruption, signal inversion, loss of frame and loss of PLL. All of them affect the data transmission in different ways. Data are transmitted from the FEB to the Read Out Drivers (ROD) using a special ATLAS format [4]. The transmission time for one event is about 10μs. Thanks to the structure of the ATLAS format and in particular due to the parity information contained in each packet, it is possible to detect almost every type of error by analysing the data at the receiver part of the link. Each type of error generates a dead time if we assume that we discard events with corrupted transmission. Transitory errors such as bit flips and missing clock pulses induce a dead time of about 10μs. This corresponds to one event only. We also observed long lasting errors that are the dominant cause of the overall dead time. These consist of losses of frames and losses of PLL. When such errors occur the link is blind during about 10 ms, which corresponds to about 1000 ATLAS events.

The detailed study of the extrapolation of the dead time induced by SEU to a system of 1600 Glink in an ATLAS environment is presented in [3]. The extrapolated error rate is 0.65 error/link/hour and corresponds to a dead time of about 10 hours over the 10 years of LHC running. Due to the uncertainties of the extrapolation procedure, this dead time is not negligible and two more links were developed with the aim of significantly reducing the dead time.

**II. THE DUAL-GLINK DESIGN**

**A. CONCEPT AND DESCRIPTION**

The “Double G-Link” concept is based on the assumption that the neutron generated single event upsets are distributed randomly in time and, therefore, the probability of two SEU’s occurring within the same event is negligible given the neutron flux expected. Incoming digitised parallel data stream of 32 lines at 40.08 MHz is split inside the link front end multiplexer into two sets of 16 lines driven at 80.16 MHz. Information sent on those two sets of lines is transmitted via two independent G-links operating in a Double-Frame mode. At the receiving board the output of the two receivers is fed into a FPGA based switch selector which checks the parity of each of the two transmissions, compares each to the parity encoded in the data, checks the status of the control Bit 15 in each link and finally compares the two data streams to each other. If the data from one of the two links are corrupted, the switch will send out the data stream from the second link. The output contains 8 additional flags monitoring the quality of the data and the status of the links. In the unlikely case of both links having
transmission problems these output flags provide “invalid data” information to the ROD. The only detectable but unrecoverable type of error is due to simultaneous flip of an even number of bits in the transmitted word giving the same parity as encoded by the FEB. The block diagrams of the transmission side and of the reception side of the “Dual Glink” system are shown in Fig.3 and 4.

![DUAL Glink Block Diagram. Transmission side.](image)

**Figure 3:** DUAL Glink Block Diagram. Transmission side.

Furthermore, when a PLL type of error occurs on a link, data and clock from that link are corrupted. The clock may even be absent for a while. It means that none of the clocks from the links can be used as the “master clock” for the design. We use an externally provided 40.08 MHz ATLAS clock onto which we resynchronise the data streams.

The switch uses dual-clock FIFOs as derandomisers. Data are written into each FIFO using the clock from the corresponding link. Data are read from the FIFOs using the ATLAS clock. Synchronisation of the two incoming data streams is performed thanks to the ATLAS format used to code the data [4]. When the Start Of Event is detected on one of the links, the authorisation to write into the corresponding FIFO is set permanently. As soon as both FIFOs received the Start Of Event, the authorisation to read from both FIFOs is set permanently. The links are synchronised. An error occurring on one link is detected in the module “ERROR DETECTION/ATLAS FORMAT” that can be seen on the switch block diagram presented in Fig.5. The output switch first checks for errors on the other link and then directs the data from this other link to the output. The FIFOs are made deep enough to permit this switching to be done before corrupted data arrive at the output of the FIFO. The FIFO that belongs to the error side is then cleared and writing into it is forbidden until a new Start Of Event is detected. This is the sign that the link came back to normal operation. The link is thus automatically resynchronised. The switch program has been embedded into an ALTERA Flex10k50E. It uses 30% of the available Embedded Array Blocks and less than 10% of the usable memory. For a production though, we would recommend the use of a component chosen among the newer and cheaper ACEX family.

![DUAL Glink Block Diagram. Reception side.](image)

**Figure 4:** DUAL Glink Block Diagram. Reception side.

The switch must take into account the fact that both transmissions are not necessary in time (different lengths of fibers, different delays in electronic devices, etc.). Furthermore, when a PLL type of error occurs on a link, data and clock from that link are corrupted. The clock may even be absent for a while. It means that none of the clocks from the links can be used as the "master clock" for the design. We use an externally provided 40.08 MHz ATLAS clock onto which we resynchronise the data streams.

The switch uses dual-clock FIFOs as derandomisers. Data are written into each FIFO using the clock from the corresponding link. Data are read from the FIFOs using the ATLAS clock. Synchronisation of the two incoming data streams is performed thanks to the ATLAS format used to code the data [4]. When the Start Of Event is detected on one of the links, the authorisation to write into the corresponding FIFO is set permanently. As soon as both FIFOs received the Start Of Event, the authorisation to read from both FIFOs is set permanently. The links are synchronised. An error occurring on one link is detected in the module “ERROR DETECTION/ATLAS FORMAT” that can be seen on the switch block diagram presented in Fig.5. The output switch first checks for errors on the other link and then directs the data from this other link to the output. The FIFOs are made deep enough to permit this switching to be done before corrupted data arrive at the output of the FIFO. The FIFO that belongs to the error side is then cleared and writing into it is forbidden until a new Start Of Event is detected. This is the sign that the link came back to normal operation. The link is thus automatically resynchronised. The switch program has been embedded into an ALTERA Flex10k50E. It uses 30% of the available Embedded Array Blocks and less than 10% of the usable memory. For a production though, we would recommend the use of a component chosen among the newer and cheaper ACEX family.

![DUAL Glink Switch Block Diagram.](image)

**Figure 5:** Dual Glink SWITCH Block Diagram.

**B. PERFORMANCE IN LABORATORY**

The prototype Dual Glink link worked in laboratory very well. In order to test the design, we injected errors externally via a generator. Single bit flips, clock corruption and frame and PLL losses were properly recovered by the link even when injected with a unrealistically high rate of several Hertz.
C. PERFORMANCE UNDER RADIATION

A neutron radiation test was performed with the Dual Glink. A custom Bit Error Tester was designed to check the quality of the transmitted data bit by bit. As expected, the SEUs occurring on both sides are not correlated. The switch recovers almost all errors. The residual rate of errors corresponding to even number of bit flips, detected but not corrected, is extremely low. Extrapolated to the ATLAS radiation environment, this rate corresponds to a dead time of about 5 seconds over 10 years of operation for a system of 1600 links. We consider it as negligible.

D. CONCLUSION

The Dual Glink system provides a satisfactory answer to the problem of Single Event Upsets recorded during neutron irradiation. Such system can recover even higher SEU rates and provides a high safety margin in the extrapolation to the ATLAS conditions. As a by-product, it also provides a high level of immunity against single point failures (dead lasers, broken fibers, failed chips). It is a very robust design.

III. THE TRIQUINT DESIGN

A. CONCEPT AND DESCRIPTION

The “TriQuint solution” is based on a GaAs serializer-deserializer chips produced by the TriQuint Semiconductor Company. This technology is radiation resistant [5] and well suited for high speed electronics. The TriQuint TQ8213 and TQ8223 chips are designed for the SONET OC48 standard of about 2.48 Gb/s. Although their nominal base frequency is 77.76 MHz, the set can operate at a frequency that is 3% higher and equal to a double of the ATLAS data transmission speed of 40.08 MHz. The chips provide simple serialization without encoding, framing or DC balance. The only control information provided by the reception chip is a flag that indicates a loss of PLL.

The concept of the TriQuint link is to provide a framing device at the interface of the parallel signals from FEB and the link and then to send the data twice at high speed over a link consisting of serializer, transceiver, and fiber. On the receiving end the output of the deserializer is then stripped off its frame in a programmable logic device and one set of data is then sent to ROD. In order to maintain DC balance of the serial stream, the data are sent once normally and once inverted.

The block diagram of the link is shown on Fig.6.

B. PERFORMANCE IN LABORATORY

For the laboratory prototype, the framer does not have to be radiation tolerant. It is designed in an ALTERA 7128 FPGA. For operation in ATLAS the framer will be designed in DMILL technology. The block diagram of the framer is shown on Fig.7.

The bandwidth-length product of the Plasma Optics fiber allows it to also be used for this link. Since the transceivers must operate at 2.5Gb/s, we selected the MLC-25-7-TL 2.5Gb/s Methode transceiver. On the reception side we use a 2.5Gb/s Finisar transceiver FTRG-8519-1.
Several TriQuint links worked in laboratory. The Bit Error Rate has been measured to be better than $2\times10^{-15}$. The eye diagram of the link can be seen in Fig.8.

![Eye Diagram](image)

**Figure 8: Eye Diagram for the TriQuint link.**

C. PERFORMANCE UNDER RADIATION

We did not have time to perform any gamma radiation test on the TriQuint link so far. However, we are confident that the link will not be affected by gamma radiation. The serializer is expected to survive to at least 100kGy given irradiation data from a TriQuint report [5] and the 2.5 Gb/s transceivers are expected to have similar properties to the lower speed versions used for the Glink links. This has been confirmed in the neutron case.

A neutron radiation test was performed on 3 boards equipped with TriQuint serializers and Methode 2.5 Gb/s transceivers. In absence of framer, only constant, balanced data were sent and analysed on line at the reception side. All 3 serializers and transceivers survived total doses equivalent to up to 60 years of operation in ATLAS. A small rate of SEU was observed. A board that was exposed to a total fluence expected in 60 years at ATLAS had a total of 2239 errors. We did not observe any bit flips. All errors consisted of loss of frame for a time varying between 25 and 30 $\mu$s. Their origin is not yet fully understood. Extrapolated to ATLAS, these errors correspond to an integrated dead time of about 115 seconds over 10 years of operation for a system of 1600 links, which is negligible.

D. CONCLUSION

The link designed with a GaAs serializer/deserializer set was successfully tested and provides another solution for the SEU problem. SEUs are also observed, but the corresponding dead time, as extrapolated to ATLAS, is negligible (115 seconds over 10 years of operation). The design uses a custom designed framer that would have to be ported in a DMILL technology for operation in ATLAS. As a by-product, the framer provides true error detection. No error recovery is necessary. It is a valid option for the Front-End readout of the ATLAS liquid Argon calorimeter.

IV. CONCLUSION

Both Dual Glink and TriQuint based links provide an efficient answer to the problem raised by the SEU that affect the Single Glink links in neutron radiation. In both cases the dead time, extrapolated for 10 years of operation in ATLAS, is very low (5 seconds for the Dual Glink and 115 seconds for the TriQuint link). In both cases full error detection is provided. Error correction is only provided by the Dual Glink solution. Both solutions require the development of a DMILL multiplexer or framer working at 80.16 MHz. Both solutions can be considered to be valid options for the Front-End readout of the ATLAS liquid Argon calorimeter.

REFERENCES


[4] Format for the data read-out from the front-end boards. ATLAS note ATL-AL-LAL-ES-1.0