Analysis of the Failures and Corrective Actions for the LHC Cryogenics Radiation Tolerant Electronics and its Field Instruments

The LHC cryogenic system radiation tolerant electronics and their associated field instruments have been in nominal conditions since before the commissioning of the first LHC beams in September 2008. This system is made of about 15’000 field instruments (thermometers, pressure sensors, liquid helium level gauges, electrical heaters and position switches), 7’500 electronic cards and 853 electronic crates. Since mid-2008 a software tool has been deployed, this allows an operator to report a problem and then lists the corrective actions. The tool is a great help in detecting recurrent problems that may be tackled by a hardware or software consolidation. The corrective actions range from simple resets, exchange of defective equipment, repair of electrical connectors, etc. However a recurrent problem that heals by itself is present on some channels. This type of fault is extremely difficult to diagnose and it appears as a temporary opening of an electrical circuit; its duration can range from a few minutes to several months. This paper presents the main type of problems encountered during the last four years, their evolution over time, the various hardware or software consolidations that have resulted and whether they have had an impact in the availability of the LHC beam.


INTRODUCTION
The Large Hadron Collider (LHC) is installed inside a 27 km circumference tunnel and most of its length is made of superconducting elements operating at a nominal temperature of 1.8 and 4.2 K.The LHC operation relies on a huge quantity of sensors and actuators distributed more or less uniformly around the ring.The LHC has very restrictive access constraints with an environment worse than that found in the industrial world because of the radiation field present when the beam is circulating.Furthermore, to save cabling costs, most of the electronics is installed under the main dipole magnets and is thus also exposed to this hostile LHC environment.The cryogenic instrumentation installed in the tunnel was designed in order to cope with the LHC requirements and after more than 4 years of beam operation, the repairs and the consolidation campaigns on the cryogenic instrumentation installed in the tunnel can be analyzed statistically.
All instrumentation malfunctions are reported by the operators onto an electronic logbook that gives information on the problem, the tag-name of the channel, the associated electronic equipment details and that permits to track the actions undertaken by the instrumentation specialists.Any sensor, actuator or electronic card that has been installed in the LHC underground area has its history tracked in the LHC quality assurance system [1] and its operational status is listed and is accessed automatically when configuring the control system [2].
The historical analysis of the logged events corresponding to instrumentation malfunctions have permitted to detect some weak points correlated with a particular component or location and the LHC logging service [3] permits to analyze the historical trends, to check for cross-correlation in between channels and for determining the exact timing of an event.Most of the problems are related with electronic equipment and electrical connections that typically require a reset operation, the replacement of an electronic card or the consolidation of an electrical connector.The most difficult issues to tackle are related with induced noise on some analog measurements, with the recurrent appearance of intermittent electrical disconnections and with external cryogenic process perturbations that N°edms :1290622 look like an instrumentation malfunction.The most reliable equipment are the sensors and actuators for which most of the damage occurred during the manufacturing of the main components (magnets, cryogenic distribution line, electrical distribution feedbox, etc.) or during the installation in the LHC tunnel.For instance no thermometers have been damaged during cryogenic operation and 9 problems were reported for the 1343 split profibus® valve positioners (5 cases concerned electrical connections, 3 the compressed air control and one an electronics failure).

CLASSIFICATION OF THE PROBLEMS AND CORRECTIVE ACTIONS
The analysis presented in this paper is based on the 2261 events that report a malfunctioning instrumentation channel, the corresponding period goes from May 2008 until February 2013.After analysis of the logged events the problems are classified in families according to their origin, see Figure 1 a.The families are "no action" for events that have not resulted in a corrective action because of duplication or non-applicable request for repair; "process" related for hydraulic, thermal or stability of control loop issues that are not an inherent problem of the instrument, "electrical" that includes blown fuses, loss of mains supply, electrical interference, damaged cards, erratic behavior, communication network loss etc.; "radiation" that until now has been observed only as Single Event Upsets (SEU) and "intermittent" loss of electrical contacts that show as the opening of an electrical circuit loop.The corresponding corrective action can be a simple remote reset all the way to a complex mechanical intervention in the LHC equipment.The most visible events are those that occur during LHC beam operation and that result in the dumping of the beam because the cryogenic control system generates an interlock to protect equipment.
Figure 1a shows the historic accumulation of the events logged by mainly the LHC operators, the peak in 2009 correspond to a consolidation campaign that was performed after the LHC accident; the 28 events for year 2013 are not shown.Figure 1b shows a higher quantity of events for sectors 78 and 81, these were the first installed sectors that were used to fine tune our commissioning procedures.About 20 events are logged per month during the last two years of operation.

Process or Installation Issues
The diagnostics of a faulty instrumentation channel is complicated when its behavior is affected by an inappropriate installation or when it depends on the local thermo-hydraulic conditions.Typical examples are nonproperly thermalized temperature sensors that overestimate the actual temperature or superconducting type liquid helium level gages that are not at the expected geometrical position or that show wrong measurements around the lambda transition.
For the LHC, some superconducting liquid helium level gauges installed in stand-alone superconducting magnets operating at 4.2 K exhibited a very erratic behavior or showed no liquid while the superconducting magnet was overfilled and spilling excess liquid helium into the cryogenic distribution line (QRL).The reason was a very large heat in-leak through the gauge insertion tube that presented thermo-acoustic oscillations; they were confirmed when measuring a pressure oscillation at about 18 Hz at the level on the warm electrical feed-through (see Figure 2a).This heat in-leak vaporized excess liquid and additional gaseous helium circulated through a relatively long and narrow outlet capillary (see Figure 2b).The resulting pressure head causes a much lower level in the gauge reservoir when comparing with the magnet liquid vessel.To mitigate this problem, filling material to damp the thermo-acoustic oscillations was added above the gauge and furthermore, in some cases, an additional gaseous helium exhaust was made by connecting the top of the gauge with a direct connection with the QRL lines.This problem was considered as a significant operational hindrance that required a mechanical intervention to be performed on the first possible occasion.The LHC incident of September 2008 permitted to solve this issue by the exchange of the long and 3mm internal diameter narrow outlet capillary with a short and flexible 6mm inner diameter tube.

Electrical Issues
Electrical issues concern mainly the field electronics that, when installed in the tunnel, is designed to withstand the radiation environment of the LHC [4].This type of problem can be inherent to a malfunctioning card or due to an external perturbation like a mains power cut, interference noise, loss of electrical connection, etc.
The mains is supplied by uninterruptible power supplies, however power cuts do occur due to tests, human action or tripping of the protection switches.When mains power is restored, some electronic circuits do not recover their correct conditions and either a remote or local reset is required.In view of the large quantity of cards this can draw a significant amount of resources, such an example were the 1'062 galvanic insulated cards for the High Temperature Superconductor (HTS) current leads that on start-up sent an incorrect range feed-back and often required a local intervention.This problem was solved by reprogramming the communication gateway [5], this gateway handles not only the communication with the field equipment and Programmable Logic Controllers (PLCs), but also the conversion in physical units for the sensors and provides the raw data used for diagnostics.
A fuse to protect equipment is used on the galvanic insulated readout cards for the HTS leads temperature and on the digital input cards that monitor the position of mechanical switches that are activated by the stem of the ON-OFF valves, thermostats and pressure switches.In the case of the current leads measurements, the fuse protects the equipment when there is a dielectric rupture during the electrical qualification of the LHC magnetic circuits; however from 2009 a large increase of spontaneous fuses blown was observed and hampered the LHC operation.This problem was observed in four of the 19 protected areas, the problem was traced to an increased temperature on these areas combined with a far too conservative selection of the fuse current.From late 2009 the fuses rating was increased and there are no more spontaneous loss of channels.
Unexplained noise on signals is observed from time to time and it is often very difficult to understand its origin or correlate with the operation of external equipment.The overall system is designed by taking into account good engineering practices like the use of double shielded cables for low level signals and filtering of analog signals.
Figure 3a shows an unexplained perturbation observed on a magnet temperature.This event lasted for about 5 days and no recurrence has been reported again.For the HTS current leads a platinum 100 ohms sensor is used, it is referred to ground at the level of the power converters that feed the superconducting magnets.The signal is acquired through a card that provides galvanic insulation and noise can appear when the ground reference of the sensor is done at the level of the HTS lead itself.Until present the power converters and the associated cards for measuring the current leads temperature are located in the same area, which means that the ground level is the "same" for both types of equipment.When the ground of the sensor is referenced at the lead, the ground levels probably differ as the connection is done by using cables with a length varying from 40 m to 700 m.The LHC electrical ground is heavily interconnected and present a very low impedance, however when using the sensors cables for estimating ground noise it does present a rich level of harmonics.Ground noise would be expected to manifest as ac noise around the median level of the expected signal, however for the insulated cards a down-conversion mechanism probably exist that transfers the high frequencies into very low frequency components that sometimes result in an effective dc offset (see Figure 3b), note the noiseless record for the excitation current fed onto the resistive platinum sensor.This mechanism is difficult to reproduce in the laboratory, adequate filtering mechanisms are being investigated for future consolidation and for some leads passive capacitive filtering has been added to tackle this problem.

Radiation Effects
The LHC instrumentation radiation environment depends both on its exact location around the tunnel and on the distance toward the beam tubes.The cryogenic sensors and actuators can be exposed to high radiation doses, this being particularly high around the inner triplets.Some instruments like temperature sensors cannot be repaired and they were qualified as radiation hard equipment [6] up to doses typical of the LHC Long Straight Section (LSS).Until now, there is no indication of degradation for any type of sensor or actuator due to radiation effects.
Electronic equipment can be very sensitive to radiation and in exposed areas only custom made and radiation tolerant electronics [4] are deployed for the cryogenic system.Commercial electronics are installed exclusively in "protected" areas; unfortunately fast hadrons are still present and Single Event Upsets (SEU) cannot be ruled out.A SEU is a transitory non-destructive radiation effect that changes the binary status of digital integrated circuits.For commercial devices the radiation effects are mitigated by a combination of improved software code and consolidation of hardware equipment, but on the long term radiation issues shall be solved by relocating the sensitive equipment in radiation free areas.Radiation effects and its mitigation strategy on commercial equipment used on the LHC cryogenics is reported elsewhere [7].
The custom electronics is deployed in the LHC arcs and dispersion suppressor regions; and in the adjacent "protected" areas.The electronics to be placed inside the tunnel was designed to be radiation tolerant, however cards for applying ac power on electrical heaters and galvanic insulated cards for measuring the HTS current leads temperatures were not intended to be radiation tolerant although they were designed by using, whenever possible, our library of qualified Commercial Off The Shelf Components (COTS) or the analog radiation-hard Application Specific Integrated Circuit (ASIC) used as voltage amplifier and current source for measuring resistance type sensors.
The radiation effects modify most of the electrical parameters of an integrated circuit and can induce Single Event Effects (SEE).Total Integrated Dose (TID) is detected by deviation of dc offsets, bias currents or by an increase in the consumed electrical power.For the crates installed inside the LHC, no increase in the consumed power is measurable; its variation is mainly correlated with the changes in the tunnel temperature.
SEU were observed exclusively on the galvanic insulated cards of the current leads, see figure 4. The rate of SEU events depends on the beam luminosity as can be seen when comparing Figure 4 a and b, the faster the increase on the integrated luminosity rate the higher the probability of SEU occurrence.The SEU on the insulated cards is traced to a commercial digital insulator circuit that transfers the range control bits.An SEU event switches the nominal excitation current from 100 µA to either 10 µA or 1μA (see Figure 5 a).A current one or two orders lower in magnitude result in a voltage (resistance) measurement reduced by the same proportion, yielding a completely wrong temperature measurement on the operator screens.Figure 5 b shows the temperature of an HTS lead when affected by the first ever SEU on this particular equipment, it provoked a beam dump because of a completely wrong temperature measurement at the interface between the High Temperature Superconductor (HTS) and copper conductors.On this particular event, once the "observed" temperature reached about 30 K, the cryogenic operation team gave the green light to apply current on the magnet in spite of an HTS material in the non-superconducting state.Such a SEU is potentially dangerous for the integrity of the equipment and fortunately the quench protection detection triggered the discharge of the circuit.This event was thoroughly investigated and, from then on, the signature of these events was well understood by the cryogenic operation.This type of SEU was consolidated during the 2011-2012 Christmas shutdown by placing electrical straps to force the current excitation to the nominal value.Since then, not a single SEU has been observed again.

Intermittent Loss of Electrical Continuity
The intermittent loss of electrical connections is a recurrent effect observed for some LHC channels; it manifests either as a channel in error or as a "noisy" measurement often accompanied with an offset with respect to the expected value.This problem has already caused some beam dumps and many delays in permitting the powering of magnetic circuits because of the loss of appropriate cryogenic conditions.The loss of electrical continuity can be evaluated from the data transmitted by the field electronics that for resistive sensors include the excitation current amplitude and the associated sensor voltage; on the gateway [5] these values are used to calculate the sensor resistance and make the conversion into physical units.Figure 6 shows the opening of the excitation current (a) for a current lead 100 ohm platinum sensor; and of the sensing voltage terminals (b) for a cernox™ sensor installed in the QRL.The opening of the excitation current loop is easily detected, in normal conditions the current amplitude has a narrow range of values and its opening sets a very low value; the control system was consolidated to set in error any channel with an abnormal excitation current value.During an intermittent opening of the current loop, the inferred temperature may look as correct and it is important to notify the cryogenic operator that the data is erroneous.The duration of the intermittent loss of electrical connections is random and can vary from about half a minute up to half a year (see Figure 7 a); the sampling rate of the field electronics is 1 second and shorter perturbations cannot be detected.This effect has been observed in 42 channels; about half of them had 2 or more intermittent losses of electrical continuity (Figure 7 b).Most events last about a day; this is probably due to the typical repair delay when operation reports malfunctions on critical channels.The period separating two events can range from relatively short periods of a day all the way up to more than four years that has been observed for a thermometer on a LHC current lead.Most of the intermittent losses (78%) are observed disproportionally on the LHC superconducting current lead sensors; the most probable cause is that these particular channels require continuous monitoring for the LHC beam operation and also have the higher number of electrical interconnects.For instance, some non-critical problematic channels may be masked by the operators in which case the systematic reporting of the malfunction of the sensors is lost.
Intermittent loss of electrical continuity is extremely difficult to diagnose and repair.In some cases the simple disconnection and reconnection of an electrical circuit restores continuity even if the problem has been tracked to a different location, the fault location is detected by an iterative process that may involve the inversion of cables and electronics of neighboring channels.Most probably the electrical fault occurs at a connector interface and the reliability of the electrical connectors is a widely investigated subject (see for instance [8]).However most of the studies are concerned with high current or high voltage contacts and little is found for low level analog signals.Nevertheless this issue cannot be neglected because, since 2010, nine LHC beam dumps have been caused by the intermittent opening of electrical connections (see Figure 8).

CONCLUSION
The LHC cryogenic operation is directly dependent on the quality of the instrumentation used for understanding the actual conditions of the cryogenic loads and for managing the distribution of the cryogenic fluids.The LHC depend on a massive amount of instruments for which, during the initial years of operation, most of the weak points have been pinpointed and consolidated as can be seen by the harsh decrease in the logged events since 2011 (Figure 1).To increase the LHC cryogenic operation robustness, duplicate communication networks and redundant instrumentation was part of the initial design; this has been instrumental for keeping a very low level of LHC operational disruptions due to the field instrumentation.However at some point in future, the lifetime of the equipment may be reached or new weaknesses may be put in evidence.When this happens it is important to track and analyze the faults in order to foresee in advance the corrective actions that are required to operate the LHC that is a machine with very limited access and very tough environmental conditions.
The raw size of the cryogenic instrumentation can be illustrated by the fact that it is made of 15'000 field instruments, 7'500 electronic cards and 853 crates.In view of these numbers, the 2'300 events logged since the year 2008, and that presently are running at about 20 per month, show that the LHC cryogenic instrumentation is indeed very reliable.Furthermore no data was presented on broken or damaged electronic cards; since 2009 they represent 60 cards that is less than 1% of the total quantity.
The most visible events are those that result in a beam dump and Figure 8 shows their monthly distribution since April 2010.The beam dumps are catalogued in various sources and 63 events are due to the cryogenic system that includes SEU on commercial components, cold-compressor faults, process instabilities, etc.For the field instrumentation malfunctions there are 16 dumps or 24% of the total accounted for the LHC cryogenics.About half of these dumps were caused by SEU and this type of event is consolidated since December 2011.As it is shown in this paper there are some types of faults, like the intermittent loss of electrical contacts, that cannot be predicted, provoke beam dumps (Figure 8) and for which its repair cannot be assumed to be a success.Furthermore the LHC is presently been upgraded, meaning that the existing instrumentation needs to be commissioned again and some new weaknesses may become apparent with the increase of the LHC performance.

FIGURE 1 .
FIGURE 1. Historical accumulation per year and type (a); and per LHC sector (b) of logged events that report a malfunctioning LHC instrumentation channel.

FIGURE 2 .
FIGURE 2. (a) Pressure oscillations observed when the LHC magnet is immersed in liquid helium.(b) Cross section of a magnet, its liquid He level gauge container and interconnecting pipes.

FIGURE 3 .
FIGURE 3. Interference or noise induced by external perturbations (a) Noise on magnet temperature during 4 days (b) Noise due to a "local" ground reference observed on the current lead temperature, note that no noise is present in the excitation current.

FIGURE 4 .
FIGURE 4. SEU and luminosity during year 2011.(a) SEU recurrence per month.The black part of the histogram corresponds to a SEU that result in a beam dump.(b) Integrated luminosity for the LHC beam in the CMS experiment.

FIGURE 5 .
FIGURE 5. SEU on insulated card.(a) Particle switches output on digital insulator of the "Range Controller" and (b) temperature trend for the actual lead temperature (solid line) and shown on operators screen (dashed line).The typical nominal temperature is 50 K as is the cease before the SEU.

FIGURE 6 .
FIGURE 6.Effects of an intermittent loss of electrical continuity on the thermometer temperature (dashed line), excitation current (continuous line) and sensor voltage (dotted line).Intermittent opening of the (a) current and (b) sensor voltage terminals.

FIGURE 7 .
FIGURE 7. Statistics on the intermittent loss of electrical connections; (a) recurrence versus duration of the event and (b)quantity of channels with an intermittent loss versus the loss recurrence per channel.

FIGURE 8 .
FIGURE 8. Quantity of beam dumps since 2010 caused by a malfunction on an instrumentation channel.