<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>FPGA-based tracking for the CMS Level-1 trigger using the tracklet algorithm</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>06/01/2020</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10182633</idno>
					<idno type="doi">10.1088/1748-0221/15/06/P06024</idno>
					<title level='j'>Journal of Instrumentation</title>
<idno>1748-0221</idno>
<biblScope unit="volume">15</biblScope>
<biblScope unit="issue">06</biblScope>					

					<author>E. Bartz</author><author>G. Boudoul</author><author>R. Bucci</author><author>J. Chaves</author><author>E. Clement</author><author>D. Cranshaw</author><author>S. Dutta</author><author>Y. Gershtein</author><author>R. Glein</author><author>K. Hahn</author><author>E. Halkiadakis</author><author>M. Hildreth</author><author>S. Kyriacou</author><author>K. Lannon</author><author>A. Lefeld</author><author>Y. Liu</author><author>E. MacDonald</author><author>N. Pozzobon</author><author>A. Ryd</author><author>K. Salyer</author><author>P. Shields</author><author>L. Skinnari</author><author>K. Stenson</author><author>R. Stone</author><author>C. Strohman</author><author>K. Sung</author><author>Z. Tao</author><author>M. Trovato</author><author>K. Ulmer</author><author>S. Viret</author><author>B. Winer</author><author>P. Wittich</author><author>B. Yates</author><author>M. Zientek</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[The high instantaneous luminosities expected following the upgrade of the Large Hadron Collider (LHC) to the High-Luminosity LHC (HL-LHC) pose major experimental challenges for the CMS experiment. A central component to allow efficient operation under these conditions is the reconstruction of charged particle trajectories and their inclusion in the hardwarebased trigger system. There are many challenges involved in achieving this: a large input data rate of about 20-40 Tb/s; processing a new batch of input data every 25 ns, each consisting of about 15,000 precise position measurements and rough transverse momentum measurements of particles ("stubs"); performing the pattern recognition on these stubs to find the trajectories; and producing 1Currently at Johnson and Johnson, New Brunswick, NJ, U.S.A. .]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>This paper describes a novel implementation of a charged particle trajectory reconstruction approach based on Field-Programmable Gate Arrays (FPGAs) for the CMS experiment <ref type="bibr">[1]</ref> at the CERN LHC. The LHC accelerator complex will undergo major upgrades, to be completed in 2026, to increase the instantaneous luminosity to approximately 7: 5 10 34 cm 2 s 1 <ref type="bibr">[2]</ref>. The "High-Luminosity LHC" (HL-LHC) upgrades will enable searches for undiscovered rare particle physics processes as well as detailed measurements of the properties of the Higgs boson. The HL-LHC will collide proton bunches every 25 ns, and each of these bunch collisions (an "event") will consist of an average of 200 proton-proton (pp) collisions. Only a small fraction of these collisions are of interest for further studies. A fast real-time selection, referred to as the Level-1 (L1) "trigger", is applied to decide whether a given collision should be considered for further analysis. The L1 trigger is implemented in custom hardware. The number of overlapping pp collisions per event, referred to as pileup (PU), in the HL-LHC era represents a large increase over previous data-taking eras (200 at the HL-LHC versus 30 during LHC Run-2), resulting in a significant challenge to the CMS trigger systemnew handles are required. One such handle is the inclusion of information from the charged particle 2020 JINST 15 P06024</p><p>Figure <ref type="figure">1</ref>. The efficiency of a single muon trigger with a L1 threshold of p T &gt;20 GeV as a function of muon p T (left) and the trigger rate as a function of the muon trigger threshold (right), shown for the stand-alone muon trigger (red) and when including L1 tracking (black) for various pseudorapidity ( ) ranges <ref type="bibr">[3]</ref>.</p><p>tracking system. This will be the first time that information from a solid-state tracking detector has been included in the hardware trigger at the high collision rates of the LHC.</p><p>Integrating charged particle tracking in the L1 trigger will improve lepton identification and momentum measurements as well as provide track isolation and vertex reconstruction. These additional handles have the potential to reduce the L1 trigger rates while maintaining trigger thresholds acceptable for the CMS physics program. One example of this is shown in figure <ref type="figure">1</ref>, where the efficiency of a single muon trigger as a function of the muon generated p T (left) and the corresponding L1 trigger rate (right) are shown <ref type="bibr">[3]</ref>. The red curves show the behavior of a standalone L1 muon trigger system, while the black curves show the performance of the same triggers when including L1 tracking. In particular, L1 tracking improves the momentum measurement, which translates to a sharper turn-on curve at the trigger threshold and hence a reduced trigger rate. The tracks must be delivered within 4 &#181;s in order to be used in the trigger decision.</p><p>To reconstruct the trajectories of charged particles, the CMS experiment includes a tracking detector, which will be replaced for the HL-LHC operation. This upgraded tracking detector will consist of an inner tracker based on silicon pixel modules (not available in the L1 trigger), and an outer tracker based on silicon modules with strips and macro-pixel sensors <ref type="bibr">[4]</ref>. The layout of the upgraded outer tracker, as proposed in ref. <ref type="bibr">[4]</ref>, is illustrated in figure <ref type="figure">2</ref>.1 Particles are produced at the interaction point and travel outwards in a 3.8 T uniform magnetic field, which is parallel to the z axis. The trajectory of a charged particle traversing this magnetic field is bent such that it forms a helix. In the r-plane, the helix forms a circle. The radius of this circle is proportional to the momentum in this plane, the transverse momentum, or p T , of the particle. The tracking detector is based on the concept that charged particles leave energy deposits ("hits") when crossing the sensitive detector material. In the upgraded outer tracker, closely spaced pairs of such hits will be linked on the detector front-end to form "stubs". With a sensor spacing of 1-4 mm, the relative position of the pairs of hits can be used to read out only those stubs that are likely to come from a particle with p T &gt;2 GeV (corresponding to a radius of curvature greater than 1.75 m). This momentum 1We use a right-handed coordinate system, with the origin at the nominal interaction point, the x axis pointing to the center of the LHC, the y axis pointing up (perpendicular to the LHC plane), and the z axis along the counterclockwisebeam. The azimuthal angle is measured in the x-y plane. One quarter of the layout of the upgraded CMS charged particle tracking detector (left), as proposed in ref. <ref type="bibr">[4]</ref>, showing the Tracker Barrel with Pixel Strip modules (TBPS) or Strip-Strip modules (TB2S) and the Tracker Endcap Double-Disks (TEDD). Not shown is the inner pixel part of the CMS charged particle tracking detector, as it is not available to use for the L1 tracking. The LHC beams cross at (0,0). In the central barrel region (extending to jzj 125 cm), there are six layers at radii from 23 cm to 110 cm. In the forward region (covering jzj 125 cm), there are five disks at z positions from 130 cm to 270 cm. The detector is divided into sectors, slices in that cover the entire length of the detector along z, for L1 track finding. In the x-y view of the barrel (right), 28 sectors are shown.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2020">JINST 15 P06024</head><p>selection on the stubs reduces the readout bandwidth requirement by a factor of 10 <ref type="bibr">[3]</ref>. In the L1 trigger, the stubs can be linked together to reconstruct the trajectories of the charged particles.</p><p>The upgraded outer tracker consists of a central portion with six detector layers parallel to the beam line, called the barrel, and five layers perpendicular to the beam line at large jzj, called the disks. The inner three layers in the barrel consist of so-called Pixel-Strip (PS) modules. The modules have a top sensor with 2.5 cm long silicon strips with 100 &#181;m "pitch" (segmentation in the bending plane), and a bottom sensor consisting of 1.5 mm wide macro-pixels, again with 100 &#181;m pitch. The inner three layers of the barrel are further divided into two regions: the "flat" region, near z = 0, where the modules are parallel to the beamline, and the "tilted" region, at higher jzj, where the modules are tilted toward the interaction point. The outer three barrel layers instead use "Strip-Strip" (2S) modules, where both sensors have 5 cm long strips with 90 &#181;m pitch. The modules on each of the five disks in each half of the detector are arranged in concentric circles, or "rings", around the beamline. The two disks closest to the interaction point in each half of the detector have 15 rings (ten consisting of PS modules and five of 2S modules), and the remaining three have twelve rings (seven consisting of PS modules and five of 2S modules). Both PS and 2S modules provide a precise position and momentum coordinate for stubs in the transverse plane, while only the PS modules give a precise measurement of the z coordinate. Further details on the upgraded CMS tracker detector are available in ref. <ref type="bibr">[4]</ref>. This paper discusses the linking of the stubs to form trajectories of charged particles, known as L1 tracking, and its associated challenges. In each collision of counter-rotating proton bunches (every 25 ns), about 15,000 stubs are formed. However, only about 10% of the stubs belong to trajectories of interest, so the majority of stubs needs to be filtered. The remaining stubs are combined to form, on average, 180 trajectories every 25 ns. This is the first time that data from the tracking detector is included in the CMS L1 trigger; previously, the amount of data to be processed and the computational complexity required, placed this task out of reach of FPGAs.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2020">JINST 15 P06024</head><p>To summarize, some of the challenges involved in reconstructing trajectories of charged particles for the CMS L1 trigger are:</p><p>&#8226; Absorb approximately 15,000 stubs arriving each 25 ns. The total estimated input bandwidth is in the range of 20-40 Tb/s, depending on the beam conditions and trigger conditions.</p><p>&#8226; Perform pattern recognition to identify the stubs that belong to a given trajectory, rejecting stubs from low-momentum particles.</p><p>&#8226; Fit the stubs to extract optimal trajectory parameters.</p><p>&#8226; Complete all above steps within the available processing time ("latency") of 4 &#181;s, in order to feed into the decision of whether to retain the event, or discard it, before the on-detector data buffers are exhausted.</p><p>The "tracklet" approach for real-time track reconstruction in the hardware-based trigger system of CMS, presented in this paper, is one of three possible implementations that were considered by the collaboration. The other two approaches that were studied were a Hough-transform based approach using FPGAs <ref type="bibr">[5]</ref> and an associative memory based approach using a custom ASIC <ref type="bibr">[6]</ref>. The tracklet approach is a "road-search" algorithm, implemented using commercially available FPGA technology. Ever-increasing capability and programming flexibility make FPGAs ideal for performing fast track finding. The tracklet approach allows a naturally pipelined implementation with a modest overall system size. It also allows for simple software emulation of the algorithm. We present here the design of such a system and results from a hardware demonstrator system that implements end-to-end reconstruction, from input stubs to output trajectories, within the available latency and with a reasonable system size, for a slice of the detector. In addition to the results from the hardware demonstrator, some further improvements to the algorithm are presented.</p><p>Many software-based particle tracking algorithms use a road-search technique where track seeds are found and the trajectories extrapolated to look for matching stubs. This technique works well with the high-precision hits in particle detectors such as the CMS tracker. In the barrel part of the detector, the typical spatial position resolution of the stubs is about 30 &#181;m in r-and either 0.5 mm (inner layers) or 1.5 cm (outer layers) in z in a cylindrical detector volume of about 2 m in diameter and 5 m in length <ref type="bibr">[4]</ref>. Therefore, the search window (road) around the projected trajectory is small and the probability of finding false matches is low. However, with previous generations of FPGAs, the required computational power for implementing this type of tracking algorithm in the trigger was not available. Today, the large number of Digital Signal Processing (DSP) blocks and Random Access Memory (RAM) resources available in FPGAs make such an approach feasible. The use of FPGAs for the implementation of this algorithm provides a good match to the requirements. FPGAs provide high-speed serial, low-latency links that are well suited to bring the data into the FPGA for processing. The resources of FPGAs allow the implementation of a highly parallelized architecture, and their reconfigurability allows flexibility to changing needs. The precise space points from the stubs are used to determine trajectories of the particles using the DSP blocks in the FPGA. The DSP blocks are also used to calculate the final track parameters. The RAMs in the FPGA are used to implement the data movement and distributed storage required for this highly pipelined algorithm.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2020">JINST 15 P06024</head><p>This paper is organized as follows. First, the tracklet algorithm is discussed in section 2. Two configurations of the system are discussed in this paper. The first configuration (referred to as "tracklet 1.0") corresponds to that employed for a test that implemented a vertical slice of the system as a hardware demonstrator. Since the hardware demonstrator, extensions to the tracklet approach have been implemented that further improve the load balancing and resulting physics performance. This corresponds to the second configuration (referred to as "tracklet 2.0"). Except as indicated, this section describes both tracklet 1.0 and tracklet 2.0. Section 3 explains the structure of the firmware, which is common for both configurations. In section 4 we detail the results of the hardware demonstrator test, using the tracklet 1.0 configuration. Section 5 describes the foreseen overall system architecture. Section 6 reports on the physics performance of the current system (tracklet 2.0). Further developments are foreseen. These are discussed in section 7, along with conclusions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Tracklet algorithm</head><p>The goal of the real-time hardware-based track finding is to reconstruct the charged particle trajectories as an estimator of their momenta for particles with p T &gt;2 GeV, and to identify the track z 0 position with about 1 mm precision. The z 0 resolution is similar to the expected average separation of proton-proton collisions in the bunch collisions of the upgraded LHC, and thus allows precise determination of the collision vertices. The proposed tracklet method forms track seeds, "tracklets", from pairs of stubs in adjacent layers or disks. The tracklets provide roads to search for compatible stubs that are attached to form track candidates. A linearized 2 -fit determines the final track parameters.</p><p>The tracklet algorithm has been optimized to provide full coverage of the tracker with a small amount of redundancy in data duplication.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Algorithm overview</head><p>The tracklet algorithm proceeds in multiple steps, illustrated in figure <ref type="figure">3</ref>. The algorithm begins with a seeding step, where tracklets are formed from pairs of stubs in adjacent layers or disks. An initial estimate of the tracklet parameters is calculated from the two stubs and using the detector origin as a constraint in the r-plane. Seeds are rejected if they are inconsistent with a track with p T &gt;2 GeV and jz 0 j &lt;15 cm. The seeding is performed in multiple pairs of layers (or disks) to ensure good efficiency for full azimuthal coverage and for redundancy in the system. Nominally, seeding between layers 1+2, 3+4, 5+6, between disks 1+2, 3+4, and between layers 1 or 2 and disk 1, are considered. Different seeding layers can be used in the event of malfunction of a particular layer or disk.</p><p>The tracklets are projected to other layers and disks to search for matching stubs. These projections use predetermined search windows, derived from measured residuals between projected tracklets and stubs in simulated data. The tracklets are projected both inside-out and outside-in, i.e., towards and away from the collision point, as needed for a given seeding combination. If a matching stub is found, the stub is included in the track candidate and the difference between the projected tracklet position and the stub position, "residual", is stored. If there are multiple stubs matched in a given layer or disk, the stub with the smallest residual is retained for the track fit. In the first step (left) pairs of stubs (red stars) are combined to form seeds, or tracklets, for the track finding. Combined with the interaction point (0,0) a helical trajectory for the particle is formed, assuming a uniform magnetic field. This trajectory is projected (middle) to the other layers. Stubs in the other layers that are close to the projection (green stars) are selected as matches (right) to the tracklet to form a track. Final trajectory parameters are calculated using a linearized 2 fit.</p><p>A linearized 2 fit is performed for all stubs matched to the trajectory. The track fit implementation uses pre-calculated derivatives and the tracklet-stub residuals from the projection step. The linearized 2 fit corrects the initial tracklet parameters to give the final track parameters: inverse radius of curvature ( 1 ), azimuthal angle ( 0 ), polar angle (tan ), z intercept (z 0 ), and optionally the transverse impact parameter (d 0 ). The main source of duplicates is due to the same track, with the same hits, being found multiple times using different seeding layers. The initial duplicate removal we studied wanted to identify these duplicates and the use of the 2 from the fit was not initially considered. However, with further studies we have, in tracklet 2.0, now also considered the use of the 2 in the duplicate removal.</p><p>When implementing the algorithm on an FPGA, we work with fixed-precision math and loworder Taylor expansions of trigonometric functions. The number of bits kept are adjusted to ensure adequate precision. The loss of precision from using fixed-point calculations is negligible.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Parallelization</head><p>The algorithm is parallelized in the following manner. First, the detector is split along azimuth into sections, called "sectors". The number of sectors is a tunable parameter and is chosen based on an optimization of the cost of FPGAs, the required per-sector input bandwidth, constraints due to the cabling interface of the on-detector electronics, the choice of time-multiplexing factor and the overall algorithm latency, as discussed in the following paragraphs. The nominal choice for the number of sectors is nine, while other values were also considered. For a given bunch crossing, each sector is assigned a dedicated hardware called the sector processor, and tracklet formation and track reconstruction are done in parallel in all of the sector processors. A small amount of data is duplicated near the boundaries of sectors to allow track formation to take place entirely within one sector processor and to avoid gaps in detector coverage. The data duplication scheme will be described in section 5.1.</p><p>The upper limit on the number of sectors is chosen such that a track with largest acceptable curvature (p T = 2 GeV) is contained in at most two sectors. This corresponds to 28 sectors, and</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2020">JINST 15 P06024</head><p>was the configuration used for the tracklet 1.0 hardware demonstrator. A minimal overlap for stubs in the even layers and disks were used, such that we can form the tracklets within a sector and maintain full seeding coverage. However, for matching stubs to projections in tracklet 1.0 we used a nearest neighbor sector communication for the projections that extended outside the sector. In addition to the projections that were sent to the nearest neighbor, matches that were found also needed to be sent back. This means that we had two steps in the algorithm where communication with neighboring sectors were needed. For the tracklet 2.0 implementation, the sector definition was revised to instead use 9 -sectors. These sectors are defined such that they include all stubs required to reconstruct tracks that pass through the sector at r crit = 55 cm. This avoids the need for the nearest neighbor communication and, as a consequence, saves approximately 1 &#181;s of latency.</p><p>The system of sector processors is replicated n times using a round-robin time multiplexing approach. Each system is entirely independent, and therefore, since new data are generated every 25 ns, each independent time multiplexed unit has to process a new event every n 25 ns. As with the number of sectors, the choice of time multiplexing factor n is driven by a balance of cost, efficiency, and needed processing power. For the current system, n = 18 is considered to balance these three factors; that is, each sector processor receives new data every 450 ns. For the 28 sector configuration, n = 6 was chosen, leading to each sector receiving data every 150 ns.</p><p>By construction, the system operates with a fixed latency. Each processing step proceeds for a fixed amount of time. If we have too many objects, some will not be processed, leading to truncation of processing and an algorithmic inefficiency.</p><p>The algorithm is further parallelized within sectors. In the serial algorithm, there are several places where loops over stubs or double loops over pairs of stubs are required. In a naive implementation, the time to process these parts of the algorithm scales as N or N 2 , where N is the number of stubs in the sector, if considering all possible combinations. The number of combinations, or combinatorics, is a challenge to the algorithm. The combinatorics in forming tracklets and matching tracklet projections to stubs is efficiently reduced by dividing sectors into smaller units in z and to allow additional parallel processing. These smaller units are referred to as "Virtual Modules" (VMs). Only a small fraction of VM pairs can form a valid tracklet -the majority would be inconsistent with a track originating at the point of collision and with high enough transverse momentum. Data are distributed into those VMs satisfying these requirements in an early stage of the algorithm. This subdivision efficiently reduces the number of combinations that need to be considered by the algorithm from the start. Additionally, each VM is processed in parallel. At the next stage of the algorithm, the amount of parallelism is reduced when the accepted VM pairs' (the tracklets) initial track parameters are calculated.</p><p>Two different configurations of VMs are considered. In the first, used in tracklet 1.0 for the demonstrator, a sector is divided into 8z 3(8z 4 &#186; VMs for the odd (even) layers, resulting in 24 (32) VMs per layer. In forming the tracklet seeds, only 96 of the 768 (= 24 32) possible combinations are consistent with a possible tracklet that has the appropriate z 0 and maximum curvature (minimum momentum); the others do not need to be considered. It is by this method of data binning that the VMs eliminate the need to expend cycles considering combinations that are known a priori not to lead to a viable tracklet candidate.</p><p>In the second configuration (long VMs) used in tracklet 2.0, a finer segmentation is used. The sectors are here subdivided into 24(32 ) VMs for the odd (even) layers. These VMs cover the</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2020">JINST 15 P06024</head><p>full length of the sector but, internally to the VM, the data are collected into eight z bins. With this configuration, 120 of the same 768 combinations lead to viable tracklet seeds. The advantage of this configuration is that it leads to more even resource usage (better load balancing) and the binning in z allows a significant reduction in the combinatorics that has to be tried when forming tracklets and matching projections to stub, as will be demonstrated in section 4. In the firmware implementation, the difference between the two versions of the algorithm is restricted to a few places in the design.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Firmware implementation</head><p>The tracklet algorithm is implemented in the V Hardware Description Language (HDL) <ref type="bibr">[7]</ref> as nine processing steps and two transmission steps <ref type="bibr">[8]</ref>. These processing steps are illustrated in figure <ref type="figure">4</ref>. The red boxes are processing modules and the data are stored in memories (blue boxes) between the different processing steps. The black lines indicate which processing modules read and write data from which memory. The implementation of the algorithm in the FPGA takes place in the following steps.</p><p>&#8226; Stub organization: (1) Sort the input stubs by their corresponding layer (LayerRouter), and</p><p>(2) into smaller units in z and , the VMs (VMRouter).</p><p>&#8226; Tracklet formation: (3) Select candidate stub pairs for the formation of tracklets (TrackletEngine), and (4) calculate the tracklet parameters and projections to other layers (TrackletCalculator module).</p><p>&#8226; Projections: (5) Transmission of projections pointing to neighboring sectors (ProjectionTranceiver). <ref type="bibr">(6)</ref> Route the projections based on smaller units (VMs) in z and (ProjectionRouter).</p><p>&#8226; Stub matching: <ref type="bibr">(7)</ref> Match projected tracklets to stubs (MatchEngine), and (8) calculate the difference in position between the stubs and projected tracklet (MatchCalculator). ( <ref type="formula">9</ref>) Transmission of matches between sectors (MatchTransceiver).</p><p>&#8226; Track fit: (10) Perform track fit; update the initial tracklet parameter estimate (TrackFit).</p><p>&#8226; Duplicate Removal: <ref type="bibr">(11)</ref> Remove tracks found multiple times (PurgeDuplicate).</p><p>Both the tracklet 1.0 and tracklet 2.0 projects follow the same structure; they differ in the internals of some of the modules and how the modules are inter-connected. Each of the steps outlined above corresponds to HDL modules (named in bold). These modules are hand-optimized. They can be customized with V parameter statements on instantiation to account for differences between use cases. For example, in the second step of stub organization, six router modules are needed to process the stubs in each layer. The bit assignment in the data differs between the inner and outer three layers of the barrel. On instantiation, a parameter is used to select the appropriate version. The project illustrated in figure <ref type="figure">4</ref> corresponds to 1 &#8260;4 of the barrel for one sector for the tracklet 1.0 configuration. A complete project would contain approximately eight times as many instantiations of the same set of modules. The wiring between modules is specified in a master project configuration file. This configuration file is processed with scripts to generate the top-level V , which is then synthesized using Xilinx Vivado 2016.1. These scripts also generate the module connection diagram shown in figure <ref type="figure">4</ref> and drive a bit-level C++ emulation of the system.</p><p>All processing modules follow a similar format where the input is read from memories filled by the previous step and the output is written to another set of memories. All processing modules across all sector processors use a single common clock (currently, 240 MHz). As soon as a new event arrives, the next step in the chain will start processing the previous event. This implies that at any given time, several events are in the processing pipeline depending on the number of processing steps. Though the project as implemented in the demonstrator used the same clock for all processing steps, the fact that we use memories as buffers between the different steps allows the use of different clock speeds for different processing modules.</p><p>An event identifier propagates with the data and is used by the processing steps to access the appropriate data. We use the event identifier in the top bits of the memory address. This assumes a fixed maximum number of entries per event in the memory buffer. The fixed latency design implies that the maximum number of entries that can be processed is known and as such the limitation due to the fixed number can be understood and tuned. Most of the data from a processing step is only used in the next step and thus we can make very shallow buffers that will hold only two events at the same time (writing one and reading the other). These small buffers are implemented as distributed</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2020">JINST 15 P06024</head><p>RAM in order to minimize the Block RAM (BRAM) resource usage in the FPGA. On the other hand, some data need to be stored for up to eight events since they will only be used later in the chain. These data are stored in BRAMs, but we try to minimize the usage of this resource as we have observed correlation of routing difficulties with the number of BRAMs used.</p><p>Since the calculations needed for routing the data are simple and using Lookup Tables (LUTs) is quick, most of the processing modules take only a few clock cycles to complete. We do not send the data to the next step immediately, but buffer it in memories until the allocated time is finished for the processing step. At this time, the module corresponding to the next step in the processing will start reading the data for the previous event and new data will be written for the current event.</p><p>We use the true dual-port memories available in the Xilinx FPGAs for our buffers such that we can write the data from one event while simultaneously reading from the previous one. These dual-port memories also allow different modules to exist in separate clock domains.</p><p>In addition to the nine processing modules, we also implement two steps of neighbor communication using optical high-speed serial links. The bending of charged particles in the magnetic field can cause trajectories to curl into neighboring sectors. In this instance, the projected position of the track is sent across fiber links to the neighbor sector processor to look for matching stubs. Simultaneously as each sector processor is sending data to its left and right neighbors, it is also receiving from them as well for the same purpose. This system configuration was chosen to reduce the amount of data duplication globally at the cost of some increase in latency. As discussed in section 2, in the tracklet 2.0 implementation nearest neighbor communication is not needed.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Module examples</head><p>To illustrate the method in more detail, we present the functionality of two processing modules. Figure <ref type="figure">5</ref> shows schematically how the VMRouter works. The module receives a start signal every 150 ns for every new event. This VMRouter module reads stubs from three input layer memories. All stubs are written to the "AllStubs" memory in the full 36-bit format. In addition, based on their coordinates (and z), the stubs are routed to a specific output memory (VM stub memory) corresponding to a specific small area of the detector. Here, only coarse position information is retained and a six bit index into the AllStubs memory is saved such that we can later retrieve the precise stub position. The process loops over the input stubs and writes them out to different memories based on their position information.</p><p>A more complex example is the TrackletEngine processing module illustrated in figure <ref type="figure">6</ref>. This module forms pairs of stubs as seed candidates. As such, this module reads input stubs from two VM stub memories filled by the VMRouter module described previously, but since we are interested in forming pairs of stubs, this module implements a double nested loop over all pairs. For each pair the coarse position information is used in two LUTs to check that the seed candidate is consistent with a trajectory with the p T and z 0 requirements described above. If the stub pair passes this check, the indices of the stubs in the AllStub memories are saved in the output memory of candidate stub pairs. These indices are used in the next step, the TrackletCalculator, to retrieve the stubs and calculate the precise trajectory.</p><p>Figure <ref type="figure">7</ref> shows the distributions of the number of stub pairs that each TrackletEngine has to process. Since each step operates with a fixed latency, we have a maximum number of stub pairs that can be processed per event. With 450 ns per event and a clock speed of 240 MHz, a maximum  of 108 input stub pairs can be considered. As can be seen in the figure, there are cases where there are more than 108 input stubs; the 109th stub-pair and later will not be processed and could lead to an inefficiency of the tracking algorithm. However, due to the built-in redundancy of seeding in multiple layers, the ultimate effect of this truncation on the final efficiency is observed to be small, as discussed in section 6. Loss for 72 is 0.680% Loss for 108 is 0.081%</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2020">JINST 15 P06024</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>CMS Phase-2 Simulation</head><p>All Pass Figure <ref type="figure">7</ref>. Simulation of the distribution of the number of stub pairs that TrackletEngines seeding in the two innermost barrel layers (L1+L2) have to process for t t events with an average pileup of 200. The grey curve shows the number of stub pairs that the module has to consider, while the black curve the number of stub pairs that pass. The blue line corresponds to 108 processing steps per bunch crossing and the red line to 72 processing steps. With a cut-off at 108 processing steps, we drop less than 0.1% of the stub pairs.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Demonstrator system</head><p>To explore the feasibility of the tracklet system, a demonstrator was completed in 2016. The goal of the test was to implement an end-to-end working system, using simulated data, to validate the design methodology and system modeling. The hardware used for the demonstrator system were MicroTCA boards with a Xilinx Virtex-7 (XC7VX690T-2) FPGA <ref type="bibr">[9]</ref> and a Xilinx Zynq-7000 SoC for configuration and outside communication. These so-called Calorimeter Trigger Processor (CTP7) boards <ref type="bibr">[10]</ref> were developed for the current CMS trigger <ref type="bibr">[11,</ref><ref type="bibr">12]</ref>. Implementing a full sector in one FPGA on a processing board was out of reach for the older Virtex-7 class FPGAs. For this reason, the demonstrator system focused on the implementation of a half-sector. The simulated data were derived from a G -based simulation of the CMS detector <ref type="bibr">[13]</ref>. Data sets used included single particle (electron, muon, and pion) events, as well as fully simulated top quark-antiquark (t t) events. To accurately simulate HL-LHC conditions, up to 200 extra pp collisions were included in addition to the event under study.</p><p>The demonstrator system consisted of three sectors and one time-multiplexing slice. A total of four MicroTCA processing blades were used, one for the central sector, two for its nearest neighbor sectors, and one blade that acted as a data source (providing input stubs) and a data sink (accepting the final output tracks). The system configuration is shown in figure <ref type="figure">8</ref>. The demonstrator was fed with simulated data derived from the of the CMS detector mentioned above. An Advanced Mezzanine Card (AMC13) <ref type="bibr">[14]</ref> provided the central clock distribution. The inter-board communication used 8b/10b encoding with 10 Gb/s link speed. The demonstrator system is shown in figure <ref type="figure">9</ref>. The demonstrator system assumed 28 sector processors and a time multiplexing factor of six, leading to new data arriving every 150 ns.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2020">JINST 15 P06024</head><p>Sector n-1</p><p>Sector  Two complete implementations of the firmware project -one for half the barrel (+z) and one for a quarter of the barrel plus the forward (+z) endcap -were used to demonstrate the feasibility of this approach for the full range of the detector. These two projects covered two regions of the detector:</p><p>1. The barrel-only region. These tracks only traverse the cylindrical part of the detectors.</p><p>2. The hybrid region. Tracks that traverse both a part of the barrel region as well as the disk region of the detector.</p><p>By dividing the projects into these two regions, we exercised all regions of the detector while still working within the constraints of the Virtex-7 FPGAs.</p><p>A sketch of the tracker regions covered by each of the implementations is shown in figure <ref type="figure">10</ref>. As discussed in section 3, data formats and calculations are slightly different in the two regions of the detector. The differences were absorbed using parameter statements in the V code and determined at module instantiation time. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2020">JINST 15 P06024</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Validation and testing</head><p>Events were processed through the demonstrator as illustrated in figure <ref type="figure">8</ref>. First, input stubs obtained from simulations were written to a piece of hardware that emulates the data delivery in the final system. On a GO signal, stubs were sent to the three sector processor boards. A new event was sent to each sector board every 150 ns. The events were processed and projections and matches were sent to and received from neighboring boards. The final output tracks were received by the track sink board. Systematic studies were performed to compare the integer-based emulation of the tracklet algorithm with a HDL simulation of the FPGA using Xilinx Vivado, as well as with the output tracks from the demonstrator system. Full agreement was observed in processing single-track events between the emulation, FPGA simulation, and board output. Better than 99.9% agreement was observed with many-track events with high pileup (figure <ref type="figure">11</ref>). The demonstrator had a 28-fold azimuthal symmetry (i.e., 28 sectors), so we tested the full +z range by using different input data, corresponding to the different sectors, without any modifications of the demonstrator itself.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">System latency</head><p>For both the demonstrator system as well as the current configuration, each processing step of the tracklet algorithm takes a fixed number of clock cycles to process its input data. The processing modules' latency from receiving upstream data to producing the first result varies between 1-50 cycles depending on the module. Each module then continues to handle the data of the same event and writes to the memories for 150 ns (for a time multiplexing factor of six) before switching to the next event. For some of the steps where data transmission between the neighboring sectors was necessary, latency due to inter-board links was also included. The measured transmission latency was 316.7 ns (76 clock cycles), which included serialization and de-serialization via the Xilinx GTH transceivers, data propagation in 15 m long optical fibers, channel bonding, and time needed to prepare and pass data from processing modules to the transceivers. The total latency of the algorithm was therefore the sum of the processing module latencies and processing time, as well as inter-board data transmission latency, of all the processing steps. The latency of the hardware demonstrator also included the data transmission latency for receiving stubs from and sending final tracks back to the data source/sink blade. A summary of the estimated latency is shown in table <ref type="table">1</ref>. With a 240 MHz clock and a time-multiplex factor of six, the total estimated latency is 3345.8 ns. The total latency of the demonstrator was also measured with a clock counter on the data source/sink blade. We started the counter when sending out stubs, and record the counter outputs when receiving valid tracks. The actual measurement was done with 240 MHz processing clock. The measured latency was 3333 ns, which agrees within three clock cycles (0.4%) with the model.2</p><p>For the tracklet 2.0 release, we are implementing the algorithm in C/C++ using the Xilinx Vivado HLS tools <ref type="bibr">[15]</ref>. We believe HLS will produce code that be easier to maintain and provide a lower barrier for entry of new people to contribute to the development of the project. Tracklet 2.0 provides several improvements over the algorithm as implemented for the demonstrator. The algorithmic improvements from using the "long virtual modules" reduce the effects of truncation and the new sector definitions remove the need for nearest neighbor communication. One significant change is that the TMUX factor is now 18, which results in 450 ns of processing time. This time increase is offset by the reduction in the number of processing steps achieved removing the steps that require communication with the nearest neighbors. This allows us to keep within the 4 &#181;s latency target. Further reductions in latency are possible by combining processing steps into fewer modules. The estimated latency with this new configuration is shown in table <ref type="table">2</ref>.</p><p>The half-sector project included seeding in multiple layer and disk combinations (L1+L2, L3+L4, D1+D2, and D3+D4). This project consisted of the following processing modules: 12 LayerRouters, 22 VMRouters, 126 TrackletEngines, 8 TrackletCalculators, 22 ProjectionRouters, 156 MatchEngines, 22 MatchCalculators, 4 TrackFits, and one PurgeDuplicate. The resources used Table <ref type="table">1</ref>. Demonstrator latency model using the tracklet 1.0 configuration. For each step, the processing time and latency is given. For steps involving data transfers, the link latency is given. The model and measured latency agree within 0.4% (three clock cycles).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Step</head><p>Proc.</p><p>Step   in the demonstrator project are shown in table <ref type="table">3</ref>. The resource usage from the V synthesis for the full sector project is summarized in table 4. The most heavily used resource was BRAMs. We ran the project at 240 MHz, limited by external constraints. First, we operated the links at 10 Gb/s with a 8b/10b encoding. This meant that we transferred data packages of 32 bits at 250 MHz. We also wanted the clock frequency to be a multiple of the 40 MHz bunch collision rate in CMS. This meant that if we operated at 240 MHz and produced 32 bit data packages at this clock speed the links could transport the data. Increases in clock speed are being considered as a future improvement.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Hardware platform and system architecture</head><p>A high-level overview of the tracker data flow is shown in figure <ref type="figure">12</ref>. The data from the tracker are sent off the detector to the Data, Trigger, and Control (DTC) cards <ref type="bibr">[4]</ref>. About 90% of the bandwidth is dedicated to a reduced-precision data stream that is sent to the L1 tracking system, or the "track trigger" (labeled TT in the figure and described in this paper). Data from the track trigger are sent downstream to a part of the L1 Trigger system (L1T) called the correlator system, where the information from the tracking detector is combined with information from other subdetectors to identify electrons, muons, and other physics quantities.3 Based on these data, a L1 Accept (L1A) is issued or not. If the L1A is issued, a signal is sent to the DTC, which then initiates the readout 3These quantities are used to decide if the event should be dropped or stored for further analysis. In total, CMS can store about 0: 0025% of the collisions for offline study.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2020">JINST 15 P06024</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1">Cabling</head><p>The baseline strategy for the project assumes a tilted barrel geometry <ref type="bibr">[4]</ref>, 9 -sectors for track finding and a time-multiplex factor of 18 for the tracklet 2.0 configuration. The DTC layout assumed here is as follows: a total of 108 DTCs for a half-detector (+ and z); 54 for the PS modules, connected to the inner three layers of the half-barrel and the inner seven rings of the endcap, and 54 for the 2S modules, connected to the outer three layers of the half-barrel and the outer five rings of the endcap. The DTCs are arranged such that they cover "nonants" in (nonants is referring to the currently proposed division of the readout of the outer tracker in a nine-fold symmetry) over a half-length of the tracker, and so each track finder nonant and half-length connects to 12 DTCs (that is, 108 DTCs per half-length in z split into nine in ). Within each nonant and half-length, the modules connected to each DTC are distributed in both the barrel and disk regions in order to better balance the stub load in each DTC, and to minimize or avoid multiple hits along a track being routed to a single DTC. A summary of the DTC layout within a nonant and half-length is shown in table <ref type="table">5</ref>. We also assume a maximum of 72 modules connected per DTC, and that each DTC handles a maximum output of 672 Gb/s. This system also requires that the sector processor boards are arranged within two ATCA shelves for each n th time-multiplexed slice.</p><p>As mentioned above, the DTCs will process the data from each of the modules in the tracker. Each DTC will connect to one front-end cable, and each cable will handle 144 fibers; 72 front-end fibers running at 10 Gb/s (inner detector layers) or 5 Gb/s (outer detector layers) coming in, and 72 front-end fibers running at 2.5 Gb/s going out. The DTC sends data to the track trigger using 28 Gb/s cables, with a maximum capacity of 672 Gb/s for each DTC (i.e., 24 28 Gb/s links). The global coordinate information is sent to the track finder and uses 36 bits per stub, and with 64/66 bit encoding we assume a total of 45 bits per stub.</p><p>The layout of the DTCs' connections to the detector modules divides the detector intononants, and into two halves along z. Twelve DTCs are connected to each -nonant and z half, for a total of 12 18 = 216 DTCs assuming a 9-sector processor configuration; this corresponds to one sector processor per nonant. However, due to the curvature of tracks within the detector, a small amount of data duplication is needed near the boundaries of sectors to ensure that track formation can take place entirely within one sector processor. The tracker is divided into 9 sectors at a critical radius r c . Tracklets that are generated within a sector and, during the projection step, produce trajectories that cross r c within a particular sector, are kept in that sector. Tracklets that during the projection step instead cross r c in an adjacent sector are discarded (they will be rediscovered by the adjacent sector). At radii above and below r c , curves corresponding to the lowest acceptable p T are drawn from each edge of a sector, and all stubs that lie within these curves have the potential to form tracks within the sector, and so must be sent to the corresponding sector processor. The curves in adjacent sectors will overlap at radii above and below r c , and it is in these small regions where data duplication is necessary. The parameter r c is tunable and can be chosen based on an optimization of the algorithmic precision, algorithmic efficiency, load balancing in the DTCs, and constraints on the number of links available to the DTCs. If the nine sector processors were aligned with the DTC -nonants, the overlap regions on either side of a sector would require each DTC to communicate with three sector processors. To reduce the number of required output links from the DTCs, the 9 sectors are offset from the nonants such that each of the DTC nonants only needs to communicate with at most two sector processors. The demonstrator setup for tracklet 1.0 shows excellent agreement between the actual firmware results and the integer-based C++ emulation of the system. For single object events, the output tracks from the firmware have 100% bitwise compatibility with the integer-based emulation. For busier events, for example t t with an average pileup of 200, the emulation and firmware tracks agree to better than 99%, as was shown in figure <ref type="figure">11</ref>. These tests validate the use of the integer-based emulation code for extrapolating to future performance improvements. This section shows the performance of the tracklet 2.0 configuration.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2020">JINST 15 P06024</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.1">Performance of different steps of the tracklet algorithm</head><p>The efficiency for identifying charged particle trajectories, using different seeding combinations, is shown in figure <ref type="figure">13</ref> for a simulated sample of events. Each event contains a single muon without any additional pileup interactions, using the integer-based C++ emulation of the algorithm. With the multiple seeding combinations, the same particle trajectory is often found multiple times, which ensures redundancy and a complete detector coverage. As illustrated in figure <ref type="figure">14</ref>, already the resolution of the tracking seeds (tracklets) is good -within a factor of two or less of the resolution of the fitted track, motivating the feasibility of the road search algorithm. Given the resolution of the seeds, we can use narrow windows in projecting the tracklet seeds to other layers and disks to search for matching stubs. This is a key element in limiting the number of combinations that must be tried by the algorithm.</p><p>The track parameters found by the TrackletCalculator are used to project the tracklet to other layers and disks. This projection is performed in the TrackletCalculator to a nominal position of the layer or disk. In the MatchCalculator, once a candidate stub is found to which the projection is matched, an exact projection is calculated using the derivatives @ proj @r and @z proj @r. A projection-stub match is accepted if the r-and z (or r in disks) residuals pass a selection criterium that depends on (i) in which layer or disk the match is found and (ii) which combination of layers or disks were used to form the tracklet. The residuals are tuned to be nearly 100% efficient for trajectories from prompt particles originating from the primary interaction vertex.  </p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_0"><p>2We did not try to understand the source of the final missing three clock cycles, as an example of the law of diminishing return.</p></note>
		</body>
		</text>
</TEI>
