

# Reflections and Research Advice Upon Receiving the 2019 Eckert-Mauchly Award

**Mark D. Hill**

University of Wisconsin-Madison

**I AM HONORED**, grateful, and humbled to receive the 2019 Eckert-Mauchly award ([https://awards.acm.org/award\\_winners/hill\\_2155109](https://awards.acm.org/award_winners/hill_2155109)). I am humbled to have my name associated with the luminaries that have preceded me. I have known many prior recipients, from Sir Maurice Wilkes to my graduate student officemate Susan Eggers. Although I am the recipient of this award, the work is really “our” work, as it stems from the creativity and perspiration of more than 160 co-authors. Figure 1 shows a word cloud with the co-author’s names sized roughly logarithmically with the number of papers.

Rather than a historical tour, this essay—and the talk it is based on [Slides at [http://pages.cs.wisc.edu/~markhill/papers/markhill\\_eckert-](http://pages.cs.wisc.edu/~markhill/papers/markhill_eckert-)

[mauchly\\_2019.pptx](http://www.cs.wisc.edu/~markhill/papers/markhill_eckert-mauchly_2019.m4a) (and .pdf) with unofficial audio [http://www.cs.wisc.edu/~markhill/papers/markhill\\_eckert-mauchly\\_2019.m4a](http://www.cs.wisc.edu/~markhill/papers/markhill_eckert-mauchly_2019.m4a) and almost-complete video <https://youtu.be/kqrhBTK6SHE>.]—gives forward-looking advice on methods that we find valuable for doing research, illustrated with past examples.

Although I am the recipient of this award, the work is really “our” work, as it stems from the creativity and perspiration of more than 160 coauthors.

## SIMULATION FOR TESTING AND REFINING HYPOTHESES

A theme of our work is developing simulators to explore questions previously out of reach of existing tools. First, as PhD student with co-advisors David Patterson and Alan Jay Smith, we developed the Dinero uniprocessor trace-driven simulator (1980s Dinero predicated web pages but was later re-released: <http://www.cs.wisc.edu/~larus/warts.html>). Its ease of use and license facilitated its distribution to dozens of universities and a few companies. Second, as mostly assistant professors, we

*Digital Object Identifier 10.1109/MM.2019.2931345*

*Date of current version 10 September 2019.*



**Figure 1.** Word cloud of coauthors with Mark D. Hill.

created the *Wisconsin Wind Tunnel*.<sup>11</sup> WWT simulated a cache coherent shared memory computer on a non-shared memory Thinking Machine CM-5. It was execution-driven—so that memory behavior could influence program execution—but had three flaws (in retrospect): the CM-5 did not get faster while Moore’s Law accelerated software-only simulators; we could not share it much as the CM-5 was rare; and the simulator did not model operating system behavior. Third, we developed *GEMS*,<sup>7</sup> which booted an OS (initially Solaris) and included I/O devices. We did this hard task, because repeated feedback at our annual industrial affiliates meetings strongly advocated for full-system simulation. To ease work and speedup development, we implemented the performance model of *GEMS* but had it do functional simulation by working symbiotically with initially beta commercial Virtutech SimICS. This symbiosis was a blessing—it worked—but also a curse as it limited the spread of our influence, as not all wanted or could license SimICS. Fortunately, the (former) Michigan folks had implemented full-system functionality in their m5 simulator and proposed that we merge *GEMS*+m5 to form *gem5*.<sup>1</sup>

Our simulators have helped many do better research and have been cited 5000 times, including some citations where authors explain why they are not using them. We achieved difficult innovations, but we were also “creatively lazy” (which we advocate) wherein we did only what was needed after leveraging the work of others, e.g., SimICS and m5. After Dinero, wrote approximately zero lines of

code for these amazing simulators. Key contributors can be found in the author lists at the end of this essay.

You might think that simulation is the most important method in computer architecture, based on both the above paragraph and what you find reading in many papers. In fact, simulation is important for testing and refining hypotheses. While it is often the step most visible in papers, this is only one step in the Scientific Method ([https://en.wikipedia.org/wiki/Novum\\_Organum](https://en.wikipedia.org/wiki/Novum_Organum))<sup>9</sup>:

1. Pick a good problem.
2. Develop insight and first hypothesis.
3. Test and refine hypotheses.
4. Repeat steps as needed.

The Scientific Method’s use underlies modern science, and its value to computer architecture is no less, even as we sometimes seem to only implicitly apply it. Let us discuss its other steps in reverse order.

## DEVELOP INSIGHT AND FIRST HYPOTHESIS

### 1980s 3C Cache Misses

In the 1980s, we were fascinated by the memory hierarchy. The good news for these hierarchies is that—if properly designed—they provide cost-performance that far exceeds that of the technology levels they are created from. The bad news is that their proper design requires setting numerous parameters, often informed by a data deluge. How could we channel the deluge, following Hamming who said, “The purpose of computing is insight, not numbers”? Our intuition said that this was a good problem and turned out to be correct. It is not clear if we were good, lucky, or both. Still, in our experience, intuition matters for choosing research directions among many options.

In my PhD dissertation<sup>4</sup> (or easier to find subset<sup>5</sup>) with co-advisors David Patterson and Alan Jay Smith, we sought a taxonomy or model to give insight into cache misses and used a thesaurus to develop a memorable name. The result was the 3C Model with conflict misses for too little associativity, capacity misses for too small cache

size, and compulsory misses for never previously accessing a block/line. The 3Cs had explicit influence on Norm Jouppi developing victim caches and stream buffers a year later,<sup>6</sup> and went on to join the undergraduate canon. It did so, in part, because my Ph.D. coadvisor cowrote a popular textbook (Patterson). While the 3C Model was simple relative to other contemporary cache models, this simplicity is a factor in its longevity. We thus learned to prize simplicity.

#### 1990s Memory Consistency With Sequential Consistency (SC) for Data-Race Free Programs

In the 1990s, we were sure that shared-memory multiprocessors had “arrived.” It turns out we were off by a decade as Moore’s Law facilitated microprocessor performance improvements that allowed most markets to avoid multiprocessors, but we were correct that it would eventually happen. This illustrates that good research should anticipate trends, but need not get timing right, as is required for products. Sarita Adve and I—and others—wanted to put the correctness of multiprocessors on a firmer foundation. We saw that cache coherence could make caches invisible, but what then? Leslie Lamport’s SC model was elegant, but most real machines did not obey it. These multiprocessors exposed write buffers, out-of-order execution, and what we now call non-atomic stores. We instead wished to follow Einstein who said, “Everything should be made as simple as possible, but not simpler.”

A breakthrough started with a talk by Bart Miller on software datarace detection—and there is a lesson here about how attending talks can lead to new connections. We developed new intuition that there might be a connection between dataraces and the weak or relaxed memory models of the era. Nevertheless, it took hard thinking to make the connection both explicit and simple: specify a system to provide SC to data-race-free (DRF) programs. The SC for DRF model enabled a “have your cake and eat it too” situation. Hardware could do aggressive reordering for performance between synchronization operations, while almost all programmers could reason with relatively simple SC. Adve and others subsequently used SC for DRF as the cornerstone of the Java and C++ high-level language memory models, and many of us still use

it when specifying memory consistency in heterogeneous systems with both CPUs and general-purpose GPUs. All this from insight from a talk plus three decades of work.

#### 2000s LogTM Transaction Memory

In the 2000s, we were again sure that shared-memory multiprocessors had “arrived.” It turns out that this time we were right, but for a reason whose timing we did not predict: the end of Dennard scaling causing a “right turn” to multicore chips. We—initially Kevin Moore, David Wood, and I—were interested in multiprocessor programmability, now that correctness was arguably under control. The prevailing programming method coordinated threads using locks that were known to not compose and are subject to deadlock. For deadlock, consider a simple method that moves an item between two data structures by obtaining a lock first at the source and then a second lock at the destination. If one thread sought to move an item from A to B while a concurrent thread sought to move it from B to A, deadlock could occur with each thread holding one lock and unable to get the other. Transactional memory (TM) offered a potentially elegant solution. With TM, each thread could ask that a method be an atomic transaction and the TM system would “make it so,” sometimes having to abort and retry transactions. Existing TM systems, however, allowed micro-architectural elements (e.g., write buffer size and cache associativity) to affect what transactions could commit, required substantial changes to conventional systems, or both.

We sought a TM solution where the micro-architecture would not limit which transactions could commit and required at most a modest change to existing cache-coherent multiprocessors. To drive thinking on what to do, we first developed a taxonomy, recreated in Figure 2. In one dimension, TM systems had to detect conflicts among concurrent transactions, either when a transaction sought to commit (lazy) or earlier when reads and writes first occur (eager). Orthogonally on a write, all TM systems must do “version management” to keep the new value for possible commit and the old value for possible abort. Lazy version management puts the new value “on the side” (e.g., in a write buffer) and

|                    |       | Version Management                     |                                                         |
|--------------------|-------|----------------------------------------|---------------------------------------------------------|
|                    |       | Lazy                                   | Eager                                                   |
| Conflict Detection | Lazy  | DBMSs w/ optimistic CC<br>Stanford TCC | none                                                    |
|                    | Eager | MIT LTM<br>Intel/Brown VTM             | DBMSs w/ locking CC<br>MIT UTM<br>Wisconsin LogTM [new] |

**Figure 2.** 2006 taxonomy of hardware transactional memory systems (adapted from the article by Moore *et al.*<sup>8</sup>).

leaves the old value “in place” (in coherent memory) until commit. Conversely, eager version management saves the old value “on the side” and puts the new value “in place.”

This taxonomy assisted us in developing LogTM.<sup>8</sup> In particular, we decided to focus on the quadrant that modeled commercially successful database management systems with locking concurrency control. Consequently, LogTM combined eager conflict detection (using coherence) with eager version management (a per-thread log). LogTM—and its successor LogTM-SE<sup>12</sup>—enabled unbounded transactions with modest core changes and trivial memory system changes. While LogTM and other academic papers developed many promising ideas, and limited TM hardware is supported by Intel, IBM, and recently ARM, it is not universal. This is, in part, because, although multicore use is now ubiquitous (e.g., in the cloud), only a relatively few experts program directly with thread-level parallelism. Nevertheless, we should take the long view, as ideas can take time to flourish. For example, SIMD and vectors developed over decades of niche successes before their broader success with general-purpose GPUs and the SIMT model. More generally, we recommend developing taxonomies to structure deep thinking, recalling the “mother of all scientific taxonomies”: Mendeleev’s periodic table of the elements that focused efforts for uncovering missing elements.

#### 2010s System-of-a-Chip (SoC) Gables and Accelerator-Level Parallelism (ALP)

In the 2010s, SoCs grew up to have heterogeneous CPUs, GPUs, dozens of accelerators, interconnects, coherence, virtual memory, and even virtualization. Accelerators make specific computations faster, more predictable, and more energy-

efficient. As a Google intern (during my 2017–2018 sabbatical), my host Albert Meixner charged that we should make SoC design “more scientific.” Gasp! Thus, I fell into SoCs by luck after deciding to do another mind-expanding sabbatical in industry, putting myself in a position to get lucky.

To make some progress on this grandiose charge and frame early SoC thinking, Vijay Janapa Reddi developed a simple SoC model called Gables<sup>2</sup> for SoC hardware and software use cases. Specifically, Gables models each accelerator with a “roofline” (previously used for a whole multicore chip), including the important parameter of “operational intensity” that speaks to whether communication or computation is the bottleneck. George Box said, “All models are wrong; some are useful.” The community has yet to decide if the newly proposed Gables is useful. Nevertheless, it has already led to the important hypothesis that mobile SoCs, in particular—and arguably computing, more generally—must now deal with ALP wherein multiple accelerators are concurrently active. Broad ALP success will require the research community—maybe you!—to develop better “best practices” for targeting accelerators, managing accelerator concurrency, choreographing inter-accelerator communication, and productively programming them.

#### PICK A GOOD PROBLEM

A greatly underappreciated aspect of influential research is the first step of picking a problem from the infinite set of possible problems. This is creative and important, but rarely discussed. Good research problems fall at the intersection of two criteria, “If you can do it, people will care.” and “You can do (some of) it,” as illustrated in Figure 3.

In our experience, one should devote considerable thinking and time to picking problems to work on. One should not play Jeopardy! which is an American TV game show in which contestants are shown the answer and challenged to develop the question. You may laugh, but we have seen this in research: “I have a cool mechanism, so let me figure out what it is good for.” In our experience, this approach rarely produces groundbreaking and lasting research. Ask first, “What problem am I trying to solve and why?”

Operationally, we recommend looking for change. You can do good work without change,

but you have to be smarter and more creative than all that have preceded you, which is hard for important problems. In computer architecture, change arises from 1) software and applications above, e.g., exploding machine learning and augmented reality; 2) technology changes below, e.g., 3-D chip stacking and emerging non-volatile memory technologies; and 3) influences from other (sub)fields, e.g., miraculous progress in SAT solvers and using ML to optimize hardware.

For a concrete example, consider the work of David Patterson, Garth Gibson, and Randy Katz on redundant arrays of inexpensive disks or RAID<sup>10</sup> that I observed as a graduate student. You might think that the most creative part of the work was taking erasure codes, applying them to blocks, and rotating parity to avoid bottlenecks. Although this was creative, in our opinion, the greatest creativity in the project

came first in recognizing the problem. Historically, the most cost-effective way to store large data was on large washing-machine-sized disks. When personal computers exploded into the world, they soon adopted small disks, and the sales volume made these disks the most cost-effective place to store data. The problem: Can we use small PC disks to store large data? The problem within the problem: Without innovation, an array of PC disks is too unreliable. A single PC disk is not that reliable (the market wanted inexpensive), and an array of these disks will lose data in days. With this setup, the innovation of RAID now seems creative but not superhuman. The hard thinking to pick a good problem was well rewarded with a seminal paper with over 4000 citations and three test-of-time awards.

We conclude this section with four other comments regarding picking problems. First, spend considerable time and energy on picking the problem. Avoid jumping to solutions too fast, as you might solve the wrong problem. Second, seek simple ideas, especially for interfaces. In computer architecture and systems, even simple ideas get more complex when actually deployed. Be proud of simple ideas, as I tout with a web page



**Figure 3.** Pick problems at the smile.

(<http://pages.cs.wisc.edu/~markhill/includes/simple.html>). Woe to those in industry who have to deploy something that was already complex in the academic paper. Third, collaborate broadly with professional colleagues and students. It is not clear

Don't unduly worry about dividing credit, as credit often multiplies and collaboration usually enables something worthy of greater credit.

I ever unilaterally developed an idea that did not benefit from interactions with one or more of my 160 coauthors. Don't unduly worry about dividing credit, as credit often multiplies and collaboration usually enables something worthy of greater credit. Fourth, keep academic-industry connections strong, as computer architecture and systems are about influence, not intrinsic beauty. Impact and effort seem to vary proportionally from the smaller—talking to people at conferences and holding industrial affiliate meetings—to the larger—student internships and sabbaticals in industry (for me at Sun, AMD, and Google).

## THANKS AND GIVING FORWARD

I have been blessed to work with many great people—beginning with my PhD coadvisors David Patterson and Alan Jay Smith—and continuing with 160 coauthors, with the biggest word cloud font sizes in Figure 1 going to David Wood, Daniel Sorin, Michael Swift, James Larus, and Milo Martin. This work has occurred at three great public universities—Michigan for undergraduate, Berkeley for graduate, and Wisconsin where I am faculty—with funding largely from the U.S. National Science Foundation, most recently with grants CCF-161782, CCF-1734706, and CNS-1815656. As Newton said, “We can see further, because we stand on the shoulders of giants.” For my students and me, we have discovered that these giants go back at least as far as 990 AD in Constantinople ([http://pages.cs.wisc.edu/~markhill/Academic/Genealogy\\_Hill\\_Mark.pdf](http://pages.cs.wisc.edu/~markhill/Academic/Genealogy_Hill_Mark.pdf)).

For this, one cannot give back, but one can—and should—give forward. I have sought to give forward through more than thirty years of teaching, three years as department chair, and through service in several ways, including with ACM SIGARCH and, mostly recently, the Computing Community Consortium (<https://cra.org/ccc/>). Your opportunities and predilections may be different, but please give forward to honor those who gave to you.

I conclude by thanking my family: wife Sue, children Nicole and Greg, granddaughter Zeynep, sister Kathryn, and parents Toivo and Maria. My strong mother Maria passed away the morning after I received the 2019 Eckert-Mauchly award. She learned of the award a few weeks before, valued it, but cared more that her children sought to live with integrity and to follow the golden rule. May Maria rest in peace.

## ■ REFERENCES

1. N. Binkert *et al.*, "The gem5 simulator," *ACM SIGARCH Comput. Archit. News*, vol. 39, no. 2, pp. 1–7, 2011.
2. M. D. Hill and V. Janapa Reddi, "Gables: A roofline model for mobile SoCs," in *Proc. IEEE 25th Int. Symp. High Perform. Comput. Archit.*, 2019, pp. 317–330.
3. M. D. Hill and V. J. Reddi, "Accelerator-level parallelism," 2019. [Online]. Available: <https://arxiv.org/abs/1907.02064v1>
4. M. D. Hill, *Aspects of Cache Memory and Instruction Buffer Performance*. Ph.D. dissertation, Univ. California, Berkeley, Berkeley, CA, USA, Nov. 1987. [Online]. Available: [http://www.cs.wisc.edu/multifacet/theses/mark\\_hill\\_phd.pdf](http://www.cs.wisc.edu/multifacet/theses/mark_hill_phd.pdf)
5. M. D. Hill and A. J. Smith, "Evaluating associativity in CPU Caches," *IEEE Trans. Comput.*, vol. 38, no. 12, pp. 1612–1630, Dec. 1989.
6. N. P. Jouppi, "Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers," *ACM SIGARCH Comput. Archit. News*, vol. 18, no. 2SI, pp. 364–373, 1990.
7. M. M. K. Martin *et al.*, "Multifacet's general execution-driven multiprocessor simulator (gems) toolset," *SIGARCH Comput. Archit. News*, vol. 33, pp. 92–99, Nov. 2005.
8. K. E. Moore, J. Bobba, M. J. Moravan, M. D. Hill, and D. A. Wood, "Logtm: Log-based transactional memory," in *Proc. 12th Int. Symp. High-Perform. Comput. Archit.*, 2006, pp. 254–265.
9. J. R. Platt, "Strong inference," *Science*, vol. 146, no. 3642, pp. 347–353, 1964.
10. D. A. Patterson, G. Gibson, and R. H. Katz, "A case for redundant arrays of inexpensive disks (RAID)," in *Proc. ACM SIGMOD Int. Conf. Manage. Data*, vol. 17, 1988, pp. 109–116.
11. S. K. Reinhardt, M. D. Hill, J. R. Larus, A. R. Lebeck, J. C. Lewis, and D. A. Wood, "The wisconsin wind tunnel: Virtual prototyping of parallel computers," in *Proc. 1993 ACM SIGMETRICS Conf. Meas. Model. Comput. Syst.*, SIGMETRICS'93, 1993, pp. 48–60.
12. L. Yen *et al.*, "Logtm-se: Decoupling hardware transactional memory from caches," in *Proc. IEEE 13th Int. Symp. High Perform. Comput. Archit.*, 2007, pp. 261–272.

**Mark D. Hill** is with the University of Wisconsin-Madison, Madison, WI, USA. Contact him at: markhill@cs.wisc.edu.