<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>OM3: An Ordered Multi-level Min-Max Representation for Interactive Progressive Visualization of Time Series</title></titleStmt>
			<publicationStmt>
				<publisher>ACM</publisher>
				<date>06/13/2023</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10515159</idno>
					<idno type="doi">10.1145/3589290</idno>
					<title level='j'>Proceedings of the ACM on Management of Data</title>
<idno>2836-6573</idno>
<biblScope unit="volume">1</biblScope>
<biblScope unit="issue">2</biblScope>					

					<author>Yunhai Wang</author><author>Yuchun Wang</author><author>Xin Chen</author><author>Yue Zhao</author><author>Fan Zhang</author><author>Eugene Wu</author><author>Chi-Wing Fu</author><author>Xiaohui Yu</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[<p>We present a novel multi-level representation of time series called OM3 that facilitates efficient interactive progressive visualization of large data stored in a database and supports various interactions such as resizing, panning, zooming, and visual query. Based on our proposed line-segment aggregation, this representation can produce error-free line visualizations that preserve the shape of a time series in windows of arbitrary sizes. To reduce the interaction latency, we develop an incremental tree-based query strategy to support progressive visualizations, allowing a finer control on the accuracy-time tradeoff. We quantitatively compare OM3 with state-of-the-art methods, including a method implemented on a leading time-series database InfluxDB, in two settings with databases residing either in the local area network or on the cloud. Results show that OM^3 maintains a low latency within 300~ms on the web browser and a high data reduction ratio regardless of the data size (ranging from millions to billions of records), achieving around 1,000 times faster than the state-of-the-art methods on the largest dataset experimented with.</p>]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">INTRODUCTION</head><p>The past decades have witnessed an explosion of time-series data in many applications, from nancial engineering to manufacturing. A large amount of time-series data is collected by measuring variables over time at regular intervals, and usually stored in remote databases on cloud servers for subsequent analysis via interactive visualizations on the client side. By interacting with time series displayed as line charts, users can conduct analytical tasks <ref type="bibr">[1]</ref>, such as peak identi cation, trend analysis, and pattern search.</p><p>To improve the user exploration e ciency, a variety of interaction techniques <ref type="bibr">[1]</ref> have been developed, e.g., SignalLens <ref type="bibr">[16]</ref>, multi-focus zooming <ref type="bibr">[12,</ref><ref type="bibr">29]</ref>, Zenvisage <ref type="bibr">[27]</ref>, and a few visual query tools <ref type="bibr">[10,</ref><ref type="bibr">21]</ref>. Recently, Siddiqui et al. <ref type="bibr">[28]</ref> proposed an expressive shape search algebra, enabling users to interactively search for various desired patterns. However, almost all existing techniques require the whole data to be quickly loaded onto the client side for e cient rendering. Yet, the latency is often high for large time-series data stored on remote databases. This hinders smooth interactive exploration of temporal patterns at various resolutions.</p><p>A common approach to reducing the latency is data reduction <ref type="bibr">[4]</ref>, by aggregating the data to reduce its size before visualization. Various strategies have been proposed to preserve salient features in the reduction. However, most of them do not account for the perceptual e ects of, e.g., resizing the display, and thus often produce erroneous visualizations that can signi cantly distort the shape of the rendered data. This issue is addressed by the visualization-oriented time-series reduction method M4 <ref type="bibr">[13]</ref>, which nds essential records per pixel column (where each pixel column consists of all pixels with the same x coordinate) in the display window to preserve the exact rendering of an input time series, referred as error-free line visualizations. Yet, M4 cannot e ciently sustain smooth interactions for large time-series data. First, the time complexity of executing a query is O(n), where n is the time-series length. So, processing data with millions of records could easily take more than a second, which is beyond the latency limit <ref type="bibr">[17]</ref> for interactive visual analysis. Second, M4 independently issues and fully executes a new query for each user action on the visualization. For these reasons, it does not support continuous interactions like panning and zooming, and precludes most interaction techniques that are commonly used for time-series exploration.</p><p>Progressive visualization <ref type="bibr">[33]</ref> is a promising direction, where, instead of waiting for slow queries, the visualization immediately renders intermediate results that the user can potentially interact with. A prominent approach is based on IncVisage <ref type="bibr">[23]</ref>, which uses online sampling-based techniques to progressively reveal salient features. The approach quickly renders approximate visualizations (on the order of seconds) that update over time and eventually converge to the exact visualization (though this may take a minute or more). Each update sends a new result set to visualize, so network costs are linear in the number of updates-on the order of hundreds before convergence. These characteristics are ill-suited and not widely adopted for interactive visualization of large datasets.</p><p>In this paper, we present a new approach to interactive progressive visualization of arbitrarily-sized time-series. Our approach is designed to satisfy the following desiderata for interactive visualization of large time-series stored in a remote database: (i) ensuring error-free line visualizations at any scale; (ii) minimizing query latency and amount of data transfer to the visualization clients; (iii) supporting progressive re nement of intermediate visualization with a good trade-o between interaction latency and visualization quality; and (iv) o ering rich interactions for uid data exploration.</p><p>Our proposal centers around OM 3 , an ordered multi-level min-max representation of a timeseries dataset, which enables visualizations that preserve the shape of a time series displayed in any window size. This representation has &#8968;log n&#8969; levels and maintains minimum and maximum values at every time interval that is used to rasterize a pixel column in the display window with the width 2 i at the ith (0 &lt; i &lt; &#8968;log n&#8969;) level. In addition, it tracks the temporal ordering of the paired minimum and maximum values at the leaf level. To do so, we formulate the forward OM 3 transform to recursively aggregate the time series for di erent time intervals at di erent resolutions, constructing a hierarchy of coe cients obtained by aggregating and di ering between </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Technique</head><p>Vis. Qual. Data Tran. Prog. Vis. Int. Sup. the aggregate values of two paired aggregated samples. We store these coe cients in a database table, whose size is only three-quarters of the original time series. By default, the procedure is limited to time-series lengths that are powers of two, and we extend it to support missing data and arbitrary lengths with the same storage space. With OM 3 interactions issue queries over the hierarchy of coe cients and render the charts on the client using the data fetched from the server. A naive implementation would involve traversing all the coe cients falling into the speci c time range for each interaction. This, however, could be prohibitively expensive for large data. To address this issue, we propose a visualization-aware incremental query algorithm of time complexity O(w log(n)), where w is the display-window width and n is the time-series length. For every time interval corresponding to a pixel column in the visualization, it nds four M4 samples <ref type="bibr">[13]</ref> and reuses the query results from the previous round of interaction to attain a substantial speed-up. Since the query evaluation proceeds incrementally level by level through the hierarchy of coe cients, the rasterization is progressively performed with a dynamically changed number of pixels rendered on screen and rapidly converges to exact visualizations. For example, it takes less than 300 ms to reach the convergence for the dataset with 1B records. As a further optimization on the cost of traversing OM 3 , we develop an e cient pruning strategy in a breadth-rst search, which can achieve a pruning ratio of around 45% on most datasets. In doing so, interactive, progressive visualization of large time series can become feasible, even for data with billions of records. Table <ref type="table">1</ref> compares the design objectives of OM 3 and existing techniques along four considerations.</p><p>OM 3 can support a rich variety of interaction techniques such as panning, zooming, resizing, and TimeBox Search <ref type="bibr">[11]</ref>. We implement OM 3 using JavaScript with PostgreSQL <ref type="bibr">[20]</ref>, and quantitatively compare it with the baseline approaches: M4 on PostgreSQL and a leading time-series database In uxDB, and Haar wavelet on PostgreSQL, in two settings, with the database residing either in the local area network or on the cloud, on several real-world datasets. Evaluation results show that our method ensures error-free visualizations and o ers latency under 300 ms (the upper limit of interactive latency is 500 ms <ref type="bibr">[17]</ref>) and high data reduction ratio (99.5%), showing its capability to support progressive visualization of large time-series data. Besides, we conduct a case study to demonstrate the e ectiveness of our provided interaction techniques in the exploration of large data.</p><p>In summary, we make the following main contributions.</p><p>&#8226; We develop OM 3 , an ordered multi-level min-max representation that forms the basis for delivering interactive visualization of large time-series data with one billion records; &#8226; We propose an incremental query algorithm with an e cient pruning strategy to enable progressive visualization and rich uid interactions on the client; and &#8226; We quantitatively compare OM 3 with state-of-the-art methods, manifesting the e ectiveness of OM 3 in supporting interactive exploration of large time-series data. The rest of this paper is organized as follows. In Section 2, we provide a review of related work. In Section 3, we revisit min-max aggregation and present an extension to support error-free visualization of time series. Section 4 presents our new solution based on OM 3 . Section 5 reports the experimental results, and Section 6 concludes this paper.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">RELATED WORK</head><p>Our work is related to the literature on interaction for line visualizations and data reduction of large time-series data. Below, we discuss these areas one by one.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Interaction for Line Visualizations</head><p>Time-series visualization <ref type="bibr">[1]</ref> has been extensively explored for facilitating users to discover trends and patterns at varying time scales. Line chart is still the most popular vehicle for presenting time series. However, displaying many time series as line charts on a limited screen inevitably produces heavy overplot. To address this issue, various lens-based interaction techniques <ref type="bibr">[16]</ref> and overview+detail interaction techniques <ref type="bibr">[12,</ref><ref type="bibr">29]</ref> have been developed for simultaneously showing the regions of interest in detail with an overview of the whole data. All these techniques are built on basic interaction operations: resizing, panning, and zooming, with the assumption that the latency of the data query is low. Yet, very few of them have been used for exploration of very large timeseries data stored in remote databases. In contrast, the proposed OM 3 and its associated incremental query algorithm focus on enabling e cient interactive visualization of remote time-series data with billions of records.</p><p>As a separate line of research that is concerned with e ciently identifying patterns of interest, visual query tools help search for line chart visualizations matching a query pattern speci ed through the user interface. Among them, sketch-based queries <ref type="bibr">[18,</ref><ref type="bibr">27,</ref><ref type="bibr">28,</ref><ref type="bibr">32]</ref> allow users to sketch a temporal pattern to select and show the line charts that match the sketch. Yet, this approach does not provide an overview of the whole data. Instead, Timeboxes <ref type="bibr">[11]</ref> and KD-Box <ref type="bibr">[34]</ref> act as lters, in which users can specify query constraints by manipulating rectangular boxes in the charts with all time series and then lter out the time series that do not pass through the boxes. Our multi-level representation OM 3 further extends such queries to explore the detail of lines of interest at ner levels.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Time-series Data Reduction</head><p>To reduce the latency for large data query during interactive visualization, a few data reduction techniques have been proposed for time-series data, which can be grouped into two categories: point aggregation and line simpli cation.</p><p>Point Aggregation. To ensure a manageable size for the query data, point aggregation methods often rst group the entire time series into a few time spans and then compute an aggregated value and an aggregated timestamp for each interval using aggregation functions such as minimum, maximum, mean, and median. A common method is the piece-wise aggregate approximation (PAA) <ref type="bibr">[14]</ref>, which selects the minimum ( rst) timestamp and computes an average value. Based on these aggregation functions, a few methods have been proposed to o er fast approximate query <ref type="bibr">[2,</ref><ref type="bibr">6]</ref>. Yet, regardless of the aggregation function being used, the line chart of the aggregated data may distort the original shape of the time series. To address the issue, M4 <ref type="bibr">[13]</ref> aims to preserve the shape by selecting two data points with the minimum and maximum values, and another two with the minimum and maximum timestamps in each pixel column, connecting the four data points in time order with the rest of the time series, and rasterizing the line segments in the pixel column.</p><p>With M4, the number of data points needed to be transferred from the database to the visualization client can be signi cantly reduced. However, it cannot facilitate smooth interaction of large timeseries data due to the following two major drawbacks. First, its query execution has high interaction latency. The most costly operation in the whole query process is the join operation, which has a computational complexity of O(n + 4w) for matching n data points with 4w aggregates. Second, M4 cannot reuse previous query results to support incremental update, which is essential to supporting interactive operations such as zooming and panning. As a result, it requires very frequent data query and transfer from the remote database during user exploration, thus hindering e ective exploration of large data. Although SampleAction <ref type="bibr">[8]</ref> and IFOCUS <ref type="bibr">[15]</ref> support incremental visualizations by using online aggregation, they cannot provide error-free visualizations until converging to precise results. In contrast, OM 3 enables users to interactively explore large-scale time-series data with exible interactions.</p><p>Line Simpli cation. Given a time-series curve, line simpli cation is another data-reduction approach that aims to preserve its rough shape with fewer curve segments. One widely-used method is the top-down Ramer-Douglas-Peucker <ref type="bibr">[7,</ref><ref type="bibr">9]</ref> algorithm, which joins the rst and last points in the curve and divides the curve into two segments, if the distance from the farthest point to the line is larger than a threshold. Then, each curve segment is recursively divided until all points of the original curve are within the simpli cation's tolerance. Fu et al. <ref type="bibr">[5]</ref> further attempts to preserve salient points in the simpli cation. However, these methods have a high computational complexity, O(n 2 ) or O(n log n), (n is the number of data points in the time series), so interactive queries cannot be supported for large data. Instead, INCVISAGE <ref type="bibr">[23]</ref> uses online sampling-based algorithms to incrementally generate curve segments, attaining 46x speedup relative to baselines. Yet, it provides only approximate visualizations and does not support interactive operations like zooming. Since the time complexity for constructing the OM 3 representation is only O(n) and querying with OM 3 is only O(w log n) (w is display-window width), OM 3 is able to support progressive interactive visualization of large time series.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">LINE-SEGMENT AGGREGATION</head><p>In this section, we revisit min-max aggregation and propose a line-segment aggregation of time series for generating error-free visualizations. As shown in Fig. <ref type="figure">1</ref>(b), simply using the min-max aggregation may produce errors in line-chart visualizations with missing pixels (e.g., E 1 ) and false pixels (e.g., E 2 ). To address this issue, M4 <ref type="bibr">[13]</ref> locates the two data points of maximum and minimum timestamps and the two data points of maximum and minimum data values in each pixel column and then connects these points to form line segments that represent the line chart. Fig. <ref type="figure">1</ref>(c) shows its result, which exactly matches the visualization produced by directly rasterizing the whole data points shown in Fig. <ref type="figure">1</ref>(a). However, some inter-column segments formed by M4 are redundant and unnecessary for yielding error-free visualizations.</p><p>is an ordered list of uniformly-sampled data points, where i is the index, t i is the i-th timestamp (t i &lt; t i+1 ), and v i is the data value at time t i .</p><p>De nition 2. An equidistant grouping of a time series refers to a non-overlapping partition of T into w groups of same width in time domain:</p><p>&#948; is round(n/w), and w is the display-window width. For each group, its value range is O i = [G min (T , w, i), G max (T , w, i)] de ned by the minimal and maximal values G min (T , w, i) and G max (T , w, i), respectively. De nition 3. A line visualization of time series T error-free, if the set of its rendered foreground pixels is the same as the set rendered by rasterizing all line segments over all data points in T .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>De nition 4.</head><p>For each group B i , the intra-column line segment is de ned by G min (T , w, i) and G max (T , w, i) of this group, while the inter-column line segment between B i and the next group B i+1 is de ned by connecting the last data record of B i and the rst data record of B i+1 . As shown in </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">OUR SOLUTION</head><p>Based on the line-segment aggregation described in Section 3, we propose the OM 3 representation. Particularly, we shall show how this representation facilitates interactive progressive visualization of large time series stored in remote databases with low latency. Fig. <ref type="figure">2</ref> shows the overall architecture, which consists of two stages: o ine preprocessing and online query.</p><p>O line Preprocessing. We apply our OM 3 preprocessor (see Section 4.1) to convert each time series stored in a relational database at a server into our OM 3 coe cients by a forward transform. This is a one-time pre-processing conducted locally on the server.</p><p>Online Query. Once the OM 3 coe cients are available, users can interactively explore the data on the client by retrieving just the necessary coe cients from the server to reconstruct the original data records via an inverse transform. To produce a visualization from scratch, the server can automatically produce query statements to retrieve coe cients based on the display-window width and the visible time range of the time series on the display window. Then, the client receives a stream of coe cients from the server to produce the visualization, while maintaining an OM 3 coe cient tree of the reconstructed data as cache to accelerate subsequent visualization generation during interactive exploration. To support smooth interactions, the server performs incremental queries to request data on demand, enabling the client to smoothly pan, zoom, resize, and query the time series by exploiting the tree hierarchy.</p><p>OM 3 inherently requires the time series to be complete (i.e., without missing values) and have a power-of-two length. It also requires the window width to be a power of two. However, none of these can be ensured for general time-series data during the interactive data exploration. Hence, we further extend OM 3 to address these issues. In this section, we will rst describe the forward and inverse OM 3 transforms and our extensions for the time series with missing data and arbitrary lengths, incremental tree-based query, acceleration strategies on OM 3 , and families of interactions supported by OM 3 .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">OM 3 Transforms</head><p>Given a time series T of length n = 2 l , we obtain its hierarchical line-segment aggregation via two operations: (i) hierarchical min-max aggregation, and (ii) the ordering of the min-max aggregation results at the nest level. Operation (i) helps build the min-max lines, while operation (ii) facilitates the construction of the inter-column lines. We refer to such a line-segment aggregation result as the ordered multi-level min-max representation of the time series, i.e., OM 3 . So, the OM 3 transform converts time series T from an ordered list of data records to a set of coe cients that describe T in di erent levels of detail. Using these coe cients, the original data records at any level can be reconstructed by an inverse transform. Fig. <ref type="figure">3</ref>(a) shows an example four-level hierarchy produced from 16 data points. We can see that the eight coe cient tuples at level 3 are obtained by aggregating and di ering between two adjacent aggregate coe cients at level 4.</p><p>Note also that in our implementation, we directly compute A l -1 from input data without explicitly replicating the input data points to form A l . As shown in Fig. <ref type="figure">3</ref>(a), the coe cients at level 3 can be obtained by operating on the two adjacent data points. Since A l -1 encodes only the set of original sample values in the input without information on their order, we further store the ordering relationship between A l -1 2i and A l -1 2i+1 as the ordering coe cient O to facilitate the reconstruction of the input and set</p><p>2i is less than A l -1 2i+1 , and false, otherwise. Meanwhile, the detail coe cients at level l -1 (e.g., D l -1 2i and D l -1 2i+1 ) are redundant, as we do not need to reconstruct A l 2i and A l 2i+1 during the inverse transform. In Fig. <ref type="figure">3</ref>(b), the cells highlighted by the blue box mark the redundant detail coe cients. For short, we refer these aggregate and detail coe cients to be stored as the M 3 coe cients, and the combination of M 3 coe cients and ordering coe cients as the OM 3 coe cients. So, only n M 3 coe cients and half of the n Boolean ordering coe cients are indispensable. For e cient storage of the M 3 coe cients in database, we use an ID column to maintain the ordering of the records, see the left column of Fig. <ref type="figure">3(b)</ref>. Since n Booleans (ordering coe cients) take small space, we store them separately. For the input time series of length n, each data point has two values t i and v i , and thus the total number of required coe cients is approximately three-quarters of the input data size.</p><p>After removing the redundant detail coe cients, we can pack the aggregate and detail coe cients into a database table with two columns (see the example shown in Fig. <ref type="figure">3(b)</ref>) and the ordering coe cients into a table (see Fig. <ref type="figure">3(c</ref>)), which stores all the information necessary for reconstructing the input data points.</p><p>Inverse Transform. As the forward transform employs min and max aggregations, we need di erent mechanisms to reconstruct the aggregate coe cients. Given the aggregate and detail coe cients at level j -1, the aggregate coe cients at level j can be found by reversing Eqs. ( <ref type="formula">1</ref>) &amp; (2). For example, we can reverse Eq. (1) to nd A j 4i and A j 4i+2 based on the sign of D j-1 2i :</p><p>Similarly, we can reverse Eq. ( <ref type="formula">2</ref>) to nd A j 4i+1 and A j 4i+3 . For a display window with power-of-two width w = 2 j , the inverse transform requires fetching coe cients whose ID ranges [0, 2w -1] from the M 3 coe cient table and then performing the inverse transform from level 0 to level j -1 to reconstruct the original data points. Figs. <ref type="figure">3(d,</ref><ref type="figure">e</ref>) show a 16-samples time series reconstructed to t two-and four-pixel-wide windows, respectively. Yet, the visualization rendered by the reconstructed min-max aggregate in each pixel column misses some pixels; see the ones marked by the red boxes. The reason is that some inter-column pixels cannot be determined using only the min-max aggregation. We will address this issue with our tree-base query in Section 4.2.</p><p>Transformation of Non-canonical Data. To handle time series with missing values and/or of non-power-of-two length, which we collectively call non-canonical time series, we may simply ll the missing values with an estimated value (e.g., simple average) or pad it with zeros to extend its length to a power of two. However, value lling and zero padding will in uence the calculation of the minimum and maximum values, hindering the OM 3 transforms.</p><p>Instead, we represent a non-canonical time series as the one with a power-of-two length (see the bottom in Fig. <ref type="figure">4(a)</ref>) and pad the null value to the samples without data values. Then we construct a groups of the min-max aggregate values at level 2 and render these min-max lines on this window; we show also the original line chart (grey line) as a reference. Comparing the two charts, we can see that three pixels (red boxes) are missed in the generated visualization, as the four-pixel-width line chart does not provide appropriate data samples precisely for the three-pixel-width window, especially near the pixel-column boundary. Also, when using the correct min-max aggregate coe cients, only some pixels are missed (see Figs. <ref type="figure">3(d,</ref><ref type="figure">e</ref>)); yet, we can recover these pixels by some inter-column lines, as explained and shown in Theorem 1 (see Section 3).</p><p>To address the issue, we propose a two-step tree-based incremental query scheme for e ciently nding appropriate data points in each pixel column of the target window. Assuming that the width of the display window w is in range [2 p-1 , 2 p ] for some positive integer p, we rst retrieve the coe cients for the window of width 2 p from the server database and reconstruct 2 p data points via an inverse transform. Then, the client can successively retrieve detail coe cients of boundary nodes that contain the rst/last timestamp of each pixel column from the server to reconstruct the original data points at the nest level. Before performing this two-step query, we rst send the whole ordering coe cients from the server to the client. By nding the minimum and maximum data values in every pixel column of the target window and the indispensable inter-column samples within a single query, we can then obtain accurate aggregates for producing error-free visualizations. Yet, extensively visiting the coe cient tree all the way to the leaf nodes and transferring all information from server to client would lead to extra costs in computing time and network bandwidth. Also, some inter-column lines are unnecessary as shown by Theorem 1. These observations form the basis for reducing the tree traversal cost.</p><p>To avoid exhaustive tree traversal, we introduce a breadth-rst search method with an e cient pruning strategy to avoid unnecessary traversal, as outlined in Algorithm 1. When traversing each level, we initialize an empty list B to store those boundary nodes (line 6) that contain data points near the boundary between adjacent pixel columns. Then, for each node, we examine its time range against the time range of the nearby pixel columns in the target window. If the time range of a node falls entirely inside the time range of a pixel column, the node itself can provide appropriate information for updating the data value range (i.e., <ref type="bibr">[v_min,v_max]</ref>) of the pixel column (lines 7-11), so we do not need to further visit its child nodes to explore the ner-level time ranges. Otherwise, we append it to the list B (line 13). For the boundary node at a level greater than p, we remove it from B, if its data value range is already covered by the current data ranges of the two associated pixel columns (lines <ref type="bibr">[17]</ref><ref type="bibr">[18]</ref><ref type="bibr">[19]</ref><ref type="bibr">[20]</ref><ref type="bibr">[21]</ref>. Once all nodes are checked, we load the coe cients for all remaining nodes in B and reconstruct the corresponding data points at the next level (lines <ref type="bibr">[25]</ref><ref type="bibr">[26]</ref>. By recursively visiting only necessary child nodes with our pruning strategy level by level, we can e ciently nd the data value range of each pixel column in the target window and rasterize all inner-column pixels by the associated pixel column's data value range (line 29). At the last level, we rasterize all inner-column pixels and inter-column pixels de ned by the ordered boundary samples collected from B and O (line 16). In doing so, progressive visualization is achieved, where the visualization can be incrementally re ned during the level-by-level traversal (see Fig. <ref type="figure">5(d)</ref>).</p><p>Taking node <ref type="bibr">[12,</ref><ref type="bibr">20]</ref> in Fig. <ref type="figure">5</ref>(c) as an example. Its time range [0, 15] falls entirely in the rst pixel column's time range [0, 21] in the target three-pixel-width window, so we do not need to visit its child nodes and can initialize the data value range (minimum and maximum) of the rst pixel column as <ref type="bibr">[12,</ref><ref type="bibr">20]</ref>; see the left column of the level 2 result in Fig. <ref type="figure">5(d)</ref>. For node <ref type="bibr">[6,</ref><ref type="bibr">28]</ref>, its time range <ref type="bibr">[16,</ref><ref type="bibr">31]</ref> overlaps the rst two pixel columns in the target window, so we append it to the boundary node list and visit its child nodes <ref type="bibr">[7,</ref><ref type="bibr">19]</ref> (time range <ref type="bibr">[16,</ref><ref type="bibr">23]</ref>) and <ref type="bibr">[6,</ref><ref type="bibr">28]</ref> (time range <ref type="bibr">[24,</ref><ref type="bibr">31]</ref>) at level 3. Later, for node <ref type="bibr">[14,</ref><ref type="bibr">17]</ref> at level 4, its value range is covered by the value ranges of the adjacent two pixel columns <ref type="bibr">[7,</ref><ref type="bibr">20]</ref> and <ref type="bibr">[6,</ref><ref type="bibr">29]</ref>, so we do not need to visit its child nodes. For some pixel columns, considering only the minimum and maximum values within each pixel column may not be su cient to accurately rasterize its pixels. For example, the leaf node <ref type="bibr">[14,</ref><ref type="bibr">22]</ref> (time range <ref type="bibr">[42,</ref><ref type="bibr">43]</ref>) crosses the second pixel column (time range <ref type="bibr">[22,</ref><ref type="bibr">42]</ref>) and the third pixel column (time range [43, 63]), forming an indispensable inter-column line (see the highlighted one in Fig. <ref type="figure">5(b)</ref>). Hence, we put it into the boundary node list and connect its data points to rasterize the bottom gray pixel in the third column of the level 5 result in Fig. <ref type="figure">5(d)</ref>.</p><p>Fig. <ref type="figure">5</ref>(b) shows the exhaustive tree traversal case that loads all relevant coe cients of boundary nodes at one time. Fig. <ref type="figure">5(c</ref>) shows the nodes in the coe cient tree that we need to visit with our pruning strategy, and Fig. <ref type="figure">5(d)</ref> shows how the data value ranges and rasterization (grey pixels) are updated progressively for each pixel column. Note that the pruning strategy only searches necessary values, as we described in Theorem 1, and skips inter-column samples whose value ranges covered by both corresponding consecutive groups automatically. Since the value ranges of adjacent pixel columns depend on each other, we sequentially prune nodes and leave the parallel processing as future work.</p><p>Time Complexity. Given a window of width w, in the worst case, we need to traverse all the boundary nodes, so the time complexity is O(w log(n)). With our method, a large fraction of queries can be terminated early, irrespective of the length of the time series, so the time complexity is close to O(w). For the case shown in Fig. <ref type="figure">5(b)</ref>, there are 12 nodes from level 3 to level 5. Yet our method only needs to visit 10 of the nodes. Experimentally, we found that the pruning ratio is around or larger than 45% for all tested data (see Section 5.4). For the three-pixel-width window shown in Fig. <ref type="figure">5</ref>, directly applying M4 would fetch 12 samples for the three pixel columns, whereas our tree-based search would issue three database queries to fetch 20 coe cients from ve nodes. Note that as the data gets larger, our method would become faster. It is because our query can be done with a time complexity of O(w log(n)). In contrast, M4 has a time complexity of O(n), and n is typically much larger than w, as a time series could contain millions or even billions of samples.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">Additional Acceleration Strategies</head><p>The coe cients are stored in the database table with the schema (id, minc, maxc) as depicted in Fig. <ref type="figure">3(b)</ref>. For a window of width w &#8712; [2 p-1 , 2 p ], where p is a positive integer, we only need to load coe cients up to level p -1 by executing the following query: select minc, maxc where [id&gt;= start_id and index&lt;= end_id].</p><p>To further reduce the query latency, we introduce the following three additional strategies to accelerate the query.</p><p>Coe cients Prefetching. To better support drill-down operations, we can pre-load the coefcients at levels p and p + 1. As coe cients are stored contiguously, retrieving them from the database incurs minimal cost. On the other hand, the number of coe cients 2 p+2 is small compared to n (thanks to the OM 3 transform), so transferring all these coe cients can be done quickly.</p><p>Tree Caching. We can cache the reconstructed coe cient tree in the past interactions on the client side at runtime. By storing the tree as a list in the memory, we employ the least recently used (LRU) strategy for the cache replacement. During the interaction, the coe cients at the coarser levels are frequently used but they typically consume a small amount of memory. Hence, a small cache is enough for supporting uid interaction (see Section 5.2).</p><p>Query Merging. During the interaction based on incremental tree-based query, we need to fetch also the information on multiple boundary nodes from the database with the following query: select minc, maxc where [condition1] or ... or <ref type="bibr">[conditionN]</ref> where each condition corresponds to the starting and ending indices of a boundary node. To improve the query speed, we merge multiple conditions with continuous records together, e.g., the queries of nodes <ref type="bibr">[6,</ref><ref type="bibr">28]</ref> and <ref type="bibr">[13,</ref><ref type="bibr">35]</ref> at level 2 in Fig. <ref type="figure">5</ref>(b) can be merged. In doing so, we can reduce the overall query time by 40%.</p><p>Furthermore, we use two threads: one for analyzing the coe cients and the other for issuing queries to the database. Once the analysis is done, the queried coe cients at the ner level can be reused for the analysis at the next level e ciently.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.4">Supported Interactions</head><p>OM 3 enables dynamic data fetching on demand for generating error-free visualizations. By storing the coe cients of multiple time series in the same database, we can allow interactive exploration of multiple large-scale time-series data with smooth interactions by concurrent query processing. Below, we introduce two families of common interactions that can be supported e ectively by OM 3 .</p><p>Resizing, Panning, and Zooming. One common feature of these interactions is that they all involve a new mapping between the time range of the data and the width of the display window. For instance, resizing shows the same time series on a window of changing sizes (see the example in Fig. <ref type="figure">5</ref>), whereas panning and zooming show a di erent time range of the time series on the same window.</p><p>For a time series shown in a window of width w &#8712; (2 p-1 , 2 p ], we can e ciently zoom into a time interval [t a , t b ] in four steps:</p><p>(1) compute node indices &#181; a and &#181; b associated with time stamps t a and t b , respectively, at level p;</p><p>(2) nd minimal extra level &#1013; such that |&#181; b&#181; a + 1|2 &#1013; &#8805; w;</p><p>(3) fetch the coe cients of all nodes within index range [&#181; a , &#181; b ] and their children from level p + 1 up to level p + &#1013;; and (4) perform the tree-based search to nd the boundary data points. With our incremental tree-based search, we can perform this process e ciently. For zooming in and out with a scaling factor of two, some boundary data points in the current view can be reused for the next view, thus further reducing the query cost. For a time series shown on a window of width w in [2 p-1 , 2 p ] and current time interval [t a , t b ], panning into a potentially overlapping interval [t c , t d ](t bt a = t dt c ) amounts to steps (1) and (4) for zooming. Timebox Query. Another family of common window query interactions is timebox query <ref type="bibr">[11]</ref>, in which users can select lines covered by a user-speci ed range or by a given rectangle. For multiple time series shown on the client visualization, timebox query retrieves a subset of lines that satisfy the user-speci ed constraints and highlights them in the current display window. However, to handle a large number of lines, the retrieval process proposed in <ref type="bibr">[11]</ref> is ine cient. Also, users cannot explore details of the queried lines. Thanks to the OM 3 coe cient tree, we can further accelerate the retrieval process and support zooming in and out of the queried lines.</p><p>Given a box constraint with a certain value range and time range, users want to nd the set of time series whose data values are entirely covered by the box in the box's time range. As each node in the coe cient tree stores the minimum and maximum data values for the time range covered by the node, we can e ciently perform a BFS search to nd the nodes whose time range is associated with the box's time range. There are three cases in the BFS search:</p><p>&#8226; Case (i): no intersection between the value range of the node and the value range of the box;</p><p>&#8226; Case (ii): the value range of the node is fully covered by the value range of the box; and</p><p>&#8226; Case (iii): the value range of the node is partially covered by the value range of the box, so we need to further explore the child nodes to check if it is Case (i) or Case (ii). Since this query is based on the given visualization, its processing does not require communication with the server, so it can run very fast solely on the client side. Once the queried lines are available, users are allowed to further interact with the queried lines by using the panning, zooming and resizing operations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">EVALUATION</head><p>We developed our system in a client-server architecture. The client was implemented in JavaScript running on an 8-core 3.2GHz MacBook Pro with 32GB RAM, whereas the server was implemented using the PostgreSQL database (v14.2) and In uxDB (v2.0) running on an 8-core Intel Xeon Platinum 8269CY CPU with 2.5 GHz, 32GB RAM, and 250GB SSD with Ubuntu 22.04 in the Alibaba Cloud. We chose PostgreSQL because it was used for implementing the original M4 <ref type="bibr">[13]</ref>. As shown in Fig. <ref type="figure">2</ref>, the system can be con gured with the coe cients of any reversible multi-level transform, and thus we compare the OM 3 -based system with the one con gured by Haar wavelet <ref type="bibr">[3]</ref>, which applies the average function to compute the aggregate coe cients. Since the OM 3 representation is slightly di erent for time series with missing values, our implementation maintains di erent functions for preprocessing and reconstructing the time series without and with missing values. We performed the corresponding forward transforms in the server and stored the result coe cients in the PostgreSQL database. Besides comparing with the PostgreSQL-based M4 implementation that can generate error-free time-series visualizations, we further developed a version of M4 in In uxDB, which is a leading time-series database <ref type="bibr">[22]</ref> optimized for storing and querying time-series data. Since In uxDB requires the data with the column of timestamps, it cannot be used for storing Haar wavelet and OM 3 coe cients.</p><p>To show that OM 3 enables error-free time-series visualizations with low latency, we performed three quantitative comparisons with state-of-the-art methods in static, interactive, and progressive settings (Sections 5.1, 5.2, and 5.3) and a qualitative case study (Section 5.5). Since OM 3 needs an index while M4 runs immediately, the static setting reports the comparison between M4 and OM3's rst queries (with index cost), while the interactive setting focuses on the comparison on the subsequent queries. To study the in uence of the network transmission cost, we further conducted the same evaluation in an alternative setting with databases residing in the same local area network as the client, and found that the only di erence from the cloud setting is the round-trip time in the network (2 ms vs. 12 ms). Finally, we conducted an ablation study on OM 3 to learn the e ect of di erent acceleration strategies on tree-based query and a case study to demonstrate its e ectiveness for interactive exploration. Hence, we only show the evaluation results for the cloud setting here; the full evaluation results, including the screenshots and the corresponding scores of di erent measures on each dataset, can be found in the supplemental material.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1">Static Visualization</head><p>We compare line charts generated by various methods on the same data in terms of visual quality, data reduction ratio, and runtime. Procedure. For each dataset, we store it (or its coe cients) in the server database. Then, the client requests the server to send the data for generating line charts of six di erent window widths (200, 400, 600, 800, 1000, 1200) and same window height (600). Methods. Six methods are included for comparison: two versions of M4 <ref type="bibr">[13]</ref>, Haar wavelet <ref type="bibr">[30]</ref>, our method based on the multi-level min-max representation, and our OM 3 with and without the pruning strategy shown in Algorithm 1. Note that we implemented two versions of M4, namely M4-P based on PostgreSQL and M4-I based on In uxDB. Also, we refer OM 3 based on the multi-level min-max representation as M 3 and OM 3 without pruning as OM 3 -NP for short. M 3 and OM 3 use the same pruning-based incremental query algorithm with the only di erence being the coe cients stored at each node. Since the query results of In uxDB always contain the timestamp column, we implement M4 by only using the four extreme operators without the join operation required by the multiple copies of minima and maxima in each pixel column, where M4-P needs to nd all these duplicates from the server. In all methods, the maximum query and response times are caused by the dataset with &#8764;1.1 billion records. While our OM 3 takes only &#8764;380 ms, OM 3 -NP, M4-I, and M4-P require 800 ms, 9.1 min, and 13.2 min, respectively. So, OM 3 is around 1000 times faster than the state-of-the-art method M4 for error-free line plotting on the largest dataset.</p><p>To learn how the measures vary with window widths, we further explore two typical datasets, both with &#8764;8M records. The dataset used in Figs. <ref type="figure">8(a,</ref><ref type="figure">c</ref>) is a real-world dataset with missing values, while the one in Figs. <ref type="figure">8(b,</ref><ref type="figure">d</ref>) is a complete synthetic dataset with a power-of-two length. Here, we exclude Haar wavelet, as it cannot produce error-free visualizations. We can see that the data reduction ratio of all methods decreases as the window width increases, as more data points are needed for rendering more pixel columns. Among these methods, the decreasing rate of M4-I is the smallest and the ones of M4-P and M 3 on the stock price and synthetic dataset are the largest, respectively, while the ones of OM 3 and OM 3 -NP are in-between them. Yet, the data reduction ratio of OM 3 for both datasets is always larger than 0.99. Compared to OM 3 -NP, OM 3 takes only half amount of data points for generating the same visualization.</p><p>Regarding the response time, the observations are consistent with those shown in Fig. <ref type="figure">7</ref>. The latency of OM 3 is consistently less than 200 ms for both datasets as the window width increases, those of M 3 and OM 3 -NP are both around 300 ms, whereas those of M4-I and M4-P are around 1 s and 5 s, respectively. beginning and then remains stable. The reason is that OM 3 dynamically updates the coe cient tree on the client side and reuses the stored coe cients for the follow-up interactions, rather than performing each query from scratch, like in M4. Thus, with a small cache, resizing can done in 10 ms (see the three extreme points in Figs. <ref type="figure">9(a,</ref><ref type="figure">b</ref>)) for the changed window widths whose corresponding coe cients have been cached, and our zooming-out interaction takes only 12 ms on average (see Figs. <ref type="figure">9(e,</ref><ref type="figure">f</ref>)). Most panning, resizing, zooming-in, and the hybrid interactions take &lt;250 ms on average; thanks to our e cient query strategy. In contrast, the zooming-out and zooming-in operations with M4-P take around 4 seconds and over 6 seconds for the synthetic and stock price datasets, respectively, while both operations with M4-I take similar time (1 second) for both datasets. This is as expected, since M4-P requires querying more records in the database, while In uxDB is optimized for range queries over the time-series data <ref type="bibr">[26]</ref>. Last, M4-P is signi cantly slower on the stock price data than the synthetic data because this data has more duplicate values and the joint operation in M4-P causes more redundant data points. In contrast, our method and M4-I perform similarly on both datasets.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3">Interactive Progressive Visualization</head><p>The last experiment focuses on exploring the e ectiveness of OM 3 -empowered progressive visualization of very large data. We conducted the evaluation on two large datasets: electrical power with 30M records and a synthetic data with 1B (2 30 ) records. After visualizing the data on a 1000 &#215; 600 window, we change this view by applying each of the three types of interactions (resizing, panning, and zooming-in) 20 times with di erent parameters. We did not test zooming-out, as it only takes around 15 ms for any data (see Figs. 9(e,f)) by reusing the cached coe cients. Once an interaction begins, we take the nal visualization as the reference to measure the SSIM value and the response time for each intermediate result.</p><p>Parameters. For resizing and zooming-in, the starting view shows the whole time range. The widths of the resized windows after 20 successive operations are 64, 128, 192, 256, 320, 384, 448, 512, 576, 640, 600, 1200, 1800, 2400, 3000, 3600, 4200, 4800, 5400, and 6000, while the zoomed relative time range is generated by [s%, s% + 100% -4% * r ], where s is a random start position and r is a unique integer in <ref type="bibr">[1,</ref><ref type="bibr">20]</ref>. For panning, we only show the data within the rst 50% time range in the display window and then pan the view to time range [2r %, (50 + 2r )%], where r is a unique integer in <ref type="bibr">[1,</ref><ref type="bibr">20]</ref>.</p><p>Results. To show that OM 3 can generate meaningful intermediate results at interactive speeds, we run 20 trials for each interaction according to the above parameters. For each trial, we measured the response time and the quality of the visualization generated by the results of tree-based query at each level (in terms of SSIM). The boxplots in Figs. <ref type="figure">10(a,</ref><ref type="figure">b</ref>) show that the response time to achieve 95% SSIM accuracy is less than 100 ms for panning and zooming-in with all parameters. Regarding resizing, the response time varies signi cantly with di erent parameters, but the largest time duration is still under 300 ms with window width 6000 for both data. To explore how the visualization quality improves over time, the line charts in Figs. <ref type="figure">10(c,</ref><ref type="figure">d</ref>) plot the curves between the SSIM values and the response time of a randomly-selected trial. We can see that the SSIM value quickly increases and reaches 0.95 within 75 ms, and then grows slowly to generate error-free visualizations. Such results indicate that our approach makes a trade-o between interactivity (&lt;70-100ms response) and progressive results (SSIM &#8805; 0.95).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.4">Ablation Study</head><p>We conduct an ablation study on OM 3 to examine the e ect of the four major acceleration strategies: node pruning, coe cients prefetching, tree caching, and query merging. As such, we tested the produces error-free line charts. We thus formulate the forward and inverse OM 3 transforms to convert between a time series and a hierarchy of coe cients. This supports progressive error-free interactive any scale. We then extend OM 3 to time-series with missing values or non-power-of-two lengths. OM 3 only needs &#8776;3/4 space of the original time series.</p><p>To adopt OM 3 for smooth interactions, we design an incremental tree-based query method with time complexity O(w log(n)) (for display width w and time-series length n), and an e cient pruning strategy. We combine prefetching, caching, and merging query ranges to further reduce latency. These e ciently support the common time-series interactions, including resize, pan, and zoom. Both quantitative and qualitative evaluations show how OM 3 e ciently helps users explore billion-record datasets on remote cloud databases with &#8776;300ms latencies.</p><p>Future work can further reduce query times by adding lossless compression <ref type="bibr">[24]</ref> and using a predictive prefetching framework <ref type="bibr">[19]</ref>. When applying OM 3 to charts with multivariate time series, we can leverage line-based density representations, e.g., <ref type="bibr">[34]</ref>, to manage heavy visual clutter. We can also incorporate other types of visualizations such as trendlines <ref type="bibr">[23]</ref> to better support interactive exploration, since OM 3 faithfully recovers the original input data. We can also extend to streaming data via partial updates to the coe cient tree; most updates are likely local and do not change the min/max aggregates in higher levels of the tree (which would propagate to their descendants).</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_0"><p>Proc. ACM Manag. Data, Vol. 1, No. 2, Article 145. Publication date: June 2023.</p></note>
		</body>
		</text>
</TEI>
