Gap Filling and Interpolation (GFI)
A Swiss-Army Knife for Time Series Analytics
Gap Filling and Interpolation (GFI) is a set of patent-pending time series analytics features in Vertica . In this post, through additional use cases, we will show that GFI can enable Vertica users in a wide range of industry sectors to achieve a diverse set of goals.
Use Case #1: Aggregate bandwidth analysis for Video-On-Demand (VOD) applications
In a Telecommunications VOD workload, assume we have the following (simplified) fact table schema for tracking the quality of service (QoS) in serving video content to end users: vod_qos_fact(user_id, video_id, timestamp, bandwidth).
Each time a user requests a video clip, a row is inserted into the fact table, recording the timestamp and the bandwidth consumption of serving the video clip at that time. In addition, when the bandwidth consumption changes significantly for a video clip being served, a new row is inserted to record the new bandwidth consumption. An example fact table might be:
In this example, the VOD QoS analyst might like to know the following: for every minute, what is the aggregate bandwidth consumption across all users? This question can be answered in Vertica 4.0 in two pipelined steps.
First, the GFI time series analytics is used to compute the interpolated bandwidth consumption of each user for every minute. The result is shown below.
Second, the interpolated bandwidth consumption is aggregated for every minute. The resulting SQL formulation of these two steps is the following.
SELECT slice_time, sum(i_bandwidth) agg_bandwidthFROM (SELECT user_id, slice_time, TS_FIRST_VALUE(bandwidth) AS i_bandwidthFROM vod_qos_factTIMESERIES slice_time AS ’1 minute’ OVER (PARTITION BY user_id ORDER BY timestamp)) tmpGROUP BY slice_time;
In comparison to other popular solution alternatives such as stored procedure, Map/Reduce script and ODBC/JDBC program, the Vertica approach enjoys the following benefits:
- Code and architecture simplicity: The computation is expressed in SQL, and is performed within the Vertica database. No additional programming language or execution platform is needed.
- Efficiency and scalability: The Vertica approach effectively balances the data and computation across all cluster nodes, and minimizes the amount of disk and network I/O needed to accomplish this computation task.
These benefits apply to the following use cases as well.
Use Case #2: Storage compression for inventory management applications
In a retail workload, assume we have the following (simplified) fact table schema for inventory management: inventory_fact(item_id, item_location, timestamp, quantity_in_stock).
For each inventory item in a certain location, every time its stock quantity changes, there is a new row inserted into this fact table with the updated quantity value and the associated timestamp of that inventory change.
For example, I recent went to a nearby Apple Retail Store to purchase an iPad. After that transaction, suppose Apple’s inventory_fact table is populated with a new row, indicating that for this particular Apple store, the quantity of its iPads in stock has decreased by 1.
For a popular product like the iPad, the inventory level in each store may change thousands of times or more each day, creating a large number of rows in the inventory_fact table. However, for the purpose of inventory analysis, the inventory data can often be stored in a much coarser time granularity, such as one row per day for each item and location pair.
The GFI time series analytics in Vertica 4.0 can be used to efficiently aggregate the raw inventory data into a proper time granularity, thus consuming much less storage as well as significantly speeding up inventory analysis. Note that in comparison, traditional SQL analytic functions such as FIRST_VALUE cannot be used to correctly compute the aggregation (the reasoning behind this statement is left as an exercise for the interested reader).
Use Case #3: Power reduction for wireless sensor applications
In a Smartdust project, computer scientists may deploy a large number of sensors in a wild region to measure environmental parameters, such as light and temperature. The sensors transmit their measurement readings via wireless signals to one or more base stations, where installed applications process the measurements.
Since these sensors are battery-driven, efficient power management is crucial to ensure a prolonged operation. The key to efficient power management is to minimize the amount of data transmission from sensors to the base stations.
A base station application may require that measurement updates be available at regular time intervals; for example, once every second. One straightforward implementation is to have each sensor transmit a new measurement for each time interval. However, this transmission policy could lead to a huge battery drain on the sensors, especially considering that the consecutive measurements from the same sensor often carry the same value.
Instead, we could have the sensors submit their readings only when the measurement values change, and then use GFI on the base station application to losslessly recover the required measurement values at regular intervals.
Conclusion
The above three use cases respectively illustrate the use of Vertica’s GFI time series analytics in the context of Telecommunications, Retail, and Science applications. You may be wondering, how to assess whether GFI applies to your particular problem at hand? One way is to look for the following common characteristics of problems to which GFI is applicable:
- You have multiple time series event streams (say from multiple users or multiple stock symbols), where the events do not occur at regular time intervals. However, you would like to aggregate or correlate over these event streams (use case #1)
- The raw data is stored in a denser time granularity than needed in your analysis computation (use case #2)
- The raw data is stored in a sparser time granularity than needed in your analysis computation (use case #3)
With Vertica, the built-in GFI functionality allows you to simplify and accelerate your time-series analytics.
No comments:
Post a Comment