CSE-4/562 Spring 2019 - Streaming Queries

Streaming Queries

CSE-4/562 Spring 2019

March 8, 2019

Textbook:

Sequential Data

  • Temporal Data (today's focus)
  • Bi-Temporal (time the event happened + time the event was recorded)
  • Spatial (2d data, 3d data)
  • Spatio-Temporal Data (Spatial + Time)

Example Queries

Find the % change in monthly sales, each month


      SELECT A.Month, A.Sales-B.Sales / B.Sales 
      FROM (SELECT … AS Month, SUM(…) AS Sales FROM …) A, 
           (SELECT … AS Month, SUM(…) AS Sales FROM …) B 
      WHERE A.Month = B.Month + 1
    

Example Queries

Find the daily top-5 products by sales in the last week


      SELECT Product, SUM(…) AS Sales FROM … WHERE date = today - 1 
      ORDER BY Sales Desc LIMIT 5 UNION ALL 
      SELECT Product, SUM(…) AS Sales FROM … WHERE date = today - 2 
      ORDER BY Sales Desc LIMIT 5 UNION ALL …
    

Example Queries

Find the trailing n-day moving average of sales

… almost impossible to express if n is a parameter
(i.e., query size depends on N)

The WINDOW Operator

  1. Define a Sequence (i.e., sort the relation)
  2. Compute all subsequences
    • Fixed Physical Size: N records exactly.
    • Fixed Logical Size: Records within N units of time.
  3. Compute an aggregate for each subsequence (one output row per subsequence)

    SELECT L.state, T.month, 
       AVG(S.sales) OVER W as movavg
    FROM   Sales S, Times T, Locations L
    WHERE  S.timeid = T.timeid 
      AND  S.locid = L.locid
    WINDOW W AS ( 
       PARTITION BY L.state
       ORDER BY T.month
       RANGE BETWEEN INTERVAL ‘1’ MONTH PRECEDING
             AND INTERVAL ‘1’ MONTH FOLLOWING
    )
    
PARTITION BY
Like GROUP BY
ORDER BY
The sequence to create. The output has one row for each value of this column.
RANGE BETWEEN ... AND ...
Physical/Logical size of the window
[Aggregate] OVER [WindowName]
A single query can have multiple windows.
OLAP
Changing Queries, Fixed Data
OLTP
Changing Data, Minimal Queries
Streaming
Changing Data, Fixed Queries

Challenge: Need to react to new data as it arrives

Streaming Queries

  • Push vs Pull Data Flow
  • Revisit Joins
  • Revisit Indexing
  • Revisit Aggregation
  • Each operator operates independently
  • Each operator buffers input
  • Operators are scheduled (possibly in different threads)

Stream Joins

  • Problem 1: Eventually run out of buffer!
  • Problem 2: Looping over all buffer entries is slow

Idea 1: Mandate ONLY WINDOW queries

Idea 2: Index the buffer!

Challenge: Maintaining the index as tuples fall out of the buffer

Streaming Indexes

Tuples always enter from one side and exit out the other

Lots of lookups for active tuples.

What is the best layout?

Queue/Linked List
Insert/remove in (nearly) temporal order
Hash/Tree
Lookup (randomly ordered) Join Key

$O(1 + log(|W|))$ insertions.

$O(1 + log(|W|))$ expiration.

Streaming Aggregation

Ring Aggregates (Sum, Count, Average)
Add new values - $O(|\Delta|)$
Subtract old values - $O(|\Delta|)$
Semiring Aggregates (Min, Max)
Rescan for new max - $O(|W|)$

Summary

Push vs Pull Data Flow
Push is a better fit because sources produce data at different rates.
Revisit Joins
Focus on ripple-style WINDOW joins.
Revisit Indexing
Linked Hash/Tree Indexes for efficient windowed indexing.
Revisit Aggregation
Sliding window aggregates.