CSE-4/562 Spring 2019

Streaming Queries

March 8, 2019

Textbook:

Sequential Data

Temporal Data (today's focus)
Bi-Temporal (time the event happened + time the event was recorded)
Spatial (2d data, 3d data)
Spatio-Temporal Data (Spatial + Time)

Example Queries

Find the % change in monthly sales, each month


      SELECT A.Month, A.Sales-B.Sales / B.Sales 
      FROM (SELECT … AS Month, SUM(…) AS Sales FROM …) A, 
           (SELECT … AS Month, SUM(…) AS Sales FROM …) B 
      WHERE A.Month = B.Month + 1

Example Queries

Find the daily top-5 products by sales in the last week


      SELECT Product, SUM(…) AS Sales FROM … WHERE date = today - 1 
      ORDER BY Sales Desc LIMIT 5 UNION ALL 
      SELECT Product, SUM(…) AS Sales FROM … WHERE date = today - 2 
      ORDER BY Sales Desc LIMIT 5 UNION ALL …

Example Queries

Find the trailing n-day moving average of sales

… almost impossible to express if n is a parameter
(i.e., query size depends on N)

The `WINDOW` Operator

Define a Sequence (i.e., sort the relation)
Compute all subsequences
- Fixed Physical Size: N records exactly.
- Fixed Logical Size: Records within N units of time.
Compute an aggregate for each subsequence (one output row per subsequence)


    SELECT L.state, T.month, 
       AVG(S.sales) OVER W as movavg
    FROM   Sales S, Times T, Locations L
    WHERE  S.timeid = T.timeid 
      AND  S.locid = L.locid
    WINDOW W AS ( 
       PARTITION BY L.state
       ORDER BY T.month
       RANGE BETWEEN INTERVAL ‘1’ MONTH PRECEDING
             AND INTERVAL ‘1’ MONTH FOLLOWING
    )

PARTITION BY: Like GROUP BY
ORDER BY: The sequence to create. The output has one row for each value of this column.
RANGE BETWEEN ... AND ...: Physical/Logical size of the window
[Aggregate] OVER [WindowName]: A single query can have multiple windows.

OLAP: Changing Queries, Fixed Data
OLTP: Changing Data, Minimal Queries
Streaming: Changing Data, Fixed Queries

Challenge: Need to react to new data as it arrives

Streaming Queries

Push vs Pull Data Flow
Revisit Joins
Revisit Indexing
Revisit Aggregation

Each operator operates independently
Each operator buffers input
Operators are scheduled (possibly in different threads)

Stream Joins

Problem 1: Eventually run out of buffer!
Problem 2: Looping over all buffer entries is slow

Idea 1: Mandate ONLY WINDOW queries

Idea 2: Index the buffer!

Challenge: Maintaining the index as tuples fall out of the buffer

Streaming Indexes

Tuples always enter from one side and exit out the other

Lots of lookups for active tuples.

What is the best layout?

Queue/Linked List: Insert/remove in (nearly) temporal order
Hash/Tree: Lookup (randomly ordered) Join Key

$O(1 + log(|W|))$ insertions.

$O(1 + log(|W|))$ expiration.

Streaming Aggregation

Ring Aggregates (Sum, Count, Average): Add new values - $O(|\Delta|)$; Subtract old values - $O(|\Delta|)$
Semiring Aggregates (Min, Max): Rescan for new max - $O(|W|)$

Summary

Push vs Pull Data Flow: Push is a better fit because sources produce data at different rates.
Revisit Joins: Focus on ripple-style WINDOW joins.
Revisit Indexing: Linked Hash/Tree Indexes for efficient windowed indexing.
Revisit Aggregation: Sliding window aggregates.

Streaming Queries

CSE-4/562 Spring 2019

Textbook:

Sequential Data

Example Queries

Example Queries

Example Queries

The WINDOW Operator

Streaming Queries

Stream Joins

Streaming Indexes

What is the best layout?

Streaming Aggregation

Summary

The `WINDOW` Operator