CSE-4/562 Spring 2019 - Streaming Queries

March 8, 2019

- Temporal Data (today's focus)
- Bi-Temporal (time the event happened + time the event was recorded)
- Spatial (2d data, 3d data)
- Spatio-Temporal Data (Spatial + Time)

Find the % change in monthly sales, each month

```
SELECT A.Month, A.Sales-B.Sales / B.Sales
FROM (SELECT … AS Month, SUM(…) AS Sales FROM …) A,
(SELECT … AS Month, SUM(…) AS Sales FROM …) B
WHERE A.Month = B.Month + 1
```

Find the daily top-5 products by sales in the last week

```
SELECT Product, SUM(…) AS Sales FROM … WHERE date = today - 1
ORDER BY Sales Desc LIMIT 5 UNION ALL
SELECT Product, SUM(…) AS Sales FROM … WHERE date = today - 2
ORDER BY Sales Desc LIMIT 5 UNION ALL …
```

Find the trailing n-day moving average of sales

… almost impossible to express if n is a parameter

(i.e., query size depends on N)

`WINDOW`

Operator- Define a Sequence (i.e., sort the relation)
- Compute all subsequences
- Fixed
__Physical__Size: N records exactly. - Fixed
__Logical__Size: Records within N units of time.

- Fixed
- Compute an aggregate for each subsequence (one output row per subsequence)

```
SELECT L.state, T.month,
AVG(S.sales) OVER W as movavg
FROM Sales S, Times T, Locations L
WHERE S.timeid = T.timeid
AND S.locid = L.locid
WINDOW W AS (
PARTITION BY L.state
ORDER BY T.month
RANGE BETWEEN INTERVAL ‘1’ MONTH PRECEDING
AND INTERVAL ‘1’ MONTH FOLLOWING
)
```

`PARTITION BY`

- Like
`GROUP BY`

`ORDER BY`

- The sequence to create. The output has one row for each value of this column.
`RANGE BETWEEN ... AND ...`

- Physical/Logical size of the window
`[Aggregate] OVER [WindowName]`

- A single query can have multiple windows.

- OLAP
- Changing Queries, Fixed Data
- OLTP
- Changing Data, Minimal Queries
- Streaming
- Changing Data, Fixed Queries

**Challenge: ** Need to react to new data __as it arrives__

- Push vs Pull Data Flow
- Revisit Joins
- Revisit Indexing
- Revisit Aggregation

- Each operator operates independently
- Each operator buffers input
- Operators are scheduled (possibly in different threads)

**Problem 1:**Eventually run out of buffer!**Problem 2:**Looping over all buffer entries is slow

**Idea 1: ** Mandate ONLY `WINDOW`

queries

**Idea 2: ** Index the buffer!

**Challenge: ** Maintaining the index as tuples fall out of the buffer

Tuples always enter from one side and exit out the other

Lots of lookups for active tuples.

- Queue/Linked List
- Insert/remove in (nearly) temporal order
- Hash/Tree
- Lookup (randomly ordered) Join Key

$O(1 + log(|W|))$ insertions.

$O(1 + log(|W|))$ expiration.

- Ring Aggregates (Sum, Count, Average)
- Add new values - $O(|\Delta|)$
- Subtract old values - $O(|\Delta|)$
- Semiring Aggregates (Min, Max)
- Rescan for new max - $O(|W|)$

- Push vs Pull Data Flow
- Push is a better fit because sources produce data at different rates.
- Revisit Joins
- Focus on ripple-style
`WINDOW`

joins. - Revisit Indexing
- Linked Hash/Tree Indexes for efficient windowed indexing.
- Revisit Aggregation
- Sliding window aggregates.