CSE 662 - Languages and Runtimes for Big Data

Project Seeds

Reminder

Learned Index Structures due Weds (1 week)

Expectations

Checkpoint 1: Project Description (Due Sept 23, 11:59)

What is the specific challenge that you will solve?
What metrics will you use to evaluate success?
What deliverables will you produce?

Checkpoint 2: Progress Report (Due Oct 21, 11:59)

What challenges have you overcome so far?
How does your existing work compare to other, similar approaches?
What design decisions have you made so far and why?
How have your goals changed from checkpoint 1?
What challenges remain for you to overcome?

Checkpoint 3: Final Report (Due Dec 9, 11:59)

What specific challenges did you solve?
How does your final solution compare to other, similar approaches?
Were the design decisions you made correct and why?

Decentralized IoT Plumbing

What IoT Means

Lots of devices with...

Sensors (Temperature, RFID, Cameras): Inputs from the outside world.
Actuators (Robots, Lightbulbs, Conveyor Belts): Outputs to affect the outside world.
Reasonable Compute Resources: The ability to actually decide how.

Core Idea

The user gives you...: A list of nodes (sensors/actuators); A list of activities (globally what to do and when)
Your code compiles and deploys...: Triggers for nodes (locally what to do and when)

Things to Think About...

How does the user specify activities to your system?
Which node(s) is(/are) responsible for required computation?
How do you get data from where it is to where the compute happens?
What resources (compute, network) will be needed to execute on your plan?
How do you optimize the necessary compute for one activity? across all activities?

Uncertainty-Aware Machine Learning

Not all data sources are created equal.

Even within one data set, some data may be more trustworthy than others.

Mixed-Quality Training

How do you train a classifier/neural net/markov model/etc... on mixed-quality data?

Preprocess the data ("fix" the errors)
Train separate models on subsets of the data
Ignore the errors and hope for the best

Problem: Usually easier to "fix" than to label missing data.

But what if the data is already labeled!

Core Idea

You get...: A dataset; Descriptions of uncertainty (what kind is up to you)
You make...: A model (of some sort) that is of higher quality using labels than not using them.

Ideally the model is interpretable as well.

Things to Think About

What statistical properties are you aiming for?
How should you describe uncertain data?
How should the model interact with missing data? ... to less reliable data?
How does uncertainty in the training data affect the model's predictions

Web-of-Trust for Crowdsourced Data

Crowdsourcing

Have a question?

Most people will give you a bad answer.

A few will give you a bad answer.

The average of a bunch of bad answers and a few good answers is a good answer?

Crowdsourcing with Trust!

Web of Trust

Core Idea

You get...: A set of participants; A set of (possibly contradictory) facts stated by each participant; A set of trust levels for each pair of participants
You produce...: A (weighted?) set of facts for each user.

Things to Think About

How do trust levels combine? (Transitively vs Additively)
How do derivations of contradictory facts combine (e.g., average trust vs most trusted wins)
Can the model be maintained incrementally as new facts arrive/users change how much they trust other users?
What happens for pairs of users who don't know how much they trust each other?

Sensitivity Analysis in Mimir

Problem: Often there is a very large number of possible worlds.

Solution: Break down possible worlds by choices.

Question: Which choices have the biggest impact on a query result?

Sensitivity/Influence

Sensitivity analysis and explanations for robust query evaluation in probabilistic databases.
Kanagal, Li, Deshpande (SIGMOD 2011)

Tracing data errors with view-conditioned causality
Meliou, Gatterbauer, Nath, Suciu (SIGMOD 2011)

Approach

Unit of Choice: Is a tuple (fact) in the source data or not?

Compute the "derivative" of the query result with respect to the probability of each source tuple.
Find the tuple that maxizes the derivative.

Mimir

Let queries call a nondeterministic "choice" function that decides which "world" to visit.


    SELECT CASE VGTerm("A", ROWID) WHEN 1 THEN "FOO" 
                                          ELSE "BAR" 
           END AS A, Input.*
    FROM Input;

VGTerm("A", ROWID) generates a separate value for each row.

Core Idea

You get...: A deterministic database; A non-deterministic query (and a set of tools for sampling from its outputs).
You produce...: Which "call" to the query has the biggest influence on the output.

Things to Think About

What kind(s) of influence measures make sense?
How to compute influence efficiently for all tuples in parallel?
Early pruning: Can some influence measures be computed exactly?

Sandboxed Python

♥

→

←

Core Idea

You get...: Python Code; Inputs to the code (or a socket)
Your system produces...: Output for the code... without calling out of the sandbox.

Things to Think About

What security guarantees are you providing?
How can you prove to yourselves that those guarantees are enforced?
What tooling can you use to wrap/execute python?

In-Class Assignment

Form a group of 3-4 people that you'll work with for the duration of the semester.
Come up with a clever group name (or one will be made up for you).
Challenge: Form a group with people you don't know or don't know well.