CSE 4/562 - Database Systems

Intro

CSE 4/562 – Database Systems

January 29, 2018

Why Are Databases Awesome?

They're Everywhere

$$$

Rank Organization Sales (B$) FY Market cap (B$) Headquarters
1United StatesMicrosoft86.62017601Redmond, WA, US
2United StatesOracle37.22017205Redwood City, CA, US
3GermanySAP23.22017117Walldorf, Germany
4United StatesSalesforce.com8.4201769San Francisco, CA, US
5United StatesVMware6.7201748Palo Alto, CA, US
6United StatesFiserv5.3201726Brookfield, WI, US
7United StatesAdobe Systems5201784San Jose, CA, US
8United StatesSymantec5.4201719Mountain View, CA, US
9SpainAmadeus IT Holdings4.3201725Madrid, Spain

5 of 9 Forbes Top Software Companies
Have a Focus on Data Management Systems

(Source wikipedia.org)

Interesting Problems

What is "Databases"?

How do we ask and answer questions about data?

  • Efficiency
  • Accuracy
  • Summarization
  • Data Curation

How do we manipulate and persist data?

  • Consistency
  • Correctness
  • Parallelism

Databases

Techniques

Data Modeling

Cost-Based Optimization

Recipes

Join Algorithms

Index Data Structures

Knowledge

The Memory Hierarchy

Data Consistency

Which Tools To Use

And When?

Template for 90% of Database Systems

For X, the best, correct choice is Y, at least when Z.

  1. How do you define Correct and Best?
  2. What correct alternatives are available?
  3. How do you find the best available alternative

General Course Information

People

  • Oliver Kennedy (me)
  • Gokhan Kul (Practicum Lead)
  • William Spoth (Project TA)
  • Saurav Singhi (Concept TA)
  • Carl Nuessle (Ninja)
  • [ TBD ]

Syllabus and Projects

https://odin.cse.buffalo.edu/teaching/cse-462/

https://odin.cse.buffalo.edu/teaching/cse-562/

(same link)

Course Forum

https://piazza.com/buffalo/spring2018/cse4562/home

Course Structure

Concepts (50% of Grade; Lectures on Mon/Wed)
  • Homework (10%; 12-15 Assignments, Keep Best 10)
  • Midterm (20% or 15%)
  • Comprehensive Final (20% or 25%)
Practicum (50% of Grade; Lectures on Fri)
  • Build a Relational Query Engine
  • 3-Person Group Project
  • 5 "Checkpoints"
$150 $50
Index
ToC
No Index
ToC Summary

Embedded Databases

  • SQLite (In your browser, computer, phone, fridge...)
  • Simple, Easy-To-Use Declarative Data Management
  • Critical for future tech: Part of Mobile, IoT, Web

Your Startup: Build the next great Embedded Database

We give you...

Data (CSV Files)

Schema Information (CREATE TABLE)

Questions (SQL Queries)


You give us...

Answers

(really really fast)

Real World Challenge

You get graded on your code's...

Correctness
~1/3 credit for getting the right answer.
Performance
~2/3 credit for getting it reasonably fast.

Checkpoint 0: "Hello World"

5/50 pts

  • Form groups
  • Submit a simple Java program
  • Make sure that the submission workflow works for you.

Checkpoint 1: "Intro to CSV"

10/50 pts

  • Parse SQL with JSQLParser
  • Load CSV Files
  • Project (Map) Data
  • Select (Filter) Data

Checkpoint 2: "Real SQL"

10/50 pts

  • Joins
  • Order By
  • Limit
  • Nested Queries
  • Interactive Prompt

Checkpoint 3: "Optimization"

15/50 pts

  • Aggregation
  • Actual Data (no naive algorithms)

Checkpoint 4: "The Real World"

10/50 pts

  • Too much data for memory
  • Time for precomputation

Ways to Fail

  • Start your project at the last minute
  • Don’t go to office hours
  • Don’t ask questions on Piazza
  • Wait until the deadline to submit for the first time
  • Cheat

Academic Integrity

Cheating is submitting any work that you did not perform by yourself as if you did.

References (when cited)
Wikipedia, Wikibooks (or similar): OK
Public Code
Stack Exchange (or similar): Not OK
Discussing concepts/ideas with classmates
“A hash index has O(1) lookups”: OK (except during exams 😇 )
Sharing code or answers with anyone
“Just have a look at how I implemented it”: NOT OK
For-hire code: NOT OK

MOSS

MOSS

Zero Tolerance
If I catch you submitting someone else’s code, you will fail the class.
Group Responsibility
If your teammate cheats on a group project, the entire group will be penalized.
Share Code, Share Blame
If someone else submits your code as their own, you will be penalized as well.

Questions/Concerns?

What does a Data Management System Do?

Analysis: Answering user-provided questions about data
What kind of tools can we give end-users?
  • Declarative Languages
  • Organizational Datastructures (e.g., Indexes)
Manipulation: Safely persisting and sharing data updates
What kind of tools can we give end-users?
  • Consistency Primitives
  • Data Validation Primitives

So let's talk structure...

Primitive
Basic building blocks like Int, Float, Char, String
Tuple
Several ‘fields’ of different types. (N-Tuple = N fields)
A Tuple has a ‘schema’ defining each field
Set
A collection of unique records, all of the same type
Bag
An unordered collection of records, all of the same type
List
An ordered collection of records, all of the same type

Your data is currently an Unordered Set
of Tuples with 100 fields each.

Tomorrow, you’ll be repeatedly asked for 1 specific attribute
of 5 specific rows identified by the first attribute

Can you do better?

Better Idea: Rewrite data into a 99-Tuple of Maps keyed on the 1st attribute

This representation is equivalent and better for your needs.

Declarative specifications make it easier to find equivalences.