CSE-4/562 Spring 2019 - Intro

Intro

CSE-4/562 Spring 2019

January 28, 2019

Textbook: Ch. 1, 2.1-2.2

Why Are Databases Awesome?

They're Everywhere

$$$

Rank Organization Sales (B$) FY Market cap (B$) Headquarters
1United StatesMicrosoft86.62017601Redmond, WA, US
2United StatesOracle37.22017205Redwood City, CA, US
3GermanySAP23.22017117Walldorf, Germany
4United StatesSalesforce.com8.4201769San Francisco, CA, US
5United StatesVMware6.7201748Palo Alto, CA, US
6United StatesFiserv5.3201726Brookfield, WI, US
7United StatesAdobe Systems5201784San Jose, CA, US
8United StatesSymantec5.4201719Mountain View, CA, US
9SpainAmadeus IT Holdings4.3201725Madrid, Spain

5 of 9 Forbes Top Software Companies
Have a Focus on Data Management Systems

(Source wikipedia.org)

Interesting Problems

What is "Databases"?

How do we ask and answer questions about data?

How do we manipulate and persist data?

Databases

Techniques

Data Modeling

Cost-Based Optimization

Recipes

Join Algorithms

Index Data Structures

Knowledge

The Memory Hierarchy

Data Consistency

Which Tools To Use

And When?

Template for 90% of this class

What is the best, correct technique for task X, when Y is true?

  1. How do you define Correct and Best?
  2. What correct alternatives are available?
  3. How do you find the best available alternative

General Course Information

Expectations

Algorithms / Data Structures
$O(\cdot)$ Analysis, Sort Algos, Trees, Hash Tables
How to use a database
CSE 4/560 (or equivalent)
Java (or Scala)
This means actual programming experience
(C++, C# or similar is usually good enough)

Me

Oliver Kennedy

TAs

Vicky Zheng Qiuling Suo

Ninjas

William Spoth Darshana Balakrishnan Carl Nuessle

Syllabus and Projects

https://odin.cse.buffalo.edu/teaching/cse-462/

https://odin.cse.buffalo.edu/teaching/cse-562/

(same link)

Course Forum

https://piazza.com/buffalo/spring2019/cse4562/home

$150 $50
Index
ToC
No Index
ToC Summary

Course Structure

Concepts (50% of Grade)
  • Homework (10%; ~12 Assignments, Drop any 4)
  • March 13: Midterm (20% or 15%)
  • May 17?: Comprehensive Final (20% or 25%)
Practicum (50% of Grade)
  • Build a Relational Query Engine
  • 3-Person Group Project
  • 4 Checkpoints (+ 5 free points for Checkpoint 0)

Embedded Databases

  • SQLite (In your browser, computer, phone, fridge...)
  • Simple, Easy-To-Use Declarative Data Management
  • Critical for future tech: Part of Mobile, IoT, Web

Your Startup: Build the next great Embedded Database

We give you...

Data (CSV Files)

Schema Information (CREATE TABLE)

Questions (SQL Queries)


You give us...

Answers

(really really fast)

Real World Challenge

You get graded on your code's...

Correctness
~1/3 credit for getting the right answer.
Performance
~2/3 credit for getting it reasonably fast.

Checkpoint 0: "Hello World"

5/50 pts

  • Form groups
  • Submit a simple Java program
  • Make sure that the submission workflow works for you.

Checkpoint 1: "Get it Working"

10/50 pts

  • Translate SQL to Relational Algebra
  • Load CSV Files
  • Run Basic Select, Project, Join Queries

Checkpoint 2: "Big Data"

10/50 pts

  • Order By
  • Limit
  • Nested Queries
  • Aggregation
  • Too much data for memory

Checkpoint 3: "Precomputation"

15/50 pts

  • You get a few minutes to pre-compute
  • Load data
  • Cache views
  • Build indexes

Checkpoint 4: "The Real World"

10/50 pts

  • We give you a Buffer Manager
  • Dynamic, skewed workload

Ways to Fail

  • Start your project at the last minute
  • Don’t go to office hours
  • Don’t ask questions on Piazza
  • Wait until the deadline to submit for the first time
  • Cheat

Academic Integrity

Cheating is submitting any work that you did not perform by yourself as if you did.

References (when cited)
Wikipedia, Wikibooks (or similar): OK
Public Code
Stack Exchange (or similar): Not OK
Discussing concepts/ideas with classmates
“A hash index has O(1) lookups”: OK (except during exams 😇 )
Sharing code or answers with anyone
“Just have a look at how I implemented it”: NOT OK
For-hire code: NOT OK

MOSS

MOSS

Zero Tolerance
If I catch you submitting someone else’s code (including pay-for-code services), you will fail the class.
Group Responsibility
If your teammate cheats on a group project, the entire group will be penalized.
Share Code, Share Blame
If someone else submits your code as their own, you will be penalized as well.

Questions/Concerns?

What does a Data Management System Do?

Analysis: Answering user-provided questions about data
What kind of tools can we give end-users?
  • Declarative Languages
  • Organizational Datastructures (e.g., Indexes)
Manipulation: Safely persisting and sharing data updates
What kind of tools can we give end-users?
  • Consistency Primitives
  • Data Validation Primitives

So let's talk structure...

Primitive
Basic building blocks like Int, Float, Char, String
Tuple
Several ‘fields’ of different types. (N-Tuple = N fields)
A Tuple has a ‘schema’ defining each field
Set
A collection of unique records, all of the same type
Bag
An unordered collection of records, all of the same type
List
An ordered collection of records, all of the same type

Your data is currently an Unordered Set
of Tuples with 100 fields each.

Tomorrow, you’ll be repeatedly asked for 1 specific attribute
of 5 specific rows identified by the first attribute

Can you do better?

Better Idea: Rewrite data into a 99-Tuple of Maps keyed on the 1st attribute

This representation is equivalent and better for your needs.

Declarative specifications make it easier to find equivalences.