CSE-4/562 Spring 2019 - Checkpoint 3

Checkpoint 3

CSE-4/562 Spring 2019

April 10, 2019

Textbook:
Phase 1
5 minutes (total) to process CREATE TABLEs
Stream of queries, as before
Phase 2 -- Restart
No CREATE TABLE issued
Stream of queries, as before

Both phases in-memory

Optimization Ideas

Cache smaller tables in-memory
Avoid repeated parsing overheads
Build Indexes
CREATE TABLEs will include PRIMARY KEY and other suggested indexes.
Gather statistics
Pick the best available index / join order.
Save pre-parsed tables
Avoid parsing overheads on first load
Projection pushdown
Avoid parsing, store smaller tuples, fewer memory copies

In-Memory Cost-Based Optimization

Remember the Real Goals

  1. Accurately rank the plans.
  2. Don't spend more time optimizing than you get back.
  3. Don't pick a plan that uses more memory than you have.

Typical cost centers

  1. Parsing
  2. Memory Copies (Tuple Construction)
  3. (Nearly) everything else

Not universal, every system is different

HProf


  $> java -Xrunhprof:cpu=samples dubstep.Main
    

or


  $> java -Xrunhprof:cpu=time dubstep.Main
    

Sampling is faster, Time is more accurate

Output ends up in ./java.hprof.txt

Example java.hprof.txt

...
... lots of gobblygook ...
...
CPU SAMPLES BEGIN (total = 126) Fri Oct 22 12:12:14 2004
rank   self  accum   count trace method
   1 53.17% 53.17%      67 300027 java.util.zip.ZipFile.getEntry
   2 17.46% 70.63%      22 300135 java.util.zip.ZipFile.getNextEntry
   3  5.56% 76.19%       7 300111 java.lang.ClassLoader.defineClass2
   4  3.97% 80.16%       5 300140 java.io.UnixFileSystem.list
   5  2.38% 82.54%       3 300149 java.lang.Shutdown.halt0
   6  1.59% 84.13%       2 300136 java.util.zip.ZipEntry.initFields
   7  1.59% 85.71%       2 300138 java.lang.String.substring
   8  1.59% 87.30%       2 300026 java.util.zip.ZipFile.open
   9  0.79% 88.10%       1 300118 com.sun.tools.javac.code.Type$ErrorType.<init>
  10  0.79% 88.89%       1 300134 java.util.zip.ZipFile.ensureOpen
    
https://docs.oracle.com/javase/7/docs/technotes/samples/hprof.html

'self' column shows how much time is spent in that method.

'count' column shows number of samples/invocations

'trace' column links back to a stack trace of the sample/invocation


TRACE 300027:
java.util.zip.ZipFile.getEntry(ZipFile.java:Unknown line)
java.util.zip.ZipFile.getEntry(ZipFile.java:253)
java.util.jar.JarFile.getEntry(JarFile.java:197)
java.util.jar.JarFile.getJarEntry(JarFile.java:180)
    

Main Goal: Figure out where your code is spending its time, have it spend less time there

Idea: Minimize number of "intermediate" tuples

Estimating IOs requires Estimating $|Q(R)|$, $|\delta_A(Q(R))|$

Cardinality Estimation

Operator RA Estimated Size
Table $R$ $|R|$
Projection $\pi(Q)$ $|Q|$
Union $Q_1 \uplus Q_2$ $|Q_1| + |Q_2|$
Cross Product $Q_1 \times Q_2$ $|Q_1| \times |Q_2|$
Sort $\tau(Q)$ $|Q|$
Limit $\texttt{LIMIT}_N(Q)$ $N$
Selection $\sigma_c(Q)$ $|Q| \times \texttt{SEL}(c, Q)$
Join $Q_1 \bowtie_c Q_2$ $|Q_1| \times |Q_2| \times \texttt{SEL}(c, Q_1\times Q_2)$
Distinct $\delta_A(Q)$ $\texttt{UNIQ}(A, Q)$
Aggregate $\gamma_{A, B \leftarrow \Sigma}(Q)$ $\texttt{UNIQ}(A, Q)$
  • $\texttt{SEL}(c, Q)$: Selectivity of $c$ on $Q$, or $\frac{|\sigma_c(Q)|}{|Q|}$
  • $\texttt{UNIQ}(A, Q)$: # of distinct values of $A$ in $Q$.

Cardinality Estimation

(The Hard Parts)

$\sigma_c(Q)$ (Cardinality Estimation)
How many tuples will a condition $c$ allow to pass?
$\delta_A(Q)$ (Distinct Values Estimation)
How many distinct values of attribute(s) $A$ exist?

Many major databases (Oracle, Postgres, Teradata, etc...) use something like 10% rule if they have nothing better.

(The specific % varies by DBMS. E.g., Teradata uses 10% for the first AND clause, and 75% for every subsequent clause)

Uniform Prior

We assume that for $\sigma_c(Q)$ or $\delta_A(Q)$...

  1. Basic statistics are known about $Q$:
    • COUNT(*)
    • COUNT(DISTINCT A) (for each A)
    • MIN(A), MAX(A) (for each numeric A)
  2. Attribute values are uniformly distributed.
  3. No inter-attribute correlations.

If necessary statistics aren't available (point 1), fall back to the 10% rule.

If statistical assumptions (points 2, 3) aren't perfectly true, we'll still likely be getting a better estimate than the 10% rule.

COUNT(DISTINCT A)

$\texttt{UNIQ}(A, \pi_{A, \ldots}(R)) = \texttt{UNIQ}(A, R)$

$\texttt{UNIQ}(A, \sigma(R)) \approx \texttt{UNIQ}(A, R)$

$\texttt{UNIQ}(A, R \times S) = \texttt{UNIQ}(A, R)$ or $\texttt{UNIQ}(A, S)$

$$max(\texttt{UNIQ}(A, R), \texttt{UNIQ}(A, S)) \leq\\ \texttt{UNIQ}(A, R \uplus S)\\ \leq \texttt{UNIQ}(A, R) + \texttt{UNIQ}(A, S)$$

MIN(A), MAX(A)

$min_A(\pi_{A, \ldots}(R)) = min_A(R)$

$min_A(\sigma_{A, \ldots}(R)) \approx min_A(R)$

$min_A(R \times S) = min_A(R)$ or $min_A(S)$

$min_A(R \uplus S) = min(min_A(R), min_A(S))$

Estimating $\delta_A(Q)$ requires only COUNT(DISTINCT A)

Estimating Selectivity

Selectivity is a probability ($\texttt{SEL}(c, Q) = P(c)$)

$P(A = x_1)$ $=$ $\frac{1}{\texttt{COUNT(DISTINCT A)}}$
$P(A \in (x_1, x_2, \ldots, x_N))$ $=$ $\frac{N}{\texttt{COUNT(DISTINCT A)}}$
$P(A \leq x_1)$ $=$ $\frac{x_1 - \texttt{MIN(A)}}{\texttt{MAX(A)} - \texttt{MIN(A)}}$
$P(x_1 \leq A \leq x_2)$ $=$ $\frac{x_2 - x_1}{\texttt{MAX(A)} - \texttt{MIN(A)}}$
$P(A = B)$ $=$ $\textbf{min}\left( \frac{1}{\texttt{COUNT(DISTINCT A)}}, \frac{1}{\texttt{COUNT(DISTINCT B)}} \right)$
$P(c_1 \wedge c_2)$ $=$ $P(c_1) \cdot P(c_2)$
$P(c_1 \vee c_2)$ $=$ $1 - (1 - P(c_1)) \cdot (1 - P(c_2))$

(With constants $x_1$, $x_2$, ...)

Simplified Plan Exploration

Left-Deep Plans Only
Avoid searching the full space of join orders.
Greedy index selection
Separate join ordering and index selection

Suggestion: Start Simple

  • In-Mem/On-Disk Index Builder
  • Optimizer += Index Selection (e.g., selectivity)
  • Index Scan
  • Index Nested Loop Join