CSE-4/562 Spring 2019

Probabilistic Databases

May 6, 2019

Textbook: PDB Concepts and C-Tables

(One Form of) Incomplete Databases

Define each choice as a variable
Tag each row with a boolean formula over variables
Each possible world is one assignment of values to variables
The possible world has all rows tagged with formulas that evaluate to "true"

Certain Tuple: A tuple that appears in all possible worlds
Possible Tuple: A tuple that appears in at least one possible world

Limitation: Can't distinguish between possible-but unlikely and possible-but very likely.

Idea: Make variables probabilistic

Example

$$\texttt{bob} = \begin{cases} 4 & p = 0.8 \\ 9 & p = 0.2\end{cases}$$

$$\texttt{carol} = \begin{cases} 3 & p = 0.4 \\ 8 & p = 0.6\end{cases}$$

$\mathcal R$	Name	ZipCode
1	Alice	10003	→ always
2	Bob	14260	→ if $\texttt{bob} = 4$
3	Bob	19260	→ if $\texttt{bob} = 9$
4	Carol	13201	→ if $\texttt{carol} = 3$
5	Carol	18201	→ if $\texttt{carol} = 8$


                SELECT COUNT(*) 
                FROM R NATURAL JOIN ZipCodeLookup 
                WHERE State = 'NY'

$$Q(\mathcal D) = \begin{cases} 1 & \textbf{if } \texttt{bob} = 9 \wedge \texttt{carol} = 8\\ 2 & \textbf{if } \texttt{bob} = 4 \wedge \texttt{carol} = 8 \\&\; \vee\; \texttt{bob} = 9 \wedge \texttt{carol} = 3\\ 3 & \textbf{if } \texttt{bob} = 4 \wedge \texttt{carol} = 3 \end{cases}$$

$$ = \begin{cases} 1 & p = 0.2 \times 0.6\\ 2 & p = 0.8 \times 0.6 + 0.2 \times 0.4\\ 3 & p = 0.8 \times 0.4 \end{cases}$$

$$ = \begin{cases} 1 & p = 0.12\\ 2 & p = 0.56\\ 3 & p = 0.32\end{cases}$$

$$Q(\mathcal D) = \begin{cases} 1 & p = 0.12\\ 2 & p = 0.56\\ 3 & p = 0.32\end{cases}$$

$E\left[Q(\mathcal D)\right] = 0.12+1.12+0.96 = 2.20$

$P\left[Q(\mathcal D) \geq 2\right] = 0.56+0.32 = 0.88$

In general, computing marginal probabilities for result tuples exactly is #P

... so we approximate

Idea 1: Sample. Pick (e.g.) 10 random possible worlds and compute results for each.

$$R_{1} \Leftarrow \{\; \texttt{bob} \rightarrow 4, \; \texttt{carol} \rightarrow 3\}$$

$\mathcal R_{1}$	Name	ZipCode
1	Alice	10003
2	Bob	14260
3	Carol	13201

$$\mathcal Q = \{\;3\;\}$$

$$R_{2} \Leftarrow \{\; \texttt{bob} \rightarrow 9, \; \texttt{carol} \rightarrow 8\}$$

$\mathcal R_{2}$	Name	ZipCode
1	Alice	10003
2	Bob	19260
3	Carol	18201

$$\mathcal Q = \{\;3,\;1\;\}$$

$$R_{3} \Leftarrow \{\; \texttt{bob} \rightarrow 4, \; \texttt{carol} \rightarrow 8\}$$

$\mathcal R_{3}$	Name	ZipCode
1	Alice	10003
2	Bob	14260
3	Carol	18201

$$\mathcal Q = \{\;3,\;1,\;2\;\}$$

$$R_{4} \Leftarrow \{\; \texttt{bob} \rightarrow 4, \; \texttt{carol} \rightarrow 8\}$$

$\mathcal R_{4}$	Name	ZipCode
1	Alice	10003
2	Bob	14260
3	Carol	18201

$$\mathcal Q = \{\;3,\;1,\;2,\;2\;\}$$

$$R_{5} \Leftarrow \{\; \texttt{bob} \rightarrow 9, \; \texttt{carol} \rightarrow 8\}$$

$\mathcal R_{5}$	Name	ZipCode
1	Alice	10003
2	Bob	19260
3	Carol	18201

$$\mathcal Q = \{\;3,\;1,\;2,\;2,\;1\;\}$$

Problem: Sloooooooooooow.

Can we make it faster?

Idea 1.A: Combine all samples into one query.

$\mathcal R$	Name	ZipCode	$\mathcal{ID}$
1	Alice	10003	1
2	Bob	14260	1
3	Carol	13201	1
4	Alice	10003	2
5	Bob	19260	2
6	Carol	18201	2
7	Alice	10003	3
8	Bob	14260	3
9	Carol	18201	3
10	Alice	10003	4
11	Bob	14260	4
12	Carol	18201	4
13	Alice	10003	5
14	Bob	19260	5
15	Carol	18201	5

$\mathcal Q$	Count	$\mathcal{ID}$
1	3	1
2	1	2
3	2	3
4	2	4
5	1	5

Querying Joint Sample Tables

$\pi_A(R) \rightarrow $ $\pi_{A, \mathcal{ID}}(R)$

$\sigma_\phi(R) \rightarrow $ $\sigma_{\phi}(R)$

$R \uplus S \rightarrow $ $R \uplus S$

$R \times S \rightarrow $ $\pi_{R.*, S.*, R.\mathcal{ID}}\big($$\sigma_{R.\mathcal{ID} = S.\mathcal{ID}}( $$ R \times S)\big)$

$\delta R \rightarrow $ $\delta R$

$_A\gamma_{Agg(*)}(R) \rightarrow $ $_{A, \mathcal{ID}}\gamma_{Agg(*)}(R)$

Still sloooooow.

There's a lot of repetition.

Idea 2.B Use native array-types in DBs

Tuple Bundles

$\mathcal R$	Name	ZipCode
1	Alice	10003
2	Bob	[14260, 19260, 14260, 14260, 19260]
3	Carol	[13201, 18201, 18201, 18201, 18201]

MCDB: a monte carlo approach to managing uncertain data (Jampani et. al.)

Querying Tuple Bundles

$\pi_A(R) \rightarrow $ $\pi_{A}(R)$

$\sigma_\phi(R) \rightarrow $ ?

Idea 1.B' Also mark which tuples are present in which samples

$\mathcal R$	Name	ZipCode	$\mathcal W$
1	Alice	10003	11111
2	Bob	[14260, 19260, 14260, 14260, 19260]	11111
3	Carol	[13201, 18201, 18201, 18201, 18201]	11111

↓ $\sigma_{InNYS(ZipCode)}(\mathcal R)$ ↓

$\mathcal R$	Name	ZipCode	$\mathcal W$
1	Alice	10003	11111
2	Bob	[14260, 19260, 14260, 14260, 19260]	10110
3	Carol	[13201, 18201, 18201, 18201, 18201]	10000

Querying Tuple Bundles

$\pi_A(R) \rightarrow $ $\pi_{A}(R)$

$\sigma_\phi(R) \rightarrow $ $\sigma_{\mathcal W = 0}($$\pi_{\mathcal W \;\&\; \vec \phi}(R))$

$R \uplus S \rightarrow $ $R \uplus S$

$R \times S \rightarrow $ $\sigma_{\mathcal{W} = 0}\big($$\pi_{R.*, S.*, R.\mathcal{W} \;\&\; S.\mathcal{W}}( $$ R \times S)\big)$

$_A\gamma_{Agg(B)}(R) \rightarrow $ $_A\gamma_{[ Agg\big(\textbf{if}(W[1])\{R.B[1]\}\big), Agg\big(\textbf{if}(W[2])\{R.B[2]\}\big), \ldots ]}(R)$

Querying Joint Sample Tables

$\pi_A(R) \rightarrow \pi_{A}(R)$

$\sigma_\phi(R) \rightarrow \sigma_{\mathcal W = 0}(\pi_{\mathcal W \;\&\; \vec \phi}(R))$

$R \uplus S \rightarrow R \uplus S$

$R \times S \rightarrow \sigma_{\mathcal{W} = 0}\big(\pi_{R.*, S.*, R.\mathcal{W} \;\&\; S.\mathcal{W}}( R \times S)\big)$

$_A\gamma_{Agg(B)}(R) \rightarrow $ $_A\gamma_{[ Agg\big(\textbf{if}(W[1])\{R.B[1]\}\big), Agg\big(\textbf{if}(W[2])\{R.B[2]\}\big), \ldots ]}(R)$

(Generate aggregates for each sample separately)

Good luck ever doing an equi-join.

Hope your group-by variables aren't uncertain.

Inefficient equi-joins on uncertain variables.

Inefficient aggregates with uncertain variables.

How many samples necessary to get desired precision?

Idea 2: Symbolic Execution (Provenance)

$\sigma_{count \geq 2}(Q) =$

$\texttt{bob} = 4 \wedge \texttt{carol} = 8 $
$\vee\; \texttt{bob} = 9 \wedge \texttt{carol} = 3 $
$\vee\; \texttt{bob} = 4 \wedge \texttt{carol} = 3$

$P[\sigma_{count \geq 2}(Q)] = ?$ $\approx$ #SAT

Computing Probabilities

$P[\texttt{x} \wedge \texttt{y}] = P[\texttt{x}] \cdot P[\texttt{y}]$
(iff $\texttt{x}$ and $\texttt{y}$ are independent)

$P[\texttt{x} \wedge \texttt{y}] = 0$
(iff $\texttt{x}$ and $\texttt{y}$ are mutually exclusive)

$P[\texttt{x} \vee \texttt{y}] = 1- (1-P[\texttt{x}]) \cdot (1-P[\texttt{y}])$
(iff $\texttt{x}$ and $\texttt{y}$ are independent)

$P[\texttt{x} \vee \texttt{y}] = P[\texttt{x}] + P[\texttt{y}]$
(iff $\texttt{x}$ and $\texttt{y}$ are mutually exclusive)

Good enough to get us the probability of any boolean formula over mutually exclusive or independent variables

... and otherwise?

Shannon Expansion

For a boolean formula $f$ and variable $\texttt{x}$:

$$f = (\texttt{x} \wedge f[\texttt{x}\backslash T]) \vee (\neg \texttt{x} \wedge f[\texttt{x}\backslash F])$$

Disjunction of mutually-exclusive terms!

... each a conjunction of independent terms.

... and $\texttt{x}$ removed from $f$

Ok... just keep applying Shannon!

Each application creates 2 new formulas (ExpTime!)

Idea 2.A: Combine the two. Use Shanon expansion as long as time/resources permit, then use a #SAT approximation.

Sprout: Lazy vs. eager query plans for tuple-independent probabilistic databases (Olteanu et. al.)

More Resources

MCDB: Sampling-based probabilistic databases
Sprout: "Any-time" Approximation.
Mimir: PL tricks to make ProbDBs faster
DeepDive: ProbDBs used in practice to populate Knowledge Bases.
Integrating and Ranking Uncertain Scientific Data: ProbDBs used in practice to predict gene expressions / propose experiments.