CSE-4/562 Spring 2019 - Probabilistic Databases

Probabilistic Databases

CSE-4/562 Spring 2019

May 6, 2019

Textbook: PDB Concepts and C-Tables


(One Form of) Incomplete Databases

  • Define each choice as a variable
  • Tag each row with a boolean formula over variables
  • Each possible world is one assignment of values to variables
  • The possible world has all rows tagged with formulas that evaluate to "true"
Certain Tuple
A tuple that appears in all possible worlds
Possible Tuple
A tuple that appears in at least one possible world

Limitation: Can't distinguish between possible-but unlikely and possible-but very likely.

Idea: Make variables probabilistic

Example

$$\texttt{bob} = \begin{cases} 4 & p = 0.8 \\ 9 & p = 0.2\end{cases}$$

$$\texttt{carol} = \begin{cases} 3 & p = 0.4 \\ 8 & p = 0.6\end{cases}$$

$\mathcal R$NameZipCode
1Alice10003→ always
2Bob14260→ if $\texttt{bob} = 4$
3Bob19260→ if $\texttt{bob} = 9$
4Carol13201→ if $\texttt{carol} = 3$
5Carol18201→ if $\texttt{carol} = 8$

                SELECT COUNT(*) 
                FROM R NATURAL JOIN ZipCodeLookup 
                WHERE State = 'NY'
    

$$Q(\mathcal D) = \begin{cases} 1 & \textbf{if } \texttt{bob} = 9 \wedge \texttt{carol} = 8\\ 2 & \textbf{if } \texttt{bob} = 4 \wedge \texttt{carol} = 8 \\&\; \vee\; \texttt{bob} = 9 \wedge \texttt{carol} = 3\\ 3 & \textbf{if } \texttt{bob} = 4 \wedge \texttt{carol} = 3 \end{cases}$$

$$ = \begin{cases} 1 & p = 0.2 \times 0.6\\ 2 & p = 0.8 \times 0.6 + 0.2 \times 0.4\\ 3 & p = 0.8 \times 0.4 \end{cases}$$

$$ = \begin{cases} 1 & p = 0.12\\ 2 & p = 0.56\\ 3 & p = 0.32\end{cases}$$

$$Q(\mathcal D) = \begin{cases} 1 & p = 0.12\\ 2 & p = 0.56\\ 3 & p = 0.32\end{cases}$$

$E\left[Q(\mathcal D)\right] = 0.12+1.12+0.96 = 2.20$

$P\left[Q(\mathcal D) \geq 2\right] = 0.56+0.32 = 0.88$

In general, computing marginal probabilities for result tuples exactly is #P

... so we approximate

Idea 1: Sample. Pick (e.g.) 10 random possible worlds and compute results for each.

$$R_{1} \Leftarrow \{\; \texttt{bob} \rightarrow 4, \; \texttt{carol} \rightarrow 8\}$$

$\mathcal R_{1}$NameZipCode
1Alice10003
2Bob14260
3Carol18201

$$\mathcal Q = \{\;2\;\}$$

 

 

$$R_{2} \Leftarrow \{\; \texttt{bob} \rightarrow 9, \; \texttt{carol} \rightarrow 8\}$$

$\mathcal R_{2}$NameZipCode
1Alice10003
2Bob19260
3Carol18201

$$\mathcal Q = \{\;2,\;1\;\}$$

 

 

$$R_{3} \Leftarrow \{\; \texttt{bob} \rightarrow 4, \; \texttt{carol} \rightarrow 8\}$$

$\mathcal R_{3}$NameZipCode
1Alice10003
2Bob14260
3Carol18201

$$\mathcal Q = \{\;2,\;1,\;2\;\}$$

 

 

$$R_{4} \Leftarrow \{\; \texttt{bob} \rightarrow 4, \; \texttt{carol} \rightarrow 8\}$$

$\mathcal R_{4}$NameZipCode
1Alice10003
2Bob14260
3Carol18201

$$\mathcal Q = \{\;2,\;1,\;2,\;2\;\}$$

 

 

$$R_{5} \Leftarrow \{\; \texttt{bob} \rightarrow 4, \; \texttt{carol} \rightarrow 3\}$$

$\mathcal R_{5}$NameZipCode
1Alice10003
2Bob14260
3Carol13201

$$\mathcal Q = \{\;2,\;1,\;2,\;2,\;3\;\}$$

 

 

Problem: Sloooooooooooow.

Can we make it faster?

Idea 1.A: Combine all samples into one query.

$\mathcal R$NameZipCode$\mathcal{ID}$
1Alice100031
2Bob142601
3Carol182011
4Alice100032
5Bob192602
6Carol182012
7Alice100033
8Bob142603
9Carol182013
10Alice100034
11Bob142604
12Carol182014
13Alice100035
14Bob142605
15Carol132015
$\mathcal Q$Count$\mathcal{ID}$
121
212
323
424
535

Querying Joint Sample Tables

$\pi_A(R) \rightarrow $ $\pi_{A, \mathcal{ID}}(R)$

$\sigma_\phi(R) \rightarrow $ $\sigma_{\phi}(R)$

$R \uplus S \rightarrow $ $R \uplus S$

$R \times S \rightarrow $ $\pi_{R.*, S.*, R.\mathcal{ID}}\big($$\sigma_{R.\mathcal{ID} = S.\mathcal{ID}}( $$ R \times S)\big)$

$\delta R \rightarrow $ $\delta R$

$_A\gamma_{Agg(*)}(R) \rightarrow $ $_{A, \mathcal{ID}}\gamma_{Agg(*)}(R)$

Still sloooooow.

There's a lot of repetition.

Idea 2.B Use native array-types in DBs

Tuple Bundles

$\mathcal R$NameZipCode
1Alice10003
2Bob[14260, 19260, 14260, 14260, 14260]
3Carol[18201, 18201, 18201, 18201, 13201]
MCDB: a monte carlo approach to managing uncertain data (Jampani et. al.)

Querying Tuple Bundles

$\pi_A(R) \rightarrow $ $\pi_{A}(R)$

$\sigma_\phi(R) \rightarrow $ ?

 

 

 

 

Idea 1.B' Also mark which tuples are present in which samples

$\mathcal R$NameZipCode$\mathcal W$
1Alice1000311111
2Bob[14260, 19260, 14260, 14260, 14260]11111
3Carol[18201, 18201, 18201, 18201, 13201]11111
↓ $\sigma_{InNYS(ZipCode)}(\mathcal R)$ ↓
$\mathcal R$NameZipCode$\mathcal W$
1Alice1000311111
2Bob[14260, 19260, 14260, 14260, 14260]10111
3Carol[18201, 18201, 18201, 18201, 13201]00001

Querying Tuple Bundles

$\pi_A(R) \rightarrow $ $\pi_{A}(R)$

$\sigma_\phi(R) \rightarrow $ $\sigma_{\mathcal W = 0}($$\pi_{\mathcal W \;\&\; \vec \phi}(R))$

$R \uplus S \rightarrow $ $R \uplus S$

$R \times S \rightarrow $ $\sigma_{\mathcal{W} = 0}\big($$\pi_{R.*, S.*, R.\mathcal{W} \;\&\; S.\mathcal{W}}( $$ R \times S)\big)$

$_A\gamma_{Agg(B)}(R) \rightarrow $ $_A\gamma_{[ Agg\big(\textbf{if}(W[1])\{R.B[1]\}\big), Agg\big(\textbf{if}(W[2])\{R.B[2]\}\big), \ldots ]}(R)$

Querying Joint Sample Tables

$\pi_A(R) \rightarrow \pi_{A}(R)$

$\sigma_\phi(R) \rightarrow \sigma_{\mathcal W = 0}(\pi_{\mathcal W \;\&\; \vec \phi}(R))$

$R \uplus S \rightarrow R \uplus S$

$R \times S \rightarrow \sigma_{\mathcal{W} = 0}\big(\pi_{R.*, S.*, R.\mathcal{W} \;\&\; S.\mathcal{W}}( R \times S)\big)$

$_A\gamma_{Agg(B)}(R) \rightarrow $ $_A\gamma_{[ Agg\big(\textbf{if}(W[1])\{R.B[1]\}\big), Agg\big(\textbf{if}(W[2])\{R.B[2]\}\big), \ldots ]}(R)$

(Generate aggregates for each sample separately)

Good luck ever doing an equi-join.

Hope your group-by variables aren't uncertain.

Inefficient equi-joins on uncertain variables.

Inefficient aggregates with uncertain variables.

How many samples necessary to get desired precision?

Idea 2: Symbolic Execution (Provenance)

$\sigma_{count \geq 2}(Q) =$

$\texttt{bob} = 4 \wedge \texttt{carol} = 8 $
$\vee\; \texttt{bob} = 9 \wedge \texttt{carol} = 3 $
$\vee\; \texttt{bob} = 4 \wedge \texttt{carol} = 3$

$P[\sigma_{count \geq 2}(Q)] = ?$ $\approx$ #SAT

Computing Probabilities

$P[\texttt{x} \wedge \texttt{y}] = P[\texttt{x}] \cdot P[\texttt{y}]$
(iff $\texttt{x}$ and $\texttt{y}$ are independent)

$P[\texttt{x} \wedge \texttt{y}] = 0$
(iff $\texttt{x}$ and $\texttt{y}$ are mutually exclusive)

$P[\texttt{x} \vee \texttt{y}] = 1- (1-P[\texttt{x}]) \cdot (1-P[\texttt{y}])$
(iff $\texttt{x}$ and $\texttt{y}$ are independent)

$P[\texttt{x} \vee \texttt{y}] = P[\texttt{x}] + P[\texttt{y}]$
(iff $\texttt{x}$ and $\texttt{y}$ are mutually exclusive)

Good enough to get us the probability of any boolean formula over mutually exclusive or independent variables

... and otherwise?

Shannon Expansion

For a boolean formula $f$ and variable $\texttt{x}$:

$$f = (\texttt{x} \wedge f[\texttt{x}\backslash T]) \vee (\neg \texttt{x} \wedge f[\texttt{x}\backslash F])$$

Disjunction of mutually-exclusive terms!

... each a conjunction of independent terms.

... and $\texttt{x}$ removed from $f$

Ok... just keep applying Shannon!

Each application creates 2 new formulas (ExpTime!)

Idea 2.A: Combine the two. Use Shanon expansion as long as time/resources permit, then use a #SAT approximation.

Sprout: Lazy vs. eager query plans for tuple-independent probabilistic databases (Olteanu et. al.)

More Resources

MCDB
Sampling-based probabilistic databases
Sprout
"Any-time" Approximation.
Mimir
PL tricks to make ProbDBs faster
DeepDive
ProbDBs used in practice to populate Knowledge Bases.
Integrating and Ranking Uncertain Scientific Data
ProbDBs used in practice to predict gene expressions / propose experiments.