mips.tex

	
% PLEASE USE UTF-8 ENCODING WHEN EDITING THIS FILE!

% WWW 2012 page limit: 10 pages

%\documentclass{llncs}
% \documentclass{acm_proc_article-sp}
\documentclass{sig-alternate}  % tighter style --> less pages (use the other one if we have enough space)
%\documentclass{www2012-accepted} 
%\usepackage{amsmath, amssymb}
%\usepackage{graphicx}
\usepackage{hyperref}
%\usepackage{listings}
\usepackage[utf8]{inputenc}
\usepackage{moreverb}

\newcommand{\todo}[1]{\textbf{ToDo: \textit{#1}}}
%\newcommand{\comm}[1]{\textbf{Comment: \textit{#1}}}\stackrel{\leftrightarrow}{}

\usepackage{enumitem}
\usepackage[usenames,dvipsnames]{color}
\newcommand{\sparqlwo}[2]{{\texttt{#1 \char '173} \\[1.5pt] \hspace*{.5cm}\parbox{6cm}{\tt #2} \\ \texttt{\char '175} }}
\newcommand{\sparql}[3]{{\texttt{#1 \char '173} \\[1.5pt] \hspace*{.5cm}\parbox{6cm}{\tt #2} \\ \texttt{\char '175} \\ \texttt{#3} }}
\newcommand{\slot}[3]{$\langle\texttt{#1},\text{#2},\text{\sf #3}\rangle$}
\newcommand{\argmax}{\operatornamewithlimits{arg\,max}}
\newcommand{\argmin}{\operatornamewithlimits{arg\,min}}

\usepackage{qtree}
\usepackage[usenames,dvipsnames,table]{xcolor}


\newcommand{\Drs}[2]{%%%%%%%%%%%%%%%%%%%%%%%%%
\(                   % begin maths mode
 \begin{array}{|l|}  %
 \hline              % top line
   \begin{array}{l}  %
    #1                % `Universe'
    \end{array} \\   %  end the `universe' part
 \hline              % line between Universe and Conditions
    \begin{array}{l} %
    #2                % the conditions
    \end{array} \\   % end the conditions part
 \hline              % bottom line
\end{array}          %
\)                   % end maths mode
                     }%%%%%%%%%%%%%%%%%%%%%%%%%


\begin{document}

\clubpenalty=10000 
\widowpenalty = 10000

%\conferenceinfo{WWW}{2012 Lyon, France}

\title{Duplicate-Aware Federated Query Processing over the Data Web}

\numberofauthors{2} 
\author{
% 1st. author
\alignauthor
Muhammad Saleem\\
       \affaddr{DERI Galway}\\
       \affaddr{??}\\ 
       \email{??}
% 2nd. author
\alignauthor
Axel-Cyrille Ngonga Ngomo\\
       \affaddr{Universit\"at Leipzig, IFI/AKSW}\\
       \affaddr{PO 100920, D-04009 Leipzig}\\
       \email{ngonga@informatik.uni-leipzig.de}

}

%\date{30 July 1999}
\maketitle

%\title{SPARQL Template Based Question Answering}
% \titlerunning{SPARQL Template Based Question Answering}  
%\author{Christina Unger\inst{1} \and Lorenz B{\"u}hmann\inst{2} \and Jens Lehmann\inst{2} \and Philipp Cimiano\inst{1}}
%\authorrunning{...} 
%\institute{
%Bielefeld University, CITEC, Semantic Computing Group\\
%Universit\"atsstra{\ss}e 21--23, 33615 Bielefeld\\ 
%\email{cunger|cimiano@cit-ec.uni-bielefeld.de}
%\and 
%University of Leipzig, Computer Science Institute, AKSW Group,
%\\Johannisgasse 26, 04103 Leipzig\\
%\email{buehmann|lehmann@informatik.uni-leipzig.de}
%}
\begin{abstract}

\end{abstract}

% A category with the (minimum) three required fields
\category{H.2.4}{Database Management}{Systems}[Distributed databases]
%A category including the fourth, optional field follows...
% \category{D.2.8}{Software Engineering}{Metrics}[complexity measures, performance measures]

\terms{Algorithms, Experimentation, Theory}

\keywords{Federated queries, SPARQL, deduplication}

\section{Introduction (AN+MS)}
Over the last years, the Linked Data Web has developed into a large compendium of interlinked data sets from diverse domains. 
One of the central principles underlying the architecture of these data sets is the reuse of URIs and vocabularies as well as the linking of knowledge bases.
One of the results of this architectural choice is that certain queries can only be answered by retrieving information from several knowledge bases.
This type of querying, called \emph{federated querying}, is of central importance for manifold applications such as question answering, knowledge retrieval and data integration.
In addition to the information necessary to answering queries being distributed, certain pieces of information (i.e., triples) can be found in several knowledge bases. 
For example, the name of movie directors can be found both in DBpedia and LinkedMDB.
Similarly, the authors of papers can be found in both the ACM and DBLP libraries.
We call triples that can be found in several knowledge bases across the Web of Data \emph{duplicates}.

While the importance of federated queries over the Web of Data has been stressed in previous work, the impact of duplicates has not received much attention.
Yet (as we will show in the remainder of this paper), a duplicate-aware approach to query processing can lead to more time-efficient and effective algorithms for federated queries.
In this paper, we address this drawback by presenting a duplicate-aware approach for query processing based on min-wise independent permutation (MIPS) vectors.
Our approach is able to predict the amount of new information contained in a knowledge base with a higher accuracy than the state of the art, leading to a better performance with respect to source ranking for both whole queries and fragment of queries as well as better result set size approximation. 
In the rest of this paper, we aim to show experimentally that our approach outperforms the state-of-the-art by running it against ?? methods on ?? queries. 
We begin by giving a brief overview of the state of the art in federated query processing.
In addition, we argue for the use of MIPS for the approximation of result size sets.
Thereafter, we present our duplicate-aware federated query processing approach.
After an overview of the datasets used in this paper, we present experimental results which corroborate the superior efficiency of our approach.
We conclude the paper with a discussion of our findings and an overview of future work.

\section{Related Work (MS)}
\subsection{Federated SPARQL queries}
\subsection{Filters}
Bloom
MIPS
etc.
\section{Notation}
In this section, we present the core of the notation that will be used throughout this paper.
We denote data sources with $S$ and the total number of data sources with $n$. 
The set of all possible result sets is denoted $R$ while the set of all possible SPARQL queries is labeled with $Q$. 
A data source ranking function $rank: S \times Q \rightarrow \{1 \ldots n\}$ is a function that assigns a ranking to each data source given a particular query $q \in Q$.
Note that for any source ranking function $rank$, we assume that $\forall S, S' \forall q \in Q : S  \neq S' \Longleftrightarrow rank(S, q) \neq rank (S', q)$.
A result set estimation function $est: S \times Q \rightarrow \mathbb{N}$ aims at approximating the size of the result set that will be returned by a given query.
Note that this function plays a crucial role in the processing of federated queries as it is most commonly used to decide upon the ranking of data sources for a given query.
The aim of a federated query system such as the one described in this work is thus to optimize its estimation function $est$ so as to ensure a ranking of the source close to the optimal ranking. %do we really need this sentence?
\section{Approach (MS)}
\subsection{Overview}
\subsection{MIPS}
Vector Construction
Union, overlap
\subsection{Index construction}
Compression Ratio Setting
\subsection{Source Selection}
\subsection{Source Ranking}
Result set estimation       
\subsection{Subquery generation}
\section{Experiments and Results (AN)}
\subsection{Experimental Setup (AN)}
\subsubsection{Datasets (3 datasets)}
\subsubsection{Queries (6 types)}
\subsubsection{Metrics (MSE)}
\subsection{Results (AN)}
\subsubsection{Ranking Error}
\subsubsection{Result Set Estimation Error}
\subsubsection{Execution Time}
\subsubsection{Size of MIPS vectors (on largest dataset)}
\section{Discussion (AN)}


\bibliographystyle{plain}
\bibliography{literature}


\end{document}