Data Engineering meets the Semantic Web (DESWeb)
In conjunction with ICDE 2016, Helsinki, Finland
— Blocking and Filtering Techniques for Link Discovery —
Link Discovery constitutes a core task for realizing the vision of Linked Data, as its goal is to connect the large volume of RDF data that are still isolated in secluded silos. To this end, it essentially compares every resource description with all possible matches, a process that suffers from a quadratic complexity. To scale to large datasets with millions of resources, Link Discovery typically relies on blocking and filtering techniques in order to save a large part of unnecessary comparisons between unlikely matches. In this talk, we will introduce a taxonomy of the main blocking and filtering techniques that have been proposed in the literature. Based on it, we will examine the methods employed by the main Link Discovery Frameworks as well as the stand-alone methods that are inherently crafted for Semantic Web data. We will also discuss their application to the domain of relational data, comparing them with methods that were originally developed for databases. We will conclude with directions for future research.
George Papadakis is a postdoctoral researcher at the University of Athens. He received his PhD in Computer Science from the Leibniz University of Hanover and his Diploma in Computer Engineering from the National Technical University of Athens (NTUA). He has worked at the "Athena" Research Center, NCSR “Demokritos”, NTUA and the L3S Research Center. His research interests include entity resolution, link discovery and web data mining. He has received the best paper award from ACM Hypertext 2011.
— Cleaning the Web and using the Web for data cleaning —
The saying goes that "you spend 80% of your time preparing and cleaning the data ...". Data cleaning is indeed a key step in any data analytics pipeline. Data on the Web provides both an opportunity to help in and be itself the subject of data cleaning. In this talk, I will overview some of our research for cleaning data and enhancing its usability. I will first describe Katara, a knowledge base and crowd powered data cleaning system that, given a table, a knowledge base, and a crowd, interprets table semantics to align it with the knowledge base, identifies correct and incorrect data, and generates top-k possible repairs for incorrect data. Katara exploits knowledge bases, such as YAGO and DBPEdia, and crowdsourcing to achieve better accuracy in data cleaning. In another line work, I will describe how we use Web tables, Web forms, and knowledge bases to discovers transformations that convert data from one representation to another. These transformations are crucial in data integration and data curation. Finally, I’ll go over cleaning Web event databases; these databases collect from the Web temporal information about entities, such as travel of famous persons and cyber attacks. For cleaning these databases, we discover declarative rules that take into account their time dimension. Discovering such rules needs to overcome data sparsity, reporting delays, and errors due to inaccurate Web sources and Web extractors. We use machine learning methods, such as association measures and outlier detection, for the discovery of the rules, together with an aggressive repair of the data in the mining step itself.
Mourad Ouzzani is a Principal Scientist with the Qatar Computing Research Institute. Before that, he was a Research Associate Professor at Purdue University. Mourad conducts research in data integration, data quality, spatio temporal data management, and database systems for scientific data. Mourad has published extensively in top database venues including SIGMOD, VLDB, and ICDE. He recently received the best demo award at SIGMOD 2015.