Abstracts

Large-Scale Network Analysis with the Parallel Boost Graph Library

Dr. Doug Gregor, Indiana University

In recent years, our ability to collect network data has increased far beyond our capabilities to analyze this data. With this deluge of data, the simple, direct implementations of graph-theoretic algorithms no longer suffice to search through or analyze these massive data sets. Thus, we must turn to more advanced techniques, including graph compression, parallel computing, and distributed computing to process today's large-scale networks, such as web graphs and global social networks.

This talk will introduce the Parallel Boost Graph Library, an open-source library for the analysis of large-scale graph data on commodity clusters. The Parallel BGL contains a rich set of parallel graph algorithms and distributed graph data structures, which can divide very large graphs--billions of vertices and edges or more--across the many computers in a cluster. The Parallel BGL then coordinates the effort of these computers to solve large-scale problems in minutes that would otherwise be infeasible on a single computer. The Parallel BGL itself is designed as both a research platform, within which one can easily implemented multiple variations of an algorithm or data structure to evaluate their trade-offs in a real-world setting, and as an end-user library for solving large-scale graph problems in a variety of application domains. This talk will explore the design and use of the Parallel BGL, focusing on the application of the Parallel BGL to large-scale network analysis.


Designing Intrusion-tolerant Architectures for Distributed Services

Dr. Cristina Nita-Rotaru, Purdue University Department of Computer Science

During the last few years, there has been considerable progress in the design of Byzantine-resilient replication systems. The current state of the art performs well on small scale systems that are usually confined to local area networks, but are less practical in a wide-area setting. We present the first hierarchical Byzantine replication architecture tailored to systems that span multiple wide area sites, each consisting of several replicas. The architecture confines the effects of any malicious replica to its local site, reduces message complexity of wide area communication, and allows read-only queries to be performed locally within a site for the price of additional hardware. A prototype implementation is evaluated over several network topologies and is compared with a flat Byzantine fault-tolerant approach. The experimental results show considerable improvement over flat Byzantine replication algorithms, bringing the performance of Byzantine replication closer to existing benign fault-tolerant replication techniques over WANs. This is joint work with: Y. Amir, C. Danilov, J. Kirsch and J. Lane of Johns Hopkins; D. Dolev of Hebrew University of Jerusalem, and J. Olsen and D. Zage, Purdue University.


Fusion Approach to Opinionated Blog Detection

Professor Kiduk Yang, Indiana University

The talk will present a fusion approach to finding opinion about a given target in blog postings and describe WIDIT lab's participation in the TREC blog track task. We tackled the opinion blog retrieval task by breaking it down to two sequential subtasks: on-topic retrieval followed by opinion detection. Our opinion retrieval approach was to first apply traditional IR methods to retrieve on-topic blogs, and then boost the ranks of opinionated blogs using combined opinion scores generated by multiple opinion assessment methods. Our opinion module consists of Opinion Term Module, which identify opinions based on the frequency of opinion terms i.e., terms that only occur frequently in opinion blogs), Rare Term Module, which uses uncommon/rare terms (e.g., "sooo good") for opinion classification, IU Module, which uses IU (I and you) collocations, and Adjective-Verb Module, which uses computational linguistics' distribution similarity approach to learn the subjective language from training data.


Coupling Fragments of XPath Algebra with Structural Indices for XML

Professor Yuqing Melanie Wu, Indiana University

Coupling theoretical research with system design and implementation has worked very well for relational database systems. Since the debut of XML in late 90's, there have been numerous efforts in both theoretical and engineering research for this data model. However, the type of tight coupling among the two, as seen in the relational world, has yet to emerge.

In this talk, we will discuss a new methodology that is drawn from the classical study of the relational queries and the design of relational indices and query evaluation techniques. We extend the methodology to the context of XML. More specifically, we analyze the structural features of XML documents and propose the notions of label-path and label-path based partition of nodes and pairs in an XML document. We will introduce a family of algebras for sub-languages of XPath, and discover the tight coupling between the algebras and the partitions of data. We will show how this discovery can be used in reasoning about the existing indices and in designing novel indices for XML database systems.


Large-Scale Data Management for the Sciences

Professor Tanu Malik, Purdue University

Traditional enterprises and novel scientific applications are accumulating petabyte-scale datasets, which makes the need for large-scale data management more pressing than ever. Geographic distribution of the datasets accompanied by complex demands on data makes large-scale data management challenging. This is especially true for sciences that model complex physical and biological phenomena using data from multiple sources.

In this talk I will address two critical problems in scientific data management: combining large number of diverse data sources for execution of scientific queries and executing data-intensive scientific queries efficiently, in terms of both network and I/O, on these data sources. I will present SkyQuery--a system that federates data from several petabyte size, autonomous and heterogeneous astronomy databases scattered worldwide. Using SkyQuery, scientists can write declarative queries that compare and merge multiple astronomical datasets. For efficient query execution and scalability, I will present Bypass-Yield Caching--a novel caching framework for database systems that dramatically reduces the network bandwidth requirements of data-intensive federations such as SkyQuery making them good network citizens. Distributed applications such as the Bypass Yield Cache often rely on a priori knowledge of query cardinalities to make optimization decisions. In this context, I will present a black-box approach to selectivity estimation that is suitable for distributed applications.

The success of SkyQuery and its adoption by the National Virtual Observatory is an example of data management systems enabling scientific endeavors.


Reflection: a Powerful Tool for Database Programming

Professor Dirk Van Gucht, Indiana University

Reflection in programming languages is the ability, within the run of a program, to inspect the program in the current environment, and also generate and execute other programs. If we replace in this sentence the word "program" by "query" and the word "environment" by "database", we get that reflection in query languages is the ability, within the run of a query, to inspect the query in the current database, and also generate and execute other queries. When the database contains data, meta-data, schema and ontological information, queries, constraints, workflows, wrappers, mappings, etc., we get that query languages with reflection become a powerful tool for database programming. In this talk, I will give an overview of research on reflection in query languages and in particular show that it is relative straightforward to extend current query languages, such as SQL and XQuery, with this mechanism.


End to End Security: The End Game with WS-*?

Ruchith Fernando, Apache Software Foundation

The industry standard interoperable middleware security platform today is powered by a set of open security standards built fundamentally on XML and SOAP. The standards include WS-Security, WS-Trust, WS-SecureConverstion, and WS-Security Policy. These specifications address areas such as message level security, securing multiple message exchanges, establishing trust relationships between communicating parties and expressing security requirements and capabilities of Web services.

This presentation will explain how these specification contribute to establishing end-to-end applications security for distributed enterprise applications. The WS-* security platform covers the full spectrum of application security needs including authentication, integrity, non-repudiation, confidentiality and trust. We will also identify some of the new security issues that are raised by these specifications themselves.

Finally, we will discuss different emerging applications of Web services security standards in the industry in fields such as user centric identity management.


Memory-Efficient Dynamic Data Caching

Professor Amitabh Chaudhary, University of Notre Dame

We present an online algorithm for a general caching model: There is a remote data source that has data objects of various sizes and a local cache that can store copies of objects as long as its limited capacity is not exceeded. The cache receives an online sequence of user queries, each of which accesses multiple objects and returns a result, possibly of size much less than the total size of objects accessed. Meanwhile the source receives an online sequence of updates on the data objects. To answer a query with the latest data, a cache either has updated copies of all objects accessed by the query, or loads all accessed objects that it doesn't have and updates all accessed objects that it does have but are not updated, or ships the query directly to the source. To load new objects, the cache may have to evict some of the existing objects. It does this online, minimizing the amount of network traffic generated. We present a simple algorithm for the problem, which when combined with known algorithms for object (multi-size, multi-cost) caching gives an polylog(k)-competitive solution, where k is the ratio of the size of cache to the size of the smallest object. Further, we randomize the algorithm, getting the same performance bounds, but now eliminating the need to maintain any information about the sequence of queries already seen. This algorithm for the general model also serves as the best known online algorithm (which in addition is memory-efficient) for several generalizations of caching that have been studied such as file-bundle caching, bypass caching with overlapping queries, push pull based data partitioning, and replica maintenance systems.


Data Cleaning Techniques by means of Entity Resolution

Byung-Won On, Ph.D., Pennsylvania State University

Real data are "dirty." Despite active research into integrity constraint enforcement and data cleaning, data in real database applications are still dirty. To make matters worse, both diverse formats/usages of modern data and demands for large-scale data handling and integration from multiple independent, heterogeneous data sources make this problem even harder. In particular, to surmount the challenges for which conventional solutions against this problem no longer work, I focus on one type of problem known as Entity Resolution (ER) - the process of identifying and merging duplicate entities determined to represent the same real-world object. This problem frequently occurs in a wide variety of applications: information fusion, data warehousing, census survey, e-commerce, digital libraries, bio medical informatics, information retrieval, etc. In my thesis, I have studied three specialized types of ER problems: the Split Entity Resolution (SER) problem, the Grouped Entity Resolution (GER) problem, and the Mixed Entity Resolution (MER) problem. Despite the fact that the problems have been studied extensively, it is still challenging to de-duplicate complex entities in large-scale database management. Toward this problem, the goal of my research is to develop a set of generic domain-independent, effective, scalable data cleaning solutions in terms of compatibility with SQL and Database engine. Especially, in order to address the scalability issue, I developed a scalable two-step framework using blocking-based pruning method in the SER problem and a multi-level graph partitioning method in the MER problem. Moreover, to improve accuracy, I proposed two graph theoretic concepts in addition to the existing textual similarity measures -- one based on Quasi-Clique and the other based on Bipartite Matching -- and experimentally validate the superiority of the proposed solutions.

 

National Science Foundation IU School of Informatics Indiana University