Abstracts - Fall 2008


Using Metadata to Find Relevant Data in the e-Science Haystack

Scott Jensen

Indiana University, Ph.D. Candidate

As scientists increasingly have access to powerful computational grids through e-science portals, huge volumes of valuable scientific data are being generated. Being able to reuse this data for validation of experimental results and further research is crucial to the advancement of science - resulting in an increasing need for accurate and detailed metadata. Scientific communities use detailed XML schemas to describe their data, and our research looks at the characteristics of these metadata schemas - how they differ from general XML and how these differences can be exploited to address the particular requirements of scientists and enable scientific communities to easily catalog and search their data using the schema of their domain.


The Web is Smaller than it Seems

Indiana University, Ph.D. Candidate

The Web has grown beyond anyone's imagination. While significant research has been devoted to understanding aspects of the Web from the perspective of the documents that comprise it, we have little data on the relationship among servers that comprise the Web. In this talk, we explore the extent to which Web servers are co-located with other Web servers in the Internet. In terms of the location of servers, we find that the Web is surprisingly smaller than it seems. This has important implications for the availability of Web servers in case of DoS attacks and blocklisting.


Beyond Reproducibility: Using Provenance to Streamline Data Exploration through Workflows

University of Utah

To analyze and understand the growing wealth of scientific data, complex computational processes need to be assembled, often requiring the combination of loosely-coupled resources, specialized libraries, distributed computing infrastructure, and Web services. Workflow (and workflow-based) systems have recently emerged as an alternative to ad-hoc approaches to constructing computational tasks widely used in the scientific community. But although the benefits of using workflow systems are well known, the fact that workflows are hard to create and maintain has been a major barrier to wider adoption of this technology in the scientific domain. This is especially true for exploratory analysis tasks, where the path from data to insight requires a laborious, trial-and-error process, where users successively assemble, modify, and execute multiple workflows.

We advocate a data-centric view of workflow-based computational processes, where provenance of exploratory processes is captured through the workflow specifications, information about their evolution and impact on the data they manipulate. In this talk, we discuss how this detailed provenance information can be used to provide intuitive interfaces and tools that support collaborative analysis of scientific data. In particular, we will present a query-by-example interface for querying workflows whereby users query workflows through the same familiar interface they use to create them; a mechanism for semi-automatically creating and refining workflows by analogy, without requiring users to directly manipulate or edit the workflow specifications; and a recommendation system that guides users through the workflow design process by automatically suggesting completions based on a database of previously created workflows. We will also demonstrate how these tools have been implemented and can be used in VisTrails (), an open-source provenance management system.

Joint work with Claudio T. Silva, Erik Anderson, Steven P. Callahan, Tommy Ellkvist, David Koop, Lauro Lins, Emanuele Santos, Carlos E. Scheidegger and Huy T. Vo.


Databases 'R 'Us: A Turning Point in Database Research?

University of Louisville

There is considerable talk in the database research community that we are at a turning point, and that a new agenda should refocus efforts into non-traditional areas. But, what are these new areas? How should database research play a role on it? And, more importantly, if there are out there areas in which databases should play an important role, how did we get to this point, where databases are not a player? In this talk we present our viewpoint on how we got to this situation and some of the things we should be doing to get out of it. We argue that database research has taken too narrow a view of the phenomenon of information flow. We make the ideas concrete by presenting some new research projects. All such projects involve the basic idea of collaboration: collaboration among the users of a database (which form, implicitly or explicitly, a community, and can therefore be analyzed with tools developed lately in Social Network Theory and related fields), and collaboration between users and the database: the users are no longer passive recipients of whatever data the database offers to them, but they should have the ability to annotate the database (influencing not only content but also structure) or to direct the way the data is treated (creating workflows that control information processing). Clearly, some research in this areas already exists, but it is now taking certain stage and reaching further, as applications like e-science come to dominate the landscape and demand that databases relinquish the absolute control of the data that they enjoyed so far.


A Generative Model of Text Documents Capturing Bursts and Similarity

Indiana University

Various universal regularities characterize text from different domains and languages. Most notable are Zipf's law on the distribution of word frequencies, Heaps' law on vocabulary size, and the bursty nature of topical words. However, no single model of text generation explains how these properties emerge. Furthermore, no model exists to interpret the empirical distribution of similarity between documents. Here we present and validate a generative model that produces simultaneously all of the statistical features of textual corpora. Our results point to frequency ranking as a key mechanism for understanding language generation. Understanding the emergence of structure and topicality in written text can shed light into the collective cognitive processes we use to organize and store information, and find broad applications in literature analysis, Web mining, and social media. Joint work with Mariangeles Serrano and Alessandro Flammini


Leveraging Provenance for Case-Based Support of e-Science Experimentation

Indiana University, Ph.D. Candidate

The emerging popularity of in silico experimentation within the scientific community has brought with it not only an abundance of resultant data, but also provenance -- metadata describing the pedigree of the results. As a knowledge source, provenance can be leveraged for the task of automated assistance for scientists in need of technical assistance or a useful information source for planning which grid resources to employ in their experiments. Through several experiments examining a large collection of existing experimental workflows, we have found that case-based methods of generating suggestions are effective in providing quality assistance.