Abstracts


Semantic Music retrieval

Dr. Ying Ding

Indiana University, School of Library and Information Science

The vision of the Semantic Web is to lift current Web into semantic repositories where heterogeneous data can be queried and different services can be mashed up. The Web becomes a platform for integrating data and services. Ontology or agreed consensus is the key issue to achieve that. Especially in cultural heritage area, cross-media and cross-archival retrieval turn out to be the slogan in this area. The EASAIER project (European Union funded) aims to enable enhanced access to sound archives by providing multiple methods of retrieval, integration with other media archives and content enrichment. During this talk, I will share with you the development of this project.


Provenance Collection in the Life Science Grid

Bin Cao

Indiana University, Center for Data and Search Informatics

Cyberinfrastructure frameworks for science discovery are an accepted way of accessing analysis tools, computational and data resources on the Internet. Provenance plays an increasingly valuable role in scientific discovery by capturing a record of activities performed that led to the creation of data. This record is essential to the long-term preservation and reuse of the data, and to making determinations of its quality. The Life Science Grid is an open source cyberinfrastructure framework for biochemical discovery developed by Eli Lilly Pharmaceutical Company. We are collaborating with University of Manchester to add provenance and ontological semantics to the Life Science Grid to help researchers track their interactions with the discovery framework, better annotate the results, and work more efficiently. The project raises interesting challenges in instrumentation, annotation, and visualization of provenance data.


Pollution resilience for DNS resolvers

Andrew Kalafut

Indiana University, Department of Computer Science

The DNS is a cornerstone of the Internet, serving as a distributed database to translate host names to IP addresses, among other things. Unfortunately, no matter how securely an organization provisions and guards its own DNS infrastructure, it is at the mercy of others' provisioning when it comes to resolutions its resolvers perform on behalf of its clients -- even one compromised DNS server in the Internet can mislead an organization's clients to fake look-alike phishing Web sites or malware-serving sites. We propose a self-defense mechanism where the DNS resolvers collect a small amount of additional information for the DNS responses they receive and maintain a history of previous responses to guard their clients against misleading information from compromised DNS servers in the Internet. Any organization can choose to enhance its resolvers with our mechanism unilaterally, unlike DNSSEC, which can ensure correctness of information only if the remote DNS server deploys it.


Distrust Reputation System For P2P Information Retrieval

Ruj Akavipat

Indiana University, Department of Computer Science

As peer-to-peer (P2P) information retrieval systems gain more popularity, it is inevitable that some users will try to exploit the systems to push out irrelevant information, such as unsolicited advertisements, to others. These exploiters are well-known as spammers to email users and the Web IR community. So far, many approaches have been proposed to directly reduce the effect of spammers. However, there is not enough emphasis on protecting against decoy peers maliciously directing traffics to spammers. To fill in this gap, this work focuses on handling such a threat by extending the PageRank techniques to compute peers' reputations. The techniques proposed in this study integrate propagating of distrust from known offenders to reduce the effect of decoy peers. The results from simulations showed that the proposed techniques were effective against the decoys while preserving an acceptable level of search results' quality.


Socially Induced Semantic Networks and Applications

Ben Markines

Indiana University, Department of Computer Science

Social bookmarking systems allow Web users to actively annotate online resources. These annotations incorporate meta-information with Web pages in addition to the actual document contents. From a collection of socially annotated resources, we present various methods for quantifying the relationship between objects, i.e. tags or resources. These relationships can then be represented in a semantic similarity network where the nodes represent objects and the undirected weighted edges represent their relations. One problem of assembling and maintaining such a network is efficiency, i.e. the time and space complexity associated with graph algorithms. The complexity of these algorithms are typically quadratic. We present two techniques: first addressing space complexity using a concept from the sparse matrix literature, second offering an incremental process answering both space and time limitations. We then present a number of applications leveraging a socially induced semantic similarity network. A recommendation engine and a Web navigation tool are evaluated through user studies. Finally, we explore spam detection to enhance the functionality of social bookmarking systems.


What's in a session: Tracking individual behavior on the Web

Mark Meiss

Indiana University, Department of Computer Science

This talk describes interesting features of a dataset containing all HTTP requests generated by a thousand undergraduates over a span of two months. Preserving user identity in the data set allows us to discover novel properties of Web traffic that directly affect models of Web navigation. First, the popularity of Web sites (the number of users who contribute to their traffic) lacks any intrinsic mean and may be unbounded. Further, many aspects of the browsing behavior of individual users can be approximated by log-normal distributions even though their aggregate behavior is scale-free. Finally, users' click streams cannot be cleanly segmented into sessions using timeouts, complicating attempts to model their behavior. We propose instead a logical definition of sessions based on browsing activity as revealed by referrer URLs; a user may have several active sessions in their click stream at any one time. Applying a timeout to these logical sessions affects their statistics to a lesser extent than a purely timeout-based mechanism.


An Adaptive Document Classifier Inspired by T-cell Cross-regulation in the Immune System

Alaa Abi-Haidar

Indiana University, School of Informatics

Millions of years have evolved the vertebrate immune system into one of the most complex and intelligent biological systems whose function is to protect the body from diseases. More specifically, the adaptive immune system is capable of learning about new harmful and harmless intruders and discriminating between them effectively. Several mathematical models have been proposed to simulate and understand the adaptive immune system and its functional subsystems.

We propose to develop a novel agent-based model of T-Cell cross-regulation in the adaptive immune system, which we apply to binary classification problems analogous to that of the immune system. We expect our study to help immunologists better understand the general mechanism behind T-cell cross-regulation and also raise additional questions about the behavior of the adaptive immune system in general. Our aim is to understand how the cross-regulation model can deal with concept drift, robustness to noise, unbalanced classification, and other emerging complex behaviors in document classification. In addition, our goal is to generalize the model beyond binary classification towards a general purpose multi-classifier. We will validate the model on various types of stream data. We will use data with fixed and varying number of classes (or categories) and the number of elements from each class can also vary over time. Finally, we intend to compare our model with other machine learning classifiers and conclude with insights that can help us better understand the vertebrate immune system and its application to document classification.


Programming Large Distributed Systems with High Level Abstractions

Dr. Douglas Thain

University of Notre Dame, Department of Computer Science and Engineering

Large distributed systems such as clusters, clouds, and grids remain very challenging to program. Although experts can tune distributed and multicore computers to achieve high performance, the non-expert may struggle to achieve a program that even functions correctly. To address this problem, we propose the use of high level abstractions to represent very large problems. An abstraction is a regularly structured framework into which an end-user may plug in simple sequential programs to create very large parallel programs. I will discuss several examples of abstractions -- AllPairs, Classify, and Wavefront -- that we have implemented on both multicore and distributed systems. Using these abstractions, we have enabled the solution of problems of unprecedented size in biometrics, data mining, economics, and genomics.


BIBLIOME INFORMATICS IN COMPUTATIONAL BIOLOGY

Dr. Luis M. Rocha

Indiana University, School of Informatics

Literature mining (or bibliome informatics) is useful to infer bio-chemical and functional information about groups of genes and proteins; its objective is to automatically sort through huge collections of literature and suggest the most relevant pieces of information for a specific analysis task, e.g. the annotation of proteins. Until now, literature mining has been applied essentially to help annotate molecular entities such as genes and proteins. In the next few years the field is expected to move into bolder pursuits, such as the discovery of novel relationships among such entities, e.g. protein-protein and gene-disease interactions. Indeed, the second Biocreative competition, which we participated in, included a series of tasks on extraction of protein-protein interaction information from the literature. We describe our complex network approach to biomedical literature mining in proteomics. We also describe our work on the large-scale validation of bibliome algorithms.


Integration of Social and Environmental Data using Geographic Information Systems and Remote Sensing Techniques

Dr. Tom Evans

Indiana University, Department of Geography

Interactions between people and the environment are complex and dynamic. Direct relationships between two specific phenomena (e.g., population density and deforestation) are rare, if not nonexistent. More commonly, data from a variety of sources are needed to adequately understand and explain the social and biophysical factors that are a part of human- environment interactions. One method of integrating the various phenomena affecting human-environment relationships is by creating a spatial representation to provide a spatially explicit data modeling environment. A critical aspect of these spatially linked datasets is the decision of what spatial unit of analysis to use to study a specific social-biophysical process. This presentation discusses these spatial representations and implications for subsequent spatial data analysis.


Towards a Science of Science Cyberinfrastructure

Dr. Katy Borner

Victor H. Yngve Associate Professor of Information Science
Director of the Cyberinfrastructure for Network Science Center
Indiana University, School of Library and Information Science

What cyberinfrastructure is required to measure, model, analyze, and communicate scholarly data and ultimately scientific progress? This talk presents our efforts to create a science of science cyberinfrastructure that supports

This talk should be particularly interesting for those interested to

  • Map their very own domain of research,
  • Test and compare data federation, mining, visualization algorithms on large scale datasets, or
  • Use advanced network science algorithms in their own research


Transformation of LEAD to a Persistent Community Facility – Recent Enhancements and Future Capabilities

Craig Mattocks

University of North Carolina, Institute for Marine Sciences

LEAD (Linked Environments for Atmospheric Discovery), a service-oriented architecture (SOA) based cyberinfrastructure project in which meteorological analysis tools, forecast models, and data repositories can operate as dynamically adaptive, on-demand, grid-enabled systems, is currently undergoing a transformation to become a persistent, sustained facility upon which the atmospheric sciences community can rely. As part of this effort, LEAD scientists are collaborating with the developers of the WRF (Weather Research and Forecasting model) Portal at NOAA-ESRL-GSD, which is used at the Developmental Test Bed Center (DTC) in the Joint Numerical Test Bed (JNT) at NCAR, to develop enhanced, interoperable capabilities between the two portals. The new fused portal, with its more advanced and intuitive graphical user interface (GUI), will provide unprecedented forecasting capabilities by enabling users of all levels of sophistication and institutional capability to configure and run numerical weather simulations on powerful, shared or local computing resources.

In addition, plans are in place to incorporate the WRF-Var 3D/4D variational data assimilation system into LEAD workflows for improving the quality of numerical weather simulations. WRF-Var produces an optimal estimate of the true state of the atmosphere by integrating meteorological observations into a first-guess or previous “background” forecast, using an iterative minimization of a prescribed cost/penalty function and representative background error covariances. The improved analysis of the atmospheric state produced by WRF-Var can then be used to provide initial and boundary conditions for subsequent WRF simulations. The version of WRF-Var that will be used in this project will include a new interface for automatically ingesting weather observations directly from the real-time Meteoro­logical Assimilation Data Ingest System (MADIS) data stream.

A potential future LEAD application is a workflow that includes a real-time, event-triggered storm surge prediction system, which was developed for the State of North Carolina to assist emergency managers with evacuation planning, decision-making and resource deployment during hurricane landfall events. Based on the WRF numerical weather prediction model and the ADCIRC (Advanced Circulation) coastal ocean model, this system provides a high-resolution, rapid-response assessment of the winds and flooding that are likely to occur as a tropical cyclone approaches the shoreline.


Cloud Computing with Nimbus

Argonne National Laboratory,

Infrastructure-as-a-Service (IaaS) style cloud computing is emerging as a viable alternative to the acquisition and management of physical resources. But what exactly is cloud computing, how can we leverage it, and what opportunities does it open?

In this talk, I will give an overview of cloud computing and describe Nimbus -- a toolkit that provides an open source, EC2-compatible IaaS implementation as well as user-level tools adapting cloud computing to scientific needs. I will describe how application requirements drove the development of various Nimbus capabilities and how they use these capabilities today. I will also discuss our experiences with configuring and running the Science Clouds -- a group of clouds in academic domain available to scientific projects. Finally, I will discuss the emerging trends and innovation opportunities in cloud computing.


Intelligent Construction of Ensemble Machine Learning Algorithms

Indiana University,

Ensemble machine learning models are often highly accurate on the supervised learning problem of classification. Combining groups of independent models allows for individual specialization and diversification with limited over fitting. The main drawback of using ensembles is the greatly increased computational resource requirements necessary for training. In this talk, we will explore how training set preprocessing using clustering and singular value decomposition can be used to build accurate ensembles while keeping training times to a minimum. We will show results from several domains including sentiment/opinion mining, medical diagnosis, and spam detection.