Abstracts
Semantic Music retrieval
Indiana University, School of Library and Information Science
The vision of the Semantic Web is to lift current Web into semantic repositories where heterogeneous data can be queried and different services can be mashed up. The Web becomes a platform for integrating data and services. Ontology or agreed consensus is the key issue to achieve that. Especially in cultural heritage area, cross-media and cross-archival retrieval turn out to be the slogan in this area. The EASAIER project (European Union funded) aims to enable enhanced access to sound archives by providing multiple methods of retrieval, integration with other media archives and content enrichment. During this talk, I will share with you the development of this project.
Provenance Collection in the Life Science Grid
Indiana University, Center for Data and Search Informatics
Cyberinfrastructure frameworks for science discovery are an accepted way of accessing analysis tools, computational and data resources on the Internet. Provenance plays an increasingly valuable role in scientific discovery by capturing a record of activities performed that led to the creation of data. This record is essential to the long-term preservation and reuse of the data, and to making determinations of its quality. The Life Science Grid is an open source cyberinfrastructure framework for biochemical discovery developed by Eli Lilly Pharmaceutical Company. We are collaborating with University of Manchester to add provenance and ontological semantics to the Life Science Grid to help researchers track their interactions with the discovery framework, better annotate the results, and work more efficiently. The project raises interesting challenges in instrumentation, annotation, and visualization of provenance data.
Pollution resilience for DNS resolvers
Indiana University, Department of Computer Science
The DNS is a cornerstone of the Internet, serving as a distributed database to translate host names to IP addresses, among other things. Unfortunately, no matter how securely an organization provisions and guards its own DNS infrastructure, it is at the mercy of others' provisioning when it comes to resolutions its resolvers perform on behalf of its clients -- even one compromised DNS server in the Internet can mislead an organization's clients to fake look-alike phishing Web sites or malware-serving sites. We propose a self-defense mechanism where the DNS resolvers collect a small amount of additional information for the DNS responses they receive and maintain a history of previous responses to guard their clients against misleading information from compromised DNS servers in the Internet. Any organization can choose to enhance its resolvers with our mechanism unilaterally, unlike DNSSEC, which can ensure correctness of information only if the remote DNS server deploys it.
Distrust Reputation System For P2P Information Retrieval
Indiana University, Department of Computer Science
As peer-to-peer (P2P) information retrieval systems gain more popularity, it is inevitable that some users will try to exploit the systems to push out irrelevant information, such as unsolicited advertisements, to others. These exploiters are well-known as spammers to email users and the Web IR community. So far, many approaches have been proposed to directly reduce the effect of spammers. However, there is not enough emphasis on protecting against decoy peers maliciously directing traffics to spammers. To fill in this gap, this work focuses on handling such a threat by extending the PageRank techniques to compute peers' reputations. The techniques proposed in this study integrate propagating of distrust from known offenders to reduce the effect of decoy peers. The results from simulations showed that the proposed techniques were effective against the decoys while preserving an acceptable level of search results' quality.
Socially Induced Semantic Networks and Applications
Indiana University, Department of Computer Science
Social bookmarking systems allow Web users to actively annotate online resources. These annotations incorporate meta-information with Web pages in addition to the actual document contents. From a collection of socially annotated resources, we present various methods for quantifying the relationship between objects, i.e. tags or resources. These relationships can then be represented in a semantic similarity network where the nodes represent objects and the undirected weighted edges represent their relations. One problem of assembling and maintaining such a network is efficiency, i.e. the time and space complexity associated with graph algorithms. The complexity of these algorithms are typically quadratic. We present two techniques: first addressing space complexity using a concept from the sparse matrix literature, second offering an incremental process answering both space and time limitations. We then present a number of applications leveraging a socially induced semantic similarity network. A recommendation engine and a Web navigation tool are evaluated through user studies. Finally, we explore spam detection to enhance the functionality of social bookmarking systems.
What's in a session: Tracking individual behavior on the Web
Indiana University, Department of Computer Science
This talk describes interesting features of a dataset containing all HTTP requests generated by a thousand undergraduates over a span of two months. Preserving user identity in the data set allows us to discover novel properties of Web traffic that directly affect models of Web navigation. First, the popularity of Web sites (the number of users who contribute to their traffic) lacks any intrinsic mean and may be unbounded. Further, many aspects of the browsing behavior of individual users can be approximated by log-normal distributions even though their aggregate behavior is scale-free. Finally, users' click streams cannot be cleanly segmented into sessions using timeouts, complicating attempts to model their behavior. We propose instead a logical definition of sessions based on browsing activity as revealed by referrer URLs; a user may have several active sessions in their click stream at any one time. Applying a timeout to these logical sessions affects their statistics to a lesser extent than a purely timeout-based mechanism.
An Adaptive Document Classifier Inspired by T-cell Cross-regulation in the Immune System
Indiana University, School of Informatics
Millions of years have evolved the vertebrate immune system into one of the most complex and intelligent biological systems whose function is to protect the body from diseases. More specifically, the adaptive immune system is capable of learning about new harmful and harmless intruders and discriminating between them effectively. Several mathematical models have been proposed to simulate and understand the adaptive immune system and its functional subsystems.
We propose to develop a novel agent-based model of T-Cell cross-regulation in the adaptive immune system, which we apply to binary classification problems analogous to that of the immune system. We expect our study to help immunologists better understand the general mechanism behind T-cell cross-regulation and also raise additional questions about the behavior of the adaptive immune system in general. Our aim is to understand how the cross-regulation model can deal with concept drift, robustness to noise, unbalanced classification, and other emerging complex behaviors in document classification. In addition, our goal is to generalize the model beyond binary classification towards a general purpose multi-classifier. We will validate the model on various types of stream data. We will use data with fixed and varying number of classes (or categories) and the number of elements from each class can also vary over time. Finally, we intend to compare our model with other machine learning classifiers and conclude with insights that can help us better understand the vertebrate immune system and its application to document classification.
Programming Large Distributed Systems with High Level Abstractions
University of Notre Dame, Department of Computer Science and Engineering
Large distributed systems such as clusters, clouds, and grids remain very challenging to program. Although experts can tune distributed and multicore computers to achieve high performance, the non-expert may struggle to achieve a program that even functions correctly. To address this problem, we propose the use of high level abstractions to represent very large problems. An abstraction is a regularly structured framework into which an end-user may plug in simple sequential programs to create very large parallel programs. I will discuss several examples of abstractions -- AllPairs, Classify, and Wavefront -- that we have implemented on both multicore and distributed systems. Using these abstractions, we have enabled the solution of problems of unprecedented size in biometrics, data mining, economics, and genomics.
BIBLIOME INFORMATICS IN COMPUTATIONAL BIOLOGY
Indiana University, School of Informatics
Literature mining (or bibliome informatics) is useful to infer bio-chemical and functional information about groups of genes and proteins; its objective is to automatically sort through huge collections of literature and suggest the most relevant pieces of information for a specific analysis task, e.g. the annotation of proteins. Until now, literature mining has been applied essentially to help annotate molecular entities such as genes and proteins. In the next few years the field is expected to move into bolder pursuits, such as the discovery of novel relationships among such entities, e.g. protein-protein and gene-disease interactions. Indeed, the second Biocreative competition, which we participated in, included a series of tasks on extraction of protein-protein interaction information from the literature. We describe our complex network approach to biomedical literature mining in proteomics. We also describe our work on the large-scale validation of bibliome algorithms.
Integration of Social and Environmental Data using Geographic Information Systems and Remote Sensing Techniques
Indiana University, Department of Geography
Interactions between people and the environment are complex and dynamic. Direct relationships between two specific phenomena (e.g., population density and deforestation) are rare, if not nonexistent. More commonly, data from a variety of sources are needed to adequately understand and explain the social and biophysical factors that are a part of human- environment interactions. One method of integrating the various phenomena affecting human-environment relationships is by creating a spatial representation to provide a spatially explicit data modeling environment. A critical aspect of these spatially linked datasets is the decision of what spatial unit of analysis to use to study a specific social-biophysical process. This presentation discusses these spatial representations and implications for subsequent spatial data analysis.
Towards a Science of Science Cyberinfrastructure
Victor H. Yngve Associate Professor of Information Science
Director of the Cyberinfrastructure for Network Science Center
Indiana University, School of Library and Information Science
What cyberinfrastructure is required to measure, model, analyze, and communicate scholarly data and ultimately scientific progress? This talk presents our efforts to create a science of science cyberinfrastructure that supports
- Data access and federation via the Scholarly Database, http://sdb.slis.indiana.edu,
- Data preprocessing, modeling, analysis, and visualization using plug-and-play cyberinfrastructures such as the Network Workbench, http://nwb.slis.indiana.edu, but also the
- Communication of science to a general audience via the Mapping Science exhibit at http://scimaps.org
This talk should be particularly interesting for those interested to
- Map their very own domain of research,
- Test and compare data federation, mining, visualization algorithms on large scale datasets, or
- Use advanced network science algorithms in their own research
Transformation of LEAD to a Persistent Community Facility – Recent Enhancements and Future Capabilities
University of North Carolina, Institute for Marine Sciences
LEAD (Linked Environments for Atmospheric Discovery), a service-oriented architecture (SOA) based cyberinfrastructure project in which meteorological analysis tools, forecast models, and data repositories can operate as dynamically adaptive, on-demand, grid-enabled systems, is currently undergoing a transformation to become a persistent, sustained facility upon which the atmospheric sciences community can rely. As part of this effort, LEAD scientists are collaborating with the developers of the WRF (Weather Research and Forecasting model) Portal at NOAA-ESRL-GSD, which is used at the Developmental Test Bed Center (DTC) in the Joint Numerical Test Bed (JNT) at NCAR, to develop enhanced, interoperable capabilities between the two portals. The new fused portal, with its more advanced and intuitive graphical user interface (GUI), will provide unprecedented forecasting capabilities by enabling users of all levels of sophistication and institutional capability to configure and run numerical weather simulations on powerful, shared or local computing resources.
In addition, plans are in place to incorporate the WRF-Var 3D/4D variational data assimilation system into LEAD workflows for improving the quality of numerical weather simulations. WRF-Var produces an optimal estimate of the true state of the atmosphere by integrating meteorological observations into a first-guess or previous “background” forecast, using an iterative minimization of a prescribed cost/penalty function and representative background error covariances. The improved analysis of the atmospheric state produced by WRF-Var can then be used to provide initial and boundary conditions for subsequent WRF simulations. The version of WRF-Var that will be used in this project will include a new interface for automatically ingesting weather observations directly from the real-time Meteorological Assimilation Data Ingest System (MADIS) data stream.
A potential future LEAD application is a workflow that includes a real-time, event-triggered storm surge prediction system, which was developed for the State of North Carolina to assist emergency managers with evacuation planning, decision-making and resource deployment during hurricane landfall events. Based on the WRF numerical weather prediction model and the ADCIRC (Advanced Circulation) coastal ocean model, this system provides a high-resolution, rapid-response assessment of the winds and flooding that are likely to occur as a tropical cyclone approaches the shoreline.

