Completed Projects

The following is a list of projects completed by the research group of Prof. Michael Gertz both at Heidelberg University (since 2008) and the University of California at Davis (1997-2008)

PhD Graduate School CrowdAnalyser: Spatio-temporal Analysis of User-generated Content (U Heidelberg)

A key characteristic feature of the Web 2.0 is that data is voluntarily provided by users on the Internet through portals such as Wikipedia, YouTube, Flickr, Twitter, Blogs, OpenStreetMap, and various social networks at an unprecedented scale and staggering rate. In today’s information society and knowledge economy these portals provide a valuable resource for diverse application domains. The enormous potential of this voluntarily generated (crowdsourced) data through the masses of volunteers (crowd) is increasingly recognized, but in many areas, especially in science, it is not utilized to its full potential. There are several unsolved issues that arise from these rapidly increasing, very dynamic and highly heterogeneous data streams of content created by users. Addressing these issues has the goal to automatically assess and develop this new type of poorly structured data for different application domains, in particular, to infer new information. The participating research groups in Heidelberg have done pioneering work in these directions, especially in the context of utilizing geographic data. The objective of the college is to develop novel methods and approaches towards the quality-oriented analysis and exploration of crowdsourced Web 2.0 data as well to further improve and scale existing methods.

ITR: Adaptive Query Processing Architecture for Streaming Geospatial Image Data (UC Davis)

This NSF funded research project developed models, techniques, and architectures for the adaptive processing of real-time remotely sensed, streaming geospatial satellite image data. It developed a generic stream format, called GeoStreams, for remotely sensed, geospatial image data, based on the widely used image data formats to process continuous queries in an adaptive fashion and to deliver tailored geospatial image data to consumers. The GeoStream query model and query processing framework extends current models and techniques for processing continuous queries on data streams to handle complex query frameworks on geospatial data. The main focus was on query processing techniques to optimize multi-query processing of spatial and sensor band selections. Additional important operators on GeoStreams include sensor band algebra, pixel neighborhood processing, spatial filtering, aggregation, and spatial reference transformations. National Oceanic and Atmospheric Administration's (NOAA) Geostationary Operational Environmental Satellite (GOES) data, directly accessible in this project, served as a primary testbed for the research. A major testbed application was in support of water balance applications. The goal was to calculate spatially distributed daily reference evapotranspiration maps for the State of California at high spatial resolution (1 km2-16 km2), and in real-time. The methodology used GOES data, properly combined with weather station data, and ancillary spatial data such as digital elevation maps.

COMET: COast-to-Mountain Environmental Transect (UC Davis)

The goal of this NSF-funded project was to develop a practical cyberinfrastructure prototype to facilitate the study of the way in which multiple environmental factors, including climatic variability, affect major ecosystems along an elevation gradient from coastal California to the summit of the Sierra Nevada. An understanding of the coupling between the strength of the California upwelling system and terrestrial ecosystem carbon exchange is the central scientific question. Additional scientific goals were to better understand the way in which atmospheric dust is transported to Lake Tahoe and an examination of carbon flux in the coastal zone as moderated by upwelling processes. The geographic context is one in which there is a diversity of ecosystems that are believed to be sensitive to climatological changes. The dispersion and complexity of the data needed to answer the scientific questions motivate the development of a state-of-the-art cyberinfrastructure so facilitate the scientific research. This cyberinfrastructure is based around the integration of access to distributed and varied data collections and data streams, semantic registration of data, models and analysis tools, semantically-aware data query mechanisms, and an orchestration system for advanced scientific workflows. Access to this cyberinfrastructure was provided through a web-based portal. Prof. Dr. Michael led this project until he moved from UC Davis to Heidelberg University .

Mining Problem-solving Behavior from Open-Source Repositories (UC Davis)

Open Source systems such as Linux and Mozilla are both innovative and popular. They are built by an informal group of volunteers, working in a distributed, asynchronous manner. Communication and coordination are mediated by emails and shared repositories (containing manuals, design documents, source code and bug reports). This repository constitutes an extensive on-line record of user feedback, artifact evolution and task-related problem-solving behaviors. This data is publicly available, and is amenable to modern data mining techniques, as well as automated program analysis algorithms. Our goal is to integrate these automated analyses with social science methods to study the intrinsic relationship between community behavior and software engineering outcomes in large-scale open source projects, with view to improving software engineering practice. We have assembled an inter-disciplinary research team consisting of a software engineer, a social scientist, and a database researcher. Software systems have an enormous impact on the economy and on society; an improved understanding of the social processes underlying software development could well lead to faster development of cheaper, better software systems. Our interdisciplinary collaboration builds on existing work by explicitly connecting software engineering imperatives to the techniques of social science; thus we hope to evaluate the relevance of folklore principles such as "Conway's Law," that claims a relationship between artifact structure and community structure. Finally this collaboration will help us formulate new, much-needed interdisciplinary pedagogy to train both undergraduate and graduate software engineers in the social aspects of software development: team-building, work allocation, coordination, project management, and process improvement.

Security Analysis and Re-engineering of Databases (UC Davis)

Many of today's mission critical databases have not been designed with a particular focus on security aspects such as integrity, confidentiality, and availability. Even if security mechanisms have been used during the initial design, these mechanisms are often outdated due to new requirements and applications, and do not reflect current security polices, thus leaving ways for insider misuse and intrusion. The proposed NSF funded research was concerned with analyzing various security aspects of mission critical (relational)databases that are embedded in complex information system infrastructures. We proposed four complementary avenues of research: (1) models and techniques to profile the behavior of mission critical data stored in databases, (2) algorithms to correlate (anomalous) data behavior to application/user behavior, (3) techniques to determine and model user profiles and roles from behavioral descriptions, and (4) the integration of techniques, algorithms, and mechanisms into a security re-engineering workbench for (relational) databases. Two major themes built the core of the proposed approaches. First, the analysis of database vulnerabilities and violations of security paradigms is data-driven, i.e., first the behavior of the data is analyzed and modeled before it is correlated to users and applications. Second, we introduced the concept of access path model to uniformly model and correlate data flow and access behavior among relations, users, and applications. This model allows security personnel to fully inspect database security aspects in complex settings in a focused, aspect (policy) driven fashion.

A Metadata-Driven Visualization Interface Technology for Scientific Data Exploration (UC Davis)

This project lays out our plan for research and development aimed atdramatically improving the process and outcome of scientific data analysis and visualization. The improvement will be achieved by coupling an expressive and extensible metadata management framework with novel visualization interfaces that facilitate effective reuse, sharing, and cross-exploration of visualizationinformation and thus will make a profound impact on a broad range of scientific applications. The process of scientific visualization is inherently iterative. A good visualization comes from experimenting with visualization and rendering parameters to bring out the most relevant information in the data. This raises a question. Considering the computer and human time we routinely invest for exploratory and production visualization, are there methodologies and mechanisms to enhance not only the productivity of scientists but also their understanding of the visualization process anddata used?

Recent advances in the field of data visualization have been made mainly in rendering and display technologies (such as realtime volume rendering and immersive environments), but little in coherently managing, representing, and sharing information about the visualization process and results (images and insights). Naturally, the various information about data exploration should be shared and reused to leverage the knowledge and experience scientists gain from visualizing scientific data. A visual representation of the data exploration process along with expressive models for recording and querying task specific information help scientists keep track of their visualization experience and findings, use it to generate new visualizations, and share it with others.

While previous research has addressed some related issues, a more comprehensive study remains to be done. Thus, we propose two complementary avenues of research: (1) new user interfaces for data visualization tasks, and (2) expressive metadata models supporting the recording and querying of information related to data exploration tasks. In addition, a set of user studies will be conducted on a Web-based visualization testbed realizing (1) and (2) in order to refine the proposed methodologies and designs. Traditional user interfaces cannot support the increasingly complex process of scientific data exploration. A fundamental change in the conventional designs and functionality must be made to offer more intuitive interaction, guidance, and enhanced perception. We will begin our study with enriching the graphbased and spreadsheet-like interfaces we have developed, and also investigate alternative designs. An expressive and extensible metadata model representing the data exploration process and its embedded data visualization process is needed. Such a model along with an appropriate user interface makes it possible to manage diverse information about the input and results of the visualization process, analyze parameter coverage and usage, identify unexplored visualization spaces, and incorporate findings on the process and results in form of visualization metadata. The model is independent of the actual visual interface used and is open in that its realization in form of a metadata repository can be loosely coupled with a variety of different visualization tools. A set of interfaces and protocols to the repository will be designed to manage, query, and analyze visualization metadata gathered from and utilized by different visualization tools.

Scalable and Secure Information Republication (UC Davis)

Our society increasingly relies on prompt, accurate delivery of information over the Internet. Users need assurance that the information they get in this way is authentic, and they need to get this assurance in a cheap and reliable way.

The research centers around a new approach for engendering this confidence. The starting point is to separate the roles of the "owner" of a database and its "publisher" (or publishers). With this approach the user need not trust the publisher. Instead, the owner of the database provides the user with a small amount of "summary information". After that, the publisher not only answers the user's questions, but also provides, along with each answer, a short "digital certificate" of accuracy. Using the summary information, the certificate lets the user check that the information received is correct and complete. Developing and evaluating good schemes to construct these certificates is the key technical challenge.

The publisher need not maintain a trusted system, lowering his cost of doing business. The publisher can also more easily provide information from multiple owners. Overall, the approach should make it cheaper to obtain reliable data over the Internet, and will expand the settings where the data is used.

Outlier Regions in Sensor Networks (UC Davis)

Sensor networks play an important role in applications concerned with environmental monitoring, disaster management, and policy making. Effective and flexible techniques are needed to explore unusual environmental phenomena in sensor readings that are continuously streamed to applications. In this work, we develop models and techniques that allow to detect outlier sensors and to efficiently construct outlier regions from respective outlier sensors. For this, we utilized the concept of degree-based outliers. Compared to the traditional binary outlier models (outlier versus non-outlier), this concept allows for a more fine-grained, context sensitive analysis of anomalous sensor readings and in particular the construction of heterogeneous outlier regions. The latter suitably reflect the heterogeneity among outlier sensors and sensor readings that determine the spatial extent of outlier regions. Such regions furthermore allow for useful data exploration tasks. We demonstrated the effectiveness and utility of our approach using real world and synthetic sensor data streams.