Information Integration, Databases and Ontologies
Contents
  1. Introduction
  2. Approach
  3. Abstract Schemas, Schema Morphisms and the SCIA Tool
  4. Ontologies
  5. Some Related Topics
  6. Brief Bibliography
  7. Some Other Links
  8. A BOBJ Approach

1. Introduction

Rapidly advancing technologies are changing the nature of research in almost every field of science, and many other fields as well. For example, cheap sensors can produce unprecedented volumes of information about ecological sites, cheap mass storage allows unprecedented amounts of such data to be preserved, and computer power allows it to be analyzed in unprecedentedly complex ways. A major difficulty in realizing the promise of all this arises from the need to integrate data from multiple sources, often formatted in incompatible ways, and even worse, represented using incompatible assumptions, some of which may be highly implicit. For example, data from one source may be based on weekly samples and measured in meters, while data from another source is sampled every 10 days, and measured in feet; an implicit assumption for the first source might be that missing data points are filled in with interpolated values, whereas the second source just omits them; they may also have been sampled at different times (e.g., noon vs. midnight). An example of a significant implicit variable is the elapsed time between taking a sample and analyzing it in the lab; if these elapsed times are sufficiently different at different sites, and if some substances decay rapidly, then measurements will have to be recalibrated in order to be compared meaningfully. Furthermore, the data to be analyzed may be stored in a variety of databases having different data models, formats, and platforms. Similar problems can arise in many other fields, including textual analysis, computer integrated manufacturing, molecular biology, and data mining.

The XML language for semi-structured data is rapidly gaining acceptance, and has been proposed as a solution for data integration problems, because it allows flexible coding and display of data, by using metadata to describe the structure of data (e.g. DTD or Schema). Although this is important, it is less useful than often thought, because it can only define the syntax of a class of documents. Moreover, even if an adequate semantics were available for each document class, this would still not support the integration of data that is represented in different ways, because it does not give any way to translate among the different datasets. In addition to dealing with datasets that appear in computer readable documents and databases, users may also want to compare the results of simulation packages with the empirical datasets. These also may involve still other formats and implicit assumptions. Very complex workflows can easily arise in contemporary scientific research and in industrial and commercial practice.

The research described in this webpage is intended to address such problems of data integration, through the construction of a general tool called SCIA, the use of ontologies, and the development of supporting theory, as described below. We are also interested in the critical exploration of the limitations of such tools and methods.


2. Approach

A promising approach is to develop tools that go beyond syntax by using semantic metadata. But despite some optimistic projections to the contrary, the representation of meaning, in anything like the sense that humans use that term, is far beyond current information technology. As explored in detail in fields such as Computer Supported Cooperative Work (CSCW), understanding the meaning of a document often requires a deep understanding of its social context, including how it was produced, how it is used, its role in organizational politics, its relation to other documents, its relation to other organizations, and much more, depending on the particular situation. Moreover, all these contexts may be changing at a rapid rate, as may the documents themselves, and the context of the data is also often both indeterminate and evolving. Another complication is that the same document may be used in multiple ways, some of which can be very different from others.

These complexities mean that it is unrealistic to expect any single semantics to adequately reflect the meaning of the documents of some class for every purpose. Most attempts to deal with these problems in existing literature and practice are either ad hoc ("We just wrote some Perl scripts") or else are what we may call "high maintenance" solutions, involving complex infrastructure, such as commercial relational databases, high volume data storage centers, and ontologies written in specialized languages to describe semantics. Solutions of the first kind are typically undocumented and cannot be reused, whereas solutions of the second kind require considerable effort from highly skilled computer professionals, which can be frustrating for application experts, due to the difficulty of discovering, communicating, formalizing and especially updating, all the necessary contextual information. For this reason, many application scientists prefer to avoid high maintenance solutions, and do the data integration themselves in an ad hoc manner (often using graduate students or other assistants).

One approach is to provide tools to make data integration using semantic metadata much easier for application scientists to do themselves. Section 3 below describes SCIA, a GUI tool that metadata integration engineers can use to generate mappings between a virtual master database and local databases, from which end-user queries can be answered. A second, more flexible, approach is described in Section 8, using an ultra high level programming language based on equational logic.

It is important to know what role data integration actually plays in particular research settings, in order to design tools that will be useful in actual practice. In particular, data integration problems can have significant social dimensions. For example, we have already mentioned ethnographic studies indicating that scientists often prefer simple tools that closely match current needs, rather than high maintenance general purpose tools, of the kind computer scientists might prefer.

Although ontologies are promising for certain applications, many difficult problems remain, in part due to the essentially syntactic nature of ontology languages (e.g. OWL), the computationally intractable nature of highly expressive ontology languages (such as KIF), and the difficulty of interoperability among the many existing ontology languages, as well as among the ontologies written in those languages. Difficulties of another kind stem from the unrealistic expectations engendered by the many exaggerated claims made in the literature.

The goal of research in Data, Schema and Ontology Integration and Information Integration in Institutions is to provide a rigorous foundation for information integration that is not tied to any specific representational or logical formalism, by using category theory to achieve independence from any particular choice of representation, and using institutions to achieve independence from any particular choice of logic. The information flow and channel theories of Barwise and Seligman are generalized to any logic by using institutions; following the lead of Kent, this is combined with the formal conceptual analysis of Ganter and Wille, and the lattice of theories approach of Sowa. We also draw on the early categorical general systems theory of Goguen as a further generalization of information flow, and draw on Peirce to support triadic satisfaction.


3. Abstract Schemas, Schema Morphisms, and the SCIA Tool

It is unreasonable to expect fully automatic tools for information integration; in particular, it is difficult to find correct schema matches, especially where there are n-to-m matches, semantic functions, conditions and/or diverse data models; it may not even be clear what correctness means in such situations. To help solve this, we develop a tool called SCIA for XML Schema and DTD matching, that finds those "critical points" where user input is maximally useful, does as much as reasonable automatically, identifies new critical points, and iterates these steps until convergence is achieved. Critical points are determined using path contexts and a combination of matching algorithms; a convenient GUI provides hints and accepts user input at critical points; and view generation supports data transformation and queries. Tests with various datasets show that critical points and path contexts can significantly reduce total user effort.

Semantic models are needed for mappings of XML DTDs and XML Schemas, relational and object oriented schemas, and even spreadsheets and structured files, all with integrity constraints. We have developed a theory of abstract schemas and abstract schema morphisms, which provides a semantics for n-to-m matches with semantic functions and/or conditions over diverse data models. The theory provides semantic foundations for our schema mapping tool.


4. Ontologies

Ontologies, in the sense of formal semantic theories for datasets (not the sense of academic philosophy), are increasingly being proposed, and even used, to support the integration of information that is stored in heterogeneous formats, especially in connection with the world wide web, but also for other, less chaotic, forms of distributed database. In particular, ontologies have been proposed as a key to the success of the so called "semantic web."

Formally speaking, an ontology is a theory over a logic. Although this may sound straightforward, ontologies unfortunately are proliferating almost as quickly as the datasets that they are meant to describe. Therefore, integrating datasets the semantics of which are given by different ontologies, will require that their ontologies be integrated first. This task is greatly complicated by the fact that many different languages are in use for expressing ontologies, including Owl, Ontologic, Flora, KIF, and RDF, each of which has its own logic. Therefore to integrate ontologies, it may be necessary first to integrate the logics in which they are expressed. Moreover, dataset integration will also have to take account of the fact that the schemas describing structure are also often expressed in different languages, reflecting different underlying data models, e.g., relational, object oriented, spreadsheet, and formatted file.

This tangle of questions can be approached using the theory of institutions, which provides an axiomatization of the notion of logical system, based on Tarski's idea that the notion of satisfaction is central. One can then define theories over an institution, and theory morphisms can be used for translating ontologies over a given logic. The further notion of institution morphism is needed for translating between different logical systems (see the paper Institution Morphisms, by Joseph Goguen and Grigore Rosu), and morphisms of theories over different institutions are accommodated by Diaconescu's Grothendieck institution construction, as discussed in the papers Data, Schema and Ontology Integration and Information Integration in Institutions. Some limitations of ontologies are discussed in Ontology, Ontotheology, and Society. It is intended to extend SCIA to handle ontology integration, and to take advantage of ontologies in dataset integration.


5. Some Related Topics

Another research project in our group, called algebraic semiotics, is also useful here; it has the goal of developing a scientific understanding of basic issues of usability, representation and coordination that arise in interface design and related areas, especially the visualization of scientific data, and the organization of complex information using multimedia resources; there is also some focus on distributed cooperative work and on semiotics. For details, see the Short Overview of Algebraic Semiotics and the Brief Annotated Bibliography given there, as well as the User Interface Design homepage, and the UCSD course CSE 271.

Data integration is a good topic for combining our interests in algebraic semantics, user interface design with algebraic semiotics, and the sociology of science and technology (see the Sociology of Technology page, and the UCSD course CSE 275 for further information on our approach to these areas, which can help reveal what users of data integration services really need).


6. Brief Bibliography
  1. [Now Being Revised] Ontotheology, Ontology, and Society, to appear in special issue of Int. J. Human Computer Studies, edited by Christopher Brewster and Kieron O'Hara; a postscript version is also available. This paper marshalls ideas from philosophy, cognitive science, and sociology, in an attempt to discern some limitations of ontologies in the computer science technical sense. It is an expanded and revised version of Ontology, Society, and Ontotheology, in Formal Ontology in Information Systems, edited by Achille Varzi and Laure Vieu, IOS Press, pages 95-103, 2004, which is proceedings of Conference on Formal Ontology in Information Systems (FOIS'04); a postscript version is also available, as is the abstract, see also Workshop on Potential of Cognitive Semantics for Ontologies.
     
  2. [New] A short essay, Support for Ontological Diversity and Evolution, written for the SEEK (Science Environment for Ecological Knowledge) project meeting on 27 October 2005. Argues that multiple ontologies for single domains are inevitable, and suggests technology and theory to support work in such an environment, including ways to detect and negotiate differences.
     
  3. [New] Data, Schema, Ontology, and Logic Integration, to appear in book edited by Walter Carnielli, Miguel Dionisio, and Paulo Mateus; postscript version also available. Extended abstract appears in Proceedings, CombLog'04 Workshop, edited by Walter Carnielli, Miguel Dionisio, and Paulo Mateus, pages 21-31; held 28-30 July 2004, in Lisbon, Portugal; keynote address. Motivation and theory for a "data integration chain," from data to schema to ontology to ontology language to ontology logic integration; main new ideas are abstract schema, abstract schema species, and abstract schema morphism.
     
  4. [Newly Revised] Information Integration in Institutions, paper for Jon Barwise memorial volume edited by Larry Moss. This paper unifies and/or generalizes several approaches to information, including the information flow theory of Barwise and Seligman, the formal conceptual analysis of Wille, the lattice of theories approach of Sowa, the categorical general systems theory of Goguen, and the cognitive semantic theories of Fauconnier, Turner, Gardenfors, and others. Its rigorous approach uses category theory to achieve independence from any particular choice of representation, and institutions to achieve independence from any particular choice of logic. Corelations, cocones, and colimits over arbitrary diagrams provide a very general formalization of information integration, and Grothendieck constructions extend this to several kinds of heterogeneity. Examples from databases, ontologies, cognitive semantics and other areas are treated. An unusual way to institutionalize databases is given in an appendix, inspired by C.S. Peirce's triadic semiotics. A postscript version is also available.
     
  5. [Newly Revised] What is a Concept?, by Joseph Goguen, in Proceedings of 13th International Conference on Conceptual Structures (ICCS '05), edited by Frithjof Dau and Marie-Laure Mungier, Springer Lecture Notes in Artificial Intelligence, volume 3596, pages 52-77, 2005; conference held 18-22 July, 2005, Kassel, Germany. Slides for the lecture are also available in pdf and in postscript. This paper surveys a number of approaches to concepts, focussing on cognitive, social, and formal approaches, and in particular, unifies the symbolic mental spaces of Fauconnier and the geometric conceptual spaces of Gardenfors; ideas of Peirce and Latour help to unify the diversity of approaches.
     
  6. Three Perspectives on Information Integration. For a book of contributions to Seminar 04391, Semantic Interoperability and Integration, held from 20 to 24 September 2004, at Schloss Dagstuhl, Germany. There is also a postscript version.
     
  7. Critical Points for Interactive Schema Matching, with Guilian Wang, Young-Kwang Nam, and Kai Lin. Technical Report CS2004-0779, UCSD Department of Computer Science, 31 January 2004; this is the long version of a paper of the same name in Advanced Web Technologies and Applications, edited by Jeffrey Xu Yu, Xuemin Lin, Hongjun Lu and YanChun Zhang, Springer Lecture Notes in Computer Science, volume 3007, 2004, pages 654-664. Proceedings of Sixth Asia Pacific Web Conference, Hangzhou, China, 14-17 April 2004. The shorter published version is also available, as is a pdf version of the full report.
     
  8. A Metadata Tool for Retrieval from Heterogeneous Distributed XML Documents, by Young-Kwang Nam, Joseph Goguen, and Guilian Wang. In Proceedings, International Conference on Computational Science, edited by P.M.A. Sloot and others, Springer, Lecture Notes in Computer Science, volume 2660, pages 1020-1029, 2003. Melbourne, Australia, 2-4 June 2003. Describes a GUI tool for constructing correspondences between XML documents that support information retrieval from distributed collections.
     
  9. A Metadata Integration Assistant Generator for Heterogeneous Distributed Databases, with Young-Kwang Nam and Guilian Wang, in in Proceedings, 16th Conference on Ontologies, DataBases, and Applications of Semantics for Large Scale Information Systems, Springer, Lecture Notes in Computer Science, Volume 2519, 2002, pages 1332-1344; from a conference held in Irvine CA, 29-31 October 2002. A MS Word version is also available, with color figures. This describes an early version of the SCIA system.

7. Some Other Links
  1. The Database Research Group.
     
  2. A table of Links to Database Conferences and Journals, maintained by Jenny Wang.
     
  3. The IEEE standard upper ontology website; work of Robert Kent and others.
     
  4. Website of Erhard Rahm, University of Leipzig; database research, especially data adaptive workflow, metadata management, web usage mining."
     
  5. Website of Phil Bernstein, Microsoft Research; work on databases, especially data integration using meta-data, called "model management."
     
  6. Website of Maurizio Lenzerini, Universita di Roma la Sapienza; database research, especially semantic web agents, data integration with meta-data, and internet services.
     
  7. Website of Suad Alagic, University of Southern Maine, research on XML semantics, object oriented system semantics.
     
  8. Is the Semantic Web Hype?, Slides by Mark Butler on RDF, XML, etc.

8. A BOBJ Approach

A different approach is to use the extremely high level BOBJ language (supplemented with standard programs for functions like reading data and doing statistical calculations). BOBJ is an algebraic programming and specification language, a recent member of the OBJ family, having a modern implementation in Java. It supports very high level abstractions, and very powerful parameterized modularization. Its user-definable syntax and semantics allows users to quickly define both the syntax and semantics of complex data structures, and its pattern matching makes it easy to define translations among data structures. In addition, its high level of abstraction makes it easier to write code, and its modularization makes it easier to reuse code; these points are especially important because our research shows that programs of this kind typically evolve through many iterations. An additional advantage of this approach is that users can also define their own application-oriented query languages, instead of being forced to use SQL (though SQL will still be needed to access some legacy databases, and it is used for illustration in the sample program given below). Unsurprisingly, BOBJ executes more slowly than conventional languages (since it is an interpreter rather than a compiler), but it should easily be adequate for the proposed applications.

To test this, we are writing programs to see if particular problems can be solved with relatively little effort, using concise, modular, reusable code. In future, we hope to test the feasibility of our approach with larger case studies on real problems in ecology, using datasets developed by the Long Term Ecologial Research (LTER) project, and stored in facilities administed by the San Diego Supercomputer Center (SDSC). We hope to support light-weight, user-produced data integration programs.

BOBJ is a logical programming language, in the sense that it (unlike Prolog) is rigorously based on a logic, in this case, three variants of order sorted equational logic, called loose, initial, and hidden. This means that any program written in BOBJ is a precise mathematical description of what it does, thus facilitating semantic integration and verification. For a general introduction to BOBJ, see the BOBJ entry in the OBJ family homepage. For more detail and examples, see the BOBJ language homepage.

Some initial experiments have been with bibliographies, since there are many of these on the web using XML. The sample BOBJ code (written by Kai Lin) is in the file data1.obj. Despite its small size (especially if we exclude the LIST module, which merely defines a standard data structure), this code does a lot of work: it accepts and parses a user query (in a subset of SQL), translates it into two database queries (also in SQL), integrates the answsers, and then returns the result to the user. The last two blocks, beginning "red query" are not code at all, but rather test cases, illustrating how to use the code. The output from running these two test cases is given in the file named out.

Unfortunately, it would take considerable space to explain this code in detail to readers who do not already have some background in OBJ, while on the other hand, readers who do have such a background (and who also know a little SQL) would find such an explanation largely redundant. Some tutorial material on algebraic semantics and OBJ may be found in the author's version of the UCSD gradute course CSE 230, on programming languages. A more recent, somewhat improved, version of this code may be found at data2.obj.


The URL of this document is
     www.cs.ucsd.edu/users/goguen/projs/data.html
and the html source includes links to webpages that give more detail on many topics.
To my research projects index page
Maintained by Joseph Goguen
Last modified: Tue Jan 3 20:21:15 PST 2006