Constraints on Information Selection in Terminographic Definitions: Towards Relevant Generic Relational Models
Summary of the PhD Dissertation
Definition writing is an essential activity in the development of terminological resources. It aims to fulfill one of the major functions of terminological dictionaries or databases, i.e. to convey information about the concepts expressed by the terms in a given domain and about their referents in order to ease and enhance communication. Unlike term-extraction or corpus building in terminology, this activity is still mostly realized manually. Terminologists would greatly benefit from the assistance of (semi-)automatic definition writing tools, as is already the case with these other terminographic activities. Such tools would not only accelerate the writing process but also enhance the consistency and, therefore, the overall quality of the produced definitions. The greater systematicity thus obtained in definition writing would furthermore allow for broader use of terminological dictionaries or databases as lexical resources for natural language processing systems (NLP) and for easier domain-ontology development, a field which has growing ties with terminology.
The general objective of my work is thus to conceive and implement generic tools to assist in definition writing, whatever the terminographic context, the domain or the language. In my thesis, I explore more specifically the nature of dictionary definitions and of the activity of definition writing in terminology. The automation of definition writing indeed requires addressing a number of questions: What is a terminographic definition?, What is it that we define?, How do we define? and, especially, What are the questions raised by definition writing? A thorough examination of these questions leads to the specification of the main research topic of my thesis: the selection of defining information.
Typically, terminologists construct definitions using information in texts written by domain experts. However, not all the pieces of information found in these texts can be considered as defining and, when they are, not all of them are considered relevant to be included in a definition. One of the most challenging tasks of definition writing is therefore the selection of defining information. Thus, the two main questions raised by definition writing and which ought to be addressed in order to conceive and implement generic definition writing tools are the following:
- What determines or influences information selection?
- What types of information are relevant to define?
Considering the different factors that are acknowledged to constrain the selection of defining information, the one constraint that is, a priori, the most independent from any domain and language is the level of reality. To answer these questions, I therefore make the hypothesis that domain-specific conceptualizations, which are “described” by terminological definitions, refer to different kinds of entities in the world and that the properties and relations of these entity types have an influence on the information that is relevant to define these concepts. Information selection is thus partly a function of the type of entity defined. If this hypothesis is verified, it is possible to propose defining models based on the properties and relations characterizing each type of entity.
To test this hypothesis, I propose to adopt the categories of an existing realist upper-level ontology, the Basic Formal Ontology (BFO), and their specifications. This ontology is aimed at representing the type of things that exist in the world, their properties and their relations to other types of entities. In BFO, entity types are organized according to philosophical distinctions and they are consistent with the scientific knowledge of the world. I propose to adapt these categories to creating relational models, and to use these models to describe the internal structure of existing definitions. The idea is that large-scale multi-domain and multilingual corpus analyses can be used to test the hypothesis and, if verified, to implement these models in a (semi-)automatic definition writing tool. A pilot experiment based on a corpus analysis of a sample of 240 terminological definitions extracted from 15 domains yielded encouraging results, with almost 75 % of the relations expressed in the analyzed definitions pertaining to the models associated with each entity type. This empirical study shows, moreover, which relations (characteristics related to each entity type) in these generic models are most relevant in terminological definitions. These results tend to confirm the tested hypothesis and pave the way for further consolidation work and, eventually, the implementation of the models in (semi-)automatic definition writing tools. The theoretical considerations underlying this methodological proposition also contribute to the foundations of an integrated theory of definitions in terminology.