Language, Vision, and Music

Full Title: Language, Vision, and Music: Selected Papers from the 8th International Workshop on the Cognitive Science of Natural Language Processing, Galway, Ireland 1999
Author / Editor: Paul Mc Kevitt, Seán Ó Nualláin and Conn Mulvihill
Publisher: John Benjamins, 2002

Buy on Amazon

 

Review © Metapsychology Vol. 8, No. 43
Reviewer: Daniel Mauro

Language, Vision and Music is
a collection of 30 articles selected from the Eighth International Workshop on
the Cognitive Science of Natural Language Processing (CSNLP-8). Is rhythm an
important property in language, music and vision? How does one model creativity
in computers? What is the nature of synaesthesia? Can we improve human computer
interface design by integrating the linguistic and visual modalities? These are
just a few of the kinds of questions to which one may find interesting answers
in this wide-ranging collection of papers. Language, vision and music appear to
share a number of fundamental attributes, among them, hierarchical
organization, recursivity, ambiguity and systematicity. The existence of such
shared properties indicates that language, music and vision – though seemingly
distinct as modes of understanding and interacting with the world – could be
instantiations of a general purpose cognitive system. In the context of this
multimodal theme, researchers from a variety of disciplines were invited to
contribute papers that examined interrelationships between language, vision and
music and explored the integration of these modalities in intelligent
multimedia computer systems.

The
resulting collection of papers is divided into three topic sections: Part I:
Language & vision; Part II: Language & music; Part III: Creativity. The
articles span a variety of themes and approaches in both human (e.g.,
synaesthesia, semantic priming) and artificial systems (e.g., virtual
perception, intellimedia). To give prospective readers a sense of what is
presented in this broad array of highly diverse and specialized topics, I have
included brief summaries of the general content of each of the three main
sections, while providing individual commentaries on (nine) articles that
warranted special attention because of their originality, interest or relevance
to the field. Readers who would prefer an abbreviated version of this review
may forego the article summaries and skip to the last few paragraphs where I provide
a general overview of the collection.

Part I is edited by Paul McKevitt
and brings together thirteen articles that integrate language and vision. Nine
of these papers investigate or are relevant to MultiModal Systems (also known
as Intelligent MultiMedia) while the remaining four articles address human
topics. The first paper describes how General Systems Theory can be applied to
the problem of multimedia integration. The next three articles examine various
multimodal language understanding applications. Five more papers investigate
computer applications, architectures and models of multimodal integration in
human computer interface design, focusing mainly on speech and gesture. One
article looks at the relation of speech and vision in aphasic individuals. The
final three contributions explore the phenomenon of synaesthesia. I review four
articles from this section.

John Connolly examines the problem of multimedia integration from the perspective of General Systems Theory (GST). Multimedia integration is an attempt to combine different communication media (e.g., speech, vision) into a coherent whole while GST is aimed at discovering general properties of different systems, of which multimedia systems is an example. In GST, context and holism turn out to be of paramount importance when designing complex systems: one needs to understand the nature of the subsystems, how the subsystems interact and where to form natural boundaries between those subsystems. In engineering, General Systems Theory can be applied in one of two ways: either as a solid engineering tool, or as a set of ideas that can be used to uncover new aspects of a particular subject. This paper is an example of the second approach. Connolly introduces a railway map as an example of a system whose constituents include both linguistic (names of stations) and non-linguistic components (network of railway lines). Seen as a simple system that is made up of mutually interdependent subsystems, the map helps to illustrate basic GST principles such as ‘the whole is more than the sum of the parts’ or ‘each component of the system has an effect upon the whole.’ Throughout the paper, Connolly uses the map example to demonstrate the kinds of design issues that can occur in a multimedia context and how the GST approach would deal with those issues. He provides persuasive arguments that a systems methodology, precisely because it emphasizes understanding how the components of a holistic system interrelate, is particularly relevant to the kinds of issues associated with multimedia integration. As an introduction to both General Systems Theory and the challenges of multimedia integration, the article provides a good backdrop for the other multimedia papers that follow. 

John Gurney, Elizabeth Klipple and Robert Winkler describe "A simulated language understanding agent using virtual perception." Their software agent uses virtual perception to perform a number of ‘spoken language navigation tasks’ in a virtual reality (VR) environment that simulates helicopter movement through realistic looking terrain. The software agent acts as an interface between the virtual world and the human by interpreting spoken commands and implementing them as corresponding flight patterns that can be visually monitored by the user. Getting a computer to perform realistic navigation tasks (like tracking and following moving objects) using only speech based commands is exceedingly difficult with current human computer interfaces (HCIs). Accordingly, the authors motivate their "agent-based approach to HCI" by comparing their agent to a conventional database model. With the traditional approach, following simple navigational commands in a dynamic environment is cumbersome because the individual subtasks often conflict with one another. The difficulty, say Gurney and colleagues, is that the database behaves as though it were an "omniscient wizard" who knows where everything is in the virtual world yet is limited to adjusting dials and buttons in response to various navigational directives (e.g., go south for 5 km) — in other words, it is too detached from the virtual world. The authors’ innovative solution to this problem is to allow the software agent to see that world from the human point of view. They purposefully cripple the agent so that it knows only what a person would know rather than having a perfect, yet unrealistic knowledge of the virtual environment. The result is an interface whose representations match much better with the user’s representations. Once the tasks are organized in a way that is consistent with how humans perceive them, a lot of the interference between apparently conflicting actions disappears. In addition to demonstrating a software agent that overcomes the problems of a challenging multimodal domain, Gurney, Klipple and Winkler provide a compelling example of how the perceptual and linguistic nuances of human-human interaction can provide insight into the problems of designing intelligent human computer interfaces.

Named after Douglas Adam’s science fiction story, A. L. Cohen-rose and S. B. Christiansen’s paper "The Hitchhiker’s Guide to the Galaxy" describes a simple system called the Guide, which answers natural language queries about places to eat and drink in the form of short stories. A person using the system presents a spoken or typed request to the system. Perhaps they want to know where they can get a good organic meal in town. The guide parses the query and attempts to interpret the intention of the user. Storytelling agents then access relevant ‘smarticles’ from a knowledge base of previously written reviews about restaurant topics. The smarticles are rated and sorted based on previous usage of the system and then graphically presented to the user. Cohen-rose and Christiansen’s contextual approach to intelligent multimedia was motivated by limitations in traditional web-searching facilities. Most of us are familiar with the typical information searching experience: you type in some key words and a list of sites appear; however, you’re never quite sure which one might contain the precise information you are searching for and the information is often spread across multiple links. While the authors’ research is potentially useful from an engineering or commercial standpoint, it is unlikely to provide any new theoretical insights into the nature of cognition or language. Having said that, Cohen-rose and Christiansen are not focused on developing new cognitive theories here — their primary aim is to improve on the performance and user-friendliness of conventional search engines. The Guide has some definite advantages over such systems: it is contextually driven and therefore more consistent with how humans process information, it can handle complex full sentence queries relieving the user of inefficient key word searches and it customizes its responses to the user. Although it would be difficult to furnish the Guide with a contextual ‘story-oriented’ knowledge base that could handle the breadth of topics available on the Internet, the Guide nevertheless has the potential to become a practical information tool within limited domains.

Sean Day examines the unusual perceptual phenomenon known as synaesthesia. In synaesthesia, sensory modalities are confounded so that the perception of one modality (e.g., sounds) will be accompanied by another physically nonexistent sensory modality (e.g., colors). In some synaesthetes, for example, the sounds of musical instruments will cause them to see particular colors. For Day, who is himself a synaesthete, the sound of a piano will produce a sky-blue cloud about a yard in front of him while a tenor saxophone is accompanied by an "image of electric purple neon lights" (170). Thus, the synaesthetic sensation is generally added to the primary perception rather than replacing it. Synaesthesia can be broken down into two basic types, what Day refers to as "synaesthesia proper" in which stimuli to a sensory input trigger sensations in other sensory modalities, and the more common "cognitive synaesthesia," whereby culturally determined categorizational systems are associated with arbitrary sensations (e.g., colored letters or numbers). After a review of some basic facts, Day traces the ideas of numerous ancient astronomers, mathematicians and philosophers suggesting that their theories about correspondences between planetary bodies and musical intervals provided "initial cornerstones to later theories on synaesthesia" (173). Day goes on to examine a number of interesting synaesthetic phenomena, including the incidence of rare forms of synaesthesia, the role of synaesthesia in composers, the notion of synaesthetic universals and the nature of drug-induced synaesthesia. All in all, Day’s paper provides a good overview of synaesthesia while attempting to address some common misconceptions about the area. The article is particularly useful because it is written by someone who is himself a synaesthete – having first hand knowledge of the phenomena one would expect he has a better idea than theorists about "What synaesthesia is and is not." Regardless of your theoretical predispositions, synaesthesia is one of those appealing topics in cognitive science, the study of which is likely to offer insights into the complex nuances of multimodal integration in humans.

Part II is edited by Seán Ó
Nualláin and brings together twelve articles on language and music. This
section covers a wide variety of themes including: a metaphoric approach to
language and music; a semiotic analysis of music and language; auditory
structuring as a basis for musical aptitude and reading abilities; a comparison
of priming effects in music and language; the role of conscious and
subconscious processes for interpreting language and music; musical fragments
(emons) as self-contained emotionally-based information units; the lexicon of
the conductor’s face; virtual operas that integrate music, text and image;
multimedia (language, vision, sounds) compositions that are ‘improvised’ within
a shared virtual environment; tonality in Irish music; the relationship between
rhythm and language comprehension in children; and a comparison of contours in
speech and European musical traditions. I review four papers from this section.

In "Auditory structuring in
explaining dyslexia,"KaiKarma introduces a simple auditory
procedure that can serve as both a musical aptitude test and a diagnostic tool
for predicting reading performance in dyslexic individuals. As defined by
Karma, ‘auditory structuring’ is an intermediate-level auditory capacity that
involves perceiving temporal relationships between tones. She argues that
auditory structuring is an ideal measure of musical aptitude because it
captures the important relations between auditory elements while being
relatively culture neutral. Karma’s structurally-based musical aptitude test
consists of (nonmusical) sequences of alternating high/low notes and long/short
notes. Having established the relevance of auditory structuring within the
musical domain, she goes on to suggest that auditory structuring can also serve
as a powerful construct for dyslexia. Karma attempts to validate this idea
experimentally by testing dyslexics on her custom designed musical aptitude
test and a related auditory/visual matching task. In keeping with initial
hypotheses, she finds that dyslexics perform significantly worse than control
subjects on the structural auditory tests and that combining the results of
these tasks increases the ability to predict (based on performance measures)
those subjects which are dyslexic. Karma’s approach is generally consistent
with recent theories suggesting that certain perceptual language disorders are
due to auditory temporal processing deficits (see any of Tallal’s post 1980
work); however she tends to emphasize the structuring qualities of auditory
processes rather than deficits in the rate of temporal processing. In addition
to highlighting the auditory temporal bases of dyslexia, Karma’s work provides
evidence for a putative structural link between the auditory underpinnings of
music and language and serves as a nice example of a research paradigm that
draws together theory, experiment and application.

Barbara Tillmann and Emmanuel
Bigand compare priming phenomena in language and music. Music and language are
both examples of systems in which discrete elements are hierarchically
organized into complex patterns according to structuring principles. Innate
knowledge of these structural patterns allows experienced listeners to develop expectancies
that can influence the processing of ongoing musical and linguistic events.
Existing semantic priming research shows that target words are more easily
identified when following a prime word from a previous context. For example,
the target word ‘bread’ is processed more quickly when it follows the
semantically related word ‘butter’ than when it follows a semantically
unrelated word like ‘doctor.’ Similar priming paradigms have been established
in the musical realm in which single chord or multiple chord sequences are
found to facilitate the processing of a target chord. Tillmann and Bigand
directly compare these harmonic and semantic priming paradigms by organizing
results from studies in each domain according to the kind of context used
(e.g., local contexts involve a single word/chord, global contexts involve
sentences or chord sequences and scrambled global contexts involve
interchanging element order). Briefly, general findings reveal that whereas
local and global contexts tend to produce similar priming effects in music and
language, combined and scrambled global contexts show divergent priming patterns
across domains. The authors conclude by discussing two cognitive models that
could account for such differences. Tillmann and Bigand are important
contributors to the field of music cognition. Their review of the music and
semantic priming literature is well organized, concise and provides a
comprehensive picture of the subject area without focusing unnecessarily on the
details of any particular study. However, the question arises as to whether
priming effects can sort out more subtle nuances of musical and linguistic
processing. Although expectancies have been purported to reflect syntactic
musical principles, it is unclear to what extent the harmonic priming paradigm
is able to adequately reflect the essential properties of musical structures,
given that those structures make substantially different contributions to
‘meaning’ compared with linguistic expressions. As noted by Lerdahl and
Jackendoff (1983), one must be cautious when imposing a linguistic approach
onto a musical domain – the priming paradigm may be a valuable tool for
comparing the structural principles of language and music, but it should be
appropriately adapted to the peculiarities of those domains.

Paul Nemirovsky and Glorianna
Davenport explore the fascinating idea that artistic mediums (e.g., music or
video) can be packaged into self-contained units that convey information. They
have developed something they refer to as the emon, "a small discrete unit
of aesthetic expression" which elicits predictable emotional effects that
can guide or direct human behavior in background information channels (255).
The authors’ emon approach was implemented and tested in their GuideShoes
system, a wearable information device that allows a user to navigate in an open
space (e.g., streets) by using musical emons as emotional cues. Nemirovsky and
Davenport motivate their emon approach to information delivery with a
hypothetical travel scenario. They ask you to imagine that you are a traveler
in a foreign city with no street names where you cannot speak the language.
Luckily you have your GuideShoes and headset. You tell it where you want to go,
whereupon GuideShoes connects to the Internet and plugs in your current and
target destinations. "As you start walking down the street, your headset
starts playing music. . . Musical patterns (emons) provide you with information
regarding the correctness of your direction" (256). Can musical structures
be used to communicate precise emotional meanings? The relationship between
pattern and meaning has long been of central interest to cognitive scientists
and musicologists. Because music lacks a precise semantics, theorists are
doubtful that music can be used to convey specific meanings, especially given
that listeners may derive different meanings from the same musical stimuli.
Nemirovsky and Davenport’s emon research challenges this widely accepted notion
and suggests that using aesthetic forms as self-contained emotionally-based
information sources can potentially simplify our perceptual world, particularly
in situations when we are faced with multiple cognitive demands. Limited to the
realm of theory, this idea might sound implausible. However, by incorporating
emons into their GuideShoes system, they have validated their information
approach in a real-world setting. Nemirovsky and Davenport’s emon research
raises some fundamental theoretical issues involving the role of musical
structures as emotional information carriers. The paper is timely
because both emotion and music are important fields of inquiry that have been
traditionally neglected in mainstream cognitive science.

Dilys Treharne examines
relationships between language comprehension and basic rhythmic abilities in
children. According to Treharne, such relationships exist because rhythm is
fundamental to language development – a "matrix of communication
skills" evolves that includes not only spoken language, but also
conceptual development, motor skills and social skills. Treharne argues that
rhythm forms a scaffolding upon which this communication matrix rests. In this
paper, she explores the link between rhythm and language comprehension within
an experimental setting. Treharne’s experiments reveal correlations between
children’s auditory verbal comprehension and their ability to imitate rhythms,
to judge the similarity of perceived rhythms and to infer missing words from a
sentence frame based on the rhythmic pattern of the word. Children who are good
at imitating and recognizing rhythms appear to be better at understanding the
meaning of sentences. In keeping with a growing literature on the subject,
Treharne’s study provides additional support for the idea that basic perceptual
rhythmic abilities are an essential ingredient of language comprehension.
Combined with the idea of a communication matrix, Treharne’s finding that
auditory verbal comprehension is correlated with both "non-verbal rhythmic
awareness andthe ability to understand incomplete sentences using
rhythmic aspects of prosody"provides a theoretical framework for
language development that may have important clinical applications (322). For
example, children with certain language disorders could be trained to use
prosodic cues to facilitate comprehension.

One contribution to the section on
language and music, "On tonality in Irish traditional music" seemed a
little out of place. Though well written and informative, Ó Nualláin himself
describes the paper as dealing with the "political conditions that
hampered the full harmonic development of Irish music" (12). Nowhere in
this paper are there direct comparisons between music and language and thus it
is unclear how the paper fits into the multimodal language processing theme.

Part III on creativity is
introduced by Conn Mulvihill. This section includes a summary report of a panel
session that explored the question "What is creativity" followed by
four papers on creativity. The first paper explores the analogical foundations
of creativity in language and the arts, tracing human history to find a decent
model of human computation. The second contribution examines the factors
involved in creative team performance within an economic context. The third
article takes a cultural approach to creativity, focusing on the centrality of humor
and metaphor in the Tarahumara Indian religion of Northern Mexico. The final
paper, which I review here, offers a computational perspective on creativity.

Conn Mulvihill and Micheál Colhoun
take on the challenging question "Is creativity algorithmic." The
answer to this question has important implications for computational approaches
to cognition, as it would allow us to specify whether creative thinking can, in
principle, be programmed into computers. Mulvihill and Colhoun provide an
interesting and well thought out approach to the topic of creativity in the
context of language. According to the authors, in languages where form (medium)
and content (message) mix is where we most often find creativity. They cite
examples of this mixing of form and content in the creative arenas of art,
philosophy and biology. In the visual arts, for example, the content conveyed
in a painting will be influenced by the form (e.g., impressionism). Does this
interplay of form and content extend to computers? A number of early artificial
intelligence (AI) programs designed with a computer program called Lisp (e.g.,
Lenat) produced interesting results that were initially taken to be creative,
but were ultimately found to be attributable to the richness of the
representational medium. In general, Mulvihill and Colhoun find that current
computer languages appear not to have the capacity for creativity, at least
according to their form/content requirement; algorithms can provide yes/no
answers to questions of form (e.g., compilers), but not to questions of
content, a limitation which they suggest may be inherent in the properties of
logical symbol systems themselves. Despite this apparent drawback, Mulhivill
and Colhoun propose that any (computer) language that did support creativity
would be marked by two additional properties (ambiguity and reflectivity) and
conclude with some thoughts on modeling the creative process.

After reading this short section on
creativity, one should certainly not expect to come away with a precise notion
of the fundamental nature of creativity. Creativity is a challenging and little
understood area of human cognition that has been tackled within many
disciplines. Indeed, according to one of the contributors (Rickards) to this
section, a universally accepted definition has yet to be found. Notwithstanding
the lack of consensus on exactly what creativity is, this final section
provides four interesting perspectives on the topic, with Mulvihill and
Colhoun’s paper on algorithms and Klein’s paper on the analogical foundations
of creativity offering two particularly promising approaches.

A guiding premise of this book is
that an integrative approach to language, vision and music can inform us about
the nature of both natural language and artificial communication systems and
the complex interrelationships that exist between language, mind and machine.
Contained in this volume are exciting examples of several sophisticated
multimodal computer systems, architectures and interfaces, original
experimental approaches relating language and music and some interesting work
on the difficult topic of creativity. Several of the articles, particularly a
few in the multimedia subset on vision and language, are presented by leaders
in their field. An important question is whether any of these researchers
successfully address one of the underlying themes outlined in the ‘Call for
Papers,’ namely, the notion of a modality-independent general purpose cognitive
system. Although few of the papers tackle this question directly, a potential
answer can be gleaned from an examination of reoccurring themes. Of particular
interest, the concept of rhythm (and related notions of timing and temporal
processing) appears in several of the articles in this collection, not only in
the section on language and music, where one would expect the theme to be
addressed, but also in a couple of the multimedia papers. Rhythm, it seems, is
not just a ubiquitous property in music and speech (e.g., prosody) but rather
finds a wider application in the cognitive realm. As noted by Ipke Wachsmuth,
one of the contributors to the section on language and vision,
"observations in diverse research areas suggest that human communicational
behavior is significantly rhythmic in nature" (118). In face-to-face
communication, speech, gesture and movement are highly coordinated across multiple
levels of temporal organization. Thus, rhythm may be an important structuring
and synchronizing principle underpinning the temporal aspects of cognition and
therefore relevant to research linking language, vision and music in both
natural and artificial domains.

Some cognitive researchers might
question the inclusion of music in this collection as an ‘important’ cognitive
modality having the same status as that of language or vision. The ability to
perceive music is certainly not essential for our everyday understanding of the
world in the same way that vision and language are. At the same time, there is
no culture in recorded human history that has been without music. Despite the
apparent centrality of music in human activities, as a field of study, music
has traditionally been accorded a less important role in cognitive science
relative to other domains. However, a recent upsurge in the cognitive
neuroscience of music literature appears to be changing this state of affairs.
Increasingly, there is evidence that even passive musical activities require
complex underlying processes and dedicated biological substrates and it has
been suggested that understanding the intricacies of musical processing may
provide insight into general properties of the mind and brain. In light of this
possibility, it is commendable that the conference organizers chose to include
music as part of their multimodal theme; indeed, it is promising that there
were a sufficient number of high quality contributions warranting a whole subsection
on language and music.

What I
found most appealing about this volume is the number of research projects that
successfully blended theory, experiment and application — an ideal that is
strived for, but often not evident in the subdisciplines of cognitive science.
As the papers in this collection employ a range of theoretical and
methodological approaches for understanding the integrated processing of
language, vision and music in both natural and computational contexts, there
should be something of interest here for almost everyone (studying cognition).
That statement should be qualified, however. As with most reading endeavors,
what you get out of this book will depend in part on what you bring to it.
While many of the articles are well written, interesting and self-contained,
quite a few of them presuppose a basic familiarity with the subject matter. As
such, the book is geared primarily to cognitive scientists and those with
backgrounds in language, music and computer related or engineering disciplines
(particularly intelligent multimedia systems), although individuals with an
interest in specific topics (e.g., creativity, synaesthesia) may benefit from a
reading of selected papers.

At 433
pages, the book can be a little tough going, especially given that each of its
30 articles are quite packed with information. In taking on this collection of
specialized papers, readers new to the subject matter may find themselves at
times overwhelmed in a sea of details and potentially challenged in having to constantly
shift gears from one conceptual mode of thought to another. However, the effort
required to get through these papers will be rewarded with a better
appreciation of both the range and difficulty of some central issues in a
thriving area of cognitive science. This collection is valuable in that it
brings together a wide range of interesting and diverse topics under the common
rubric of multimodality, an approach that promises to bear intellectual fruit
in the advancement of general-purpose cognitive theories. Human minds have the
capacity to effortlessly integrate linguistic, visual and musical stimuli in
real time. Consequently, progress in cognitive science can only come about when
researchers are able to study these and other perceptual and cognitive
modalities in a truly integrated fashion.

 

© 2004 Daniel Mauro

 

Daniel Mauro is
a senior PhD student in the cognitive science program at Carleton University
(Ottawa) and specializes in auditory temporal processing and musical cognition.

Categories: Philosophical, Psychology