The Grice Club

Welcome

The Grice Club

The club for all those whose members have no (other) club.

Is Grice the greatest philosopher that ever lived?

Search This Blog

Wednesday, January 9, 2013

Automated Grice

Speranza

Commentary on E. Gola, S. Federici, N. Ruimy, and J. Wade, "Automated Translation Between Lexicon and Corpora"

Grice hated computers and anything to do with automata.

The reason was simple. When he was given a word processor, he noted that the word processor could not process his neologism (of sorts): pirot. The word processor would systematically correct that to "parrot".

Similarly with "sticky wicket".

"That's enough!", he thought.

His anti-computational thoughts he shared with Haugeland, of Berkeley, and they would discuss anti-automatism (i.e. the negation of Kantian freedom) to no end.

In this paper, the authors argue for the relevance of the lexicon in the translation process and the need to dispose of wide coverage and high quality lexical resources. The access to a rich lexical knowledge is in fact a fundamental requirement for a computational system to correctly analyze a text and generate its translation.
To this purpose, we will present an Italian lexicon that meets the requirements of MT systems, and we will show how its lexical information can be used in a translation process.
It must, however, be emphasized that building a large coverage lexicon is a very costly and time consuming process. That is the reason why Computational Lexicography is today mostly oriented toward the development of methodologies and strategies that make the creation of lexicons easier and faster with the automatic acquisition of data from corpora, from the Web, or by induction from existing resources. In this paper, we will show a bootstrapping method, based on a machine learning technique, that allows us to build at the same time a corpus-based lexicon and a tagged corpus, that grow incrementally together in a semi-automated way.
1. Machine Translation: historical overview and state of the art
Since the beginning of Artificial Intelligence (AI) and Natural Language Processing (NLP) , studies and research were devoted to realizing the dream of Machine Translation.
During the first decades of MT research, an articulated panorama of methodologies and strategies started shaping. Classifying all the approaches is almost impossible, given that perspectives change along with the adopted parameters.
There is now a variety of MT systems which almost defies any neat classification. It is still often legitimate to apply the labels of the 1960s: practical vs. theoretical, empirical vs. perfectionist, direct vs. indirect, interlingual and transfer. But now there are new labels and new perspectives: interactive vs. fully automatic, ‘try-anything’ systems vs. ‘restricted language’ systems, mainframe systems vs. microcomputer or word-processor systems, AI-based systems vs. linguistics-oriented systems (Hutchins, 1986, p. 19).
Automated Translation between Lexicon and Corpora 63
To our purposes, we will focus on the distinction between direct and indirect strategies that belong respectively to first and second generation MT systems.
Until the sixties, MT systems, called first generation systems, followed a so-called direct strategy, in which a direct correspondence was established between Source Language and Target Language (henceforth, SL and TL). In this strategy, the SL was only analyzed from a morphological point of view. The output of the morphological analysis constituted the access point to the bilingual lexicon. In this way, a text could only be translated word-by-word. This strategy failed therefore to cope with the translation of ambiguous sentences or sentences with different SL and TL syntactic structures, such as the Italian sentence Questo ragazzo piace a Maria (lit. this boy likes to Maria), whose English structure: Maria likes this boy is quite different.
During the sixties, the second generation systems adopted an indirect strategy, in which two approaches were followed. Firstly, a two-phase process defined as the Interlingua approach and, secondly, a three-phase process defined as the Transfer approach.
In the Interlingua approach, a formal, abstract and language-independent representation interfaces source and target languages: a SL text is analyzed into an interlingual representation which is then synthesized into a TL text. In this view, a conceptual lexicon is required, the building of which is an extremely complex and controversial task.
For this reason, more realistic strategies, based on Transfer, are adopted. In this case, the translation steps are the following:
 analysis of a SL text into a SL formal representation;
 transfer of the SL formal representation into a TL formal representation;
 generation of a TL text from the TL formal representation.
In the Transfer approach, the structural analysis of the SL text is performed in different steps and leads to the building of a formal representation of the SL structures that, in the transfer phase, is mapped onto a formal representation of the TL structures. As to the lexical transfer, the SL lexical units are translated into TL lexical units, using an electronic bilingual dictionary. During the
64 Humana.Mente – Issue 23 – December 2012
synthesis phase, the TL formal representation is turned, following different steps, into a TL text. In this perspective, cultural aspects of different languages are taken into account.
In spite of this innovation, disappointment with the feasibility of MT was growing, due to the “semantic barriers” that researchers encountered and that proved difficult to overcome.
Furthermore, in 1964, the US government sponsors asked the National Science Foundation to constitute a committee in order to evaluate the progress made in NLP in general and in the MT state-of-the-art in particular. The commission produced in 1966 a “(in)famous report”, as John Hutchings (1996) defined it, the ALPAC report (from the name of the committee: Automatic Language Processing Advisory Committee). The ALPAC report stated that MT systems were slower, less precise and more expensive than human translators. The verdict then was: “there is no immediate or predictable prospect of useful machine translation” (ALPAC, 1966).
It should be noticed, though, that the ALPAC committee, in its report, took into consideration only direct strategy systems, evaluating them negatively. For the next ten years, this assessment caused the U.S. Government to reduce its funding in this area dramatically. As a direct consequence, research in this field stopped in the US for over a decade, while it carried on in Canada, Germany and France.
It is only in the middle of the Seventies that we find a renewed interest for automated translation, with the emergence of third generation systems based on Artificial Intelligence.
Starting from the 1990s, a new methodological approach emerges, that makes use of large bodies of text (corpora) (Hunston, 2002). Among the corpus-based systems, the most common approaches are statistics-based systems (SBMT) and example-based systems (EBMT).
SBMT follows strategies in which SL and TL sentences are tentatively aligned on the basis of the probability that each word in the SL sentence corresponds to one or more words in the TL sentence. On the contrary, the example-based methodology, suggested by Nagao in 1984 but implemented only in the 1990s, gives a translation by analogy, comparing the input sentence with a bilingual dictionary that includes examples and matching those that are more similar to the input (Nagao, 1984; Brown, 1999, Turcato et al., 1999).
In the same years, the rule-based systems move away from syntax-based representations to more 'lexicalist' approaches. At its extreme, the essence of
Automated Translation between Lexicon and Corpora 65
the lexicalist approach in MT system design is to reduce transfer rules to simple bilingual lexical equivalences. Such a drastic reduction was first put forward in the CRITTER project (Isabelle et al., 1988). The approach has been explored in the ACQUILEX project devoted primarily to the construction of multilingual lexicons for transfer-based MT (Sanfilippo et al., 1992), and is probably best known as the 'shake-and-bake' method described by Whitelock (1992). The requirement for structural representations - common to both transfer and interlingua approaches - is abandoned in favour of sets of semantic and syntactic constraints on lexical items. Translation involves the identification of TL lexical items which satisfy the semantic constraints attached to the SL lexical equivalents.
The 'bag' of target lexical items is then 'shaken' to generate an output text consistent with the syntax and semantics of the target language (Hutchins, 1993).
This ‘lexicalist’ turn led the MT community to an increasing interest for computational lexicons.
Today, Machine Translation systems usually follow either a corpus-based or a rule-based approach. In the first trend, we find statistical approaches and example-based approaches. In the second one, emphasis is given to lexical resources. In the following section of the paper, we will propose an integration of these two approaches.
2. Relevance of the lexicon in MT
In order to produce a good translation it is necessary to understand correctly the input text. It is precisely for this reason that Machine Translation is deemed one of the most difficult tasks in the field of AI language applications. Any translation process implies, in fact, the resolution of a whole range of problems regarding both the analysis and the generation of texts. In this context, the lexicon plays a crucial role. A robust translation system should be able to cope with a wide range of issues inherent to the complexity of natural language, such as the various types of ambiguity, non literal uses, polysemy and so on. A poor lexicon fails to support these challenging tasks.
66 Humana.Mente – Issue 23 – December 2012
3. Lexicon and lexical problems in MT
Table 1 illustrates some of the most typical and frequent lexical problems that are encountered during a translation process and that a lexicon tailored for an MT system should be able to deal with.
The lexicons used in MT systems must have wide coverage and provide, for each lexical entry, a large range of rich and various information spanning all levels of linguistic description.
Direct strategy MT systems used a unique, very complex bilingual lexicon containing all grammatical information concerning both the SL and the TL lexical units, as well as the conditions for selecting the appropriate translation in case there are different alternatives possible.
Transfer-based MT systems, by contrast, use different monolingual lexicons (morphological, syntactic and semantic) containing all relevant information for each level of linguistic description for both the analysis and generation phases. In the transfer phase a bilingual lexicon is used. The transfer bilingual lexicon consists of lexical rules setting i) the correspondences between the lexical units described in the SL and TL monolingual semantic lexicons and ii) the conditions imposed on those equivalences. For example, in case of a SL word translatable by different TL words, the lexical transfer rule selects the appropriate TL equivalent, on the basis of the information provided by the two translational equivalents in their respective monolingual description.
In the domain of computational lexicography, a significant number of electronic lexical resources are now available, even though not all languages are equally represented. Most lexicons deal with a single level of linguistic description; some describe a unique part of speech or are strictly theory-dependent. Some are created in order to describe the vocabulary of a particular domain; others in order to meet the requirements of a specific application.
Automated Translation between Lexicon and Corpora 67
Level
Phenomenon
Example
Phonology
Homography
 it. pésca = en. fishing
 it. pèsca = en. peach
Morphology
Homonymy
 it. legge, porta, sbarra: N & V
 it. appunto: N. & ADV
Syntax
Syntagmatic realization
 en. know + NP = it. conoscere
 en. know + WH-clause = sapere
Semantics
Homonymy
 fr. louer = en. to praise
 fr. louer = en. to rent
Polysemy
 en. set up = it. piantare, erigere,
mettere su, causare, installare, allestire, formare, etc.
Conceptual division
 en. corner = sp. rincón (internal),
esquina (external)
Lexical gaps
 it. fuoricorso, consuocero :
not lexicalized in English and in French
Table 1.
Very few lexical resources, however, have the required features to be used in an MT system. As a matter of fact, besides providing a rich and various amount of information, a lexicon must guarantee completeness and coherence of the encoded lexical data. Moreover, it must be conceived as a dynamic resource, and not as a static and crystallized repertory of lexical information. Such a resource should be simple to update and expand not only manually but essentially through the automatic acquisition of information from textual resources, so as to reflect the continuous evolution of languages and to meet the new needs and answer the problematic issues which might emerge from the translation process. In this perspective, a generic lexical model and a modular architecture are essential for an electronic lexicon to be profitably exploitable.
A large computational lexical resource for the Italian language was developed at the Istituto di Linguistica Computazionale of the National
68 Humana.Mente – Issue 23 – December 2012
Research Council in Pisa from 1996 to 2003, which presents these characteristics.
4. The Lexical Resource
The computational lexicon PAROLE-SIMPLE-CLIPS (Ruimy et al., 1998; 2002; 2003), elaborated in the framework of three different projects1, provides a wide-coverage, four-level description of the Italian language. This lexical resource was built according to a multifunctional and multilingual perspective and in compliance with the international standards set out in the PAROLE-SIMPLE lexical model (Ruimy et al., 1998; Lenci et al., 2000).
This model, based on the EAGLES recommendations (San Filippo et al., 1998) and on an extended version of the GENELEX model (Antoni-Lay et al., 1994), is at the forefront of the field of Computational Lexicography for some outstanding and innovative features. The flexible architecture of the model as well as the building methodology allow the coherent encoding of a wide range of highly structured information, at the desired granularity level. Consensually adopted at a European level for the building of twelve harmonized monolingual electronic lexicons, the PAROLE-SIMPLE lexical model became a de facto standard and subsequently strongly inspired the ISO standard for NLP lexicons, the metamodel Lexical Markup Framework2.
The PAROLE-SIMPLE-CLIPS lexicon offers, therefore, the outstanding advantage of being compatible with eleven other lexicons developed for European languages, with which it shares the theoretical and representational model, the working methodology as well as a kernel of entries.
The lexicon is articulated in four independent but interrelated modules, which correspond respectively to the phonological, morphological, syntactic and semantic levels of linguistic representation. The complete description of a lexical unit consists therefore in a minimum of four interconnected entries, each one providing a structured set of information relevant to the description level that hosts it.
A phonological entry accounts for the phonetic and phonological features of a lexical unit while a morphological entry informs on its grammatical category and inflectional paradigm. A syntactic entry describes both the
1 The European projects LE-PAROLE and LE-SIMPLE and the Italian project Corpora e Lessici dell’italiano Parlato e Scritto (CLIPS)
2 ISO-24613:2008
Automated Translation between Lexicon and Corpora 69
intrinsic and contextual properties of a lexical unit in one specific syntactic structure. The subcategorization frame is modelled in terms of syntactic category, grammatical function, optionality and morphosyntactic, syntactic and lexical restrictions of the governed elements. Systematic frame alternations, such as the causative-inchoative variation, are represented in a complex entry whereby the correspondence between the constituents of the two structures is specified.
The adopted theoretical framework for the representation of semantic information is based on the fundamental principles of the Generative Lexicon theory (Pustejovsky, 1995). In a generative lexicon, a semantic unit is modelled through four different levels of representation3 that account for the componential aspect of meaning, define the type of event denoted, describe its semantic context and set its hierarchical position with respect to other lexicon units.
The semantic lexicon is structured in terms of an ontology of semantic types (the SIMPLE ontology). In a semantic entry, which encodes a single meaning of a lexeme, the membership in an ontological type represents the primary and most relevant information. Besides the ontological classification, the semantic unit is endowed with information concerning its domain of use; the type of event it denotes, where relevant; some distinctive semantic features; its links with other lexical units - among which synonymy and morphological derivation links - and membership in a class of regular polysemy. The semantic frame of predicative units is also described in terms of semantic role and selectional restrictions of the arguments.
To express the links holding among sense units, the SIMPLE lexicographers benefited from a remarkably efficient expressive means, the Extended Qualia Structure. This representational tool was derived from the Qualia Structure, a four-role4 structure which is considered a mainstay in the Generative Lexicon theory for representing the multidimensionality of a word’s meaning. The extended structure was created by defining, for each of the four Qualia roles, a subset of semantic relations. Such relations obviously allowed a much sharper expression of both the multidimensional aspect of a word sense and the nature of its syntagmatic and paradigmatic links to other lexical units. To give but one example, considering the telic role that informs
3 Namely Qualia structure, Event Structure Argument Structure and Lexical Typing Structure.
4 Formal, constitutive, agentive and telic.
70 Humana.Mente – Issue 23 – December 2012
about the function or purpose of an entity, the most appropriate relation may be selected among the following ones: ‘used_for’, ‘used_by’, ‘used_as’, ‘used_against’, ‘is_the_activity_of’, ‘object_of_the_activity’ and so on.
Moreover, in a new and revised version of the lexical-semantic database, called Simple_PLUS, the semantic representation has been enriched with significant information concerning the relationships holding between events and their participants and among co-participants in events (Ruimy, 2010).
This lexicon offers, therefore, a wide range of very rich and interesting information, especially at the semantic level. It is our deep conviction that an MT system could greatly benefit from such a wealth of lexical data, for both the granularity of the information provided and its explicit formulation.
5. Lexical Semantics for the resolution of some MT problems
A translation process presupposes the understanding of the many and various aspects that characterize the input text. Besides the morphological and syntactic aspects, it is necessary to disambiguate the logical form of the sentence, checking the coherence among semantic restrictions and preferences of words. To establish an equivalence between a source and a target text a translator should also understand other semantic and pragmatic aspects (for example conversational implicatures, metaphors, ironic contexts, etc.), that are not easily detectable. In the following, we will briefly show how Lexical Semantics plays a central role in the resolution of problems that typically emerge in Machine Translation.
Word sense ambiguity is a pervasive characteristic of natural language. It is one of the main reasons for poor performance of Information Retrieval systems. In MT, lexical ambiguity may occur both in the analysis and the transfer phases. Its resolution, which is therefore considered a major problem, requires a large amount of rich lexical knowledge.
5.1.1. Polysemy / homonymy and domain knowledge
A SL polysemic word or two SL homonyms may translate in two different ways according to their usage domain (see Table 2). Matching the information concerning the topic of the source text and the indication, in the monolingual lexicon, of the different domains of use of the ambiguous word enables the selection, in the bilingual lexicon, of the appropriate translation.
Automated Translation between Lexicon and Corpora 71
en.
mouse

it.
(gen.)
(inform.)
topo
mouse
it.
borsa

en.
(gen.)
(econ.)
bag
stock exchange
it.
calcolo

en.
(gen.)
(med.)
calculation
gallstone
Table 2.
5.1.2. Polysemy / homonymy and ontological classification
The semantic classification of a word sense is generally sufficient to discriminate among its different meanings or among homonyms and therefore to enable the selection of the relevant one from its different possible translational equivalents, as shown in Table 3 for Italian-English and Italian-French translations.
Italian

English
French
ala : [PART]

wing
aile
ala : [BODY_PART]

wing
aile
ala : [ROLE]

winger
ailier
espresso [ARTIFACT_DRINK]

espresso
express
espresso [VEHICLE]

express (train)
express
espresso [SEMIOTIC_ARTIFACT]

express (letter)
exprès
Table 3.
5.1.3. Polysemy / homonymy and contextual links
More complex situations emerge when two readings of a lemma cannot be disambiguated through their semantic classification or other paradigmatic information. In this case, syntagmatic and therefore contextual links may be used. In the following example reported in Table 4, means for selecting the appropriate translation are provided by the domain of use, but also by semantic relations linking each ambiguous term to the predicate denoting its function.
72 Humana.Mente – Issue 23 – December 2012
Italian

English
ferri_1
[INSTRUMENT] used_for sferruzzare (to knit)

knitting needles
ferri_2
[INSTRUMENT] used_for operare (to operate)

surgical instruments
Table 4.
5.1.4. Polysemy / homonymy and semantic frame
The semantic frame description may also provide clues for solving lexical ambiguities. Two homonym predicates may be distinguished by a different argument structure, either by the number of arguments they require (Table 5, first example) or by the semantic restrictions imposed on those arguments (Table 5, second example).
Italian

English
avvertire1: arg0, arg1, arg2

to inform, to warn
avvertire2: arg0, arg1

to feel, to notice
Italian

English
camminare1: arg0 = + animate

walk
camminare2:arg0 = - animate

work
Table 5.
It is worth noting that the whole range of lexical semantic information used for solving the above cases of ambiguity is encoded in the lexicon presented in the previous section.
6. The Corpus-based Approach
In order to briefly illustrate how a corpus-based approach may work, we have decided to focus our attention on one specific example, the translation of the English phrasal verb ‘set up’ into Italian, gathering our samples from electronic texts on the Web and analyzing them with a KeyWord in Context (KWIC) tool.
The experiment outlined here was carried out using the following procedure. A lexical item was selected, for the purposes of this analysis the
Automated Translation between Lexicon and Corpora 73
English phrasal verb ‘set up’ (Wade & Federici, 2006), since this item is problematic from a semantic point of view. It provides an interesting example of the highly polysemic nature of the English language, characterised by “remarkable range, flexibility and adaptability” (Crystal, 1988, p. 39). In this case the translator, for example, is required to consider the context specific nature of the lexical item (Eco, 2003, p. 29) and where areas of “inherent fuzziness” (Bell, 1991, p. 102) are found in establishing equivalence between one language and another. Indeed, ‘set’ alone has about 120 different meanings (cf. Collins Cobuild English Dictionary, 1995). With regard to ‘set up’, it was decided to first examine its meanings using a traditional bi-lingual dictionary Ragazzini-Zanichelli (2009). Secondly, a small sample of examples was collected from the web with a specifically designed search tool, followed by the manual examination and analysis of the gathered data and comparison with the information provided in the dictionary. The analysis was then extended through the analogical comparison of the initial manual analysis, allowing the further extraction of a wider sample of data.
To perform the kind of analysis described above, a tool was developed to acquire word-concordances directly from the web. The tool is a combination of several web/linguistic tools:
 a web spider that acquires a predefined number of web pages;
 a segmenter that splits acquired web pages;
 a rule-based lemmatiser;
 a KWIC (KeyWord In Context) tool;
 a self-learning analogy-based engine.
The web spider (cf. Federici, Wade, 2007) extracts web pages starting from a given web address, thus providing “a random snapshot of the current state of the Internet in a given language” (Sharoff, 2006, p. 437). The spider filters out all unneeded web overstructure (see Figure 1).
74 Humana.Mente – Issue 23 – December 2012
Figure 1
Then the lemmatiser associates each word form contained in the extracted
web pages to the corresponding lemma. After the corpus has been cleaned and
lemmatised, the KWIC will read the corpus by indexing all the lemmas. This is
illustrated in Figure 2, where the word forms or lemmas are in the keyword
area on the left (2a), and clicking on the keyword which is the focus of our
interest the concordances are created (2b).
Figure 2a
Automated Translation between Lexicon and Corpora 75
AREA KEYWORDS AREA CONCORDANZE
Figure 2b
While this approach is certainly useful as it enables the linguist to capture
the real usage of a given word, it also suffers from a number of limitations:
 the manual analysis of data is extremely time-consuming;
 it is often not practicable to analyse all the examples, especially in
large corpora, so only a selected number of examples are chosen as
representative;
 there is the risk of human error and inconsistency in manual
analysis.
7. Corpus vs. Dictionary
Our starting point was an analysis of the word senses provided in bi-lingual
English-Italian dictionary Ragazzini-Zanichelli (2009). The result is illustrated
in Figure 3.
76 Humana.Mente – Issue 23 – December 2012
to set up (verbo transitivo)
1. mettere su; alzare; erigere; piantare; montare; installare; allestire
2. mettere su; montare; installare; allestire
3. mettere su; mettere in piedi; istituire; fondare; costituire; formare; aprire (un ufficio); avviare (un’azienda)
4. sistemare; mettere (q.) in affari (o politica ecc.); aiutare (q.) finanziariamente (politicamente ecc.)
5. lanciare (un grido)
6. causare; provocare; dare l’avvio (o il via) a
7. stabilire (sport)
8. comporre (tipog.)
9. tesare, arridare (naut.)
10. (fam.) rimettere in salute (o in forze; in sesto); tirare su
11. (fam.) montare un’accusa contro (q.); incastrare; mettere contro; mettersi a fare; fornire; essere forte; essere ben fornito
Figure 3
It is to be noted that only a restricted number of entries provide contextualised examples of usage.
An initial analysis of the corpus created with the tools described above, on the other hand, reveals significantly richer contextualised source. In fact, it becomes immediately apparent that there are cases which are not included in the dictionary, such as the meaning ‘creare’, which is the appropriate translation of ‘set up’ in the case illustrated below:
[…] useful information on the British Council's website, which was […] useful information on the British Council's website, which was […] useful information on the British Council's website, which was […] useful information on the British Council's website, which was […] useful information on the British Council's website, which was […] useful information on the British Council's website, which was […] useful information on the British Council's website, which was […] useful information on the British Council's website, which was […] useful information on the British Council's website, which was […] useful information on the British Council's website, which was […] useful information on the British Council's website, which was […] useful information on the British Council's website, which was […] useful information on the British Council's website, which was […] useful information on the British Council's website, which was […] useful information on the British Council's website, which was […] useful information on the British Council's website, which was […] useful information on the British Council's website, which was […] useful information on the British Council's website, which was […] useful information on the British Council's website, which was […] useful information on the British Council's website, which was […] useful information on the British Council's website, which was […] useful information on the British Council's website, which was […] useful information on the British Council's website, which was […] useful information on the British Council's website, which was […] useful information on the British Council's website, which was […] useful information on the British Council's website, which was […] useful information on the British Council's website, which was […] useful information on the British Council's website, which was […] useful information on the British Council's website, which was […] useful information on the British Council's website, which was […] useful information on the British Council's website, which was […] useful information on the British Council's website, which was […] useful information on the British Council's website, which was […] useful information on the British Council's website, which was […] useful information on the British Council's website, which was […] useful information on the British Council's website, which was […] useful information on the British Council's website, which was […] useful information on the British Council's website, which was […] useful information on the British Council's website, which was […] useful information on the British Council's website, which was […] useful information on the British Council's website, which was […] useful information on the British Council's website, which was […] useful information on the British Council's website, which was […] useful information on the British Council's website, which was […] useful information on the British Council's website, which was […] useful information on the British Council's website, which was […] useful information on the British Council's website, which was […] useful information on the British Council's website, which was […] useful information on the British Council's website, which was […] useful information on the British Council's website, which was […] useful information on the British Council's website, which was set up set upset upset upset up specifi specifispecifispecifispecifispecifically for assistants to use in their placement countries. cally for assistants to use in their placement countries. cally for assistants to use in their placement countries. cally for assistants to use in their placement countries. cally for assistants to use in their placement countries. cally for assistants to use in their placement countries. cally for assistants to use in their placement countries. cally for assistants to use in their placement countries. cally for assistants to use in their placement countries. cally for assistants to use in their placement countries. cally for assistants to use in their placement countries. cally for assistants to use in their placement countries. cally for assistants to use in their placement countries. cally for assistants to use in their placement countries. cally for assistants to use in their placement countries. cally for assistants to use in their placement countries. cally for assistants to use in their placement countries. cally for assistants to use in their placement countries. cally for assistants to use in their placement countries. cally for assistants to use in their placement countries. cally for assistants to use in their placement countries. cally for assistants to use in their placement countries. cally for assistants to use in their placement countries. cally for assistants to use in their placement countries. cally for assistants to use in their placement countries. cally for assistants to use in their placement countries. cally for assistants to use in their placement countries. cally for assistants to use in their placement countries. cally for assistants to use in their placement countries. cally for assistants to use in their placement countries. cally for assistants to use in their placement countries. cally for assistants to use in their placement countries. cally for assistants to use in their placement countries. cally for assistants to use in their placement countries. cally for assistants to use in their placement countries. cally for assistants to use in their placement countries. cally for assistants to use in their placement countries. cally for assistants to use in their placement countries. cally for assistants to use in their placement countries. cally for assistants to use in their placement countries. cally for assistants to use in their placement countries. (“Foreign (“Foreign (“Foreign (“Foreign (“Foreign (“Foreign (“Foreign (“Foreign assistance” assistance” assistance” assistance” assistance” assistance” assistance” by Katie PhippsKatie Phipps Katie PhippsKatie PhippsKatie PhippsKatie PhippsKatie PhippsKatie PhippsKatie PhippsKatie Phipps, «Education Guardian online» «Education Guardian online»«Education Guardian online»«Education Guardian online» «Education Guardian online» «Education Guardian online»«Education Guardian online»«Education Guardian online»«Education Guardian online» «Education Guardian online»«Education Guardian online»«Education Guardian online» «Education Guardian online»«Education Guardian online» «Education Guardian online»«Education Guardian online»«Education Guardian online»«Education Guardian online»«Education Guardian online», August AugustAugustAugust 23th 23th 23th 23th 23th 2005)
In an experiment that analysed 600 contexts of ‘set up’, only 8 out of 17 translations were attested in the dictionary. While it may be argued that the entries in the dictionary could be the most frequent usages of ‘set up’, it does not seem to be the case if we consider that the dictionary covers only about 47% of the translations of ‘set up’ occurring in our corpus (see Figure 4).
Automated Translation between Lexicon and Corpora 77
Translations present in the dictionary
Translations not present in the dictionary
ALLESTIRE
AVVIARE
COSTITUIRE
FONDARE
FORMARE
INSTALLARE
ISTITUIRE
STABILIRE
APPRONTARE
ATTUARE
CREARE
DEFINIRE
DIPINGERE
IMPOSTARE
ORGANIZZARE
PREPARARE
REALIZZARE
Figure 4
From these analyses it emerges that examples extracted from real texts may be useful (i) to extend coverage of the lexicon and (ii) to refine semantic entries.
8. Extending the study: the ‘bootstrapping’ process
In order to extend the study and refine the data gathered, we need to use some type of Artificial Intelligence engine that (semi-)automatically carries out the annotation task. The procedure applied for the purposes of this study is called ‘bootstrapping’.
In the first step a small portion of the corpus was annotated manually, assigning a translation to each sample (see Figure 5):
Manually annotated concordances:
1. […] websites have also been set up/CREARE by the LSC […]
2. […] Websites have also been set up/CREARE and open days organised […]
3. […] an appeal panel has been set up/COSTITUIRE by the Dept. […]
4. […] a panel, task force, set up/COSTITUIRE by Harvard […]
Figure 5
78 Humana.Mente – Issue 23 – December 2012
In the second step, the annotation is extended automatically to the remaining concordances for ‘set up’ in the corpus. At this stage it is found that not all of the translations assigned are correct (see Figure 6).
Manually annotated concordances:
1. […] websites have also been set up/CREARE by the LSC […]
2. […] Websites have also been set up/CREARE and open days organised […]
3. […] an appeal panel has been set up/COSTITUIRE by the Dept. […]
4. […] a panel, task force, set up/COSTITUIRE by Harvard […]
New concordances (Automatic annotation)
1. There are […] much more useful information on the British Council’s website, which was set up/CREARE […] for assistants […] (CORRECT)
2. […] a committee was set up/ISTITUIRE to arrange […] (WRONG)
Figure 6
During this automatic annotation step the first occurrence of ‘set up’ is automatically annotated as ‘CREARE’, which is correct, while the second is automatically annotated as ‘ISTITUIRE’, which is incorrect. This is because the algorithm in this case failed to provide the appropriate translation for lack of evidence.
In the third step, therefore, further manual revision is necessary. During this last phase the correct interpretation is manually assigned to those keywords that have been wrongly annotated.
9. Practical application of the procedure
We tested this procedure by setting up an experiment in which 600 contexts from a 1.5 million word corpus were manually annotated by assigning a translation to each concordance with ‘set up’, 400 new contexts were then automatically annotated and finally revised manually.
The results were encouraging, since the correctness of the automatically assigned translations was about 49%. That is, almost half of the time the
Automated Translation between Lexicon and Corpora 79
procedure assigns the correct translation, even starting from a relatively small set of training samples. This is acceptable when compared to the high number of possible translations (17) and less thorough baselines, such as the ones that could be obtained by assigning a random interpretation (1/17=6%) or just the most frequent one (that is “avviare”, that accounts for only 12% of the cases).
Conclusions
In our hypothesis, the corpus-based process outlined above might prove to be very useful in enhancing lexical resources. This study aimed to create a dynamic cyclical process, in which the lexicon, in the case of our web-based experiment, is enhanced by a corpus-based analysis, and the corpus-based analysis can then be automatized thanks to the availability of richer and more precise lexical knowledge. This would appear to be necessary when dealing with a dynamic process as opposing a static lexicon which fails to provide a complete descriptive picture of current language use. With the application of automated methods, a wide set of new lexical data and knowledge can be collected and analyzed.
There is the need, therefore, for the implementation of systems which are able to dynamically extend/enhance/update lexicons with information acquired from large corpora and from the web. Our objective should be to set up a new generation of large-size, dynamic lexical resources that fully capture current language usage (how language is materially manifested) and use (the way in which language forms are used as a means of communication) (Widdowson, 1978, pp. 18-19).
ACKNOWLEDGEMENTS

This work is the outcome of a collaborative effort. However, for the specific concerns of Italian academy, Elisabetta Gola is responsible for paragraphs 1-3, Stefano Federici is responsible for 8-9, Nilda Ruimy is responsible for 4-5 and John Wade is responsible for 6-7.
80 Humana.Mente – Issue 23 – December 2012

REFERENCES
Antoni-Lay, M.-H., Francopoulo, G., Zaysser, L. (1994). Generic Model for Reuseable Lexicons: The Genelex Project, Literary and Linguistic Computing, 9(1), 47-54.
Bell, R.T. (1991). Translation and Translating: Theory and Practice. Harlow: Longman.
Brown, P.F., Cocke, J., Della Pietra, S., Della Pietra, V., Jelinek, F., Mercer, R., Roossin, P. (1990). A statistical approach to language translation, Computational Linguistics, 16, 79-85.
Brown, R.D. (1999). Example-based machine translation, accessed on http://www.cs.cmu.edu/afs/cs.cmu.edu/user/ralf/pub/WWW/ebmt/ebmt.html [2000, January 6].
Collins Cobuild English Dictionary (1995). London: Harper Collins Publisher.
Crystal, D. (1988). The English Language, London: Penguin.
Dizionario Inglese-Italiano, Italiano Inglese Ragazzini-Zanichelli. Zanichelli Editore (2009).
Eco, U. (2003). Dire quasi la stessa cosa: Esperienze di traduzione. Milano: Bompiani.
Federici, S., Wade J.C. (2007). Letting in the light and working with the Web: A dynamic corpus development approach to interpreting metaphor. In M. Davis, P. Rayson, S. Hunston P. Danielsson, eds. Proceedings of Corpus Linguistics Conference 2007, University of Birmingham (UK), http://corpus.bham.ac.uk/corplingproceedings07/paper/207_Paper.pdf.
Grice, H. P. (1938). Negation. The Grice Papers, UC/Berkeley, Bancroft Library.
Grice, H. P. (1989). Studies in the way of words.
Horn, L. R. A brief history of negation.
Hunston, S. (2002). Corpora in Applied Linguistics. Cambridge, UK: Cambridge University Press.
Hutchins, J. (1996). ALPAC: the (in)famous report, MT News International 14, June 1996, 9-12.
Hutchins, J. (1986). Machine Translation: Past, Present, Future. Chichester: Ellis Horwood Series in Computers and their Applications.
Hutchins, J. (1993). Latest Developments in Machine Translation technology: Beginning a New Era in MT research. In Proceedings MT Summit IV.:
Automated Translation between Lexicon and Corpora 81
International cooperation for global communication, July 20-22, 1993, Kobe, Japan, 11-34.
Hutchins, J. (2003). Machine translation: general overview. In R. Mitkov (ed.), The Oxford Handbook of Computational Linguistics, Oxford: Oxford University Press, 501-511.
Isabelle, P., Dymetman, M., Macklovitch, E. (1988). CRITTER: a translation system for agricultural market reports”, Proceedings of the 12th conference on Computational linguistics - Volume 1, Budapest, 261-266.
Lenci, A., Bel, N., Busa, F., Calzolari N., Gola, E., Monachini, M., Ogonowski, A., Peters I., Peters, W., Ruimy, N., Villegas, M., Zampolli, A. (2000). SIMPLE: A General Framework for the Development of Multilingual Lexicon, International Journal of Lexicography, special issue, Dictionaries, Thesauri and Lexical-Semantic Relations, 13(4), 249-263.
Nagao, M. (1984). “A framework of a mechanical translation between Japanese and English by analogy principle”, Artificial and Human Intelligence: edited review papers at the International NATO Symposium on Artificial and Human Intelligence sponsored by the Special Programme Panel held in Lyon, France, October, 1981, Amsterdam: Elsevier Science Publishers, 173-180.
Pustejovsky, J. (1995). The Generative Lexicon, , Cambridge, MA: The MIT Press.
Ruimy, N., Corazzari, O., Gola, E., Spanu, A., Calzolari, N., Zampolli, A. (1998). The European LE-PAROLE Project: The Italian Syntactic Lexicon, LREC (1998) First International Conference on Language Resources and Evaluation Proceedings, I, Granada, Spain, 241-248.
Ruimy, N., Monachini, M., Distante, R., Guazzini, E., Molino, S., Ulivieri, M., Calzolari, N., Zampolli, A. (2002). CLIPS, a Multi-level Italian Computational Lexicon, LREC (2002) Third International Conference on language resources and evaluation proceedings, III, Las Palmas de Gran Canaria, 792-799.
Ruimy, N., Monachini, M., Gola, E., Calzolari, N., Del Fiorentino, M.C., Ulivieri, M., Rossi, S. (2003). A computational semantic lexicon of Italian: SIMPLE,. In A. Zampolli, N. Calzolari, L. Cignoni (Eds.), Computational Linguistics in Pisa. Linguistica Computazionale, Special Issue, XVIII-XIX (II), Pisa-Roma: IEPI, 821-864.
Ruimy, N. (2010). Simple_PLUS: a network of lexical semantic relations Simple_PLUS: una red de relaciones léxico-semánticas. In: Procesamiento del
82 Humana.Mente – Issue 23 – December 2012
Lenguaje Natural, 44, Sociedad Española para el Procesamiento del Lenguaje Natural, 99-106.
Sanfilippo, A., et al. (1998) EAGLES Preliminary recommendations on semantic encoding, The EAGLES Lexicon Interest Group, http://www.ilc.cnr.it/EAGLES/EAGLESLE.PDF.
Sharoff, S. (2006). Open-source corpora: using the net to fish for linguistic data, The International Journal of Corpus Linguistics, 11(4), 435-462.
Speranza, Join the Grice Club.
Turcato, D., Mcfetridge, P., Popowich, F., Toole, J. (1999). A unified example-based and lexicalist approach to machine translation. In Proceedings of the 8th International Conference on theoretical and methodological issues in Machine Translation (TMI '99), Chester.
Wade, J.C., Federici, S. (2006). Struttura-significato. Il processo di traduzione. In R. Pititto, S. Venezia (Eds.), Tradurre e comprendere: pluralità dei linguaggi e delle culture (Atti del XII Congresso Nazionale della Società di Filosofia del Linguaggio, Piano di Sorrento 2005), Roma: Aracne, 307-332.
Whitelock, P. (1992). Shake-and-bake translation, Proceedings of the 14th conference on Computational linguistics, 2, Nantes, 784-791.
Widdowson H.G. (1978) Teaching language as communication. Oxford: Oxford University Press

No comments:

Post a Comment