MEANING will be concerned with automatically collecting and analysing language
data from the WWW on a large scale, and building more comprehensive multilingual
lexical knowledge bases to support improved word sense disambiguation (WSD).
Current web access applications are based on words; MEANING will open
the way for access to the Multilingual Web based on concepts, providing
applications with capabilities that significantly exceed those currently
available. MEANING will facilitate development of concept-based open domain
Internet applications (such as Question/Answering, Cross Lingual Information
Retrieval, Summarisation, Text Categorisation, Event Tracking, Information
Extraction, Machine Translation, etc.). Furthermore, MEANING will supply
a common conceptual structure to Internet documents, thus facilitating
knowledge management of web content.
Progress is being made in Human Language Technology (HLT) but there
is still a long way towards Natural Language Understanding (NLU). An important
step towards this goal is the development of technologies and resources
that deal with concepts rather than words. MEANING will develop concept-based
technologies and resources through large-scale knowledge processing over
the web, robust and fast machine learning algorithms, very large lexical
resources and novel strategies for combining them. Small-scale, isolated
experiments with limited infrastructure (such as Internet access, processing
power, and storage space) have no chance of bridging the gap to understanding.
Advances in this area can only be expected in the context of large-scale
long-term research projects.
MEANING will treat the web as a (huge) corpus to learn information from,
since even the largest conventional corpora available (e.g. the Reuters
corpus, the British National Corpus) are not large enough to be able to
acquire reliable information in sufficient detail about language behaviour.
Moreover, most European languages do not have large or diverse enough corpora
MEANING: Developing Multilingual Web-scale Language Technologies