One of the biggest challenges in Linguistic Engineering is to extract, store, organize, and update encyclopedic information, which is in steady growth. In recent years, specifically through the project Linked Data, are accessible and linked on the Web a large number of structured repositories containing information about companies, products, scientific terms, writers, composers, works of art, music, geographic places, etc. These repositories are only updated if the external sources of data (well structured) that depend are also updated. One of the challenges of the coming years is to automatically update encyclopedic data, by extracting information, not only from other structured sources with high maintenance costs, but directly from the source text, i.e., from linguistic corpora.
Driven by this need, the research group ProLNat @ GE, coordinated by Pablo Gamallo (Citius), researcher at the the University of Santiago de Compostela and co-founder of Cilenis, has been working since 2011 in OntoPedia project. The project, funded by MICINN, aims to acquire, organize and automatically update large amounts of encyclopedic information, by designing and applying techniques of natural language processing and information extraction. The project is focused on written text in four languages: English, Spanish, Portuguese, and English. And all the tools and resources generated from the project will be available under free license (General Public License).
More precisely, the group has developed a system to extract information about Named Entities (ENS), by mining a large corpus (Wikipedia) containing encyclopedic knowledge. The extracted information is stored in a knowledge base constituted by a collection of triplets, which represent properties, basic facts and events related to ENs and domain-specific terms. Within this project, we also have designed ONTOpedia, a search engine that operates on a collection of triplets of 47GBs, which allows retrieving the information required by the user. Each triplet contains information about any EN or term and consists of three elements (Object1, Relationship, Object2). Examples of triplets are the following:
Object1 Relation Object2
Mourinho- awards- award Limón in 2002…
Rajoy- current age- 58
Aneto- elevation- 3404m
Queries can be performed on any of the three elements of the triplet. For example, if a query consists of two key words: “Mourinho” in the field Object1 and “award” in the field Relation, the system returns all triplets that mapped the query, including, besides the Limón Prize, World Soccer and its Honoris Causa.
In future work, the project has several important goals: to improve the search engine through expansion with synonyms and translation equivalents; to built a computational architecture that allows updating of the corpus through the design and implementation of specific tools for each source of information (apart from Wikipedia, information will be extracted from different newspapers and blogs); to incorporate a new module for analyzing questions in natural language, and then, to construct a Question & Answering system. Our goal is that the user could find short answers by making questions to the knowledge base by using natural language expressions: what awards did Mourinho win? how old is Rajoy? What is the altitude of Monte Aneto?
The idea underlying ONTOpedia is close to that of the renowned IBM Watson syste: the first Q&A search engine that won a TV contest against two humans in the popular American program Jeopardy.