Archive for the Category » Publication metadata «

Thursday, June 20th, 2013 | Author:

On 20 June 2013, Ben Companjen defended his MSc thesis on matching author names on publications to researcher profiles on the scale of The Netherlands. He carried out this research at DANS where he applied and validated his techniques on a coupling between the NARCIS scholarly database and the researcher profile database VSOI.
“Probabilistically Matching Author Names to Researchers”[download]
Publications are most important form of scientific communication, but science also consists of researchers, research projects and organisations. The goal of NARCIS (National Academic Research and Collaboration Information System) is to provide a complete and concise view of current science in the Netherlands.
Connecting publications to the researchers, projects and organisations that created them in retrospect is hard, because of a lack in the use of author identifiers in publications and researcher profiles. There is too much data to identify all researchers in NARCIS manually, so an automatic method is needed to assist completing the view of science in the Netherlands.
In this thesis the problems that limit automatic connection of author names in publications to researchers are explored and a method to automatically connect publications and researchers is developed and evaluated.
Using only the author names themselves finds the correct researcher for around 80% of the author names in an experiment, using two test sets. However, none of the correct matches were given the highest confidence of the returned matches. Over 90% of the correct matches were ranked second by confidence. Other correct matches were ranked lower, and using probabilistic results allows working with the correct results, even if they are not the best match. Many names that should not match, were included in the matches. The matching algorithm can be optimised to assign confidence to matches differently.
Including a matching function that compares publication titles and researcher’s project titles did not improve the results, but better results are expected when more context elements are used to assign confidences.

Monday, May 20th, 2013 | Author:

One of my Master students, Oliver Jundt, has a paper on EUSFLAT 2013.
Sample-based XPath Ranking for Web Information Extraction
Oliver Jundt and Maurice van Keulen
Web information extraction typically relies on a wrapper, i.e., program code or a configuration that specifies how to extract some information from web pages at a specific website. Manually creating and maintaining wrappers is a cumbersome and error-prone task. It may even be prohibitive as some applications require information extraction from previously unseen websites. This paper approaches the problem of automatic on-the-fly wrapper creation for websites that provide attribute data for objects in a ‘search – search result page – detail page’ setup. The approach is a wrapper induction approach which uses a small and easily obtainable set of sample data for ranking XPaths on their suitability for extracting the wanted attribute data. Experiments show that the automatically generated top-ranked XPaths indeed extract the wanted data. Moreover, it appears that 20 to 25 input samples suffice for finding a suitable XPath for an attribute.
The paper will be presented at the EUSFLAT 2013 conference, 11-13 Sep 2013, Milan, Italy [details]

Wednesday, September 01st, 2010 | Author:

Gezocht met spoed: student voor onderstaande afstudeeropdracht tbv het ESCAPE project.
Automatische verrijking
ESCAPE is een project tbv een nieuwe manier van wetenschappelijke communicatie die niet meer gebaseerd is op alleen maar artikelen. Het is gebaseerd op semantic web technologie waarmee brede kennis over artikelen, data, resultaten, onderzoekers, projecten, organisaties, en de relaties daartussen kunnen worden opgeslagen, bevraagd en gemanipuleerd. Het invoeren van de gegevens en kennis is echter nogal arbeidsintensief. Deze opdracht gaat erover om tools te ontwikkelen voor automatische verrijking van de gegevens en kennis. Daarmee bedoelen we op ‘t laagste niveau import van publicatiegegevens van websites van uitgevers e.d., maar ook op een hoger niveau verrijking door automatisch links te leggen met Open Linked Data en andere databases en websites.