Today, a PhD student of mine, Mohammad S. Khelghati, defended his thesis.
Deep Web Content Monitoring [Download]
In this thesis, we investigate the path towards a focused web harvesting approach which can automatically and efficiently query websites, navigate through results, download data, store it and track data changes over time. Such an approach can also facilitate users to access a complete collection of relevant data to their topics of interest and monitor it over time. To realize such a harvester, we focus on the following obstacles: finding methods that can achieve the best coverage in harvesting data for a topic; reducing the cost of harvesting a website regarding the number of submitted requests by estimating its actual size; monitoring data changes over time in web data repositories; and we combine our experiences in harvesting with the studies in the literature to suggest a general designing and developing framework for a web harvester. It is important to know how to configure harvesters so that they can be applied to different websites, domains and settings. These steps bring further improvements to data coverage and monitoring functionalities of web harvesters and can help users such as journalists, business analysts, organizations and governments to reach the data they need without requiring extreme software and hardware facilities. With this thesis, we hope to have contributed to the goal of focused web harvesting and monitoring topics over time.
Archive for the Category » 4. Projects «
Today, a PhD student of mine, Mohammad S. Khelghati, defended his thesis.
The project proposal “Time To Care: Using sensor technology to dynamically model social interactions of healthcare professionals at work in relation to healthcare quality” has been accepted in our university’s Tech4People program. The project is a cooperation with Educational Sciences (chair OWK) and Psychology of Conflict, Risk and Safety (chair PCRS) with whom the funded PhD student will be shared.
What I am particulary enthusiastic about in this project is that it is not only interdisciplinary cooperation towards a shared goal, but that also disciplinary research questions from each of the participating disciplines can be answered. For me, it is a unique opportunity to test whether probabilistic modeling of the data quality problems / noise in the social interaction data obtained from the sensors indeed provide significantly different results when predicting team performance.
My PhD student Mohammad Khelgathi released his web harvesting software, called HarvestED.
Today I gave a presentation at the Data Science Northeast Netherlands Meetup about
Managing uncertainty in data: the key to effective management of data quality problems [slides (PDF)]
Business analytics and data science are significantly impaired by a wide variety of ‘data handling’ issues, especially when data from different sources are combined and when unstructured data is involved. The root cause of many such problems centers around data semantics and data quality. We have developed a generic method which is based on modeling such problems as uncertainty *in* the data. A recently conceived new kind of DBMS can store, manage, and query large volumes of uncertain data: the UDBMS or “Uncertain Database”. Together, they allow one to, e.g., postpone the resolution of data problems, assess what their influence is on analytical results, etc. We furthermore develop technology for data cleansing, web harvesting, and natural language processing which uses this method to deal with ambiguity of natural language and many other problems encountered when using unstructured data.
Dolf Trieschnigg and I got some subsidy to valorize some of the research results of the COMMIT/ TimeTrails, PayDIBI, and FedSS projects. Company involved is Mydatafactory.
SmartCOPI: Smart Consolidation of Product Information
[download public version of project proposal]
Maintaining the quality of detailed product data, ranging from data about required raw materials to detailed specifications of tools and spare parts, is of vital importance in many industries. Ordering or using wrong spare parts (based on wrong or incomplete information) may result in significant production loss or even impact health and safety. The web provides a wealth of information on products provided in various formats, detail levels, targeted at at a variety of audiences. Semi- automatically locating, extracting and consolidating this information would be a “killer app” for enriching and improving product data quality with a significant impact on production cost and quality. The new to COMMIT/ industry partner Mydatafactory is interested in both the web harvesting and data cleansing technologies developed in COMMIT/-projects P1/Infiniti and P19/TimeTrails for this potential and for improving Mydatafactory’s data cleansing services. The ICT science questions behind data cleansing and web harvesting are how noise can be detected and reduced in discrete structured data, and how human cognitive skills in information navigation and extraction can be mimicked. Research results on these questions may benefit a wide range of applications from various domains such as fraud detection and forensics, creating a common operational picture, and safety in food and pharmaceuticals.
Today I gave a presentation on the SIKS Smart Auditing workshop at the University of Tilburg.
Tweakers.net, NU.nl and Kennislink.nl picked up the UT homepage news item on the research of my PhD student Mena Badieh Habib on Named Entity Extraction and Named Entity Disambiguation.
Tweakers.net: UT laat politiecomputers tweets ‘begrijpen’ voor veiligheid bij evenementen
NU.nl: Universiteit Twente laat computers beter begrijpend lezen
Kennislink.nl: Twentse computer leest beter
The news feed of the UT homepage features an item on the research of my PhD student Mena Badieh Habib.
Computers leren beter begrijpend lezen dankzij UT-onderzoek (in Dutch).
Mena defended his PhD thesis entitled “Named Entity Extraction and Disambiguation for Informal Text – The Missing Links on May 9th.
Today, a PhD student of mine, Mena Badieh Habib Morgan, defended his thesis.
Named Entity Extraction and Disambiguation for Informal Text – The Missing Link
Social media content represents a large portion of all textual content appearing on the Internet. These streams of user generated content (UGC) provide an opportunity and challenge for media analysts to analyze huge amount of new data and use them to infer and reason with new information. A main challenge of natural language is its ambiguity and vagueness. When we move to informal language widely used in social media, the language becomes even more ambiguous and thus more challenging for automatic understanding. Named Entity Extraction (NEE) is a sub task of Information Extraction (IE) that aims to locate phrases (mentions) in the text that represent names of entities such as persons, organizations or locations regardless of their type. Named Entity Disambiguation (NED) is the task of determining which correct person, place, event, etc. is referred to by a mention. The main goal of this thesis is to mimic the human way of recognition and disambiguation of named entities especially for domains that lack formal sentence structure. We propose a robust combined framework for NEE and NED in semi-formal and informal text. The achieved robustness has been proven to be valid across languages and domains and to be independent of the selected extraction and disambiguation techniques. It is also shown to be robust against shortness in labeled training data and against the informality of the used language.
Today I’m going to give a presentation about my fraud detection research for the SCS chair.
Information Combination and Enrichment for Data-Driven Fraud Detection
Governmental organizations responsible for keeping certain types of fraud under control, often use data-driven methods for both immediate detection of fraud, or for fraud risk analysis aimed at more effectively targeting inspections. A blind spot in such methods, is that the source data often represents a ‘paper reality’. Fraudsters will attempt to disguise themselves in the data they supply painting a world in which they do nothing wrong. This blind spot can be counteracted by enriching the data with traces and indicators from more ‘real-world’ sources such as social media and internet. One of the crucial data management problems in accomplishing this enrichment is how to capture and handle uncertainty in the data. The presentation will start with a real-world example, which is also used as starting point for a problem generalization in terms of information combination and enrichment (ICE). We then present the ICE technology we have developed and a few more applications in which it has been or is intended to be applied. In terms of the 3 V’s of big data — volume, velocity, and variety — this presentation focuses on the third V: variety.
Date: Wednesday, April 16th, 2014
Room: ZI 2042
Time: 12:30-13:30 hrs