Archive for the Category » 5. Teaching «

Tuesday, May 10th, 2016 | Author:

After a guest lecture in the second year module “Data:From the Source to the Senses“, I was asked to organize a data wrangling workshop. After a short introduction and a demonstration of TriFacta, the data wrangling tool that I proposed them to use, I let them work on data from the university’s timetabling system for compliance checking, exploration, or trend analysis purposes, or their own data that they found for their self-chosen data visualization project in the module. After the workshop, I asked who would use Trifacta in their project and a majority of the hands went up. Most of them were using Excel for their data wrangling tasks before that. Quite a success for Trifacta that the majority of the students were convinced of its value and strengths after just a 1.5 hour workshop.

I also like to share two example cases of the students that came up and that I found most convincing.

Example case 1:
One group of students found some data that had six columns of the form “Male A”, “Female A”, “Male B”, “Female B”, “Male C”, “Female C”. They wanted to reshape this into a form that had two rows per original row, one for “Male” and one for “Female”, with both a column for “Gender”, three columns “A”, “B”, and “C”, and all the values in the other columns duplicated. We of course recognized it as a case for unpivot, but not a trivial one, because we needed to unpivot two sets of three columns at the same time! We achieved it by first nesting the three columns for both “Male” and “Female”, then doing the unpivot, and then unnesting the result. Some rename-ing of columns, some replace-ing of values, and dropping of irrelevant columns inbetween and done … for the whole data set! This particular group of students was quite impressed by this feat.

Example case 2:
Another group of students found data on countries that they wanted to use for a kind of network analysis. The data included 6 columns with up to 6 languages that were spoken in those countries. For the network analysis, they wanted to reshape this data into a form with one row per combination of countries where the same language was spoken. Obviously, they were at a loss how to achieve that. As a database-person myself, I recognized this as a self-join on language. What we did was first unpivot on the six language columns to obtain 6 rows per country, one for each language. For countries with less than 6 languages, some of the rows had an empty cell in the new “Language” column. They were easily dropped with a few clicks. Then we generated the resulting data set to have the file twice. The we did a join with the other file on their respective “Language” columns. As you can imagine, this particular group was also quite impressed that this could be achieved in a matter of a few minutes for the whole data set.

One must know that these students typically do not have much programming or database experience. But with the suggestions made by Trifacta, possibly modifying the commands behind these suggestions a bit, they were quickly able to do their own wrangling without any instruction beforehand. In my opinion, this suggests that the tool is suitable for a wide audience of not-that-technical users that need to do data-driven analyses.

Another observation I want to make is about data quality. Trifacta throws it in your face that your data is dirty (which is a good thing!). There are histograms above all columns as well as a bar indicating the amount of trustful (green), suspicious (red), and missing (black) values. I have seen no case of a data set, that didn’t have red and black parts for one or more columns. Of course, there are data quality problems! Always! So, I think it is very good that this is so in-your-face making users aware that their data is dirty and they need to do something about it to make their visualizations and analytic results reliable. I often use the term responsible analytics: a data scientist should know about data quality problems in their data and how the affect the results (and be open and tell you about it). Furthermore, the functionality and suggestions for data cleaning are quite good. I use the ETL tool Pentaho data integration, aka Kettle, in another course, but I definitely think that Trifacta is better for detecting and solving data quality problems (as well as for doing such transformations as above).

In conclusion, good work Trifacta! A valuable addition to a data scientist’s toolbox.

Category: 5. Teaching, Data cleaning, Data science  | Comments off
Tuesday, March 03rd, 2015 | Author:

I have been nominated for the “decentrale onderwijsprijs“, a teaching award issued by the computer science student association Inter-Actief. Part of the election process is that each nominee gives a short (10-15 min) mini-lecture. Mine was about “Onzekere databases” (Uncertain databases; In Dutch). Announcement of the winner is March 9th, 2015.

Category: 5. Teaching  | Tags: , ,  | Comments off
Friday, November 08th, 2013 | Author:

Wiskundedocent Dick Meijer schreef een stukje over TOM: ‘De Treurige TOM Top Tien‘. Ik ben juist overwegend positief over TOM. In mijn rol als modulecoördinator van ‘Parels der Informatica’, de eerste module van Technische Informatica (TI), heb ik een stukje geschreven als tegenwicht voor het stuk van Dick: De Fleurige TOM Top Tien

Wednesday, June 26th, 2013 | Author:

ACM TechNews picked up the UT homepage news item Gauging the Risk of Fraud From Social Media on Henry Been’s master project “Finding you on the Internet“.

Thursday, June 20th, 2013 | Author:

On 20 June 2013, Ben Companjen defended his MSc thesis on matching author names on publications to researcher profiles on the scale of The Netherlands. He carried out this research at DANS where he applied and validated his techniques on a coupling between the NARCIS scholarly database and the researcher profile database VSOI.
“Probabilistically Matching Author Names to Researchers”[download]
Publications are most important form of scientific communication, but science also consists of researchers, research projects and organisations. The goal of NARCIS (National Academic Research and Collaboration Information System) is to provide a complete and concise view of current science in the Netherlands.
Connecting publications to the researchers, projects and organisations that created them in retrospect is hard, because of a lack in the use of author identifiers in publications and researcher profiles. There is too much data to identify all researchers in NARCIS manually, so an automatic method is needed to assist completing the view of science in the Netherlands.
In this thesis the problems that limit automatic connection of author names in publications to researchers are explored and a method to automatically connect publications and researchers is developed and evaluated.
Using only the author names themselves finds the correct researcher for around 80% of the author names in an experiment, using two test sets. However, none of the correct matches were given the highest confidence of the returned matches. Over 90% of the correct matches were ranked second by confidence. Other correct matches were ranked lower, and using probabilistic results allows working with the correct results, even if they are not the best match. Many names that should not match, were included in the matches. The matching algorithm can be optimised to assign confidence to matches differently.
Including a matching function that compares publication titles and researcher’s project titles did not improve the results, but better results are expected when more context elements are used to assign confidences.

Tuesday, June 18th, 2013 | Author:

The news feed of the UT homepage features an item to the master project of Henry Been.
Gauging the risk of fraud from social media.

Tuesday, June 18th, 2013 | Author:

On 18 June 2013, Henry Been defended his MSc thesis on an attempt to find a person’s Twitter account given only name, address, telephone, email address for the purpose of risk analysis for fraud detection. It turned out that he could determine a set of a few tens/hunderds of candidate Twitter accounts among which the correct one was indeed present in almost all cases. Henry also paid much attention to the ethical aspects surrounding this research. A news item on the UT homepage made it on ACM TechNews.
“Finding you on the Internet: Entity resolution on Twitter accounts and real world people”[download]
Over the last years online social network sites [SNS] have become very popular. There are many scenarios in which it might prove valuable to know which accounts on a SNS belong to a person. For example, the dutch social investigative authority is interested in extracting characteristics of a person from Twitter to aid in their risk analysis for fraud detection.
In this thesis a novel approach to finding a person’s Twitter account using only known real world information is developed and tested. The developed approach operates in three steps. First a set of heuristic queries using known information is executed to find possibly matching accounts. Secondly, all these accounts are crawled and information about the account, and thus its owner, is extracted. Currently, name, url’s, description, language of the tweets and geo tags are extracted. Thirdly, all possible matches are examined and the correct account is determined.
This approach differs from earlier research in that it does not work with extracted and cleaned datasets, but directly with the Internet. The prototype has to cope with all the ”noise” on the Internet like slang, typo’s, incomplete profiles, etc. Another important part the approach was repetition of the three steps. It was expected that repeating the discovering candidates, enriching them and eliminating false positives will increase the chance that over time the correct account ”surfaces.”
During development of the prototype ethical concerns surrounding both the experi- ments and the application in practice were considered and judged morally justifiable.
Validation of the prototype in an experiment showed that the first step is executed very well. In an experiment With 12 subjects with a Twitter account, an inclusion of 92% was achieved. This means that for 92% of the subjects the correct Twitter account was found and thus included as a possible match. A number of variations of this experiment were ran, which showed that inclusion of both first and last name is necessary to achieve this high inclusion. Leaving out physical addresses, e-mail addresses and telephone numbers does not influence inclusion.
Contrary to those of the first step, the results of the third step were less accurate. The currently extracted features cannot be used to predict if a possible match is actually the correct Twitter account or not. However, there is much ongoing research into feature extraction from tweets and Twitter accounts in general. It is therefore expected that enhancing feature extraction using new techniques will make it a matter of time before it is also possible to identify correct matches in the candidate set.

Monday, June 03rd, 2013 | Author:

Two Master students of mine, Jasper Kuperus and Jop Hofste, have a paper on FORTAN 2013, colocated with EISIC 2013.
Increasing NER recall with minimal precision loss
Jasper Kuperus, Maurice van Keulen, and Cor Veenman
Named Entity Recognition (NER) is broadly used as a first step toward the interpretation of text documents. However, for many applications, such as forensic investigation, recall is currently inadequate, leading to loss of potentially important information. Entity class ambiguity cannot be resolved reliably due to the lack of context information or the exploitation thereof. Consequently, entity classification introduces too many errors, leading to severe omissions in answers to forensic queries.
We propose a technique based on multiple candidate labels effectively postponing decisions for entity classification to query time. Entity resolution exploits user feedback: a user is only asked for feedback on entities relevant to his/her query. Moreover, giving feedback can be stopped anytime when query results are considered good enough. We propose several interaction strategies that obtain increased recall with little loss in precision. [details]
Digital-forensics based pattern recognition for discovering identities in electronic evidence
Hans Henseler, Jop Hofsté, and Maurice van Keulen
With the pervasiveness of computers and mobile devices, digital forensics becomes more important in law enforcement. Detectives increasingly depend on the scarce support of digital specialists which impedes efficiency of criminal investigations. This paper proposes and algorithm to extract, merge and rank identities that are encountered in the electronic evidence during processing. Two experiments are described demonstrating that our approach can assist with the identification of frequently occurring identities so that investigators can prioritize the investigation of evidence units accordingly. [details]
Both papers will be presented at the FORTAN 2013 workshop, 12 Aug 2013, Uppsala, Sweden

Sunday, May 26th, 2013 | Author:

Following New Scientist and WebWereld, also the homepage of the UT features an article about my identity extraction work together with Fox IT: “Tracks Inspector brengt binnen paar uur netwerk van verdachte in kaart” (Dutch).

Monday, May 20th, 2013 | Author:

One of my Master students, Oliver Jundt, has a paper on EUSFLAT 2013.
Sample-based XPath Ranking for Web Information Extraction
Oliver Jundt and Maurice van Keulen
Web information extraction typically relies on a wrapper, i.e., program code or a configuration that specifies how to extract some information from web pages at a specific website. Manually creating and maintaining wrappers is a cumbersome and error-prone task. It may even be prohibitive as some applications require information extraction from previously unseen websites. This paper approaches the problem of automatic on-the-fly wrapper creation for websites that provide attribute data for objects in a ‘search – search result page – detail page’ setup. The approach is a wrapper induction approach which uses a small and easily obtainable set of sample data for ranking XPaths on their suitability for extracting the wanted attribute data. Experiments show that the automatically generated top-ranked XPaths indeed extract the wanted data. Moreover, it appears that 20 to 25 input samples suffice for finding a suitable XPath for an attribute.
The paper will be presented at the EUSFLAT 2013 conference, 11-13 Sep 2013, Milan, Italy [details]