Archive for the Category » Data cleaning «

Tuesday, May 10th, 2016 | Author:

After a guest lecture in the second year module “Data:From the Source to the Senses“, I was asked to organize a data wrangling workshop. After a short introduction and a demonstration of TriFacta, the data wrangling tool that I proposed them to use, I let them work on data from the university’s timetabling system for compliance checking, exploration, or trend analysis purposes, or their own data that they found for their self-chosen data visualization project in the module. After the workshop, I asked who would use Trifacta in their project and a majority of the hands went up. Most of them were using Excel for their data wrangling tasks before that. Quite a success for Trifacta that the majority of the students were convinced of its value and strengths after just a 1.5 hour workshop.

I also like to share two example cases of the students that came up and that I found most convincing.

Example case 1:
One group of students found some data that had six columns of the form “Male A”, “Female A”, “Male B”, “Female B”, “Male C”, “Female C”. They wanted to reshape this into a form that had two rows per original row, one for “Male” and one for “Female”, with both a column for “Gender”, three columns “A”, “B”, and “C”, and all the values in the other columns duplicated. We of course recognized it as a case for unpivot, but not a trivial one, because we needed to unpivot two sets of three columns at the same time! We achieved it by first nesting the three columns for both “Male” and “Female”, then doing the unpivot, and then unnesting the result. Some rename-ing of columns, some replace-ing of values, and dropping of irrelevant columns inbetween and done … for the whole data set! This particular group of students was quite impressed by this feat.

Example case 2:
Another group of students found data on countries that they wanted to use for a kind of network analysis. The data included 6 columns with up to 6 languages that were spoken in those countries. For the network analysis, they wanted to reshape this data into a form with one row per combination of countries where the same language was spoken. Obviously, they were at a loss how to achieve that. As a database-person myself, I recognized this as a self-join on language. What we did was first unpivot on the six language columns to obtain 6 rows per country, one for each language. For countries with less than 6 languages, some of the rows had an empty cell in the new “Language” column. They were easily dropped with a few clicks. Then we generated the resulting data set to have the file twice. The we did a join with the other file on their respective “Language” columns. As you can imagine, this particular group was also quite impressed that this could be achieved in a matter of a few minutes for the whole data set.

One must know that these students typically do not have much programming or database experience. But with the suggestions made by Trifacta, possibly modifying the commands behind these suggestions a bit, they were quickly able to do their own wrangling without any instruction beforehand. In my opinion, this suggests that the tool is suitable for a wide audience of not-that-technical users that need to do data-driven analyses.

Another observation I want to make is about data quality. Trifacta throws it in your face that your data is dirty (which is a good thing!). There are histograms above all columns as well as a bar indicating the amount of trustful (green), suspicious (red), and missing (black) values. I have seen no case of a data set, that didn’t have red and black parts for one or more columns. Of course, there are data quality problems! Always! So, I think it is very good that this is so in-your-face making users aware that their data is dirty and they need to do something about it to make their visualizations and analytic results reliable. I often use the term responsible analytics: a data scientist should know about data quality problems in their data and how the affect the results (and be open and tell you about it). Furthermore, the functionality and suggestions for data cleaning are quite good. I use the ETL tool Pentaho data integration, aka Kettle, in another course, but I definitely think that Trifacta is better for detecting and solving data quality problems (as well as for doing such transformations as above).

In conclusion, good work Trifacta! A valuable addition to a data scientist’s toolbox.

Category: 5. Teaching, Data cleaning, Data science  | Comments off
Tuesday, February 02nd, 2016 | Author:

The project proposal “Time To Care: Using sensor technology to dynamically model social interactions of healthcare professionals at work in relation to healthcare quality” has been accepted in our university’s Tech4People program. The project is a cooperation with Educational Sciences (chair OWK) and Psychology of Conflict, Risk and Safety (chair PCRS) with whom the funded PhD student will be shared.

What I am particulary enthusiastic about in this project is that it is not only interdisciplinary cooperation towards a shared goal, but that also disciplinary research questions from each of the participating disciplines can be answered. For me, it is a unique opportunity to test whether probabilistic modeling of the data quality problems / noise in the social interaction data obtained from the sensors indeed provide significantly different results when predicting team performance.

Thursday, January 14th, 2016 | Author:

Today I gave a presentation at the Data Science Northeast Netherlands Meetup about
Managing uncertainty in data: the key to effective management of data quality problems [slides (PDF)]

Business analytics and data science are significantly impaired by a wide variety of ‘data handling’ issues, especially when data from different sources are combined and when unstructured data is involved. The root cause of many such problems centers around data semantics and data quality. We have developed a generic method which is based on modeling such problems as uncertainty *in* the data. A recently conceived new kind of DBMS can store, manage, and query large volumes of uncertain data: the UDBMS or “Uncertain Database”. Together, they allow one to, e.g., postpone the resolution of data problems, assess what their influence is on analytical results, etc. We furthermore develop technology for data cleansing, web harvesting, and natural language processing which uses this method to deal with ambiguity of natural language and many other problems encountered when using unstructured data.

Wednesday, October 14th, 2015 | Author:

Dolf Trieschnigg and I got some subsidy to valorize some of the research results of the COMMIT/ TimeTrails, PayDIBI, and FedSS projects. Company involved is Mydatafactory.
SmartCOPI: Smart Consolidation of Product Information
[download public version of project proposal]
Maintaining the quality of detailed product data, ranging from data about required raw materials to detailed specifications of tools and spare parts, is of vital importance in many industries. Ordering or using wrong spare parts (based on wrong or incomplete information) may result in significant production loss or even impact health and safety. The web provides a wealth of information on products provided in various formats, detail levels, targeted at at a variety of audiences. Semi- automatically locating, extracting and consolidating this information would be a “killer app” for enriching and improving product data quality with a significant impact on production cost and quality. The new to COMMIT/ industry partner Mydatafactory is interested in both the web harvesting and data cleansing technologies developed in COMMIT/-projects P1/Infiniti and P19/TimeTrails for this potential and for improving Mydatafactory’s data cleansing services. The ICT science questions behind data cleansing and web harvesting are how noise can be detected and reduced in discrete structured data, and how human cognitive skills in information navigation and extraction can be mimicked. Research results on these questions may benefit a wide range of applications from various domains such as fraud detection and forensics, creating a common operational picture, and safety in food and pharmaceuticals.