Archive for the Category » Data science «

Thursday, September 07th, 2017 | Author:

Loved to talk about “Data Quality” at the Second Data Science MeetUp Twente on Responsible Data Analytics. Thanks to Susan Kamies of MoneyBird and Sven van Munster of Green Orange for organizing the event.
[Slides on Slideshare] [2nd Data Science MeetUp Twente]

Tuesday, May 10th, 2016 | Author:

After a guest lecture in the second year module “Data:From the Source to the Senses“, I was asked to organize a data wrangling workshop. After a short introduction and a demonstration of TriFacta, the data wrangling tool that I proposed them to use, I let them work on data from the university’s timetabling system for compliance checking, exploration, or trend analysis purposes, or their own data that they found for their self-chosen data visualization project in the module. After the workshop, I asked who would use Trifacta in their project and a majority of the hands went up. Most of them were using Excel for their data wrangling tasks before that. Quite a success for Trifacta that the majority of the students were convinced of its value and strengths after just a 1.5 hour workshop.

I also like to share two example cases of the students that came up and that I found most convincing.

Example case 1:
One group of students found some data that had six columns of the form “Male A”, “Female A”, “Male B”, “Female B”, “Male C”, “Female C”. They wanted to reshape this into a form that had two rows per original row, one for “Male” and one for “Female”, with both a column for “Gender”, three columns “A”, “B”, and “C”, and all the values in the other columns duplicated. We of course recognized it as a case for unpivot, but not a trivial one, because we needed to unpivot two sets of three columns at the same time! We achieved it by first nesting the three columns for both “Male” and “Female”, then doing the unpivot, and then unnesting the result. Some rename-ing of columns, some replace-ing of values, and dropping of irrelevant columns inbetween and done … for the whole data set! This particular group of students was quite impressed by this feat.

Example case 2:
Another group of students found data on countries that they wanted to use for a kind of network analysis. The data included 6 columns with up to 6 languages that were spoken in those countries. For the network analysis, they wanted to reshape this data into a form with one row per combination of countries where the same language was spoken. Obviously, they were at a loss how to achieve that. As a database-person myself, I recognized this as a self-join on language. What we did was first unpivot on the six language columns to obtain 6 rows per country, one for each language. For countries with less than 6 languages, some of the rows had an empty cell in the new “Language” column. They were easily dropped with a few clicks. Then we generated the resulting data set to have the file twice. The we did a join with the other file on their respective “Language” columns. As you can imagine, this particular group was also quite impressed that this could be achieved in a matter of a few minutes for the whole data set.

One must know that these students typically do not have much programming or database experience. But with the suggestions made by Trifacta, possibly modifying the commands behind these suggestions a bit, they were quickly able to do their own wrangling without any instruction beforehand. In my opinion, this suggests that the tool is suitable for a wide audience of not-that-technical users that need to do data-driven analyses.

Another observation I want to make is about data quality. Trifacta throws it in your face that your data is dirty (which is a good thing!). There are histograms above all columns as well as a bar indicating the amount of trustful (green), suspicious (red), and missing (black) values. I have seen no case of a data set, that didn’t have red and black parts for one or more columns. Of course, there are data quality problems! Always! So, I think it is very good that this is so in-your-face making users aware that their data is dirty and they need to do something about it to make their visualizations and analytic results reliable. I often use the term responsible analytics: a data scientist should know about data quality problems in their data and how the affect the results (and be open and tell you about it). Furthermore, the functionality and suggestions for data cleaning are quite good. I use the ETL tool Pentaho data integration, aka Kettle, in another course, but I definitely think that Trifacta is better for detecting and solving data quality problems (as well as for doing such transformations as above).

In conclusion, good work Trifacta! A valuable addition to a data scientist’s toolbox.

Category: 5. Teaching, Data cleaning, Data science  | Comments off