Uncategorized – Page 2 – Djoerd Hiemstra

Jop Hofste graduates on identity ranking in digital evidence data

Scalable identity extraction and ranking in Tracks Inspector

by Jop Hofste

The digital forensic world deals with a growing amount of data which should be processed. In general, investigators do not have the time to manually analyze all the digital evidence to get a good picture of the suspect. Most of the time investigations contain multiple evidence units per case. This research shows the extraction and resolution of identities out of evidence data. Investigators are supported in their investigations by proposing the involved identities to them. These identities are extracted from multiple heterogeneous sources like system accounts, emails, documents, address books and communication items. Identity resolution is used to merge identities at case level when multiple evidence units are involved.

The functionality for extracting, resolving and ranking identities is implemented and tested in the forensic tool Tracks Inspector. The implementation in Tracks Inspector is tested on five datasets. The results of this are compared with two other forensic products, Clearwell and Trident, on the extent to which they support the identity functionality. Tracks Inspector delivers very promising results compared to these products, it extracts more or the same number of the relevant identities in their top 10 identities compared to Clearwell and Trident. Tracks Inspector delivers a high accuracy, compared to Clearwell it has a better precision and the recall is approximately equal what results from the tests.

The contribution of this research is to show a method for the extraction and ranking of identities in Tracks Inspector. In the digital forensic world it is a quite new approach, because no other software products support this kind of functionality. Investigations can now start by exploring the most relevant identities in a case. The nodes which are involved in an identity can be quickly recognized. This means that the evidence data can be filtered at an early-stage.

[download pdf]

Mark Kazemier graduates on social networks for primary education teachers

Integrating a social network into an administration system for primary education

by Mark Kazemier

Research of the Dutch educational inspectorate shows that there are still many problems within Dutch primary education (Inspectie van het onderwijs, 2010). Topicus creates a pupil administration system ParnasSys that tries to solve these problems for the primary education. Two of these problems are not solved by ParnasSys however. Teachers are uncertified and teaching material is often bad. With the recent increase in popularity of social networks, Topicus sees opportunities. This study shows a social network should be integrated into ParnasSys as a stand-alone application. This means that when users log-in to ParnasSys they get a new option to go to the social network, but the existing parts do not connect directly to the network.
Existing theory and implementations of social networks in education and corporations shows that social networking creates new relationships between people that otherwise would not have existed. This leads to access to more information, new experience and creation of new content. The creation of new content can help teachers to select better teaching material, enhance their current teaching material and find solutions to issues they currently have in the classroom. They can also share their own experiences with others helping other teachers increase their skills and experiences.
When integrating a social network within ParnasSys there are two issues that need to be mitigated: 1) Copyright, 2) Privacy. Copyright can easily be mitigated by automatically posting all content on the network with a creative commons attribution license. This means that everyone can use the content as long as they mention the author. When people post content to the network that is copyrighted it can be removed when a takedown notice or report is received. Privacy is a more subtle issue. While privacy controls mitigate most of the issues. Some issues subsist. For example when a teacher posts something about a pupil and the parent of this pupil is also a teacher with access to ParnasSys this could lead to issues. The only way to mitigate this issue is by educating the users that those privacy issues exist.
It is recommended to integrate a social network within ParnasSys. There are two possibilities for further research. First the research recommends to integrate the social network as a stand-alone application as start, but it is recommended to look further into possibilities to connect several existing parts of ParnasSys with the network. For example pages with information of tests could integrate with the network where several users can work together on these tests. Second, finding of information gets more important when the network gets more users. While there are no issues found on finding of information in the interviews with users, this could become an issue in the future. It is therefore recommended to test several search methods and measure how many users use these methods to find their needed information.

[download pdf]

ACM SIGIR honors Norbert Fuhr

For pioneering contributions to approaches that now dominate the search industry, ACM SIGIR honors Norbert Fuhr from the University of Duisburg-Essen (Germany) with the 2012 Gerard Salton Award. Fuhr developed probabilistic retrieval models for databases and XML, and his research on probabilistic models anticipated the current interest in learning to rank approaches in search operations. Fuhr received the award at the ACM SIGIR Conference in Portland, Oregon, USA, where he gave the opening keynote address. Read more in the ACM Press release.

STW grant for StructWeb

Wim Korevaar received a valorization grant from the Dutch Technology Foundation STW for his proposal StructWeb: Structuring the Web for Organizations. The concept is based on an innovative information system developed on the basis of the latest insights on search technology and making use of an intuitive user interface: StructWeb. The new technology will be used to help businesses and organizations to structure their vast information resources and make it more easy for their staff and clients to access them.

More information at: structweb.nl.

Saving the Old IR Literature

The SIGIR project Saving the Old IR Literature has scanned and released a new batch of historic IR (Information Retrieval) papers, including early papers on the SMART system and papers on the development of test collections. The papers are written by amongst others: Gerard Salton, Karen Sparck Jones, William Cooper, Keith van Rijsbergen, Stepen Robertson, Martin Kay, Michael Lesk, and Nicolas Belkin. The new batch is listed below and available from the SIGIR web site.

The collection contains some unique documents, for instance Karen Sparck Jones' and Keith van Rijsbergen's Report on the Need for and Provision for an 'IDEAL' Information Retrieval Test Collection written in 1975, which I anxiously searched for when doing my Ph.D. research. The document is an important mile stone towards the current TREC conferences; work that already started in 1960 with Cyril Cleverdon's Cranfield experiments, one of Computer Science's earliest examples of empirical testing in a laboratory setting.

It's all there, enjoy!

Open source alternatives for Blackboard?

Starting in 2009, the University of Twente uses Blackboard as on-line learning management system. However, Blackboard turns out to be very insecure; see for instance the news item (in Dutch) Universiteitssoftware blijkt langdurig lek. Among other things, it is not only possible but actually easy for students to hack into a teacher's account and invisibly change grades. As it turns out, this has been known amongst our students for quite some time.

Blackboard is a commercial system and its internals are a company secret. Kerckhoff's Principle states that a secure system must not require secrecy. This way, it can be stolen by the enemy without causing trouble. In the design of software systems, this argument is used in favour of open source software security: Security through obscurity is considered bad practice, see for instance Jaap-Henk Hoepman and Bart Jacobs' Communications of the ACM article Increased security through open source (CACM 50-1, 2007). So, maybe it is time to look at some of the open source alternatives out there, such as Sakai or Moodle. Both come with commercial support, in case our technical university does not want to invest in the expertise to deploy such a system in-house.

Keith van Rijsbergen retired

Keith van Rijsbergen is retiring this year. To celebrate his long successful career, you can download his book “Information Retrieval” in the popular epub format, an open format that is supported by most e-readers.

InformationRetrieval.epub

Since the publication in 1976 of the first edition of Van Rijsbergen’s book, it has established itself as a classic. The book gives a thorough introduction to “automatic ranked” retrieval, which today forms the basis of web search engines, but at that time was still highly experimental. The book covers all important information retrieval topics, but it is Van Rijsbergen’s personal view on information retrieval that makes the book so different from other scientific books on information retrieval: The book is written in the first person, a writing style I would normally not recommend for scientific documents. In this book, however, Van Rijsbergen’s personal style of writing inspired me a lot. Maybe it is his undisputed expertise, maybe it is his critical analysis of the work of others, or maybe it is merely his enthousiastic account of science, whatever it is, it is a pleasure to read the book, even almost 35 years after its first publications. Here is a nice example, where Van Rijsbergen’s shares his view on significance tests:

Unfortunately, I have to agree with the findings of the Comparative Systems Laboratory in 1968, that there are no known statistical tests applicable to IR. This may sound like a counsel of defeat but let me hasten to add that it is possible to select a test which violates only a few of the assumptions it makes.

His analysis let me to use the paired sign test in my PhD thesis, and I motivated this by adding that Van Rijsbergen says I am allowed to do so. (Actually, he claims I am allowed to do so only conservatively, because some of the test’s assumptions are not met…) The book is also a no-nonsense book in many respects, with many practical approaches that are directly applicable. In several of our experiments, we used the stop word list printed in the book (see Table 2.1). This is science in its best form. Experiments should be easily reproducible, and what is more easy than the usage of a officially published stop word list?

So, if you are still looking for a good, personal, entertaining, no-nonsense, scientific book on information retrieval to be read by the pool during the holidays, please consider Information Retrieval. No e-reader yet? Then you can read the ebook using the EPUBReader Firefox addon.

[download epub]

Ralf Schimmel graduates on keyword suggestion

Keyword Suggestion for Search Engine Marketing

by Ralf Schimmel

Every person acquainted with the web, is also a frequent user of search engines like Yahoo and Google. Any person with a web site makes this web site with a vision in mind, most of the times this entails being found on the web. Search engines offer several methods to users that help them to be found. One group of the techniques used in this field is Search Engine Optimization (SEO), which covers everything that can be done to optimize a web site for the search engine. The whole idea of SEO is to ensure that a web site is listed in the set of search results once a matching query is entered by a user. A second important part of the search engines is Search Engine Advertisement (SEA). Billions of dollars are paid by companies that bid on keywords that match their advertisements to a users query. These keywords are hard to find, of course a company knows what it sells, but it does not know how the users search for the same products or services. Advertising in search engines can be done in multiple ways. The focus of this research lies in finding many long-tail keywords, words that often have a low search volume, but which are cheap (low competition) and which are often specific enough to ensure high conversion rates (a visitor becomes a customer). Several keyword suggestion techniques are researched and evaluated for practical use. One applicable technique is chosen, implemented and evaluated. The chosen technique is a web based technique which is using an undirected weighted graph of candidate terms (nodes), where the weight of the vertices is the semantic similarity between the two nodes, and where the term frequency of the term is stored in the node. The evaluation shows that it is a technique capable of suggesting a lot of relevant keywords that can be used for search engine marketing. According to the evaluation the technique is capable of using the term frequencies and the semantic similarities to find and rank suggestions based on popularity and relevance. The most important conclusion is that, for single term suggestions, the system outperforms Google's suggestion system. Google's precision on single term suggestions is better then the precision of the new tool, however the relative recall of Google is a lot worse, for both obvious and non-obvious single term suggestions. Currently the tool can only be used to complement Google's tool, however once extended with support for multi term suggestions it can replace the entire system.

[download pdf]

Searching in the free world

Google faced a cyber attack originating from computers in China, that was serious enough to send an ultimatum to the Chinese government:

…We have decided we are no longer willing to continue censoring our results on Google.cn, and so over the next few weeks we will be discussing with the Chinese government the basis on which we could operate an unfiltered search engine within the law, if at all…

See: Google's blog.

Sander Bockting wint ENIAC scriptieprijs

Sander Bockting heeft dit jaar de ENIAC scriptieprijs gewonnen. ENIAC is de de alumnivereniging voor oud-studenten van Informatica, Bedrijfsinformatietechnologie en Telematica. ENIAC reikt elk jaar een prijs uit voor de beste afstudeerscriptie. Het juryrapport luidt:

De jury heeft besloten de ENIAC scriptieprijs 2009 toe te kennen aan de scriptie “Collection Selection for Distributed Web Search: Using Highly Discriminative Keys, Query-driven Indexing and ColRank”, van Sander Bockting. De jury heeft gekozen voor deze scriptie, vanwege de relevantie van het onderzoek, de wetenschappelijke benadering en het grote deel 'ontwerp' (het prototype Sophos) dat in het werk besloten ligt. Hiernaast biedt Sanders onderzoek een (mogelijk) antwoord op het toegankelijke houden van het internet. Zoeken op internet en de bijbehorende zoekmachines vervullen een maatschappelijke functie in het ontsluiten van informatie. Door de sterke groei van het internet is het echter onmogelijk om het gehele internet centraal te blijven indexeren. Tevens geeft deze methode veel macht aan de eigenaren van enkele centrale zoekmachines. Sander laat zien dat het toepassen van gedistribueerde zoeksystemen een veelbelovende aanpak is, die in potentie gegevens beter ontsluit terwijl de afhankelijkheid van enkele centrale zoekmachines afneemt. De vijf door hem vergelijken technieken zijn dan ook een prima basis voor maatschappelijk en wetenschappelijk relevant vervolgonderzoek.