Archive for the 'Machine Learning' Category

What country was this tweeted from?

Monday, August 3rd, 2015, posted by Djoerd Hiemstra

Determine the User Country of a Tweet

by Han van der Veen, Djoerd Hiemstra, Tijs van den Broek, Michel Ehrenhard, and Ariana Need

In the widely used message platform Twitter, about 2% of the tweets contains the geographical location through exact GPS coordinates (latitude and longitude). Knowing the location of a tweet is useful for many data analytics questions. This research is looking at the determination of a location for tweets that do not contain GPS coordinates. An accuracy of 82% was achieved using a Naive Bayes model trained on features such as the users’ timezone, the user’s language, and the parsed user location. The classiffier performs well on active Twitter countries such as the Netherlands and United Kingdom. An analysis of errors made by the classiffier shows that mistakes were made due to limited information and shared properties between countries such as shared timezone. A feature analysis was performed in order to see the effect of different features. The features timezone and parsed user location were the most informative features.

[download pdf]

Mike Kolkman graduates on cross-domain geocoding

Thursday, July 2nd, 2015, posted by Djoerd Hiemstra

Cross-domain textual geocoding: influence of domain-specific training data

by Mike Kolkman

Modern technology is more and more able to understand natural language. To do so, unstructured texts need to be analysed and structured. One such structuring method is geocoding, which is aimed at recognizing and disambiguating references to geographical locations in text. These locations can be countries and cities, but also streets and buildings, or even rivers and lakes. A word or phrase that refers to a location is called a toponym. Approaches to tackle the geocoding task mainly use natural language processing techniques and machine learning. The difficulty of the geocoding task is dependent of multiple aspects, one of which is the data domain. The domain of a text describes the type of the text, like its goal, degree of formality, and target audience. When texts come from two (or more) different domains, like a Twitter post and a news item, they are said to be cross-domain.
An analysis of baseline geocoding systems shows that identifying toponyms on cross-domain data has still room for improvement, as existing systems depend significantly on domain-specific metadata. Systems focused on Twitter data are often dependent on account information of the author and other Twitter specific metadata. This causes the performance of these systems to drop significantly when applied on news item data.
This thesis presents a geocoding system, called XD-Geocoder, aimed at robust cross-domain performance by using text-based and lookup list based features only. Such a lookup list is called a gazetteer and contains a vast amount of geographical locations and information about these locations. Features are built up using word shape, part-of-speech tags, dictionaries and gazetteers. The features are used to train SVM and CRF classifiers.
Both classifiers are trained and evaluated on three corpora from three domains: Twitter posts, news items and historical documents. These evaluations show Twitter data to be the best for training out of the tested data sets, because both classifiers show the best overall performance when trained on tweets. However, this good performance might also be caused by the relatively high toponym to word ratio in the used Twitter data.
Furthermore, the XD-Geocoder was compared to existing geocoding systems. Although the XD-Geocoder is outperformed by state-of-the-art geocoders on single-domain evaluations (trained and evaluated on data from the same domain), it outperforms the baseline systems on cross-domain evaluations.

[download pdf]

Niek Tax graduates on scaling learning to rank to big data

Tuesday, November 25th, 2014, posted by Djoerd Hiemstra

Scaling Learning to Rank to Big Data: Using MapReduce to Parallelise Learning to Rank

by Niek Tax

Niek Tax

Learning to rank is an increasingly important task within the scientific fields of machine learning and information retrieval, that comprises the use of machine learning for the ranking task. New learning to rank methods are generally evaluated in terms of ranking accuracy on benchmark test collections. However, comparison of learning to rank methods based on evaluation results is hindered by non-existence of a standard set of evaluation benchmark collections. Furthermore, little research is done in the field of scalability of the training procedure of Learning to Rank methods, to prepare us for input data sets that are getting larger and larger. This thesis concerns both the comparison of Learning to Rank methods using a sparse set of evaluation results on benchmark data sets, as well as the speed-up that can be achieved by parallelising Learning to Rank methods using MapReduce.

In the first part of this thesis we propose a way to compare learning to rank methods based on a sparse set of evaluation results on a set of benchmark datasets. Our comparison methodology consists of two components: 1) Normalized Winning Number, which gives insight in the ranking accuracy of the learning to rank method, and 2) Ideal Winning Number, which gives insight in the degree of certainty concerning its ranking accuracy. Evaluation results of 87 learning to rank methods on 20 well-known benchmark datasets are collected through a structured literature search. ListNet, SmoothRank, FenchelRank, FSMRank, LRUF and LARF were found to be the best performing learning to rank methods in increasing order of Normalized Winning Number and decreasing order of Ideal Winning Number. Of these ranking algorithms, FenchelRank and FSMRank are pairwise ranking algorithms and the others are listwise ranking algorithms.

In the second part of this thesis we analyse the speed-up of the ListNet training algorithm when implemented in the MapReduce computing model. We found that running ListNet on MapReduce comes with a job scheduling overhead in the range of 150-200 seconds per training iteration. This makes MapReduce very inefficient to process small data sets with ListNet, compared to a single-machine implementation of the algorithm. The MapReduce implementation of ListNet was found to be able to offer improvements in processing time for data sets that are larger than the physical memory of the single machine otherwise available for computation. In addition we showed that ListNet tends to converge faster when a normalisation preprocessing procedure is applied to the input data. The training time of our cluster version of ListNet was found to grow linearly in terms of data size increase. This shows that the cluster implementation of ListNet can be used to scale the ListNet training procedure to arbitrarily large data sets, given that enough data nodes are available for computation.

[download pdf]

Tesfay Aregay graduates on Ranking Factors for Web Search

Thursday, July 31st, 2014, posted by Djoerd Hiemstra

Ranking Factors for Web Search : Case Study In The Netherlands

by Tesfay Aregay

It is essential for search engines to constantly adjust ranking function to satisfy their users, at the same time SEO companies and SEO specialists are observed trying to keep track of the factors prioritized by these ranking functions. In this thesis, the problem of identifying highly influential ranking factors for better ranking on search engines is examined in detail, looking at two different approaches currently in use and their limitations. The first approach is to calculate correlation coefficient (e.g. Spearman rank) between a factor and the rank of it’s corresponding webpages (ranked document in general) on a particular search engine. The second approach is to train a ranking model using machine learning techniques, on datasets and select the features that contributed most for a better performing ranker. We present results that show whether or not combining the two approaches of feature selection can lead to a significantly better set of factors that improve the rank of webpages on search engines. We also provide results that show calculating correlation coefficients between values of ranking factors and a webpage’s rank gives stronger result if a dataset that contains a combination of top few and least few ranked pages is used. In addition list of ranking factors that have higher contribution to well-ranking webpages, for the Dutch web dataset (our case study) and LETOR dataset are provided.

[download pdf]

Tesfay Aregay
Photo by @Indenty.