Introduction to Data Analysis Competitions Based on Kaggle Platform
About Machine Learning and The Kaggle Platform
The main task of machine learning is to build models capable of predicting the result based on input data that differs from those observed previously. For example, we want to predict the value of shares of a specific company at a given time, given the current value, the dynamics of the market and financial news.
Here the future value of shares will be a prediction, and the current value, dynamics, and news are input data.
Kaggle is primarily a platform for conducting competitions on data analysis and machine learning. The range of tasks is completely different – the classification of whales on species, the identification of cancer tumors, the valuation of the value of a real estate and so on. The organizer’s company formulates the problem, provides data and sponsors the prize fund. At the time of writing, three competitions are active, while the total prize fund is $1.25M – list of active competitions.
My last competition is Outbrain click prediction, the task is to predict which advertisement the user from the ones shown to him will click. The sponsor of the competition – Outbrain is engaged in the promotion of various content, such as blogs or news. They place their ad units on a variety of different resources, including cnn.com, washingtonpost.com, and others. As the company that receives money for clicks of users, they are interested in showing users potentially interesting content.
The formal description of the task – arrange the advertisement shown to the user in every page view in the descending order according to the probability of clicking on the advertisement.
The amount of provided data is quite large, for example, a click log file is about 80GB. An exact description of the input data can be found on the competition page.
The data provided is divided into two parts – the one for which the participants know which banner the user clicks (training data), and the data for which the result needs to be predicted is the test data. At the same time, Kaggle knows the real results for the test data.
Competitors use training data to build models that predict results for test data. In the end, these predictions are loaded back, where the platform, knowing the real results, shows the accuracy of the predictions.
General plan of participation in the competitions
- Analysis of the task and available data
- Search and study of possible solutions
- Development of a local mechanism for assessing the accuracy
- Generation of features for the model
- Implementation (if there is no algorithm available) + model building
- Testing accuracy
- Iteration of items 4 to 6, adding new models (takes most of the time)
- Ensemble and stack of models (optional)
- Forming a team(optional)
Beginning of work
First of all, it is necessary to understand the data that is available to the participants of the competition. On the Data page, there is a description of the structure and semantics, and on the Rules page of the competition – description of whether you can use external data sources, and if so – whether it is worth sharing them with the rest.
The first thing I usually do is extract all the data and pivot that to understand the structure and dependencies so that I know what is available to solve the task. For those purposes, it is convenient to use Jupyter + Python. You can build various graphs, statistical metrics, look at data distributions – do everything that will help to understand the data.
For example, let’s build the distribution of the number of advertisements in one ad unit:
Also, there is a Kernels section on Kaggle where you can execute code in place without the need to download data, plus usually there are guys who make various visualizations of the data, this is how we start using other people’s ideas.
If you have questions about the structure, dependencies or data values, you can search for the answer on the forum, or you can try to guess yourself and get an advantage over those who have not guessed (yet).
Also, sometimes there are Leaks in the data (dependencies, for example, temporary ones) that allow you to understand the value of the target variable (prediction) for a subset (at least) of test data.
For example, in Outbrain click prediction, from the data in the click log, you could understand that the user clicked on certain ads. Information about such leaks can be published on the forum and may be used by the participants without publicity.
Analysis of approaches to the solution of the problem
When everything is clear with the statement of the problem and the input data in general, I start collecting information – reading books, studying similar competitions, scientific publications. This is a wonderful period of competition when in very short time periods you significantly expand your knowledge in solving problems similar to ones in competition.
While studying historical competitions, I am reviewing its forum, where the winners usually describe their approaches + check the source code of solutions that is available.
Reading publications introduces us to state of the art results and approaches that are used to achieve them. It’s also great when you can find the source code for release.
For example, the last two Click-Prediction competitions were won by the same team with the same algorithm. Description of their decisions + source codes + reading of forums of these competitions roughly gives you an idea which you can start work on.
I always start by initializing the version control repository. It’s possible to lose valuable pieces of code even at the very beginning, and it is very unpleasant when you have to restore it. I usually use Git + Bitbucket (free private repositories)
Local (cross) validation
Validation, or predictions testing, is an estimate of accuracy based on the metric selected.
For example, in Outbrain click prediction, you need to predict which advertisement the user clicks on. Provided data is divided into two parts:
|Data type||The ID of the clicked ad for each impression|
|Training||Known to everybody|
|Test||Known only to the organizers of the competition|
The training of the models takes place on the training split of the data, in the hope that the accuracy of the test data will also improve, while it is assumed that the test and training data are taken from a single distribution.
In the process of training, there is often a moment when the accuracy on training data increases, but on test data – begins to fall. This is because the capacity of the model allows you to memorize or adjust to the test set.
This phenomenon is called overfitting, and we’ll talk about how to deal with it later, while it is enough to remember that it is necessary to check accuracy on the subset of data that model hasn’t seen during the training process.
Despite the fact that Kaggle makes it possible to assess the accuracy of the solution (about the test data) by uploading it to the site, this evaluation approach has some drawbacks:
- Limit on the number of uploads per day. Often 2-5
- Latency – calculation of solutions for a test suite, writing into a file, upload of a file
- Non-triviality of Automation
- The probability of overfitting on publicly available test data
It would be great to have a local system for estimating accuracy, which will allow us to:
- Quickly estimate the accuracy based on the selected metrics
- Automate selection of algorithms and/or hyperparameters
- Save evaluation results together with other information for further analysis
- Eliminate the risk of overfitting
For these purposes, doing local (cross) validation is what we need. The idea is simple – we divide the training set into a training and validation ones and use validation to evaluate the accuracy, compare the algorithms, but not for training.
There are many validation approaches, as well as methods of splitting data, but it is important to follow those rules:
- The split should not change over time to exclude influence on the result. For example, we can save separately the validation and training sets, save row indices or use the standard random.seed
- If you have features that are built on the entire data set – for example, the frequency of clicks on a particular advertisement depending on the time, such features should be calculated only using training set, not validation one, so that the knowledge of the answer will not be “flowing” into the model
- If the data is time-dependent – for example, the test set will contain data from the future, about the training set, it is necessary to have the same distribution in training/validation split
Check the adequacy of the local validation by comparing the improvement shown locally with the actual improvement obtained after uploading the solution to the Kaggle. It’s good when the changes in the local score give an idea of real improvement.
The model can be thought of as a black box or a function that receives data at the input and returns the result. The results can be of different types, most commonly used are:
|Model Type||Type of the returned result||Example|
|Regression||Real number||The value of a stock of company N in 1 minute|
|Classification||Category||Solvent / Insolvent borrower|
In the Outbrain competition, it was necessary to rank the advertisement by the probability of the user clicking on it. A simple and effective way to do this is to use the classification task, predict the likelihood of click, and sort the advertisement in the block according to the probabilities given by the classifier:
Predictions of the classifier:
The accepted format of the problem solution:
|5||57 56 58 55|
The evaluation metric is a function that shows how accurately the prediction corresponds to the actual data. For each type of task, there exist a lot of metrics – for example, regression is often based on the difference of squares of values. The competition used MEAP (Mean Average Precision) – a metric that takes into account not only the number of correct answers but also the difference in the order of ranking.
Let’s consider the simplest algorithm when we think that most likely the user will click on the most popular advertisement – the one which Click-through Rate (CTR) ratio is maximum. In this case, we will have two input values – the id of the advertisement and whether the advertisement was clicked on. There is no special machine learning here, just stats.
Let’s say it’s our training data, and under the displayId, we group ads, shown to one user in one block:
To form the first characteristic, we use the formula adClicked = max adId sum(adId, clicked == 1) / sum (adId)
Input values are often represented as a vector (often called a feature vector), and all observations form the matrix, often denoted by X. The target variable (in our case Clicked) is denoted as y
Now, when calculating prediction, for each displayId, we sort the displayed ads by feature_1 and get the following result:
The first thing to do is to check the accuracy of the model using the validation mechanism we have already developed. The model based on the frequency with smoothing returns much better result than random predictions (more is better):
|Frequency with smoothing||0.63380|
You can extend the model, and count the CTR based on the user’s region while counting statistics for all the combinations adId * country * state:
adClicked = max adId sum(adId, clicked == 1, country==displayCountry, state==displayState) / sum(adId)
It is important to build features only from the training subset of data, excluding validation and test sets, otherwise, we will not be able to adequately assess the accuracy of the model. If you use k-fold cross-validation, you will have to build these attributes k times.
Another approach is to generate features in such a way that will reduce/eliminate overfit of the model. For example, I added statistics on the frequency of clicks only for those advertisements where the number of views N> 10 (the value is selected during validation). Motivation – if we add frequencies where the number of ad views == 1, an algorithm with sufficient complexity (for example, a decision tree) will determine the possibility of this attribute to predict the answer uniquely and can only use it for prediction, while it is clear that the solution will be fairly naive.
The entire process of generating features from input data is usually called Feature Engineering and is often a decisive factor for the successful competition since algorithms and meta-algorithms for learning models are publicly available and well-known.
Outbrain competition features
I logically divide features used in Outbrain competition into three main groups:
- User features – person to whom the advertisement is displayed
- Advertiser features
- Context + Time
Let’s consider each group in more details:
Using a training set of data and a log of page views, you can mine a lot of interesting information about the user – which advertisements/advertising campaigns he/she clicked on and which ones were always ignored. Since the landing page information has been provided, it is possible to identify which category of pages or themes/entities the user is interested in – if the landing page is sport.cnn, and the user reads the sport news often at this time of the day or on that day of the week, it can be used as a feature.
Such and similar things will help later to find users with similar preferences and thru them to predict – whether the user will click on the advertisement.
I made the selection of features manually – based on the validation score change with/without the feature.
Here I started with meta-information about advertiser / landing-page / advertising campaign + CTR similar features based on geo-location, time of day, the day of the week – for example, stateCTRCamp: the frequency of clicks on the advertising campaign (combines ads) in one of the states.
Context is a page on which ads are displayed and time when the ad was shown + user information: geo + device type. Knowing the time, you can list all the ads and pages that the user visited\clicked yesterday\day before yesterday\within the last hour. You can also determine the content popular at the moment and so on.
Some of the features I had:
country, state, platform, county, pageDocumentCategories, countryCTRAdv, campaignId, advertiserId, userDocsSeenFromLogYesterday, userClickedThisAdvertiserTimes, hourOfDay, userDocsClickedToday, lastAdvUserClicked
The total number of features is about 120, most of them engineered manually, for example, userDocsSeenFromLogYesterday – documents from the click of the log viewed by the user yesterday (about the predicted ad display). Extended (while still incomplete) list is in the technical description of the solution.
Most of the used characteristics are categorical – for example, the country, and one-hot-encoding are used to convert to a binary feature. Some numerical characteristics were also converted into binary ones by assigning them to a numerical interval, some of them were smoothed with log (x + 1) and some features I left as is.
Encoded features occurring less than ten times, were not taken into account. The total number of features is > 5M, hashing has not been used to reduce the dimension.
An example of the simplest model - logistic regression
Let’s build simple logistic regression model, which gets 2 features as an input – frequency of click in country and state:
countryAdCTR = sum(adId, clicked == 1, country == displayCountry) / sum(adId)
stateAdCTR = sum(adId, clicked == 1, state == displayState) / sum(adId)
Formula (skipping bias for simplicity) for the probability of clicking on ad will be:
y* = f(z), z = w1 * countryAdCTR + w2 * stateAdCTR, f(z) = 1 / (1 + e(-z))
f(z) is a logistic function, which returns values in the interval [0: 1]. The learning algorithm selects the coefficients w1 and w2 in such a way as to reduce the difference between y * and y – which means to achieve the maximum similarity of predictions and real values.
Now let’s add the categorical features – advertiser and view_page_domain into the model, first converting them into binary ones using the one-hot-encoding method, for example:
The formula will be:
z = w1 * countryAdCTR + w2 * stateAdCTR + w3*advertiser + w4*view_page_domain
Since advertiser and view_page is a vector, both w3 and w4 will also be vectors
In CTR prediction, it is very important to consider the interaction of features, for example, for the page where the advertisement and the advertiser are displayed – the chance to click on the Gucci ad on the Vogue page is way different than on Adidas advertising. The model can be extended with additional features that model interaction between the advertiser and view_page:
z = w1 * countryAdCTR + w2 * stateAdCTR + w3*advertiser + w4*view_page_domain + w5*advertiser*view_page_domain
We already know that the advertiser and view_page are vectors, so the dimension of the vector w5 is the length of the advertiser vector * the length of the vector view_page.
There are several problems with this approach, however – first, it will be a huge vector – all possible domains multiplied by the number of all possible advertisers. Second, the resulting vector will be very sparse, and most values will never become 1 – most of the combinations we’ll never encounter in real life.
Factorization Machines (FM)
FM is great for CTR prediction tasks since they explicitly take into account the interaction of features while solving the problem of sparse data. An excellent description can be found in the original publication, here I will briefly touch the main idea – each value of the feature gets its latent vector of length k, and the result of the interaction of features is the dot product of vectors – see the formula in the publication in the Model Equation section.
Outbrain competition model
Field-aware Factorization Machines (FFM)
During the analysis of the best models, I found that the last 2 CTR prediction competitions were won by the guys with the Field-aware Factorization Machines (FFM) model. This model is an extension of FM, but now the features are divided into n fields – kind of feature group, such as a feature group consisting of documents previously viewed, a group for other advertisements in this ad unit, and so on. Now each feature is presented in the form of n vectors with dimension k – and has a different value in each other group of features, which allows more accurate modeling of the interaction between groups. For a description and details, see the publication.
FFM is very prone to overfit and to tackle this early stop is used – after each iteration, the accuracy on the validation set is evaluated. If the accuracy decreases – training stops. I made several changes to the standard library code, for example, I added val loss calculation based on the MEAP metric, same as was used to calculate leaderboard positions on Kaggle, instead of the standard log loss.
One of the teams that took place in the top 3 also added pair optimization into FFM.
To be able to stop early when learning the whole model, I used the divided training set into train/validation and used 5% as a validation set.
The final result is a simple average of the results of 5 models on different random test/validation distributions with slightly different sets of characteristics.
This method of mixing results on the subsets of the training set is called bagging (bootstrap aggregation) and is often used to reduce the variance, which improves the result. Also, it usually works well for me when mixing gradient boosting decision trees (xgboost / lightGBM).
What did not work?
Models based on gradient boosting consistently returned a worse result (comparable to FM), even with pairwise optimization. Also, generation of features for FFM based on the tree leaves from GBDT didn’t work well for me, and stacking of FFM → FFM or XGBOOST → FFM was constantly worse than FFM on the entire data set.
|My best single model result||0.69665|
|My best result (mix of 5 models)||0.69778|
|1st place best result||0.70145|
Code, infrastructure, and hardware
The original files preprocessing is done using Python, and I usually use Jupyter for data analysis purposes. I also filtered the click-log only for users encountered in the training/test suite, which allowed it to be reduced from 80 to 10GB.
The original Feature engineering was also written in Python, however, given the huge amount of data, and therefore the time needed to process it, I quickly switched to Scala. The difference in speed according to an approximate estimate was about 40 times (raw python was used).
To quickly iterate and evaluate improvements (when lucky), I used a subset of the data, about 1/10 of the total amount. This allowed me to get the result in about 5-10 minutes after the launch + models were fit into the laptop’s memory.
Generation of input files for the final model takes about 3 hours on a 6-core machine. The total amount of written files is > 500GB. The approximate learning and prediction time was 10-12 hours, with the use of memory for about 120GB.
The entire process is automated using scripts.
From my point of view, participation in competitions on Kaggle and similar is a great way to introduce yourself to real problem-solving using real data and evaluating your approach against real competitors.
Competing at a high level also often requires the dedication of time and effort, especially having full-time work plus no one canceled your personal life, so that such competitions can be considered as competition also with oneself – win against yourself in the first place.