/ ai

You can't easily beat Logistic Regression on Numerai. But can you improve it?

Numerai competitions offer a sample model with each data set, called example_model.py. It uses Logistic Regression (which, technically, isn't regression) from sklearn. It's fast, and it's quite effective.

Let's see how it does on Tournament 89 data. For convenience, add the following snippet at the end to calculate logloss:

#calculate logloss
validation_data = prediction_data.loc[prediction_data['data_type'] == "validation"]
numValidationRows = validation_data.count()
eval_y = validation_data["target"]
predictions = pd.DataFrame(results_df.iloc[:numValidationRows['target']])
print ("logloss: %f" % metrics.log_loss(eval_y, predictions))

Running the example model gives us a logloss of 0.692946:

mike@MacBook ~/D/n/89> time python example_model.py
Loading data...
Training...
Predicting...
Writing predictions to predictions.csv
logloss: 0.692946
       23.50 real        22.20 user         1.08 sys
mike@MacBook ~/D/n/89> 

Many of the 50 features in the data set are strongly correlated. What if we exclude some of them (10 out of 50) from the model? Let's try feature ranking with recursive feature elimination (RFE). Add the following module to example_model.py:

from sklearn.feature_selection import RFE

Feature ranking will work as follows (we want to only keep 40 features in this example):

#create the RFE model and select 40 attributes
print("Selecting 40 attributes...")
rfe = RFE(model, 40)
print("Training...")    
rfe = rfe.fit(X, Y)

The new version of the sample model takes a bit longer to run, produces slightly lower logloss (0.692942), and also tells us which features it wants to keep or eliminate:

mike@MacBook ~/D/n/89> time python example_model_plus.py
Loading data...
Selecting 40 attributes...
Training...
[ True  True False  True  True  True False  True  True  True  True False
  True  True  True  True  True False  True  True  True False  True  True
  True  True False False  True  True  True  True False  True  True False
  True  True  True  True  True  True  True  True  True  True  True  True
  True False]
[ 1  1  4  1  1  1  2  1  1  1  1 10  1  1  1  1  1 11  1  1  1  5  1  1  1
  1  7  6  1  1  1  1  3  1  1  8  1  1  1  1  1  1  1  1  1  1  1  1  1  9]
Training...
Predicting...
Writing predictions to predictions.csv
Logloss: 0.692942
      148.61 real       144.62 user         3.56 sys
mike@MacBook ~/D/n/89> 

If you play with the number of features you want to keep (I've seen relatively good results with anywhere from 30 to 45), you may discover that you can make more accurate predictions than the example shows you.

This is it, right? Just upload the results to Numerai? WRONG. You still have the "originality" and "consistency" requirement to beat. I will cover that in future posts.

Full source code of the "improved" model with RFE:

#!/usr/bin/env python

"""
Example classifier on Numerai data using a logistic regression classifier.
To get started, install the required packages: pip install pandas, numpy, sklearn
"""

import pandas as pd
import numpy as np
from sklearn import metrics, preprocessing, linear_model
from sklearn.feature_selection import RFE


def main():
    # Set seed for reproducibility
    np.random.seed(0)

    print("Loading data...")
    # Load the data from the CSV files
    training_data = pd.read_csv('numerai_training_data.csv', header=0)
    prediction_data = pd.read_csv('numerai_tournament_data.csv', header=0)


    # Transform the loaded CSV data into numpy arrays
    features = [f for f in list(training_data) if "feature" in f]
    X = training_data[features]
    Y = training_data["target"]
    x_prediction = prediction_data[features]
    ids = prediction_data["id"]

    # This is your model that will learn to predict
    model = linear_model.LogisticRegression(n_jobs=-1)

    # create the RFE model and select 40 attributes
    print("Selecting 40 attributes...")
    rfe = RFE(model, 40)
    print("Training...")    
    rfe = rfe.fit(X, Y)
    # summarize the selection of the attributes
    print(rfe.support_)
    print(rfe.ranking_)

    print("Training...")
    # Your model is trained on the training_data

    print("Predicting...")
    # Your trained model is now used to make predictions on the numerai_tournament_data
    # The model returns two columns: [probability of 0, probability of 1]
    # We are just interested in the probability that the target is 1.
    y_prediction = rfe.predict_proba(x_prediction)
    results = y_prediction[:, 1]
    results_df = pd.DataFrame(data={'probability':results})
    joined = pd.DataFrame(ids).join(results_df)

    print("Writing predictions to predictions.csv")
    # Save the predictions out to a CSV file
    joined.to_csv("predictions.csv", index=False)
    # Now you can upload these predictions on numer.ai

    # calculate Logloss
    validation_data = prediction_data.loc[prediction_data['data_type'] == "validation"]
    numValidationRows = validation_data.count()
    eval_y = validation_data["target"]
    predictions = pd.DataFrame(results_df.iloc[:numValidationRows['target']])
    print ("Logloss: %f" % metrics.log_loss(eval_y, predictions))


if __name__ == '__main__':
    main()