Numerai competitions offer a sample model with each data set, called *example_model.py*. It uses Logistic Regression (which, technically, isn't regression) from sklearn. It's fast, and it's quite effective.

Let's see how it does on Tournament 89 data. For convenience, add the following snippet at the end to calculate logloss:

#calculate logloss validation_data = prediction_data.loc[prediction_data['data_type'] == "validation"] numValidationRows = validation_data.count() eval_y = validation_data["target"] predictions = pd.DataFrame(results_df.iloc[:numValidationRows['target']]) print ("logloss: %f" % metrics.log_loss(eval_y, predictions))

Running the example model gives us a logloss of 0.692946:

mike@MacBook ~/D/n/89> time python example_model.py Loading data... Training... Predicting... Writing predictions to predictions.csv logloss: 0.692946 23.50 real 22.20 user 1.08 sys mike@MacBook ~/D/n/89>

Many of the 50 features in the data set are strongly correlated. What if we exclude some of them (10 out of 50) from the model? Let's try feature ranking with recursive feature elimination (RFE). Add the following module to *example_model.py*:

from sklearn.feature_selection import RFE

Feature ranking will work as follows (we want to only keep 40 features in this example):

#create the RFE model and select 40 attributes print("Selecting 40 attributes...") rfe = RFE(model, 40) print("Training...") rfe = rfe.fit(X, Y)

The new version of the sample model takes a bit longer to run, produces slightly lower logloss (0.692942), and also tells us which features it wants to keep or eliminate:

mike@MacBook ~/D/n/89> time python example_model_plus.py Loading data... Selecting 40 attributes... Training... [ True True False True True True False True True True True False True True True True True False True True True False True True True True False False True True True True False True True False True True True True True True True True True True True True True False] [ 1 1 4 1 1 1 2 1 1 1 1 10 1 1 1 1 1 11 1 1 1 5 1 1 1 1 7 6 1 1 1 1 3 1 1 8 1 1 1 1 1 1 1 1 1 1 1 1 1 9] Training... Predicting... Writing predictions to predictions.csv Logloss: 0.692942 148.61 real 144.62 user 3.56 sys mike@MacBook ~/D/n/89>

If you play with the number of features you want to keep (I've seen relatively good results with anywhere from 30 to 45), you may discover that you can make more accurate predictions than the example shows you.

This is it, right? Just upload the results to Numerai? WRONG. You still have the "originality" and "consistency" requirement to beat. I will cover that in future posts.

Full source code of the "improved" model with RFE:

#!/usr/bin/env python """ Example classifier on Numerai data using a logistic regression classifier. To get started, install the required packages: pip install pandas, numpy, sklearn """ import pandas as pd import numpy as np from sklearn import metrics, preprocessing, linear_model from sklearn.feature_selection import RFE def main(): # Set seed for reproducibility np.random.seed(0) print("Loading data...") # Load the data from the CSV files training_data = pd.read_csv('numerai_training_data.csv', header=0) prediction_data = pd.read_csv('numerai_tournament_data.csv', header=0) # Transform the loaded CSV data into numpy arrays features = [f for f in list(training_data) if "feature" in f] X = training_data[features] Y = training_data["target"] x_prediction = prediction_data[features] ids = prediction_data["id"] # This is your model that will learn to predict model = linear_model.LogisticRegression(n_jobs=-1) # create the RFE model and select 40 attributes print("Selecting 40 attributes...") rfe = RFE(model, 40) print("Training...") rfe = rfe.fit(X, Y) # summarize the selection of the attributes print(rfe.support_) print(rfe.ranking_) print("Training...") # Your model is trained on the training_data print("Predicting...") # Your trained model is now used to make predictions on the numerai_tournament_data # The model returns two columns: [probability of 0, probability of 1] # We are just interested in the probability that the target is 1. y_prediction = rfe.predict_proba(x_prediction) results = y_prediction[:, 1] results_df = pd.DataFrame(data={'probability':results}) joined = pd.DataFrame(ids).join(results_df) print("Writing predictions to predictions.csv") # Save the predictions out to a CSV file joined.to_csv("predictions.csv", index=False) # Now you can upload these predictions on numer.ai # calculate Logloss validation_data = prediction_data.loc[prediction_data['data_type'] == "validation"] numValidationRows = validation_data.count() eval_y = validation_data["target"] predictions = pd.DataFrame(results_df.iloc[:numValidationRows['target']]) print ("Logloss: %f" % metrics.log_loss(eval_y, predictions)) if __name__ == '__main__': main()