Instacart - Predictions on Test Set
In this notebook we will load the trained Random Forest classifier, built in the Instacart Model Fitting notebook, and make reordered products predictions for the test set in the Instacart Market Basket Analysis Kaggle competition.
Support Functions
from IPython.display import Markdown, display
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import scipy.stats
#----------------------------------------------------------------------
# Functions to load datasets into memory using space efficient data types.
def load_orders(path):
def convert_eval_set(value):
if 'prior' == value:
return np.uint8(1)
elif 'train' == value:
return np.uint8(2)
else:
return np.uint8(3) # 'test'
def convert_days_since_prior_order(value):
# 30 is the maximum value
if '' == value:
return np.int8(-1)
else:
return np.int8(np.float(value))
orders = pd.read_csv(path,
dtype={'order_id': np.uint32,
'user_id': np.uint32,
'order_number': np.uint8,
'order_dow': np.uint8,
'order_hour_of_day': np.uint8},
converters={'eval_set':convert_eval_set,
'days_since_prior_order':convert_days_since_prior_order})
orders = orders.astype({'eval_set': np.uint8,
'days_since_prior_order': np.int8})
return orders
def load_orders_prods(path):
return pd.read_csv(path, dtype={'order_id': np.uint32,
'product_id': np.uint32,
'add_to_cart_order': np.uint8,
'reordered': np.uint8})
#----------------------------------------------------------------------
# Function to generate markdown output
# Ref: https://stackoverflow.com/a/32035217
def printmd(string):
display(Markdown(string))
Load Testing Data
orders = load_orders('data/split/sf_test_set_orders.csv')
orders_prods = load_orders_prods('data/split/sf_test_set_prior_order_products.csv')
# List of orders in history
prior_orders_only = orders[(1 == orders.eval_set)]
final_orders_only = orders[(1 != orders.eval_set)]
Meta Features - Mean Order Length and Mean Reorder Ratio Per Customer
# Compute mean order length per customer.
orders_length = orders_prods.groupby('order_id').add_to_cart_order.max().reset_index()
orders_length.rename(columns={'add_to_cart_order': 'total_items_ordered'}, inplace=True)
orders_length_merge = orders_length.merge(prior_orders_only[['order_id','user_id']], on='order_id')
orders_length_merge['order_id'] = orders_length_merge.order_id.astype(np.uint32)
mean_order_length_per_customer = orders_length_merge.groupby('user_id').total_items_ordered.mean().round().reset_index()
mean_order_length_per_customer['user_id'] = mean_order_length_per_customer.user_id.astype(np.uint32)
mean_order_length_per_customer.rename(columns={'total_items_ordered': 'mean_order_length'}, inplace=True)
mean_order_length_per_customer['mean_order_length'] = mean_order_length_per_customer.mean_order_length.astype(np.uint16)
del orders_length_merge
# Compute mean reorder ratio per customer.
# For each order compute ratio of re-ordered items to total ordered items.
orders_reorder_ratio = orders_prods.groupby('order_id').reordered.sum() / orders_length.set_index('order_id').total_items_ordered
orders_reorder_ratio = orders_reorder_ratio.reset_index()
del orders_length
# Exclude first orders, since none of the products have been ordered yet,
# and so reordered ratio would be zero, thus skewing the mean reorder ratio
# both overall and per user.
orders_reorder_ratio = orders_reorder_ratio.merge(prior_orders_only[prior_orders_only.order_number > 1], on='order_id')
orders_reorder_ratio.rename(columns={0: 'reorder_ratio'}, inplace=True)
orders_reorder_ratio['order_id'] = orders_reorder_ratio.order_id.astype(np.uint32)
mean_reorder_ratio_per_customer = orders_reorder_ratio.groupby('user_id').reorder_ratio.mean().reset_index()
mean_reorder_ratio_per_customer['user_id'] = mean_reorder_ratio_per_customer.user_id.astype(np.uint32)
mean_reorder_ratio_per_customer.rename(columns={'reorder_ratio': 'mean_reorder_ratio'}, inplace=True)
mean_reorder_ratio_per_customer['mean_reorder_ratio'] = mean_reorder_ratio_per_customer.mean_reorder_ratio.astype(np.float16)
del orders_reorder_ratio
mean_order_length_per_customer.head()
user_id | mean_order_length | |
---|---|---|
0 | 3 | 7 |
1 | 4 | 4 |
2 | 6 | 5 |
3 | 11 | 13 |
4 | 12 | 15 |
mean_reorder_ratio_per_customer.head()
user_id | mean_reorder_ratio | |
---|---|---|
0 | 3 | 0.718750 |
1 | 4 | 0.035706 |
2 | 6 | 0.142822 |
3 | 11 | 0.402588 |
4 | 12 | 0.181763 |
Feature Engineering
Merge Prior Orders, Ordered Products, Aisles and Departments
flat_order_prods = orders_prods.merge(prior_orders_only[['order_id','user_id','order_number','days_since_prior_order']], on='order_id')
flat_order_prods.head()
order_id | product_id | add_to_cart_order | reordered | user_id | order_number | days_since_prior_order | |
---|---|---|---|---|---|---|---|
0 | 13 | 17330 | 1 | 0 | 45082 | 2 | 1 |
1 | 13 | 27407 | 2 | 0 | 45082 | 2 | 1 |
2 | 13 | 35419 | 3 | 0 | 45082 | 2 | 1 |
3 | 13 | 196 | 4 | 0 | 45082 | 2 | 1 |
4 | 13 | 44635 | 5 | 0 | 45082 | 2 | 1 |
Days Since First Order (DSFO) per Order per Customer
DSFO_popc = prior_orders_only.copy()
# days since first order per order per customer
# add one since each users' first order has days_since_prior_order set to -1.
DSFO_popc['DSFO'] = DSFO_popc.groupby(['user_id']).days_since_prior_order.cumsum() + 1
DSFO_popc['DSFO'] = DSFO_popc.DSFO.astype(np.uint16)
del DSFO_popc['eval_set']
del DSFO_popc['order_number']
del DSFO_popc['order_dow']
del DSFO_popc['order_hour_of_day']
del DSFO_popc['days_since_prior_order']
DSFO_popc.head()
order_id | user_id | DSFO | |
---|---|---|---|
0 | 1374495 | 3 | 0 |
1 | 444309 | 3 | 9 |
2 | 3002854 | 3 | 30 |
3 | 2037211 | 3 | 50 |
4 | 2710558 | 3 | 62 |
Max Days Since First Order (DSFO) per Customer
max_DSFO_pc = DSFO_popc.groupby(['user_id']).DSFO.max().reset_index()
max_DSFO_pc.rename(columns={'DSFO': 'max_DSFO'}, inplace=True)
max_DSFO_pc.head()
user_id | max_DSFO | |
---|---|---|
0 | 3 | 133 |
1 | 4 | 55 |
2 | 6 | 18 |
3 | 11 | 123 |
4 | 12 | 100 |
Number of Orders per Customer
orders_pc = prior_orders_only.groupby('user_id').order_number.max().reset_index()
orders_pc['user_id'] = orders_pc.user_id.astype(np.uint32)
orders_pc.rename(columns={'order_number': 'number_of_orders'}, inplace=True)
orders_pc.head()
user_id | number_of_orders | |
---|---|---|
0 | 3 | 12 |
1 | 4 | 5 |
2 | 6 | 3 |
3 | 11 | 7 |
4 | 12 | 5 |
Final Summary for Products Ordered per Customer
# days since first order per product per order per customer
props_pppc = flat_order_prods[['order_id','product_id','reordered']].merge(DSFO_popc, on="order_id")
# aggregate to get properties for each product ordered for each customer
props_pppc = props_pppc.groupby(['user_id','product_id']).agg({'DSFO': [min, max],
'reordered': sum})
# flatten hierarchical column index
props_pppc = props_pppc.reset_index()
props_pppc.columns = ['_'.join(col).strip('_') for col in props_pppc.columns.values]
# add max_DSFO and total orders per customer
props_pppc = props_pppc.merge(max_DSFO_pc, on='user_id')
props_pppc = props_pppc.merge(orders_pc, on='user_id')
# change data types for space efficiency
props_pppc['user_id'] = props_pppc.user_id.astype(np.uint32)
props_pppc['product_id'] = props_pppc.product_id.astype(np.uint32)
# add days since last order for the customer's final order
props_pppc = props_pppc.merge(final_orders_only[['user_id','days_since_prior_order']], on="user_id")
props_pppc.rename(columns={'days_since_prior_order': 'last_order_DSLO'}, inplace=True)
# compute reorder and recency probability along with the mean days to order each product.
props_pppc['reorder_prob'] = (props_pppc['reordered_sum'] + 1) / props_pppc['number_of_orders']
# check that DSFO_max is greater than zero to avoid NaN, since some customers might have only
# multiple orders on same day that the first order was placed.
props_pppc['recency_prob'] = (props_pppc['DSFO_max'] / (props_pppc['max_DSFO'] + props_pppc['last_order_DSLO'])).where(props_pppc['last_order_DSLO'] > 0, 0)
# change all float64 fields to float16
props_pppc['reorder_prob'] = props_pppc.reorder_prob.astype(np.float16)
props_pppc['recency_prob'] = props_pppc.recency_prob.astype(np.float16)
# drop the columns we no longer need
del props_pppc['DSFO_min']
del props_pppc['DSFO_max']
del props_pppc['reordered_sum']
del props_pppc['max_DSFO']
del props_pppc['number_of_orders']
del props_pppc['last_order_DSLO']
props_pppc['pred_reordered_prob'] = 0
props_pppc['pred_reordered_prob'] = props_pppc.pred_reordered_prob.astype(np.float16)
props_pppc.head()
user_id | product_id | reorder_prob | recency_prob | pred_reordered_prob | |
---|---|---|---|---|---|
0 | 3 | 248 | 0.083313 | 0.062500 | 0.0 |
1 | 3 | 1005 | 0.083313 | 0.743164 | 0.0 |
2 | 3 | 1819 | 0.250000 | 0.527832 | 0.0 |
3 | 3 | 7503 | 0.083313 | 0.208374 | 0.0 |
4 | 3 | 8021 | 0.083313 | 0.062500 | 0.0 |
Predictions on Test Data
Load Random Forest Model to Make Reorder Predictions
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.externals import joblib
Extract Testing Feature Matrix
X_test = props_pppc.loc[:,['reorder_prob','recency_prob']].as_matrix()
Predict Product Reordering
The random forest trained model used for predictions will be loaded from disk and used to predict product reordering. The nearly 5 million products ordered in the past by the customers are processed in batches of at most a million and each time the model is deleted and reloaded from disk. This was done to work within the memory resources available to the Docker container used to carry out these experiments.
total_products = len(X_test)
print("Total products in test orders: {0}".format(total_products))
start_pos = 0
end_pos = 0
batch_size = 1000000
for i in range(0, np.int(np.round(total_products / 1000000))):
clf = joblib.load('randomforest-all-training-data.pkl')
start_pos = i * batch_size
end_pos = start_pos + batch_size
if end_pos > total_products:
end_pos = total_products
print("Processing {0} - {1}".format(start_pos, end_pos))
y_hat_probs = clf.predict_proba(X_test[start_pos:end_pos])
props_pppc.loc[start_pos:end_pos-1,['pred_reordered_prob']] = y_hat_probs[:,1]
del clf
Total products in test orders: 4833292
Processing 0 - 1000000
[Parallel(n_jobs=2)]: Done 1 tasks | elapsed: 0.3s
[Parallel(n_jobs=2)]: Done 4 tasks | elapsed: 0.6s
[Parallel(n_jobs=2)]: Done 9 tasks | elapsed: 1.1s
[Parallel(n_jobs=2)]: Done 14 tasks | elapsed: 1.7s
[Parallel(n_jobs=2)]: Done 21 tasks | elapsed: 2.5s
[Parallel(n_jobs=2)]: Done 28 tasks | elapsed: 3.2s
[Parallel(n_jobs=2)]: Done 37 tasks | elapsed: 4.3s
[Parallel(n_jobs=2)]: Done 46 tasks | elapsed: 5.5s
[Parallel(n_jobs=2)]: Done 57 tasks | elapsed: 6.8s
[Parallel(n_jobs=2)]: Done 68 tasks | elapsed: 8.0s
[Parallel(n_jobs=2)]: Done 81 tasks | elapsed: 9.5s
[Parallel(n_jobs=2)]: Done 94 tasks | elapsed: 10.9s
[Parallel(n_jobs=2)]: Done 109 tasks | elapsed: 12.8s
[Parallel(n_jobs=2)]: Done 124 tasks | elapsed: 14.3s
[Parallel(n_jobs=2)]: Done 140 out of 140 | elapsed: 16.2s finished
Processing 1000000 - 2000000
[Parallel(n_jobs=2)]: Done 1 tasks | elapsed: 0.3s
[Parallel(n_jobs=2)]: Done 4 tasks | elapsed: 0.6s
[Parallel(n_jobs=2)]: Done 9 tasks | elapsed: 1.3s
[Parallel(n_jobs=2)]: Done 14 tasks | elapsed: 1.8s
[Parallel(n_jobs=2)]: Done 21 tasks | elapsed: 2.8s
[Parallel(n_jobs=2)]: Done 28 tasks | elapsed: 3.6s
[Parallel(n_jobs=2)]: Done 37 tasks | elapsed: 4.6s
[Parallel(n_jobs=2)]: Done 46 tasks | elapsed: 5.6s
[Parallel(n_jobs=2)]: Done 57 tasks | elapsed: 7.0s
[Parallel(n_jobs=2)]: Done 68 tasks | elapsed: 8.3s
[Parallel(n_jobs=2)]: Done 81 tasks | elapsed: 9.8s
[Parallel(n_jobs=2)]: Done 94 tasks | elapsed: 11.4s
[Parallel(n_jobs=2)]: Done 109 tasks | elapsed: 13.2s
[Parallel(n_jobs=2)]: Done 124 tasks | elapsed: 17.0s
[Parallel(n_jobs=2)]: Done 140 out of 140 | elapsed: 20.5s finished
Processing 2000000 - 3000000
[Parallel(n_jobs=2)]: Done 1 tasks | elapsed: 0.3s
[Parallel(n_jobs=2)]: Done 4 tasks | elapsed: 0.5s
[Parallel(n_jobs=2)]: Done 9 tasks | elapsed: 1.2s
[Parallel(n_jobs=2)]: Done 14 tasks | elapsed: 1.7s
[Parallel(n_jobs=2)]: Done 21 tasks | elapsed: 2.5s
[Parallel(n_jobs=2)]: Done 28 tasks | elapsed: 3.3s
[Parallel(n_jobs=2)]: Done 37 tasks | elapsed: 4.4s
[Parallel(n_jobs=2)]: Done 46 tasks | elapsed: 6.0s
[Parallel(n_jobs=2)]: Done 57 tasks | elapsed: 7.2s
[Parallel(n_jobs=2)]: Done 68 tasks | elapsed: 8.9s
[Parallel(n_jobs=2)]: Done 81 tasks | elapsed: 10.6s
[Parallel(n_jobs=2)]: Done 94 tasks | elapsed: 12.0s
[Parallel(n_jobs=2)]: Done 109 tasks | elapsed: 13.7s
[Parallel(n_jobs=2)]: Done 124 tasks | elapsed: 15.8s
[Parallel(n_jobs=2)]: Done 140 out of 140 | elapsed: 23.0s finished
Processing 3000000 - 4000000
[Parallel(n_jobs=2)]: Done 1 tasks | elapsed: 0.4s
[Parallel(n_jobs=2)]: Done 4 tasks | elapsed: 1.4s
[Parallel(n_jobs=2)]: Done 9 tasks | elapsed: 2.5s
[Parallel(n_jobs=2)]: Done 14 tasks | elapsed: 3.5s
[Parallel(n_jobs=2)]: Done 21 tasks | elapsed: 4.5s
[Parallel(n_jobs=2)]: Done 28 tasks | elapsed: 5.4s
[Parallel(n_jobs=2)]: Done 37 tasks | elapsed: 6.4s
[Parallel(n_jobs=2)]: Done 46 tasks | elapsed: 7.5s
[Parallel(n_jobs=2)]: Done 57 tasks | elapsed: 8.9s
[Parallel(n_jobs=2)]: Done 68 tasks | elapsed: 10.1s
[Parallel(n_jobs=2)]: Done 81 tasks | elapsed: 11.8s
[Parallel(n_jobs=2)]: Done 94 tasks | elapsed: 13.3s
[Parallel(n_jobs=2)]: Done 109 tasks | elapsed: 15.0s
[Parallel(n_jobs=2)]: Done 124 tasks | elapsed: 16.7s
[Parallel(n_jobs=2)]: Done 140 out of 140 | elapsed: 18.5s finished
Processing 4000000 - 4833292
[Parallel(n_jobs=2)]: Done 1 tasks | elapsed: 0.2s
[Parallel(n_jobs=2)]: Done 4 tasks | elapsed: 0.4s
[Parallel(n_jobs=2)]: Done 9 tasks | elapsed: 1.0s
[Parallel(n_jobs=2)]: Done 14 tasks | elapsed: 1.4s
[Parallel(n_jobs=2)]: Done 21 tasks | elapsed: 2.2s
[Parallel(n_jobs=2)]: Done 28 tasks | elapsed: 2.8s
[Parallel(n_jobs=2)]: Done 37 tasks | elapsed: 3.8s
[Parallel(n_jobs=2)]: Done 46 tasks | elapsed: 4.6s
[Parallel(n_jobs=2)]: Done 57 tasks | elapsed: 6.3s
[Parallel(n_jobs=2)]: Done 68 tasks | elapsed: 7.3s
[Parallel(n_jobs=2)]: Done 81 tasks | elapsed: 8.5s
[Parallel(n_jobs=2)]: Done 94 tasks | elapsed: 9.9s
[Parallel(n_jobs=2)]: Done 109 tasks | elapsed: 11.3s
[Parallel(n_jobs=2)]: Done 124 tasks | elapsed: 12.7s
[Parallel(n_jobs=2)]: Done 140 out of 140 | elapsed: 14.4s finished
Save Predictions to Disk
The predictions are saved to disk so that if the Jupyter notebook kernel stops working we would not need to perform the predictions once more. Loading the predictions from disk is so much faster.
props_pppc.to_csv('predictions.csv')
Load Predictions from Disk
def load_predictions_data(path):
predictions = pd.read_csv(path, dtype={'user_id': np.uint32,
'product_id': np.uint32,
'pred_reordered_prob': np.float16},
usecols=['user_id','product_id','pred_reordered_prob'])
return predictions
predictions = load_predictions_data('predictions.csv')
Add Order ID and User ID to Predictions
predictions = predictions.merge(final_orders_only[['order_id','user_id']], on='user_id')
predictions.sort_values(['order_id','pred_reordered_prob'], inplace=True, ascending=[True, False])
predictions.head()
user_id | product_id | pred_reordered_prob | order_id | |
---|---|---|---|---|
858078 | 36855 | 13107 | 0.938965 | 17 |
858079 | 36855 | 21463 | 0.882324 | 17 |
858080 | 36855 | 38777 | 0.846680 | 17 |
858081 | 36855 | 21709 | 0.792969 | 17 |
858082 | 36855 | 47766 | 0.792969 | 17 |
Calculate Mean Number of Reordered Products per Customer
The mean number of reordered products per customer in their last order is computed to include only the top predicted products per customer in the final order.
predictions = predictions.merge(mean_order_length_per_customer, on='user_id')
predictions = predictions.merge(mean_reorder_ratio_per_customer, on='user_id')
predictions['order_length'] = predictions.mean_order_length * predictions.mean_reorder_ratio
predictions['order_length'] = predictions['order_length'].round()
predictions.order_length = predictions.order_length.astype(int)
del predictions['user_id']
del predictions['mean_order_length']
del predictions['mean_reorder_ratio']
predictions_above_threshold = predictions.loc[predictions.pred_reordered_prob > 0.5]
del predictions['pred_reordered_prob']
predictions.head()
product_id | order_id | order_length | |
---|---|---|---|
0 | 13107 | 17 | 2 |
1 | 21463 | 17 | 2 |
2 | 38777 | 17 | 2 |
3 | 21709 | 17 | 2 |
4 | 47766 | 17 | 2 |
Generate Reordered Products List for Each Last Order
LimitToMeanReorderLength = False
if LimitToMeanReorderLength:
predictions_above_threshold = predictions_above_threshold.groupby('order_id').apply(lambda x: list(x['product_id'])[:list(x['order_length'])[0]])
else:
predictions_above_threshold = predictions_above_threshold.groupby('order_id').apply(lambda x: list(x['product_id'])
)
predictions_above_threshold.head()
order_id
17 [13107, 21463, 38777, 21709, 47766, 26429, 392...
34 [39180, 39475, 47792, 47766, 2596, 16083, 4350...
137 [23794, 41787, 24852, 38689, 2326, 5134, 25890...
182 [9337, 39275, 13629, 5479, 47672, 47209, 33000...
257 [49235, 24852, 27104, 27966, 29837, 30233, 450...
dtype: object
Save Reordered Products List per Final Order to Disk
predictions_above_threshold.to_csv('submit_predictions_0_5.csv')
Determine List of Empty Orders and Save to Disk
Empty orders are final orders which our predictive model thinks will not include any reordered products. Although the predictive model assigns some probability to all products in a customer’s history, for some customers and their final order, the probabilities assigned to the products are lower than the threshold we use. In our case we determined that 0.7 gave us the best F1-score and so we are using that.
# dataframe with single column full of final order ids
all_last_orders = pd.DataFrame(predictions['order_id'].unique(), columns=['order_id'])
predictions_above_threshold = predictions_above_threshold.reset_index()
empty_orders = all_last_orders.loc[~all_last_orders.order_id.isin(predictions_above_threshold.order_id.values)]['order_id']
empty_orders.reset_index().to_csv('empty_orders_0_5.csv', index=False, header=False, columns=['order_id'])
Verify That All Final Orders Were Included
There are 75,000 final orders, so to verify that we included all of them in our predictions, we sum the number of orders in the empty orders list along with the number of orders that include at least one reordered product.
printmd("Normal orders **{0}** + Empty orders **{1}** = **{2}** total orders.".format(len(predictions_above_threshold),
len(empty_orders),
len(empty_orders) + len(predictions_above_threshold)))
Normal orders 74875 + Empty orders 125 = 75000 total orders.
Prepare Kaggle Submission File
The two CSV files generated above, one for the normal final orders and the other for the empty final orders, need to be formatted and combined to satisfy the submission file format expected by Kaggle for this particular competition.
This was achieved through a simple bash script file making use of sed and cat tools, prep_submission_file.sh.
Results on Kaggle Instacart Market Basket Analysis Competition
The following are the mean F1-scores achieved on the Instacart Market Basket Analysis competition hosted by Kaggle using only a probability threshold on top of the predictions from the trained random forest model.
Reorder Probability Threshold | Mean F1 Score |
---|---|
0.5 | 0.3343295 |
0.6 | 0.3558827 |
0.7 | 0.3622994 |
0.8 | 0.3276510 |
Finally, we performed one final test using a probability threshold of 0.7 as before but combined with truncating the reordered products list to satisfy the mean number of reordered products per customer. In this experiment, the mean F1-score reported by Kaggle was 0.3313422.
Conclusion
The best mean F1-score avhieved using the fitted random forest model is 0.3622994. Although this score is nowhere near the one achieved by the challenge winner, at 0.4091449, it is a good start, especially when one considers the simplicity of the features, the model and the computational resources required.
Further Improvement
There are many things one could try out to improve the model, including coming up with other features, such as a slightly different recency probability computation using order number over total number of orders instead of the days since last order metric we used. Another option could be to use clustering to try to identify customer types, which could then in turn be explored and a specific model fitted to each customer type.
One can also use more powerful and complex methods, such as LSTMs. For instance, Kaggle user SJV used a mixture of deep learning models for feature extraction and prediction, attaining third place with a mean F1-score of 0.4081041. To read more about this head over here. Another common technique was F1 maximization, also used by Onodera along with an array of features to place 2nd in the competition with a mean F1-score of 0.4082039. You can read more about his approach here, but keep in mind that you need approximately 300GB of RAM to fit and evaluate his models. In comparison, the models fitted and evaluated in these notebooks work in under 4GB of RAM.