Instacart - Splitting Training Data
In this notebook we will split the Instacart training data as provided on Kaggle in the Instacart Market Basket Analysis competition, into a training set and a validation set, 80% and 20% respectively.
import pandas as pd
import math
# load the data to split
prior = pd.read_csv('data/original/order_products__prior.csv')
train = pd.read_csv('data/original/order_products__train.csv')
orders = pd.read_csv('data/original/orders.csv')
# Get list of user IDs included in test set
test_user_ids = orders[orders.eval_set == "test"].user_id
# Get list of user IDs included in train set
orig_train_user_ids = orders[orders.eval_set == "train"].user_id
users_in_test_set = len(test_user_ids.index)
users_in_train_set = len(orig_train_user_ids.index)
print("Users in test set: {0}.".format(users_in_test_set))
print("Users in train set: {0}.".format(users_in_train_set))
print("\nSplit training into 80:20 - training:validation\n")
users_in_val_set = math.floor(users_in_train_set / 5)
users_in_train_set = users_in_train_set - users_in_val_set
print("Users in train set: {0}".format(users_in_train_set))
print("Users in validation set: {0}".format(users_in_val_set))
Users in test set: 75000.
Users in train set: 131209.
Split training into 80:20 - training:validation
Users in train set: 104968
Users in validation set: 26241
val_user_ids = orig_train_user_ids[0:users_in_val_set]
new_train_user_ids = orig_train_user_ids[users_in_val_set:]
# Confirm newly created training and validation lists are of the expected length
actual_users_in_val_set = len(val_user_ids.index)
actual_users_in_train_set = len(new_train_user_ids.index)
print("Users in new train set: {0} - Match: {1}".format(actual_users_in_train_set, users_in_train_set==actual_users_in_train_set))
print("Users in new validation set: {0} - Match: {1}".format(actual_users_in_val_set, users_in_val_set==actual_users_in_val_set))
Users in new train set: 104968 - Match: True
Users in new validation set: 26241 - Match: True
orders[orders.user_id.isin(test_user_ids)].to_csv('data/split/sf_test_set_orders.csv', index=False)
orders[orders.user_id.isin(new_train_user_ids)].to_csv('data/split/sf_train_set_orders.csv', index=False)
orders[orders.user_id.isin(val_user_ids)].to_csv('data/split/sf_val_set_orders.csv', index=False)
# split the last orders according to validation and training set orders
val_orders = pd.read_csv('data/split/sf_val_set_orders.csv')
train[train.order_id.isin(val_orders.order_id)].to_csv('data/split/sf_val_set_last_order_products.csv', index=False)
train_orders = pd.read_csv('data/split/sf_train_set_orders.csv')
train[train.order_id.isin(train_orders.order_id)].to_csv('data/split/sf_train_set_last_order_products.csv', index=False)
# split the prior orders according to training, validation and test set orders.
test_orders = pd.read_csv('data/split/sf_test_set_orders.csv')
prior[prior.order_id.isin(test_orders.order_id)].to_csv('data/split/sf_test_set_prior_order_products.csv', index=False)
prior[prior.order_id.isin(val_orders.order_id)].to_csv('data/split/sf_val_set_prior_order_products.csv', index=False)
prior[prior.order_id.isin(train_orders.order_id)].to_csv('data/split/sf_train_set_prior_order_products.csv', index=False)