In this project, we intend to use machine learning techniques to predict the visitor’s shopping intention. The data consist of features such as the closeness of the site visiting time to a special day, the number of pages visited by the visitor about a product, which was extracted from the visit log of an online shopping website. Random forest (RF), support vector machines(SVMs), and multilayer perceptron(MLP) are used for prediction. We use oversampling to improve the performance and scalability of the classifier. An ensemble using the predictions from the three models was run. The results show that MLP has higher accuracy and F1 Score than RF and SVM while oversampling did help with the model performance. The ensemble had the highest accuracy and F1 score among all. Page value, number of product-related pages visited and product-related page visiting time is the most important features to predict purchasing intention.
For online shopping websites, they could be popular for visitors, but the conversion rate for those who are interested in the product to those who make the purchases could less than ideal. Correctly identifying the aspect that websites could improve upon in order to turn more “visitors” into “buyers” becomes crucial. Also, being able to predict the potential “buyers” with higher accuracy would allow the websites to properly allocate their advertising or other visual information to encourage purchase behavior. Our goal of this paper is to find a better model to predict purchase intention and identify the important features that contribute to the prediction.
Feature “Product Related” refers to the number of pages visited by shoppers about the feature “Product Related”. The product is the one that shoppers want to buy, thus, the “Product Related” is an important feature to predict shopper’s intention.
Feature “Month” represents the Month value of the visit day, and feature “Special Day” represents how close the day is to a “special day” which could be holidays and etc. Shoppers know that for some holidays, there is much more discount than ever. They would like to wait for that holiday to purchase what they into before. Thus, the month can be a crucial feature to predict their purchasing intention.
For features that are important to the purchase intention, we hypothesis that Product Related, and Month of the session are the most features to predict shopper’s intention.
The dataset that is used in the current project comes from Sakar et al.(2019), in which the authors designed a system that could predict online shoppers purchasing intention and page abandonment. Online shoppers purchasing intention was cast as a binary classification problem, such that viewers would either made the final purchase or not. The dataset consists of 12330 sessions, of which record a session that a viewer visited items of the websites. Each session would belong to a different user in a 1-year period to avoid any tendency to a specific campaign, special day, user profile, or period. The dataset is highly imbalanced for 84.5%(10422) of the viewers did not make the final purchase.
| Feature | Feature Description |
|---|---|
| Administrative | Number of pages visited by the visitor about account management |
| Administative duration | Total amount of time (in seconds) spent by the visitor on account management related pages |
| Informational | Number of pages visited by the visitor about Web site, communication and address information of the shopping site |
| Informational duration | Total amount of time (in seconds) spent by the visitor on informational pages |
| Product related | Number of pages visited by visitor about product related pages |
| Product related duration | Total amount of time (in seconds) spent by the visitor on product related pages |
| Bounce rate | Average bounce rate value of the pages visited by the visitor |
| Exit rate | Average exit rate value of the pages visited by the visitor |
| Page value | Average page value of the pages visited by the visitor |
| Special day | Closeness of the site visiting time to a special day |
| OperatingSystems | Operating system of the visitor |
| Browser | Browser of the visitor |
| Region | Geographic region from which the session has been started by the visitor |
| TrafficType | Traffic source by which the visitor has arrived at the Web site (e.g., banner, SMS, direct) |
| VisitorType | Visitor type as ‘‘New Visitor,’’ ‘‘Returning Visitor,’’ and ‘‘Other’’ |
| Weekend | Boolean value indicating whether the date of the visit is weekend |
| Month | Month value of the visit date |
| Revenue | Class label indicating whether the visit has been finalized with a transaction |
After downloaded, we used read.csv() to import the file directly and summaried to see some statistics, like max, min and median values of 18 features. There are 10 numerical features and 8 categorical features.
data <- read.csv("online_shoppers_intention.csv")
data <- na.locf(data)
data_sum <- summary(data)
data_sum %>% kable() %>%
kable_styling("striped") %>%
scroll_box(width = "700px", height = "400px")| Administrative | Administrative_Duration | Informational | Informational_Duration | ProductRelated | ProductRelated_Duration | BounceRates | ExitRates | PageValues | SpecialDay | Month | OperatingSystems | Browser | Region | TrafficType | VisitorType | Weekend | Revenue | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Min. : 0.000 | Min. : -1.00 | Min. : 0.0000 | Min. : -1.00 | Min. : 0.00 | Min. : -1.0 | Min. :0.000000 | Min. :0.00000 | Min. : 0.000 | Min. :0.00000 | May :3364 | Min. :1.000 | Min. : 1.000 | Min. :1.000 | Min. : 1.00 | New_Visitor : 1694 | Mode :logical | Mode :logical | |
| 1st Qu.: 0.000 | 1st Qu.: 0.00 | 1st Qu.: 0.0000 | 1st Qu.: 0.00 | 1st Qu.: 7.00 | 1st Qu.: 185.3 | 1st Qu.:0.000000 | 1st Qu.:0.01429 | 1st Qu.: 0.000 | 1st Qu.:0.00000 | Nov :2998 | 1st Qu.:2.000 | 1st Qu.: 2.000 | 1st Qu.:1.000 | 1st Qu.: 2.00 | Other : 85 | FALSE:9462 | FALSE:10422 | |
| Median : 1.000 | Median : 8.00 | Median : 0.0000 | Median : 0.00 | Median : 18.00 | Median : 600.2 | Median :0.003125 | Median :0.02511 | Median : 0.000 | Median :0.00000 | Mar :1907 | Median :2.000 | Median : 2.000 | Median :3.000 | Median : 2.00 | Returning_Visitor:10551 | TRUE :2868 | TRUE :1908 | |
| Mean : 2.322 | Mean : 80.98 | Mean : 0.5046 | Mean : 34.52 | Mean : 31.76 | Mean : 1195.7 | Mean :0.022139 | Mean :0.04298 | Mean : 5.889 | Mean :0.06143 | Dec :1727 | Mean :2.124 | Mean : 2.357 | Mean :3.147 | Mean : 4.07 | NA | NA | NA | |
| 3rd Qu.: 4.000 | 3rd Qu.: 93.79 | 3rd Qu.: 0.0000 | 3rd Qu.: 0.00 | 3rd Qu.: 38.00 | 3rd Qu.: 1469.2 | 3rd Qu.:0.016667 | 3rd Qu.:0.05000 | 3rd Qu.: 0.000 | 3rd Qu.:0.00000 | Oct : 549 | 3rd Qu.:3.000 | 3rd Qu.: 2.000 | 3rd Qu.:4.000 | 3rd Qu.: 4.00 | NA | NA | NA | |
| Max. :27.000 | Max. :3398.75 | Max. :24.0000 | Max. :2549.38 | Max. :705.00 | Max. :63973.5 | Max. :0.200000 | Max. :0.20000 | Max. :361.764 | Max. :1.00000 | Sep : 448 | Max. :8.000 | Max. :13.000 | Max. :9.000 | Max. :20.00 | NA | NA | NA | |
| NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | (Other):1337 | NA | NA | NA | NA | NA | NA | NA |
data_vis <- data[,]
col <- cor(data_vis[, c("Administrative","Administrative_Duration", "Informational", "Informational_Duration", "ProductRelated","ProductRelated_Duration", "BounceRates", "ExitRates", "PageValues", "SpecialDay") ])
corrplot(col, method = "square", title ="Correlation Matrix for Online Shoppers Intention", tl.cex = 0.7, tl.col = "black", mar = c(1,1,1,1))pairs.panels(data_vis[c("ProductRelated","ProductRelated_Duration")],
method = "pearson", # correlation method
hist.col = "pink",
density = TRUE, # show density plots
)pairs.panels(data_vis[c("BounceRates", "ExitRates")],
method = "pearson", # correlation method
hist.col = "pink",
density = TRUE, # show density plots
)data_vis$Month <-factor(data_vis$Month, levels = c("Feb", "Mar", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"))
barplot(table(data_vis$Month), main="Popular Shopping Month", xlab= "Month", ylab="Count",border="navy",
col="pink")Support vector machine, Random Forest and Multilayer Perceptron models to predict shoppers’ purchasing intention.
Support Vector Machines (SVM) SVM a very often used for classification. It uses hyperplanes to split the data into separate groups, or classes. And kernel tricks can help with transforming the data to a higher dimension before the hyperplanes split the data into groups.
Random Forest: Decision trees and Random Forest is very useful for feature selection since there is always a binary split at every node.
Multilayer perceptron (MLP) is one of the most often used neural_network for classification. Users may specify the number of hidden layers. The model optimize a log-loss function using stochastic gradient descent.
SVM is very often used for classification. It uses hyperplanes to split the data into separate groups, or classes. And kernel tricks can help with transforming the data to a higher dimension before the hyperplanes split the data into groups.
set categorical features as class variables.
data$OperatingSystems <- as.factor(data$OperatingSystems)
data$Browser <- as.factor(data$Browser)
data$Region <- as.factor(data$Region)
data$TrafficType <- as.factor(data$TrafficType)
data$VisitorType <- as.factor(data$VisitorType)
data$Weekend <- as.factor(data$Weekend)
data$Month <- as.factor(data$Month)
data$Revenue <- as.factor(data$Revenue)train test split
sample_size <- floor(0.9 * nrow(data))
train_ind <- sample(seq_len(nrow(data)), size = sample_size)
data_train <- data[train_ind, ]
data_test <- data[-train_ind, ]## Setting default kernel parameters
svm_pred<- predict(svm_model, data_test)
F1_Score(y_pred = svm_pred, y_true = data_test$Revenue, positive = "TRUE")## [1] 0.5088339
## [1] 0.8872668
require(randomForest)
rf_model<- randomForest(Revenue~., data=data_train,importance=TRUE,ntree=15)
rf_pred <- predict(rf_model, data_test)
F1_Score(y_pred = rf_pred, y_true = data_test$Revenue, positive = "TRUE")## [1] 0.626087
## [1] 0.8953771
data_numeric <- data[,]
data_numeric$Revenue <- as.integer(as.factor(data_numeric$Revenue))
data_numeric$Month <- as.integer(as.factor(data_numeric$Month))
data_numeric$Weekend <- as.integer(as.factor(data_numeric$Weekend))
data_numeric$VisitorType <- as.integer(as.factor(data_numeric$VisitorType))
data_numeric$TrafficType <- as.integer(as.factor(data_numeric$TrafficType))
data_numeric$Region <- as.integer(as.factor(data_numeric$Region))
data_numeric$Browser <- as.integer(as.factor(data_numeric$Browser))
data_numeric$OperatingSystems <- as.integer(as.factor(data_numeric$OperatingSystems))mlp_train <- data_numeric_train[sample(1:nrow(data_numeric_train),length(1:nrow(data_numeric_train))),1:ncol(data_numeric_train)]
mlp_train_x <- mlp_train[,1:17]
mlp_train_y <- decodeClassLabels(mlp_train[,18])
mlp_data <- splitForTrainingAndTest(mlp_train_x, mlp_train_y, ratio=0.1)
mlp_model <- mlp(mlp_data$inputsTrain, mlp_data$targetsTrain, size=50, learnFuncParams=c(0.001),
linOut=FALSE,hiddenActFunc = "Act_Logistic",
maxit=50, inputsTest=mlp_data$inputsTest, targetsTest=mlp_data$targetsTest)mlp_test_x <- data_numeric_test[,1:17]
mlp_test_y <- decodeClassLabels(data_numeric_test[,18])
mlp_pred <- predict(mlp_model,mlp_test_x, type="class")gold = as.numeric(mlp_test_y[,2])
temp <- as.numeric(mlp_pred[,2])
pred <- temp
pred[temp>0.2] = 1
pred[temp<=0.2] = 0
tp = 0
fp = 0
fn = 0
tn = 0
for (i in 1:1233){
if(gold[i]==1){
if(pred[i]==1){
tp = tp + 1
}
else{
fn = fn + 1
}
}
else{
if(pred[i]==1){
fp = fp + 1
}
else{
tn = tn + 1
}
}
}
recall = tp/(tp+fn)
preci = tp/(tp+fp)
f1 = 2*recall*preci/(recall + preci)
accuracy = (tp+tn)/(tp+tn+fp+fn)
print(f1)## [1] 0.1538462
## [1] 0.8572587
Because the data set is highly imbalances such that there are far fewer users who made the purchases than those who did not, we used oversampling to increase the number of purchased users, by simply duplicating the entries 5.5 times. After this process, the categories for users who made the final purchases and who did not contain the same number of entries.
Oversampling
data_over <- read.csv("online_shoppers_intention.csv")
data_over <- na.locf(data_over)
data_over$OperatingSystems <- as.factor(data_over$OperatingSystems)
data_over$Browser <- as.factor(data_over$Browser)
data_over$Region <- as.factor(data_over$Region)
data_over$TrafficType <- as.factor(data_over$TrafficType)
data_over$VisitorType <- as.factor(data_over$VisitorType)
data_over$Weekend <- as.factor(data_over$Weekend)
data_over$Month <- as.factor(data_over$Month)
data_over$Revenue <- as.factor(data_over$Revenue)data_over_train <- data_over[train_ind, ]
data_over_test <- data_over[-train_ind, ]
data_over_train <- upSample(data_over_train, data_over_train$Revenue)
data_over_train <- select(data_over_train, -Class)## Setting default kernel parameters
svm_over_pred<- predict(svm_over, data_over_test)
F1_Score(y_pred = svm_over_pred, y_true = data_over_test$Revenue, positive = "TRUE")## [1] 0.6285714
## [1] 0.8734793
require(randomForest)
rf_over <- randomForest(Revenue~., data=data_over_train,importance=TRUE,ntree=15)
rf_over_pred<- predict(rf_over,data_over_test)
F1_Score(y_pred = rf_over_pred, y_true = data_over_test$Revenue, positive = "TRUE")## [1] 0.6368421
## [1] 0.8880779
data_over_numeric <- data_over[,]
data_over_numeric$Revenue <- as.factor(data_over_numeric$Revenue)
data_over_numeric$Month <- as.integer(as.factor(data_over_numeric$Month))
data_over_numeric$Weekend <- as.integer(as.factor(data_over_numeric$Weekend))
data_over_numeric$VisitorType <- as.integer(as.factor(data_over_numeric$VisitorType))
data_over_numeric$TrafficType <- as.integer(as.factor(data_over_numeric$TrafficType))
data_over_numeric$Region <- as.integer(as.factor(data_over_numeric$Region))
data_over_numeric$Browser <- as.integer(as.factor(data_over_numeric$Browser))
data_over_numeric$OperatingSystems <- as.integer(as.factor(data_over_numeric$OperatingSystems))data_over_numeric_train <- data_over_numeric[train_ind, ]
data_over_numeric_test <- data_over_numeric[-train_ind, ]mlp_over_train_raw_order <- data_over_numeric_train
mlp_over_train <- data_over_numeric_train[sample(1:nrow(data_over_numeric_train),length(1:nrow(data_over_numeric_train))),1:ncol(data_over_numeric_train)]
mlp_over_train_x <- mlp_over_train[,1:17]
mlp_over_train_y <- decodeClassLabels(mlp_over_train[,18])
mlp_over_data <- splitForTrainingAndTest(mlp_over_train_x, mlp_over_train_y, ratio=0.1)
# iris <- normTrainingAndTestSet(iris)
mlp_over <- mlp(mlp_over_data$inputsTrain, mlp_over_data$targetsTrain, size=20, learnFuncParams=c(0.001),
maxit=150, learnFunc="Rprop", inputsTest=mlp_over_data$inputsTest, targetsTest=mlp_over_data$targetsTest)mlp_over_test_x <- data_over_numeric_test[,1:17]
mlp_over_test_y <- decodeClassLabels(data_over_numeric_test[,18])
mlp_over_pred <- predict(mlp_over,mlp_over_test_x, type="class")gold = as.numeric(mlp_over_test_y[,2])
temp <- as.numeric(mlp_over_pred[,2])
pred <- temp
pred[temp>0.5] = 1
pred[temp<=0.5] = 0
tp = 0
fp = 0
fn = 0
tn = 0
for (i in 1:1233){
if(gold[i]==1){
if(pred[i]==1){
tp = tp + 1
}
else{
fn = fn + 1
}
}
else{
if(pred[i]==1){
fp = fp + 1
}
else{
tn = tn + 1
}
}
}
recall = tp/(tp+fn)
preci = tp/(tp+fp)
f1 = 2*recall*preci/(recall + preci)
accuracy = (tp+tn)/(tp+tn+fp+fn)
print(f1)## [1] 0.6167665
## [1] 0.8961882
Ensemble is to use the predictions of several different models as feature inputs, and train a new model based on these predictions. Theoretically the ensemble modle should be better than the previous models.In our case, this is an ensemble for the SVM, Random Forest, and MLP model with oversampling.
mlp_over_train_x <- mlp_over_train_raw_order[,1:17]
mlp_over_train_y <- decodeClassLabels(mlp_over_train_raw_order[,18])
mlp_over_pred_train <- predict(mlp_over,mlp_over_train_x, type="class")
temp <- as.numeric(mlp_over_pred_train[,2])
mlp_over_pred_train_class <- temp
mlp_over_pred_train_class[temp>0.5] = 1
mlp_over_pred_train_class[temp<=0.5] = 0
mlp_over_pred_test <- predict(mlp_over,mlp_over_test_x, type="class")
temp <- as.numeric(mlp_over_pred_test[,2])
mlp_over_pred_test_class <- temp
mlp_over_pred_test_class[temp>0.5] = 1
mlp_over_pred_test_class[temp<=0.5] = 0
mlp_over_pred_train_class = mlp_over_pred_train_class==TRUE
mlp_over_pred_test_class = mlp_over_pred_test_class==TRUE
mlp_over_pred_train_class = as.factor(mlp_over_pred_train_class)
mlp_over_pred_test_class = as.factor(mlp_over_pred_test_class)rf_pred_train <- predict(rf_over, data_over_train)
svm_pred_train <- predict(svm_over, data_over_train)
ensemble_train <- cbind(svm_pred_train, rf_pred_train, mlp_over_pred_train_class, data_over_train$Revenue)## Warning in cbind(svm_pred_train, rf_pred_train, mlp_over_pred_train_class, :
## number of rows of result is not a multiple of vector length (arg 3)
ensemble_train <- as.data.frame(ensemble_train)
colnames(ensemble_train) <- c("svm", "rf", "mlp", "true")
rf_pred_test <- predict(rf_over, data_over_test)
svm_pred_test <- predict(svm_over, data_over_test)
ensemble_test <- cbind(svm_pred_test, rf_pred_test, mlp_over_pred_test_class, data_over_test$Revenue)
ensemble_test <- as.data.frame(ensemble_test)
colnames(ensemble_test) <- c("svm", "rf", "mlp", "true")ensemble <- randomForest(x=ensemble_train, y= data_over_train$Revenue,importance=TRUE,ntree=3)
#ensemble <- ksvm(x=ensemble_train, y= data_over_train$Revenue, kernel = "vanilladot")
ensemble_pred <- predict(ensemble, ensemble_test)
F1_Score(y_pred = ensemble_pred, y_true = data_over_test$Revenue, positive = "TRUE")## [1] 0.7986799
## [1] 0.9505272
Random Forest (without oversamping) will be used for feature selection, trying to identify what features would influence the prediction of shoppers buying intention.
##
## Call:
## randomForest(formula = Revenue ~ ., data = data_train, importance = TRUE, ntree = 15)
## Type of random forest: classification
## Number of trees: 15
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 11.32%
## Confusion matrix:
## FALSE TRUE class.error
## FALSE 8860 496 0.05301411
## TRUE 758 966 0.43967517
In feature selection, Gini index is refering to information gain. According Accuracy and Gini index in the Random Forest model(without oversampling), when PageValues is excluded, both of the metrics would decrease.PageVlues is by far the most important feature when prediction if a viewer will make a final purchase.
imp <- randomForest::importance(rf_model)
impvar <- rownames(imp)[order(imp[, 1], decreasing=TRUE)]
op <- par(mfrow=c(2, 3))
for (i in 1:6) { # seq_along(impvar)) { # to plot the marginal probabilities for all features
partialPlot(rf_model, data_train, impvar[i], xlab=impvar[i],
main=paste("Partial Dependence of 'Revenue'\n on ", impvar[i]))
}According to the parcial dependecy plots,some most important features are PageValues, ProductRelated, Administrative_Duration. PageValues have an earliest and steepest drop.
According to the common features yiled by the feature selection methods, we could possible stipulate that the value of pages, and number of pages that the viewer go through, and the time that the viewer spends on the related pages will be some crucial features to predict if they will make the final purchase or not. Further statistic analysis is needed to test such hypothese.
Model training without oversampling at the first round, Random Forest has comparatively higher accuracy and F1 Score than RF and MLP. After introducing oversampling, they have similar accuracy and F1 Score.
Oversampling does help with improving the performance of the models.
The Ensemble using the prediction of the three models yield the best accuracy and f1 score among all.
Salient features are: Page Value; Product Related (number of related pages have viewed); Product Related Duration(total time spent on related pages).
We can hypothesize that more attractive the pages are, the more pages the viewer reads, the more time the viewer spends on the pages, the more likely they are going to make a purchase.
There are possible bugs for calculating f1 scores running MLP without oversampling. Also, More research for a proper model should be conducted. The three models that are run here may not be the best option for the task. Also, more research for semble should be done in order to show the advantage of the ensemble mode running accurately.
For future research, different methods of data augmentation should be experimented to improve model performance. Statistic analysis of the selected features can be conduct to the causal effect of the purchasing intention. More Parameter tuning should be done to improve the performance of the models. Also, we can experiment with different structures of CNN for better performance.
Dinov, Ivo D. (2018). Data Science and Predictive Analytics Biomedical and Health Applications using R /. Cham : Springer International Publishing : Imprint: Springer.
Sakar, C Okan, Sakar, C Okan, Polat, S Olcay, Polat, S Olcay, Katircioglu, Mete, Katircioglu, Mete, Kastro, Yomi, et al. (2019). Real-time prediction of online shoppers’ purchasing intention using multilayer perceptron and LSTM recurrent neural networks. Neural Computing and Applications, 31(10), 6893–6908. London: Springer London.