Back to Homepage

1 Abstract

In this project, we intend to use machine learning techniques to predict the visitor’s shopping intention. The data consist of features such as the closeness of the site visiting time to a special day, the number of pages visited by the visitor about a product, which was extracted from the visit log of an online shopping website. Random forest (RF), support vector machines(SVMs), and multilayer perceptron(MLP) are used for prediction. We use oversampling to improve the performance and scalability of the classifier. An ensemble using the predictions from the three models was run. The results show that MLP has higher accuracy and F1 Score than RF and SVM while oversampling did help with the model performance. The ensemble had the highest accuracy and F1 score among all. Page value, number of product-related pages visited and product-related page visiting time is the most important features to predict purchasing intention.

2 Introduction

For online shopping websites, they could be popular for visitors, but the conversion rate for those who are interested in the product to those who make the purchases could less than ideal. Correctly identifying the aspect that websites could improve upon in order to turn more “visitors” into “buyers” becomes crucial. Also, being able to predict the potential “buyers” with higher accuracy would allow the websites to properly allocate their advertising or other visual information to encourage purchase behavior. Our goal of this paper is to find a better model to predict purchase intention and identify the important features that contribute to the prediction.

Feature “Product Related” refers to the number of pages visited by shoppers about the feature “Product Related”. The product is the one that shoppers want to buy, thus, the “Product Related” is an important feature to predict shopper’s intention.

Feature “Month” represents the Month value of the visit day, and feature “Special Day” represents how close the day is to a “special day” which could be holidays and etc. Shoppers know that for some holidays, there is much more discount than ever. They would like to wait for that holiday to purchase what they into before. Thus, the month can be a crucial feature to predict their purchasing intention.

For features that are important to the purchase intention, we hypothesis that Product Related, and Month of the session are the most features to predict shopper’s intention.

3 Data Exploration and Visualization

3.1 Data Description

The dataset that is used in the current project comes from Sakar et al.(2019), in which the authors designed a system that could predict online shoppers purchasing intention and page abandonment. Online shoppers purchasing intention was cast as a binary classification problem, such that viewers would either made the final purchase or not. The dataset consists of 12330 sessions, of which record a session that a viewer visited items of the websites. Each session would belong to a different user in a 1-year period to avoid any tendency to a specific campaign, special day, user profile, or period. The dataset is highly imbalanced for 84.5%(10422) of the viewers did not make the final purchase.

Feature Feature Description
Administrative Number of pages visited by the visitor about account management
Administative duration Total amount of time (in seconds) spent by the visitor on account management related pages
Informational Number of pages visited by the visitor about Web site, communication and address information of the shopping site
Informational duration Total amount of time (in seconds) spent by the visitor on informational pages
Product related Number of pages visited by visitor about product related pages
Product related duration Total amount of time (in seconds) spent by the visitor on product related pages
Bounce rate Average bounce rate value of the pages visited by the visitor
Exit rate Average exit rate value of the pages visited by the visitor
Page value Average page value of the pages visited by the visitor
Special day Closeness of the site visiting time to a special day
OperatingSystems Operating system of the visitor
Browser Browser of the visitor
Region Geographic region from which the session has been started by the visitor
TrafficType Traffic source by which the visitor has arrived at the Web site (e.g., banner, SMS, direct)
VisitorType Visitor type as ‘‘New Visitor,’’ ‘‘Returning Visitor,’’ and ‘‘Other’’
Weekend Boolean value indicating whether the date of the visit is weekend
Month Month value of the visit date
Revenue Class label indicating whether the visit has been finalized with a transaction

3.2 Data Summary

After downloaded, we used read.csv() to import the file directly and summaried to see some statistics, like max, min and median values of 18 features. There are 10 numerical features and 8 categorical features.

Administrative Administrative_Duration Informational Informational_Duration ProductRelated ProductRelated_Duration BounceRates ExitRates PageValues SpecialDay Month OperatingSystems Browser Region TrafficType VisitorType Weekend Revenue
Min. : 0.000 Min. : -1.00 Min. : 0.0000 Min. : -1.00 Min. : 0.00 Min. : -1.0 Min. :0.000000 Min. :0.00000 Min. : 0.000 Min. :0.00000 May :3364 Min. :1.000 Min. : 1.000 Min. :1.000 Min. : 1.00 New_Visitor : 1694 Mode :logical Mode :logical
1st Qu.: 0.000 1st Qu.: 0.00 1st Qu.: 0.0000 1st Qu.: 0.00 1st Qu.: 7.00 1st Qu.: 185.3 1st Qu.:0.000000 1st Qu.:0.01429 1st Qu.: 0.000 1st Qu.:0.00000 Nov :2998 1st Qu.:2.000 1st Qu.: 2.000 1st Qu.:1.000 1st Qu.: 2.00 Other : 85 FALSE:9462 FALSE:10422
Median : 1.000 Median : 8.00 Median : 0.0000 Median : 0.00 Median : 18.00 Median : 600.2 Median :0.003125 Median :0.02511 Median : 0.000 Median :0.00000 Mar :1907 Median :2.000 Median : 2.000 Median :3.000 Median : 2.00 Returning_Visitor:10551 TRUE :2868 TRUE :1908
Mean : 2.322 Mean : 80.98 Mean : 0.5046 Mean : 34.52 Mean : 31.76 Mean : 1195.7 Mean :0.022139 Mean :0.04298 Mean : 5.889 Mean :0.06143 Dec :1727 Mean :2.124 Mean : 2.357 Mean :3.147 Mean : 4.07 NA NA NA
3rd Qu.: 4.000 3rd Qu.: 93.79 3rd Qu.: 0.0000 3rd Qu.: 0.00 3rd Qu.: 38.00 3rd Qu.: 1469.2 3rd Qu.:0.016667 3rd Qu.:0.05000 3rd Qu.: 0.000 3rd Qu.:0.00000 Oct : 549 3rd Qu.:3.000 3rd Qu.: 2.000 3rd Qu.:4.000 3rd Qu.: 4.00 NA NA NA
Max. :27.000 Max. :3398.75 Max. :24.0000 Max. :2549.38 Max. :705.00 Max. :63973.5 Max. :0.200000 Max. :0.20000 Max. :361.764 Max. :1.00000 Sep : 448 Max. :8.000 Max. :13.000 Max. :9.000 Max. :20.00 NA NA NA
NA NA NA NA NA NA NA NA NA NA (Other):1337 NA NA NA NA NA NA NA

3.3 Data Visualization

  • Chart 1: created a heat map to see the correlation between 10 numerical variables. We found that the type of page and the stay of that page has relatively higher correlations.

  • Chart 2: used pairs.panels to create a correlation, density and hist diagram of two features: ProductRelated and ProductRelated_Duration, which has the second-highest correlation value.

  • Chart 3: used the pairs.panels to display the highest correlation value of BounceRates and ExitRates.

  • Chart 4: used barplot to display popular shopping months and found out “May” is the most popular one.

  • Chart 5: displayed the different types of visitors and found out “returning_visitor” has many more numbers.

4 Model Training without Oversampling

Support vector machine, Random Forest and Multilayer Perceptron models to predict shoppers’ purchasing intention.

  • Support Vector Machines (SVM) SVM a very often used for classification. It uses hyperplanes to split the data into separate groups, or classes. And kernel tricks can help with transforming the data to a higher dimension before the hyperplanes split the data into groups.

  • Random Forest: Decision trees and Random Forest is very useful for feature selection since there is always a binary split at every node.

  • Multilayer perceptron (MLP) is one of the most often used neural_network for classification. Users may specify the number of hidden layers. The model optimize a log-loss function using stochastic gradient descent.

4.3 Multilayer Perceptron Classifier

## [1] 0.1538462
## [1] 0.8572587

5 Model Training with Oversampling

Because the data set is highly imbalances such that there are far fewer users who made the purchases than those who did not, we used oversampling to increase the number of purchased users, by simply duplicating the entries 5.5 times. After this process, the categories for users who made the final purchases and who did not contain the same number of entries.

Oversampling

5.3 Multilayer Perceptron Classifier

## [1] 0.6167665
## [1] 0.8961882

5.4 Ensemble

Ensemble is to use the predictions of several different models as feature inputs, and train a new model based on these predictions. Theoretically the ensemble modle should be better than the previous models.In our case, this is an ensemble for the SVM, Random Forest, and MLP model with oversampling.

## Warning in cbind(svm_pred_train, rf_pred_train, mlp_over_pred_train_class, :
## number of rows of result is not a multiple of vector length (arg 3)
## [1] 0.7986799
## [1] 0.9505272

6 Feature Selection

Random Forest (without oversamping) will be used for feature selection, trying to identify what features would influence the prediction of shoppers buying intention.

## 
## Call:
##  randomForest(formula = Revenue ~ ., data = data_train, importance = TRUE,      ntree = 15) 
##                Type of random forest: classification
##                      Number of trees: 15
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 11.32%
## Confusion matrix:
##       FALSE TRUE class.error
## FALSE  8860  496  0.05301411
## TRUE    758  966  0.43967517

In feature selection, Gini index is refering to information gain. According Accuracy and Gini index in the Random Forest model(without oversampling), when PageValues is excluded, both of the metrics would decrease.PageVlues is by far the most important feature when prediction if a viewer will make a final purchase.

According to the parcial dependecy plots,some most important features are PageValues, ProductRelated, Administrative_Duration. PageValues have an earliest and steepest drop.

According to the common features yiled by the feature selection methods, we could possible stipulate that the value of pages, and number of pages that the viewer go through, and the time that the viewer spends on the related pages will be some crucial features to predict if they will make the final purchase or not. Further statistic analysis is needed to test such hypothese.

7 Result

7.1 Model prediction

  • Model training without oversampling at the first round, Random Forest has comparatively higher accuracy and F1 Score than RF and MLP. After introducing oversampling, they have similar accuracy and F1 Score.

  • Oversampling does help with improving the performance of the models.

  • The Ensemble using the prediction of the three models yield the best accuracy and f1 score among all.

7.2 Feature selection

  • Salient features are: Page Value; Product Related (number of related pages have viewed); Product Related Duration(total time spent on related pages).

  • We can hypothesize that more attractive the pages are, the more pages the viewer reads, the more time the viewer spends on the pages, the more likely they are going to make a purchase.

8 Discussion/Future Research

There are possible bugs for calculating f1 scores running MLP without oversampling. Also, More research for a proper model should be conducted. The three models that are run here may not be the best option for the task. Also, more research for semble should be done in order to show the advantage of the ensemble mode running accurately.

For future research, different methods of data augmentation should be experimented to improve model performance. Statistic analysis of the selected features can be conduct to the causal effect of the purchasing intention. More Parameter tuning should be done to improve the performance of the models. Also, we can experiment with different structures of CNN for better performance.

9 Acknowledgment

  • Thanks Dr. Dinov for all the great instruction this semester.
  • Thanks Xinyan Zhao for his great patience helping with debugging and things for oversampling.

10 Reference

Dinov, Ivo D. (2018). Data Science and Predictive Analytics Biomedical and Health Applications using R /. Cham : Springer International Publishing : Imprint: Springer.

Sakar, C Okan, Sakar, C Okan, Polat, S Olcay, Polat, S Olcay, Katircioglu, Mete, Katircioglu, Mete, Kastro, Yomi, et al. (2019). Real-time prediction of online shoppers’ purchasing intention using multilayer perceptron and LSTM recurrent neural networks. Neural Computing and Applications, 31(10), 6893–6908. London: Springer London.

Back to Homepage