You can access the GitHub repository here, the documentation can be found here.
If you would like to view the report, please go to DR NTU. And if you like what we did, do cite the work as follows:
Ng, T. K. (2024). Streamsight: a toolkit for offline evaluation of recommender systems. Final Year Project (FYP), Nanyang Technological University, Singapore. https://hdl.handle.net/10356/181114
There have been numerous Recommender System (RS) toolkits for offline evaluation that have
been released over the years. However, little emphasis has been placed on observing the
temporal aspects in the framework of these toolkits. We noticed that current toolkits tend to
prioritize complex algorithm implementations and the variety of metrics that are used to
evaluate these algorithms. Instead, we would like to take a step back to consider another angle
of approaching the implementation of toolkits for RS. That is, to consider appropriate
approaches in handling the temporal aspects of the data pertaining to the data split scheme and
how it can be observed during the evaluation of RS.
This report introduces Streamsight, an open-source Python RS toolkit developed and made
available on Python Package Index (PyPI). Streamsight provides a framework which considers
the existing gaps discussed and implements the proposed solutions in this report. Streamsight
provides the entire framework to develop and test RS, mainly targeted towards implementing
a global sliding window as a proposed data split scheme and evaluation method for RS which
considers a temporal aspect. With the observance of the temporal element, we aim to bring
offline evaluation closer to the actual dynamic data communication and flow in the online
setting. In this library, we provide the programmer with the APIs that abstract the underlying
implementation for easy and standardized use of the implementation. The project and API
documentation can be found in Github and PyPI.
As technology evolves, Recommender Systems (RS) have become a ubiquitous technology,
driving user engagement and influencing purchase decisions across various platforms. RS
leverages on algorithms that recommend items to users based on their preferences or past
behaviour, their effectiveness being evaluated over defined metrics such as Precision and
Recall.
Over the years, several toolkits have been developed to provide a framework to run and
reproduce experimental settings to train and evaluate these RS algorithms. Many of these
toolkits have been made open source and can be easily accessed on Github.
Examining the available toolkits through documentation and codebases, we found that
currently built implementations tend to focus on designing complex algorithms for prediction
and novel metrics for evaluation with oversight on temporal handling in their implementations.
These implementations failed to observe the dynamic nature of user behaviour concerning time
aspects by assuming that traditional data splitting and evaluation methods for RS are
sufficient. User preferences can evolve over time, influenced by trends or seasonality, a process
termed concept drift. This temporal dimension holds crucial information that RS can use to
formulate better recommendations.
In order to address the above gaps, we have developed Streamsight, a Python toolkit for
evaluation and reproducibility of RS in an offline setting that observes the context of time in
its framework and implementation. Mainly, our focus will lie mainly in the implementation of
a data split scheme that respects a global timeline to prevent data leakage and to propose the
evaluation of RS with a time context emphasising on the computation of the metrics.
Furthermore, to better simulate production environment in an offline setting, we leverage on
the design of the new split scheme to design 2 evaluation schemes for RS. The traditional
pipeline to train and evaluate algorithm with the capability of accommodating to the new split
scheme and a streaming scheme to provide API calls such that the RS and evaluation platform
can be decoupled.
In this project, we have designed a toolkit to provide a framework for offline evaluation such
that the research community and developers have a set of tools to utilise that provide them with
another perspective from an evaluation standpoint. One might ponder, if there are already so
many toolkits available that provide various algorithms and metrics, why might one consider
using Streamsight. Our answer to that would be that we see a current gap in the research
community. That is current evaluation scheme tends to oversimplify the complex nature of
human dynamic interactions with the RS during evaluation which was highlighted by Sun [1].
As pointed out by [2], we can see that the basic Item-KNN performs well in the overall
comparison with 20 other algorithms and even ranks first in some tests. In the overall test,
Item-KNN performed the best in terms of mean rank across various algorithms that have been
developed over the years (Table 1 of [2]). This intriguing finding suggests that a paradigm
shift in perspective could be considered in the research community where the focus could be
shifted from designing new complex algorithms and metrics, but rather to truly examine the
evaluation of these RS which tends to be overlooked.
Sun’s [1], strong advocacy for drawing the community’s attention to the evaluation domain
highlights the very purpose of this project, where we want to echo his perspective by providing
a framework which tries to adhere closely to the dynamic nature of user behaviour and the
temporal aspects which can affect the evaluation of RS. Coming back to the question raised at
the start of this section, we would answer that we hope with Streamsight, we can highlight the
current gaps in the community and offer a solution to advance evaluation methods towards the
direction envisaged by Sun.
[1] A. Sun, “Take a Fresh Look at Recommender Systems from an Evaluation Standpoint,”
in Proceedings of the 46th International ACM SIGIR Conference on Research and
Development in Information Retrieval, in SIGIR ’23. New York, NY, USA: Association
for Computing Machinery, Jul. 2023, pp. 2629–2638. doi: 10.1145/3539618.3591931.
[2] D. McElfresh, S. Khandagale, J. Valverde, J. P. Dickerson, and C. White, “On the
Generalizability and Predictability of Recommender Systems”.
I would like to express my deepest gratitude to my supervisor Assoc Prof Sun Aixin for his
guidance and time throughout the project. His feedback and mentorship have been instrumental
in shaping the direction and completion of the project. The regular meetings amidst his busy
schedule has been extremely helpful in my understanding of the scope and impact of the project.
Finally, I would like to thank my family and friends for their continuous support in my study
period in NTU.