Recommendation.jl v0.4.0: Working with Missing Values, Data Typing, and Factorization Machines

Home > Blog > Recommendation.jl v0.4.0: Working with Missing Values, Data Typing, and Factorization Machines

2022-01-08

Recommendation.jl v0.4.0: Working with Missing Values, Data Typing, and Factorization Machines

Support by donation Gift a cup of coffee

This article is part of the series: Building Recommender Systems in Julia

This week I have released version 0.4.0 of Recommendation.jl, a library for recommender systems in Julia. As usual, you can get the package from the Julia centralized package repository and play with it as follows:

julia> using Pkg; Pkg.add("Recommendation")

It's been more than two years since the last version was released (and I presented at JuliaCon 2019). This rework is in fact part of my new year's resolution, which is about "writing more code to deliver actual values in the form of products".

Working on recommender systems in the scientific computing-focused programming language always brings a fresh perspective, and the experience enables me to rethink of how recommender interfaces should be in terms of e.g., type, hyper-parameter representation, function naming, computational efficiency; in our day-to-day work as a developer, it's easy to overlook such essential aspects when we aggressively leverage well-developed application-oriented tools such as scikit-learn and Spark MLlib.

In this release, there are three key updates I would like to highlight and want you to think of.

What "missing" means in the context of recommender

When it comes to implementing recommender systems, handling missing values in vectors and matrices (or tensors, in general) is indeed one of the most important challenges. In practice, the missing values in user-item data could have multiple different meanings, for instance. It could be a case that a user simply hasn't been exposed to the item yet (truly missing observation). Or, on the other hand, they did see it but intentionally ignored the item (negative feedback). Thus, thinking carefully about the special cases is a common problem the developers encounter.

In Julia, there are multiple ways to represent "this is not an actual value that the element is supposed to be": missing, nothing, NaN, and undef.

Recommendation.jl v0.3.0 used NaN without thinking much about the point, but I had an impression that this design choice was semantically incorrect; NaN is returned by 0 / 0, for example, and the value explicitly represents "something went wrong". Meanwhile, undef represents uninitialized values that should eventually be fulfilled in some ways, which is actually tricky to deal with and also semantically different from what we are looking for in the context of user-item interaction matrix.

Therefore, I decided to leverage Julia's special-purpose types, Missing and Nothing, for our particular use case. Both represent "unknown" but could be used in different ways:

Missing is mainly for model parameters, which should exist but have not been calculated yet.
Input data is expected to use Nothing as an explicit indication of "no value".

Both values are eventually filled by zero in the current implementation, but there is certainly a room for improvement e.g., filling by means, considering positive-only feedback recommendation techniques. Regardless, I believe staying away from NaN is an important step the recommender implementation needed to take.

See this stackoverflow answer for more thoughts.

Data typing

Beyond the conventional deterministic Float64 type, the library has generalized a value type associated with an user-item event (e.g., click, purchase, frequency, rating) as Infinite as follows:

Infinite = Union{AbstractFloat, Integer}

type Event
   user::Integer
   item::Integer
   value::Infinite
end

Importantly, both AbstractFloat and Integer, as well as their union Infinite, are abstract types. The fact makes the interface applicable to a wider range of datasets. What "value" indicates differs depending on a dataset, and the possibility includes not only missing and floating point numbers but signed and unsigned integers.

For data-driven applications in general, setting proper expectation to the types is critical so that all the subsequent numerical computations run correctly. From that perspective, in Julia, ScientificTypes.jl is currently inspiring me to dig deep into the problem. They attached proper semantics to the standard types, and it empowers applications to be more usable, readable, and maintainable.

I would try to keep increasing the type coverage including non-trivial string representations, and Recommendation.jl may eventually use ScientificTypes.jl as its dependency.

Factorization Machines

Last but not least, I have finally implemented and merged Factorization Machines, a polynomial regression-based general predictor widely used by machine learning and recommendation communities. In fact, when I presented Recommendation.jl at JuliaCon 2019, several audiences reached out to me and asked for supporting the algorithm in the package. I did a quick implementation afterward, but it had stayed as a pull request for a while.

Currently, the basic combination of least squares and SGD-based optimization is the only option, but I will definitely support ranking loss along with the extension of how to handle/represent missing/numeric values as discussed above.

Speaking of supported algorithms, I'm currently thinking of limiting the complexity to Factorization Machines at most. I originally implemented Recommendation.jl as a collection of basic recommendation techniques mostly for the educational sake, so I would like to spend more time on sophistication and generalization rather than complication and over-optimization.

Ultimately, performance improvement is the biggest topic I personally want to invest more effort on. Interpretable, semantically meaningful design of code sometimes (and unnecessarily, in many cases) sacrifices computational efficiency. Since Julia has been best known as an efficient programming language, it's worth thinking deeply about the efficient use of language features.

This article is part of the series: Building Recommender Systems in Julia

Author: Takuya Kitazawa

Takuya Kitazawa is a freelance software developer, previously working at a Big Tech and Silicon Valley-based start-up company where he wore multiple hats as a full-stack software developer, machine learning engineer, data scientist, and product manager. At the intersection of technological and social aspects of data-driven applications, he is passionate about promoting the ethical use of information technologies through his mentoring, business consultation, and public engagement activities. See CV for more information, or contact at [email protected].

Support by donation Gift a cup of coffee

Disclaimer

Opinions are my own and do not represent the views of organizations I am/was belonging to.
I am doing my best to ensure the accuracy and fair use of the information. However, there might be some errors, outdated information, or biased subjective statements because the main purpose of this blog is to jot down my personal thoughts as soon as possible before conducting an extensive investigation. Visitors understand the limitations and rely on any information at their own risk.
That said, if there is any issue with the content, please contact me so I can take the necessary action.