ApacheCon Europe 2019
Apache Hivemall is a collection of Hive user-defined functions for machine learning (ML). The tool enables us to solve a wide variety of ML-related problems through the scalable SQL-like interface to Hive. To give a motivating example, simple regression and classification model can be efficiently trained by just executing 10 lines of a query.
This session demonstrates such Hivemall functionality with a special focus on integration with Apache Spark; the Hivemall contributors have been actively working on Spark integration since the project has entered the Apache Incubator. In particular, we deep-dive into how it works in PySpark.
In PySpark, SparkSession with Hive support enabled gives direct access to the Hivemall capabilities at each of preprocessing, training, prediction, and evaluation phases. That is, we can simultaneously leverage the scalability of Hive/Spark and flexibility of Python ecosystem. We will eventually see how the combination can be a deeply satisfying way to implement a practical end-to-end ML solution.
Author: Takuya Kitazawa
Takuya Kitazawa is a freelance software developer, minimalistic traveler, ultralight hiker & runner, and craft beer enthusiast. With a decade of experience at start-up companies and Big Tech ranging from full-stack/machine-learning engineering to data science to product management, I am currently working at the intersection of technological and social aspects of data-driven applications. See CV for more information.