Home  >   Work  >   Tech Conference

2019 - 2021

Apache Hivemall Meets PySpark: Scalable Machine Learning with Hive, Spark, and Python

ApacheCon Europe 2019

Apache Hivemall Meets PySpark: Scalable Machine Learning with Hive, Spark, and Python @ ApacheCon Europe 2019


Apache Hivemall is a collection of Hive user-defined functions for machine learning (ML). The tool enables us to solve a wide variety of ML-related problems through the scalable SQL-like interface to Hive. To give a motivating example, simple regression and classification model can be efficiently trained by just executing 10 lines of a query.

This session demonstrates such Hivemall functionality with a special focus on integration with Apache Spark; the Hivemall contributors have been actively working on Spark integration since the project has entered the Apache Incubator. In particular, we deep-dive into how it works in PySpark.

In PySpark, SparkSession with Hive support enabled gives direct access to the Hivemall capabilities at each of preprocessing, training, prediction, and evaluation phases. That is, we can simultaneously leverage the scalability of Hive/Spark and flexibility of Python ecosystem. We will eventually see how the combination can be a deeply satisfying way to implement a practical end-to-end ML solution.



  Author: Takuya Kitazawa

Takuya Kitazawa is a freelance software developer, minimalistic traveler, ultralight hiker & runner, and craft beer enthusiast. While my area of specialty is in data & AI ethics and machine learning productization, I have worked full-stack throughout the career e.g., as a frontend/backend engineer, OSS developer, technical evangelist, solution architect, data scientist, and product manager. You can find what I am doing lately at my "now" page, and your inquiry is always welcome at [email protected], including comments on my blog posts.

  Schedule a call with me


  • Opinions are my own and do not represent the views of organizations I am/was belonging to.
  • I am doing my best to ensure the accuracy and fair use of the information. However, there might be some errors or biased subjective statements because the main purpose of this blog is to jot down my personal thoughts as soon as possible before conducting an extensive investigation. Visitors understand the limitations and rely on any information at their own risk.
  • That said, if there is any issue with the content, please contact me so I can take the necessary action.