ApacheCon North America 2019
Apache Hivemall is a scalable machine learning library for Apache Hive, Spark, and Pig. Hivemall allows us to apply a wealth of machine learning techniques to massive data stored in distributed storage by just writing a series of SQL-like queries. It provides classification, regression, recommendation, anomaly detection, and topic modeling functionalities in a scalable manner, along with a variety of auxiliary functions for data preprocessing and feature engineering.
This talk demonstrates the Hivemall library with a special emphasis on its new features merged after the first Apache Incubator release. Hivemall v0.5.2-incubating, the latest version as of April 2019, has introduced a state-of-the-art generalized factor model named Field-Aware Factorization Machines and many useful UDFs (e.g., data sketching) originated from the Brickhouse Hive UDF package.
We also show the roadmap of this incubating project. Open issues and pull requests include Apache Spark 2.4 support, implementation of new algorithms such as word2vec and multi-nominal logistic regression, as well as integration with widely-used tools like XGBoost and LightGBM.
Author: Takuya Kitazawa
Takuya Kitazawa is a freelance software developer, minimalistic traveler, ultralight hiker & runner, and craft beer enthusiast. While my area of specialty is in data & AI ethics and machine learning productization, I have worked full-stack throughout the career e.g., as a frontend/backend engineer, OSS developer, technical evangelist, solution architect, data scientist, and product manager. You can find what I am doing lately at my "now" page, and your inquiry is always welcome at [email protected], including comments on my blog posts.
- Opinions are my own and do not represent the views of organizations I am/was belonging to.
- I am doing my best to ensure the accuracy and fair use of the information. However, there might be some errors or biased subjective statements because the main purpose of this blog is to jot down my personal thoughts as soon as possible before conducting an extensive investigation. Visitors understand the limitations and rely on any information at their own risk.
- That said, if there is any issue with the content, please contact me so I can take the necessary action.