Apache Hivemall: Query-Based Handy, Scalable Machine Learning on Hive

ホーム > イベント登壇・論文執筆などの記録 > Tech Conference

2018 - 2021

Apache Hivemall: Query-Based Handy, Scalable Machine Learning on Hive

ODSC Europe 2018

Apache Hivemall: Query-Based Handy, Scalable Machine Learning on Hive @ ODSC Europe 2018

Abstract

This talk introduces Apache Hivemall, a scalable machine learning library for Apache Hive, Spark and Pig, in the context of real-world large-scale data science.

Most importantly, Hivemall significantly simplifies machine learning workflow such as feature engineering, algorithm implementation and evaluation, because Hive enables us to access to distributed storage using handy SQL-like queries (HiveQL). Today, data scientists and machine learning engineers commonly suffer from numerous tiny code fragments and poor scalability of pipelines due to the difficulty of implementation. By contrast, once Hivemall is installed, we can execute a wealth of machine learning algorithms in a scalable manner by just writing dozens of lines of queries.

To the end of this session, the speaker talks about:

Which part of modern realistic machine learning and data science is painful
When Hivemall is notably preferable to the other implementation of machine learning algorithms, and why it is
Who can get the benefit from the scalability and simplicity of Hivemall
What kind of machine learning techniques are implemented in Hivemall, including classification, regression, anomaly detection, natural language processing and recommendation
How to install and use Hivemall, and how Hivemall implements a wide variety of machine learning algorithms in the scalable manner

Note that, since Hivemall is officially providing its Dockerfile, attendees can immediately try its functionality on their laptop.

Additionally, this talk provides some tips to more effectively utilize Hivemall by showing an example with a workflow engine. For example, Digdag, a distributed workflow engine, provides a simple way to run, organize and/or schedule highly-dependent complex tasks in either sequential and parallel; that is, the workflow engine makes real-world machine learning pipelines nicely manageable. Since workflow definition itself is written in the easy-to-use YAML format, engineers can handle the pipelines in a similar way to what people do on their own source code, in terms of deployment, version control and modularity.

Eventually, the speaker expects audiences to get an idea for making machine learning more handy and manageable, and they would become able to discuss how real-world machine learning should be in the next couple of years.

Slides

Video

書いた人: Takuya Kitazawa（たくち）

長野県出身、カナダ・バンクーバー在住のソフトウェアエンジニアです。これまでB2B/B2Cの各領域で、Web技術・データサイエンス・機械学習のプロダクト化および顧客への導入支援・コンサルティング、そして関連分野の啓蒙活動に携わってきました。現在は主に北米（カナダ）、アジア（日本）、アフリカ（マラウイ）の個人および企業を対象にフリーランスとして活動中。詳しい経歴はレジュメを参照ください。いろいろなまちを走って、時に自然と戯れながら、その時間その場所の「日常」を生きています。ご意見・ご感想およびお仕事のご相談は [email protected] まで。