As a committer/PPMC of Apache Hivemall (incubating), I have attended and presented at ApacheCon North America and ApacheCon Europe.
Photo: Jan Michalko / plain schwarz
These were my very first ApacheCon experiences, and it was great to be part of the community in this memorial year - yes, it's the 20th anniversary of Apache Software Foundation.
A key message I've got from the conferences is about the power of community. Of course, all sessions are technically stimulating and exciting to learn, but, more than that, I was impressed by how to make such a big conference possible as a result of gathering diverse people and projects from the OSS community.
ApacheCon North America @ Las Vegas
In this annual conference, we have presented the recent updates on the Hivemall project:
I feel the audiences had a highly practical point of view; through the questions, they were trying to get a deeper understanding of how Hivemall runs with MapReduce, Tez, and Spark.
When I mentioned the sketching algorithms Hivemall has imported from Brickhouse, somebody pointed out taking advantage of Apache DataSketches, which naturally came into their mind from another ApacheCon session. I'm sure the sustainable Apache ecosystem makes such a consideration of inter-project collaboration easier.
As an audience, notable things I've seen were:
- There were less machine learning sessions than I expected.
- Widely used Apache projects, CloudStack, Pulsar, Kafka, and Beam in particular, are drawing exceptional attention.
- IoT ecosystem is growing in the community.
Unlike the other "Big Data" conferences, I couldn't see so many ML-centric talks, and I believe attendee's interests were more in how large companies are utilizing the Apache projects in their production-grade systems at scale.
For example, in a special series of talks named Beam Summit, Lyft introduced their use of Apache Beam for dynamic ride pricing. In fact, they mentioned "ML" in the talk, but the point was how to make ML more operational, rather than the detail of their pricing logic. Similarly, Beam is an effective tool to implement ML pipelines as we can see in TensorFlow Extended.
Meanwhile, I can confidently say IoT is becoming an important use case of Apache projects. In ApacheCon NA, a dedicated session covers integration between MQTT and Kafka, IoT application development using Spark and Bahir, and example of industrial solution named StreamPipes.
More specifically, since MQTT is a protocol designed for IoT use cases that assume constrained unreliable TCP/IP connections, Kafka nicely compensates the limitations by providing reliable, scalable, decoupleable path of IoT data streams. In terms of system architecture, there are multiple IoT design patterns inspired by Lambda / Kappa architecture.
It should be finally noted that talk right before my session introduced an interesting project called Apache Unomi ("You-know-me"), an open-source customer data platform (CDP). As my company is building an enterprise CDP, I can share sympathy for its difficulty with the developers.
Attending an interesting talk about open source customer data platform...strongest motivation is in making sure privacy & transparency. ML capability is coming from PredictionIO. / “Apache Unomi™ Open Source Customer Data Platform | Main Page” https://t.co/TU7y7cKmQi— Takuya Kitazawa (@takuti) September 12, 2019
Ah, many surprises! I didn't expect to see CDP talk at ApacheCon, as well as the insightful IoT sessions.
ApacheCon Europe @ Berlin
While the European version of ApacheCon is not held regularly, it happened this year to celebrate the anniversary.
Happy birthday to @TheASF !! #apachecon #apache20 #aceu19 pic.twitter.com/NEFNMTOxS3— ApacheCon (@ApacheCon) October 23, 2019
In contrast to the generic introduction of Hivemall in my ACNA talk, I have focused more on integrating Hivemall with PySpark:
The content is based on my previous article: Apache Hivemall in PySpark, and I modified the sample Google Colab notebook a lot to be ready for the event.
As Spark is widely used for large-scale machine learning, I can easily imagine the audiences are interested in its internal implementation and scalability in comparison with Spark MLlib. I hope my talk and Q&A helps to get the points, but I'm sure we need to clarify more somewhere at the documentation site to help users to convince "Why Hivemall?".
After the session, I had a great conversation with Alexey Zinovyev who has presented about Ignite ML, in the context of distributed ML. We shared how the parallelizable implementation of well-know ML techniques is challenging and interesting, and we agreed to have a look at the projects for each other.
Similarly to North America, IoT-related talks had a strong presence across the conference. The very first talk "How to Become an IoT Developer" motivates us to play with the physical devices such as Raspberry Pi, and following talks introduced specific IoT projects from the community, including PLC4X, IoTDB, Mynewt, and NiFi, with cool demonstrations.
My first ApacheCon experiences were simply great not only as a speaker but as a developer working on real-world CDP and IoT applications at Arm.
If I give feedback, it would be nicer if we could see more interactions with non-Apache projects & communities; in fact, the Apache community is solid, but it sometimes shows a closed, nonflexible atmosphere that possibly makes bringing new joiners and new technology trends into the community harder.
Anyway, the community-driven conferences are super productive and insightful for me, and it's somewhat unusual compared to many other "enterprise-driven" ones.
Support (Thank you!)
Note that, as an Amazon Associate, I earn from qualifying purchases on amazon.ca.
- What I've Seen at IoT Solutions World Congress 2019
- Apache Hivemall in PySpark
- Apache Hivemall at #ODSCEurope, #RecSys2018, and #MbedConnect
Last updated: 2022-06-04
Author: Takuya Kitazawa
Takuya Kitazawa is a freelance software developer, minimalistic traveler, ultralight hiker & runner, and craft beer enthusiast. With a decade of experiences as a full-stack software developer, machine learning engineer, data scientist, and product manager, I am currently working at the intersection of technological and societal aspects of data-driven applications. You can find what I am doing lately at my "now" page, and your inquiry is always welcome at [email protected], including comments on my blog posts.
- Opinions are my own and do not represent the views of organizations I am/was belonging to.
- I am doing my best to ensure the accuracy and fair use of the information. However, there might be some errors or biased subjective statements because the main purpose of this blog is to jot down my personal thoughts as soon as possible before conducting an extensive investigation. Visitors understand the limitations and rely on any information at their own risk.
- That said, if there is any issue with the content, please contact me so I can take the necessary action.
- I retrieved external links on the page's publication date, so they may or may not be expired or outdated.