icon

2019-10-26

ApacheCon 2019 North America #ACNA19 & Europe #ACEU19

As a committer/PPMC of Apache Hivemall (incubating), I have attended and presented at ApacheCon North America and ApacheCon Europe.

takuti.jpg

Photo: Jan Michalko / plain schwarz

These were my very first ApacheCon experiences, and it was great to be part of the community in this memorial year - yes, it's the 20th anniversary of Apache Software Foundation.

A key message I've got from the conferences is about the power of community. Of course, all sessions are technically stimulating and exciting to learn, but, more than that, I was impressed by how to make such a big conference possible as a result of gathering diverse people and projects from the OSS community.

ApacheCon North America @ Las Vegas

In this annual conference, we have presented the recent updates on the Hivemall project:

I feel the audiences had a highly practical point of view; through the questions, they were trying to get a deeper understanding of how Hivemall runs with MapReduce, Tez, and Spark.

When I mentioned the sketching algorithms Hivemall has imported from Brickhouse, somebody pointed out taking advantage of Apache DataSketches, which naturally came into their mind from another ApacheCon session. I'm sure the sustainable Apache ecosystem makes such a consideration of inter-project collaboration easier.

As an audience, notable things I've seen were:

  • There were less machine learning sessions than I expected.
  • Widely used Apache projects, CloudStack, Pulsar, Kafka, and Beam in particular, are drawing exceptional attention.
  • IoT ecosystem is growing in the community.

Unlike the other "Big Data" conferences, I couldn't see so many ML-centric talks, and I believe attendee's interests were more in how large companies are utilizing the Apache projects in their production-grade systems at scale.

For example, in a special series of talks named Beam Summit, Lyft introduced their use of Apache Beam for dynamic ride pricing. In fact, they mentioned "ML" in the talk, but the point was how to make ML more operational, rather than the detail of their pricing logic. Similarly, Beam is an effective tool to implement ML pipelines as we can see in TensorFlow Extended.

Meanwhile, I can confidently say IoT is becoming an important use case of Apache projects. In ApacheCon NA, a dedicated session covers integration between MQTT and Kafka, IoT application development using Spark and Bahir, and example of industrial solution named StreamPipes.

More specifically, since MQTT is a protocol designed for IoT use cases that assume constrained unreliable TCP/IP connections, Kafka nicely compensates the limitations by providing reliable, scalable, decoupleable path of IoT data streams. In terms of system architecture, there are multiple IoT design patterns inspired by Lambda / Kappa architecture.

It should be finally noted that talk right before my session introduced an interesting project called Apache Unomi ("You-know-me"), an open-source customer data platform (CDP). As my company is building an enterprise CDP, I can share sympathy for its difficulty with the developers.

Ah, many surprises! I didn't expect to see CDP talk at ApacheCon, as well as the insightful IoT sessions.

ApacheCon Europe @ Berlin

While the European version of ApacheCon is not held regularly, it happened this year to celebrate the anniversary.

In contrast to the generic introduction of Hivemall in my ACNA talk, I have focused more on integrating Hivemall with PySpark:

The content is based on my previous article: Apache Hivemall in PySpark, and I modified the sample Google Colab notebook a lot to be ready for the event.

As Spark is widely used for large-scale machine learning, I can easily imagine the audiences are interested in its internal implementation and scalability in comparison with Spark MLlib. I hope my talk and Q&A helps to get the points, but I'm sure we need to clarify more somewhere at the documentation site to help users to convince "Why Hivemall?".

After the session, I had a great conversation with Alexey Zinovyev who has presented about Ignite ML, in the context of distributed ML. We shared how the parallelizable implementation of well-know ML techniques is challenging and interesting, and we agreed to have a look at the projects for each other.

Similarly to North America, IoT-related talks had a strong presence across the conference. The very first talk "How to Become an IoT Developer" motivates us to play with the physical devices such as Raspberry Pi, and following talks introduced specific IoT projects from the community, including PLC4X, IoTDB, Mynewt, and NiFi, with cool demonstrations.

Bottom Line

My first ApacheCon experiences were simply great not only as a speaker but as a developer working on real-world CDP and IoT applications at Arm.

If I give feedback, it would be nicer if we could see more interactions with non-Apache projects & communities; in fact, the Apache community is solid, but it sometimes shows a closed, nonflexible atmosphere that possibly makes bringing new joiners and new technology trends into the community harder.

Anyway, the community-driven conferences are super productive and insightful for me, and it's somewhat unusual compared to many other "enterprise-driven" ones.

search See also

» more
user Takuya Kitazawa (a.k.a. takuti) is an engineer working on machine learning, data science, and product development at Arm Treasure Data. Opinions are my own.