When you’re running a modern data cluster, which are becoming increasingly commonplace and essential to businesses, you inevitably discover headaches.
Typically a wide variety of workloads run on a single cluster, which can make it a nightmare to manage and operate – similar to managing traffic in a busy city. There’s a real pain for the operations folks out there who have to manage Spark, Hive, impala and Kafka applications running on the same cluster where they have to worry about each app’s resource requirements, the time distribution of the cluster workloads, the priority levels of each app or user, and then make sure everything runs like a predictable well-oiled machine.
Anyone working in data ops will have a strong point of view here since you’ll have no doubt spent countless hours, day in and day out, studying the behaviour of giant production clusters in the discovery of insights into how to improve performance, predictability and stability. Whether it is a thousand node Hadoop cluster running batch jobs, or a five hundred node Spark cluster running AI, ML or some type of advanced, real-time, analytics. Or, more likely, 1000 nodes of Hadoop, connected via a 50 node Kafka cluster to a 500 node Spark cluster for processing.