Spark or Storm?

That is the question

Posted by Moises Trelles on September 05, 2016

Here some major points to consider when choosing the right tool:

Latency:

Is the performance of the streaming application paramount? Storm can give sub-second latency much more easily and with less restrictions than Spark Streaming.


Development Cost:

Is it desired to have similar code bases for batch processing and stream processing? In our case I could say yes! With Spark, batching and streaming are very similar.


Message Delivery Guarantees:

Is there high importance on processing every single record, or is some nominal amount of data loss acceptable? In our case I want to say it’s important to processes every record. Disregarding everything else, Spark trivially yields perfect, exactly once message delivery. Storm can provide all three delivery semantics, but getting perfect exactly once message delivery requires more effort to properly achieve.


Process Fault Tolerance:

Is high-availability of primary concern? Both systems actually handle fault-tolerance of this kind really well and in relatively similar ways.

Production Storm clusters will run Storm processes under supervision; if a process fails, the supervisor process will restart it automatically. State management is handled through ZooKeeper. Processes restarting will reread the state from ZooKeeper on an attempt to rejoin the cluster.

Spark handles restarting workers via the resource manager: YARN or its standalone manager. Spark’s standalone resource manager handles master node failure with standby-masters and ZooKeeper.


Conclusion

Both Apache Spark Streaming and Apache Storm are great solutions that solve the streaming ingestion and transformation problem. Either system can be a great choice for part of an analytics stack. Choosing the right one is simply a matter of answering the above questions.