A Spark but not quite a fire
The debate about whether Hadoop MapReduce is on its way out and Apache Spark is the new leader in the streaming platform for Big Data continues to be a hot topic. Furthermore, those wishing to perform real-time analytics on their data as opposed to the ETL process of yesterday continue to question the validity of any solution that promises to do it all and cook you dinner. Let’s not wait around for answers but instead roll up our sleeves and dive in to see what the differences are, where to use one or the other and how your organization can benefit from both.
Apache Spark, simply put, is an open-source data-processing framework that has taken the data analytics and real-time processing world by storm. Spark isn’t tied to the underpinnings of MapReduce and runs entirely in memory, meaning it can scale much faster on a horizontal backbone such as AWS where EC2 instances can be deployed instantly for unlimited scale. Spark can ran in tandem with an in memory database such as MemSQL or on top of YARN, yet another resource negotiator. One of the great shall I say features is the ability to handle streaming. Streaming real-time ingestion data from a source such as MemSQL for financial transactions and life sciences are two primary use cases where Spark has really taken off. Yahoo, Amazon and Group are just a few of the long list of companies using Spark in production.
When transitioning to a new type of database or database schema the first question that comes to my mind is “how easy is it use it?” Today, if a vendor or application doesn’t have an API to write to and perform the normal PULL, POST, GET functions, it’s basically dead in the water. Databases are underpinned by SQL, meaning that like MemSQL, Spark has very simple and easy accessible API’s for Scala and Java as well as Python for those that need it. Notice I said Java when it pertains to Spark and the reason why is it ties back to MapReduce. For anyone that has tried to program in MapReduce you know the difficulties it presents and you could something like Pig but still it’s far from perfect. Spark features an interactive mode where you can get instant feedback from a call while MapReduce does not provide this functionality. If you are currently looking at or have deployed Cloudera or Hortonworks, Spark and MapReduce are included.
Apache Spark allows for much faster performance than a traditional MapReduce cluster. Why? As stated earlier, Spark runs entirely in-memory. Compare that to MapReduce which does at some point, write and persist data back to a disk. This inherit need to write to a disk will inherently make MapReduce slower. When I first read about Spark I initially thought that the amount of memory needed to run Spark would have to extraordinary and to some extent it is. I used a simple CloudFormation stack on AWS to test Spark ingest data and quickly realized that to run at the speeds I needed, a lot more horsepower would be needed. This in turn meant a heavy investment in capital from my side, which I neglected to pursue. Anyway, Spark keeps all the data in memory until a call is needed, however, if the data is so large that it outpaces the memory reserved in Spark, serious implications can occur such as performance. Not to worry, MapReduce kills a process as soon as it completes much like a container does once the “kill” command is issued. When thinking about performance needs comparing Spark and MapReduce, it’s easiest to think about it this way. Any data that can live in memory should be a good fit for Spark otherwise it should be used with MapReduce.
We just finished talking about performance earlier but like anything else cost and the ability to scale out is always at the forefront of any great architects mind. You will be happy to know that Spark can run on commodity servers in a cloud such as AWS, Google or Azure. In fact, CloudFormation templates can be created to deploy a cluster containing “X” amount of memory, cpu, etc. If you want a refresher on CloudFormation, refer to my MemSQL demo and deployment. For this reason, the memory allocated to your Spark cluster should be as large if not larger than amount of data needed processing in the cluster. Again, the trade off becomes, do you want more memory at a higher cost or more space at a lower cost with slower processing of data while utilizing more hardware since disks inherently take up more space than memory does. Spark also integrates and has compatibility with BI tools such as ODBC and JDBC meaning the integrations it has today are the same as MapReduce.
Which to use?
Apache Spark is able to process data much faster as it holds all data in memory vs. MapReduce which will persists data down to disks. If you are looking to perform real-time data processing of an environment Apache Spark is your answer. If you are looking to do batch processing of large quantities, think petabytes of data, then MapReduce is your answer. In reality, I have yet to come across a customer using Spark independently of MapReduce and even the number of customers using Hadoop MapReduce is scarce as it is still a relatively new offering (Hadoop started showing up mainstream in 2005 and Spark made its debut in 2010).
How are you using Spark and/or MapReduce in your environment today?
Have you experienced any pain points with performance or scale?