Latest news about Bitcoin and all cryptocurrencies. Your daily crypto news habit.
Every day, we talk to companies who are in the early phases of building our their data infrastructure. A lot of times these conversations circle around which technology to pick for which job. For example, we often get the question âwhatâs betterâââSpark or Amazon Redshift?â, or âwhich one should we be using?â. Spark and Redshift are two very different technologies. Itâs not an either / or, itâs more of a âwhen do I use what?â. In this post, Iâll lay out some of the differences and when to use which technology.
At the time of this post, if you look under the hood of the most advanced tech start-ups in Silicon Valley, you will likely find both Spark and Redshift. Spark is getting a little bit more attention these days because itâs a new shiny toy. But they cover different use cases (âdish washer vs. fridgeâ, per Ricardo Vladimiro).
Letâs give you a decision-making framework that can guide you through your thinking:
- What it is
- Data architecture
- Spark and Redshift: Data engineering
What it is
Spark
Apache Spark is a data processing engine. With Spark you can:
- process batch and streaming workloads in real-time
- write applications in Java, Scala, Python and R
- use pre-built libraries for building those apps
There is a general execution engine (Spark Core) and all other functionality is built on top of.
Source: Databricks
People are excited about Spark for three reasons:
Spark is fast because it distributes data across a cluster, and processes that data in parallel. It tries to process data in memory, vs. shuffling things in and out of disk (like e.g. MapReduce does).
Spark is easy because it has a high level of abstraction, allowing you to write applications with less lines of code. Plus, Scala and R are attractive for data manipulation.
Spark is extensible via the pre-built libraries, e.g. for machine learning, streaming apps or data ingestion. These libraries are either part of Spark or 3rd party projects.
In short, the promise of Spark is to speed up development, make applications more portable and extensible, and make the actual application run faster.
A few more noteworthy points on Spark:
- Spark is open source, so you can download it and start running it yourself, e.g. on Amazonâs EC2. Companies like Databricks (founded by the people who created Spark) offer support to make your life easier.
- I think Kiyoto Tamura mentions a key distinction that isnât always clear. Spark âis NOT a databaseâ. You will need some sort of persistent data storage that Spark can pull data from (i.e. a data source, such as Amazon S3 orâââhint, hintâââRedshift).
- Spark then âreads data into memoryâ to process it. Once thatâs done, Spark will require a place to store / pass on the results (because Spark is not a database) . Could be back into S3. And from S3 into e.g. Redshift (see where this answer is going?).
Source: Spark StreamingâââSpark 2.1.1 Documentation.
You need to know how to write code to use Spark (the âwrite applicationsâ part). So the people who use Spark are typically developers.
Redshift
Amazon Redshift is an analytical database. With Redshift you can:
- build a central data warehouse unifying data from many sources
- run big, complex analytic queries against that data with SQL
- report and pass on the results to dashboards or other apps
Redshift is a managed service provided by Amazon. Raw data flows into Redshift (called âETLâ), where itâs processed and transformed at a regular cadence (âtransformationâ or âaggregationsâ), or on an ad-hoc basis (âad-hoc queriesâ). Another term for loading and transforming data is also âdata pipelinesâ.
Source: Amazon Web Services
People are excited about Redshift for three reasons:
Redshift is fast because its massively parallel processing (MPP) architecture distributes and parallelizes queries. Redshift allows a high query concurrency and processes queries in memory.
Redshift is easy because it can ingest structured, semi-structured and unstructured datasets (via S3 or DynamoDB) up to a petabyte or more, to then slice ân dice that data any way you can imagine with SQL.
Redshift is cheap because you can store data for a $935/TB annual fee (if you use the pricing for a 3-year reserved instance). That price-point is unheard of in the world of data warehousing.
In short, the promise of Redshift is to make data warehousing cheaper, faster and easier. You can analyze much bigger and complex datasets than ever before, and thereâs a rich ecosystem of tools that work with Redshift.
A few more noteworthy points about Redshift:
- Redshift is a âfully managed serviceâ. Iâd say the âmanaged serviceâ part of that is true, itâs fully managed from the hardware layer down. You run a credit card and youâre off to the races. The âfullyâ can be a bit misleadingâââlots of knobs to turn to get good performance.
- Your cluster comes empty. But there are plenty of data integration / ETL tools that allow you to quickly populate your cluster to start analyzing and reporting data for business intelligence (âBI Analyticsâ) purposes.
- Redshift is a database, so you can store a history of your raw data AND the results of your transformations. In April 2017, Amazon also introduced âRedshift Spectrumâ, which enables you to run queries against data in S3 (which is a much cheaper way of storing your data).
Source: Automating Analytic Workflows on AWS
You need to know how to write SQL queries to use Redshift (the ârun big, complex queriesâ part). So the people who use Redshift are typically analysts or data scientists.
In summary, one way to think about Spark and Redshift is to distinguish them by what they are, what you do with them, how you interact with them, and who the typical user is.
Source: image created for this blog post by intermix
Iâve hinted at how you see both Spark and Redshift deployed. That gets us to data architecture.
Data Architecture
In very simple terms, you can build an application with Spark, and then use Redshift both as a source and a destination for data.
Why would you do that? A key reason is the difference between Spark and Redshift in the way they process data, and how much time it takes to product a result.
- With Spark, you can do real-time stream processing, i.e. you get a real-time response to events in your data streams.
- With Redshift, you can do near-real time batch operations, i.e. you ingest small batches of events from data streams, to then run your analysis to get a response to events.
A highly simplified example: Fraud detection. You could build an app with Spark that detects fraud in real-time from e.g. a stream of bitcoin transactions. Given itâs near-real time character, Redshift would not be a great fit in this case.
But letâs say if you wanted to have more signals for your fraud detection, for better predictability. You could load data from Spark into Redshift. There, you join it with historic data on fraud patterns. But you canât do that in real-time, the result would come too late for you to block the transaction. So you use Spark to e.g. block a transaction in real-time, and then wait for the result from Redshift to decide if you keep blocking it, send it to a human for verification, or approve it.
In December 2017, the Amazon Big Data Blog had another example of using both Spark and Redshift: âPowering Amazon Redshift Analytics with Apache Spark and Amazon Machine Learningâ. The post covers how to build a predictive app that tells you how likely a flight will be delayed. The prediction happens based on the time of day or the airline carrier, by using multiple data sources and processing them across Spark and Redshift.
You can see how the separation of âappsâ and âdata warehousingâ we created at the start of this post is in reality an area thatâs shifting or even merging. That takes us to the final part of this post: Data engineering.
Spark and Redshift: Data Engineering
The border between developers and business intelligence analysts / data scientists are fading. That has given rise to a new occupation: Data engineering. Iâll use a definition for data engineering by Maxime Beauchemin:
âIn relation to previously existing roles, the data engineering field [is] a superset of business intelligence and data warehousing that brings more elements from software engineering, [and it] integrates the operation of âbig dataâ distributed systemsâ.
Spark is such a âbig dataâ distributed system. Redshift is the data warehousing part. Data engineering is the discipline that brings both together. Thatâs because you see âcodeâ moving its way into data warehousing. Code allows you to author, schedule and monitor data pipelines that feed into Redshift, incl. the transformations on the data once it sits inside your cluster. And youâll very likely have to ingest data from Spark. And so the trend to âcodeâ in warehousing implies that knowing SQL is not sufficient any more. You need to know how to write code. Hence the âdata engineerâ.
This post is already way too long, but I hope it provides a useful summary on how to think about your data stack. For your big data architecture, you will likely end up using both Spark and Redshift, each one to fulfill a specific use case thatâs is best suited for.
Spark and Redshift: Which is better for big data? was originally published in Hacker Noon on Medium, where people are continuing the conversation by highlighting and responding to this story.
Disclaimer
The views and opinions expressed in this article are solely those of the authors and do not reflect the views of Bitcoin Insider. Every investment and trading move involves risk - this is especially true for cryptocurrencies given their volatility. We strongly advise our readers to conduct their own research when making a decision.