With our recently blogs on new products like Amazon Redshift you might be wondering what this technology is? I certainly was late last year so went on a quest to learn more about this new wonderful data warehousing product from Amazon our team were talking endlessly about – here is a summary of what I found out:
First of all I did the usual google search Redshift and had my brain stretched learning “In physics, redshift happens when light or other electromagnetic radiation from an object is increased in wavelength, or shifted to the red end of the spectrum”. So refined my search to Amazon Redshift, first up the Wikipedia description:
“Amazon Redshift is a hosted data warehouse product which is part of the larger cloud computing platform Amazon Web Services (AWS). It is built on top of technology from the Massive parallel processing (MPP) data warehouse ParAccel by Actian. Redshift differs from Amazon’s other hosted database offering Amazon RDS by being able to handle analytics workloads on large scale datasets stored by a Column-oriented DBMS principle. To be able to handle large scale datasets Amazon is making use of massive parallel processing.”
What Amazon’s website says:
“Amazon Redshift is a fast, fully managed, petabyte-scale data warehouse service that makes it simple and cost-effective to efficiently analyze all your data using your existing business intelligence tools. You can start small for just $0.25 per hour with no commitments or upfront costs and scale to a petabyte or more for $1,000 per terabyte per year, less than a tenth of most other data warehousing solutions.”
Amazon provide this cute Youtube overview clip – well worth a watch.
For the more technically inclined this blog from Brian Dailey titled “Amazon Redshift – What you need to know” provides a good overview of what’s under the covers, summarising one companies experiences moving to Redshift and providing some insight into Why Redshift, it’s architecture, data extraction and optimisations. At the end of the post their wrap up surmises:
“The advantages of Redshift to any shop wishing to run adhoc queries against large sets of data is abundantly clear.
- It looks and smells like PostgreSQL 8.
- It’s much easier to use existing talent than learning to use a new tool like Hadoop.
- As others have pointed out, Hadoop is generally overutilized anyway.
- Even AirBNB analysts liked it so much they didn’t want to go back to PIG and HIVE.
- It’s less expensive than appliances like Vertica.
- It’s pricing scheme is much more clear than competing appliances.
Making the switch from PostgreSQL to Redshift has not only made our adhoc queries faster, it’s also saved us hosting costs.”
Architecture wise there are a few resources out there worth taking a look at:
- This slideshare Amazon Redshift a whirl-wind tour is a quick read
- This AWS slideshare (Welcome to the AWS cloud) is a good sales overview – note the prices and compute power change often so are likely not correct in this presentation (vs the youtube clip or AWS website)
- “Amazon Redshift a peek at it’s internal organs” provides some commentary and diagrams on the architecture a good blog from Kamil Bartocha
Concluding what I have learned:
- Amazon Redshift is specifically made for data warehouse processing on your AWS platform
- It’s pretty new technology (launched late 2012)
- It can scale and performs well on the constantly improving AWS platform
- It’s considered easier to just learn (eg for RDBMS DBA’s) than the learning curve for Hadoop
- There are no upfront fees and you pay as you go
- People like it’s speed and price alot
Hope this collection of links has given you an overview to get you started. Happy sharing, Vic.
Victoria MacLennan is a reformed techo from the data and information management world – who now focuses on creating jobs and opportunities.
She is passionate about Open Data, Data Privacy and Governance so will blog on those topics occasionally.