Real-Time Data Pipelines Made Easy with Structured Streaming in Apache Spark
Deep dive into Spark streaming module, with structured streaming. Learn about Spark's micro-batch strategy and aggregations.
Deep dive into Spark streaming module, with structured streaming. Learn about Spark's micro-batch strategy and aggregations.
Learn about the similarities and differences between Spark and Hadoop, How Spark is faster than Hadoop?. Explore the challenges Spark tries to address, you will give you a good idea about the need for spark. Spark’s performance and efficiency. RDDs. Step by step how the program we write gets translated in to actual execution behind the scenes in a Spark cluster.
Understand Spark basics: Spark Core and RDDs
You’ll learn how to use Spark to work with big data and build machine learning models at scale, including how to wrangle and model massive datasets with PySpark. Learn about big data and how Spark fits into the big data ecosystem. Practice processing and cleaning datasets to get comfortable with Spark’s SQL and dataframe APIs. Debug and optimize your Spark code when running on a cluster. Use Spark’s Machine Learning Library to train machine learning models at scale.
Spark introduction: what is it, modules, data types, operations, aggregations, joins, developing applications
Apache Hive: History, what is it, data flow, modeling, types, modes and main features. Differences with RDBMS.
Learn about HBase: what is it, use cases and applications, storage and architecture. See a quick demo.
Learn Apache Flume basics, use cases, advantages, architecture and see an example of Twitter Data Streaming
Learn about file formats in Hadoop, their differences and when to choose each one
MR design patterns in detail, including stages and good practices