Get in Touch

Course Outline

Introduction:

  • Apache Spark within the Hadoop Ecosystem
  • Overview of Python and Scala

Core Concepts (Theory):

  • System Architecture
  • Resilient Distributed Datasets (RDD)
  • Transformations and Actions
  • Stages, Tasks, and Dependencies

Hands-on Workshop: Mastering the Basics in the Databricks Environment

  • RDD API exercises
  • Basic transformation and action functions
  • PairRDD operations
  • Joining datasets
  • Caching strategies
  • DataFrame API exercises
  • SparkSQL
  • DataFrame operations: select, filter, group, sort
  • User Defined Functions (UDF)
  • Exploration of the DataSet API
  • Streaming capabilities

Hands-on Workshop: Deployment in the AWS Environment

  • AWS Glue fundamentals
  • Differentiating between AWS EMR and AWS Glue
  • Sample job implementations in both environments
  • Evaluating advantages and limitations

Additional Topics:

  • Introduction to Apache Airflow orchestration

Requirements

Programming proficiency (preferably in Python and Scala)

Foundational knowledge of SQL

 21 Hours

Number of participants


Price per participant

Testimonials (3)

Upcoming Courses

Related Categories