Get in Touch

Course Outline

Section 1: Introduction to Hadoop

  • Hadoop history and core concepts
  • Ecosystem overview
  • Distributions
  • High-level architecture
  • Common Hadoop myths
  • Challenges associated with Hadoop
  • Hardware and software considerations
  • Lab: Initial exploration of Hadoop

Section 2: HDFS

  • Design principles and architecture
  • Core concepts (horizontal scaling, replication, data locality, rack awareness)
  • Daemons: NameNode, Secondary NameNode, DataNode
  • Communications and heartbeat mechanisms
  • Data integrity
  • Read and write paths
  • High Availability (HA) and Federation for NameNode
  • Labs: Interacting with HDFS

Section 3: MapReduce

  • Concepts and architecture
  • Daemons (MRV1): JobTracker and TaskTracker
  • Execution phases: Driver, Mapper, Shuffle/Sort, Reducer
  • MapReduce Version 1 versus Version 2 (YARN)
  • MapReduce internals
  • Introduction to Java MapReduce programming
  • Labs: Executing a sample MapReduce program

Section 4: Pig

  • Pig compared to Java MapReduce
  • Pig job flow
  • Pig Latin language
  • ETL processes with Pig
  • Transformations and Joins
  • User-defined functions (UDF)
  • Labs: Writing Pig scripts for data analysis

Section 5: Hive

  • Architecture and design
  • Data types
  • SQL support within Hive
  • Creating Hive tables and executing queries
  • Partitions
  • Joins
  • Text processing capabilities
  • Labs: Various exercises on data processing using Hive

Section 6: HBase

  • Concepts and architecture
  • HBase versus RDBMS versus Cassandra
  • HBase Java API
  • Handling time series data in HBase
  • Schema design
  • Labs: Interacting with HBase via the shell; programming with the HBase Java API; Schema design exercise

Requirements

  • Proficiency in the Java programming language, as most coding exercises are conducted in Java.
  • Familiarity with the Linux environment, including the ability to navigate the command line and edit files using vi or nano.

Lab Environment

Zero Install: Students are not required to install Hadoop software on their personal machines. A functional Hadoop cluster will be provided for use.

Participants will need:

  • An SSH client (Linux and Mac systems come with SSH clients by default; PuTTY is recommended for Windows users).
  • A web browser to access the cluster, with Firefox being the recommended option.
 28 Hours

Number of participants


Price per participant

Testimonials (1)

Upcoming Courses

Related Categories