Get in Touch

Course Outline

  • Introduction
    • History and core concepts of Hadoop
    • The Hadoop ecosystem
    • Overview of Hadoop distributions
    • High-level architecture
    • Common myths surrounding Hadoop
    • Hadoop challenges (hardware and software)
    • Labs: Discussion on your Big Data projects and associated problems
  • Planning and Installation
    • Choosing software and Hadoop distributions
    • Cluster sizing and growth planning
    • Selecting appropriate hardware and network configurations
    • Rack topology design
    • Installation procedures
    • Implementing multi-tenancy
    • Directory structures and log management
    • Benchmarking methodologies
    • Labs: Cluster installation and performance benchmark execution
  • HDFS Operations
    • Core concepts: horizontal scaling, replication, data locality, and rack awareness
    • Nodes and daemons (NameNode, Secondary NameNode, HA Standby NameNode, DataNode)
    • Health monitoring strategies
    • Administration via command-line and web browser interfaces
    • Adding storage capacity and replacing faulty drives
    • Labs: Familiarization with HDFS command-line utilities
  • Data Ingestion
    • Using Flume for log and data ingestion into HDFS
    • Utilizing Sqoop for importing data from SQL databases to HDFS and exporting back to SQL
    • Implementing Hadoop data warehousing with Hive
    • Transferring data between clusters using distcp
    • Leveraging S3 as a complementary storage layer to HDFS
    • Best practices and architectural patterns for data ingestion
    • Labs: Setting up and utilizing Flume and Sqoop
  • MapReduce Operations and Administration
    • Evolution of parallel computing: Comparing HPC with Hadoop administration
    • Managing MapReduce cluster loads
    • Nodes and Daemons (JobTracker, TaskTracker)
    • Guided walkthrough of the MapReduce UI
    • Configuring MapReduce
    • Job configuration details
    • Optimization techniques for MapReduce
    • Preventing errors: Guidelines for programmers
    • Labs: Running MapReduce example jobs
  • YARN: New Architecture and Capabilities
    • YARN design objectives and implementation architecture
    • Key components: ResourceManager, NodeManager, Application Master
    • Installing YARN
    • Job scheduling within YARN
    • Labs: Investigation of job scheduling mechanisms
  • Advanced Topics
    • Hardware monitoring
    • Cluster monitoring
    • Adding/removing servers and upgrading Hadoop versions
    • Backup, recovery, and business continuity planning
    • Oozie job workflows
    • Hadoop High Availability (HA)
    • Hadoop Federation
    • Securing your cluster with Kerberos
    • Labs: Establishing monitoring systems
  • Optional Tracks
    • Cloudera Manager: For cluster administration, monitoring, and routine tasks; including installation and usage. All exercises and labs in this track are conducted within the Cloudera distribution environment (CDH5).
    • Ambari: For cluster administration, monitoring, and routine tasks; including installation and usage. All exercises and labs in this track are conducted within the Ambari cluster manager and Hortonworks Data Platform (HDP 2.0).

Requirements

  • Familiarity with fundamental Linux system administration
  • Basic scripting proficiency

Prior knowledge of Hadoop or Distributed Computing is not mandatory, as these concepts will be introduced and explained throughout the course.

Lab Environment

Zero Installation Required: Students are not required to install Hadoop software on their personal devices. A fully functional Hadoop cluster will be provided for use during the sessions.

Participants will need the following tools:

  • An SSH client (Linux and Mac systems come with built-in SSH clients; for Windows users, PuTTY is recommended)
  • A web browser for cluster access. We recommend Firefox equipped with the FoxyProxy extension
 21 Hours

Number of participants


Price per participant

Testimonials (1)

Upcoming Courses

Related Categories