Thank you for sending your enquiry! One of our team members will contact you shortly.
Thank you for sending your booking! One of our team members will contact you shortly.
Course Outline
-
Introduction
- History and core concepts of Hadoop
- The Hadoop ecosystem
- Overview of Hadoop distributions
- High-level architecture
- Common myths surrounding Hadoop
- Hadoop challenges (hardware and software)
- Labs: Discussion on your Big Data projects and associated problems
-
Planning and Installation
- Choosing software and Hadoop distributions
- Cluster sizing and growth planning
- Selecting appropriate hardware and network configurations
- Rack topology design
- Installation procedures
- Implementing multi-tenancy
- Directory structures and log management
- Benchmarking methodologies
- Labs: Cluster installation and performance benchmark execution
-
HDFS Operations
- Core concepts: horizontal scaling, replication, data locality, and rack awareness
- Nodes and daemons (NameNode, Secondary NameNode, HA Standby NameNode, DataNode)
- Health monitoring strategies
- Administration via command-line and web browser interfaces
- Adding storage capacity and replacing faulty drives
- Labs: Familiarization with HDFS command-line utilities
-
Data Ingestion
- Using Flume for log and data ingestion into HDFS
- Utilizing Sqoop for importing data from SQL databases to HDFS and exporting back to SQL
- Implementing Hadoop data warehousing with Hive
- Transferring data between clusters using distcp
- Leveraging S3 as a complementary storage layer to HDFS
- Best practices and architectural patterns for data ingestion
- Labs: Setting up and utilizing Flume and Sqoop
-
MapReduce Operations and Administration
- Evolution of parallel computing: Comparing HPC with Hadoop administration
- Managing MapReduce cluster loads
- Nodes and Daemons (JobTracker, TaskTracker)
- Guided walkthrough of the MapReduce UI
- Configuring MapReduce
- Job configuration details
- Optimization techniques for MapReduce
- Preventing errors: Guidelines for programmers
- Labs: Running MapReduce example jobs
-
YARN: New Architecture and Capabilities
- YARN design objectives and implementation architecture
- Key components: ResourceManager, NodeManager, Application Master
- Installing YARN
- Job scheduling within YARN
- Labs: Investigation of job scheduling mechanisms
-
Advanced Topics
- Hardware monitoring
- Cluster monitoring
- Adding/removing servers and upgrading Hadoop versions
- Backup, recovery, and business continuity planning
- Oozie job workflows
- Hadoop High Availability (HA)
- Hadoop Federation
- Securing your cluster with Kerberos
- Labs: Establishing monitoring systems
-
Optional Tracks
- Cloudera Manager: For cluster administration, monitoring, and routine tasks; including installation and usage. All exercises and labs in this track are conducted within the Cloudera distribution environment (CDH5).
- Ambari: For cluster administration, monitoring, and routine tasks; including installation and usage. All exercises and labs in this track are conducted within the Ambari cluster manager and Hortonworks Data Platform (HDP 2.0).
Requirements
- Familiarity with fundamental Linux system administration
- Basic scripting proficiency
Prior knowledge of Hadoop or Distributed Computing is not mandatory, as these concepts will be introduced and explained throughout the course.
Lab Environment
Zero Installation Required: Students are not required to install Hadoop software on their personal devices. A fully functional Hadoop cluster will be provided for use during the sessions.
Participants will need the following tools:
- An SSH client (Linux and Mac systems come with built-in SSH clients; for Windows users, PuTTY is recommended)
- A web browser for cluster access. We recommend Firefox equipped with the FoxyProxy extension
21 Hours
Testimonials (1)
Hands on exercises. Class should have been 5 days, but the 3 days helped to clear up a lot of questions that I had from working with NiFi already