Hadoop Course Introduction
Hadoop is an Apache open source framework written in java that allows distributed processing of large datasets across clusters of computers using simple programming models. The Hadoop framework application works in an environment that provides distributed storage and computation across clusters of computers. Hadoop is designed to scale up from single server to thousands of machines, each offering local computation and storage.
Hadoop Architecture
At its core, Hadoop has two major layers namely −
Processing/Computation layer (MapReduce), and
Storage layer (Hadoop Distributed File System).
MapReduce
MapReduce is a parallel programming model for writing distributed applications devised at Google for efficient processing of large amounts of data (multi-terabyte data-sets), on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner. The MapReduce program runs on Hadoop which is an Apache open-source framework.
Hadoop Distributed File System
The Hadoop Distributed File System (HDFS) is based on the Google File System (GFS) and provides a distributed file system that is designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. It is highly fault-tolerant and is designed to be deployed on low-cost hardware. It provides high throughput access to application data and is suitable for applications having large datasets.
Hadoop Online Training Course Content
- Module 01 - Hadoop Installation and Setup
- Module 02 - Introduction to Big Data Hadoop and Understanding HDFS and MapReduce
- Module 03 - Deep Dive in MapReduce
- Module 04 - Introduction to Hive
- Module 05 - Advanced Hive and Impala
- Module 06 - Introduction to Pig
- Module 07 - Flume, Sqoop and HBase
- Module 08 - Writing Spark Applications Using Scala
- Module 09 - Use Case Bobsrockets Package
- Module 10 - Introduction to Spark
- Module 12 - Working with RDDs in Spark
- Module 13 - Aggregating Data with Pair RDDs
- Module 14 - Writing and Deploying Spark Applications
- Module 15 - Project Solution Discussion and Cloudera Certification Tips and Tricks
- Module 16 - Parallel Processing
- Module 17 - Spark RDD Persistence
- Module 19 - Integrating Apache Flume and Apache Kafka
- Module 20 - Spark Streaming
- Module 21 - Improving Spark Performance
- Module 22 - Spark SQL and Data Frames
- Module 23 - Scheduling/Partitioning
- Module 24 - Hadoop Administration – Multi-node Cluster Setup Using Amazon EC2
- Module 25 - Hadoop Administration – Cluster Configuration
- Module 26 - Hadoop Administration – Maintenance, Monitoring and Troubleshooting
- Module 27 - ETL Connectivity with Hadoop Ecosystem (Self-Paced)
- Module 28 - Hadoop Application Testing
- Module 29 - Roles and Responsibilities of Hadoop Testing Professional
- Module 30 - Framework Called MRUnit for Testing of MapReduce Programs
- Module 32 - Test Execution
- Module 33 - Test Plan Strategy and Writing Test Cases for Testing Hadoop Application