Oreilly - Hadoop and Spark Fundamentals

Oreilly - Hadoop and Spark Fundamentals

by Douglas Eadline | Released June 2018 | ISBN: 0134770862

https://www.oreilly.com/library/view/hadoop-and-spark/9780134770871/

9+ Hours of Video InstructionThe perfect (and fast) way to get started with Hadoop and SparkHadoop and Spark Fundamentals LiveLessons provides 9+ hours of video introduction to the Apache Hadoop Big Data ecosystem. The tutorial includes background information and explains the core components of Hadoop, including Hadoop Distributed File Systems (HDFS), MapReduce, the YARN resource manager, and YARN Frameworks. In addition, it demonstrates how to use Hadoop at several levels, including the native Java interface, C++ pipes, and the universal streaming program interface. Examples include how to use benchmarks and high-level tools, including the Apache Pig scripting language, Apache Hive "SQL-like" interface, Apache Flume for streaming input, Apache Sqoop for import and export of relational data, and Apache Oozie for Hadoop workflow management. In addition, there is comprehensive coverage of Spark, PySpark, and the Zeppelin web-GUI. The steps for easily installing a working Hadoop/Spark system on a desktop/laptop and on a local stand-alone cluster using the powerful Ambari GUI are also included. All software used in these LiveLessons is open source and freely available for your use and experimentation. A bonus lesson includes a quick primer on the Linux command line as used with Hadoop and Spark.About the InstructorDouglas Eadline, PhD, began his career as a practitioner and a chronicler of the Linux cluster HPC revolution and now documents big data analytics. Starting with the first Beowulf Cluster how-to document, Doug has written hundreds of articles, white papers, and instructional documents covering High Performance Computing (HPC) and Data Analytics. Prior to starting and editing the popular ClusterMonkey.net website in 2005, he served as editor-in-chief for ClusterWorld Magazine, and was senior HPC editor for Linux Magazine. Currently, he is a writer and consultant to the HPC/Data Analytics industry and leader of the Limulus Personal Cluster Project. He is author of Hadoop Fundamentals LiveLessons and Apache Hadoop YARN Fundamentals LiveLessons videos from Pearson, and book coauthor of Apache Hadoop YARN: Moving beyond MapReduce and Batch Processing with Apache Hadoop 2 and Practical Data Science with Hadoop and Spark: Designing and Building Effective Analytics at Scale. He is also the sole author of Hadoop 2 Quick Start Guide: Learn the Essentials of Big Data Computing in the Apache Hadoop 2 Ecosystem. Skill LevelBeginnerIntermediateLearn How ToUnderstand Hadoop design and key componentsHow the MapReduce process works in HadoopUnderstand the relationship of Spark and HadoopKey aspects of the new YARN design and FrameworksUse, administer, and program HDFSRun and administer Hadoop/Spark programsWrite basic MapReduce/Spark programsInstall Hadoop/Spark on a laptop/desktopRun Apache Pig, Hive, Flume, Sqoop, Oozie, Spark applicationsPerform basic data Ingest with Hive and SparkUse the Zeppelin web-GUI for Spark/Hive programingInstall and administer Hadoop with the Apache Ambari GUI toolWho Should Take This CourseUsers, developers, and administrators interested in learning the fundamental aspects and operations of the open source Hadoop and Spark ecosystemsCourse RequirementsBasic understanding of programming and developmentA working knowledge of Linux systems and toolsFamiliarity with Bash, Python, Java, and C++Lesson 1: Background Concepts This lesson introduces Hadoop and Spark along with the many aspects and features that enable the analysis of large unstructured data sets. Many of these discussions about Hadoop ignore the fundamental change Hadoop brings to data management. Doug explains this key point using the data lake metaphor, and then provides background on how the Hadoop data platform, MapReduce, and Spark fit into the data analytics landscape. A bonus lesson is also included for new Linux users that provides the basics of the command line interface used throughout these lessons.Lesson 2: Running Hadoop on a Desktop or Laptop A real Hadoop installation, whether it be a local cluster or in the cloud, can be difficult to configure and possibly an expensive proposition. In order to make the examples of this tutorial more accessible, you learn how to install the Hortonworks HDP Sandbox on a desktop or laptop. The "Sandbox" is a freely available Hadoop virtual machine that provides a full Hadoop environment (including Spark). You can use this environment to try most of the examples in this tutorial. If you would rather learn about Hadoop and Spark installation details, we will also do a direct single (Linux) machine install using the latest Hadoop and Spark binary code.Lesson 3: The Hadoop Distributed File System The backbone of Hadoop is the Hadoop Distributed File System or HDFS. In this lesson you learn the basics of HDFS and how it is different from many standard file systems used today. In particular, Doug explains why various design trade-offs provide HDFS with a performance edge in big data applications. You also learn how to navigate HDFS using the Hadoop tools and how to use HDFS in user programs. Finally, I present some of the new features available in HDFS including high availability, federation, snapshots, and NFS access.Lesson 4: Hadoop MapReduce If the Hadoop Distributed File System is the backbone of Hadoop, then MapReduce is the muscle that operates on big data. In this lesson, Doug shows you how MapReduce compares to a traditional search approach. From there, he shows you how to compile and run a Java MapReduce application. Deeper background on how MapReduce works is presented along with how to use MapReduce with other languages and how to do simple debugging of a MapReduce program.Lesson 5: Hadoop MapReduce Examples This lesson continues with MapReduce examples. Doug first shows you a multifile word count program, and then moves on to a more practical log file analysis. From there, he demonstrates how to use a really large text file, like Wikipedia. The lesson concludes with some examples of running MapReduce benchmarks and the using the YARN job browser.Lesson 6: Higher Level Tools While Hadoop is very effective at presenting a basic scalable MapReduce model, some higher-level approaches have been developed. In this lesson, Doug teaches you how to use Apache Pig‚Äìa Hadoop scripting language that simplifies using MapReduce. In addition, he shows you how to use Apache Hive QL‚Äìan SQL-like language that enables higher-level "ad hoc" queries using MapReduce and HDFS. And finally, the Oozie workflow manager is presented.Lesson 7: Using the Spark Language Spark has become a popular tool for data analytics. In this lesson, Doug provides some of the basic aspects of the Spark language and demonstrates the Python-Spark interface, PySpark, with a simple command line example. Additional aspects of the Spark language are also used in the next two lessons.Lesson 8: Getting Data into Hadoop HDFS The first, and often overlooked step in data analytics, is "data ingest." As was demonstrated in Lesson 3, files can be simply copied into HDFS. However, there are methods that can preserve and import structure that could be lost with simple copying. In this lesson. Doug demonstrates how to import data into Hive tables and use Spark to import data into HDFS. He also demonstrates importing log and other streaming data directly into HDFS using Apache Flume. Finally, a complete example of using Apache Sqoop to import and export a relational database into and out of HDFS is presented.Lesson 9: Using the Zeppelin Web Interface Although much of the early Hadoop applications were developed using the command line interface, new web-based GUI tools such as Apache Zeppelin offer a more user-friendly approach to application development. In this lesson, a walk-through of the Zeppelin interface is provided and includes an example of how to create an interactive Zeppelin notebook for a simple Spark application.Lesson 10: Learning Basic Hadoop Installation and Administration One of the challenges facing Hadoop users and administrators is setting up a real cluster for production use. In this lesson, Doug teaches you how to use the Ambari web GUI to install, monitor, and administer a full Hadoop installation. He also provides a few important command line tools that will help with basic administration. Finally, some additional HDFS features such as snapshots and NFSv3 mounts are demonstrated.About Pearson Video TrainingPearson publishes expert-led video tutorials covering a wide selection of technology topics designed to teach you the skills you need to succeed. These professional and personal technology videos feature world-leading author instructors published by your trusted technology brands: Addison-Wesley, Cisco Press, Pearson IT Certification, Prentice Hall, Sams, and Que Topics include: IT Certification, Network Security, Cisco Technology, Programming, Web Development, Mobile Development, and more. Learn more about Pearson Video training at http://www.informit.com/video. Show and hide more

Introduction
- Hadoop and Spark Fundamentals: Introduction 00:04:12
Lesson 1: Background Concepts
- Learning objectives 00:00:52
- 1.1 Understand Big Data and analytics 00:17:41
- 1.2 Understand Hadoop as a data platform 00:23:27
- 1.3 Understand Hadoop MapReduce basics 00:15:01
- 1.4 Understand Spark language basics 00:21:27
- 1.5 Learn the Linux command line features 00:24:15
- 1.6 Preview Hadoop V3 new features 00:15:54
Lesson 2: Running Hadoop on a Desktop or Laptop
- Learning objectives 00:01:13
- 2.1 Install Hortonworks Hadoop and Spark HDP Sandbox 00:16:08
- 2.2 Install from Hadoop sources--Part 1 00:30:07
- 2.2 Install from Hadoop sources--Part 2 00:15:47
- 2.3 Install from Spark sources 00:19:37
Lesson 3: The Hadoop Distributed File System
- Learning objectives 00:00:58
- 3.1 Understand HDFS basics 00:24:09
- 3.2 Use HDFS command line tools 00:27:01
- 3.3 Use HDFS in programs 00:17:30
- 3.4 Utilize additional features of HDFS 00:15:17
Lesson 4: Hadoop MapReduce
- Learning objectives 00:00:56
- 4.1 Understand the MapReduce paradigm 00:07:42
- 4.2 Develop and run a Java MapReduce application 00:15:48
- 4.3 Understand how MapReduce works 00:20:14
Lesson 5: Hadoop MapReduce Examples
- Learning objectives 00:00:47
- 5.1 Use the Streaming Interface 00:10:34
- 5.2 Use the Pipes interface 00:07:21
- 5.3 Run the Hadoop grep example 00:06:25
- 5.4 Debugging MapReduce 00:11:01
- 5.5 Understand Hadoop Version 2 MapReduce 00:07:57
- 5.6 Use Hadoop Version 2 features--Part 1 00:21:24
- 5.6 Use Hadoop Version 2 features--Part 2 00:17:58
Lesson 6: Higher Level Tools
- Learning objectives 00:00:43
- 6.1 Demonstrate a Pig example 00:07:56
- 6.2 Demonstrate a Hive example 00:06:34
- 6.3 Demonstrate an Oozie example--Part 1 00:28:40
- 6.3 Demonstrate an Oozie example--Part 2 00:17:28
Lesson 7: Using the Spark Language
- Learning objectives 00:00:37
- 7.1 Learn Spark language basics 00:39:30
- 7.2 Demonstrate a PySpark command line example 00:12:02
Lesson 8: Getting Data into Hadoop HDFS
- Learning objectives 00:01:02
- 8.1 Import data into Hive tables 00:22:36
- 8.2 Use Spark to import data into HDFS 00:25:27
- 8.3 Demonstrate a Flume Example--Part 1 00:16:36
- 8.3 Demonstrate a Flume Example--Part 2 00:15:54
- 8.4 Demonstrate a Sqoop Example--Part 1 00:19:38
- 8.4 Demonstrate a Sqoop Example--Part 2 00:17:32
Lesson 9: Using the Zeppelin Web Interface
- Learning objectives 00:00:39
- 9.1 Understand Zeppelin features 00:18:32
- 9.2 Deconstruct a Spark application in Zeppelin 00:19:00
Lesson 10: Learning Basic Hadoop Installation and Administration
- Learning objectives 00:00:47
- 10.1 Install and configure Hadoop using Ambari--Part 1 00:30:48
- 10.1 Install and configure Hadoop using Ambari Part--2 00:33:20
- 10.2 Perform simple administration and monitoring with Ambari 00:46:36
- 10.3 Perform simple command line administration 00:30:23
- 10.4 Utilize additional features of HDFS 00:28:03
Summary
- Hadoop and Spark Fundamentals: Summary 00:04:08