Spark and Delta Lake Workshop

Spark and Delta Lake Workshop

Pricing: $45-60 for 6-hrs Hands-On Workshop. See EventBrite for details.

Register using EventBrite.

PDS (Professional Development Seminar)

A 2-day virtual workshop intended to teach you how you can architect reliable scalable solutions with Apache Spark and Data Lake

Details

Sat, April 2, 2022, 10:00 AM – 1:00 PM

Sun, April 3, 2022, 10:00 AM – 1:00 PM

Abstract 

This 2-day workshop is intended to teach you what Apache Spark™ and Delta Lake are and how to use them in your data architectures for reliable large-scale distributed data pipelines. This course will show the features of Delta Lake that, alongside Spark SQL and Spark Structured Streaming, introduce ACID transactions and time travel (data versioning) to your ETL batch and streaming workloads. Slides, demos, exercises, and Q&A sessions should all together help you understand the concepts of the modern data lakehouse architecture.

Motivation

Whether you are new to the field of data analytics & data science, you know that working with large amounts of data is a critical need for businesses today. For the first time, SF Bay ACM is partnering with Databricks to bring to you this exciting workshop on Apache Spark and Delta Lake. These two technologies combine to bring the power of petabytes of data at your finger-tip.

Sponsorship

Databricks is partially sponsoring this event, so we also have a rare opportunity to support our professional development activities with a significant price drop. Check below for details.

NOTE: While this is a virtual class, we will cap it at classroom size so that there is a strong focus on learning. There is a nominal charge for the 6 hours of lecture – please sign up early as we will keep the attendee count low. This is NOT a MOOC. Registration also includes a 1-year SFBay ACM membership ($20 value)

Content: You will have access to all the notebooks, training material for hands-on workshop training.

Who is the course for?

  • Solution Architects
  • Data Engineers
  • Data Scientists

Structure

  • Six 55-min modules (10-min break between modules)
  • 15-min talk/20-min labs / 15-min Q&A / 5-min buffer

Requirements

  • Sign up for Databricks Community Edition
  • Should have experience with SQL and Python

Saturday – Day 1: 10am-11:30am, Pacific Time

Module 1. The Fundamentals of Apache Spark

  • Introduction to Databricks Community Edition
  • Loading and saving datasets (/databricks-datasets) [SQL]
  • Basic DataFrame Transformations [SQL]
  • Working with Spark tables [SQL]

Module 2. Intermediate Spark SQL

  • Aggregations [SQL]
  • Joins [SQL]
  • Basics of web UI

Module 3. Advanced Spark SQL

  • Windowed Aggregation [SQL]
  • Introduction to Spark Structured Streaming [Python, SQL]

Sunday (Delta Lake)

Module 4: Introduction to Delta Lake

  • Bringing Reliability to Data Lakes (Concepts)
  • Convert existing tables to Delta Lake [SQL]
  • Unified Batch and Streaming [Python, SQL]

Module 5: DML and Schema

  • Create, Insert, Update, Delete, Merge
  • Schema Enforcement and Evolution

Module 6: SQL and the Transaction Log

  • Delta Lake SQL
  • Time Travel
  • Transaction Log Fundamentals

Organizer & SFBay ACM Prof Dev Chair: Yashesh Shroff  @yashroff

For more information about Registration, please contact SF Bay Chapter of the ACM, yshroff at g | m | a i l

We look forward to seeing you at the workshop!

Sign-up / Registration

https://www.eventbrite.com/e/spark-and-delta-lake-workshop-tickets-277155448407?aff=website
Attendees who have signed up and paid for the PDS will receive a complimentary 1-yr subscription to the local ACM Chapter. (Active members will have their membership dues extended +1-yr.)