featuring SQL and DataFrames.
This course is around 5 hours long.
Having problems? check the errata
Introduction 6m 29s What do DataFrames and SparkSQL offer compared to SparkCore (RDDs)? |
Preview |
Getting Started 20m 10s We'll read in a DataSet (DataFrame) to get started |
Preview |
Working with DataSets 29m 3s For our first real task with SparkSQL, we'll see how do filters |
Preview |
Full SQL Syntax 13m 45s How to query Spark using the full SQL syntax |
Watch |
In Memory Data 15m 4s In Module 1 we used parallelize to use in memory data - useful for unit tests. This is how to do it using DataFrames. |
Watch |
Grouping and Aggregating 12m 59s Understanding the Group By clause in SparkSQL |
Watch |
Date Formatting 6m 30s How to use the date_format function in SparkSQL |
Watch |
Multiple Groupings 13m 59s More than one group by column? |
Watch |
Ordering 16m 36s How to use the order by clause |
Watch |
DataFrames API 28m 4s We've concentrated on the SQL syntax so far, but we can also use a Java API to do everything (and more) that SQL can. |
Watch |
Pivot Tables 21m 21s In DataFrames, we can produce Pivot Tables as with spreadsheets and databases. But for Big Data! |
Preview |
General Aggregations 18m 49s The agg method is the most flexible aggregating function, so we'll see how to use it. |
Watch |
Practical Session 8m 12s A short exercise |
Watch |
User Defined Functions 23m 55s How to use lambdas to add your own functions to the SQL syntax and DataFrame API |
Watch |
Performance 25m 56s Using the SparkUI to analyse tasks. We ask the question: is the SQL syntax slower than the DataFrame API? Answers will follow in the next video... |
Watch |
HashAggregation 39m 21s Spark has two strategies for grouping - HashAggregation is extremely efficient but can only be used in restricted circumstances. Find out how to make sure HashAggegration is used instead of the (usually) slower SortAggregate routine. |
Watch |
SparkSQL vs SparkRDD 6m 55s Which performs "better"? |
Watch |
Update - Tuning the spark.sql.shuffle.partitions Property 8m 18s An update - by default you will have a large number of partitions when shuffling (such as when grouping) - this can kill performance on small jobs. This is how to fix the problem. |
Watch |
Module Summary 2m 24s Coming up later in 2018 is a module on SparkML. |
Watch |