Introduction to Apache Spark

Big Data as described by its volume, velocity and veracity, requires usage of specific tools and technology for analysis and analytics. Apache Spark has proven to be top industry tool for analyzing big datasets on real-time bases. This course introduces PySpark module (the R version of the course is conducted using SparklyR library) which allows for real-time scalable and parallel analysis of large datasets.

Curriculum

  1. Introduction to Apache Spark and its use cases
  2. RDDs and DataFrames
  3. Running Spark on a Cluster
  4. Spark and Databases
  5. Spark and Streaming data
  6. Machine Learning with Spark
  7. Machine Learning with Spark 2
  8. Spark alternatives

Prerequisites

  • Introductory knowledge of statistics
  • Experience with one or more regression or classification models
  • Intermediate knowledge of Python/R

Related Courses