Watch Demo

Rs. 1999  Rs. 599

Learn Spark for Data Science with Python

Created by Stanford and IIT alumni with work experience in Google and Microsoft, this course will teach you how to process data using Spark for analytics, machine learning and data science.

08h:00m
Lifetime access
130 learners
Course Introduction

Big Data analysis is a very valuable skill in the job market and this course will teach you the hottest technology in big data analytics: Apache Spark.

What is Spark? If you are an analyst or a data scientist, having several platforms such as SQL, Python, R, Java etc. for working with data might be something you are well-versed in. Apache Spark is a fast cluster computing framework used for large-scale data processing. With Spark, you have a single engine where you can explore and play with substantial data, run machine learning algorithms, and then use the same system to productionize your code.

Read more

Course Objectives

By the end of this course you will be able to:

  • Work with a variety of datasets ranging from predicting airplane departure delays to social networks and product ratings.
  • Utilize all the features and libraries of Spark such as RDDs, Dataframes, Spark SQL, MLlib, Spark Streaming and GraphX.
  • Utilize Apache Spark for a number of analytics and Machine Learning tasks.
  • Implement complex algorithms like PageRank and Recommendations in Music.

Prerequisites and Target Audience

Prerequisites for the course:

  • To subscribe to this course, you need to have knowledge of Python. You must be able to write Python code directly in the PySpark shell. If you have IPython Notebook installed, this course will show you how to configure it for Spark.
  • To get a firm grasp of the Java module, you should have knowledge of Java. An IDE which supports Maven, like IntelliJ IDEA/Eclipse would be useful.
  • All examples work with or without Hadoop. If you want to use Spark with Hadoop, you will have to have Hadoop installed on your system. It could be either in pseudo-distributed or cluster mode.

Read more
Course Plan
Certificate of completion

1. You, This Course and Us
1 video
2. Introduction to Spark
8 videos
The PySpark Shell 04:50

Transformations and Actions 13:33

See it in Action : Munging Airlines Data with PySpark - I 10:13
3. Resilient Distributed Datasets
9 videos
RDD Characteristics: Partitions and Immutability 12:35

RDD Characteristics: Lineage, RDDs know where they came from 06:06

What can you do with RDDs? 11:08

Create your first RDD from a file 16:10

Average distance travelled by a flight using map() and reduce() operations 05:50

Get delayed flights using filter(), cache data using persist() 05:23

Average flight delay in one-step using aggregate() 15:11

Frequency histogram of delays using countByValue() 03:26

See it in Action : Analyzing Airlines Data with PySpark - II 06:25
4. Advanced RDDs: Pair Resilient Distributed Datasets
6 videos
Special Transformations and Actions 14:45

Average delay per airport, use reduceByKey(), mapValues() and join() 18:11

Average delay per airport in one step using combineByKey() 11:53

Get the top airports by delay using sortBy() 04:34

Lookup airport descriptions using lookup(), collectAsMap(), broadcast() 14:03

See it in Action : Analyzing Airlines Data with PySpark - III 04:58
5. Advanced Spark: Accumulators, Spark Submit, MapReduce , Behind The Scenes
7 videos
Get information from individual processing nodes using accumulators 13:35

See it in Action : Using an Accumulator variable 02:41

Long running programs using spark-submit 05:58

See it in Action : Running a Python script with Spark-Submit 03:58

Behind the scenes: What happens when a Spark script runs? 14:30

Running MapReduce operations 13:44

See it in Action : MapReduce with Spark 02:05
6. Java and Spark
5 videos
The Java API and Function objects 15:58

Pair RDDs in Java 04:49

Running Java code 03:49

Installing Maven 02:20

See it in Action : Running a Spark Job with Java 05:08
7. PageRank: Ranking Search Results
5 videos
What is PageRank? 16:44

The PageRank algorithm 06:15

Implement PageRank in Spark 12:01

Join optimization in PageRank using Custom Partitioning 07:27

8. Spark SQL
2 videos
Dataframes: RDDs + Tables 16:05

See it in Action : Dataframes and Spark SQL 04:50
9. MLlib in Spark: Build a recommendations engine
4 videos
Collaborative filtering algorithms 12:19

Latent Factor Analysis with the Alternating Least Squares method 11:39

Music recommendations using the Audioscrobbler dataset 07:51

Implement code in Spark using MLlib 16:05
10. Spark Streaming
4 videos
Introduction to streaming 09:55

Implement stream processing in Spark using Dstreams 10:54

Stateful transformations using sliding windows 09:26

See it in Action : Spark Streaming 04:17
11. Graph Libraries
1 video
The Marvel social network using Graphs 18:01

Meet the Author


Loonycorn
4 Alumni of Stanford, IIM-A, IITs and Google, Microsoft, Flipkart

Loonycorn is a team of 4 people who graduated from reputed top universities. Janani Ravi, Vitthal Srinivasan, Swetha Kolalapudi and Navdeep Singh have spent years (decades, actually) working in the Tech sector across the world.

  • Janani: Graduated from Stanford and has worked for 7 years at Google (New York, Singapore). She also worked at Flipkart and Microsoft.
  • Vitthal: Studied at Stanford; worked at Google (Singapore), Flipkart, Credit Suisse, and INSEAD.
  • Swetha: An IIM Ahmedabad and IIT Madras alumnus having experience of working in Flipkart.
  • Navdeep: An IIT Guwahati alumnus and Longtime Flipkart employee.
  • More from Loonycorn
    Ratings and Reviews     4.6/5

    You may also like