Recommendation Engines Using ALS in PySpark (MovieLens Dataset)

This tutorial provides an overview of how the Alternating Least Squares (ALS) algorithm works, and, using the MovieLens data set, it provides a code-level example of how to build out a collaborative-filtering recommendation engine using Pyspark.

One note on using the TrainValidationSplit mentioned at 6:01: A more appropriate solution which would incorporate cross-validation would be to use the cross validator functionality. This will allow you to use several different "folds" in the cross-validation step rather than using just a train set and a test set as in the TestValidationSplit. To incorporate this into the code, you should replace the "tvs = TrainValidationSplit(…" code chunk with this:

cv = CrossValidator(estimator=als,

numFolds can be set to any integer you prefer. Typical number of folds is usually around 5. I used 3 here for purposes of time.

Also, be sure to replace the previous "tvs" in your code to "cv" in the "model =" code chunk as well.


