Скачать или смотреть PySpark Tutorial : Immutability and Lazy Processing

PySpark Tutorial : Immutability and Lazy Processing

Machine Learning with PySparkMachine Learning Scientist with PythonBig Data with PySparkDataCampPython Tutorialwant to learn PythonData Sciencehow to learn data scienceData Analyst with PythonData Scientist with PythonBig Datahow to connect to Spark using PythonLinear RegressionLogistic Regression/Classifiersand creating pipelines.Spark Machine LearningMachine Learning & SparkConnecting to Spark

Скачать PySpark Tutorial : Immutability and Lazy Processing бесплатно в качестве 4к (2к / 1080p)

У нас вы можете скачать бесплатно PySpark Tutorial : Immutability and Lazy Processing или посмотреть видео с ютуба в максимальном доступном качестве.

Для скачивания выберите вариант из формы ниже:

Информация по загрузке:

Cкачать музыку PySpark Tutorial : Immutability and Lazy Processing бесплатно в формате MP3:

Если иконки загрузки не отобразились, ПОЖАЛУЙСТА, НАЖМИТЕ ЗДЕСЬ или обновите страницу
Если у вас возникли трудности с загрузкой, пожалуйста, свяжитесь с нами по контактам, указанным в нижней части страницы.
Спасибо за использование сервиса video2dn.com

Описание к видео PySpark Tutorial : Immutability and Lazy Processing

Want to learn more? Take the full course at https://learn.datacamp.com/courses/cl... at your own pace. More than a video, you'll learn hands-on coding & quickly apply skills to your daily work.

---

Welcome back! We've had a quick discussion about data cleaning, data types, and schemas. Let's move on to some further Spark concepts - Immutability and Lazy Processing.

Normally in Python, and most other languages, variables are fully mutable. The values can be changed at any given time, assuming the scope of the variable is valid.

While very flexible, this does present problems anytime there are multiple concurrent components trying to modify the same data. Most languages work around these issues using constructs like mutexes, semaphores, etc. This can add complexity, especially with non-trivial programs.

Unlike typical Python variables, Spark Data Frames are immutable. While not strictly required, immutability is often a component of functional programming.

We won't go into everything that implies here, but understand that Spark is designed to use immutable objects. Practically, this means Spark Data Frames are defined once and are not modifiable after initialization. If the variable name is reused, the original data is removed (assuming it's not in use elsewhere) and the variable name is reassigned to the new data.

While this seems inefficient, it actually allows Spark to share data between all cluster components. It can do so without worry about concurrent data objects.

This is a quick example of the immutability of data frames in Spark. It's OK if you don't understand the actual code, this example is more about the concepts of what happens.

First, we create a data frame from a CSV file called voterdata.csv. This creates a new data frame definition and assigns it to the variable name voter_df.

Once created, we want to do two further operations. The first is to create a full year column by using a 2-digit year present in the data set and adding 2000 to each entry. This does not actually change the data frame at all. It copies the original definition, adds the transformation, and assigns it to the voter_df variable name.

Our second operation is similar - now we want to drop the original year column from the data frame. Again, this copies the definition, adds a transformation and reassigns the variable name to this new object. The original objects are destroyed. Please note that the original year column is now permanently gone from this instance, though not from the underlying data (ie, you could simply reload it to a new dataframe if desired).

You may be wondering how Spark does this so quickly, especially on large data sets. Spark can do this because of something called lazy processing.

Lazy processing in Spark is the idea that very little actually happens until an action is performed. In our previous example, we read a CSV file, added a new column, and deleted another. The trick is that no data was actually read / added / modified, we only updated the instructions (aka, Transformations) for what we wanted Spark to do. This functionality allows Spark to perform the most efficient set of operations to get the desired result.

The code example is the same as the previous slide but with the added count() method call. This classifies as an action in Spark and will process all the transformation operations.

These concepts can be a little tricky to grasp without some examples. Let's practice these ideas in the coming exercises.

#DataCamp #PySparkTutorial #CleaningDatawithPySpark

Комментарии

Информация по комментариям в разработке