How to create a Dataset in Spark : 4 ways to create a spark dataset

Описание к видео How to create a Dataset in Spark : 4 ways to create a spark dataset

https://bigdataelearning.com/course/a...
https://bigdataelearning.com/courses
https://bigdataelearning.com
The dataset can be created using any of the below 4 methods.

1) Create a dataset from a sequence of elements
2) Create a dataset from a sequence of case classes
3) Create a dataset from a RDD
4) Create a dataset from Dataframe

1. Create a Dataset from a sequence of elements

Dataset can be created from a sequence of elements using toDS() method. For example, let’s say we have a numberSeq variable containing elements 1 to 5. Applying toDS method on the numberSeq should give us a dataset.

2. Create a Dataset from a sequence of case classes

Now let’s see how we can create a dataset from a sequence of case classes using toDS() method. For example, let’s say we have a case class called “Employee” with 2 fields “name” and “age” in it. Creating a sequence with 3 employees in it. Apply toDS method on employeeSeq. This creates a dataset called employeeDs.

3. Create a Dataset from a RDD

So far, we have seen how to create a dataset from sequence of elements or from a sequence of case classes. Now, let’s see how to create a dataset from a RDD. I am going to create a RDD using parallelize method of spark context. This creates an rdd that has 2 elements in it. Apply toDS method should create a dataset

4. Create a Dataset from Dataframe

Finally let’s try creating a dataset from a dataframe. I am creating a case class called person with 2 fields in it, name and age. Then I am creating a sequence with 3 person objects in it.

Creating a rdd, from a sequence of objects. I am passing the sequence object to the parallelize method of spark context. Then I am creating a dataframe from the rdd using toDF() method. So now we have a dataframe. Let’s convert this into dataset using “as” method. Df dot as the schema. This should give us a dataset. Let me apply toDS method on this RDD. This should have created a dataset.

To summarize, Datasets can be created either from a sequence of elements, or from a sequence of case classes, or from a RDD or from a Dataframe.

Комментарии

Информация по комментариям в разработке