Generate Fake Data using PySpark in 1 min

Описание к видео Generate Fake Data using PySpark in 1 min

Know more about the pyspark course: https://www.geekcoders.co.in/courses/...

import farsante
from mimesis import Person,Address,Datetime
p=Person('en')
ad=Address('en')
dt=Datetime()
df=farsante.pyspark_df([p.first_name,p.last_name,p.sex,p.age,ad.country,ad.country_code,ad.address,ad.city,ad.state,dt.year],100)
display(df)


You can use below code to generate the data using faker

from pyspark.sql import SparkSession
from pyspark.sql.functions import lit
from faker import Faker
import random

Initialize Faker and PySpark
fake = Faker()
spark = SparkSession.builder.appName("FakeData").getOrCreate()

Function to generate fake data
def generate_fake_data(num_records):
data = []
for _ in range(num_records):
data.append((
fake.name(),
fake.email(),
fake.address(),
fake.phone_number(),
fake.date_of_birth(minimum_age=18, maximum_age=90).strftime("%Y-%m-%d"),
random.randint(1000, 10000) # random salary
))
return data

Number of fake records you want
num_records = 1000

Generate the fake data
fake_data = generate_fake_data(num_records)

Create PySpark DataFrame
columns = ["Name", "Email", "Address", "Phone", "Date_of_Birth", "Salary"]
df = spark.createDataFrame(fake_data, columns)

Show some rows from the DataFrame
df.show(10, truncate=False)

Stop Spark session (optional)
spark.stop()


#pyspark #spark #bigdata #databricks

Комментарии

Информация по комментариям в разработке