Data cleansing importance in Pyspark | Multiple date format, clean special characters in header

Sreyobhilashi IT
Sreyobhilashi IT
3.8 هزار بار بازدید - 2 سال پیش - for pyspark training or placement
for pyspark training or placement please call +91-8500002025 or join in this telegram https://t.me/SparkTraining
in this video what is the importance of data cleaning and python importanace in pyspark.
If you have multiple date format, if u want to remove special characters from header how to skip explained in this video.

from pyspark.sql import *
from pyspark.sql.functions import *
from pyspark.sql import functions as f


spark = SparkSession.builder.master("local[2]").config("spark.sql.legacy.timeParserPolicy","LEGACY").appName("test").getOrCreate()

data="E:\\bigdata\\datasets\\dateusecases.txt"
df=spark.read.format("csv").option("header","true").load(data)
#df.show()
#df.printSchema()
yyyy-MM-dd format only
#data cleaning steps
def dynamic_date(col,frmts=("yyyy-MM-dd","dd-MMM-yyyy","ddMMMMyyyy","MM-dd-yyyy","MMM/yyyy/dd")):
   return coalesce(*[to_date(col, i )for i in frmts])

import re
cols=[re.sub('[^a-zA-Z0-9]','',c) for c in df.columns]
ndf=df.toDF(*cols)
res=ndf.withColumn("birthdob",dynamic_date(col("birthdob")))
res.printSchema()
#res.show()
#data process
res.createOrReplaceTempView("tab")
result = spark.sql("select * from tab where birthdob='2022-03-23'")
result.show()
2 سال پیش در تاریخ 1401/06/01 منتشر شده است.
3,857 بـار بازدید شده
... بیشتر