How to use PySpark DataFrame API? | DataFrame Operations on Spark

BI Insights Inc
BI Insights Inc
5.3 هزار بار بازدید - 2 سال پیش - In this tutorial we will
In this tutorial we will continue with PySpark. In the previous session we covered the setup, learned about the basics of PySpark and explored few of the features that it offers for example DataFrame API and Spark SQL. In this session we will further explore these features before we dive into building data pipelines with PySpark (Spark API).
Spark is a distributed engine designed for processing large amount of data. It offers scalability beyond a single machine. If you encounter Pandas memory error due to data size then it is time to explore Spark. It is designed for large data. It is the engine behind the AWS Glue.

Link to GitHub repo: https://github.com/hnawaz007/pythonda...

PySpark documentation: https://www.oracle.com/java/technolog...

Databricks case on Pandas API on Spark: https://www.databricks.com/blog/2021/...

Subscribe to our channel:
haqnawaz

-------------------------------------------
Follow me on social media!

GitHub: https://github.com/hnawaz007
Instagram: Instagram: bi_insights_inc
LinkedIn: LinkedIn: haq-nawaz

-------------------------------------------

#apachespark  #pyspark #dataframe  

Topics covered in this video:
0:00 - Introduction to PySpark
0:28 - Spark in current context of Data
1:16 - Spark DataFrame API
2:22 - Jupyter Notebook
3:00 - Read Data from Database
4:06 - DataFrame API Operations - Rename and Select
4:35 - Sort DataFrame
5:14 - Filter Operation in DataFrame and Spark SQL
7:40 - DataFrame & SQL Join & Aggregate Operation
9:22 - Create new Columns based on condition
11:06 - Replace Null & Drop Columns
2 سال پیش در تاریخ 1401/08/07 منتشر شده است.
5,346 بـار بازدید شده
... بیشتر