How to process large dataset with pandas | Avoid out of memory issues while loading data into pandas

BI Insights Inc
BI Insights Inc
4.4 هزار بار بازدید - 2 سال پیش - In this tutorial, we are
In this tutorial, we are covering how to handle large dataset with pandas. I have received few questions regarding handling dataset that is larger than the available memory of the computer. How can we process such datasets via pandas?
My first suggestion would be to filter the data prior to loading it into pandas dataframe. Second, use a distributed engines that is designed for big data. Some of the examples are Dask, Apache Flink, Kafka and Spark. We are covering Spark in the recent series. These systems use a cluster of computers called nodes to process data. They can handle terabyte of data depending on the available nodes.
Anyways, let’s say we have some data in a relational database, it is a medium size dataset and we want to process it with Pandas. How can we safely load it into pandas.

SQLAlchemy docs on stream results: https://docs.sqlalchemy.org/en/20/cor...
Pandas-dev GitHub PR for server side cursor: https://github.com/pandas-dev/pandas/...

#pandas #memorymanagement #batchprocessing

Subscribe to our channel:
haqnawaz

---------------------------------------------
Follow me on social media!

Github: https://github.com/hnawaz007
Instagram: Instagram: bi_insights_inc
LinkedIn: LinkedIn: haq-nawaz

---------------------------------------------

#ETL #Python #SQL

Topics covered in this video:
0:00 - Introduction to Pandas large data handling
0:19 - Recommendation for large datasets
0:58 - Why memory error occurs?
1:26 - Pandas batching or Server side cursor a solution
1:49 - Simple example with Jupyter Notebook
3:04 - Method Two Batch Processing on the client
4:56 - Method Three Batch Processing on the Server
6:19 - Pandas-dev PR for Server side cursor
6:36 - Pandas batching overview and summary
2 سال پیش در تاریخ 1401/09/21 منتشر شده است.
4,419 بـار بازدید شده
... بیشتر