Ana Chaloska - To One-Hot or Not: A guide to feature encoding and when to use what | PDAMS 2023

PyData
PyData
664 بار بازدید - 9 ماه پیش - Have you ever struggled with
Have you ever struggled with a multitude of columns created by One Hot Encoder? Or decided to look beyond it, but found it hard to decide which feature encoder would be a good replacement?

Good news, there are many encoding techniques that have been developed to address different types of categorical data. This talk will provide an overview on various encoding methods available in data science, and a guidance on decision making about which one is appropriate for the data at hand.

Join this talk if you would like to hear about the importance of feature encoding and why it is important to not default to One Hot Encoding in every scenario. It will start with commonly used approaches and will progress into more advanced and powerful techniques which can help extract meaningful information from the data.

For each presented encoder, after this talk you will know:
- When to use it
- When NOT to use it
- Important considerations specific to the encoder
- Python library that offers a built-in method with the encoder, facilitating easy integration into feature engineering pipelines.

I will explore different feature encoding approaches and provide guidance for decision-making. I will cover simpler methods like Label, One Hot, and Frequency encoding, progressing to powerful techniques like Target and Rare Label encoding. Finally, I will explain more complex approaches like Weight of Evidence, Hash and Catboost encoding. I will close the talk with summarizing the key takeaways.

Target Audience:
Data scientists and anyone interested in feature encoding

Previous experience with feature encoders can be useful but is not mandatory to follow the talk.

===

www.pydata.org

PyData is an educational program of NumFOCUS, a 501(c)3 non-profit organization in the United States. PyData provides a forum for the international community of users and developers of data analysis tools to share ideas and learn from each other. The global PyData network promotes discussion of best practices, new approaches, and emerging technologies for data management, processing, analytics, and visualization. PyData communities approach data science using many languages, including (but not limited to) Python, Julia, and R.

PyData conferences aim to be accessible and community-driven, with novice to advanced level presentations. PyData tutorials and talks bring attendees the latest project features along with cutting-edge use cases.

00:00 Welcome!
00:10 Help us add time stamps or captions to this video! See the description for details.

Want to help add timestamps to our YouTube videos to help with discoverability? Find out more here: https://github.com/numfocus/YouTubeVi...
9 ماه پیش در تاریخ 1402/09/01 منتشر شده است.
664 بـار بازدید شده
... بیشتر