NLP Demystified 5: Basic Bag-of-Words and Measuring Document Similarity

Future Mojo
Future Mojo
9.7 هزار بار بازدید - 2 سال پیش - Course playlist:
Course playlist: Natural Language Processing Demystified

After preprocessing our text, we take our first step in turning text into numbers so our machines can start working with them. We'll explore:
- a simple "bag-of-words" (BoW) approach.
- learn how to use cosine similarity to measure document similarity.
- the shortcomings of this BoW approach.

In the demo, we'll use a combination of spaCy and scikit-learn to build BoW representations and perform simple document similarity search.

Colab notebook: https://colab.research.google.com/git...

Timestamps:
00:00:00 Basic bag-of-words (BoW)
00:00:22 The need for vectors
00:00:53 Selecting and extracting features from our data
00:04:04 Idea: similar documents share similar vocabulary
00:04:46 Turning a corpus into a BoW matrix
00:07:10 What vectorization helps us accomplish
00:08:20 Measuring document similarity
00:11:09 Shortcomings of basic BoW
00:12:37 Capturing a bit of context with n-grams
00:14:10 DEMO: creating basic BoW with scikit-learn and spaCy
00:17:47 DEMO: measuring document similarity
00:18:40 DEMO: creating n-grams with scikit-learn
00:19:35 Basic BoW recap

This video is part of Natural Language Processing Demystified --a free, accessible course on NLP.

Visit https://www.nlpdemystified.org/ to learn more.
2 سال پیش در تاریخ 1401/02/07 منتشر شده است.
9,704 بـار بازدید شده
... بیشتر