NLP Demystified 5: Basic Bag-of-Words and Measuring Document Similarity
9.7 هزار بار بازدید -
2 سال پیش
-
Course playlist:
Course playlist: Natural Language Processing Demystified
After preprocessing our text, we take our first step in turning text into numbers so our machines can start working with them. We'll explore:
- a simple "bag-of-words" (BoW) approach.
- learn how to use cosine similarity to measure document similarity.
- the shortcomings of this BoW approach.
In the demo, we'll use a combination of spaCy and scikit-learn to build BoW representations and perform simple document similarity search.
Colab notebook: https://colab.research.google.com/git...
Timestamps:
00:00:00 Basic bag-of-words (BoW)
00:00:22 The need for vectors
00:00:53 Selecting and extracting features from our data
00:04:04 Idea: similar documents share similar vocabulary
00:04:46 Turning a corpus into a BoW matrix
00:07:10 What vectorization helps us accomplish
00:08:20 Measuring document similarity
00:11:09 Shortcomings of basic BoW
00:12:37 Capturing a bit of context with n-grams
00:14:10 DEMO: creating basic BoW with scikit-learn and spaCy
00:17:47 DEMO: measuring document similarity
00:18:40 DEMO: creating n-grams with scikit-learn
00:19:35 Basic BoW recap
This video is part of Natural Language Processing Demystified --a free, accessible course on NLP.
Visit https://www.nlpdemystified.org/ to learn more.
After preprocessing our text, we take our first step in turning text into numbers so our machines can start working with them. We'll explore:
- a simple "bag-of-words" (BoW) approach.
- learn how to use cosine similarity to measure document similarity.
- the shortcomings of this BoW approach.
In the demo, we'll use a combination of spaCy and scikit-learn to build BoW representations and perform simple document similarity search.
Colab notebook: https://colab.research.google.com/git...
Timestamps:
00:00:00 Basic bag-of-words (BoW)
00:00:22 The need for vectors
00:00:53 Selecting and extracting features from our data
00:04:04 Idea: similar documents share similar vocabulary
00:04:46 Turning a corpus into a BoW matrix
00:07:10 What vectorization helps us accomplish
00:08:20 Measuring document similarity
00:11:09 Shortcomings of basic BoW
00:12:37 Capturing a bit of context with n-grams
00:14:10 DEMO: creating basic BoW with scikit-learn and spaCy
00:17:47 DEMO: measuring document similarity
00:18:40 DEMO: creating n-grams with scikit-learn
00:19:35 Basic BoW recap
This video is part of Natural Language Processing Demystified --a free, accessible course on NLP.
Visit https://www.nlpdemystified.org/ to learn more.
2 سال پیش
در تاریخ 1401/02/07 منتشر شده
است.
9,704
بـار بازدید شده