NLP Demystified 5: Basic Bag-of-Words and Measuring Document Similarity

Описание к видео NLP Demystified 5: Basic Bag-of-Words and Measuring Document Similarity

Course playlist:    • Natural Language Processing Demystified  

After preprocessing our text, we take our first step in turning text into numbers so our machines can start working with them. We'll explore:
a simple "bag-of-words" (BoW) approach.
learn how to use cosine similarity to measure document similarity.
the shortcomings of this BoW approach.

In the demo, we'll use a combination of spaCy and scikit-learn to build BoW representations and perform simple document similarity search.

Colab notebook: https://colab.research.google.com/git...

Timestamps:
00:00:00 Basic bag-of-words (BoW)
00:00:22 The need for vectors
00:00:53 Selecting and extracting features from our data
00:04:04 Idea: similar documents share similar vocabulary
00:04:46 Turning a corpus into a BoW matrix
00:07:10 What vectorization helps us accomplish
00:08:20 Measuring document similarity
00:11:09 Shortcomings of basic BoW
00:12:37 Capturing a bit of context with n-grams
00:14:10 DEMO: creating basic BoW with scikit-learn and spaCy
00:17:47 DEMO: measuring document similarity
00:18:40 DEMO: creating n-grams with scikit-learn
00:19:35 Basic BoW recap

This video is part of Natural Language Processing Demystified --a free, accessible course on NLP.

Visit https://www.nlpdemystified.org/ to learn more.

Комментарии

Информация по комментариям в разработке