Python Code for BERT Paragraph Vector Embedding w/ Transformers (PyTorch, Colab)

Описание к видео Python Code for BERT Paragraph Vector Embedding w/ Transformers (PyTorch, Colab)

After BERT "Sentence embedding" (   • How to code BERT Word + Sentence Vect...  ) ... NOW Part 2:
BERT "Paragraph Vector Embedding" in a high-dim vector space where semantic similar sentences/paragraphs/texts are close by!

Part 1 of this video is called:
How to code BERT Word + Sentence Vectors (Embedding) w/ Transformers?
and linked here:    • How to code BERT Word + Sentence Vect...  

Plus a simple PCA visualization, using a BERT-base model where three semantic clusters become visible in PCA 2D, but ... (hint) at a vector space with more than 1000 dimensions, I would recommend a UMAP dimensional reduction and HDBSCAN for clustering.

Great informative sources and Colab NB (I reference to them):
https://mccormickml.com/2019/07/22/BE...
https://jalammar.github.io/a-visual-g...
https://github.com/VincentK1991/BERT_...

Principal component analysis (PCA).
Linear dimensionality reduction using Singular Value Decomposition of the data to project it to a lower dimensional space. The input data is centered but not scaled for each feature before applying the SVD.

Limitation of PCA:
PCA is used in exploratory data analysis and for making predictive models. It is commonly used for dimensionality reduction by projecting each data point onto only the first few principal components to obtain lower-dimensional data while preserving as much of the data's variation as possible. The first principal component can equivalently be defined as a direction that maximizes the variance of the projected data.
PCA can capture linear correlations between the features but fails when this assumption is violated. It is not optimized for class separability.

#datascience
#machinelearningwithpython
#embedding
#vectorspace

Комментарии

Информация по комментариям в разработке