Limin's Machine Learning Lab – Embedding and Searching PDF

LangChain has a PyPDFLoader data loader that can load a PDF file. It requires the pypdf package to be installed.

Load a PDF file

from langchain.document_loaders import PyPDFLoader

pdf_loader = PyPDFLoader("attention_is_all_you_need.pdf")
pdf_pages = pdf_loader.load()
print(f'total pages: {len(pdf_pages)}')

total pages: 15

The PDF file is loaded into a list of Document, which contains 2 fields, page_content and metadata.

print(pdf_pages[0].page_content[0:600])

Attention Is All You Need
Ashish Vaswani
Google Brain
avaswani@google.comNoam Shazeer
Google Brain
noam@google.comNiki Parmar
Google Research
nikip@google.comJakob Uszkoreit
Google Research
usz@google.com
Llion Jones
Google Research
llion@google.comAidan N. Gomezy
University of Toronto
aidan@cs.toronto.eduŁukasz Kaiser
Google Brain
lukaszkaiser@google.com
Illia Polosukhinz
illia.polosukhin@gmail.com
Abstract
The dominant sequence transduction models are based on complex recurrent or
convolutional neural networks that include an encoder and a decoder. The best
performing models also con

pdf_pages[0].metadata

{'source': 'attention_is_all_you_need.pdf', 'page': 0}

Split the document

LangChain recommends RecursiveCharacterTextSplitter for generic text.

from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1500,
    chunk_overlap=150,
    separators=["\n\n", "\n", "(?<=\. )", " ", ""]
)

docs = text_splitter.split_documents(pdf_pages)
print(f'total documents: {len(docs)}')

total documents: 37

print(docs[0].page_content)

Attention Is All You Need
Ashish Vaswani
Google Brain
avaswani@google.comNoam Shazeer
Google Brain
noam@google.comNiki Parmar
Google Research
nikip@google.comJakob Uszkoreit
Google Research
usz@google.com
Llion Jones
Google Research
llion@google.comAidan N. Gomezy
University of Toronto
aidan@cs.toronto.eduŁukasz Kaiser
Google Brain
lukaszkaiser@google.com
Illia Polosukhinz
illia.polosukhin@gmail.com
Abstract
The dominant sequence transduction models are based on complex recurrent or
convolutional neural networks that include an encoder and a decoder. The best
performing models also connect the encoder and decoder through an attention
mechanism. We propose a new simple network architecture, the Transformer,
based solely on attention mechanisms, dispensing with recurrence and convolutions
entirely. Experiments on two machine translation tasks show these models to
be superior in quality while being more parallelizable and requiring signiﬁcantly
less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-
to-German translation task, improving over the existing best results, including
ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task,
our model establishes a new single-model state-of-the-art BLEU score of 41.8 after
training for 3.5 days on eight GPUs, a small fraction of the training costs of the
best models from the literature. We show that the Transformer generalizes well to

Embedding the docs and use Chroma for search

Use a HuggingFace embedding model to create a vector store.

import torch
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma

emb_model = 'sentence-transformers/all-mpnet-base-v2'
device = 'cuda' if torch.cuda.is_available() else 'cpu'

embedding = HuggingFaceEmbeddings(
    model_name=emb_model,
    model_kwargs={'device': device}
)

vectordb = Chroma.from_documents(
    documents=docs,
    embedding=embedding,
    persist_directory=None)

vectordb._collection.count()

Similarity search with enforced diversity

Use max_marginal_relevance_search to achieve both relevance and diversity.

q = 'scaled dot-product attention'

results = vectordb.max_marginal_relevance_search(q, k=4, fetch_k=6)

print(results[0].page_content)

Scaled Dot-Product Attention
 Multi-Head Attention
Figure 2: (left) Scaled Dot-Product Attention. (right) Multi-Head Attention consists of several
attention layers running in parallel.
3.2.1 Scaled Dot-Product Attention
We call our particular attention "Scaled Dot-Product Attention" (Figure 2). The input consists of
queries and keys of dimension dk, and values of dimension dv. We compute the dot products of the
query with all keys, divide each bypdk, and apply a softmax function to obtain the weights on the
values.
In practice, we compute the attention function on a set of queries simultaneously, packed together
into a matrix Q. The keys and values are also packed together into matrices KandV. We compute
the matrix of outputs as:
Attention(Q;K;V ) = softmax(QKT
pdk)V (1)
The two most commonly used attention functions are additive attention [ 2], and dot-product (multi-
plicative) attention. Dot-product attention is identical to our algorithm, except for the scaling factor
of1pdk. Additive attention computes the compatibility function using a feed-forward network with
a single hidden layer. While the two are similar in theoretical complexity, dot-product attention is
much faster and more space-efﬁcient in practice, since it can be implemented using highly optimized
matrix multiplication code.
While for small values of dkthe two mechanisms perform similarly, additive attention outperforms