What are Vector Embeddings and How are They Used in A.I?

Vector embeddings are a critical component in the world of machine learning and AI. These are mathematical representations of words or documents that enable the computer to interpret, compare, and analyze text, images, videos and audio. The idea is to represent words in a multi-dimensional space, where similar words cluster together.

Let's take a deep dive into how vector embeddings work and how they can be utilized.

content/vocab/vector-embeddings at master · nux-ai/content
Contribute to nux-ai/content development by creating an account on GitHub.

What Are Vector Embeddings?

Vector embeddings are mathematical representations of objects like words, phrases, or even whole documents in a continuous vector space. They capture the semantic meaning of these objects by assigning them to a position in the space. By transforming text into vectors, algorithms can operate on them mathematically, performing operations such as similarity calculation, clustering, or classification.

Usage in AI Applications

  1. Natural Language Processing (NLP): In NLP, embeddings are used to capture the meaning of words or sentences, enabling tasks like sentiment analysis, text classification, and language translation.
  2. Information Retrieval: In search engines, embeddings allow the system to understand and match the user's query with relevant documents.
  3. Recommendation Systems: Embeddings help in understanding user preferences and content to provide personalized recommendations.

A Practical Example with Code

Below is a practical code walkthrough demonstrating the use of vector embeddings to analyze a given corpus (text about Japan) and answer a specific query about it. The code uses the sentence-transformers library to accomplish this task.

Loading the Model
from sentence_transformers import SentenceTransformer, util

# Load the model
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
Preprocessing the Corpus

Here, the corpus (text about Japan) is turned into an array of sentences for further processing.

# set corpus from first page of wikipedia
corpus = "Japan is an island country in East Asia..."

# turn it into an array of sentences
docs = corpus.split('.')
Encoding the Corpus

The following code snippet encodes the corpus into a set of vectors.

# Encode our question and documents in 384 dimension

query = "How many islands are comprised of Japan?"

query_vector = model.encode(query)
Query Encoding

To find relevant information within the corpus, a query is encoded.

query = "How many islands are comprised of Japan?"query_vector = model.encode(query) 
Calculating Similarity

The cosine similarity between the corpus vectors and the query vector is calculated to find the relevant sentences.

# Calculate cosine similarity between the corpus of vectors and the query vector
scores = util.cos_sim(query_vector, corpus_vector)[0].cpu().tolist()

# Combine docs & scores
doc_score_pairs = list(zip(docs, scores))

# Sort by decreasing score
doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)

# Convert doc_score_pairs to a list of strings
doc_score_strings = [f"Score: {score}, Document: {doc}" for doc, score in doc_score_pairs]

# Output passages & scores
for doc, score in doc_score_pairs:
    print(doc_score_strings, doc)
Output
['Score: 0.7428829073905945, Document:  Japan is a part of the Ring of Fire, and spans an archipelago of 6852 islands covering 377,975 square kilometers (145,937 sq mi); the five main islands are Hokkaido, Honshu, Shikoku, Kyushu, and Okinawa', 'Score: 0.7245738506317139, Document: Japan is an island country in East Asia', 'Score: 0.7163315415382385, Document:  Japan is divided into 47 administrative prefectures and eight traditional regions', 'Score: 0.5539742708206177, Document:  Japan has been inhabited since the Upper Paleolithic period (30,000 BC), though the first written mention of the archipelago appears in a Chinese chronicle (the Book of Han) finished in the 2nd century AD', 'Score: 0.48450547456741333, Document:  Japan is the eleventh most populous country in the world, as well as one of the most densely populated and urbanized', "Score: 0.46953386068344116,nd Taiwan in the south', "Score: 0.405386745929718, Document:  Japan has the world's highest life expectancy, though it is experiencing a decline in population", 'Score: 0.40184539556503296, Document:  Under the 1947 constitution, Japan has maintained a unitary parliamentary constitutional monarchy with a bicameral legislature, the National Diet', "Score: 0.3986291289329529, Document:  Although Japan has renounced its right to declare war, the country maintains Self-Defense Forces that rank as one of the world's strongest militaries", 'Score: 0.3984721899032593, Document:  The culture of Japan is well known around the world, including its art, cuisine, music, and popular culture, which encompasses prominent comic, animation and video game industries', 'Score: 0.38563060760498047, Document:  The Greater Tokyo Area is the most populous metropolitan area in the world, with more than 37', 'Score: 0.33849799633026123, Document:  Amidst a rise in militarism and overseas colonization, Japan invaded China in 1937 and entered World War II as an Axis power in 1941', 'Score: 0.32281097769737244, Document:  In 1854, a United States fleet forced Japan to open trade to the West, which led to the end of the shogunate and the restoration of imperial power in 1868', 'Score: 0.3130512237548828, Document:  After World War II, Japan experienced record growth in an economic miracle, becoming the second-largest economy in the world by 1972 but has stagnated since 1995 in what is referred to as the Lost Decades', 'Score: 0.29516148567199707, Document: 5 million on narrow coastal plains', 'Score: 0.28122079372406006, Document:  After a century-long period of civil war, the country was reunified in 1603 under the Tokugawa shogunate, which enacted an isolationist foreign policy', 'Score: 0.26617658138275146, Document:  Beginning in the 12th century, political power was held by a series of military dictators (shōgun) and feudal lords (daimyō) and enforced by a class of warrior nobility (samurai)', 'Score: 0.24469870328903198, Document:  It is a member of numerous international organizations, including the United Nations (since 1956), OECD, G20 and Group of Seven', "Score: 0.24037012457847595, Document:  About three-fourths of the country's terrain is mountainous, concentrating its population of 125", 'Score: 0.23301862180233002, Document: 4 million residents', "Score: 0.12032614648342133, Document:  Its economy is the world's third-largest by nominal GDP and the fourth-largest by PPP", 'Score: 0.06605695933103561, Document:  A global leader in the automotive, robotics and electronics industries, the country has made significant contributions to science and technology', 'Score: 0.0658482164144516, Document: ']  After a century-long period of civil war, the country was reunified in 1603 under the Tokugawa shogunate, which enacted an isolationist foreign policy...]

What will you build?

Explore workbook templates or customize your own.

Start Building