GPT Embeddings

Not Magic - Just Math

Barry S. Stahl

Solution Architect & Developer

@bsstahl@cognitiveinheritance.com

https://CognitiveInheritance.com

Favorite Physicists & Mathematicians

Favorite Physicists

Harold "Hal" Stahl
Carl Sagan
Richard Feynman
Marie Curie
Nikola Tesla
Albert Einstein
Neil Degrasse Tyson
Niels Bohr
Galileo Galilei
Michael Faraday

Other notables: Stephen Hawking, Edwin Hubble

Favorite Mathematicians

Ada Lovelace
Alan Turing
Johannes Kepler
Rene Descartes
Isaac Newton
Leonardo Fibonacci
George Boole
Blaise Pascal
Johann Gauss
Grace Hopper

Other notables: Daphne Koller, Grady Booch, Evelyn Berezin

Some OSS Projects I Run

Liquid Victor : Media tracking and aggregation [used to assemble this presentation]
Prehensile Pony-Tail : A static site generator built in c#
TestHelperExtensions : A set of extension methods helpful when building unit tests
Conference Scheduler : A conference schedule optimizer
IntentBot : A microservices framework for creating conversational bots on top of Bot Framework
LiquidNun : Library of abstractions and implementations for loosely-coupled applications
Toastmasters Agenda : A c# library and website for generating agenda's for Toastmasters meetings
ProtoBuf Data Mapper : A c# library for mapping and transforming ProtoBuf messages

http://GiveCamp.org

The OpenAI API

Chat Completions
- ChatGPT gets most of its power here
Embeddings
- Enable additional features that can be used with Chat Completions
- Especially useful operationally

Questions to be Answered

What are embeddings?
What do they represent?
How do we compare/contrast them?
How can we use them operationally?

Embeddings

A point in multi-dimensional space
Mathematical representation of a word or phrase
Encode both semantic and contextual information

Model: text-embedding-ada-002
Use 1536 dimensions
Are normalized to unit length

3-D Space Projected into 2-D

Cosine Similarity & Distance

Relate vectors based on the angle between them

Cosine Similarity ranges from -1 to 1, where:
- +1 indicates that the vectors represent similar semantics & context
- 0 indicates that the vectors are orthogonal (no similarity)
- -1 indicates that the vectors have opposing semantics & context
Cosine Distance is defined as 1 - cosine similarity where:
- 0 = Synonymous
- 1 = Orthogonal
- 2 = Antonymous

Cosine Distance

Clustering

Unsupervised machine learning technique
Clusters form around centroids (the geometric center of the cluster)
Data points are grouped (clustered) based on their similarity
- Minimize the error (distance from centroid)
Embeddings cluster with others of similar semantic and contextual meaning
Advantages
- No need to define a distance threshold
Disadvantages
- Quality is use-case dependent
- Requires the number of clusters to be specified

Embedding Distance

Feature	Example
Synonym	"Happy" is closer to "Joyful" than to "Sad"
Language	"The Queen" is very close to "La Reina"
Idiom	"He kicked the bucket" is closer to "He died" than to "He kicked the ball"
Sarcasm	"Well, look who's on time" is closer to "Actually Late" than "Actually Early"
Homonym	"Bark" (dog sound) is closer to "Howl" than to "Bark" (tree layer)
Collocation	"Fast food" is closer to "Junk food" than to "Fast car"
Proverb	"The early bird catches the worm" is closer to "Success comes to those who prepare well and put in effort" than to "A bird in the hand is worth two in the bush"
Metaphor	"Time is money" is closer to "Don't waste your time" than to "Time flies"
Simile	"He is as brave as a lion" is closer to "He is very courageous" than to "He is a lion"

Operational Architecture

Operational Architecture

Vector Databases

Designed to store/retrieve high-dimensional vectors
Values are retrieved using similarity searches
Leverage data-structures such as K-D Trees
Examples
- Azure Cognitive Search
- Redis
- Qdrant
- Pinecone
- Chroma

Classification

Grouping data into categories based on features of each item

Can be used for:
- Identifying which known group a new item belongs to
- Grouping items with shared properties together (clustering)
- Normalization of input/output

Polarity Detection

Determines if the input is an affirmative or negative response to a question

"I'm a canine lover" is an affirmative response to "Are dogs your favorite pets?"
"Nobody does" is a negative response to "Do you like Javascript?"

Sentiment Analysis

Determines the emotional tone of a response

"I love speaking at great conferences like this" => Enthusiasm
"I had to miss so many great conferences due to covid" => Regret

Retrieval Augmented Generation (RAG)

Combines the benefits of retrieval-based and generative models
- Identify and retrieve relevant information
- Agument context of the generative models
- Generative responses based on the augmented context
Potential uses include
- Explore large sets of documentation conversationally
- Generate recommendations and insights based on retrieved relevant information
- Summarization of articles in light of known relevant information

Beary - The Beary Barry Bot

Beary Flow

Beary Demo - Flowchart - Horizontal Flow.png

Beary Embeddings Json Snippet

Using LLM Output Has Dangers

Model Answers May Be

Incomplete
Poorly phrased
Outright wrong

The model is biased

Not handling the bias makes bias a feature of your app
Prevent all predictable biases
Watch for unpredictable biases

Embeddings are Reversable

Researchers have had success in reversing embeddings
- Using distance-measurements against a large Vector DB
- Using models trained to predict the text from the embedding
Embeddings can be thought-of like a hash
- Data is obscured, but not encrypted
Do not expect embeddings alone to protect PII
- Encrypt or tokenize all PII before embedding

Simpleicons_Interface_unlocked-padlock - Red 600x600.png

When Should AI be Used?

When all possible biases have been accounted for
When all sensitive data has been removed, tokenized or encrypted
When the stochastic nature of responses has been accounted for
- A wrong answer is no worse than no answer
- Outputs have been fully constrained
- A human is in-the-loop to fix the inevitable errors

What Are Embeddings?

Arrays of 1536 floating-point values
Structured numeric data that represents unstructured text
Representations of the semantics and context of the source text
Vectors that support standard mathematical operations

Resources

This Presentation - Web | PDF
These Demos - Code | Docs
The Depth of GPT Embeddings
Programmers -- Take Responsibility for Your AI’s Output
Experiments in Reconstructing Text from Embeddings
Azure OpenAI Client
Semantic Kernel
Carl Sagan - Reasoning on Higher Dimensions (YouTube)