GPT Embeddings

Not Magic - Just Math


Barry S. Stahl

Solution Architect & Developer

@bsstahl@cognitiveinheritance.com

https://CognitiveInheritance.com

Transparent Half Width Image.png

Favorite Physicists & Mathematicians

Favorite Physicists

  1. Harold "Hal" Stahl
  2. Carl Sagan
  3. Richard Feynman
  4. Marie Curie
  5. Nikola Tesla
  6. Albert Einstein
  7. Neil Degrasse Tyson
  8. Niels Bohr
  9. Galileo Galilei
  10. Michael Faraday

Other notables: Stephen Hawking, Edwin Hubble

Favorite Mathematicians

  1. Ada Lovelace
  2. Alan Turing
  3. Johannes Kepler
  4. Rene Descartes
  5. Isaac Newton
  6. Leonardo Fibonacci
  7. George Boole
  8. Blaise Pascal
  9. Johann Gauss
  10. Grace Hopper

Other notables: Daphne Koller, Grady Booch, Evelyn Berezin

Some OSS Projects I Run

  1. Liquid Victor : Media tracking and aggregation [used to assemble this presentation]
  2. Prehensile Pony-Tail : A static site generator built in c#
  3. TestHelperExtensions : A set of extension methods helpful when building unit tests
  4. Conference Scheduler : A conference schedule optimizer
  5. IntentBot : A microservices framework for creating conversational bots on top of Bot Framework
  6. LiquidNun : Library of abstractions and implementations for loosely-coupled applications
  7. Toastmasters Agenda : A c# library and website for generating agenda's for Toastmasters meetings
  8. ProtoBuf Data Mapper : A c# library for mapping and transforming ProtoBuf messages

http://GiveCamp.org

GiveCamp.png
  bss-100-achievement-unlocked-1024x250.png

The OpenAI API

  • Chat Completions
    • ChatGPT gets most of its power here
  • Embeddings
    • Enable additional features that can be used with Chat Completions
    • Especially useful operationally

Questions to be Answered

  • What are embeddings?
  • What do they represent?
  • How do we compare/contrast them?
  • How can we use them operationally?

Embeddings

  • A point in multi-dimensional space
  • Mathematical representation of a word or phrase
  • Encode both semantic and contextual information
  • Model: text-embedding-ada-002
  • Use 1536 dimensions
  • Are normalized to unit length

3-D Space Projected into 2-D

Necker_cube_with_background.png
  Ram - Just Statements.png
  Ram - With Terms.png
  Ram - With Clusters.png

Cosine Similarity & Distance

Relate vectors based on the angle between them

  • Cosine Similarity ranges from -1 to 1, where:

    • +1 indicates that the vectors represent similar semantics & context
    • 0 indicates that the vectors are orthogonal (no similarity)
    • -1 indicates that the vectors have opposing semantics & context
  • Cosine Distance is defined as 1 - cosine similarity where:

    • 0 = Synonymous
    • 1 = Orthogonal
    • 2 = Antonymous
Cosine Unit Circle - Enhanced.jpg

Cosine Distance

Cosine Distance 989x600.png

Cosine Distance

Angles2.svg

Clustering

  • Unsupervised machine learning technique
  • Clusters form around centroids (the geometric center of the cluster)
  • Data points are grouped (clustered) based on their similarity
    • Minimize the error (distance from centroid)
  • Embeddings cluster with others of similar semantic and contextual meaning
  • Advantages
    • No need to define a distance threshold
  • Disadvantages
    • Quality is use-case dependent
    • Requires the number of clusters to be specified
k-means results.png
 
 
 
 
 
 
 
 
 
 
 
 
 

Embedding Distance

Feature Example
Synonym "Happy" is closer to "Joyful" than to "Sad"
Language "The Queen" is very close to "La Reina"
Idiom "He kicked the bucket" is closer to "He died" than to "He kicked the ball"
Sarcasm "Well, look who's on time" is closer to "Actually Late" than "Actually Early"
Homonym "Bark" (dog sound) is closer to "Howl" than to "Bark" (tree layer)
Collocation "Fast food" is closer to "Junk food" than to "Fast car"
Proverb "The early bird catches the worm" is closer to "Success comes to those who prepare well and put in effort" than to "A bird in the hand is worth two in the bush"
Metaphor "Time is money" is closer to "Don't waste your time" than to "Time flies"
Simile "He is as brave as a lion" is closer to "He is very courageous" than to "He is a lion"

Operational Architecture

Operational Embeddings-Start.png

Operational Architecture

Operational Embeddings-Full.png

Vector Databases

  • Designed to store/retrieve high-dimensional vectors
  • Values are retrieved using similarity searches
  • Leverage data-structures such as K-D Trees
  • Examples
    • Azure Cognitive Search
    • Redis
    • Qdrant
    • Pinecone
    • Chroma
VectorDB-650x650.png

Classification

Grouping data into categories based on features of each item

  • Can be used for:
    • Identifying which known group a new item belongs to
    • Grouping items with shared properties together (clustering)
    • Normalization of input/output
 
 
 
 
 
 

Polarity Detection

Determines if the input is an affirmative or negative response to a question

  • "I'm a canine lover" is an affirmative response to "Are dogs your favorite pets?"
  • "Nobody does" is a negative response to "Do you like Javascript?"
 
 
 

Sentiment Analysis

Determines the emotional tone of a response

  • "I love speaking at great conferences like this" => Enthusiasm
  • "I had to miss so many great conferences due to covid" => Regret
 
 
 

Retrieval Augmented Generation (RAG)

  • Combines the benefits of retrieval-based and generative models
    • Identify and retrieve relevant information
    • Agument context of the generative models
    • Generative responses based on the augmented context
  • Potential uses include
    • Explore large sets of documentation conversationally
    • Generate recommendations and insights based on retrieved relevant information
    • Summarization of articles in light of known relevant information

Beary - The Beary Barry Bot

Beary_600x600.png

Beary Flow

Beary Demo - Flowchart - Horizontal Flow.png

Beary Embeddings Json Snippet

Beary Embeddings Json Snippet.png
 
 
 
 
 

Using LLM Output Has Dangers

herebedragons.jpg

Model Answers May Be

  • Incomplete
  • Poorly phrased
  • Outright wrong
No Takesies backsies.png

The model is biased

  • Not handling the bias makes bias a feature of your app
  • Prevent all predictable biases
  • Watch for unpredictable biases
bias logo - large.jpg

Embeddings are Reversable

  • Researchers have had success in reversing embeddings
    • Using distance-measurements against a large Vector DB
    • Using models trained to predict the text from the embedding
  • Embeddings can be thought-of like a hash
    • Data is obscured, but not encrypted
  • Do not expect embeddings alone to protect PII
    • Encrypt or tokenize all PII before embedding
Simpleicons_Interface_unlocked-padlock - Red 600x600.png

When Should AI be Used?

  • When all possible biases have been accounted for
  • When all sensitive data has been removed, tokenized or encrypted
  • When the stochastic nature of responses has been accounted for
    • A wrong answer is no worse than no answer
    • Outputs have been fully constrained
    • A human is in-the-loop to fix the inevitable errors

What Are Embeddings?

  • Arrays of 1536 floating-point values
  • Structured numeric data that represents unstructured text
  • Representations of the semantics and context of the source text
  • Vectors that support standard mathematical operations

Resources

IntroToEmbeddings_QR.png