Best of Damien Benveniste, PhD - LinkedIn Posts by Damien Benveniste, PhD

Damien Benveniste, PhD

Feb 15, 2023 at 4:41 PM

Deep Learning requires much more of an ARCHITECT mindset than traditional Machine Learning. In a sense, part of the feature engineering work has been moved to the design of very specialized computational blocks using smaller units (LSTM, Convolutional, embedding, Fully connected, …). I would always advise starting with a simple net when architecting a model such that you can build your intuition. Jumping right away into a Transformer model may not be the best way to start.

Most of the advancements in Machine Learning for the past 10 years have come from a smart rearrangement of the simple units presented here. Obviously, I am omitting activation functions and some others here but you get the idea.

A convolution layer is meant to learn local correlations. Multiple successive blocks of conv and pooling layers allow one to learn correlations at multiple scales and they can be used on image data (conv2d), text data (text is just a time series of categorical variables) or time series (conv1d). You can encode text data using an embedding followed by a couple of conv1d layers. And you can encode a time series using a series of conv1d and pooling layers.

I advise against using LSTM layers when possible. The iterative computation doesn’t allow for good parallelism leading to very slow training (even with the Cuda LSTM). For text and time series ConvNet are much faster to train as they make use of matrix computation parallelism and tend to perform on par with LSTM networks (https://lnkd.in/g-6Z6qCN). One reason transformers became the leading block unit for text learning tasks, is its superior parallelism capability compared to LSTM allowing for realistically much bigger training data sets.

Here are a couple of dates to understand the DL timeline:
- (1989) Convolution layer and average pooling: https://lnkd.in/gtv_Q7iv
- (1997) LSTM layer: https://lnkd.in/gCWJjxJv
- (2003) Embedding layer: https://lnkd.in/g3iCBQNf
- (2007) Max Pooling: https://lnkd.in/ge9KKCME
- (2012) Feature dropout: https://lnkd.in/g49Sp6HE
- (2012) Transfer learning: https://lnkd.in/g9yWA86k
- (2013) Word2Vec Embedding: https://lnkd.in/gC62AchR
- (2013) Maxout network: https://lnkd.in/gC_KvJjT
- (2014) GRU layer: https://lnkd.in/g-rRQ6km
- (2014) Dropout layers: https://lnkd.in/gkHUqYDE
- (2014) GloVe Embedding: https://lnkd.in/gA8bnnX2
- (2015) Batch normalization: https://lnkd.in/gmptQTXY
- (2016) Layer normalization: https://lnkd.in/gTad4iHE
- (2016) Instance Normalization: https://lnkd.in/g7SA_Z3q
- (2017) Self Attention layer and transformers: https://lnkd.in/gUts7Sjq
- (2018) Group Normalization: https://lnkd.in/gMv7KehG

----
Don't forget to subscribe to my ML newsletter: TheAiEdge.io
#machinelearning #datascience #artificialintelligence

▿ Show more

Damien Benveniste, PhD

May 30, 2023 at 3:50 PM

It took me some time to warm up to the idea that Prompt Engineering is actually worth digging into! Not only I realized that it is not a trivial field of research but it is also critical if we need to build applications using LLMs. We are surely barely getting started with our understanding of the subject, so time to jump into it!

Typically people are asking direct questions to the LLM. That is called zero-shot prompting. If you provide a few examples in the prompt, it is called few-shot prompting:

“““
Example question
Example answer

Question
What is the answer?
“““

In a few-shot prompt, you can get better results if you showcase the reasoning that leads to a specific answer. This is called “Chain of thoughts“. You can induce a similar behavior in zero-shot by prompting it to “think step by step“ and this is referred to as “Inception“. “Chain of Thoughts“ can take the form of intermediate questions and answers (e.g. “Do I need more information to solve the problem? -> Yes I do“). This is called “Self-ask“. You can induce targeted answers by referring to concepts or analogies (e.g. “imagine you are a physics professor. Answer this question:“). This is called “Memetic Proxy“. If you rerun the same query multiple times, you'll get different answers, so choosing a consistent answer over multiple queries increases the quality of the answers. This is called “Self-consistency“. If you exchange multiple messages with the LLM, it is a good idea to use a memory pattern such that it can refer to previous interactions.

It becomes interesting when LLMs are given access to tools or databases. Based on questions they can decide to use tools. We can provide examples of question-action pairs (e.g. “What is the age of the universe?“ -> [search on Wikipedia]) and this is called “Act“. We get better results with “ReAct“ when we induce intermediate thoughts from the LLM (e.g. “What is the age of the universe?“ -> “I need to find more information on the universe“ -> [search on Wikipedia]).

You can chain prompts! For example, you can prompt for using a tool, extract the information coming from the tool API call in the following prompt and use the result in the next prompt to answer the initial question. You can solve complex problems by inducing a plan of action. Each step of the plan can be used to generate its own chain of actions using ReAct. For example, AutoGPT is using the “Plan and execute“ pattern on autopilot to solve complex problems. I guarantee that “Plan and execute“ will be the source of many new startups in the coming years!

LLMs can be thought of as some sort of flexible subroutines taking inputs and producing outputs. Prompts are the molds that shape the subroutines to solve a specific problem.

----
Receive 50 ML lessons (100 pages) when subscribing to our newsletter: TheAiEdge.io
Want to become an ML engineer? Join our next Masterclass: MasterClass.TheAiEdge.io
#machinelearning #datascience #artificialintelligence

▿ Show more

Damien Benveniste, PhD

Jun 23, 2023 at 8:30 PM

You should learn to DEPLOY your Machine Learning models! The way to deploy is dictated by business requirements. You should not start any ML development before you know how you are going to deploy the resulting model. There are 4 main ways to deploy ML models:

- Batch deployment - The predictions are computed at defined frequencies (for example on a daily basis), and the resulting predictions are stored in a database and can easily be retrieved when needed. However, we cannot use more recent data and the predictions can very quickly be outdated. Look at this article on how AirBnB progressively moved from batch to real-time deployments: https://lnkd.in/gCeZWFrR

- Real-time - the “real-time“ label describes the synchronous process where a user requests a prediction, the request is pushed to a backend service through HTTP API calls that in turn will push it to a ML service. It is great if you need personalized predictions that utilize recent contextual information such as the time of the day or recent searches by the user. The problem is, until the user receives its prediction, the backend and the ML services are stuck waiting for the prediction to come back. To handle additional parallel requests from other users, you need to count on multi-threaded processes and vertical scaling by adding additional servers. Here are simple tutorials on real-time deployments in Flask and Django: https://lnkd.in/gtUjMqaK, https://lnkd.in/g4eGEyFr.

- Streaming deployment - This allows for a more asynchronous process. An event can trigger the start of the inference process. For example, as soon as you get on the Facebook page, the ads ranking process can be triggered, and by the time you scroll, the ad will be ready to be presented. The process is queued in a message broker such as Kafka and the ML model handles the request when it is ready to do so. This frees up the backend service and allows to save a lot of computation power by an efficient queueing process. The resulting predictions can be queued as well and consumed by backend services when needed. Here is a tutorial in Kafka: https://lnkd.in/g9qUTv9X

- Edge deployment - That is when the model is directly deployed on the client such as the web browser, a mobile phone or IoT products. This results in the fastest inferences and it can also predict offline (disconnected from the internet), but the models usually need to be pretty small to fit on smaller hardware. For example, here is a tutorial to deploy YOLO on IOS: https://lnkd.in/gUE8id5J

----
Receive 50 ML lessons (100 pages) when subscribing to our newsletter: TheAiEdge.io
#machinelearning #datascience #artificialintelligence

▿ Show more

Damien Benveniste, PhD

Sep 19, 2022 at 3:34 PM

“Machine Learning is JUST statistics!“. Sure! But before you start congratulating yourself for rehashing the same dogma over and over, can you answer the following questions?

- Why finding a set of weights for a Neural Network so that the network produces the correct output for all the training examples is a NP-hard problem? https://lnkd.in/eW2qeEZK
- Why the Feature Selection problem is a NP-complete problem? https://lnkd.in/eYh7bU6U
- Why the Hyperparameter Optimization problem is NP-complete? https://lnkd.in/e_Rwr2JW
- How would you implement Logistic Regression in a distributed manner? https://lnkd.in/ecEv776k, https://lnkd.in/eUd7hX_J
- What are the pros and cons of an Iterative Re-weighted Least Square implementation over a Gradient Descent implementation for a Logistic regression?
https://lnkd.in/eFWZCWnU
- How do you efficiently design a parallelized implementation of a Gradient Boosting Algorithm? https://lnkd.in/egsShBmr
- What are the trade-offs to build the trees in breadth-first-search (BFS) manner vs a depth-search-first (DFS) manner for a Random Forest algorithm?
https://lnkd.in/e3DU4-JJ
- How to modify the breadth-first-search algorithm to build efficient KD-trees for K-nearest neighbors?
https://lnkd.in/eJ7nEvkB
https://lnkd.in/e5pF9syy
- Why the algorithms to parallelize on GPUs are slightly different from the ones to parallelize on CPUs? https://lnkd.in/eY-_8Wz5
- What is the effect of precision (e.g. float16 vs float32) in training Neural Networks? https://lnkd.in/e5-2ADAd, https://lnkd.in/eZCicQ-z
- How do you implement Logistic Regression on a quantum computing unit? https://lnkd.in/eVQxg3JD
- Why can Logistic Regression can perfectly learn the outcomes of a AND and OR logical gate but not from a XOR logical gate?
https://lnkd.in/e2JwD3zW
https://lnkd.in/e-y6XzYR
- What are the pros and cons of using Dynamic programming VS Monte Carlo methods to optimize the Bellman equations?
https://lnkd.in/ednVNZGR
- Why the Temporal-difference Learning method leads to more stable convergence of the Reinforcement learning algorithms?
https://lnkd.in/e5Z_hS_K

Now that you answered those questions (or tried to!), can we take a minute now to appreciate the absurdity of the initial claim in this post? Thank you!

#machinelearning #statistics

▿ Show more

Damien Benveniste, PhD

Dec 18, 2023 at 4:48 PM

The TikTok recommender system is widely regarded as one of the best in the world at the scale it operates at. It can recommend videos or ads, and even the other big tech companies could not compete. Recommending on a platform like TikTok is tough because the training data is non-stationary as a user's interest can change in a matter of minutes and the number of users, videos, and ads keeps changing.

The predictive performance of a recommender system on a social media platform deteriorates in a matter of hours, so it needs to be updated as often as possible. TikTok built a streaming engine to ensure the model is continuously trained in an online manner. The model server generates features for the model to recommend videos, and in return, the user interacts with the recommended items. This feedback loop leads to new training samples that are immediately sent to the training server. The training server holds a copy of the model, and the model parameters are updated in the parameter server. Every minute, the parameter server synchronizes itself with the production model.

The recommendation model is several terabytes in size, so it is very slow to synchronize such a big model across the network. That is why the model is only partially updated. The leading cause of non-stationary (concept drift) comes from the sparse variables (users, videos, ads, etc.) that are represented by embedding tables. When a user interacts with a recommended item, only the vectors associated with the user and the item get updated, as well as some of the weights on the network. Therefore, only the updated vectors get synchronized on a minute basis, and the network weights are synchronized on a longer time frame.

Typical recommender systems use fixed embedding tables, and the categories of the sparse variables get assigned to a vector through a hash function. Typically, the hash size is smaller than the number of categories, and multiple categories get assigned to the same vector. For example, multiple users share the same vector. This allows us to deal with the cold start problem for new users, and it puts a constraint on the maximum memory that the whole table will use. But this also tends to reduce the performance of the model because user behaviors get conflated. Instead, TikTok uses dynamic embedding sizes such that new users can be added to their own vector. They use a collisionless hashing function so each user gets its own vector. Low-activity users will not influence the model performance that much, so they dynamically remove those low-occurrence IDs as well as stale IDs. This keeps the embedding table small while preserving the quality of the model.

Here is the TikTok paper: https://lnkd.in/g9fA62GD!
#machinelearning #datascience #artificialintelligence

--
👉 Learn more Machine Learning on my website: https://www.TheAiEdge.io
--

▿ Show more

Damien Benveniste, PhD

May 29, 2023 at 4:00 PM

Here how, in 3 lines of code, I give ChatGPT access to:

- Wikipedia
- A calculator
- Google search
- Python
- Wolfram alpha
- The terminal
- The latest news
- Podcast APIs
- Current weather information

I could have given it access to more tools but I was afraid it was going to take over the world!

----
Receive 50 ML lessons (100 pages) when subscribing to our newsletter: TheAiEdge.io
Want to become an ML engineer? Join our next Masterclass: MasterClass.TheAiEdge.io
#machinelearning #datascience #artificialintelligence

▿ Show more

Damien Benveniste, PhD

Jul 13, 2023 at 10:30 PM

We have seen recently a surge in vector databases in this era of generative AI. The idea behind vector databases is to index the data with vectors that relate to that data. Hierarchical Navigable Small World (HNSW) is one of the most efficient ways to build indexes for vector databases. The idea is to build a similarity graph and traverse that graph to find the nodes that are the closest to a query vector.

Navigable Small World (NSW) networks is a process to build efficient graphs for search. Let’s imagine we have multiple vectors we need to index. We build a graph by adding them one after the others and connecting each new node to the most similar neighbors.

When building the graph, we need to decide on a metric for similarity such that the search is optimized for the specific metric used to query items. Initially, when adding nodes, the density is low and the edges will tend to capture nodes that are far apart in similarity. Little by little, the density increases and the edges start to be shorter and shorter. As a consequence the graph is composed of long edges that allow us to traverse long distances in the graph, and short edges that capture closer neighbors. Because of it, we can quickly traverse the graph from one side to the other and look for nodes at a specific location in the vector space.

When we want to find the nearest neighbor to a query vector, we initiate the search by starting at one node (i.e. node A in that case). Among its neighbors (D, G, C), we look for the closest node to the query (D). We iterate over that process until there are no closer neighbors to the query. Once we cannot move anymore, we found a close neighbor to the query. The search is approximate and the found node may not be the closest as the algorithm may be stuck in a local minima.

The problem with NSW, is we spend a lot of iterations traversing the graph to arrive at the right node. The idea for Hierarchical Navigable Small World is to build multiple graph layers where each layer is less dense compared to the next. Each layer represents the same vector space, but not all vectors are added to the graph. Basically, we include a node in the graph at layer L with a probability P(L). We include all the nodes in the final layer (if we have N layers, we have P(N) = 1) and the probability gets smaller as we get toward the first layers. We have a higher chance of including a node in the following layer and we have P(L) < P(L + 1).

The first layer allows us to traverse longer distances at each iteration where in the last layer, each iteration will tend to capture shorter distances. When we search for a node, we start first in layer 1 and go to the next layer if the NSW algorithm finds the closest neighbor in that layer. This allows us to find the approximate nearest neighbor in less iterations in average.

----
Find more similar content in my newsletter: TheAiEdge.io
#machinelearning #datascience #artificialintelligence

▿ Show more

Damien Benveniste, PhD

Sep 7, 2023 at 3:53 PM

If you think about Transformers, chances are you are thinking about NLP applications, but how can we use Transformers for data types other than text? Actually, you can use Transformers on any data that you are able to express as a sequence of vectors, which is what Transformers feed on! Typically, any sequence or time series of data points should be able to fit the bill.

Let's consider image data, for example. An image is not per se a sequence of data, but the local correlation of the pixels sure resembles the concept. For the Vision Transformer (ViT: https://lnkd.in/gPC_iFaV), the guys at Google simply created patches of an image that were flattened through linear transformations into a vector format. By feeding images to Transformers through this process, they realized that typical CNNs were performing better on a small amount of data, but Transformers were getting better than CNNs if the scale of the data was very high.

Time series are obviously good candidates for Transformers. For example, for the Temporal Fusion Transformer (https://lnkd.in/gfMTHYBc), they transform the time series into the right-sized vector through LSTM layers, as they say, to capture the short-term correlations of the data where the multihead attention layers take care of capturing the long term correlations. They beat all the time series benchmarks with this model, but I wonder how scalable it is with those LSTM layers! You can use it in PyTorch: https://lnkd.in/gzisFCUF

Sequencing proteins seems to be an obvious application of Transformers, considering the language analogy of amino acid sequences. Here, you just need to have an amino acid embedding to capture the semantic representation of protein unit tokens. Here is a Nature article on generating new proteins with Transformers: https://lnkd.in/gzeiuZ8w, and here is its BioaRXiv version: https://lnkd.in/gQgHg-sm.

Reinforcement Learning expressed at a Markov chain sequence of states, actions, and rewards is another good one. For the Decision Transformer (https://lnkd.in/giJCnXJX), they encoded each state, action, and reward as a vector and concatenated them into 1 final vector. For example, in the case of video games, a state can simply be the image on the screen at a time t, and you extract the latent features with a CNN. An action can be encoded with embedding, and a scalar reward can be seen as a vector with 1 dimension. Apparently, they beat all the benchmarks as well! You can find the code here: https://lnkd.in/gwFdrZHX.

Looking forward to seeing what Transformers are going to achieve in the coming years!

--
👉 Get a Free Machine Learning PDF (100+ pages) with 50+ tips by subscribing to my newsletter today: TheAiEdge.io
--
#machinelearning #datascience #artificialintelligence

▿ Show more

Damien Benveniste, PhD

Oct 16, 2023 at 3:40 PM

Graph Databases should be the better choice for Retrieval Augmented Generation (RAG)! We have seen the debate RAG vs fine-tuning, but what about Vector databases vs Graph databases?

In both cases, we maintain a database of information that an LLM can query to answer a specific question. In the case of vector databases, we partition the data into chunks, encode the chunks into vector representations using an LLM, and index the data by their vector representations. Once we have a question, we retrieve the nearest neighbors to the vector representation of the question. The advantage is the fuzzy matching of the question to chunks of data. We don't need to query a specific word or concept; we simply retrieve semantically similar vectors. The problem is that the retrieved data may contain a lot of irrelevant information, which might confuse the LLM.

In the context of graphs, we extract the relationships between the different entities in the text, and we construct a knowledge base of the information contained within the text. An LLM is good at extracting that kind of triplet information:

[ENTITY A] -> [RELATIONSHIP] -> [ENTITY B]

For example:
- A [cow] IS an [animal]
- A [cow] EATS [plants]
- An [animal] IS a [living thing]
- A [plant] IS a [living thing]

Once the information is parsed, we can store it in a graph database. The information stored is the knowledge base, not the original text. For information retrieval, the LLM needs to come up with an Entity query related to the question to retrieve the related entities and relationships. The retrieved information is much more concise and to the point than in the case of vector databases. This context should provide much more useful information for the LLM to answer the question. The problem is that the query matching needs to be exact, and if the entities captured in the database are slightly semantically or lexically different, the query will not return the right information.

I wonder if there is a possibility to merge the advantages of vector and graph databases. We could parse the entities and relationships, but we index them by their vector representations in a graph database. This way, the information retrieval could be performed using approximate nearest neighbor search instead of exact matching. Does that exist already?

#machinelearning #datascience #artificialintelligence

--
👉 Get a Free Machine Learning PDF (100+ pages) with 50+ tips by subscribing to my newsletter today: https://TheAiEdge.io
--

▿ Show more

Damien Benveniste, PhD

Jul 10, 2023 at 10:30 PM

In this new Machine Learning era dominated by LLMs, knowledge Distillation is going to be at the forefront of LLMOps. For widespread adoption and further development of generative ML, we first need to make those models more manageable to deploy and fine-tune.

Just to put some numbers on how unmanageable it can be, SOTA models these days have about ~500B parameters and that represents at least ~1TB of GPU memory to operate with specialized infrastructure. That's a minimum of ~$60,000 - $100,000 per year per deployed model just for inference servers. And that doesn't include fine-tuning nor typical elastic load balancing costs for reliability best practices. Not impossible, but somewhat a high barrier to entry for most businesses.

I always felt that knowledge distillation was a silent hero in this era of transformer-type language models. There are tons of distilled BERT-like models on HuggingFace for example. The concept behind distillation is actually pretty simple. Let's assume you have a large pre-trained model. In the context of LLMs, it could be pre-trained with self-supervised learning and fine-tuned in a RLHF fashion for example. Your pre-trained model now becomes the all-knowing teacher for a smaller student model. If we call the teacher model T and the student model S, we want to learn the parameters for S such that

T(x) = y_t ~ y_s = S(x)

For some data x, we want the predictions y_t and y_s by T and S to be as close to each other. To train a student model, we simply pass the training data through the teacher and student and update the student's parameters by minimizing the loss function l(y_t, y_s) and back-propagating its gradient. Typically we use cross-entropy for the loss function. To train the student model, think about typical supervised learning where the training data is the same or similar to the teacher's training data, but the ground truth label for the student model is the output prediction of the teacher model. You can read more about it in this survey: https://lnkd.in/gCmzGDhq.

With the advent of prompt engineering, we now understand better how to extract the right piece of knowledge from LLMs. Techniques like Chain-of-Thought (CoT) greatly improved LLMs performance on few-shot learning tasks. The guys at Google just published an article (https://lnkd.in/gfjwhbq3) utilizing CoT to improve the distillation process. The idea is to have the student LLM predicting the rationales for the predictions alongside the predictions and minimizing a loss function between the teacher rationale and the student rationale. Basically by forcing the LLM to explain its predictions, they were able to beat all the distillation SOTA. For example, they outperformed a 540B parameters PaLM model with a 770M parameters T5 model after distillation! I think this paper will have a huge impact in the coming year!

----
Find more similar content in my newsletter: TheAiEdge.io
#machinelearning #datascience #artificialintelligence

▿ Show more

Damien Benveniste, PhD

Jun 30, 2023 at 9:30 PM

Your Machine Learning model LOSS is your GAIN! At least if you choose the right one! Unfortunately, there are even more loss functions available than redundant articles about ChatGPT.

When it comes to regression problems, the Huber and Smooth L1 losses are the best of both worlds between MSE and MAE being differentiable at 0 and limiting the weight of outliers for large values. The LogCosh has the same advantage, having a similar shape to the Huber one. The mean absolute percentage error and the mean squared logarithmic error greatly mitigate the effects of outliers. Poisson regression is widely used for count targets that can only be positive.

For classification problems, cross-entropy loss tends to be king. I find the focal cross-entropy to be quite interesting, as it gives more weight to the samples where the model is less confident giving more focus on the “hard“ samples to classify. The KL divergence is another information theoretic metric and I would assume it is less stable than cross-entropy in small batches due to the more fluctuating averages of the logs. The hinge loss is the original loss of the SVM algorithm. The squared hinge is simply the square of the hinge loss and the soft margin is simply a softer differentiable version of it.

Ranking losses tend to be extensions of the pointwise ones, penalizing the losses when 2 samples are misaligned compared to the ground truth. The margin ranking, the soft pairwise hinge and the pairwise logistic losses are extensions of the hinge losses. However, ranking loss functions are painfully slow to compute as the time complexity is O(N^2) where N is the number of samples within a batch.

Contrastive learning is a very simple way to learn aligned semantic representations of multimodal data. For example, triplet margin loss was used in Facenet (https://lnkd.in/g8Js5MRq) and cosine embedding loss in CLIP (https://lnkd.in/eGNMirji). The hinge embedding loss is similar but we replace the cosine similarity with the Euclidean distance.

Deep Learning had a profound effect on Reinforcement Learning, allowing us to train models with high state and action dimensionalities. For Q-learning, the loss can simply take the form of the MSE for the residuals of the Bellman equation. In the case of Policy gradient, the loss is the cross-entropy of the action probabilities weighted by the Q-value.

And those are just a small subset of what exists. To get a sense of what is out there, a simple approach is to take a look at the PyTorch (https://lnkd.in/g63Z26NY) and TensorFlow (https://lnkd.in/gRUHvmbg) documentation. These lecture notes seem to be worth a read: https://lnkd.in/gR4srZQJ.

----
Receive 50 ML lessons (100 pages) when subscribing to our newsletter: TheAiEdge.io
#machinelearning #datascience #artificialintelligence

▿ Show more

Damien Benveniste, PhD

Best Posts by Damien Benveniste, PhD on LinkedIn

Related Influencers

David Carlin

Arthur Chan

ASHISH SHUKLA

Milagros Zegarra