Claim 35 Post Templates from the 7 best LinkedIn Influencers

Get Free Post Templates
Avi Chawla

Avi Chawla

These are the best posts from Avi Chawla.

16 viral posts with 30,608 likes, 490 comments, and 2,875 shares.
11 image posts, 0 carousel posts, 4 video posts, 1 text posts.

๐Ÿ‘‰ Go deeper on Avi Chawla's LinkedIn with the ContentIn Chrome extension ๐Ÿ‘ˆ

Best Posts by Avi Chawla on LinkedIn

25 most important mathematical definitions in data science in a single frame.

How many of them do you know?

--
Find a more detailed explanation here:ย https://lnkd.in/gibud-c6
--

Some of the terms are pretty self-explanatory, so I wonโ€™t go through each of them, like:
- Gradient Descent, Normal Distribution, Sigmoid, Correlation, Cosine similarity, Naive Bayes, F1 score, ReLU, Softmax, MSE, MSE + L2 regularization, KMeans, Linear regression, SVM, Log loss.

Here are the remaining terms:

- MLE: Used to estimate the parameters of a statistical model by maximizing the likelihood of the observed data.

- Z-score: A standardized value that indicates how many standard deviations away a data point is from the mean.

- OLS: A closed-form solution for linear regression obtained using MLE.

- Entropy: A measure of the uncertainty or randomness of a random variable. It is often utilized in decision trees and the t-SNE algorithm.

- Eigen Vectors: Vectors that do not change direction after a linear transformation. The principal components in PCA are obtained using eigenvectors of the data's covariance matrix.

- R2 (R-squared): It measures the proportion of variance explained by a regression model.

- KL divergence: Assesses how much information is lost when one distribution is used to approximate another distribution. It is used as a loss function in the t-SNE algorithm.

- SVD: A factorization technique that decomposes a matrix into three other matrices. It is fundamental in linear algebra for applications like dimensionality reduction.

- Lagrange multipliers: A mathematical technique to solve constrained optimization problems. For instance, consider an optimization problem with an objective function f(x) and assume that the constraints are g(x)=0 and h(x)=0. Lagrange multipliers help us solve this. This is used in PCA's derivation.

๐Ÿ‘‰ Over to you: Of course, this is not an all-encompassing list. What other mathematical definitions will you include here?
Post image by Avi Chawla
I reviewed 1,000+ Python libraries and discovered these hidden gems I never knew even existed.

Here are some of them that will make you fall in love with Python and its versatility (even more).

Please read the full list here: https://bit.ly/py-gems

1) PyGWalker: Analyze Pandas dataframe in a tableau-like interface in Jupyter.
Link: https://bit.ly/pyg-walker

2) Science plots: Make professional matplotlib plots for presentations, research papers, etc.
Link: https://bit.ly/sciplt

3) CleverCSV: Resolve parsing errors while reading CSV files with Pandas.
Link: https://bit.ly/clv-csv

4) fastparquet: Speed-up parquet I/O of pandas by 5x.
Link: https://bit.ly/fparquet

5) Dovpanda: Generate helpful hints as you write your Pandas code.
Link: https://bit.ly/dv-pnda

6) Drawdata: Draw a 2D dataset of any shape in a notebook by dragging the mouse.
Link: https://bit.ly/data-dr

7) nbcommands: Search code in Jupyter notebooks easily rather than manually doing it.
Link: https://bit.ly/nb-cmnds

8) Bottleneck: Speedup NumPy methods 25x. Especially better if array has NaN values.ย 
Link: https://bit.ly/btlneck

9) multipledispatch: Enable function overloading in python.
Link: https://bit.ly/func-ove

10) Aquarel: Style matplotlib plots.ย 
Link: https://bit.ly/py-aql

11) Uniplot: Lightweight plotting in the terminal with Unicode.
Link: https://bit.ly/py-uni

12) pydbgen: Random pandas dataframe generator.
Link: https://bit.ly/pydbgen

13) modelstore: Version machine learning models for better tracking.
LinkedIn: https://bit.ly/mdl-str

14) Pigeon: Annotate data with button clicks in Jupyter notebook.
Link: https://bit.ly/py-pgn

15) Optuna: A framework for faster/better hyperparameter optimization.
Link: https://bit.ly/py-optuna

16) Pampy: Simple, intuitive and faster pattern matching. Works on numerous data structures.ย 
Link: https://bit.ly/py-pmpy

17) Typeguard: Enforce type annotations in python.
Link: https://bit.ly/typeguard

18) KnockKnock: Decorator that notifies upon model training completion.
Link: https://bit.ly/knc-knc

19) Gradio: Create an elegant UI for ML model.
LinkedIn: https://bit.ly/py-grd

20) Parse: Reverse f-strings by specifying patterns.
Link: https://bit.ly/py-prs

21) handcalcs - Write and display mathematical equations in Jupyter
Link: https://bit.ly/py-hcals

22) Osquery: Write SQL-based queries to explore operating system data.
Link: https://bit.ly/py-osqry

23) D3Blocks: Create and export interactive plots as HTML. (Matplolib/Plotly lose interactivity when exported).
Link: https://bit.ly/py-d3

24) itables: Show Pandas dataframes as interactive tables.
Link: https://bit.ly/py-itbls

25) jellyfish: Perform approximate and phonetic string matching.
Link: https://bit.ly/jly-fsh

Thatโ€™s a wrap!!

What cool Python libraries would you add to this list?

๐Ÿ‘‡ Drop your suggestions in the replies below ๐Ÿ‘‡

๐Ÿ‘‰ Check out my daily newsletter to learn something new about Python and Data Science every day:ย https://bit.ly/DailyDS.
11 plots in data science that are used 90% of the time

(with precise usage๐Ÿ‘‡)

Visualizations are critical in understanding complex data patterns and relationships.

They offer a concise way to understand the:
- intricacies of statistical models
- validate model assumptions
- evaluate model performance, and much more.

The visual below depicts theย 11 mostย important and must-know plots in data science:

1) KS Plot:
- It is used to assess the distributional differences.
- The core idea is to measure the maximum distance between the cumulative distribution functions (CDF) of two distributions.
- The lower the maximum distance, the more likely they belong to the same distribution.
- Thus, instead of a โ€œplotโ€, it is mainly interpreted as a โ€œstatistical testโ€ to determine distributional differences.

2) SHAP Plot:
- It summarizes feature importance to a modelโ€™s predictions by considering interactions/dependencies between them.
- It is useful in determining how different values (low or high) of a feature affect the overall output.

3) ROC Curve:
- It depicts the tradeoff between the true positive rate (good performance) and the false positive rate (bad performance) across different classification thresholds.

4) Precision-Recall Curve:
- It depicts the tradeoff between Precision and Recall across different classification thresholds.

5) QQ Plot:
- It assesses the distributional similarity between observed data and theoretical distribution.
- It plots the quantiles of the two distributions against each other.
- Deviations from the straight line indicate a departure from the assumed distribution.

6) Cumulative Explained Variance Plot:
- It is useful in determining the number of dimensions we can reduce our data to while preserving max variance during PCA.

7) Elbow Curve:
- The plot helps identify the optimal number of clusters for the k-means algorithm.
- The point of the elbow depicts the ideal number of clusters.

8) Silhouette Curve:
- The Elbow curve is often ineffective when you have plenty of clusters.
- Silhouette Curve is a better alternative, as depicted above.

9) Gini-Impurity and Entropy:
- They are used to measure the impurity or disorder of a node or split in a decision tree.
- The plot compares Gini impurity and Entropy across different splits.
- This provides insights into the tradeoff between these measures.

10) Bias-Variance Tradeoff:
- It is used to find the right balance between the bias and the variance of a model against complexity.

11) PDP:
- Depicts the dependence between target and features.
- A plot between the target and one feature forms โ†’ 1-way PDP.
- A plot between the target and two feature forms โ†’ 2-way PDP.

๐Ÿ‘‰ Over to you: Do you find it easier to interpret plots over numbers?
____
If you want to learn AI/ML engineering, get this free PDF (530+ pages) with 150+ core DS/ML lessons.

Get here: https://lnkd.in/gi6xKmDc
____
Find me โ†’ย Avi Chawla
Every day, I share tutorials and insights on DS, ML, LLMs, and RAGs.
Post image by Avi Chawla
Scatter plots are extremely useful for visualizing two sets of numerical variables. But when you have, say, thousands of data points, scatter plots can get too dense to interpret.

Hexbins can be a good choice in such cases. As the name suggests, they bin the area of a chart into hexagonal regions. Each region is assigned a color intensity based on the method of aggregation used (the number of points, for instance).

Hexbins are especially useful for understanding the spread of data. It is often considered an elegant alternative to a scatter plot. Moreover, binning makes it easier to identify data clusters and depict patterns.

#python #datascience
Post image by Avi Chawla
The time complexity of 10 popular ML algorithms in a single frame.

Understanding the run time of ML algorithms is important because it helps us:
- Build a core understanding of an algorithm.
- Understand the data-specific conditions that allow us to use an algorithm.

For instance, using SVM or t-SNE on large datasets is infeasible because of their polynomial relation with data size.

Similarly, using OLS on a high-dimensional dataset makes no sense because its run-time grows cubically with total features.

--
๐Ÿ‘‰ Join 80k+ data scientists and get a Free data science PDF (550+ pages) with 320+ posts by subscribing to my daily newsletter: https://lnkd.in/gzfJWHmu
--

๐Ÿ‘‰ Over to you: Can you tell the inference run-time of KMeans Clustering?
Post image by Avi Chawla
Pandas is getting outdated.

Here's a new alternative you should consider switching to๐Ÿ‘‡.

To begin, Pandas has many limitations.

For instance, Pandas:
- always adheres to single-core computation
- offers no lazy execution
- creates bulky DataFrames
- is slow on large datasets, and many more

Polars is a lightning-fast DataFrame library that addresses these limitations.

It provides two APIs:
- Eager: Executed instantly, like Pandas.
- Lazy: Executed only when one needs the results.

The visual presents a comparison of Polars and Pandas on various parameters.

It's clear that Polars is much more efficient than Pandas.

--
๐Ÿ‘‰ Get a Free Data Science PDF (350+ pages) with 250+ posts by subscribing to my daily newsletter today:ย https://bit.ly/DailyDS.
--

๐Ÿ‘‰ Over to you: What are some other better alternatives to Pandas that you are aware of?

#datascience #python
Post image by Avi Chawla
Most common Pandas operations and their SQL translations in one frame.

SQL and Pandas are both powerful tools for data scientists to work with data. Thus, proficiency in both frameworks is extremely crucial.

--
๐Ÿ‘‰ Get a Free Data Science PDF (550+ pages) with 320+ tips by subscribing to my daily newsletter today: https://bit.ly/DailyDS.
--

Over to you: What other Pandas to SQL translations will you include here?

#python
.
Post image by Avi Chawla
FireDucks makes Pandas 20x Faster...

...by changing JUST ONE LINE of code.

Pandas has several limitations:
- Pandas always adheres to a single-core computation on a CPU.
- Pandas always creates bulky DataFrames.
- Pandas always follows an eager execution mode (every operation triggers immediate computation), which is why it cannot prepare a smart execution plan that optimizes the entire sequence of operations.

FireDucksย is a heavily optimized alternative to Pandas with exactly the same API as Pandasโ€™ that addresses these.


There are three ways to use it:
1) Load the extension: %๐ฅ๐จ๐š๐_๐ž๐ฑ๐ญ ๐—ณ๐—ถ๐—ฟ๐—ฒ๐—ฑ๐˜‚๐—ฐ๐—ธ๐˜€.๐ฉ๐š๐ง๐๐š๐ฌ; ๐—ถ๐—บ๐—ฝ๐—ผ๐—ฟ๐˜ ๐—ฝ๐—ฎ๐—ป๐—ฑ๐—ฎ๐˜€ ๐—ฎ๐˜€ ๐—ฝ๐—ฑ
2) Import FireDucks instead of Pandas: ๐ข๐ฆ๐ฉ๐จ๐ซ๐ญ ๐—ณ๐—ถ๐—ฟ๐—ฒ๐—ฑ๐˜‚๐—ฐ๐—ธ๐˜€.๐ฉ๐š๐ง๐๐š๐ฌ ๐š๐ฌ ๐ฉ๐
3) If you have a Python script, execute is as follows: ๐—ฝ๐˜†๐˜๐—ต๐—ผ๐—ป3 -๐—บ ๐—ณ๐—ถ๐—ฟ๐—ฒ๐—ฑ๐˜‚๐—ฐ๐—ธ๐˜€.๐—ฝ๐—ฎ๐—ป๐—ฑ๐—ฎ๐˜€ ๐—ฐ๐—ผ๐—ฑ๐—ฒ.๐—ฝ๐˜†

Done!

The speed up compared to Pandas and Polars is evident from the video below.

As perย FireDucksโ€™ official benchmarks, it can be ~20x faster than Pandas and ~2x faster than Polars.

๐Ÿ‘‰ I covered this in detail and why it is effective here: https://lnkd.in/gtRRcXiD.

๐Ÿ‘‰ Over to you: What are some other ways to accelerate Pandas operations in general?
Box plots are quite common in data analysis. But they can be misleading at times. Here's why.

A box plot is a graphical representation of just five numbers extracted from the data. These are: min, first quartile, median, third quartile, and max.

Thus, two different datasets with similar five values will produce identical box plots. This, at times, can be misleading and one may draw wrong conclusions.

The takeaway is NOT that box plots should not be used. Instead, look at the underlying distribution too. Here, histograms and violin plots can help.

Lastly, always remember that when you condense a dataset, you don't see the whole picture. You are losing essential information.

๐Ÿ‘‰ Check out my daily newsletter to learn something new about Python and Data Science every day:ย https://bit.ly/DailyDS.

#python #datascience #statistics #datavisualization
Post image by Avi Chawla
Microsoft open-sourced a powerful data analysis tool:

(it's AI-powered and no-code)๐Ÿ‘‡

Data Formulator is an innovative tool from Microsoft that uses LLMs to transform data to speed up data analysis and create visualizations.

Key features include:
โ˜‘ AI-powered data transformation
โ˜‘ Interactive drag-and-drop UI for data visualization
โ˜‘ Seamless integration of UI and natural language inputs

You can also create visualization beyond the initial dataset. Data Formulator automatically identifies a need for computation, creates those fields for you, and outputs the visualization.

Find the GitHub repo in the comments!

๐Ÿ‘‰ P.S. Do you like no-code tools for data analysis?
_____
If you want to learn AI/ML engineering, I have put together a free PDF (530+ pages) with 150+ core DS/ML lessons.

Get here: https://lnkd.in/gi6xKmDc
_____
Find me โ†’ Avi Chawla
Every day, I share tutorials and insights on ML, LLMs, and RAGs.
Most common magic methods in Python in a single frame.
.
.
Magic methods offer immense flexibility to define the behavior of class objects in certain scenarios. Thus, awareness about them is extremely crucial for developing elegant, and intuitive pipelines.

--
๐Ÿ‘‰ Get a Free Data Science PDF (550+ pages) with 320+ posts by subscribing to my daily newsletter today:ย https://bit.ly/DailyDS.
--

๐Ÿ‘‰ Over to you: What other magic methods will you include here? Which ones do you use the most?

#python
.
Post image by Avi Chawla
11 plots that data scientists use 95% of the time.

Besides the regular box, bar, line plots, there are many more visualizations that are critical in understanding:
- intricacies of statistical models
- validate model assumptions
- evaluate model performance, and much more.

The visual below depicts the 11 most important and must-know plots in data science:

๐Ÿ‘‰ Find a more vivid explanation with visuals here: https://lnkd.in/gWt_JfP2.

1) KS Plot:
- It is used to assess the distributional differences.
- The core idea is to measure the maximum distance between the cumulative distribution functions (CDF) of two distributions.
- The lower the maximum distance, the more likely they belong to the same distribution.
- Thus, instead of a โ€œplotโ€, it is mainly interpreted as a โ€œstatistical testโ€ to determine distributional differences.

2) SHAP Plot:
- It summarizes feature importance to a modelโ€™s predictions by considering interactions/dependencies between them.
- It is useful in determining how different values (low or high) of a feature affect the overall output.

3) ROC Curve:
- It depicts the tradeoff between the true positive rate (good performance) and the false positive rate (bad performance) across different classification thresholds.

4) Precision-Recall Curve:
- It depicts the tradeoff between Precision and Recall across different classification thresholds.

5) QQ Plot:
- It assesses the distributional similarity between observed data and theoretical distribution.
- It plots the quantiles of the two distributions against each other.
- Deviations from the straight line indicate a departure from the assumed distribution.

6) Cumulative Explained Variance Plot:
- It is useful in determining the number of dimensions we can reduce our data to while preserving max variance during PCA.

7) Elbow Curve:
- The plot helps identify the optimal number of clusters for the k-means algorithm.
- The point of the elbow depicts the ideal number of clusters.

8) Silhouette Curve:
- The Elbow curve is often ineffective when you have plenty of clusters.
- Silhouette Curve is a better alternative, as depicted above.

9) Gini-Impurity and Entropy:
- They are used to measure the impurity or disorder of a node or split in a decision tree.
- The plot compares Gini impurity and Entropy across different splits.
- This provides insights into the tradeoff between these measures.

10) Bias-Variance Tradeoff:
- It is used to find the right balance between the bias and the variance of a model against complexity.

11) PDP:
- Depicts the dependence between target and features.
- A plot between the target and one feature forms โ†’ 1-way PDP.
- A plot between the target and two feature forms โ†’ 2-way PDP.

--
๐Ÿ‘‰ Get a free data science PDF (530+ pages) with 150+ core data science and machine learning lessons. https://lnkd.in/gzfJWHmu
--

๐Ÿ‘‰ Over to you: Which important plots have I missed here?
Post image by Avi Chawla
Traditional RAG vs. HyDE, explained visually.
.
.
One critical problem with the traditional โ€‹RAG systemโ€‹ is that questions are not semantically similar to their answers.

Consider you want to find a sentence similar to โ€œWhat is ML?โ€œ.

It is likely that โ€œWhat is AI?โ€œ is more similar to it than โ€œMachine learning is fun.โ€œ

Due to this semantic dissimilarity, several irrelevant contexts get retrieved during the retrieval step.

HyDE solves this.

The following visual depicts how it differs from traditional RAG.

Here's how it works:

- Use an LLM to generate a hypothetical answer H for the query Q (this answer does not have to be entirely correct).

- Embed the answer using a contriever model to get E โ€‹(Bi-encodersโ€‹ trained using contrastive learning are famously used here).

- Use the embedding E to query the vector database and fetch relevant context (C).

- Pass the hypothetical answer H + retrieved-context C + query Q to the LLM to produce an answer.

Done!

Now, of course, the hypothetical generated will likely contain hallucinated details.

But this does not severely affect the performance due to the contriever modelโ€”one which embeds.

More specifically, this model is trained using contrastive learning and it also functions as a near-lossless compressor whose task is to filter out the hallucinated details of the fake document.

This produces a vector embedding that is expected to be more similar to the embeddings of actual documents than the question is to the real documents.

Several studies have shown that HyDE improves the retrieval performance compared to the traditional embedding model.

But this comes at the cost of increased latency and more LLM usage.

I'll cover a hands-on of HyDE in the Daily Dose of Data Science newsletter soon.

Join here: https://lnkd.in/gB6HTzm8.

Also, get a free data science PDF (530+ pages) with 150+ core DS/ML lessons.

๐Ÿ‘‰ Over to you: What are some other ways to improve RAG?
Post image by Avi Chawla
After loading any dataframe in Jupyter, we preview it.

But it hardly tells anything about the data.

One has to dig deeper by analyzing it, which involves simple yet repetitive code.

Instead, use Jupyter-DataTables.

It's an open-source tool that supercharges the default preview of a DataFrame.

The preview provides many common operations, such as:
- sorting
- filtering
- exporting
- plotting column distribution
- printing data types,
- pagination, and more.

--
๐Ÿ‘‰ Get a Free Data Science PDF (550+ pages) with 320+ posts by subscribing to my daily newsletter today:ย https://bit.ly/DailyDS.
--

๐Ÿ‘‰ Over to you: What are some other cool jupyter tools you are aware of?

#python
.
Here's an underrated technique to immensely boost your data analysis in Jupyter๐Ÿ‘‡:
.
.

When using Jupyter, folks often:
- re-rerun the same cells after modifying the code/input slightly.

This makes data exploration:
- irreproducible,
- tedious, and
- unorganized.

Instead, leverage interactive controls using IPywidgets.

A single decorator (interact) allows you to add:
- sliders
- dropdowns
- text fields, and more.

As a result, you can:
- explore your data interactively
- speed-up data exploration
- avoid repetitive cell modifications and executions
- organize your data analysis

--
๐Ÿ‘‰ Get a Free Data Science PDF (550+ pages) with 320+ posts by subscribing to my daily newsletter today:ย https://bit.ly/DailyDS.
--

๐Ÿ‘‰ Over to you: What are some ways to elegantly explore data in Jupyter?

#datascience
.
๐Ÿš€ 70x Faster Pandas by changing just one line of code.
.
.
It is challenging to work on large datasets in Pandas.

To speed up its operations, try Modin.

It provides instant run-time improvements with no extra effort. By changing the import statement, you are ready to use Modin just like the Pandas API.

--
๐Ÿ‘‰ Get a Free Data Science PDF (550+ pages) with 320+ tips by subscribing to my daily newsletter today: https://bit.ly/DailyDS.
--

๐Ÿ‘‰ What are some other ways to speed up Pandas?

#python
Post image by Avi Chawla

Related Influencers