<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[Epochs of Data Insights]]></title><description><![CDATA[Simplifying the "WHY" and "HOW" of Data Science and Analytics topics. Join in for in-depth bi-weekly practical and actionable insights.]]></description><link>https://analyticalnikita.substack.com</link><image><url>https://substackcdn.com/image/fetch/$s_!jT43!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F675a2477-6a67-4e93-b0a2-2cae74032449_1080x1080.png</url><title>Epochs of Data Insights</title><link>https://analyticalnikita.substack.com</link></image><generator>Substack</generator><lastBuildDate>Tue, 30 Jun 2026 19:40:14 GMT</lastBuildDate><atom:link href="https://analyticalnikita.substack.com/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[Nikita Prasad]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[analyticalnikita@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[analyticalnikita@substack.com]]></itunes:email><itunes:name><![CDATA[Nikita Prasad]]></itunes:name></itunes:owner><itunes:author><![CDATA[Nikita Prasad]]></itunes:author><googleplay:owner><![CDATA[analyticalnikita@substack.com]]></googleplay:owner><googleplay:email><![CDATA[analyticalnikita@substack.com]]></googleplay:email><googleplay:author><![CDATA[Nikita Prasad]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[What is Decision Tree Regressor?]]></title><description><![CDATA[Must-Know Interview Topic for Every Aspiring Data Scientist]]></description><link>https://analyticalnikita.substack.com/p/what-is-decision-tree-regressor</link><guid isPermaLink="false">https://analyticalnikita.substack.com/p/what-is-decision-tree-regressor</guid><dc:creator><![CDATA[Nikita Prasad]]></dc:creator><pubDate>Sat, 14 Jun 2025 10:34:26 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!5Tbl!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e68287a-8606-44ac-a849-7b3845938948_1200x1091.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Let&#8217;s break down the core of <strong>Decision Tree Regression</strong> &#8212; what it is, how it works, and why it matters, from the perspective of a data scientist, along with a sample problem statement.</p><p><em>By the end of this, you will have a solid grasp on this algorithm &#8212; helpful for both interview preparation and day-to-day work as a DS.</em></p><blockquote><p><strong>Quick Pause</strong>: <em>If you&#8217;re new here <strong>Subscribe </strong>&#8212; my goal is to make Data Science easy for you. &#128071;&#127995;</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://analyticalnikita.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://analyticalnikita.substack.com/subscribe?"><span>Subscribe now</span></a></p></blockquote><div><hr></div><h3><strong>What is a Decision Tree?</strong></h3><p>The algorithm uses a tree-like structure for decisions to either predict the target value (regression) or predict the target class (classification). </p><p>Before diving into how decision trees work, let us become familiar with the basic structure and terminologies of a decision tree:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!5Tbl!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e68287a-8606-44ac-a849-7b3845938948_1200x1091.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!5Tbl!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e68287a-8606-44ac-a849-7b3845938948_1200x1091.png 424w, https://substackcdn.com/image/fetch/$s_!5Tbl!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e68287a-8606-44ac-a849-7b3845938948_1200x1091.png 848w, https://substackcdn.com/image/fetch/$s_!5Tbl!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e68287a-8606-44ac-a849-7b3845938948_1200x1091.png 1272w, https://substackcdn.com/image/fetch/$s_!5Tbl!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e68287a-8606-44ac-a849-7b3845938948_1200x1091.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!5Tbl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e68287a-8606-44ac-a849-7b3845938948_1200x1091.png" width="1200" height="1091" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5e68287a-8606-44ac-a849-7b3845938948_1200x1091.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1091,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;image.png&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="image.png" title="image.png" srcset="https://substackcdn.com/image/fetch/$s_!5Tbl!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e68287a-8606-44ac-a849-7b3845938948_1200x1091.png 424w, https://substackcdn.com/image/fetch/$s_!5Tbl!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e68287a-8606-44ac-a849-7b3845938948_1200x1091.png 848w, https://substackcdn.com/image/fetch/$s_!5Tbl!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e68287a-8606-44ac-a849-7b3845938948_1200x1091.png 1272w, https://substackcdn.com/image/fetch/$s_!5Tbl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e68287a-8606-44ac-a849-7b3845938948_1200x1091.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><ul><li><p><strong>Root Node</strong>: The topmost node representing all data points.</p></li><li><p><strong>Splitting</strong>: It refers to dividing a node into two or more sub-nodes.</p></li><li><p><strong>Decision Node</strong>: Nodes further split into sub-nodes; a split node.</p></li><li><p><strong>Leaf / Terminal Node</strong>: Nodes that do not split; final results.</p></li><li><p><strong>Branch / Sub-Tree</strong>: Subsection of the entire tree.</p></li><li><p><strong>Parent and Child Node</strong>: Parent node divides into sub-nodes; children are the sub-nodes.</p></li><li><p><strong>Pruning</strong>: Removing sub-nodes of a decision node is called pruning. Pruning is often done in decision trees to prevent overfitting.</p></li></ul><div><hr></div><h2>What Is a Decision Tree Regressor?</h2><p>A <strong>Decision Tree Regressor</strong> is used when the target is <strong>continuous</strong> &#8212; such as predicting a house price or stock value.</p><p>It learns the input-output relationship by breaking the dataset into smaller segments.<br>At each node, the algorithm chooses a feature and a split point that minimizes the prediction error.</p><h3>How It Works</h3><ul><li><p>The algorithm observes feature values.</p></li><li><p>It builds a tree that predicts continuous outputs.</p></li><li><p>At each split, it tries to reduce the <strong>mean squared error (MSE) (or other evaluation metrics</strong>.</p></li><li><p>The process continues until certain stopping criteria are met.</p></li></ul><h3><strong>How is Decision Tree Classifier Different?</strong></h3><p>Classification trees are used to predict categorical data (yes, no), while regression trees are used to predict numerical data, such as the price of a stock.</p><p>Classification and regression trees are powerful tools for analyzing data.</p><div><hr></div><h2><strong>Training Decision Trees</strong></h2><p>A decision tree in general parlance represents a hierarchical series of binary decisions. Including,</p><ul><li><p>Letting the algorithm find the <strong>best split</strong> at each level.</p></li><li><p>Building branches until either:</p><ul><li><p>The tree reaches a <strong>maximum depth</strong>, or</p></li><li><p>The <strong>data in a node is pure enough</strong>, or</p></li><li><p>A <strong>minimum number of samples</strong> is reached.</p></li></ul></li></ul><p>Rather than setting rules manually, the algorithm figures out the optimal conditions for splits on its own.</p><blockquote><p>Do not forget to explore the full implementation, including code and data: <em><strong><a href="https://github.com/nikitaprasad21/ML-Cheat-Codes/blob/main/DT-and-RF/dt_acme_expenses.ipynb">Github Repository</a></strong></em> &#128072;&#127995;</p></blockquote><div><hr></div><p>Alright, that&#8217;s a wrap! If you&#8217;ve made it this far &#8212; thank you! <em>Stay tuned with<strong> <a href="https://substack.com/@analyticalnikita">ME</a></strong></em>, so you won&#8217;t miss out on future updates.</p><div class="captioned-button-wrap" data-attrs="{&quot;url&quot;:&quot;https://analyticalnikita.substack.com/p/how-do-llms-remember-context-over?utm_source=substack&amp;utm_medium=email&amp;utm_content=share&amp;action=share&amp;token=eyJ1c2VyX2lkIjoxMzY5MjcxNywicG9zdF9pZCI6MTUzOTY3MDMxLCJpYXQiOjE3NDE5NDIzMzYsImV4cCI6MTc0NDUzNDMzNiwiaXNzIjoicHViLTM2MjM0MTciLCJzdWIiOiJwb3N0LXJlYWN0aW9uIn0.JWw4pdU7jCGSV5bwMlHnG8gazyOqGmuOTgSol7f4g3o&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="CaptionedButtonToDOM"><div class="preamble"><p class="cta-caption">Found this helpful? Leave a &#8220;heart&#8221;&#10084;&#65039;! And if you have any questions/ suggestions/ thoughts, do drop me a line below. &#128395;&#65039;&#128071;</p></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://analyticalnikita.substack.com/p/what-is-decision-tree-regressor?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://analyticalnikita.substack.com/p/what-is-decision-tree-regressor?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p></div><p>Until next time, happy learning!</p><p>&#8212; <em><strong><a href="https://substack.com/@analyticalnikita">Nikita Prasad</a></strong></em></p>]]></content:encoded></item><item><title><![CDATA[Reducing Overfitting Using Regularization]]></title><description><![CDATA[Difference between Ridge and Lasso Regularization]]></description><link>https://analyticalnikita.substack.com/p/reducing-overfitting-using-regularization</link><guid isPermaLink="false">https://analyticalnikita.substack.com/p/reducing-overfitting-using-regularization</guid><dc:creator><![CDATA[Nikita Prasad]]></dc:creator><pubDate>Sat, 07 Jun 2025 10:36:13 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!xWv5!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd007152-5024-4ee5-8bb9-d3a056916770_1400x680.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em><strong><a href="https://github.com/nikitaprasad21/ML-Cheat-Codes/blob/main/Linear-Regression/multiple-linear-regression.ipynb">Linear Regression</a></strong></em> assumes data with no multicollinearity, and no noise. </p><p>But real-world datasets, we often have:</p><ul><li><p><strong>Too many features</strong>.</p></li><li><p><strong>Correlated predictors</strong>.</p></li><li><p><strong>Noise, Outliers</strong> and much more.</p></li></ul><p>That is why we need <strong>regularization</strong>.</p><p>Let&#8217;s break down what it is, different types of regularization techniques, and how to choose between them like a data scientist who knows what they're doing.</p><blockquote><p><em>And hey </em>&#8212; <em>if you&#8217;re new here <strong>Subscribe</strong>, as my goal is to simplify Data Science for you. &#128071;&#127995;</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://analyticalnikita.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://analyticalnikita.substack.com/subscribe?"><span>Subscribe now</span></a></p></blockquote><div><hr></div><p>Let&#8217;s dive in!</p><h2><strong>Regularization in Machine Learning</strong></h2><p>It is a statistical method to reduce errors caused by <strong>overfitting</strong> on training data, by adding a penalty term to the cost function in the model. </p><p>This discourages complex models with high coefficients, promoting simpler and more generalizable solutions. </p><p>Regularization helps improve a model's performance on unseen data and enhances its overall robustness.</p><h2><strong>Techniques of Regularization</strong></h2><p>There are mainly two types of regularization techniques, which are given below:</p><ol><li><p><strong>Ridge Regression (L2 Regularization)</strong></p></li><li><p><strong>Lasso Regression (L1 Regularization)</strong></p></li></ol><p>each influencing the model's behavior in different ways.</p><h3>1. Ridge Regression (or L2 Regularization)</h3><p>Ridge Regression adds a <strong>squared penalty</strong> to the coefficients:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!K5gV!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f25dff2-51f8-4c9f-911e-382080c72f7a_400x79.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!K5gV!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f25dff2-51f8-4c9f-911e-382080c72f7a_400x79.png 424w, https://substackcdn.com/image/fetch/$s_!K5gV!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f25dff2-51f8-4c9f-911e-382080c72f7a_400x79.png 848w, https://substackcdn.com/image/fetch/$s_!K5gV!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f25dff2-51f8-4c9f-911e-382080c72f7a_400x79.png 1272w, https://substackcdn.com/image/fetch/$s_!K5gV!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f25dff2-51f8-4c9f-911e-382080c72f7a_400x79.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!K5gV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f25dff2-51f8-4c9f-911e-382080c72f7a_400x79.png" width="400" height="79" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6f25dff2-51f8-4c9f-911e-382080c72f7a_400x79.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:79,&quot;width&quot;:400,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:8152,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://analyticalnikita.substack.com/i/165283336?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f25dff2-51f8-4c9f-911e-382080c72f7a_400x79.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!K5gV!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f25dff2-51f8-4c9f-911e-382080c72f7a_400x79.png 424w, https://substackcdn.com/image/fetch/$s_!K5gV!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f25dff2-51f8-4c9f-911e-382080c72f7a_400x79.png 848w, https://substackcdn.com/image/fetch/$s_!K5gV!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f25dff2-51f8-4c9f-911e-382080c72f7a_400x79.png 1272w, https://substackcdn.com/image/fetch/$s_!K5gV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f25dff2-51f8-4c9f-911e-382080c72f7a_400x79.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><ul><li><p><em>RSS </em>is the residual sum of squares.</p></li><li><p><em>&#955;</em> (lambda) controls the strength of the penalty.</p></li></ul><h4>What It Does:</h4><ul><li><p>Keeps all features in the model, but <strong>shrinks coefficients</strong> toward zero.</p></li><li><p>Especially helpful when predictors are <strong>correlated</strong>.</p></li><li><p>Doesn&#8217;t eliminate features&#8212;just balances them.</p></li></ul><h4>Use Ridge When:</h4><ul><li><p>You want to <strong>keep all variables</strong>, but reduce model complexity.</p></li><li><p><strong>Multicollinearity</strong> is an issue.</p></li><li><p>Interpretability isn&#8217;t your top priority.</p></li></ul><div><hr></div><h3>2. Lasso Regression (or L1 Regularization)</h3><p>Lasso uses the <strong>absolute value</strong> of coefficients:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!F1I9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9d4ad18-1915-443e-be2a-f6bc7c1a3c58_424x97.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!F1I9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9d4ad18-1915-443e-be2a-f6bc7c1a3c58_424x97.png 424w, https://substackcdn.com/image/fetch/$s_!F1I9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9d4ad18-1915-443e-be2a-f6bc7c1a3c58_424x97.png 848w, https://substackcdn.com/image/fetch/$s_!F1I9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9d4ad18-1915-443e-be2a-f6bc7c1a3c58_424x97.png 1272w, https://substackcdn.com/image/fetch/$s_!F1I9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9d4ad18-1915-443e-be2a-f6bc7c1a3c58_424x97.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!F1I9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9d4ad18-1915-443e-be2a-f6bc7c1a3c58_424x97.png" width="424" height="97" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c9d4ad18-1915-443e-be2a-f6bc7c1a3c58_424x97.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:97,&quot;width&quot;:424,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:7961,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://analyticalnikita.substack.com/i/165283336?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9d4ad18-1915-443e-be2a-f6bc7c1a3c58_424x97.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!F1I9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9d4ad18-1915-443e-be2a-f6bc7c1a3c58_424x97.png 424w, https://substackcdn.com/image/fetch/$s_!F1I9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9d4ad18-1915-443e-be2a-f6bc7c1a3c58_424x97.png 848w, https://substackcdn.com/image/fetch/$s_!F1I9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9d4ad18-1915-443e-be2a-f6bc7c1a3c58_424x97.png 1272w, https://substackcdn.com/image/fetch/$s_!F1I9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9d4ad18-1915-443e-be2a-f6bc7c1a3c58_424x97.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><h4>What It Does:</h4><ul><li><p>Shrinks some coefficients exactly to <strong>zero</strong>.</p></li><li><p>Performs <strong>feature selection</strong> automatically.</p></li><li><p>Helps build <strong>sparse models</strong>&#8212;great for high-dimensional datasets.</p></li></ul><h4>Use Lasso When:</h4><ul><li><p>You want to <strong>select important features</strong> and ignore the rest.</p></li><li><p>Your dataset has <strong>many variables</strong>, but not all of them matter.</p></li><li><p>You care about <strong>simpler, interpretable models</strong>.</p></li></ul><div><hr></div><h3>Bonus: ElasticNet</h3><p>Why choose between Ridge and Lasso when you can have both?</p><p><strong>ElasticNet</strong> combines both L1 and L2 penalties:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!hmAt!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4941ec1-4a42-4a6d-a4b8-a78ad47f457e_424x56.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!hmAt!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4941ec1-4a42-4a6d-a4b8-a78ad47f457e_424x56.png 424w, https://substackcdn.com/image/fetch/$s_!hmAt!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4941ec1-4a42-4a6d-a4b8-a78ad47f457e_424x56.png 848w, https://substackcdn.com/image/fetch/$s_!hmAt!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4941ec1-4a42-4a6d-a4b8-a78ad47f457e_424x56.png 1272w, https://substackcdn.com/image/fetch/$s_!hmAt!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4941ec1-4a42-4a6d-a4b8-a78ad47f457e_424x56.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!hmAt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4941ec1-4a42-4a6d-a4b8-a78ad47f457e_424x56.png" width="424" height="56" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c4941ec1-4a42-4a6d-a4b8-a78ad47f457e_424x56.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:56,&quot;width&quot;:424,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:7539,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://analyticalnikita.substack.com/i/165283336?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4941ec1-4a42-4a6d-a4b8-a78ad47f457e_424x56.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!hmAt!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4941ec1-4a42-4a6d-a4b8-a78ad47f457e_424x56.png 424w, https://substackcdn.com/image/fetch/$s_!hmAt!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4941ec1-4a42-4a6d-a4b8-a78ad47f457e_424x56.png 848w, https://substackcdn.com/image/fetch/$s_!hmAt!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4941ec1-4a42-4a6d-a4b8-a78ad47f457e_424x56.png 1272w, https://substackcdn.com/image/fetch/$s_!hmAt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4941ec1-4a42-4a6d-a4b8-a78ad47f457e_424x56.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Best for:</p><ul><li><p>High-dimensional data.</p></li><li><p>When predictors are correlated.</p></li></ul><blockquote><p><em><strong>Note</strong>: Elastic Net seems to perform empirically better which combines both the above methods.</em></p></blockquote><p>Here&#8217;s a <strong>visual comparison</strong> between <strong>Ridge Regression (L2)</strong>, <strong>Lasso Regression (L1)</strong>, and <strong>Elastic Net (L1 + L2)</strong>, showing how each method <strong>constrains the coefficient estimates</strong> during regularization, as discussed above.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!xWv5!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd007152-5024-4ee5-8bb9-d3a056916770_1400x680.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!xWv5!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd007152-5024-4ee5-8bb9-d3a056916770_1400x680.png 424w, https://substackcdn.com/image/fetch/$s_!xWv5!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd007152-5024-4ee5-8bb9-d3a056916770_1400x680.png 848w, https://substackcdn.com/image/fetch/$s_!xWv5!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd007152-5024-4ee5-8bb9-d3a056916770_1400x680.png 1272w, https://substackcdn.com/image/fetch/$s_!xWv5!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd007152-5024-4ee5-8bb9-d3a056916770_1400x680.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!xWv5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd007152-5024-4ee5-8bb9-d3a056916770_1400x680.png" width="1400" height="680" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fd007152-5024-4ee5-8bb9-d3a056916770_1400x680.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:680,&quot;width&quot;:1400,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:214068,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://analyticalnikita.substack.com/i/165283336?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd007152-5024-4ee5-8bb9-d3a056916770_1400x680.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!xWv5!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd007152-5024-4ee5-8bb9-d3a056916770_1400x680.png 424w, https://substackcdn.com/image/fetch/$s_!xWv5!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd007152-5024-4ee5-8bb9-d3a056916770_1400x680.png 848w, https://substackcdn.com/image/fetch/$s_!xWv5!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd007152-5024-4ee5-8bb9-d3a056916770_1400x680.png 1272w, https://substackcdn.com/image/fetch/$s_!xWv5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd007152-5024-4ee5-8bb9-d3a056916770_1400x680.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>If you&#8217;d like to explore the full implementation, including code and data, then checkout: <em><strong><a href="https://github.com/nikitaprasad21/ML-Cheat-Codes/blob/main/Linear-Regression/ridge-regression.ipynb">Github Repository</a></strong></em> &#128072;&#127995;</p><div><hr></div><p>Stay tuned with<strong> </strong><em><strong><a href="https://substack.com/@analyticalnikita">ME</a></strong></em>, so you won&#8217;t miss out on future updates.</p><div class="captioned-button-wrap" data-attrs="{&quot;url&quot;:&quot;https://analyticalnikita.substack.com/p/working-with-text-data-tokenization?utm_source=substack&amp;utm_medium=email&amp;utm_content=share&amp;action=share&amp;token=eyJ1c2VyX2lkIjoxMzY5MjcxNywicG9zdF9pZCI6MTUzODk3MDA4LCJpYXQiOjE3MzYyNjI3MTMsImV4cCI6MTczODg1NDcxMywiaXNzIjoicHViLTM2MjM0MTciLCJzdWIiOiJwb3N0LXJlYWN0aW9uIn0.9YJtyeDdbdG7YRaC7fxDDMCJsWzsYR2hfBW6KlmWeS4&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="CaptionedButtonToDOM"><div class="preamble"><p class="cta-caption"><em>Before you go.. leave a &#8220;heart&#8221; &#10084;&#65039; and if you have any questions/ suggestions/ thoughts, do drop me a line below. &#128395;&#65039;&#128071;</em></p></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://analyticalnikita.substack.com/p/reducing-overfitting-using-regularization?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://analyticalnikita.substack.com/p/reducing-overfitting-using-regularization?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p></div><p>Until next time, happy learning!</p><p>&#8212; <em><strong><a href="https://substack.com/@analyticalnikita">Nikita Prasad</a></strong></em></p>]]></content:encoded></item><item><title><![CDATA[Text Feature Extraction Made Simple ]]></title><description><![CDATA[Beginner&#8217;s Guide to Text Vectorization for NLP Tasks]]></description><link>https://analyticalnikita.substack.com/p/text-feature-extraction-made-simple</link><guid isPermaLink="false">https://analyticalnikita.substack.com/p/text-feature-extraction-made-simple</guid><dc:creator><![CDATA[Nikita Prasad]]></dc:creator><pubDate>Sat, 31 May 2025 10:36:15 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!9jFU!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab392eff-bdc4-4129-8b43-c52189bc499c_1400x950.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In most of the cases, text classification systems primarily have two main parts:</p><ol><li><p><strong>Feature Extraction Component (Turning Text into Numbers)</strong>: </p><ol><li><p>This part takes a piece of text and turns it into a set of features or characteristics (basically, numbers). </p></li><li><p>These features help the system understand what's important in the text.</p></li></ol></li><li><p><strong>Classifier/ Regressor Component (Making a Decision)</strong>:</p><ol><li><p>Once the features are generated, this part of the system uses them to decide what category label or fine-grained scores the text belongs to. </p></li><li><p>It matches the features with a list of known categories/ scores and assigns the most appropriate one to the text.</p></li></ol></li></ol><p>By the end of this read, you will have a solid understanding of:</p><ul><li><p><em><strong>What Text Vectorization is?</strong></em></p></li><li><p><em><strong>Why it is essential while working with Text data? and,</strong></em></p></li><li><p><em><strong>What popular techniques you can use?</strong></em></p></li></ul><p>This is helpful for day-to-day work as a Data Scientist.</p><blockquote><p><em>And hey </em>&#8212; <em>if you&#8217;re new here <strong>Subscribe</strong>, as my goal is to simplify Data Science for you. &#128071;&#127995;</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://analyticalnikita.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://analyticalnikita.substack.com/subscribe?"><span>Subscribe now</span></a></p></blockquote><div><hr></div><p>Let&#8217;s dive in!</p><h2>What is Feature Extraction from Text or Text Vectorization?</h2><p>Machine learning algorithms require numerical input. </p><p>Hence, we required to use Text Vectorization to converts textual data into numerical format, enabling algorithms to process and analyze it effectively.</p><h2>Why do we need Feature Extraction from Text?</h2><p>Text vectorization captures important information from sequential text data.</p><p>It includes word frequency, relationships between words, semantic and syntactic meaning.</p><p>Simply stating,</p><ul><li><p> Which words are common?</p></li><li><p> Which words are rare?</p></li><li><p> How words are connected?</p></li><li><p> What the sentence might mean?</p></li></ul><p>All of this helps the machine learn and <strong>make better predictions</strong>.</p><h2>Text Feature Extraction Techniques</h2><p>Here's a comparison table of common <strong>Text Feature Extraction Techniques</strong> used in Natural Language Processing (NLP). </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!9jFU!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab392eff-bdc4-4129-8b43-c52189bc499c_1400x950.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!9jFU!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab392eff-bdc4-4129-8b43-c52189bc499c_1400x950.png 424w, https://substackcdn.com/image/fetch/$s_!9jFU!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab392eff-bdc4-4129-8b43-c52189bc499c_1400x950.png 848w, https://substackcdn.com/image/fetch/$s_!9jFU!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab392eff-bdc4-4129-8b43-c52189bc499c_1400x950.png 1272w, https://substackcdn.com/image/fetch/$s_!9jFU!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab392eff-bdc4-4129-8b43-c52189bc499c_1400x950.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!9jFU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab392eff-bdc4-4129-8b43-c52189bc499c_1400x950.png" width="1400" height="950" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ab392eff-bdc4-4129-8b43-c52189bc499c_1400x950.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:950,&quot;width&quot;:1400,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:140298,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://analyticalnikita.substack.com/i/164857651?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab392eff-bdc4-4129-8b43-c52189bc499c_1400x950.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!9jFU!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab392eff-bdc4-4129-8b43-c52189bc499c_1400x950.png 424w, https://substackcdn.com/image/fetch/$s_!9jFU!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab392eff-bdc4-4129-8b43-c52189bc499c_1400x950.png 848w, https://substackcdn.com/image/fetch/$s_!9jFU!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab392eff-bdc4-4129-8b43-c52189bc499c_1400x950.png 1272w, https://substackcdn.com/image/fetch/$s_!9jFU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab392eff-bdc4-4129-8b43-c52189bc499c_1400x950.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3><strong>Tips for Use</strong>:</h3><p>Want to try these techniques? Here&#8217;s how to choose:</p><ul><li><p>Choose <strong>Word2Vec / FastText / GloVe</strong> for capturing <strong>semantic relationships</strong>.</p></li><li><p>Use <strong>BERT or Transformer-based models</strong> when <strong>context</strong> is critical (e.g., QA, NER).</p></li></ul><div><hr></div><h4><em>&#128071;&#127995; Additionally you can also checkout:</em></h4><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;886ccb66-e8d0-4cff-8872-e8b8ec796b7a&quot;,&quot;caption&quot;:&quot;In the initial phase of the input processing workflow, the input text is segmented into separate tokens using tiktoken library.&quot;,&quot;cta&quot;:&quot;Read full story&quot;,&quot;showBylines&quot;:true,&quot;showDescription&quot;:true,&quot;showImage&quot;:true,&quot;size&quot;:&quot;lg&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;How LLMs Embeds Input Tokens?&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:13692717,&quot;name&quot;:&quot;Nikita Prasad&quot;,&quot;bio&quot;:&quot;&#128187; Documenting my Learning in Simplified ways at Epochs of Data Insights. &#128640; Join my data-driven journey, today! &quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3ac9b439-e213-42c3-a12f-7a10f429b3f0_2160x2160.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2025-02-04T10:46:54.956Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd9a0569-7639-45f6-994c-02a489cf1adc_1587x2245.jpeg&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://analyticalnikita.substack.com/p/how-llms-embeds-input-tokens&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:153967003,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:3,&quot;comment_count&quot;:0,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;Epochs of Data Insights&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F675a2477-6a67-4e93-b0a2-2cae74032449_1080x1080.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><div><hr></div><p>Stay tuned with<strong> </strong><em><strong><a href="https://substack.com/@analyticalnikita">ME</a></strong></em>, so you won&#8217;t miss out on future updates.</p><div class="captioned-button-wrap" data-attrs="{&quot;url&quot;:&quot;https://analyticalnikita.substack.com/p/working-with-text-data-tokenization?utm_source=substack&amp;utm_medium=email&amp;utm_content=share&amp;action=share&amp;token=eyJ1c2VyX2lkIjoxMzY5MjcxNywicG9zdF9pZCI6MTUzODk3MDA4LCJpYXQiOjE3MzYyNjI3MTMsImV4cCI6MTczODg1NDcxMywiaXNzIjoicHViLTM2MjM0MTciLCJzdWIiOiJwb3N0LXJlYWN0aW9uIn0.9YJtyeDdbdG7YRaC7fxDDMCJsWzsYR2hfBW6KlmWeS4&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="CaptionedButtonToDOM"><div class="preamble"><p class="cta-caption"><em>Before you go.. leave a &#8220;heart&#8221; &#10084;&#65039; and if you have any questions/ suggestions/ thoughts, do drop me a line below. &#128395;&#65039;&#128071;</em></p></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://analyticalnikita.substack.com/p/text-feature-extraction-made-simple?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://analyticalnikita.substack.com/p/text-feature-extraction-made-simple?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p></div><p>Until next time, happy learning!</p><p>&#8212; <em><strong><a href="https://substack.com/@analyticalnikita">Nikita Prasad</a></strong></em></p>]]></content:encoded></item><item><title><![CDATA[Explain Bagging vs. Boosting]]></title><description><![CDATA[...The Most Important Interview Question!]]></description><link>https://analyticalnikita.substack.com/p/explain-bagging-vs-boosting</link><guid isPermaLink="false">https://analyticalnikita.substack.com/p/explain-bagging-vs-boosting</guid><dc:creator><![CDATA[Nikita Prasad]]></dc:creator><pubDate>Sat, 17 May 2025 10:36:15 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/242060f4-85bf-4361-9efe-fb6564c42ecb_1000x630.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>One of the most asked data science interview question, <em><strong>Differentiate between Bagging and Boosting</strong></em>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!v0YT!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4b8641ca-831a-4ebf-be5a-b3ee65349dc8_1920x1080.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!v0YT!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4b8641ca-831a-4ebf-be5a-b3ee65349dc8_1920x1080.jpeg 424w, https://substackcdn.com/image/fetch/$s_!v0YT!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4b8641ca-831a-4ebf-be5a-b3ee65349dc8_1920x1080.jpeg 848w, https://substackcdn.com/image/fetch/$s_!v0YT!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4b8641ca-831a-4ebf-be5a-b3ee65349dc8_1920x1080.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!v0YT!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4b8641ca-831a-4ebf-be5a-b3ee65349dc8_1920x1080.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!v0YT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4b8641ca-831a-4ebf-be5a-b3ee65349dc8_1920x1080.jpeg" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4b8641ca-831a-4ebf-be5a-b3ee65349dc8_1920x1080.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:123175,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://analyticalnikita.substack.com/i/163765219?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4b8641ca-831a-4ebf-be5a-b3ee65349dc8_1920x1080.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!v0YT!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4b8641ca-831a-4ebf-be5a-b3ee65349dc8_1920x1080.jpeg 424w, https://substackcdn.com/image/fetch/$s_!v0YT!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4b8641ca-831a-4ebf-be5a-b3ee65349dc8_1920x1080.jpeg 848w, https://substackcdn.com/image/fetch/$s_!v0YT!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4b8641ca-831a-4ebf-be5a-b3ee65349dc8_1920x1080.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!v0YT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4b8641ca-831a-4ebf-be5a-b3ee65349dc8_1920x1080.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Today, let&#8217;s understand the core differences in how these algorithms work.</p><p>And how they relate to two powerful ML algorithms: <strong>Random Forest</strong> (bagging) and <strong>XGBoost</strong> (boosting).</p><blockquote><p>Make sure to hit that <em><strong>Subscribe</strong></em> button, so you don&#8217;t miss when I put out any new posts!</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://analyticalnikita.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://analyticalnikita.substack.com/subscribe?"><span>Subscribe now</span></a></p></blockquote><div><hr></div><p>In machine learning, <strong>ensemble methods</strong> combine multiple models to improve performance. </p><p>Two of the most popular ensemble techniques are <strong>bagging</strong> and <strong>boosting</strong>. </p><p>Understanding their differences is crucial to choosing the right model for your problem. </p><p>Let's explore! </p><div><hr></div><h3>&#127794; What is Bagging?</h3><p><strong>Bagging</strong> stands for <em>Bootstrap Aggregating</em>. </p><p>The idea is simple:</p><ul><li><p>It builds multiple models (usually decision trees) on different <strong>random subsets</strong> of the data.</p></li><li><p>These subsets are created through <strong>sampling with replacement</strong> (i.e. bootstrap sampling).</p></li><li><p>Predictions are averaged (for regression model) or voted (for classification model).</p></li></ul><p><strong>Random Forest</strong> is the most used bagging algorithm. It builds many decision trees and merges them to get a more accurate and stable prediction.</p><div><hr></div><h3>&#128640; What is Boosting?</h3><p><strong>Boosting</strong> is a sequential technique:</p><ul><li><p>It builds models <strong>one after another</strong>, where each new model focuses on the <strong>mistakes</strong> of the previous one.</p></li><li><p>Instead of sampling, it adjusts the <strong>weights</strong> of the training data points.</p></li><li><p>The final prediction is a <strong>weighted sum</strong> of all models.</p></li></ul><p><strong>XGBoost (</strong>also called, <strong>Extreme Gradient Boosting)</strong> is a highly efficient and scalable boosting algorithm known for its accuracy and speed.</p><div><hr></div><p>Here&#8217;s a side-by-side comparison:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!K4Tl!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff49691b3-008d-460b-8ebb-1a764fd91224_1000x630.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!K4Tl!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff49691b3-008d-460b-8ebb-1a764fd91224_1000x630.jpeg 424w, https://substackcdn.com/image/fetch/$s_!K4Tl!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff49691b3-008d-460b-8ebb-1a764fd91224_1000x630.jpeg 848w, https://substackcdn.com/image/fetch/$s_!K4Tl!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff49691b3-008d-460b-8ebb-1a764fd91224_1000x630.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!K4Tl!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff49691b3-008d-460b-8ebb-1a764fd91224_1000x630.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!K4Tl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff49691b3-008d-460b-8ebb-1a764fd91224_1000x630.jpeg" width="1000" height="630" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f49691b3-008d-460b-8ebb-1a764fd91224_1000x630.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:630,&quot;width&quot;:1000,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:91232,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://analyticalnikita.substack.com/i/163765219?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff49691b3-008d-460b-8ebb-1a764fd91224_1000x630.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!K4Tl!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff49691b3-008d-460b-8ebb-1a764fd91224_1000x630.jpeg 424w, https://substackcdn.com/image/fetch/$s_!K4Tl!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff49691b3-008d-460b-8ebb-1a764fd91224_1000x630.jpeg 848w, https://substackcdn.com/image/fetch/$s_!K4Tl!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff49691b3-008d-460b-8ebb-1a764fd91224_1000x630.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!K4Tl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff49691b3-008d-460b-8ebb-1a764fd91224_1000x630.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Both Random Forest and XGBoost are powerful algorithms:</p><ul><li><p>Use <strong>Random Forest</strong> if you want a fast, robust, <em><strong>low <a href="https://analyticalnikita.substack.com/p/5-important-biases-you-should-know">bias</a>-low variance model</strong></em> that&#8217;s easy to tune and less prone to overfitting.</p></li><li><p>Use <strong>XGBoost</strong> if you need <em><strong>maximum predictive power</strong> </em>and are ready to spend time tuning the parameters.</p></li></ul><p>In practice, it&#8217;s common to try both and compare performance using cross-validation.</p><div><hr></div><p>If you&#8217;d like to explore the full implementation, including code and data, then checkout: <em><strong><a href="https://github.com/nikitaprasad21/ML-Cheat-Codes">Github Repository</a></strong></em> &#128072;&#127995;</p><div><hr></div><div class="captioned-button-wrap" data-attrs="{&quot;url&quot;:&quot;https://analyticalnikita.substack.com/p/working-with-text-data-tokenization?utm_source=substack&amp;utm_medium=email&amp;utm_content=share&amp;action=share&amp;token=eyJ1c2VyX2lkIjoxMzY5MjcxNywicG9zdF9pZCI6MTUzODk3MDA4LCJpYXQiOjE3MzYyNjI3MTMsImV4cCI6MTczODg1NDcxMywiaXNzIjoicHViLTM2MjM0MTciLCJzdWIiOiJwb3N0LXJlYWN0aW9uIn0.9YJtyeDdbdG7YRaC7fxDDMCJsWzsYR2hfBW6KlmWeS4&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="CaptionedButtonToDOM"><div class="preamble"><p class="cta-caption">B<em>efore you go.. leave a &#8220;heart&#8221; &#10084;&#65039; and let me knoe if you have any questions/ suggestions/ thoughts.&#128395;&#65039;&#128071;</em></p></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://analyticalnikita.substack.com/p/explain-bagging-vs-boosting?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://analyticalnikita.substack.com/p/explain-bagging-vs-boosting?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p></div><p>Until next time, happy learning!</p><p>&#8212; <em><strong><a href="https://substack.com/@analyticalnikita">Nikita Prasad</a></strong></em></p>]]></content:encoded></item><item><title><![CDATA[How to Prevent Gradient Boosting Model from Overfitting?]]></title><description><![CDATA[...The Most Important Interview Question!]]></description><link>https://analyticalnikita.substack.com/p/how-to-prevent-gradient-boosting</link><guid isPermaLink="false">https://analyticalnikita.substack.com/p/how-to-prevent-gradient-boosting</guid><dc:creator><![CDATA[Nikita Prasad]]></dc:creator><pubDate>Sat, 26 Apr 2025 10:36:12 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!U21_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99494941-24d1-48ea-8997-e445b411fba4_788x574.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>A good model should <strong>learn</strong> &#8212; not <strong>memorize</strong>.</p><p>That&#8217;s how you build models that are smart, simple, and ready for real-world problems.</p><p>Let&#8217;s find out how can you stop a Gradient Boosting Model from overfitting.</p><blockquote><p>Make sure to hit that <em><strong>Subscribe</strong></em> button, so you don&#8217;t miss when I put out any new posts!</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://analyticalnikita.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://analyticalnikita.substack.com/subscribe?"><span>Subscribe now</span></a></p></blockquote><div><hr></div><h2>What is Gradient Boosting Algorithm?</h2><p>It is an ensemble technique that combines the predictions of multiple weak learners, typically decision trees, sequentially to create a single and more accurate strong learner.</p><p>It can be used for <em><strong><a href="https://github.com/nikitaprasad21/ML-Cheat-Codes/blob/main/Voting-Ensemble/gradient_boosting_classifier.ipynb">classification</a></strong></em> and <em><strong><a href="https://github.com/nikitaprasad21/ML-Cheat-Codes/blob/main/Voting-Ensemble/gradient_boosting_regressor.ipynb">regression</a></strong></em> tasks. &#128072; </p><h4><br>Advantages of Gradient Boosting Model:</h4><ul><li><p>Robustness to missing values and outliers</p></li><li><p>Effective handling of high cardinality categorical features</p></li></ul><div><hr></div><h2>How to prevent Overfitting in GBM?</h2><p>Let&#8217;s discuss three important hyper-parameters provided by Sklearn to stop a Gradient Boosting model from overfitting.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!U21_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99494941-24d1-48ea-8997-e445b411fba4_788x574.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!U21_!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99494941-24d1-48ea-8997-e445b411fba4_788x574.png 424w, https://substackcdn.com/image/fetch/$s_!U21_!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99494941-24d1-48ea-8997-e445b411fba4_788x574.png 848w, https://substackcdn.com/image/fetch/$s_!U21_!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99494941-24d1-48ea-8997-e445b411fba4_788x574.png 1272w, https://substackcdn.com/image/fetch/$s_!U21_!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99494941-24d1-48ea-8997-e445b411fba4_788x574.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!U21_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99494941-24d1-48ea-8997-e445b411fba4_788x574.png" width="788" height="574" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/99494941-24d1-48ea-8997-e445b411fba4_788x574.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:574,&quot;width&quot;:788,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;image.png&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="image.png" title="image.png" srcset="https://substackcdn.com/image/fetch/$s_!U21_!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99494941-24d1-48ea-8997-e445b411fba4_788x574.png 424w, https://substackcdn.com/image/fetch/$s_!U21_!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99494941-24d1-48ea-8997-e445b411fba4_788x574.png 848w, https://substackcdn.com/image/fetch/$s_!U21_!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99494941-24d1-48ea-8997-e445b411fba4_788x574.png 1272w, https://substackcdn.com/image/fetch/$s_!U21_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99494941-24d1-48ea-8997-e445b411fba4_788x574.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>1. Using Fewer Trees</h3><p>In Gradient Boosting, you build models called trees one after another.</p><ul><li><p>If you build too many trees, the model can become too complex.</p></li><li><p>Building fewer trees can help keep the model simple and general.</p></li></ul><p>You can control the number of trees by setting a number using <code>n_estimators</code> parameter.</p><pre><code>from sklearn.ensemble import GradientBoostingClassifier

gbm = GradientBoostingClassifier(n_estimators=20)</code></pre><p>Here, you tell the model to build only 20 trees.</p><div><hr></div><h3>2. Make Trees Shorter</h3><p>Shorter trees mean the model learns simple patterns, not complicated noise.</p><p>You can control how tall or deep the trees are by setting <code>max_depth</code>.</p><pre><code><code>gbm = GradientBoostingClassifier(max_depth=3)</code></code></pre><p>A depth of 3 means the tree is not allowed to ask too many questions before making a decision.</p><div><hr></div><h3>3. Slow Down the Learning</h3><p>Another way to prevent overfitting is to make the model learn slowly.</p><p>You can do this by setting a smaller <code>learning_rate</code>.</p><pre><code><code>gbm = GradientBoostingClassifier(learning_rate=0.1)</code></code></pre><p>If the learning rate is too <em>high</em>, the model can jump too quickly to conclusions.</p><p>But small learning rate makes the model take <strong>small careful steps</strong>.</p><blockquote><p><em><strong>Tip:</strong> If you lower the learning rate, you might need <strong>more trees</strong> to reach good performance.</em></p></blockquote><p>If you&#8217;d like to explore the full implementation, including code and data, then checkout: <em><strong><a href="https://github.com/nikitaprasad21/ML-Cheat-Codes/blob/main/Voting-Ensemble/gradient_boosting_regressor.ipynb">Github Repository</a></strong></em> &#128072;&#127995;</p><div><hr></div><div class="captioned-button-wrap" data-attrs="{&quot;url&quot;:&quot;https://analyticalnikita.substack.com/p/working-with-text-data-tokenization?utm_source=substack&amp;utm_medium=email&amp;utm_content=share&amp;action=share&amp;token=eyJ1c2VyX2lkIjoxMzY5MjcxNywicG9zdF9pZCI6MTUzODk3MDA4LCJpYXQiOjE3MzYyNjI3MTMsImV4cCI6MTczODg1NDcxMywiaXNzIjoicHViLTM2MjM0MTciLCJzdWIiOiJwb3N0LXJlYWN0aW9uIn0.9YJtyeDdbdG7YRaC7fxDDMCJsWzsYR2hfBW6KlmWeS4&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="CaptionedButtonToDOM"><div class="preamble"><p class="cta-caption">B<em>efore you go.. leave a &#8220;heart&#8221; &#10084;&#65039; and let me knoe if you have any questions/ suggestions/ thoughts.&#128395;&#65039;&#128071;</em></p></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://analyticalnikita.substack.com/p/how-to-prevent-gradient-boosting?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://analyticalnikita.substack.com/p/how-to-prevent-gradient-boosting?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p></div><p>Until next time, happy learning!</p><p>&#8212; <em><strong><a href="https://substack.com/@analyticalnikita">Nikita Prasad</a></strong></em></p>]]></content:encoded></item><item><title><![CDATA[Evaluation Metrics For Classification Models]]></title><description><![CDATA[What Every Data Scientist Should Know!]]></description><link>https://analyticalnikita.substack.com/p/evaluation-metrics-for-classification</link><guid isPermaLink="false">https://analyticalnikita.substack.com/p/evaluation-metrics-for-classification</guid><dc:creator><![CDATA[Nikita Prasad]]></dc:creator><pubDate>Sat, 19 Apr 2025 10:36:54 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!bDBV!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fcd6bdf-9f94-4139-8bc3-038879df5666_989x470.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>When evaluating the performance of a classification model like logistic regression, <em>accuracy</em> is usually the first metric that comes to minds of beginner data scientists.</p><p>It&#8217;s simple, intuitive and easy to compute.</p><p>But is it enough?</p><p>In many cases &#8212; especially when dealing with imbalanced data &#8212; accuracy alone can be misleading.</p><p>Let&#8217;s go through other reliable evaluation metrics in this read, in a simple and clear way.</p><blockquote><p><em>If you&#8217;re new here <strong>Subscribe</strong> as here, I want to make Data Science easy for you. &#128071;&#127995;</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://analyticalnikita.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://analyticalnikita.substack.com/subscribe?"><span>Subscribe now</span></a></p></blockquote><div><hr></div><h2><strong>What is Accuracy?</strong></h2><p>This metric that tells you how often the classifier model is correct. </p><p>You can calculate it as:</p><p><code>Accuracy = (Number of Correct Predictions) &#247; (Total number of Predictions</code></p><p>In Python, using <code>accuracy_score</code>, you can predict the accuracy of your model.</p><pre><code>print(accuracy_score(test_target,y_pred)) # Output : 0.853</code></pre><p>This means the model was right <em>85.3%</em> of the time.</p><h3><strong>But, When Accuracy Can Be Misleading?</strong></h3><p>Accuracy fails when classes are<em> imbalanced</em>. </p><p>In cases where one class dominates the dataset, a classifier might achieve high accuracy by simply predicting the dominant class for all instances.</p><blockquote><p>I&#8217;ve already covered a high-level overview of <em><strong>How to Handle Imbalanced Datasets. </strong></em>Missed it? (really?) Go check it<em> <strong><a href="https://analyticalnikita.substack.com/p/handling-imbalanced-dataset-in-ml">out</a></strong>. &#128071;&#127995;</em></p><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;4aa92bb0-b211-4780-8276-41ddb06f2cd8&quot;,&quot;caption&quot;:&quot;In this read, I want to focus on Imbalanced Datasets &#8212; a common challenge in real-world applications.&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;showDescription&quot;:true,&quot;showImage&quot;:true,&quot;size&quot;:&quot;lg&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;Handling Imbalanced Dataset in ML&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:13692717,&quot;name&quot;:&quot;Nikita Prasad&quot;,&quot;bio&quot;:&quot;&#128187; Documenting my Learning in Simplified ways at Epochs of Data Insights. &#128640; Join my data-driven journey, today! &quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3ac9b439-e213-42c3-a12f-7a10f429b3f0_2160x2160.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2025-03-15T07:40:37.036Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36610319-e029-4870-985a-3f50b0dcc5b9_689x547.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://analyticalnikita.substack.com/p/handling-imbalanced-dataset-in-ml&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:158985316,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:3,&quot;comment_count&quot;:0,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;Epochs of Data Insights&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F675a2477-6a67-4e93-b0a2-2cae74032449_1080x1080.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div></blockquote><p>This is exactly why data scientists must know other evaluation metrics.</p><div><hr></div><h2><strong>What is Confusion Matrix?</strong></h2><p>Simply stating, it's a table, that describes the performance of a classification model. </p><p>It presents a breakdown of the correct and incorrect predictions by each class.</p><h3><strong>Why Confusion Matrix Should Be Used?</strong></h3><p>As it provides more insights than accuracy alone. </p><p>It allows you to see where the model is making errors, such as confusing one class with another.</p><blockquote><p>It&#8217;s also plays a vital role in designing effective <em>A/B tests</em> and accurately interpret the results. </p><p>In case you missed the Most Important Interview Question! Find it <em><strong><a href="https://analyticalnikita.substack.com/p/understanding-type-i-and-type-ii">here</a></strong></em>: &#128071;&#127995;</p><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;77f0b1cf-ec46-48ac-ace1-9a07ed9cbdcb&quot;,&quot;caption&quot;:&quot;Without a solid understanding of statistics, it can be challenging to design effective A/B tests and accurately interpret the results.&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;showDescription&quot;:true,&quot;showImage&quot;:true,&quot;size&quot;:&quot;lg&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;Understanding Type I and Type II Errors in Hypothesis Testing&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:13692717,&quot;name&quot;:&quot;Nikita Prasad&quot;,&quot;bio&quot;:&quot;&#128187; Documenting my Learning in Simplified ways at Epochs of Data Insights. &#128640; Join my data-driven journey, today! &quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3ac9b439-e213-42c3-a12f-7a10f429b3f0_2160x2160.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2025-01-31T11:13:02.713Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd077019a-7c63-4015-bdc3-732deb181d07_1000x630.jpeg&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://analyticalnikita.substack.com/p/understanding-type-i-and-type-ii&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:155236339,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:3,&quot;comment_count&quot;:0,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;Epochs of Data Insights&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F675a2477-6a67-4e93-b0a2-2cae74032449_1080x1080.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div></blockquote><h2><strong>So&#8230; Can You Calculate Accuracy from the Confusion Matrix?</strong></h2><p>Yes, you can.</p><p>You just have to sum up the correct predictions (<em>True Positives and True Negatives</em>) and divide by the total number of predictions.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!bDBV!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fcd6bdf-9f94-4139-8bc3-038879df5666_989x470.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!bDBV!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fcd6bdf-9f94-4139-8bc3-038879df5666_989x470.png 424w, https://substackcdn.com/image/fetch/$s_!bDBV!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fcd6bdf-9f94-4139-8bc3-038879df5666_989x470.png 848w, https://substackcdn.com/image/fetch/$s_!bDBV!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fcd6bdf-9f94-4139-8bc3-038879df5666_989x470.png 1272w, https://substackcdn.com/image/fetch/$s_!bDBV!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fcd6bdf-9f94-4139-8bc3-038879df5666_989x470.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!bDBV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fcd6bdf-9f94-4139-8bc3-038879df5666_989x470.png" width="989" height="470" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3fcd6bdf-9f94-4139-8bc3-038879df5666_989x470.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:470,&quot;width&quot;:989,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;image.png&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="image.png" title="image.png" srcset="https://substackcdn.com/image/fetch/$s_!bDBV!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fcd6bdf-9f94-4139-8bc3-038879df5666_989x470.png 424w, https://substackcdn.com/image/fetch/$s_!bDBV!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fcd6bdf-9f94-4139-8bc3-038879df5666_989x470.png 848w, https://substackcdn.com/image/fetch/$s_!bDBV!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fcd6bdf-9f94-4139-8bc3-038879df5666_989x470.png 1272w, https://substackcdn.com/image/fetch/$s_!bDBV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fcd6bdf-9f94-4139-8bc3-038879df5666_989x470.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>where:</p><ul><li><p><strong>True Positives (TP)</strong>: It is the case where you predicted <em><strong>Yes</strong></em> and the real output was also <em><strong>Yes</strong></em>.</p></li><li><p><strong>True Negatives (TN)</strong>: It is the case where you predicted <em><strong>No</strong></em> and the real output was also <em><strong>No</strong></em>.</p></li><li><p><strong>False Positives (FP)</strong>: It is the case where you predicted <em><strong>Yes</strong></em> but it was actually <em><strong>No</strong></em>.</p></li><li><p><strong>False Negatives (FN)</strong>: It is the case where you predicted <em><strong>No</strong></em> but it was actually <em><strong>Yes</strong></em>.</p></li></ul><blockquote><p><em><strong>Note</strong>: But, the opposite is not possible. You cannot create a full Confusion Matrix just by knowing the Accuracy, because accuracy is just a number.</em></p></blockquote><div><hr></div><h2><strong>What is Precision?</strong></h2><p>It measures the accuracy of the positive predictions made by the classifier. </p><p>Here&#8217;s it&#8217;s formula:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Go_o!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f58e54c-1d1e-4334-b6be-11e80ba55007_444x90.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Go_o!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f58e54c-1d1e-4334-b6be-11e80ba55007_444x90.png 424w, https://substackcdn.com/image/fetch/$s_!Go_o!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f58e54c-1d1e-4334-b6be-11e80ba55007_444x90.png 848w, https://substackcdn.com/image/fetch/$s_!Go_o!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f58e54c-1d1e-4334-b6be-11e80ba55007_444x90.png 1272w, https://substackcdn.com/image/fetch/$s_!Go_o!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f58e54c-1d1e-4334-b6be-11e80ba55007_444x90.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Go_o!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f58e54c-1d1e-4334-b6be-11e80ba55007_444x90.png" width="444" height="90" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2f58e54c-1d1e-4334-b6be-11e80ba55007_444x90.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:90,&quot;width&quot;:444,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:4081,&quot;alt&quot;:&quot;image.png&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="image.png" title="image.png" srcset="https://substackcdn.com/image/fetch/$s_!Go_o!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f58e54c-1d1e-4334-b6be-11e80ba55007_444x90.png 424w, https://substackcdn.com/image/fetch/$s_!Go_o!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f58e54c-1d1e-4334-b6be-11e80ba55007_444x90.png 848w, https://substackcdn.com/image/fetch/$s_!Go_o!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f58e54c-1d1e-4334-b6be-11e80ba55007_444x90.png 1272w, https://substackcdn.com/image/fetch/$s_!Go_o!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f58e54c-1d1e-4334-b6be-11e80ba55007_444x90.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Say, if the classifier predicts <em><strong>Yes</strong></em> 10 times, and only 7 were actually correct, then the precision is 7 out of 10, or 0.7.</p><div><hr></div><h2><strong>What is Recall or Sensitivity?</strong></h2><p>Recall tells you how many actual positive cases your model was able to identify.</p><p>Here&#8217;s it&#8217;s formula:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!DLVm!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F877eef7b-d6da-4e75-b8e6-2e710cffce28_444x76.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!DLVm!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F877eef7b-d6da-4e75-b8e6-2e710cffce28_444x76.png 424w, https://substackcdn.com/image/fetch/$s_!DLVm!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F877eef7b-d6da-4e75-b8e6-2e710cffce28_444x76.png 848w, https://substackcdn.com/image/fetch/$s_!DLVm!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F877eef7b-d6da-4e75-b8e6-2e710cffce28_444x76.png 1272w, https://substackcdn.com/image/fetch/$s_!DLVm!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F877eef7b-d6da-4e75-b8e6-2e710cffce28_444x76.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!DLVm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F877eef7b-d6da-4e75-b8e6-2e710cffce28_444x76.png" width="444" height="76" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/877eef7b-d6da-4e75-b8e6-2e710cffce28_444x76.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:76,&quot;width&quot;:444,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:4007,&quot;alt&quot;:&quot;image.png&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="image.png" title="image.png" srcset="https://substackcdn.com/image/fetch/$s_!DLVm!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F877eef7b-d6da-4e75-b8e6-2e710cffce28_444x76.png 424w, https://substackcdn.com/image/fetch/$s_!DLVm!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F877eef7b-d6da-4e75-b8e6-2e710cffce28_444x76.png 848w, https://substackcdn.com/image/fetch/$s_!DLVm!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F877eef7b-d6da-4e75-b8e6-2e710cffce28_444x76.png 1272w, https://substackcdn.com/image/fetch/$s_!DLVm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F877eef7b-d6da-4e75-b8e6-2e710cffce28_444x76.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Basically, it answers the question: &#8220;<em>Out of all the real <strong>Yes </strong>cases, how many did the model find?&#8221;</em></p><div><hr></div><h3>Trade-Off between Precision &amp; Recall</h3><p>Often there&#8217;s a trade-off between precision and recall.</p><p>Meaning increasing one can decrease the other. Which one to prioritize depends on the application.</p><p>Many real-world applications requires a balance&#8212;this is where the F1 Scores comes in.</p><div><hr></div><h2><strong>What is F1 Score?</strong></h2><p>F1 score is the <em>harmonic mean of precision and recall.</em> </p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!tauN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e627f87-ab69-49a3-8294-813b30d3d1b0_1352x158.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!tauN!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e627f87-ab69-49a3-8294-813b30d3d1b0_1352x158.png 424w, https://substackcdn.com/image/fetch/$s_!tauN!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e627f87-ab69-49a3-8294-813b30d3d1b0_1352x158.png 848w, https://substackcdn.com/image/fetch/$s_!tauN!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e627f87-ab69-49a3-8294-813b30d3d1b0_1352x158.png 1272w, https://substackcdn.com/image/fetch/$s_!tauN!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e627f87-ab69-49a3-8294-813b30d3d1b0_1352x158.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!tauN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e627f87-ab69-49a3-8294-813b30d3d1b0_1352x158.png" width="1352" height="158" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9e627f87-ab69-49a3-8294-813b30d3d1b0_1352x158.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:158,&quot;width&quot;:1352,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;image.png&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="image.png" title="image.png" srcset="https://substackcdn.com/image/fetch/$s_!tauN!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e627f87-ab69-49a3-8294-813b30d3d1b0_1352x158.png 424w, https://substackcdn.com/image/fetch/$s_!tauN!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e627f87-ab69-49a3-8294-813b30d3d1b0_1352x158.png 848w, https://substackcdn.com/image/fetch/$s_!tauN!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e627f87-ab69-49a3-8294-813b30d3d1b0_1352x158.png 1272w, https://substackcdn.com/image/fetch/$s_!tauN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e627f87-ab69-49a3-8294-813b30d3d1b0_1352x158.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>It's useful when you want to find a balance between precision and recall, especially when classes are imbalanced.</p><h3>Why Harmonic Mean?</h3><p>Because it punishes extreme values more. </p><p>And, if either precision or recall is very low, the F1 Score will be low, helpful with imbalance classes.</p><div><hr></div><h2><strong>What is Specificity?</strong></h2><p>While recall tell you how good the model is at identifying positives, <em>specificity</em> tells us how good it is at identifying negatives.</p><p>Useful in domains where detecting negatives correctly is just as important as positives.</p><div><hr></div><h2><strong>Classification Report</strong></h2><p>Scikit-learn provides a <em>classification report</em> that summarizes all key metrics for each class, using the following code. </p><pre><code>from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))</code></pre><p>This typically includes:</p><ul><li><p>Precision, </p></li><li><p>Recall, </p></li><li><p>F1 score, </p></li><li><p>Support (the number of actual samples for each class in the specified dataset). </p></li></ul><p>If you&#8217;d like to explore the full implementation, including code and data, then checkout: <em><strong><a href="https://github.com/nikitaprasad21/ML-Cheat-Codes/blob/main/Logistic-Regression/fmnist-logistic-regression.ipynb">Github Repository</a></strong></em> &#128072;&#127995;</p><p>Understanding these metrics helps data scientists make informed decisions about model performance, fairness, and suitability for real-world use.</p><div><hr></div><p>Thanks for reading!</p><div class="captioned-button-wrap" data-attrs="{&quot;url&quot;:&quot;https://analyticalnikita.substack.com/p/working-with-text-data-tokenization?utm_source=substack&amp;utm_medium=email&amp;utm_content=share&amp;action=share&amp;token=eyJ1c2VyX2lkIjoxMzY5MjcxNywicG9zdF9pZCI6MTUzODk3MDA4LCJpYXQiOjE3MzYyNjI3MTMsImV4cCI6MTczODg1NDcxMywiaXNzIjoicHViLTM2MjM0MTciLCJzdWIiOiJwb3N0LXJlYWN0aW9uIn0.9YJtyeDdbdG7YRaC7fxDDMCJsWzsYR2hfBW6KlmWeS4&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="CaptionedButtonToDOM"><div class="preamble"><p class="cta-caption"><em>Before you go.. leave a &#8220;heart&#8221; &#10084;&#65039; and let me knoe if you have any questions/ suggestions/ thoughts.&#128395;&#65039;&#128071;</em></p></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://analyticalnikita.substack.com/p/evaluation-metrics-for-classification?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://analyticalnikita.substack.com/p/evaluation-metrics-for-classification?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p></div><p>Until next time, happy learning!</p><p>&#8212; <em><strong><a href="https://substack.com/@analyticalnikita">Nikita Prasad</a></strong></em></p>]]></content:encoded></item><item><title><![CDATA[Hiding Future Tokens ]]></title><description><![CDATA[Illustrated Guide to Causal Attention Mechanism]]></description><link>https://analyticalnikita.substack.com/p/hiding-future-tokens</link><guid isPermaLink="false">https://analyticalnikita.substack.com/p/hiding-future-tokens</guid><dc:creator><![CDATA[Nikita Prasad]]></dc:creator><pubDate>Sat, 12 Apr 2025 10:40:42 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b79b337-58a2-4cb2-b1bc-2155bfcf3162_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In this read, we are modifying the previous <em><strong><a href="https://analyticalnikita.substack.com/p/scaled-dot-product-attention-explained">self-attention mechanism</a></strong></em> into a <em><strong>causal self-attention mechanism</strong></em>.</p><p>I would highly recommend you to give it a read, if you haven&#8217;t till now &#128071;&#127995;:</p><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;e3a2a753-271a-44ef-b2ac-91d1cbfd03af&quot;,&quot;caption&quot;:&quot;Previously, I have covered a high-level overview about the Simple Attention Mechanism without Trainable Weights. Missed it? (really?) Go check it out.&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;showDescription&quot;:true,&quot;showImage&quot;:true,&quot;size&quot;:&quot;lg&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;Scaled Dot-Product Attention Explained! &quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:13692717,&quot;name&quot;:&quot;Nikita Prasad&quot;,&quot;bio&quot;:&quot;&#128187; Documenting my Learning in Simplified ways at Epochs of Data Insights. &#128640; Join my data-driven journey, today! &quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3ac9b439-e213-42c3-a12f-7a10f429b3f0_2160x2160.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2025-03-29T08:36:43.491Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82fb939a-b7ba-4330-8af4-209943a4c320_1456x1048.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://analyticalnikita.substack.com/p/scaled-dot-product-attention-explained&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:153967051,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:0,&quot;comment_count&quot;:0,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;Epochs of Data Insights&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F675a2477-6a67-4e93-b0a2-2cae74032449_1080x1080.png&quot;,&quot;belowTheFold&quot;:false,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><blockquote><p><em>And while you&#8217;re at it, <strong>subscribe to me</strong> so you&#8217;ll never miss any more of these reads.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://analyticalnikita.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://analyticalnikita.substack.com/subscribe?"><span>Subscribe now</span></a></p></blockquote><div><hr></div><h2>What is Causal Self-Attention Mechanism?</h2><p><em>Causal or Masked self-attention</em> ensures that the model's prediction for a certain position in a sequence is only dependent on the <em><strong>known previous outputs, and not on future tokens.</strong></em></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!l7CL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b142406-39c4-41fb-904f-959527828759_1456x1048.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!l7CL!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b142406-39c4-41fb-904f-959527828759_1456x1048.png 424w, https://substackcdn.com/image/fetch/$s_!l7CL!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b142406-39c4-41fb-904f-959527828759_1456x1048.png 848w, https://substackcdn.com/image/fetch/$s_!l7CL!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b142406-39c4-41fb-904f-959527828759_1456x1048.png 1272w, https://substackcdn.com/image/fetch/$s_!l7CL!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b142406-39c4-41fb-904f-959527828759_1456x1048.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!l7CL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b142406-39c4-41fb-904f-959527828759_1456x1048.png" width="1456" height="1048" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2b142406-39c4-41fb-904f-959527828759_1456x1048.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1048,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:97972,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://analyticalnikita.substack.com/i/159606505?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b142406-39c4-41fb-904f-959527828759_1456x1048.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!l7CL!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b142406-39c4-41fb-904f-959527828759_1456x1048.png 424w, https://substackcdn.com/image/fetch/$s_!l7CL!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b142406-39c4-41fb-904f-959527828759_1456x1048.png 848w, https://substackcdn.com/image/fetch/$s_!l7CL!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b142406-39c4-41fb-904f-959527828759_1456x1048.png 1272w, https://substackcdn.com/image/fetch/$s_!l7CL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b142406-39c4-41fb-904f-959527828759_1456x1048.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div><hr></div><h2>How to Hide Future Tokens with Causal Attention?</h2><p>In causal attention, the attention weights above the diagonal are masked.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!xHfo!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F185a530d-b245-491b-bd76-8381eeca954f_1456x1048.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!xHfo!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F185a530d-b245-491b-bd76-8381eeca954f_1456x1048.png 424w, https://substackcdn.com/image/fetch/$s_!xHfo!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F185a530d-b245-491b-bd76-8381eeca954f_1456x1048.png 848w, https://substackcdn.com/image/fetch/$s_!xHfo!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F185a530d-b245-491b-bd76-8381eeca954f_1456x1048.png 1272w, https://substackcdn.com/image/fetch/$s_!xHfo!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F185a530d-b245-491b-bd76-8381eeca954f_1456x1048.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!xHfo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F185a530d-b245-491b-bd76-8381eeca954f_1456x1048.png" width="1456" height="1048" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/185a530d-b245-491b-bd76-8381eeca954f_1456x1048.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1048,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:150445,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://analyticalnikita.substack.com/i/159606505?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F185a530d-b245-491b-bd76-8381eeca954f_1456x1048.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!xHfo!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F185a530d-b245-491b-bd76-8381eeca954f_1456x1048.png 424w, https://substackcdn.com/image/fetch/$s_!xHfo!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F185a530d-b245-491b-bd76-8381eeca954f_1456x1048.png 848w, https://substackcdn.com/image/fetch/$s_!xHfo!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F185a530d-b245-491b-bd76-8381eeca954f_1456x1048.png 1272w, https://substackcdn.com/image/fetch/$s_!xHfo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F185a530d-b245-491b-bd76-8381eeca954f_1456x1048.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This is to ensure that for any given input, the LLM is unable to utilize future tokens while calculating the context vectors with the attention weight.</p><p>To achieve this, for each given token, we mask out the future tokens (the ones that come after the current token in the input text).</p><p>The simplest way is to mask out the unnormalized attention scores above the diagonal with negative infinity before they enter the softmax function.</p><pre><code><code>mask = torch.triu(torch.ones(context_length, context_length), diagonal=1)
masked = attn_scores.masked_fill(mask.bool(), -torch.inf)
print(masked)

"""
Output: 
tensor([[0.0988,   -inf,   -inf,   -inf,   -inf],
        [0.1345, 0.1951,   -inf,   -inf,   -inf],
        [0.1330, 0.1957, 0.2100,   -inf,   -inf],
        [0.1985, 0.2716, 0.2985, 0.1392,   -inf],
        [0.0688, 0.1003, 0.1079, 0.0486, 0.1117]],
       grad_fn=&lt;MaskedFillBackward0&gt;)
"""</code></code></pre><blockquote><p><em><strong>Note</strong>: The attention weights in each row correctly sum to 1.</em></p></blockquote><h3>Masking Additional Attention Weights With Dropout</h3><p>In addition, we also apply dropout to reduce <em>overfitting </em>during training of the LLM.</p><blockquote><p><em>Dropout is a DL technique where randomly selected hidden layer units are ignored during the training of the Neural Network for preventing overfitting and improving generalization.</em></p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!emwp!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b5c250f-bb42-4ab5-9400-1d667c770f22_1456x1048.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!emwp!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b5c250f-bb42-4ab5-9400-1d667c770f22_1456x1048.png 424w, https://substackcdn.com/image/fetch/$s_!emwp!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b5c250f-bb42-4ab5-9400-1d667c770f22_1456x1048.png 848w, https://substackcdn.com/image/fetch/$s_!emwp!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b5c250f-bb42-4ab5-9400-1d667c770f22_1456x1048.png 1272w, https://substackcdn.com/image/fetch/$s_!emwp!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b5c250f-bb42-4ab5-9400-1d667c770f22_1456x1048.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!emwp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b5c250f-bb42-4ab5-9400-1d667c770f22_1456x1048.png" width="1456" height="1048" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3b5c250f-bb42-4ab5-9400-1d667c770f22_1456x1048.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1048,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:142751,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://analyticalnikita.substack.com/i/159606505?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b5c250f-bb42-4ab5-9400-1d667c770f22_1456x1048.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!emwp!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b5c250f-bb42-4ab5-9400-1d667c770f22_1456x1048.png 424w, https://substackcdn.com/image/fetch/$s_!emwp!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b5c250f-bb42-4ab5-9400-1d667c770f22_1456x1048.png 848w, https://substackcdn.com/image/fetch/$s_!emwp!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b5c250f-bb42-4ab5-9400-1d667c770f22_1456x1048.png 1272w, https://substackcdn.com/image/fetch/$s_!emwp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b5c250f-bb42-4ab5-9400-1d667c770f22_1456x1048.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>In GPT Models, dropout can be applied in several places, such as:</p><ul><li><p>after computing the attention weights</p></li><li><p>or, after multiplying the attention weights with the value vectors</p></li></ul><p>Though, it is recommended to apply the dropout mask after <em>computing the attention weights</em>.</p><p>Furthermore, in this specific example, we use a dropout rate of 50%, which means randomly masking out half of the attention weights. </p><p>Later while training the GPT model, we will use a lower dropout rate, such as 0.1 or 0.2.</p><p>If we apply a dropout rate of 0.5 (50%), the non-dropped values will be scaled accordingly by a factor of:</p><pre><code>1 / (1 - <code>dropout_rate</code>) = 1/0.5 = 2</code></pre><p>This scaling is crucial to maintain the overall balance of the attention weights, ensuring that the average influence of the attention mechanism remains consistent during both the training and inference phases.</p><div><hr></div><h2>Implementing <em>CausalAttention</em> Class</h2><p>Now, we are ready to implement a self-attention class, including the causal and dropout masks.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!9X_4!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b79b337-58a2-4cb2-b1bc-2155bfcf3162_1456x1048.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!9X_4!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b79b337-58a2-4cb2-b1bc-2155bfcf3162_1456x1048.png 424w, https://substackcdn.com/image/fetch/$s_!9X_4!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b79b337-58a2-4cb2-b1bc-2155bfcf3162_1456x1048.png 848w, https://substackcdn.com/image/fetch/$s_!9X_4!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b79b337-58a2-4cb2-b1bc-2155bfcf3162_1456x1048.png 1272w, https://substackcdn.com/image/fetch/$s_!9X_4!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b79b337-58a2-4cb2-b1bc-2155bfcf3162_1456x1048.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!9X_4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b79b337-58a2-4cb2-b1bc-2155bfcf3162_1456x1048.png" width="1456" height="1048" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2b79b337-58a2-4cb2-b1bc-2155bfcf3162_1456x1048.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1048,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:98056,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://analyticalnikita.substack.com/i/159606505?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b79b337-58a2-4cb2-b1bc-2155bfcf3162_1456x1048.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!9X_4!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b79b337-58a2-4cb2-b1bc-2155bfcf3162_1456x1048.png 424w, https://substackcdn.com/image/fetch/$s_!9X_4!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b79b337-58a2-4cb2-b1bc-2155bfcf3162_1456x1048.png 848w, https://substackcdn.com/image/fetch/$s_!9X_4!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b79b337-58a2-4cb2-b1bc-2155bfcf3162_1456x1048.png 1272w, https://substackcdn.com/image/fetch/$s_!9X_4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b79b337-58a2-4cb2-b1bc-2155bfcf3162_1456x1048.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Step 1</strong>: Compared to the previous <em><a href="https://analyticalnikita.substack.com/p/scaled-dot-product-attention-explained">SelfAttention class</a></em>, we added a dropout layer.</p><p><strong>Step 2</strong>: The <code>register_buffer</code> call is also a new addition.</p><p><strong>Step 3</strong>: Transpose dimensions 1 and 2, keeping the batch dimension at the first position (i.e., 0).</p><pre><code><code>class CausalAttention(nn.Module):

    def __init__(self, d_in, d_out, context_length, dropout, qkv_bias=False):
        super().__init__()
        self.d_out = d_out
        self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_key   = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.dropout = nn.Dropout(dropout)
        self.register_buffer('mask', torch.triu(torch.ones(context_length, context_length), diagonal=1))

    def forward(self, x):
        b, num_tokens, d_in = x.shape # New batch dimension b
        keys = self.W_key(x)
        queries = self.W_query(x)
        values = self.W_value(x)

        attn_scores = queries @ keys.transpose(1, 2) # Changed transpose
        attn_scores.masked_fill_(  # Note `_` operations are in-place in PyTorch
            self.mask.bool()[:num_tokens, :num_tokens], -torch.inf)  # `:num_tokens` to account for cases where the number of tokens in the batch is smaller than the supported context_size
        attn_weights = torch.softmax(
            attn_scores / keys.shape[-1]**0.5, dim=-1
        )
        attn_weights = self.dropout(attn_weights) 

        context_vec = attn_weights @ values
        return context_vec</code></code></pre><blockquote><p><em><strong>Note:</strong> In PyTorch, operations with a trailing underscore (_) are performed in-place, avoiding unnecessary memory copies.</em></p></blockquote><p>Instantiating <code>CausalAttention</code> Class:</p><pre><code><code>print(d_in)
print(d_out)

torch.manual_seed(123)
# Instantiating CausalAttention Class
context_length = batch.shape[1]
ca = CausalAttention(d_in, d_out, context_length, 0.0)

context_vecs = ca(batch)

print(context_vecs)
print("context_vecs.shape:", context_vecs.shape)

"""
Output: 
3
2
tensor([[[-0.3585,  0.0415],
         [-0.4451,  0.0252],
         [-0.4656,  0.0393],
         [-0.5461, -0.0500],
         [-0.5129, -0.0367]],

        [[-0.3585,  0.0415],
         [-0.4451,  0.0252],
         [-0.4656,  0.0393],
         [-0.5461, -0.0500],
         [-0.5129, -0.0367]]], grad_fn=&lt;UnsafeViewBackward0&gt;)
context_vecs.shape: torch.Size([2, 5, 2])
"""</code></code></pre><p>As we can see, the resulting context vector is a 3D tensor where each token is now represented by a 2D embedding.</p><blockquote><p><em><strong>Note</strong>: Dropout is only applied during training, not during inference.</em></p></blockquote><p>If you&#8217;d like to explore the full implementation, including code and data, then checkout: <em><strong><a href="https://github.com/nikitaprasad21/LLM-Cheat-Code/blob/main/Attention-Mechanism/Causal_Self_Attention_Mechanism.ipynb">Github Repository</a></strong></em> &#128072;&#127995;</p><div><hr></div><p>And that&#8217;s a wrap! </p><p>Next, we will expand on this concept and implement a multi-head attention module, that implements several of such causal attention mechanisms in parallel. </p><p>If you&#8217;ve made it this far &#8212; thank you so much, <em>stay tuned with<strong> <a href="https://substack.com/@analyticalnikita">ME</a></strong></em>, so you won&#8217;t miss out on future updates.</p><div class="captioned-button-wrap" data-attrs="{&quot;url&quot;:&quot;https://analyticalnikita.substack.com/p/how-do-llms-remember-context-over?utm_source=substack&amp;utm_medium=email&amp;utm_content=share&amp;action=share&amp;token=eyJ1c2VyX2lkIjoxMzY5MjcxNywicG9zdF9pZCI6MTUzOTY3MDMxLCJpYXQiOjE3NDI2MzM2MTYsImV4cCI6MTc0NTIyNTYxNiwiaXNzIjoicHViLTM2MjM0MTciLCJzdWIiOiJwb3N0LXJlYWN0aW9uIn0.3EDcUz0ZGbdQYaxUDtwo4Ug--8dAAMsgQ_z8NK7J2hs&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="CaptionedButtonToDOM"><div class="preamble"><p class="cta-caption">&#10084;&#65039; If you found this helpful, leave a &#8220;heart&#8221;! And if you have any questions/ suggestions/ thoughts, do drop me a line below. &#128395;&#65039;&#128071;</p></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://analyticalnikita.substack.com/p/hiding-future-tokens?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://analyticalnikita.substack.com/p/hiding-future-tokens?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p></div><p>Until next time, happy learning!</p><p>&#8212; <em><strong><a href="https://substack.com/@analyticalnikita">Nikita Prasad</a></strong></em></p>]]></content:encoded></item><item><title><![CDATA[5 Ways to Write Better PyTorch Code ]]></title><description><![CDATA[Everyday Tips and Tricks to Improve Your Deep Learning Workflow!]]></description><link>https://analyticalnikita.substack.com/p/5-ways-to-write-better-pytorch-code</link><guid isPermaLink="false">https://analyticalnikita.substack.com/p/5-ways-to-write-better-pytorch-code</guid><dc:creator><![CDATA[Nikita Prasad]]></dc:creator><pubDate>Sat, 05 Apr 2025 10:36:35 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/4385507c-914c-4e38-9a3f-a468e9be887c_1000x630.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>With large-scale advancements in Large Language Models and Generative AI.</p><p>Keeping up with best coding practices using PyTorch can give you a significant edge in this competitive market.</p><p>In this read, I&#8217;ve discussed 5 practical tips to improve your PyTorch codes!</p><blockquote><p><em>Want to learn &#8220;<strong>WHY</strong>&#8221; and &#8220;<strong>HOW</strong>&#8221; of ML, Gen-AI, Data Science &amp; Analytics and much more? It&#8217;s all here. &#128071;&#127995;</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://analyticalnikita.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://analyticalnikita.substack.com/subscribe?"><span>Subscribe now</span></a></p></blockquote><div><hr></div><p>But the first thing to note is that,</p><h2>Why PyTorch? </h2><p>Internet is full of debate over PyTorch vs. Tensorflow. While both have their respective pros and cons.</p><p>PyTorch stands out for its flexiblity, and widespread adoption in both academia and industry. </p><p>With its <em><strong>Dynamic Computation Graph</strong></em> and powerful <em><strong>GPU acceleration</strong></em>, PyTorch makes deep learning development seamless and efficient.</p><p>That being said, let&#8217;s jump into the core of this article&#8212;</p><h2>How to write better PyTorch code? </h2><p>Here are 5 practical <em><strong>tips</strong></em> that can surely<em><strong> </strong>elevate your coding practices</em>:</p><h3>1. Utilizing DataLoaders for Streamlining Data Processing</h3><p>While working with large datasets, manually loading data using loops can slow down the training process.</p><p>Instead, you can use PyTorch&#8217;s <code>DataLoader</code>, which efficiently handles batching, shuffling and parallel loading for you while utilizing <em>multiprocessing</em> for speed.</p><pre><code><code>import torch
from torch.utils.data import DataLoader, TensorDataset

# Creating a sample dataset
data = torch.randn(1000, 3)  # 1000 samples, 3 features
labels = torch.randint(0, 2, (1000,))  # Binary labels

# Converting dataset to PyTorch dataset (TensorDataset)
dataset = TensorDataset(data, labels)

# Use DataLoader for batching and shuffling the data
data_loader = DataLoader(dataset, batch_size=32, shuffle=True, num_workers=4)

for batch in data_loader:
    input_batch, target_batch = batch
    print(input_batch.shape, target_batch.shape)  # Outout: (32,3) (32,)</code></code></pre><div><hr></div><h3>2. Using GPU Acceleration Efficiently </h3><p>One of the biggest advantages of PyTorch is its seamless support for GPUs. </p><p>However, many beginners forget to transfer their models and data to the GPU, which can result in slow CPU-based computation. </p><p>So, here&#8217;s the proper way to utilize your GPU:</p><pre><code><code>device = 'cuda' if torch.cuda.is_available() else 'cpu'
model.to(device)  # Move model to GPU
data, labels = data.to(device), labels.to(device)  # Move data to GPU</code></code></pre><p>Additionally, you can also check for GPU&#8217;s memory usage:</p><pre><code><code>print(torch.cuda.memory_summary())</code></code></pre><p>This helps you <em>avoid memory overflow</em> issues when working with large models.</p><div><hr></div><h3>3. Using Automatic Mixed Precision (AMP) for Faster Training </h3><p>PyTorch provides native support for Mixed Precision Training, allowing you to <em>speed up model training while reducing memory usage</em>.</p><pre><code><code>from torch.cuda.amp import GradScaler, autocast

scaler = GradScaler()

for data, labels in data_loader:
    optimizer.zero_grad()
    with autocast():  # Enables mixed precision
        outputs = model(data)
        loss = loss_fn(outputs, labels)
    
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()</code></code></pre><blockquote><p><em><strong>Note</strong>: GradScaler or Gradient scaling helps prevent gradients with small magnitudes from flushing to zero (&#8220;underflowing&#8221;) when training with mixed precision.</em></p></blockquote><p>Mixed Precision Training can lead to a 2-3x increase in speed while maintaining the same level of accuracy.</p><div><hr></div><h3>4. Implementing Checkpoints for Long Training Sessions </h3><p>You probably know that DL models can take hours (or even days) to train. </p><p>So, to prevent losing progress due to unexpected shutdowns, it is recommended to always <em>save model checkpoints</em> at regular intervals.</p><pre><code><code># Saving Model
torch.save(model.state_dict(), "model_checkpoint.pth")

# Loading Model
model.load_state_dict(torch.load("model_checkpoint.pth"))</code></code></pre><p>Additionally, you can also consider saving the optimizer state for resuming training seamlessly, using below code snippet:</p><pre><code># Saving Optimizer State
torch.save({
    "epoch": epoch,
    "model_state_dict": model.state_dict(),
    "optimizer_state_dict": optimizer.state_dict(),
    "loss": loss,
}, "checkpoint.pth")</code></pre><div><hr></div><h3>5. Profile Your Code for Performance Optimization </h3><p>PyTorch also provides you with a built-in profiler that helps detect bottlenecks in your code, such as <em>slow data loading, inefficient computations, or memory leaks</em>.</p><p>You can use this code:</p><pre><code>from torch.profiler import profile, record_function, ProfilerActivity

# Profiling the Code
with profile(activities=[ProfilerActivity.CPU], ProfilerActivity.CUDA,                record_shapes=True) as prof:
    with record_function("model_inference"):
        model(data)

print(prof.key_averages().table(sort_by = "cpu_time_total", row_limit=10))</code></pre><p>This can help you optimize performance by identifying the most time-consuming parts of your code.</p><div><hr></div><p>As an apiring data scientist using these tips will not only help you streamline your deep learning workflows but also enhance your productivity.</p><blockquote><p><em><strong>Bonus:</strong> Read about how to know the best value of your hyperparmeters to </em>utilize Full Potential of your Computationally EXPENSIVE Neural Networks! &#128071;&#127995;</p><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;6dfdb9fb-721e-4c33-9cd4-8e6bca5cc1aa&quot;,&quot;caption&quot;:&quot;As a data scientist, you never know the best value of your hyperparmeters to improve the performance of machine learning algorithms or statistical models.&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;showDescription&quot;:true,&quot;showImage&quot;:true,&quot;size&quot;:&quot;lg&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;Bayesian Hyperparameter Optimization Using Optuna&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:13692717,&quot;name&quot;:&quot;Nikita Prasad&quot;,&quot;bio&quot;:&quot;&#128187; Documenting my Learning in Simplified ways at Epochs of Data Insights. &#128640; Join my data-driven journey, today! &quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3ac9b439-e213-42c3-a12f-7a10f429b3f0_2160x2160.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2025-01-21T11:22:06.552Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d8b1838-5673-49e3-9ddf-8d4008b6bd1c_1000x630.jpeg&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://analyticalnikita.substack.com/p/bayesian-hyperparameter-optimization&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:154889609,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:3,&quot;comment_count&quot;:0,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;Epochs of Data Insights&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F675a2477-6a67-4e93-b0a2-2cae74032449_1080x1080.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div></blockquote><p>Got more cool PyTorch tips? <strong>Comment Down</strong> &#128071;</p><div class="captioned-button-wrap" data-attrs="{&quot;url&quot;:&quot;https://analyticalnikita.substack.com/p/5-ways-to-write-better-pytorch-code?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="CaptionedButtonToDOM"><div class="preamble"><p class="cta-caption">If you found these tips helpful, leave a &#8220;heart&#8221;&#10084;&#65039;! And if you have any questions/ suggestions/ thoughts, do drop me a line below. &#128395;&#65039;&#128071;</p></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://analyticalnikita.substack.com/p/5-ways-to-write-better-pytorch-code?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://analyticalnikita.substack.com/p/5-ways-to-write-better-pytorch-code?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p></div><p>Until next time, happy learning!</p><p>&#8212; <em><strong><a href="https://substack.com/@analyticalnikita">Nikita Prasad</a></strong></em></p>]]></content:encoded></item><item><title><![CDATA[Understanding Data and Concept Drift]]></title><description><![CDATA[...The Most Important Interview Question!]]></description><link>https://analyticalnikita.substack.com/p/understanding-data-and-concept-drift</link><guid isPermaLink="false">https://analyticalnikita.substack.com/p/understanding-data-and-concept-drift</guid><dc:creator><![CDATA[Nikita Prasad]]></dc:creator><pubDate>Tue, 01 Apr 2025 10:36:36 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!oSoC!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa53b8433-3ebb-4ce2-bfbb-c122801b949c_1000x630.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>No model works forever.</p><p>YES, it degrades over time!</p><p>Machine learning models are trained with historical data.</p><p>But when deployed in the real-world, they encounter live data that constantly evolves. </p><p>As the environment changes, models can become outdated and lose accuracy over time&#8212;a phenomenon called <em><strong>model drift</strong></em>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!oSoC!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa53b8433-3ebb-4ce2-bfbb-c122801b949c_1000x630.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!oSoC!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa53b8433-3ebb-4ce2-bfbb-c122801b949c_1000x630.jpeg 424w, https://substackcdn.com/image/fetch/$s_!oSoC!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa53b8433-3ebb-4ce2-bfbb-c122801b949c_1000x630.jpeg 848w, https://substackcdn.com/image/fetch/$s_!oSoC!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa53b8433-3ebb-4ce2-bfbb-c122801b949c_1000x630.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!oSoC!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa53b8433-3ebb-4ce2-bfbb-c122801b949c_1000x630.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!oSoC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa53b8433-3ebb-4ce2-bfbb-c122801b949c_1000x630.jpeg" width="1000" height="630" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a53b8433-3ebb-4ce2-bfbb-c122801b949c_1000x630.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:630,&quot;width&quot;:1000,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:87149,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://analyticalnikita.substack.com/i/159607567?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa53b8433-3ebb-4ce2-bfbb-c122801b949c_1000x630.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!oSoC!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa53b8433-3ebb-4ce2-bfbb-c122801b949c_1000x630.jpeg 424w, https://substackcdn.com/image/fetch/$s_!oSoC!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa53b8433-3ebb-4ce2-bfbb-c122801b949c_1000x630.jpeg 848w, https://substackcdn.com/image/fetch/$s_!oSoC!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa53b8433-3ebb-4ce2-bfbb-c122801b949c_1000x630.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!oSoC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa53b8433-3ebb-4ce2-bfbb-c122801b949c_1000x630.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>So, let&#8217;s dive deeper into why this happens and how you can prevent model decay.</p><blockquote><p><em>&#9208;&#65039;<strong> Quick Pause</strong>: If you&#8217;re new here, I&#8217;d highly appreciate if you <strong>subscribe </strong>to receive bi-weekly data tips and insights &#8212; directly into your inbox. &#128071;&#127995;</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://analyticalnikita.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://analyticalnikita.substack.com/subscribe?"><span>Subscribe now</span></a></p></blockquote><div><hr></div><h2>How Your Predictions Can FAIL?</h2><p>Imagine you&#8217;re a data scientist working on <em>cab fare prediction model</em>.</p><p>Your goal is to ensure that predicted<strong> </strong>fares closely match real-world pricing.</p><p>Your model likely relies on input features such as:</p><ul><li><p>Distance of the ride,</p></li><li><p>Time of day (peak vs. non-peak),</p></li><li><p>Traffic conditions.</p></li><li><p>Weather conditions, etc.</p></li></ul><p>Everything works fine&#8212;until one day, your product manager rushes in with a drastic issue: <em>fares are way off</em><strong>!</strong></p><p>Despite careful training and validation, the model&#8217;s accuracy has plummeted.</p><div><hr></div><h2>What&#8217;s Going Wrong?</h2><p>After ruling out any data quality issues, two usual suspects emerge: <em><strong>data drift</strong></em> and <em><strong>concept drift</strong></em>&#8212;two critical challenges in production ML system.</p><p>So, it&#8217;s important to understand the difference between them because they require different approaches to detect and fix.</p><div><hr></div><h2><strong>What is Data Drift?</strong> </h2><p><strong>Data drift</strong> happens when the statistical distribution of input data changes.</p><p>And this shift leads to unreliable predictions because the model encounters unfamiliar patterns.</p><h4><strong>Example:</strong></h4><p>Your cab fare model was trained on normal traffic patterns. But, <strong>new highways open</strong>, drastically changing travel times. </p><p>Now, the model&#8217;s predictions are unreliable.</p><h3>How to Detect and Fix Data Drift?</h3><h4>Detection Methods:</h4><ol><li><p><strong>Feature Distribution Monitoring</strong>: Compare real-time input distributions with historical training data using statistical tests like <em>Kolmogorov-Smirnov (KS)</em> or <em>Population Stability Index (PSI)</em>.</p></li><li><p><strong>Drift Metrics</strong>: Track changes in <em>mean, variance,</em> and<em> percentiles</em> of key input features over-time.</p></li><li><p><strong>Out-of-Distribution (OOD) Detection</strong>: Use anomaly detection models to flag inputs that differ significantly from training data.</p></li></ol><h4>Mitigation Strategies:</h4><ul><li><p><strong>Frequent Model Retraining</strong>: Periodically update the model using the latest data.</p></li><li><p><strong>Adaptive Models</strong>: Implement online learning techniques that allow models to adapt dynamically to new data distributions.</p></li></ul><div><hr></div><h3><strong>What is Concept Drift?</strong> </h3><p>Unlike data drift, <strong>concept drift </strong>occurs when the relationship between features and the target changes.</p><p>Practically stating, what you&#8217;re trying to predict has changed. </p><h4><strong>Example</strong></h4><p>Say your cab fare model was trained on a <strong>per-mile pricing</strong> system. But stakeholders decided to switch to a <strong>flat-rate fare</strong> structure. </p><p>The input features remain the same, but their relationship to fare price has changed&#8212;making the model&#8217;s predictions inaccurate.</p><h3><strong>How to Detect and Fix Concept Drift</strong></h3><h4><strong>Detection Methods:</strong></h4><ol><li><p><strong>Feature-Target Relationship Analysis</strong>: Continuously track correlations between input variables (e.g., distance, time) and the target variable (fare price).</p></li><li><p><strong>Error Distribution Monitoring</strong>: If prediction errors systematically increase over time, it may indicate concept drift.</p></li><li><p><strong>Model Comparison</strong>: Maintain a simple baseline model; if its performance surpasses the production model, concept drift may be at play.</p></li></ol><h4><strong>Mitigation Strategies:</strong></h4><ul><li><p><strong>Business Rule Awareness</strong>: Work closely with product teams to stay updated on evolving pricing policies.</p></li><li><p><strong>Hybrid Approaches</strong>: Combine rule-based adjustments with ML models to accommodate dynamic pricing structures.</p></li></ul><div><hr></div><blockquote><p><em><strong>REMEMBER</strong></em>: <em>To prevent silent model degradation, always question</em></p><ul><li><p><em>has the <strong>data has changed</strong>? </em></p></li><li><p><em>has the <strong>business logic has shifted</strong>?</em></p></li><li><p><em>is the <strong>model is still aligned with reality</strong>?</em></p></li></ul><p><em>By actively monitoring for data and concept drift, retraining models and collaborating with domain experts, you can stay ahead of model decay before it impacts business decisions. </em></p></blockquote><p><strong>Comment Down</strong> &#128071;: <em>Have you ever faced model degradation? How did you handle it?</em></p><div class="captioned-button-wrap" data-attrs="{&quot;url&quot;:&quot;https://analyticalnikita.substack.com/p/how-do-llms-remember-context-over?utm_source=substack&amp;utm_medium=email&amp;utm_content=share&amp;action=share&amp;token=eyJ1c2VyX2lkIjoxMzY5MjcxNywicG9zdF9pZCI6MTUzOTY3MDMxLCJpYXQiOjE3NDEwMjE3OTgsImV4cCI6MTc0MzYxMzc5OCwiaXNzIjoicHViLTM2MjM0MTciLCJzdWIiOiJwb3N0LXJlYWN0aW9uIn0.BcAvyA5l6iao5MFj8b00rCuka3kkluditHSBt6fISkY&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="CaptionedButtonToDOM"><div class="preamble"><p class="cta-caption">&#10084;&#65039; If you found this helpful, leave a &#8220;heart&#8221;! And if you have any questions/ suggestions/ thoughts, do drop me a line below. &#128395;&#65039;&#128071;</p></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://analyticalnikita.substack.com/p/understanding-data-and-concept-drift?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://analyticalnikita.substack.com/p/understanding-data-and-concept-drift?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p></div><p>Until next time, happy learning!</p><p>&#8212; <em><strong><a href="https://substack.com/@analyticalnikita">Nikita Prasad</a></strong></em></p>]]></content:encoded></item><item><title><![CDATA[Scaled Dot-Product Attention Explained! ]]></title><description><![CDATA[Beginners Friendly In-Depth Illustrated Guide]]></description><link>https://analyticalnikita.substack.com/p/scaled-dot-product-attention-explained</link><guid isPermaLink="false">https://analyticalnikita.substack.com/p/scaled-dot-product-attention-explained</guid><dc:creator><![CDATA[Nikita Prasad]]></dc:creator><pubDate>Sat, 29 Mar 2025 08:36:43 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!hCuG!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82fb939a-b7ba-4330-8af4-209943a4c320_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Previously, I have covered a high-level overview about the <em><strong>Simple Attention Mechanism without Trainable Weights. </strong></em>Missed it? (really?) Go check it<em> <strong><a href="https://analyticalnikita.substack.com/p/how-do-llms-remember-context-over">out</a></strong>. </em></p><blockquote><p><em>And while you&#8217;re at it, <strong>Subscribe</strong> <strong>me</strong> so as you&#8217;ll not miss any more of these contents.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://analyticalnikita.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://analyticalnikita.substack.com/subscribe?"><span>Subscribe now</span></a></p></blockquote><p>In this read, let&#8217;s gear up to improve the model by adding some trainable weights, just like those used in the <em>original transformer architecture</em>, the <em>GPT models</em>, and most other popular <em>LLMs</em>.</p><blockquote><p><em><strong>Note</strong>: This Self-Attention Mechanism is also called "<strong>Scaled Dot-Product Attention</strong>".</em></p></blockquote><p>Here&#8217;s the overall idea (similar to before):</p><ul><li><p>Computing <em><strong>context vectors</strong></em> as <em>weighted sums</em> over the <em><strong>input vectors, </strong></em>specific to a certain input element.</p></li><li><p>For this, you need <em><strong>attention weights</strong></em> (normalized attention scores that sum up to 1, using the <em>softmax function</em>).</p></li></ul><p>Here&#8217;s the modified architecture: &#128071;&#127995;</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!hCuG!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82fb939a-b7ba-4330-8af4-209943a4c320_1456x1048.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!hCuG!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82fb939a-b7ba-4330-8af4-209943a4c320_1456x1048.png 424w, https://substackcdn.com/image/fetch/$s_!hCuG!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82fb939a-b7ba-4330-8af4-209943a4c320_1456x1048.png 848w, https://substackcdn.com/image/fetch/$s_!hCuG!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82fb939a-b7ba-4330-8af4-209943a4c320_1456x1048.png 1272w, https://substackcdn.com/image/fetch/$s_!hCuG!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82fb939a-b7ba-4330-8af4-209943a4c320_1456x1048.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!hCuG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82fb939a-b7ba-4330-8af4-209943a4c320_1456x1048.png" width="1456" height="1048" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/82fb939a-b7ba-4330-8af4-209943a4c320_1456x1048.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1048,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:110364,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://analyticalnikita.substack.com/i/153967051?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82fb939a-b7ba-4330-8af4-209943a4c320_1456x1048.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!hCuG!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82fb939a-b7ba-4330-8af4-209943a4c320_1456x1048.png 424w, https://substackcdn.com/image/fetch/$s_!hCuG!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82fb939a-b7ba-4330-8af4-209943a4c320_1456x1048.png 848w, https://substackcdn.com/image/fetch/$s_!hCuG!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82fb939a-b7ba-4330-8af4-209943a4c320_1456x1048.png 1272w, https://substackcdn.com/image/fetch/$s_!hCuG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82fb939a-b7ba-4330-8af4-209943a4c320_1456x1048.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>You can realise, there&#8217;re only slight differences compared to the <em><strong><a href="https://analyticalnikita.substack.com/p/how-do-llms-remember-context-over">basic attention mechanism</a></strong></em>, introduced earlier:</p><ul><li><p>The most notable difference is the introduction of weight matrices (<em><strong>W<sub>q</sub></strong></em>, <em><strong>W<sub>k</sub></strong></em>, and <em><strong>W<sub>v</sub></strong></em>) that are updated during model training.</p><ul><li><p>These trainable weight matrices are crucial, so that the model (specifically, the attention module inside the model) can learn to produce <em>reliable</em> context vectors.</p></li></ul></li><li><p>Also, now you have to scale the attention scores by dividing them by the square root of the embedding dimension, <em><strong>&#8730;d<sub>k</sub></strong></em> (i.e., <code>d_k**0.5</code>):</p></li></ul><p>So, that being said let&#8217;s get into the coding, and explore more!</p><div><hr></div><h2>Implementing Self-Attention with Trainable Weights</h2><p>Let&#8217;s me start by introducing the three training weight matrices: <em><strong>W<sub>q</sub></strong></em>, <em><strong>W<sub>k</sub></strong></em>, and <em><strong>W<sub>v</sub></strong></em>.</p><p>These three matrices are used to project the embedded input tokens, <strong>x<sub>i</sub></strong>, into query, key, and value vectors via. matrix multiplication:</p><ul><li><p>Query vector: <em><strong>q<sub>i </sub>=W<sub>q</sub>x<sub>i</sub></strong></em></p></li><li><p>Key vector: <em><strong>k<sub>i </sub>=W<sub>k</sub>x<sub>i</sub></strong></em></p></li><li><p>Value vector: <em><strong>v<sub>i </sub>= W<sub>v</sub>x<sub>i</sub></strong></em></p></li></ul><p>The embedding dimensions of the input <em>x </em>and the query vector <em>q</em> can be the same or different, depending on the model's design and specific implementation.</p><pre><code><code>import torch

input_emb = torch.tensor([
    [0.12, 0.45, 0.67],  # "Attention"
    [0.34, 0.56, 0.78],  # "Mechanism"
    [0.23, 0.57, 0.91],  # "drives"
    [0.76, 0.88, 0.45],  # "contextual"
    [0.54, 0.12, 0.34]   # "embedding"
], dtype=torch.float32)</code></code></pre><blockquote><p><em><strong>Note</strong></em>: In GPT models, the input and output dimensions are usually the same.</p></blockquote><p>But for illustration purposes, we&#8217;re using choosing different input and output dimensions to better follow the computation:</p><pre><code><code>d_in = input_emb.shape[1] # the input embedding size, d=3
d_out = 2                 # the output embedding size, d=2</code></code></pre><h4><strong>Following steps are implemented in the below code snippet:</strong></h4><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!QLRQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6781ea20-74fd-4923-b242-b72d4ea09f03_1456x1048.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!QLRQ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6781ea20-74fd-4923-b242-b72d4ea09f03_1456x1048.png 424w, https://substackcdn.com/image/fetch/$s_!QLRQ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6781ea20-74fd-4923-b242-b72d4ea09f03_1456x1048.png 848w, https://substackcdn.com/image/fetch/$s_!QLRQ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6781ea20-74fd-4923-b242-b72d4ea09f03_1456x1048.png 1272w, https://substackcdn.com/image/fetch/$s_!QLRQ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6781ea20-74fd-4923-b242-b72d4ea09f03_1456x1048.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!QLRQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6781ea20-74fd-4923-b242-b72d4ea09f03_1456x1048.png" width="1456" height="1048" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6781ea20-74fd-4923-b242-b72d4ea09f03_1456x1048.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1048,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:102810,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://analyticalnikita.substack.com/i/153967051?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6781ea20-74fd-4923-b242-b72d4ea09f03_1456x1048.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!QLRQ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6781ea20-74fd-4923-b242-b72d4ea09f03_1456x1048.png 424w, https://substackcdn.com/image/fetch/$s_!QLRQ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6781ea20-74fd-4923-b242-b72d4ea09f03_1456x1048.png 848w, https://substackcdn.com/image/fetch/$s_!QLRQ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6781ea20-74fd-4923-b242-b72d4ea09f03_1456x1048.png 1272w, https://substackcdn.com/image/fetch/$s_!QLRQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6781ea20-74fd-4923-b242-b72d4ea09f03_1456x1048.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Step 1: Convert Input to Query, Key, and Value Vectors</strong>: Multiply <em><a href="https://analyticalnikita.substack.com/p/how-llms-embeds-input-tokens">input embedding</a></em> by weight matrices (<strong>Wq, Wk, Wv</strong>) to get <strong>queries (Q), keys (K), and values (V)</strong>.</p><p><strong>Step 2: Compute Attention Scores:</strong> Take the <strong>dot product of Queries and Keys</strong> to measure how much each word should focus on others.</p><p><strong>Step 3: Scale the Scores:</strong> Divide by <strong>square root of embedding size</strong> (&#8730;d&#8342;) to keep values stable.</p><p><strong>Step 4: Apply Softmax</strong>: Convert scores into <strong>probabilities</strong> so they sum to 1 (higher score = more focus).</p><p><strong>Step 5: Compute Context Vectors</strong>: Multiply <strong>attention weights with Value vectors (V)</strong> to get the final <strong>context representation</strong>.</p><p><strong>Step 6: Return Context Vectors</strong>: These final <strong>context vectors</strong> are returned and will be <strong>used in further layers</strong> of the GPT model.</p><pre><code><code>import torch.nn as nn

class SelfAttention(nn.Module):

    def __init__(self, d_in, d_out, qkv_bias=False):
        super().__init__()
        self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_key   = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)

    def forward(self, x):
        keys = self.W_key(x)
        queries = self.W_query(x)
        values = self.W_value(x)
        
        attn_scores = queries @ keys.T
        attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)

        context_vec = attn_weights @ values
        return context_vec</code></code></pre><blockquote><p><strong>Note</strong>: <em>Instead of manually defining trainable weight matrices, using <strong>PyTorch's Linear Layers (</strong></em><code>nn.Linear)</code><em> has a preferred weight initialization scheme, which leads to more stable model training.  </em></p></blockquote><p>Let's instantiate a tokenizer object of <code>SelfAttention</code> class and tokenize our sampled input text:</p><pre><code><code>torch.manual_seed(123)
sa = SelfAttention(d_in, d_out)
print(sa(input_emb))

"""
Output: 
tensor([[-0.5128, -0.0366],
        [-0.5141, -0.0376],
        [-0.5143, -0.0377],
        [-0.5143, -0.0377],
        [-0.5129, -0.0367]], grad_fn=&lt;MmBackward0&gt;)
"""</code></code></pre><p>Bravo! You have successfully implemented a self-attention mechanism.</p><p>However, we&#8217;ve just scraped the tip of the iceberg.</p><blockquote><p><em>Want to go deeper? <strong>Subscribe</strong> for upcoming deep dives into GPT Architectures!</em></p></blockquote><p>If you&#8217;d like to explore the full implementation, including code and data, then checkout: <em><strong><a href="https://github.com/nikitaprasad21/LLM-Cheat-Code/blob/main/Attention-Mechanism/Self_Attention_Mechanism.ipynb">Github Repository</a></strong></em> &#128072;&#127995;</p><div><hr></div><p>And that&#8217;s a wrap! If you&#8217;ve made it this far &#8212; thank you so much, <em>stay tuned with<strong> <a href="https://substack.com/@analyticalnikita">ME</a></strong></em>, so you won&#8217;t miss out on future updates.</p><div class="captioned-button-wrap" data-attrs="{&quot;url&quot;:&quot;https://analyticalnikita.substack.com/p/how-do-llms-remember-context-over?utm_source=substack&amp;utm_medium=email&amp;utm_content=share&amp;action=share&amp;token=eyJ1c2VyX2lkIjoxMzY5MjcxNywicG9zdF9pZCI6MTUzOTY3MDMxLCJpYXQiOjE3NDI2MzM2MTYsImV4cCI6MTc0NTIyNTYxNiwiaXNzIjoicHViLTM2MjM0MTciLCJzdWIiOiJwb3N0LXJlYWN0aW9uIn0.3EDcUz0ZGbdQYaxUDtwo4Ug--8dAAMsgQ_z8NK7J2hs&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="CaptionedButtonToDOM"><div class="preamble"><p class="cta-caption">&#10084;&#65039; If you found this helpful, leave a &#8220;heart&#8221;! And if you have any questions/ suggestions/ thoughts, do drop me a line below. &#128395;&#65039;&#128071;</p></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://analyticalnikita.substack.com/p/scaled-dot-product-attention-explained?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://analyticalnikita.substack.com/p/scaled-dot-product-attention-explained?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p></div><p>Until next time, happy learning!</p><p>&#8212; <em><strong><a href="https://substack.com/@analyticalnikita">Nikita Prasad</a></strong></em></p>]]></content:encoded></item><item><title><![CDATA[6 Important Biases You Should Know]]></title><description><![CDATA[That RUIN Your Data-Driven Decisions!]]></description><link>https://analyticalnikita.substack.com/p/5-important-biases-you-should-know</link><guid isPermaLink="false">https://analyticalnikita.substack.com/p/5-important-biases-you-should-know</guid><dc:creator><![CDATA[Nikita Prasad]]></dc:creator><pubDate>Tue, 25 Mar 2025 10:36:08 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!doQA!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc55e8211-8435-4f68-a2be-0da990753ce2_1000x630.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Many a times data&#8212;or how you interpret it&#8212;is misleading? </p><p>And as a data scientist, your job is to extract meaningful insights from data. </p><p>But, <em><strong>biases</strong></em> can silently creep into the models, leading to inaccurate predictions and poor decision-making. </p><p>So, let&#8217;s explore 6 most critical biases you must watch out for and how to handle them.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!doQA!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc55e8211-8435-4f68-a2be-0da990753ce2_1000x630.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!doQA!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc55e8211-8435-4f68-a2be-0da990753ce2_1000x630.jpeg 424w, https://substackcdn.com/image/fetch/$s_!doQA!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc55e8211-8435-4f68-a2be-0da990753ce2_1000x630.jpeg 848w, https://substackcdn.com/image/fetch/$s_!doQA!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc55e8211-8435-4f68-a2be-0da990753ce2_1000x630.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!doQA!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc55e8211-8435-4f68-a2be-0da990753ce2_1000x630.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!doQA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc55e8211-8435-4f68-a2be-0da990753ce2_1000x630.jpeg" width="1000" height="630" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c55e8211-8435-4f68-a2be-0da990753ce2_1000x630.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:630,&quot;width&quot;:1000,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:105655,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://analyticalnikita.substack.com/i/159809412?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc55e8211-8435-4f68-a2be-0da990753ce2_1000x630.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!doQA!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc55e8211-8435-4f68-a2be-0da990753ce2_1000x630.jpeg 424w, https://substackcdn.com/image/fetch/$s_!doQA!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc55e8211-8435-4f68-a2be-0da990753ce2_1000x630.jpeg 848w, https://substackcdn.com/image/fetch/$s_!doQA!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc55e8211-8435-4f68-a2be-0da990753ce2_1000x630.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!doQA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc55e8211-8435-4f68-a2be-0da990753ce2_1000x630.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><blockquote><p><em>&#9208;&#65039;<strong> Quick Pause</strong>: If you&#8217;re new here, I&#8217;d highly appreciate if you <strong>subscribe </strong>to recieve bi-weekly data tips and insights &#8212; directly into your inbox. &#128071;&#127995;</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://analyticalnikita.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://analyticalnikita.substack.com/subscribe?"><span>Subscribe now</span></a></p></blockquote><div><hr></div><h3>1. <strong>Correlation vs. Causation Fallacy</strong> &#8211; Not Everything That Moves Together Is Connected</h3><p>A company notices that employees who wear headphones are more productive. </p><p>But does wearing headphones cause productivity, or are highly focused employees more likely to wear headphones?</p><p>This is the classic scenario of confusing <strong>correlation </strong><em>(two things happening together) </em>with<strong> causation </strong><em>(one thing directly causing the other)</em><strong>.</strong></p><h4><strong>How to avoid it?</strong> </h4><ul><li><p>Always try to answer, <strong>Could there be third factor influencing both variables? </strong></p></li><li><p>Use <strong>controlled experiments</strong> (e.g., A/B testing) to establish causation. </p></li></ul><blockquote><p><em><strong>BONUS READ:</strong></em> &#128071;&#127995;</p><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;8e5f4403-2324-4f3f-bc89-f7302f777dd8&quot;,&quot;caption&quot;:&quot;Correlation Doesn't Implies Causation!&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;showDescription&quot;:true,&quot;showImage&quot;:true,&quot;size&quot;:&quot;lg&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;Correlation Doesn't Implies Causation!&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:13692717,&quot;name&quot;:&quot;Nikita Prasad&quot;,&quot;bio&quot;:&quot;&#128187; Documenting my Learning in Simplified ways at Epochs of Data Insights. &#128640; Join my data-driven journey, today! &quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3ac9b439-e213-42c3-a12f-7a10f429b3f0_2160x2160.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2025-01-08T14:22:57.232Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F200d585b-6b28-42d6-8c04-b86b8be8f4c0_1000x630.jpeg&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://analyticalnikita.substack.com/p/correlation-doesnt-implies-causation&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:154065800,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:4,&quot;comment_count&quot;:0,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;Epochs of Data Insights&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F675a2477-6a67-4e93-b0a2-2cae74032449_1080x1080.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div></blockquote><div><hr></div><h3><strong>2. Simpson&#8217;s Paradox</strong> &#8211; When Trends Trick You</h3><p>Imagine you&#8217;re analyzing hospital performance and find that Hospital A has a higher survival rate than Hospital B. </p><p>Seems simple and clear, right? </p><p>But when you break it down by patient severity, Hospital B actually has better survival rates for both mild and critical cases. </p><p>What happened?</p><p>This is called <strong>Simpson&#8217;s Paradox.</strong>  </p><p>When data is grouped incorrectly, your trend might be positive at the global level, but when you break it down, the trends may reverse.</p><p>The problem arises because different sub-groups may behave differently.</p><h4><strong>How to avoid it?</strong> </h4><ul><li><p>Always segment your data to check if trends hold across different groups. </p></li><li><p>Look at <strong>confounding variables</strong>&#8212;hidden factors that could explain the results.</p></li></ul><div><hr></div><h3>3. <strong>Survivorship Bias</strong> &#8211; The Data You Don&#8217;t See Matters</h3><p>During World War II, engineers studied bullet holes on returning planes to decide where to add armor. </p><p>They initially thought to reinforce the areas with the most bullet holes.</p><p>But they were missing a key insight&#8212;<strong>planes hit in other areas never returned</strong>.</p><p>This is <strong>survivorship bias</strong>&#8212;focusing only on successful cases while ignoring failures.</p><h4><strong>How to avoid it?</strong> </h4><ul><li><p>Consider thinking out of the box like <strong>what&#8217;s missing from the dataset</strong>, before drawing conclusions. </p></li><li><p>Look at failure cases as well, not just success stories.</p></li></ul><div><hr></div><h3>4. <strong>Selection Bias</strong> &#8211; The Wrong Sample, the Wrong Conclusions</h3><p>Imagine you analyze customer feedback and find that <strong>95% of users love your app.</strong> </p><p>You conclude that your product is performing exceptionally well. </p><p>However, your data only includes responses from active users&#8212;those who <strong>already enjoy the app.</strong></p><p><strong>The problem?</strong> </p><p>You're missing feedback from <strong>users who churned</strong> or never engaged in the first place. </p><p>If you only analyze feedback from happy users, you might <strong>overestimate customer satisfaction</strong> and miss critical pain points that drive users away.</p><p>This is <strong>selection bias</strong>&#8212;when your sample doesn&#8217;t represent the entire population.</p><h4><strong>How to avoid it?</strong> </h4><ul><li><p> Use <strong>random sampling</strong> to ensure diverse representation. </p></li><li><p> Be mindful of who is excluded from the dataset.</p></li></ul><div><hr></div><h3>5. <strong>Confirmation Bias</strong> &#8211; Seeing What You Want to See</h3><p>People who believe in a conspiracy theory tend to <strong>mostly read sources</strong> that support it.</p><p>That&#8217;s <strong>confirmation bias</strong>&#8212;favoring information that confirms your existing beliefs while ignoring contradictory evidence.</p><p>In data science, this happens when you test models but only report metrics that support your hypothesis.</p><h4><strong>How to avoid it?</strong> </h4><ul><li><p>Challenge your assumptions&#8212;ask, <strong>What would prove me wrong?</strong> </p></li><li><p>Perform <strong>A/B testing</strong> and validate results using multiple methods.</p></li></ul><div><hr></div><h3>6. <strong>Omitted Variable Bias</strong> &#8211; The Missing Piece of the Puzzle</h3><p>A study finds that students who drink coffee score higher on exams. </p><p>But it <strong>ignores</strong> the fact that students who drink coffee might also study more.</p><p><strong>Omitted variable bias</strong> happens when you ignore an important factor that affects both the cause and effect.</p><h4><strong>How to avoid it?</strong> </h4><ul><li><p>Identify <strong>all possible influences</strong> before drawing conclusions. </p></li><li><p>Use domain expertise and statistical tests to check for missing variables.</p></li></ul><div><hr></div><p>I hope this guide <strong>helped you gain a deeper understanding</strong> of biases and their impact on building <strong>robust models </strong>also<strong> </strong>produce <strong>accurate, fair insights</strong>.</p><blockquote><p><em><strong>REMEMBER</strong></em>: </p><ul><li><p><em>to always question your data, </em></p></li><li><p><em>validate your assumptions,</em></p></li><li><p><em>look for bigger picture rather than<strong> </strong>what&#8217;s easy to measure, and</em></p></li><li><p><em>be wary of drawing conclusions from observational data alone.</em></p></li></ul></blockquote><p><strong>Comment Down</strong> &#128071;: <em>Have you ever made a decision based on data that later turned out to be misleading?</em></p><div class="captioned-button-wrap" data-attrs="{&quot;url&quot;:&quot;https://analyticalnikita.substack.com/p/how-do-llms-remember-context-over?utm_source=substack&amp;utm_medium=email&amp;utm_content=share&amp;action=share&amp;token=eyJ1c2VyX2lkIjoxMzY5MjcxNywicG9zdF9pZCI6MTUzOTY3MDMxLCJpYXQiOjE3NDEwMjE3OTgsImV4cCI6MTc0MzYxMzc5OCwiaXNzIjoicHViLTM2MjM0MTciLCJzdWIiOiJwb3N0LXJlYWN0aW9uIn0.BcAvyA5l6iao5MFj8b00rCuka3kkluditHSBt6fISkY&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="CaptionedButtonToDOM"><div class="preamble"><p class="cta-caption">&#10084;&#65039; If you found this helpful, leave a &#8220;heart&#8221;! And if you have any questions/ suggestions/ thoughts, do drop me a line below. &#128395;&#65039;&#128071;</p></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://analyticalnikita.substack.com/p/5-important-biases-you-should-know?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://analyticalnikita.substack.com/p/5-important-biases-you-should-know?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p></div><p>Until next time, happy learning!</p><p>&#8212; <em><strong><a href="https://substack.com/@analyticalnikita">Nikita Prasad</a></strong></em></p>]]></content:encoded></item><item><title><![CDATA[5 Assumptions of Linear Regression]]></title><description><![CDATA[(BIG MISTAKE) 50% of Data Enthusiasts Overlook This!]]></description><link>https://analyticalnikita.substack.com/p/5-assumptions-of-linear-regression</link><guid isPermaLink="false">https://analyticalnikita.substack.com/p/5-assumptions-of-linear-regression</guid><dc:creator><![CDATA[Nikita Prasad]]></dc:creator><pubDate>Tue, 18 Mar 2025 10:36:32 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!dF5z!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14bf0c17-52af-4c63-aa1f-27f59bb0030b_1344x960.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>When building a linear regression model, it&#8217;s tempting to focus solely on achieving a &#119841;&#119842;&#119840;&#119841; &#119834;&#119837;&#119843;&#119854;&#119852;&#119853;&#119838;&#119837; &#119825;&#178; and &#119846;&#119842;&#119847;&#119842;&#119846;&#119842;&#119859;&#119842;&#119847;&#119840; &#119846;&#119838;&#119834;&#119847; &#119852;&#119850;&#119854;&#119834;&#119851;&#119838;&#119837; &#119838;&#119851;&#119851;&#119848;&#119851;. </p><p>However, a model isn&#8217;t truly reliable unless it meets key statistical assumptions.</p><p><em><strong><a href="https://github.com/nikitaprasad21/ML-Cheat-Codes/tree/main/Linear-Regression">Linear regression</a></strong></em> relies on several assumptions to ensure the validity and reliability of the model's results.</p><p>In this read, let&#8217;s explore the <em><strong>5 Main Assumptions of Linear Regression</strong></em>, in detail.</p><blockquote><p><em>Before that, if you&#8217;re new here <strong>Subscribe</strong>, as my goal is to simplify Data Science for you. &#128071;&#127995;</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://analyticalnikita.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://analyticalnikita.substack.com/subscribe?"><span>Subscribe now</span></a></p></blockquote><p>Now, let&#8217;s get started! </p><div><hr></div><h2><strong>Assumptions of Linear Regression</strong>:</h2><p>These assumptions are often asked in <em>Data Science interviews</em>, so understanding them is crucial!</p><h2>1. Linearity Among Dependent and Independent Variables</h2><p>For linear regression to work, the relationship between the independent variables (predictors) and the dependent variable (response) should be <em>approximately linear</em>. </p><p>This means that the change in the response variable should be proportional to changes in the predictor variables. </p><h3><strong>How to check?</strong></h3><p>You can verify this assumption by examining:</p><ul><li><p><strong>Scatter plots</strong>: Check if the data points form a straight-line pattern.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!dF5z!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14bf0c17-52af-4c63-aa1f-27f59bb0030b_1344x960.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!dF5z!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14bf0c17-52af-4c63-aa1f-27f59bb0030b_1344x960.png 424w, https://substackcdn.com/image/fetch/$s_!dF5z!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14bf0c17-52af-4c63-aa1f-27f59bb0030b_1344x960.png 848w, https://substackcdn.com/image/fetch/$s_!dF5z!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14bf0c17-52af-4c63-aa1f-27f59bb0030b_1344x960.png 1272w, https://substackcdn.com/image/fetch/$s_!dF5z!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14bf0c17-52af-4c63-aa1f-27f59bb0030b_1344x960.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!dF5z!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14bf0c17-52af-4c63-aa1f-27f59bb0030b_1344x960.png" width="1344" height="960" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/14bf0c17-52af-4c63-aa1f-27f59bb0030b_1344x960.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:960,&quot;width&quot;:1344,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;image.png&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="image.png" title="image.png" srcset="https://substackcdn.com/image/fetch/$s_!dF5z!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14bf0c17-52af-4c63-aa1f-27f59bb0030b_1344x960.png 424w, https://substackcdn.com/image/fetch/$s_!dF5z!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14bf0c17-52af-4c63-aa1f-27f59bb0030b_1344x960.png 848w, https://substackcdn.com/image/fetch/$s_!dF5z!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14bf0c17-52af-4c63-aa1f-27f59bb0030b_1344x960.png 1272w, https://substackcdn.com/image/fetch/$s_!dF5z!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14bf0c17-52af-4c63-aa1f-27f59bb0030b_1344x960.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div></li><li><p>Residual plots: Ensure there is no curved pattern.</p></li></ul><h2>2. No Multicollinearity in the Data</h2><p>Multicollinearity occurs when one predictor variable is <em>highly correlated</em> with one or more other predictor variables.  </p><p>This makes it difficult for the model to determine which variable is actually influencing the dependent variable.</p><h3><strong>How to check?</strong></h3><ul><li><p><strong>Variance Inflation Factor (VIF):</strong> Measures how much a predictor&#8217;s variance is increased by multicollinearity. If VIF &gt; 5 or 10, it indicates high multicollinearity.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!zUGY!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1bcd12fc-3c6d-4882-b9e5-11ec1251746d_1117x394.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!zUGY!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1bcd12fc-3c6d-4882-b9e5-11ec1251746d_1117x394.png 424w, https://substackcdn.com/image/fetch/$s_!zUGY!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1bcd12fc-3c6d-4882-b9e5-11ec1251746d_1117x394.png 848w, https://substackcdn.com/image/fetch/$s_!zUGY!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1bcd12fc-3c6d-4882-b9e5-11ec1251746d_1117x394.png 1272w, https://substackcdn.com/image/fetch/$s_!zUGY!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1bcd12fc-3c6d-4882-b9e5-11ec1251746d_1117x394.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!zUGY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1bcd12fc-3c6d-4882-b9e5-11ec1251746d_1117x394.png" width="1117" height="394" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1bcd12fc-3c6d-4882-b9e5-11ec1251746d_1117x394.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:394,&quot;width&quot;:1117,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:30940,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://analyticalnikita.substack.com/i/159316137?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1bcd12fc-3c6d-4882-b9e5-11ec1251746d_1117x394.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!zUGY!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1bcd12fc-3c6d-4882-b9e5-11ec1251746d_1117x394.png 424w, https://substackcdn.com/image/fetch/$s_!zUGY!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1bcd12fc-3c6d-4882-b9e5-11ec1251746d_1117x394.png 848w, https://substackcdn.com/image/fetch/$s_!zUGY!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1bcd12fc-3c6d-4882-b9e5-11ec1251746d_1117x394.png 1272w, https://substackcdn.com/image/fetch/$s_!zUGY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1bcd12fc-3c6d-4882-b9e5-11ec1251746d_1117x394.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div></li><li><p><strong>Correlation Matrix:</strong> Identifies highly correlated predictor variables using the <code>.corr</code> method of a Pandas series.</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!bOJO!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55aa84a7-ed26-40b0-afd5-bbad02a003fa_766x536.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!bOJO!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55aa84a7-ed26-40b0-afd5-bbad02a003fa_766x536.png 424w, https://substackcdn.com/image/fetch/$s_!bOJO!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55aa84a7-ed26-40b0-afd5-bbad02a003fa_766x536.png 848w, https://substackcdn.com/image/fetch/$s_!bOJO!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55aa84a7-ed26-40b0-afd5-bbad02a003fa_766x536.png 1272w, https://substackcdn.com/image/fetch/$s_!bOJO!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55aa84a7-ed26-40b0-afd5-bbad02a003fa_766x536.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!bOJO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55aa84a7-ed26-40b0-afd5-bbad02a003fa_766x536.png" width="766" height="536" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/55aa84a7-ed26-40b0-afd5-bbad02a003fa_766x536.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:536,&quot;width&quot;:766,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!bOJO!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55aa84a7-ed26-40b0-afd5-bbad02a003fa_766x536.png 424w, https://substackcdn.com/image/fetch/$s_!bOJO!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55aa84a7-ed26-40b0-afd5-bbad02a003fa_766x536.png 848w, https://substackcdn.com/image/fetch/$s_!bOJO!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55aa84a7-ed26-40b0-afd5-bbad02a003fa_766x536.png 1272w, https://substackcdn.com/image/fetch/$s_!bOJO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55aa84a7-ed26-40b0-afd5-bbad02a003fa_766x536.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><blockquote><p><strong>Important Note: </strong><em>Just because two features are correlated doesn&#8217;t mean one causes the other. This is known as <strong><a href="https://analyticalnikita.substack.com/p/correlation-doesnt-implies-causation">Correlation vs Causation Fallacy</a>.</strong></em></p><p><em>While this may seem obvious, computers can't really differentiate between correlation and causation, why is why human insight is required.</em></p></blockquote><h2>3. Normality of Residuals (or Errors)</h2><p>The residual (errors) should be <em>normally distributed</em>. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!F3H8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc72c7071-dc29-4770-a804-0b4dc97e85c4_834x552.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!F3H8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc72c7071-dc29-4770-a804-0b4dc97e85c4_834x552.png 424w, https://substackcdn.com/image/fetch/$s_!F3H8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc72c7071-dc29-4770-a804-0b4dc97e85c4_834x552.png 848w, https://substackcdn.com/image/fetch/$s_!F3H8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc72c7071-dc29-4770-a804-0b4dc97e85c4_834x552.png 1272w, https://substackcdn.com/image/fetch/$s_!F3H8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc72c7071-dc29-4770-a804-0b4dc97e85c4_834x552.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!F3H8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc72c7071-dc29-4770-a804-0b4dc97e85c4_834x552.png" width="834" height="552" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c72c7071-dc29-4770-a804-0b4dc97e85c4_834x552.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:552,&quot;width&quot;:834,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;image.png&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="image.png" title="image.png" srcset="https://substackcdn.com/image/fetch/$s_!F3H8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc72c7071-dc29-4770-a804-0b4dc97e85c4_834x552.png 424w, https://substackcdn.com/image/fetch/$s_!F3H8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc72c7071-dc29-4770-a804-0b4dc97e85c4_834x552.png 848w, https://substackcdn.com/image/fetch/$s_!F3H8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc72c7071-dc29-4770-a804-0b4dc97e85c4_834x552.png 1272w, https://substackcdn.com/image/fetch/$s_!F3H8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc72c7071-dc29-4770-a804-0b4dc97e85c4_834x552.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><blockquote><p><strong>Note</strong>: <em>While this is not required for estimating regression coefficiets, it is crucial for hypothesis testing and constructing confidence intervals.</em></p></blockquote><h3><strong>How to check?</strong></h3><p>Normality of errors can be checked using techniques such as:</p><ul><li><p><strong>Kernel density estimation (KDE): </strong>A smoothed representation of the residual distribution.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!INxt!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb32692df-ad56-4f0c-9bcf-4f99dfba1081_479x477.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!INxt!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb32692df-ad56-4f0c-9bcf-4f99dfba1081_479x477.png 424w, https://substackcdn.com/image/fetch/$s_!INxt!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb32692df-ad56-4f0c-9bcf-4f99dfba1081_479x477.png 848w, https://substackcdn.com/image/fetch/$s_!INxt!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb32692df-ad56-4f0c-9bcf-4f99dfba1081_479x477.png 1272w, https://substackcdn.com/image/fetch/$s_!INxt!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb32692df-ad56-4f0c-9bcf-4f99dfba1081_479x477.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!INxt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb32692df-ad56-4f0c-9bcf-4f99dfba1081_479x477.png" width="479" height="477" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b32692df-ad56-4f0c-9bcf-4f99dfba1081_479x477.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:477,&quot;width&quot;:479,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!INxt!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb32692df-ad56-4f0c-9bcf-4f99dfba1081_479x477.png 424w, https://substackcdn.com/image/fetch/$s_!INxt!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb32692df-ad56-4f0c-9bcf-4f99dfba1081_479x477.png 848w, https://substackcdn.com/image/fetch/$s_!INxt!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb32692df-ad56-4f0c-9bcf-4f99dfba1081_479x477.png 1272w, https://substackcdn.com/image/fetch/$s_!INxt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb32692df-ad56-4f0c-9bcf-4f99dfba1081_479x477.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div></li><li><p><strong>Q-Q Plot:</strong> If residuals align with the straight reference line, they are normally distributed.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!hNZo!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa01b8aab-5624-4791-86c3-744345b4d91f_1049x559.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!hNZo!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa01b8aab-5624-4791-86c3-744345b4d91f_1049x559.png 424w, https://substackcdn.com/image/fetch/$s_!hNZo!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa01b8aab-5624-4791-86c3-744345b4d91f_1049x559.png 848w, https://substackcdn.com/image/fetch/$s_!hNZo!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa01b8aab-5624-4791-86c3-744345b4d91f_1049x559.png 1272w, https://substackcdn.com/image/fetch/$s_!hNZo!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa01b8aab-5624-4791-86c3-744345b4d91f_1049x559.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!hNZo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa01b8aab-5624-4791-86c3-744345b4d91f_1049x559.png" width="1049" height="559" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a01b8aab-5624-4791-86c3-744345b4d91f_1049x559.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:559,&quot;width&quot;:1049,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!hNZo!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa01b8aab-5624-4791-86c3-744345b4d91f_1049x559.png 424w, https://substackcdn.com/image/fetch/$s_!hNZo!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa01b8aab-5624-4791-86c3-744345b4d91f_1049x559.png 848w, https://substackcdn.com/image/fetch/$s_!hNZo!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa01b8aab-5624-4791-86c3-744345b4d91f_1049x559.png 1272w, https://substackcdn.com/image/fetch/$s_!hNZo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa01b8aab-5624-4791-86c3-744345b4d91f_1049x559.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div></li><li><p><strong>Shapiro-Wilk Test:</strong> A statistical test for normality.</p></li></ul><h2>4. Homoscedasticity Among the Data</h2><p>Homoscedasticity means that the variance of the errors should <em>remains the same</em> for all values of the predictors (independent variables). </p><p>If the variance changes (gets bigger or smaller), the assumption is violated, and this is called <strong>heteroscedasticity</strong> (not preferable).</p><h3><strong>How to check?</strong></h3><ul><li><p><strong>Residual plots:</strong> Residuals should be randomly scattered. If there is a funnel shape, it indicates heteroscedasticity.</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!989w!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3459636f-4be3-48f4-b8a2-01321d6040ad_870x509.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!989w!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3459636f-4be3-48f4-b8a2-01321d6040ad_870x509.png 424w, https://substackcdn.com/image/fetch/$s_!989w!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3459636f-4be3-48f4-b8a2-01321d6040ad_870x509.png 848w, https://substackcdn.com/image/fetch/$s_!989w!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3459636f-4be3-48f4-b8a2-01321d6040ad_870x509.png 1272w, https://substackcdn.com/image/fetch/$s_!989w!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3459636f-4be3-48f4-b8a2-01321d6040ad_870x509.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!989w!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3459636f-4be3-48f4-b8a2-01321d6040ad_870x509.png" width="870" height="509" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3459636f-4be3-48f4-b8a2-01321d6040ad_870x509.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:509,&quot;width&quot;:870,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!989w!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3459636f-4be3-48f4-b8a2-01321d6040ad_870x509.png 424w, https://substackcdn.com/image/fetch/$s_!989w!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3459636f-4be3-48f4-b8a2-01321d6040ad_870x509.png 848w, https://substackcdn.com/image/fetch/$s_!989w!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3459636f-4be3-48f4-b8a2-01321d6040ad_870x509.png 1272w, https://substackcdn.com/image/fetch/$s_!989w!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3459636f-4be3-48f4-b8a2-01321d6040ad_870x509.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Since, in above plot the spread of the residuals doesn't have any significant increases or decreases with the test predictor, it suggests homoscedasticity in data.</p><h2>5. No Auto-correlation of Residuals</h2><p>Auto-correlation happens when the reiduals are <em>related to each other</em> over time.</p><p>This usually occurs in time-series data where past values influence future values.</p><blockquote><h4><strong>Why is this bad?</strong></h4><ul><li><p>If residuals are correlated, the model underestimates the standard error, making predictions unreliable.</p></li></ul></blockquote><h3><strong>How to check?</strong></h3><ul><li><p><strong>Durbin-Watson test:</strong> A statistical test for detecting autocorrelation.</p></li><li><p><strong>Residual plot over time:</strong> Look for patterns instead of randomness.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!VFeT!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4007a91d-1115-48a2-9331-babf81cf29fc_870x509.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!VFeT!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4007a91d-1115-48a2-9331-babf81cf29fc_870x509.png 424w, https://substackcdn.com/image/fetch/$s_!VFeT!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4007a91d-1115-48a2-9331-babf81cf29fc_870x509.png 848w, https://substackcdn.com/image/fetch/$s_!VFeT!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4007a91d-1115-48a2-9331-babf81cf29fc_870x509.png 1272w, https://substackcdn.com/image/fetch/$s_!VFeT!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4007a91d-1115-48a2-9331-babf81cf29fc_870x509.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!VFeT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4007a91d-1115-48a2-9331-babf81cf29fc_870x509.png" width="870" height="509" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4007a91d-1115-48a2-9331-babf81cf29fc_870x509.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:509,&quot;width&quot;:870,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!VFeT!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4007a91d-1115-48a2-9331-babf81cf29fc_870x509.png 424w, https://substackcdn.com/image/fetch/$s_!VFeT!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4007a91d-1115-48a2-9331-babf81cf29fc_870x509.png 848w, https://substackcdn.com/image/fetch/$s_!VFeT!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4007a91d-1115-48a2-9331-babf81cf29fc_870x509.png 1272w, https://substackcdn.com/image/fetch/$s_!VFeT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4007a91d-1115-48a2-9331-babf81cf29fc_870x509.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div></li></ul><p>Clearly, above plot shows is no pattern or correlation present in the residual points.</p><p>If there is no pattern, autocorrelation is not present.</p><div><hr></div><p>And that&#8217;s a wrap, whether you&#8217;re a beginner diving into machine learning or an expert, understanding these assumptions will help you build better models.</p><p>If you&#8217;d like to explore the full implementation, including code and data, then checkout: <em><strong><a href="https://github.com/nikitaprasad21/ML-Cheat-Codes/blob/main/Linear-Regression/lr_assumptions.ipynb">Github Repository</a></strong></em>. &#128072;&#127995;</p><p>Also <em>stay tuned with<strong> <a href="https://substack.com/@analyticalnikita">ME</a></strong></em>, so you won&#8217;t miss out on future updates.</p><div class="captioned-button-wrap" data-attrs="{&quot;url&quot;:&quot;https://analyticalnikita.substack.com/p/working-with-text-data-tokenization?utm_source=substack&amp;utm_medium=email&amp;utm_content=share&amp;action=share&amp;token=eyJ1c2VyX2lkIjoxMzY5MjcxNywicG9zdF9pZCI6MTUzODk3MDA4LCJpYXQiOjE3MzYyNjI3MTMsImV4cCI6MTczODg1NDcxMywiaXNzIjoicHViLTM2MjM0MTciLCJzdWIiOiJwb3N0LXJlYWN0aW9uIn0.9YJtyeDdbdG7YRaC7fxDDMCJsWzsYR2hfBW6KlmWeS4&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="CaptionedButtonToDOM"><div class="preamble"><p class="cta-caption"><em>Before you go.. leave a &#8220;heart&#8221; &#10084;&#65039; and if you have any questions/ suggestions/ thoughts, do drop me a line below. &#128395;&#65039;&#128071;</em></p></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://analyticalnikita.substack.com/p/5-assumptions-of-linear-regression?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://analyticalnikita.substack.com/p/5-assumptions-of-linear-regression?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p></div><p>Until next time, happy learning!</p><p>&#8212; <em><strong><a href="https://substack.com/@analyticalnikita">Nikita Prasad</a></strong></em></p>]]></content:encoded></item><item><title><![CDATA[Handling Imbalanced Dataset in ML]]></title><description><![CDATA[Easy Explanation for Data Science Interviews]]></description><link>https://analyticalnikita.substack.com/p/handling-imbalanced-dataset-in-ml</link><guid isPermaLink="false">https://analyticalnikita.substack.com/p/handling-imbalanced-dataset-in-ml</guid><dc:creator><![CDATA[Nikita Prasad]]></dc:creator><pubDate>Sat, 15 Mar 2025 07:40:37 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!xKDd!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36610319-e029-4870-985a-3f50b0dcc5b9_689x547.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In this read, I want to focus on<em><strong> Imbalanced Datasets </strong>&#8212; a common challenge in </em> real-world applications. </p><p>Knowing how to deal with it not only helps you tackle interview questions but also enables you to build better predictive models.</p><blockquote><p><strong>Quick Pause</strong>: <em>If you&#8217;re new here <strong>Subscribe </strong>&#8212; my goal is to make Data Science easy for you. &#128071;&#127995;</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://analyticalnikita.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://analyticalnikita.substack.com/subscribe?"><span>Subscribe now</span></a></p></blockquote><div><hr></div><h2>What Is Imbalanced Data?</h2><p>Imbalanced data refers to a dataset within which one or some of the classes (or labels) make up the majority of the data set, leaving far fewer examples of others. </p><p>This issue applies to both <em>classification</em> and <em>regression tasks</em>.</p><ul><li><p>In <strong>classification</strong>, it might happen in <em>binary classification</em>, <em>multi-class classification</em> and <em>multi-label classification.</em></p><ul><li><p>For instance, a dataset where 95% of the data belongs to class &#8221;blue&#8221; and the rest 5% belongs to the other class &#8220;orange&#8221; (see the plot below).</p></li></ul></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!xKDd!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36610319-e029-4870-985a-3f50b0dcc5b9_689x547.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!xKDd!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36610319-e029-4870-985a-3f50b0dcc5b9_689x547.png 424w, https://substackcdn.com/image/fetch/$s_!xKDd!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36610319-e029-4870-985a-3f50b0dcc5b9_689x547.png 848w, https://substackcdn.com/image/fetch/$s_!xKDd!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36610319-e029-4870-985a-3f50b0dcc5b9_689x547.png 1272w, https://substackcdn.com/image/fetch/$s_!xKDd!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36610319-e029-4870-985a-3f50b0dcc5b9_689x547.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!xKDd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36610319-e029-4870-985a-3f50b0dcc5b9_689x547.png" width="689" height="547" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/36610319-e029-4870-985a-3f50b0dcc5b9_689x547.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:547,&quot;width&quot;:689,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!xKDd!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36610319-e029-4870-985a-3f50b0dcc5b9_689x547.png 424w, https://substackcdn.com/image/fetch/$s_!xKDd!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36610319-e029-4870-985a-3f50b0dcc5b9_689x547.png 848w, https://substackcdn.com/image/fetch/$s_!xKDd!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36610319-e029-4870-985a-3f50b0dcc5b9_689x547.png 1272w, https://substackcdn.com/image/fetch/$s_!xKDd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36610319-e029-4870-985a-3f50b0dcc5b9_689x547.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><ul><li><p>In <strong>regression</strong> problem, imbalance occurs when there are <em>outlier values</em> that are either much lower or higher than the median or average of the data.</p><ul><li><p>For example, while predicting  customers lifetime value, most customers may have average spending, but a few high-value customers could skew the dataset.</p></li></ul></li></ul><blockquote><p><em><strong>Note</strong>: In many real-world scenarios, the data is <strong>inherently imbalanced</strong>, such as fraud detection, diagnosis of rare diseases, anomaly detection, etc.</em></p></blockquote><h2>Why is Imbalanced Data a Problem?</h2><ul><li><p><strong>Biased Model Performance</strong>: </p><ul><li><p>The models trained on imbalanced data tend to be biased towards the majority class, resulting in <em><strong>poor generalization</strong></em>.</p></li></ul></li><li><p><strong>Misleading Evaluation Metrics</strong>: </p><ul><li><p>Traditional metrics like <em><strong>accuracy</strong> </em>can be misleading. A model may achieve high accuracy simply by predicting the majority class most of the time, while completely ignoring the minority class.</p></li></ul></li></ul><h2>How to Deal With Imbalanced Data?</h2><p>Previously, in this <em><strong><a href="https://github.com/nikitaprasad21/ML-Cheat-Codes/blob/main/Feature-Engineering/imbalanced-data.ipynb">Github repository</a></strong></em>, I covered three different methods, in detail:</p><ul><li><p><em><strong>Data-level methods (Resampling techniques), </strong></em></p></li><li><p><em><strong>Model-level methods (Algorithm adjustments),</strong></em></p></li><li><p><em><strong>Metrics-level methods (Choosing better evaluation metrics)</strong></em></p></li></ul><p>But, is there a better approach? Let&#8217;s explore some additional techniques.</p><h2>How to Fix Imbalanced Data?</h2><p>Other than resampling the dataset or selecting appropriate algorithms (like tree-based models), you can also consider combining multiple techniques together for better results:</p><h3>1. Under-Sampling with Ensemble Learning</h3><p>The idea is to use all samples of the minority class while creating smaller subsets from the majority class, then ensemble multiple models trained on these subsets. </p><p>Imagine, you have two classes for binary classification task.</p><p>Class A has 1000 examples, while Class B has only 100 examples.</p><p>You can break class A into 10 smaller groups, each containing 100 examples.</p><p>Train 10 different models, each with one small group from class A plus all examples in class B.</p><p>Finally, combine predictions from these models (<strong>Ensemble Learning</strong>) to improve metrics.</p><h4>2. Up-Sampling with Adjusted Loss Function</h4><p>Another method is to adjust the loss function of the model after resampling.</p><p>You can up-sample the minority class until a desired ratio is reached and then recalculate the new weights for both classes.</p><p>Next, you pass these adjusted weights to the model&#8217;s loss function, ensuring that it pays more attention to the minority class during training.</p><blockquote><p>Certainly, these are two of many possible ways to handle class imbalance problems, which are commonly discussed in Data Science interviews.</p><p>However, you can experiement with different combinations of techniques. The general idea is to use multiple techniques together to address class imbalance.</p></blockquote><h2>Choosing the Right Metrics</h2><p>Also, picking the right evaluation metric is really crucial when working with imbalanced data sets.</p><blockquote><p><strong>Key Considerations: </strong></p><p>Before we talk about which metrics are appropriate for imbalanced data sets there&#8217;re two important things to keep in mind:</p><ul><li><p>First, always <strong>evaluate on the original, unmodified dataset</strong> rather than resampled data to avoid overfitting.</p></li><li><p>Secondly, the test dataset should <strong>represent the original data distribution</strong> as closely as possible.</p></li></ul></blockquote><p>As I mentioned earlier, <strong>accuracy </strong>can be misleading when classes are imbalanced, as the performance of the model on the majority class will dominant the accuracy score.</p><p>So, a better choice is to consider using accuracy for each class individually, meaning you should evaluate the model&#8217;s performance on the minority class separately to get a clearer picture.</p><p>Other metrics, such as the <strong>Precision-Recall (PR) curve</strong> helps identify a threshold that works best for the dataset, as it gives more emphasis on how many predictions the model got right out of the total number it predicted to be positive, which is helpful when dealing with imbalanced data sets.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!fdNh!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5b65c8f-99d2-4621-a903-b77a0d482b3d_1213x918.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!fdNh!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5b65c8f-99d2-4621-a903-b77a0d482b3d_1213x918.png 424w, https://substackcdn.com/image/fetch/$s_!fdNh!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5b65c8f-99d2-4621-a903-b77a0d482b3d_1213x918.png 848w, https://substackcdn.com/image/fetch/$s_!fdNh!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5b65c8f-99d2-4621-a903-b77a0d482b3d_1213x918.png 1272w, https://substackcdn.com/image/fetch/$s_!fdNh!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5b65c8f-99d2-4621-a903-b77a0d482b3d_1213x918.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!fdNh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5b65c8f-99d2-4621-a903-b77a0d482b3d_1213x918.png" width="1213" height="918" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b5b65c8f-99d2-4621-a903-b77a0d482b3d_1213x918.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:918,&quot;width&quot;:1213,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:48192,&quot;alt&quot;:&quot;Precision-Recall Curve of a Logistic Regression Model and a No Skill Classifier&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Precision-Recall Curve of a Logistic Regression Model and a No Skill Classifier" title="Precision-Recall Curve of a Logistic Regression Model and a No Skill Classifier" srcset="https://substackcdn.com/image/fetch/$s_!fdNh!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5b65c8f-99d2-4621-a903-b77a0d482b3d_1213x918.png 424w, https://substackcdn.com/image/fetch/$s_!fdNh!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5b65c8f-99d2-4621-a903-b77a0d482b3d_1213x918.png 848w, https://substackcdn.com/image/fetch/$s_!fdNh!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5b65c8f-99d2-4621-a903-b77a0d482b3d_1213x918.png 1272w, https://substackcdn.com/image/fetch/$s_!fdNh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5b65c8f-99d2-4621-a903-b77a0d482b3d_1213x918.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Another commonly used metric is the <strong>ROC-AUC</strong>. From the ROC curve, we can tune thresholds to increase recall while decreasing the false positive rate.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!OV3Y!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50a990a5-cfc0-40a9-aae7-0dd8f6b4c90e_578x455.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!OV3Y!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50a990a5-cfc0-40a9-aae7-0dd8f6b4c90e_578x455.png 424w, https://substackcdn.com/image/fetch/$s_!OV3Y!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50a990a5-cfc0-40a9-aae7-0dd8f6b4c90e_578x455.png 848w, https://substackcdn.com/image/fetch/$s_!OV3Y!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50a990a5-cfc0-40a9-aae7-0dd8f6b4c90e_578x455.png 1272w, https://substackcdn.com/image/fetch/$s_!OV3Y!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50a990a5-cfc0-40a9-aae7-0dd8f6b4c90e_578x455.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!OV3Y!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50a990a5-cfc0-40a9-aae7-0dd8f6b4c90e_578x455.png" width="728" height="573.0795847750865" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/50a990a5-cfc0-40a9-aae7-0dd8f6b4c90e_578x455.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:455,&quot;width&quot;:578,&quot;resizeWidth&quot;:728,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!OV3Y!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50a990a5-cfc0-40a9-aae7-0dd8f6b4c90e_578x455.png 424w, https://substackcdn.com/image/fetch/$s_!OV3Y!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50a990a5-cfc0-40a9-aae7-0dd8f6b4c90e_578x455.png 848w, https://substackcdn.com/image/fetch/$s_!OV3Y!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50a990a5-cfc0-40a9-aae7-0dd8f6b4c90e_578x455.png 1272w, https://substackcdn.com/image/fetch/$s_!OV3Y!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F50a990a5-cfc0-40a9-aae7-0dd8f6b4c90e_578x455.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>However, the problem with ROC curve is that it treats both classes equally and is less sensitive to improvements in the minority class. </p><p>This makes it less helpful compared to the PR curve.</p><div><hr></div><p>Alright, that&#8217;s a wrap! If you&#8217;ve made it this far &#8212; thank you! <em>Stay tuned with<strong> <a href="https://substack.com/@analyticalnikita">ME</a></strong></em>, so you won&#8217;t miss out on future updates.</p><blockquote><p>Do not forget to explore the full implementation, including code and data: <em><strong><a href="https://github.com/nikitaprasad21/ML-Cheat-Codes/blob/main/Feature-Engineering/imbalanced-data.ipynb">Github Repository</a></strong></em> &#128072;&#127995;</p></blockquote><div class="captioned-button-wrap" data-attrs="{&quot;url&quot;:&quot;https://analyticalnikita.substack.com/p/how-do-llms-remember-context-over?utm_source=substack&amp;utm_medium=email&amp;utm_content=share&amp;action=share&amp;token=eyJ1c2VyX2lkIjoxMzY5MjcxNywicG9zdF9pZCI6MTUzOTY3MDMxLCJpYXQiOjE3NDE5NDIzMzYsImV4cCI6MTc0NDUzNDMzNiwiaXNzIjoicHViLTM2MjM0MTciLCJzdWIiOiJwb3N0LXJlYWN0aW9uIn0.JWw4pdU7jCGSV5bwMlHnG8gazyOqGmuOTgSol7f4g3o&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="CaptionedButtonToDOM"><div class="preamble"><p class="cta-caption">Found this helpful? Leave a &#8220;heart&#8221;&#10084;&#65039;! And if you have any questions/ suggestions/ thoughts, do drop me a line below. &#128395;&#65039;&#128071;</p></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://analyticalnikita.substack.com/p/handling-imbalanced-dataset-in-ml?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://analyticalnikita.substack.com/p/handling-imbalanced-dataset-in-ml?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p></div><p>Until next time, happy learning!</p><p>&#8212; <em><strong><a href="https://substack.com/@analyticalnikita">Nikita Prasad</a></strong></em></p>]]></content:encoded></item><item><title><![CDATA[Why Large Language Models Hallucinate?]]></title><description><![CDATA[Everything You NEED to Know as a Data Scientist]]></description><link>https://analyticalnikita.substack.com/p/why-large-language-models-hallucinate</link><guid isPermaLink="false">https://analyticalnikita.substack.com/p/why-large-language-models-hallucinate</guid><dc:creator><![CDATA[Nikita Prasad]]></dc:creator><pubDate>Sat, 08 Mar 2025 10:34:25 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/f886ec71-a933-4924-b62b-c106088058a7_1000x630.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>OpenAI recently announced GPT 4.5 with claims of significantly lower hallucination rates, making it more accurate and reliable.</p><p>However, after extensive testing, many users, including AI researcher debating on Reddit &#8212; suggest that: &#8220;<em><a href="https://www.reddit.com/r/singularity/comments/1j06srh/gpt45_hallucination_rate_in_practice_is_too_high/?rdt=64420">GPT-4.5 hallucination rate, in practice, is too high for reasonable use</a></em><strong>&#8221;.</strong></p><p>In this read, let&#8217;s dive into <em>Hallucinations in LLMs, explore why they happen, and discuss strategies to mitigate them</em>.</p><blockquote><p><em><strong>Quick Pause</strong> : If you find these tips helpful, do not forget to</em> <em><strong>subscribe me</strong> and stick around to dive deeper into Python &amp; ML insights. &#128071;&#127995;</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://analyticalnikita.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://analyticalnikita.substack.com/subscribe?"><span>Subscribe now</span></a></p></blockquote><div><hr></div><h2>What Are LLM Hallucinations?</h2><p>While LLMs can generate fluent and coherent text on wide range of topics and domains, they&#8217;re also prone to hallucinate.</p><p>Hallucinations in LLMs refers to the phenomenon where the model deviate from factual accuracy or contextual logic. </p><p>This can range from minor inconsistencies to generating entirely fabricated outputs, outright incorrect facts, or erroneous explanations.</p><p>You can categorize hallucination across different levels of granularity:</p><ul><li><p><strong>Sentence Contradictions</strong>: The LLM generates statements that contradicts each other within the same response. </p><ul><li><p>Something like <em>"Einstein was born in 1879 in Germany. However, his birthplace is Switzerland".</em></p></li></ul></li><li><p><strong>Prompt Contradictions</strong>: The generated response contradicts with the prompt that was used to generate it.</p><ul><li><p>Say, if you ask an LLM to write a formal email declining an invitation and it returns a rude response, that would be in direct contradiction to what you asked.</p></li></ul></li><li><p><strong>Factual Contradictions</strong>: These are the factual error hallucinations responded by the LLM. </p><ul><li><p>For instance, in one test, GPT&#8209;4.5 incorrectly stated that<em> &#8220;The Third Circuit held personal jurisdiction existed&#8221;. However, </em>the actual ruling was the exact opposite. Such factual errors are especially concerning.</p></li></ul></li></ul><div><hr></div><p>Now with the question of what LLMs hallucinations are answered, let&#8217;s understand the reason behind that:</p><h2>Why Do Hallucinations Happen?</h2><p>Understanding why LLMs hallucinate is not straightforward, as the way they derive their output remain partly a black box &#8212; even to their engineers.</p><p>But there&#8217;re a number of common causes have been  identified:</p><ul><li><p><strong>Limited Contextual Understanding:</strong> Even though models like GPT&#8209;4.5 are pretrained on extensive and diverse datasets, they may still lose vital context when sentences or clauses span long distances.</p></li><li><p><strong>Training Data Limitations: </strong>If the training data is biased or lacks the necessary knowledge for specific queries, the model might fill in the gaps with &#8220;most probable&#8221; text rather than the &#8220;true&#8221; information.</p></li><li><p><strong>Output Generation Methods:</strong> Newer models might focus more on generating &#8220;emotionally intelligent&#8221; or so called human-like responses rather than factually verified outputs, leads to potentially convincing yet incorrect responses.</p></li><li><p><strong>Benchmarking vs. Real Use:</strong> Although OpenAI&#8217;s benchmarks (such as the SimpleQA benchmark) claim a reduced hallucination rate of 37.1%, but in practice, these metrics often fail to capture the full complexity of real-world scenarios.</p></li></ul><div><hr></div><h2>How Can We Fix the Hallucination Problem?</h2><p>While no solution is perfect, some promising approaches can help reduce hallucinations when using LLMs:</p><ol><li><p><strong>Clear and Specific Prompts</strong>: The more precise and the more detailed your input prompt, the more likely the LLM will generate relevant and accurate outputs.</p></li><li><p><strong>Active Mitigation Strategies: </strong>Adjusting model settings &#8212; such as temperature parameter &#8212; can control the randomness of the output, and potentially reduce hallucinations.</p></li><li><p><strong>Integrate Web Search:</strong> Models that combine reasoning with real-time web search tend to verify facts more accurately, leading to more reliable outputs.</p></li></ol><p>So while LLMs like GPT 4.5 may sometimes hallucinate, understanding the causes and employing the mitigation strategies can help you harness their true potential.</p><div><hr></div><p><strong>&#129781;&#127995;Over to You: </strong><em>What&#8217;s your experience with AI hallucinations? </em>Drop a comment below, and let&#8217;s discuss! <em>&#128395;&#65039;&#128071;</em></p><div class="captioned-button-wrap" data-attrs="{&quot;url&quot;:&quot;https://analyticalnikita.substack.com/p/working-with-text-data-tokenization?utm_source=substack&amp;utm_medium=email&amp;utm_content=share&amp;action=share&amp;token=eyJ1c2VyX2lkIjoxMzY5MjcxNywicG9zdF9pZCI6MTUzODk3MDA4LCJpYXQiOjE3MzU5MDgyMDEsImV4cCI6MTczODUwMDIwMSwiaXNzIjoicHViLTM2MjM0MTciLCJzdWIiOiJwb3N0LXJlYWN0aW9uIn0.6r1ERfqwFCJwDfcSbNBjM-aU7g-W0iFek6Y8qbD3Fzc&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="CaptionedButtonToDOM"><div class="preamble"><p class="cta-caption"><em>&#128226; Before you go.. leave a a &#8220;heart&#8221; &#10084;&#65039;. Stay tuned with<strong> <a href="https://substack.com/@analyticalnikita">ME</a></strong></em>, so you won&#8217;t miss out on future updates.</p></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://analyticalnikita.substack.com/p/why-large-language-models-hallucinate?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://analyticalnikita.substack.com/p/why-large-language-models-hallucinate?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p></div><p>Until next time, happy learning!</p><p>&#8212; <em><strong><a href="https://substack.com/@analyticalnikita">Nikita Prasad</a></strong></em></p>]]></content:encoded></item><item><title><![CDATA[Often Used ACTIVATION FUNCTIONS In Neural Networks ]]></title><description><![CDATA[Must Known Concept to Avoid Struggling with Model Performance]]></description><link>https://analyticalnikita.substack.com/p/often-used-activation-functions-in</link><guid isPermaLink="false">https://analyticalnikita.substack.com/p/often-used-activation-functions-in</guid><dc:creator><![CDATA[Nikita Prasad]]></dc:creator><pubDate>Tue, 04 Mar 2025 10:34:34 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!Egba!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26912d90-8e52-4697-a9a7-abe96d97055a_800x1131.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Training a DL model for hours, only to find that it <strong>fails to converge</strong>, gets stuck with <strong>vanishing gradients</strong>, or <strong>struggles with accuracy</strong>. </p><p>You may tweak hyperparameters, increase training data, and even adjust the learning rate&#8212;yet the issue persists.</p><p>What if the <strong>real problem lies in your activation function?</strong></p><p>In this guide, let&#8217;s <strong>break down the most commonly used activation functions</strong>, their strengths and weaknesses, and how to choose the <strong>right one for your deep learning model</strong>. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Egba!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26912d90-8e52-4697-a9a7-abe96d97055a_800x1131.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Egba!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26912d90-8e52-4697-a9a7-abe96d97055a_800x1131.jpeg 424w, https://substackcdn.com/image/fetch/$s_!Egba!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26912d90-8e52-4697-a9a7-abe96d97055a_800x1131.jpeg 848w, https://substackcdn.com/image/fetch/$s_!Egba!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26912d90-8e52-4697-a9a7-abe96d97055a_800x1131.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!Egba!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26912d90-8e52-4697-a9a7-abe96d97055a_800x1131.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Egba!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26912d90-8e52-4697-a9a7-abe96d97055a_800x1131.jpeg" width="800" height="1131" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/26912d90-8e52-4697-a9a7-abe96d97055a_800x1131.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1131,&quot;width&quot;:800,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:94553,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://analyticalnikita.substack.com/i/154065578?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26912d90-8e52-4697-a9a7-abe96d97055a_800x1131.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Egba!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26912d90-8e52-4697-a9a7-abe96d97055a_800x1131.jpeg 424w, https://substackcdn.com/image/fetch/$s_!Egba!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26912d90-8e52-4697-a9a7-abe96d97055a_800x1131.jpeg 848w, https://substackcdn.com/image/fetch/$s_!Egba!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26912d90-8e52-4697-a9a7-abe96d97055a_800x1131.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!Egba!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26912d90-8e52-4697-a9a7-abe96d97055a_800x1131.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><blockquote><p><em>&#9208;&#65039;<strong> Quick Pause</strong>: If you&#8217;re new here, I&#8217;d highly appreciate if you <strong>subscribe </strong>to recieve bi-weekly data tips and insights &#8212; directly into your inbox. &#128071;&#127995;</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://analyticalnikita.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://analyticalnikita.substack.com/subscribe?"><span>Subscribe now</span></a></p></blockquote><div><hr></div><h3>&#10145;&#65039; There are two types of Activation Functions:</h3><p>1. Linear Activation Functions<br>2. Non-linear Activation Functions</p><p>First thing first, let&#8217;s answer!</p><h2>&#128736;&#65039; Why Use Non-Linear Activation Functions?</h2><p>In real-life data, we work with non-linear data samples. Adding an activation function ensures that we consider the non-linearity in the data, further helping the DL models to perform well in complex tasks.</p><p><strong>Adding non-linearity ensures that:</strong><br>&#10004; The model can learn <strong>hierarchical representations</strong><br>&#10004; It enables deep networks to approximate <strong>any function</strong><br>&#10004; It helps solve <strong>complex real-world problems</strong> in NLP, vision, and more</p><h3>Here's the list of Non-Linear Activation Functions:&#128071;&#127995;</h3><p><br>1&#65039;&#8419; &#120294;&#120310;&#120308;&#120314;&#120316;&#120310;&#120305; - It's historically popular function that squashes its input into the range [0, 1]. But it can cause Vanishing Gradient Problems in deep networks.<br><br>2&#65039;&#8419; &#120295;&#120302;&#120315;&#120309; - Similar to the sigmoid but scales the output to be in the range [-1, 1]. It's zero-centered, which can help mitigate the Vanishing Gradient Problems to some extent.<br><br>3&#65039;&#8419; &#120293;&#120306;&#120287;&#120296; <strong>(Rectified Linear Unit)</strong> - It's computationally efficient and helps with the vanishing gradient problem but can cause "<em><strong>Dying ReLU</strong></em>" problem.<br><br>4&#65039;&#8419; &#120287;&#120306;&#120302;&#120312;&#120326; &#120293;&#120306;&#120287;&#120322; - To tackle the dying ReLU problem, Leaky ReLU has a small slope for negative values instead of 0.<br><br>5&#65039;&#8419; &#120291;&#120293;&#120306;&#120287;&#120296; <strong>(Parametric ReLU) </strong>- Similar to Leaky ReLU but the slope for negative values is learned during training rather than being predefined.<br><br>6&#65039;&#8419; &#120280;&#120287;&#120296; <strong>(Exponential Linear Unit)</strong> - Tries to make the mean activations closer to zero. Transforms negative inputs to values between -&#945; and 0, producing a more robust model.<br><br>7&#65039;&#8419; &#120294;&#120280;&#120287;&#120296; <strong>(Scaled Exponential Linear Unit) </strong>- Like ELU, but with scaling, making it self-normalizing. It has specific conditions under which it can maintain mean 0 and variance 1.<br><br>8&#65039;&#8419; &#120294;&#120316;&#120307;&#120321;&#120317;&#120313;&#120322;&#120320; - A smooth approximation to the ReLU function, and it's always positive.<br><br>9&#65039;&#8419; &#120294;&#120316;&#120307;&#120321;&#120320;&#120310;&#120308;&#120315; - Scales input by 1 plus the absolute value. Similar to tanh but less common used.<br><br>1&#65039;&#8419;0&#65039;&#8419; &#120283;&#120302;&#120319;&#120305; &#120294;&#120310;&#120308;&#120314;&#120316;&#120310;&#120305; - Piecewise linear approximation of sigmoid function, more computationally efficient than the regular sigmoid.<br><br>1&#65039;&#8419;1&#65039;&#8419; &#120294;&#120324;&#120310;&#120320;&#120309; - Self-gated activation function discovered by researchers at Google. It's computationally efficient and outperforms ReLU in some cases.<br><br>1&#65039;&#8419;2&#65039;&#8419; &#120288;&#120310;&#120320;&#120309; - Combines softplus and tanh, shown to outperform many traditional activations in deep networks.</p><blockquote><p><em><strong>Note</strong></em>: <em>All these functions are used in hidden layers, while <strong>SoftMax</strong> is used for <strong>Multi-class classification</strong>, typically preferred as an output layer function.</em></p></blockquote><div><hr></div><p>I hope this guide <strong>helped you gain a deeper understanding</strong> of activation functions and their impact on building <strong>robust neural network models</strong>. </p><p>For most business problems, I <strong>keep it simple</strong>&#8212;using <strong>ReLU for hidden layers and Softmax for multi-class classification</strong>. If the results are reasonable, I move forward. Only after evaluating the final pipeline with the client do I experiment with <strong>new layers and activation function combinations</strong> to optimize performance.</p><div><hr></div><p><strong>Comment Down</strong> &#128071;: <em>Which activation function do you use often when training a neural network? </em></p><div class="captioned-button-wrap" data-attrs="{&quot;url&quot;:&quot;https://analyticalnikita.substack.com/p/how-do-llms-remember-context-over?utm_source=substack&amp;utm_medium=email&amp;utm_content=share&amp;action=share&amp;token=eyJ1c2VyX2lkIjoxMzY5MjcxNywicG9zdF9pZCI6MTUzOTY3MDMxLCJpYXQiOjE3NDEwMjE3OTgsImV4cCI6MTc0MzYxMzc5OCwiaXNzIjoicHViLTM2MjM0MTciLCJzdWIiOiJwb3N0LXJlYWN0aW9uIn0.BcAvyA5l6iao5MFj8b00rCuka3kkluditHSBt6fISkY&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="CaptionedButtonToDOM"><div class="preamble"><p class="cta-caption">&#10084;&#65039; If you found this helpful, leave a &#8220;heart&#8221;! And if you have any questions/ suggestions/ thoughts, do drop me a line below. &#128395;&#65039;&#128071;</p></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://analyticalnikita.substack.com/p/often-used-activation-functions-in?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://analyticalnikita.substack.com/p/often-used-activation-functions-in?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p></div><p>Until next time, happy learning!</p><p>&#8212; <em><strong><a href="https://substack.com/@analyticalnikita">Nikita Prasad</a></strong></em></p>]]></content:encoded></item><item><title><![CDATA[How Do LLMs Remember Context Over Long Sequences?]]></title><description><![CDATA[Building Simple Attention Mechanism From Scratch]]></description><link>https://analyticalnikita.substack.com/p/how-do-llms-remember-context-over</link><guid isPermaLink="false">https://analyticalnikita.substack.com/p/how-do-llms-remember-context-over</guid><dc:creator><![CDATA[Nikita Prasad]]></dc:creator><pubDate>Sat, 01 Mar 2025 10:34:35 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec42fb2d-f24d-47bb-8e0d-4fb6add4e2c3_1456x1048.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Open AI has finally introduced the much-awaited research preview of <em><strong>GPT&#8209;4.5</strong></em>, claiming to be their most advanced model yet &#8212; with<em> broader knowledge base, improved intent-following abilities, and greater &#8220;EQ&#8221; (Emotional Intelligence). </em></p><p>This motivates me to write a detailed yet crucial article on the <em><strong>fundamentals of Attention Mechanism</strong></em><strong> </strong>&#8212; the core of GPT Models. Along with an implementation of a <strong>Simple</strong> <strong>Attention Mechanism</strong> from scratch to truly understand how it works.</p><blockquote><p><em>&#9208;&#65039;<strong> Quick Pause</strong>: If you&#8217;re new here, I&#8217;d highly appreciate if you <strong>subscribe </strong>to recieve bi-weekly data tips and insights &#8212; directly into your inbox. &#128071;&#127995;</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://analyticalnikita.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://analyticalnikita.substack.com/subscribe?"><span>Subscribe now</span></a></p></blockquote><div><hr></div><h2>The Problem with Modeling Long Sequences</h2><p>In tasks like machine translation, word-by-word translation <em>doesn&#8217;t work</em> because it requires <em>contextual understanding and grammatical alignment between the source and target languages</em>.</p><p>Prior to the introduction of transformer models, encoder-decoder RNNs were commonly used for machine translation tasks.</p><p>In this setup, the encoder processes a sequence of tokens from the source language, using a hidden state&#8212;a kind of intermediate layer within the neural network.</p><p>Leading to loss-of-context, especially in long complex sentences where dependencies might span long distances.</p><p>As the current hidden state is condensed representation of entire input sequence into single hidden state vector.</p><p><strong>Solution? Self-Attention Mechanism!</strong></p><h2><strong>What is the Self-Attention Mechanism?</strong></h2><p>Through an attention mechanism, the text-generating decoder segment of the network is capable of selectively accessing to different parts of the input tokens.</p><p>&#128161; <strong>Key Idea: </strong>Certain input tokens hold more significance (weight) than others in the generation of a specific output token, to improve LLM performance.</p><blockquote><p><em><strong><a href="https://medium.com/gitconnected/self-attention-networks-beginners-friendly-in-depth-understanding-0f2d605a8f23">Self-attention in transformers</a></strong> &#8212; sometimes referred to as <strong>intra-attention</strong> &#8212; is a mechanism that allows the inputs to interact with each other (&#8220;self&#8221;) in order to determine what they should focus on (&#8220;attention&#8221;).</em></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!lN-S!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F27214b0a-25a6-48b9-bb67-0ae7bb00b438_963x622.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!lN-S!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F27214b0a-25a6-48b9-bb67-0ae7bb00b438_963x622.png 424w, https://substackcdn.com/image/fetch/$s_!lN-S!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F27214b0a-25a6-48b9-bb67-0ae7bb00b438_963x622.png 848w, https://substackcdn.com/image/fetch/$s_!lN-S!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F27214b0a-25a6-48b9-bb67-0ae7bb00b438_963x622.png 1272w, https://substackcdn.com/image/fetch/$s_!lN-S!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F27214b0a-25a6-48b9-bb67-0ae7bb00b438_963x622.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!lN-S!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F27214b0a-25a6-48b9-bb67-0ae7bb00b438_963x622.png" width="963" height="622" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/27214b0a-25a6-48b9-bb67-0ae7bb00b438_963x622.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:622,&quot;width&quot;:963,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!lN-S!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F27214b0a-25a6-48b9-bb67-0ae7bb00b438_963x622.png 424w, https://substackcdn.com/image/fetch/$s_!lN-S!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F27214b0a-25a6-48b9-bb67-0ae7bb00b438_963x622.png 848w, https://substackcdn.com/image/fetch/$s_!lN-S!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F27214b0a-25a6-48b9-bb67-0ae7bb00b438_963x622.png 1272w, https://substackcdn.com/image/fetch/$s_!lN-S!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F27214b0a-25a6-48b9-bb67-0ae7bb00b438_963x622.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div></blockquote><p>In simple terms, they process <em>n</em> inputs and return <em>n </em>outputs. The outputs comprise the aggregates of these interactions and also attention scores that are calculated based on a single input.</p><div><hr></div><h2>Implementing Simple Attention Mechanism</h2><p>For illustration purposes, let&#8217;s implement a simple version of self-attention, which does not contain any trainable weights (for now). </p><p>Suppose we are given an input sequence <em>x<sub>1</sub></em> to <em>x<sub>T</sub></em><sub> </sub>:</p><ul><li><p>The input is a text (for example, a sentence like "<em>Attention Mechanism drives contextual embedding</em>") that has already been converted into <em><strong><a href="https://analyticalnikita.substack.com/p/how-llms-embeds-input-tokens">token embeddings</a></strong></em>.</p></li><li><p>For instance, <em>x<sub>1</sub></em> is a d-dimensional vector representing the word "Attention",  <em>x<sub>2</sub></em> for &#8220;Mechanism&#8221;, and so forth.</p></li></ul><blockquote><p><strong>Goal:</strong> To compute context vectors, <em>z<sub>i</sub></em> for each input sequence element x<sub>i</sub> in <em>x<sub>1</sub></em> to <em>x<sub>T</sub></em>, <em>where z and x have the same dimension</em>.</p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!_EhD!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec42fb2d-f24d-47bb-8e0d-4fb6add4e2c3_1456x1048.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!_EhD!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec42fb2d-f24d-47bb-8e0d-4fb6add4e2c3_1456x1048.png 424w, https://substackcdn.com/image/fetch/$s_!_EhD!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec42fb2d-f24d-47bb-8e0d-4fb6add4e2c3_1456x1048.png 848w, https://substackcdn.com/image/fetch/$s_!_EhD!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec42fb2d-f24d-47bb-8e0d-4fb6add4e2c3_1456x1048.png 1272w, https://substackcdn.com/image/fetch/$s_!_EhD!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec42fb2d-f24d-47bb-8e0d-4fb6add4e2c3_1456x1048.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!_EhD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec42fb2d-f24d-47bb-8e0d-4fb6add4e2c3_1456x1048.png" width="1456" height="1048" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ec42fb2d-f24d-47bb-8e0d-4fb6add4e2c3_1456x1048.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1048,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:110792,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://analyticalnikita.substack.com/i/153967031?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec42fb2d-f24d-47bb-8e0d-4fb6add4e2c3_1456x1048.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!_EhD!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec42fb2d-f24d-47bb-8e0d-4fb6add4e2c3_1456x1048.png 424w, https://substackcdn.com/image/fetch/$s_!_EhD!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec42fb2d-f24d-47bb-8e0d-4fb6add4e2c3_1456x1048.png 848w, https://substackcdn.com/image/fetch/$s_!_EhD!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec42fb2d-f24d-47bb-8e0d-4fb6add4e2c3_1456x1048.png 1272w, https://substackcdn.com/image/fetch/$s_!_EhD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec42fb2d-f24d-47bb-8e0d-4fb6add4e2c3_1456x1048.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The code below walks through the figure above step by step:</p><p>In the case of the tensor shown above, each row represents a word, and each column represents an embedding dimension:</p><pre><code><code>import torch

input_emb = torch.tensor([
    [0.12, 0.45, 0.67],  # "Attention"
    [0.34, 0.56, 0.78],  # "Mechanism"
    [0.23, 0.57, 0.91],  # "drives"
    [0.76, 0.88, 0.45],  # "contextual"
    [0.54, 0.12, 0.34]   # "embedding"
], dtype=torch.float32)</code></code></pre><p>We use input sequence element 1, x<sub>1</sub>, as an example to compute context vector z<sub>1</sub>; later, we will generalize this to compute all context vectors.</p><h3>Step 1: Attention Scores (&#969;) </h3><p>The first step is to compute the unnormalized attention scores by computing the dot product between the query x<sub>1</sub> and all other input tokens:</p><blockquote><p><em>Raw, unnormalized values that indicates how relevant one input element is to another.</em></p></blockquote><p>Computed by comparing the<em> query vector (Q)</em> of one element with the <em>key vector (K) </em>of all element.</p><pre><code><code>query = input_emb[0]  # 1st input token is the query

attn_scores_1 = torch.empty(input_emb.shape[0])
for i, x_i in enumerate(input_emb):
    attn_scores_1[i] = torch.dot(x_i, query) # dot product (transpose not necessary here since they are 1-dim vectors)

print(attn_scores_1)

# Output: tensor([0.6658, 0.8154, 0.8938, 0.7887, 0.3466])</code></code></pre><blockquote><p><strong>Note</strong>: <em>A <strong>dot product</strong> is used for multiplying two vectors elements-wise and summing the resulting products.</em></p></blockquote><h3>Step 2: Attention Weights (&#945;)</h3><p>It represent the relative importance of one element to another in a probabilistic manner (value between 0 to 1).</p><blockquote><p><strong>Note</strong>: <em>Larger weights means greater relevance.</em></p></blockquote><p>Let&#8217;s normalize the unnormalized attention scores ("omegas", &#969;) so that they sum up to 1:</p><pre><code><code>attn_weights_1_temp = attn_scores_1 / attn_scores_1.sum()

print("Attention weights:", attn_weights_1_temp)
print("Sum:", attn_weights_1_temp.sum())

# Output: Attention weights: tensor([0.1897, 0.2323, 0.2546, 0.2247, 0.0987])
#         Sum: tensor(1.0000)</code></code></pre><blockquote><p><strong>Note</strong>: <em>It is recommended, to use the softmax function for normalization, which is better at handling extreme values and has more desirable gradient properties during training.</em></p></blockquote><p>So, let&#8217;s use the PyTorch implementation of softmax for scaling, which also normalizes the vector elements such that they sum up to 1:</p><pre><code><code>attn_weights_1 = torch.softmax(attn_scores_1, dim=0)

print("Attention weights:", attn_weights_1)
print("Sum:", attn_weights_1.sum())

# Output: Attention weights: tensor([0.1896, 0.2202, 0.2381, 0.2144, 0.1378])
#         Sum: tensor(1.0000)</code></code></pre><h3>Step 3: Context Vectors (z)</h3><p>The input embedding vectors are converted to the context vector.</p><p>It aims to capture both semantic and syntactic information from the input embeddings.</p><p>It is the key component that encodes the weighted representation of the input sequence, to capture most relevant information for each element in the sequence by considering its relationship with all other tokens.</p><blockquote><p><strong>Note</strong>: <em>Context size is the maximum number of previous tokens the LLM looks at before predicting next token.</em></p></blockquote><p>Let&#8217;s, compute the context vector <em>z<sub>1</sub></em> by multiplying the embedded input tokens, <em>x<sub>i</sub></em> with the attention weights and sum the resulting vectors:</p><pre><code><code>query = input_emb[0] # 1st input token is the query

context_vec_1 = torch.zeros(query.shape)
for i,x_i in enumerate(input_emb):
    context_vec_1 += attn_weights_1[i]*x_i

print(context_vec_1)

# Output: tensor([0.3897, 0.5495, 0.6587])</code></code></pre><p>The model now has a weighted understanding of the input sequence.</p><div><hr></div><h2>Computing Attention Weights for All Input Tokens</h2><p>Above, we computed the attention weights and context vector for input 1.</p><p>Next, let&#8217;s generalizing this computation for all tokens in the input embeddings. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!TyaQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb50f75d5-fef6-4525-9898-65d8c560ae5d_1456x1048.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!TyaQ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb50f75d5-fef6-4525-9898-65d8c560ae5d_1456x1048.png 424w, https://substackcdn.com/image/fetch/$s_!TyaQ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb50f75d5-fef6-4525-9898-65d8c560ae5d_1456x1048.png 848w, https://substackcdn.com/image/fetch/$s_!TyaQ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb50f75d5-fef6-4525-9898-65d8c560ae5d_1456x1048.png 1272w, https://substackcdn.com/image/fetch/$s_!TyaQ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb50f75d5-fef6-4525-9898-65d8c560ae5d_1456x1048.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!TyaQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb50f75d5-fef6-4525-9898-65d8c560ae5d_1456x1048.png" width="1456" height="1048" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b50f75d5-fef6-4525-9898-65d8c560ae5d_1456x1048.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1048,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:129217,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://analyticalnikita.substack.com/i/153967031?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb50f75d5-fef6-4525-9898-65d8c560ae5d_1456x1048.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!TyaQ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb50f75d5-fef6-4525-9898-65d8c560ae5d_1456x1048.png 424w, https://substackcdn.com/image/fetch/$s_!TyaQ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb50f75d5-fef6-4525-9898-65d8c560ae5d_1456x1048.png 848w, https://substackcdn.com/image/fetch/$s_!TyaQ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb50f75d5-fef6-4525-9898-65d8c560ae5d_1456x1048.png 1272w, https://substackcdn.com/image/fetch/$s_!TyaQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb50f75d5-fef6-4525-9898-65d8c560ae5d_1456x1048.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Applying<strong> step 1</strong> to all pairwise elements to compute the unnormalized attention score matrix:</p><pre><code><code>attn_scores = torch.empty(5, 3)

attn_scores = inputs @ inputs.T # Compute scores for all pairs
print(attn_scores)

"""
Output: 
tensor([[0.6658, 0.8154, 0.8938, 0.7887, 0.3466],
        [0.8154, 1.0376, 1.1072, 1.1022, 0.5160],
        [0.8938, 1.1072, 1.2059, 1.0859, 0.5020],
        [0.7887, 1.1022, 1.0859, 1.5545, 0.6690],
        [0.3466, 0.5160, 0.5020, 0.6690, 0.4216]])
"""</code></code></pre><p>Similar to <strong>step 2</strong> previously, we normalize each row so that the values in each row sum to 1:</p><pre><code><code>attn_weights = torch.softmax(attn_scores, dim=-1) # Normalize Attention Scores
print(attn_weights)

"""
Output: 
tensor([[0.1896, 0.2202, 0.2381, 0.2144, 0.1378],
        [0.1766, 0.2206, 0.2365, 0.2353, 0.1309],
        [0.1821, 0.2254, 0.2488, 0.2207, 0.1231],
        [0.1481, 0.2026, 0.1994, 0.3185, 0.1314],
        [0.1721, 0.2039, 0.2010, 0.2376, 0.1855]])
"""</code></code></pre><p>Lastly, applying <strong>step 3</strong> to compute all context vectors:</p><pre><code><code>all_context_vecs = attn_weights @ input_emb # Compute all context vectors
print(all_context_vecs)

"""
Output: 
tensor([[0.3897, 0.5495, 0.6587],
        [0.4001, 0.5606, 0.6560],
        [0.3899, 0.5589, 0.6653],
        [0.4455, 0.5898, 0.6267],
        [0.4169, 0.5375, 0.6272]])
"""</code></code></pre><p>As a sanity check, the previously computed context vector z<sub>1</sub> can be found in the 1st row in above:</p><pre><code><code>print("Previous 1st context vector:", context_vec_1)

# Output: Previous 1st context vector: tensor([0.3897, 0.5495, 0.6587])</code></code></pre><p>Now, each token has a dynamically computed representation based on its relationship with all other tokens.</p><blockquote><p><em>Want to go deeper? <strong>Subscribe</strong> for upcoming deep dives into Transformer Architectures!</em></p></blockquote><p>If you&#8217;d like to explore the full implementation, including code and data, then checkout: <em><strong><a href="https://github.com/nikitaprasad21/LLM-Cheat-Code/blob/main/Attention-Mechanism/Simplified_Attention_Mechanism.ipynb">Github Repository</a></strong></em> &#128072;&#127995;</p><div><hr></div><p>And that&#8217;s a wrap! If you&#8217;ve made it this far &#8212; thank you so much, <em>stay tuned with<strong> <a href="https://substack.com/@analyticalnikita">ME</a></strong></em>, so you won&#8217;t miss out on future updates..</p><div class="captioned-button-wrap" data-attrs="{&quot;url&quot;:&quot;https://analyticalnikita.substack.com/p/how-do-llms-remember-context-over?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="CaptionedButtonToDOM"><div class="preamble"><p class="cta-caption">&#10084;&#65039; If you found this helpful, leave a &#8220;heart&#8221;! And if you have any questions/ suggestions/ thoughts, do drop me a line below. &#128395;&#65039;&#128071;</p></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://analyticalnikita.substack.com/p/how-do-llms-remember-context-over?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://analyticalnikita.substack.com/p/how-do-llms-remember-context-over?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p></div><p>Until next time, happy learning!</p><p>&#8212; <em><strong><a href="https://substack.com/@analyticalnikita">Nikita Prasad</a></strong></em></p>]]></content:encoded></item><item><title><![CDATA[How Python Lists Can Be RISKY?]]></title><description><![CDATA[A Must-Know Guide to Avoid These Costly Mistakes!]]></description><link>https://analyticalnikita.substack.com/p/why-python-lists-can-be-risky</link><guid isPermaLink="false">https://analyticalnikita.substack.com/p/why-python-lists-can-be-risky</guid><dc:creator><![CDATA[Nikita Prasad]]></dc:creator><pubDate>Tue, 25 Feb 2025 10:49:35 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!DBuW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F010e0fce-fe77-4b4c-85fa-5b9498864492_1100x642.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Unlike tuples (which are immutable), lists allow modifications, making them both powerful and risky.</p><h2>How?</h2><p>&#128071;&#127995;Consider the following snippet of code:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!DBuW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F010e0fce-fe77-4b4c-85fa-5b9498864492_1100x642.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!DBuW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F010e0fce-fe77-4b4c-85fa-5b9498864492_1100x642.png 424w, https://substackcdn.com/image/fetch/$s_!DBuW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F010e0fce-fe77-4b4c-85fa-5b9498864492_1100x642.png 848w, https://substackcdn.com/image/fetch/$s_!DBuW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F010e0fce-fe77-4b4c-85fa-5b9498864492_1100x642.png 1272w, https://substackcdn.com/image/fetch/$s_!DBuW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F010e0fce-fe77-4b4c-85fa-5b9498864492_1100x642.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!DBuW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F010e0fce-fe77-4b4c-85fa-5b9498864492_1100x642.png" width="1100" height="642" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/010e0fce-fe77-4b4c-85fa-5b9498864492_1100x642.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:642,&quot;width&quot;:1100,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:231320,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://analyticalnikita.substack.com/i/157792728?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F010e0fce-fe77-4b4c-85fa-5b9498864492_1100x642.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!DBuW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F010e0fce-fe77-4b4c-85fa-5b9498864492_1100x642.png 424w, https://substackcdn.com/image/fetch/$s_!DBuW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F010e0fce-fe77-4b4c-85fa-5b9498864492_1100x642.png 848w, https://substackcdn.com/image/fetch/$s_!DBuW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F010e0fce-fe77-4b4c-85fa-5b9498864492_1100x642.png 1272w, https://substackcdn.com/image/fetch/$s_!DBuW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F010e0fce-fe77-4b4c-85fa-5b9498864492_1100x642.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>At first glance, you might <strong>expect </strong><code>b</code><strong> to remain </strong><code>[1, 3, 4]</code>, but instead, <strong>both </strong><code>a</code><strong> and </strong><code>b</code><strong> change</strong>. <strong>Why?</strong></p><p>If you&#8217;re a data scientist, ignoring such risks may lead to <em>incorrect analysis, unexpected bugs, and memory issues in</em> ML models.</p><blockquote><p><em><strong>Quick Pause</strong> : If you find these tips helpful, do not forget to</em> <em><strong>subscribe me</strong> and stick around to dive deeper into Python &amp; ML insights. &#128071;&#127995;</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://analyticalnikita.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://analyticalnikita.substack.com/subscribe?"><span>Subscribe now</span></a></p></blockquote><div><hr></div><p>Let&#8217;s break down the issue,</p><h2>What&#8217;s Happening?</h2><p>When you execute  <code>b=a</code>, you aren&#8217;t creating a new list.</p><p>Instead, both <code>a</code> and <code>b</code> now refer to the same list in the memory.</p><p>So, when you do <code>a.append(2)</code>, you&#8217;re modifying the original list, which means both <code>a</code> and <code>b</code> reflect that changes.</p><p>Imagine this type of unintented modification corrupting your entire data science project while working with large datasets.</p><div><hr></div><h2>How to FIX It?</h2><h3><strong>1. Copying a List Using </strong><code>.copy()</code><strong> Method</strong></h3><p>The <code>.copy()</code> method return a shallow copy of original list. </p><p>This works fine as long as the list <strong>does not</strong> contain <em><strong>nested objects</strong></em>. For instance:</p><pre><code><code>a = [1, 2, 3]  
b = a.copy()  # Now b is a separate list  
a.append(6)  

print(a)     # Output: [1, 2, 3, 6]  
print(b)     # Output: [1, 2, 3]  (Fixed!)  

print(id(a)) # Output: 131938375649664
print(id(b)) # Output: 131938374499648</code></code></pre><p>Now, <code>b</code> stays unchanged because <code>.copy()</code> creates a new list instead of a reference.</p><p>However, if the list contains <em><strong>nested objects</strong></em>, <code>.copy() </code>only copies the reference to those nested objects, not the objects themselves.</p><p>Below code snippet can help you understand this better:</p><pre><code><code>original_list = [1, [2, 3], [4, 5]]
shallow_copied_list = original_list.copy()

# Modify the original list
original_list[1][0] = 'X'

# Modify the copied list
shallow_copied_list[2][1] = 9

print(original_list)             # Output: [1, ['X', 3], [4, 9]]
print(shallow_copied_list)       # Output: [1, ['X', 3], [4, 9]]

print(id(original_list))         # Output: 133878327171712
print(id(shallow_copied_list))   # Output: 133878327557248</code></code></pre><blockquote><p>&#9888;&#65039; <strong>Problem:</strong> A <strong>shallow copy</strong> creates a <strong>new outer list</strong> at a different memory location (hence, a different <code>id</code>). </p><p>However, the <strong>nested objects inside the list still reference the same memory addresses</strong> as in the original list.</p><p>So, modifying a nested object will affect both copies.</p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!l_9x!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F823153b7-49af-47ce-a997-99f12f7f2bb9_2560x1179.webp" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!l_9x!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F823153b7-49af-47ce-a997-99f12f7f2bb9_2560x1179.webp 424w, https://substackcdn.com/image/fetch/$s_!l_9x!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F823153b7-49af-47ce-a997-99f12f7f2bb9_2560x1179.webp 848w, https://substackcdn.com/image/fetch/$s_!l_9x!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F823153b7-49af-47ce-a997-99f12f7f2bb9_2560x1179.webp 1272w, https://substackcdn.com/image/fetch/$s_!l_9x!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F823153b7-49af-47ce-a997-99f12f7f2bb9_2560x1179.webp 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!l_9x!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F823153b7-49af-47ce-a997-99f12f7f2bb9_2560x1179.webp" width="2560" height="1179" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/823153b7-49af-47ce-a997-99f12f7f2bb9_2560x1179.webp&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1179,&quot;width&quot;:2560,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:39696,&quot;alt&quot;:&quot;Shallow Copy and Deep Copy&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/webp&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Shallow Copy and Deep Copy" title="Shallow Copy and Deep Copy" srcset="https://substackcdn.com/image/fetch/$s_!l_9x!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F823153b7-49af-47ce-a997-99f12f7f2bb9_2560x1179.webp 424w, https://substackcdn.com/image/fetch/$s_!l_9x!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F823153b7-49af-47ce-a997-99f12f7f2bb9_2560x1179.webp 848w, https://substackcdn.com/image/fetch/$s_!l_9x!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F823153b7-49af-47ce-a997-99f12f7f2bb9_2560x1179.webp 1272w, https://substackcdn.com/image/fetch/$s_!l_9x!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F823153b7-49af-47ce-a997-99f12f7f2bb9_2560x1179.webp 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3><strong>2. Copying a List Using </strong><code>copy.deepcopy()</code>Function </h3><p>The<strong> </strong><code>copy.deepcopy()</code> creates a completely independent copy, ensuring that all nested objects are also duplicated rather tha just their references.</p><pre><code><code>import copy

original_list = [1, [2, 3], [4, 5]]
shallow_copied_list = copy.copy(original_list)
deep_copied_list = copy.deepcopy(original_list) 

# Modify the original list
original_list[1][0] = 'X'

print(original_list)            # Output: [1, ['X', 3], [4, 5]]
print(shallow_copied_list)      # Output: [1, ['X', 3], [4, 5]]
print(deep_copied_list)         # Output: [1, [2, 3], [4, 5]] (No unintended modification)

print(id(original_list))         # Output: 140288711654720
print(id(shallow_copied_list))   # Output: 140288711602432
print(id(deep_copied_list))      # Output: 140288711597440</code></code></pre><p>Now, the deep-copied list remains completely independent because <code>deepcopy()</code> <strong>recursively copies all nested objects</strong>.</p><h3><strong>3. Avoid Using Lists as Default Arguments in Functions</strong></h3><p>Using mutual objects like <strong>lists as default function arguments</strong> can create <strong>unexpected behavior</strong>, that you can read <em><strong><a href="https://medium.com/gitconnected/3-python-coding-mistakes-you-must-avoid-0ea2259fec76">here</a></strong></em>.</p><div><hr></div><p>Additional, if as a curious Python developer, you also wonder:</p><ul><li><p>How is data processed and manipulated in memory? </p></li><li><p>How has it affected the quality of the program? </p></li></ul><p>This article will provide a <em><strong><a href="https://www.analyticsvidhya.com/blog/2024/09/mutable-vs-immutable-objects-in-python/">Comprehensive overview of Mutable vs Immutable Objects in Python</a></strong></em> and <em><strong>why they are crucial for effective programming</strong></em>.</p><div><hr></div><p><em>Stay tuned with<strong> <a href="https://substack.com/@analyticalnikita">ME</a></strong></em>, so you won&#8217;t miss out on future updates.</p><div class="captioned-button-wrap" data-attrs="{&quot;url&quot;:&quot;https://analyticalnikita.substack.com/p/working-with-text-data-tokenization?utm_source=substack&amp;utm_medium=email&amp;utm_content=share&amp;action=share&amp;token=eyJ1c2VyX2lkIjoxMzY5MjcxNywicG9zdF9pZCI6MTUzODk3MDA4LCJpYXQiOjE3MzU5MDgyMDEsImV4cCI6MTczODUwMDIwMSwiaXNzIjoicHViLTM2MjM0MTciLCJzdWIiOiJwb3N0LXJlYWN0aW9uIn0.6r1ERfqwFCJwDfcSbNBjM-aU7g-W0iFek6Y8qbD3Fzc&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="CaptionedButtonToDOM"><div class="preamble"><p class="cta-caption"><em>&#128226; Before you go.. leave a a &#8220;heart&#8221; &#10084;&#65039; and if you have any questions/ suggestions/ thoughts, do drop me a line below. &#128395;&#65039;&#128071;</em></p></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://analyticalnikita.substack.com/p/why-python-lists-can-be-risky?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://analyticalnikita.substack.com/p/why-python-lists-can-be-risky?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p></div><p>Until next time, happy learning!</p><p>&#8212; <em><strong><a href="https://substack.com/@analyticalnikita">Nikita Prasad</a></strong></em></p><p></p>]]></content:encoded></item><item><title><![CDATA[Why Your Data Presentations Are Failing?]]></title><description><![CDATA[4 Data Storytelling RULES Every Analyst Should Master]]></description><link>https://analyticalnikita.substack.com/p/why-your-data-presentations-are-failing</link><guid isPermaLink="false">https://analyticalnikita.substack.com/p/why-your-data-presentations-are-failing</guid><dc:creator><![CDATA[Nikita Prasad]]></dc:creator><pubDate>Sat, 15 Feb 2025 08:31:58 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/4df7436e-61a3-416a-b66f-d6190e1e2e8c_1000x630.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Let me guess, if you&#8217;re here, then you&#8217;re probably struggling to improve your data visualization skills.</p><p>And looking for, what can you do to make the process easier.</p><p>Stay with me as I walk you through some useful tips to break down your analysis into digestible chunks.</p><blockquote><p><em>&#9208;&#65039;<strong> Quick Pause</strong>: If you&#8217;re new here, I&#8217;d highly appreciate if you <strong>subscribe </strong>to recieve bi-weekly data tips and insights &#8212; directly into your inbox. &#128071;&#127995;</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://analyticalnikita.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://analyticalnikita.substack.com/subscribe?"><span>Subscribe now</span></a></p></blockquote><div><hr></div><p>So, let&#8217;s begin!</p><p>Data can be complex and difficult to understand. </p><p>As a data analyst or scientist, you can make data more relatable and easier for your audience&#8212;particularly non-technical stakeholders in the corporate world&#8212;by weaving it into a story.</p><div class="pullquote"><p>&#128204; <strong>To remember</strong>: Your goal is to communicate data-driven insights and recommendations, not to showcase an extensive vocabulary or big numbers.</p></div><p>For the reason <em><strong>data isn&#8217;t just numbers &#8212; it&#8217;s the foundation for better decision-making </strong></em>across all domains, from <strong>product development</strong> to <strong>marketing</strong> and <strong>financial forecasting</strong>.</p><h2><strong>How to Present Your Data Effectively?</strong></h2><p>As a data analyst, your role is to make sense out of past and present data to uncover trends and relationships of your key metrics to report observation or insights about what has happened to take informed business decisions.</p><h3>Use these simple rules to to standout in your presentations:</h3><ol><li><p>Even if it seems boring, begin by explaining <strong>WHY </strong>the analysis is being conducted. This ensures everyone is on the same page.</p></li><li><p>Be brief with the <strong>HOW</strong> to explain how you arrived at the results &#8212; usually Your Methodology.</p></li><li><p>Then move to the <strong>WHAT</strong>, present the results, insights, and recommendations. Offer a side-by-side comparison of different KPIs and Metrics. Visuals can help to make your story more engaging and memorable.</p></li><li><p>Lastly, always highlight the <strong>key takeaways </strong>for your audience, particularly regarding future needs. So that your audience understands their roles, and takes action accordingly.</p></li></ol><h4>Here&#8217;s the key: </h4><p><strong>Tell a clear and concise story</strong>. Don't overload your colleagues with too much information.</p><blockquote><blockquote><p>&#10145;&#65039; Also, do not miss: <strong><a href="https://medium.com/learning-data/choosing-the-right-charts-for-effective-business-insights-41060082aac4">Choosing the Right Charts for Effective Business Insights</a></strong></p></blockquote></blockquote><p>Thanks for reading!</p><div class="captioned-button-wrap" data-attrs="{&quot;url&quot;:&quot;https://analyticalnikita.substack.com/p/working-with-text-data-tokenization?utm_source=substack&amp;utm_medium=email&amp;utm_content=share&amp;action=share&amp;token=eyJ1c2VyX2lkIjoxMzY5MjcxNywicG9zdF9pZCI6MTUzODk3MDA4LCJpYXQiOjE3MzYyNjI3MTMsImV4cCI6MTczODg1NDcxMywiaXNzIjoicHViLTM2MjM0MTciLCJzdWIiOiJwb3N0LXJlYWN0aW9uIn0.9YJtyeDdbdG7YRaC7fxDDMCJsWzsYR2hfBW6KlmWeS4&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="CaptionedButtonToDOM"><div class="preamble"><p class="cta-caption"><em>Before you go.. leave a &#8220;heart&#8221; &#10084;&#65039; and let me knoe if you have any questions/ suggestions/ thoughts.&#128395;&#65039;&#128071;</em></p></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://analyticalnikita.substack.com/p/why-your-data-presentations-are-failing?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://analyticalnikita.substack.com/p/why-your-data-presentations-are-failing?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p></div><p>Until next time, happy learning!</p><p>&#8212; <em><strong><a href="https://substack.com/@analyticalnikita">Nikita Prasad</a></strong></em></p>]]></content:encoded></item><item><title><![CDATA[6 Regression Techniques You Must Know!]]></title><description><![CDATA[Key Concepts Explained Including IMPLEMENTATION]]></description><link>https://analyticalnikita.substack.com/p/6-regression-techniques-you-must</link><guid isPermaLink="false">https://analyticalnikita.substack.com/p/6-regression-techniques-you-must</guid><dc:creator><![CDATA[Nikita Prasad]]></dc:creator><pubDate>Tue, 11 Feb 2025 10:37:31 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/2edf0a79-5b0c-4399-bcda-9c0a6ec2dfbc_1000x630.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>When it comes to predicting trends, analyzing data, or making data-driven decisions, <em><strong>regression analysis</strong></em> is one of the most powerful tools.</p><p>Whether you&#8217;re forecasting sales, setting optimal product prices, or predicting customer behavior, understanding regression techniques is essential.</p><p>So, let&#8217;s dive into <em><strong>six key regression techniques</strong></em> you should know and when to use them.</p><blockquote><p><em><strong>Subscribe</strong> as here, I want to make Machine Learning easy for you. &#128071;&#127995;</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://analyticalnikita.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://analyticalnikita.substack.com/subscribe?"><span>Subscribe now</span></a></p></blockquote><div><hr></div><h2>1. Linear Regression &#8212; The Classic Approch</h2><p>If you&#8217;re new to regression, Linear Regression is where it all begins.</p><p>It&#8217;s simple, easy to interpret and works well when your data follows a <em><strong>linear relationship</strong></em> (a stright-line trend).</p><h4><em>Here&#8217;s the Github Repository (&#11088; this repo):</em></h4><ul><li><p><em><strong><a href="https://github.com/nikitaprasad21/ML-Cheat-Codes/blob/main/Linear-Regression/simple-linear-regression.ipynb">Simple Linear Regression</a></strong></em></p></li><li><p><em><strong><a href="https://github.com/nikitaprasad21/ML-Cheat-Codes/blob/main/Linear-Regression/multiple-linear-regression.ipynb">Multiple Linear Regression</a></strong></em></p></li></ul><h4><em>Use it when:</em></h4><ul><li><p>You want to predict a continuous outcome based on one or more features.</p></li><li><p>Your data has a roughly <strong>straight-line </strong>relationship.</p></li></ul><div><hr></div><h2>2. Polynomial Regression &#8212; Non-Linear Relationships</h2><p>Not all data follows a straight-line pattern &#8212; sometimes, it curves!</p><p>That&#8217;s where you need Polynomial Regression &#8212; it fits a curved trend by adding higher degree terms to the equation instead of fitting a straight line.</p><h4><em>Here&#8217;s the Github Repository (&#11088; this repo):</em></h4><ul><li><p><em><strong><a href="https://github.com/nikitaprasad21/ML-Cheat-Codes/blob/main/Linear-Regression/polynomial-regression.ipynb">Polynomial Regression</a></strong></em> </p></li></ul><h4><em>Use it when:</em></h4><ul><li><p>Your data shows a <strong>non-linear trend</strong>.</p></li><li><p>You need to capture <strong>curvatures</strong> in the relationship.</p></li></ul><div><hr></div><h2><strong>3. Ridge Regression &#8212; Handling Overfitting</strong></h2><p>When your model learns too much from the training data and struggles to generalize, Ridge Regression helps by <em><strong>adding a penalty term</strong></em> to shrink coefficients.</p><p>This penalty term make the model less complex and results in better predictions.</p><h4><em>Here&#8217;s the Github Repository (&#11088; this repo):</em></h4><ul><li><p><em><strong><a href="https://github.com/nikitaprasad21/ML-Cheat-Codes/blob/main/Linear-Regression/ridge-regression.ipynb">Ridge Regression (L2 Regularization)</a></strong></em></p></li><li><p><em><strong><a href="https://github.com/nikitaprasad21/ML-Cheat-Codes/blob/main/Linear-Regression/ridge-regression-gradient-descent.ipynb">Ridge Regression Using Gradient Descent</a></strong></em></p></li></ul><h4><em>Use it when:</em></h4><ul><li><p>You&#8217;re working with <strong>high-dimensional data</strong>.</p></li><li><p><strong>Overfitting</strong> is a concern in your model.</p></li></ul><div><hr></div><h2>4. Lasso Regression &#8212; Feature Selection Included</h2><p>Lasso Regression is similar to Ridge but goes one step further &#8212; it can <em><strong>shrink some coefficients to exactly zero</strong></em>.</p><p>This is helpful in automatically selecting the most important features while dropping the rest.</p><h4><em>Here&#8217;s the Github Repository (&#11088; this repo):</em></h4><ul><li><p><em><strong><a href="https://github.com/nikitaprasad21/ML-Cheat-Codes/blob/main/Linear-Regression/lasso-regression.ipynb">LASSO Regression (L1 Regularization)</a></strong></em></p></li></ul><h4><em>Use it when:</em></h4><ul><li><p>You want both <strong>regularization</strong> and <strong>feature selection</strong>.</p></li><li><p>Your dataset has a lot of irrelevant or redundant variables.</p></li></ul><div><hr></div><h2><strong>5. Elastic Net &#8212; The Best of Both Worlds</strong></h2><p>Can&#8217;t decide between Ridge and Lasso? </p><p><strong>Elastic Net</strong> combines both! </p><p>It strikes a balance by applying both <strong>L1 (Lasso)</strong> and <strong>L2 (Ridge)</strong> regularization, making it a great choice when you suspect <strong>some</strong> features should be eliminated but not all.</p><h4><em><strong>Use it when:</strong></em></h4><ul><li><p>Your dataset is <strong>high-dimensional</strong> and may have <strong>correlated features</strong>.</p></li><li><p>You need <strong>better stability</strong> than Lasso alone.</p></li></ul><div><hr></div><h2>6. Logistic Regression &#8212; Classification problems</h2><p>Despite its name, <strong>Logistic Regression</strong> isn&#8217;t used for predicting numbers&#8212;it&#8217;s for <strong>classification</strong> (yes/no, true/false, spam/not spam, etc). </p><p>Instead of predicting a continuous value, it predicts the <strong>probability</strong> of belonging to a category.</p><h4><em>Here&#8217;s the Github Repository (&#11088; this repo):</em></h4><ul><li><p><em><strong><a href="https://github.com/nikitaprasad21/ML-Cheat-Codes/blob/main/Logistic-Regression/weather_forcaste.ipynb">Logistic Regression</a></strong></em></p></li></ul><h4><em>Use it when:</em></h4><ul><li><p>Your target variable is <strong>binary</strong> (e.g., pass/fail, buy/don&#8217;t buy).</p></li><li><p>You need an <strong>interpretable</strong> model for decision-making.</p></li></ul><div><hr></div><h3><strong>Wrapping Up &#8211; Which Regression Model Should You Use?</strong></h3><p>There&#8217;s no one-size-fits-all regression technique&#8212;it all depends on your data and problem!</p><p>Next time you're working on a machine learning project, take a step back and think about which regression method fits best. </p><p>Picking the right one could make all the difference in reliability of your predictions!</p><p>If you&#8217;d like to explore the full implementation, including code and data, then do not forget to checkout the given repositories. </p><div><hr></div><p><strong>&#128073; </strong><em>Over to you</em>:</p><p>Which regression technique do you use the most? </p><p>Drop your thoughts in the comments!</p><p>Thanks for reading!</p><div class="captioned-button-wrap" data-attrs="{&quot;url&quot;:&quot;https://analyticalnikita.substack.com/p/working-with-text-data-tokenization?utm_source=substack&amp;utm_medium=email&amp;utm_content=share&amp;action=share&amp;token=eyJ1c2VyX2lkIjoxMzY5MjcxNywicG9zdF9pZCI6MTUzODk3MDA4LCJpYXQiOjE3MzYyNjI3MTMsImV4cCI6MTczODg1NDcxMywiaXNzIjoicHViLTM2MjM0MTciLCJzdWIiOiJwb3N0LXJlYWN0aW9uIn0.9YJtyeDdbdG7YRaC7fxDDMCJsWzsYR2hfBW6KlmWeS4&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="CaptionedButtonToDOM"><div class="preamble"><p class="cta-caption"><em>Before you go.. leave a &#8220;heart&#8221; &#10084;&#65039; and let me knoe if you have any questions/ suggestions/ thoughts.&#128395;&#65039;&#128071;</em></p></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://analyticalnikita.substack.com/p/6-regression-techniques-you-must?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://analyticalnikita.substack.com/p/6-regression-techniques-you-must?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p></div><p>Until next time, happy learning!</p><p>&#8212; <em><strong><a href="https://substack.com/@analyticalnikita">Nikita Prasad</a></strong></em></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p>]]></content:encoded></item><item><title><![CDATA[Outliers Can Ruin Your Analysis ]]></title><description><![CDATA[(What, Why and How to Handle Them!)]]></description><link>https://analyticalnikita.substack.com/p/outliers-can-ruin-your-analysis</link><guid isPermaLink="false">https://analyticalnikita.substack.com/p/outliers-can-ruin-your-analysis</guid><dc:creator><![CDATA[Nikita Prasad]]></dc:creator><pubDate>Sat, 08 Feb 2025 10:36:45 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!lhwp!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faea810b2-2aad-41f5-9988-c36f879de960_701x465.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Raw data is never perfect. You might come across values that seem way off compared to the rest of data. These are called <em><strong>Outliers</strong></em>!</p><p>Let&#8217;s break this down in simple terms:</p><blockquote><p><strong>Quick Pause</strong>: <em>If you&#8217;re new here <strong>Subscribe </strong>&#8212; my goal is to make Data Science easy  for you. &#128071;&#127995; </em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://analyticalnikita.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://analyticalnikita.substack.com/subscribe?"><span>Subscribe now</span></a></p></blockquote><div><hr></div><h1>What are Outliers? &#129300;</h1><p>An outlier is an observation in a dataset that lies <em><strong>far</strong></em> from the rest of the values. </p><p>Or simply stating, data points that are significantly different from the other values in a dataset. </p><p>That means an outlier can be much larger or smaller than the majority of values, making them stand out.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!lhwp!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faea810b2-2aad-41f5-9988-c36f879de960_701x465.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!lhwp!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faea810b2-2aad-41f5-9988-c36f879de960_701x465.png 424w, https://substackcdn.com/image/fetch/$s_!lhwp!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faea810b2-2aad-41f5-9988-c36f879de960_701x465.png 848w, https://substackcdn.com/image/fetch/$s_!lhwp!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faea810b2-2aad-41f5-9988-c36f879de960_701x465.png 1272w, https://substackcdn.com/image/fetch/$s_!lhwp!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faea810b2-2aad-41f5-9988-c36f879de960_701x465.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!lhwp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faea810b2-2aad-41f5-9988-c36f879de960_701x465.png" width="701" height="465" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/aea810b2-2aad-41f5-9988-c36f879de960_701x465.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:465,&quot;width&quot;:701,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:73239,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!lhwp!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faea810b2-2aad-41f5-9988-c36f879de960_701x465.png 424w, https://substackcdn.com/image/fetch/$s_!lhwp!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faea810b2-2aad-41f5-9988-c36f879de960_701x465.png 848w, https://substackcdn.com/image/fetch/$s_!lhwp!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faea810b2-2aad-41f5-9988-c36f879de960_701x465.png 1272w, https://substackcdn.com/image/fetch/$s_!lhwp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faea810b2-2aad-41f5-9988-c36f879de960_701x465.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h1>Why Do they Occur?</h1><p>An outlier may occur due to the <em><strong>variability in the data</strong></em>, or due to <em><strong>experimental bias </strong>or<strong> human error</strong></em>.</p><blockquote><p><em><strong>Note</strong>: They may indicate <strong>heavy skewness </strong>in the data (<strong>heavy-tailed distribution</strong>).</em></p></blockquote><h1>What Do They Affect?</h1><p>Outliers can distort the statistical results, leading to misleading conclusions.</p><p>Their impact depends on which statistical measure you&#8217;re using.</p><ul><li><p><strong>Mean</strong> is the only measure of central tendency that is affected by the outliers which in turn impacts <em><strong>Standard Deviation</strong></em>.</p></li><li><p><strong>Median</strong> is commonly preferred, when dealing with skewed data.</p></li><li><p><strong>Mode</strong> is used if there are outliers and about &#189; or more of the data is the same.</p></li></ul><div><hr></div><h1>How to Detect Outliers?</h1><p>If datasets are small, you can often spot outlier by just looking at the data. </p><p>But for larger datasets, we need proper visualization and statistical techniques to accurately find them.</p><p>Below are some of the techniques of detecting outliers:</p><h3>1. <em><strong>Z-scores (Standard Deviation Method) </strong></em>: </h3><ul><li><p><strong>Criteria</strong>: Works best for normally distributed data.</p></li><li><p><strong>Rule</strong>: Any data point whose Z-score falls out of<em> &#177;3rd standard deviation</em> is considered as an outlier.</p></li><li><p><strong>Formula:</strong> <em><strong>Z = X&#8722;&#956;&#8203; / &#963;</strong></em></p><p>where <strong>X</strong> is the data point, <strong>&#956;</strong> is the mean, and <strong>&#963;</strong> is the standard deviation.</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!DQSe!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a80c00-14dc-44f2-80ea-1a1d013dc2a5_1000x630.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!DQSe!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a80c00-14dc-44f2-80ea-1a1d013dc2a5_1000x630.jpeg 424w, https://substackcdn.com/image/fetch/$s_!DQSe!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a80c00-14dc-44f2-80ea-1a1d013dc2a5_1000x630.jpeg 848w, https://substackcdn.com/image/fetch/$s_!DQSe!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a80c00-14dc-44f2-80ea-1a1d013dc2a5_1000x630.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!DQSe!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a80c00-14dc-44f2-80ea-1a1d013dc2a5_1000x630.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!DQSe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a80c00-14dc-44f2-80ea-1a1d013dc2a5_1000x630.jpeg" width="1000" height="630" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/62a80c00-14dc-44f2-80ea-1a1d013dc2a5_1000x630.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:630,&quot;width&quot;:1000,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:67020,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!DQSe!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a80c00-14dc-44f2-80ea-1a1d013dc2a5_1000x630.jpeg 424w, https://substackcdn.com/image/fetch/$s_!DQSe!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a80c00-14dc-44f2-80ea-1a1d013dc2a5_1000x630.jpeg 848w, https://substackcdn.com/image/fetch/$s_!DQSe!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a80c00-14dc-44f2-80ea-1a1d013dc2a5_1000x630.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!DQSe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a80c00-14dc-44f2-80ea-1a1d013dc2a5_1000x630.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>2. <em><strong>Inter Quantile Range(IQR) Method</strong></em></h3><ul><li><p><strong>Criteria</strong>: Best for skewed or non-normally distributed data points.</p></li><li><p><strong>Rule</strong>: Any data that lie 1.5 times of IQR above <em>Q3</em> and below <em>Q1</em> are outliers.</p></li></ul><p>In statistics, interquartile range or IQR is a quantity that measures the difference between the first and the third quartiles in a given dataset.</p><ul><li><p><strong>Formula:</strong><em><strong> IQR = Q3 &#8722; Q1</strong></em></p></li><li><p><strong>Outlier Thresholds:</strong> Lower Bound = <em>Q1 - 1.5 &#215; IQR</em></p><p>                                     Upper Bound = <em>Q3 + 1.5 &#215; IQR</em></p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!I1oe!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28b855c6-abee-4e43-97b1-4a6057a648fe_1091x547.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!I1oe!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28b855c6-abee-4e43-97b1-4a6057a648fe_1091x547.png 424w, https://substackcdn.com/image/fetch/$s_!I1oe!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28b855c6-abee-4e43-97b1-4a6057a648fe_1091x547.png 848w, https://substackcdn.com/image/fetch/$s_!I1oe!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28b855c6-abee-4e43-97b1-4a6057a648fe_1091x547.png 1272w, https://substackcdn.com/image/fetch/$s_!I1oe!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28b855c6-abee-4e43-97b1-4a6057a648fe_1091x547.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!I1oe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28b855c6-abee-4e43-97b1-4a6057a648fe_1091x547.png" width="1091" height="547" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/28b855c6-abee-4e43-97b1-4a6057a648fe_1091x547.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:547,&quot;width&quot;:1091,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:98249,&quot;alt&quot;:&quot;image.png&quot;,&quot;title&quot;:&quot;image.png&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="image.png" title="image.png" srcset="https://substackcdn.com/image/fetch/$s_!I1oe!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28b855c6-abee-4e43-97b1-4a6057a648fe_1091x547.png 424w, https://substackcdn.com/image/fetch/$s_!I1oe!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28b855c6-abee-4e43-97b1-4a6057a648fe_1091x547.png 848w, https://substackcdn.com/image/fetch/$s_!I1oe!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28b855c6-abee-4e43-97b1-4a6057a648fe_1091x547.png 1272w, https://substackcdn.com/image/fetch/$s_!I1oe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28b855c6-abee-4e43-97b1-4a6057a648fe_1091x547.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><blockquote><p><em><strong>Note</strong> : Boxplot highlights outliers as individual points that fall outside the &#8220;whiskers&#8221; of the box.</em></p></blockquote><h3>3. <em>Detecting Outliers Using Percentile</em></h3><p>Defining a <code>custom range</code> that accommodates all data points that lie anywhere between 0.5 and 99.5 percentile of the dataset.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!9xJT!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb26e5bf4-ccff-495a-a03a-744f29fd1172_1000x630.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!9xJT!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb26e5bf4-ccff-495a-a03a-744f29fd1172_1000x630.jpeg 424w, https://substackcdn.com/image/fetch/$s_!9xJT!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb26e5bf4-ccff-495a-a03a-744f29fd1172_1000x630.jpeg 848w, https://substackcdn.com/image/fetch/$s_!9xJT!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb26e5bf4-ccff-495a-a03a-744f29fd1172_1000x630.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!9xJT!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb26e5bf4-ccff-495a-a03a-744f29fd1172_1000x630.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!9xJT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb26e5bf4-ccff-495a-a03a-744f29fd1172_1000x630.jpeg" width="1000" height="630" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b26e5bf4-ccff-495a-a03a-744f29fd1172_1000x630.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:630,&quot;width&quot;:1000,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:58060,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!9xJT!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb26e5bf4-ccff-495a-a03a-744f29fd1172_1000x630.jpeg 424w, https://substackcdn.com/image/fetch/$s_!9xJT!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb26e5bf4-ccff-495a-a03a-744f29fd1172_1000x630.jpeg 848w, https://substackcdn.com/image/fetch/$s_!9xJT!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb26e5bf4-ccff-495a-a03a-744f29fd1172_1000x630.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!9xJT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb26e5bf4-ccff-495a-a03a-744f29fd1172_1000x630.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div><hr></div><h1>How to Handle Outliers?</h1><p>Once you&#8217;ve identified outliers, the next question is: <em>What should you do with them?</em></p><p>The answer depends on the cause of the outlier and its impact on your analysis. </p><p>Below are some of the methods of treating the outliers:</p><h4><em>Method 1: Trimming/Removing Outliers</em></h4><p>If an outlier is caused by <em><strong>human error </strong></em>or<em><strong> faulty measurement</strong></em>, it&#8217;s safe to remove it.</p><p>Although, it isn&#8217;t a good practice to follow this, without a valid reason, as it can distort the real data, reducing its accuracy.</p><h4><em>Method 2: Quantile Based Flooring and Capping (Winsorization)</em></h4><p>Instead of removing outliers, you can limit their impact by<em> <strong>capping at a certain value</strong></em>, say above the 90th percentile value or <em><strong>floored at a factor below </strong></em>the 10th percentile value. </p><h4><em>Method 3: Median Imputation</em></h4><p>As the mean value is highly influenced by the outliers, it is advised to <em><strong>replace the outliers with the median value</strong></em>, to preserve dataset&#8217;s integrity.</p><h4><em>Method 4: Use Transformations</em></h4><p>It is recommended to apply transformations when dealing with highly skewed data. </p><p><em><strong>Logarithmic </strong></em>or <em><strong>Square Root Transformations</strong></em> can reduce the influence of extreme values.</p><h2>When to Keep Outliers?</h2><p>YES, not all outliers should be removed! </p><p>Sometimes, they provide important insights, that is why we need to keep them.</p><ul><li><p>In <strong>Fraud Detection</strong>, extreme values could indicate fraudulent transactions.</p></li><li><p>In <strong>Medical Data</strong>, rare values might hightlight a significant medical condition.</p></li><li><p>In <strong>Business Analysis</strong>, an unexpected spike in sales might indicate a successful marketing campaign. </p></li></ul><p>So, always investigate the cause before deciding to remove or modify outliers.</p><p>If you&#8217;d like to explore the full implementation, including code and data, then checkout: <em><strong><a href="https://github.com/nikitaprasad21/ML-Cheat-Codes/blob/main/Feature-Engineering/Outlier-Handling/ouliers_basics.ipynb">Github Repository</a></strong></em> &#128072;&#127995;</p><div><hr></div><p>If you enjoyed this deep dive, <em>stay tuned with<strong> <a href="https://substack.com/@analyticalnikita">ME</a></strong></em>, so you won&#8217;t miss out on future updates.</p><div class="captioned-button-wrap" data-attrs="{&quot;url&quot;:&quot;https://analyticalnikita.substack.com/p/working-with-text-data-tokenization?utm_source=substack&amp;utm_medium=email&amp;utm_content=share&amp;action=share&amp;token=eyJ1c2VyX2lkIjoxMzY5MjcxNywicG9zdF9pZCI6MTUzODk3MDA4LCJpYXQiOjE3MzU5MDgyMDEsImV4cCI6MTczODUwMDIwMSwiaXNzIjoicHViLTM2MjM0MTciLCJzdWIiOiJwb3N0LXJlYWN0aW9uIn0.6r1ERfqwFCJwDfcSbNBjM-aU7g-W0iFek6Y8qbD3Fzc&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="CaptionedButtonToDOM"><div class="preamble"><p class="cta-caption"><em>Before you go.. leave a a &#8220;heart&#8221; &#10084;&#65039; and if you have any questions/ suggestions/ thoughts, do drop me a line below. &#128395;&#65039;&#128071;</em></p></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://analyticalnikita.substack.com/p/outliers-can-ruin-your-analysis?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://analyticalnikita.substack.com/p/outliers-can-ruin-your-analysis?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p></div><p>Until next time, happy learning!</p><p>&#8212; <em><strong><a href="https://substack.com/@analyticalnikita">Nikita Prasad</a></strong></em></p>]]></content:encoded></item></channel></rss>